使用tungsten將mysql的數據同步到hadoop

本文轉載自查看原文 2014-12-18 20:25 2335 tungsten-replicator continuent-tools-hadoop

背景

線上有很多的數據庫在運行，后台需要一個分析用戶行為的數據倉庫。目前比較流行的是mysql和hadoop平台。

現在的問題是，如何將線上的mysql數據實時的同步到hadoop中，以供分析。這篇文章就是利用tungsten-replicator來實現。

環境

由於tungsten-replicator依賴ruby和gem。需要安裝

yum install ruby
yum install rubygems
gem install json

其中json模塊可能因為gfw的原因，需要手動下載到本地，然后使用gem本地安裝

yum install ruby-devel
gem install --local json-xxx.gem

安裝好mysql，地址是 192.168.12.223:3306 ，數據庫配置好權限

安裝好hadoop 2.4 ，hdfs的地址是 192.168.12.221:9000

配置

先在mysql的機器上，進入到tungsten-replicator目錄下執行，並且啟動tungsten，可以使用trepctl thl 等命令查看服務的狀態

./tools/tpm install mysql1 --master=192.168.12.223 --install-directory=/user/app/tungsten/mysql1 --datasource-mysql-conf=/user/data/mysql_data/my-3306.cnf --replication-user=stats --replication-password=stats_dh5 --enable-heterogenous-master=true --net-ssh-option=port=20460  --property=replicator.filter.pkey.addColumnsToDeletes=true --property=replicator.filter.pkey.addPkeyToInserts=true

mysql1/tungsten/cluster-home/bin/startall

到hadoop的機器上，，進入到tungsten-replicator目錄下執行，並且啟動tungsten，可以使用trepctl thl 等命令查看服務的狀態

./tools/tpm install hadoop1 --batch-enabled=true --batch-load-language=js --batch-load-template=hadoop --datasource-type=file --install-directory=/user/app/tungsten/hadoop1 --java-file-encoding=UTF8 --java-user-timezone=GMT --master=192.168.12.223 --members=192.168.12.221 --property=replicator.datasource.applier.csvType=hive --property=replicator.stage.q-to-dbms.blockCommitInterval=1s --property=replicator.stage.q-to-dbms.blockCommitRowCount=1000 --skip-validation-check=DatasourceDBPort --skip-validation-check=DirectDatasourceDBPort --skip-validation-check=HostsFileCheck --skip-validation-check=InstallerMasterSlaveCheck --skip-validation-check=ReplicationServicePipelines --rmi-port=25550

可以在hadoop的文件系統上，查看對應的目錄下是否生成了mysql對應的庫。如下所示：

└── user
......
......
    └── tungsten
        └── staging
            └── hadoop1
                └── db1
                    ├── x1
                    │   ├── x1-14.csv
                    │   └── x1-3.csv
                    └── x2
                        ├── x2-115.csv
                        ├── x2-15.csv
                        ├── x2-16.csv
                        ├── x2-17.csv
                        └── x2-18.csv

最后還需要將staging的數據merge到hive中，建立hive的表結構，並且讓數據能夠被hive查詢，這里使用continuent-tools-hadoop工具里面的load-reduce-check腳本，在使用之前，先需要配置好hive的環境變量，並且啟動hiveservice在10000端口上。拷貝如下的jar包到bristlecone的lib-ext目錄

 cp -v /user/app/hive/apache-hive-0.13.1-bin/lib/hive-jdbc-0.13.1.jar /user/app/tungsten/hadoop1/tungsten/bristlecone/lib-ext/
 cp -v /user/app/hive/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar /user/app/tungsten/hadoop1/tungsten/bristlecone/lib-ext/
 cp -v /user/app/hive/apache-hive-0.13.1-bin/lib/hive-service-0.13.1.jar /user/app/tungsten/hadoop1/tungsten/bristlecone/lib-ext/
 cp -v /user/app/hive/apache-hive-0.13.1-bin/lib/httpclient-4.2.5.jar /user/app/tungsten/hadoop1/tungsten/bristlecone/lib-ext/
 cp -v /user/app/hive/apache-hive-0.13.1-bin/lib/commons-httpclient-3.0.1.jar /user/app/tungsten/hadoop1/tungsten/bristlecone/lib-ext/
 cp -v /user/app/hive/apache-hive-0.13.1-bin/lib/httpcore-4.2.5.jar /user/app/tungsten/hadoop1/tungsten/bristlecone/lib-ext/
 cp -v /user/app/hadoop/hadoop-2.4.0-onenode/share/hadoop/common/hadoop-common-2.4.0.jar /user/app/tungsten/hadoop1/tungsten/bristlecone/lib-ext/
 cp -v /user/app/hadoop/hadoop-2.4.0-onenode/share/hadoop/common/lib/slf4j-* /user/app/tungsten/hadoop1/tungsten/bristlecone/lib-ext/

然后執行如下的命令：

第一次，或者以后增加了表，或者表結構發生了變化
./bin/load-reduce-check -v -U jdbc:mysql:thin://192.168.12.223:3306/ -u stats -p stats_dh5 --schema db1 --service=hadoop1 -r /user/app/tungsten/hadoop1  --no-compare


如果表結構沒有發生變化，只需要重新裝載數據的話，可以執行如下的命令
./bin/load-reduce-check -v -U jdbc:mysql:thin://192.168.12.223:3306/ -u stats -p stats_dh5 --schema db1 --service=hadoop1 -r /user/app/tungsten/hadoop1  --no-base-ddl --no-staging-ddl --no-meta


只想比較數據，不過貌似compare很卡
./bin/load-reduce-check -v -U jdbc:mysql:thin://192.168.12.223:3306/ -u stats -p stats_dh5 --schema db1 --service=hadoop1 -r /user/app/tungsten/hadoop1  --no-base-ddl --no-staging-ddl --no-meta --no-materialize

參考

tungsten-replicator-3.0.pdf 中的 3.4. Deploying MySQL to Hadoop Replication

https://github.com/continuent/continuent-tools-hadoop

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 大數據實踐-數據同步篇tungsten-relicator（mysql->mongo）【Centos】使用confluent將Mysql數據同步到clickhouse 使用Canal作為mysql的數據同步工具使用logstash同步mysql數據到elasticsearch 使用logstash同步mysql數據到elasticsearch 使用Logstash來實時同步MySQL數據到ES 使用maxwell實時同步mysql數據到kafka 使用canal同步MySQL數據至Elasticsearch 使用logstash同步MySQL數據到ES elasticsearch使用river同步mysql數據