DataX安裝使用實現MySQL到MySQL數據同步
1.前置條件:
1.1jdk安裝
-
jdk安裝前往官網,這里我安裝jdk-8u261
-
解壓
sudo mkdir -p /opt/moudle sudo tar -zxvf jdk-8u261-linux-x64.tar.gz -C /opt/moudle/
-
設置環境變量
export JAVA_HOME=/opt/moudle/jdk1.8.0_261 export JRE_HOME=${JAVA_HOME}/jre export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib export PATH=${JAVA_HOME}/bin:$PATH
-
刷新配置
source /etc/profile
-
檢查java
java -version # 出現下面安裝成功 java version "1.8.0_261" Java(TM) SE Runtime Environment (build 1.8.0_261-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.261-b12, mixed mode)
1.2python安裝
- 略(官方推薦>=2.6.X)
1.3 Hadoop單機偽分布式安裝
2.安裝DataX
-
DataX是阿里巴巴的一個異構數據源離線同步工具,致力於實現包括關系型數據庫(MySQL、Oracle等)、HDFS、Hive、ODPS、HBase、FTP等各種異構數據源之間穩定高效的數據同步功能。
!
-
下載地址:http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
-
解壓
tar -zxvf datax.tar.gz -C /opt/software/
-
運行自檢腳本
cd /opt/software/datax/ bin/datax.py job/job.json
-
出現下面界面表示成功:
/opt/software/datax/job/job.json
格式。
{
"content":[
{
"reader":{
"name":"streamreader",# 流式讀,根據DataX定義好的設置
"parameter":{
"column":[#把column里所有value讀到流當中
{
"type":"string",
"value":"DataX"
},
{
"type":"long",
"value":19890604
},
{
"type":"date",
"value":"1989-06-04 00:00:00"
},
{
"type":"bool",
"value":true
},
{
"type":"bytes",
"value":"test"
}
],
"sliceRecordCount":100000
}
},
"writer":{
"name":"streamwriter",# 流式寫,根據DataX定義好的設置
"parameter":{
"encoding":"UTF-8",
"print":false#打印
}
}
}
],
"setting":{
"errorLimit":{# errorLimit錯誤限制
"percentage":0.02,# 最大容忍錯誤限制百分比2%
"record":0# 容忍錯誤記錄調試 0
},
"speed":{# 控制並發數:通過byte或channel控制,這里默認通過byte控制
"byte":10485760#以 sliceRecordCount乘以byte,打印數據條數占用空間
}
}
}
3.基本使用
3.1從stream讀取數據並打印到控制台。
-
首先查看官方json配置模版
# 查看 streamreader --> streamwriter 模版 python /opt/software/datax/bin/datax.py -r streamreader -w streamwriter # 模版如下: DataX (DATAX-OPENSOURCE-3.0), From Alibaba ! Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved. Please refer to the streamreader document: https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md Please refer to the streamwriter document: https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md Please save the following configuration as a json file and use python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json to run the job. { "job": { "content": [ { "reader": { "name": "streamreader", "parameter": { "column": [], "sliceRecordCount": "" } }, "writer": { "name": "streamwriter", "parameter": { "encoding": "", "print": true } } } ], "setting": { "speed": { "channel": "" } } } }
-
根據模版編寫json文件
{ "job": { "content": [ { "reader": { "name": "streamreader", "parameter": { "column": [ { "type":"string", "value":"xujunkai, hello world!" }, { "type":"string", "value":"徐俊凱, 你好!" }, ], "sliceRecordCount": "10"#打印次數 } }, "writer": { "name": "streamwriter", "parameter": { "encoding": "utf-8", #編碼方式utf-8 "print": true } } } ], "setting": { "speed": {#控制並發數 "channel": "2"#控制並發2次-->這里因為是打印所以會sliceRecordCount乘以channel 打印20遍。如果設置為mysql真的會進行並發 } } } }
-
創建一個json文件,在根目錄
mkdir json cd json/ vim stream2stream.json # 將上述內容粘貼進去
-
運行job
/opt/software/datax/bin/datax.py ./stream2stream.json
-
如下圖:
3.2從MySQL到MySQL批量插入
3.2.1預先准備工作:
-
寫入和讀取方准備創建庫和表
# 創建庫 create database `testdatax` character set utf8 # 創建表 create table user1w( id int not null auto_increment, name varchar(10) not null, score int not null, primary key(`id`))engine=InnoDB default charset=utf8;
-
編寫一個簡單存儲過程,讀取數據端插入數據:
DELIMITER // create PROCEDURE add_user(in num INT) BEGIN DECLARE rowid INT DEFAULT 0; DECLARE name CHAR(1); DECLARE score INT; WHILE rowid < num DO SET rowid = rowid + 1; set name = SUBSTRING('abcdefghijklmnopqrstuvwxyz',ROUND(1+25*RAND()),1); set score= FLOOR(40 + (RAND()*60)); insert INTO user1w (name,score) VALUES (name,score); END WHILE; END // DELIMITER ;
-
執行插入數據
call add_user(10000);
3.2.2 查看一下mysql到mysql的json配置
-
python /opt/software/datax/bin/datax.py -r mysqlreader -w mysqlwriter
,json文件配置:{ "job": { "content": [ { "reader": { "name": "mysqlreader", # 讀取端,根據DataX定義好的設置 "parameter": { "column": [], # 讀取端需要同步的列 "splitPk": "",# 數據抽取時指定字段進行數據分片 "connection": [ { "jdbcUrl": [], #讀取端連接信息 "table": []# 讀取端指定的表 } ], "password": "", #讀取端賬戶 "username": "", #讀取端密碼 "where": ""# 描述篩選條件 } }, "writer": { "name": "mysqlwriter", #寫入端,根據DataX定義好的設置 "parameter": { "column": [], #寫入端需要同步的列 "connection": [ { "jdbcUrl": "", # 寫入端連接信息 "table": []# 寫入端指定的表 } ], "password": "", #寫入端密碼 "preSql": [], # 執行寫入之前做的事情 "session": [], "username": "", #寫入端賬戶 "writeMode": ""# 操作樂星 } } } ], "setting": { "speed": { "channel": ""#指定channel數 } } } }
-
我的配置json:
{ "job": { "content": [ { "reader": { "name": "mysqlreader", "parameter": { "username": "root", "password": "123" "column": ["*"], "splitPk": "id", "connection": [ { "jdbcUrl": [ "jdbc:mysql://讀取端IP:3306/testdatax?useUnicode=true&characterEncoding=utf8" ], "table": ["user1w"] } ] } }, "writer": { "name": "mysqlwriter", "parameter": { "column": ["*"], "connection": [ { "jdbcUrl": "jdbc:mysql://寫入端IP:3306/testdatax?useUnicode=true&characterEncoding=utf8", "table": ["user1w"] } ], "password": "123", "preSql": [ "truncate user1w" ], "session": [ "set session sql_mode='ANSI'" ], "username": "root", "writeMode": "insert" } } } ], "setting": { "speed": { "channel": "5" } } } }
-
cd到datax下bin目錄執行:
python2 datax.py /root/json/mysql2mysql.json
-
會打印同步數據信息完畢。更多配置見github-dataX
3.3從數據庫MySQL數據導入到HDFS中
python /opt/software/datax/bin/datax.py -r mysqlreader -w mysqlwriter
- 未完待續...