DataX安裝使用實現MySQL到MySQL數據同步


DataX安裝使用實現MySQL到MySQL數據同步

1.前置條件:

1.1jdk安裝

  • jdk安裝前往官網,這里我安裝jdk-8u261

  • 解壓

    sudo mkdir -p /opt/moudle
    sudo tar -zxvf jdk-8u261-linux-x64.tar.gz -C /opt/moudle/
    
    
  • 設置環境變量

    export JAVA_HOME=/opt/moudle/jdk1.8.0_261
    export JRE_HOME=${JAVA_HOME}/jre
    export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
    export PATH=${JAVA_HOME}/bin:$PATH
    
  • 刷新配置

    source /etc/profile
    
  • 檢查java

    java -version
    
    # 出現下面安裝成功
    java version "1.8.0_261"
    Java(TM) SE Runtime Environment (build 1.8.0_261-b12)
    Java HotSpot(TM) 64-Bit Server VM (build 25.261-b12, mixed mode)
    

1.2python安裝

  • 略(官方推薦>=2.6.X)

1.3 Hadoop單機偽分布式安裝

2.安裝DataX

  • DataX是阿里巴巴的一個異構數據源離線同步工具,致力於實現包括關系型數據庫(MySQL、Oracle等)、HDFS、Hive、ODPS、HBase、FTP等各種異構數據源之間穩定高效的數據同步功能。

    !

  • 下載地址:http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz

  • 解壓

    tar -zxvf datax.tar.gz -C /opt/software/
    
  • 運行自檢腳本

    cd /opt/software/datax/
    bin/datax.py job/job.json
    
  • 出現下面界面表示成功:

  • /opt/software/datax/job/job.json格式。
{
	"content":[
		{
			"reader":{
				"name":"streamreader",# 流式讀,根據DataX定義好的設置
				"parameter":{
					"column":[#把column里所有value讀到流當中
						{
							"type":"string",
							"value":"DataX"
						},
						{
							"type":"long",
							"value":19890604
						},
						{
							"type":"date",
							"value":"1989-06-04 00:00:00"
						},
						{
							"type":"bool",
							"value":true
						},
						{
							"type":"bytes",
							"value":"test"
						}
					],
					"sliceRecordCount":100000
				}
			},
			"writer":{
				"name":"streamwriter",# 流式寫,根據DataX定義好的設置
				"parameter":{
					"encoding":"UTF-8",
					"print":false#打印
				}
			}
		}
	],
	"setting":{
		"errorLimit":{# errorLimit錯誤限制
			"percentage":0.02,# 最大容忍錯誤限制百分比2%
			"record":0# 容忍錯誤記錄調試 0
		},
		"speed":{# 控制並發數:通過byte或channel控制,這里默認通過byte控制
			"byte":10485760#以 sliceRecordCount乘以byte,打印數據條數占用空間
		}
	}
}

3.基本使用

3.1從stream讀取數據並打印到控制台。

  • 首先查看官方json配置模版

    # 查看 streamreader --> streamwriter 模版
    python /opt/software/datax/bin/datax.py -r streamreader -w streamwriter
    # 模版如下:
    DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
    Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
    
    
    Please refer to the streamreader document:
         https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md 
    
    Please refer to the streamwriter document:
         https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md 
     
    Please save the following configuration as a json file and  use
         python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json 
    to run the job.
    {
        "job": {
            "content": [
                {
                    "reader": {
                        "name": "streamreader", 
                        "parameter": {
                            "column": [], 
                            "sliceRecordCount": ""
                        }
                    }, 
                    "writer": {
                        "name": "streamwriter", 
                        "parameter": {
                            "encoding": "", 
                            "print": true
                        }
                    }
                }
            ], 
            "setting": {
                "speed": {
                    "channel": ""
                }
            }
        }
    }
    
  • 根據模版編寫json文件

    {
        "job": {
            "content": [
                {
                    "reader": {
                        "name": "streamreader", 
                        "parameter": {
                            "column": [
                                {
                                    "type":"string",
                                    "value":"xujunkai, hello world!"
                                },
                                {
                                    "type":"string",
                                    "value":"徐俊凱, 你好!"
                                },
                            ], 
                            "sliceRecordCount": "10"#打印次數
                        }
                    }, 
                    "writer": {
                        "name": "streamwriter", 
                        "parameter": {
                            "encoding": "utf-8", #編碼方式utf-8
                            "print": true
                        }
                    }
                }
            ], 
            "setting": {
                "speed": {#控制並發數
                    "channel": "2"#控制並發2次-->這里因為是打印所以會sliceRecordCount乘以channel 打印20遍。如果設置為mysql真的會進行並發
                }
            }
        }
    }
    
  • 創建一個json文件,在根目錄

    mkdir json
    cd json/
    vim stream2stream.json
    # 將上述內容粘貼進去
    
  • 運行job

    /opt/software/datax/bin/datax.py ./stream2stream.json
    
  • 如下圖:

3.2從MySQL到MySQL批量插入

3.2.1預先准備工作:
  • 寫入和讀取方准備創建庫和表

    # 創建庫
    create database `testdatax` character set utf8
    # 創建表
    create table user1w(
    id int not null auto_increment,
    name varchar(10) not null,
    score int not null,
    primary key(`id`))engine=InnoDB default charset=utf8;
    
  • 編寫一個簡單存儲過程,讀取數據端插入數據:

    DELIMITER //
    create PROCEDURE add_user(in num INT)
    BEGIN
    DECLARE rowid INT DEFAULT 0;
    DECLARE name CHAR(1);
    DECLARE score INT;
    WHILE rowid < num DO
    SET rowid = rowid + 1;
    set name = SUBSTRING('abcdefghijklmnopqrstuvwxyz',ROUND(1+25*RAND()),1);
    set score= FLOOR(40 + (RAND()*60));
    insert INTO user1w (name,score) VALUES (name,score);
    END WHILE;
    END //
    DELIMITER ;
    
  • 執行插入數據

    call add_user(10000);
    
3.2.2 查看一下mysql到mysql的json配置
  • python /opt/software/datax/bin/datax.py -r mysqlreader -w mysqlwriter,json文件配置:

    {
        "job": {
            "content": [
                {
                    "reader": {
                        "name": "mysqlreader", # 讀取端,根據DataX定義好的設置
                        "parameter": {
                            "column": [], # 讀取端需要同步的列
                		   "splitPk": "",# 數據抽取時指定字段進行數據分片
                            "connection": [
                                {
                                    "jdbcUrl": [], #讀取端連接信息
                                    "table": []# 讀取端指定的表
                                }
                            ], 
                            "password": "", #讀取端賬戶
                            "username": "", #讀取端密碼
                            "where": ""# 描述篩選條件
                        }
                    }, 
                    "writer": {
                        "name": "mysqlwriter", #寫入端,根據DataX定義好的設置
                        "parameter": {
                            "column": [], #寫入端需要同步的列
                            "connection": [
                                {
                                    "jdbcUrl": "", # 寫入端連接信息
                                    "table": []# 寫入端指定的表
                                }
                            ], 
                            "password": "", #寫入端密碼
                            "preSql": [], # 執行寫入之前做的事情
                            "session": [], 
                            "username": "", #寫入端賬戶
                            "writeMode": ""# 操作樂星
                        }
                    }
                }
            ], 
            "setting": {
                "speed": {
                    "channel": ""#指定channel數
                }
            }
        }
    }
    
  • 我的配置json:

    {
        "job": {
            "content": [
                {
                    "reader": {
                        "name": "mysqlreader", 
                        "parameter": {
                            "username": "root",
                            "password": "123"
                            "column": ["*"],
                            "splitPk": "id",
                            "connection": [
                                {
                                    "jdbcUrl": [
                                        "jdbc:mysql://讀取端IP:3306/testdatax?useUnicode=true&characterEncoding=utf8"
                                    ], 
                                    "table": ["user1w"]
                                }
                            ]
                        }
                    }, 
                    "writer": {
                        "name": "mysqlwriter", 
                        "parameter": {
                            "column": ["*"], 
                            "connection": [
                                {
                                    "jdbcUrl": "jdbc:mysql://寫入端IP:3306/testdatax?useUnicode=true&characterEncoding=utf8", 
                                    "table": ["user1w"]
                                }
                            ], 
                            "password": "123", 
                            "preSql": [
                                "truncate user1w"
                            ], 
                            "session": [
                                "set session sql_mode='ANSI'"
                            ], 
                            "username": "root", 
                            "writeMode": "insert"
                        }
                    }
                }
            ], 
            "setting": {
                "speed": {
                    "channel": "5"
                }
            }
        }
    }
    
  • cd到datax下bin目錄執行:

    python2 datax.py /root/json/mysql2mysql.json
    
  • 會打印同步數據信息完畢。更多配置見github-dataX

3.3從數據庫MySQL數據導入到HDFS中

python /opt/software/datax/bin/datax.py -r mysqlreader -w mysqlwriter
  • 未完待續...


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM