datax入門


datax簡單入門

概述

什么是datax

DataX 是阿里巴巴開源的一個異構數據源離線同步工具,致力於實現包括關系型數據庫(MySQL、Oracle等)、HDFS、Hive、ODPS、HBase、FTP等各種異構數據源之間穩定高效的數據同步功能。

image.png

DataX的設計

為了解決異構數據源同步問題,DataX將復雜的網狀的同步鏈路變成了星型數據鏈路,DataX作為中間傳輸載體負責連接各種數據源。

當需要接入一個新的數據源的時候,只需要將此數據源對接到DataX,便能跟已有的數據源做到無縫數據同步。

image.png

框架設計

[image.png

運行原理

image.png

快速入門

官方地址

下載地址:http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz

源碼地址:https://github.com/alibaba/DataX

前置要求

  • Linux
  • JDK(1.8以上,推薦1.8)
  • Python(推薦Python2.6.X)

安裝

1)將下載好的datax.tar.gz上傳到other的/opt/softwarez

[root@other software]$ ls datax.tar.gz

2)解壓datax.tar.gz到/opt/module

[root@other software]$ tar -zxvf datax.tar.gz -C /opt/module/

3)運行自檢腳本

[root@other ~]# cd /opt/module/datax/bin/
[root@other bin]# ll
total 40
-rwxr-xr-x 1 62265 users  8993 Nov 24  2017 datax.py
-rwxr-xr-x 1 62265 users  6906 Nov 24  2017 dxprof.py
-rwxr-xr-x 1 62265 users 16897 Nov 24  2017 perftrace.py
[root@other bin]# python datax.py /opt/module/datax/job/job.json

image-20200908233051258

使用案例

從stream流讀取數據並打印到控制台

1)查看配置模板

[root@other bin]# python datax.py -r streamreader -w streamwriter

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.


Please refer to the streamreader document:
     https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md

Please refer to the streamwriter document:
     https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md

Please save the following configuration as a json file and  use
     python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json
to run the job.

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "streamreader",
                    "parameter": {
                        "column": [],
                        "sliceRecordCount": ""
                    }
                },
                "writer": {
                    "name": "streamwriter",
                    "parameter": {
                        "encoding": "",
                        "print": true
                    }
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": ""
            }
        }
    }
}
[root@other bin]#

2)根據模板編寫配置文件

[root@other job]# cat stream2stream.json
{
  "job": {
    "content": [
      {
        "reader": {
          "name": "streamreader",
          "parameter": {
            "sliceRecordCount": 10,
            "column": [
              {
                "type": "long",
                "value": "10"
              },
              {
                "type": "string",
                "value": "hello,DataX"
              }
            ]
          }
        },
        "writer": {
          "name": "streamwriter",
          "parameter": {
            "encoding": "UTF-8",
            "print": true
          }
        }
      }
    ],
    "setting": {
      "speed": {
        "channel": 1
       }
    }
  }
}
[root@other job]#

3)運行

[root@other job]$ /opt/module/datax/bin/datax.py /opt/module/datax/job/stream2stream.json

image-20200908233724873

Oracle數據庫

我這里是直接用docker安裝的,需要的話可以查看我之前的博客:

新建用戶

image-20200908175022813

建議插入數據:

SQL>create TABLE student(id INTEGER,name VARCHAR2(20));
SQL>insert into student values (1,'zhangsan');
SQL> select * from student; 
        ID 	NAME
---------- ----------------------------------------
         1 	zhangsan

Oracle與MySQL的SQL區別

類型 Oracle MySQL
整型 number(N)/integer int/integer
浮點型 float float/double
字符串類型 varchar2(N) varchar(N)
NULL '' null和''不一樣
分頁 rownum limit
"" 限制很多,一般不讓用 與單引號一樣
價格 閉源,收費 開源,免費
主鍵自動增長 ×
if not exists ×
auto_increment ×
create database ×
select * from table as t ×

DataX案例

從Oracle中讀取數據存到MySQL

1)MySQL中創建表

mysql> create database oracle;
mysql> use oracle;
mysql> create table student(id int,name varchar(20));

2)編寫datax配置文件

[root@other job]# cat oralce2mysql.json
{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "oraclereader",
                    "parameter": {
                        "column": ["*"],
                        "connection": [
                            {
                                "jdbcUrl": ["jdbc:oracle:thin:@192.168.1.121:1521:helowin"],
                                "table": ["student"]
                            }
                        ],
                        "password": "123456",
                        "username": "dalianpai"
                    }
                },
                "writer": {
                    "name": "mysqlwriter",
                    "parameter": {
                        "column": ["*"],
                        "connection": [
                            {
                                "jdbcUrl": "jdbc:mysql://192.168.1.121:3306/datax",
                                "table": ["student"]
                            }
                        ],
                        "password": "root",
                        "username": "root",
                        "writeMode": "insert"
                    }
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": "1"
            }
        }
    }
}
[root@other job]#

3)執行命令

/opt/module/datax/bin/datax.py /opt/module/datax/job/oracle2mysql.json

顯示:

image-20200908225726607

結果:

image-20200908234316845

注:簡單的演示一下,由於我的HDFS安裝在CDH中,懶的開那么多虛擬機,后面有時間在繼續研究一下,datax-web好像更加友好,還提供了相關的界面。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM