datax簡單入門
概述
什么是datax
DataX 是阿里巴巴開源的一個異構數據源離線同步工具,致力於實現包括關系型數據庫(MySQL、Oracle等)、HDFS、Hive、ODPS、HBase、FTP等各種異構數據源之間穩定高效的數據同步功能。
DataX的設計
為了解決異構數據源同步問題,DataX將復雜的網狀的同步鏈路變成了星型數據鏈路,DataX作為中間傳輸載體負責連接各種數據源。
當需要接入一個新的數據源的時候,只需要將此數據源對接到DataX,便能跟已有的數據源做到無縫數據同步。
框架設計
[
運行原理
快速入門
官方地址
下載地址:http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
源碼地址:https://github.com/alibaba/DataX
前置要求
- Linux
- JDK(1.8以上,推薦1.8)
- Python(推薦Python2.6.X)
安裝
1)將下載好的datax.tar.gz上傳到other的/opt/softwarez
[root@other software]$ ls datax.tar.gz
2)解壓datax.tar.gz到/opt/module
[root@other software]$ tar -zxvf datax.tar.gz -C /opt/module/
3)運行自檢腳本
[root@other ~]# cd /opt/module/datax/bin/
[root@other bin]# ll
total 40
-rwxr-xr-x 1 62265 users 8993 Nov 24 2017 datax.py
-rwxr-xr-x 1 62265 users 6906 Nov 24 2017 dxprof.py
-rwxr-xr-x 1 62265 users 16897 Nov 24 2017 perftrace.py
[root@other bin]# python datax.py /opt/module/datax/job/job.json
使用案例
從stream流讀取數據並打印到控制台
1)查看配置模板
[root@other bin]# python datax.py -r streamreader -w streamwriter
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
Please refer to the streamreader document:
https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md
Please refer to the streamwriter document:
https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md
Please save the following configuration as a json file and use
python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json
to run the job.
{
"job": {
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"column": [],
"sliceRecordCount": ""
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"encoding": "",
"print": true
}
}
}
],
"setting": {
"speed": {
"channel": ""
}
}
}
}
[root@other bin]#
2)根據模板編寫配置文件
[root@other job]# cat stream2stream.json
{
"job": {
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"sliceRecordCount": 10,
"column": [
{
"type": "long",
"value": "10"
},
{
"type": "string",
"value": "hello,DataX"
}
]
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"encoding": "UTF-8",
"print": true
}
}
}
],
"setting": {
"speed": {
"channel": 1
}
}
}
}
[root@other job]#
3)運行
[root@other job]$ /opt/module/datax/bin/datax.py /opt/module/datax/job/stream2stream.json
Oracle數據庫
我這里是直接用docker安裝的,需要的話可以查看我之前的博客:
新建用戶
建議插入數據:
SQL>create TABLE student(id INTEGER,name VARCHAR2(20));
SQL>insert into student values (1,'zhangsan');
SQL> select * from student;
ID NAME
---------- ----------------------------------------
1 zhangsan
Oracle與MySQL的SQL區別
類型 | Oracle | MySQL |
---|---|---|
整型 | number(N)/integer | int/integer |
浮點型 | float | float/double |
字符串類型 | varchar2(N) | varchar(N) |
NULL | '' | null和''不一樣 |
分頁 | rownum | limit |
"" | 限制很多,一般不讓用 | 與單引號一樣 |
價格 | 閉源,收費 | 開源,免費 |
主鍵自動增長 | × | √ |
if not exists | × | √ |
auto_increment | × | √ |
create database | × | √ |
select * from table as t | × | √ |
DataX案例
從Oracle中讀取數據存到MySQL
1)MySQL中創建表
mysql> create database oracle;
mysql> use oracle;
mysql> create table student(id int,name varchar(20));
2)編寫datax配置文件
[root@other job]# cat oralce2mysql.json
{
"job": {
"content": [
{
"reader": {
"name": "oraclereader",
"parameter": {
"column": ["*"],
"connection": [
{
"jdbcUrl": ["jdbc:oracle:thin:@192.168.1.121:1521:helowin"],
"table": ["student"]
}
],
"password": "123456",
"username": "dalianpai"
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": ["*"],
"connection": [
{
"jdbcUrl": "jdbc:mysql://192.168.1.121:3306/datax",
"table": ["student"]
}
],
"password": "root",
"username": "root",
"writeMode": "insert"
}
}
}
],
"setting": {
"speed": {
"channel": "1"
}
}
}
}
[root@other job]#
3)執行命令
/opt/module/datax/bin/datax.py /opt/module/datax/job/oracle2mysql.json
顯示:
結果:
注:簡單的演示一下,由於我的HDFS安裝在CDH中,懶的開那么多虛擬機,后面有時間在繼續研究一下,datax-web好像更加友好,還提供了相關的界面。