一、介紹
TxtFileReader提供了讀取本地文件系統數據存儲的能力。在底層實現上,TxtFileReader獲取本地文件數據,並轉換為DataX傳輸協議傳遞給Writer。
二、配置模版
{ "setting": {}, "job": { "setting": { "speed": { "channel": 2 } }, "content": [ { "reader": { "name": "txtfilereader", "parameter": { "path": ["/home/haiwei.luo/case00/data"], "encoding": "UTF-8", "column": [ { "index": 0, "type": "long" }, { "index": 1, "type": "boolean" }, { "index": 2, "type": "double" }, { "index": 3, "type": "string" }, { "index": 4, "type": "date", "format": "yyyy.MM.dd" } ], "fieldDelimiter": "," } }, "writer": { "name": "txtfilewriter", "parameter": { "path": "/home/haiwei.luo/case00/result", "fileName": "luohw", "writeMode": "truncate", "format": "yyyy-MM-dd" } } } ] } }
三、使用說明
-
支持且僅支持讀取TXT的文件,且要求TXT中shema為一張二維表。
-
支持類CSV格式文件,自定義分隔符。
- 支持多種類型數據讀取(使用String表示),支持列裁剪,支持列常量
四、實踐
最近需要導一張表,原來的表數據是存放在hive上的,利用python腳本處理數據之后直接插入到hive的。現在是要將這張表的數據導入到greenplum中。表數據在7200萬左右
方法:將hive數據導出成csv文件,利用datax導入到greenplum
開干:
配置json文件
{ "content":[ { "reader":{ "name":"txtfilereader", "parameter":{ "column":[ { "format":"yyyy-MM-dd", "index":0, "type":"date" }, { "index":1, "type":"string" }, { "index":2, "type":"string" }, { "index":3, "type":"string" }, { "index":4, "type":"string" }, { "index":5, "type":"long" }, { "index":6, "type":"long" }, { "index":7, "type":"long" }, { "index":8, "type":"long" } ], "encoding":"utf-8", "fieldDelimiter":",", "path":[ "/home/tianyafu/flux_timecount_action.csv" ] } }, "writer":{ "name":"gpdbwriter", "parameter":{ "column":[ "record_date", "outid", "tm_type", "serv", "app", "down_flux", "up_flux", "seconds", "count" ], "connection":[ { "jdbcUrl":"jdbc:postgresql://192.168.100.21:5432/ods", "table":[ "ods_flux_timecount_action" ] } ], "password":"******", "segment_reject_limit":0, "username":"admin" } } } ], "setting":{ "errorLimit":{ "percentage":0.02, "record":0 }, "speed":{ "channel":"1" } } }
然后就失敗了呀
確定錯誤是數據中有null值,無法轉換為Long類型。
查詢到解決方法是添加:
nullFormat配置項
nullFormat 描述:文本文件中無法使用標准字符串定義null(空指針),DataX提供nullFormat定義哪些字符串可以表示為null。 例如如果用戶配置: nullFormat:"\N",那么如果源頭數據是"\N",DataX視作null字段。 必選:否 默認值:\N
那就加上唄,
{ "content":[ { "reader":{ "name":"txtfilereader", "parameter":{ "column":[ { "format":"yyyy-MM-dd", "index":0, "type":"date" }, { "index":1, "type":"string" }, { "index":2, "type":"string" }, { "index":3, "type":"string" }, { "index":4, "type":"string" }, { "index":5, "type":"long" }, { "index":6, "type":"long" }, { "index":7, "type":"long" }, { "index":8, "type":"long" } ], "csvReaderConfig":{ "safetySwitch":false, "skipEmptyRecords":false, "useTextQualifier":false }, "encoding":"utf-8", "fieldDelimiter":",", "nullFormat":"null", "path":[ "/home/tianyafu/flux_timecount_action.csv" ] } }, "writer":{ "name":"gpdbwriter", "parameter":{ "column":[ "record_date", "outid", "tm_type", "serv", "app", "down_flux", "up_flux", "seconds", "count" ], "connection":[ { "jdbcUrl":"jdbc:postgresql://192.168.100.21:5432/ods", "table":[ "ods_flux_timecount_action" ] } ], "password":"******", "segment_reject_limit":0, "username":"admin" } } } ], "setting":{ "errorLimit":{ "percentage":0.02, "record":0 }, "speed":{ "channel":"1" } } }
結果又失敗了
看來是大小寫敏感的,繼續改:
{ "content":[ { "reader":{ "name":"txtfilereader", "parameter":{ "column":[ { "format":"yyyy-MM-dd", "index":0, "type":"date" }, { "index":1, "type":"string" }, { "index":2, "type":"string" }, { "index":3, "type":"string" }, { "index":4, "type":"string" }, { "index":5, "type":"long" }, { "index":6, "type":"long" }, { "index":7, "type":"long" }, { "index":8, "type":"long" } ], "csvReaderConfig":{ "safetySwitch":false, "skipEmptyRecords":false, "useTextQualifier":false }, "encoding":"utf-8", "fieldDelimiter":",", "nullFormat":"NULL", "path":[ "/home/tianyafu/flux_timecount_action.csv" ] } }, "writer":{ "name":"gpdbwriter", "parameter":{ "column":[ "record_date", "outid", "tm_type", "serv", "app", "down_flux", "up_flux", "seconds", "count" ], "connection":[ { "jdbcUrl":"jdbc:postgresql://192.168.100.21:5432/ods", "table":[ "ods_flux_timecount_action" ] } ], "password":"******", "segment_reject_limit":0, "username":"admin" } } } ], "setting":{ "errorLimit":{ "percentage":0.02, "record":0 }, "speed":{ "channel":"1" } } }
終於成功了
看來這個參數是大小寫敏感的