實現思路:
1、每天凌晨將前一天增量的數據從業務系統導出到文本,並FTP到Hadoop集群某個主節點上
上傳路徑默認為:/mnt/data/crawler/
2、主節點上通過shell腳本調用hive命令加載本地增量溫江到hive臨時表
3、shell腳本中,使用hive sql 實現臨時表中的增量數據更新或者新增增量數據到hive 主數據表中
實現步驟:
1.建表語句, 分別創建兩張表test_temp, test 表
create table crawler.test_temp( a.id string, a.name string, a.email string, create_time string ) row format delimited fields terminated by ',' stored as textfile ; +++++++++++++++++++++++++++++++++ create table crawler.test( a.id string, a.name string, a.email string, create_time string ) partitioned by (dt string) row format delimited fields terminated by '\t' stored as orc ; |
2.編寫處理加載本地增量數據到hive臨時表的shell腳本test_temp.sh
#! /bin/bash ################################## # 調用格式: # # 腳本名稱 [yyyymmdd] # # 日期參數可選,默認是系統日期-1 # ################################## dt='' table=test_temp #獲取當前系統日期 sysdate=`date +%Y%m%d` #獲取昨日日期,格式: YYYYMMDD yesterday=`date -d yesterday +%Y%m%d` #數據文件地址 file_path=/mnt/data/crawler/ if [ $# -eq 1 ]; then dt=$1 elif [ $# -eq 0 ]; then dt=$yesterday else echo "非法參數!" #0-成功,非0-失敗 exit 1 fi filename=$file_path$table'_'$dt'.txt' if [ ! -e $filename ]; then echo "$filename 數據文件不存在!" exit 1 fi hive<<EOF load data local inpath '$filename' overwrite into table crawler.$table; EOF if [ $? -eq 0 ]; then echo "" echo $dt "$table 加載成功!" else echo "" echo $dt "$table 加載失敗!" fi |
3.增量加載臨時數據到主數據表的shell腳本test.sh
#! /bin/bash ################################## table=test #獲取當前系統日期 sysdate=`date +%Y%m%d`
#實現增量覆蓋 hive<<EOF set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table crawler.test partition (dt) select a.id, a.name, a.email, a.create_time, a.create_time as dt from ( select id, name, email, create_time from crawler.test_temp union all select t.id, t.name, t.email, t.create_time from crawler.test t left outer join crawler.test_temp t1 on t.id = t1.id where t1.id is null ) a; quit; EOF if [ $? -eq 0 ]; then echo $sysdate $0 " 增量抽取完成!" else echo $sysdate $0 " 增量抽取失敗!" |
https://www.aboutyun.com/forum.php?mod=viewthread&tid=20025&ordertype=1