使用union all 命令之后如何對hive表格進行去重

本文轉載自查看原文 2019-03-15 12:24 1509

業務場景大概是這樣的，這里由兩個hive表格，tableA 和 tableB, 格式內容都是這樣的:
uid cate1 cate2

在hive QL中，我們知道union有着自動去重的功能，但是那是真對幾行內容完全一致的情況下才可以。現在我們要進行去重的情況是根據uid進行去重。
也就是說可能存在這種情況:
1234 老師唱歌
1234 老師跳舞
對於hive表格中的這兩行數據我們只想要保留其中的一行。

針對這種情況，我們做的大致思路就是，取兩個表格數據的時候同時人為加上一個flag，然后使用python代碼根據flag進行區分保留。
為了進行去重，我們寫了兩個代碼，一個是取得hive數據的shell腳本，一個是處理hive數據的python腳本

vim get_data.sh
function merge(){
cat <<EOF
add file ./process.py;
    select transform(a.*) using 'python tt.py' as uid,cate1,cate2 from

    (select * from
    (select uid,cate1,cate2,"0" as flag from tableA where dt='sth1'
    union all
    select uid,cate1,cate2,"1" as flag from tableB where dt='sth2'
    )ts
    distribute by uid sort by uid,flag asc
    )a
EOF
}

對於上面這個代碼，我覺得有一點需要特別注意，就是

distribute by uid sort by uid,flag asc

為了了解這行代碼，我特意去看了看這里的解釋參考
簡單來說就是說，distribute by uid代表的就是所有uid相同的數據會被送到同一個reducer中去處理。

vim process.py

#!/bin/env python
#-*- encoding:utf-8 -*-
import os
import sys

def set_values(value):
        if value.isdigit():
                return int(value)
        else :
                return 0

lastuid=""
cate1=""
cate2=""
flag=""

for line in sys.stdin :
        line=line.replace("\n","").replace(" ","")
        v=line.split("\t")
        try :
                uid=v[0]
                if not uid.isdigit() or len(v) != 4:
                        pass
                if lastuid!="" and lastuid!=uid:
                        print (lastuid+"\t"+str(cate1)+"\t"+str(cate2))
                        lastuid=""
                        cate1=""
                        cate2=""
                        flag=""
                cate1=v[1]
                cate2=v[2]
                flag=v[3]
                lastuid=uid
        except :
                pass

print (lastuid+"\t"+str(cate1)+"\t"+str(cate2)) #這行代碼是為了輸出最后一行，這行代碼很類似於python word count中的示例代碼

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 hive union all 的煩惱 HIve SQL 之Union和Union All區別 sql使用union all進行多張表相同字段求和 hive里面union all的用法記錄【SQL優化】union、union all、or的使用 union 或者 union all 與 order by 的聯合使用 oracle中union和union all 使用區別 SQL：union all和union的區別和使用 hive—UNION ALL和UNION區別，以及性能最優用法 sql優化之union all 和or ，in使用