Hive手寫SQL案例

本文轉載自查看原文 2019-12-14 15:04 291 Hive

1-請詳細描述將一個有結構的文本文件student.txt導入到一個hive表中的步驟，及其關鍵字

假設student.txt 有以下幾列：id,name,gender三列
1-創建數據庫 create database student_info;
2-創建hive表 student

create external table student_info.student( id string comment '學生id', name string comment '學生姓名', gender string comment '學生性別' ) comment "學生信息表" row format delimited fields terminated by '\t' line terminated by '\n' stored as textfile location "/user/root/student";

3-加載數據

load data local inpath '/root/student.txt' into table student_info.student location "/user/root/student" ;

4- 進入hive-cli，查看相應的表結構
select * from student_info.student limit 10；

划重點：要回手寫代碼

2-利用HQL實現以下功能

2-1-創建表

創建員工基本信息表(EmployeeInfo)，字段包括(員工 ID，員工姓名，員工身份證號，性別，年齡，所屬部門，崗位，入職公司時間，離職公司時間)，分區字段為入職公司時間，其行分隔符為”\n “，字段分隔符為”\t “。其中所屬部門包括行政部、財務部、研發部、教學部，其對應崗位包括行政經理、行政專員、財務經理、財務專員、研發工程師、測試工程師、實施工程師、講師、助教、班主任等，時間類型值如：2018-05-10 11:00:00
創建員工收入表(IncomeInfo)，字段包括(員工 ID，員工姓名，收入金額，收入所屬
月份，收入類型，收入薪水的時間)，分區字段為發放薪水的時間，其中收入類型包括薪資、獎金、公司福利、罰款四種情況 ; 時間類型值如：2018-05-10 11:00:00。

注意：時間類型是2018-05-10 11:00:00，需要對字段進行處理

創建員工基本信息表

create external table test.employee_info( id string comment '員工id', name string comment '員工姓名', indentity_card string comment '身份證號', gender string comment '性別', department string comment '所屬部門', post string comment '崗位', hire_date string comment '入職時間', departure_date string comment '離職時間' ) comment "員工基本信息表" partitioned by (day string comment "員工入職時間") row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile location '/user/root/employee';

創建員工收入表

create external table test.income_info( id string comment '員工id', name string comment '員工姓名', income_data string comment '收入', income_month string comment '收入所屬月份', income_type string comment '收入類型', income_datetime string comment '收入薪水時間' ) comment '員工收入表' partitioned by (day string comment "員工發放薪水時間") row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile location '/user/root/income';

2-2用 HQL 實現，求公司每年的員工費用總支出各是多少，並按年份降序排列?

重點對時間類型 2018-05-10 11:00:00 進行內置函數處理
需要讀取income_info全量表，按照分區時間進行聚合，因為收入類型里面有罰款一項，所以需要在員工發放的錢中扣除罰款的錢。
不采用join、對數據一次遍歷輸出結果，
對於大數據量的情況下，要考慮對數據進行一次遍歷求出結果

select income_year,(income_data-(nvl(penalty_data,0))) as company_cost from ( -- 統計員工收入金額和罰款金額，輸出 2019 500 10 select income_year, sum(case when income_type!='罰款' then data_total else 0 end) as income_data, sum(case when income_type='罰款' then data_total else 0 end) as penalty_data from ( -- 按照年份、收入類型求收入金額 select year(to_date(income_datetime)) as income_year, income_type, sum(income_data) as data_total from test.income_info group by year(to_date(income_datetime)) ,income_type ) tmp_a group by tmp_a.income_year ) as temp order by income_year desc;

2-3用 HQL 實現，求各部門每年的員工費用總支出各是多少，並按年份降序，按部門的支出升序排列？

保證對數據的一次遍歷

--根據id關聯得出department,和消費類型 select income_year,department, (sum(case when income_type!='罰款' then income_data else 0 end) - sum(case when income_type='罰款' then income_data else 0 end) ) as department_cost from ( -- 先對員工進行薪資類別的聚合統計 select id,year(to_date(income_datetime)) as income_year,income_type,sum(income_data) as income_data from test.income_info group by year(to_date(income_datetime)),id,income_type ) temp_a inner join test.employee_info b on temp_a.id=b.id group by department,income_year order by income_year desc , department_cost asc;

2-4用 HQL 實現，求各部門歷史所有員工費用總支出各是多少，按總支出多少排名降序，遇到值相等情況，不留空位。

根據2-3中的中間結果進行修改
注意歷史上所有的數據

select department,department_cost,dense_rank() over(order by department_cost desc) as cost_rank from ( --根據id關聯得出department,和消費類型 select department, (sum(case when income_type!='罰款' then income_data else 0 end) - sum(case when income_type='罰款' then income_data else 0 end) ) as department_cost from ( -- 先對員工進行薪資類別的聚合統計 select id,income_type,sum(income_data) as income_data from test.income_info group by id,income_type ) temp_a inner join test.employee_info b on temp_a.id=b.id group by department ) tmp_c ;

2-5 用 HQL 實現，創建並生成員工薪資收入動態變化表，即員工 ID，員工姓名，員工本月薪資，本月薪資發放時間，員工上月薪資，上月薪資發放時間。分區字段為本月薪資發放時間。

感覺應該使用動態分區插入的特性？-但是不知道該怎么寫
先創建表，再采用insert into table **** select ***
要考慮到離職和入職的員工，這一點需要考慮到，full join
兩張表進行full join，過濾day is null
需要concat year month to_date內置函數處理
這個題需要考慮的比較多

create external table test.income_dynamic( id string comment '員工id', name string comment '員工姓名', income_data_current string comment '本月收入', income_datetime_current string comment 本月'收入薪水時間', income_data_last string comment '上月收入', income_datetime_last string comment '上月收入薪水時間', ) comment '員工收入動態表' partitioned by (day string comment "員工本月發放薪水時間") row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile location '/user/root/income'; -- ------------------------------------------------------------------------------ -- 動態分區插入 -- 插入語句 -- 采用full join insert into table test.income_dynamic partition(day) select (case when id_a is not null then id_a else id_b end ) as id, (case when name_a is not null then name_a else name_b end ) as name , income_data,income_datetime,income_data_b,income_datetime_b,day from ( -- 選出表中所有的數據 select id as id_a,name as name_a,income_data,income_datetime,day,concat(year(to_date(day)),month(to_date(day))) as day_flag from test.income_info where income_type='薪資' ) tmp_a full outer join ( -- 將表中的收到薪水的日期整體加一個月 select id as id_b,name as name_b,income_data as income_data_b,income_datetime as income_datetime_b,concat(year(add_months(to_date(day),1)),month(add_months(to_date(day),1))) as month_flag from test.income_info where income_type='薪資' ) tmp_b on tmp_a.day_flag=tmp_b.month_flag and tmp_a.id_a=tmp_b.id_b where day is not null ;

2-6 用 HQL 實現，薪資漲幅方面，2018 年 5 月份誰的工資漲的最多，誰的漲幅最大？

再2-5的基礎上做比較簡單，僅僅利用select部分即可；或者是再2-5的基礎上做就行

Hive行列轉換

１、問題
hive如何將
a       b       1 a b 2 a b 3 c d 4 c d 5 c d 6 變為： a b 1,2,3 c d 4,5,6 ------------------------------------------------------------------------------------------- ２、數據 test.txt a b 1 a b 2 a b 3 c d 4 c d 5 c d 6 ------------------------------------------------------------------------------------------- ３、答案 1.建表 drop table tmp_jiangzl_test; create table tmp_jiangzl_test ( col1 string, col2 string, col3 string ) row format delimited fields terminated by '\t' stored as textfile; -- 加載數據 load data local inpath '/home/jiangzl/shell/test.txt' into table tmp_jiangzl_test; 2.處理 select col1,col2,concat_ws(',',collect_set(col3)) from tmp_jiangzl_test group by col1,col2; --------------------------------------------------------------------------------------- collect_set/concat_ws語法參考鏈接：https://blog.csdn.net/waiwai3/article/details/79071544 https://blog.csdn.net/yeweiouyang/article/details/41286469 [Hive]用concat_w實現將多行記錄合並成一行 --------------------------------------------------------------------------------------- 二、列轉行 １、問題 hive如何將 a b 1,2,3 c d 4,5,6 變為： a b 1 a b 2 a b 3 c d 4 c d 5 c d 6 --------------------------------------------------------------------------------------------- 2、答案 1.建表 drop table tmp_jiangzl_test; create table tmp_jiangzl_test ( col1 string, col2 string, col3 string ) row format delimited fields terminated by '\t' stored as textfile; 處理： select col1, col2, col5 from tmp_jiangzl_test a lateral view explode(split(col3,',')) b AS col5; --------------------------------------------------------------------------------------- lateral view 語法參考鏈接： https://blog.csdn.net/clerk0324/article/details/58600284

Hive實現wordcount

1.創建數據庫 create database wordcount; 2.創建外部表 create external table word_data(line string) row format delimited fields terminated by ',' location '/home/hadoop/worddata'; 3.映射數據表 load data inpath '/home/hadoop/worddata' into table word_data; 4.這里假設我們的數據存放在hadoop下，路徑為：/home/hadoop/worddata，里面主要是一些單詞文件，內容大概為： hello man what are you doing now my running hello kevin hi man 執行了上述hql就會創建一張表src_wordcount，內容是這些文件的每行數據，每行數據存在字段line中，select * from word_data;就可以看到這些數據 5.根據MapReduce的規則，我們需要進行拆分，把每行數據拆分成單詞，這里需要用到一個hive的內置表生成函數（UDTF）：explode(array)，參數是array，其實就是行變多列： create table words(word string); insert into table words select explode(split(line, " ")) as word from word_data; 6.查看words表內容 OK hello man what are you doing now my running hello kevin hi man split是拆分函數，跟java的split功能一樣，這里是按照空格拆分，所以執行完hql語句，words表里面就全部保存的單個單詞 7.group by統計單詞 select word, count(*) from wordcount.words group by word; wordcount.words 庫名稱.表名稱，group by word這個word是create table words(word string) 命令創建的word string 結果： are 1 doing 1 hello 2 hi 1 kevin 1 man 2 my 1 now 1 running 1 what 1 you 1

Hive取TopN

rank() over()
dense_rank() over()
row_number() over()

求取指定狀態下的訂單id

給一張訂單表，統計只購買過面粉的用戶；（重點在於僅僅購買過面粉的客戶）
eg：order:order_id,buyer_id,order_time.....

在保證一次遍歷的情況下,重點是O(1)復雜度

select buyer_id from ( select buyer_id,sum(case when order_id='面粉' then 0 else 1 end) as flag from order ) as tmp where flag=0;

微博體系中互粉的有多少組

在微博粉絲表中，互相關注的人有多少組，例如：A-->B;B-->A；A和B互粉，稱為一組。
表結構：id,keep_id,time.... (id,keep_id可作為聯合主鍵)
借助Hive進行實現

select count(*)/2 as weibo_relation_number from ( (select concat(id,keep_id) as flag from weibo_relation) union all --全部合並到一起，不能提前去重 (select concat(keep_id,id) as flag from weibo_relation) ) as tmp having count(flag) =2 group by flag;

購買了香蕉的人買了多少東西

這個是一個很經典的問題，購買了香蕉的人買了多少東西
數據還是延用上一個問題的數據和表結構，即理解為關注C的人總共關注了多少人
仔細理解是需要對關注的人進行去重統計

select count(distinct keep_id) as total_keep_id from weibo_relation where id in (select id from weibo_relation where keep_id='c')

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 自己動手寫SQL執行引擎純js手寫輪播圖案例，簡單的js代碼。 k-近鄰算法及識別手寫數字的案例手寫promise 手寫Tomcat Mybatis-Spring Boot 手寫SQL語句手寫一個簡單的ElasticSearch SQL轉換器(一) 手寫reduce() hibernate使用手寫sql以及對結果list的處理全網最詳細最好懂 PyTorch CNN案例分析識別手寫數字