1-請詳細描述將一個有結構的文本文件student.txt導入到一個hive表中的步驟,及其關鍵字
- 假設student.txt 有以下幾列:id,name,gender三列
- 1-創建數據庫 create database student_info;
- 2-創建hive表 student
create external table student_info.student( id string comment '學生id', name string comment '學生姓名', gender string comment '學生性別' ) comment "學生信息表" row format delimited fields terminated by '\t' line terminated by '\n' stored as textfile location "/user/root/student";
- 3-加載數據
load data local inpath '/root/student.txt' into table student_info.student location "/user/root/student" ;
- 4- 進入hive-cli,查看相應的表結構
select * from student_info.student limit 10;
划重點:要回手寫代碼
2-利用HQL實現以下功能
2-1-創建表
-
創建員工基本信息表(EmployeeInfo),字段包括(員工 ID,員工姓名,員工身份證號,性別,年齡,所屬部門,崗位,入職公司時間,離職公司時間),分區字段為入職公司時間,其行分隔符為”\n “,字段分隔符為”\t “。其中所屬部門包括行政部、財務部、研發部、教學部,其對應崗位包括行政經理、行政專員、財務經理、財務專員、研發工程師、測試工程師、實施工程師、講師、助教、班主任等,時間類型值如:2018-05-10 11:00:00
-
創建員工收入表(IncomeInfo),字段包括(員工 ID,員工姓名,收入金額,收入所屬
月份,收入類型,收入薪水的時間),分區字段為發放薪水的時間,其中收入類型包括薪資、獎金、公司福利、罰款四種情況 ; 時間類型值如:2018-05-10 11:00:00。
注意:時間類型是2018-05-10 11:00:00,需要對字段進行處理
- 創建員工基本信息表
create external table test.employee_info( id string comment '員工id', name string comment '員工姓名', indentity_card string comment '身份證號', gender string comment '性別', department string comment '所屬部門', post string comment '崗位', hire_date string comment '入職時間', departure_date string comment '離職時間' ) comment "員工基本信息表" partitioned by (day string comment "員工入職時間") row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile location '/user/root/employee';
- 創建員工收入表
create external table test.income_info( id string comment '員工id', name string comment '員工姓名', income_data string comment '收入', income_month string comment '收入所屬月份', income_type string comment '收入類型', income_datetime string comment '收入薪水時間' ) comment '員工收入表' partitioned by (day string comment "員工發放薪水時間") row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile location '/user/root/income';
2-2用 HQL 實現,求公司每年的員工費用總支出各是多少,並按年份降序排列?
- 重點對時間類型 2018-05-10 11:00:00 進行內置函數處理
- 需要讀取income_info全量表,按照分區時間進行聚合,因為收入類型里面有罰款一項,所以需要在員工發放的錢中扣除罰款的錢。
- 不采用join、對數據一次遍歷輸出結果,
- 對於大數據量的情況下,要考慮對數據進行一次遍歷求出結果
select income_year,(income_data-(nvl(penalty_data,0))) as company_cost from ( -- 統計員工收入金額和罰款金額,輸出 2019 500 10 select income_year, sum(case when income_type!='罰款' then data_total else 0 end) as income_data, sum(case when income_type='罰款' then data_total else 0 end) as penalty_data from ( -- 按照年份、收入類型求收入金額 select year(to_date(income_datetime)) as income_year, income_type, sum(income_data) as data_total from test.income_info group by year(to_date(income_datetime)) ,income_type ) tmp_a group by tmp_a.income_year ) as temp order by income_year desc;
2-3用 HQL 實現,求各部門每年的員工費用總支出各是多少,並按年份降序,按部門的支出升序排列?
- 保證對數據的一次遍歷
--根據id關聯得出department,和消費類型 select income_year,department, (sum(case when income_type!='罰款' then income_data else 0 end) - sum(case when income_type='罰款' then income_data else 0 end) ) as department_cost from ( -- 先對員工進行薪資類別的聚合統計 select id,year(to_date(income_datetime)) as income_year,income_type,sum(income_data) as income_data from test.income_info group by year(to_date(income_datetime)),id,income_type ) temp_a inner join test.employee_info b on temp_a.id=b.id group by department,income_year order by income_year desc , department_cost asc;
2-4用 HQL 實現,求各部門歷史所有員工費用總支出各是多少,按總支出多少排名降序,遇到值相等情況,不留空位。
- 根據2-3中的中間結果進行修改
- 注意歷史上所有的數據
select department,department_cost,dense_rank() over(order by department_cost desc) as cost_rank from ( --根據id關聯得出department,和消費類型 select department, (sum(case when income_type!='罰款' then income_data else 0 end) - sum(case when income_type='罰款' then income_data else 0 end) ) as department_cost from ( -- 先對員工進行薪資類別的聚合統計 select id,income_type,sum(income_data) as income_data from test.income_info group by id,income_type ) temp_a inner join test.employee_info b on temp_a.id=b.id group by department ) tmp_c ;
2-5 用 HQL 實現,創建並生成員工薪資收入動態變化表,即員工 ID,員工姓名,員工本月薪資,本月薪資發放時間,員工上月薪資,上月薪資發放時間。分區字段為本月薪資發放時間。
- 感覺應該使用動態分區插入的特性?-但是不知道該怎么寫
- 先創建表,再采用insert into table **** select ***
- 要考慮到離職和入職的員工,這一點需要考慮到,full join
- 兩張表進行full join,過濾day is null
- 需要concat year month to_date內置函數處理
- 這個題需要考慮的比較多
create external table test.income_dynamic( id string comment '員工id', name string comment '員工姓名', income_data_current string comment '本月收入', income_datetime_current string comment 本月'收入薪水時間', income_data_last string comment '上月收入', income_datetime_last string comment '上月收入薪水時間', ) comment '員工收入動態表' partitioned by (day string comment "員工本月發放薪水時間") row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile location '/user/root/income'; -- ------------------------------------------------------------------------------ -- 動態分區插入 -- 插入語句 -- 采用full join insert into table test.income_dynamic partition(day) select (case when id_a is not null then id_a else id_b end ) as id, (case when name_a is not null then name_a else name_b end ) as name , income_data,income_datetime,income_data_b,income_datetime_b,day from ( -- 選出表中所有的數據 select id as id_a,name as name_a,income_data,income_datetime,day,concat(year(to_date(day)),month(to_date(day))) as day_flag from test.income_info where income_type='薪資' ) tmp_a full outer join ( -- 將表中的收到薪水的日期整體加一個月 select id as id_b,name as name_b,income_data as income_data_b,income_datetime as income_datetime_b,concat(year(add_months(to_date(day),1)),month(add_months(to_date(day),1))) as month_flag from test.income_info where income_type='薪資' ) tmp_b on tmp_a.day_flag=tmp_b.month_flag and tmp_a.id_a=tmp_b.id_b where day is not null ;
2-6 用 HQL 實現,薪資漲幅方面,2018 年 5 月份誰的工資漲的最多,誰的漲幅最大?
- 再2-5的基礎上做比較簡單,僅僅利用select部分即可;或者是再2-5的基礎上做就行
Hive行列轉換
1、問題
hive如何將
a b 1 a b 2 a b 3 c d 4 c d 5 c d 6 變為: a b 1,2,3 c d 4,5,6 ------------------------------------------------------------------------------------------- 2、數據 test.txt a b 1 a b 2 a b 3 c d 4 c d 5 c d 6 ------------------------------------------------------------------------------------------- 3、答案 1.建表 drop table tmp_jiangzl_test; create table tmp_jiangzl_test ( col1 string, col2 string, col3 string ) row format delimited fields terminated by '\t' stored as textfile; -- 加載數據 load data local inpath '/home/jiangzl/shell/test.txt' into table tmp_jiangzl_test; 2.處理 select col1,col2,concat_ws(',',collect_set(col3)) from tmp_jiangzl_test group by col1,col2; --------------------------------------------------------------------------------------- collect_set/concat_ws語法參考鏈接:https://blog.csdn.net/waiwai3/article/details/79071544 https://blog.csdn.net/yeweiouyang/article/details/41286469 [Hive]用concat_w實現將多行記錄合並成一行 --------------------------------------------------------------------------------------- 二、列轉行 1、問題 hive如何將 a b 1,2,3 c d 4,5,6 變為: a b 1 a b 2 a b 3 c d 4 c d 5 c d 6 --------------------------------------------------------------------------------------------- 2、答案 1.建表 drop table tmp_jiangzl_test; create table tmp_jiangzl_test ( col1 string, col2 string, col3 string ) row format delimited fields terminated by '\t' stored as textfile; 處理: select col1, col2, col5 from tmp_jiangzl_test a lateral view explode(split(col3,',')) b AS col5; --------------------------------------------------------------------------------------- lateral view 語法參考鏈接: https://blog.csdn.net/clerk0324/article/details/58600284
Hive實現wordcount
1.創建數據庫 create database wordcount; 2.創建外部表 create external table word_data(line string) row format delimited fields terminated by ',' location '/home/hadoop/worddata'; 3.映射數據表 load data inpath '/home/hadoop/worddata' into table word_data; 4.這里假設我們的數據存放在hadoop下,路徑為:/home/hadoop/worddata,里面主要是一些單詞文件,內容大概為: hello man what are you doing now my running hello kevin hi man 執行了上述hql就會創建一張表src_wordcount,內容是這些文件的每行數據,每行數據存在字段line中,select * from word_data;就可以看到這些數據 5.根據MapReduce的規則,我們需要進行拆分,把每行數據拆分成單詞,這里需要用到一個hive的內置表生成函數(UDTF):explode(array),參數是array,其實就是行變多列: create table words(word string); insert into table words select explode(split(line, " ")) as word from word_data; 6.查看words表內容 OK hello man what are you doing now my running hello kevin hi man split是拆分函數,跟java的split功能一樣,這里是按照空格拆分,所以執行完hql語句,words表里面就全部保存的單個單詞 7.group by統計單詞 select word, count(*) from wordcount.words group by word; wordcount.words 庫名稱.表名稱,group by word這個word是create table words(word string) 命令創建的word string 結果: are 1 doing 1 hello 2 hi 1 kevin 1 man 2 my 1 now 1 running 1 what 1 you 1
Hive取TopN
- rank() over()
- dense_rank() over()
- row_number() over()
求取指定狀態下的訂單id
- 給一張訂單表,統計只購買過面粉的用戶;(重點在於僅僅購買過面粉的客戶)
eg:order:order_id,buyer_id,order_time.....
在保證一次遍歷的情況下,重點是O(1)復雜度
select buyer_id from ( select buyer_id,sum(case when order_id='面粉' then 0 else 1 end) as flag from order ) as tmp where flag=0;
微博體系中互粉的有多少組
- 在微博粉絲表中,互相關注的人有多少組,例如:A-->B;B-->A;A和B互粉,稱為一組。
表結構:id,keep_id,time.... (id,keep_id可作為聯合主鍵) - 借助Hive進行實現
select count(*)/2 as weibo_relation_number from ( (select concat(id,keep_id) as flag from weibo_relation) union all --全部合並到一起,不能提前去重 (select concat(keep_id,id) as flag from weibo_relation) ) as tmp having count(flag) =2 group by flag;
購買了香蕉的人買了多少東西
- 這個是一個很經典的問題,購買了香蕉的人買了多少東西
- 數據還是延用上一個問題的數據和表結構,即理解為關注C的人總共關注了多少人
- 仔細理解是需要對關注的人進行去重統計
select count(distinct keep_id) as total_keep_id from weibo_relation where id in (select id from weibo_relation where keep_id='c')