hive常見問題以及解析

本文轉載自查看原文 2021-04-18 22:18 310 00_大數據_bigdata

1：數據傾斜

理論

hive數據傾斜可能的原因有哪些？主要解決方法有哪些？

原因

1：數據傾斜多由於臟數據/特殊數據（某一類數據集中）
2：大小表join
3：小文件過多；

解決方案

1:臟數據不參與關聯，給特數據數據做隨機（建表時）
2:使用mapjoin將小表加入內存。
3：合並小文件，通過set hive.merge.mapredfiles=true 解決；或者增加map數；（計算量大）

code

解決方法1：id為空的不參與關聯
 比如：select * from log a
join users b
on a.id is not null and a.id = b.id
union all
select * from log a
where a.id is null;
解決方法2：給空值分配隨機的key值
 如：select * from log a
left outer join users b
on
case when a.user_id is null
then concat(‘hive’,rand() )
else a.user_id end = b.user_id;

2：行列互換

行轉列

students_info(stu_id,name,depart);
1、張三、語文
1、張三、數學
1、張三、英語
2、李四、語文
2、李四、數學
實現：
1、張三、語文|數學|英語
2、李四、語文|數學

答案

select stu_id,name,concat_ws('|',collect_set(depart)) as departs from students_info group by stu_id;
1: group by
2：collect_set 打平放成set
3: concat_ws 連接

列轉行

students_info(stu_id,name,departs);
1、張三、語文|數學|英語
2、李四、語文|數學
實現：
1、張三、語文
1、張三、數學
1、張三、英語
2、李四、語文
2、李四、數學

答案

select stu_id, name,depart from students_info lateral view explode(split(depart,'|')) as depart;
1: 拆成數組（split），如果是數組類型的，不需要。 Array [1,2,3]
2: 把數組分行（explode）
3: 虛擬分行數據為視圖（記得別稱），同時放置到查詢里。

3：TopN

海量數據處理，10億個數中找出最大的10000個數，知道幾種說幾種。
1：全量排序，占存儲（空間復雜度）
2：分治分成100份，快排（基准數）
3：容器取前1w（排序），后邊依次比較，又叫最小堆。10000

4：連續三天登錄

過去一周，有過連續三天以及上登錄的用戶有哪些。
pv_detail：uid，login_time ;
101、2021/1/1
101、2021/1/2
101、2021/1/3
102、2021/1/3
103、2021/1/3
103、2021/1/4
101、2021/1/5
102、2021/1/6

第一層（uid排序，且生成rownumber）

select uid,login_time,row_number() over (partition by uid order by login_time) as sort from pv_detail ;

101 、2021/1/1、1
101 、2021/1/2、2
101 、2021/1/3、3
101、2021/1/5、4
102、2021/1/3、1
102、2021/1/6、2

第二層（相減）

select uid,login_time,sort,date_sub(login_time,sort) from (
select uid,login_time,row_number() over (partition by uid order by login_time) as sort from pv_detail );

101 、2021/1/1、1、2020/12/31
101 、2021/1/2、2、2020/12/31
101 、2021/1/3、3、2020/12/31
101、2021/1/5、4、2021/1/1
102、2021/1/3、1、2021/1/2
102、2021/1/6、2、2021/1/4

進行統計

select uid,min(login_time),max(login_time),date_sub(login_time,sort) as login_group,count(1) as continue_days
from (
select uid,login_time,row_number() over (partition by uid order by login_time) as sort from pv_detail )
group by uid ,date_sub(login_time,sort) ;

101 、2020/12/31、2021/1/1、2021/1/3、3
101、2021/1/1、2021/1/5、2021/1/5、1

三天以上

select distinct uid from (
select uid,min(login_time),max(login_time),date_sub(login_time,sort) as login_group,count(1) as continue_days
from (
select uid,login_time,row_number() over (partition by uid order by login_time) as sort from pv_detail )
group by uid ,date_sub(login_time,sort) ) where continue_days >=3;

5：學號、分數取30%高的學號

score_info (id,score);

select id,socre from (
select id,score, ntile(10) over (order by score desc) as level from score_info ) as a
where a.level <=3;

6： 5個有序的大文件合並成一個文件並排序。

借用歸並排序中歸並的方法（多路歸並）.
對每個已經排好序的大文件，讀取其第一個元素，放到內存中，按順序組成一個列表；
取列表中最小的元素作為追加到輸出文件中。
再從最小元素所在的文件中讀取一個元素，放到列表的相應位置。
如此反復，知道所有文件被讀完。

7:兩次select合並到同一張表

【grouping sets()、with cube、with rollup】
1：同時獲取用戶的性別分布、城市分布、等級分布
grouping sets() 在 group by 查詢中，根據不同的維度組合進行聚合，等價於將不同維度的 group by 結果集進行 union all。聚合規則在括號中進行指定。
select sex, city, level, count(distinct user_id) from user_info group by sex,city,level grouping sets (sex,city,level);
2：同時獲取用戶的性別分布以及每個性別的城市分布
grouping__id : (兩個下划線) 結果屬於哪一個分組集合
select sex, city, count(distinct user_id), grouping__id from user_info group by sex,city grouping sets(sex,(sex,city));

問題：性別、城市、等級的各種組合的用戶分布
根據 group by 維度的所有組合進行聚合
select sex, city, level, count(distinct user_id) from user_info group by sex,city,level with cube;
問題：同時計算每個月的支付金額，以及每年的總支付金額
以最左側的維度為主，進行層級聚合，是 cube 的子集
select year(dt) as year, month(dt) as month, sum(pay_amount) as pay_total from user_trade where dt > "0" group by year(dt),month(dt) with rollup;

8：row_number中嵌套子查詢

原日志格式:
uid url datetime
求每日熱門訪問人數 top100 的url？

select dt,url,cnt,row
from(
select *
,row_number() over(partition by url,dt order by cnt desc) as top
from
(
select url
,to_date(datetime)as dt
,count(1)as cnt
from log_info
group by url,to_date(datetime)
)
)final
where final.top <=100;

9:昨日登錄用戶今日留存率

select count(b.uid)/count(a.uid) from(select distinct datetime,uid from 表 where datetime ='2021-04-22' ) a left join(select distinct datetime,uid from 表 where datetime ='2021-04-23' )b on a.uid=b.uid

10:一個表兩個字段，x,y軸，求添加兩個字段，波峰波谷

11：linux操作命令

tail top ps du awk sort
du 會顯示指定的目錄或文件所占用的磁盤空間
查看進程:ps-ef |grep
查看端口號:lsof -i:8000
log文件滾動輸出 tail -f
把log打印到文件中:nohub java -jar x.jar>1.log &
find 文件查找
find -name
find -path
查看磁盤空間：df -h
查看內存使用空間:free -m
sort命令用於將文本文件內容加以排序

10：hive命令

from_unixtime（）
unix_timestamp（）

11：hive數據質量

1：四個方面評估數據質量：完整性、准確性、一致性、及時性
2：保障體系：
a：完整性、准確性通過抽驗、字段內容覆蓋率
b：一致性：結合元數據鏈路分析，數據差分
b：及時性：風險點監控：離線DQC校驗，規則校驗

12：sql

1 語文 78 張三
2 數學 85 張三
3 語文 90 李四
4 數學 85 李四
6 英語 90 王五

分數大於60的學生姓名
select name from b GROUP BY name HAVING min(score)>=80;

學科大於60的有多少學生
select count(name),a.course from (select * from b where score>=60) as a group by a.course;

存在於a表不存在與b表
SELECT a.key,a.value
FROM a
WHERE a.key not in (SELECT b.key FROM b)

select a.key,a.value
from a
left join b
where a.key=b.key and b.key is null;

12：元數據管理

工具 altas
datawork

13：數據建模

星型模型（一個事實表、N個維度表，維表只跟事實表關聯）
雪花模型（一個事實表、N個維度表，部分維表跟維表有關聯的）
星座模型，N個星型模型

14:hive中order by 、sort by、distribute by、cluster by、group by操作

order by是全局進行排序
sort by不是全局排序，其在數據進入reducer前完成排序。因此，如果用sort by進行排序，並且設置mapreduce.job.reduces>1，則sort by只保證每個reducer的輸出有序，不保證全局有序。
distribute by類似於MapReduce中分區partation，對數據進行分區，結合sort by進行使用，distribute by控制在map端如何拆分數據給reduce端。hive會根據distribute by后面列，對應reduce的個數進行分發，默認是采用hash算法。
cluster b除了具有distribute by的功能外，還會對該字段進行排序。當distribute by和sort by 字段相同時，可以使用cluster by 代替
即 cluster by col <==> distribute by col sort by col

13：spark

https://blog.csdn.net/qq_32595075/article/details/79918644

6：人群包標簽表

users uid bigint, tags array 123, [1,2,3,.....] 1000個標簽 tags tag_id, tag_name, tag_type_id, tag_type_name 1, 北京，101，地域 2，18，201，年齡 3，科技，301，興趣人群包 ~ 1億地域、年齡、興趣北京天津 18 20 二次元 tag_type_id, tag_type_name, num

7：出一張報表，展示各個區的銷售金額

訂單表：城區、區域、品類、金額：

8：統計滿足最近7天，歸屬高檔門店數大於500家的城市

交易表 trade_info（iterm_id,shop_id,sales,price,dt）,門店表：shop_info (shop_id,provice,city)

9：數據加工時序問題

10：kafka怎么保證同一個id放在一起

11：常見的hive優化

12：數據建模有幾種

13：范式相關

面試准備

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 hive中常見問題 Android中點擊事件的處理解析及常見問題 django常見問題 LoadRunner常見問題 spiffs常見問題 keepAlived常見問題 Redis常見問題 kafka常見問題 Grunt常見問題 burpsuite常見問題