1.大小表join(MapJoin)
说明 : 当大表小表关联时,可以将小表读取到内存,在Map端进行数据关联
小表在左在右都会触发 Mapjoin
2. 怎样判断大小表?
大表 > set hive.mapjoin.smalltable.filesize
小表 <= set hive.mapjoin.smalltable.filesize
3. 大小表关联可能引起的后果?

-- 大文件 A表
1 a 2 b 2 b 2 b 3 c 3 c 3 c -- 小文件 B表
1 x 2 y 3 z --结果文件
1 a x 2 b y 2 b y 2 b y 3 c z 3 c z 3 c z map阶段: 1.读取大文件 map输出 key : 1 value : a-大表标识 key : 2 value : b-大表标识 key : 2 value : b-大表标识 key : 2 value : b-大表标识 key : 3 value : c-大表标识 2.读取小文件 map输出 key : 1 value : x-小表标识 key : 2 value : y-小表标识 key : 3 value : z-小表标识 3.reduce阶段: key : 1 value : a-大表标识 key : 1 value : x-小表标识 聚合 : key : 1 value : a,x key : 2 value : b-大表标识 key : 2 value : b-大表标识 key : 2 value : y-小表标识 聚合 : key : 2 value : b,y key : 2 value : b,y key : 2 value : b,y key : 3 value : c-大表标识 key : 3 value : z-小表标识 聚合 : key : 3 value : c,z 4. 说明 reduce阶段Join : 当 A表有10亿数据,且id=1的数据有9亿 id=other的1亿,会造成数据严重倾斜,处理id=1的reduceTask 资源不足或处理过慢(map = 100%, reduce = 99%) map阶段Join : 将 B表数据读取到内存,在Map阶段完成join操作,避免了reduce阶段,id分布不均也就不会造成影响
4. 参数设置
-- 根据输入文件大小,判断是否将普通join 转换为mapjoin (默认为true)
set hive.auto.convert.join=true; -- 设置小表阈值(默认为 25M)
set hive.mapjoin.smalltable.filesize=25000000;
5. MapJoin 使用场景
1. 大小表关联
6. 案例测试
A表 : 大小 : 79.2 M dt=20220209/000000_0
72.9 M dt=20220209/000001_0
8.0 M dt=20220209/000002_0
11.1 M dt=20220209/000003_0 数据量 : 1613772 B表 : 大小 : 27.0 K dt=20220209/000000_0 数据量 : 523
--===============reduceJoin=================================================================
INFO : Number of reduce tasks not specified. Estimated from input data size: 1 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer= INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max= INFO : In order to set a constant number of reducers: INFO : set mapreduce.job.reduces= INFO : number of splits:2 INFO : Submitting tokens for job: job_1640670724826_487384 INFO : The url to track the job: j INFO : Starting Job = job_1640670724826_487384, Tracking URL = INFO : Kill Command = job -kill job_1640670724826_487384 INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1 INFO : 2022-02-10 14:37:22,574 Stage-1 map = 0%, reduce = 0% INFO : 2022-02-10 14:37:28,863 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 8.68 sec INFO : 2022-02-10 14:37:42,394 Stage-1 map = 65%, reduce = 0%, Cumulative CPU 21.03 sec INFO : 2022-02-10 14:37:45,532 Stage-1 map = 80%, reduce = 0%, Cumulative CPU 21.85 sec INFO : 2022-02-10 14:37:56,990 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 27.97 sec INFO : 2022-02-10 14:38:08,359 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 31.64 sec INFO : 2022-02-10 14:38:13,527 Stage-1 map = 100%, reduce = 99%, Cumulative CPU 39.42 sec INFO : 2022-02-10 14:38:14,562 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 40.27 sec INFO : MapReduce Total cumulative CPU time: 40 seconds 270 msec INFO : Ended Job = job_1640670724826_487384 INFO : Loading data to table tab from INFO : Table tab stats: [numFiles=1, numRows=1238897, totalSize=74173, rawDataSize=441047332] No rows affected (93.818 seconds) --================MapJoin================================================================
INFO : Execution completed successfully INFO : MapredLocal task succeeded INFO : Number of reduce tasks is set to 0 since there's no reduce operator INFO : number of splits:1 INFO : Submitting tokens for job: job_ INFO : The url to track the job: INFO : Starting Job = job_1640670724826_487444, Tracking URL = INFO : Kill Command = job -kill job_ INFO : Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0 INFO : 2022-02-10 14:41:32,271 Stage-4 map = 0%, reduce = 0% INFO : 2022-02-10 14:41:43,802 Stage-4 map = 89%, reduce = 0%, Cumulative CPU 12.68 sec INFO : 2022-02-10 14:41:46,957 Stage-4 map = 100%, reduce = 0%, Cumulative CPU 17.03 sec INFO : MapReduce Total cumulative CPU time: 17 seconds 30 msec INFO : Ended Job = job_1640670724826_487444 INFO : Loading data to table tab1 from INFO : Table tab stats: [numFiles=1, numRows=1238897, totalSize=3935316, rawDataSize=441047332] No rows affected (49.744 seconds)