技術說明:http://lxw1234.com/archives/2016/04/632.htm
hive表是orc 存儲
本文優化方法:使用 bloom filter 和二級動態分區
實操:
1,建表:
CREATE TABLE test( mall_id bigint COMMENT '店鋪id', mall_collection_id bigint COMMENT '商家包id', city_id bigint COMMENT '城市id', city_name string COMMENT '城市名稱', province_id bigint COMMENT '省份id', province_name string COMMENT '省份', is_illegal bigint COMMENT '是否違規', stat_day string COMMENT '統計時間' ) COMMENT 'XXXX' PARTITIONED BY ( pt string COMMENT '分區日期', mall_col_id bigint COMMENT 'id') STORED AS ORC TBLPROPERTIES ('orc.compress'='SNAPPY', 'orc.create.index'='true', "orc.bloom.filter.columns"="mall_collection_id,stat_day", -- 這樣建索引原因是接口用這兩個查詢數據 'orc.bloom.filter.fpp'='0.05', 'orc.stripe.size'='10485760', 'orc.row.index.stride'='10000') ;
2,數據插入結果表:
INSERT OVERWRITE TABLE test PARTITION(pt = '${env.YYYYMMDD}', mall_col_id) SELECT mall_id, mall_collection_id, city_id, city_name, province_id, province_name, is_illegal, stat_day, mall_collection_id % 1000 as mall_col_id from A DISTRIBUTE BY mall_collection_id SORT BY mall_collection_id,stat_day -- 這里和索引保持一致 ;
因為bloom filter 可以過濾無效的數據,減少數據的掃描
