環境
虛擬機:VMware 10
Linux版本:CentOS-6.5-x86_64
客戶端:Xshell4
FTP:Xftp4
jdk8
hadoop-3.1.1
apache-hive-3.1.1
一、Hive 參數
1、Hive 參數類型
hive當中的參數、變量,都是以命名空間開頭;

通過${}方式進行引用,其中system、env下的變量必須以前綴開頭;
在Hive CLI查看參數
#顯示所有參數 hive>set; #查看單個參數 hive> set hive.cli.print.header; hive.cli.print.header=false
2、Hive參數設置方式
(1)、修改配置文件 ${HIVE_HOME}/conf/hive-site.xml 這會使所有客戶端都生效
(2)、啟動hive cli時,通過--hiveconf key=value的方式進行設置 這只會在當前客戶端生效
例:
[root@PCS102 ~]# hive --hiveconf hive.cli.print.header=true hive> set hive.cli.print.header; hive.cli.print.header=true hive>
(3)、進入cli之后,通過使用set命令設置 這只會在當前客戶端生效
hive> set hive.cli.print.header; hive.cli.print.header=false hive> select * from wc; OK hadoop 2 hbase 1 hello 2 name 3 world 1 zookeeper 1 Time taken: 2.289 seconds, Fetched: 6 row(s) hive> set hive.cli.print.header=true; hive> set hive.cli.print.header; hive.cli.print.header=true hive> select * from wc; OK wc.word wc.totalword hadoop 2 hbase 1 hello 2 name 3 world 1 zookeeper 1 Time taken: 2.309 seconds, Fetched: 6 row(s) hive>
(4)使用.hiverc文件設置
當前用戶家目錄(例:root用戶:家目錄是/root)下的.hiverc文件
如果沒有,可直接創建該文件,將需要設置的參數寫到該文件中,hive啟動運行時,會加載改文件中的配置。
[root@PCS102 ~]# vi ~/.hiverc set hive.cli.print.header=true :wq
[root@PCS102 ~]# ll -a|grep hive -rw-r--r--. 1 root root 5562 Feb 15 15:01 .hivehistory -rw-r--r--. 1 root root 31 Feb 15 15:03 .hiverc
另外:
.hivehistory 文件記錄hive歷史操作命令集
#重新登錄 可以發現配置生效了 影響當前linux用戶登錄的客戶端
[root@PCS102 ~]# hive hive> set hive.cli.print.header; hive.cli.print.header=true hive>
二、動態分區
參數設置:
開啟支持動態分區
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nostrict;
默認:strict(至少有一個分區列是靜態分區)
其他參數
set hive.exec.max.dynamic.partitions.pernode;
每一個執行mr節點上,允許創建的動態分區的最大數量(100)
set hive.exec.max.dynamic.partitions;
所有執行mr節點上,允許創建的所有動態分區的最大數量(1000)
set hive.exec.max.created.files;
所有的mr job允許創建的文件的最大數量(100000)
數據 /root/data:
1,小明1,18,boy,lol-book-movie,beijing:shangxuetang-shanghai:pudong 2,小明2,20,man,lol-book-movie,beijing:shangxuetang-shanghai:pudong 3,小明3,21,boy,lol-book-movie,beijing:shangxuetang-shanghai:pudong 4,小明4,21,man,lol-book-movie,beijing:shangxuetang-shanghai:pudong 5,小明5,21,boy,lol-book-movie,beijing:shangxuetang-shanghai:pudong 6,小明6,21,man,lol-book-movie,beijing:shangxuetang-shanghai:pudong
1、原始表
hive> CREATE TABLE psn21( > id INT, > name STRING, > age INT, > sex string, > likes ARRAY<STRING>, > address MAP<STRING,STRING> > ) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > COLLECTION ITEMS TERMINATED BY '-' > MAP KEYS TERMINATED BY ':' > LINES TERMINATED BY '\n'; OK Time taken: 0.183 seconds hive> LOAD DATA LOCAL INPATH '/root/data' INTO TABLE psn21; Loading data to table default.psn21 OK Time taken: 0.248 seconds hive> select * from psn21; OK psn21.id psn21.name psn21.age psn21.sex psn21.likes psn21.address 1 小明1 18 boy ["lol","book","movie"] {"beijing":"shangxuetang","shanghai":"pudong"} 2 小明2 20 man ["lol","book","movie"] {"beijing":"shangxuetang","shanghai":"pudong"} 3 小明3 21 boy ["lol","book","movie"] {"beijing":"shangxuetang","shanghai":"pudong"} 4 小明4 21 man ["lol","book","movie"] {"beijing":"shangxuetang","shanghai":"pudong"} 5 小明5 21 boy ["lol","book","movie"] {"beijing":"shangxuetang","shanghai":"pudong"} 6 小明6 21 man ["lol","book","movie"] {"beijing":"shangxuetang","shanghai":"pudong"} Time taken: 0.113 seconds, Fetched: 6 row(s) hive>
2、分區表
hive> CREATE TABLE psn22( > id INT, > name STRING, > likes ARRAY<STRING>, > address MAP<STRING,STRING> > ) > partitioned by (age int,sex string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > COLLECTION ITEMS TERMINATED BY '-' > MAP KEYS TERMINATED BY ':' > LINES TERMINATED BY '\n'; OK Time taken: 0.045 seconds
3、原始表數據導入分區表(注意psn21下數據不變)
hive> from psn21 > insert overwrite table psn22 partition(age, sex) > select id, name,likes, address,age, sex distribute by age, sex; Query ID = root_20190215170643_7aeb9dae-62d5-49fe-ab37-022446f6a004 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1548397153910_0009, Tracking URL = http://PCS102:8088/proxy/application_1548397153910_0009/ Kill Command = /usr/local/hadoop-3.1.1/bin/mapred job -kill job_1548397153910_0009 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2019-02-15 17:06:50,930 Stage-1 map = 0%, reduce = 0% 2019-02-15 17:06:55,069 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.86 sec 2019-02-15 17:07:00,206 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.26 sec MapReduce Total cumulative CPU time: 6 seconds 260 msec Ended Job = job_1548397153910_0009 Loading data to table default.psn22 partition (age=null, sex=null) Time taken to load dynamic partitions: 0.482 seconds Time taken for adding to write entity : 0.001 seconds MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.26 sec HDFS Read: 13250 HDFS Write: 599 SUCCESS Total MapReduce CPU Time Spent: 6 seconds 260 msec OK id name likes address age sex Time taken: 18.572 seconds hive>




查看該分區下數據:
[root@PCS102 ~]# hdfs dfs -cat /root/hive_remote/warehouse/psn22/age=21/sex=boy/* 5,小明5,lol-book-movie,beijing:shangxuetang-shanghai:pudong 3,小明3,lol-book-movie,beijing:shangxuetang-shanghai:pudong [root@PCS102 ~]#
全部分區數據:
hive> select * from psn22; OK psn22.id psn22.name psn22.likes psn22.address psn22.age psn22.sex 1 小明1 ["lol","book","movie"] {"beijing":"shangxuetang","shanghai":"pudong"} 18 boy 2 小明2 ["lol","book","movie"] {"beijing":"shangxuetang","shanghai":"pudong"} 20 man 5 小明5 ["lol","book","movie"] {"beijing":"shangxuetang","shanghai":"pudong"} 21 boy 3 小明3 ["lol","book","movie"] {"beijing":"shangxuetang","shanghai":"pudong"} 21 boy 6 小明6 ["lol","book","movie"] {"beijing":"shangxuetang","shanghai":"pudong"} 21 man 4 小明4 ["lol","book","movie"] {"beijing":"shangxuetang","shanghai":"pudong"} 21 man Time taken: 0.141 seconds, Fetched: 6 row(s) hive>
三、分桶
1、分桶
分桶表是對列值取哈希值的方式,將不同數據放到不同文件中存儲。
對於hive中每一個表、分區都可以進一步進行分桶。
由列的哈希值除以桶的個數來決定每條數據划分在哪個桶中。
適用場景:數據抽樣( sampling )、map-join
2、開啟支持分桶
set hive.enforce.bucketing=true;
默認:false;設置為true之后,mr運行時會根據bucket的個數自動分配reduce task個數。
(用戶也可以通過mapred.reduce.tasks自己設置reduce任務個數,但分桶時不推薦使用)
注意:一次作業產生的桶(文件數量)和reduce task個數一致。
3、桶表 抽樣查詢
select * from bucket_table tablesample(bucket 1 out of 4 on columns);
TABLESAMPLE語法:
TABLESAMPLE(BUCKET x OUT OF y)
x:表示從哪個bucket開始抽取數據
y:必須為該表總bucket數的倍數或因子 (Y表示相隔多少個桶再次抽取)
舉例:
當表總bucket數為32時
(1)TABLESAMPLE(BUCKET 2 OUT OF 4),抽取哪些數據?
數據個數:32/4=8份
桶號:2,6(2+4),10(6+4),14(10+4),18(14+4),22(18+4),26(22+4),30(26+4)
(2)TABLESAMPLE(BUCKET 3 OUT OF 8),抽取哪些數據?
數據個數:32/8=4份
桶號:3,11(3+8),19(11+8),27(19+8)
(3)TABLESAMPLE(BUCKET 3 OUT OF 256),抽取哪些數據?
數據個數:32/256=1/8份
桶號:3, 一個桶取1/8即可
4、分桶案例
原始表:
CREATE TABLE psn31 ( id INT, name STRING, age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
數據/root/data2:
1,tom,11 2,cat,22 3,dog,33 4,hive,44 5,hbase,55 6,mr,66 7,alice,77 8,scala,88
數據導入:
hive>load data local inpath '/root/data2' into table psn31;
創建分桶表
CREATE TABLE psnbucket ( id INT, name STRING, age INT) CLUSTERED BY (age) INTO 4 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
數據分桶預測:
age%4
1,tom,11 --3 2,cat,22 --2 3,dog,33 --1 4,hive,44 --0 5,hbase,55 --3 6,mr,66 --2 7,alice,77 --1 8,scala,88 --0
加載數據 執行MR任務 表目錄下有四個文件(桶表不能通過load的方式直接加載數據,只能從另一張表中插入數據):
hive>insert into table psnbucket select id, name, age from psn31;

看一下每個桶文件內的數據是否和預測一樣:
[root@PCS102 ~]# hdfs dfs -cat /root/hive_remote/warehouse/psnbucket/000000_0 8,scala,88 4,hive,44 [root@PCS102 ~]# hdfs dfs -cat /root/hive_remote/warehouse/psnbucket/000001_0 7,alice,77 3,dog,33 [root@PCS102 ~]# hdfs dfs -cat /root/hive_remote/warehouse/psnbucket/000002_0 6,mr,66 2,cat,22 [root@PCS102 ~]# hdfs dfs -cat /root/hive_remote/warehouse/psnbucket/000003_0 5,hbase,55 1,tom,11
數據抽樣:結果跟之前版本預期不一樣 很奇怪 為什么不是取00001_0里的數據?
hive> select id, name, age from psnbucket tablesample(bucket 2 out of 4 on age); OK id name age 6 mr 66 1 tom 11 Time taken: 0.184 seconds, Fetched: 2 row(s) hive>
