Hive語法及其進階(一)

本文轉載自查看原文 2021-09-27 21:44 172 HIve/ Hive

1、Hive完整建表

 1 CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name(  2       [(col_name data_type [COMMENT col_comment], ...)]  3  )  4       [COMMENT table_comment]
 5       [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]  6       [CLUSTERED BY (col_name, col_name, ...)  7  [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]  8       [
 9  [ROW FORMAT row_format] 
10        [STORED AS file_format]
11        | STORED BY 'storage.handler.class.name' [ WITH SERDEPROPERTIES (...) ]  (Note:  only available starting with 0.6.0) 12  ] 13       [LOCATION hdfs_path]
14       [TBLPROPERTIES (property_name=property_value, ...)]  (Note:  only available starting with 0.6.0) 15       [AS select_statement]  (Note: this feature is only available starting with 0.5.0.)

注意:
　　　　[]:表示可選
　　　　EXTERNAL:外部表
　　　　(col_name data_type [COMMENT col_comment],...:定義字段名，字段類型
　　　　COMMENT col_comment:給字段加上注釋
　　　　COMMENT table_comment:給表加上注釋
　　　　PARTITIONED BY (col_name data_type [COMMENT col_comment],...):分區分區字段注釋
　　　　CLUSTERED BY (col_name, col_name,...):分桶
　　　　SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS:設置排序字段升序、降序
　　　　ROW FORMAT row_format:指定設置行、列分隔符(默認行分隔符為\n)
　　　　STORED AS file_format:指定Hive儲存格式：textFile、rcFile、SequenceFile 默認為：textFile
　　　　LOCATION hdfs_path:指定儲存位置(默認位置在hive.warehouse目錄下)
　　　　TBLPROPERTIES (property_name=property_value, ...):跟外部表配合使用，比如：映射HBase表，然后可以使用HQL對hbase數據進行查詢，當然速度比較慢
　　　　AS select_statement:從別的表中加載數據 select_statement=sql語句

2、使用默認方式建表

1 create table students01 2  ( 3             id bigint, 4  name string, 5             age int, 6  gender string, 7  clazz string 8  ) 9         ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

注意:
分割符不指定,默認不分割
通常指定列分隔符,如果字段只有一列可以不指定分割符：

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

3、建表2：指定location

 1 create table students02  2  (  3             id bigint,  4  name string,  5             age int,  6  gender string,  7  clazz string  8  )  9         ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
10         LOCATION 'data';

4、建表3：指定存儲格式

 1  create table student_rc  2  (  3             id bigint,  4  name string,  5             age int,  6  gender string,  7  clazz string  8  )  9         ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
10         STORED AS rcfile;

注意:

　　　　指定儲存格式為rcfile，inputFormat:RCFileInputFormat,outputFormat:RCFileOutputFormat，如果不指定，默認為textfile

注意：

　　　　除textfile以外，其他的存儲格式的數據都不能直接加載，需要使用從表加載的方式。

5、建表4：從其他表中加載數據
　　格式:
　　　　create table xxxx as select_statement(SQL語句) (這種方式比較常用)

　　例子:
　　　　　create table students4 as select * from students2;

6、建表5：從其他表中獲取表結構

　　格式:
　　　　create table xxxx like table_name 只想建表，不需要加載數據

　　例子：

　　　 create table student04 like students;

7.Hive加載數據

　　　　1、使用```hadoop dfs -put '本地數據' 'hive表對應的HDFS目錄下

　　　　2、使用 load data inpath（是對hdfs的文件移動，移動，移動，不是復制）

　　 3、使用load data local inpath（經常使用，從本地文件中上傳）

　　　　// overwrite 覆蓋加載
　　　　// 實際上就是hadoop執行了rmr然后put操作
　　　　例如：load data local inpath'/usr/local/data/students.txt' overwrite into table student01;

方式1和方式2的區別:

　　　　　　　　　　1.上傳數據到hdfs目錄和hive表沒有任何關系(不需要數據格式進行匹配,hive讀取數據還是需要數據格式的匹配)

　　　　　　　　　　2.上傳數據到hive表和hive表有關系(需要數據格式進行匹配)

8. 清空表
　　　　truncate table student01;

注意：清空代表清空數據，不是刪除表

11. insert into table xxxx SQL語句（沒有as）傳輸給別的格式的hive table

　　例如：

　　　　insert into table student04 select * from student01;

　　覆蓋插入把into 換成 overwrite

　　　　例如：

　　　　　　insert overwrite table student04 select * from student01;

9、Hive 內部表（Managed tables）vs 外部表（External tables）

區別:

　　　　內部表刪除數據跟着刪除
　　　　外部表只會刪除表結構,數據依然存在

注意:

　　　　公司中實際應用場景為外部表,為了避免表意外刪除數據也丟失
　　　　不能通過路徑來判斷是目錄還是hive表(是內部表還是外部表)

建表：

 1 內部表  2 create table students_managed01  3 (  4     id bigint,  5  name string,  6     age int,  7  gender string,  8  clazz string  9 ) 10 ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

 1 //內部表指定location  2 create table students_managed02  3 (  4     id bigint,  5  name string,  6     age int,  7  gender string,  8  clazz string  9 ) 10 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
11 LOCATION '/managed';

 1 // 外部表  2 create external table students_external01  3 (  4     id bigint,  5  name string,  6     age int,  7  gender string,  8  clazz string  9 ) 10 ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

 1 // 外部表指定location  2 create external table students_external02  3 (  4     id bigint,  5  name string,  6     age int,  7  gender string,  8  clazz string  9 ) 10 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; 11 LOCATION '/external';

上傳數據：

hive> load data local inpath '/usr/local/data/students.txt'into table students_managed01;hive> load data local inpath '/usr/local/data/students.txt'into table students_managed02;
hive> load data local inpath '/usr/local/data/students.txt'into table students_external01;hive> load data local inpath '/usr/local/data/students.txt'into table students_external02;

刪除數據：

hive> drop table students_managed01; hive> drop table students_managed02; hive> drop table students_external01; hive> drop table students_external02;

外部表與內部表總結：

　　　　可以看出，刪除內部表的時候，表中的數據（HDFS上的文件）會被同表的元數據一起刪除

　　　　刪除外部表的時候，只會刪除表的元數據，不會刪除表中的數據（HDFS上的文件）

　　　　一般在公司中，使用外部表多一點，因為數據可以需要被多個程序使用，避免誤刪，通常外部表會結合location一起使用

　　　　外部表還可以將其他數據源中的數據映射到 hive中，比如說：hbase，ElasticSearch......

　　　　設計外部表的初衷就是讓表的元數據與數據解耦

10、Hive建立單級分區表

1.創建單級分區

 1 create table students_pt  2 (  3     id bigint,  4  name string,  5     age int,  6  gender string,  7  clazz string  8 )  9 PARTITIONED BY(month string) 10 ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

2.加載數據

load data local inpath '/usr/local/data/students.txt' into table students_pt partition(month='2021-09-26');

3.分區查詢

單分區查詢

　　　　select * from students_pt where month='2021-09-26';

多分區查詢

　　　　select * from students_pt where month='2021-09-26'or month='2021-09-24';

4.增加分區

創建單個分區

　　　　alter table students_pt add partition(month='2021-09-25');

創建多個分區

　　　　alter table students_pt add partition(month='2021-09-23') partition(month='2021-09-24');（注意中間沒有逗號分割）

5.刪除分區

刪除單個分區

　　　　alter table students_pt drop partition(month='2021-09-23');

刪除多個分區

　　　　alter table students_pt drop partition(month='2021-09-24'),partition(month='2021-09-25'); （注意中間有逗號分割）

6.查看分區表分區

　　　　show partitions students_pt;

7.查看分區表結構

　　　　desc formatted students_pt;

11、Hive建立多級分區表

1.創建二級分區表

1 hive> create table score_pt( 2     > id int, 3     > subjectid int, 4     > score int) 5     > partitioned by (month string,day string) 6     > row format delimited fields terminated by ',';

2.上傳數據

1  load data local inpath '/usr/local/data/score.txt' into table score_pt partition(month='2021-09',day='01')

3.加載數據

1  select * from score_pt where month='2021-09' and day='01';

4.添加二級分區

1 hive> alter table score_pt add partition(month='2021-09',day=02);

1 alter table score_pt add partition(month='2021-09',day=03) partition(month='2021-09',day=04);
注意：沒有逗號，和添加單級分區一樣

5.刪除二級分區

1 alter table score_pt drop partition(month='2021-09',day=02);

1 alter table score_pt drop partition(month='2021-09',day=03),partition(month='2021-09',day=04);

注意：有逗號，和刪除單級分區一樣

12.動態分區

> 有的時候我們原始表中的數據里面包含了 ''日期字段 dt''，我們需要根據dt中不同的日期，分為不同的分區，將原始表改造成分區表。
>
> hive默認不開啟動態分區
>
> 動態分區：根據數據中某幾列的不同的取值划分不同的分區

##### 開啟Hive的動態分區支持

```
# 表示開啟動態分區
hive> set hive.exec.dynamic.partition=true;
# 表示動態分區模式：strict（需要配合靜態分區一起使用）、nostrict
# strict： insert into table students_pt partition(dt='anhui',pt) select ......,pt from students;
hive> set hive.exec.dynamic.partition.mode=nostrict;
# 表示支持的最大的分區數量為1000，可以根據業務自己調整
hive> set hive.exec.max.dynamic.partitions.pernode=1000;

#### 使用動態分區插入數據

1.創建表

存儲數據

 1 create table students_dt  2 (  3     id bigint,  4  name string,  5     age int,  6  gender string,  7  clazz string,  8  dt string  9 ) 10 ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

 1 create table students_dt_p  2 (  3     id bigint,  4  name string,  5     age int,  6  gender string,  7  clazz string  8 )  9 PARTITIONED BY(dt string) 10 ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

2.插入數據（只能用這一種方式）

// 分區字段需要放在 select 的最后，如果有多個分區字段同理，它是按位置匹配，不是按名字匹配

insert into table students_dt_p partition(dt) select id,name,age,gender,clazz,dt from students_dt;

上單講分區：https://developer.aliyun.com/article/81775

#### Hive分桶

> 分桶實際上是對文件（數據）的進一步切分
>
> Hive默認關閉分桶
>
> 作用：在往分桶表中插入數據的時候，會根據 clustered by 指定的字段進行hash分區對指定的buckets個數進行取余，進而可以將數據分割成buckets個數個文件，以達到數據均勻分布，可以解決Map端的“數據傾斜”問題，方便我們取抽樣數據，提高Map join效率
>
> 分桶字段需要根據業務進行設定

##### 開啟分桶開關

```
hive> set hive.enforce.bucketing=true;
```

##### 建立分桶表

create table students_buks
(
id bigint,
name string,
age int,
gender string,
clazz string
)
CLUSTERED BY (clazz) into 12 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
```

##### 往分桶表中插入數據

```
// 直接使用load data 並不能將數據打散
load data local inpath '/usr/local/soft/data/students.txt' into table students_buks;

// 需要使用下面這種方式插入數據，才能使分桶表真正發揮作用
insert into students_buks select * from students;

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 hive with as 語法【hive】——Hive sql語法詳解 hive進階技巧 Hive基本語法操練 Hive中的排序語法 Hive SQL語法總結 hive的使用 + hive的常用語法 Hadoop Hive sql 語法詳解 Hadoop Hive基礎sql語法 Hive進階_開發Hive的自定義函數