Hadoop Hive概念學習系列之hive里的索引（十三）

本文轉載自查看原文 2016-11-26 21:35 27386 Hadoop Hive概念學習系列

　　Hive支持索引，但是Hive的索引與關系型數據庫中的索引並不相同，比如，Hive不支持主鍵或者外鍵。

Hive索引可以建立在表中的某些列上，以提升一些操作的效率，例如減少MapReduce任務中需要讀取的數據塊的數量。

在可以預見到分區數據非常龐大的情況下，索引常常是優於分區的。

　　博主我推薦各位博文們通過查閱Hive文檔對Hive表的索引進行更深入的了解。

　　需要時刻記住的是，Hive並不像事物數據庫那樣針對個別的行來執行查詢、更新、刪除等操作。這些操作依賴高效的索引來實現高性能。

　　Hive是一種批處理工具，通常用在多任務節點的場景下，快速地掃描大規模數據。關系型數據庫則適用於典型的單機運行、I/O密集型的場景。

Hive通過並行化來實現性能，因此Hive更適用於全表掃描這樣的操作，而不是像使用關系型數據庫一樣操作。

為什么要創建索引？
　　Hive的索引目的是提高Hive表指定列的查詢速度。
　　沒有索引時，類似'WHERE tab1.col1 = 10' 的查詢，Hive會加載整張表或分區，然后處理所有的rows，
　　但是如果在字段col1上面存在索引時，那么只會加載和處理文件的一部分。
　　與其他傳統數據庫一樣，增加索引在提升查詢速度時，會消耗額外資源去創建索引和需要更多的磁盤空間存儲索引。
　　Hive 0.7.0版本中，加入了索引。Hive 0.8.0版本中增加了bitmap索引。

　　　　Hive里的2維坐標系統（第一步定位行鍵 -> 第二步定位列修飾符）

第四步定位時間戳）

HBase里的4維坐標系統（第一步定位行鍵 -> 第二步定位列簇 -> 第三步定位列修飾符 -> 第四步定位時間戳）

　 行鍵，相當於第一步級索引。

　　列簇，相當於第二步級索引。

　　列修飾符，相當於第三步級索引。

　　時間戳，相當於第四步級索引。

預習案例

說明:
原表是user
創建索引后的表是user_index_table
索引是user_index

先創建原表

create table user(
id int,
name string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

往原表里導入數據

LOAD DATA LOCAL INPATH '/export1/tmp/wyp/row.txt' OVERWRITE INTO TABLE user;

給原表做個測試

SELECT * FROM user where id =500000;
Total MapReduce CPU Time Spent: 5 seconds 630 msec
OK
500000 wyp.
Time taken: 14.107 seconds, Fetched: 1 row(s)
可以看出，一共用了14.107s。

在原表user上創建索引user_index，得到創建索引后的表user_index_table

CREATE INDEX user_index ON TABLE user(id) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH deferred REBUILD IN TABLE user_index_table;
或者如下寫都是一樣的，建議如下寫
hive > create index user_index on table user(id)
> as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
> with deferred rebuild
> IN TABLE user_index_table;

給原表user更新數據

ALTER INDEX user_index on user REBUILD;

刪除索引

DROP INDEX user_index on user;

查看索引

SHOW INDEX on user;

創建表和索引案例

步驟一：創建索引測試表

CREATE TABLE index_test(
id INT,
name STRING
)
PARTITIONED BY (dt STRING)
ROW FORMAT DELIMITED FILEDS TERMINATED BY ',';
說明：
創建一個索引測試表 index_test，dt作為分區屬性，
“ROW FORMAT DELIMITED FILEDS TERMINATED BY ','” 表示用逗號分割字符串，默認為‘\001’。

步驟二：創建臨時索引表

create table index_tmp(
id INT,
name STRING,
dt STRING
)
ROW FORMAT DELIMITED FILEDS TERMINATED BY ',';
說明：臨時索引表是table index_tmp

步驟三：加載數據到臨時索引表中

load data local inpath '/home/hadoop/djt/test.txt' into table index_tmp;

步驟四：設置 Hive 的索引屬性來優化索引查詢

set hive.exec.dynamic.partition.mode=nonstrict;----設置所有列為 dynamic partition
set hive.exec.dynamic.partition=true;----使用動態分區

步驟五：查詢臨時索引表中的數據，插入到索引測試表中。

insert overwrite table index_test partition(dt) select id,name,dt from index_tmp;

步驟六：使用索引測試表，在屬性 id 上創建一個索引

create index index1_index_test on table index_test(id) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERERD REBUILD;
建議如下寫
create index index1_index_test on table index_test(id)
as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERERD REBUILD;
索引是index1_index_test
索引測試表是 index_test
在索引測試表的屬性id上創建的索引

步驟七：填充索引測試表的索引數據

alter index index1_index_test on index_test rebuild;

步驟八：查看索引測試表的創建的索引

show index on index_test

步驟九：查看索引測試表的分區信息

show partitions index_test;

步驟十：查看索引測試表的索引數據

$ hadoop fs -ls /usr/hive/warehouse/default_index_test_index1_index_test_

步驟十一：刪除索引測試表的索引

drop index index1_index_test on index_test;
show index on index_test;

步驟十二：索引測試表的索引數據也被刪除

$ hadoop fs -ls /usr/hive/warehouse/default_index_test_index1_index_test_
no such file or directory

步驟十三：修改配置文件信息

hive.optimize.index.filter 和 hive.optimize.index.groupby 參數默認是 false。
使用索引的時候必須把這兩個參數開啟，才能起到作用。
hive.optimize.index.filter.compact.minsize 參數
為輸入一個緊湊的索引將被自動采用最小尺寸、默認5368709120（以字節為單位）。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。