Hive索引功能測試

本文轉載自查看原文 2016-12-21 18:26 1702 Hive

作者：Syn良子出處：http://www.cnblogs.com/cssdongl 轉載請注明出處

從Hive的官方wiki來看，Hive0.7以后增加了一個對表建立index的功能,想試下性能是否有很大提升，參考了一些資料親手實現了一遍，記錄下過程和心得

一.測試數據准備

1.新建一個gen-data.sh腳本，內容如下

#! /bin/bash  
#generating 1.7G raw data.  
i=0
while [ $i -ne 5000000 ]  
do
        echo "$i        A decade ago, many were predicting that Cooke, a New York City prodigy, would become a basketball shoe pitchman and would flaunt his wares and skills at All-Star weekends like the recent aerial show in Orlando, Fla. There was a time, however fleeting, when he was more heralded, or perhaps merely hyped, than any other high school player in America."  
        i=$(($i+1))
done

2.生成文件

執行如上腳本: sh gen-data.sh >dual.txt,大約幾分鍾后生成完畢.

二.Hive建立表和索引

1.建表，注意和上面生成的數據是一致的，id和name以制表符隔開進行映射

create table table01(id int,name string) row format delimited fields terminated by '\t';

2.加載數據到表中

load data local inpath '~/testData/hive/dataScripts/dual.txt' overwrite into table table01; (用時Time taken: 160.787 seconds)

3.創建table02,數據來自於table01

create table table02 as select id ,name as text from table01; (Time taken: 154.463 seconds)

4.查詢測試

select * from table02 where id=500000; (Time taken: 30.463 seconds, Fetched: 1 row(s))

此時dfs -ls /user/hive/warehouse/,會看到有table01和table02對應的數據文件夾生成

5.利用hive的CompactIndexHandler為id字段自動創建索引

create index table02_index on table table02(id) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild;

alter index table02_index on table02 rebuild; (Time taken: 112.451 seconds)

注意上面這句是必要的，因為deferred rebuild以后，索引文件內容初始化是empty的，而alter index能夠幫助重建index structure.

6.此時會看到索引表的生成，查看索引表內容

hive> select * from default__table02_table02_index__ limit 3;
OK
9    hdfs://littleNameservice/user/hive/warehouse/table02/000000_0    [3168]
36    hdfs://littleNameservice/user/hive/warehouse/table02/000000_0    [12698]
63    hdfs://littleNameservice/user/hive/warehouse/table02/000000_0    [22229]

這里可以看到索引表分為三列，分別是索引列的枚舉值，每個值對應的數據文件位置，以及在這個文件位置中的偏移量。通過這種方式，

可以減少查詢的數據量（偏移量可以告訴你從哪個位置開始找，自然只需要定位到相應的block），起到減少資源消耗的作用.

7.再次查詢測試

select * from table02 where id=500000; (Time taken: 29.226 seconds, Fetched: 1 row(s))

對比剛開始的30.463秒,基本沒變化。所以繼續研究

8.需要進行索引手動裁剪，如下

SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
Insert overwrite directory "/tmp/table02_index_data" select `_bucketname`, `_offsets` from default__table02_table02_index__ where id =500000;
Set hive.index.compact.file=/tmp/table02_index_data;
Set hive.optimize.index.filter=false;
Set hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;

簡單解釋下上面命令的意思就是對自己需要索引的查詢比如id = 500000,手動從已有的索引表default__table02_table02_index__ 中裁剪出來插入臨時的tmp目錄，然后設置索引的文件

指向和忽略自動索引

9.最終查詢測試

select * from table02 where id =500000; (Time taken: 17.259 seconds, Fetched: 1 row(s))

好，這次變成17秒了，證明索引生效了.但是感覺差強人意啊.

個人總結:從官方的wiki，jira以及自己的測試來看，Hive的索引很不好用，它並不是傳統的的B樹索引，而是冗余了一個lookup的索引表，把需要索引的表簡單划分了range和偏移量，

這些信息被儲存在索引表里面進行查詢,而且使用的時候不能直接用，還要根據條件進行裁剪才會真正生效。個人感覺這就是個半成品,官方也宣稱這塊兒功能需要加強.

參考資料:

https://cwiki.apache.org/confluence/display/Hive/IndexDev

https://issues.apache.org/jira/browse/HIVE-417

http://lxw1234.com/archives/2015/05/207.htm

http://blog.csdn.net/liwei_1988/article/details/7319030

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 功能測試的流程接口功能測試支付功能測試表單功能測試什么是功能測試 WEB測試—功能測試游戲測試-功能測試功能測試--聊天功能測試&微信聊天 APP功能測試要點 APP非功能測試