hive隨機采樣


hive> select * from account limit 10;
OK
account.accountname     account.accid   account.platid  account.dateid  account.createtime
1004210 1004210 6       20180116        2018-01-16 10:39:50.0
20946754        20946754        0       20170913        2017-09-13 10:02:37.0
20946766        20946766        0       20170901        2017-09-01 16:51:30.0
20946793        20946793        0       20171117        2017-11-17 16:51:07.0
20946796        20946796        0       20180110        2018-01-10 13:30:46.0
20946962        20946962        0       20171219        2017-12-19 15:43:14.0
20957641        20957641        0       20171117        2017-11-17 17:44:58.0
20957642        20957642        0       20171220        2017-12-20 15:32:21.0
20963649        20963649        6       20171220        2017-12-20 10:13:57.0
20963674        20963674        33      20171219        2017-12-19 22:59:39.0

只通過limit河南保證數據隨機的返回,通常情況下,此種方式會按文件的順序返回,沒到達隨機采樣的目的。


scala> hivcon.sql("select * from gamedw.account order by rand() limit 10").show
+-----------+--------+------+--------+--------------------+                    
|accountname|   accid|platid|  dateid|          createtime|
+-----------+--------+------+--------+--------------------+
|   22780583|22780583|    18|20180117|2018-01-17 07:27:...|
|   22777237|22777237|     6|20180116|2018-01-16 20:52:...|
|   22868751|22868751|     1|20180208|2018-02-08 10:20:...|
|   22993835|22993835|    83|20180327|2018-03-27 19:48:...|
|   22779551|22779551|     9|20180117|2018-01-17 00:33:...|
|   22796970|22796970|     1|20180119|2018-01-19 17:53:...|
|   22713995|22713995|    33|20171227|2017-12-27 21:05:...|
|   22811076|22811076|     6|20180123|2018-01-23 13:16:...|
|   22715294|22715294|    15|20171228|2017-12-28 01:59:...|
|   22843107|22843107|    33|20180129|2018-01-29 20:13:...|
+-----------+--------+------+--------+--------------------+

scala> hivcon.sql("select * from gamedw.account order by rand() limit 10").show
+-----------+--------+------+--------+--------------------+                    
|accountname|   accid|platid|  dateid|          createtime|
+-----------+--------+------+--------+--------------------+
|   22809368|22809368|    35|20180122|2018-01-22 20:55:...|
|   22810321|22810321|    18|20180123|2018-01-23 03:42:...|
|   22850089|22850089|    15|20180201|2018-02-01 07:16:...|
|   23040545|23040545|    27|20180415|2018-04-15 10:45:...|
|   21095611|21095611|     3|20180118|2018-01-18 19:02:...|
|   22787794|22787794|     9|20180117|2018-01-17 22:10:...|
|   22701928|22701928|     6|20171226|2017-12-26 14:42:...|
|   23123308|23123308|     6|20180526|2018-05-26 02:02:...|
|   23037031|23037031|    27|20180413|2018-04-13 23:52:...|
|   22798775|22798775|    27|20180120|2018-01-20 00:32:...|
+-----------+--------+------+--------+--------------------+

使用rand()確實可以實現真正意義的隨機,但性能不高。為了實現總排序,Hive必須將所有數據強制傳輸到單個reducer。該reducer將對整個數據集進行排序。這很不好。

scala> hivcon.sql("select * from gamedw.account sort by rand() limit 10").show
+-----------+--------+------+--------+--------------------+
|accountname|   accid|platid|  dateid|          createtime|
+-----------+--------+------+--------+--------------------+
|   22708141|22708141|    15|20171226|2017-12-26 22:13:...|
|   22780855|22780855|    15|20180117|2018-01-17 08:25:...|
|   22781579|22781579|    35|20180117|2018-01-17 09:55:...|
|   22775456|22775456|     1|20180116|2018-01-16 18:35:...|
|   22705922|22705922|    18|20171226|2017-12-26 20:18:...|
|   22773860|22773860|   138|20180116|2018-01-16 16:09:...|
|   22771885|22771885|     4|20180116|2018-01-16 14:05:...|
|   22633182|22633182|     0|20171202|2017-12-02 04:07:...|
|   22228137|22228137|     0|20170828|2017-08-28 17:55:...|
|   22593212|22593212|     0|20171121|2017-11-21 18:38:...|
+-----------+--------+------+--------+--------------------+

scala> hivcon.sql("select * from gamedw.account sort by rand() limit 10").show
+-----------+--------+------+--------+--------------------+
|accountname|   accid|platid|  dateid|          createtime|
+-----------+--------+------+--------+--------------------+
|   22703713|22703713|    15|20171226|2017-12-26 18:02:...|
|   22766790|22766790|     4|20180116|2018-01-16 10:11:...|
|   22728340|22728340|    27|20171231|2017-12-31 19:59:...|
|   22770064|22770064|     1|20180116|2018-01-16 12:13:...|
|   22705604|22705604|    15|20171226|2017-12-26 19:58:...|
|   22771835|22771835|    21|20180116|2018-01-16 14:01:...|
|   22720562|22720562|    33|20171229|2017-12-29 11:56:...|
|   22779785|22779785|    83|20180117|2018-01-17 01:36:...|
|   22705319|22705319|    33|20171226|2017-12-26 19:46:...|
|   22717315|22717315|    27|20171228|2017-12-28 17:04:...|
+-----------+--------+------+--------+--------------------+ 

Hive有一個非標准SQL“sort by”子句,它只在單個reducer中排序,並且不保證數據跨多個reducers中排序.

Hive的將數據拆分為多個reducer的方法是未定義的。它可能是真正隨機的,它可能基於文件順序,它可能基於數據中的某些值。Hive如何在reducers中實現limit子句也是未定義的。也許它按順序從reducer中獲取數據 - 即,reducer 0中的所有數據,然后全部來reducer1,等等。也許它通過它們循環並將所有內容混合在一起。

假設reduce 的key是基於數據列,而limit子句是reducers的順序。然后樣品會非常傾斜。

解決方案是另一個非標准的Hive功能:“distribute by”。對於reduce key不是由查詢結構確定的查詢(沒有“group by”,沒有join),可以准確指定reduce key的內容。

scala> hivcon.sql("select * from gamedw.account distribute by rand() sort by rand() limit 10").show
+-----------+--------+------+--------+--------------------+                    
|accountname|   accid|platid|  dateid|          createtime|
+-----------+--------+------+--------+--------------------+
|   22808048|22808048|    27|20180122|2018-01-22 14:15:...|
|   22984607|22984607|    60|20180324|2018-03-24 00:56:...|
|   22775548|22775548|    15|20180116|2018-01-16 18:40:...|
|   22386073|22386073|     4|20170914|2017-09-14 22:33:...|
|   22709109|22709109|    18|20171226|2017-12-26 23:06:...|
|   22713147|22713147|     6|20171227|2017-12-27 18:55:...|
|   22847395|22847395|    15|20180131|2018-01-31 09:21:...|
|   22863523|22863523|    50|20180206|2018-02-06 10:06:...|
|   22772443|22772443|     3|20180116|2018-01-16 14:41:...|
|   23063884|23063884|    27|20180424|2018-04-24 19:25:...|
+-----------+--------+------+--------+--------------------+

scala> hivcon.sql("select * from gamedw.account distribute by rand() sort by rand() limit 10").show
+-----------+--------+------+--------+--------------------+                    
|accountname|   accid|platid|  dateid|          createtime|
+-----------+--------+------+--------+--------------------+
|   22702272|22702272|     6|20171226|2017-12-26 15:32:...|
|   22588521|22588521|     0|20171121|2017-11-21 17:00:...|
|   23045191|23045191|    27|20180416|2018-04-16 22:31:...|
|   22997042|22997042|    83|20180328|2018-03-28 21:17:...|
|   22801335|22801335|    18|20180120|2018-01-20 17:59:...|
|   22789961|22789961|     6|20180118|2018-01-18 09:51:...|
|   22884362|22884362|     3|20180214|2018-02-14 00:01:...|
|   22786025|22786025|     1|20180117|2018-01-17 18:45:...|
|   22769132|22769132|    19|20180116|2018-01-16 11:32:...|
|   23045125|23045125|    27|20180416|2018-04-16 23:12:...|
+-----------+--------+------+--------+--------------------+

 

作為最后一次優化,可以在map-side做一些過濾。如果表的總大小是已知的,輕松設置一個隨機閾值條件來進行數據過濾,如下:

scala> hivcon.sql("select * from gamedw.account where rand()<0.01 distribute by rand() sort by rand() limit 10").show
+-----------+--------+------+--------+--------------------+                    
|accountname|   accid|platid|  dateid|          createtime|
+-----------+--------+------+--------+--------------------+
|   23120261|23120261|    15|20180524|2018-05-24 12:05:...|
|   22726833|22726833|     4|20171231|2017-12-31 09:53:...|
|   22731460|22731460|     4|20180101|2018-01-01 22:15:...|
|   22379995|22379995|     4|20170913|2017-09-13 10:01:...|
|   22767301|22767301|    21|20180116|2018-01-16 10:25:...|
|   22701994|22701994|    21|20171226|2017-12-26 14:52:...|
|   22792831|22792831|    60|20180118|2018-01-18 20:09:...|
|   23145016|23145016|    15|20180608|2018-06-08 05:04:...|
|   22702216|22702216|    21|20171226|2017-12-26 15:28:...|
|   22858386|22858386|    33|20180204|2018-02-04 10:19:...|
+-----------+--------+------+--------+--------------------+

 抽樣可以從被抽取的數據中估計和推斷出整體的特性,是科學實驗、質量檢驗、社會調查普遍采用的一種經濟有效的工作和研究方法。當數據量特別大時,對全體數據進行處理存在困難時,抽樣就顯得尤其重要了

Hive支持桶表抽樣和塊抽樣:

 

桶表抽樣的語法如下:

   table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname])

TABLESAMPLE子句允許用戶編寫用於數據抽樣而不是整個表的查詢,該子句出現FROM子句中,可用於任何表中。桶編號從1開始,colname表明抽取樣本的列,可以是非分區列中的任意一列,或者使用rand()表明在整個行中抽取樣本而不是單個列。在colname上分桶的行隨機進入1到y個桶中,返回屬於桶x的行。

下例返回3000個桶中第2個桶的數據。

hive> select * from gamedw.account TABLESAMPLE(BUCKET 2 OUT OF 3000 on rand()) s;
OK
s.accountname   s.accid s.platid        s.dateid        s.createtime
22701274        22701274        33      20171226        2017-12-26 13:17:31.0
22702663        22702663        33      20171226        2017-12-26 16:30:12.0
22714480        22714480        27      20171227        2017-12-27 22:20:26.0
22729211        22729211        4       20180101        2018-01-01 00:37:46.0
22772179        22772179        18      20180116        2018-01-16 14:23:18.0
22774547        22774547        4       20180116        2018-01-16 17:12:40.0
22779739        22779739        1       20180117        2018-01-17 01:20:23.0
22787395        22787395        4       20180117        2018-01-17 21:27:29.0
22789721        22789721        15      20180118        2018-01-18 08:27:52.0
22793281        22793281        35      20180118        2018-01-18 21:14:48.0
22795002        22795002        15      20180119        2018-01-19 10:01:58.0
22797054        22797054        3       20180119        2018-01-19 18:12:43.0
22804855        22804855        138     20180121        2018-01-21 15:38:33.0
22813246        22813246        1       20180124        2018-01-24 00:09:43.0
22877117        22877117        3       20180211        2018-02-11 13:33:44.0
23045597        23045597        21      20180417        2018-04-17 04:02:17.0
23060405        23060405        15      20180423        2018-04-23 07:07:09.0
23075002        23075002        191     20180430        2018-04-30 17:24:44.0
23076553        23076553        158     20180501        2018-05-01 15:45:16.0
23085391        23085391        18      20180506        2018-05-06 10:53:59.0
Time taken: 0.938 seconds, Fetched: 20 row(s)

通常情況下,TABLESAMPLE將會掃描整個表然后抽取樣本,顯然這種做法效率不是很高。

替代方法是,由於在使用CLUSTERED BY時指定了分桶的列,如果抽樣時TABLESAMPLE子句中指定的列匹配CLUSTERED BY子句中的列,TABLESAMPLE只掃描表中要求的分區。

與rand()相比,rand()每次執行的結果都不同,指定分桶列后,每次執行結果相同

TABLESAMPLE(BUCKET 3 OUT OF 32 on rand()):  account表在創建時使用了CLUSTEREDBY accid  INTO 32 BUCKETS,那么下面的語句將返回第3個和第19個簇中的行,因為每個桶由(32/16)=2個簇組成。為什么是3和19呢,因為要返回的是第3個桶,而每個桶由原來的2個簇組成,第3個桶就由原來的第3個和19個簇組成,根據簡單的哈希算法(3%16=19%16)。

hive> select * from gamedw.account TABLESAMPLE(BUCKET 5 OUT OF 3000 on accid) s;
OK
s.accountname   s.accid s.platid        s.dateid        s.createtime
21078004        21078004        3       20180116        2018-01-16 20:21:32.0
22380004        22380004        4       20170913        2017-09-13 10:02:13.0
22386004        22386004        27      20170914        2017-09-14 22:15:23.0
22392004        22392004        4       20170917        2017-09-17 00:36:06.0
22671004        22671004        0       20171215        2017-12-15 11:58:57.0
22701004        22701004        33      20171226        2017-12-26 12:49:32.0
22704004        22704004        4       20171226        2017-12-26 18:22:16.0
22707004        22707004        15      20171226        2017-12-26 21:20:56.0
22710004        22710004        15      20171227        2017-12-27 01:45:43.0
22722004        22722004        27      20171229        2017-12-29 19:49:04.0
22767004        22767004        1       20180116        2018-01-16 10:17:55.0
22770004        22770004        19      20180116        2018-01-16 12:10:56.0
22773004        22773004        138     20180116        2018-01-16 15:11:39.0
22779004        22779004        3       20180116        2018-01-16 23:19:48.0
22782004        22782004        18      20180117        2018-01-17 10:38:42.0
22803004        22803004        1       20180120        2018-01-20 23:59:56.0
22812004        22812004        6       20180123        2018-01-23 18:36:55.0
22821004        22821004        21      20180124        2018-01-24 20:38:20.0
22830004        22830004        4       20180126        2018-01-26 03:44:23.0
22887004        22887004        15      20180214        2018-02-14 23:16:07.0
22911004        22911004        83      20180223        2018-02-23 18:11:57.0
22971004        22971004        1       20180318        2018-03-18 17:01:03.0
22977004        22977004        19      20180320        2018-03-20 22:22:38.0
23043004        23043004        27      20180416        2018-04-16 09:23:30.0
Time taken: 0.926 seconds, Fetched: 24 row(s)

使用塊抽樣

hive> select * from gamedw.account TABLESAMPLE(0.01 percent) s;
OK
s.accountname   s.accid s.platid        s.dateid        s.createtime
1004210 1004210 6       20180116        2018-01-16 10:39:50.0
20946754        20946754        0       20170913        2017-09-13 10:02:37.0
20946766        20946766        0       20170901        2017-09-01 16:51:30.0
20946793        20946793        0       20171117        2017-11-17 16:51:07.0
20946796        20946796        0       20180110        2018-01-10 13:30:46.0
20946962        20946962        0       20171219        2017-12-19 15:43:14.0
20957641        20957641        0       20171117        2017-11-17 17:44:58.0
20957642        20957642        0       20171220        2017-12-20 15:32:21.0
Time taken: 0.873 seconds, Fetched: 8 row(s)

hive> select * from gamedw.account TABLESAMPLE(1k) s;
OK
s.accountname   s.accid s.platid        s.dateid        s.createtime
1004210 1004210 6       20180116        2018-01-16 10:39:50.0
20946754        20946754        0       20170913        2017-09-13 10:02:37.0
20946766        20946766        0       20170901        2017-09-01 16:51:30.0
20946793        20946793        0       20171117        2017-11-17 16:51:07.0
20946796        20946796        0       20180110        2018-01-10 13:30:46.0
20946962        20946962        0       20171219        2017-12-19 15:43:14.0
20957641        20957641        0       20171117        2017-11-17 17:44:58.0
20957642        20957642        0       20171220        2017-12-20 15:32:21.0
20963649        20963649        6       20171220        2017-12-20 10:13:57.0
20963674        20963674        33      20171219        2017-12-19 22:59:39.0
20964032        20964032        15      20171221        2017-12-21 17:04:03.0
20964042        20964042        33      20171221        2017-12-21 20:06:33.0
20964052        20964052        3       20171221        2017-12-21 14:17:28.0
20964128        20964128        0       20171123        2017-11-23 15:52:38.0
20964153        20964153        3       20170906        2017-09-06 16:39:35.0
20964307        20964307        18      20180101        2018-01-01 17:10:39.0
20964461        20964461        18      20180116        2018-01-16 13:14:22.0
20964769        20964769        0       20170825        2017-08-25 17:03:35.0
20965027        20965027        18      20180116        2018-01-16 11:16:07.0
20965038        20965038        33      20171226        2017-12-26 20:02:26.0
Time taken: 0.588 seconds, Fetched: 20 row(s)

hive> select * from gamedw.account TABLESAMPLE(30 rows) s;
OK
s.accountname   s.accid s.platid        s.dateid        s.createtime
1004210 1004210 6       20180116        2018-01-16 10:39:50.0
20946754        20946754        0       20170913        2017-09-13 10:02:37.0
20946766        20946766        0       20170901        2017-09-01 16:51:30.0
20946793        20946793        0       20171117        2017-11-17 16:51:07.0
20946796        20946796        0       20180110        2018-01-10 13:30:46.0
20946962        20946962        0       20171219        2017-12-19 15:43:14.0
20957641        20957641        0       20171117        2017-11-17 17:44:58.0
20957642        20957642        0       20171220        2017-12-20 15:32:21.0
20963649        20963649        6       20171220        2017-12-20 10:13:57.0
20963674        20963674        33      20171219        2017-12-19 22:59:39.0
20964032        20964032        15      20171221        2017-12-21 17:04:03.0
20964042        20964042        33      20171221        2017-12-21 20:06:33.0
20964052        20964052        3       20171221        2017-12-21 14:17:28.0
20964128        20964128        0       20171123        2017-11-23 15:52:38.0
20964153        20964153        3       20170906        2017-09-06 16:39:35.0
20964307        20964307        18      20180101        2018-01-01 17:10:39.0
20964461        20964461        18      20180116        2018-01-16 13:14:22.0
20964769        20964769        0       20170825        2017-08-25 17:03:35.0
20965027        20965027        18      20180116        2018-01-16 11:16:07.0
20965038        20965038        33      20171226        2017-12-26 20:02:26.0
20965124        20965124        18      20171226        2017-12-26 11:10:57.0
20965187        20965187        33      20171226        2017-12-26 22:35:32.0
20965282        20965282        33      20171226        2017-12-26 11:50:23.0
20965301        20965301        33      20180121        2018-01-21 17:34:58.0
20965326        20965326        33      20171228        2017-12-28 10:22:44.0
20965646        20965646        18      20180119        2018-01-19 21:13:33.0
20965650        20965650        18      20180116        2018-01-16 11:08:33.0
20965815        20965815        18      20180116        2018-01-16 13:59:43.0
20965938        20965938        18      20180116        2018-01-16 10:01:19.0
20966100        20966100        18      20180122        2018-01-22 14:25:31.0
Time taken: 0.898 seconds, Fetched: 30 row(s)

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM