postgresql如何從表中高效的隨機獲取一條記錄
select C_BH from db_scld.t_scld_cprscjl order by `random()` LIMIT 1;
select c_jdrybm from db_scld.t_jdry
where c_bmbm = v_scdd and c_sfyx ='1' and c_ryzszt not in ('05','12','11','07','09','13') order by `random()` limit 1
db_jdsjpt=# explain analyze select C_BH from db_scld.t_scld_cprscjl order by random() LIMIT 1;
QUERY PLAN
---------------------------------------------------------------------------------------------------
Limit (cost=61029.94..61029.94 rows=1 width=41) (actual time=587.193..587.193 rows=1 loops=1)
-> Sort (cost=61029.94..63172.22 rows=856911 width=41) (actual time=587.185..587.185 rows=1 loops=1)
Sort Key: (random())
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on t_scld_cprscjl (cost=0.00..56745.39 rows=856911 width=41) (actual time=0.019..380.139 rows=854682 loop
s=1)
Planning time: 1.179 ms
Execution time: 587.242 ms
(7 rows)
--表總數量
db_jdsjpt=# select count(*) from db_scld.t_scld_cprscjl;
count
--------
854682
(1 row)
隨機獲取一條記錄random()
random()
耗時:Time: 389.818 ms
--隨機獲取一條耗時
db_jdsjpt=# select C_BH from db_scld.t_scld_cprscjl order by random() LIMIT 1;
c_bh
----------------------------------
6d861b011c854040bf5b731f49d40b48
(1 row)
Time: 389.818 ms
改寫1
offset
耗時:Time: 60.022 ms
--offset可以走索引,少了排序操作
db_jdsjpt=# select C_BH from db_scld.t_scld_cprscjl offset floor(random()*854682) LIMIT 1;
c_bh
----------------------------------
f90301bd8ac2485196ffae32ee70345c
(1 row)
Time: 60.022 ms
db_jdsjpt=# explain analyze select C_BH from db_scld.t_scld_cprscjl offset floor(random()*854682) LIMIT 1;
QUERY PLAN
---------------------------------------------------------------------------------------------------
Limit (cost=3747.64..3747.68 rows=1 width=33) (actual time=30.758..30.759 rows=1 loops=1)
-> Index Only Scan using i_corscjl_cprscbh_ on t_scld_cprscjl (cost=0.42..37472.65 rows=854682 width=33) (actual time=0.
047..25.808 rows=81993 loops=1)
Heap Fetches: 0
Planning time: 0.228 ms
Execution time: 30.802 ms
(5 rows)
Time: 31.779 ms
改寫2
pg從9.5開始提供抽樣函數
使用tablesample抽樣的過程中比例不能太低,否則可能獲取不到結果,且不能帶有過濾條件
system
耗時: Time: 0.639 ms
system
:隨機性較差,效率高
--改寫后耗時
db_jdsjpt=# select c_bh from db_scld.t_scld_cprscjl tablesample system(0.1) limit 1;
c_bh
----------------------------------
e2fce25399db42f0bf49faf8e7214d5f
(1 row)
Time: 0.639 ms
--system隨機抽樣以塊為單位所以更快
db_jdsjpt=# explain analyze select c_bh from db_scld.t_scld_cprscjl tablesample system(0.1) limit 1;
QUERY PLAN
---------------------------------------------------------------------------------------------------
Limit (cost=0.00..0.23 rows=1 width=33) (actual time=0.105..0.105 rows=1 loops=1)
-> Sample Scan on t_scld_cprscjl (cost=0.00..192.55 rows=855 width=33) (actual time=0.102..0.102 rows=1 loops=1)
Sampling: system ('0.1'::real)
Planning time: 0.190 ms
Execution time: 0.134 ms
(5 rows)
Time: 1.182 ms
改寫3
bernoulli
:隨機性更好,但效率比system要低
bernoullih
耗時:Time: 0.822 ms
db_jdsjpt=# select c_bh from db_scld.t_scld_cprscjl tablesample bernoulli(0.1) limit 1;
c_bh
----------------------------------
7ec30761ffd04bd9ad77797a33645a84
(1 row)
Time: 0.822 ms
--bernoulli以行為單位進行抽樣,比system效率低點
db_jdsjpt=# explain analyze select c_bh from db_scld.t_scld_cprscjl tablesample bernoulli(0.1) limit 1;
QUERY PLAN
---------------------------------------------------------------------------------------------------
Limit (cost=0.00..53.85 rows=1 width=33) (actual time=1.410..1.411 rows=1 loops=1)
-> Sample Scan on t_scld_cprscjl (cost=0.00..46042.55 rows=855 width=33) (actual time=1.408..1.408 rows=1 loops=1)
Sampling: bernoulli ('0.1'::real)
Planning time: 0.446 ms
Execution time: 1.436 ms
(5 rows)
Time: 25.770 ms
同理另外一條sql也可用同樣的方式,且在c_bmbm
字段上面加上索引
當有條件的時候可以使用offset
獲取,offset
的值也可以通過for
循環傳入
db_jdsjpt=# select count(*) from db_scld.t_jdry;
count
--------
214819
(1 row)
db_jdsjpt=# select c_jdrybm from db_scld.t_jdry where c_bmbm = '4402222804' and c_sfyx ='1' and c_ryzszt not in ('05','12','11','07','09','13') offset floor(random()*214819) limit 1;
c_jdrybm
----------
(0 rows)
Time: 1.924 ms
對比
方法 | 耗時 |
---|---|
order by random() | 389.818 ms |
offset n | 60.022 ms-240ms |
system() | 0.639 ms |
bernoulli() | 0.822 ms |
使用offset
的時候和n的大小有關系,當n越大,掃描的索引塊越多,就越大,但是相對於order by random()
的方式仍然要快。
注意
system(0.1)
等於百分之零點一,也就是抽樣千分之一 854682*0.001=854
,大概每次抽取854
條記錄
--system
db_jdsjpt=# select count(*) from db_scld.t_scld_cprscjl tablesample system(0.1) ;
count
-------
592
(1 row)
Time: 1.499 ms
--bernoulli
db_jdsjpt=# select count(*) from db_scld.t_scld_cprscjl tablesample bernoulli(0.1) ;
count
-------
840
(1 row)
Time: 86.037 ms
這里可以看出bernoulli效率比system要低
結語
1.隨機獲取表中的一條數據,當表中數據較小時使用random感覺不明顯,當數據量大時random由於每次都需要排序操作,導致隨機獲取一條的成本較高
4.隨機獲取一條記錄可以使用limit 1 offset n-1的方式,或者使用隨機抽樣的方式
5.無論是使用limit 1 offset n還是使用tablesample隨機抽樣方式都需要知道表中的數據量,不能超過表數據量