ES scroll(ES游標) 解決深分頁


ES scroll(ES游標) 解決深分頁。

Why

當Elasticsearch響應請求時,它必須確定docs的順序,排列響應結果。如果請求的頁數較少(假設每頁20個docs), Elasticsearch不會有什么問題,但是如果頁數較大時,比如請求第20頁,Elasticsearch不得不取出第1頁到第20頁的所有docs,再去除第1頁到第19頁的docs,得到第20頁的docs。

原理

Scrolling allows us to do an initial search and to keep pulling batches of results from Elasticsearch until there are no more results left. It’s a bit like a cursor in a traditional database.

A scrolled search takes a snapshot in time(適時). 中間更新不可見。

?
1
2
<code>By keeping old data files around.
</code>

深分頁的代價是全局排序,若禁止排序,sort by _doc,return the next batch of results from every shard that still has results to return.

context keepalive time(當批夠用) 和 scroll_id(最新)

Set the scroll value to the length of time we want to keep the scroll window open.
How long it should keep the “search context” alive.

The scroll expiry time is refreshed every time we run a scroll request,所以不宜過長(垃圾)、過短(超時),夠處理一批數據即可。

?
1
2
3
4
5
6
7
8
9
10
11
12
13
<code>GET /old_index/_search?scroll=1m //第1次請求
{
     "query" : { "match_all" : {}},
     "sort" : [ "_doc" ], //the most efficient sort order
     "size" 1000
}
返回結果包含:_scroll_id ,base- 64 編碼的字符串
 
GET /_search/scroll  //后續請求
{
     "scroll" : "1m" ,
     "scroll_id" : "cXVlcnlUaGVuRmV0Y2g7NTsxMDk5NDpkUmpiR2FjOFNhNnlCM1ZDMWpWYnRROzEwOTk1OmRSamJHYWM4U2E2eUIzVkMxalZidFE7MTA5OTM6ZFJqYkdhYzhTYTZ5QjNWQzFqVmJ0UTsxMTE5MDpBVUtwN2lxc1FLZV8yRGVjWlI2QUVBOzEwOTk2OmRSamJHYWM4U2E2eUIzVkMxalZidFE7MDs="
}</code>

scroll parameter : how long it should keep the search context alive,long enough to process the previous batch of results, each scroll request sets a new expiry time.

An open search context prevents the old segments from being deleted while they are still in use.

注意:Keeping older segments alive means that more file handles(FD) are needed.
檢查有多少search contexts(open_contexts):

?
1
<code>GET _nodes/stats/indices/search</code>

Clear scroll API
Search context are automatically removed when the scroll timeout has been exceeded.

?
1
2
<code>清所有,可以清部分(無意義):
DELETE _search/scroll/_all</code>

size
When scanning, the size is applied to each shard, 真實size是:size * number_of_primary_shards.

否則(regular scroll),返還總的size。

查詢結束
No more hits are returned. Each call to the scroll API returns the next batch of results until there are no more results left to return, ie the hits array is empty.

適用場景
Scrolling is not intended for real time(實時) user requests, but rather for processing large amounts of data.

scroll目的,不是處理實時的用戶請求,而是為處理大數據的。

似快照
The results that are returned from a scroll request reflect the state of the index at the time that the initial search request was made, like a snapshot in time. Subsequent changes to documents (index, update or delete) will only affect later search requests.

聚合
If the request specifies aggs, only the initial search response will contain the aggs results.

順序無關
不關心返回文檔的順序!
Scroll requests have optimizations that make them faster when the sort order is _doc. If you want to iterate over all documents regardless of the order, this is the most efficient option:

?
1
2
3
4
5
6
<code>GET /_search?scroll=1m
{
   "sort" : [
     "_doc"
   ]
}</code>

slice scroll
split the scroll in multiple slices

scanning and standard scroll

scanning scroll與standard scroll 查詢幾點不同:
1. scanning scroll 結果沒有排序,結果順序是doc入庫時的順序;
2. scanning scroll 不支持聚合
3. scanning scroll 最初查詢結果的“hits”列表中不會包含結果
4. scanning scroll 最初查詢中如果設定了“size”,是設定每個分片(shard)size的數量,若size=3,有5個shard,每次返回結果的最大值就是3*5=15。

示例

常見問題

scroll_id一樣與否

?
1
2
<code><code>the scroll_id may change over the course of multiple calls and so it is required to always pass the most recent scroll_id as the scroll_id for the subsequent request.
</code></code>

異常:SearchContextMissingException

SearchContextMissingException[No search context found for id [721283]];
原因:scroll設置的時間過短了。

源碼212">問源碼(2.1.2)

scroll_id的生成:
…search.type.TransportSearchHelper#buildScrollId(…) 三個參數,搜索查詢類型、結果信息、查詢條件參數 TransportSearchQueryThenFetchAction.AsyncAction. finishHim()


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM