前言
search 我們經常使用,默認一次返回10條數據,並且可以通過 from 和 size 參數修改返回條數並執行分頁操作。但是有時需要返回大量數據,就必須通過scan和scroll實現。兩者一起使用來從Elasticsearch里高效地取回巨大數量的結果而不需要付出深分頁的代價。
詳情參考:https://es.xiaoleilu.com/060_Distributed_Search/20_Scan_and_scroll.html
與上文鏈接不同的是,本文是關於python實現的介紹和描述。
數據說明
索引hz中一共29999條數據,且內容如下。批量導入數據代碼可見:
http://blog.csdn.net/xsdxs/article/details/72849796
代碼示例
ES客戶端代碼:
# -*- coding: utf-8 -*-
import elasticsearch
ES_SERVERS = [{ 'host': 'localhost', 'port': 9200 }]
es_client = elasticsearch.Elasticsearch( hosts=ES_SERVERS )
search接口搜索代碼:
# -*- coding: utf-8 -*- from es_client import es_client def search(search_offset, search_size): es_search_options = set_search_optional() es_result = get_search_result(es_search_options, search_offset, search_size) final_result = get_result_list(es_result) return final_result def get_result_list(es_result): final_result = [] result_items = es_result['hits']['hits'] for item in result_items: final_result.append(item['_source']) return final_result def get_search_result(es_search_options, search_offset, search_size, index='hz', doc_type='xyd'): es_result = es_client.search( index=index, doc_type=doc_type, body=es_search_options, from_=search_offset, size=search_size ) return es_result def set_search_optional(): # 檢索選項 es_search_options = { "query": { "match_all": {} } } return es_search_options if __name__ == '__main__': final_results = search(0, 1000) print len(final_results)
這樣一切貌似ok,正常輸出1000,但是現在改下需求,想搜索其中20000條數據。
if __name__ == '__main__': final_results = search(0, 20000)
輸出如下錯誤:
elasticsearch.exceptions.TransportError: TransportError(500, u’search_phase_execution_exception’, u’Result window is too large, from + size must be less than or equal to: [10000] but was [20000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter.’)
說明:search接口最多返回1w條數據。所以這里會報錯。
不廢話,基於scan和scroll實現,直接給代碼如下:
# -*- coding: utf-8 -*- from es_client import es_client from elasticsearch import helpers def search(): es_search_options = set_search_optional() es_result = get_search_result(es_search_options) final_result = get_result_list(es_result) return final_result def get_result_list(es_result): final_result = [] for item in es_result: final_result.append(item['_source']) return final_result def get_search_result(es_search_options, scroll='5m', index='hz', doc_type='xyd', timeout="1m"): es_result = helpers.scan( client=es_client, query=es_search_options, scroll=scroll, index=index, doc_type=doc_type, timeout=timeout ) return es_result def set_search_optional(): # 檢索選項 es_search_options = { "query": { "match_all": {} } } return es_search_options if __name__ == '__main__': final_results = search() print len(final_results)
輸出如下:
把29999條數據全部取出來了。