Python Elasticsearch DSL 查詢、過濾、聚合操作實例

Elasticsearch 基本概念

Index：Elasticsearch用來存儲數據的邏輯區域，它類似於關系型數據庫中的database 概念。一個index可以在一個或者多個shard上面，同時一個shard也可能會有多個replicas。

Document：Elasticsearch里面存儲的實體數據，類似於關系數據中一個table里面的一行數據。

document由多個field組成，不同的document里面同名的field一定具有相同的類型。document里面field可以重復出現，也就是一個field會有多個值，即multivalued。

Document type：為了查詢需要，一個index可能會有多種document，也就是document type. 它類似於關系型數據庫中的 table 概念。但需要注意，不同document里面同名的field一定要是相同類型的。

Mapping：它類似於關系型數據庫中的 schema 定義概念。存儲field的相關映射信息，不同document type會有不同的mapping。

下圖是ElasticSearch和關系型數據庫的一些術語比較：

Relationnal database	Elasticsearch
Database	Index
Table	Type
Row	Document
Column	Field
Schema	Mapping
Schema	Mapping
Index	Everything is indexed
SQL	Query DSL
SELECT * FROM table…	GET http://…
UPDATE table SET	PUT http://…

Python Elasticsearch DSL 使用簡介

連接 Es：

import elasticsearch

es = elasticsearch.Elasticsearch([{'host': '127.0.0.1', 'port': 9200}])

先看一下搜索，q 是指搜索內容，空格對 q 查詢結果沒有影響，size 指定個數，from_ 指定起始位置，filter_path 可以指定需要顯示的數據，如本例中顯示在最后的結果中的只有 _id 和 _type。

res_3 = es.search(index="bank", q="Holmes", size=1, from_=1)
res_4 = es.search(index="bank", q=" 39225    5686 ", size=1000, filter_path=['hits.hits._id', 'hits.hits._type'])

查詢指定索引的所有數據：

其中，index 指定索引，字符串表示一個索引；列表表示多個索引，如 index=["bank", "banner", "country"]；正則形式表示符合條件的多個索引，如 index=["apple*"]，表示以 apple 開頭的全部索引。

search 中同樣可以指定具體 doc-type。

from elasticsearch_dsl import Search

s = Search(using=es, index="index-test").execute()
print s.to_dict()

根據某個字段查詢，可以多個查詢條件疊加：

s = Search(using=es, index="index-test").query("match", sip="192.168.1.1")
s = s.query("match", dip="192.168.1.2")
s = s.excute()

多字段查詢：

from elasticsearch_dsl.query import MultiMatch, Match

multi_match = MultiMatch(query='hello', fields=['title', 'content'])
s = Search(using=es, index="index-test").query(multi_match)
s = s.execute()

print s.to_dict()

還可以用 Q() 對象進行多字段查詢，fields 是一個列表，query 為所要查詢的值。

from elasticsearch_dsl import Q

q = Q("multi_match", query="hello", fields=['title', 'content'])
s = s.query(q).execute()

print s.to_dict()

Q() 第一個參數是查詢方法，還可以是 bool。

q = Q('bool', must=[Q('match', title='hello'), Q('match', content='world')])
s = s.query(q).execute()

print s.to_dict()

通過 Q() 進行組合查詢，相當於上面查詢的另一種寫法。

q = Q("match", title='python') | Q("match", title='django')
s = s.query(q).execute()
print(s.to_dict())
# {"bool": {"should": [...]}}

q = Q("match", title='python') & Q("match", title='django')
s = s.query(q).execute()
print(s.to_dict())
# {"bool": {"must": [...]}}

q = ~Q("match", title="python")
s = s.query(q).execute()
print(s.to_dict())
# {"bool": {"must_not": [...]}}

過濾，在此為范圍過濾，range 是方法，timestamp 是所要查詢的 field 名字，gte 為大於等於，lt 為小於，根據需要設定即可。

關於 term 和 match 的區別，term 是精確匹配，match 會模糊化，會進行分詞，返回匹配度分數，（term 如果查詢小寫字母的字符串，有大寫會返回空即沒有命中，match 則是不區分大小寫都可以進行查詢，返回結果也一樣）

# 范圍查詢
s = s.filter("range", timestamp={"gte": 0, "lt": time.time()}).query("match", country="in")
# 普通過濾
res_3 = s.filter("terms", balance_num=["39225", "5686"]).execute()

其他寫法：

s = Search()
s = s.filter('terms', tags=['search', 'python'])
print(s.to_dict())
# {'query': {'bool': {'filter': [{'terms': {'tags': ['search', 'python']}}]}}}

s = s.query('bool', filter=[Q('terms', tags=['search', 'python'])])
print(s.to_dict())
# {'query': {'bool': {'filter': [{'terms': {'tags': ['search', 'python']}}]}}}
s = s.exclude('terms', tags=['search', 'python'])
# 或者
s = s.query('bool', filter=[~Q('terms', tags=['search', 'python'])])
print(s.to_dict())
# {'query': {'bool': {'filter': [{'bool': {'must_not': [{'terms': {'tags': ['search', 'python']}}]}}]}}}

聚合可以放在查詢，過濾等操作的后面疊加，需要加 aggs。

bucket 即為分組，其中第一個參數是分組的名字，自己指定即可，第二個參數是方法，第三個是指定的 field。

metric 也是同樣，metric 的方法有 sum、avg、max、min 等，但是需要指出的是，有兩個方法可以一次性返回這些值，stats 和 extended_stats，后者還可以返回方差等值。

# 實例1
s.aggs.bucket("per_country", "terms", field="timestamp").metric("sum_click", "stats", field="click").metric("sum_request", "stats", field="request")

# 實例2
s.aggs.bucket("per_age", "terms", field="click.keyword").metric("sum_click", "stats", field="click")

# 實例3
s.aggs.metric("sum_age", "extended_stats", field="impression")

# 實例4
s.aggs.bucket("per_age", "terms", field="country.keyword")

# 實例5，此聚合是根據區間進行聚合
a = A("range", field="account_number", ranges=[{"to": 10}, {"from": 11, "to": 21}])

res = s.execute()

最后依然要執行 execute()，此處需要注意，s.aggs 操作不能用變量接收（如 res=s.aggs，這個操作是錯誤的），聚合的結果會保存到 res 中顯示。

排序

s = Search().sort(
    'category',
    '-title',
    {"lines" : {"order" : "asc", "mode" : "avg"}}
)

分頁

s = s[10:20]
# {"from": 10, "size": 10}

一些擴展方法，感興趣的同學可以看看：

s = Search()

# 設置擴展屬性使用`.extra()`方法
s = s.extra(explain=True)

# 設置參數使用`.params()`
s = s.params(search_type="count")

# 如要要限制返回字段，可以使用`source()`方法
# only return the selected fields
s = s.source(['title', 'body'])
# don't return any fields, just the metadata
s = s.source(False)
# explicitly include/exclude fields
s = s.source(include=["title"], exclude=["user.*"])
# reset the field selection
s = s.source(None)

# 使用dict序列化一個查詢
s = Search.from_dict({"query": {"match": {"title": "python"}}})

# 修改已經存在的查詢
s.update_from_dict({"query": {"match": {"title": "python"}}, "size": 42})

參考文檔：

fingerchou.com/2017/08/12/…

fingerchou.com/2017/08/13/…

blog.csdn.net/JunFeng666/…