本文以 Elasticsearch 6.2.4為例。
經過前面的基礎入門,我們對ES的基本操作也會了。現在來學習ES最強大的部分:全文檢索。
准備工作
批量導入數據
先需要准備點數據,然后導入:
wget https://raw.githubusercontent.com/elastic/elasticsearch/master/docs/src/test/resources/accounts.json
curl -H "Content-Type: application/json" -XPOST "localhost:9200/bank/account/_bulk" --data-binary "@accounts.json"
這樣我們就導入了1000條數據到ES。
注意:
accounts.json
每行必須以\n
換行。如果提示The bulk request must be terminated by a newline [\n]
,請檢查最后一行是否以\n
換行。
index是bank。我們可以查看現在有哪些index:
curl "localhost:9200/_cat/indices?format=json&pretty"
結果:
[
{
"health" : "yellow",
"status" : "open",
"index" : "bank",
"uuid" : "MDxR02uESgKSynX6k8B-og",
"pri" : "5",
"rep" : "1",
"docs.count" : "1000",
"docs.deleted" : "0",
"store.size" : "474.6kb",
"pri.store.size" : "474.6kb"
}
]
使用kibana可視化數據
該小節是可選的,如果不感興趣,可以跳過。
該小節要求你已經搭建好了ElasticSearch + Kibana。
打開kibana web地址:http://127.0.0.1:5601,依次打開:Management
-> Kibana
-> Index Patterns
,選擇Create Index Pattern
:
a. Index pattern 輸入:bank
;
b. 點擊Create。
然后打開Discover,選擇 bank
就能看到剛才導入的數據了。
我們在可視化界面里檢索數據:
是不是很酷!
接下來我們使用API來實現檢索。
查詢
URI檢索
uri檢索是通過提供請求參數純粹使用URI來執行搜索請求。
GET /bank/_search?q=Virginia&pretty
GET /bank/_search?q=firstname:Virginia
curl:
curl -XGET "localhost:9200/bank/_search?q=Virginia&pretty"
curl -XGET "localhost:9200/bank/_search?q=firstname:Virginia&pretty"
解釋:檢索關鍵字為"Virginia"的結果。結果示例:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 4.631368,
"hits": [
{
"_index": "bank",
"_type": "account",
"_id": "298",
"_score": 4.631368,
"_source": {
"account_number": 298,
"balance": 34334,
"firstname": "Bullock",
"lastname": "Marsh",
"age": 20,
"gender": "M",
"address": "589 Virginia Place",
"employer": "Renovize",
"email": "bullockmarsh@renovize.com",
"city": "Coinjock",
"state": "UT"
}
},
{
"_index": "bank",
"_type": "account",
"_id": "25",
"_score": 4.6146765,
"_source": {
"account_number": 25,
"balance": 40540,
"firstname": "Virginia",
"lastname": "Ayala",
"age": 39,
"gender": "F",
"address": "171 Putnam Avenue",
"employer": "Filodyne",
"email": "virginiaayala@filodyne.com",
"city": "Nicholson",
"state": "PA"
}
}
]
}
}
返回字段含義:
- took – Elasticsearch執行搜索的時間(以毫秒為單位)
- timed_out – 搜索是否超時
- _shards – 搜索了多少個分片,以及搜索成功/失敗分片的計數
- hits – 搜索結果,是個對象
- hits.total – 符合我們搜索條件的文檔總數
- hits.hits – 實際的搜索結果數組(默認為前10個文檔)
- hits.sort - 對結果進行排序(如果按score排序則沒有該字段)
- hits._score、max_score - 暫時忽略這些字段
參數:
- q 查詢字符串(映射到query_string查詢)
- df 在查詢中未定義字段前綴時使用的默認字段。
- analyzer 分析查詢字符串時要使用的分析器名稱。
- sort 排序。可以是
fieldName
或fieldName:asc/
的形式fieldName:desc
。fieldName
可以是文檔中的實際字段,也可以是特殊_score
名稱,表示基於分數的排序。可以有幾個sort參數(順序很重要)。 - timeout 搜索超時。默認為無超時。
- from 從命中的索引開始返回。默認為0。
- size 要返回的點擊次數。默認為10。
- default_operator 要使用的默認運算符可以是AND或 OR。默認為OR。
詳見: https://www.elastic.co/guide/en/elasticsearch/reference/6.2/search-uri-request.html
示例:
GET /bank/_search?q=*&sort=account_number:asc&pretty
解釋:所有結果通過account_number字段升序排列。默認只返回前10條。
下面的查詢與上面的含義一致:
GET /bank/_search
{
"query": {
"multi_match" : {
"query" : "Virginia",
"fields" : ["_all"]
}
}
}
GET /bank/_search
{
"query": { "match_all": {} },
"sort": [
{ "account_number": "asc" }
]
}
通常我們會采用傳JSON方式查詢。Elasticsearch提供了一種JSON樣式的特定於域的語言,可用於執行查詢。這被稱為查詢DSL。
注意:上述的查詢里面我們僅指定了index,並沒有指定type,那么ES將不會區分type。如果想區分,請在URI后面追加type。示例:
GET /bank/account/_search
。
match查詢
GET /bank/_search
{
"query" : {
"match" : { "address" : "Avenue" }
}
}
curl:
curl -XGET -H "Content-Type: application/json" "localhost:9200/bank/_search?pretty" -d '{"query":{"match":{"address":"Avenue"}}}'
上述查詢返回結果是address
含有Avenue
的結果。
term查詢
GET /bank/_search
{
"query" : {
"term" : { "address" : "Avenue" }
}
}
curl:
curl -XGET -H "Content-Type: application/json" "localhost:9200/bank/_search?pretty" -d '{"query":{"term":{"address":"Avenue"}}}'
上述查詢返回結果是address
等於Avenue
的結果。
注:如果一個字段既需要分詞搜索,又需要精准匹配,最好是一開始設置mapping的時候就設置正確。例如:通過增加
.keyword
字段來支持精准匹配:
{
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
這樣相當於有
address
和address.keyword
兩個字段。這個后面mapping章節再講解。
分頁(from/size)
分頁使用關鍵字from、size,分別表示偏移量、分頁大小。
GET /bank/_search
{
"query": { "match_all": {} },
"from": 0,
"size": 2
}
from默認是0,size默認是10。
注意:ES的from、size分頁不是真正的分頁,稱之為淺分頁。from+ size不能超過
index.max_result_window
默認為10,000
的索引設置。有關 更有效的深度滾動方法,請參閱 Scroll或 Search After API。
排序(sort)
字段排序關鍵字是sort。支持升序(asc)、降序(desc)。默認是對_score
字段進行排序。
GET /bank/_search
{
"query": { "match_all": {} },
"sort": [
{ "account_number": "asc" }
],
"from":0,
"size":10
}
多個字段排序:
GET /bank/_search
{
"query": { "match_all": {} },
"sort": [
{ "account_number": "asc" },
{ "_score": "asc" }
],
"from":0,
"size":10
}
先按照account_number
排序,再按照_score
排序。
按腳本排序
允許基於自定義腳本進行排序,這是一個示例:
GET bank/account/_search
{
"query": { "range": { "age": {"gt": 20} }},
"sort" : {
"_script" : {
"type" : "number",
"script" : {
"lang": "painless",
"source": "doc['account_number'].value * params.factor",
"params" : {
"factor" : 1.1
}
},
"order" : "asc"
}
}
}
上述查詢是使用腳本進行排序:按 account_number*1.1
的結果進行升序。其中lang
指的是使用的腳本語言類型為painless
。painless
支持Math.log
函數。
上述例子僅僅是演示使用方法,沒有實際含義。
過濾字段
默認情況下,ES返回所有字段。這被稱為源(_source
搜索命中中的字段)。如果我們不希望返回所有字段,我們可以只請求返回源中的幾個字段。
GET /bank/_search
{
"query": { "match_all": {} },
"_source": ["account_number", "balance"]
}
通過_source
關鍵字可以實現字段過濾。
返回腳本字段
可以通過腳本動態返回新定義字段。示例:
GET bank/account/_search
{
"query" : {
"match_all": {}
},
"size":2,
"script_fields" : {
"age2" : {
"script" : {
"lang": "painless",
"source": "doc['age'].value * 2"
}
},
"age3" : {
"script" : {
"lang": "painless",
"source": "params['_source']['age'] * params.factor",
"params" : {
"factor" : 2.0
}
}
}
}
}
結果:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1000,
"max_score": 1,
"hits": [
{
"_index": "bank",
"_type": "account",
"_id": "25",
"_score": 1,
"fields": {
"age3": [
78
],
"age2": [
78
]
}
},
{
"_index": "bank",
"_type": "account",
"_id": "44",
"_score": 1,
"fields": {
"age3": [
74
],
"age2": [
74
]
}
}
]
}
}
注意:使用
doc['my_field_name'].value
比使用params['_source']['my_field_name']
更快更效率,推薦使用。
AND查詢
如果我們想同時查詢符合A和B字段的結果,該怎么查呢?可以使用must關鍵字組合。
GET /bank/_search
{
"query": {
"bool": {
"must": [
{ "match": { "address": "mill" } },
{ "match": { "address": "lane" } }
]
}
}
}
GET /bank/_search
{
"query": {
"bool": {
"must": [
{ "match": { "account_number":136 } },
{ "match": { "address": "lane" } },
{ "match": { "city": "Urie" } }
]
}
}
}
must也等價於:
GET /bank/_search
{
"query": {
"bool": {
"must": [
{ "match": { "address": "mill" } }
],
"must": [
{ "match": { "address": "lane" } }
]
}
}
}
這種相當於先查詢A再查詢B,而上面的則是同時查詢符合A和B,但結果是一樣的,執行效率可能有差異。有知道原因的朋友可以告知。
OR查詢
ES使用should關鍵字來實現OR查詢。
GET /bank/_search
{
"query": {
"bool": {
"should": [
{ "match": { "account_number":136 } },
{ "match": { "address": "lane" } },
{ "match": { "city": "Urie" } }
]
}
}
}
AND取反查
must_not
關鍵字實現了既不包含A也不包含B的查詢。
GET /bank/_search
{
"query": {
"bool": {
"must_not": [
{ "match": { "address": "mill" } },
{ "match": { "address": "lane" } }
]
}
}
}
表示 address 字段需要符合既不包含 mill 也不包含 lane。
布爾組合查詢
我們可以組合 must 、should 、must_not、filter 進行復雜的查詢。
- A AND NOT B
GET /bank/_search
{
"query": {
"bool": {
"must": [
{ "match": { "age": 40 } }
],
"must_not": [
{ "match": { "state": "ID" } }
]
}
}
}
相當於SQL:
select * from bank where age=40 and state!= "ID";
- A AND (B OR C)
GET /bank/_search
{
"query":{
"bool":{
"must":[
{"match":{"age":39}},
{"bool":{"should":[
{"match":{"city":"Nicholson"}},
{"match":{"city":"Yardville"}}
]}
}
]
}
}
}
相當於SQL:
select * from bank where age=39 and (city="Nicholson" or city="Yardville");
范圍查詢
GET /bank/_search
{
"query": {
"bool": {
"must": { "match_all": {} },
"filter": {
"range": {
"balance": {
"gte": 20000,
"lte": 30000
}
}
}
}
}
}
- A AND (B OR C) AND (D BETWEEN E, F)
GET /bank/_search
{
"query":{
"bool":{
"must":[
{"match":{"age":39}},
{"bool":{"should":[
{"match":{"city":"Nicholson"}},
{"match":{"city":"Yardville"}}
]}
},
{"bool":{"filter": {"range": {
"balance": {
"gte": 20000,
"lte": 30000
}}}
}
}
]
}
}
}
相當於SQL:
select * from bank where age=39 and (city="Nicholson" or city="Yardville") and (balance between 20000 and 30000);
如果僅僅是單字段范圍查詢,也可以直接省略 must、filter等關鍵字:
GET /bank/_search
{
"query":{
"range":{
"balance":{
"gte":20000,
"lte":30000
}
}
}
}
相當於SQL:
select * from bank where balance between 20000 and 30000;
多字段范圍查詢:
GET /bank/_search
{
"query": {
"bool": {
"must": { "match_all": {} },
"filter": {
"bool":{
"must":[
{"range": {"balance": {"gte": 20000,"lte": 30000}}},
{"range": {"age": {"gte": 30}}}
]
}
}
}
}
}
查詢字段不存在或者為0的值
GET /bank/doc/_search
{
"query":{
"bool":{
"should":[
{
"term":{"age":0}
},
{
"bool":{
"must_not":[{"exists":{"field":"age"}}]
}
}
]
}
}
}
高亮結果
ES可以高亮返回結果里的關鍵字,使用html標記標出。
GET bank/account/_search
{
"query" : {
"match": { "address": "Avenue" }
},
"from": 0,
"size": 1,
"highlight" : {
"require_field_match": false,
"fields": {
"*" : { }
}
}
}
輸出:
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 214,
"max_score": 1.5814995,
"hits": [
{
"_index": "bank",
"_type": "account",
"_id": "102",
"_score": 1.5814995,
"_source": {
"account_number": 102,
"balance": 29712,
"firstname": "Dena",
"lastname": "Olson",
"age": 27,
"gender": "F",
"address": "759 Newkirk Avenue",
"employer": "Hinway",
"email": "denaolson@hinway.com",
"city": "Choctaw",
"state": "NJ"
},
"highlight": {
"address": [
"759 Newkirk <em>Avenue</em>"
]
}
}
]
}
}
返回結果里的highlight
部分就是高亮結果,默認使用<em>
標出。如果需要修改,可以使用pre_tags
設置修改:
"fields": {
"*" : { "pre_tags" : ["<strong>"], "post_tags" : ["</strong>"] }
}
*
代表所有字段都高亮,也可以只高亮具體的字段,直接用具體字段替換*
即可。
require_field_match
:默認情況下,僅突出顯示包含查詢匹配的字段。設置require_field_match為false突出顯示所有字段。默認為true。詳見:https://www.elastic.co/guide/en/elasticsearch/reference/6.2/search-request-highlighting.html
聚合查詢
GET /bank/_search
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword"
}
}
}
}
結果:
{
"took": 29,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped" : 0,
"failed": 0
},
"hits" : {
"total" : 1000,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"group_by_state" : {
"doc_count_error_upper_bound": 20,
"sum_other_doc_count": 770,
"buckets" : [ {
"key" : "ID",
"doc_count" : 27
}, {
"key" : "TX",
"doc_count" : 27
}, {
"key" : "AL",
"doc_count" : 25
}, {
"key" : "MD",
"doc_count" : 25
}, {
"key" : "TN",
"doc_count" : 23
}, {
"key" : "MA",
"doc_count" : 21
}, {
"key" : "NC",
"doc_count" : 21
}, {
"key" : "ND",
"doc_count" : 21
}, {
"key" : "ME",
"doc_count" : 20
}, {
"key" : "MO",
"doc_count" : 20
} ]
}
}
}
查詢結果返回了ID州(Idaho)有27個賬戶,TX州(Texas)有27個賬戶。
相當於SQL:
SELECT state, COUNT(*) FROM bank GROUP BY state ORDER BY COUNT(*) DESC
該查詢意思是按照字段state分組,返回前10個聚合結果。
其中size設置為0意思是不返回文檔內容,僅返回聚合結果。state.keyword
表示字段精確匹配,因為使用模糊匹配性能很低,所以不支持。
如果需要聚合的時候對某個字段去重,使用cardinality
關鍵字即可:
GET /bank/_search
{
"size": 0,
"aggs": {
"group_by_state": {
"cardinality": {
"field": "state.keyword"
}
}
}
}
多重聚合
我們可以在聚合的基礎上再進行聚合,例如求和、求平均值等等。
GET /bank/_search
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword"
},
"aggs": {
"average_balance": {
"avg": {
"field": "balance"
}
}
}
}
}
}
上述查詢實現了在前一個聚合的基礎上,按州計算平均帳戶余額(同樣僅針對按降序排序的前10個州)。
我們可以在聚合中任意嵌套聚合,以從數據中提取所需的統計數據。
在前一個聚合的基礎上,我們現在按降序排列平均余額:
GET /bank/_search
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword",
"order": {
"average_balance": "desc"
}
},
"aggs": {
"average_balance": {
"avg": {
"field": "balance"
}
}
}
}
}
}
這里基於第二個聚合結果進行倒序排列。其實上一個例子隱藏了默認排序,也就是默認按照_sort
(分值)倒序:
GET /bank/_search
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword",
"order": {
"_sort": "desc"
}
},
"aggs": {
"average_balance": {
"avg": {
"field": "balance"
}
}
}
}
}
}
此示例演示了我們如何按年齡段(20-29歲,30-39歲和40-49歲)進行分組,然后按性別分組,最后得到每個年齡段的平均帳戶余額:
GET /bank/_search
{
"size": 0,
"aggs": {
"group_by_age": {
"range": {
"field": "age",
"ranges": [
{
"from": 20,
"to": 30
},
{
"from": 30,
"to": 40
},
{
"from": 40,
"to": 50
}
]
},
"aggs": {
"group_by_gender": {
"terms": {
"field": "gender.keyword"
},
"aggs": {
"average_balance": {
"avg": {
"field": "balance"
}
}
}
}
}
}
}
}
這個結果就復雜了,屬於嵌套分組,結果也是嵌套的:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1000,
"max_score": 0,
"hits": []
},
"aggregations": {
"group_by_age": {
"buckets": [
{
"key": "20.0-30.0",
"from": 20,
"to": 30,
"doc_count": 451,
"group_by_gender": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "M",
"doc_count": 232,
"average_balance": {
"value": 27374.05172413793
}
},
{
"key": "F",
"doc_count": 219,
"average_balance": {
"value": 25341.260273972603
}
}
]
}
},
{
"key": "30.0-40.0",
"from": 30,
"to": 40,
"doc_count": 504,
"group_by_gender": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "F",
"doc_count": 253,
"average_balance": {
"value": 25670.869565217392
}
},
{
"key": "M",
"doc_count": 251,
"average_balance": {
"value": 24288.239043824702
}
}
]
}
},
{
"key": "40.0-50.0",
"from": 40,
"to": 50,
"doc_count": 45,
"group_by_gender": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "M",
"doc_count": 24,
"average_balance": {
"value": 26474.958333333332
}
},
{
"key": "F",
"doc_count": 21,
"average_balance": {
"value": 27992.571428571428
}
}
]
}
}
]
}
}
}
term與match查詢
首先大家看下面的例子有什么區別:
已知條件:ES里address
為171 Putnam Avenue
的數據有1條;address
為Putnam
的數據有0條。index為bank,type為account,文檔ID為25。
GET /bank/_search
{
"query": {
"match" : {
"address" : "Putnam"
}
}
}
GET /bank/_search
{
"query": {
"match" : {
"address.keyword" : "Putnam"
}
}
}
GET /bank/_search
{
"query": {
"term" : {
"address" : "Putnam"
}
}
}
結果:
1、第一個能匹配到數據,因為會分詞查詢。
2、第二個不能匹配到數據,因為不分詞的話沒有該條數據。
3、結果不確定。需要看實際是怎么分詞的。
我們通過下列查詢可以知曉該條數據字段address
的分詞情況:
GET /bank/account/25/_termvectors?fields=address
結果:
{
"_index": "bank",
"_type": "account",
"_id": "25",
"_version": 1,
"found": true,
"took": 0,
"term_vectors": {
"address": {
"field_statistics": {
"sum_doc_freq": 591,
"doc_count": 197,
"sum_ttf": 591
},
"terms": {
"171": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 3
}
]
},
"avenue": {
"term_freq": 1,
"tokens": [
{
"position": 2,
"start_offset": 11,
"end_offset": 17
}
]
},
"putnam": {
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 4,
"end_offset": 10
}
]
}
}
}
}
}
可以看出該條數據字段address
一共分了3個詞:
171
avenue
putnam
現在可以得出第三個查詢的答案:匹配不到!但值改成小寫的putnam
又能匹配到了!
原因是:
- term query 查詢的是倒排索引中確切的term
- match query 會對filed進行分詞操作,然后再查詢
由於Putnam
不在分詞里(大小寫敏感),所以匹配不到。match query先對filed進行分詞,也就是分成putnam
,再去匹配倒排索引中的term,所以能匹配到。
standard
analyzer 分詞器分詞默認會將大寫字母全部轉為小寫字母。
參考
1、Getting Started | Elasticsearch Reference [6.2] | Elastic
https://www.elastic.co/guide/en/elasticsearch/reference/6.2/getting-started.html
2、Elasticsearch 5.x 關於term query和match query的認識 - wangchuanfu - 博客園
https://www.cnblogs.com/wangchuanfu/p/7444253.html