Elasticsearch學習記錄(入門篇)
1、 Elasticsearch
的請求與結果
請求結構
curl -X<VERB> '<PROTOCOL>://<HOST>:<PORT>/<PATH>?<QUERY_STRING>' -d '<BODY>'
- VERB HTTP方法:GET, POST, PUT, HEAD, DELETE
- PROTOCOL http或者https協議(只有在Elasticsearch前面有https代理的時候可用)
- HOST Elasticsearch集群中的任何一個節點的主機名,如果是在本地的節點,那么就叫localhost
- PORT Elasticsearch HTTP服務所在的端口,默認為9200
- PATH API路徑(例如_count將返回集群中文檔的數量),PATH可以包含多個組件,例如_cluster/stats或者_nodes/stats/jvm
- QUERY_STRING 一些可選的查詢請求參數,例如?pretty參數將使請求返回更加美觀易讀的JSON數據
BODY 一個JSON格式的請求主體(如果請求需要的話)
PUT創建(索引創建)
$ curl -XPUT 'http://localhost:9200/megacorp/employee/3?pretty' -d '
{
"first_name" : "Douglas", "last_name" : "Fir", "age" : 35, "about": "I like to build cabinets", "interests": [ "forestry" ]
}
’{
"_index" : "megacorp",
"_type" : "employee",
"_id" : "3",
"_version" : 1,
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"created" : true
}##GET請求(搜索) ###檢索文檔
$ curl -XGET 'http://localhost:9200/megacorp/employee/1?pretty'
{
"_index" : "megacorp",
"_type" : "employee",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source" : {
"first_name" : "John",
"last_name" : "Smith",
"age" : 25,
"about" : "I love to go rock climbing",
"interests" : [ "sports", "music" ]
}
}###簡單搜索 使用`megacorp`索引和`employee`類型,但是我們在結尾使用關鍵字\_search來取代原來的文檔ID。響應內容的hits數組中包含了我們所有的三個文檔。默認情況下搜索會返回前10個結果。
$ curl -XGET 'http://localhost:9200/megacorp/employee/_search?pretty'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 1.0,
"hits" : [ {
"_index" : "megacorp",
"_type" : "employee",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"first_name" : "Jane",
"last_name" : "Smith",
"age" : 32,
"about" : "I like to collect rock albums",
"interests" : [ "music" ]
}
}, {
"_index" : "megacorp",
"_type" : "employee",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"first_name" : "John",
"last_name" : "Smith",
"age" : 25,
"about" : "I love to go rock climbing",
"interests" : [ "sports", "music" ]
}
}, {
"_index" : "megacorp",
"_type" : "employee",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"first_name" : "Douglas",
"last_name" : "Fir",
"age" : 35,
"about" : "I like to build cabinets",
"interests" : [ "forestry" ]
}
} ]
}
}接下來,讓我們搜索姓氏中包含“Smith”的員工。我們將在命令行中使用輕量級的搜索方法。這種方法常被稱作查詢字符串(query string)搜索,因為我們像傳遞URL參數一樣去傳遞查詢語句:
$ curl -XGET 'http://localhost:9200/megacorp/employee/_search?q=last_name:Smith&pretty'
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.30685282,
"hits" : [ {
"_index" : "megacorp",
"_type" : "employee",
"_id" : "2",
"_score" : 0.30685282,
"_source" : {
"first_name" : "Jane",
"last_name" : "Smith",
"age" : 32,
"about" : "I like to collect rock albums",
"interests" : [ "music" ]
}
}, {
"_index" : "megacorp",
"_type" : "employee",
"_id" : "1",
"_score" : 0.30685282,
"_source" : {
"first_name" : "John",
"last_name" : "Smith",
"age" : 25,
"about" : "I love to go rock climbing",
"interests" : [ "sports", "music" ]
}
} ]
}
}###使用DSL語句查詢 查詢字符串搜索便於通過命令行完成特定(ad hoc)的搜索,但是它也有局限性(參閱簡單搜索章節)。Elasticsearch提供豐富且靈活的查詢語言叫做DSL查詢(Query DSL),它允許你構建更加復雜、強大的查詢。 DSL(Domain Specific Language特定領域語言)以JSON請求體的形式出現。我們可以這樣表示之前關於“Smith”的查詢:
$ curl -XGET 'http://localhost:9200/megacorp/employee/_search?pretty' -d '
{
"query" : {
"match" : {
"last_name" : "Smith"
}
}
}
'###更復雜的搜索 我們讓搜索稍微再變的復雜一些。我們依舊想要找到姓氏為“Smith”的員工,但是我們只想得到年齡大於30歲的員工。我們的語句將添加過濾器(filter),它使得我們高效率的執行一個結構化搜索:
$ curl -XGET 'http://localhost:9200/megacorp/employee/_search?pretty' -d '
{
"query" : {
"filtered" : {
"filter" : {
"range" : {
"age" : { "gt" : 30 } --<1>
}
},
"query" : {
"match" : {
"last_name" : "smith" --<2>
}
}
}
}
}
'* <1> 這部分查詢屬於區間過濾器(range filter),它用於查找所有年齡大於30歲的數據——gt為"greater than"的縮寫。 * <2> 這部分查詢與之前的match語句(query)一致。
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.30685282,
"hits" : [ {
"_index" : "megacorp",
"_type" : "employee",
"_id" : "2",
"_score" : 0.30685282,
"_source" : {
"first_name" : "Jane",
"last_name" : "Smith",
"age" : 32,
"about" : "I like to collect rock albums",
"interests" : [ "music" ]
}
} ]
}
}###全文搜索 到目前為止搜索都很簡單:搜索特定的名字,通過年齡篩選。讓我們嘗試一種更高級的搜索,全文搜索——一種傳統數據庫很難實現的功能。 我們將會搜索所有喜歡“rock climbing”的員工:
$ curl -XGET 'http://localhost:9200/megacorp/employee/_search?pretty' -d '
{
"query" : {
"match" : {
"about" : "rock climbing"
}
}
}
'你可以看到我們使用了之前的`match`查詢,從`about`字段中搜索"**rock climbing**",我們得到了兩個匹配文檔:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.16273327,
"hits" : [ {
"_index" : "megacorp",
"_type" : "employee",
"_id" : "1",
"_score" : 0.16273327,<1>
"_source" : {
"first_name" : "John",
"last_name" : "Smith",
"age" : 25,
"about" : "I love to go rock climbing",
"interests" : [ "sports", "music" ]
}
}, {
"_index" : "megacorp",
"_type" : "employee",
"_id" : "2",
"_score" : 0.016878016,<2>
"_source" : {
"first_name" : "Jane",
"last_name" : "Smith",
"age" : 32,
"about" : "I like to collect rock albums",
"interests" : [ "music" ]
}
} ]
}
}* <1><2> 結果相關性評分。 默認情況下,Elasticsearch根據結果相關性評分來對結果集進行排序,所謂的「結果相關性評分」就是文檔與查詢條件的匹配程度。很顯然,排名第一的`John Smith`的`about`字段明確的寫到“**rock climbing**” 但是為什么`Jane Smith`也會出現在結果里呢?原因是“**rock**”在她的abuot字段中被提及了。因為只有“**rock**”被提及而“**climbing**”沒有,所以她的`_score`要低於John。 ###短語搜索 目前我們可以在字段中搜索單獨的一個詞,這挺好的,但是有時候你想要確切的匹配若干個單詞或者短語(phrases)。例如我們想要查詢同時包含"rock"和"climbing"(並且是相鄰的)的員工記錄。 要做到這個,我們只要將`match`查詢變更為`match_phrase`查詢即可:
$ curl -XGET 'http://localhost:9200/megacorp/employee/_search?pretty' -d '
{
"query" : {
"match_phrase" : {
"about" : "rock climbing"
}
}
}
'{
"took" : 16,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.23013961,
"hits" : [ {
"_index" : "megacorp",
"_type" : "employee",
"_id" : "1",
"_score" : 0.23013961,
"_source" : {
"first_name" : "John",
"last_name" : "Smith",
"age" : 25,
"about" : "I love to go rock climbing",
"interests" : [ "sports", "music" ]
}
} ]
}
}###高亮我們的搜索 很多應用喜歡從每個搜索結果中**高亮(highlight)**匹配到的關鍵字,這樣用戶可以知道為什么這些文檔和查詢相匹配。在Elasticsearch中高亮片段是非常容易的。 讓我們在之前的語句上增加`highlight`參數:
$ curl -XGET 'http://localhost:9200/megacorp/employee/_search?pretty' -d '
{
"query" : {
"match_phrase" : {
"about" : "rock climbing"
}
},
"highlight": {
"fields" : {
"about" : {}
}
}
}
'當我們運行這個語句時,會命中與之前相同的結果,但是在返回結果中會有一個新的部分叫做`highlight`,這里包含了來自`about`字段中的文本,並且用\<em>\</em>來標識匹配到的單詞。
{
"took" : 33,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.23013961,
"hits" : [ {
"_index" : "megacorp",
"_type" : "employee",
"_id" : "1",
"_score" : 0.23013961,
"_source" : {
"first_name" : "John",
"last_name" : "Smith",
"age" : 25,
"about" : "I love to go rock climbing",
"interests" : [ "sports", "music" ]
},
"highlight" : {
"about" : [ "I love to go rock climbing" ]
}
} ]
}
}##聚合 ###分析 最后,我們還有一個需求需要完成:允許管理者在職員目錄中進行一些分析。 Elasticsearch有一個功能叫做**聚合(aggregations)**,它允許你在數據上生成復雜的分析統計。它很像SQL中的`GROUP BY`但是功能更強大。
$ curl -XGET 'http://localhost:9200/megacorp/employee/_search?pretty' -d '
{
"aggs": {
"all_interests": {
"terms": { "field": "interests" }
}
}
}
'查詢結果:
{...
"aggregations" : {
"all_interests" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "music",
"doc_count" : 2
}, {
"key" : "forestry",
"doc_count" : 1
}, {
"key" : "sports",
"doc_count" : 1
} ]
}
}
}這些數據並沒有被預先計算好,它們是實時的從匹配查詢語句的文檔中動態計算生成的。 如果我們想知道所有姓"Smith"的人最大的共同點(興趣愛好),我們只需要增加合適的語句既可:
$ curl -XGET 'http://localhost:9200/megacorp/employee/_search?pretty' -d '
{
"query": {
"match": {
"last_name": "smith"
}
},
"aggs": {
"all_interests": {
"terms": {
"field": "interests"
}
}
}
}
'all_interests聚合已經變成只包含和查詢語句相匹配的文檔了:
...
"all_interests": {
"buckets": [
{
"key": "music",
"doc_count": 2
},
{
"key": "sports",
"doc_count": 1
}
]
}聚合也允許分級匯總。例如,讓我們統計每種興趣下職員的平均年齡:
$ curl -XGET 'http://localhost:9200/megacorp/employee/_search?pretty' -d '
{
"aggs" : {
"all_interests" : {
"terms" : { "field" : "interests" },
"aggs" : {
"avg_age" : {
"avg" : { "field" : "age" }
}
}
}
}
}
'雖然這次返回的聚合結果有些復雜,但仍然很容易理解:
...
"all_interests": {
"buckets": [
{
"key": "music",
"doc_count": 2,
"avg_age": {
"value": 28.5
}
},
{
"key": "forestry",
"doc_count": 1,
"avg_age": {
"value": 35
}
},
{
"key": "sports",
"doc_count": 1,
"avg_age": {
"value": 25
}
}
]
}該聚合結果比之前的聚合結果要更加豐富。我們依然得到了興趣以及數量(指具有該興趣的員工人數)的列表,但是現在每個興趣額外擁有`avg_age`字段來顯示具有該興趣員工的平均年齡。