分布式特性
Elasticsearch的分布式帶來的好處:
- 存儲的水平擴容
- 提供系統的可用性,部分節點停止服務,整個集群不受影響
Elasticsearch的分布式架構
- 不同集群通過不同集群名稱區分,默認"elasticsearch"
- 通過配置文件修改,或者在命令行中-E cluster.name="ops-es"進行設定
節點
節點是一個Elasticsearch實例:
- 本質上就是一個JAVA進程
- 一台機器上可以運行多個Elasticsearch進程,但是生產環境一般建議一台機器上就運行一個Elasticsearch實例
每一個節點都有名字,通過配置文件,或者啟動的時候-E node.name=es01指定
每一個節點啟動后,都會生產一個UID,保存在data目錄下
Coordinating Node
處理請求的節點叫 Coordinating Node
- 路由到正確的節點,例如創建索引,就路由到master節點
所有節點默認都是Coordinating Node
通過將其他類型設置成False,使其變成Coordinating Node節點
Data Node
可以保存數據的節點,就叫Data Node節點
- 節點啟動后,默認就是數據節點,可以設置成node.data: false 禁止
Data Node的職責
- 保存分片數據,在數據擴展上起到至關重要的作用,(由Master Node決定如何把分片分發到數據節點上)
通過增加數據節點
- 可以解決數據水平擴展和解決數據單點的問題
Master Node
Master Node的職責
- 處理創建、刪除索引等請求、決定分片分到那個節點
- 維護並更新Cluster 狀態
Master Node最佳實踐
- Master 節點非常重要,在部署的時候需要考慮單點的問題
- 為一個集群設置多個Master節點/每一個節點只承擔Master單一角色
集群狀態信息
集群狀態信息,維護一個集群中,必要信息
- 所有節點信息
- 所有索引和其相關的Mapping和setting信息
- 分片路由信息
在每一個節點上都保存了集群的狀態信息
但是,只有Master節點上才能修改集群狀態的信息,並負責同步給其他節點
- 因為,任意節點都能修改信息會導致Cluster state信息的不一致
Master Eligible Nodes & 選主的過程
相互ping對方,Node ID低的會成為被選舉的節點
其他節點會加入集群,但是不承擔Master 節點的角色,一旦發現被選中的節點丟失,就會選舉出新的Master節點
腦裂問題
Split-Brain,分布式系統的經典網絡問題,當出現網絡問題,一個節點和其他節點無法連接
- Node2 和Node3會重新選舉Master
- Node1 自己還是作為Master,組成一個集群,同時更新Cluster state
- 導致2個Master節點,維護不同的cluster state。當網絡恢復時,無法選擇正確恢復
如何避免腦裂問題
限定一個選舉條件,設置quorum(仲裁),只有在Master eligishble 節點數大於quorum時,才能進行選舉
- quorum = (master節點數/2)+1
- 當3個master eligible時,設置discovery.zen.minimum_master_nodes為2,既避免腦裂
從7.0開始,無需此配置
- 移除minimum_master_nodes參數,讓Elasticsearch自己選擇可以形成仲裁的節點
- 典型的主節點選舉現在只需要很短的時間就可以完成。集群的伸縮變得更安全、更容易、並且可能造成丟失數據的系統配置選項更少了
- 節點更清楚的記錄它們的狀態,有助於判斷為什么它們不能加入集群或為什么無法選舉出主節點
Primary Shard
分片是Elasticsearch分布式存儲基石
- 主分片/副本分片
通過主分片將數據分布在所有節點上
- primary shard,可以將一份索引的數據,分散在多個Data Node上,實現存儲的水平擴展
- 主分片數在索引創建時指定,后續默認不能修改,如需修改,需要重新索引
分片數設定
如何規划一個索引的主分片和副本分片數
- 主分片數過小:例如創建1個primary shard 的index
- 如果該索引增長很快,集群無法通過增加節點實現對這個索引的數據擴展
- 主分片數設置過大:導致單個shard容量很小,引發一個節點上過多分片,影響性能
- 副本分片設置過多,會降低集群整體寫入性能
文檔存儲在分片上
文檔會存儲在具體的某個主分片和副本分片上,例如:文檔1,會存儲在P0和R0分片上
文檔到分片的映射算法:
- 確保文檔能均勻分布在所有分片上,充分利用硬件資源,避免部分機器空閑,部分機器繁忙
- 潛在算法
- 隨機/Round Robin。當查詢文檔1,分片數很多,需要多次查詢才可能查到文檔1
- 維護文檔到分片的映射關系,當文檔數據量很大的時候,維護成本高
- 實時計算,通過文檔1,自動算出,需要去那個分片上獲取文檔
文檔到分片的路由算法
shard = hash(_routing) % number_of_primary_shards
- hash算法確保文檔均勻分散到分片中
- 默認的_routing值是文檔id
- 可以自行限定_ronting數值,例如相同國家的商品,都分配到指定的shard
- 設置Index settings 后,Primary數,不能隨意修改的根本原因
分片的內部原理
什么是ES的分片
- ES中最小的工作單元:是一個Lucene的index
一些問題:
- 為什么ES的搜索是近實時的
- ES如何保證在斷電時數據也不會丟失
- 為什么刪除文檔,並不會立即釋放空間
倒排索引的不可變性
- 倒排索引采用Immutable Design,一旦生產,不可更改
- 不可變性,帶來的好處:
- 無需考慮並發寫文件的問題,避免了鎖機制帶來的性能問題
- 一旦寫入內核的文件系統緩存,便留在哪里。只要文件系統存有足夠的空間,大部分請求就會直接請求內存,不會命中磁盤,提升了很大的性能
- 緩存容易生產和維護、數據可以被壓縮
- 不可變性,帶來了的挑戰:如果需要讓一個新文檔可以被搜索,需要從建整個索引。
Lucene Index
- 在Lucene中,單個倒排索引文件被成為Segment,Sgement是自包含的,不可變更的,多個Sgement匯總在一起,稱為Lucene的Index,其對應的就是ES中的Shard
- 當有新文檔寫入時,會生成新的Segment,查詢時會同時查詢所有的Segment,並且對結果匯總,Lucene中有一個文件,用來記錄所有Segment信息,叫做Commit Point
- 刪除的文檔信息,保存在“.del”文件中
什么Refresh
- 將Index Buffer寫入Segment的過程叫Refresh。Refresh不執行fsync操作
- Refresh頻率:默認1秒發生一次,可通過index.refresh_interval配置。Refersh后,數據就可以被搜索到了。這也是為什么Elasticsearch是近實時查詢的原因
- 如果系統有大量的數據寫入,那就會產生很多Segment
- Index Buffer被占滿時,會觸發Refresh,默認值是JVM的10%
什么是Transaction Log
- Segment寫入磁盤的過程相對耗時,借助文件系統緩存,Refresh時,先將Segment寫入緩存以開放查詢
- 為了保證數據不會丟失。所以在Index文檔時,同時寫Transaction Log,高版本開始,Transaction Log默認落盤,每個分片有一個Transaction Log
- 在ES Refresh 時,Index Buffer被清空,Transaction Log不會被清空
什么是Flush
ES Flush & Luence Commit
- 調用Refresh,Index Buffer清空並且Refresh
- 調用fsync,將緩存中的Segment寫入磁盤
- 清空Transaction Log
- 默認30分鍾調用一次
- Transaction Log滿(默認512M)
什么是Merge
- Segment很多,需要被定期被合並
- 減少Segment/刪除已經刪除的文檔
- ES和Luence會自動進行Merge操作
- POST my_index/_forcemerge
分布式搜索機制
Elasticsearch的搜索分為兩步:
第一步-Query
第二部-Fetch
- 用戶發出搜索的請求到ES節點,節點搜到請求后,會以Coordinating節點身份,在6個主副本分片中隨機選擇3個分片,發出查詢請求
- 被選中的分片執行查詢,進行排序。然后,每個分片都會返回From+Size個排序后文檔id和排序值給Coordinating節點
- Coordinating節點會將Query階段,從每個分片獲取的排序后的文檔Id列表,重新進行排序。選取From到From + Size個文檔的ID
- 以 multi get 請求的方式,到相應的分片獲取詳細的文檔數據
Query Then Fetch 的潛在問題
性能問題:
- 每個分片上需要查的文檔個數=From + Size
- 最終協調節點需要處理:number_of_shard * (From+size)
- 深度分頁
相關性算分
- 每一個都基於自己上分片數據進行相關度算分。這會導致大分偏離的情況,特別是數據量很少時,相關性算分在分片之間是相互獨立,當文檔總數很少情況下,如果主分片大於1,主分片數越多,相關性算法越不准
分頁& 遍歷
- From:開始的位置
- Size:期望獲取文檔的總數
ES天生就是分布式系統,查詢信息,但是數據分別保存在多個分片中,多台機器上,ES天生就需要滿足排序的需求(按照相關性算分)
當一個查詢:From=990, Size=10
- 會在每個分片中獲取1000個文檔。然后,在通過Coordinating Node聚合所有結果。最好再通過排序選取前1000個文檔
- 頁數越深,占用內存越多。為了避免深度分頁帶來的內存開銷,ES有一個設定,默認限定10000個文檔
Search After避免深度分頁的問題
- 避免深度分頁的性能問題,可以實時獲取下一頁文檔信息
- 不支持指定頁數(From)
- 只能往下分頁
- 第一步搜索需要指定sort,並且保證值是唯一的(可以通過加入_id保證唯一性)
- 然后使用上一次,最后一個文檔的sort值進行查詢
Bucket & Metric 聚合分析及嵌套聚合
- Metric 一些一系列的統計方法
- Bucket 一組滿足條件的文檔
Metric Aggregation
單值分析
- max min avg sum
- Cardinality(類似 distinct count)
多值分析
- stats、extended stats
- percentile、percentile rank
- top hits
Demo
生產數據
#定義員工表索引的定義 PUT /employees/ { "mappings":{ "properties":{ "age":{ "type": "integer" }, "gender":{ "type": "keyword" }, "job":{ "type": "text", "fields":{ "keyword": { "type": "keyword", "ignore_above": 50 } } }, "name":{ "type": "keyword" }, "salary":{ "type" : "integer" } } } } #插入數據 PUT /employees/_bulk { "index" : { "_id" : "1" } } { "name" : "Emma","age":32,"job":"Product Manager","gender":"female","salary":35000 } { "index" : { "_id" : "2" } } { "name" : "Underwood","age":41,"job":"Dev Manager","gender":"male","salary": 50000} { "index" : { "_id" : "3" } } { "name" : "Tran","age":25,"job":"Web Designer","gender":"male","salary":18000 } { "index" : { "_id" : "4" } } { "name" : "Rivera","age":26,"job":"Web Designer","gender":"female","salary": 22000} { "index" : { "_id" : "5" } } { "name" : "Rose","age":25,"job":"QA","gender":"female","salary":18000 } { "index" : { "_id" : "6" } } { "name" : "Lucy","age":31,"job":"QA","gender":"female","salary": 25000} { "index" : { "_id" : "7" } } { "name" : "Byrd","age":27,"job":"QA","gender":"male","salary":20000 } { "index" : { "_id" : "8" } } { "name" : "Foster","age":27,"job":"Java Programmer","gender":"male","salary": 20000} { "index" : { "_id" : "9" } } { "name" : "Gregory","age":32,"job":"Java Programmer","gender":"male","salary":22000 } { "index" : { "_id" : "10" } } { "name" : "Bryant","age":20,"job":"Java Programmer","gender":"male","salary": 9000} { "index" : { "_id" : "11" } } { "name" : "Jenny","age":36,"job":"Java Programmer","gender":"female","salary":38000 } { "index" : { "_id" : "12" } } { "name" : "Mcdonald","age":31,"job":"Java Programmer","gender":"male","salary": 32000} { "index" : { "_id" : "13" } } { "name" : "Jonthna","age":30,"job":"Java Programmer","gender":"female","salary":30000 } { "index" : { "_id" : "14" } } { "name" : "Marshall","age":32,"job":"Javascript Programmer","gender":"male","salary": 25000} { "index" : { "_id" : "15" } } { "name" : "King","age":33,"job":"Java Programmer","gender":"male","salary":28000 } { "index" : { "_id" : "16" } } { "name" : "Mccarthy","age":21,"job":"Javascript Programmer","gender":"male","salary": 16000} { "index" : { "_id" : "17" } } { "name" : "Goodwin","age":25,"job":"Javascript Programmer","gender":"male","salary": 16000} { "index" : { "_id" : "18" } } { "name" : "Catherine","age":29,"job":"Javascript Programmer","gender":"female","salary": 20000} { "index" : { "_id" : "19" } } { "name" : "Boone","age":30,"job":"DBA","gender":"male","salary": 30000} { "index" : { "_id" : "20" } } { "name" : "Kathy","age":29,"job":"DBA","gender":"female","salary": 20000}
測試樣例
#Metric 聚合 找到最低工資 POST employees/_search { "size":0, "aggs": { "min_salary": { "min": { "field": "salary" } } } } #查詢結果 { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "min_salary" : { "value" : 9000.0 } } } #Metric 聚合 找到最高工資 POST employees/_search { "size":0, "aggs": { "max_salary": { "max": { "field": "salary" } } } } #查詢結果 { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "max_salary" : { "value" : 50000.0 } } } #多個Metric 聚合 找到 最低最高平均工資 POST employees/_search { "size": 0, "aggs": { "max_salary": { "max": { "field": "salary" } }, "min_salary": { "min": { "field": "salary" } }, "avg_salary": { "avg": { "field": "salary" } } } } #查詢結果 { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "max_salary" : { "value" : 50000.0 }, "avg_salary" : { "value" : 24700.0 }, "min_salary" : { "value" : 9000.0 } } } # 一個聚合,輸出多值,統計 POST employees/_search { "size": 0, "aggs": { "stats_salary": { "stats": { "field":"salary" } } } } #查詢結果 { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "stats_salary" : { "count" : 20, "min" : 9000.0, "max" : 50000.0, "avg" : 24700.0, "sum" : 494000.0 } } }
Bucket聚合分析
按照一定規則,將文檔分配到不同的桶中,從而達到分類的目的,ES提供常見Bucket Aggregation
- Terms
- 數字類型
- Range/Data Range
- Histogram/Data Histogram
- 支持嵌套(桶中桶)
Terms Aggregation
- 字段需要打開fieldata,才能進行Terms Aggregation
- keyword 默認支持Terms Aggregation
- Text需要在Mapping中enable。會按照分詞后的執行結果分
# 對job的keyword 進行聚合 POST employees/_search { "size": 0, "aggs": { "jobs": { "terms": { "field":"job.keyword" } } } } #查詢結果 { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "jobs" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "Java Programmer", "doc_count" : 7 }, { "key" : "Javascript Programmer", "doc_count" : 4 }, { "key" : "QA", "doc_count" : 3 }, { "key" : "DBA", "doc_count" : 2 }, { "key" : "Web Designer", "doc_count" : 2 }, { "key" : "Dev Manager", "doc_count" : 1 }, { "key" : "Product Manager", "doc_count" : 1 } ] } } }
對Text類型的進行聚合分析的話,需要打開fieldata功能
# 對 Text 字段打開 fielddata,支持terms aggregation PUT employees/_mapping { "properties" : { "job":{ "type": "text", "fielddata": true } } } # 對 Text 字段進行 terms 分詞。分詞后的terms POST employees/_search { "size": 0, "aggs": { "jobs": { "terms": { "field":"job" } } } } #查詢結果,而keyword不同, { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "jobs" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "programmer", "doc_count" : 11 }, { "key" : "java", "doc_count" : 7 }, { "key" : "javascript", "doc_count" : 4 }, { "key" : "qa", "doc_count" : 3 }, { "key" : "dba", "doc_count" : 2 }, { "key" : "designer", "doc_count" : 2 }, { "key" : "manager", "doc_count" : 2 }, { "key" : "web", "doc_count" : 2 }, { "key" : "dev", "doc_count" : 1 }, { "key" : "product", "doc_count" : 1 } ] } } }
對terms統計的的做法
# 對job.keyword 和 job 進行 terms 聚合,分桶的總數並不一樣 POST employees/_search { "size": 0, "aggs": { "cardinate": { "cardinality": { "field": "job.keyword" } } } } #查詢結果 { "took" : 7, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "cardinate" : { "value" : 7 } } }
對性別分桶
# 對 性別的 keyword 進行聚合 POST employees/_search { "size": 0, "aggs": { "gender": { "terms": { "field":"gender" } } } } #查詢結果 { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "gender" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "male", "doc_count" : 12 }, { "key" : "female", "doc_count" : 8 } ] } } }
指定size
#指定 bucket 的 size POST employees/_search { "size": 0, "aggs": { "ages_5": { "terms": { "field":"age", "size":3 } } } } #查詢結果 { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "ages_5" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 12, "buckets" : [ { "key" : 25, "doc_count" : 3 }, { "key" : 32, "doc_count" : 3 }, { "key" : 27, "doc_count" : 2 } ] } } }
Bucket Size
# 指定size,不同工種中,年紀最大的3個員工的具體信息 POST employees/_search { "size": 0, "aggs": { "jobs": { "terms": { "field":"job.keyword" }, "aggs":{ "old_employee":{ "top_hits":{ "size":3, "sort":[ { "age":{ "order":"desc" } } ] } } } } } } #查詢結果 { "took" : 4, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "jobs" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "Java Programmer", "doc_count" : 7, "old_employee" : { "hits" : { "total" : { "value" : 7, "relation" : "eq" }, "max_score" : null, "hits" : [ { "_index" : "employees", "_type" : "_doc", "_id" : "11", "_score" : null, "_source" : { "name" : "Jenny", "age" : 36, "job" : "Java Programmer", "gender" : "female", "salary" : 38000 }, "sort" : [ 36 ] }, { "_index" : "employees", "_type" : "_doc", "_id" : "15", "_score" : null, "_source" : { "name" : "King", "age" : 33, "job" : "Java Programmer", "gender" : "male", "salary" : 28000 }, "sort" : [ 33 ] }, { "_index" : "employees", "_type" : "_doc", "_id" : "9", "_score" : null, "_source" : { "name" : "Gregory", "age" : 32, "job" : "Java Programmer", "gender" : "male", "salary" : 22000 }, "sort" : [ 32 ] } ] } } }, { "key" : "Javascript Programmer", "doc_count" : 4, "old_employee" : { "hits" : { "total" : { "value" : 4, "relation" : "eq" }, "max_score" : null, "hits" : [ { "_index" : "employees", "_type" : "_doc", "_id" : "14", "_score" : null, "_source" : { "name" : "Marshall", "age" : 32, "job" : "Javascript Programmer", "gender" : "male", "salary" : 25000 }, "sort" : [ 32 ] }, { "_index" : "employees", "_type" : "_doc", "_id" : "18", "_score" : null, "_source" : { "name" : "Catherine", "age" : 29, "job" : "Javascript Programmer", "gender" : "female", "salary" : 20000 }, "sort" : [ 29 ] }, { "_index" : "employees", "_type" : "_doc", "_id" : "17", "_score" : null, "_source" : { "name" : "Goodwin", "age" : 25, "job" : "Javascript Programmer", "gender" : "male", "salary" : 16000 }, "sort" : [ 25 ] } ] } } }, { "key" : "QA", "doc_count" : 3, "old_employee" : { "hits" : { "total" : { "value" : 3, "relation" : "eq" }, "max_score" : null, "hits" : [ { "_index" : "employees", "_type" : "_doc", "_id" : "6", "_score" : null, "_source" : { "name" : "Lucy", "age" : 31, "job" : "QA", "gender" : "female", "salary" : 25000 }, "sort" : [ 31 ] }, { "_index" : "employees", "_type" : "_doc", "_id" : "7", "_score" : null, "_source" : { "name" : "Byrd", "age" : 27, "job" : "QA", "gender" : "male", "salary" : 20000 }, "sort" : [ 27 ] }, { "_index" : "employees", "_type" : "_doc", "_id" : "5", "_score" : null, "_source" : { "name" : "Rose", "age" : 25, "job" : "QA", "gender" : "female", "salary" : 18000 }, "sort" : [ 25 ] } ] } } }, { "key" : "DBA", "doc_count" : 2, "old_employee" : { "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : null, "hits" : [ { "_index" : "employees", "_type" : "_doc", "_id" : "19", "_score" : null, "_source" : { "name" : "Boone", "age" : 30, "job" : "DBA", "gender" : "male", "salary" : 30000 }, "sort" : [ 30 ] }, { "_index" : "employees", "_type" : "_doc", "_id" : "20", "_score" : null, "_source" : { "name" : "Kathy", "age" : 29, "job" : "DBA", "gender" : "female", "salary" : 20000 }, "sort" : [ 29 ] } ] } } }, { "key" : "Web Designer", "doc_count" : 2, "old_employee" : { "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : null, "hits" : [ { "_index" : "employees", "_type" : "_doc", "_id" : "4", "_score" : null, "_source" : { "name" : "Rivera", "age" : 26, "job" : "Web Designer", "gender" : "female", "salary" : 22000 }, "sort" : [ 26 ] }, { "_index" : "employees", "_type" : "_doc", "_id" : "3", "_score" : null, "_source" : { "name" : "Tran", "age" : 25, "job" : "Web Designer", "gender" : "male", "salary" : 18000 }, "sort" : [ 25 ] } ] } } }, { "key" : "Dev Manager", "doc_count" : 1, "old_employee" : { "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : null, "hits" : [ { "_index" : "employees", "_type" : "_doc", "_id" : "2", "_score" : null, "_source" : { "name" : "Underwood", "age" : 41, "job" : "Dev Manager", "gender" : "male", "salary" : 50000 }, "sort" : [ 41 ] } ] } } }, { "key" : "Product Manager", "doc_count" : 1, "old_employee" : { "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : null, "hits" : [ { "_index" : "employees", "_type" : "_doc", "_id" : "1", "_score" : null, "_source" : { "name" : "Emma", "age" : 32, "job" : "Product Manager", "gender" : "female", "salary" : 35000 }, "sort" : [ 32 ] } ] } } } ] } } }
#Ranges 分桶
#Salary Ranges 分桶,可以自己定義 key POST employees/_search { "size": 0, "aggs": { "salary_range": { "range": { "field":"salary", "ranges":[ { "to":10000 }, { "from":10000, "to":20000 }, { "key":">20000", "from":20000 } ] } } } } #查詢結果 { "took" : 4, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "salary_range" : { "buckets" : [ { "key" : "*-10000.0", "to" : 10000.0, "doc_count" : 1 }, { "key" : "10000.0-20000.0", "from" : 10000.0, "to" : 20000.0, "doc_count" : 4 }, { "key" : ">20000", "from" : 20000.0, "doc_count" : 15 } ] } } }
#Salary Histogram,工資0到10萬,以 5000一個區間進行分桶 POST employees/_search { "size": 0, "aggs": { "salary_histrogram": { "histogram": { "field":"salary", "interval":5000, "extended_bounds":{ "min":0, "max":100000 } } } } }
Bucket 子聚合分析、子聚合可以是Bucket 或者 Metric
# 嵌套聚合1,按照工作類型分桶,並統計工資信息 POST employees/_search { "size": 0, "aggs": { "Job_salary_stats": { "terms": { "field": "job.keyword" }, "aggs": { "salary": { "stats": { "field": "salary" } } } } } } #查詢結果 { "took" : 9, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "Job_salary_stats" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "Java Programmer", "doc_count" : 7, "salary" : { "count" : 7, "min" : 9000.0, "max" : 38000.0, "avg" : 25571.428571428572, "sum" : 179000.0 } }, { "key" : "Javascript Programmer", "doc_count" : 4, "salary" : { "count" : 4, "min" : 16000.0, "max" : 25000.0, "avg" : 19250.0, "sum" : 77000.0 } }, { "key" : "QA", "doc_count" : 3, "salary" : { "count" : 3, "min" : 18000.0, "max" : 25000.0, "avg" : 21000.0, "sum" : 63000.0 } }, { "key" : "DBA", "doc_count" : 2, "salary" : { "count" : 2, "min" : 20000.0, "max" : 30000.0, "avg" : 25000.0, "sum" : 50000.0 } }, { "key" : "Web Designer", "doc_count" : 2, "salary" : { "count" : 2, "min" : 18000.0, "max" : 22000.0, "avg" : 20000.0, "sum" : 40000.0 } }, { "key" : "Dev Manager", "doc_count" : 1, "salary" : { "count" : 1, "min" : 50000.0, "max" : 50000.0, "avg" : 50000.0, "sum" : 50000.0 } }, { "key" : "Product Manager", "doc_count" : 1, "salary" : { "count" : 1, "min" : 35000.0, "max" : 35000.0, "avg" : 35000.0, "sum" : 35000.0 } } ] } } }
# 多次嵌套。根據工作類型分桶,然后按照性別分桶,計算工資的統計信息 POST employees/_search { "size": 0, "aggs": { "Job_gender_stats": { "terms": { "field": "job.keyword" }, "aggs": { "gender_stats": { "terms": { "field": "gender" }, "aggs": { "salary_stats": { "stats": { "field": "salary" } } } } } } } } #查詢結果 { "took" : 3, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 20, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "Job_gender_stats" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "Java Programmer", "doc_count" : 7, "gender_stats" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "male", "doc_count" : 5, "salary_stats" : { "count" : 5, "min" : 9000.0, "max" : 32000.0, "avg" : 22200.0, "sum" : 111000.0 } }, { "key" : "female", "doc_count" : 2, "salary_stats" : { "count" : 2, "min" : 30000.0, "max" : 38000.0, "avg" : 34000.0, "sum" : 68000.0 } } ] } }, { "key" : "Javascript Programmer", "doc_count" : 4, "gender_stats" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "male", "doc_count" : 3, "salary_stats" : { "count" : 3, "min" : 16000.0, "max" : 25000.0, "avg" : 19000.0, "sum" : 57000.0 } }, { "key" : "female", "doc_count" : 1, "salary_stats" : { "count" : 1, "min" : 20000.0, "max" : 20000.0, "avg" : 20000.0, "sum" : 20000.0 } } ] } }, { "key" : "QA", "doc_count" : 3, "gender_stats" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "female", "doc_count" : 2, "salary_stats" : { "count" : 2, "min" : 18000.0, "max" : 25000.0, "avg" : 21500.0, "sum" : 43000.0 } }, { "key" : "male", "doc_count" : 1, "salary_stats" : { "count" : 1, "min" : 20000.0, "max" : 20000.0, "avg" : 20000.0, "sum" : 20000.0 } } ] } }, { "key" : "DBA", "doc_count" : 2, "gender_stats" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "female", "doc_count" : 1, "salary_stats" : { "count" : 1, "min" : 20000.0, "max" : 20000.0, "avg" : 20000.0, "sum" : 20000.0 } }, { "key" : "male", "doc_count" : 1, "salary_stats" : { "count" : 1, "min" : 30000.0, "max" : 30000.0, "avg" : 30000.0, "sum" : 30000.0 } } ] } }, { "key" : "Web Designer", "doc_count" : 2, "gender_stats" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "female", "doc_count" : 1, "salary_stats" : { "count" : 1, "min" : 22000.0, "max" : 22000.0, "avg" : 22000.0, "sum" : 22000.0 } }, { "key" : "male", "doc_count" : 1, "salary_stats" : { "count" : 1, "min" : 18000.0, "max" : 18000.0, "avg" : 18000.0, "sum" : 18000.0 } } ] } }, { "key" : "Dev Manager", "doc_count" : 1, "gender_stats" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "male", "doc_count" : 1, "salary_stats" : { "count" : 1, "min" : 50000.0, "max" : 50000.0, "avg" : 50000.0, "sum" : 50000.0 } } ] } }, { "key" : "Product Manager", "doc_count" : 1, "gender_stats" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "female", "doc_count" : 1, "salary_stats" : { "count" : 1, "min" : 35000.0, "max" : 35000.0, "avg" : 35000.0, "sum" : 35000.0 } } ] } } ] } } }
Pipeline 聚合分析
管道的概念:支持聚合分析的結果,再次聚合分析
Pipeline的分析結果輸出到原結果當中,根據位置的不同,分為兩類:
- sibling 結果和現有結果同級
- min max avg sum Bucket
- stats,Extended status Bucket
- Percentiles Bucket
- parent 結果內嵌到現有聚合分析結果之中
- Derivative(求導)
- Cumultive Sum (累計求和)
- Moving Function (移動窗口)
# 平均工資最低的工作類型 POST employees/_search { "size": 0, "aggs": { "jobs": { "terms": { "field": "job.keyword", "size": 10 }, "aggs": { "avg_salary": { "avg": { "field": "salary" } } } }, "min_salary_by_job":{ "min_bucket": { "buckets_path": "jobs>avg_salary" } } } } # 平均工資最高的工作類型 POST employees/_search { "size": 0, "aggs": { "jobs": { "terms": { "field": "job.keyword", "size": 10 }, "aggs": { "avg_salary": { "avg": { "field": "salary" } } } }, "max_salary_by_job":{ "max_bucket": { "buckets_path": "jobs>avg_salary" } } } } # 平均工資的平均工資 POST employees/_search { "size": 0, "aggs": { "jobs": { "terms": { "field": "job.keyword", "size": 10 }, "aggs": { "avg_salary": { "avg": { "field": "salary" } } } }, "avg_salary_by_job":{ "avg_bucket": { "buckets_path": "jobs>avg_salary" } } } } # 平均工資的統計分析 POST employees/_search { "size": 0, "aggs": { "jobs": { "terms": { "field": "job.keyword", "size": 10 }, "aggs": { "avg_salary": { "avg": { "field": "salary" } } } }, "stats_salary_by_job":{ "stats_bucket": { "buckets_path": "jobs>avg_salary" } } } } # 平均工資的百分位數 POST employees/_search { "size": 0, "aggs": { "jobs": { "terms": { "field": "job.keyword", "size": 10 }, "aggs": { "avg_salary": { "avg": { "field": "salary" } } } }, "percentiles_salary_by_job":{ "percentiles_bucket": { "buckets_path": "jobs>avg_salary" } } } } #按照年齡對平均工資求導 POST employees/_search { "size": 0, "aggs": { "age": { "histogram": { "field": "age", "min_doc_count": 1, "interval": 1 }, "aggs": { "avg_salary": { "avg": { "field": "salary" } }, "derivative_avg_salary":{ "derivative": { "buckets_path": "avg_salary" } } } } } } #Cumulative_sum POST employees/_search { "size": 0, "aggs": { "age": { "histogram": { "field": "age", "min_doc_count": 1, "interval": 1 }, "aggs": { "avg_salary": { "avg": { "field": "salary" } }, "cumulative_salary":{ "cumulative_sum": { "buckets_path": "avg_salary" } } } } } } #Moving Function POST employees/_search { "size": 0, "aggs": { "age": { "histogram": { "field": "age", "min_doc_count": 1, "interval": 1 }, "aggs": { "avg_salary": { "avg": { "field": "salary" } }, "moving_avg_salary":{ "moving_fn": { "buckets_path": "avg_salary", "window":10, "script": "MovingFunctions.min(values)" } } } } } }
作用范圍和排序
ES聚合分析默認作用范圍是query的查詢結果集
同時ES還支持一下方式改變聚合查詢的作用范圍
- Filter
- Post Filter
- Global
#作用范圍 # Query 的作用范圍 POST employees/_search { "size": 0, "query": { "range": { "age": { "gte": 20 } } }, "aggs": { "jobs": { "terms": { "field":"job.keyword" } } } } #Filter 的作用范圍 POST employees/_search { "size": 0, "aggs": { "older_person": { "filter":{ "range":{ "age":{ "from":35 } } }, "aggs":{ "jobs":{ "terms": { "field":"job.keyword" } } }}, "all_jobs": { "terms": { "field":"job.keyword" } } } } #Post field. 一條語句,找出所有的job類型。還能找到聚合后符合條件的結果 POST employees/_search { "aggs": { "jobs": { "terms": { "field": "job.keyword" } } }, "post_filter": { "match": { "job.keyword": "Dev Manager" } } } #global POST employees/_search { "size": 0, "query": { "range": { "age": { "gte": 40 } } }, "aggs": { "jobs": { "terms": { "field":"job.keyword" } }, "all":{ "global":{}, "aggs":{ "salary_avg":{ "avg":{ "field":"salary" } } } } } }
排序:
指定order,安裝count和key進行排序
- 默認情況下,按照count降序排序
- 指定size,就能返回相應的桶
#排序 order #count and key POST employees/_search { "size": 0, "query": { "range": { "age": { "gte": 20 } } }, "aggs": { "jobs": { "terms": { "field":"job.keyword", "order":[ {"_count":"asc"}, {"_key":"desc"} ] } } } } #排序 order #count and key POST employees/_search { "size": 0, "aggs": { "jobs": { "terms": { "field":"job.keyword", "order":[ { "avg_salary":"desc" }] }, "aggs": { "avg_salary": { "avg": { "field":"salary" } } } } } } #排序 order #count and key POST employees/_search { "size": 0, "aggs": { "jobs": { "terms": { "field":"job.keyword", "order":[ { "stats_salary.min":"desc" }] }, "aggs": { "stats_salary": { "stats": { "field":"salary" } } } } } }
UpdateByQuery & Reindex
使用場景:
一般以下情況,需要重新索引
- 索引的mapping發送變更:字段類型、分詞器及字典更新
- 索引的setting發送變更:索引主分片數發送改變
- 集群內,集群間需要做數據遷移
ES內置提供的API
-
UpdateByQuery 在現有索引上重建
-
Reindex 在其他索引上重建索引
案例1
#重建索引 DELETE blogs/ # 寫入文檔 PUT blogs/_doc/1 { "content":"Hadoop is cool", "keyword":"hadoop" } # 查看 Mapping GET blogs/_mapping # 修改 Mapping,增加子字段,使用英文分詞器 PUT blogs/_mapping { "properties" : { "content" : { "type" : "text", "fields" : { "english" : { "type" : "text", "analyzer":"english" } } } } } # 寫入文檔 PUT blogs/_doc/2 { "content":"Elasticsearch rocks", "keyword":"elasticsearch" } # 查詢新寫入文檔 POST blogs/_search { "query": { "match": { "content.english": "Elasticsearch" } } } # 查詢 Mapping 變更前寫入的文檔 POST blogs/_search { "query": { "match": { "content.english": "Hadoop" } } } # Update所有文檔 POST blogs/_update_by_query { } # 執行update_by_query后 再查詢之前寫入的文檔 POST blogs/_search { "query": { "match": { "content.english": "Hadoop" } } }
案例2,更新已有字段的mapping
- ES不允許在原有mapping上對字段類型進行修改
- 只能創建新的索引,並且設定正確的字段類型,再重新導入數據
# 查詢 GET blogs/_mapping #結果查詢,我們看keyword 的字段類型是Text { "blogs" : { "mappings" : { "properties" : { "content" : { "type" : "text", "fields" : { "english" : { "type" : "text", "analyzer" : "english" }, "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "keyword" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } } } } } } #嘗試修改類型,報錯,ES不允許對已有字段進行修改 PUT blogs/_mapping { "properties" : { "content" : { "type" : "text", "fields" : { "english" : { "type" : "text", "analyzer" : "english" } } }, "keyword" : { "type" : "keyword" } } } # 創建新的索引並且設定新的Mapping PUT blogs_fix/ { "mappings": { "properties" : { "content" : { "type" : "text", "fields" : { "english" : { "type" : "text", "analyzer" : "english" } } }, "keyword" : { "type" : "keyword" } } } } # Reindx API POST _reindex { "source": { "index": "blogs" }, "dest": { "index": "blogs_fix" } } #查看新索引 GET blogs_fix/_doc/1 #查詢結果 { "_index" : "blogs_fix", "_type" : "_doc", "_id" : "1", "_version" : 1, "_seq_no" : 0, "_primary_term" : 1, "found" : true, "_source" : { "content" : "Hadoop is cool", "keyword" : "hadoop" } } # 測試 Term Aggregation POST blogs_fix/_search { "size": 0, "aggs": { "blog_keyword": { "terms": { "field": "keyword", "size": 10 } } } } #我們修改成keyword類型,只有keyword 才能Term Aggregation #查詢結果 { "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "blog_keyword" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "elasticsearch", "doc_count" : 1 }, { "key" : "hadoop", "doc_count" : 1 } ] } } }
Reindex以上總結
Reindex API支持從一個索引拷貝到另一個索引中
使用ReindexAPI的場景:
- 修改索引的主分片數
- 改變字段的Mapping字段類型
- 集群內/外 數據遷移
IngestPipeline & PainlessScript
Ingest Node
ES5.0后,引入的一種新的節點類型,默認配置下,每個節點都是Ingest Node
- 具有預處理數據的能力,可攔截Index或者Bulk API 的請求
- 對數據進行轉換,並重新返回給Index 或者Bulk API
無需Logstash,就可以進行數據的預處理,例如:
- 為某個字段設置默認值:重命名某個字段的字段名;對字段進行Split操作
- 支持設置Painless腳本,對數據進行更多復雜加工
Demo
創建文檔
#Blog數據,包含3個字段,tags用逗號間隔 PUT tech_blogs/_doc/1 { "title":"Introducing big data......", "tags":"hadoop,elasticsearch,spark", "content":"You konw, for big data" }
POST _ingest/pipeline/_simulate { "pipeline": { "description": "to split blog tags", // 按,切割 "processors": [ { "split": { "field": "tags", "separator": "," } } ] }, "docs": [ { "_index": "index", "_id": "id", "_source": { "title": "Introducing big data......", "tags": "hadoop,elasticsearch,spark", "content": "You konw, for big data" } }, { "_index": "index", "_id": "idxx", "_source": { "title": "Introducing cloud computering", "tags": "openstack,k8s", "content": "You konw, for cloud" } } ] }
#同時為文檔,增加一個字段。blog查看量 POST _ingest/pipeline/_simulate { "pipeline": { "description": "to split blog tags", "processors": [ { "split": { "field": "tags", "separator": "," } }, // 增加一個字段, { "set":{ "field": "views", "value": 0 } } ] }, "docs": [ { "_index":"index", "_id":"id", "_source":{ "title":"Introducing big data......", "tags":"hadoop,elasticsearch,spark", "content":"You konw, for big data" } }, { "_index":"index", "_id":"idxx", "_source":{ "title":"Introducing cloud computering", "tags":"openstack,k8s", "content":"You konw, for cloud" } } ] }
以上是測試可以使用,我們測試完成后,在ES上創建一個Pipeline
PUT _ingest/pipeline/blog_pipeline { "description": "a blog pipeline", "processors": [ { "split": { "field": "tags", "separator": "," } }, { "set":{ "field": "views", "value": 0 } } ] }
#查看Pipleline
GET _ingest/pipeline/blog_pipeline
#測試pipeline,只需要提供文檔的數組就可以了 POST _ingest/pipeline/blog_pipeline/_simulate { "docs": [ { "_source": { "title": "Introducing cloud computering", "tags": "openstack,k8s", "content": "You konw, for cloud" } } ] }
#測試2 情況索引 DELETE tech_blogs #不使用pipeline更新數據 PUT tech_blogs/_doc/1 { "title":"Introducing big data......", "tags":"hadoop,elasticsearch,spark", "content":"You konw, for big data" } #使用pipeline更新數據 PUT tech_blogs/_doc/2?pipeline=blog_pipeline { "title": "Introducing cloud computering", "tags": "openstack,k8s", "content": "You konw, for cloud" } #查看兩條數據,一條被處理,一條未被處理 POST tech_blogs/_search {} #update_by_query 會導致錯誤 POST tech_blogs/_update_by_query?pipeline=blog_pipeline { } #增加update_by_query的條件 POST tech_blogs/_update_by_query?pipeline=blog_pipeline { "query": { "bool": { "must_not": { "exists": { "field": "views" } } } } } #再次索引,這次我們可以看到文檔1也被pipeline處理了 POST tech_blogs/_search
一些內置的Processors
- Split 給一個字段分成數組
- Remove / Rename 移除或者重命名一個字段
- Append 增加一個新標簽
- Convert 從字符串轉換成float類型
- Date / JSON 日期格式轉換,字符串轉JSON
- Data Index Name 將通過該處理器的文檔,分配到指定時間格式的索引中
- Fail 一旦出現異常,該Pipeline指定的錯誤信息能返回給用戶
- Foreach 數組字段,數組的每個元素都會使用到一個相同的處理器
- Grok 日志的格式切割
- Gsub /Join /Split 字符串轉換 數組轉換字符串 字符串轉換數組
- Lowercase /Upcase 大小寫轉換
Painless
- 自ES5.x后引入,專門為ES設計,擴展了JAVA的語法
- 6.0開始,ES只支持Painless。Groovy JavaScript和Python 都不支持
- Painless支持所有java數據類型及Java API子集
- Painless Script 具備以下特性:
- 高性能 / 安全
- 支持顯示類型或者動態定義類型
Painless 用途:
可以對文檔字段加工處理
- 更新刪除字段,處理數據聚合操作
- Script Field: 對返回字段提前進行計算
- Fcunction Score: 對文檔的算分進行處理
在Ingest Pipeline 中執行腳本
在Reindex API, Update By Query時,對數據進行處理
#########Demo for Painless############### # 增加一個 Script Prcessor POST _ingest/pipeline/_simulate { "pipeline": { "description": "to split blog tags", "processors": [ { "split": { "field": "tags", "separator": "," } }, { "script": { "source": """ if(ctx.containsKey("content")){ ctx.content_length = ctx.content.length(); }else{ ctx.content_length=0; } """ } }, { "set":{ "field": "views", "value": 0 } } ] }, "docs": [ { "_index":"index", "_id":"id", "_source":{ "title":"Introducing big data......", "tags":"hadoop,elasticsearch,spark", "content":"You konw, for big data" } }, { "_index":"index", "_id":"idxx", "_source":{ "title":"Introducing cloud computering", "tags":"openstack,k8s", "content":"You konw, for cloud" } } ] } DELETE tech_blogs PUT tech_blogs/_doc/1 { "title":"Introducing big data......", "tags":"hadoop,elasticsearch,spark", "content":"You konw, for big data", "views":0 } POST tech_blogs/_update/1 { "script": { "source": "ctx._source.views += params.new_views", "params": { "new_views":100 } } } # 查看views計數 POST tech_blogs/_search { } #保存腳本在 Cluster State POST _scripts/update_views { "script":{ "lang": "painless", "source": "ctx._source.views += params.new_views" } } POST tech_blogs/_update/1 { "script": { "id": "update_views", "params": { "new_views":1000 } } } GET tech_blogs/_search { "script_fields": { "rnd_views": { "script": { "lang": "painless", "source": """ java.util.Random rnd = new Random(); doc['views'].value+rnd.nextInt(1000); """ } } }, "query": { "match_all": {} } }
