Elasticsearch7-分布式及分布式搜索機制


分布式特性

Elasticsearch的分布式帶來的好處:

  • 存儲的水平擴容
  • 提供系統的可用性,部分節點停止服務,整個集群不受影響

Elasticsearch的分布式架構

  • 不同集群通過不同集群名稱區分,默認"elasticsearch"
  • 通過配置文件修改,或者在命令行中-E cluster.name="ops-es"進行設定

節點

節點是一個Elasticsearch實例:

  • 本質上就是一個JAVA進程
  • 一台機器上可以運行多個Elasticsearch進程,但是生產環境一般建議一台機器上就運行一個Elasticsearch實例

每一個節點都有名字,通過配置文件,或者啟動的時候-E node.name=es01指定

每一個節點啟動后,都會生產一個UID,保存在data目錄下

Coordinating Node

處理請求的節點叫 Coordinating Node

  • 路由到正確的節點,例如創建索引,就路由到master節點

所有節點默認都是Coordinating Node

通過將其他類型設置成False,使其變成Coordinating Node節點

 Data Node

可以保存數據的節點,就叫Data Node節點

  • 節點啟動后,默認就是數據節點,可以設置成node.data: false 禁止

Data Node的職責

  • 保存分片數據,在數據擴展上起到至關重要的作用,(由Master Node決定如何把分片分發到數據節點上)

通過增加數據節點

  • 可以解決數據水平擴展和解決數據單點的問題

 Master Node

Master Node的職責

  • 處理創建、刪除索引等請求、決定分片分到那個節點
  • 維護並更新Cluster 狀態

Master Node最佳實踐

  • Master 節點非常重要,在部署的時候需要考慮單點的問題
  • 為一個集群設置多個Master節點/每一個節點只承擔Master單一角色

集群狀態信息

集群狀態信息,維護一個集群中,必要信息

  • 所有節點信息
  • 所有索引和其相關的Mapping和setting信息
  • 分片路由信息

在每一個節點上都保存了集群的狀態信息

但是,只有Master節點上才能修改集群狀態的信息,並負責同步給其他節點

  • 因為,任意節點都能修改信息會導致Cluster state信息的不一致

Master Eligible Nodes & 選主的過程

相互ping對方,Node ID低的會成為被選舉的節點

其他節點會加入集群,但是不承擔Master 節點的角色,一旦發現被選中的節點丟失,就會選舉出新的Master節點

腦裂問題

Split-Brain,分布式系統的經典網絡問題,當出現網絡問題,一個節點和其他節點無法連接

  • Node2 和Node3會重新選舉Master
  • Node1 自己還是作為Master,組成一個集群,同時更新Cluster state
  • 導致2個Master節點,維護不同的cluster state。當網絡恢復時,無法選擇正確恢復

如何避免腦裂問題

限定一個選舉條件,設置quorum(仲裁),只有在Master eligishble 節點數大於quorum時,才能進行選舉

  • quorum = (master節點數/2)+1
  • 當3個master eligible時,設置discovery.zen.minimum_master_nodes為2,既避免腦裂

從7.0開始,無需此配置

  • 移除minimum_master_nodes參數,讓Elasticsearch自己選擇可以形成仲裁的節點
  • 典型的主節點選舉現在只需要很短的時間就可以完成。集群的伸縮變得更安全、更容易、並且可能造成丟失數據的系統配置選項更少了
  • 節點更清楚的記錄它們的狀態,有助於判斷為什么它們不能加入集群或為什么無法選舉出主節點

Primary Shard

分片是Elasticsearch分布式存儲基石

  • 主分片/副本分片

通過主分片將數據分布在所有節點上

  • primary shard,可以將一份索引的數據,分散在多個Data Node上,實現存儲的水平擴展
  • 主分片數在索引創建時指定,后續默認不能修改,如需修改,需要重新索引

分片數設定

如何規划一個索引的主分片和副本分片數

  • 主分片數過小:例如創建1個primary shard 的index
    • 如果該索引增長很快,集群無法通過增加節點實現對這個索引的數據擴展
  • 主分片數設置過大:導致單個shard容量很小,引發一個節點上過多分片,影響性能
  • 副本分片設置過多,會降低集群整體寫入性能

 文檔存儲在分片上

文檔會存儲在具體的某個主分片和副本分片上,例如:文檔1,會存儲在P0和R0分片上

文檔到分片的映射算法:

  • 確保文檔能均勻分布在所有分片上,充分利用硬件資源,避免部分機器空閑,部分機器繁忙
  • 潛在算法
    • 隨機/Round Robin。當查詢文檔1,分片數很多,需要多次查詢才可能查到文檔1
    • 維護文檔到分片的映射關系,當文檔數據量很大的時候,維護成本高
    • 實時計算,通過文檔1,自動算出,需要去那個分片上獲取文檔

文檔到分片的路由算法

shard = hash(_routing) % number_of_primary_shards

  • hash算法確保文檔均勻分散到分片中
  • 默認的_routing值是文檔id
  • 可以自行限定_ronting數值,例如相同國家的商品,都分配到指定的shard
  • 設置Index settings 后,Primary數,不能隨意修改的根本原因

分片的內部原理

什么是ES的分片

  • ES中最小的工作單元:是一個Lucene的index

一些問題:

  • 為什么ES的搜索是近實時的
  • ES如何保證在斷電時數據也不會丟失
  • 為什么刪除文檔,並不會立即釋放空間

倒排索引的不可變性

  • 倒排索引采用Immutable Design,一旦生產,不可更改
  • 不可變性,帶來的好處:
    • 無需考慮並發寫文件的問題,避免了鎖機制帶來的性能問題
    • 一旦寫入內核的文件系統緩存,便留在哪里。只要文件系統存有足夠的空間,大部分請求就會直接請求內存,不會命中磁盤,提升了很大的性能
    • 緩存容易生產和維護、數據可以被壓縮
  • 不可變性,帶來了的挑戰:如果需要讓一個新文檔可以被搜索,需要從建整個索引。

Lucene Index

  • 在Lucene中,單個倒排索引文件被成為Segment,Sgement是自包含的,不可變更的,多個Sgement匯總在一起,稱為Lucene的Index,其對應的就是ES中的Shard
  • 當有新文檔寫入時,會生成新的Segment,查詢時會同時查詢所有的Segment,並且對結果匯總,Lucene中有一個文件,用來記錄所有Segment信息,叫做Commit Point
  • 刪除的文檔信息,保存在“.del”文件中

什么Refresh

  • 將Index Buffer寫入Segment的過程叫Refresh。Refresh不執行fsync操作
  • Refresh頻率:默認1秒發生一次,可通過index.refresh_interval配置。Refersh后,數據就可以被搜索到了。這也是為什么Elasticsearch是近實時查詢的原因
  • 如果系統有大量的數據寫入,那就會產生很多Segment
  • Index Buffer被占滿時,會觸發Refresh,默認值是JVM的10%

什么是Transaction Log

  • Segment寫入磁盤的過程相對耗時,借助文件系統緩存,Refresh時,先將Segment寫入緩存以開放查詢
  • 為了保證數據不會丟失。所以在Index文檔時,同時寫Transaction Log,高版本開始,Transaction Log默認落盤,每個分片有一個Transaction Log
  • 在ES Refresh 時,Index Buffer被清空,Transaction Log不會被清空

 什么是Flush

ES Flush & Luence Commit

  • 調用Refresh,Index Buffer清空並且Refresh
  • 調用fsync,將緩存中的Segment寫入磁盤
  • 清空Transaction Log
  • 默認30分鍾調用一次
  • Transaction Log滿(默認512M)

什么是Merge

  • Segment很多,需要被定期被合並
    • 減少Segment/刪除已經刪除的文檔
  • ES和Luence會自動進行Merge操作
    • POST my_index/_forcemerge

 分布式搜索機制

Elasticsearch的搜索分為兩步:

第一步-Query

第二部-Fetch

  • 用戶發出搜索的請求到ES節點,節點搜到請求后,會以Coordinating節點身份,在6個主副本分片中隨機選擇3個分片,發出查詢請求
  • 被選中的分片執行查詢,進行排序。然后,每個分片都會返回From+Size個排序后文檔id和排序值給Coordinating節點
  • Coordinating節點會將Query階段,從每個分片獲取的排序后的文檔Id列表,重新進行排序。選取From到From + Size個文檔的ID
  • 以 multi get 請求的方式,到相應的分片獲取詳細的文檔數據

Query Then Fetch 的潛在問題

性能問題:

  • 每個分片上需要查的文檔個數=From + Size
  • 最終協調節點需要處理:number_of_shard * (From+size)
  • 深度分頁

相關性算分

  • 每一個都基於自己上分片數據進行相關度算分。這會導致大分偏離的情況,特別是數據量很少時,相關性算分在分片之間是相互獨立,當文檔總數很少情況下,如果主分片大於1,主分片數越多,相關性算法越不准

分頁& 遍歷

  • From:開始的位置
  • Size:期望獲取文檔的總數

ES天生就是分布式系統,查詢信息,但是數據分別保存在多個分片中,多台機器上,ES天生就需要滿足排序的需求(按照相關性算分)

當一個查詢:From=990, Size=10

  • 會在每個分片中獲取1000個文檔。然后,在通過Coordinating Node聚合所有結果。最好再通過排序選取前1000個文檔
  • 頁數越深,占用內存越多。為了避免深度分頁帶來的內存開銷,ES有一個設定,默認限定10000個文檔

Search After避免深度分頁的問題

  • 避免深度分頁的性能問題,可以實時獲取下一頁文檔信息
    • 不支持指定頁數(From)
    • 只能往下分頁
  • 第一步搜索需要指定sort,並且保證值是唯一的(可以通過加入_id保證唯一性)
  • 然后使用上一次,最后一個文檔的sort值進行查詢

 Bucket & Metric 聚合分析及嵌套聚合

  • Metric 一些一系列的統計方法
  • Bucket 一組滿足條件的文檔

Metric Aggregation

單值分析

  • max min avg sum
  • Cardinality(類似  distinct count)

多值分析

  • stats、extended stats
  • percentile、percentile rank
  • top hits

Demo

生產數據

#定義員工表索引的定義
PUT /employees/ 
{
  "mappings":{
    "properties":{
      "age":{
        "type": "integer"
      },
      "gender":{
        "type": "keyword"
      },
      "job":{
        "type": "text",
        "fields":{
          "keyword": {
            "type": "keyword",
            "ignore_above": 50
          }
        }
      },
      "name":{
        "type": "keyword"
      },
      "salary":{
        "type" : "integer"
      }
    }
  }
}
#插入數據
PUT /employees/_bulk
{ "index" : {  "_id" : "1" } }
{ "name" : "Emma","age":32,"job":"Product Manager","gender":"female","salary":35000 }
{ "index" : {  "_id" : "2" } }
{ "name" : "Underwood","age":41,"job":"Dev Manager","gender":"male","salary": 50000}
{ "index" : {  "_id" : "3" } }
{ "name" : "Tran","age":25,"job":"Web Designer","gender":"male","salary":18000 }
{ "index" : {  "_id" : "4" } }
{ "name" : "Rivera","age":26,"job":"Web Designer","gender":"female","salary": 22000}
{ "index" : {  "_id" : "5" } }
{ "name" : "Rose","age":25,"job":"QA","gender":"female","salary":18000 }
{ "index" : {  "_id" : "6" } }
{ "name" : "Lucy","age":31,"job":"QA","gender":"female","salary": 25000}
{ "index" : {  "_id" : "7" } }
{ "name" : "Byrd","age":27,"job":"QA","gender":"male","salary":20000 }
{ "index" : {  "_id" : "8" } }
{ "name" : "Foster","age":27,"job":"Java Programmer","gender":"male","salary": 20000}
{ "index" : {  "_id" : "9" } }
{ "name" : "Gregory","age":32,"job":"Java Programmer","gender":"male","salary":22000 }
{ "index" : {  "_id" : "10" } }
{ "name" : "Bryant","age":20,"job":"Java Programmer","gender":"male","salary": 9000}
{ "index" : {  "_id" : "11" } }
{ "name" : "Jenny","age":36,"job":"Java Programmer","gender":"female","salary":38000 }
{ "index" : {  "_id" : "12" } }
{ "name" : "Mcdonald","age":31,"job":"Java Programmer","gender":"male","salary": 32000}
{ "index" : {  "_id" : "13" } }
{ "name" : "Jonthna","age":30,"job":"Java Programmer","gender":"female","salary":30000 }
{ "index" : {  "_id" : "14" } }
{ "name" : "Marshall","age":32,"job":"Javascript Programmer","gender":"male","salary": 25000}
{ "index" : {  "_id" : "15" } }
{ "name" : "King","age":33,"job":"Java Programmer","gender":"male","salary":28000 }
{ "index" : {  "_id" : "16" } }
{ "name" : "Mccarthy","age":21,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : {  "_id" : "17" } }
{ "name" : "Goodwin","age":25,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : {  "_id" : "18" } }
{ "name" : "Catherine","age":29,"job":"Javascript Programmer","gender":"female","salary": 20000}
{ "index" : {  "_id" : "19" } }
{ "name" : "Boone","age":30,"job":"DBA","gender":"male","salary": 30000}
{ "index" : {  "_id" : "20" } }
{ "name" : "Kathy","age":29,"job":"DBA","gender":"female","salary": 20000}

測試樣例

#Metric 聚合 找到最低工資
POST employees/_search
{
  "size":0,
  "aggs": {
    "min_salary": {
      "min": {
        "field": "salary"
      }
    }
  }
}
#查詢結果
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 20,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "min_salary" : {
      "value" : 9000.0
    }
  }
}
#Metric 聚合 找到最高工資
POST employees/_search
{
  "size":0,
  "aggs": {
    "max_salary": {
      "max": {
        "field": "salary"
      }
    }
  }
}
#查詢結果
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 20,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "max_salary" : {
      "value" : 50000.0
    }
  }
}
#多個Metric 聚合 找到 最低最高平均工資
POST employees/_search
{
  "size": 0,
  "aggs": {
    "max_salary": {
      "max": {
        "field": "salary"
      }
    },
    "min_salary": {
      "min": {
        "field": "salary"
      }
    },
    "avg_salary": {
      "avg": {
        "field": "salary"
      }
    }
  }
}
#查詢結果
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 20,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "max_salary" : {
      "value" : 50000.0
    },
    "avg_salary" : {
      "value" : 24700.0
    },
    "min_salary" : {
      "value" : 9000.0
    }
  }
}
# 一個聚合,輸出多值,統計
POST employees/_search
{
  "size": 0,
  "aggs": {
    "stats_salary": {
      "stats": {
        "field":"salary"
      }
    }
  }
}
#查詢結果
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 20,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "stats_salary" : {
      "count" : 20,
      "min" : 9000.0,
      "max" : 50000.0,
      "avg" : 24700.0,
      "sum" : 494000.0
    }
  }
}

Bucket聚合分析

按照一定規則,將文檔分配到不同的桶中,從而達到分類的目的,ES提供常見Bucket Aggregation

  • Terms
  • 數字類型
    • Range/Data Range
    • Histogram/Data Histogram
  • 支持嵌套(桶中桶)

Terms Aggregation

  • 字段需要打開fieldata,才能進行Terms Aggregation
    • keyword 默認支持Terms Aggregation
    • Text需要在Mapping中enable。會按照分詞后的執行結果分

 

# 對job的keyword 進行聚合
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job.keyword"
      }
    }
  }
}
#查詢結果
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 20,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "jobs" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Java Programmer",
          "doc_count" : 7
        },
        {
          "key" : "Javascript Programmer",
          "doc_count" : 4
        },
        {
          "key" : "QA",
          "doc_count" : 3
        },
        {
          "key" : "DBA",
          "doc_count" : 2
        },
        {
          "key" : "Web Designer",
          "doc_count" : 2
        },
        {
          "key" : "Dev Manager",
          "doc_count" : 1
        },
        {
          "key" : "Product Manager",
          "doc_count" : 1
        }
      ]
    }
  }
}

對Text類型的進行聚合分析的話,需要打開fieldata功能

# 對 Text 字段打開 fielddata,支持terms aggregation
PUT employees/_mapping
{
  "properties" : {
    "job":{
       "type":     "text",
       "fielddata": true
    }
  }
}
# 對 Text 字段進行 terms 分詞。分詞后的terms
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job"
      }
    }
  }
}
#查詢結果,而keyword不同,
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 20,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "jobs" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "programmer",
          "doc_count" : 11
        },
        {
          "key" : "java",
          "doc_count" : 7
        },
        {
          "key" : "javascript",
          "doc_count" : 4
        },
        {
          "key" : "qa",
          "doc_count" : 3
        },
        {
          "key" : "dba",
          "doc_count" : 2
        },
        {
          "key" : "designer",
          "doc_count" : 2
        },
        {
          "key" : "manager",
          "doc_count" : 2
        },
        {
          "key" : "web",
          "doc_count" : 2
        },
        {
          "key" : "dev",
          "doc_count" : 1
        },
        {
          "key" : "product",
          "doc_count" : 1
        }
      ]
    }
  }
}

對terms統計的的做法

# 對job.keyword 和 job 進行 terms 聚合,分桶的總數並不一樣
POST employees/_search
{
  "size": 0,
  "aggs": {
    "cardinate": {
      "cardinality": {
        "field": "job.keyword"
      }
    }
  }
}
#查詢結果
{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 20,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "cardinate" : {
      "value" : 7
    }
  }
}

對性別分桶

# 對 性別的 keyword 進行聚合
POST employees/_search
{
  "size": 0,
  "aggs": {
    "gender": {
      "terms": {
        "field":"gender"
      }
    }
  }
}
#查詢結果
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 20,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "gender" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "male",
          "doc_count" : 12
        },
        {
          "key" : "female",
          "doc_count" : 8
        }
      ]
    }
  }
}

指定size

#指定 bucket 的 size
POST employees/_search
{
  "size": 0,
  "aggs": {
    "ages_5": {
      "terms": {
        "field":"age",
        "size":3
      }
    }
  }
}
#查詢結果
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 20,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "ages_5" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 12,
      "buckets" : [
        {
          "key" : 25,
          "doc_count" : 3
        },
        {
          "key" : 32,
          "doc_count" : 3
        },
        {
          "key" : 27,
          "doc_count" : 2
        }
      ]
    }
  }
}

Bucket Size

# 指定size,不同工種中,年紀最大的3個員工的具體信息
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job.keyword"
      },
      "aggs":{
        "old_employee":{
          "top_hits":{
            "size":3,
            "sort":[
              {
                "age":{
                  "order":"desc"
                }
              }
            ]
          }
        }
      }
    }
  }
}
#查詢結果
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 20,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "jobs" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Java Programmer",
          "doc_count" : 7,
          "old_employee" : {
            "hits" : {
              "total" : {
                "value" : 7,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "11",
                  "_score" : null,
                  "_source" : {
                    "name" : "Jenny",
                    "age" : 36,
                    "job" : "Java Programmer",
                    "gender" : "female",
                    "salary" : 38000
                  },
                  "sort" : [
                    36
                  ]
                },
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "15",
                  "_score" : null,
                  "_source" : {
                    "name" : "King",
                    "age" : 33,
                    "job" : "Java Programmer",
                    "gender" : "male",
                    "salary" : 28000
                  },
                  "sort" : [
                    33
                  ]
                },
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "9",
                  "_score" : null,
                  "_source" : {
                    "name" : "Gregory",
                    "age" : 32,
                    "job" : "Java Programmer",
                    "gender" : "male",
                    "salary" : 22000
                  },
                  "sort" : [
                    32
                  ]
                }
              ]
            }
          }
        },
        {
          "key" : "Javascript Programmer",
          "doc_count" : 4,
          "old_employee" : {
            "hits" : {
              "total" : {
                "value" : 4,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "14",
                  "_score" : null,
                  "_source" : {
                    "name" : "Marshall",
                    "age" : 32,
                    "job" : "Javascript Programmer",
                    "gender" : "male",
                    "salary" : 25000
                  },
                  "sort" : [
                    32
                  ]
                },
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "18",
                  "_score" : null,
                  "_source" : {
                    "name" : "Catherine",
                    "age" : 29,
                    "job" : "Javascript Programmer",
                    "gender" : "female",
                    "salary" : 20000
                  },
                  "sort" : [
                    29
                  ]
                },
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "17",
                  "_score" : null,
                  "_source" : {
                    "name" : "Goodwin",
                    "age" : 25,
                    "job" : "Javascript Programmer",
                    "gender" : "male",
                    "salary" : 16000
                  },
                  "sort" : [
                    25
                  ]
                }
              ]
            }
          }
        },
        {
          "key" : "QA",
          "doc_count" : 3,
          "old_employee" : {
            "hits" : {
              "total" : {
                "value" : 3,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "6",
                  "_score" : null,
                  "_source" : {
                    "name" : "Lucy",
                    "age" : 31,
                    "job" : "QA",
                    "gender" : "female",
                    "salary" : 25000
                  },
                  "sort" : [
                    31
                  ]
                },
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "7",
                  "_score" : null,
                  "_source" : {
                    "name" : "Byrd",
                    "age" : 27,
                    "job" : "QA",
                    "gender" : "male",
                    "salary" : 20000
                  },
                  "sort" : [
                    27
                  ]
                },
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "5",
                  "_score" : null,
                  "_source" : {
                    "name" : "Rose",
                    "age" : 25,
                    "job" : "QA",
                    "gender" : "female",
                    "salary" : 18000
                  },
                  "sort" : [
                    25
                  ]
                }
              ]
            }
          }
        },
        {
          "key" : "DBA",
          "doc_count" : 2,
          "old_employee" : {
            "hits" : {
              "total" : {
                "value" : 2,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "19",
                  "_score" : null,
                  "_source" : {
                    "name" : "Boone",
                    "age" : 30,
                    "job" : "DBA",
                    "gender" : "male",
                    "salary" : 30000
                  },
                  "sort" : [
                    30
                  ]
                },
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "20",
                  "_score" : null,
                  "_source" : {
                    "name" : "Kathy",
                    "age" : 29,
                    "job" : "DBA",
                    "gender" : "female",
                    "salary" : 20000
                  },
                  "sort" : [
                    29
                  ]
                }
              ]
            }
          }
        },
        {
          "key" : "Web Designer",
          "doc_count" : 2,
          "old_employee" : {
            "hits" : {
              "total" : {
                "value" : 2,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "4",
                  "_score" : null,
                  "_source" : {
                    "name" : "Rivera",
                    "age" : 26,
                    "job" : "Web Designer",
                    "gender" : "female",
                    "salary" : 22000
                  },
                  "sort" : [
                    26
                  ]
                },
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "3",
                  "_score" : null,
                  "_source" : {
                    "name" : "Tran",
                    "age" : 25,
                    "job" : "Web Designer",
                    "gender" : "male",
                    "salary" : 18000
                  },
                  "sort" : [
                    25
                  ]
                }
              ]
            }
          }
        },
        {
          "key" : "Dev Manager",
          "doc_count" : 1,
          "old_employee" : {
            "hits" : {
              "total" : {
                "value" : 1,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "2",
                  "_score" : null,
                  "_source" : {
                    "name" : "Underwood",
                    "age" : 41,
                    "job" : "Dev Manager",
                    "gender" : "male",
                    "salary" : 50000
                  },
                  "sort" : [
                    41
                  ]
                }
              ]
            }
          }
        },
        {
          "key" : "Product Manager",
          "doc_count" : 1,
          "old_employee" : {
            "hits" : {
              "total" : {
                "value" : 1,
                "relation" : "eq"
              },
              "max_score" : null,
              "hits" : [
                {
                  "_index" : "employees",
                  "_type" : "_doc",
                  "_id" : "1",
                  "_score" : null,
                  "_source" : {
                    "name" : "Emma",
                    "age" : 32,
                    "job" : "Product Manager",
                    "gender" : "female",
                    "salary" : 35000
                  },
                  "sort" : [
                    32
                  ]
                }
              ]
            }
          }
        }
      ]
    }
  }
}

#Ranges 分桶

#Salary Ranges 分桶,可以自己定義 key
POST employees/_search
{
  "size": 0,
  "aggs": {
    "salary_range": {
      "range": {
        "field":"salary",
        "ranges":[
          {
            "to":10000
          },
          {
            "from":10000,
            "to":20000
          },
          {
            "key":">20000",
            "from":20000
          }
        ]
      }
    }
  }
}
#查詢結果
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 20,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "salary_range" : {
      "buckets" : [
        {
          "key" : "*-10000.0",
          "to" : 10000.0,
          "doc_count" : 1
        },
        {
          "key" : "10000.0-20000.0",
          "from" : 10000.0,
          "to" : 20000.0,
          "doc_count" : 4
        },
        {
          "key" : ">20000",
          "from" : 20000.0,
          "doc_count" : 15
        }
      ]
    }
  }
}
#Salary Histogram,工資0到10萬,以 5000一個區間進行分桶
POST employees/_search
{
  "size": 0,
  "aggs": {
    "salary_histrogram": {
      "histogram": {
        "field":"salary",
        "interval":5000,
        "extended_bounds":{
          "min":0,
          "max":100000

        }
      }
    }
  }
}

Bucket 子聚合分析、子聚合可以是Bucket 或者 Metric

# 嵌套聚合1,按照工作類型分桶,並統計工資信息
POST employees/_search
{
  "size": 0,
  "aggs": {
    "Job_salary_stats": {
      "terms": {
        "field": "job.keyword"
      },
      "aggs": {
        "salary": {
          "stats": {
            "field": "salary"
          }
        }
      }
    }
  }
}
#查詢結果
{
  "took" : 9,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 20,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "Job_salary_stats" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Java Programmer",
          "doc_count" : 7,
          "salary" : {
            "count" : 7,
            "min" : 9000.0,
            "max" : 38000.0,
            "avg" : 25571.428571428572,
            "sum" : 179000.0
          }
        },
        {
          "key" : "Javascript Programmer",
          "doc_count" : 4,
          "salary" : {
            "count" : 4,
            "min" : 16000.0,
            "max" : 25000.0,
            "avg" : 19250.0,
            "sum" : 77000.0
          }
        },
        {
          "key" : "QA",
          "doc_count" : 3,
          "salary" : {
            "count" : 3,
            "min" : 18000.0,
            "max" : 25000.0,
            "avg" : 21000.0,
            "sum" : 63000.0
          }
        },
        {
          "key" : "DBA",
          "doc_count" : 2,
          "salary" : {
            "count" : 2,
            "min" : 20000.0,
            "max" : 30000.0,
            "avg" : 25000.0,
            "sum" : 50000.0
          }
        },
        {
          "key" : "Web Designer",
          "doc_count" : 2,
          "salary" : {
            "count" : 2,
            "min" : 18000.0,
            "max" : 22000.0,
            "avg" : 20000.0,
            "sum" : 40000.0
          }
        },
        {
          "key" : "Dev Manager",
          "doc_count" : 1,
          "salary" : {
            "count" : 1,
            "min" : 50000.0,
            "max" : 50000.0,
            "avg" : 50000.0,
            "sum" : 50000.0
          }
        },
        {
          "key" : "Product Manager",
          "doc_count" : 1,
          "salary" : {
            "count" : 1,
            "min" : 35000.0,
            "max" : 35000.0,
            "avg" : 35000.0,
            "sum" : 35000.0
          }
        }
      ]
    }
  }
}
# 多次嵌套。根據工作類型分桶,然后按照性別分桶,計算工資的統計信息
POST employees/_search
{
  "size": 0,
  "aggs": {
    "Job_gender_stats": {
      "terms": {
        "field": "job.keyword"
      },
      "aggs": {
        "gender_stats": {
          "terms": {
            "field": "gender"
          },
          "aggs": {
            "salary_stats": {
              "stats": {
                "field": "salary"
              }
            }
          }
        }
      }
    }
  }
}
#查詢結果
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 20,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "Job_gender_stats" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Java Programmer",
          "doc_count" : 7,
          "gender_stats" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "male",
                "doc_count" : 5,
                "salary_stats" : {
                  "count" : 5,
                  "min" : 9000.0,
                  "max" : 32000.0,
                  "avg" : 22200.0,
                  "sum" : 111000.0
                }
              },
              {
                "key" : "female",
                "doc_count" : 2,
                "salary_stats" : {
                  "count" : 2,
                  "min" : 30000.0,
                  "max" : 38000.0,
                  "avg" : 34000.0,
                  "sum" : 68000.0
                }
              }
            ]
          }
        },
        {
          "key" : "Javascript Programmer",
          "doc_count" : 4,
          "gender_stats" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "male",
                "doc_count" : 3,
                "salary_stats" : {
                  "count" : 3,
                  "min" : 16000.0,
                  "max" : 25000.0,
                  "avg" : 19000.0,
                  "sum" : 57000.0
                }
              },
              {
                "key" : "female",
                "doc_count" : 1,
                "salary_stats" : {
                  "count" : 1,
                  "min" : 20000.0,
                  "max" : 20000.0,
                  "avg" : 20000.0,
                  "sum" : 20000.0
                }
              }
            ]
          }
        },
        {
          "key" : "QA",
          "doc_count" : 3,
          "gender_stats" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "female",
                "doc_count" : 2,
                "salary_stats" : {
                  "count" : 2,
                  "min" : 18000.0,
                  "max" : 25000.0,
                  "avg" : 21500.0,
                  "sum" : 43000.0
                }
              },
              {
                "key" : "male",
                "doc_count" : 1,
                "salary_stats" : {
                  "count" : 1,
                  "min" : 20000.0,
                  "max" : 20000.0,
                  "avg" : 20000.0,
                  "sum" : 20000.0
                }
              }
            ]
          }
        },
        {
          "key" : "DBA",
          "doc_count" : 2,
          "gender_stats" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "female",
                "doc_count" : 1,
                "salary_stats" : {
                  "count" : 1,
                  "min" : 20000.0,
                  "max" : 20000.0,
                  "avg" : 20000.0,
                  "sum" : 20000.0
                }
              },
              {
                "key" : "male",
                "doc_count" : 1,
                "salary_stats" : {
                  "count" : 1,
                  "min" : 30000.0,
                  "max" : 30000.0,
                  "avg" : 30000.0,
                  "sum" : 30000.0
                }
              }
            ]
          }
        },
        {
          "key" : "Web Designer",
          "doc_count" : 2,
          "gender_stats" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "female",
                "doc_count" : 1,
                "salary_stats" : {
                  "count" : 1,
                  "min" : 22000.0,
                  "max" : 22000.0,
                  "avg" : 22000.0,
                  "sum" : 22000.0
                }
              },
              {
                "key" : "male",
                "doc_count" : 1,
                "salary_stats" : {
                  "count" : 1,
                  "min" : 18000.0,
                  "max" : 18000.0,
                  "avg" : 18000.0,
                  "sum" : 18000.0
                }
              }
            ]
          }
        },
        {
          "key" : "Dev Manager",
          "doc_count" : 1,
          "gender_stats" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "male",
                "doc_count" : 1,
                "salary_stats" : {
                  "count" : 1,
                  "min" : 50000.0,
                  "max" : 50000.0,
                  "avg" : 50000.0,
                  "sum" : 50000.0
                }
              }
            ]
          }
        },
        {
          "key" : "Product Manager",
          "doc_count" : 1,
          "gender_stats" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "female",
                "doc_count" : 1,
                "salary_stats" : {
                  "count" : 1,
                  "min" : 35000.0,
                  "max" : 35000.0,
                  "avg" : 35000.0,
                  "sum" : 35000.0
                }
              }
            ]
          }
        }
      ]
    }
  }
}

 Pipeline 聚合分析

管道的概念:支持聚合分析的結果,再次聚合分析

Pipeline的分析結果輸出到原結果當中,根據位置的不同,分為兩類:

  • sibling  結果和現有結果同級
    • min max avg sum Bucket
    • stats,Extended status Bucket
    • Percentiles Bucket
  • parent 結果內嵌到現有聚合分析結果之中
    • Derivative(求導)
    • Cumultive Sum (累計求和)
    • Moving Function (移動窗口)

 

# 平均工資最低的工作類型
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword",
        "size": 10
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    },
    "min_salary_by_job":{
      "min_bucket": {
        "buckets_path": "jobs>avg_salary"
      }
    }
  }
}


# 平均工資最高的工作類型
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword",
        "size": 10
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    },
    "max_salary_by_job":{
      "max_bucket": {
        "buckets_path": "jobs>avg_salary"
      }
    }
  }
}


# 平均工資的平均工資
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword",
        "size": 10
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    },
    "avg_salary_by_job":{
      "avg_bucket": {
        "buckets_path": "jobs>avg_salary"
      }
    }
  }
}


# 平均工資的統計分析
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword",
        "size": 10
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    },
    "stats_salary_by_job":{
      "stats_bucket": {
        "buckets_path": "jobs>avg_salary"
      }
    }
  }
}


# 平均工資的百分位數
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword",
        "size": 10
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    },
    "percentiles_salary_by_job":{
      "percentiles_bucket": {
        "buckets_path": "jobs>avg_salary"
      }
    }
  }
}



#按照年齡對平均工資求導
POST employees/_search
{
  "size": 0,
  "aggs": {
    "age": {
      "histogram": {
        "field": "age",
        "min_doc_count": 1,
        "interval": 1
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        },
        "derivative_avg_salary":{
          "derivative": {
            "buckets_path": "avg_salary"
          }
        }
      }
    }
  }
}


#Cumulative_sum
POST employees/_search
{
  "size": 0,
  "aggs": {
    "age": {
      "histogram": {
        "field": "age",
        "min_doc_count": 1,
        "interval": 1
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        },
        "cumulative_salary":{
          "cumulative_sum": {
            "buckets_path": "avg_salary"
          }
        }
      }
    }
  }
}

#Moving Function
POST employees/_search
{
  "size": 0,
  "aggs": {
    "age": {
      "histogram": {
        "field": "age",
        "min_doc_count": 1,
        "interval": 1
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        },
        "moving_avg_salary":{
          "moving_fn": {
            "buckets_path": "avg_salary",
            "window":10,
            "script": "MovingFunctions.min(values)"
          }
        }
      }
    }
  }
}

作用范圍和排序

ES聚合分析默認作用范圍是query的查詢結果集

同時ES還支持一下方式改變聚合查詢的作用范圍

  • Filter
  • Post Filter
  • Global
#作用范圍
# Query 的作用范圍
POST employees/_search
{
  "size": 0,
  "query": {
    "range": {
      "age": {
        "gte": 20
      }
    }
  },
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job.keyword"
        
      }
    }
  }
}


#Filter 的作用范圍
POST employees/_search
{
  "size": 0,
  "aggs": {
    "older_person": {
      "filter":{
        "range":{
          "age":{
            "from":35
          }
        }
      },
      "aggs":{
         "jobs":{
           "terms": {
        "field":"job.keyword"
      }
      }
    }},
    "all_jobs": {
      "terms": {
        "field":"job.keyword"
        
      }
    }
  }
}



#Post field. 一條語句,找出所有的job類型。還能找到聚合后符合條件的結果
POST employees/_search
{
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword"
      }
    }
  },
  "post_filter": {
    "match": {
      "job.keyword": "Dev Manager"
    }
  }
}


#global
POST employees/_search
{
  "size": 0,
  "query": {
    "range": {
      "age": {
        "gte": 40
      }
    }
  },
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job.keyword"
        
      }
    },
    
    "all":{
      "global":{},
      "aggs":{
        "salary_avg":{
          "avg":{
            "field":"salary"
          }
        }
      }
    }
  }
}

排序:

指定order,安裝count和key進行排序

  • 默認情況下,按照count降序排序
  • 指定size,就能返回相應的桶
#排序 order
#count and key
POST employees/_search
{
  "size": 0,
  "query": {
    "range": {
      "age": {
        "gte": 20
      }
    }
  },
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job.keyword",
        "order":[
          {"_count":"asc"},
          {"_key":"desc"}
          ]
        
      }
    }
  }
}


#排序 order
#count and key
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job.keyword",
        "order":[  {
            "avg_salary":"desc"
          }]
        
        
      },
    "aggs": {
      "avg_salary": {
        "avg": {
          "field":"salary"
        }
      }
    }
    }
  }
}


#排序 order
#count and key
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job.keyword",
        "order":[  {
            "stats_salary.min":"desc"
          }]
        
        
      },
    "aggs": {
      "stats_salary": {
        "stats": {
          "field":"salary"
        }
      }
    }
    }
  }
}

 UpdateByQuery & Reindex

使用場景:

一般以下情況,需要重新索引

  • 索引的mapping發送變更:字段類型、分詞器及字典更新
  • 索引的setting發送變更:索引主分片數發送改變
  • 集群內,集群間需要做數據遷移

 ES內置提供的API

  •  UpdateByQuery 在現有索引上重建

  • Reindex 在其他索引上重建索引

 案例1

#重建索引
DELETE blogs/

# 寫入文檔
PUT blogs/_doc/1
{
  "content":"Hadoop is cool",
  "keyword":"hadoop"
}

# 查看 Mapping
GET blogs/_mapping

# 修改 Mapping,增加子字段,使用英文分詞器
PUT blogs/_mapping
{
      "properties" : {
        "content" : {
          "type" : "text",
          "fields" : {
            "english" : {
              "type" : "text",
              "analyzer":"english"
            }
          }
        }
      }
    }
# 寫入文檔
PUT blogs/_doc/2
{
  "content":"Elasticsearch rocks",
    "keyword":"elasticsearch"
}

# 查詢新寫入文檔
POST blogs/_search
{
  "query": {
    "match": {
      "content.english": "Elasticsearch"
    }
  }

}

# 查詢 Mapping 變更前寫入的文檔
POST blogs/_search
{
  "query": {
    "match": {
      "content.english": "Hadoop"
    }
  }
}


# Update所有文檔
POST blogs/_update_by_query
{

}

# 執行update_by_query后 再查詢之前寫入的文檔
POST blogs/_search
{
  "query": {
    "match": {
      "content.english": "Hadoop"
    }
  }
}

案例2,更新已有字段的mapping

  • ES不允許在原有mapping上對字段類型進行修改
  • 只能創建新的索引,並且設定正確的字段類型,再重新導入數據
# 查詢
GET blogs/_mapping
#結果查詢,我們看keyword 的字段類型是Text
{
  "blogs" : {
    "mappings" : {
      "properties" : {
        "content" : {
          "type" : "text",
          "fields" : {
            "english" : {
              "type" : "text",
              "analyzer" : "english"
            },
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "keyword" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}
#嘗試修改類型,報錯,ES不允許對已有字段進行修改
PUT blogs/_mapping
{
        "properties" : {
        "content" : {
          "type" : "text",
          "fields" : {
            "english" : {
              "type" : "text",
              "analyzer" : "english"
            }
          }
        },
        "keyword" : {
          "type" : "keyword"
        }
      }
}
# 創建新的索引並且設定新的Mapping
PUT blogs_fix/
{
  "mappings": {
        "properties" : {
        "content" : {
          "type" : "text",
          "fields" : {
            "english" : {
              "type" : "text",
              "analyzer" : "english"
            }
          }
        },
        "keyword" : {
          "type" : "keyword"
        }
      }    
  }
}
# Reindx API
POST  _reindex
{
  "source": {
    "index": "blogs"
  },
  "dest": {
    "index": "blogs_fix"
  }
}
#查看新索引
GET  blogs_fix/_doc/1
#查詢結果
{
  "_index" : "blogs_fix",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "content" : "Hadoop is cool",
    "keyword" : "hadoop"
  }
}
# 測試 Term Aggregation
POST blogs_fix/_search
{
  "size": 0,
  "aggs": {
    "blog_keyword": {
      "terms": {
        "field": "keyword",
        "size": 10
      }
    }
  }
}
#我們修改成keyword類型,只有keyword 才能Term Aggregation
#查詢結果
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "blog_keyword" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "elasticsearch",
          "doc_count" : 1
        },
        {
          "key" : "hadoop",
          "doc_count" : 1
        }
      ]
    }
  }
}

Reindex以上總結

Reindex API支持從一個索引拷貝到另一個索引中

使用ReindexAPI的場景:

  • 修改索引的主分片數
  • 改變字段的Mapping字段類型
  • 集群內/外 數據遷移

 IngestPipeline & PainlessScript

Ingest Node

ES5.0后,引入的一種新的節點類型,默認配置下,每個節點都是Ingest Node

  • 具有預處理數據的能力,可攔截Index或者Bulk API 的請求
  • 對數據進行轉換,並重新返回給Index 或者Bulk API

無需Logstash,就可以進行數據的預處理,例如:

  • 為某個字段設置默認值:重命名某個字段的字段名;對字段進行Split操作
  • 支持設置Painless腳本,對數據進行更多復雜加工

Demo

創建文檔

#Blog數據,包含3個字段,tags用逗號間隔
PUT tech_blogs/_doc/1
{
  "title":"Introducing big data......",
  "tags":"hadoop,elasticsearch,spark",
  "content":"You konw, for big data"
}
POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "to split blog tags",
    // 按,切割
    "processors": [
      {
        "split": {
          "field": "tags",
          "separator": ","
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "title": "Introducing big data......",
        "tags": "hadoop,elasticsearch,spark",
        "content": "You konw, for big data"
      }
    },
    {
      "_index": "index",
      "_id": "idxx",
      "_source": {
        "title": "Introducing cloud computering",
        "tags": "openstack,k8s",
        "content": "You konw, for cloud"
      }
    }
  ]
}
#同時為文檔,增加一個字段。blog查看量
POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "to split blog tags",
    "processors": [
      {
        "split": {
          "field": "tags",
          "separator": ","
        }
      },
// 增加一個字段,
      {
        "set":{
          "field": "views",
          "value": 0
        }
      }
    ]
  },

  "docs": [
    {
      "_index":"index",
      "_id":"id",
      "_source":{
        "title":"Introducing big data......",
  "tags":"hadoop,elasticsearch,spark",
  "content":"You konw, for big data"
      }
    },


    {
      "_index":"index",
      "_id":"idxx",
      "_source":{
        "title":"Introducing cloud computering",
  "tags":"openstack,k8s",
  "content":"You konw, for cloud"
      }
    }

    ]
}

以上是測試可以使用,我們測試完成后,在ES上創建一個Pipeline

PUT _ingest/pipeline/blog_pipeline
{
  "description": "a blog pipeline",
  "processors": [
      {
        "split": {
          "field": "tags",
          "separator": ","
        }
      },

      {
        "set":{
          "field": "views",
          "value": 0
        }
      }
    ]
}
#查看Pipleline
GET _ingest/pipeline/blog_pipeline
#測試pipeline,只需要提供文檔的數組就可以了
POST _ingest/pipeline/blog_pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "title": "Introducing cloud computering",
        "tags": "openstack,k8s",
        "content": "You konw, for cloud"
      }
    }
  ]
}
#測試2  情況索引
DELETE tech_blogs

#不使用pipeline更新數據
PUT tech_blogs/_doc/1
{
  "title":"Introducing big data......",
  "tags":"hadoop,elasticsearch,spark",
  "content":"You konw, for big data"
}

#使用pipeline更新數據
PUT tech_blogs/_doc/2?pipeline=blog_pipeline
{
  "title": "Introducing cloud computering",
  "tags": "openstack,k8s",
  "content": "You konw, for cloud"
}


#查看兩條數據,一條被處理,一條未被處理
POST tech_blogs/_search
{}

#update_by_query 會導致錯誤
POST tech_blogs/_update_by_query?pipeline=blog_pipeline
{
}

#增加update_by_query的條件
POST tech_blogs/_update_by_query?pipeline=blog_pipeline
{
    "query": {
        "bool": {
            "must_not": {
                "exists": {
                    "field": "views"
                }
            }
        }
    }
}
#再次索引,這次我們可以看到文檔1也被pipeline處理了
POST tech_blogs/_search

一些內置的Processors

  • Split 給一個字段分成數組
  • Remove / Rename 移除或者重命名一個字段
  • Append 增加一個新標簽
  • Convert 從字符串轉換成float類型
  • Date / JSON 日期格式轉換,字符串轉JSON
  • Data Index Name 將通過該處理器的文檔,分配到指定時間格式的索引中
  • Fail 一旦出現異常,該Pipeline指定的錯誤信息能返回給用戶
  • Foreach 數組字段,數組的每個元素都會使用到一個相同的處理器
  • Grok 日志的格式切割
  • Gsub /Join /Split 字符串轉換 數組轉換字符串 字符串轉換數組
  • Lowercase /Upcase  大小寫轉換

Painless

  • 自ES5.x后引入,專門為ES設計,擴展了JAVA的語法
  • 6.0開始,ES只支持Painless。Groovy JavaScript和Python 都不支持
  • Painless支持所有java數據類型及Java API子集
  • Painless Script 具備以下特性:
    • 高性能 / 安全
    • 支持顯示類型或者動態定義類型

Painless 用途:

可以對文檔字段加工處理

  • 更新刪除字段,處理數據聚合操作
  • Script Field: 對返回字段提前進行計算
  • Fcunction Score: 對文檔的算分進行處理

在Ingest Pipeline 中執行腳本

在Reindex API, Update By Query時,對數據進行處理

#########Demo for Painless###############

# 增加一個 Script Prcessor
POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "to split blog tags",
    "processors": [
      {
        "split": {
          "field": "tags",
          "separator": ","
        }
      },
      {
        "script": {
          "source": """
          if(ctx.containsKey("content")){
            ctx.content_length = ctx.content.length();
          }else{
            ctx.content_length=0;
          }


          """
        }
      },

      {
        "set":{
          "field": "views",
          "value": 0
        }
      }
    ]
  },

  "docs": [
    {
      "_index":"index",
      "_id":"id",
      "_source":{
        "title":"Introducing big data......",
  "tags":"hadoop,elasticsearch,spark",
  "content":"You konw, for big data"
      }
    },


    {
      "_index":"index",
      "_id":"idxx",
      "_source":{
        "title":"Introducing cloud computering",
  "tags":"openstack,k8s",
  "content":"You konw, for cloud"
      }
    }

    ]
}


DELETE tech_blogs
PUT tech_blogs/_doc/1
{
  "title":"Introducing big data......",
  "tags":"hadoop,elasticsearch,spark",
  "content":"You konw, for big data",
  "views":0
}

POST tech_blogs/_update/1
{
  "script": {
    "source": "ctx._source.views += params.new_views",
    "params": {
      "new_views":100
    }
  }
}

# 查看views計數
POST tech_blogs/_search
{

}

#保存腳本在 Cluster State
POST _scripts/update_views
{
  "script":{
    "lang": "painless",
    "source": "ctx._source.views += params.new_views"
  }
}

POST tech_blogs/_update/1
{
  "script": {
    "id": "update_views",
    "params": {
      "new_views":1000
    }
  }
}


GET tech_blogs/_search
{
  "script_fields": {
    "rnd_views": {
      "script": {
        "lang": "painless",
        "source": """
          java.util.Random rnd = new Random();
          doc['views'].value+rnd.nextInt(1000);
        """
      }
    }
  },
  "query": {
    "match_all": {}
  }
}

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM