Elasticsearch7.6學習筆記1 Getting start with Elasticsearch


Elasticsearch7.6學習筆記1 Getting start with Elasticsearch

前言

權威指南中文只有2.x, 但現在es已經到7.6. 就安裝最新的來學下.

安裝

這里是學習安裝, 生產安裝是另一套邏輯.

win

es下載地址:

https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.0-windows-x86_64.zip

kibana下載地址:

https://artifacts.elastic.co/downloads/kibana/kibana-7.6.0-windows-x86_64.zip

官方目前最新是7.6.0, 但下載速度慘不忍睹. 使用迅雷下載速度可以到xM.

bin\elasticsearch.bat
bin\kibana.bat

雙擊bat啟動.

docker安裝

對於測試學習,直接使用官方提供的docker鏡像更快更方便。

安裝方法見: https://www.cnblogs.com/woshimrf/p/docker-es7.html

以下內容來自:

https://www.elastic.co/guide/en/elasticsearch/reference/7.6/getting-started.html

Index some documents 索引一些文檔

本次測試直接使用kibana, 當然也可以通過curl或者postman訪問localhost:9200.

訪問localhost:5601, 然后點擊Dev Tools.

新建一個客戶索引(index)

PUT /{index-name}/_doc/{id}

PUT /customer/_doc/1
{
  "name": "John Doe"
}

put 是http method, 如果es中不存在索引(index) customer, 則創建一個, 並插入一個數據, id, name=John`.
如果存在則更新. 注意, 更新是覆蓋更新, 即body json是什么, 最終結果就是什么.

返回如下:

{
  "_index" : "customer",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 7,
  "result" : "updated",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 6,
  "_primary_term" : 1
}

  • _index 是索引名稱
  • _type 唯一為_doc
  • _id 是文檔(document)的主鍵, 也就是一條記錄的pk
  • _version 是該_id的更新次數, 我這里已經更新了7次
  • _shards 表示分片的結果. 我們這里一共部署了兩個節點, 都寫入成功了.

在kibana上設置-index manangement里可以查看index的狀態. 比如我們這條記錄有主副兩個分片.

保存記錄成功后可以立馬讀取出來:

GET /customer/_doc/1

返回

{
  "_index" : "customer",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 15,
  "_seq_no" : 14,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "name" : "John Doe"
  }
}

  • _source 就是我們記錄的內容

批量插入

當有多條數據需要插入的時候, 我們可以批量插入. 下載准備好的文檔, 然后通過http請求導入es.

創建一個索引bank: 由於shards(分片)和replicas(副本)創建后就不能修改了,所以要先創建的時候配置shards. 這里配置了3個shards和2個replicas.

PUT /bank
{
  "settings": {
    "index": {
      "number_of_shards": "3",
      "number_of_replicas": "2"
    }
  }
}

文檔地址: https://gitee.com/mirrors/elasticsearch/raw/master/docs/src/test/resources/accounts.json

下載下來之后, curl命令或者postman 發送文件請求過去

curl -H "Content-Type: application/json" -XPOST "localhost:9200/bank/_bulk?pretty&refresh" --data-binary "@accounts.json"
curl "localhost:9200/_cat/indices?v"

每條記錄格式如下:

{
  "_index": "bank",
  "_type": "_doc",
  "_id": "1",
  "_version": 1,
  "_score": 0,
  "_source": {
    "account_number": 1,
    "balance": 39225,
    "firstname": "Amber",
    "lastname": "Duke",
    "age": 32,
    "gender": "M",
    "address": "880 Holmes Lane",
    "employer": "Pyrami",
    "email": "amberduke@pyrami.com",
    "city": "Brogan",
    "state": "IL"
  }
}

在kibana monitor中選擇self monitor. 然后再indices中找到索引bank。可以看到我們導入的數據分布情況。

可以看到, 有3個shards分在不同的node上, 並且都有2個replicas.

開始查詢

批量插入了一些數據后, 我們就可以開始學習查詢了. 上文知道, 數據是銀行職員表, 我們查詢所有用戶,並根據賬號排序.

類似 sql

select * from bank order by  account_number asc limit 3

Query DSL


GET /bank/_search
{
  "query": { "match_all": {} },
  "sort": [
    { "account_number": "asc" }
  ],
  "size": 3,
  "from": 2
}
  • _search 表示查詢
  • query 是查詢條件, 這里是所有
  • size 表示每次查詢的條數, 分頁的條數. 如果不傳, 默認是10條. 在返回結果的hits中顯示.
  • from表示從第幾個開始

返回:


{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1000,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : null,
        "_source" : {
          "account_number" : 2,
          "balance" : 28838,
          "firstname" : "Roberta",
          "lastname" : "Bender",
          "age" : 22,
          "gender" : "F",
          "address" : "560 Kingsway Place",
          "employer" : "Chillium",
          "email" : "robertabender@chillium.com",
          "city" : "Bennett",
          "state" : "LA"
        },
        "sort" : [
          2
        ]
      },
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : null,
        "_source" : {
          "account_number" : 3,
          "balance" : 44947,
          "firstname" : "Levine",
          "lastname" : "Burks",
          "age" : 26,
          "gender" : "F",
          "address" : "328 Wilson Avenue",
          "employer" : "Amtap",
          "email" : "levineburks@amtap.com",
          "city" : "Cochranville",
          "state" : "HI"
        },
        "sort" : [
          3
        ]
      },
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : null,
        "_source" : {
          "account_number" : 4,
          "balance" : 27658,
          "firstname" : "Rodriquez",
          "lastname" : "Flores",
          "age" : 31,
          "gender" : "F",
          "address" : "986 Wyckoff Avenue",
          "employer" : "Tourmania",
          "email" : "rodriquezflores@tourmania.com",
          "city" : "Eastvale",
          "state" : "HI"
        },
        "sort" : [
          4
        ]
      }
    ]
  }
}



返回結果提供了如下信息

  • took es查詢時間, 單位是毫秒(milliseconds)
  • timed_out search是否超時了
  • _shards 我們搜索了多少shards, 成功了多少, 失敗了多少, 跳過了多少. 關於shard, 簡單理解為數據分片, 即一個index里的數據分成了幾片,可以理解為按id進行分表。
  • max_score 最相關的記錄(document)的分數

接下來可可以嘗試帶條件的查詢。

分詞查詢

查詢address中帶milllane的地址。

GET /bank/_search
{
  "query": { "match": { "address": "mill lane" } },
  "size": 2
}

返回

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 19,
      "relation" : "eq"
    },
    "max_score" : 9.507477,
    "hits" : [
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "136",
        "_score" : 9.507477,
        "_source" : {
          "account_number" : 136,
          "balance" : 45801,
          "firstname" : "Winnie",
          "lastname" : "Holland",
          "age" : 38,
          "gender" : "M",
          "address" : "198 Mill Lane",
          "employer" : "Neteria",
          "email" : "winnieholland@neteria.com",
          "city" : "Urie",
          "state" : "IL"
        }
      },
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "970",
        "_score" : 5.4032025,
        "_source" : {
          "account_number" : 970,
          "balance" : 19648,
          "firstname" : "Forbes",
          "lastname" : "Wallace",
          "age" : 28,
          "gender" : "M",
          "address" : "990 Mill Road",
          "employer" : "Pheast",
          "email" : "forbeswallace@pheast.com",
          "city" : "Lopezo",
          "state" : "AK"
        }
      }
    ]
  }
}

  • 我設置了返回2個,但實際上命中的有19個

完全匹配查詢

GET /bank/_search
{
  "query": { "match_phrase": { "address": "mill lane" } }
}

這時候查的完全符合的就一個了

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 9.507477,
    "hits" : [
      {
        "_index" : "bank",
        "_type" : "_doc",
        "_id" : "136",
        "_score" : 9.507477,
        "_source" : {
          "account_number" : 136,
          "balance" : 45801,
          "firstname" : "Winnie",
          "lastname" : "Holland",
          "age" : 38,
          "gender" : "M",
          "address" : "198 Mill Lane",
          "employer" : "Neteria",
          "email" : "winnieholland@neteria.com",
          "city" : "Urie",
          "state" : "IL"
        }
      }
    ]
  }
}

多條件查詢

實際查詢中通常是多個條件一起查詢的

GET /bank/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "age": "40" } }
      ],
      "must_not": [
        { "match": { "state": "ID" } }
      ]
    }
  }
}
  • bool用來合並多個查詢條件
  • must, should, must_not是boolean查詢的子語句, must, should決定相關性的score,結果默認按照score排序
  • must not是作為一個filter,影響查詢的結果,但不影響score,只是從結果中過濾。

還可以顯式地指定任意過濾器,以包括或排除基於結構化數據的文檔。

比如,查詢balance在20000和30000之間的。

GET /bank/_search
{
  "query": {
    "bool": {
      "must": { "match_all": {} },
      "filter": {
        "range": {
          "balance": {
            "gte": 20000,
            "lte": 30000
          }
        }
      }
    }
  }
}

聚合運算group by

按照省份統計人數

按sql的寫法可能是

select state AS group_by_state, count(*) from tbl_bank limit 3;

對應es的請求是


GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword",
        "size": 3
      }
    }
  }
}
  • size=0是限制返回內容, 因為es會返回查詢的記錄, 我們只想要聚合值
  • aggs是聚合的語法詞
  • group_by_state 是一個聚合結果, 名稱自定義
  • terms 查詢的字段精確匹配, 這里是需要分組的字段
  • state.keyword state是text類型, 字符類型需要統計和分組的,類型必須是keyword
  • size=3 限制group by返回的數量,這里是top3, 默認top10, 系統最大10000,可以通過修改search.max_buckets實現, 注意多個shards會產生精度問題, 后面再深入學習

返回值:

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1000,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "group_by_state" : {
      "doc_count_error_upper_bound" : 26,
      "sum_other_doc_count" : 928,
      "buckets" : [
        {
          "key" : "MD",
          "doc_count" : 28
        },
        {
          "key" : "ID",
          "doc_count" : 23
        },
        {
          "key" : "TX",
          "doc_count" : 21
        }
      ]
    }
  }
}


  • hits命中查詢條件的記錄,因為設置了size=0, 返回[]. total是本次查詢命中了1000條記錄
  • aggregations 是聚合指標結果
  • group_by_state 是我們查詢中命名的變量名
  • doc_count_error_upper_bound 沒有在這次聚合中返回、但是可能存在的潛在聚合結果.鍵名有「上界」的意思,也就是表示在預估的最壞情況下沒有被算進最終結果的值,當然doc_count_error_upper_bound的值越大,最終數據不准確的可能性越大,能確定的是,它的值為 0 表示數據完全正確,但是它不為 0,不代表這次聚合的數據是錯誤的.
  • sum_other_doc_count 聚合中沒有統計到的文檔數

值得注意的是, top3是否是准確的呢. 我們看到doc_count_error_upper_bound是有錯誤數量的, 即統計結果很可能不准確, 並且得到的top3分別是28,23,21. 我們再來添加另個查詢參數來比較結果:

GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword",
        "size": 3,
        "shard_size":  60
      }
    }
  }
}
-----------------------------------------
  "aggregations" : {
    "group_by_state" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 915,
      "buckets" : [
        {
          "key" : "TX",
          "doc_count" : 30
        },
        {
          "key" : "MD",
          "doc_count" : 28
        },
        {
          "key" : "ID",
          "doc_count" : 27
        }
      ]
    }
  }
  • shard_size 表示每個分片計算的數量. 因為agg聚合運算是每個分片計算出一個結果,然后最后聚合計算最終結果. 數據在分片分布不均衡, 每個分片的topN並不是一樣的, 就有可能最終聚合結果少算了一部分. 從而導致doc_count_error_upper_bound不為0. es默認shard_size的值是size*1.5+10, size=3對應就是14.5, 驗證shar_size=14.5時返回值確實和不傳一樣. 而設置為60時, error終於為0了, 即, 可以保證這個3個絕對是最多的top3. 也就是說, 聚合運算要設置shard_size盡可能大, 比如size的20倍.

按省份統計人數並計算平均薪酬

我們想要查看每個省的平均薪酬, sql可能是

select 
  state, avg(balance) AS average_balance, count(*) AS group_by_state 
from tbl_bank
group by state
limit 3

在es可以這樣查詢:

GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword",
        "size": 3,
        "shard_size":  60
      },
      "aggs": {
        "average_balance": {
          "avg": {
            "field": "balance"
          }
        },
        "sum_balance": {
          "sum": {
            "field": "balance"
          }
        }
      }
    }
  }
}
  • 第二個aggs是計算每個state的聚合指標
  • average_balance 自定義的變量名稱, 值為相同state的balance avg運算
  • sum_balance 自定義的變量名稱, 值為相同state的balancesum運算

結果如下:

{
  "took" : 12,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1000,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "group_by_state" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 915,
      "buckets" : [
        {
          "key" : "TX",
          "doc_count" : 30,
          "sum_balance" : {
            "value" : 782199.0
          },
          "average_balance" : {
            "value" : 26073.3
          }
        },
        {
          "key" : "MD",
          "doc_count" : 28,
          "sum_balance" : {
            "value" : 732523.0
          },
          "average_balance" : {
            "value" : 26161.535714285714
          }
        },
        {
          "key" : "ID",
          "doc_count" : 27,
          "sum_balance" : {
            "value" : 657957.0
          },
          "average_balance" : {
            "value" : 24368.777777777777
          }
        }
      ]
    }
  }
}

按省份統計人數並按照平均薪酬排序

agg terms默認排序是count降序, 如果我們想用其他方式, sql可能是這樣:

select 
  state, avg(balance) AS average_balance, count(*) AS group_by_state 
from tbl_bank
group by state
order by average_balance
limit 3

對應es可以這樣查詢:

GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword",
        "order": {
          "average_balance": "desc"
        },
        "size": 3
      },
      "aggs": {
        "average_balance": {
          "avg": {
            "field": "balance"
          }
        }
      }
    }
  }
}

返回結果的top3就不是之前的啦:

  "aggregations" : {
    "group_by_state" : {
      "doc_count_error_upper_bound" : -1,
      "sum_other_doc_count" : 983,
      "buckets" : [
        {
          "key" : "DE",
          "doc_count" : 2,
          "average_balance" : {
            "value" : 39040.5
          }
        },
        {
          "key" : "RI",
          "doc_count" : 5,
          "average_balance" : {
            "value" : 36035.4
          }
        },
        {
          "key" : "NE",
          "doc_count" : 10,
          "average_balance" : {
            "value" : 35648.8
          }
        }
      ]
    }
  }

參考


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM