Elastic Stack 筆記（四）Elasticsearch5.6 索引及文檔管理

本文轉載自查看原文 2018-06-03 22:17 1196 Kibana/ Elasticsearch/ ELK

博客地址：http://www.moonxy.com

一、前言

在 Elasticsearch 中，對文檔進行索引等操作時，既可以通過 RESTful 接口進行操作，也可以通過 Java 也可以通過 Java 客戶端進行操作。本文主要講解基於 RESTful 的文檔索引與管理方法，后面章節再講面向 Java 客戶端的編程方法。

使用 RESTful API 時，主要有如下四種方式可以選擇：

方式一：可以使用終端中的 curl 命令，如果還沒有安裝，按照系統的不同，執行不同的安裝命令，CentOS 用 yum -y install curl 安裝，Ubuntu 用 sudo apt-get install curl 安裝。

方式二：可以使用 Elasticsearch-head 插件，通過切換不同的頁簽，可以執行不同的操作。

方式三：Elastic Stack 官方推出的 Kibana 提供了 Dev Tools，專用於執行此類代碼，其前身為 Chrome 的 Sense 插件。

方式四：使用 Chrome 谷歌瀏覽器的 Sense 插件，唯一的缺點是對中文的支持不太理想，也可以使用 Postman 插件。

以下我們方式三，使用 Kibana 提供的 Dev Tools 工具。

在創建索引之前，首先了解 RESTful API 的調用風格，在管理和使用 ElasticSearch 服務時，常用的 HTTP 動詞有下面五個：

GET 請求：獲取服務器中的對象

相當於 SQL 的 Select 命令
GET /blogs：列出所有博客

POST 請求：在服務器上更新對象

相當於 SQL 的 Update 命令
POST /blogs/ID：更新指定的博客

PUT 請求：在服務器上創建對象

相當於 SQL 的 Create 命令
PUT /blogs/ID：新建一個博客　　

DELETE 請求：刪除服務器中的對象

相當於 SQL 的 Delete 命令
DELETE /blogs/ID：刪除指定的博客

HEAD 請求：僅僅用於獲取對象的基礎信息

二、索引管理

2.1 創建索引

PUT blog

返回響應結果：

{
    "acknowledged": true,
    "shards_acknowledged": true,
    "index": "blog"
}

本文章內的命令，如無特別說明，默認均在 Kibana 的 Dev Tools 中執行，如下圖所示：

創建索引時，也可以在指定索引數據的 shards 分片數量和 replicas 副本數量，如果不指定，系統會采用默認值，分別是 shards 數量為 5，replicas 數量為1。Elasticsearch 默認給一個索引設置 5 個分片和 1 個副本，一個索引的分片數一經指定后就不能再修改，副本數可以通過命令隨時修改。如果想創建自定義分片數和副本數的索引，可以通過 settings 參數在索引時設置初始化信息。

PUT blog
{
    "settings": {
        "number_of_shards": 5,
        "number_of_replicas": 1
    }
}

2.2 查看索引

使用 HTTP 的操作類型 GET，GET {index}/_settings 子句，則可以獲取當前索引文件的較為詳細的配置信息。

GET blog/_settings

返回響應結果：

{
   "blog": {
      "settings": {
         "index": {
            "creation_date": "1527989901762",
            "number_of_shards": "5",
            "number_of_replicas": "1",
            "uuid": "HKq5wwD3S8ObJq7Q2tZn2g",
            "version": {
               "created": "5060099"
            },
            "provided_name": "blog"
         }
      }
   }
}

該命令對應的 curl 命令為：

curl -XGET 'http://192.168.56.110:9200/blog/_settings'

或

curl -XGET "http://192.168.56.110:9200/blog/_settings"

以上兩天 curl 命令在 Linux 上，使用單引號或者雙引號包裹請求內容均可。

但在 Windows 上只能使用第二種，即用雙引號包裹，用單引號時會出現錯誤 curl: (1) Protocol "'http" not supported or disabled in libcurl，如下：

C:\Users\Administrator>curl -XGET 'http://192.168.56.110:9200/blog/_settings'
curl: (1) Protocol "'http" not supported or disabled in libcurl C:\Users\Administrator>curl -XGET "http://192.168.56.110:9200/blog/_settings" {"blog":{"settings":{"index":{"creation_date":"1527989901762","number_of_shards" :"5","number_of_replicas":"1","uuid":"HKq5wwD3S8ObJq7Q2tZn2g","version":{"create d":"5060099"},"provided_name":"blog"}}}}

2.3 更新副本數

Elasticsearch 支持修改一個已經存在索引的副本數，命令如下：

PUT blog
{
    "settings": {
        "number_of_replicas": 1
    }
}

2.4 關閉和打開索引

一個關閉的索引將禁止讀取和寫入其中的數據，而一個打開的索引文件可以允許用戶對其中的數據進行相應的操作。

關閉索引：

POST blog/_close

打開索引：

POST blog/_open

2.5 刪除索引

DELETE blog

三、文檔管理

Mapping 也稱為映射，映射可分為動態映射和靜態映射。在關系型數據庫中寫入數據之前首先要建表，在建表語句中聲明字段的屬性，在 Elasticsearch 中則不必如此，Elasticsearch 最重要的功能之一就是讓你盡可能地開始探索數據，文檔寫入 Elasticsearch 中，它會根據字段的類型自動識別，這種機制稱為動態映射，而靜態映射則是寫入數據之前對字段的屬性進行手工設置。

上面只創建了一個索引，並沒有設置 mapping，查看一下索引 mapping 的內容：

GET blog/_mapping

返回響應結果：

{
    "blog": {
        "mappings": {}
    }
}

可以看到 mapping 為空，我們只創建了一個索引，並沒有進行 mapping 配置，所以 mapping 為空。

下面給 blog 這個索引加一個 type，type name 為 article，並設置 mapping：

POST blog/article/_mapping
{
    "article": {
        "properties": {
            "id": {
                "type": "text"
            },
            "title": {
                "type": "text"
            },
            "postdate": {
                "type": "date"
            },
            "content": {
                "type": "text",
                "analyzer": "ik_max_word"
            }
        }
    }
}

返回響應結果：

{
    "acknowledged": true
}

也可以在創建索引時就定義好映射，如下：

PUT forum
{
    "mappings": {
        "article": {
            "properties": {
                "id": {
                    "type": "text"
                },
                "title": {
                    "type": "text"
                },
                "postdate": {
                    "type": "date"
                },
                "content": {
                    "type": "text",
                    "analyzer": "ik_max_word"
                }
            }
        }
    }
}

返回響應結果：

{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "forum"
}

3.1 新建文檔

此處需要注意 PUT 和 POST，一般認為 POST 方法用來創建資源（具有非冪等性），而 PUT 方法則用來更新資源（具有冪等性）。

PUT blog/article/1
{
    "id": "1001",
    "title": "Git簡單介紹",
    "postdate": "2018-06-03",
    "content": "Git是一款免費、開源的分布式版本控制系統"
}

返回響應結果：

{
  "_index": "blog",
  "_type": "article",
  "_id": "1",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "created": true
}

3.2 查看文檔

GET blog/article/1

返回響應結果：

{
  "_index": "blog",
  "_type": "article",
  "_id": "1",
  "_version": 1,
  "found": true,
  "_source": {
    "id": "1001",
    "title": "Git簡單介紹",
    "postdate": "2018-06-03",
    "content": "Git是一款免費、開源的分布式版本控制系統"
  }
}

返回結果中前4個屬性表明文檔的位置和版本信息，found 屬性表明是否查詢到文檔，_source 字段中是文檔的內容。

如果某個文檔不存在，如下：

GET blog/article/1008611

返回響應結果：

{
  "_index": "blog",
  "_type": "article",
  "_id": "1008611",
  "found": false
}

可以看出，found 屬性值為 false，因為文檔不存在，當然也就沒有版本信息和 source 字段。

使用 HEAD 命令可以檢查一個索引或文檔是否存在：

HEAD blog/article/1

返回響應結果：

200 - OK

如果文檔存在，返回 "200 - OK"，反之返回 "404 - Not Found"。

3.3 更新文檔

POST blog/article/1/_update
{
    "doc": {
        "title":"Git分布式版本控制工具簡介"
    }
}

返回響應結果：

{
  "_index": "blog",
  "_type": "article",
  "_id": "1",
  "_version": 2,
  "result": "noop",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  }
}

3.4 刪除文檔

DELETE blog/article/1

返回響應結果：

{
  "found": true,
  "_index": "blog",
  "_type": "article",
  "_id": "1",
  "_version": 3,
  "result": "deleted",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  }
}

可以看到 _shards 中的 total 參數為 2，表示該文檔存在與兩個分片中（一個主分片，一個副本分片），successful 也為 2，表示執行成功了兩個分片中的文檔（表示分別從主分片中刪除了一條文檔，從副本分片中刪除了一條文檔），所以執行失敗的文檔數量 failed 為 0。

3.5 批量操作

Elasticsearch 中不僅提供了一個一個操作文檔文檔的 API，還提供了文檔的批量操作機制，通過 Bulk API 可以執行批量索引、批量刪除、批量更新等操作。

可以從 https://github.com/bly2k/files/blob/master/accounts.zip 下載一個樣本數據集，然后解壓到 /opt/elk 目錄下，然后批量索引文檔：

[root@masternode elk]# pwd
/opt/elk
[root@masternode elk]# ll
total 240
-rwxr-xr-x   1 root   root   244848 Jun  3 20:15 accounts.json
drwxr-xr-x.  9 esuser esuser    155 May 26 23:08 elasticsearch-5.6.0
drwxrwxr-x  12 esuser esuser    232 Sep  7  2017 kibana-5.6.0-linux-x86_64
[root@masternode elk]# curl -XPOST 'http://192.168.56.110:9200/bank/account/_bulk?pretty' --data-binary @accounts.json

查看索引

GET _cat/indices?v

返回響應結果：

health status index   uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   forum   9M7dtsGDSTWOJAKTJxejXQ   5   1          0            0      1.5kb           810b
green  open   moonxy  rkQ2UUsZR-iSyOIt1HJzdw   5   1          0            0      1.8kb           955b
green  open   .kibana 1lJLwJHZTKWgbNfBpzIXXA   1   1          2            0     17.5kb          8.7kb
green  open   blog    HKq5wwD3S8ObJq7Q2tZn2g   5   1          2            0     18.4kb          9.2kb
green  open   bank    zFkIZRe5Rj6gwhejxV-Wog   5   1       1000            0      1.2mb        640.8kb

四、數據查詢

4.1 返回所有文檔

GET blog/article/_search

返回響應結果：

{
  "took": 12,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "blog",
        "_type": "article",
        "_id": "2",
        "_score": 1,
        "_source": {
          "id": "1002",
          "title": "SVN簡介",
          "postdate": "2018-06-03",
          "content": "SVN是一款免費、開源的集中式版本控制系統"
        }
      },
      {
        "_index": "blog",
        "_type": "article",
        "_id": "1",
        "_score": 1,
        "_source": {
          "id": "1001",
          "title": "Git簡單介紹",
          "postdate": "2018-06-03",
          "content": "Git是一款免費、開源的分布式版本控制系統"
        }
      }
    ]
  }
}

上面代碼中，返回結果的 took 字段表示該操作的耗時（單位為毫秒），timed_out 字段表示是否超時，hits 字段表示命中的記錄，里面子字段的含義如下。

total：返回記錄數，本例是2條。

max_score：最高的匹配程度，本例是1.0。

hits：返回的記錄組成的數組。

返回的記錄中，每條記錄都有一個_score 字段，表示匹配的程序，默認是按照這個字段降序排列。

4.2 全文檢索

GET blog/article/_search
{
  "query" : { "match" : { "content" : "分布式版本控制工具" }}
}

返回響應結果：

{
  "took": 21,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1.6194323,
    "hits": [
      {
        "_index": "blog",
        "_type": "article",
        "_id": "1",
        "_score": 1.6194323, "_source": {
          "id": "1001",
          "title": "Git簡單介紹",
          "postdate": "2018-06-03",
          "content": "Git是一款免費、開源的分布式版本控制系統"
        }
      },
      {
        "_index": "blog",
        "_type": "article",
        "_id": "2",
        "_score": 0.80971617, "_source": {
          "id": "1002",
          "title": "SVN簡介",
          "postdate": "2018-06-03",
          "content": "SVN是一款免費、開源的集中式版本控制系統"
        }
      }
    ]
  }
}

可以看到兩個文檔的 _score 得分不一樣，含有分詞項多的文檔相應的得分也高。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Elastic Stack 筆記（五）Elasticsearch5.6 Mappings 映射 Elastic Stack 筆記（七）Elasticsearch5.6 聚合分析 Elastic Stack 筆記（八）Elasticsearch5.6 Java API Elastic Stack-Elasticsearch介紹 Elastic Stack-Elasticsearch使用介紹(四) Elastic Stack-Elasticsearch使用介紹(五) Elastic Stack-Elasticsearch使用介紹(一) Elastic Stack-Elasticsearch使用介紹(二) Elastic Stack-Elasticsearch使用介紹(三) elasticsearch-head安裝方法--Elastic Stack之二