Elasticsearch： Join數據類型

本文轉載自查看原文 2019-12-23 16:20 6032 ELK Stack

在Elasticsearch中，Join可以讓我們創建parent/child關系。Elasticsearch不是一個RDMS。通常join數據類型盡量不要使用，除非不得已。那么Elasticsearch為什么需要Join數據類型呢？

在Elasticsearch中，更新一個object需要root object一個完整的reindex：

即使是一個field的一個字符的改變
即便是nested object也需要完整的reindex才可以實現搜索

通常情況下，這是完全OK的，但是在有些場合下，如果我們有頻繁的更新操作，這樣可能對性能帶來很大的影響。

如果你的數據需要頻繁的更新，並帶來性能上的影響，這個時候，join數據類型可能是你的一個解決方案。

join數據類型可以完全地把兩個object分開，但是還是保持這兩者之前的關系。

parent及child是完全分開的兩個文檔
parent可以單獨更新而不需要重新reindex child
children可以任意被添加/串改/刪除而不影響parent及其它的children

與 nested類型類似，父子關系也允許您將不同的實體關聯在一起，但它們在實現和行為上有所不同。與nested文檔不同，它們不在同一文檔中，而parent/child文檔是完全獨立的文檔。它們遵循一對多關系原則，允許您將一種類型定義為parent類型，將一種或多種類型定義為child類型

即便join數據類型給我們帶來了方便，但是，它也在搜索時給我帶來額外的內存及計算方便的開銷。

注意：目前Kibana對nested及join數據類型有比較少的支持。如果你想使用Kibana來在dashboard里展示數據，這個方面的你需要考慮。在未來，這種情況可能會發生改變。

**join數據類型是一個特殊字段，用於在同一索引的文檔中創建父/子關系。關系部分定義文檔中的一組可能關系，每個關系是父（parent)名稱和子（child)名稱。 **

一個例子：

    PUT my_index
    {
      "mappings": {
        "properties": {
          "my_join_field": { 
            "type": "join",
            "relations": {
              "question": "answer" 
            }
          }
        }
      }
    }

在這里我們定義了一個叫做my_index的索引。在這個索引中，我們定義了一個field，它的名字是my_join_field。它的類型是join數據類型。同時我們定義了單個關系：question是answer的parent。

要使用join來index文檔，必須在source中提供關系的name和文檔的可選parent。例如，以下示例在question上下文中創建兩個parent文檔：

    PUT my_index/_doc/1?refresh
    {
      "text": "This is a question",
      "my_join_field": {
        "name": "question" 
      }
    }
     
    PUT my_index/_doc/2?refresh
    {
      "text": "This is another question",
      "my_join_field": {
        "name": "question"
      }
    }

這里采用refresh來強制進行索引，以便接下來的搜索。在這里name標識question，說明這個文檔時一個question文檔。

索引parent文檔時，您可以選擇僅將關系的名稱指定為快捷方式，而不是將其封裝在普通對象表示法中：

    PUT my_index/_doc/1?refresh
    {
      "text": "This is a question",
      "my_join_field": "question" 
    }
     
    PUT my_index/_doc/2?refresh
    {
      "text": "This is another question",
      "my_join_field": "question"
    }

這種方法和前面的是一樣的，只是這里我們只使用了question, 而不是一個像第一種方法那樣，使用如下的一個對象來表達：

    "my_join_field": {
        "name": "question"
      }

在實際的使用中，你可以根據自己的喜好來使用。

索引child項時，必須在_source中添加關系的名稱以及文檔的parent id。

注意：需要在同一分片中索引父級的譜系，必須使用其parent的id來確保這個child和parent是在一個shard中。每個文檔分配在那個shard之中在默認的情況下是按照文檔的id進行一些hash來分配的，當然也可以通過routing來進行。針對child，我們使用其parent的id，這樣就可以保證。否則在我們join數據的時候，跨shard是非常大的一個消費。

例如，以下示例顯示如何索引兩個child文檔：

    PUT my_index/_doc/3?routing=1?refresh  (1)
    {
      "text": "This is an answer",
      "my_join_field": {
        "name": "answer",   (2)
        "parent": "1"       (3)
      }
    }
     
    PUT my_index/_doc/4?routing=1?refresh
    {
      "text": "This is another answer",
      "my_join_field": {
        "name": "answer",
        "parent": "1"
      }
    }

在上面的（1）處，我們必須使用routing，這樣能確保parent和child是在同一個shard里。我們這里routing為1，這是因為parent的id 為1，在（3）處定義。(2) 處定義了該文檔join的名稱。

parent-join及其性能

join字段不應像關系數據庫中的連接一樣使用。在Elasticsearch中，良好性能的關鍵是將數據去規范化為文檔。每個連接字段has_child或has_parent查詢都會對查詢性能產生重大影響。

join字段有意義的唯一情況是，如果您的數據包含一對多關系，其中一個實體明顯超過另一個實體。這種情況的一個例子是產品的用例和這些產品的報價。如果提供的產品數量明顯多於產品數量，則將產品建模為父文檔並將產品建模為子文檔是有意義的。

parent-join的限制

對於每個index來說，只能有一個join字段
parent及child文檔，必須是在一個shard里建立索引。這也意味着，同樣的routing值必須應用於getting, deleting或updating一個child文檔。
一個元素可以有多個children，但是只能有一個parent.
可以對已有的join項添加新的關系
也可以將child添加到現有元素，但僅當元素已經是parent時才可以。

針對parent-join的搜索

parent-join創建一個字段來索引文檔中關系的名稱（my_parent，my_child，...）。

它還為每個parent/child關系創建一個字段。此字段的名稱是join字段的名稱，后跟＃和關系中parent的名稱。因此，例如對於my_parent⇒[my_child，another_child]關系，join字段會創建一個名為my_join_field＃my_parent的附加字段。

如果文檔是子文件（my_child或another_child），則此字段包含文檔鏈接到的parent_id，如果文檔是parent文件（my_parent），則包含文檔的_id。

搜索包含join字段的索引時，始終在搜索響應中返回這兩個字段：

上面的描述比較繞口，我們還是以一個例子來說說明吧：

    GET my_index/_search
    {
      "query": {
        "match_all": {}
      },
      "sort": ["_id"]
    }

這里我們搜索所有的文檔，並以_id進行排序：

    {
      "took" : 2,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 4,
          "relation" : "eq"
        },
        "max_score" : null,
        "hits" : [
          {
            "_index" : "my_index",
            "_type" : "_doc",
            "_id" : "1",
            "_score" : null,
            "_source" : {
              "text" : "This is a question",
              "my_join_field" : "question" (1)
            },
            "sort" : [
              "1"
            ]
          },
          {
            "_index" : "my_index",
            "_type" : "_doc",
            "_id" : "2",
            "_score" : null,
            "_source" : {
              "text" : "This is another question",
              "my_join_field" : "question" (2)
            },
            "sort" : [
              "2"
            ]
          },
          {
            "_index" : "my_index",
            "_type" : "_doc",
            "_id" : "3",
            "_score" : null,
            "_routing" : "1",
            "_source" : {
              "text" : "This is an answer",
              "my_join_field" : {
                "name" : "answer", (3)
                "parent" : "1"     (4)
              }
            },
            "sort" : [
              "3"
            ]
          },
          {
            "_index" : "my_index",
            "_type" : "_doc",
            "_id" : "4",
            "_score" : null,
            "_routing" : "1",
            "_source" : {
              "text" : "This is another answer",
              "my_join_field" : {
                "name" : "answer",
                "parent" : "1"
              }
            },
            "sort" : [
              "4"
            ]
          }
        ]
      }
    }

在這里，我們可以看到4個文檔：

(1)表明這個文檔是一個question join
(2)表明這個文檔是一個question join
(3)表明這個文檔是一個answer join
(4)表明這個文檔的parent是id為1的文檔

Parent-join 查詢及aggregation

可以在aggregation和script中訪問join字段的值，並可以使用parent_id查詢進行查詢：

    GET my_index/_search
    {
      "query": {
        "parent_id": { 
          "type": "answer",
          "id": "1"
        }
      }
    }

我們通過查詢parent_id，返回所有parent_id為1的所有answer類型的文檔：

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.35667494,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 0.35667494,
        "_routing" : "1",
        "_source" : {
          "text" : "This is another answer",
          "my_join_field" : {
            "name" : "answer",
            "parent" : "1"
          }
        }
      },
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.35667494,
        "_routing" : "1",
        "_source" : {
          "text" : "This is an answer",
          "my_join_field" : {
            "name" : "answer",
            "parent" : "1"
          }
        }
      }
    ]
  }
}

在這里，我們可以看到返回id為3和4的文檔。我們也可以對這些文檔進行aggregation:

    GET my_index/_search
    {
      "query": {
        "parent_id": {
          "type": "answer",
          "id": "1"
        }
      },
      "aggs": {
        "parents": {
          "terms": {
            "field": "my_join_field#question",
            "size": 10
          }
        }
      },
      "script_fields": {
        "parent": {
          "script": {
            "source": "doc['my_join_field#question']"
          }
        }
      }
    }

就像我們在上一節中介紹的那樣，在我們的應用實例中，在index時，它也創建一個額外的一個字段，雖然在source里我們看不到。這個字段就是my_join_filed#question，這個字段含有parent _id。在上面的查詢中，我們首先查詢所有的parent_id為1的所有的answer類型的文檔。接下來對所有的文檔以parent_id進行聚合：

    {
      "took" : 0,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 2,
          "relation" : "eq"
        },
        "max_score" : 0.35667494,
        "hits" : [
          {
            "_index" : "my_index",
            "_type" : "_doc",
            "_id" : "4",
            "_score" : 0.35667494,
            "_routing" : "1",
            "fields" : {
              "parent" : [
                "1"
              ]
            }
          },
          {
            "_index" : "my_index",
            "_type" : "_doc",
            "_id" : "3",
            "_score" : 0.35667494,
            "_routing" : "1",
            "fields" : {
              "parent" : [
                "1"
              ]
            }
          }
        ]
      },
      "aggregations" : {
        "parents" : {
          "doc_count_error_upper_bound" : 0,
          "sum_other_doc_count" : 0,
          "buckets" : [
            {
              "key" : "1",
              "doc_count" : 2
            }
          ]
        }
      }
    }

一個parent對應多個child

對於一個parent來說，我們可以定義多個child，比如：

    PUT my_index
    {
      "mappings": {
        "properties": {
          "my_join_field": {
            "type": "join",
            "relations": {
              "question": ["answer", "comment"]  
            }
          }
        }
      }
    }

在這里，question是answer及comment的parent。

多層的parent join

雖然這個不建議，這樣做可能會可能在query時帶來更多的內存及計算方面的開銷：

    PUT my_index
    {
      "mappings": {
        "properties": {
          "my_join_field": {
            "type": "join",
            "relations": {
              "question": ["answer", "comment"],  
              "answer": "vote" 
            }
          }
        }
      }
    }

這里question是answer及comment的parent，同時answer也是vote的parent。它表明了如下的關系：

索引grandchild文檔需routing值等於grand-parent（譜系里的更大parent）：

    PUT my_index/_doc/3?routing=1&refresh 
    {
      "text": "This is a vote",
      "my_join_field": {
        "name": "vote",
        "parent": "2" 
      }
    }

這個child文檔必須是和他的grand-parent在一個shard里。在這里它使用了1，也即question的id。同時，對於vote來說，它的parent必須是它的parent，也即answer的id。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 ElasticSearch 數據類型 Elasticsearch數據類型 elasticsearch 數據類型大全 Elasticsearch : alias數據類型 ElasticSearch Field數據類型 ElasticSearch 映射類型及數據類型區分 Elasticsearch 字段數據類型 elasticsearch 字段數據類型 Elasticsearch7.*字段數據類型 Hive 與 ElasticSearch 的數據類型對照