Elasticsearch：inverted index，doc_values及source

本文轉載自查看原文 2019-12-23 15:44 625 ELK Stack

以后會用到的相關知識：索引中某些字段禁止搜索，排序等操作

當我們學習Elasticsearch時，經常會遇到如下的幾個概念：

Reverted index
doc_values
source？

這個幾個概念分別指的是什么？有什么用處？如何配置它們？只有我們熟練地掌握了這些概念，我們才可以正確地使用它們。

Inverted index

inverted index（反向索引）是Elasticsearch和任何其他支持全文搜索的系統的核心數據結構。反向索引類似於您在任何書籍結尾處看到的索引。它將出現在文檔中的術語映射到文檔。

例如，您可以從以下字符串構建反向索引：

Elasticsearch從已建立索引的三個文檔中構建數據結構。以下數據結構稱為反向索引(inverted index)：
Term Frequency Document (postings)
choice 1 3
day 1 2
is 3 1,2,3
it 1 1
last 1 2
of 1 2
of 1 2
sunday 2 1,2
the 3 2,3
tomorrow 1 1
week 1 2
yours 1 3

在這里反向索引指的的是，我們根據term來尋找相應的文檔ids。這和常規的根據文檔id來尋找term相反。

請注意以下幾點：

刪除標點符號並將其小寫后，文檔會按術語進行細分。
術語按字母順序排序
“Frequency”列捕獲該術語在整個文檔集中出現的次數
第三列捕獲了在其中找到該術語的文檔。此外，它還可能包含找到該術語的確切位置（文檔中的偏移）

在文檔中搜索術語時，查找給定術語出現在其中的文檔非常快捷。如果用戶搜索術語“sunday”，那么從“Term”列中查找sunday將非常快，因為這些術語在索引中進行了排序。即使有數百萬個術語，也可以在對術語進行排序時快速查找它們。

隨后，考慮一種情況，其中用戶搜索兩個單詞，例如last sunday。反向索引可用於分別搜索last和sunday的發生；文檔2包含這兩個術語，因此比僅包含一個術語的文檔1更好。

反向索引是執行快速搜索的基礎。同樣，很容易查明索引中出現了多少次術語。這是一個簡單的計數匯總。當然，Elasticsearch在我們在這里解釋的簡單的反向排索引的基礎上使用了很多創新。它兼顧搜索和分析。

默認情況下，Elasticsearch在文檔中的所有字段上構建一個反向索引，指向該字段所在的Elasticsearch文檔。也就是說在每個Elasticsearch的Lucene里，有一個位置存放這個inverted index。

在Kibana中，我們建立一個如下的文檔：

    PUT twitter/_doc/1
    {
      "user" : "雙榆樹-張三",
      "message" : "今兒天氣不錯啊，出去轉轉去",
      "uid" : 2,
      "age" : 20,
      "city" : "北京",
      "province" : "北京",
      "country" : "中國",
      "name": {
        "firstname": "三",
        "surname": "張"
      },
      "address" : [
        "中國北京市海淀區",
        "中關村29號"
      ],
      "location" : {
        "lat" : "39.970718",
        "lon" : "116.325747"
      }
    }

當這個文檔被建立好以后，Elastic就已經幫我們建立好了相應的inverted index供我們進行搜索，比如：

    GET twitter/_search
    {
      "query": {
        "match": {
          "user": "張三"
        }
      }
    }

我們可與得到相應的搜索結果：

    {
      "took" : 0,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 1,
          "relation" : "eq"
        },
        "max_score" : 0.5753642,
        "hits" : [
          {
            "_index" : "twitter",
            "_type" : "_doc",
            "_id" : "1",
            "_score" : 0.5753642,
            "_source" : {
              "user" : "雙榆樹-張三",
              "message" : "今兒天氣不錯啊，出去轉轉去",
              "uid" : 2,
              "age" : 20,
              "city" : "北京",
              "province" : "北京",
              "country" : "中國",
              "name" : {
                "firstname" : "三",
                "surname" : "張"
              },
              "address" : [
                "中國北京市海淀區",
                "中關村29號"
              ],
              "location" : {
                "lat" : "39.970718",
                "lon" : "116.325747"
              }
            }
          }
        ]
      }
    }

如果我們想不讓我們的某個字段不被搜索，也就是說不想為這個字段建立inverted index，那么我們可以這么做：

    DELETE twitter
    PUT twitter
    {
      "mappings": {
        "properties": {
          "city": {
            "type": "keyword",
            "ignore_above": 256
          },
          "address": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "age": {
            "type": "long"
          },
          "country": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "location": {
            "properties": {
              "lat": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "lon": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          },
          "message": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "name": {
            "properties": {
              "firstname": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "surname": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          },
          "province": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "uid": {
            "type": "long"
          },
          "user": {
            "type": "object",
            "enabled": false
          }
        }
      }
    }
     
    PUT twitter/_doc/1
    {
      "user" : "雙榆樹-張三",
      "message" : "今兒天氣不錯啊，出去轉轉去",
      "uid" : 2,
      "age" : 20,
      "city" : "北京",
      "province" : "北京",
      "country" : "中國",
      "name": {
        "firstname": "三",
        "surname": "張"
      },
      "address" : [
        "中國北京市海淀區",
        "中關村29號"
      ],
      "location" : {
        "lat" : "39.970718",
        "lon" : "116.325747"
      }
    }

在上面，我們通過mapping對user字段進行了修改：

     "user": {
            "type": "object",
            "enabled": false
      }

也就是說這個字段將不被建立索引，我們如果使用這個字段進行搜索的話，不會產生任何的結果：

    GET twitter/_search
    {
      "query": {
        "match": {
          "user": "張三"
        }
      }
    }

搜索的結果為：

    {
      "took" : 0,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 0,
          "relation" : "eq"
        },
        "max_score" : null,
        "hits" : [ ]
      }
    }

顯然是沒有任何的結果。但是如果我們對這個文檔進行查詢的話：

GET twitter/_doc/1

顯示的結果是：

    {
      "_index" : "twitter",
      "_type" : "_doc",
      "_id" : "1",
      "_version" : 1,
      "_seq_no" : 0,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "user" : "雙榆樹-張三",
        "message" : "今兒天氣不錯啊，出去轉轉去",
        "uid" : 2,
        "age" : 20,
        "city" : "北京",
        "province" : "北京",
        "country" : "中國",
        "name" : {
          "firstname" : "三",
          "surname" : "張"
        },
        "address" : [
          "中國北京市海淀區",
          "中關村29號"
        ],
        "location" : {
          "lat" : "39.970718",
          "lon" : "116.325747"
        }
      }
    }

顯然user的信息是存放於source里的。只是它不被我們所搜索而已。

如果我們不想我們的整個文檔被搜索，我們甚至可以直接采用如下的方法：

    DELETE twitter
     
    PUT twitter 
    {
      "mappings": {
        "enabled": false 
      }
    }

那么整個twitter索引將不建立任何的inverted index，那么我們通過如下的命令：

    PUT twitter/_doc/1
    {
      "user" : "雙榆樹-張三",
      "message" : "今兒天氣不錯啊，出去轉轉去",
      "uid" : 2,
      "age" : 20,
      "city" : "北京",
      "province" : "北京",
      "country" : "中國",
      "name": {
        "firstname": "三",
        "surname": "張"
      },
      "address" : [
        "中國北京市海淀區",
        "中關村29號"
      ],
      "location" : {
        "lat" : "39.970718",
        "lon" : "116.325747"
      }
    }
     
    GET twitter/_search
    {
      "query": {
        "match": {
          "city": "北京"
        }
      }
    }

上面的命令執行的結果是，沒有任何搜索的結果。更多閱讀，可以參閱“Mapping parameters: enabled”(https://www.elastic.co/guide/en/elasticsearch/reference/current/enabled.html)。

Source

在Elasticsearch中，通常每個文檔的每一個字段都會被存儲在shard里存放source的地方，比如：

    PUT twitter/_doc/2
    {
      "user" : "雙榆樹-張三",
      "message" : "今兒天氣不錯啊，出去轉轉去",
      "uid" : 2,
      "age" : 20,
      "city" : "北京",
      "province" : "北京",
      "country" : "中國",
      "name": {
        "firstname": "三",
        "surname": "張"
      },
      "address" : [
        "中國北京市海淀區",
        "中關村29號"
      ],
      "location" : {
        "lat" : "39.970718",
        "lon" : "116.325747"
      }
    }

在這里，我們創建了一個id為2的文檔。我們可以通過如下的命令來獲得它的所有的存儲的信息。

GET twitter/_doc/2

它將返回：

    {
      "_index" : "twitter",
      "_type" : "_doc",
      "_id" : "2",
      "_version" : 1,
      "_seq_no" : 1,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "user" : "雙榆樹-張三",
        "message" : "今兒天氣不錯啊，出去轉轉去",
        "uid" : 2,
        "age" : 20,
        "city" : "北京",
        "province" : "北京",
        "country" : "中國",
        "name" : {
          "firstname" : "三",
          "surname" : "張"
        },
        "address" : [
          "中國北京市海淀區",
          "中關村29號"
        ],
        "location" : {
          "lat" : "39.970718",
          "lon" : "116.325747"
        }
      }
    }

在上面的_source里我們可以看到Elasticsearch為我們所存下的所有的字段。如果我們不想存儲任何的字段，那么我們可以做如下的設置：

    DELETE twitter
     
    PUT twitter
    {
      "mappings": {
        "_source": {
          "enabled": false
        }
      }
    }

那么我們使用如下的命令來創建一個id為1的文檔：

    PUT twitter/_doc/1
    {
      "user" : "雙榆樹-張三",
      "message" : "今兒天氣不錯啊，出去轉轉去",
      "uid" : 2,
      "age" : 20,
      "city" : "北京",
      "province" : "北京",
      "country" : "中國",
      "name": {
        "firstname": "三",
        "surname": "張"
      },
      "address" : [
        "中國北京市海淀區",
        "中關村29號"
      ],
      "location" : {
        "lat" : "39.970718",
        "lon" : "116.325747"
      }
    }

那么同樣地，我們來查詢一下這個文檔：

GET witter/_doc/1

顯示的結果為：

    {
      "_index" : "twitter",
      "_type" : "_doc",
      "_id" : "1",
      "_version" : 1,
      "_seq_no" : 0,
      "_primary_term" : 1,
      "found" : true
    }

顯然我們的文檔是被找到了，但是我們看不到任何的source。那么我們能對這個文檔進行搜索嗎？嘗試如下的命令：

    GET twitter/_search
    {
      "query": {
        "match": {
          "city": "北京"
        }
      }
    }

顯示的結果為：

    {
      "took" : 0,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 1,
          "relation" : "eq"
        },
        "max_score" : 0.5753642,
        "hits" : [
          {
            "_index" : "twitter",
            "_type" : "_doc",
            "_id" : "1",
            "_score" : 0.5753642
          }
        ]
      }
    }

顯然這個文檔id為1的文檔可以被正確地搜索，也就是說它有完好的inverted index供我們查詢，雖然它沒有字的source。

那么我們如何有選擇地進行存儲我們想要的字段呢？這種情況適用於我們想節省自己的存儲空間，只存儲那些我們需要的字段到source里去。我們可以做如下的設置：

    DELETE twitter
     
    PUT twitter
    {
      "mappings": {
        "_source": {
          "includes": [
            "*.lat",
            "address",
            "name.*"
          ],
          "excludes": [
            "name.surname"
          ]
        }    
      }
    }

在上面，我們使用include來包含我們想要的字段，同時我們通過exclude來去除那些不需要的字段。我們嘗試如下的文檔輸入：

    PUT twitter/_doc/1
    {
      "user" : "雙榆樹-張三",
      "message" : "今兒天氣不錯啊，出去轉轉去",
      "uid" : 2,
      "age" : 20,
      "city" : "北京",
      "province" : "北京",
      "country" : "中國",
      "name": {
        "firstname": "三",
        "surname": "張"
      },
      "address" : [
        "中國北京市海淀區",
        "中關村29號"
      ],
      "location" : {
        "lat" : "39.970718",
        "lon" : "116.325747"
      }
    }

通過如下的命令來進行查詢，我們可以看到：

GET twitter/_doc/1

結果是：

    {
      "_index" : "twitter",
      "_type" : "_doc",
      "_id" : "1",
      "_version" : 1,
      "_seq_no" : 0,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "address" : [
          "中國北京市海淀區",
          "中關村29號"
        ],
        "name" : {
          "firstname" : "三"
        },
        "location" : {
          "lat" : "39.970718"
        }
      }
    }

顯然，我們只有很少的幾個字段被存儲下來了。通過這樣的方法，我們可以有選擇地存儲我們想要的字段。

在實際的使用中，我們在查詢文檔時，也可以有選擇地進行顯示我們想要的字段，盡管有很多的字段被存於source中：

GET twitter/_doc/1?_source=name,location

在這里，我們只想顯示和name及location相關的字段，那么顯示的結果為：

    {
      "_index" : "twitter",
      "_type" : "_doc",
      "_id" : "1",
      "_version" : 1,
      "_seq_no" : 0,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "name" : {
          "firstname" : "三"
        },
        "location" : {
          "lat" : "39.970718"
        }
      }
    }

更多的閱讀，可以參閱文檔“Mapping meta-field: _source”(https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html)

Doc_values

默認情況下，大多數字段都已編入索引，這使它們可搜索。反向索引允許查詢在唯一的術語排序列表中查找搜索詞，並從中立即訪問包含該詞的文檔列表。

sort，aggregtion和訪問腳本中的字段值需要不同的數據訪問模式。除了查找術語和查找文檔外，我們還需要能夠查找文檔並查找其在字段中具有的術語。

Doc values是在文檔索引時構建的磁盤數據結構，這使這種數據訪問模式成為可能。它們存儲與_source相同的值，但以面向列的方式存儲，這對於排序和聚合而言更為有效。幾乎所有字段類型都支持Doc值，但對字符串字段除外。

默認情況下，所有支持doc值的字段均已啟用它們。如果您確定不需要對字段進行排序或匯總，也不需要通過腳本訪問字段值，則可以禁用doc值以節省磁盤空間：

比如我們可以通過如下的方式來使得city字段不可以做sort或aggregation：

    DELETE twitter
    PUT twitter
    {
      "mappings": {
        "properties": {
          "city": {
            "type": "keyword",
            "doc_values": false,
            "ignore_above": 256
          },
          "address": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "age": {
            "type": "long"
          },
          "country": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "location": {
            "properties": {
              "lat": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "lon": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          },
          "message": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "name": {
            "properties": {
              "firstname": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "surname": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          },
          "province": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "uid": {
            "type": "long"
          },
          "user": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }

在上面，我們把city字段的doc_values設置為false。

          "city": {
            "type": "keyword",
            "doc_values": false,
            "ignore_above": 256
          },

我們通過如下的方法來創建一個文檔：

    PUT twitter/_doc/1
    {
      "user" : "雙榆樹-張三",
      "message" : "今兒天氣不錯啊，出去轉轉去",
      "uid" : 2,
      "age" : 20,
      "city" : "北京",
      "province" : "北京",
      "country" : "中國",
      "name": {
        "firstname": "三",
        "surname": "張"
      },
      "address" : [
        "中國北京市海淀區",
        "中關村29號"
      ],
      "location" : {
        "lat" : "39.970718",
        "lon" : "116.325747"
      }
    }

那么，當我們使用如下的方法來進行aggregation時：

    GET twitter/_search
    {
      "size": 0,
      "aggs": {
        "city_bucket": {
          "terms": {
            "field": "city",
            "size": 10
          }
        }
      }
    }

在我們的Kibana上我們可以看到：

    {
      "error": {
        "root_cause": [
          {
            "type": "illegal_argument_exception",
            "reason": "Can't load fielddata on [city] because fielddata is unsupported on fields of type [keyword]. Use doc values instead."
          }
        ],
        "type": "search_phase_execution_exception",
        "reason": "all shards failed",
        "phase": "query",
        "grouped": true,
        "failed_shards": [
          {
            "shard": 0,
            "index": "twitter",
            "node": "IyyZ30-hRi2rnOpfx4n1-A",
            "reason": {
              "type": "illegal_argument_exception",
              "reason": "Can't load fielddata on [city] because fielddata is unsupported on fields of type [keyword]. Use doc values instead."
            }
          }
        ],
        "caused_by": {
          "type": "illegal_argument_exception",
          "reason": "Can't load fielddata on [city] because fielddata is unsupported on fields of type [keyword]. Use doc values instead.",
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "Can't load fielddata on [city] because fielddata is unsupported on fields of type [keyword]. Use doc values instead."
          }
        }
      },
      "status": 400
    }

顯然，我們的操作是失敗的。盡管我們不能做aggregation及sort，但是我們還是可以通過如下的命令來得到它的source：

GET twitter/_doc/1

顯示結果為：

    {
      "_index" : "twitter",
      "_type" : "_doc",
      "_id" : "1",
      "_version" : 1,
      "_seq_no" : 0,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "user" : "雙榆樹-張三",
        "message" : "今兒天氣不錯啊，出去轉轉去",
        "uid" : 2,
        "age" : 20,
        "city" : "北京",
        "province" : "北京",
        "country" : "中國",
        "name" : {
          "firstname" : "三",
          "surname" : "張"
        },
        "address" : [
          "中國北京市海淀區",
          "中關村29號"
        ],
        "location" : {
          "lat" : "39.970718",
          "lon" : "116.325747"
        }
      }
    }

更多閱讀請參閱“Mapping parameters: doc_values”(https://www.elastic.co/guide/en/elasticsearch/reference/7.4/doc-values.html)。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Elasticsearch 7 : doc_values 屬性 Elasticsearch 中映射參數doc_values 和 fielddata分析比較 ES doc_values介紹2——本質是field value的列存儲，做聚合分析用，ES默認開啟，會占用存儲空間 Inverted index 倒排索引 Elasticsearch學習之圖解Elasticsearch中的_source、_all、store和index屬性倒排索引(inverted index) 圖解Elasticsearch中的_source、_all、store和index屬性倒排文件索引（Inverted File Index）的建立 ES 13 - Elasticsearch的元字段 (_index、_type、_source、_routing等) elasticsearch mapping映射屬性_source、_all、store和index