Elasticsearch Field Options Norms

本文轉載自查看原文 2019-08-03 14:46 821 elasticsearch

Elasticsearch 定義字段時Norms選項的作用

本文介紹ElasticSearch中2種字段(text 和 keyword)的Norms參數作用。

創建ES索引時，一般指定2種配置信息：settings、mappings。settings 與數據存儲有關（幾個分片、幾個副本）；而mappings 是數據模型，類似於MySQL中的表結構定義。在Mapping信息中指定每個字段的類型，ElasticSearch支持多種類型的字段(field datatypes)，比如String、Numeric、Date…其中String又細分成為種：keyword 和 text。在創建索引時，需要定義字段並為每個字段指定類型，示例如下：

PUT my_index
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "_doc": {
      "_source": {
        "enabled": true
      },
      "properties": {
        "title": {
          "type": "text",
          "norms": false
        },
        "overview": {
          "type": "text",
          "norms": true
        },
        "body": {
          "type": "text"
        },
        "author": {
          "type": "keyword",
          "norms": true
        },
        "chapters": {
          "type": "keyword",
          "norms": false
        },
        "email": {
          "type": "keyword"
        }
      }
    }
  }
}

my_index 索引的 title 字段類型是 text，而 author 字段類型是 keyword。

對於 text 類型的字段而言，默認開啟了norms，而 keyword 類型的字段則默認關閉了norms

Whether field-length should be taken into account when scoring queries. Accepts true（text filed datatype） or false(keyword filed datatype)

為什么 keyword 類型的字段默認關閉 norms 呢？keyword 類型的string 可理解為：Do index the field, but don't analyze the string value，也即：keyword 類型的字段是不會被Analyzer "分析成" 一個個的term的，它是一個single-token fields，因此也就不需要字段長度(fieldNorm)、tfNorm（term frequency Norm）這些歸一化因子了。而 text 類型的字段會被分析器(Analyzer)分析，生成若干個terms，兩個 text 類型的字段，一個可能有很多term(比如文章的正文)，另一個只有很少的term(比如文章的標題)，在多字段查詢時，就需要長度歸一化，這就是為什么 text 類型字段默認開啟 norms 選項的原因吧。另外，對於Lucene常用的2種評分算法：tf-idf 和 bm25，tf-idf 就傾向於給長度較小的字段打高分，為什么呢？Lucene 的相似度評分公式，主要由三部分組成：IDF score，TF score 還有 fieldNorms。就TF-IDF評分公式而言，IDF score 是log(numDocs/(docFreq+1))，TF score 是 sqrt(tf)，fieldNorms 是 1/sqrt(length)，因此：文檔長度越短，fieldNorms越大，評分越高，這也是為什么TF-IDF嚴重偏向於給短文本打高分的原因。

norms 作用是什么？

norms 是一個用來計算文檔/字段得分(Score)的"調節因子"。TF-IDF、BM25算法計算文檔得分時都用到了norms參數，具體可參考這篇文章中的Lucene文檔得分計算公式。

ElasticSearch中的一篇文檔(Document)，里面有多個字段。查詢解析器(QueryParser)將用戶輸入的查詢字符串解析成Terms ，在多字段搜索中，每個 Term 會去匹配各個字段，為每個字段計算一個得分，各個字段的得分經過某種方式(以詞為中心的搜索 vs 以字段為中心的搜索)組合起來，最終得到一篇文檔的得分。

ES官方文檔關於Norms解釋：

Norms store various normalization factors that are later used at query time in order to compute the score of a document relatively to a query.

這里的 normalization factors 用於查詢計算文檔得分時進行 boosting。比如根據BM25算法給出的公式(freq*(k1+1))/(freq+k1*(1-b+b*fieldLength/avgFieldLength))計算文檔得分時，其中的fieldLength/avgFieldLength就是 normalization factors。

norms 的代價

開啟norms之后，每篇文檔的每個字段需要一個字節存儲norms。對於 text 類型的字段而言是默認開啟norms的，因此對於不需要評分的 text 類型的字段，可以禁用norms，這算是一個調優點吧。

Although useful for scoring, norms also require quite a lot of disk (typically in the order of one byte per document per field in your index, even for documents that don’t have this specific field). As a consequence, if you don’t need scoring on a specific field, you should disable norms on that field

norms 因子屬於 Index-time boosting一部分，也即：在索引文檔(寫入文檔)的時候，就已經將所有boosting因子存儲起來，在查詢時從內存中讀取，參與得分計算。參考《Lucene in action》中一段話：

During indexing, all sources of index-time boosts are combined into a single floating point number for each indexed field in the document. The document may have its own boost; each field may have a boost; and Lucene computes an automatic boost based on the number of tokens in the field (shorter fields have a higher boost). These boosts are combined and then compactly encoded (quantized) into a single byte, which is stored per field per document. During searching, norms for any field being searched are loaded into memory, decoded back into a floating-point number, and used when computing the relevance score.

另一種類型的 boosting 是search time boosting，在查詢語句中指定boosting因子，然后動態計算出文檔得分，具體可參考：《relevant search with applications for solr and elasticsearch》，本文不再詳述。但是值得注意的是：目前的ES版本已經不再推薦使用index time boosting了，而是推薦使用 search time boosting。ES官方文檔給出的理由如下：

在索引文檔時存儲的boosting因子(開啟 norms 選項)，一經存儲，就無法改變。要想改變，只能reindex索引
search time boosting 的效果和 index time boosting是一樣的，並且search time boosting能夠動態指定boosting因子(但計算文檔得分時更消耗CPU吧)，靈活性更大。而index time boosting需要額外的存儲空間
index time boosting因子存儲在norms字段，它影響了 field length normalization，從而導致文檔相似度計算結果不太准確(lower quality relevance calculations)

附：my_index索引的mapping 信息：

GET my_index/_mapping

{
  "my_index": {
    "mappings": {
      "_doc": {
        "properties": {
          "author": {
            "type": "keyword",
            "norms": true
          },
          "body": {
            "type": "text"
          },
          "chapters": {
            "type": "keyword"
          },
          "email": {
            "type": "keyword"
          },
          "overview": {
            "type": "text"
          },
          "title": {
            "type": "text",
            "norms": false
          }
        }
      }
    }
  }
}

原文：https://www.cnblogs.com/hapjin/p/11254535.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 ElasticSearch norms有什么用 Django文檔——Model字段選項(Field Options) ElasticSearch Field數據類型【ElasticSearch】：索引Index、文檔Document、字段Field elasticSearch Alternatively use a keyword field instead. ElasticSearch Join Field Type性能測試 spring-data-elasticsearch (elasticsearch 6.7.0) @Document 和 @Field 注解詳解 You cannot set a form field before rendering a field associated with the value. You can use `getFieldDecorator(id, options)` instead `v-decorator="[id, options]"` to register it before render. elasticsearch入門使用 Mapping + field type字段類型幾張圖看懂【ElasticSearch】：索引Index、文檔Document、字段Field