Elasticsearch搜索之explain評分分析

本文轉載自查看原文 2017-04-06 18:50 2111 elasticsearch 搜索執行計划/ elasticsearch 查詢分析/ elasticsearch explain/ elasticsearch 評分分析

Lucene的IndexSearcher提供一個explain方法，能夠解釋Document的Score是怎么得來的，具體每一部分的得分都可以詳細地打印出來。這里用一個中文實例來純手工驗算一遍Lucene的評分算法，並且結合Lucene的源碼做一個解釋。

首先是測試用例，我使用“北京東路”來檢索一個含有address域的文檔。

然后是是輸出，注意它有縮進，代表一個個的層級，下面以測試環境數據作為舉例:

{
        "value" : 0.7271681,
        "description" : "max of:",
        "details" : [ {
          "value" : 0.7271681,
          "description" : "sum of:",
          "details" : [ {
            "value" : 0.43069553,
            "description" : "weight(address:北京 in 787) [PerFieldSimilarity], result of:",
            "details" : [ {
              "value" : 0.43069553,
              "description" : "score(doc=787,freq=1.0), product of:",
              "details" : [ {
                "value" : 0.34374008,
                "description" : "queryWeight, product of:",
                "details" : [ {
                  "value" : 5.0118747,
                  "description" : "idf(docFreq=2104, maxDocs=116302)"
                }, {
                  "value" : 0.06858513,
                  "description" : "queryNorm"
                } ]
              }, {
                "value" : 1.2529687,
                "description" : "fieldWeight in 787, product of:",
                "details" : [ {
                  "value" : 1.0,
                  "description" : "tf(freq=1.0), with freq of:",
                  "details" : [ {
                    "value" : 1.0,
                    "description" : "termFreq=1.0"
                  } ]
                }, {
                  "value" : 5.0118747,
                  "description" : "idf(docFreq=2104, maxDocs=116302)"
                }, {
                  "value" : 0.25,
                  "description" : "fieldNorm(doc=787)"
                } ]
              } ]
            } ]
          }, {
            "value" : 0.29647252,
            "description" : "weight(address:東路 in 787) [PerFieldSimilarity], result of:",
            "details" : [ {
              "value" : 0.29647252,
              "description" : "score(doc=787,freq=1.0), product of:",
              "details" : [ {
                "value" : 0.2851919,
                "description" : "queryWeight, product of:",
                "details" : [ {
                  "value" : 4.158218,
                  "description" : "idf(docFreq=4942, maxDocs=116302)"
                }, {
                  "value" : 0.06858513,
                  "description" : "queryNorm"
                } ]
              }, {
                "value" : 1.0395545,
                "description" : "fieldWeight in 787, product of:",
                "details" : [ {
                  "value" : 1.0,
                  "description" : "tf(freq=1.0), with freq of:",
                  "details" : [ {
                    "value" : 1.0,
                    "description" : "termFreq=1.0"
                  } ]
                }, {
                  "value" : 4.158218,
                  "description" : "idf(docFreq=4942, maxDocs=116302)"
                }, {
                  "value" : 0.25,
                  "description" : "fieldNorm(doc=787)"
                } ]
              } ]
            } ]
          } ]
        } ]
      }

這個看起來可真是頭疼，嘗試解釋一下：

首先，需要學習Lucene的評分計算公式——

分值計算方式為查詢語句q中每個項t與文檔d的匹配分值之和，當然還有權重的因素。其中每一項的意思如下表所示：

表3.5	評分公式中的因子
評分因子	描述
tf(t in d)	項頻率因子——文檔（d)中出現項（t)的頻率
idf(t)	項在倒排文檔中出現的頻率：它被用來衡量項的“唯一”性.出現頻率較高的term具有較低的idf,出現較少的term具有較高的idf
boost(t.field in d)	域和文檔的加權，在索引期間設置.你可以用該方法對某個域或文檔進行靜態單獨加權
lengthNorm(t.field in d)	域的歸一化（Normalization)值，表示域中包含的項數量.該值在索引期間計算，並保存在索引norm中.對於該因子，更短的域（或更少的語匯單元）能獲得更大的加權
coord(q,d)	協調因子（Coordination factor),基於文檔中包含查詢的項個數.該因子會對包含更多搜索項的文檔進行類似AND的加權
queryNorm(q)	每個査詢的歸一化值，指毎個查詢項權重的平方和

總匹配分值的計算

具體到上面的測試來講，地址字段address匹配了二個詞條，先分別計算每個詞條對應的分值，然后相加，最后結果= ("北京") 0.43069553+ (“東路”)0.29647252=0.7271681 (結果舍入)。

查詢語句在某個field匹配分值計算

這個0.43069553是如何來的呢？這是詞條“北京”在field中的分值=查詢權重queryWeight * 域權重fieldWeight 即 0.34374008*1.2529687=0.43069553。

同埋“東路”這個詞條在field中的分值=查詢權重queryWeight * 域權重fieldWeight 即 0.2851919*1.0395545=0.29647252

queryWeight的計算

queryWeight的計算可以在TermQuery$TermWeight.normalize(float)方法中看到計算的實現：

public void normalize(float queryNorm) {

this.queryNorm = queryNorm;

//原來queryWeight 為idf*t.getBoost()，現在為queryNorm*idf*t.getBoost()。

queryWeight *= queryNorm;

value = queryWeight * idf;

}

其實默認情況下，queryWeight = idf * queryNorm，因為Lucene中默認的boost = 1.0。

以“北京”這個詞條為例，查詢權重queryWeight = idf * queryNorm，即 0.34374008 = 5.0118747*0.06858513。

idf的計算

idf是項在倒排文檔中出現的頻率，計算方式為

/** Implemented as <code>log(numDocs/(docFreq+1)) + 1</code>. */

@Overrid

public float idf(long docFreq, long numDocs) {

return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0);

｝

docFreq是根據指定關鍵字進行檢索，檢索到的Document的數量，我們測試“北京”詞條的docFreq=2104；

numDocs是指索引文件中總共的Document的數量，對應explain結果中的maxDocs，我們測試的maxDocs=116302。

用計算器驗證一下，沒有錯誤，這里就不啰嗦了。

fieldWeight的計算

fieldWeight = tf * idf * fieldNorm

tf和idf的計算參考前面的，fieldNorm的計算在索引的時候確定了，此時直接從索引文件中讀取，這個方法並沒有給出直接的計算。

如果使用DefaultSimilarity的話，它實際上就是lengthNorm，域越長的話Norm越小，在org/apache/lucene/search/similarities/DefaultSimilarity.java里面有關於它的計算：

public float lengthNorm(FieldInvertState state) {

final int numTerms;

if (discountOverlaps)

numTerms = state.getLength() - state.getNumOverlap();

else

numTerms = state.getLength();

return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));

}

這個我就不再驗算了，每個域的Terms數量開方求倒數乘以該域的boost得出最終的結果。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 ElasticSearch評分分析 explian 解釋和一些查詢理解 ELASTICSEARCH 搜索的評分機制 Elasticsearch 搜索的評分機制大數據-電影評分分析數據分析 - 美國金融科技公司Prosper的風險評分分析信用評分預測模型（三）--主成分分析PCA降維 Elasticsearch【正則搜索】分析&實踐 ElasticSearch 簡單的搜索聚合分析用ElasticSearch搭建自己的搜索和分析引擎 ElasticSearch 評分排序