Elasticsearch學習之相關度評分TF&IDF

本文轉載自查看原文 2017-06-26 08:58 1409 Elasticsearch/ Elasticsearch學習

relevance score算法，簡單來說，就是計算出，一個索引中的文本，與搜索文本，他們之間的關聯匹配程度

Elasticsearch使用的是 term frequency/inverse document frequency算法，簡稱為TF/IDF算法

Term frequency(TF)：搜索文本中的各個詞條在field文本中出現了多少次，出現次數越多，就越相關

Inverse document frequency(IDF)：搜索文本中的各個詞條在整個索引的所有文檔中出現了多少次，出現的次數越多，就越不相關

示例：

搜索請求：hello world 
doc1：hello, today is very good doc2：hi world, how are you
 比如說，在index中有1萬條document，hello這個單詞在所有的document中，一共出現了1000次；world這個單詞在所有的document中，一共出現了100次 doc2更相關

Field-length norm：field長度，field越長，相關度越弱

doc1：{ "title": "hello article", "content": "babaaba 1萬個單詞" }
doc2：{ "title": "my article", "content": "blablabala 1萬個單詞，hi world" }

hello world 在整個index中出現的次數是一樣多的

doc1 更相關，title field更短

分析一個document是如何被匹配上的

GET /test_index/test_type/6/_explain
{
    "query": {
        "match": {
            "test_field": "test hello"
        }
    }
}

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 ES忽略TF-IDF評分——使用constant_score TF-IDF學習筆記機器學習——TF-IDF

計算文章的相似度