ElasticSearch的評分機制詳解


1. 評分機制詳解

1.1. 評分機制 TF\IDF

1.1.1 算法介紹

relevance score算法,簡單來說,就是計算出,一個索引中的文本,與搜索文本,他們之間的關聯匹配程度。

Elasticsearch使用的是 term frequency/inverse document frequency算法,簡稱為TF/IDF算法。TF詞頻(Term Frequency),IDF逆向文件頻率(Inverse Document Frequency)

Term frequency:搜索文本中的各個詞條在field文本中出現了多少次,出現次數越多,就越相關。

1571494142950

舉例:搜索請求:hello world

doc1 : hello you and me,and world is very good.

doc2 : hello,how are you

Inverse document frequency:搜索文本中的各個詞條在整個索引的所有文檔中出現了多少次,出現的次數越多,就越不相關.

1571494159465

1571494176760

舉例:搜索請求:hello world

doc1 : hello ,today is very good

doc2 : hi world ,how are you

整個index中1億條數據。hello的document 1000個,有world的document 有100個。

doc2 更相關

Field-length norm:field長度,field越長,相關度越弱

舉例:搜索請求:hello world

doc1 : {"title":"hello article","content ":"balabalabal 1萬個"}

doc2 : {"title":"my article","content ":"balabalabal 1萬個,world"}

1.1.2 _score是如何被計算出來的

GET /book/_search?explain=true
{
  "query": {
    "match": {
      "description": "java程序員"
    }
  }
}

返回

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 2.137549,
    "hits" : [
      {
        "_shard" : "[book][0]",
        "_node" : "MDA45-r6SUGJ0ZyqyhTINA",
        "_index" : "book",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 2.137549,
        "_source" : {
          "name" : "spring開發基礎",
          "description" : "spring 在java領域非常流行,java程序員都在用。",
          "studymodel" : "201001",
          "price" : 88.6,
          "timestamp" : "2019-08-24 19:11:35",
          "pic" : "group1/M00/00/00/wKhlQFs6RCeAY0pHAAJx5ZjNDEM428.jpg",
          "tags" : [
            "spring",
            "java"
          ]
        },
        "_explanation" : {
          "value" : 2.137549,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 0.7936629,
              "description" : "weight(description:java in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.7936629,
                  "description" : "score(freq=2.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.47000363,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 2,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 3,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.7675597,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 2.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 12.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 35.333332,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "value" : 1.3438859,
              "description" : "weight(description:程序員 in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 1.3438859,
                  "description" : "score(freq=1.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.98082924,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 3,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.6227967,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 12.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 35.333332,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard" : "[book][0]",
        "_node" : "MDA45-r6SUGJ0ZyqyhTINA",
        "_index" : "book",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.57961315,
        "_source" : {
          "name" : "java編程思想",
          "description" : "java語言是世界第一編程語言,在軟件開發領域使用人數最多。",
          "studymodel" : "201001",
          "price" : 68.6,
          "timestamp" : "2019-08-25 19:11:35",
          "pic" : "group1/M00/00/00/wKhlQFs6RCeAY0pHAAJx5ZjNDEM428.jpg",
          "tags" : [
            "java",
            "dev"
          ]
        },
        "_explanation" : {
          "value" : 0.57961315,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 0.57961315,
              "description" : "weight(description:java in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.57961315,
                  "description" : "score(freq=1.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.47000363,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 2,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 3,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.56055,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 19.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 35.333332,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

1.1.3 分析一個document是如何被匹配上的

GET /book/_explain/3
{
  "query": {
    "match": {
      "description": "java程序員"
    }
  }
}

1.2. Doc value

搜索的時候,要依靠倒排索引;排序的時候,需要依靠正排索引,看到每個document的每個field,然后進行排序,所謂的正排索引,其實就是doc values

在建立索引的時候,一方面會建立倒排索引,以供搜索用;一方面會建立正排索引,也就是doc values,以供排序,聚合,過濾等操作使用

doc values是被保存在磁盤上的,此時如果內存足夠,os會自動將其緩存在內存中,性能還是會很高;如果內存不足夠,os會將其寫入磁盤上

倒排索引

doc1: hello world you and me

doc2: hi, world, how are you

term doc1 doc2
hello *
world * *
you * *
and *
me *
hi *
how *
are *

搜索時:

hello you --> hello, you

hello --> doc1

you --> doc1,doc2

doc1: hello world you and me

doc2: hi, world, how are you

sort by 出現問題

正排索引

doc1: { "name": "jack", "age": 27 }

doc2: { "name": "tom", "age": 30 }

document name age
doc1 jack 27
doc2 tom 30

1.3. query phase

1.3.1、query phase

(1)搜索請求發送到某一個coordinate node,構構建一個priority queue,長度以paging操作from和size為准,默認為10

(2)coordinate node將請求轉發到所有shard,每個shard本地搜索,並構建一個本地的priority queue

(3)各個shard將自己的priority queue返回給coordinate node,並構建一個全局的priority queue

1.3.2、replica shard如何提升搜索吞吐量

一次請求要打到所有shard的一個replica/primary上去,如果每個shard都有多個replica,那么同時並發過來的搜索請求可以同時打到其他的replica上去

1.4. fetch phase

1.4.1、fetch phbase工作流程

(1)coordinate node構建完priority queue之后,就發送mget請求去所有shard上獲取對應的document

(2)各個shard將document返回給coordinate node

(3)coordinate node將合並后的document結果返回給client客戶端

1.4.2、一般搜索,如果不加from和size,就默認搜索前10條,按照_score排序

1.5. 搜索參數小總結

1、preference

決定了哪些shard會被用來執行搜索操作

_primary, _primary_first, _local, _only_node:xyz, _prefer_node:xyz, _shards:2,3

bouncing results問題,兩個document排序,field值相同;不同的shard上,可能排序不同;每次請求輪詢打到不同的replica shard上;每次頁面上看到的搜索結果的排序都不一樣。這就是bouncing result,也就是跳躍的結果。

搜索的時候,是輪詢將搜索請求發送到每一個replica shard(primary shard),但是在不同的shard上,可能document的排序不同

解決方案就是將preference設置為一個字符串,比如說user_id,讓每個user每次搜索的時候,都使用同一個replica shard去執行,就不會看到bouncing results了

2、timeout

主要就是限定在一定時間內,將部分獲取到的數據直接返回,避免查詢耗時過長

3、routing

document文檔路由,_id路由,routing=user_id,這樣的話可以讓同一個user對應的數據到一個shard上去

4、search_type

default:query_then_fetch

dfs_query_then_fetch,可以提升revelance sort精准度


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM