1. 評分機制詳解
1.1. 評分機制 TF\IDF
1.1.1 算法介紹
relevance score算法,簡單來說,就是計算出,一個索引中的文本,與搜索文本,他們之間的關聯匹配程度。
Elasticsearch使用的是 term frequency/inverse document frequency算法,簡稱為TF/IDF算法。TF詞頻(Term Frequency),IDF逆向文件頻率(Inverse Document Frequency)
Term frequency:搜索文本中的各個詞條在field文本中出現了多少次,出現次數越多,就越相關。

舉例:搜索請求:hello world
doc1 : hello you and me,and world is very good.
doc2 : hello,how are you
Inverse document frequency:搜索文本中的各個詞條在整個索引的所有文檔中出現了多少次,出現的次數越多,就越不相關.


舉例:搜索請求:hello world
doc1 : hello ,today is very good
doc2 : hi world ,how are you
整個index中1億條數據。hello的document 1000個,有world的document 有100個。
doc2 更相關
Field-length norm:field長度,field越長,相關度越弱
舉例:搜索請求:hello world
doc1 : {"title":"hello article","content ":"balabalabal 1萬個"}
doc2 : {"title":"my article","content ":"balabalabal 1萬個,world"}
1.1.2 _score是如何被計算出來的
GET /book/_search?explain=true
{
"query": {
"match": {
"description": "java程序員"
}
}
}
返回
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 2.137549,
"hits" : [
{
"_shard" : "[book][0]",
"_node" : "MDA45-r6SUGJ0ZyqyhTINA",
"_index" : "book",
"_type" : "_doc",
"_id" : "3",
"_score" : 2.137549,
"_source" : {
"name" : "spring開發基礎",
"description" : "spring 在java領域非常流行,java程序員都在用。",
"studymodel" : "201001",
"price" : 88.6,
"timestamp" : "2019-08-24 19:11:35",
"pic" : "group1/M00/00/00/wKhlQFs6RCeAY0pHAAJx5ZjNDEM428.jpg",
"tags" : [
"spring",
"java"
]
},
"_explanation" : {
"value" : 2.137549,
"description" : "sum of:",
"details" : [
{
"value" : 0.7936629,
"description" : "weight(description:java in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.7936629,
"description" : "score(freq=2.0), product of:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.47000363,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 2,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 3,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.7675597,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 2.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 12.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 35.333332,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
},
{
"value" : 1.3438859,
"description" : "weight(description:程序員 in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 1.3438859,
"description" : "score(freq=1.0), product of:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.98082924,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 1,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 3,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.6227967,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 12.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 35.333332,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
},
{
"_shard" : "[book][0]",
"_node" : "MDA45-r6SUGJ0ZyqyhTINA",
"_index" : "book",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.57961315,
"_source" : {
"name" : "java編程思想",
"description" : "java語言是世界第一編程語言,在軟件開發領域使用人數最多。",
"studymodel" : "201001",
"price" : 68.6,
"timestamp" : "2019-08-25 19:11:35",
"pic" : "group1/M00/00/00/wKhlQFs6RCeAY0pHAAJx5ZjNDEM428.jpg",
"tags" : [
"java",
"dev"
]
},
"_explanation" : {
"value" : 0.57961315,
"description" : "sum of:",
"details" : [
{
"value" : 0.57961315,
"description" : "weight(description:java in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.57961315,
"description" : "score(freq=1.0), product of:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.47000363,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 2,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 3,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.56055,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 19.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 35.333332,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
}
]
}
}
1.1.3 分析一個document是如何被匹配上的
GET /book/_explain/3
{
"query": {
"match": {
"description": "java程序員"
}
}
}
1.2. Doc value
搜索的時候,要依靠倒排索引;排序的時候,需要依靠正排索引,看到每個document的每個field,然后進行排序,所謂的正排索引,其實就是doc values
在建立索引的時候,一方面會建立倒排索引,以供搜索用;一方面會建立正排索引,也就是doc values,以供排序,聚合,過濾等操作使用
doc values是被保存在磁盤上的,此時如果內存足夠,os會自動將其緩存在內存中,性能還是會很高;如果內存不足夠,os會將其寫入磁盤上
倒排索引
doc1: hello world you and me
doc2: hi, world, how are you
| term | doc1 | doc2 |
|---|---|---|
| hello | * | |
| world | * | * |
| you | * | * |
| and | * | |
| me | * | |
| hi | * | |
| how | * | |
| are | * |
搜索時:
hello you --> hello, you
hello --> doc1
you --> doc1,doc2
doc1: hello world you and me
doc2: hi, world, how are you
sort by 出現問題
正排索引
doc1: { "name": "jack", "age": 27 }
doc2: { "name": "tom", "age": 30 }
| document | name | age |
|---|---|---|
| doc1 | jack | 27 |
| doc2 | tom | 30 |
1.3. query phase
1.3.1、query phase
(1)搜索請求發送到某一個coordinate node,構構建一個priority queue,長度以paging操作from和size為准,默認為10
(2)coordinate node將請求轉發到所有shard,每個shard本地搜索,並構建一個本地的priority queue
(3)各個shard將自己的priority queue返回給coordinate node,並構建一個全局的priority queue
1.3.2、replica shard如何提升搜索吞吐量
一次請求要打到所有shard的一個replica/primary上去,如果每個shard都有多個replica,那么同時並發過來的搜索請求可以同時打到其他的replica上去
1.4. fetch phase
1.4.1、fetch phbase工作流程
(1)coordinate node構建完priority queue之后,就發送mget請求去所有shard上獲取對應的document
(2)各個shard將document返回給coordinate node
(3)coordinate node將合並后的document結果返回給client客戶端
1.4.2、一般搜索,如果不加from和size,就默認搜索前10條,按照_score排序
1.5. 搜索參數小總結
1、preference
決定了哪些shard會被用來執行搜索操作
_primary, _primary_first, _local, _only_node:xyz, _prefer_node:xyz, _shards:2,3
bouncing results問題,兩個document排序,field值相同;不同的shard上,可能排序不同;每次請求輪詢打到不同的replica shard上;每次頁面上看到的搜索結果的排序都不一樣。這就是bouncing result,也就是跳躍的結果。
搜索的時候,是輪詢將搜索請求發送到每一個replica shard(primary shard),但是在不同的shard上,可能document的排序不同
解決方案就是將preference設置為一個字符串,比如說user_id,讓每個user每次搜索的時候,都使用同一個replica shard去執行,就不會看到bouncing results了
2、timeout
主要就是限定在一定時間內,將部分獲取到的數據直接返回,避免查詢耗時過長
3、routing
document文檔路由,_id路由,routing=user_id,這樣的話可以讓同一個user對應的數據到一個shard上去
4、search_type
default:query_then_fetch
dfs_query_then_fetch,可以提升revelance sort精准度
