- 對分詞字段檢索使用的通常是match查詢,對於短語查詢使用的是matchphrase查詢,但是並不是matchphrase可以直接對分詞字段進行不分詞檢索(也就是業務經常說的精確匹配),下面有個例子,使用Es的請注意。
- 某個Index下面存有如下內容
{ "id": "1", "fulltext": "亞馬遜卓越有限公司訴訟某某公司" }其中fulltext使用ik分詞器進行分詞存儲,使用ik分詞結果如下
"tokens": [ { "token": "亞馬遜", "start_offset": 0, "end_offset": 3, "type": "CN_WORD", "position": 0 }, { "token": "亞", "start_offset": 0, "end_offset": 1, "type": "CN_WORD", "position": 1 }, { "token": "馬", "start_offset": 1, "end_offset": 2, "type": "CN_CHAR", "position": 2 }, { "token": "遜", "start_offset": 2, "end_offset": 3, "type": "CN_WORD", "position": 3 }, { "token": "卓越", "start_offset": 3, "end_offset": 5, "type": "CN_WORD", "position": 4 }, { "token": "卓", "start_offset": 3, "end_offset": 4, "type": "CN_WORD", "position": 5 }, { "token": "越有", "start_offset": 4, "end_offset": 6, "type": "CN_WORD", "position": 6 }, { "token": "有限公司", "start_offset": 5, "end_offset": 9, "type": "CN_WORD", "position": 7 }, { "token": "有限", "start_offset": 5, "end_offset": 7, "type": "CN_WORD", "position": 8 }, { "token": "公司", "start_offset": 7, "end_offset": 9, "type": "CN_WORD", "position": 9 }, { "token": "訴訟", "start_offset": 9, "end_offset": 11, "type": "CN_WORD", "position": 10 }, { "token": "訟", "start_offset": 10, "end_offset": 11, "type": "CN_WORD", "position": 11 }, { "token": "某某", "start_offset": 11, "end_offset": 13, "type": "CN_WORD", "position": 12 }, { "token": "某公司", "start_offset": 12, "end_offset": 15, "type": "CN_WORD", "position": 13 }, { "token": "公司", "start_offset": 13, "end_offset": 15, "type": "CN_WORD", "position": 14 } ]
對於如上結果,如果進行matchphrase查詢 “亞馬遜卓越”,無法匹配出任何結果
因為對 “亞馬遜卓越” 進行分詞后的結果為:
{
"tokens": [
{
"token": "亞馬遜",
"start_offset": 0,
"end_offset": 3,
"type": "CN_WORD",
"position": 0
},
{
"token": "亞",
"start_offset": 0,
"end_offset": 1,
"type": "CN_WORD",
"position": 1
},
{
"token": "馬",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 2
},
{
"token": "遜",
"start_offset": 2,
"end_offset": 3,
"type": "CN_WORD",
"position": 3
},
{
"token": "卓越",
"start_offset": 3,
"end_offset": 5,
"type": "CN_WORD",
"position": 4
},
{
"token": "卓",
"start_offset": 3,
"end_offset": 4,
"type": "CN_WORD",
"position": 5
},
{
"token": "越",
"start_offset": 4,
"end_offset": 5,
"type": "CN_CHAR",
"position": 6
}
]
}
和存儲的內容對比發現 原文存儲中包含詞語 “越有”,而查詢語句中並不包含“越有”,包含的是“越”,因此使用matchphrase短語匹配失敗,也就導致了無法檢索出內容。
還是這個例子,換個詞語進行檢索,使用“亞馬遜卓越有”,會發現竟然檢索出來了,對“亞馬遜卓越有”進行分詞得到如下結果:
{
"tokens": [
{
"token": "亞馬遜",
"start_offset": 0,
"end_offset": 3,
"type": "CN_WORD",
"position": 0
},
{
"token": "亞",
"start_offset": 0,
"end_offset": 1,
"type": "CN_WORD",
"position": 1
},
{
"token": "馬",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 2
},
{
"token": "遜",
"start_offset": 2,
"end_offset": 3,
"type": "CN_WORD",
"position": 3
},
{
"token": "卓越",
"start_offset": 3,
"end_offset": 5,
"type": "CN_WORD",
"position": 4
},
{
"token": "卓",
"start_offset": 3,
"end_offset": 4,
"type": "CN_WORD",
"position": 5
},
{
"token": "越有",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 6
}
]
}
注意到了嗎?這里出現了越有這個詞,這也就是說現在的分詞結果和之前的全文分詞結果完全一致了,所以matchphrash也就找到了結果。
再換一個極端點的例子,使用“越有限公司”去進行檢索,你會驚訝的發現,竟然還能檢索出來,對“越有限公司”進行分詞,結果如下:
{
"tokens": [
{
"token": "越有",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "有限公司",
"start_offset": 1,
"end_offset": 5,
"type": "CN_WORD",
"position": 1
},
{
"token": "有限",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 2
},
{
"token": "公司",
"start_offset": 3,
"end_offset": 5,
"type": "CN_WORD",
"position": 3
}
]
}
這個結果和原文中的結果又是完全一致(從越有之后的內容一致),所以匹配出來了結果,注意點這里有個詞語“有限公司”,檢索詞語如果我換成了“越有限”,就會發現沒有查詢到內容,因為“越有限”分詞結果為:
{
"tokens": [
{
"token": "越有",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "有限",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 1
}
]
}
“越有”這個詞是包含的,”有限”這個詞語也是包含的,但是中間隔了一個“有限公司”,所以沒有完全一致,也就匹配不到結果了。這時候如果我檢索條件設置matchphrase的slop=1,使用“越有限”就能匹配到結果了,現在可以明白了,其實position的位置差就是slop的值,而matchphrase並不是所謂的詞語拼接進行匹配,還是需要進行分詞,以及position匹配的。
