ElasticSearch一看就懂之分詞器edge_ngram和ngram的區別
1 year ago
edge_ngram和ngram是ElasticSearch自帶的兩個分詞器,一般設置索引映射的時候都會用到,設置完步長之后,就可以直接給解析器analyzer的tokenizer賦值使用。
這里,我們統一用字符串來做分詞示例:
字符串
- edge_ngram分詞器,分詞結果如下:
{
"tokens": [
{
"token": "字",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "字符",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "字符串",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 2
}
]
} - ngram分詞器,分詞結果如下:
{
"tokens": [
{
"token": "字",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "字符",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "字符串",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 2
},
{
"token": "符",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 3
},
{
"token": "符串",
"start_offset": 1,
"end_offset": 3,
"type": "word",
"position": 4
},
{
"token": "串",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 5
}
]
}
一目了然,看明白了嗎?簡單理解來說:edge_ngram的分詞器,就是從首字開始,按步長,逐字符分詞,直至最終結尾文字;ngram呢,就不僅是從首字開始,而是逐字開始按步長,逐字符分詞。
具體應用呢?如果必須首字匹配的情況,那么用edge_ngram自然是最佳選擇,如果需要文中任意字符的匹配,ngram就更為合適了。