ElasticSearch之分詞器edge_ngram和ngram的區別


ElasticSearch一看就懂之分詞器edge_ngram和ngram的區別
1 year ago
edge_ngram和ngram是ElasticSearch自帶的兩個分詞器,一般設置索引映射的時候都會用到,設置完步長之后,就可以直接給解析器analyzer的tokenizer賦值使用。
這里,我們統一用字符串來做分詞示例:
字符串

  1. edge_ngram分詞器,分詞結果如下:
    {
    "tokens": [
    {
    "token": "字",
    "start_offset": 0,
    "end_offset": 1,
    "type": "word",
    "position": 0
    },
    {
    "token": "字符",
    "start_offset": 0,
    "end_offset": 2,
    "type": "word",
    "position": 1
    },
    {
    "token": "字符串",
    "start_offset": 0,
    "end_offset": 3,
    "type": "word",
    "position": 2
    }
    ]
    }
  2. ngram分詞器,分詞結果如下:
    {
    "tokens": [
    {
    "token": "字",
    "start_offset": 0,
    "end_offset": 1,
    "type": "word",
    "position": 0
    },
    {
    "token": "字符",
    "start_offset": 0,
    "end_offset": 2,
    "type": "word",
    "position": 1
    },
    {
    "token": "字符串",
    "start_offset": 0,
    "end_offset": 3,
    "type": "word",
    "position": 2
    },
    {
    "token": "符",
    "start_offset": 1,
    "end_offset": 2,
    "type": "word",
    "position": 3
    },
    {
    "token": "符串",
    "start_offset": 1,
    "end_offset": 3,
    "type": "word",
    "position": 4
    },
    {
    "token": "串",
    "start_offset": 2,
    "end_offset": 3,
    "type": "word",
    "position": 5
    }
    ]
    }
    一目了然,看明白了嗎?簡單理解來說:edge_ngram的分詞器,就是從首字開始,按步長,逐字符分詞,直至最終結尾文字;ngram呢,就不僅是從首字開始,而是逐字開始按步長,逐字符分詞。
    具體應用呢?如果必須首字匹配的情況,那么用edge_ngram自然是最佳選擇,如果需要文中任意字符的匹配,ngram就更為合適了。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM