1、默認的分詞器
standard 分詞器
standard tokenizer:以單詞邊界進行切分
standard token filter:什么都不做
lowercase token filter:將所有字母轉換為小寫
stop token filer(默認被禁用):移除停用詞,比如a the it等等
2、修改分詞器的設置
啟用english停用詞token filter
PUT /my_index { "settings": { "analysis": { "analyzer": { "es_std": { "type": "standard", "stopwords": "_english_" } } } } } GET /my_index/_analyze { "analyzer": "standard", "text": "a dog is in the house" } GET /my_index/_analyze { "analyzer": "es_std", "text":"a dog is in the house" }
3、定制化自己的分詞器
1.&字符轉換
2.停用某些詞
3.大小寫轉換
PUT /my_index { "settings": { "analysis": { "char_filter": { "&_to_and": { "type": "mapping", "mappings": ["&=> and"] } }, "filter": { "my_stopwords": { "type": "stop", "stopwords": ["the", "a"] } }, "analyzer": { "my_analyzer": { "type": "custom", "char_filter": ["html_strip", "&_to_and"], "tokenizer": "standard", "filter": ["lowercase", "my_stopwords"] } } } } } GET /my_index/_analyze { "text": "tom&jerry are a friend in the house, <a>, HAHA!!", "analyzer": "my_analyzer" } PUT /my_index/_mapping/my_type { "properties": { "content": { "type": "text", "analyzer": "my_analyzer" } } }