分析器(Analyzer) Elasticsearch 無論是內置分析器還是自定義分析器,都由三部分組成:字符過濾器(Character Filters)、分詞器(Tokenizer)、詞元過濾器(Token Filters)。 分析器Analyzer工作流程: Input Text => Character Filters(如果有多個,按順序應用) => Tokenizer => Token Filters(如果有多個,按順序應用) => Output Token 字符過濾器(Character Filters) 字符過濾器:對原始文本預處理,如去除HTML標簽,”&”轉成”and”等。 注意:一個分析器同時有多個字符過濾器時,按順序應用。 分詞器(Tokenizer) 分詞器:將字符串分解成一系列的詞元Token。如根據空格將英文單詞分開。 詞元過濾器(Token Filters) 詞元過濾器:對分詞器分出來的詞元Token做進一步處理,如轉換大小寫、移除停用詞、單復數轉換、同義詞轉換等。 注意:一個分析器同時有多個詞元過濾器時,按順序應用。 分析器analyze API的使用 分析器analyze API可驗證分析器的分析效果並解釋分析過程。 # text: 待分析文本 # explain:解釋分析過程 # char_filter:字符過濾器 # tokenizer:分詞器 # filter:詞元過濾器 GET _analyze { "char_filter" : ["html_strip"], "tokenizer": "standard", "filter": [ "lowercase"], "text": "<p><em>No <b>dreams</b>, why bother <b>Beijing</b> !</em></p>", "explain" : true } 自定義多個分析器 創建索引並自定義多個分析器 這里對一個索引同時定義了多個分析器。 PUT my_index { "settings": { "number_of_shards": 3, "number_of_replicas": 1, "analysis": { "char_filter": { //自定義多個字符過濾器 "my_charfilter1": { "type": "mapping", "mappings": ["& => and"] }, "my_charfilter2": { "type": "pattern_replace", "pattern": "(\\d+)-(?=\\d)", "replacement": "$1_" } }, "tokenizer":{ //自定義多個分詞器 "my_tokenizer1": { "pattern":"\\s+", "type":"pattern" }, "my_tokenizer2":{ "pattern":"_", "type":"pattern" } }, "filter": { //自定義多個詞元過濾器 "my_tokenfilter1": { "type": "stop", "stopwords": ["the", "a","an"] }, "my_tokenfilter2": { "type": "stop", "stopwords": ["info", "debug"] } }, "analyzer": { //自定義多個分析器 "my_analyzer1":{ //分析器my_analyzer1 "char_filter": ["html_strip", "my_charfilter1","my_charfilter2"], "tokenizer":"my_tokenizer1", "filter": ["lowercase", "my_tokenfilter1"] }, "my_analyzer2":{ //分析器my_analyzer2 "char_filter": ["html_strip"], "tokenizer":"my_tokenizer2", "filter": ["my_tokenfilter2"] } } } } } 驗證索引my_index的多個分析器 驗證分析器my_analyzer1分析效果 GET /my_index/_analyze { "text": "<b>Tom </b> & <b>jerry</b> in the room number 1-1-1", "analyzer": "my_analyzer1"//, //"explain": true } #返回結果 { "tokens": [ { "token": "tom", "start_offset": 3, "end_offset": 6, "type": "word", "position": 0 }, { "token": "and", "start_offset": 12, "end_offset": 13, "type": "word", "position": 1 }, { "token": "jerry", "start_offset": 17, "end_offset": 26, "type": "word", "position": 2 }, { "token": "in", "start_offset": 27, "end_offset": 29, "type": "word", "position": 3 }, { "token": "room", "start_offset": 34, "end_offset": 38, "type": "word", "position": 5 }, { "token": "number", "start_offset": 39, "end_offset": 45, "type": "word", "position": 6 }, { "token": "1_1_1", "start_offset": 46, "end_offset": 51, "type": "word", "position": 7 } ] } 驗證分析器my_analyzer2分析效果 GET /my_index/_analyze { "text": "<b>debug_192.168.113.1_971213863506812928</b>", "analyzer": "my_analyzer2"//, //"explain": true } #返回結果 { "tokens": [ { "token": "192.168.113.1", "start_offset": 9, "end_offset": 22, "type": "word", "position": 1 }, { "token": "971213863506812928", "start_offset": 23, "end_offset": 45, "type": "word", "position": 2 } ] } 添加Mapping並為不同字段設置不同分析器 PUT my_index/_mapping/my_type { "properties": { "my_field1": { "type": "text", "analyzer": "my_analyzer1", "fields": { "keyword": { "type": "keyword" } } }, "my_field2": { "type": "text", "analyzer": "my_analyzer2", "fields": { "keyword": { "type": "keyword" } } } } } 創建文檔 PUT my_index/my_type/1 { "my_field1":"<b>Tom </b> & <b>jerry</b> in the room number 1-1-1", "my_field2":"<b>debug_192.168.113.1_971213863506812928</b>" } Query-Mathch全文檢索 查詢時,ES會根據字段使用的分析器進行分析,然后檢索。 #查詢my_field2字段包含IP:192.168.113.1的文檔 GET my_index/_search { "query": { "match": { "my_field2": "192.168.113.1" } } } #返回結果 { "took": 22, "timed_out": false, "_shards": { "total": 3, "successful": 3, "skipped": 0, "failed": 0 }, "hits": { "total": 1, "max_score": 0.2876821, "hits": [ { "_index": "my_index", "_type": "my_type", "_id": "1", "_score": 0.2876821, "_source": { "my_field1": "<b>Tom </b> & <b>jerry</b> in the room number 1-1-1", "my_field2": "<b>debug_192.168.113.1_971213863506812928</b>" } } ] } }