Elasticsearch拼音分詞和IK分詞的安裝及使用


一、Es插件配置及下載

1.IK分詞器的下載安裝

關於IK分詞器的介紹不再多少,一言以蔽之,IK分詞是目前使用非常廣泛分詞效果比較好的中文分詞器。做ES開發的,中文分詞十有八九使用的都是IK分詞器。

下載地址:https://github.com/medcl/elasticsearch-analysis-ik

2.pinyin分詞器的下載安裝

可以在淘寶、京東的搜索框中輸入pinyin就能查找到自己想要的結果,這就是拼音分詞,拼音分詞則是將中文分析成拼音格式,可以通過拼音分詞分析出來的數據進行查找想要的結果。

下載地址:https://github.com/medcl/elasticsearch-analysis-pinyin

注:插件下載一定要和自己版本對應的Es版本一致,並且安裝完插件后需重啟Es,才能生效。

插件安裝位置:(本人安裝了三個插件,暫時先不介紹murmur3插件,可以暫時忽略)

 

 插件配置成功,重啟Es

 

二、拼音分詞器和IK分詞器的使用

1.IK中文分詞器的使用

1.1 ik_smart: 會做最粗粒度的拆分

GET /_analyze
{
  "text":"中華人民共和國國徽",
  "analyzer":"ik_smart"
}

結果:
{
  "tokens": [
    {
      "token": "中華人民共和國",
      "start_offset": 0,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "國徽",
      "start_offset": 7,
      "end_offset": 9,
      "type": "CN_WORD",
      "position": 1
    }
  ]
}

 1.2  ik_max_word: 會將文本做最細粒度的拆分

GET /_analyze
{
  "text": "中華人民共和國國徽",
  "analyzer": "ik_max_word"
}

結果:
{
  "tokens": [
    {
      "token": "中華人民共和國",
      "start_offset": 0,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "中華人民",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "中華",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "華人",
      "start_offset": 1,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "人民共和國",
      "start_offset": 2,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "人民",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 5
    },
    {
      "token": "共和國",
      "start_offset": 4,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 6
    },
    {
      "token": "共和",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 7
    },
    {
      "token": "國",
      "start_offset": 6,
      "end_offset": 7,
      "type": "CN_CHAR",
      "position": 8
    },
    {
      "token": "國徽",
      "start_offset": 7,
      "end_offset": 9,
      "type": "CN_WORD",
      "position": 9
    }
  ]
}

 

 

 2.拼音分詞器的使用

GET /_analyze
{
  "text":"劉德華",
  "analyzer": "pinyin"
}

結果:
{
  "tokens": [
    {
      "token": "liu",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "ldh",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 0
    },
    {
      "token": "de",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "hua",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 2
    }
  ]
}

 

 

注:不管是拼音分詞器還是IK分詞器,當深入搜索一條數據是時,必須是通過分詞器分析的數據,才能被搜索到,否則搜索不到

 

三、IK分詞和拼音分詞的組合使用

當我們創建索引時可以自定義分詞器,通過指定映射去匹配自定義分詞器

PUT /my_index
{
  "settings": {
        "analysis": {
            "analyzer": {
                "ik_smart_pinyin": {
                    "type": "custom",
                    "tokenizer": "ik_smart",
                    "filter": ["my_pinyin", "word_delimiter"]
                },
                "ik_max_word_pinyin": {
                    "type": "custom",
                    "tokenizer": "ik_max_word",
                    "filter": ["my_pinyin", "word_delimiter"]
                }
            },
            "filter": {
                "my_pinyin": {
                    "type" : "pinyin",
                    "keep_separate_first_letter" : true,
                    "keep_full_pinyin" : true,
                    "keep_original" : true,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "remove_duplicated_term" : true 
                }
            }
        }
  }
    
}

 

 

當我們建type時,需要在字段的analyzer屬性填寫自己的映射

PUT /my_index/my_type/_mapping
{
    "my_type":{
      "properties": {
        "id":{
          "type": "integer"
        },
        "name":{
          "type": "text",
          "analyzer": "ik_smart_pinyin"
        }
      }
    }
}

 

 

測試,讓我們先添加幾條數據

POST /my_index/my_type/_bulk
{ "index": { "_id":1}}
{ "name": "張三"}
{ "index": { "_id": 2}}
{ "name": "張四"}
{ "index": { "_id": 3}}
{ "name": "李四"}

 IK分詞查詢

GET /my_index/my_type/_search
{
  "query": {
    "match": {
      "name": "李"
    }
  }
}

結果:
{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.47160998,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 0.47160998,
        "_source": {
          "name": "李四"
        }
      }
    ]
  }
}

 

拼音分詞查詢:

GET /my_index/my_type/_search
{
  "query": {
    "match": {
      "name": "zhang"
    }
  }
}

結果:
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.3758317,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 0.3758317,
        "_source": {
          "name": "張四"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 0.3758317,
        "_source": {
          "name": "張三"
        }
      }
    ]
  }
}

 

注:搜索時,先查看被搜索的詞被分析成什么樣的數據,如果你搜索該詞輸入沒有被分析出的參數時,是查不到的!!!!

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM