使用HanLP增強Elasticsearch分詞功能

本文轉載自查看原文 2018-12-11 16:08 865

hanlp-ext 插件源碼地址：http://git.oschina.net/hualongdata/hanlp-ext 或 https://github.com/hualongdata/hanlp-ext

Elasticsearch 默認對中文分詞是按“字”進行分詞的，這是肯定不能達到我們進行分詞搜索的要求的。官方有一個SmartCN 中文分詞插件，另外還有一個 IK 分詞插件使用也比較廣。但這里，我們采用 HanLP 這款自然語言處理工具來進行中文分詞。

Elasticsearch

Elasticsearch 的默認分詞效果是慘不忍睹的。

    GET /_analyze?pretty
    {
      "text" : ["重慶華龍網海數科技有限公司"] }

輸出：

{
  "tokens": [ { "token": "重", "start_offset": 0, "end_offset": 1, "type": "<IDEOGRAPHIC>", "position": 0 }, { "token": "慶", "start_offset": 1, "end_offset": 2, "type": "<IDEOGRAPHIC>", "position": 1 }, { "token": "華", "start_offset": 2, "end_offset": 3, "type": "<IDEOGRAPHIC>", "position": 2 }, { "token": "龍", "start_offset": 3, "end_offset": 4, "type": "<IDEOGRAPHIC>", "position": 3 }, { "token": "網", "start_offset": 4, "end_offset": 5, "type": "<IDEOGRAPHIC>", "position": 4 }, { "token": "海", "start_offset": 5, "end_offset": 6, "type": "<IDEOGRAPHIC>", "position": 5 }, { "token": "數", "start_offset": 6, "end_offset": 7, "type": "<IDEOGRAPHIC>", "position": 6 }, { "token": "科", "start_offset": 7, "end_offset": 8, "type": "<IDEOGRAPHIC>", "position": 7 }, { "token": "技", "start_offset": 8, "end_offset": 9, "type": "<IDEOGRAPHIC>", "position": 8 }, { "token": "有", "start_offset": 9, "end_offset": 10, "type": "<IDEOGRAPHIC>", "position": 9 }, { "token": "限", "start_offset": 10, "end_offset": 11, "type": "<IDEOGRAPHIC>", "position": 10 }, { "token": "公", "start_offset": 11, "end_offset": 12, "type": "<IDEOGRAPHIC>", "position": 11 }, { "token": "司", "start_offset": 12, "end_offset": 13, "type": "<IDEOGRAPHIC>", "position": 12 } ] }

可以看到，默認是按字進行分詞的。

elasticsearch-hanlp

HanLP

HanLP 是一款使用 Java 實現的優秀的，具有如下功能：

中文分詞
詞性標注
命名實體識別
關鍵詞提取
自動摘要
短語提取
拼音轉換
簡繁轉換
文本推薦
依存句法分析
語料庫工具

安裝 elasticsearch-hanlp（安裝見：https://github.com/hualongdata/hanlp-ext/tree/master/es-plugin）插件以后，我們再來看看分詞效果。

    GET /_analyze?pretty
    {
      "analyzer" : "hanlp", "text" : ["重慶華龍網海數科技有限公司"] }

輸出：

{
  "tokens": [ { "token": "重慶", "start_offset": 0, "end_offset": 2, "type": "ns", "position": 0 }, { "token": "華龍網", "start_offset": 2, "end_offset": 5, "type": "nr", "position": 1 }, { "token": "海數", "start_offset": 5, "end_offset": 7, "type": "nr", "position": 2 }, { "token": "科技", "start_offset": 7, "end_offset": 9, "type": "n", "position": 3 }, { "token": "有限公司", "start_offset": 9, "end_offset": 13, "type": "nis", "position": 4 } ] }

HanLP 的功能不止簡單的中文分詞，有很多功能都可以集成到 Elasticsearch 中。

文章來源於網絡

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Elasticsearch：hanlp 中文分詞器 Hanlp在java中文分詞中的使用介紹 HanLP分詞器的使用方法分詞工具比較及使用(ansj、hanlp、jieba) Elasticsearch集成HanLP分詞器-個人學習 elasticsearch 及分詞使用 Elasticsearch使用逗號分詞基於hanlp的es分詞插件 HanLP分詞研究 NLP入門學習中關於分詞庫HanLP導入使用教程