elasticsearch - ik分詞器

本文轉載自查看原文 2019-03-28 11:26 716 Elasticsearch

前言
ik分詞器的由來
IK分詞器插件的安裝
ik分詞器的使用

前言

在知名的中分分詞器中，ik中文分詞器的大名可以說是無人不知，elasticsearch有了ik分詞器的加持，就像男人有了神油.......要了解ik中文分詞器，就首先要了解一下它的由來。

ik分詞器的由來

IK Analyzer是一個開源的，基於java語言開發的輕量級的中文分詞工具包。從2006年12月推出1.0版開始， IK Analyzer已經推出了4個大版本。最初，它是以開源項目Luence為應用主體的，結合詞典分詞和文法分析算法的中文分詞組件。從3.0版本開始，IK發展為面向Java的公用分詞組件，獨立於Lucene項目，同時提供了對Lucene的默認優化實現。在2012版本中，IK實現了簡單的分詞歧義排除算法，標志着IK分詞器從單純的詞典分詞向模擬語義分詞衍化。
IK Analyzer 2012特性：

采用了特有的正向迭代最細粒度切分算法，支持細粒度和智能分詞兩種切分模式。在系統環境：Core2 i7 3.4G雙核，4G內存，window 7 64位， Sun JDK 1.6_29 64位普通pc環境測試，IK2012具有160萬字/秒（3000KB/S）的高速處理能力。
2012版本的智能分詞模式支持簡單的分詞排歧義處理和數量詞合並輸出。
采用了多子處理器分析模式，支持：英文字母、數字、中文詞匯等分詞處理，兼容韓文、日文字符
優化的詞典存儲，更小的內存占用。支持用戶詞典擴展定義。特別的，在2012版本，詞典支持中文，英文，數字混合詞語。

后來，被一個叫medcl（曾勇 elastic開發工程師與布道師，elasticsearch開源社區負責人，2015年加入elastic）的人集成到了elasticsearch中，並支持自定義字典.......
ps：elasticsearch的ik中文分詞器插件由medcl的github上下載，而 IK Analyzer 這個分詞器，如果百度搜索的，在開源中國中的提交者是林良益，由此推斷之下，才有了上面的一番由來...........
才有了接下來一系列的扯淡..........

IK分詞器插件的安裝

打開Github官網，搜索elasticsearch-analysis-ik，單擊medcl/elasticsearch-analysis-ik。或者直接點擊

在readme.md文件中，下拉選擇歷史版本連接。

由於ik與elasticsearch存在兼容問題。所以在下載ik時要選擇和elasticsearch版本一致的，也就是選擇v6.5.4版本，單擊elasticsearch-analysis-ik-6.5.4.zip包，自動進入下載到本地。

本地下載成功后，是個zip包。

安裝

首先打開C:\Program Files\elasticseach-6.5.4\plugins目錄，新建一個名為ik的子目錄，並將elasticsearch-analysis-ik-6.5.4.zip包解壓到該ik目錄內也就是C:\Program Files\elasticseach-6.5.4\plugins\ik目錄。

測試

首先將elascticsearch和kibana服務重啟。
然后地址欄輸入http://localhost:5601，在Dev Tools中的Console界面的左側輸入命令，再點擊綠色的執行按鈕執行。

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "上海自來水來自海上"
}

右側就顯示出結果了如下所示：

{
  "tokens" : [
    {
      "token" : "上海",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "自來水",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "自來",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "水",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "來自",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "海上",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 5
    }
  ]
}

OK，安裝完畢，非常的簡單。

ik目錄簡介

我們簡要的介紹一下ik分詞配置文件：

IKAnalyzer.cfg.xml，用來配置自定義的詞庫
main.dic，ik原生內置的中文詞庫，大約有27萬多條，只要是這些單詞，都會被分在一起。
surname.dic，中國的姓氏。
suffix.dic，特殊（后綴）名詞，例如鄉、江、所、省等等。
preposition.dic，中文介詞，例如不、也、了、仍等等。
stopword.dic，英文停用詞庫，例如a、an、and、the等。
quantifier.dic，單位名詞，如厘米、件、倍、像素等。

ik分詞器的使用

before

首先將elascticsearch和kibana服務重啟，讓插件生效。
然后地址欄輸入http://localhost:5601，在Dev Tools中的Console界面的左側輸入命令，再點擊綠色的執行按鈕執行。

第一個ik示例

來個簡單的示例。

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "上海自來水來自海上"
}

右側就顯示出結果了如下所示：

{
  "tokens" : [
    {
      "token" : "上海",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "自來水",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "自來",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "水",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "來自",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "海上",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 5
    }
  ]
}

那么你可能對開始的analyzer：ik_max_word有一絲的疑惑，這個家伙是干嘛的呀？我們就來看看這個家伙到底是什么鬼！

ik_max_word

現在有這樣的一個索引：

PUT ik1
{
  "mappings": {
    "doc": {
      "dynamic": false,
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "ik_max_word"
        }
      }
    }
  }
}

上例中，ik_max_word參數會將文檔做最細粒度的拆分，以窮盡盡可能的組合。
接下來為該索引添加幾條數據：

PUT ik1/doc/1
{
  "content":"今天是個好日子"
}
PUT ik1/doc/2
{
  "content":"心想的事兒都能成"
}
PUT ik1/doc/3
{
  "content":"我今天不活了"
}

現在讓我們開始查詢，隨便查！

GET ik1/_search
{
  "query": {
    "match": {
      "content": "心想"
    }
  }
}

查詢結果如下：

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "ik1",
        "_type" : "doc",
        "_id" : "2",
        "_score" : 0.2876821,
        "_source" : {
          "content" : "心想的事兒都能成"
        }
      }
    ]
  }
}

成功的返回了一條數據。我們再來以今天為條件來查詢。

GET ik1/_search
{
  "query": {
    "match": {
      "content": "今天"
    }
  }
}

結果如下：

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "ik1",
        "_type" : "doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "content" : "今天是個好日子"
        }
      },
      {
        "_index" : "ik1",
        "_type" : "doc",
        "_id" : "3",
        "_score" : 0.2876821,
        "_source" : {
          "content" : "我今天不活了"
        }
      }
    ]
  }
}

上例的返回中，成功的查詢到了兩條結果。
與ik_max_word對應還有另一個參數。讓我們一起來看下。

ik_smart

與ik_max_word對應的是ik_smart參數，該參數將文檔作最粗粒度的拆分。

GET _analyze
{
  "analyzer": "ik_smart",
  "text": "今天是個好日子"
}

上例中，我們以最粗粒度的拆分文檔。
結果如下：

{
  "tokens" : [
    {
      "token" : "今天是",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "個",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "好日子",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}

再來看看以最細粒度的拆分文檔。

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "今天是個好日子"
}

結果如下：

{
  "tokens" : [
    {
      "token" : "今天是",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "今天",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "是",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "個",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "好日子",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "日子",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 5
    }
  ]
}

由上面的對比可以發現，兩個參數的不同，所以查詢結果也肯定不一樣，視情況而定用什么粒度。
在基本操作方面，除了粗細粒度，別的按照之前的操作即可，就像下面兩個短語查詢和短語前綴查詢一樣。

ik之短語查詢

ik中的短語查詢參照之前的短語查詢即可。

GET ik1/_search
{
  "query": {
    "match_phrase": {
      "content": "今天"
    }
  }
}

結果如下：

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "ik1",
        "_type" : "doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "content" : "今天是個好日子"
        }
      },
      {
        "_index" : "ik1",
        "_type" : "doc",
        "_id" : "3",
        "_score" : 0.2876821,
        "_source" : {
          "content" : "我今天不活了"
        }
      }
    ]
  }
}

ik之短語前綴查詢

同樣的，我們第2部分的快速上手部分的操作在ik中同樣適用。

GET ik1/_search
{
  "query": {
    "match_phrase_prefix": {
      "content": {
        "query": "今天好日子",
        "slop": 2
      }
    }
  }
}

結果如下：

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "ik1",
        "_type" : "doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "content" : "今天是個好日子"
        }
      },
      {
        "_index" : "ik1",
        "_type" : "doc",
        "_id" : "3",
        "_score" : 0.2876821,
        "_source" : {
          "content" : "我今天不活了"
        }
      }
    ]
  }
}

歡迎斧正，that's all see also：[IK Analysis for Elasticsearch](https://github.com/medcl/elasticsearch-analysis-ik) | [elasticsearch build-in Analyzers](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 elasticsearch之ik分詞器 Elasticsearch IK分詞器 ElasticSearch中文分詞器-IK分詞器的使用 ElasticSearch中文分詞器-IK分詞器的使用 Elasticsearch的分詞器，IK分詞器以及IK分詞器權限問題 Elasticsearch集成ik分詞器 elasticsearch安裝ik分詞器 Elasticsearch整合IK分詞器 elasticsearch擴展ik分詞器詞庫 elasticsearch入門 (三 ik 分詞器安裝)