Elasticsearch集群使用ik分詞器


IK分詞插件的安裝

ES集群環境

  • VMWare下三台虛擬機Ubuntu 14.04.2 LTS
  • JDK 1.8.0_66
  • Elasticsearch 2.3.1
  • elasticsearch-jdbc-2.3.1.0
  • IK分詞器1.9.1
  • clustername:my-application
    分配如下表:
    虛擬機 | IP | node-x
    ----|----
    search1 | 192.168.235.133 | node-1
    search2 |192.168.235.134 | node-2
    search3 |192.168.235.135 | node-3

IK分詞器下載與編譯

在github下載IK分詞器zip包:
https://github.com/myitroad/elasticsearch-analysis-ik
解壓后導入IntelliJ IDEA為maven工程。
生成jar包
使用IntelliJ IDEA maven的terminal工具,執行:

mvn clean
mvn compile
mvn package

在F:\workspace_idea\elasticsearch-analysis-ik-master\target\releases生成:
elasticsearch-analysis-ik-1.9.1.zip
上傳IK分詞器
將上述zip包上傳Elasticsearch的node-x(擇一即可,如node-1),解壓到:
/home/es/cluster/elasticsearch-2.3.1/plugins/ik目錄,
最終的ik文件夾內目錄為:

ik
│   ├── commons-codec-1.9.jar
│   ├── commons-logging-1.2.jar
│   ├── config
│   │   └── ik
│   │       ├── custom
│   │       │   ├── ext_stopword.dic
│   │       │   ├── mydict.dic
│   │       │   ├── single_word.dic
│   │       │   ├── single_word_full.dic
│   │       │   ├── single_word_low_freq.dic
│   │       │   └── sougou.dic
│   │       ├── IKAnalyzer.cfg.xml
│   │       ├── main.dic
│   │       ├── preposition.dic
│   │       ├── quantifier.dic
│   │       ├── stopword.dic
│   │       ├── suffix.dic
│   │       └── surname.dic
│   ├── elasticsearch-analysis-ik-1.9.1.jar
│   ├── httpclient-4.4.1.jar
│   ├── httpcore-4.4.1.jar
│   └── plugin-descriptor.properties

配置詞庫(ik自帶搜狗詞庫)
配置:$ES_HOME/plugins/ik/config/ik/IKAnalyzer.cfg.xml
添加以下配置:

<entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic;custom/sougou.dic</entry>

重啟節點node-1

測試IK分詞效果

默認_analyze分析命令可能造成中文亂碼,因此對中文使用URL編碼。
%E6%88%91%E6%98%AF%E4%B8%AD%E5%9B%BD%E4%BA%BA是“我是中國人”的URL轉碼。
若直接使用“我是中國人”測試分詞,則可能會返回亂碼。
使用IK的ik_max_word最大分詞

es@search1:~/cluster/elasticsearch-2.3.1$ curl -XGET 'localhost:9200/myindex/_analyze?analyzer=ik_max_word&text=%E6%88%91%E6%98%AF%E4%B8%AD%E5%9B%BD%E4%BA%BA&pretty'

返回分詞結果:

{
  "tokens" : [ {
    "token" : "我是",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "CN_WORD",
    "position" : 0
  }, {
    "token" : "我",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "CN_WORD",
    "position" : 1
  }, {
    "token" : "是中國人",
    "start_offset" : 1,
    "end_offset" : 5,
    "type" : "CN_WORD",
    "position" : 2
  }, {
    "token" : "中國人",
    "start_offset" : 2,
    "end_offset" : 5,
    "type" : "CN_WORD",
    "position" : 3
  }, {
    "token" : "中國",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 4
  }, {
    "token" : "國人",
    "start_offset" : 3,
    "end_offset" : 5,
    "type" : "CN_WORD",
    "position" : 5
  }, {
    "token" : "人",
    "start_offset" : 4,
    "end_offset" : 5,
    "type" : "CN_WORD",
    "position" : 6
  } ]
}

使用IK的ik_smart最小分詞

es@search1:~/cluster/elasticsearch-2.3.1$ curl -XGET 'localhost:9200/myindex/_analyze?analyzer=ik_smart&text=%E6%88%91%E6%98%AF%E4%B8%AD%E5%9B%BD%E4%BA%BA&pretty'

返回:

{
  "tokens" : [ {
    "token" : "我是",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "CN_WORD",
    "position" : 0
  }, {
    "token" : "中國人",
    "start_offset" : 2,
    "end_offset" : 5,
    "type" : "CN_WORD",
    "position" : 1
  } ]
}

使用IK分詞器導入MySQL數據

建立myindex索引
在node-1上執行:

curl -XPUT 'localhost:9200/myindex?pretty'

編寫MySQL導入es腳本mysql-es-all.sh:(存放位置可任意)

#!/bin/sh
bin=/home/es/cluster/elasticsearch-2.3.1/elasticsearch-jdbc-2.3.1.0/bin
lib=/home/es/cluster/elasticsearch-2.3.1/elasticsearch-jdbc-2.3.1.0/lib
echo '
{
    "type" : "jdbc",
    "jdbc" : {
        "locale" : "zh_CN",
        "statefile" : "statefile.json",
        "timezone" : "GMT+8",
        "autocommit" : true,
        "elasticsearch" : {
            "cluster" : "my-application",
            "host" : "192.168.235.133",
            "port" : "9300"
        },
        "index" : "myindex",
        "type" : "mytype",
        "url" : "jdbc:mysql://10.110.1.47:3306/ispider_data",
        "user" : "root",
        "password" : "xxx",
        "sql" : "select uuid as _id,title,content,release_time from JCY_VOICE_NEWS_INFO",
        "metrics" : {
            "enabled" : true,
            "interval" : "5m"
        },
        "index_settings" : {
            "index" : {
                "number_of_shards" : 2,
                "number_of_replicas" : 2
            }
        },
        "type_mapping": {
            "mytype" : {
                "properties" : {
                    "title" : {
                        "type" : "string",
                        "store": "no",
                        "term_vector": "with_positions_offsets",
                        "analyzer": "ik_max_word",
                        "search_analyzer": "ik_max_word",
                        "include_in_all": "true"
                    },
                    "content" : {
                        "type" : "string",
                        "store": "no",
                        "term_vector": "with_positions_offsets",
                        "analyzer": "ik_max_word",
                        "search_analyzer": "ik_max_word",
                        "include_in_all": "true"
                    },
                    "release_time":{
                        "type":"date",
                        "store":"no",
                        "format":"YYYY-MM-dd HH:mm:ss",
                        "index":"not_analyzed",
                        "include_in_all":"true"
                    }
                }
            }
        }
    }
}
' | java \
    -cp "${lib}/*" \
    -Dlog4j.configurationFile=${bin}/log4j2.xml \
    org.xbib.tools.Runner \
    org.xbib.tools.JDBCImporter

添加運行權限並運行腳本

es@search1:~/cluster/elasticsearch-2.3.1$chmod +x mysql-es-all.sh
es@search1:~/cluster/elasticsearch-2.3.1$./mysql-es-all.sh

參考資料


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM