Smart Chinese Analysis插件將Lucene的Smart Chinese分析模塊集成到Elasticsearch中,用於分析中文或中英文混合文本。 支持的分析器在大型訓練語料庫上使用基於隱馬爾可夫(Markov)模型的概率知識來查找簡體中文文本的最佳分詞。 它使用的策略是首先將輸入文本分解為句子,然后對句子進行切分以獲得單詞。 該插件提供了一個稱為smartcn分析器的分析器,以及一個稱為smartcn_tokenizer的標記器。 請注意,兩者均不能使用任何參數進行配置。
要將smartcn Analysis插件安裝在Elasticsearch Docker容器中,請使用以下屏幕截圖中顯示的命令。 然后,我們重新啟動容器以使插件生效:
./bin/elasticsearch-plugin install analysis-smartcn
在Elasticsearch的安裝目錄運行上面的命令。顯示的結果如下:
$ ./bin/elasticsearch-plugin install analysis-smartcn
-> Downloading analysis-smartcn from elastic
[=================================================] 100%
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.bouncycastle.jcajce.provider.drbg.DRBG (file:/Users/liuxg/elastic/elasticsearch-7.3.0/lib/tools/plugin-cli/bcprov-jdk15on-1.61.jar) to constructor sun.security.provider.Sun()
WARNING: Please consider reporting this to the maintainers of org.bouncycastle.jcajce.provider.drbg.DRBG
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
-> Installed analysis-smartcn
(base) localhost:elasticsearch-7.3.0 liuxg$ ./bin/elasticsearch-plugin list
analysis-icu
analysis-ik
analysis-smartcn
pinyin
上面顯示我們已經成功地把analysis-smartcn安裝成功了。針對docker的安裝,我們可以通過如下的命令來進入到docker里,再進行安裝:
$ docker exec -it es01 /bin/bash
[root@ec4d19f59a7d elasticsearch]# ls
LICENSE.txt README.textile config jdk logs plugins
NOTICE.txt bin data lib modules
[root@ec4d19f59a7d elasticsearch]#
在這里es01是docker中的Elasticsearch實例。具體安裝請參閱我的文章“Elastic:用Docker部署Elastic棧”。
注意:在我們安裝好smartcn分析器后,我們必須重新啟動Elasticsearch使它開始起作用。
實例
在下面,我們在Kibana中用一個實例來展示這個用法:
POST _analyze
{
"text": "股市,投資,穩,賺,不,賠,必修課,如何,做,好,倉,位,管理,和,情緒,管理",
"analyzer": "smartcn"
}
顯示結果:
{
"tokens" : [
{
"token" : "股市",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "投資",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 2
},
{
"token" : "穩",
"start_offset" : 6,
"end_offset" : 7,
"type" : "word",
"position" : 4
},
{
"token" : "賺",
"start_offset" : 8,
"end_offset" : 9,
"type" : "word",
"position" : 6
},
{
"token" : "不",
"start_offset" : 10,
"end_offset" : 11,
"type" : "word",
"position" : 8
},
{
"token" : "賠",
"start_offset" : 12,
"end_offset" : 13,
"type" : "word",
"position" : 10
},
{
"token" : "必修課",
"start_offset" : 14,
"end_offset" : 17,
"type" : "word",
"position" : 12
},
{
"token" : "如何",
"start_offset" : 18,
"end_offset" : 20,
"type" : "word",
"position" : 14
},
{
"token" : "做",
"start_offset" : 21,
"end_offset" : 22,
"type" : "word",
"position" : 16
},
{
"token" : "好",
"start_offset" : 23,
"end_offset" : 24,
"type" : "word",
"position" : 18
},
{
"token" : "倉",
"start_offset" : 25,
"end_offset" : 26,
"type" : "word",
"position" : 20
},
{
"token" : "位",
"start_offset" : 27,
"end_offset" : 28,
"type" : "word",
"position" : 22
},
{
"token" : "管理",
"start_offset" : 29,
"end_offset" : 31,
"type" : "word",
"position" : 24
},
{
"token" : "和",
"start_offset" : 32,
"end_offset" : 33,
"type" : "word",
"position" : 26
},
{
"token" : "情緒",
"start_offset" : 34,
"end_offset" : 36,
"type" : "word",
"position" : 28
},
{
"token" : "管理",
"start_offset" : 37,
"end_offset" : 39,
"type" : "word",
"position" : 30
}
]
}