什么是分詞
分詞就是指將一個文本轉化成一系列單詞的過程,也叫文本分析,在Elasticsearch中稱之為Analysis。
舉例:我是中國人 --> 我/是/中國人
結果:
{ "tokens": [ { "token": "hello", "start_offset": 0, "end_offset": 5, "type": "<ALPHANUM>", "position": 0 }, { "token": "world", "start_offset": 6, "end_offset": 11, "type": "<ALPHANUM>", "position": 1 } ] }
在結果中不僅可以看出分詞的結果,還返回了該詞在文本中的位置。
中文分詞
中文分詞的難點在於,在漢語中沒有明顯的詞匯分界點,如在英語中,空格可以作為分隔符,如果分隔不正確就會造成歧義。
如:
我/愛/炒肉絲
我/愛/炒/肉絲
常用中文分詞器,IK、jieba、THULAC等,推薦使用IK分詞器。
K Analyzer是一個開源的,基於java語言開發的輕量級的中文分詞工具包。從2006年12月推出1.0版開始,IKAnalyzer已經推出了3個大版本。最初,它是以開源項目Luence為應用主體的,結合詞典分詞和文法分析算法的中文分詞組件。新版本的IK Analyzer 3.0則發展為面向Java的公用分詞組件,獨立於Lucene項目,同時提供了對Lucene的默認優化實現。
采用了特有的“正向迭代最細粒度切分算法“,具有80萬字/秒的高速處理能力 采用了多子處理器分析模式,支持:英文字母(IP地址、Email、URL)、數字(日期,常用中文數量詞,羅馬數字,科學計數法),中文詞匯(姓名、地名處理)等分詞處理。 優化的詞典存儲,更小的內存占用
IK分詞器 Elasticsearch插件地址:https://github.com/medcl/elasticsearch-analysis-ik
[root@dalianpai ~]# docker run -p 9200:9200 -d -v /root/ik:/usr/share/elasticsearch/plugins/ik --name elasticsearch 3fd2f723b598 8378a1865408d30a279f9e057115cf4e68cfc4360fa2fe3866072ea9b820a27f [root@dalianpai ~]# docker ps -l CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 8378a1865408 3fd2f723b598 "/docker-entrypoin..." 3 seconds ago Up 2 seconds 0.0.0.0:9200->9200/tcp, 9300/tcp elasticsearch
結果:
{ "tokens": [ { "token": "我", "start_offset": 0, "end_offset": 1, "type": "CN_CHAR", "position": 0 }, { "token": "是", "start_offset": 1, "end_offset": 2, "type": "CN_CHAR", "position": 1 }, { "token": "中國人", "start_offset": 2, "end_offset": 5, "type": "CN_WORD", "position": 2 }, { "token": "中國", "start_offset": 2, "end_offset": 4, "type": "CN_WORD", "position": 3 }, { "token": "國人", "start_offset": 3, "end_offset": 5, "type": "CN_WORD", "position": 4 } ] }
可以看到,已經對中文進行了分詞。
全文搜索
全文搜索兩個最重要的方面是:
- 相關性(Relevance) 它是評價查詢與其結果間的相關程度,並根據這種相關程度對結果排名的一種能力,這種計算方式可以是 TF/IDF 方法、地理位置鄰近、模糊相似,或其他的某些算法。
- 分詞(Analysis) 它是將文本塊轉換為有區別的、規范化的 token 的一個過程,目的是為了創建倒排索引以及查詢倒排索引。
{ "acknowledged": true, "shards_acknowledged": true, "index": "topcheer" }
批量插入數據
結果:
{ "took": 213, "errors": false, "items": [ { "index": { "_index": "topcheer", "_type": "person", "_id": "AXFzovfLg3Eko2bZmO_B", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "created": true, "status": 201 } }, { "index": { "_index": "itcast", "_type": "person", "_id": "AXFzovfLg3Eko2bZmO_C", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "created": true, "status": 201 } }, { "index": { "_index": "itcast", "_type": "person", "_id": "AXFzovfLg3Eko2bZmO_D", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "created": true, "status": 201 } }, { "index": { "_index": "itcast", "_type": "person", "_id": "AXFzovfLg3Eko2bZmO_E", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "created": true, "status": 201 } } ] }
單詞搜索
結果:
{ "took": 38, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 1, "max_score": 1.3123269, "hits": [ { "_index": "itcast", "_type": "person", "_id": "AXFzovfLg3Eko2bZmO_D", "_score": 1.3123269, "_source": { "name": "王五", "age": 22, "mail": "333@qq.com", "hobby": "羽毛球、籃球、游泳、聽音樂" }, "highlight": { "hobby": [ "羽毛球、籃球、游泳、聽<em>音</em><em>樂</em>" ] } } ] } }
過程說明:
1. 檢查字段類型
愛好 hobby 字段是一個 text 類型( 指定了IK分詞器),這意味着查詢字符串本身也應該被分詞。
2. 分析查詢字符串 。
將查詢的字符串 “音樂” 傳入IK分詞器中,輸出的結果是單個項 音樂。因為只有一個單詞項,所以 match 查詢執行的是單個底層 term 查詢。
3. 查找匹配文檔 。
用 term 查詢在倒排索引中查找 “音樂” 然后獲取一組包含該項的文檔
4. 為每個文檔評分 。
用 term 查詢計算每個文檔相關度評分 _score ,這是種將 詞頻(term frequency,即詞 “音樂” 在相關文檔的hobby 字段中出現的頻率)和 反向文檔頻率(inverse document frequency,即詞 “音樂” 在所有文檔的hobby 字段中出現的頻率),以及字段的長度(即字段越短相關度越高)相結合的計算方式。
多詞搜索
結果:
{ "took": 5, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": 3, "max_score": 1.2632889, "hits": [ { "_index": "topcheer", "_type": "person", "_id": "AXFzrtEJg3Eko2bZmO_K", "_score": 1.2632889, "_source": { "name": "王五", "age": 22, "mail": "333@qq.com", "hobby": "羽毛球、籃球、游泳、聽音樂" }, "highlight": { "hobby": [ "羽毛球、<em>籃球</em>、游泳、聽<em>音樂</em>" ] } }, { "_index": "topcheer", "_type": "person", "_id": "AXFzrtEJg3Eko2bZmO_L", "_score": 0.42327404, "_source": { "name": "趙六", "age": 23, "mail": "444@qq.com", "hobby": "跑步、游泳、籃球" }, "highlight": { "hobby": [ "跑步、游泳、<em>籃球</em>" ] } }, { "_index": "topcheer", "_type": "person", "_id": "AXFzrtEJg3Eko2bZmO_J", "_score": 0.2887157, "_source": { "name": "李四", "age": 21, "mail": "222@qq.com", "hobby": "羽毛球、乒乓球、足球、籃球" }, "highlight": { "hobby": [ "羽毛球、乒乓球、足球、<em>籃球</em>" ] } } ] } }
可以看到,包含了“音樂”、“籃球”的數據都已經被搜索到了。
可是,搜索的結果並不符合我們的預期,因為我們想搜索的是既包含“音樂”又包含“籃球”的用戶,顯然結果返回的“或”的關系。
在Elasticsearch中,可以指定詞之間的邏輯關系,如下:
結果:
{ "took": 7, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": 1, "max_score": 1.2632889, "hits": [ { "_index": "topcheer", "_type": "person", "_id": "AXFzrtEJg3Eko2bZmO_K", "_score": 1.2632889, "_source": { "name": "王五", "age": 22, "mail": "333@qq.com", "hobby": "羽毛球、籃球、游泳、聽音樂" }, "highlight": { "hobby": [ "羽毛球、<em>籃球</em>、游泳、聽<em>音樂</em>" ] } } ] } }
可以看到結果符合預期。
前面我們測試了“OR” 和 “AND”搜索,這是兩個極端,其實在實際場景中,並不會選取這2個極端,更有可能是選取這種,或者說,只需要符合一定的相似度就可以查詢到數據,在Elasticsearch中也支持這樣的查詢,通過minimum_should_match來指定匹配度,如:70%;
示例:
結果:
{ "took": 3, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": 3, "max_score": 1.2632889, "hits": [ { "_index": "topcheer", "_type": "person", "_id": "AXFzrtEJg3Eko2bZmO_K", "_score": 1.2632889, "_source": { "name": "王五", "age": 22, "mail": "333@qq.com", "hobby": "羽毛球、籃球、游泳、聽音樂" }, "highlight": { "hobby": [ "羽毛球、<em>籃球</em>、游泳、聽<em>音樂</em>" ] } }, { "_index": "topcheer", "_type": "person", "_id": "AXFzrtEJg3Eko2bZmO_L", "_score": 0.42327404, "_source": { "name": "趙六", "age": 23, "mail": "444@qq.com", "hobby": "跑步、游泳、籃球" }, "highlight": { "hobby": [ "跑步、游泳、<em>籃球</em>" ] } }, { "_index": "topcheer", "_type": "person", "_id": "AXFzrtEJg3Eko2bZmO_J", "_score": 0.2887157, "_source": { "name": "李四", "age": 21, "mail": "222@qq.com", "hobby": "羽毛球、乒乓球、足球、籃球" }, "highlight": { "hobby": [ "羽毛球、乒乓球、足球、<em>籃球</em>" ] } } ] } }
組合搜索
在搜索時,也可以使用過濾器中講過的bool組合查詢,示例:
上面搜索的意思是:
搜索結果中必須包含籃球,不能包含音樂,如果包含了游泳,那么它的相似度更高。
結果:
{ "took": 5, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": 2, "max_score": 1.2458471, "hits": [ { "_index": "topcheer", "_type": "person", "_id": "AXFzrtEJg3Eko2bZmO_L", "_score": 1.2458471, "_source": { "name": "趙六", "age": 23, "mail": "444@qq.com", "hobby": "跑步、游泳、籃球" }, "highlight": { "hobby": [ "跑步、<em>游泳</em>、<em>籃球</em>" ] } }, { "_index": "topcheer", "_type": "person", "_id": "AXFzrtEJg3Eko2bZmO_J", "_score": 0.2887157, "_source": { "name": "李四", "age": 21, "mail": "222@qq.com", "hobby": "羽毛球、乒乓球、足球、籃球" }, "highlight": { "hobby": [ "羽毛球、乒乓球、足球、<em>籃球</em>" ] } } ] } }
評分的計算規則
bool 查詢會為每個文檔計算相關度評分 _score , 再將所有匹配的 must 和 should 語句的分數 _score 求和,最后除以 must 和 should 語句的總數。
must_not 語句不會影響評分; 它的作用只是將不相關的文檔排除。默認情況下,should中的內容不是必須匹配的,如果查詢語句中沒有must,那么就會至少匹配其中一個。當然了,也可以通過minimum_should_match參數進行控制,該值可以是數字也可以的百分比。
示例:
結果:
{ "took": 3, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": 2, "max_score": 1.8243669, "hits": [ { "_index": "topcheer", "_type": "person", "_id": "AXFzrtEJg3Eko2bZmO_K", "_score": 1.8243669, "_source": { "name": "王五", "age": 22, "mail": "333@qq.com", "hobby": "羽毛球、籃球、游泳、聽音樂" }, "highlight": { "hobby": [ "羽毛球、<em>籃球</em>、<em>游泳</em>、聽<em>音樂</em>" ] } }, { "_index": "topcheer", "_type": "person", "_id": "AXFzrtEJg3Eko2bZmO_L", "_score": 1.2458471, "_source": { "name": "趙六", "age": 23, "mail": "444@qq.com", "hobby": "跑步、游泳、籃球" }, "highlight": { "hobby": [ "跑步、<em>游泳</em>、<em>籃球</em>" ] } } ] } }
權重
有些時候,我們可能需要對某些詞增加權重來影響該條數據的得分。如下:
搜索關鍵字為“游泳籃球”,如果結果中包含了“音樂”權重為10,包含了“跑步”權重為2。
結果:
{ "took": 5, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": 2, "max_score": 10.595525, "hits": [ { "_index": "topcheer", "_type": "person", "_id": "AXFzrtEJg3Eko2bZmO_K", "_score": 10.595525, "_source": { "name": "王五", "age": 22, "mail": "333@qq.com", "hobby": "羽毛球、籃球、游泳、聽音樂" }, "highlight": { "hobby": [ "羽毛球、<em>籃球</em>、<em>游泳</em>、聽<em>音樂</em>" ] } }, { "_index": "topcheer", "_type": "person", "_id": "AXFzrtEJg3Eko2bZmO_L", "_score": 4.1034093, "_source": { "name": "趙六", "age": 23, "mail": "444@qq.com", "hobby": "跑步、游泳、籃球" }, "highlight": { "hobby": [ "<em>跑步</em>、<em>游泳</em>、<em>籃球</em>" ] } } ]