Elastic Stack-Elasticsearch使用介紹(一)

本文轉載自查看原文 2018-09-03 08:10 2229 Elastic Stack

一、前言

Elasticsearch對外提供RESTful API，下面的演示我們主要使用Postman，進行一系列的Demo演示，這款工具方便各位前端大大或者對接口調試的神器；

安裝過於簡單我不做過多介紹，推薦一些文章供大家參考安裝:

windows:

Linux:

另外再推薦大家閱讀這篇文章:搜索引擎選擇： Elasticsearch與Solr

二、簡單的一些操作

#創建索引
PUT /test
#刪除索引
DELETE /test
#根據id創建文檔
#索引/type/id
PUT /test/doc/1
#創建文檔
#不帶id會自動生成id
#索引/type/
PUT /test/doc/
#查詢文檔
#使用id查詢文檔
GET /test/doc/1
#使用另外的DSL查詢方式等等下次介紹,通過JSON請求完成

View Code

_bulk批量操作:

{"index":{"_index":"test","_type":"aaa","_id":"2"}}
{"userName":"123456"}
{"index":{"_index":"test","_type":"aaa","_id":"3"}}
{"userName":"1234567"}
{"delete":{"_index":"test","_type":"aaa","_id":"1"}}

View Code

_mget批量查詢:

http://localhost:9200/test/_mget
{
    "docs":[
        {
             "_type":"aaa",
            "_id":"2"
        },
        {
            "_type":"bbbbb",
            "_id":"1"
        }
    ]
}

View Code

如果type相同的話，可以使用ids，將id放入數組當中；

批量操作這2個API還是很重要的，如果要一次性操作很多的數據一定要批量操作，盡可能減少網絡開銷次數，提升系統的性能；

三、倒排索引

之前我寫過一篇文章由樹到數據庫索引,大家可以看下數據庫的正排索引，這里我們在舉一個例子，大家都知道書是有目錄頁和頁碼頁的，其實書的目錄頁就是正排索引，頁碼頁就是倒排索引；

正排索引就是文檔Id到文檔的內容、單詞的關聯關系，如下圖

倒排索引就是單詞到文檔id的關系，如下圖

這個時候當我們使用搜索引擎查詢包含Elasticsearch文檔的，這個時候流程可能是這樣的

1.通過倒排索引獲取包含Elasticsearch文檔id為1；

2.通過正排索引查找id為1的文檔內容；

3.拿到正確結果返回；

這個時候我們可以來思考下倒排索引的結構了，當分詞以后以我們了解到的數據結構來看的話B+樹是一種高效的查詢方式，整好符合分詞以后的結構，如下圖；

當我們快速拿到我們想要的查詢的分詞的時候，我們這個時候就需要知道最重要的東西就是文檔的id，這樣確實可以拿到正確的結果，如下圖

但是這個時候我們再考慮下另外的情況，當我們在淘寶搜索一個物品的時候他有一個高亮顯示，這個時候我們上面的情況就滿足不了我們了，我們就需要在倒排索引列表中加入分詞位置信息和偏移長度，這個時候我們就可以做高亮顯示；

后面又來一種情況，隨着文檔的擴大，我們當用搜索引擎去查詢的時候會有很多結果，我們需要優先顯示相近的，這個時候有需要另外一個字段就是詞頻，記錄在文檔中出現的次數，這個時候就滿足可能出現的所有情況了，結構入下圖

明白整體的結構，你就知道為什么搜索引擎可以快速查詢出我們要想要的結果來了，是不是很滿足，那就點個關注吧！！哈哈！！當然內部有很多很多優化這個我們暫時就先不要管了！！

四、分詞器

分詞器組成

分詞：按照某種規則將整體變成部分，在Elasticsearch中分詞的組件是分詞器(Analyzer），組成如下：

1.Character Filters: 針對原始文本進行處理，有點類似正則過濾的意思；

2.Tokenizer:按照指定規則進行分詞；

3.Token Filters：將分好的詞再次粉裝轉化；

分詞器API

Elasticsearch給我們提供分詞API就是_analyze,作用就是為了測試是否能按照我們想要的結果進行分詞，簡單的演示下怎么使用：

看一下返回結果，每個token里面都包含我們說的倒排索引內所有字段，這個type含義我不是很清楚，但是無傷大雅,另外還可以指定索引進行分詞，默認為standard分詞器：

分詞器類型

1.standard

默認分詞器,按詞切分,支持多語言，字母轉化為小寫，分詞效果太多JSON返回的過長不方便截圖，總體來說對中文支持不是很好，分成一個字一個字，畢竟老外寫的；

2.simple

按照非字母切分，字母轉化為小寫;

3.whitespace

按照空格切分；

4.stop

與simple相比增加了語氣助詞區分，例如then、an、的、得等；

5.keyword

不分詞；

6.pattern

通過正則表達式自定義分割符，默認\W+，非茲磁的符號作為分隔符；

7.language

語言分詞器,內置多種語言；

以上都是自帶分詞器，對中文的支持都不是很好，接下來我們看下有哪些中文分詞器：

1.IK

用法參考下Github，實現中英文分詞，支持ik_smart，ik_max_word等，支持自定義詞庫、更新分詞詞庫；

#url
http://localhost:9200/_analyze
#json體
{
    "analyzer":"ik_max_word",
    "text":"今天是個好天氣，我是中國人"
}
#ik_max_word分詞結果
{
    "tokens": [
        {
            "token": "今天是",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "今天",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "是",
            "start_offset": 2,
            "end_offset": 3,
            "type": "CN_CHAR",
            "position": 2
        },
        {
            "token": "個",
            "start_offset": 3,
            "end_offset": 4,
            "type": "CN_CHAR",
            "position": 3
        },
        {
            "token": "好天氣",
            "start_offset": 4,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "好天",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "天氣",
            "start_offset": 5,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 6
        },
        {
            "token": "我",
            "start_offset": 8,
            "end_offset": 9,
            "type": "CN_CHAR",
            "position": 7
        },
        {
            "token": "是",
            "start_offset": 9,
            "end_offset": 10,
            "type": "CN_CHAR",
            "position": 8
        },
        {
            "token": "中國人",
            "start_offset": 10,
            "end_offset": 13,
            "type": "CN_WORD",
            "position": 9
        },
        {
            "token": "中國",
            "start_offset": 10,
            "end_offset": 12,
            "type": "CN_WORD",
            "position": 10
        },
        {
            "token": "國人",
            "start_offset": 11,
            "end_offset": 13,
            "type": "CN_WORD",
            "position": 11
        }
    ]
}
#ik_smart分詞結果
{
    "tokens": [
        {
            "token": "今天是",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "個",
            "start_offset": 3,
            "end_offset": 4,
            "type": "CN_CHAR",
            "position": 1
        },
        {
            "token": "好天氣",
            "start_offset": 4,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "我",
            "start_offset": 8,
            "end_offset": 9,
            "type": "CN_CHAR",
            "position": 3
        },
        {
            "token": "是",
            "start_offset": 9,
            "end_offset": 10,
            "type": "CN_CHAR",
            "position": 4
        },
        {
            "token": "中國人",
            "start_offset": 10,
            "end_offset": 13,
            "type": "CN_WORD",
            "position": 5
        }
    ]
}

View Code

2.jieba

python中流行的分詞系統，玩Py的朋友看下GitHub;

以上基本滿足我們日常開發了，有興趣的可以查看下HanLP、THULAC等等；

自定義分詞器

如果以上這些還滿足不了你的需求，那么你可以進行自定義分詞，自定義的分詞的流程就是上面我們介紹分詞器的組成的流程；

1.Character Filters

在tokenizer之前對原始文本進行處理，會影響在Tokenizer解析position和offset的信息，在這個里面我們可以做如下事情：

html_strip 取出html標簽中的內容；

mapping進行字符串替換；

pattern_replace進行正則替換；

寫了一個簡單的demo,剩下大家可以參考下官方文檔；

2.Tokenizer

Tokenizer將原始文本按照一定規則切分為單詞，大概分成3類：

按照字符為導向分割(Word Oriented Tokenizers):Standard Tokenizer、Letter Tokenizer、Whitespace Tokenizer等等；

部分單詞匹配(Partial Word Tokenizers):類似於ik_max_word;

按照某種結構進行分割(Structured Text Tokenizers):Path Tokenizer、Keyword Tokenizer等等；

詳細介紹查看官方文檔;

3.Token Filter

tokenizer輸出的單詞進行增加、刪除、修改等操作，tokenizer filter是可以有多個的，自帶類型有好多大家可以查看官方文檔；

#url
http://localhost:9200/_analyze
#post請求體
{
    "tokenizer":"standard",
    "text": "I'm LuFei wo will haizheiwang ",
    "filter":[
        "stop",
        "lowercase",
        {
            "type":"ngram",
            "min_gram":5,
            "max_gram":8
        }
    ]
}
#返回分詞結果
{
    "tokens": [
        {
            "token": "lufei",
            "start_offset": 4,
            "end_offset": 9,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "haizh",
            "start_offset": 18,
            "end_offset": 29,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "haizhe",
            "start_offset": 18,
            "end_offset": 29,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "haizhei",
            "start_offset": 18,
            "end_offset": 29,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "haizheiw",
            "start_offset": 18,
            "end_offset": 29,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "aizhe",
            "start_offset": 18,
            "end_offset": 29,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "aizhei",
            "start_offset": 18,
            "end_offset": 29,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "aizheiw",
            "start_offset": 18,
            "end_offset": 29,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "aizheiwa",
            "start_offset": 18,
            "end_offset": 29,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "izhei",
            "start_offset": 18,
            "end_offset": 29,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "izheiw",
            "start_offset": 18,
            "end_offset": 29,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "izheiwa",
            "start_offset": 18,
            "end_offset": 29,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "izheiwan",
            "start_offset": 18,
            "end_offset": 29,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "zheiw",
            "start_offset": 18,
            "end_offset": 29,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "zheiwa",
            "start_offset": 18,
            "end_offset": 29,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "zheiwan",
            "start_offset": 18,
            "end_offset": 29,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "zheiwang",
            "start_offset": 18,
            "end_offset": 29,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "heiwa",
            "start_offset": 18,
            "end_offset": 29,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "heiwan",
            "start_offset": 18,
            "end_offset": 29,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "heiwang",
            "start_offset": 18,
            "end_offset": 29,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "eiwan",
            "start_offset": 18,
            "end_offset": 29,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "eiwang",
            "start_offset": 18,
            "end_offset": 29,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "iwang",
            "start_offset": 18,
            "end_offset": 29,
            "type": "<ALPHANUM>",
            "position": 4
        }
    ]
}

View Code

4.在索引中自定義分詞器

五、結束

下一篇介紹Mapping、Search Api,歡迎大家點贊，歡迎大家加群438836709，歡迎大家關注公眾號

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Elastic Stack-Elasticsearch使用介紹(二) Elastic Stack-Elasticsearch使用介紹(四) Elastic Stack-Elasticsearch使用介紹(五) Elastic Stack-Elasticsearch使用介紹(三) Elastic Stack-Elasticsearch介紹 Elastic Stack-Kibana使用介紹(七) Elastic Stack功能介紹 Elastic Stack配置和使用 Elastic Stack 筆記（五）Elasticsearch5.6 Mappings 映射 elasticsearch-head安裝方法--Elastic Stack之二