Elasticsearch學習筆記（十二）filter與query

本文轉載自查看原文 2018-02-05 15:23 1859

一.keyword 字段和keyword數據類型

1、測試准備數據

POST /forum/article/_bulk

{ "index": { "_id": 1 }}

{ "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" }

{ "index": { "_id": 2 }}

{ "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" }

{ "index": { "_id": 3 }}

{ "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" }

{ "index": { "_id": 4 }}

{ "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" }

2、查詢mapping

GET /forum/_mapping/article

{

"forum": {

    "mappings": {

      "article": {

        "properties": {

          "articleID": {

            "type": "text",

            "fields": {

              "keyword": {

                "type": "keyword",

                "ignore_above": 256

              }

            }

          },

          "hidden": {

            "type": "boolean"

          },

          "postDate": {

            "type": "date"

          },

          "userID": {

            "type": "long"

          }

        }

      }

    }

}

}

        es 5.2版本，字段數據類型為text的字段（type=text），es默認會設置兩個field，一個是field本身，比如articleID，就是分詞的；還有一個的話，就是field.keyword，articleID.keyword，默認不分詞，會最多保留256個字符

    articleID.keyword，是es最新版本內置建立的field，就是不分詞的。所以一個articleID過來的時候，會建立兩次索引，一次是自己本身，是要分詞的，分詞后放入倒排索引；另外一次是基於articleID.keyword，不分詞，保留256個字符最多，直接一個字符串放入倒排索引中。

    所以term filter，對text過濾，可以考慮使用內置的field.keyword來進行匹配。但是有個問題，默認就保留256個字符。所以盡可能還是自己去手動建立索引，指定not_analyzed吧。在最新版本的es中，不需要指定not_analyzed也可以，將type=keyword即可。

3、測試

    測試1：使用articleID搜索

GET /forum/article/_search

{

    "query" : {

        "constant_score" : {

            "filter" : {

                "term" : {

                    " articleID " : "XHDK-A-1293-#fJ3"

                }

            }

        }

    }

}

結果：查詢不到指定的document

    {

"took": 1,

"timed_out": false,

"_shards": {

    "total": 5,

    "successful": 5,

    "failed": 0

},

"hits": {

    "total": 0,

    "max_score": null,

    "hits": []

}

   }

    測試2：使用articleID.keyword搜索

GET /forum/article/_search

{

    "query" : {

        "constant_score" : {

            "filter" : {

                "term" : {

                    " articleID.keyword " : "XHDK-A-1293-#fJ3"

                }

            }

        }

    }

}

      結果：

{

"took": 2,

"timed_out": false,

"_shards": {

    "total": 5,

    "successful": 5,

    "failed": 0

},

"hits": {

    "total": 1,

    "max_score": 1,

    "hits": [

      {

        "_index": "forum",

        "_type": "article",

        "_id": "1",

        "_score": 1,

        "_source": {

          "articleID": "XHDK-A-1293-#fJ3",

          "userID": 1,

          "hidden": false,

          "postDate": "2017-01-01"

        }

      }

    ]

}

}

測試3：term查詢

GET /forum/article/_search

{

    "query" : {

        "constant_score" : {

            "filter" : {

                "term" : {

                    "userID" : 1

                }

            }

        }

    }

}

term filter/query：對搜索文本不分詞，直接拿去倒排索引中匹配，你輸入的是什么，就去匹配什么

比如說，如果對搜索文本進行分詞的話，“helle world” --> “hello”和“world”，兩個詞分別去倒排索引中匹配

term，“hello world” --> “hello world”，直接去倒排索引中匹配“hello world”

4、查看分詞

GET /forum/_analyze

{

"field": "articleID",

"text": "XHDK-A-1293-#fJ3"

}

GET /forum/_analyze

{

"field": "articleID.keyword",

"text": "XHDK-A-1293-#fJ3"

}

默認是analyzed的text類型的field，建立倒排索引的時候，就會對所有的articleID分詞，分詞以后，原本的articleID就沒有了，只有分詞后的各個word存在於倒排索引中。

term，是不對搜索文本分詞的，XHDK-A-1293-#fJ3 --> XHDK-A-1293-#fJ3；但是articleID建立索引的時候，XHDK-A-1293-#fJ3 --> xhdk，a，1293，fj3

5、定義keyword數據類型的字段

（1）刪除索引 DELETE /forum

（2）重建索引

PUT /forum

{

"mappings": {

    "article": {

      "properties": {

       "articleID": {

          "type": "keyword"

        }

      }

    }

}

}

（3）准備數據

POST /forum/article/_bulk

{ "index": { "_id": 1 }}

{ "articleID" : "XHDK-A-1293-#fJ3", "userID" : 1, "hidden": false, "postDate": "2017-01-01" }

{ "index": { "_id": 2 }}

{ "articleID" : "KDKE-B-9947-#kL5", "userID" : 1, "hidden": false, "postDate": "2017-01-02" }

{ "index": { "_id": 3 }}

{ "articleID" : "JODL-X-1937-#pV7", "userID" : 2, "hidden": false, "postDate": "2017-01-01" }

{ "index": { "_id": 4 }}

{ "articleID" : "QQPX-R-3956-#aD8", "userID" : 2, "hidden": true, "postDate": "2017-01-02" }

（4）測試articleID查詢

GET /forum/article/_search

{

    "query" : {

        "constant_score" : {

            "filter" : {

                "term" : {

                    "articleID" : "XHDK-A-1293-#fJ3"

                }

            }

        }

    }

}

6、小結

        （1）term filter：根據exact value進行搜索，數字、boolean、date天然支持

（2）text需要建索引時指定為not_analyzed，才能用term query

（3）相當於SQL中的單個where條件

二、filter執行原理深度剖析

1、bitset機制

每個filter根據在倒排索引中搜索的結果構建一個bitset（位集），用以存儲搜索的結果。簡單的數據結構去實現復雜的功能，可以節省內存空間，提升性能。bitset，就是一個二進制的數組，數組每個元素都是0或1，用來標識一個doc對一個filter條件是否匹配，如果匹配就是1，不匹配就是0。比如：[0, 1, 1]。

遍歷每個過濾條件對應的bitset，優先從最稀疏的開始搜索，查找滿足所有條件的document（先遍歷比較稀疏的bitset，就可以先過濾掉盡可能多的數據發）

2、caching bitset機制

跟蹤query，在最近256個query中超過一定次數的過濾條件，緩存其bitset。對於小segment（<1000，或<3%），不緩存bitset。在最近的256個filter中，有某個filter超過了一定的次數，次數不固定，就會自動緩存這個filter對應的bitset。filter針對小segment獲取到的結果，可以不緩存，segment記錄數<1000，或者segment大小<index總大小的3% segment數據量很小，此時哪怕是掃描也很快；segment會在后台自動合並，小segment很快就會跟其他小segment合並成大segment，此時就緩存也沒有什么意義，segment很快就消失了。

cache biset的自動更新：如果document有新增或修改，那么cached bitset會被自動更新

3、filter與query的對比

filter比query的好處就在於會caching。

filter大部分情況下來說，在query之前執行，先盡量過濾掉盡可能多的數據

query：是會計算doc對搜索條件的relevance score（相關評分），還會根據這個score去排序

filter：只是簡單過濾出想要的數據，不計算relevance score，也不排序

三、基於bool組合多個filter條件來搜索數據

1、搜索發帖日期為2017-01-01，或者帖子ID為XHDK-A-1293-#fJ3的帖子，同時要求帖子的發帖日期絕對不為2017-01-02

GET /forum/article/_search

{

"query": {

    "constant_score": {

      "filter": {

        "bool": {

          "should":[

            {"term":{"postDate":"2017-01-01"}},

            {"term":{"articleID":"HDK-A-1293-#fJ3"}}

          ],

          "must_not":{

            "term":{

              "postDate":"2017-01-02"

            }

          }

        }

      }

    }

}

}

2、搜索帖子ID為XHDK-A-1293-#fJ3，或者是帖子ID為JODL-X-1937-#pV7而且發帖日期為2017-01-01的帖子

GET /forum/article/_search

{

"query": {

    "constant_score": {

      "filter": {

        "bool": {

         "should":[

              {"term":{"articleID":"XHDK-A-1293-#fJ3"}},

              {"bool":{

                "must":[

                  {"term":{"articleID":"JODL-X-1937-#pV7"}},

                  {"term":{"postDate":"2017-01-01"}}

                ]

              }}

            ]

        }

      }

    }

}

}

四、term和terms

五、filter range

測試數據：

為帖子數據增加瀏覽量的字段

POST /forum/article/_bulk

{ "update": { "_id": "1"} }

{ "doc" : {"view_cnt" : 30} }

{ "update": { "_id": "2"} }

{ "doc" : {"view_cnt" : 50} }

{ "update": { "_id": "3"} }

{ "doc" : {"view_cnt" : 100} }

{ "update": { "_id": "4"} }

{ "doc" : {"view_cnt" : 80} }

1、搜索瀏覽量在30~60之間的帖子

GET /forum/article/_search

{

"query": {

    "constant_score": {

     "filter": {

        "range": {

          "view_cnt": {

            "gt": 30,              //gt大於 gte大於或等於

            "lt": 60               //lt大於   lte大於或等於

          }

        }

      }

    }

}

}

2、搜索發帖日期在最近1個月的帖子

GET /forum/article/_search

{

"query": {

    "constant_score": {

    "filter": {

        "range": {

          "postDate": {

            "gt": "2017-03-10||-30d"

          }

        }

      }

    }

}

}

GET /forum/article/_search

{

"query": {

    "constant_score": {

    "filter": {

        "range": {

          "postDate": {

            "gt": "now-30d"

          }

        }

      }

    }

}

}

六、match query 精准查詢

測試數據：為帖子數據增加標題字段

POST /forum/article/_bulk

{ "update": { "_id": "1"} }

{ "doc" : {"title" : "this is java and elasticsearch blog"} }

{ "update": { "_id": "2"} }

{ "doc" : {"title" : "this is java blog"} }

{ "update": { "_id": "3"} }

{ "doc" : {"title" : "this is elasticsearch blog"} }

{ "update": { "_id": "4"} }

{ "doc" : {"title" : "this is java, elasticsearch, hadoop blog"} }

{ "update": { "_id": "5"} }

{ "doc" : {"title" : "this is spark blog"} }

1、match query

    GET /forum/article/_search

{

    "query": {

        "match": {

            "title": "java elasticsearch"

        }

    }

}

相當於：

{

"bool": {

    "should": [

      { "term": { "title": "java" }},

      { "term": { "title": "elasticsearch"   }}

    ]

}

}

如果title字段是analyzed則進行full text全文搜索，則返回title字段包含java 或者elasticsearch 或者兩個都包含的document

如果是not_analyzed則進行exact value（相當於temr query），則只返回包含java elasticsearch的document

GET /forum/article/_search

{

    "query": {

        "match": {

            "title": {

"query": "java elasticsearch",

"operator": "and" //full text 中返回都包含“java”和"elasticsearch“的document

           }

        }

    }

}

相當於：

   {

      "bool": {

     "must": [

      { "term": { "title": "java" }},

      { "term": { "title": "elasticsearch"   }}

        ]

      }

    }

GET /forum/article/_search

{

"query": {

    "match": {

      "title": {

        "query": "java elasticsearch spark hadoop",

        "minimum_should_match": "75%" // full text中返回，包含指定條件的75%的document

      }

    }

}

}

相當於：

{

"bool": {

    "should": [

      { "term": { "title": "java" }},

      { "term": { "title": "elasticsearch"   }},

      { "term": { "title": "hadoop" }},

      { "term": { "title": "spark" }}

    ],

    "minimum_should_match": 3

}

}

2、用bool組合多個搜索條件，來搜索title

GET /forum/article/_search

{

"query": {

    "bool": {

      "must":     { "match": { "title": "java" }},

      "must_not": { "match": { "title": "spark" }},

      "should": [

                  { "match": { "title": "hadoop" }},

                  { "match": { "title": "elasticsearch"   }}

      ]

    }

}

}

bool組合多個搜索條件，如何計算relevance score

must和should搜索對應的分數，加起來，除以must和should的總數

排名第一：java，同時包含should中所有的關鍵字，hadoop，elasticsearch

排名第二：java，同時包含should中的elasticsearch

排名第三：java，不包含should中的任何關鍵字

should是可以影響相關度分數的

must是確保說，誰必須有這個關鍵字，同時會根據這個must的條件去計算出document對這個搜索條件的relevance score

在滿足must的基礎之上，should中的條件，不匹配也可以，但是如果匹配的更多，那么document的relevance score就會更高

默認情況下，should是可以不匹配任何一個的，比如上面的搜索中，this is java blog，就不匹配任何一個should條件

但是有個例外的情況，如果沒有must的話，那么should中必須至少匹配一個才可以

比如下面的搜索，should中有4個條件，默認情況下，只要滿足其中一個條件，就可以匹配作為結果返回

但是可以精准控制，should的4個條件中，至少匹配幾個才能作為結果返回

GET /forum/article/_search

{

"query": {

    "bool": {

      "should": [

        { "match": { "title": "java" }},

        { "match": { "title": "elasticsearch"   }},

        { "match": { "title": "hadoop"   }},

        { "match": { "title": "spark"   }}

      ],

      "minimum_should_match": 3

    }

}

}

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 elasticsearch query 和 filter 的區別 Elasticsearch 之 Filter 與 Query 有啥不同？ Elasticsearch query和filter的區別 Elasticsearch學習筆記-Delete By Query API Elasticsearch(6) --- Query查詢和Filter查詢 Elasticsearch系列(二)--query、filter、aggregations ElasticSearch中Filter和Query的異同 Elasticsearch DSL中Query與Filter的不同 ElasticSearch的 Query DSL 和 Filter DSL Python學習筆記：pd.filter、query篩選數據