Elasticsearch深入搜索之結構化搜索及JavaAPI的使用

本文轉載自查看原文 2018-12-12 15:39 880

一、Es中創建索引

1.創建索引：

在之前的Es插件的安裝和使用中說到創建索引自定義分詞器和創建type，當時是分開寫的，其實創建索引時也可以創建type，並指定分詞器。

PUT /my_index
{
  "settings": {
        "analysis": {
            "analyzer": {
                "ik_smart_pinyin": {
                    "type": "custom",
                    "tokenizer": "ik_smart",
                    "filter": ["my_pinyin", "word_delimiter"]
                },
                "ik_max_word_pinyin": {
                    "type": "custom",
                    "tokenizer": "ik_max_word",
                    "filter": ["my_pinyin", "word_delimiter"]
                }
            },
            "filter": {
                "my_pinyin": {
                    "type" : "pinyin",
                    "keep_separate_first_letter" : true,
                    "keep_full_pinyin" : true,
                    "keep_original" : true,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "remove_duplicated_term" : true 
                }
            }
        }
  },
  "mappings": {
    "my_type":{
      "properties": {
        "id":{
          "type": "integer"
        },
        "name":{
          "type": "text",
          "analyzer": "ik_max_word_pinyin"
        },
        "age":{
          "type":"integer"
        }
      }
    }
  }
}

2.添加數據

POST /my_index/my_type/_bulk
{ "index": { "_id":1}}
{ "id":1,"name": "張三","age":20}
{ "index": { "_id": 2}}
{ "id":2,"name": "張四","age":22}
{ "index": { "_id": 3}}
{ "id":3,"name": "張三李四王五","age":20}

3.查看數據類型

GET /my_index/my_type/_mapping

結果：
{
  "my_index": {
    "mappings": {
      "my_type": {
        "properties": {
          "age": {
            "type": "integer"
          },
          "id": {
            "type": "integer"
          },
          "name": {
            "type": "text",
            "analyzer": "ik_max_word_pinyin"
          }
        }
      }
    }
  }
}

二、結合JAVA（在這之前需在項目中配置好es，網上有好多例子可以參考）

1.創建Es實體類

package com.example.es_query_list.entity.es;

import lombok.Getter;
import lombok.Setter;
import org.springframework.data.annotation.Id;
import org.springframework.data.elasticsearch.annotations.Document;

@Setter
@Getter
@Document(indexName = "my_index",type = "my_type")
public class User {
    @Id
    private Integer id;
    private String name;
    private Integer age;
}

2.創建dao層

package com.example.es_query_list.repository.es;

import com.example.es_query_list.entity.es.User;
import org.springframework.data.elasticsearch.repository.ElasticsearchRepository;

public interface EsUserRepository extends ElasticsearchRepository<User,Integer> {
}

三、基本工作完成后，開始查詢

1.精確值查詢

查詢非文本類型數據

GET /my_index/my_type/_search
{
  "query": {
    "term": {
      "age": {
        "value": "20"
      }
    }
  }
}


結果:
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "name": "張三",
          "age": 20
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 1,
        "_source": {
          "name": "李四",
          "age": 20
        }
      }
    ]
  }
}

2.查詢文本類型

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

這時小伙伴們可能看到查詢結果為空，為什么精確匹配卻查不到我輸入的准確值呢？？？之前說過咱們在創建type時，字段指定的分詞器，如果輸入未被分析出來的詞是查不到結果的，讓我們證明一下！！！！

首先先查看一下咱們查詢的詞被分析成哪幾部分

GET my_index/_analyze
{
  "text":"張三李四王五",
  "analyzer": "ik_max_word"
}

結果：
{
  "tokens": [
    {
      "token": "張三李四",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "張三",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "三",
      "start_offset": 1,
      "end_offset": 2,
      "type": "TYPE_CNUM",
      "position": 2
    },
    {
      "token": "李四",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "四",
      "start_offset": 3,
      "end_offset": 4,
      "type": "TYPE_CNUM",
      "position": 4
    },
    {
      "token": "王",
      "start_offset": 4,
      "end_offset": 5,
      "type": "CN_CHAR",
      "position": 5
    },
    {
      "token": "五",
      "start_offset": 5,
      "end_offset": 6,
      "type": "TYPE_CNUM",
      "position": 6
    }
  ]
}

結果說明，張三李四王五被沒有被分析成張三李四王五，所以查詢結果為空。

解決方法：更新type中字段屬性值，自定義一個映射指定類型為keyword類型，該類型在es中是指不會被分詞器分析，也就是說這就是傳說中的准確不能再准確的值了

POST /my_index/_mapping/my_type
{
  "properties": {
    "name": {
            "type": "text",
            "analyzer": "ik_max_word_pinyin",
            "fields": {
              "keyword":{  //自定義映射名
                "type": "keyword"
              }
            }
          }
  }
}

設置好完成后，需將原有的數據刪除在添加一遍，再次查詢就能查到了

 public List<User> termQuery() {
        QueryBuilder queryBuilder = QueryBuilders.termQuery("age",20);
//        QueryBuilder queryBuilder = QueryBuilders.termQuery("name.keyword","張三李四王五");
        SearchQuery searchQuery = new NativeSearchQueryBuilder()
                .withIndices("my_index")
                .withTypes("my_type")
                .withQuery(queryBuilder)
                .build();

        List<User> list = template.queryForList(searchQuery,User.class);
        return list;
    }

四、組合過濾器

布爾過濾器

注意：官方文檔有點問題，在5.X后，filtered 被bool代替了，The filtered query is replaced by the bool query。

一個 bool 過濾器由三部分組成：

{ "bool" : { "must" : [], "should" : [], "must_not" : [], } }

must所有的語句都必須（must）匹配，與 AND 等價。

must_not所有的語句都不能（must not）匹配，與 NOT 等價。

should至少有一個語句要匹配，與 OR 等價。

就這么簡單！當我們需要多個過濾器時，只須將它們置入 bool 過濾器的不同部分即可。

GET /my_index/my_type/_search
{
  "query" : {
            "bool" : {
              "should" : [
                 { "term" : {"age" : 20}}, 
                 { "term" : {"age" : 30}} 
              ],
              "must" : {
                 "term" : {"name.keyword" : "張三"} 
              }
           }
      }
}

 public List<User> boolQuery() {
        BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery();
        boolQueryBuilder.should(QueryBuilders.termQuery("age",20));
        boolQueryBuilder.should(QueryBuilders.termQuery("age",30));
        boolQueryBuilder.must(QueryBuilders.termQuery("name.keyword","張三"));
        SearchQuery searchQuery = new NativeSearchQueryBuilder()
                .withIndices("my_index")
                .withTypes("my_type")
                .withQuery(boolQueryBuilder)
                .build();
        List<User> list = template.queryForList(searchQuery,User.class);
        return list;
    }

嵌套布爾過濾器

盡管 bool 是一個復合的過濾器，可以接受多個子過濾器，需要注意的是 bool 過濾器本身仍然還只是一個過濾器。這意味着我們可以將一個 bool 過濾器置於其他 bool 過濾器內部，這為我們提供了對任意復雜布爾邏輯進行處理的能力。

GET /my_index/my_type/_search
{
  "query" : {
            "bool" : {
              "should" : [
                 { "term" : {"age" : 20}}, 
                 { "bool" : {
                   "must": [
                     {"term": {
                       "name.keyword": {
                         "value": "李四"
                       }
                     }}
                   ]
                 }} 
              ]
           }
      }
}

結果：
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "id": 1,
          "name": "張三",
          "age": 20
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 1,
        "_source": {
          "id": 3,
          "name": "張三李四王五",
          "age": 20
        }
      }
    ]
  }
}

因為 term 和 bool 過濾器是兄弟關系，他們都處於外層的布爾邏輯 should 的內部，返回的命中文檔至少須匹配其中一個過濾器的條件。

這兩個 term 語句作為兄弟關系，同時處於 must 語句之中，所以返回的命中文檔要必須都能同時匹配這兩個條件。

五、查找多個精確值

GET my_index/my_type/_search
{
  "query": {
    "terms": {
      "age": [
        20,
        22
      ]
    }
  }
}

結果：
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 1,
        "_source": {
          "id": 2,
          "name": "張四",
          "age": 22
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "id": 1,
          "name": "張三",
          "age": 20
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 1,
        "_source": {
          "id": 3,
          "name": "張三李四王五",
          "age": 20
        }
      }
    ]
  }
}

一定要了解 term 和 terms 是 包含（contains） 操作，而非 等值（equals） （判斷）。

TermsQueryBuilder termsQueryBuilder = QueryBuilders.termsQuery("age",list);

六、范圍查詢

1、數字范圍查詢

GET my_index/my_type/_search
{
  "query": {
    "range": {
      "age": {
        "gte": 10,
        "lte": 20
      }
    }
  }
}

結果：
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "id": 1,
          "name": "張三",
          "age": 20
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 1,
        "_source": {
          "id": 3,
          "name": "張三李四王五",
          "age": 20
        }
      }
    ]
  }
}

注：gt(大於) gte(大於等於) lt(小於) lte(小於等於)

RangeQueryBuilder rangeQueryBuilder = QueryBuilders.rangeQuery("age").gte(10).lte(20);

2.對於時間范圍查詢

更新type，添加時間字段

POST /my_index/_mapping/my_type
{
"properties": {
"date":{
"type":"date",
"format":"yyyy-MM-dd"
}
}
}

添加數據：

POST /my_index/my_type/_bulk
{ "index": { "_id":4}}
{ "id":4,"name": "趙六","age":20,"date":"2018-10-1"}
{ "index": { "_id": 5}}
{ "id":5,"name": "對七","age":22,"date":"2018-11-20"}
{ "index": { "_id": 6}}
{ "id":6,"name": "王八","age":20,"date":"2018-7-28"}

查詢：

GET my_index/my_type/_search
{
  "query": {
    "range": {
      "date": {
        "gte": "2018-10-20",
        "lte": "2018-11-29"
      }
    }
  }
}

結果：
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "5",
        "_score": 1,
        "_source": {
          "id": 5,
          "name": "對七",
          "age": 22,
          "date": "2018-11-20"
        }
      }
    ]
  }
}

RangeQueryBuilder rangeQueryBuilder = QueryBuilders.rangeQuery("date").gte("2018-10-20").lte("2018-11-29");

七、處理null值

1.添加數據

POST /my_index/posts/_bulk
{ "index": { "_id": "1"              }}
{ "tags" : ["search"]                }  
{ "index": { "_id": "2"              }}
{ "tags" : ["search", "open_source"] }  
{ "index": { "_id": "3"              }}
{ "other_field" : "some data"        }  
{ "index": { "_id": "4"              }}
{ "tags" : null                      }  
{ "index": { "_id": "5"              }}
{ "tags" : ["search", null]          }

2.查詢指定字段存在的數據

GET /my_index/posts/_search
{
    "query" : {
        "constant_score" : {    //不在去計算評分，默認都是1
            "filter" : {    
                "exists" : { "field" : "tags" }
            }
        }
    }
} 

結果：
{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "posts",
        "_id": "5",
        "_score": 1,
        "_source": {
          "tags": [
            "search",
            null
          ]
        }
      },
      {
        "_index": "my_index",
        "_type": "posts",
        "_id": "2",
        "_score": 1,
        "_source": {
          "tags": [
            "search",
            "open_source"
          ]
        }
      },
      {
        "_index": "my_index",
        "_type": "posts",
        "_id": "1",
        "_score": 1,
        "_source": {
          "tags": [
            "search"
          ]
        }
      }
    ]
  }
}

BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery();
boolQueryBuilder.filter(QueryBuilders.constantScoreQuery(QueryBuilders.existsQuery("tags")));

3.查詢指定字段缺失數據

注：Filter Query Missing 已經從 ES 5 版本移除

GET /my_index/posts/_search
{
    "query" : {
        "bool": {
          "must_not": [
            {"constant_score": {
              "filter": {
                "exists": {
                "field": "tags"
              }}
            }}
          ]
        }
    }
}


查詢結果：
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "posts",
        "_id": "4",
        "_score": 1,
        "_source": {
          "tags": null
        }
      },
      {
        "_index": "my_index",
        "_type": "posts",
        "_id": "3",
        "_score": 1,
        "_source": {
          "other_field": "some data"
        }
      }
    ]
  }
}

注：處理null值，當字段內容為空時，將自定義將其當做為null值處理

boolQueryBuilder.mustNot(QueryBuilders.boolQuery().filter(QueryBuilders.constantScoreQuery(QueryBuilders.existsQuery("tags"))));

八、關於緩存

1.核心

　　　其核心實際是采用一個 bitset 記錄與過濾器匹配的文檔。Elasticsearch 積極地把這些 bitset 緩存起來以備隨后使用。一旦緩存成功，bitset 可以復用任何已使用過的相同過濾器，而無需再次計算整個過濾器。

這些 bitsets 緩存是“智能”的：它們以增量方式更新。當我們索引新文檔時，只需將那些新文檔加入已有 bitset，而不是對整個緩存一遍又一遍的重復計算。和系統其他部分一樣，過濾器是實時的，我們無需擔心緩存過期問題。

2.獨立的過濾器緩存

　　屬於一個查詢組件的 bitsets 是獨立於它所屬搜索請求其他部分的。這就意味着，一旦被緩存，一個查詢可以被用作多個搜索請求。bitsets 並不依賴於它所存在的查詢上下文。這樣使得緩存可以加速查詢中經常使用的部分，從而降低較少、易變的部分所帶來的消耗。

同樣，如果單個請求重用相同的非評分查詢，它緩存的 bitset 可以被單個搜索里的所有實例所重用。

讓我們看看下面例子中的查詢，它查找滿足以下任意一個條件的電子郵件：

查詢條件（例子）：（1）在收件箱中，且沒有被讀過的（2）不在收件箱中，但被標注重要的

GET /inbox/emails/_search
{
  "query": {
      "constant_score": {
          "filter": {
              "bool": {
                 "should": [
                    { "bool": {                                                  1
                          "must": [
                             { "term": { "folder": "inbox" }}, 
                             { "term": { "read": false }}
                          ]
                    }},
                    { "bool": {　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　2　　　　
                          "must_not": {
                             "term": { "folder": "inbox" } 
                          },
                          "must": {
                             "term": { "important": true }
                          }
                    }}
                 ]
              }
            }
        }
    }
}

1和2共用的一個過濾器，所以使用同一個bitset

盡管其中一個收件箱的條件是 must 語句，另一個是 must_not 語句，但他們兩者是完全相同的。這意味着在第一個語句執行后， bitset 就會被計算然后緩存起來供另一個使用。當再次執行這個查詢時，收件箱的這個過濾器已經被緩存了，所以兩個語句都會使用已緩存的 bitset 。

這點與查詢表達式（query DSL）的可組合性結合得很好。它易被移動到表達式的任何地方，或者在同一查詢中的多個位置復用。這不僅能方便開發者，而且對提升性能有直接的益處。

3.自動緩存行為

在 Elasticsearch 的較早版本中，默認的行為是緩存一切可以緩存的對象。這也通常意味着系統緩存 bitsets 太富侵略性，從而因為清理緩存帶來性能壓力。不僅如此，盡管很多過濾器都很容易被評價，但本質上是慢於緩存的（以及從緩存中復用）。緩存這些過濾器的意義不大，因為可以簡單地再次執行過濾器。

檢查一個倒排是非常快的，然后絕大多數查詢組件卻很少使用它。例如 term 過濾字段 "user_id" ：如果有上百萬的用戶，每個具體的用戶 ID 出現的概率都很小。那么為這個過濾器緩存 bitsets 就不是很合算，因為緩存的結果很可能在重用之前就被剔除了。

這種緩存的擾動對性能有着嚴重的影響。更嚴重的是，它讓開發者難以區分有良好表現的緩存以及無用緩存。

為了解決問題，Elasticsearch 會基於使用頻次自動緩存查詢。如果一個非評分查詢在最近的 256 次查詢中被使用過（次數取決於查詢類型），那么這個查詢就會作為緩存的候選。但是，並不是所有的片段都能保證緩存 bitset 。只有那些文檔數量超過 10,000 （或超過總文檔數量的 3% )才會緩存 bitset 。因為小的片段可以很快的進行搜索和合並，這里緩存的意義不大。

一旦緩存了，非評分計算的 bitset 會一直駐留在緩存中直到它被剔除。剔除規則是基於 LRU 的：一旦緩存滿了，最近最少使用的過濾器會被剔除。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 ElasticSearch 結構化搜索 ElasticSearch 結構化搜索全文 ElasticStack學習（九）：深入ElasticSearch搜索之詞項、全文本、結構化搜索及相關性算分非結構化數據和結構化數據提取結構化數據、非結構化數據之我的理解結構化方法結構化數據、半結構化數據和非結構化數據什么是結構化數據、半結構化數據與非結構化數據結構化數據、半結構化數據和非結構化數據 [C++]深入解析結構化異常處理(SEH)