Elasticsearch由淺入深（八）搜索引擎：mapping、精確匹配與全文搜索、分詞器、mapping總結

本文轉載自查看原文 2019-08-22 20:06 1784 【ElasticSearch】

下面先簡單描述一下mapping是什么？

自動或手動為index中的type建立的一種數據結構和相關配置，簡稱為mapping
dynamic mapping，自動為我們建立index，創建type，以及type對應的mapping，mapping中包含了每個field對應的數據類型，以及如何分詞等設置

當我們插入幾條數據，讓ES自動為我們建立一個索引

PUT /website/article/1
{
  "post_date": "2019-08-21",
  "title": "my first article",
  "content": "this is my first article in this website",
  "author_id": 11400
}

PUT /website/article/2
{
  "post_date": "2019-08-22",
  "title": "my second article",
  "content": "this is my second article in this website",
  "author_id": 11400
}

PUT /website/article/3
{
  "post_date": "2019-08-23",
  "title": "my third article",
  "content": "this is my third article in this website",
  "author_id": 11400
}

查看mapping

GET /website/_mapping

{
  "website": {
    "mappings": {
      "article": {
        "properties": {
          "author_id": {
            "type": "long"
          },
          "content": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "post_date": {
            "type": "date"
          },
          "title": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }
  }
}

上面是插入數據自動生成的mapping，還有手動生成的mapping。這種自動或手動為index中的type建立的一種數據結構和相關配置，稱為mapping。

嘗試各種搜索

GET /website/article/_search?q=2019            //3條結果             
GET /website/article/_search?q=2019-08-21            //3條結果
GET /website/article/_search?q=post_date:2019-08-21       //1條結果
GET /website/article/_search?q=post_date:2019         //0條結果

搜索結果為什么不一致，因為es自動建立mapping的時候，設置了不同的field不同的data type。不同的data type的分詞、搜索等行為是不一樣的。所以出現了_all field和post_date field的搜索表現完全不一樣。
下面是手動創建的mapping。

PUT /test_mapping
{
  "mappings" : {
    "properties" : {
      "author_id" : {
        "type" : "long"
      },
      "content" : {
        "type" : "text",
        "fields" : {
          "keyword" : {
            "type" : "keyword",
            "ignore_above" : 256
          }
        }
      },
      "post_date" : {
        "type" : "date"
      },
      "title" : {
        "type" : "text",
        "fields" : {
          "keyword" : {
            "type" : "keyword",
            "ignore_above" : 256
          }
        }
      }
    }
  }
}

View Code

精確匹配與全文搜索的對比分析

exact value

也就是某個field必須全部匹配才能返回相應的document
示例:

GET /website/article/_search?q=post_date:2019-08-21       //1條結果
GET /website/article/_search?q=post_date:2019         //0條結果

exact value，搜索的時候，必須輸入2019-08-21，才能搜索出來
如果你輸入一個21，是搜索不出來的

full text

full text與exact value不一樣，不是說單純的只是匹配完整的一個值，而是可以對值進行拆分詞語后（分詞）進行匹配，也可以通過縮寫、時態、大小寫、同義詞等進行匹配。
示例：

GET /website/article/_search?q=2019            //3條結果             
GET /website/article/_search?q=2019-08-21            //3條結果

倒排索引核心原理

下面演示一下倒排索引簡單建立的過程，當然實際中倒排索引的建立過程會非常的復雜。
doc1: I really liked my small dogs, and I think my mom also liked them.
doc2: He never liked any dogs, so I hope that my mom will not expect me to liked him.

分詞，初步的倒排索引的建立

word    doc1    doc2
I        *        *
really   *
liked    *        *
my       *        *
small    *
dogs     *
and      *
think    *
mom      *        *
also     *        
them     *
He                *
never             *
any               *
so                *
hope              *
that              *
will              *
not               *
expect            *
me                *
to                *
him               *

搜索 mother like little dog, 不會有任何結果
mother
like
little
dog
這肯定不是我們想要的結果。比如mother和mom其實根本就沒有區別。但是卻檢索不到。但是做下測試發現ES是可以查到的。實際上ES在建立倒排索引的時候，還會執行一個操作，就是會對拆分的各個單詞進行相應的處理，以提升后面搜索的時候能夠搜索到相關聯的文檔的概率。像時態的轉換，單復數的轉換，同義詞的轉換，大小寫的轉換。這個過程稱為正則化（normalization）
mother-> mom
liked -> like
small -> little
dogs -> dog
這樣重新建立倒排索引：

word    doc1    doc2
I        *        *
really   *
like     *        *
my       *        *
little   *
dog      *
and      *
think    *
mom      *        *
also     *        
them     *
He                *
never             *
any               *
so                *
hope              *
that              *
will              *
not               *
expect            *
me                *
to                *
him               *

查詢：mother like little dog 分詞正則化
mother -> mom
like -> like
little -> little
dog -> dog
doc1和doc2都會搜索出來
doc1：I really liked my small dogs, and I think my mom also liked them.
doc2：He never liked any dogs, so I hope that my mom will not expect me to liked him.

分詞器

切分詞語，normalization（提升recall召回率）

給你一段句子，然后將這段句子拆分成一個一個的單個的單詞，同時對每個單詞進行normalization（時態轉換，單復數轉換），分瓷器
recall，召回率：搜索的時候，增加能夠搜索到的結果的數量

character filter：在一段文本進行分詞之前，先進行預處理，比如說最常見的就是，過濾html標簽（<span>hello<span> --> hello），& --> and（I&you --> I and you）
tokenizer：分詞，hello you and me --> hello, you, and, me
token filter：lowercase，stop word，synonymom，dogs --> dog，liked --> like，Tom --> tom，a/the/an --> 干掉，mother --> mom，small --> little

一個分詞器，很重要，將一段文本進行各種處理，最后處理好的結果才會拿去建立倒排索引

內置分詞器的介紹：

待分詞：Set the shape to semi-transparent by calling set_trans(5)

standard analyzer：set, the, shape, to, semi, transparent, by, calling, set_trans, 5（默認的是standard）
simple analyzer：set, the, shape, to, semi, transparent, by, calling, set, trans
whitespace analyzer：Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
language analyzer（特定的語言的分詞器，比如說，english，英語分詞器）：set, shape, semi, transpar, call, set_tran, 5

mapping引入案例遺留問題大揭秘

GET /_search?q=2019

搜索的是_all field，document所有的field都會拼接成一個大串，進行分詞

2019-01-02 my second article this is my second article in this website 11400

        doc1        doc2        doc3
2019      *          *           *
01        *         
02                   *
03                               *

_all，2017，自然會搜索到3個docuemnt

GET /_search?q=post_date:2019-01-01

date，會作為exact value去建立索引

             doc1        doc2        doc3
2017-01-01    *        
2017-01-02                 *         
2017-01-03                             *

測試分詞器

語法：

GET /_analyze
{
  "analyzer": "standard",
  "text": "Text to analyze"
}

{
  "tokens": [
    {
      "token": "text",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "to",
      "start_offset": 5,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "analyze",
      "start_offset": 8,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

對mapping進一步總結

往ES里面直接插入數據，ES會自動建立索引，同時建立type以及對應的mapping
mapping中自動定義了每個fieldd的數據類型
不同的數據類型（比如說text和date），可能有的是exact value，有的是full text
exact value，在建立倒排索引的時候，分詞的時候，都是將整個值一起作為關鍵字建立到倒排索引中；full text會經歷各種各樣的處理，分詞，normalization（時態轉換，同義詞轉換，大小寫轉換），才會建立到倒排索引中
在搜索的時候，exact value和full text類型就決定了，對exact value和full text field進行搜索的行為也是不一樣的，會跟建立倒排索引的行為保持一致；比如說exact value搜索的時候，就是直接按照整個值進行匹配，full text也會進行分詞和正則化normalization再去倒排索引中去搜索。
可以用 ES的dynamic mapping，讓其自動建立mapping,包括自動設置數據類型；也可以提前手動創建index和type的mapping,自己對各個field進行設置，包括數據類型，包括索引行為，包括分析器等等。

mapping本質上就是index的type的元數據，決定了數據類型，建立倒排索引的行為，還有進行搜索的行為。

mapping核心數據類型以及dynamic mapping

核心數據類型

string text：字符串類型
byte:字節類型
short：短整型
integer：整型
long:長整型
float:浮點型
boolean:布爾類型
date:時間類型

當然還有一些高級類型，像數組，對象object，但其底層都是text字符串類型

dynamic mapping

true or false -> boolean
123 -> long
123.45 -> float
2017-01-01 -> date
"hello world" -> string text

查看mapping

語法：

GET /{index}/_mapping
GET /{index}/_mapping/{type}

手動建立和修改mapping以及定制string類型是否分詞

注意：只能創建index時手動建立mapping，或者新增field mapping，但是不能update field mapping。

```
"analyzer": "standard":自動分詞
```
```
date：日期
```
```
keyword：不分詞
```

# 創建索引
PUT /website
{
  "mappings": {
    "properties": {
      "author_id": {
        "type": "long"
      },
      "title": {
        "type": "text",
        "analyzer": "standard"
      },
      "content": {
        "type": "text"
      },
      "post_date": {
        "type": "date"
      },
      "publisher_id": {
        "type": "keyword"
      }
    }
  }
}


#修改字段的mapping
PUT /website
{
  "mappings": {
    "properties": {
      "author_id": {
        "type": "text"
      }
    }
  }
}

{
  "error": {
    "root_cause": [
      {
        "type": "resource_already_exists_exception",
        "reason": "index [website/5xLohnJITHqCwRYInmBFmA] already exists",
        "index_uuid": "5xLohnJITHqCwRYInmBFmA",
        "index": "website"
      }
    ],
    "type": "resource_already_exists_exception",
    "reason": "index [website/5xLohnJITHqCwRYInmBFmA] already exists",
    "index_uuid": "5xLohnJITHqCwRYInmBFmA",
    "index": "website"
  },
  "status": 400
}


#增加mapping的字段
PUT /website/_mapping
{
  "properties": {
    "new_field": {
      "type": "text"
    }
  }
}

{
  "acknowledged" : true
}

mapping復雜類型y以及object類型數據底層結構

multivalue field
```
{
    "tags": ["tag1", "tag2"]
}
```
建立索引時與string是一樣的，數據類型不能混
empty field
```
null，[]，[null]
```

object field
初始化數據：

PUT /company/employee/1
{
  "address": {
    "country": "china",
    "province": "guangdong",
    "city": "guangzhou"
  },
  "name": "jack",
  "age": 27,
  "join_date": "2017-01-01"
}

查看mapping

GET /company/_mapping/employee

{
  "company": {
    "mappings": {
      "employee": {
        "properties": {
          "address": {
            "properties": {
              "city": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "country": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "province": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          },
          "age": {
            "type": "long"
          },
          "join_date": {
            "type": "date"
          },
          "name": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }
  }
}

View Code

object field底層解析

{
  "address": {
    "country": "china",
    "province": "guangdong",
    "city": "guangzhou"
  },
  "name": "jack",
  "age": 27,
  "join_date": "2017-01-01"
}

↓↓↓↓

{
    "name":            [jack],
    "age":          [27],
    "join_date":      [2017-01-01],
    "address.country":         [china],
    "address.province":   [guangdong],
    "address.city":  [guangzhou]
}

{
    "authors": [
        { "age": 26, "name": "Jack White"},
        { "age": 55, "name": "Tom Jones"},
        { "age": 39, "name": "Kitty Smith"}
    ]
}

↓↓↓↓

{
    "authors.age":    [26, 55, 39],
    "authors.name":   [jack, white, tom, jones, kitty, smith]
}

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【ELK】【docker】【elasticsearch】1. 使用Docker和Elasticsearch+ kibana 5.6.9 搭建全文本搜索引擎應用集群,安裝ik分詞器 Elasticsearch打造全文搜索引擎（一） Nebula 基於 ElasticSearch 的全文搜索引擎的文本搜索全文搜索引擎 Elasticsearch 入門教程 ElasticSearch全文搜索引擎整合thinkphp 全文搜索引擎 Elasticsearch 入門教程一文看懂-ElasticSearch全文搜索引擎 net core 3.1使用ElasticSearch 全文搜索引擎全文搜索引擎 Elasticsearch （二）使用場景結巴分詞：全模式、精確模式和搜索引擎模式