下面先簡單描述一下mapping是什么?
自動或手動為index中的type建立的一種數據結構和相關配置,簡稱為mapping
dynamic mapping,自動為我們建立index,創建type,以及type對應的mapping,mapping中包含了每個field對應的數據類型,以及如何分詞等設置
當我們插入幾條數據,讓ES自動為我們建立一個索引
PUT /website/article/1 { "post_date": "2019-08-21", "title": "my first article", "content": "this is my first article in this website", "author_id": 11400 } PUT /website/article/2 { "post_date": "2019-08-22", "title": "my second article", "content": "this is my second article in this website", "author_id": 11400 } PUT /website/article/3 { "post_date": "2019-08-23", "title": "my third article", "content": "this is my third article in this website", "author_id": 11400 }
查看mapping
GET /website/_mapping { "website": { "mappings": { "article": { "properties": { "author_id": { "type": "long" }, "content": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "post_date": { "type": "date" }, "title": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } } } } }
上面是插入數據自動生成的mapping,還有手動生成的mapping。這種自動或手動為index中的type建立的一種數據結構和相關配置,稱為mapping。
嘗試各種搜索
GET /website/article/_search?q=2019 //3條結果 GET /website/article/_search?q=2019-08-21 //3條結果 GET /website/article/_search?q=post_date:2019-08-21 //1條結果 GET /website/article/_search?q=post_date:2019 //0條結果
搜索結果為什么不一致,因為es自動建立mapping的時候,設置了不同的field不同的data type。不同的data type的分詞、搜索等行為是不一樣的。所以出現了_all field和post_date field的搜索表現完全不一樣。
下面是手動創建的mapping。

PUT /test_mapping { "mappings" : { "properties" : { "author_id" : { "type" : "long" }, "content" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "post_date" : { "type" : "date" }, "title" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } } } } }
精確匹配與全文搜索的對比分析
exact value
也就是某個field必須全部匹配才能返回相應的document
示例:
GET /website/article/_search?q=post_date:2019-08-21 //1條結果 GET /website/article/_search?q=post_date:2019 //0條結果
exact value,搜索的時候,必須輸入2019-08-21,才能搜索出來
如果你輸入一個21,是搜索不出來的
full text
full text與exact value不一樣,不是說單純的只是匹配完整的一個值,而是可以對值進行拆分詞語后(分詞)進行匹配,也可以通過縮寫、時態、大小寫、同義詞等進行匹配。
示例:
GET /website/article/_search?q=2019 //3條結果 GET /website/article/_search?q=2019-08-21 //3條結果
倒排索引核心原理
下面演示一下倒排索引簡單建立的過程,當然實際中倒排索引的建立過程會非常的復雜。
doc1: I really liked my small dogs, and I think my mom also liked them.
doc2: He never liked any dogs, so I hope that my mom will not expect me to liked him.
分詞,初步的倒排索引的建立
word doc1 doc2 I * * really * liked * * my * * small * dogs * and * think * mom * * also * them * He * never * any * so * hope * that * will * not * expect * me * to * him *
搜索 mother like little dog, 不會有任何結果
mother
like
little
dog
這肯定不是我們想要的結果。比如mother和mom其實根本就沒有區別。但是卻檢索不到。但是做下測試發現ES是可以查到的。實際上ES在建立倒排索引的時候,還會執行一個操作,就是會對拆分的各個單詞進行相應的處理,以提升后面搜索的時候能夠搜索到相關聯的文檔的概率。像時態的轉換,單復數的轉換,同義詞的轉換,大小寫的轉換。這個過程稱為正則化(normalization)
mother-> mom
liked -> like
small -> little
dogs -> dog
這樣重新建立倒排索引:
word doc1 doc2 I * * really * like * * my * * little * dog * and * think * mom * * also * them * He * never * any * so * hope * that * will * not * expect * me * to * him *
查詢:mother like little dog 分詞正則化
mother -> mom
like -> like
little -> little
dog -> dog
doc1和doc2都會搜索出來
doc1:I really liked my small dogs, and I think my mom also liked them.
doc2:He never liked any dogs, so I hope that my mom will not expect me to liked him.
分詞器
切分詞語,normalization(提升recall召回率)
給你一段句子,然后將這段句子拆分成一個一個的單個的單詞,同時對每個單詞進行normalization(時態轉換,單復數轉換),分瓷器
recall,召回率:搜索的時候,增加能夠搜索到的結果的數量
- character filter:在一段文本進行分詞之前,先進行預處理,比如說最常見的就是,過濾html標簽(<span>hello<span> --> hello),& --> and(I&you --> I and you)
- tokenizer:分詞,hello you and me --> hello, you, and, me
- token filter:lowercase,stop word,synonymom,dogs --> dog,liked --> like,Tom --> tom,a/the/an --> 干掉,mother --> mom,small --> little
一個分詞器,很重要,將一段文本進行各種處理,最后處理好的結果才會拿去建立倒排索引
內置分詞器的介紹:
待分詞:Set the shape to semi-transparent by calling set_trans(5) standard analyzer:set, the, shape, to, semi, transparent, by, calling, set_trans, 5(默認的是standard) simple analyzer:set, the, shape, to, semi, transparent, by, calling, set, trans whitespace analyzer:Set, the, shape, to, semi-transparent, by, calling, set_trans(5) language analyzer(特定的語言的分詞器,比如說,english,英語分詞器):set, shape, semi, transpar, call, set_tran, 5
mapping引入案例遺留問題大揭秘
GET /_search?q=2019
搜索的是_all field,document所有的field都會拼接成一個大串,進行分詞
2019-01-02 my second article this is my second article in this website 11400
doc1 doc2 doc3 2019 * * * 01 * 02 * 03 *
_all,2017,自然會搜索到3個docuemnt
GET /_search?q=post_date:2019-01-01
date,會作為exact value去建立索引
doc1 doc2 doc3 2017-01-01 * 2017-01-02 * 2017-01-03 *
測試分詞器
語法:
GET /_analyze { "analyzer": "standard", "text": "Text to analyze" }
{ "tokens": [ { "token": "text", "start_offset": 0, "end_offset": 4, "type": "<ALPHANUM>", "position": 0 }, { "token": "to", "start_offset": 5, "end_offset": 7, "type": "<ALPHANUM>", "position": 1 }, { "token": "analyze", "start_offset": 8, "end_offset": 15, "type": "<ALPHANUM>", "position": 2 } ] }
對mapping進一步總結
- 往ES里面直接插入數據,ES會自動建立索引,同時建立type以及對應的mapping
- mapping中自動定義了每個fieldd的數據類型
- 不同的數據類型(比如說text和date),可能有的是exact value,有的是full text
- exact value,在建立倒排索引的時候,分詞的時候,都是將整個值一起作為關鍵字建立到倒排索引中;full text會經歷各種各樣的處理,分詞,normalization(時態轉換,同義詞轉換,大小寫轉換),才會建立到倒排索引中
- 在搜索的時候,exact value和full text類型就決定了,對exact value和full text field進行搜索的行為也是不一樣的,會跟建立倒排索引的行為保持一致;比如說exact value搜索的時候,就是直接按照整個值進行匹配,full text也會進行分詞和正則化normalization再去倒排索引中去搜索。
- 可以用 ES的dynamic mapping,讓其自動建立mapping,包括自動設置數據類型;也可以提前手動創建index和type的mapping,自己對各個field進行設置,包括數據類型,包括索引行為,包括分析器等等。
mapping本質上就是index的type的元數據,決定了數據類型,建立倒排索引的行為,還有進行搜索的行為。
mapping核心數據類型以及dynamic mapping
- 核心數據類型
string text:字符串類型 byte:字節類型 short:短整型 integer:整型 long:長整型 float:浮點型 boolean:布爾類型 date:時間類型
當然還有一些高級類型,像數組,對象object,但其底層都是text字符串類型
- dynamic mapping
true or false -> boolean 123 -> long 123.45 -> float 2017-01-01 -> date "hello world" -> string text
-
查看mapping
語法:
GET /{index}/_mapping GET /{index}/_mapping/{type}
手動建立和修改mapping以及定制string類型是否分詞
注意:只能創建index時手動建立mapping,或者新增field mapping,但是不能update field mapping。
-
"analyzer": "standard":自動分詞
-
date:日期
-
keyword:不分詞
# 創建索引 PUT /website { "mappings": { "properties": { "author_id": { "type": "long" }, "title": { "type": "text", "analyzer": "standard" }, "content": { "type": "text" }, "post_date": { "type": "date" }, "publisher_id": { "type": "keyword" } } } } #修改字段的mapping PUT /website { "mappings": { "properties": { "author_id": { "type": "text" } } } } { "error": { "root_cause": [ { "type": "resource_already_exists_exception", "reason": "index [website/5xLohnJITHqCwRYInmBFmA] already exists", "index_uuid": "5xLohnJITHqCwRYInmBFmA", "index": "website" } ], "type": "resource_already_exists_exception", "reason": "index [website/5xLohnJITHqCwRYInmBFmA] already exists", "index_uuid": "5xLohnJITHqCwRYInmBFmA", "index": "website" }, "status": 400 } #增加mapping的字段 PUT /website/_mapping { "properties": { "new_field": { "type": "text" } } } { "acknowledged" : true }
mapping復雜類型y以及object類型數據底層結構
- multivalue field
{ "tags": ["tag1", "tag2"] }
建立索引時與string是一樣的,數據類型不能混
- empty field
null,[],[null]
- object field
初始化數據:
PUT /company/employee/1 { "address": { "country": "china", "province": "guangdong", "city": "guangzhou" }, "name": "jack", "age": 27, "join_date": "2017-01-01" }
查看mapping
GET /company/_mapping/employee
{ "company": { "mappings": { "employee": { "properties": { "address": { "properties": { "city": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "country": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "province": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } }, "age": { "type": "long" }, "join_date": { "type": "date" }, "name": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } } } } }
object field底層解析
{ "address": { "country": "china", "province": "guangdong", "city": "guangzhou" }, "name": "jack", "age": 27, "join_date": "2017-01-01" }
↓↓↓↓
{ "name": [jack], "age": [27], "join_date": [2017-01-01], "address.country": [china], "address.province": [guangdong], "address.city": [guangzhou] }
{ "authors": [ { "age": 26, "name": "Jack White"}, { "age": 55, "name": "Tom Jones"}, { "age": 39, "name": "Kitty Smith"} ] }
↓↓↓↓
{ "authors.age": [26, 55, 39], "authors.name": [jack, white, tom, jones, kitty, smith] }