以后會用到的相關知識:索引中某些字段禁止搜索,排序等操作
當我們學習Elasticsearch時,經常會遇到如下的幾個概念:
- Reverted index
- doc_values
- source?
這個幾個概念分別指的是什么?有什么用處?如何配置它們?只有我們熟練地掌握了這些概念,我們才可以正確地使用它們。
Inverted index
inverted index(反向索引)是Elasticsearch和任何其他支持全文搜索的系統的核心數據結構。 反向索引類似於您在任何書籍結尾處看到的索引。 它將出現在文檔中的術語映射到文檔。
例如,您可以從以下字符串構建反向索引:
Elasticsearch從已建立索引的三個文檔中構建數據結構。 以下數據結構稱為反向索引(inverted index):
Term Frequency Document (postings)
choice 1 3
day 1 2
is 3 1,2,3
it 1 1
last 1 2
of 1 2
of 1 2
sunday 2 1,2
the 3 2,3
tomorrow 1 1
week 1 2
yours 1 3
在這里反向索引指的的是,我們根據term來尋找相應的文檔ids。這和常規的根據文檔id來尋找term相反。
請注意以下幾點:
- 刪除標點符號並將其小寫后,文檔會按術語進行細分。
- 術語按字母順序排序
- “Frequency”列捕獲該術語在整個文檔集中出現的次數
- 第三列捕獲了在其中找到該術語的文檔。 此外,它還可能包含找到該術語的確切位置(文檔中的偏移)
在文檔中搜索術語時,查找給定術語出現在其中的文檔非常快捷。 如果用戶搜索術語“sunday”,那么從“Term”列中查找sunday將非常快,因為這些術語在索引中進行了排序。 即使有數百萬個術語,也可以在對術語進行排序時快速查找它們。
隨后,考慮一種情況,其中用戶搜索兩個單詞,例如last sunday。 反向索引可用於分別搜索last和sunday的發生; 文檔2包含這兩個術語,因此比僅包含一個術語的文檔1更好。
反向索引是執行快速搜索的基礎。 同樣,很容易查明索引中出現了多少次術語。 這是一個簡單的計數匯總。 當然,Elasticsearch在我們在這里解釋的簡單的反向排索引的基礎上使用了很多創新。 它兼顧搜索和分析。
默認情況下,Elasticsearch在文檔中的所有字段上構建一個反向索引,指向該字段所在的Elasticsearch文檔。也就是說在每個Elasticsearch的Lucene里,有一個位置存放這個inverted index。
在Kibana中,我們建立一個如下的文檔:
PUT twitter/_doc/1
{
"user" : "雙榆樹-張三",
"message" : "今兒天氣不錯啊,出去轉轉去",
"uid" : 2,
"age" : 20,
"city" : "北京",
"province" : "北京",
"country" : "中國",
"name": {
"firstname": "三",
"surname": "張"
},
"address" : [
"中國北京市海淀區",
"中關村29號"
],
"location" : {
"lat" : "39.970718",
"lon" : "116.325747"
}
}
當這個文檔被建立好以后,Elastic就已經幫我們建立好了相應的inverted index供我們進行搜索,比如:
GET twitter/_search
{
"query": {
"match": {
"user": "張三"
}
}
}
我們可與得到相應的搜索結果:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.5753642,
"hits" : [
{
"_index" : "twitter",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.5753642,
"_source" : {
"user" : "雙榆樹-張三",
"message" : "今兒天氣不錯啊,出去轉轉去",
"uid" : 2,
"age" : 20,
"city" : "北京",
"province" : "北京",
"country" : "中國",
"name" : {
"firstname" : "三",
"surname" : "張"
},
"address" : [
"中國北京市海淀區",
"中關村29號"
],
"location" : {
"lat" : "39.970718",
"lon" : "116.325747"
}
}
}
]
}
}
如果我們想不讓我們的某個字段不被搜索,也就是說不想為這個字段建立inverted index,那么我們可以這么做:
DELETE twitter
PUT twitter
{
"mappings": {
"properties": {
"city": {
"type": "keyword",
"ignore_above": 256
},
"address": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"age": {
"type": "long"
},
"country": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"location": {
"properties": {
"lat": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"lon": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"message": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"name": {
"properties": {
"firstname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"surname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"province": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"uid": {
"type": "long"
},
"user": {
"type": "object",
"enabled": false
}
}
}
}
PUT twitter/_doc/1
{
"user" : "雙榆樹-張三",
"message" : "今兒天氣不錯啊,出去轉轉去",
"uid" : 2,
"age" : 20,
"city" : "北京",
"province" : "北京",
"country" : "中國",
"name": {
"firstname": "三",
"surname": "張"
},
"address" : [
"中國北京市海淀區",
"中關村29號"
],
"location" : {
"lat" : "39.970718",
"lon" : "116.325747"
}
}
在上面,我們通過mapping對user字段進行了修改:
"user": {
"type": "object",
"enabled": false
}
也就是說這個字段將不被建立索引,我們如果使用這個字段進行搜索的話,不會產生任何的結果:
GET twitter/_search
{
"query": {
"match": {
"user": "張三"
}
}
}
搜索的結果為:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
顯然是沒有任何的結果。但是如果我們對這個文檔進行查詢的話:
GET twitter/_doc/1
顯示的結果是:
{
"_index" : "twitter",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"found" : true,
"_source" : {
"user" : "雙榆樹-張三",
"message" : "今兒天氣不錯啊,出去轉轉去",
"uid" : 2,
"age" : 20,
"city" : "北京",
"province" : "北京",
"country" : "中國",
"name" : {
"firstname" : "三",
"surname" : "張"
},
"address" : [
"中國北京市海淀區",
"中關村29號"
],
"location" : {
"lat" : "39.970718",
"lon" : "116.325747"
}
}
}
顯然user的信息是存放於source里的。只是它不被我們所搜索而已。
如果我們不想我們的整個文檔被搜索,我們甚至可以直接采用如下的方法:
DELETE twitter
PUT twitter
{
"mappings": {
"enabled": false
}
}
那么整個twitter索引將不建立任何的inverted index,那么我們通過如下的命令:
PUT twitter/_doc/1
{
"user" : "雙榆樹-張三",
"message" : "今兒天氣不錯啊,出去轉轉去",
"uid" : 2,
"age" : 20,
"city" : "北京",
"province" : "北京",
"country" : "中國",
"name": {
"firstname": "三",
"surname": "張"
},
"address" : [
"中國北京市海淀區",
"中關村29號"
],
"location" : {
"lat" : "39.970718",
"lon" : "116.325747"
}
}
GET twitter/_search
{
"query": {
"match": {
"city": "北京"
}
}
}
上面的命令執行的結果是,沒有任何搜索的結果。更多閱讀,可以參閱“Mapping parameters: enabled”(https://www.elastic.co/guide/en/elasticsearch/reference/current/enabled.html)。
Source
在Elasticsearch中,通常每個文檔的每一個字段都會被存儲在shard里存放source的地方,比如:
PUT twitter/_doc/2
{
"user" : "雙榆樹-張三",
"message" : "今兒天氣不錯啊,出去轉轉去",
"uid" : 2,
"age" : 20,
"city" : "北京",
"province" : "北京",
"country" : "中國",
"name": {
"firstname": "三",
"surname": "張"
},
"address" : [
"中國北京市海淀區",
"中關村29號"
],
"location" : {
"lat" : "39.970718",
"lon" : "116.325747"
}
}
在這里,我們創建了一個id為2的文檔。我們可以通過如下的命令來獲得它的所有的存儲的信息。
GET twitter/_doc/2
它將返回:
{
"_index" : "twitter",
"_type" : "_doc",
"_id" : "2",
"_version" : 1,
"_seq_no" : 1,
"_primary_term" : 1,
"found" : true,
"_source" : {
"user" : "雙榆樹-張三",
"message" : "今兒天氣不錯啊,出去轉轉去",
"uid" : 2,
"age" : 20,
"city" : "北京",
"province" : "北京",
"country" : "中國",
"name" : {
"firstname" : "三",
"surname" : "張"
},
"address" : [
"中國北京市海淀區",
"中關村29號"
],
"location" : {
"lat" : "39.970718",
"lon" : "116.325747"
}
}
}
在上面的_source里我們可以看到Elasticsearch為我們所存下的所有的字段。如果我們不想存儲任何的字段,那么我們可以做如下的設置:
DELETE twitter
PUT twitter
{
"mappings": {
"_source": {
"enabled": false
}
}
}
那么我們使用如下的命令來創建一個id為1的文檔:
PUT twitter/_doc/1
{
"user" : "雙榆樹-張三",
"message" : "今兒天氣不錯啊,出去轉轉去",
"uid" : 2,
"age" : 20,
"city" : "北京",
"province" : "北京",
"country" : "中國",
"name": {
"firstname": "三",
"surname": "張"
},
"address" : [
"中國北京市海淀區",
"中關村29號"
],
"location" : {
"lat" : "39.970718",
"lon" : "116.325747"
}
}
那么同樣地,我們來查詢一下這個文檔:
GET witter/_doc/1
顯示的結果為:
{
"_index" : "twitter",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"found" : true
}
顯然我們的文檔是被找到了,但是我們看不到任何的source。那么我們能對這個文檔進行搜索嗎?嘗試如下的命令:
GET twitter/_search
{
"query": {
"match": {
"city": "北京"
}
}
}
顯示的結果為:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.5753642,
"hits" : [
{
"_index" : "twitter",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.5753642
}
]
}
}
顯然這個文檔id為1的文檔可以被正確地搜索,也就是說它有完好的inverted index供我們查詢,雖然它沒有字的source。
那么我們如何有選擇地進行存儲我們想要的字段呢?這種情況適用於我們想節省自己的存儲空間,只存儲那些我們需要的字段到source里去。我們可以做如下的設置:
DELETE twitter
PUT twitter
{
"mappings": {
"_source": {
"includes": [
"*.lat",
"address",
"name.*"
],
"excludes": [
"name.surname"
]
}
}
}
在上面,我們使用include來包含我們想要的字段,同時我們通過exclude來去除那些不需要的字段。我們嘗試如下的文檔輸入:
PUT twitter/_doc/1
{
"user" : "雙榆樹-張三",
"message" : "今兒天氣不錯啊,出去轉轉去",
"uid" : 2,
"age" : 20,
"city" : "北京",
"province" : "北京",
"country" : "中國",
"name": {
"firstname": "三",
"surname": "張"
},
"address" : [
"中國北京市海淀區",
"中關村29號"
],
"location" : {
"lat" : "39.970718",
"lon" : "116.325747"
}
}
通過如下的命令來進行查詢,我們可以看到:
GET twitter/_doc/1
結果是:
{
"_index" : "twitter",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"found" : true,
"_source" : {
"address" : [
"中國北京市海淀區",
"中關村29號"
],
"name" : {
"firstname" : "三"
},
"location" : {
"lat" : "39.970718"
}
}
}
顯然,我們只有很少的幾個字段被存儲下來了。通過這樣的方法,我們可以有選擇地存儲我們想要的字段。
在實際的使用中,我們在查詢文檔時,也可以有選擇地進行顯示我們想要的字段,盡管有很多的字段被存於source中:
GET twitter/_doc/1?_source=name,location
在這里,我們只想顯示和name及location相關的字段,那么顯示的結果為:
{
"_index" : "twitter",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"found" : true,
"_source" : {
"name" : {
"firstname" : "三"
},
"location" : {
"lat" : "39.970718"
}
}
}
更多的閱讀,可以參閱文檔“Mapping meta-field: _source”
(https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html)
Doc_values
默認情況下,大多數字段都已編入索引,這使它們可搜索。反向索引允許查詢在唯一的術語排序列表中查找搜索詞,並從中立即訪問包含該詞的文檔列表。
sort,aggregtion和訪問腳本中的字段值需要不同的數據訪問模式。除了查找術語和查找文檔外,我們還需要能夠查找文檔並查找其在字段中具有的術語。
Doc values是在文檔索引時構建的磁盤數據結構,這使這種數據訪問模式成為可能。它們存儲與_source相同的值,但以面向列的方式存儲,這對於排序和聚合而言更為有效。幾乎所有字段類型都支持Doc值,但對字符串字段除外。
默認情況下,所有支持doc值的字段均已啟用它們。如果您確定不需要對字段進行排序或匯總,也不需要通過腳本訪問字段值,則可以禁用doc值以節省磁盤空間:
比如我們可以通過如下的方式來使得city字段不可以做sort或aggregation:
DELETE twitter
PUT twitter
{
"mappings": {
"properties": {
"city": {
"type": "keyword",
"doc_values": false,
"ignore_above": 256
},
"address": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"age": {
"type": "long"
},
"country": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"location": {
"properties": {
"lat": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"lon": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"message": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"name": {
"properties": {
"firstname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"surname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"province": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"uid": {
"type": "long"
},
"user": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
在上面,我們把city字段的doc_values設置為false。
"city": {
"type": "keyword",
"doc_values": false,
"ignore_above": 256
},
我們通過如下的方法來創建一個文檔:
PUT twitter/_doc/1
{
"user" : "雙榆樹-張三",
"message" : "今兒天氣不錯啊,出去轉轉去",
"uid" : 2,
"age" : 20,
"city" : "北京",
"province" : "北京",
"country" : "中國",
"name": {
"firstname": "三",
"surname": "張"
},
"address" : [
"中國北京市海淀區",
"中關村29號"
],
"location" : {
"lat" : "39.970718",
"lon" : "116.325747"
}
}
那么,當我們使用如下的方法來進行aggregation時:
GET twitter/_search
{
"size": 0,
"aggs": {
"city_bucket": {
"terms": {
"field": "city",
"size": 10
}
}
}
}
在我們的Kibana上我們可以看到:
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Can't load fielddata on [city] because fielddata is unsupported on fields of type [keyword]. Use doc values instead."
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "twitter",
"node": "IyyZ30-hRi2rnOpfx4n1-A",
"reason": {
"type": "illegal_argument_exception",
"reason": "Can't load fielddata on [city] because fielddata is unsupported on fields of type [keyword]. Use doc values instead."
}
}
],
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Can't load fielddata on [city] because fielddata is unsupported on fields of type [keyword]. Use doc values instead.",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Can't load fielddata on [city] because fielddata is unsupported on fields of type [keyword]. Use doc values instead."
}
}
},
"status": 400
}
顯然,我們的操作是失敗的。盡管我們不能做aggregation及sort,但是我們還是可以通過如下的命令來得到它的source:
GET twitter/_doc/1
顯示結果為:
{
"_index" : "twitter",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"found" : true,
"_source" : {
"user" : "雙榆樹-張三",
"message" : "今兒天氣不錯啊,出去轉轉去",
"uid" : 2,
"age" : 20,
"city" : "北京",
"province" : "北京",
"country" : "中國",
"name" : {
"firstname" : "三",
"surname" : "張"
},
"address" : [
"中國北京市海淀區",
"中關村29號"
],
"location" : {
"lat" : "39.970718",
"lon" : "116.325747"
}
}
}
更多閱讀請參閱“Mapping parameters: doc_values”(https://www.elastic.co/guide/en/elasticsearch/reference/7.4/doc-values.html)。