接觸ES后遇到的一些問題

本文轉載自查看原文 2022-01-19 16:10 1721 elastic search

群里的問題

hive表數據量19億左右（id有少量重復），把id作為es的_id進行寫入，發現es中的數據量比hive通過id去重后的數據要多幾萬條

force merge之后做一次count試試，你看到的數據條數可能包含同一條數據不同版本的。

es小知識點

動態映射

都知道動態映射會創建mapping模板，but如果你在創建索引時手動指定某個字段的mapping后，動態映射還會再創建一次，且自動駝峰轉下划線，比如

"brand_id" : { //比如手動設置為brandId，結果創建后都有
 	"type" : "long"
 },
"brandId" : { 
 	"type" : "keyword"
 }

所以最好直接寫入轉換下划線后的數據

有助於理解es組合查詢

https://cloud.tencent.com/developer/article/1689238

商城篩選商品且過濾參數的DSL

實現全文搜索和篩選

GET /es_idx_item/_search/
{
  "from": 0,
  "size": 20, 
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "name_en": "Shoes"
          }
        }
      ],
      "filter": [
        {
          "bool": {
            "must":[
                {"term":{"category_id":"7038"}},
                {"term":{"brand_id":"123534847"}}
              ]
          }
        }
      ]
    }
  }
}

CreateIndexRequest和IndexRequest有什么區別？

前者是用來創建並配置索引的，后者是將數據與索引相關聯，並且讓數據可以被搜索。

如何確定git分支是從哪個分支拉下來的?

git reflog show branch_name

如果是多語言搜索，該如何設置查詢條件？

ES7.15中，混合語言搜索自動分詞，且不需要在mapping中設置分詞器，應該是服務端自動匹配分詞了。

Logstatsh 指定啟動配置文件

bin/logstash -f conf/conf_file

https://doc.yonyoucloud.com/doc/logstash-best-practice-cn/get_start/hello_world.html

fastjson序列化轉換

// SnakeCase  轉為 _
// CamelCase  轉為駝峰
// KebabCase  轉為 -
// PascalCase 轉為 單詞首字母大寫

https://github.com/alibaba/fastjson/wiki/PropertyNamingStrategy_cn

ES困惑

檢索類型很多，不知道選擇哪一個？嵌套查詢，不知道什么時候 bool 組合，什么時候單個查詢？

其一：全文檢索，應用場景是：全文檢索場景，比如：大數據系統。舉例：檢索包含：“牙膏黑人”的內容。核心使用的類型包含但不限於： match、match_phrase、query_string 等檢索類型。

其二：精准匹配。應用場景是：精確匹配的場景，比如：搜索郵編、電話號碼等信息的精准查詢。包含但不限於：terms、term、exists、range、wildcard 等。

如上單類檢索解決不了組合的問題，比如：期望檢索：正文內容=XXX，標題=YYY，發布時間介於XXX到XXX，且發文來源是：新華網的信息，這時候就得組合檢索，組合檢索就需要使用 bool 組合檢索。引申出 bool 組合檢索的語法，包含但不限於： must、should、must_not、filter、minum_should_count 等的組合。

！！！！！ES bulkprocessor數據同步

排除環境問題，可能是mapping的字段類型和source的數據字段類型不一致導致的

搜索、更新和新建doc可以用索引別名，創建索引不能用別名，只能指定索引名，然后關聯別名

public void processorUpdateItem(List<Item> itemList) {
    final List<ItemIndexTemplate> templates = itemList.stream()
            .parallel()
            .map(templateConvert::convert).collect(Collectors.toList());

    templates.forEach(temp -> {
        final String source = JSONObject.toJSONString(temp, config);
        final IndexRequest indexRequest = new IndexRequest(itemIndexAlias)
                .id(String.valueOf(temp.getId()))
                .source(source, XContentType.JSON);
        processor.add(indexRequest);
    });

}

Elasticsearch exception [type=illegal_argument_exception, reason=no write index is defined for alias [item_index_alias]. The write index may be explicitly disabled using is_write_index=false or the alias points to multiple indices without one being designated as a write index]"

這種情況需要指定寫索引，當一個別名關聯多個索引，寫入時不指定寫入到哪個索引中會報以上錯誤

ES 數據建模

相同或相近含義字段，一定要統一字段名、字段類型
特定字段獨立建模
索引生命周期管理方案
- 6.6+
  - ILM
- 7.9+
  - data stream
分片設置多少？
- 索引創建后不可修改
- 考慮數據量、數據節點規模
refresh_interval
- 數據由index buffer的堆內存緩存區刷新到堆外內存區域，形成segment
分頁方式
- search_after
  - 一頁一頁翻
- scroll
  - 全量導出
管道預處理ingest
- 比如像logstash的filter階段

別名對應多個索引時

es會自動查詢所有索引的分片
當多個索引指向同一個別名時，這些索引各自都是一個分片（默認1分片1副本），檢索時都會檢索到

sort的mode默認是什么？

沒有默認值，所謂的mode是ScoreMode，是枚舉

今天又是被nested折磨的一天

如何對nested中的某個對象更新？

偽命題，直接對整個文檔覆蓋更新

創建文檔時可以使用別名創建，如果別名在創建索引時綁定過了。僅針對一個索引，多個索引時要對寫入索引指定 is_write_index=true

要學的東西還有很多

pri 所有的分片數

rep 所有副本數

docs.count 所有文檔數

docs.deleted 所有已刪除文檔數

store.size 所有分片的總存儲大小，包括分片的副本

pri.store.size 所有分片的大小

終於找到了佐證

https://www.elastic.co/guide/en/elasticsearch/reference/7.6/ilm-rollover.html#ilm-rollover

To see the current index size, use the _cat indices API. The pri.store.size value shows the combined size of all primary shards.

spring構造方法注入導致循環依賴

用autowire

spring所有bean實例化后執行指定動作

從db讀數據並整理寫入es，用 SmartInitializingSingleton

處理第三方比如緩存之類的，可以用@PostConstruct

nested字段的查詢以及對nested父級字段的查詢

GET item_index_2021_12_10_102614/_search
{
  "from": 0,
  "size": 20,
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "spu_name_thai": {
              "query": "fruit"
            }
          }
        },
        {
          "match": {
            "spu_name_en": {
              "query": "fruit"
            }
          }
        },
        {
          "match": {
            "brand_name_thai": {
              "query": "fruit"
            }
          }
        },
        {
          "match": {
            "brand_name_en": {
              "query": "fruit"
            }
          }
        }
      ],
      "filter": [
        {
          "nested": {
            "query": {
              "bool": {
                "filter": [
                  {
                    "term": {
                      "skus.active": {
                        "value": true
                      }
                    }
                  },
                  {
                    "term": {
                      "skus.visible": {
                        "value": true
                      }
                    }
                  }
                ],
                "adjust_pure_negative": true
              }
            },
            "path": "skus",
            "ignore_unmapped": false,
            "score_mode": "none",
            "boost": 1
          }
        },
        {
          "term": {
            "is_gift": {
              "value": false
            }
          }
        },
        {
          "term": {
            "visible": {
              "value": true
            }
          }
        },
        {
          "term": {
            "active": {
              "value": true
            }
          }
        }
      ]
    }
  }
}

查詢nested文檔中的指定字段

POST /item_index2021-12-03-182031/_search
{
  "from": 0,
  "size": 20,
  "query": {
    "bool": {
      "nested": {
        "query": {
          "term": {
            "skus.id": {
              "value": "382620383",
              "boost": 1
            }
          }
        },
        "path": "skus",
        "ignore_unmapped": false,
        "score_mode": "none",
        "boost": 1
      }
    }
  }
}

如何讓程序更懂用戶想要什么？

ES from從0開始

ES實戰舉例

如何計算出某個品牌下最流行的衣服？（聚合+過濾）

使用聚合搜索

PUT /shirts/_doc/1?refresh
{
  "brand": "gucci",
  "color": "red",
  "model": "slim"
}
GET /shirts/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "color": "red"   }},
        { "term": { "brand": "gucci" }}
      ]
    }
  },
  "aggs": {
    "models": {
      "terms": { "field": "model" } 
    }
  }
}

后置過濾器post_filter可以過濾hits中的結果

比如只讓hits保留紅色

{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "brand": "gucci"
          }
        }
      ]
    }
  },
  "aggs": {
    "colors": {
      "terms": {
        "field": "color"
      }
    },
    "colors_red": {
      "filter": {
        "term": {
          "color": "red"
        }
      },
      "aggs": {
        "models": {
          "terms": {
            "field": "model"
          }
        }
      }
    }
  },
  "post_filter": {
    "term": {
      "color": "red"
    }
  }
}

重新算分(rescore)可以提高搜索精確度

方法是重新排序post_filter中過濾后的數據，重新算分由各分片獨立執行，最終由執行查詢請求的節點重新對結果排序。
若rescore請求被安排了業務字段的排序（_score不算），則會拋異常。

建議使用固定的“步長”分頁查詢。

默認情況下，原始查詢的得分會與rescore查詢的得分組合以得到最終算分_score，可分別通過query_weight和rescore_query_weight控制權重。

重新算分請求可以同時有多個。多個rescore請求，最終只會有1個rescore會和原始查詢組合算分。后一個rescore可以看到前一個rescore的分數，並據此排序

創建模版時需指定analyzer和search_analyzer

使搜索詞被准確分詞

為什么最匹配的結果沒有得分最高？

list對象去重用stream treeset 序

final ArrayList<Integer> list = Stream.of(3, 3, 6, 3, 2, 4, 5, 6, 9).collect(Collectors.collectingAndThen(Collectors.toCollection(TreeSet::new), ArrayList::new));

去重並升序排序

public static void main(String[] args) {
    final ArrayList<Integer> list = Stream.of(3, 3, 6, 3, 2, 4, 5, 6, 9).collect(Collectors.collectingAndThen(Collectors.toCollection(() -> new TreeSet<>(Comparator.comparingInt(x -> (int) x).reversed())), ArrayList::new));
    System.out.println(list);
}

docker找安裝的軟件的目錄

docker exec -it elasticsearch /bin/bash

es的分頁參數

es的分頁參數from是當前行數，不是當前頁數

比如第一頁是 0,10，第二頁就是 10,10，第三頁就是20,10

analyzer icu_analyzer has not been configured in mappings

本想連到測試環境（已安裝icu）下，結果連到了local（沒有icu）

nested字段排序

一定要指定nested_path

Unrecognized SSL message, plaintext connection?

在使用 https 協議訪問網絡資源時無法識別 SSL 信息，不用https就好了，或者用https但是肯定要配置證書

凍結索引

POST sale_index_test_bulk/_freeze
POST sale_index_test_bulk/_unfreeze

凍結的索引，不可寫入查詢，解凍后即可寫入查詢。

這在做搜索降級時用來測試搜索功能很方便

向量搜索（淘寶拍圖搜同款）

阿里knn寫入性能不好，占內存高，京東vearch，scann/faiss[是個庫，得自己包服務]，hnsw

如何指定寫索引（別名切換寫入索引）

在創建索引並綁定別名時指定is_write_index=true，默認false，同一個別名下只允許一個索引指定is_write_index=true

#為索引創建別名
POST /_aliases
{
  "actions": [
    {
      "add": {
        "index": "test_index", // 原寫入索引
        "alias": "my_alias",
        "is_write_index":false
      }
    },
    {
      "add": {
        "index": "test_index01",
        "alias": "my_alias",
        "is_write_index":true
      }
    }
  ]
}

如何做數據遷移

創建新索引，設置修改后的mapping

遷移數據

POST _reindex
{
  "source": {
    "index": "my-index-000001"
  },
  "dest": {
    "index": "my-new-index-000001",
    "op_type": "create" // 僅當dest索引不存在這個doc才會創建
  }
}

遷移完成后別名切換寫入索引

POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "my-index-000002",
        "alias": "my-index-alias",
        "is_write_index": false // 最好寫上
      }
    },
    {
      "add": {
        "index": "my-new-index-000002",
        "alias": "my-index-alias",
        "is_write_index": true //must
      }
    }
  ]
}

Warn: 遷移前確保數據沒有修改中的狀態

Warn: 遷移完成后需將原索引 _freeze，防止數據混亂

如何處理數據遷移過程中可能存在的數據遺失問題？

補償機制

比如，從遷移開始，單獨記錄一個索引（Or 緩存，預估用時設置失效時間） X，其中記錄itemId，代表要重算的數據

當遷移完成，做別名切換，關閉單獨記錄的itemId，再讀取數據重新寫入es

Rollover滾動索引實踐

PUT %3Cmy-index-%7Bnow%2Fd%7D-000001%3E
{
  "aliases": {
    "my-alias1": {
      "is_write_index": true
    }
  }
}


# 2、批量導入數據
PUT my-alias1/_bulk
{"index":{"_id":1}}
{"title":"testing 01"}
{"index":{"_id":2}}
{"title":"testing 02"}
{"index":{"_id":3}}
{"title":"testing 03"}
{"index":{"_id":4}}
{"title":"testing 04"}
{"index":{"_id":5}}
{"title":"testing 05"}
 
# 3、rollover 滾動索引
POST my-alias1/_rollover
{
  "conditions": {
    "max_age": "7d",
    "max_docs": 5,
    "max_primary_shard_size": "50gb"
  }
}
 
GET my-alias1/_count
 
# 4、在滿足滾動條件的前提下滾動索引
PUT my-alias1/_bulk
{"index":{"_id":6}}
{"title":"testing 06"}
 
# 5、檢索數據，驗證滾動是否生效
GET my-alias1/_search

GET my-index-2021.12.28-000001

es中如何把特殊字符替換為空格

GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": [
        "\\u005c=>\\u0020"
      ]
    }
  ],
  "text": "My license plate is/"
}

本質是Native和ASCII的轉換

特殊字符映射整理

"char_filter": {
        "my_en_char_filter": {
          "type": "mapping",
          "mappings": [
            ".=>\\u0020",
            "/=>\\u0020",
            "#=>\\u0020",
            "(=>\\u0020",
            ")=>\\u0020",
            "_=>\\u0020",
            "==>\\u0020",
            "+=>\\u0020",
            "&=>\\u0020",
            "!=>\\u0020",
            "@=>\\u0020",
            "$=>\\u0020",
            "%=>\\u0020",
            "^=>\\u0020",
            "*=>\\u0020",
            "?=>\\u0020",
            ",=>\\u0020",
            "'=>\\u0020",
            "\"=>\\u0020",
            "[=>\\u0020",
            "]=>\\u0020",
            "{=>\\u0020",
            "}=>\\u0020",
            ":=>\\u0020",
            ";=>\\u0020",
            "\\u005c=>\\u0020",
            "|=>\\u0020"
          ]
        }
      }

使用java代碼發送異步的Reindex請求並處理返回結果

public Response<String> reindexAsync(String src, String dest) throws IOException {
    final ReindexRequest reindexRequest = new ReindexRequest().setSourceIndices(src).setDestIndex(dest).setRefresh(true);
    final Request request = RequestConverters.reindex(reindexRequest);
    final org.elasticsearch.client.Response response = lowerClient.performRequest(request);
    final InputStream inputStream = response.getEntity().getContent();

    if (inputStream != null) {
        final InputStreamReader inputStreamReader = new InputStreamReader(inputStream, StandardCharsets.UTF_8);
        final Gson gson = new Gson();
        final ReindexAsyncResponse asyncResponse = gson.fromJson(inputStreamReader, ReindexAsyncResponse.class);
        return Response.success(asyncResponse.getTask());
    }
    return Response.errorMsg(null, "cannot fetch response");
}

@Data
public class ReindexAsyncResponse {
    private String task;
}

es如何刪除已完成的異步Task，以節省space？

有網友疑惑同我，在官方社區提問了...
https://discuss.elastic.co/t/how-to-delete-a-task-in-elasticsearch-v7-14/283321

最后說是用

DELETE .tasks/task/_doc/

"This is simply a document DELETE."

然而，我實際測試后發現，任務完成后task就自動移除了，我的版本是7.15

elasticsearch如何確定哪個索引的 is_write_index=true ?

GET test_item_index_222_alias/_alias
{
  "test_item_index_666" : {
    "aliases" : {
      "test_item_index_222_alias" : {
        "is_write_index" : true
      }
    }
  },
  "test_item_index_222" : {
    "aliases" : {
      "test_item_index_222_alias" : {
        "is_write_index" : false
      }
    }
  }
}
// java 代碼里可能拿到返回數據后要遍歷判斷了

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 初次接觸C# VSTO 寫Exce插件遇到的一些問題開發中遇到的一些問題內核編譯遇到的一些問題初用vue遇到的一些問題 Python遇到的一些問題 spring batch遇到的一些問題使用JAVA進行MD5加密后所遇到的一些問題近期開發storm遇到一些問題的解決點安裝SQLSERVER2012遇到的一些問題 spark遇到的一些問題及其解決辦法