ES批量索引寫入時的ID自動生成算法

本文轉載自查看原文 2016-11-17 21:16 10389 elasticsearch

對bulk request的處理流程：

1、遍歷所有的request，對其做一些加工，主要包括：獲取routing(如果mapping里有的話)、指定的timestamp(如果沒有帶timestamp會使用當前時間)，如果沒有指定id字段，在action.bulk.action.allow_id_generation配置為true的情況下，會自動生成一個base64UUID作為id字段，並會將request的opType字段置為CREATE，因為如果是使用es自動生成的id的話，默認就是createdocument而不是updatedocument。（注：坑爹啊，我從github上面下的最新的ES代碼，發現自動生成id這一段已經沒有設置opType字段了，看起來和有指定id是一樣的處理邏輯了，見https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/action/index/IndexRequest.java）。

2、創建一個shardId--> Operation的Map，再次遍歷所有的request，獲取獲取每個request應該發送到的shardId，獲取的過程是這樣的：request有routing就直接返回，如果沒有，會先對id求一個hash，這里的hash函數默認是Murmur3，當然你也可以通過配置index.legacy.routing.hash.type來決定使用的hash函數,決定發到哪個shard：

return MathUtils.mod(hash, indexMetaData.getNumberOfShards());

即用hash對shard的總數求模來獲取shardId，將shardId作為key，通過遍歷的index和request組成BulkItemRequest的集合作為value放入之前說的map中（為什么要拿到遍歷的index，因為在bulk response中可以看到對每個request的請求處理結果的），其實說了這么多就是要對request按shard來分組（為負載均衡）。

3、遍歷上面得到的map，對不同的分組創建一個bulkShardRequest，包含配置consistencyLevel和timeout。並從集群state中獲得primary shard，如果primary在本機就直接執行，如果不在會再發送到其shard所在的node。

上述1中的ID生成算法：

對於ES1.71版本，所處包為org.elasticsearch.action.index.IndexRequest

void org.elasticsearch.action.index.IndexRequest.process(MetaData metaData, @Nullable MappingMetaData mappingMd, boolean allowIdGeneration, String concreteIndex) throws ElasticsearchException
{
............
        // generate id if not already provided and id generation is allowed
        if (allowIdGeneration) {
            if (id == null) {
                id(Strings.base64UUID());
                // since we generate the id, change it to CREATE
                opType(IndexRequest.OpType.CREATE);
                autoGeneratedId = true;
            }
        }

............

IndexRequest org.elasticsearch.action.index.IndexRequest.id(String id)

Sets the id of the indexed document. If not set, will be automatically generated.
Parameters:
id

String org.elasticsearch.common.Strings.base64UUID()

Generates a time-based UUID (similar to Flake IDs), which is preferred when generating an ID to be indexed into a Lucene index as primary key. The id is opaque and the implementation is free to change at any time!

/** Generates a time-based UUID (similar to Flake IDs), which is preferred when generating an ID to be indexed into a Lucene index as
* primary key. The id is opaque and the implementation is free to change at any time! */
public static String base64UUID() {
    return TIME_UUID_GENERATOR.getBase64UUID();
}

參考：

https://discuss.elastic.co/t/generate-id/28536/2

https://www.elastic.co/blog/performance-considerations-elasticsearch-indexing

https://github.com/elastic/elasticsearch/pull/7531/files ES歷史版本的改動可以在這里看到，最開始ES使用的是randomBase64UUID，出於性能后來用了類似Flake的ID！

http://xbib.org/elasticsearch/2.1.1/apidocs/org/elasticsearch/common/Strings.html

http://www.opscoder.info/es_indexprocess1.html 有bulk插入的詳細說明

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 ID生成算法(一)——雪花算法 Twitter全局唯一ID生成算法分布式唯一ID生成算法-雪花算法 java 分布式id生成算法分布式 ID 生成算法 — SnowFlake 全局唯一的支付和訂單id生成算法 SnowFlake--分布式id生成算法子集生成算法直線生成算法排列的生成算法（一）