kafka troubleshooting,The provided member is not known in the current generation,有時候日志里還會伴隨着 i/o timeout


客戶端log:The provided member is not known in the current generation

也伴隨: i/o timeout

 

 

sever  log:

[2022-01-19 19:22:03,158] WARN [GroupCoordinator 0]: Sending empty assignment to member watermill-7d16c5e1-284d-45f4-a3e3-f21e306a480b of tws for generation 14 with no errors (kafka.coordinator.group.GroupCoordinator)
[2022-01-19 19:22:03,158] WARN [GroupCoordinator 0]: Sending empty assignment to member watermill-20467c49-0d37-417c-85c8-d703e4ca0172 of tws for generation 14 with no errors (kafka.coordinator.group.GroupCoordinator)

 

with generation 48 (kafka.coordinator.group.GroupCoordinator)

[2022-01-19 19:20:04,498] INFO [GroupCoordinator 0]: Dynamic Member with unknown member id joins group avast in Stable state. Created a new member id watermill-9ce047a7-1145-4481-be91-953760cb601e for this member and add to the group. (kafka.coordinator.group.GroupCoordinator)

[2022-01-19 19:20:04,498] INFO [GroupCoordinator 0]: Preparing to rebalance group avast in state PreparingRebalance with old generation 48 (__consumer_offsets-23) (reason: Adding new member watermill-9ce047a7-1145-4481-be91-953760cb601e with group instance id None) (kafka.coordinator.group.GroupCoordinator)

[2022-01-19 19:20:07,932] INFO [GroupCoordinator 0]: Dynamic Member with unknown member id joins group avast in PreparingRebalance state. Created a new member id watermill-e1808013-a69e-426c-a21b-e08b9afc4bff for this member and add to the group. (kafka.coordinator.group.GroupCoordinator)

[2022-01-19 19:20:34,506] INFO [GroupCoordinator 0]: Dynamic Member with unknown member id joins group avast in PreparingRebalance state. Created a new member id watermill-b0ba512a-ba9a-4de7-afb8-ded7564c2c86 for this member and add to the group. (kafka.coordinator.group.GroupCoordinator)

[2022-01-19 19:20:37,946] INFO [GroupCoordinator 0]: Dynamic Member with unknown member id joins group avast in PreparingRebalance state. Created a new member id watermill-df1a82b6-b2d7-4a1a-8255-828206492a78 for this member and add to the group. (kafka.coordinator.group.GroupCoordinator)

[2022-01-19 19:21:04,501] INFO [GroupCoordinator 0]: Group avast removed dynamic members who haven't joined: Set(watermill-09d871f1-e9a7-4854-af0b-55c9ce439d11, watermill-be10473b-4021-4a66-9176-994887db1ef0) (kafka.coordinator.group.GroupCoordinator)

[2022-01-19 19:21:04,501] INFO [GroupCoordinator 0]: Stabilized group avast generation 49 (__consumer_offsets-23) with 4 members (kafka.coordinator.group.GroupCoordinator)

[2022-01-19 19:21:34,509] INFO [GroupCoordinator 0]: Preparing to rebalance group avast in state PreparingRebalance with old generation 49 (__consumer_offsets-23) (reason: Removing member watermill-b0ba512a-ba9a-4de7-afb8-ded7564c2c86 on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)

[2022-01-19 19:21:34,509] INFO [GroupCoordinator 0]: Member MemberMetadata(memberId=watermill-b0ba512a-ba9a-4de7-afb8-ded7564c2c86, groupInstanceId=None, clientId=watermill, clientHost=/192.168.102.158, sessionTimeoutMs=600000, rebalanceTimeoutMs=60000, supportedProtocols=List(roundrobin)) has left group avast through explicit `LeaveGroup` request (kafka.coordinator.group.GroupCoordinator)

[2022-01-19 19:21:34,509] INFO [GroupCoordinator 0]: Member MemberMetadata(memberId=watermill-df1a82b6-b2d7-4a1a-8255-828206492a78, groupInstanceId=None, clientId=watermill, clientHost=/192.168.102.159, sessionTimeoutMs=600000, rebalanceTimeoutMs=60000, supportedProtocols=List(roundrobin)) has left group avast through explicit `LeaveGroup` request (kafka.coordinator.group.GroupCoordinator)

[2022-01-19 19:21:34,517] INFO [GroupCoordinator 0]: Dynamic Member with unknown member id joins group avast in PreparingRebalance state. Created a new member id watermill-5aa111c8-d7e3-43fb-827f-75be3f362fe1 for this member and add to the group. (kafka.coordinator.group.GroupCoordinator)

[2022-01-19 19:21:34,522] INFO [GroupCoordinator 0]: Dynamic Member with unknown member id joins group avast in PreparingRebalance state. Created a new member id watermill-d5f13e11-71c8-4b87-b76e-9dbf8c44cfb9 for this member and add to the group. (kafka.coordinator.group.GroupCoordinator)

[2022-01-19 19:22:04,579] INFO [GroupCoordinator 0]: Dynamic Member with unknown member id joins group avast in PreparingRebalance state. Created a new member id watermill-ac0b6540-57e4-4ba6-9d0b-bcd14a7b0f18 for this member and add to the group. (kafka.coordinator.group.GroupCoordinator)

[2022-01-19 19:22:07,028] INFO [GroupCoordinator 0]: Dynamic Member with unknown member id joins group avast in PreparingRebalance state. Created a new member id watermill-86c882a8-b2ce-41c8-8a23-799d469ff8e6 for this member and add to the group. (kafka.coordinator.group.GroupCoordinator)

[2022-01-19 19:22:34,509] INFO [GroupCoordinator 0]: Group avast removed dynamic members who haven't joined: Set(watermill-9ce047a7-1145-4481-be91-953760cb601e, watermill-e1808013-a69e-426c-a21b-e08b9afc4bff) (kafka.coordinator.group.GroupCoordinator)

[2022-01-19 19:22:34,509] INFO [GroupCoordinator 0]: Stabilized group avast generation 50 (__consumer_offsets-23) with 4 memb

 

使用:

正常應該:

一個client,一個group,對應一個topic

實際是:
一個client,一個group,對應兩個topic,三個partition

 

可能原因1:

https://github.com/Shopify/sarama/issues/1192

https://github.com/Shopify/sarama/blob/v1.19.0/consumer_group.go#L18

// Consume joins a cluster of consumers for a given list of topics and
  // starts a blocking ConsumerGroupSession through the ConsumerGroupHandler.
  //
  // The life-cycle of a session is represented by the following steps:
  //
  // 1. The consumers join the group (as explained in https://kafka.apache.org/documentation/#intro_consumers)
  // and is assigned their "fair share" of partitions, aka 'claims'.
  // 2. Before processing starts, the handler's Setup() hook is called to notify the user
  // of the claims and allow any necessary preparation or alteration of state.
  // 3. For each of the assigned claims the handler's ConsumeClaim() function is then called
  // in a separate goroutine which requires it to be thread-safe. Any state must be carefully protected
  // from concurrent reads/writes.
  // 4. The session will persist until one of the ConsumeClaim() functions exits. This can be either when the
  // parent context is cancelled or when a server-side rebalance cycle is initiated.
  // 5. Once all the ConsumeClaim() loops have exited, the handler's Cleanup() hook is called
  // to allow the user to perform any final tasks before a rebalance.
  // 6. Finally, marked offsets are committed one last time before claims are released.
  //
  // Please note, that once a relance is triggered, sessions must be completed within
  // Config.Consumer.Group.Rebalance.Timeout. This means that ConsumeClaim() functions must exit
  // as quickly as possible to allow time for Cleanup() and the final offset commit. If the timeout
  // is exceeded, the consumer will be removed from the group by Kafka, which will cause offset
  // commit failures.
// Please note, that once a relance is triggered, sessions must be completed within
  // Config.Consumer.Group.Rebalance.Timeout. This means that ConsumeClaim() functions must exit
  // as quickly as possible to allow time for Cleanup() and the final offset commit. If the timeout
  // is exceeded, the consumer will be removed from the group by Kafka, which will cause offset
  // commit failures.

 

 

可能原因2:

線上一組 kafka 消費者在運行了很多天之后突然積壓,日志顯示該 kafka 消費者頻繁 rebalance 並且大概率返回失敗。

錯誤消息如下

kafka server: The provided member is not known in the current generation Request was for a topic or partition that does not exist on this broker 

有時候日志里還會伴隨着

i/o timeout

我們添加了 errors 和 notifications 日志,發現每次錯誤都伴隨着 rebalance。

我們首先認為是超時時間過短導致的,於是我們調大了連接超時和讀寫超時,但是問題沒有得到解決。

我們又認為是我們處理信息的時間過長,導致 kafka server 認為 client 死掉了,然后進行 rebalance 導致的。於是我們將每條獲取到的 message 放到 channel 中,由多個消費者去消費 channel 來解決問題,但是問題仍然沒有解決。我們閱讀了 sarama 的 heartbeat 機制,發現每個 consumer 都有單獨的 goroutine 每 3 秒發送一次心跳。因此這個處理時間應該只會導致消費速度下降,不會導致 rebalance。

我們於是只好另外啟動了一個消費者,指定了另外一個 group id,在消費過程中,我們看到並未發生 rebalance。這時我們更加一頭霧水了。

直到我們看到了這篇文章 kafka consumer 頻繁 reblance,這篇文章提到:

kafka 不同 topic 的 consumer 如果用的 group id 名字一樣的情況下,其中任意一個 topic 的 consumer 重新上下線都會造成剩余所有的 consumer 產生 reblance 行為。

而我們正是不同的 topic 下有名字相同的 group id 的多個消費者。為了驗證確實是由這個問題導致的,我們暫停了該 group id 下其他消費者的消費,之前頻繁 rebalance 的消費者果真再也沒有發生過 rebalance。

於是我們更改了這些消費者的 group id,以不同后綴進行區分,問題便解決了。

即使大家不是同一個topic,這主要是由於kafka官方支持一個consumer同時消費多個topic的情況,所以在zk上一個consumer出問題后zk是直接把group下面所有的consumer都通知一遍

 

(消費多個topic,那么多個topic下的partition就要合並計算了,對應consumer的時候)

 

Solution is simple, just use a unique consumer group name.  Avoid generic consumer group names by all means.

We had put a very good time in naming topics but not in consumer group names.  This experience has led us to put good time in naming the consumer groups not only well, but unique.

 

 可能原因3:

watermill庫的問題

 

可能原因1,2 和 這里

read tcp :49560->:9092: i/o timeout under sarama kafka golang panic

https://www.cnblogs.com/mmgithub123/p/15855105.html  

就一起串起來了。:

因為一個client 一個consumer group 多個topic,導致頻繁rebalance。rebalance導致consumer session cancel ,而Config.Consumer.Group.Rebalance.Timeout 之內沒完成最后清理動作,導致直接斷掉,報tcp i/o timeout。而rebalance后,generation增加,老的連接再過來就不認了,報The provided member is not known in the current generation

 

參考鏈接:

https://www.cnblogs.com/chanshuyi/p/kafka_rebalance_quick_guide.html

https://olnrao.wordpress.com/2015/05/15/apache-kafka-case-of-mysterious-rebalances/


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM