問題1:
2020-03-01 16:04:06,085 [myid:1] - INFO [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@694] - Established session 0x10635fe2a6368f1 with negotiated timeout 120000 for client /10.62.3.14:55222 2020-03-01 16:06:12,006 [myid:1] - WARN [SyncThread:1:FileTxnLog@338] - fsync-ing the write ahead log in SyncThread:1 took 5073ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide 2020-03-01 16:06:13,906 [myid:1] - WARN [SyncThread:1:FileTxnLog@338] - fsync-ing the write ahead log in SyncThread:1 took 1123ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
分析: ZK服務端在fsync-ing the write ahead log日志時超長引起。
解決辦法:
1、在zoo.cfg添加:
forceSync=no
默認是開啟的,為避免同步延遲問題,ZK接收到數據后會立刻去講當前狀態信息同步到磁盤日志文件中,同步完成后才會應答。將此項關閉后,客戶端連接可以得到快速響應。Zk涮日志源碼如下圖:
關閉forceSync選項后,會存在潛在風險,雖然依舊會刷磁盤(log.flush()首先被執行),但因為操作系統為提高寫磁盤效率,會先寫緩存,當機器異常后,可能導致一些zk狀態信息沒有同步到磁盤,從而帶來ZK前后信息不一樣問題。
2、把zookeeper的日志文件和數據文件分開存儲,不存在在一塊磁盤
問題2:
2020-03-01 16:27:16,786 [myid:1] - INFO [QuorumPeer[myid=1]/0.0.0.0:2181:ZooKeeperServer@694] - Established session 0x3075e0e93860151 with negotiated timeout 120000 for client /10.62.3.2:60124 2020-03-01 16:34:52,706 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to read additional data from client sessionid 0x3075e0e93860135, likely client has closed socket 2020-03-01 16:34:52,706 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1056] - Closed socket connection for client /10.62.3.2:50244 which had sessionid 0x3075e0e93860135 2020-03-01 16:35:48,351 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to read additional data from client sessionid 0x10635fe2a636912, likely client has closed socket 2020-03-01 16:35:48,351 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1056] - Closed socket connection for client /10.62.3.14:60822 which had sessionid 0x10635fe2a636912 2020-03-01 16:35:58,226 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to read additional data from client sessionid 0x10635fe2a636914, likely client has closed socket 2020-03-01 16:35:58,226 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1056] - Closed socket connection for client /10.62.3.14:60856 which had sessionid 0x10635fe2a636914 2020-03-01 16:36:04,902 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to read additional data from client sessionid 0x10635fe2a636910, likely client has closed socket
分析: 客戶端連接Zookeeper時,配置的超時時長過短。
從上述的信息可以看出來,,會話超時時間已經設置了120s,對於hbase集群來說,,這個超時時間應該是沒問題的,但是還是有的regionserver機器由於在flush memstor時失敗了,,這里暫且在zoo.cfg文件,修改tickTime參數在觀察看看。