Zabbix監控的一台Linux主機告警:“System time is out of sync (diff with Zabbix server > 60s)”,一檢查發現時間居然滯后一個多小時了。這台Linux設置過ntpd服務,ssh登錄主機,檢查ntpd服務,發現報下面錯誤:
# service ntpd status
ntpd dead but pid file exists
ntpd服務居然掛了。然后啟動ntpd服務后,不到一分鍾的樣子,又掛了,再次啟動ntpd服務后正常了,但是時間同步依然不正常。
。
# service ntpd start
Starting ntpd: [ OK ]
# service ntpd status
ntpd (pid 14956) is running...
# service ntpd status
ntpd dead but pid file exists
檢查日志,發現如下錯誤:”time correction of 4988 seconds exceeds sanity limit (1000); set clock manually to the correct UTC time.“。 在默認設置下,ntpd對時間差距超過1000秒的情況下,拒絕對其進行時間同步操作。這個是ntpd有一個自我保護設置.
May 16 04:02:03 xxxxx syslogd 1.4.1: restart.
May 17 00:38:03 xxxxx last message repeated 5 times
May 17 07:07:18 xxxxx ntpd[14955]: ntpd 4.2.2p1@1.1570-o Fri Jul 22 18:07:53 UTC 2011 (1)
May 17 07:07:18 xxxxx ntpd[14956]: precision = 1.000 usec
May 17 07:07:18 xxxxx ntpd[14956]: Listening on interface wildcard, 0.0.0.0#123 Disabled
May 17 07:07:18 xxxxx ntpd[14956]: Listening on interface lo, 127.0.0.1#123 Enabled
May 17 07:07:18 xxxxx ntpd[14956]: Listening on interface eth1, 192.168.xxx.xxx#123 Enabled
May 17 07:07:18 xxxxx ntpd[14956]: kernel time sync status 0040
May 17 07:07:18 xxxxx ntpd[14956]: getaddrinfo: "::1" invalid host address, ignored
May 17 07:07:18 xxxxx ntpd[14956]: frequency initialized 26.675 PPM from /var/lib/ntp/drift
May 17 07:10:33 xxxxx ntpd[14956]: synchronized to LOCAL(0), stratum 10
May 17 07:10:33 xxxxx ntpd[14956]: kernel time sync enabled 0001
May 17 07:12:41 xxxxx ntpd[14956]: synchronized to 192.168.xxx.xxx, stratum 5
May 17 07:19:12 xxxxx ntpd[14956]: time correction of 4988 seconds exceeds sanity limit (1000); set clock manually to the correct UTC time.
May 17 07:25:21 xxxxx ntpd[15681]: ntpd 4.2.2p1@1.1570-o Fri Jul 22 18:07:53 UTC 2011 (1)
May 17 07:25:21 xxxxx ntpd[15682]: precision = 1.000 usec
May 17 07:25:21 xxxxx ntpd[15682]: Listening on interface wildcard, 0.0.0.0#123 Disabled
May 17 07:25:21 xxxxx ntpd[15682]: Listening on interface lo, 127.0.0.1#123 Enabled
May 17 07:25:21 xxxxx ntpd[15682]: Listening on interface eth1, 192.168.xxx.xxx#123 Enabled
May 17 07:25:21 xxxxx ntpd[15682]: kernel time sync status 0040
May 17 07:25:21 xxxxx ntpd[15682]: getaddrinfo: "::1" invalid host address, ignored
May 17 07:25:21 xxxxx ntpd[15682]: frequency initialized 26.675 PPM from /var/lib/ntp/drift
May 17 07:28:37 xxxxx ntpd[15682]: synchronized to LOCAL(0), stratum 10
May 17 07:28:37 xxxxx ntpd[15682]: kernel time sync enabled 0001
May 17 07:29:43 xxxxx ntpd[15682]: synchronized to 192.168.xxx.xxx, stratum 5
對於這種時間差距過大的時間進行同步可以用ntpdate同步,也可以手工使用ntpd同步
1:停止ntpd服務
2:運行ntpd -gnpd
3: 啟動ntpd服務
個人測試了一下,即使不停止ntpd服務,手工運行ntpd -gnqd命令,依然可以同步時間,問題不大。但是會報“addto_syslog: sendto(192.168.xxx.xxx) (fd=-1): Bad file descriptor”錯誤,所以最好是停止ntpd服務然后運行命令。
# ntpd -gnqd
ntpd 4.2.2p1@1.1570-o Fri Jul 22 18:07:53 UTC 2011 (1)
addto_syslog: precision = 1.000 usec
create_sockets(123)
addto_syslog: no IPv6 interfaces found
addto_syslog: ntp_io: estimated max descriptors: 1024, initial socket boundary: 16
addto_syslog: bind() fd 16, family 2, port 123, addr 0.0.0.0, in_classd=0 flags=9 fails: Address already in use
addto_syslog: bind() fd 16, family 2, port 123, addr 127.0.0.1, in_classd=0 flags=5 fails: Address already in use
addto_syslog: bind() fd 16, family 2, port 123, addr 192.168.xxx.xxx, in_classd=0 flags=25 fails: Address already in use
init_io: maxactivefd 0
local_clock: time 0 base 0.000000 offset 0.000000 freq 0.000 state 0
addto_syslog: getaddrinfo: "::1" invalid host address, ignored
getaddrinfo: "::1" invalid host address, ignored.
key_expire: at 0
peer_clear: at 0 next 1 assoc ID 8694 refid INIT
newpeer: 192.168.xxx.xxx->192.168.xxx.xxx mode 3 vers 4 poll 6 10 flags 0x281 0x1 ttl 0 key 00000000
key_expire: at 0
peer_clear: at 0 next 2 assoc ID 8695 refid INIT
newpeer: 127.0.0.1->127.127.1.0 mode 3 vers 4 poll 6 10 flags 0x1221 0x1 ttl 0 key 00000000
addto_syslog: frequency initialized 26.675 PPM from /var/lib/ntp/drift
local_clock: time 0 base 0.000000 offset 0.000000 freq 26.675 state 1
report_event: system event 'event_restart' (0x01) status 'sync_alarm, sync_unspec, 1 event, event_unspec' (0xc010)
addto_syslog: sendto(192.168.xxx.xxx) (fd=-1): Bad file descriptor
transmit: at 1 192.168.xxx.xxx->192.168.xxx.xxx mode 3
auth_agekeys: at 1 keys 1 expired 0
timer: refresh ts 0
refclock_transmit: at 2 127.127.1.0
refclock_receive: at 2 127.127.1.0
peer LOCAL(0) event 'event_reach' (0x84) status 'unreach, conf, 1 event, event_reach' (0x8014)
refclock_sample: n 1 offset 0.000000 disp 0.010000 jitter 0.000001
clock_filter: n 1 off 0.000000 del 0.000000 dsp 7.937500 jit 0.000001, age 0
addto_syslog: sendto(192.168.xxx.xxx) (fd=-1): Bad file descriptor
transmit: at 3 192.168.xxx.xxx->192.168.xxx.xxx mode 3
addto_syslog: sendto(192.168.xxx.xxx) (fd=-1): Bad file descriptor
transmit: at 5 192.168.xxx.xxx->192.168.xxx.xxx mode 3
addto_syslog: sendto(192.168.xxx.xxx) (fd=-1): Bad file descriptor
transmit: at 7 192.168.xxx.xxx->192.168.xxx.xxx mode 3
addto_syslog: sendto(192.168.xxx.xxx) (fd=-1): Bad file descriptor
transmit: at 9 192.168.xxx.xxx->192.168.xxx.xxx mode 3
addto_syslog: sendto(192.168.xxx.xxx) (fd=-1): Bad file descriptor
transmit: at 11 192.168.xxx.xxx->192.168.xxx.xxx mode 3
addto_syslog: sendto(192.168.xxx.xxx) (fd=-1): Bad file descriptor
transmit: at 13 192.168.xxx.xxx->192.168.xxx.xxx mode 3
addto_syslog: sendto(192.168.xxx.xxx) (fd=-1): Bad file descriptor
transmit: at 15 192.168.xxx.xxx->192.168.xxx.xxx mode 3
addto_syslog: sendto(192.168.xxx.xxx) (fd=-1): Bad file descriptor
transmit: at 17 192.168.xxx.xxx->192.168.xxx.xxx mode 3
執行上面命令后,時間同步到正常情況,ntpd服務也正常。那么回到問題的根源:為什么ntpd服務莫名掛了呢?那么要弄清楚ntpd掛掉的原因,就必須通過日志分析,但是ntpd如果沒有特別設置,它的日志信息一般位於/var/log/messages里面.我查了一下message日志,但是發現寫入的日志信息非常少。並沒有搜索到相關日志信息。所以很遺憾,最終依然不清楚最初是啥原因導致ntpd服務掛掉。至於Zabbix告警,因為告警信息較多,最近事情也有點多,導致這個些告警信息被忽略了。