故障描述:
節前將所有ceph節點全部關機,節后開機后發現 osd 全部down
ceph -s發現 HEALTH_WARN 320 pgs stale; 320 pgs stuck stale; 3/3 in osds are down
查詢很多資料都不知道如何處理,只能刪除osd再添加
重新添加OSD
1.刪除分區 fdisk /dev/sdb
2.重新格式化分區 mkfs.ext4 /dev/sdb1
3.ceph-deploy --overwrite-conf osd prepare ${hostname}:sdb1
4.ceph-deploy osd activate ${hsotname}:sdb1:sdc
發現在執行activate的時候有報錯
root@cephadmin:~/ceph-cluster# ceph-deploy osd activate ceph002:sdb1:sdc
[ceph_deploy.cli][INFO ] Invoked (1.4.0): /usr/bin/ceph-deploy osd activate ceph002:sdb1:sdc
[ceph_deploy.osd][DEBUG ] Activating cluster ceph disks ceph002:/dev/sdb1:/dev/sdc
[ceph002][DEBUG ] connected to host: ceph002
[ceph002][DEBUG ] detect platform information from remote host
[ceph002][DEBUG ] detect machine type
[ceph_deploy.osd][INFO ] Distro info: Ubuntu 14.04 trusty
[ceph_deploy.osd][DEBUG ] activating host ceph002 disk /dev/sdb1
[ceph_deploy.osd][DEBUG ] will use init type: upstart
[ceph002][INFO ] Running command: ceph-disk-activate --mark-init upstart --mount /dev/sdb1
[ceph002][WARNIN] got monmap epoch 3
[ceph002][WARNIN] 2017-02-05 22:31:37.810083 7fe4d904b800 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway
[ceph002][WARNIN] 2017-02-05 22:31:37.861884 7fe4d904b800 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway
[ceph002][WARNIN] 2017-02-05 22:31:37.862572 7fe4d904b800 -1 filestore(/var/lib/ceph/tmp/mnt.KjVzc2) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
[ceph002][WARNIN] 2017-02-05 22:31:37.906641 7fe4d904b800 -1 created object store /var/lib/ceph/tmp/mnt.KjVzc2 journal /var/lib/ceph/tmp/mnt.KjVzc2/journal for osd.1 fsid 4f1100a0-bc37-4472-b0b0-58b44eabac97
[ceph002][WARNIN] 2017-02-05 22:31:37.906731 7fe4d904b800 -1 auth: error reading file: /var/lib/ceph/tmp/mnt.KjVzc2/keyring: can't open /var/lib/ceph/tmp/mnt.KjVzc2/keyring: (2) No such file or directory
[ceph002][WARNIN] 2017-02-05 22:31:37.906869 7fe4d904b800 -1 created new key in keyring /var/lib/ceph/tmp/mnt.KjVzc2/keyring
[ceph002][WARNIN] Error EINVAL: entity osd.1 exists but key does not match
[ceph002][WARNIN] ERROR:ceph-disk:Failed to activate
[ceph002][WARNIN] Traceback (most recent call last):
[ceph002][WARNIN] File "/usr/sbin/ceph-disk", line 2798, in <module>
[ceph002][WARNIN] main()
[ceph002][WARNIN] File "/usr/sbin/ceph-disk", line 2776, in main
[ceph002][WARNIN] args.func(args)
[ceph002][WARNIN] File "/usr/sbin/ceph-disk", line 2003, in main_activate
[ceph002][WARNIN] init=args.mark_init,
[ceph002][WARNIN] File "/usr/sbin/ceph-disk", line 1777, in mount_activate
[ceph002][WARNIN] (osd_id, cluster) = activate(path, activate_key_template, init)
[ceph002][WARNIN] File "/usr/sbin/ceph-disk", line 1978, in activate
[ceph002][WARNIN] keyring=keyring,
[ceph002][WARNIN] File "/usr/sbin/ceph-disk", line 1596, in auth_key
[ceph002][WARNIN] 'mon', 'allow profile osd',
[ceph002][WARNIN] File "/usr/sbin/ceph-disk", line 316, in command_check_call
[ceph002][WARNIN] return subprocess.check_call(arguments)
[ceph002][WARNIN] File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
[ceph002][WARNIN] raise CalledProcessError(retcode, cmd)
[ceph002][WARNIN] subprocess.CalledProcessError: Command '['/usr/bin/ceph', '--cluster', 'ceph', '--name', 'client.bootstrap-osd', '--keyring', '/var/lib/ceph/bootstrap-osd/ceph.keyring', 'auth', 'add', 'osd.1', '-i', '/var/lib/ceph/tmp/mnt.KjVzc2/keyring', 'osd', 'allow *', 'mon', 'allow profile osd']' returned non-zero exit status 22
[ceph002][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: ceph-disk-activate --mark-init upstart --mount /dev/sdb1
查詢相關資料后,在刪除故障OSD的時候,執行以下操作
[root@ceph-osd-1 ceph-cluster]# ceph auth del osd.1
updated
[root@ceph-osd-1 ceph-cluster]# ceph osd rm 1
removed osd.1
[root@ceph-osd-1 ceph-cluster]# ceph osd crush remove osd.1
removed item id 1 name 'osd.1' from crush map
而后在執行
ceph-deploy --overwrite-conf osd prepare ${hostname}:sdb1
ceph-deploy osd activate ${hsotname}:sdb1:sdc
執行ceph health出現以下報錯
執行ceph health detail詳細列出 191一個pgs stuck stale
需要針對每個pgs 執行以下命令來修復
ceph pg force_create_pg {pg-num}
ceph pg map {pg-num}
批量修復命令
for pg in `ceph health detail | grep stale | cut -d' ' -f2`; do ceph pg force_create_pg $pg; done