故障描述:
节前将所有ceph节点全部关机,节后开机后发现 osd 全部down
ceph -s发现 HEALTH_WARN 320 pgs stale; 320 pgs stuck stale; 3/3 in osds are down
查询很多资料都不知道如何处理,只能删除osd再添加
重新添加OSD
1.删除分区 fdisk /dev/sdb
2.重新格式化分区 mkfs.ext4 /dev/sdb1
3.ceph-deploy --overwrite-conf osd prepare ${hostname}:sdb1
4.ceph-deploy osd activate ${hsotname}:sdb1:sdc
发现在执行activate的时候有报错
root@cephadmin:~/ceph-cluster# ceph-deploy osd activate ceph002:sdb1:sdc
[ceph_deploy.cli][INFO ] Invoked (1.4.0): /usr/bin/ceph-deploy osd activate ceph002:sdb1:sdc
[ceph_deploy.osd][DEBUG ] Activating cluster ceph disks ceph002:/dev/sdb1:/dev/sdc
[ceph002][DEBUG ] connected to host: ceph002
[ceph002][DEBUG ] detect platform information from remote host
[ceph002][DEBUG ] detect machine type
[ceph_deploy.osd][INFO ] Distro info: Ubuntu 14.04 trusty
[ceph_deploy.osd][DEBUG ] activating host ceph002 disk /dev/sdb1
[ceph_deploy.osd][DEBUG ] will use init type: upstart
[ceph002][INFO ] Running command: ceph-disk-activate --mark-init upstart --mount /dev/sdb1
[ceph002][WARNIN] got monmap epoch 3
[ceph002][WARNIN] 2017-02-05 22:31:37.810083 7fe4d904b800 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway
[ceph002][WARNIN] 2017-02-05 22:31:37.861884 7fe4d904b800 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway
[ceph002][WARNIN] 2017-02-05 22:31:37.862572 7fe4d904b800 -1 filestore(/var/lib/ceph/tmp/mnt.KjVzc2) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
[ceph002][WARNIN] 2017-02-05 22:31:37.906641 7fe4d904b800 -1 created object store /var/lib/ceph/tmp/mnt.KjVzc2 journal /var/lib/ceph/tmp/mnt.KjVzc2/journal for osd.1 fsid 4f1100a0-bc37-4472-b0b0-58b44eabac97
[ceph002][WARNIN] 2017-02-05 22:31:37.906731 7fe4d904b800 -1 auth: error reading file: /var/lib/ceph/tmp/mnt.KjVzc2/keyring: can't open /var/lib/ceph/tmp/mnt.KjVzc2/keyring: (2) No such file or directory
[ceph002][WARNIN] 2017-02-05 22:31:37.906869 7fe4d904b800 -1 created new key in keyring /var/lib/ceph/tmp/mnt.KjVzc2/keyring
[ceph002][WARNIN] Error EINVAL: entity osd.1 exists but key does not match
[ceph002][WARNIN] ERROR:ceph-disk:Failed to activate
[ceph002][WARNIN] Traceback (most recent call last):
[ceph002][WARNIN] File "/usr/sbin/ceph-disk", line 2798, in <module>
[ceph002][WARNIN] main()
[ceph002][WARNIN] File "/usr/sbin/ceph-disk", line 2776, in main
[ceph002][WARNIN] args.func(args)
[ceph002][WARNIN] File "/usr/sbin/ceph-disk", line 2003, in main_activate
[ceph002][WARNIN] init=args.mark_init,
[ceph002][WARNIN] File "/usr/sbin/ceph-disk", line 1777, in mount_activate
[ceph002][WARNIN] (osd_id, cluster) = activate(path, activate_key_template, init)
[ceph002][WARNIN] File "/usr/sbin/ceph-disk", line 1978, in activate
[ceph002][WARNIN] keyring=keyring,
[ceph002][WARNIN] File "/usr/sbin/ceph-disk", line 1596, in auth_key
[ceph002][WARNIN] 'mon', 'allow profile osd',
[ceph002][WARNIN] File "/usr/sbin/ceph-disk", line 316, in command_check_call
[ceph002][WARNIN] return subprocess.check_call(arguments)
[ceph002][WARNIN] File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
[ceph002][WARNIN] raise CalledProcessError(retcode, cmd)
[ceph002][WARNIN] subprocess.CalledProcessError: Command '['/usr/bin/ceph', '--cluster', 'ceph', '--name', 'client.bootstrap-osd', '--keyring', '/var/lib/ceph/bootstrap-osd/ceph.keyring', 'auth', 'add', 'osd.1', '-i', '/var/lib/ceph/tmp/mnt.KjVzc2/keyring', 'osd', 'allow *', 'mon', 'allow profile osd']' returned non-zero exit status 22
[ceph002][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: ceph-disk-activate --mark-init upstart --mount /dev/sdb1
查询相关资料后,在删除故障OSD的时候,执行以下操作
[root@ceph-osd-1 ceph-cluster]# ceph auth del osd.1
updated
[root@ceph-osd-1 ceph-cluster]# ceph osd rm 1
removed osd.1
[root@ceph-osd-1 ceph-cluster]# ceph osd crush remove osd.1
removed item id 1 name 'osd.1' from crush map
而后在执行
ceph-deploy --overwrite-conf osd prepare ${hostname}:sdb1
ceph-deploy osd activate ${hsotname}:sdb1:sdc
执行ceph health出现以下报错
执行ceph health detail详细列出 191一个pgs stuck stale
需要针对每个pgs 执行以下命令来修复
ceph pg force_create_pg {pg-num}
ceph pg map {pg-num}
批量修复命令
for pg in `ceph health detail | grep stale | cut -d' ' -f2`; do ceph pg force_create_pg $pg; done