環境:
centos7.3 + moosefs 3.0.97 + drbd84-utils-8.9.8-1 + keepalived-1.2.13-9
工作原理:
架構圖:
節點信息:
節點名 MFS角色 主機名 IP
node1 master & metalogger node1 172.16.0.41
node2 master & metalogger node2 172.16.0.42
node3 chunk server node3 172.16.0.43
node4 chunk server node4 172.16.0.44
node5 chunk server node5 172.16.0.45
node6 client node5 172.16.0.11
node7 client node5 172.16.0.12
node8 client node5 172.16.0.13
vip mfsmaster 172.16.0.47
說明:
1) 在兩台MFS Master機服務器安裝DRBD做網絡磁盤,網絡磁盤上存放mfs master的meta文件。
2)在兩台機器上都安裝keepalived,兩台服務器上有一個VIP漂移。keepalived通過檢測腳本來檢測服務器狀態,當一台有問題時,VIP自動切換到另一台上。
3)client、chunk server、 metalogger都是連接的VIP,所以當其中一台服務器掛掉后,並不影響服務。
node1 node2綁定hosts
cat /etc/hosts
172.16.0.41 node1 172.16.0.42 node2 172.16.0.43 node3 172.16.0.44 node4 172.16.0.45 node5 172.16.0.11 node6 172.16.0.12 node7 172.16.0.13 node8 172.16.0.47 mfsmaster
node3 - node8上綁定hosts, cat /etc/hosts
172.16.0.47 mfsmaster
安裝MFS yum庫(所有節點)
Import the public key: curl "http://ppa.moosefs.com/RPM-GPG-KEY-MooseFS" > /etc/pki/rpm-gpg/RPM-GPG-KEY-MooseFS To install ELRepo for RHEL-7, SL-7 or CentOS-7: curl "http://ppa.moosefs.com/MooseFS-3-el7.repo" > /etc/yum.repos.d/MooseFS.repo To install ELRepo for RHEL-6, SL-6 or CentOS-6: curl "http://ppa.moosefs.com/MooseFS-3-el6.repo" > /etc/yum.repos.d/MooseFS.repo
安裝elreo庫(node1 node2)
Get started Import the public key: rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org Detailed info on the GPG key used by the ELRepo Project can be found on https://www.elrepo.org/tiki/key If you have a system with Secure Boot enabled, please see the SecureBootKey page for more information. To install ELRepo for RHEL-7, SL-7 or CentOS-7: rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm
To make use of our mirror system, please also install yum-plugin-fastestmirror. To install ELRepo for RHEL-6, SL-6 or CentOS-6: rpm -Uvh http://www.elrepo.org/elrepo-release-6-8.el6.elrepo.noarch.rpm
創建分區(node1、node2)
node1、node2添加單獨的硬盤做DRBD,大小一樣
基於LVM,方便擴容
[root@mfs-n1 /]# fdisk /dev/vdb Welcome to fdisk (util-linux 2.23.2). Changes will remain in memory only, until you decide to write them. Be careful before using the write command. Command (m for help): p Disk /dev/vdb: 64.4 GB, 64424509440 bytes, 125829120 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk label type: dos Disk identifier: 0xdaf38769 Device Boot Start End Blocks Id System Command (m for help): n Partition type: p primary (0 primary, 0 extended, 4 free) e extended Select (default p): p Partition number (1-4, default 1): First sector (2048-125829119, default 2048): Using default value 2048 Last sector, +sectors or +size{K,M,G} (2048-125829119, default 125829119): Using default value 125829119 Partition 1 of type Linux and of size 60 GiB is set Command (m for help): m Command action a toggle a bootable flag b edit bsd disklabel c toggle the dos compatibility flag d delete a partition g create a new empty GPT partition table G create an IRIX (SGI) partition table l list known partition types m print this menu n add a new partition o create a new empty DOS partition table p print the partition table q quit without saving changes s create a new empty Sun disklabel t change a partition's system id u change display/entry units v verify the partition table w write table to disk and exit x extra functionality (experts only) Command (m for help): t Selected partition 1 Hex code (type L to list all codes): L 0 Empty 24 NEC DOS 81 Minix / old Lin bf Solaris 1 FAT12 27 Hidden NTFS Win 82 Linux swap / So c1 DRDOS/sec (FAT- 2 XENIX root 39 Plan 9 83 Linux c4 DRDOS/sec (FAT- 3 XENIX usr 3c PartitionMagic 84 OS/2 hidden C: c6 DRDOS/sec (FAT- 4 FAT16 <32M 40 Venix 80286 85 Linux extended c7 Syrinx 5 Extended 41 PPC PReP Boot 86 NTFS volume set da Non-FS data 6 FAT16 42 SFS 87 NTFS volume set db CP/M / CTOS / . 7 HPFS/NTFS/exFAT 4d QNX4.x 88 Linux plaintext de Dell Utility 8 AIX 4e QNX4.x 2nd part 8e Linux LVM df BootIt 9 AIX bootable 4f QNX4.x 3rd part 93 Amoeba e1 DOS access a OS/2 Boot Manag 50 OnTrack DM 94 Amoeba BBT e3 DOS R/O b W95 FAT32 51 OnTrack DM6 Aux 9f BSD/OS e4 SpeedStor c W95 FAT32 (LBA) 52 CP/M a0 IBM Thinkpad hi eb BeOS fs e W95 FAT16 (LBA) 53 OnTrack DM6 Aux a5 FreeBSD ee GPT f W95 Ext'd (LBA) 54 OnTrackDM6 a6 OpenBSD ef EFI (FAT-12/16/ 10 OPUS 55 EZ-Drive a7 NeXTSTEP f0 Linux/PA-RISC b 11 Hidden FAT12 56 Golden Bow a8 Darwin UFS f1 SpeedStor 12 Compaq diagnost 5c Priam Edisk a9 NetBSD f4 SpeedStor 14 Hidden FAT16 <3 61 SpeedStor ab Darwin boot f2 DOS secondary 16 Hidden FAT16 63 GNU HURD or Sys af HFS / HFS+ fb VMware VMFS 17 Hidden HPFS/NTF 64 Novell Netware b7 BSDI fs fc VMware VMKCORE 18 AST SmartSleep 65 Novell Netware b8 BSDI swap fd Linux raid auto 1b Hidden W95 FAT3 70 DiskSecure Mult bb Boot Wizard hid fe LANstep 1c Hidden W95 FAT3 75 PC/IX be Solaris boot ff BBT 1e Hidden W95 FAT1 80 Old Minix Hex code (type L to list all codes): 8e Changed type of partition 'Linux' to 'Linux LVM' Command (m for help): w The partition table has been altered!
Calling ioctl() to re-read partition table.
Syncing disks.
pvcreate /dev/vdb1 #創建物理卷
vgcreate vgdrbr /dev/vdb1 #創建卷組
vgdisplay (查看邏輯卷組)
vgdisplay
--- Volume group ---
VG Name vgdrbr
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 1
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 0
Open LV 0
Max PV 0
Cur PV 1
Act PV 1
VG Size <60.00 GiB
PE Size 4.00 MiB
Total PE 15359
Alloc PE / Size 0 / 0
Free PE / Size 15359 / <60.00 GiB
VG UUID jgrCsU-CQmQ-l2yz-O375-qFQq-Rho9-r6eY54
lvcreate -l +15359 -n mfs vgdrbr #創建邏輯卷
lvdisplay
--- Logical volume ---
LV Path /dev/vgdrbr/mfs
LV Name mfs
VG Name vgdrbr
LV UUID 3pb6ZJ-aMIu-PVbU-PAID-ozvB-XVpz-hdfK1c
LV Write Access read/write
LV Creation host, time mfs-n1, 2017-11-08 10:14:00 +0800
LV Status available
# open 2
LV Size <60.00 GiB
Current LE 15359
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 8192
Block device 253:3
映射分區為 /dev/mapper/vgdrbr-mfs
DRBD
安裝DRBD(node1 node2 )
yum -y install drbd84-utils kmod-drbd84 # 有更新的版本 yum install -y drbd90 kmod-drbd90
加載DRBD模塊:
# modprobe drbd
查看DRBD模塊是否加載到內核:
# lsmod |grep drbd
配置DRBD
node1:
cat /etc/drbd.conf
# You can find an example in /usr/share/doc/drbd.../drbd.conf.example include "drbd.d/global_common.conf"; include "drbd.d/*.res";
cat /etc/drbd.d/global_common.conf
# DRBD is the result of over a decade of development by LINBIT. # In case you need professional services for DRBD or have # feature requests visit http://www.linbit.com global { usage-count no; # minor-count dialog-refresh disable-ip-verification # cmd-timeout-short 5; cmd-timeout-medium 121; cmd-timeout-long 600; } common { handlers { # These are EXAMPLE handlers only. # They may have severe implications, # like hard resetting the node under certain circumstances. # Be careful when chosing your poison. # pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; # pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f"; # local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f"; # fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; # split-brain "/usr/lib/drbd/notify-split-brain.sh root"; # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root"; # before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k"; # after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh; } startup { # wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb wfc-timeout 30; degr-wfc-timeout 30; outdated-wfc-timeout 30; } options { # cpu-mask on-no-data-accessible } disk { # size on-io-error fencing disk-barrier disk-flushes # disk-drain md-flushes resync-rate resync-after al-extents # c-plan-ahead c-delay-target c-fill-target c-max-rate # c-min-rate disk-timeout on-io-error detach; } net { # protocol timeout max-epoch-size max-buffers unplug-watermark # connect-int ping-int sndbuf-size rcvbuf-size ko-count # allow-two-primaries cram-hmac-alg shared-secret after-sb-0pri # after-sb-1pri after-sb-2pri always-asbp rr-conflict # ping-timeout data-integrity-alg tcp-cork on-congestion # congestion-fill congestion-extents csums-alg verify-alg # use-rle protocol C; cram-hmac-alg sha1; shared-secret "bqrPnf9"; } }
cat /etc/drbd.d/mfs.res
resource mfs{ device /dev/drbd0; meta-disk internal; on node1{ disk /dev/vgdrbr/mfs; address 172.16.0.41:9876; } on node2{ disk /dev/vgdrbr/mfs; address 172.16.0.42:9876; } }
從node1 復制 /etc/drbd.d/* 到node2 /etc/drbd.d/
創建DRBD資源
drbdadm create-md mfs
initializing activity log
NOT initializing bitmap
Writing meta data...
New drbd meta data block successfully created.
啟動DRBD服務(node1 node2)
systemctl enable drbd
systemctl start drbd
查看DRBD初始化狀態:
drbd-overview
配置node1節點為主節點
drbdadm primary mfs
如果報錯,執行:
drbdadm primary --force mfs # 不行重啟下drbd服務,再執行
drbdadm --overwrite-data-of-peer primary all
此時用命令 drbd-overview 或cat /proc/drbd 可以看到開始同步數據了
drbdsetup status mfs --verbose --statistics // 查看詳細情況
node1節點上格式化drbd設備
mkfs -t xfs /dev/drbd0 或 mkfs.xfs /dev/drbd0
節點上測試mount設備
mkdir -p /data/drbd mount /dev/drbd0 /data/drbd
node2上需要創建 掛載相同的掛載目錄 mkdir -p /data/drbd
node2上無需創建DRBD資源及格式化drbd設備,同步完后,node2上的數據跟node1上是一致的,已經是創建資源並格式化好的。
安裝MFS Master + metalogger
yum -y install moosefs-master moosefs-cli moosefs-cgi moosefs-cgiserv moosefs-metalogger
配置MFS Master 、MFSmetalogger
/etc/mfs/mfsmaster.cfg
grep -v "^#" mfsmaster.cfg
WORKING_USER = mfs
WORKING_GROUP = mfs
SYSLOG_IDENT = mfsmaster
LOCK_MEMORY = 0
NICE_LEVEL = -19
DATA_PATH = /data/drbd/mfs
EXPORTS_FILENAME = /etc/mfs/mfsexports.cfg
TOPOLOGY_FILENAME = /etc/mfs/mfstopology.cfg
BACK_LOGS = 50
BACK_META_KEEP_PREVIOUS = 1
MATOML_LISTEN_HOST = *
MATOML_LISTEN_PORT = 9419
MATOCS_LISTEN_HOST = *
MATOCS_LISTEN_PORT = 9420
# chunkserver 與 master之間的認證
AUTH_CODE = mfspassword
REPLICATIONS_DELAY_INIT = 300
CHUNKS_LOOP_MAX_CPS = 100000
CHUNKS_LOOP_MIN_TIME = 300
CHUNKS_SOFT_DEL_LIMIT = 10
CHUNKS_HARD_DEL_LIMIT = 25
CHUNKS_WRITE_REP_LIMIT = 2
CHUNKS_READ_REP_LIMIT = 10
MATOCL_LISTEN_HOST = *
MATOCL_LISTEN_PORT = 9421
SESSION_SUSTAIN_TIME = 86400
/etc/mfs/mfsmetalogger.cfg
grep -v "^#" mfsmetalogger.cfg
WORKING_USER = mfs WORKING_GROUP = mfs SYSLOG_IDENT = mfsmetalogger LOCK_MEMORY = 0 NICE_LEVEL = -19 DATA_PATH = /var/lib/mfs BACK_LOGS = 50 BACK_META_KEEP_PREVIOUS = 3 META_DOWNLOAD_FREQ = 24 MASTER_RECONNECTION_DELAY = 5 MASTER_HOST = mfsmaster MASTER_PORT = 9419 MASTER_TIMEOUT = 10
/etc/mfs/mfsexports.cfg 權限配置
grep -v "^#" mfsexports.cfg
* / rw,alldirs,admin,maproot=0:0,password=9WpV9odJ
* . rw
password值是MFS客戶連接時的認證密碼
把node1 /etc/mfs目錄下 mfsexports.cfg mfsmaster.cfg mfsmetalogger.cfg mfstopology.cfg 文件同步到 node2 的/etc/mfs 目錄下
創建metadata存儲目錄:
mkdir -p /data/drbd/mfs cp /var/lib/mfs/metadata.mfs.empty /data/drbd/mfs/metadata.mfs chown -R mfs.mfs /data/drbd/mfs
啟動mfsmaster(node2上不用啟,在node1故障里通過keepalived腳本來啟動node2的mfsmaster)
mfsmaster start
啟動MFS監控服務
chmod 755 /usr/share/mfscgi/*.cgi # node1 node2都確定有可執行權限
mfscgiserv start 或 systemctl start moosefs-cgiserv
用瀏覽器訪問:http://172.16.0.41:9425
安裝keepalived (node1 node2)
yum -y install keepalived
配置keepalived
node1:
添加腳本
發郵件腳本:
cat /etc/keepalived/script/mail_notify.py
#!/usr/bin/env python # -*- coding:utf-8 -*- import smtplib from email.mime.text import MIMEText from email.header import Header import sys, time, subprocess, random # 第三方 SMTP 服務 mail_host="smtp.qq.com" #設置服務器
userinfo_list = [{'user':'user1@qq.com','pass':'pass1'}, {'user':'user2@qq.com','pass':'pass2'}, {'user':'user3@qq.com','pass':'pass3'}]
user_inst = userinfo_list[random.randint(0, len(userinfo_list)-1)]
mail_user=user_inst['user'] #用戶名
mail_pass=user_inst['pass'] #口令
sender = mail_user # 郵件發送者
receivers = ['xx1@qq.com', 'xx2@163.com'] # 接收郵件,可設置為你的QQ郵箱或者其他郵箱 p = subprocess.Popen('hostname', shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE) hostname = p.stdout.readline().split('\n')[0] message_to = '' for i in receivers: message_to += i + ';' def print_help(): note = '''python script.py role ip vip ''' print(note) exit(1) time_stamp = time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())) if len(sys.argv) != 4: print_help() elif sys.argv[1] == 'master': message_content = '%s server: %s(%s) keepalived change to Master, vIP: %s' %(time_stamp, sys.argv[2], hostname, sys.argv[3]) subject = '%s keepalived change to Master -- keepalived notify' %(sys.argv[2]) elif sys.argv[1] == 'backup': message_content = '%s server: %s(%s) keepalived change to Backup, vIP: %s' %(time_stamp, sys.argv[2], hostname, sys.argv[3]) subject = '%s keepalived change to Backup -- keepalived notify' %(sys.argv[2]) elif sys.argv[1] == 'stop': message_content = '%s server: %s(%s) keepalived change to Stop, vIP: %s' %(time_stamp, sys.argv[2], hostname, sys.argv[3]) subject = '%s keepalived change to Stop -- keepalived notify' %(sys.argv[2]) else: print_help() message = MIMEText(message_content, 'plain', 'utf-8') message['From'] = Header(sender, 'utf-8') message['To'] = Header(message_to, 'utf-8') message['Subject'] = Header(subject, 'utf-8') try: smtpObj = smtplib.SMTP() smtpObj.connect(mail_host, 25) # 25 為 SMTP 端口號 smtpObj.login(mail_user,mail_pass) smtpObj.sendmail(sender, receivers, message.as_string()) print("郵件發送成功") except smtplib.SMTPException as e: print("Error: 無法發送郵件") print(e)
DRBD檢測腳本:
cat /etc/keepalived/script/check_drbd.sh
#!/bin/bash # set basic parameter drbd_res=mfs drbd_mountpoint=/mfs/drbd status="ok" #ret=`ps -C mfsmaster --no-header |wc -l` ret=`pidof mfsmaster |wc -l` if [ $ret -eq 0 ]; then status="mfsmaster not running" umount $drbd_mountpoint drbdadm secondary $drbd_res mfscgiserv stop /bin/python /etc/keepalived/script/mail_notify.py stop 172.16.0.41 172.16.0.47 systemctl stop keepalived fi echo $status
keepalived切換為master腳本:
cat /etc/keepalived/script/master.sh
#!/bin/bash
drbdadm primary mfs
mount /dev/drbd0 /data/drbd mfsmaster start
mfscgiserv start
chmod +x /etc/keepalived/script/*.sh
從node1 復制 mail_notify.py 到 node2 /etc/keepalived/script
cat /etc/keepalived/keepalived.conf
! Configuration File for keepalived global_defs { notification_email { xx@xx.com } notification_email_from keepalived@xx.com smtp_server 127.0.0.1 smtp_connect_timeout 30 router_id node1_mfs_master # 標識本節點的字條串,通常為hostname,但不一定非得是hostname。故障發生時,郵件通知會用到 } vrrp_script check_drbd { script "/etc/keepalived/script/check_drbd.sh" interval 3 # check every 3 seconds # weight -40 # if failed, decrease 40 of the priority # fall 2 # require 2 failures for failures # rise 1 # require 1 sucesses for ok } # net.ipv4.ip_nonlocal_bind=1 vrrp_instance VI_MFS { state BACKUP interface ens160 virtual_router_id 16 #mcast_src_ip 172.16.0.41 nopreempt ## 當node2 keepalived 選舉為 MASTER 時,node1 keepalived重啟,優先級比node2高也不會搶占MASTER角色,state 都需為 BACKUP,只在優先級高的設置 nopreempt priority 100 advert_int 1 #debug authentication { auth_type PASS auth_pass O7F3CjHVXWP } virtual_ipaddress { 172.16.0.47 } track_script { check_drbd } }
systemctl start keepalived
systemctl disable keepalived
systemctl enable moosefs-metalogger; systemctl start moosefs-metalogger
node2:
keepalived切換為master腳本:
cat /etc/keepalived/script/master.sh
#!/bin/bash # set basic parameter drbd_res=mfs drbd_driver=/dev/drbd0 drbd_mountpoint=/mfs/drbd drbdadm primary $drbd_res mount $drbd_driver $drbd_mountpoint mfsmaster start mfscgiserv start /bin/python /etc/keepalived/script/mail_notify.py master 172.16.0.42 172.16.0.47
keepalived切換為backup腳本:
cat /etc/keepalived/script/backup.sh
#!/bin/bash # set basic parameter drbd_res=mfs drbd_mountpoint=/mfs/drbd mfsmaster stop umount $drbd_mountpoint drbdadm secondary $drbd_res mfscgiserv stop /bin/python /etc/keepalived/script/mail_notify.py backup 172.16.0.42 172.16.0.47
cat /etc/keepalived/keepalived.conf
! Configuration File for keepalived global_defs { notification_email { xx@xx.com } notification_email_from keepalived@xx.com smtp_server 127.0.0.1 smtp_connect_timeout 30 router_id node2_mfs_backup } # net.ipv4.ip_nonlocal_bind=1 vrrp_instance mfs { state BACKUP interface ens160 virtual_router_id 16 #mcast_src_ip 172.16.0.42 priority 80 advert_int 1 #debug authentication { auth_type PASS auth_pass O7F3CjHVXWP } virtual_ipaddress { 172.16.0.47 } notify_master "/etc/keepalived/script/master.sh" notify_backup "/etc/keepalived/script/backup.sh" }
systemctl start keepalived
systemctl disable keepalived
systemctl enable moosefs-metalogger; systemctl start moosefs-metalogger
注意:node1 node2 的mfsmaster、keepalived服務不要設置成開機啟動
安裝MFS Chunk servers
yum -y install moosefs-chunkserver
配置MFS Chunk servers (node3 node4 node5)
/etc/mfs/mfschunkserver.cfg
grep -v "^#" /etc/mfs/mfschunkserver.cfg
WORKING_USER = mfs
WORKING_GROUP = mfs
SYSLOG_IDENT = mfschunkserver
LOCK_MEMORY = 0
NICE_LEVEL = -19
DATA_PATH = /var/lib/mfs
HDD_CONF_FILENAME = /etc/mfs/mfshdd.cfg
HDD_TEST_FREQ = 10
BIND_HOST = *
MASTER_HOST = mfsmaster
MASTER_PORT = 9420
MASTER_TIMEOUT = 60
MASTER_RECONNECTION_DELAY = 5
# authentication string (used only when master requires authorization)
AUTH_CODE = mfspassword
CSSERV_LISTEN_HOST = *
CSSERV_LISTEN_PORT = 9422
/etc/mfs/mfshdd.cfg 指定chunk server的硬盤驅動
mkdir -p /data/mfs; chown -R mfs:mfs /data/mfs
grep -v "^#" /etc/mfs/mfshdd.cfg
mfschunk保存數據的路徑,建議使用單獨的LVM 邏輯卷
/data/mfs
systemctl enable moosefs-chunkserver; systemctl start moosefs-chunkserver
手動啟動命令 mfschunkserver start
可通過MFS的監控頁面查看是否連接MFS MASTER成功
MFS客戶端
安裝MFS客戶端
yum -y install moosefs-client fuse
MFS客戶端重啟自動掛載mfs目錄
Shell> vi /etc/rc.local /sbin/modprobe fuse /usr/bin/mfsmount /mnt1 -H mfsmaster -S /backup/db /usr/bin/mfsmount /mnt2 -H mfsmaster -S /app/image
mfsmount -H mfsmaster /mnt3 # 掛載根目錄到 /mnt3
mfsmount -H 主機 -P 端口 -p 認證密碼 掛載點路徑
mfsmount -H 主機 -P 端口 -o mfspassword=PASSWORD 掛載點路徑
通過/etc/fstab的方式(建議使用該方法)
mfsmaster使用主機名的話,需要本機可以解析,可以設置/etc/hosts
Shell> vi /etc/fstab mfsmount /mnt fuse mfsmaster=MASTER_IP,mfsport=9421,_netdev 0 0 (重啟系統后掛載MFS的根目錄) mfsmount /mnt2 fuse mfstermaster=MASTER_IP,mfsport=9421,mfssubfolder=/subdir,_netdev 0 0(重啟系統后掛載MFS的子目錄)
mfsmount /data/upload fuse mfsmaster=mfsmaster,mfsport=9421,mfssubfolder=/pro1,mfspassword=9WpV9odJ,_netdev 0 0(使用密碼認證)
## _netdev:當網絡可用時才進行掛載,避免掛載失敗
采用fstab配置文件掛載方式可以通過如下命令,測試是否配置正確,並可把fstab中mfsmount 進行掛載:
mount -a -t fuse
取消掛載,操作時工作目錄不能在掛載點/mnt:
umount /mnt
查看掛載情況
df -h -T
附:
MFS master切換測試
node1:
停止 mfsmaster服務
mfsmaster stop
查看node2是否有接管vip
df -h -T 查看磁盤掛載情況
cat /proc/drbd #查看 DRBD狀態
drbd-overview # 查看DRBD詳情
告警郵件是否有收到
若node2 keepalived為master,node1要再切回master,先確定node1 node2 drbd狀態已經同步一致,使用drbd-overview可查看
若一致時,node2上執行 /etc/keepalived/script/backup.sh腳本,然后關閉 keepalived服務,
node1執行 /etc/keepalived/script/master.sh 腳本,然后啟動 keepalived服務
node2 啟動 keepalived服務
使用metalogger恢復數據
metalogger節點不是必要的,但是在master掛掉的時候,可以用來恢復master,十分有用!
如果master無法啟動,使用 mfsmetarestore -a 進行修復,如果不能修復,則拷貝metalogger上的備份日志到master上,然后進行恢復!
查看drbd狀態
cat /proc/drbd
version: 8.4.9-1 (api:1/proto:86-101)
GIT-hash: 9976da086367a2476503ef7f6b13d4567327a280 build by akemi@Build64R7, 2016-12-04 01:08:48
0: cs:SyncTarget ro:Secondary/Primary ds:Inconsistent/UpToDate C r-----
ns:0 nr:4865024 dw:4865024 dr:0 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:58046556
[>...................] sync'ed: 7.8% (56684/61436)M
finish: 0:23:35 speed: 41,008 (33,784) want: 41,000 K/sec
第一行為:drbd的版本信息
第二行為:編譯的信息
重點第三行:
開頭"0:"表示 設備/dev/drbd0
開頭"1:"表示 設備/dev/data-1-1
"cs(connection state):" 連接狀態 Connected連接
"ro(roles):" 角色 Primary/Secondary 一個主一個備,正常狀態應該屬於這樣。
"ds:(disk staes)" 硬盤狀態 UpToDate/UpToDate
"ns(network send)" 通過網絡連接把數據發送到對端 volume of net data sent to the parther via the network connection
MFS安裝
Add the appropriate key to package manager:
curl "http://ppa.moosefs.com/RPM-GPG-KEY-MooseFS" > /etc/pki/rpm-gpg/RPM-GPG-KEY-MooseFS
Next you need to add the repository entry (MooseFS 3.0):
- For EL7 family:
# curl "http://ppa.moosefs.com/MooseFS-3-el7.repo" > /etc/yum.repos.d/MooseFS.repo
- For EL6 family:
# curl "http://ppa.moosefs.com/MooseFS-3-el6.repo" > /etc/yum.repos.d/MooseFS.repo
For MoosefS 2.0, use:
- For EL7 family:
# curl "http://ppa.moosefs.com/MooseFS-2-el7.repo" > /etc/yum.repos.d/MooseFS.repo
- For EL6 family:
# curl "http://ppa.moosefs.com/MooseFS-2-el6.repo" > /etc/yum.repos.d/MooseFS.repo
After those operations it should be possible to install the packages with following commands:
- For Master Server:
# yum install moosefs-master moosefs-cli moosefs-cgi moosefs-cgiserv
- For Chunkservers:
# yum install moosefs-chunkserver
- For Metaloggers:
# yum install moosefs-metalogger
- For Clients:
# yum install moosefs-client
If you want MooseFS to be mounted automatically when system starts, first of all install File System in Userspace (FUSE) utilities:# yum install fuseand then add one of the following entries to your /etc/fstab:
"classic" entry (works with all MooseFS 3.0 and 2.0 verisons):mfsmount /mnt/mfs fuse defaults 0 0
or "NFS-like" entry (works with MooseFS 3.0.75+):mfsmaster.host.name: /mnt/mfs moosefs defaults 0 0
Running the system
- To start process manually:
# mfsmaster start
# mfschunkserver start - For systemd OS family - EL7:
# systemctl start moosefs-master.service
# systemctl start moosefs-chunkserver.service - For SysV OS family - EL6:
# service moosefs-master start
# service moosefs-chunkserver start
moosefs-master.service修復
默認的moosefs-master.service 啟動腳本啟動超時不成功,方法:注釋掉 PIDFile=/var/lib/mfs/.mfsmaster.lock 這行
cat /usr/lib/systemd/system/moosefs-master.service
[Unit] Description=MooseFS Master server Wants=network-online.target After=network.target network-online.target [Service] Type=forking ExecStart=/usr/sbin/mfsmaster start ExecStop=/usr/sbin/mfsmaster stop ExecReload=/usr/sbin/mfsmaster reload #PIDFile=/var/lib/mfs/.mfsmaster.lock TimeoutStopSec=60 TimeoutStartSec=60 Restart=no [Install] WantedBy=multi-user.target
drbdadm create-md mfs時報 'mfs' not defined in your config (for this host)錯
主要問題:主機名與 /etc/drbd.d/mfs.res 中定義的不一樣,改成一樣即可
多個業務需要連接MFS建議
多個業務需要連接MFS,建議先用一台MFS客戶端掛載MFS / 目錄,然后創建好相應業務對方的目錄,每業務客戶機再掛載相應業務的子目錄。
MFS文件系統使用
Client通過MFS軟件提供的工具來管理MFS文件系統,下面是工具介紹
/usr/local/mfs/bin/mfstools -h
mfs multi tool
usage:
mfstools create - create symlinks (mfs<toolname> -> /usr/local/mfs/bin/mfstools)
tools:
mfsgetgoal // 設定副本數 mfssetgoal // 獲取副本數 mfsgettrashtime // 設定回收站時間 mfssettrashtime // 設定回收站時間 mfscheckfile // 檢查文件 mfsfileinfo // 文件信息 mfsappendchunks mfsdirinfo // 目錄信息 mfsfilerepair // 文件修復 mfsmakesnapshot // 快照 mfsgeteattr // 設置權限 mfsseteattr mfsdeleattr deprecated tools: // 遞歸設置 mfsrgetgoal = mfsgetgoal -r mfsrsetgoal = mfssetgoal -r mfsrgettrashtime = mfsgettreshtime -r mfsrsettrashtime = mfssettreshtime -r
掛載文件系統
MooseFS 文件系統利用下面的命令:
mfsmount mountpoint [-d][-f] [-s][-m] [-n][-p] [-HMASTER][-PPORT] [-S PATH][-o OPT[,OPT...]]
-H MASTER:是管理服務器(master server)的ip 地址
-P PORT: 是管理服務器( master server)的端口號,要按照mfsmaster.cfg 配置文件中的變量 MATOCU_LISTEN_POR 的之填寫。如果master serve 使用的是默認端口號則不用指出。 -S PATH:指出被掛接mfs 目錄的子目錄,默認是/目錄,就是掛載整個mfs 目錄。
Mountpoint:是指先前創建的用來掛接mfs 的目錄。
在開始mfsmount 進程時,用一個-m 或-o mfsmeta 的選項,這樣可以掛接一個輔助的文件系統
MFSMETA,這么做的目的是對於意外的從MooseFS 卷上刪除文件或者是為了釋放磁盤空間而移動的
文件而又此文件又過去了垃圾文件存放期的恢復,例如:
/usr/local/mfs/bin/mfsmount -m /MFS_meta/ -H 172.16.18.137
設定副本數量
目標(goal),是指文件被拷貝副本的份數,設定了拷貝的份數后是可以通過mfsgetgoal 命令來證實的,也可以通過mfsrsetgoal 來改變設定。
mfssetgoal 3 /MFS_data/test/ mfssetgoal 3 /MFS_data/test/
用 mfsgetgoal –r
和 mfssetgoal –r
同樣的操作可以對整個樹形目錄遞歸操作,其等效於 mfsrsetgoal
命令。實際的拷貝份數可以通過 mfscheckfile
和 mfsfile info
命令來證實。
注意以下幾種特殊情況:
- 一個不包含數據的零長度的文件,盡管沒有設置為非零的目標(the non-zero “goal”),但用mfscheckfile 命令查詢將返回一個空的結果;將文件填充內容后,其會根據設置的goal創建副本;這時再將文件清空,其副本依然作為空文件存在。
- 假如改變一個已經存在的文件的拷貝個數,那么文件的拷貝份數將會被擴大或者被刪除,這個過程會有延時。可以通過mfscheckfile 命令來證實。
- 對一個目錄設定“目標”,此目錄下的新創建文件和子目錄均會繼承此目錄的設定,但不會改變已經存在的文件及目錄的拷貝份數。
可以通過mfsdirinfo來查看整個目錄樹的信息摘要。
垃圾回收站
一個被刪除文件能夠存放在一個“ 垃圾箱”的時間就是一個隔離時間, 這個時間可以用 mfsgettrashtime
命令來驗證,也可以使用`mfssettrashtime 命令來設置。
mfssettrashtime 64800 /MFS_data/test/test1 mfsgettrashtime /MFS_data/test/test1
時間的單位是秒(有用的值有:1 小時是3600 秒,24 - 86400 秒,1天 - 604800 秒)。就像文件被存儲的份數一樣, 為一個目錄設定存放時間是要被新創建的文件和目錄所繼承的。數字0 意味着一個文件被刪除后, 將立即被徹底刪除,在想回收是不可能的。
刪除文件可以通過一個單獨安裝MFSMETA 文件系統。特別是它包含目錄/ trash (包含任然可以被還原的被刪除文件的信息)和/ trash/undel (用於獲取文件)。只有管理員有權限訪問MFSMETA(用戶的uid 0,通常是root)。
/usr/local/mfs/bin/mfsmount -m /MFS_meta/ -H 172.16.18.137
被刪文件的文件名在“垃圾箱”目錄里還可見,文件名由一個八位十六進制的數i-node 和被刪文件的文件名組成,在文件名和i-node 之間不是用“/”,而是用了“|”替代。如果一個文件名的長度超過操作系統的限制(通常是255 個字符),那么部分將被刪除。通過從掛載點起全路徑的文件名被刪除的文件任然可以被讀寫。
移動這個文件到trash/undel 子目錄下,將會使原始的文件恢復到正確的MooseFS 文件系統上路徑下(如果路徑沒有改變)。如果在同一路徑下有個新的同名文件,那么恢復不會成功。
從“垃圾箱”中刪除文件結果是釋放之前被它站用的空間(刪除有延遲,數據被異步刪除)。
在MFSMETA中還有另一個目錄reserved,該目錄內的是被刪除但依然打開的文件。在用戶關閉了這些被打開的文件后,reserved 目錄中的文件將被刪除,文件的數據也將被立即刪除。在reserved 目錄中文件的命名方法同trash 目錄中的一樣,但是不能有其他功能的操作。
快照snapshot
MooseFS 系統的另一個特征是利用mfsmakesnapshot 工具給文件或者是目錄樹做快照
mfsmakesnapshot source ... destination
Mfsmakesnapshot 是在一次執行中整合了一個或是一組文件的拷貝,而且任何修改這些文件的源文件都不會影響到源文件的快照, 就是說任何對源文件的操作,例如寫入源文件,將不會修改副本(或反之亦然)。
也可以使用mfsappendchunks:
mfsappendchunks destination-file source-file ...
當有多個源文件時,它們的快照被加入到同一個目標文件中(每個chunk 的最大量是chunk)。
MFS集群維護
啟動MFS集群
安全的啟動MooseFS 集群(避免任何讀或寫的錯誤數據或類似的問題)的方式是按照以下命令步驟:
- 啟動mfsmaster 進程
- 啟動所有的mfschunkserver 進程
- 啟動mfsmetalogger 進程(如果配置了mfsmetalogger)
- 當所有的chunkservers 連接到MooseFS master 后,任何數目的客戶端可以利用mfsmount 去掛接被export 的文件系統。(可以通過檢查master 的日志或是CGI 監視器來查看是否所有的chunkserver被連接)。
停止MFS集群
安全的停止MooseFS 集群:
- 在所有的客戶端卸載MooseFS 文件系統(用umount 命令或者是其它等效的命令)
- 用mfschunkserver stop 命令停止chunkserver 進程
- 用mfsmetalogger stop 命令停止metalogger 進程
- 用mfsmaster stop 命令停止master 進程
Chunkservers 的維護
若每個文件的goal(目標)都不小於2,並且沒有under-goal 文件(這些可以用mfsgetgoal –r和mfsdirinfo 命令來檢查),那么一個單一的chunkserver 在任何時刻都可能做停止或者是重新啟動。以后每當需要做停止或者是重新啟動另一個chunkserver 的時候,要確定之前的chunkserver 被連接,而且要沒有under-goal chunks。
MFS元數據備份
通常元數據有兩部分的數據:
- 主要元數據文件metadata.mfs,當mfsmaster 運行的時候會被命名為metadata.mfs.back
- 元數據改變日志changelog.*.mfs,存儲了過去的N 小時的文件改變(N 的數值是由BACK_LOGS參數設置的,參數的設置在mfschunkserver.cfg 配置文件中)。
主要的元數據文件需要定期備份,備份的頻率取決於取決於多少小時changelogs 儲存。元數據changelogs 實時的自動復制。1.6版本中這個工作都由metalogger完成。
MFS Master的恢復
一旦mfsmaster 崩潰(例如因為主機或電源失敗),需要最后一個元數據日志changelog 並入主要的metadata 中。這個操作時通過 mfsmetarestore
工具做的,最簡單的方法是:
mfsmetarestore -a
如果master 數據被存儲在MooseFS 編譯指定地點外的路徑,則要利用-d 參數指定使用,如:
mfsmetarestore -a -d /opt/mfsmaster
從MetaLogger中恢復Master
如果mfsmetarestore -a無法修復,則使用metalogger也可能無法修復,暫時沒遇到過這種情況,這里不暫不考慮。
- 找回metadata.mfs.back 文件,可以從備份中找,也可以中metalogger 主機中找(如果啟動了metalogger 服務),然后把metadata.mfs.back 放入data 目錄,一般為{prefix}/var/mfs
- 從在master 宕掉之前的任何運行metalogger 服務的服務器上拷貝最后metadata 文件,然后放入mfsmaster 的數據目錄。
- 利用mfsmetarestore 命令合並元數據changelogs,可以用自動恢復模式mfsmetarestore –a,也可以利用非自動化恢復模式
mfsmetarestore -m metadata.mfs.back -o metadata.mfs changelog_ml.*.mfs
或:強制使用metadata.mfs.back創建metadata.mfs,可以啟動master,但丟失的數據暫無法確定。
Automated Failover
生產環境使用 MooseFS 時,需要保證 master 節點的高可用。 使用 ucarp
是一種比較成熟的方案,或者 DRBD+[hearbeat|keepalived]
。 ucarp
類似於 keepalived
,通過主備服務器間的健康檢查來發現集群狀態,並執行相應操作。另外 MooseFS商業版本已經支持雙主配置,解決單點故障。
moosefs-chunkserver升級
systemctl stop moosefs-chunkserver
yum -y update moosefs-chunkserver
systemctl start moosefs-chunkserver #啟動異常
mfschunkserver -u open files limit has been set to: 16384 working directory: /var/lib/mfs config: using default value for option 'FILE_UMASK' - '23' lockfile created and locked config: using default value for option 'LIMIT_GLIBC_MALLOC_ARENAS' - '4' setting glibc malloc arena max to 4 setting glibc malloc arena test to 4 config: using default value for option 'DISABLE_OOM_KILLER' - '1' initializing mfschunkserver modules ... config: using default value for option 'HDD_LEAVE_SPACE_DEFAULT' - '256MiB' hdd space manager: data folder '/data/mfs/' already locked (used by another process) hdd space manager: no hdd space defined in /etc/mfs/mfshdd.cfg file init: hdd space manager failed !!! error occurred during initialization - exiting
ps -ef |grep mfs
mfs 51699 1 1 2017 ? 1-15:27:28 /usr/sbin/mfschunkserver start
把這個進程結束掉
刪除PID文件:rm -rf /var/lib/mfs/.mfschunkserver.lock