1.介紹
corosync是集群框架引擎程序,pacemaker是高可用集群資源管理器,crmsh是pacemaker的命令行工具。
1.1 corosync
Coreosync在傳遞信息的時候可以通過一個簡單的配置文件來定義信息傳遞的方式和協議等。它是一個新興的軟件,2008年推出,但其實它並不是一個真正意義上的新軟件,在2002年的時候有一個項目Openais它由於過大,分裂為兩個子項目,其中可以實現HA心跳信息傳輸的功能就是Corosync ,它的代碼60%左右來源於Openais. Corosync可以提供一個完整的HA功能,但是要實現更多,更復雜的功能,那就需要使用Openais了。Corosync是未來的發展方向。在以后的新項目里,一般采用Corosync,而hb_gui可以提供很好的HA管理功能,可以實現圖形化的管理。另外相關的圖形化有RHCS的套件luci+ricci.
1.2 pacemaker
pacemaker是一個開源的高可用資源管理器(CRM),位於HA集群架構中資源管理、資源代理(RA)這個層次,它不能提供底層心跳信息傳遞的功能,要想與對方節點通信需要借助底層的心跳傳遞服務,將信息通告給對方。通常它與corosync的結合。
1.3 總結
資源管理層(pacemaker負責仲裁指定誰是活動節點、IP地址的轉移、本地資源管理系統)、消息傳遞層負責心跳信息(heartbeat、corosync)、Resource Agent(理解為服務腳本)負責服務的啟動、停止、查看狀態。多個節點上允許多個不同服務,剩下的2個備節點稱為故障轉移域,主節點所在位置只是相對的,同樣,第三方仲裁也是相對的。vote system:少數服從多數。當故障節點修復后,資源返回來稱為failback,當故障節點修復后,資源仍在備用節點,稱為failover。
CRM:cluster resource manager ===>pacemaker心臟起搏器,每個節點都要一個crmd(5560/tcp)的守護進程,有命令行接口crmsh和pcs(在heartbeat v3,紅帽提出的)編輯xml文件,讓crmd識別並負責資源服務的處理。也就是說crmsh和pcs等價。
Resource Agent,OCF(open cluster framework)
primtive:主資源,在集群中只運行一個實例。clone:克隆資源,在集群中可運行多個實例。每個資源都有一定的優先級。
無窮大+負無窮大=負無窮大。主機名要和DNS解析的名稱相同才行
2.環境介紹
主機IP |
主機名 |
安裝配置 |
192.168.2.11 |
freeswitch-node1 |
corosync+pacemaker+pcsd+crmsh |
192.168.2.12 |
freeswitch-node2 |
corosync+pacemaker+pcsd+crmsh |
2.1 環境准備
2.1.1 主機名解析
兩台機器都做
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.2.11 freeswitch-node1 192.168.2.12 freeswitch-node2
2.1.2 免密鑰登錄
兩台機器相互做
[root@freeswitch-node1 ~] ssh-keygen [root@freeswitch-node1 ~] ssh-copy-id -i /root/.ssh/id_rsa root@freeswitch-node2 [root@freeswitch-node2 ~] ssh-keygen [root@freeswitch-node2 ~] ssh-copy-id -i /root/.ssh/id_rsa root@freeswitch-node1
2.1.3 時間同步
兩台都做,方法有很多中,本篇文章才用同步硬件時間的方式
[root@freeswitch-node1 ~] hwclock -s
[root@freeswitch-node2 ~] hwclock –s
3.安裝corosync&&pacemaker
3.1 安裝corosync&&pacemaker
兩台都做,centos自帶源即可,也可以只安裝pcs即可。
[root@freeswitch-node1 ~]# yum install corosync pacemaker -y [root@freeswitch-node1 ~] cd /etc/corosync [root@freeswitch-node1 corosync]# cp corosync.conf.example corosync.conf [root@freeswitch-node1 corosync]# vi corosync.conf
#修改如下部分 bindnetaddr: 192.168.2.0 #改成機器所在的網段 #添加如下部分 service { var: 0 name: pacemaker #表示啟動pacemaker } [root@freeswitch-node1 corosync]# mv /dev/{random,random.bak} [root@freeswitch-node1 corosync]# ln -s /dev/urandom /dev/random [root@freeswitch-node1 corosync]# corosync-keygen Corosync Cluster Engine Authentication key generator. Gathering 1024 bits for key from /dev/random. Press keys on your keyboard to generate entropy. Writing corosync key to /etc/corosync/authkey. [root@freeswitch-node1 corosync]# scp corosync.conf authkey root@freeswitch-node2:/etc/corosync/ [root@freeswitch-node2 corosync]# scp corosync.conf authkey root@freeswitch-node1:/etc/corosync/ #相互傳 注意主機名,其實只需要一個節點傳輸給其它節點就行 [root@freeswitch-node1 corosync]# systemctl start corosync
4.安裝pcs管理工具
[root@freeswitch-node1 corosync]# yum -y install pcs [root@freeswitch-node1 corosync]# systemctl start pcsd [root@freeswitch-node1 corosync]# echo "passw0rd"|passwd --stdin hacluster [root@freeswitch-node2 corosync]# yum -y install pcs [root@freeswitch-node2 corosync]# systemctl start pcsd [root@freeswitch-node2 corosync]# echo "passw0rd"|passwd --stdin hacluster
確定兩台都啟動后再做下續操作 --(有可能需要關閉防火牆,或者添加放行規則)
[root@freeswitch-node1 corosync]# pcs cluster auth freeswitch-node2 freeswitch-node1 Username: hacluster Password: freeswitch-node2: Authorized freeswitch-node1: Authorized [root@freeswitch-node2 corosync]# pcs cluster auth freeswitch-node1 freeswitch-node2 freeswitch-node1: Already authorized freeswitch-node2: Already authorized
4.1 建立集群
freeswitch-node1上面做
[root@freeswitch-node1 corosync]# pcs cluster setup --name mycluster freeswitch-node1 freeswitch-node2 --force
這里報錯 檢查pacemaker的啟功 重啟解決一切問題,沒啟動就啟動,解決問題
執行完上述命令 在freeswitch-node2查看配置文件,已經同步
[root@freeswitch-node2 corosync]# cat corosync.conf totem { version: 2 secauth: off cluster_name: mycluster transport: udpu } nodelist { node { ring0_addr: freeswitch-node1 nodeid: 1 } node { ring0_addr: freeswitch-node2 nodeid: 2 } } quorum { provider: corosync_votequorum two_node: 1 } logging { to_logfile: yes logfile: /var/log/cluster/corosync.log to_syslog: yes }
兩台機器都做如下操作
[root@freeswitch-node1 ~]# pcs cluster start Starting Cluster (corosync)... Starting Cluster (pacemaker)... #每個節點要單獨啟動pcsd守護進程。 [root@freeswitch-node1 ~]# pcs cluster status Cluster Status: Stack: corosync Current DC: freeswitch-node1 (version 1.1.21-4.el7-f14e36fd43) - partition WITHOUT quorum Last updated: Thu Jun 18 11:33:51 2020 Last change: Thu Jun 18 11:33:28 2020 by hacluster via crmd on freeswitch-node1 2 nodes configured 0 resources configured PCSD Status: freeswitch-node2: Online freeswitch-node1: Online [root@freeswitch-node2 ~]# pcs cluster start Starting Cluster (corosync)... Starting Cluster (pacemaker)... #每個節點要單獨啟動pcsd守護進程。 [root@freeswitch-node2 ~]# pcs cluster status Cluster Status: Stack: corosync Current DC: freeswitch-node1 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum Last updated: Thu Jun 18 11:34:29 2020 Last change: Thu Jun 18 11:33:28 2020 by hacluster via crmd on freeswitch-node1 2 nodes configured 0 resources configured PCSD Status: freeswitch-node1: Online freeswitch-node2: Online [root@freeswitch-node1 corosync]# corosync-cfgtool -s Printing ring status. Local node ID 1 RING ID 0 id = 192.168.2.11 status = ring 0 active with no faults [root@freeswitch-node2 corosync]# corosync-cfgtool -s Printing ring status. Local node ID 2 RING ID 0 id = 192.168.2.12 status = ring 0 active with no faults [root@freeswitch-node1 corosync]# corosync-cmapctl |grep members runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0 runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(192.168.2.11) runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1 runtime.totem.pg.mrp.srp.members.1.status (str) = joined runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0 runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(192.168.2.12) runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1 runtime.totem.pg.mrp.srp.members.2.status (str) = joined [root@freeswitch-node2 corosync]# corosync-cmapctl |grep members ##檢查當前的集群成員情況 runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0 runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(192.168.2.11) runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1 runtime.totem.pg.mrp.srp.members.1.status (str) = joined runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0 runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(192.168.2.12) runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1 runtime.totem.pg.mrp.srp.members.2.status (str) = joined [root@freeswitch-node1 ~]#pcs status ##DC(Designated Coordinator)的意思是說指定的協調員 ##每個node都有CRM,會有一個被選為DC,是整個Cluster的大腦,這個DC控制的CIB(cluster information base)是master CIB,其他的CIB都是副本 Cluster name: mycluster WARNINGS: No stonith devices and stonith-enabled is not false ##stonith沒有啟用隔離設備,也就是說在搶占資源的時候直接把對方給爆頭 Stack: corosync Current DC: freeswitch-node1 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum Last updated: Thu Jun 18 11:48:22 2020 Last change: Thu Jun 18 11:33:28 2020 by hacluster via crmd on freeswitch-node1 2 nodes configured 0 resources configured Online: [ freeswitch-node1 freeswitch-node2 ] No resources Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/disabled [root@freeswitch-node2 corosync]# pcs status corosync Membership information ---------------------- Nodeid Votes Name 1 1 freeswitch-node1 2 1 freeswitch-node2 (local) [root@freeswitch-node1 ~]# crm_verify -L -V ##crm_verify命令用來驗證當前的集群配置是否有錯誤 error: unpack_resources: Resource start-up disabled since no STONITH resources have been defined error: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option error: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity Errors found during check: config not valid ##出現這個報錯 執行下面的命令,上方報錯寫的很明確了,不過多解釋 [root@freeswitch-node1 ~]# pcs property set stonith-enabled=false [root@freeswitch-node1 ~]# pcs property list ##查看已經更改過的集群屬性,如果是全局的,使用pcs property --all Cluster Properties: cluster-infrastructure: corosync cluster-name: mycluster dc-version: 1.1.21-4.el7-f14e36fd43 have-watchdog: false stonith-enabled: false
4.2 安裝crmsh命令行集群管理工具
兩台都做
[root@freeswitch-node1 corosync]# cd /etc/yum.repos.d/ [root@freeswitch-node1 yum.repos.d]# wget http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/CentOS_CentOS-7/network:ha-clustering:Stable.repo [root@freeswitch-node1 ~]# yum install crmsh -y [root@freeswitch-node1 ~]# yum -y install httpd [root@freeswitch-node1 ~]# systemctl start httpd ##httpd不能夠設置為enable,得靠crm自己管理 [root@freeswitch-node1 ~]# echo "<h1>corosync pacemaker on the openstack</h1>" >/var/www/html/index.html [root@freeswitch-node2 corosync]# cd /etc/yum.repos.d/ [root@freeswitch-node2 yum.repos.d]# wget http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/CentOS_CentOS-7/network:ha-clustering:Stable.repo [root@freeswitch-node2 ~]# yum install crmsh -y [root@freeswitch-node2 ~]# yum -y install httpd [root@freeswitch-node2 ~]# systemctl start httpd ##httpd不能夠設置為enable,得靠crm自己管理 [root@freeswitch-node2 ~]# echo "<h1>corosync pacemaker on the openstack</h1>" >/var/www/html/index.html
此時,可以從瀏覽器訪問2個節點的web界面
兩個節點安裝httpd,注意,只能停止httpd服務,而不能重啟,並且不能設置為開機自啟動,因為resource manager會自動管理這些服務的運行或停止。
4.3 檢查配置
兩台節點都檢查確認一下
[root@freeswitch-node1 ~]# crm crm(live)# status ##必須保證所有節點都上線,才執行那些命令 crm(live)# ra crm(live)ra# list systemd httpd crm(live)ra# help info crm(live)ra# classes crm(live)ra# cd crm(live)# configure crm(live)configure# help primitive [root@freeswitch-node2 ~]# crm crm(live)# status ##必須保證所有節點都上線,才執行那些命令 crm(live)# ra crm(live)ra# list systemd httpd crm(live)ra# help info crm(live)ra# classes crm(live)ra# cd crm(live)# configure crm(live)configure# help primitive
4.4 定義高可用資源
freeswitch-node1上做
crm(live)ra# classes crm(live)ra# list ocf ##ocf是classes crm(live)ra# info ocf:IPaddr ##IPaddr是provider crm(live)ra# cd .. crm(live)#configure crm(live)configure# primitive FloadtIP ocf:IPaddr params ip=192.168.2.10 #設置VIP 高可用的IP 會自動漂移 crm(live)configure# show node 1: freeswitch-node1 node 2: freeswitch-node2 primitive FloadtIP IPaddr \ params ip=192.168.2.10 property cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.21-4.el7-f14e36fd43 \ cluster-infrastructure=corosync \ cluster-name=mycluster \ stonith-enabled=false crm(live)configure# verify crm(live)configure# commit crm(live)configure#cd ../ crm(live)# status Stack: corosync Current DC: freeswitch-node1 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum Last updated: Thu Jun 18 14:23:04 2020 Last change: Thu Jun 18 14:22:50 2020 by root via cibadmin on freeswitch-node1 2 nodes configured 1 resource configured Online: [ freeswitch-node1 freeswitch-node2 ] Full list of resources: FloadtIP (ocf::heartbeat:IPaddr): Started freeswitch-node1 #上述添加FloatIP資源 crm(live)# configure crm(live)configure# primitive WebServer systemd:httpd ##systemd是classes命令看到的 crm(live)configure# verify crm(live)configure# commit #上述添加WebServer資源 crm(live)# configure crm(live)configure# primitive FreeSwitch systemd:freeswitch crm(live)configure# verify crm(live)configure# commit #上述添加FreeSwitch資源(參考上一篇文章,將FreeSwitch設置為服務啟動) crm(live)configure# help group crm(live)configure# group HAService FloadtIP WebServer FreeSwitch##它們之間是有順序的,IP在哪兒,webserver/FreeSwitch就在哪兒 crm(live)configure# verify crm(live)configure# commit #上述webip和webservice綁定組資源 (將資源設定成為一組) crm(live)# node standby ##把當前節點設為備節點,節點切換 ##等同於root下直接執行crm node standby
4.5 設置開機啟動
此時pcsd服務和cluster都還沒有開機啟動。
#systemctl enable pcsd #設置pcsd開機啟動,但不啟動也能使用
#pcs cluster enable --all #所有節點開機啟動cluster
4.6 定義帶有監控的資源
由於此時沒有對資源進行監控,比如在root下直接停掉httpd服務,去查詢crm status狀態,仍然是started,我們可以對它重新定義帶有監控的資源。
要對資源進行監控需要在全局下命令primitive定義資源時一同定義,因此先把之前定義的資源刪掉后重新定義。
crm(live)# resource
crm(live)resource# show
Resource Group: HAService
FloatIP (ocf::heartbeat:IPaddr): Started
WebServer (systemd:httpd): Started
FreeSwitch (systemd:freeswitch): Started
crm(live)resource# stop HAService #停掉所有資源
crm(live)resource# show
Resource Group: HAService
FloatIP (ocf::heartbeat:IPaddr): Started (disabled)
WebServer (systemd:httpd): Stopping (disabled)
FreeSwitch (systemd:freeswitch): Stopped (disabled)
crm(live)configure# edit #編輯資源定義配置文件,刪除掉定義的3個資源和group
重新定義帶有監控的資源,每60秒監控一次,超時時長為20秒,時間不能小於建議時長,否則會報錯
crm(live)# configure crm(live)configure# primitive FloadtIP ocf:IPaddr params ip=192.168.2.10 op monitor timeout=20s interval=60s crm(live)configure# primitive WebServer systemd:httpd op monitor timeout=20s interval=60s crm(live)configure# primitive FreeSwitch systemd:freeswitch op monitor timeout=20s interval=60s crm(live)configure# group HAService FloadtIP WebServer FreeSwitch crm(live)configure# property no-quorum-policy=ignore #直接忽略當集群沒有法定票數時直接忽略,如果是節點數是單數最好不要這么設置。 crm(live)configure# verify WARNING: FreeSwitch: specified timeout 20s for monitor is smaller than the advised 100 WARNING: WebServer: specified timeout 20s for monitor is smaller than the advised 100 crm(live)configure# commit crm(live)configure# cd crm(live)# status Stack: corosync Current DC: freeswitch-node1 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum Last updated: Wed Jun 24 11:04:50 2020 Last change: Wed Jun 24 11:04:41 2020 by root via cibadmin on freeswitch-node1 2 nodes configured 3 resources configured Online: [ freeswitch-node1 freeswitch-node2 ] Full list of resources: Resource Group: HAService FloadtIP (ocf::heartbeat:IPaddr): Started freeswitch-node1 WebServer (systemd:httpd): Started freeswitch-node1 FreeSwitch (systemd:freeswitch): Started freeswitch-node1
測試一下,將服務停掉,過一會兒服務又自動會啟動
[root@freeswitch-node1 ~]# ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 00:50:56:a6:29:82 brd ff:ff:ff:ff:ff:ff inet 192.168.2.11/16 brd 192.168.255.255 scope global noprefixroute ens192 valid_lft forever preferred_lft forever inet 192.168.2.10/16 brd 192.168.255.255 scope global secondary ens192 valid_lft forever preferred_lft forever inet6 fe80::250:56ff:fea6:2982/64 scope link valid_lft forever preferred_lft forever
[root@freeswitch-node1 ~]# systemctl stop httpd crm(live)# status Stack: corosync Current DC: freeswitch-node1 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum Last updated: Wed Jun 24 11:19:13 2020 Last change: Wed Jun 24 11:15:22 2020 by root via crm_attribute on freeswitch-node2
2 nodes configured 3 resources configured Node freeswitch-node2: standby Online: [ freeswitch-node1 ] Full list of resources: Resource Group: HAService FloadtIP (ocf::heartbeat:IPaddr): Started freeswitch-node1 WebServer (systemd:httpd): Started freeswitch-node1 FreeSwitch (systemd:freeswitch): Started freeswitch-node1 Failed Resource Actions: * WebServer_monitor_60000 on freeswitch-node1 'not running' (7): call=50, status=complete, exitreason='', last-rc-change='Wed Jun 24 11:08:41 2020', queued=0ms, exec=0ms * FreeSwitch_monitor_60000 on freeswitch-node1 'not running' (7): call=60, status=complete, exitreason='', last-rc-change='Wed Jun 24 11:10:50 2020', queued=0ms, exec=0ms
4.7 清除資源錯誤信息
【注意】當重新恢復httpd服務后記得清除資源的錯誤信息,否則無法啟動資源
crm(live)# resource crm(live)resource# cleanup HAService Cleaned up FloadtIP on freeswitch-node2 Cleaned up FloadtIP on freeswitch-node1 Cleaned up WebServer on freeswitch-node2 Cleaned up WebServer on freeswitch-node1 Cleaned up FreeSwitch on freeswitch-node2 .Cleaned up FreeSwitch on freeswitch-node1 Waiting for 1 reply from the CRMd. OK crm(live)resource# show Resource Group: HAService FloadtIP (ocf::heartbeat:IPaddr): Started WebServer (systemd:httpd): Started FreeSwitch (systemd:freeswitch): Started
5 總結
1、當重新恢復資源的服務后一定記得清除資源的錯誤信息,否則無法啟動資源
2、在利用corosync+pacemaker且是兩個節點實現高可用時,需要注意的是要設置全局屬性把stonith設備關閉,忽略法定票數不大於一半的機制
3、注意selinux和iptables(firewalld.service)對服務的影響
4、注意節點相互用/etc/hosts來解析
5、節點時間一定要保持同步
6、節點相互間進行無密鑰通信
7、如果是2個節點或者雙數節點,會存在法定票數不足導致的資源不轉移的情況,解決此問題的方法有四種:
7.1、可以增加一個ping node節點。
7.2、可以增加一個仲裁磁盤。
7.3、讓集群中的節點數成奇數個。
7.4、直接忽略當集群沒有法定票數時直接忽略。property no-quorum-policy=ignore