參考文章:https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/jobmanager_high_availability.html#bootstrap-zookeeper
Flink典型的任務處理過程如下所示:

很容易發現,JobManager存在單點故障(SPOF:Single Point Of Failure),因此對Flink做HA,主要是對JobManager做HA,根據Flink集群的部署模式不同,分為Standalone、OnYarn,本文主要涉及Standalone模式。
JobManager的HA,是通過Zookeeper實現的,因此需要先搭建好Zookeeper集群,同時HA的信息,還要存儲在HDFS中,因此也需要Hadoop集群,最后修改Flink中的配置文件。
一、部署Zookeeper集群
參考博文:http://www.cnblogs.com/liugh/p/6671460.html
二、部署Hadoop集群
參考博文:http://www.cnblogs.com/liugh/p/6624872.html
三、部署Flink集群
參考博文:http://www.cnblogs.com/liugh/p/7446295.html
四、conf/flink-conf.yaml修改
4.1 必選項
high-availability: zookeeper high-availability.zookeeper.quorum: DEV-SH-MAP-01:2181,DEV-SH-MAP-02:2181,DEV-SH-MAP-03:2181 high-availability.zookeeper.storageDir: hdfs:///flink/ha
4.2 可選項
high-availability.zookeeper.path.root: /flink
high-availability.zookeeper.path.cluster-id: /map_flink
修改完后,使用scp命令將flink-conf.yaml文件同步到其他節點
五、conf/masters修改
設置要啟用JobManager的節點及端口:
dev-sh-map-01:8081 dev-sh-map-02:8081
修改完后,使用scp命令將masters文件同步到其他節點
六、conf/zoo.cfg修改
# ZooKeeper quorum peers server.1=DEV-SH-MAP-01:2888:3888 server.2=DEV-SH-MAP-02:2888:3888 server.3=DEV-SH-MAP-03:2888:3888
修改完后,使用scp命令將masters文件同步到其他節點
七、啟動HDFS
[root@DEV-SH-MAP-01 conf]# start-dfs.sh Starting namenodes on [DEV-SH-MAP-01] DEV-SH-MAP-01: starting namenode, logging to /usr/hadoop-2.7.3/logs/hadoop-root-namenode-DEV-SH-MAP-01.out DEV-SH-MAP-02: starting datanode, logging to /usr/hadoop-2.7.3/logs/hadoop-root-datanode-DEV-SH-MAP-02.out DEV-SH-MAP-03: starting datanode, logging to /usr/hadoop-2.7.3/logs/hadoop-root-datanode-DEV-SH-MAP-03.out DEV-SH-MAP-01: starting datanode, logging to /usr/hadoop-2.7.3/logs/hadoop-root-datanode-DEV-SH-MAP-01.out Starting secondary namenodes [0.0.0.0] 0.0.0.0: starting secondarynamenode, logging to /usr/hadoop-2.7.3/logs/hadoop-root-secondarynamenode-DEV-SH-MAP-01.out
八、啟動Zookeeper集群
[root@DEV-SH-MAP-01 conf]# start-zookeeper-quorum.sh Starting zookeeper daemon on host DEV-SH-MAP-01. Starting zookeeper daemon on host DEV-SH-MAP-02. Starting zookeeper daemon on host DEV-SH-MAP-03.
【注】這里使用的命令start-zookeeper-quorum.sh是FLINK_HOME/bin中的腳本
九、啟動Flink集群
[root@DEV-SH-MAP-01 conf]# start-cluster.sh Starting HA cluster with 2 masters. Starting jobmanager daemon on host DEV-SH-MAP-01. Starting jobmanager daemon on host DEV-SH-MAP-02. Starting taskmanager daemon on host DEV-SH-MAP-01. Starting taskmanager daemon on host DEV-SH-MAP-02. Starting taskmanager daemon on host DEV-SH-MAP-03.
可以看到,啟動了兩個JobManager,一個Leader,一個Standby
十、測試HA
10.1 訪問Leader的WebUI:

10.2 訪問StandBy的WebUI
這時也會跳轉到Leader的WebUI

10.3 Kill掉Leader
[root@DEV-SH-MAP-01 flink-1.3.2]# jps 14240 Jps 34929 TaskManager 33106 DataNode 33314 SecondaryNameNode 34562 JobManager 33900 FlinkZooKeeperQuorumPeer 32991 NameNode [root@DEV-SH-MAP-01 flink-1.3.2]# kill -9 34562 [root@DEV-SH-MAP-01 flink-1.3.2]# jps 34929 TaskManager 33106 DataNode 33314 SecondaryNameNode 14275 Jps 33900 FlinkZooKeeperQuorumPeer 32991 NameNode
再次訪問Flink WebUI,發現Leader已經發生切換

10.4 重啟被Kill掉的JobManager
[root@DEV-SH-MAP-01 bin]# jobmanager.sh start cluster DEV-SH-MAP-01 Starting jobmanager daemon on host DEV-SH-MAP-01. [root@DEV-SH-MAP-01 bin]# jps 34929 TaskManager 33106 DataNode 33314 SecondaryNameNode 15506 JobManager 15559 Jps 33900 FlinkZooKeeperQuorumPeer 32991 NameNode
再次查看WebUI,發現雖然以前被Kill掉的Leader起來了,但是現在仍是StandBy,現有的Leader不會發生切換,也就是Flink下面的示意圖:

十一、存在的問題
JobManager發生切換時,TaskManager也會跟着發生重啟
