一、引子
既然Kubernetes中將容器的聯網通過插件的方式來實現,那么該如何解決這個的聯網問題呢?
如果你在本地單台機器上運行docker容器的話注意到所有容器都會處在docker0
網橋自動分配的一個網絡IP段內(172.17.0.1/16。該值可以通過docker啟動參數 --bip來設置。這樣所有本地的所有的容器都擁有了一個IP地址,而且還是在一個網段內彼此就可以互相通信來。
但是Kubernetes管理的是集群,Kubernetes中的網絡要解決的核心問題就是每台主機的IP地址網段划分,以及單個容器的IP地址分配。概括為:
- 保證每個Pod擁有一個集群內唯一的IP地址
- 保證不同節點的IP地址划分不會重復
- 保證垮節點的Pod可以互相通信
- 保證不同節點的Pod可以與垮節點的主機互相通信
為了解決該問題,出現了一系列開源的Kubetnetes中的網絡插件與方案,如:
- flannel
- calico
- contiv
- weava net
- kube-router
- cilium
- canal
二、Kubernetes中的網絡解析-flannel
我們已經安裝了擁有三個節點的Kubernetes集群,節點的狀態如下:
[root@node1 ~]# kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
node1 Ready <none> 2d v1.9.1 <none> CentOS Linux 7 (Core) 3.10.0-693.11.6.el7.x86_64 docker://1.12.6
node2 Ready <none> 2d v1.9.1 <none> CentOS Linux 7 (Core) 3.10.0-693.11.6.el7.x86_64 docker://1.12.6
node3 Ready <none> 2d v1.9.1 <none> CentOS Linux 7 (Core) 3.10.0-693.11.6.el7.x86_64 docker://1.12.6
當前Kubetnetes集群中運行的所有Pod信息
[root@node1 ~]# kubectl get pods --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
kube-system coredns-5984fb8cbb-sjqv9 1/1 Running 0 1h 172.33.68.2 node1
kube-system coredns-5984fb8cbb-tkfrc 1/1 Running 1 1h 172.33.96.3 node3
kube-system heapster-v1.5.0-684c7f9488-z6sdz 4/4 Running 0 1h 172.33.31.3 node2
kube-system kubernetes-dashboard-6b66b8b96c-mnm2c 1/1 Running 0 1h 172.33.31.2 node2
kube-system monitoring-influxdb-grafana-v4-54b7854697-tw9cd 2/2 Running 2 1h 172.33.96.2 node3
當前etcd中的注冊的宿主機的pod地址網段信息:
[root@node1 ~]# etcdctl ls /kube-centos/network/subnets
/kube-centos/network/subnets/172.33.68.0-24
/kube-centos/network/subnets/172.33.31.0-24
/kube-centos/network/subnets/172.33.96.0-24
而每個node上的Pod子網是根據我們在安裝flannel時配置來划分的,在etcd中查看該配置:
[root@node1 ~]# etcdctl get /kube-centos/network/config
{"Network":"172.33.0.0/16","SubnetLen":24,"Backend":{"Type":"host-gw"}}
我們知道Kubernets集群內部存在三類IP,分別是:
- Node IP:宿主機的IP地址
- Pod IP:使用網絡插件創建的IP(如:flannel),該跨主機的Pod可以互通
- Cluster IP:虛擬IP,通過iptables規則訪問服務
在安裝node節點的時候,節點上的進程是按照flannel-->docker--->kubelet--->kube-proxy的順序啟動,flannnel的網絡划分和如何與docker交互,如何通過iptables訪問service。
Flannel
Flannel
是作為一個二進制文件的方式部署在每個node上,主要實現兩個功能:
- 為每個node分配subnet,容器將自動從該子網中獲取IP地址
- 當有node加入到網絡中時,為每個node增加路由配置
下面是使用 host -gw backend的flannel網絡架構圖:
Node1上的flannel配置如下:
[root@node1 ~]# cat /usr/lib/systemd/system/flanneld.service
[Unit]
Description=Flanneld overlay address etcd agent
After=network.target
After=network-online.target
Wants=network-online.target
After=etcd.service
Before=docker.service
[Service]
Type=notify
EnvironmentFile=/etc/sysconfig/flanneld
EnvironmentFile=-/etc/sysconfig/docker-network
ExecStart=/usr/bin/flanneld-start $FLANNEL_OPTIONS
ExecStartPost=/usr/libexec/flannel/mk-docker-opts.sh -k DOCKER_NETWORK_OPTIONS -d /run/flannel/docker
Restart=on-failure
[Install]
WantedBy=multi-user.target
RequiredBy=docker.service
其中有兩個環境變量文件的配置如下:
[root@node1 ~]# cat /etc/sysconfig/flanneld
# Flanneld configuration options
FLANNEL_ETCD_ENDPOINTS="http://172.17.8.101:2379"
FLANNEL_ETCD_PREFIX="/kube-centos/network"
FLANNEL_OPTIONS="-iface=eth2"
上面的配置文件僅供flanneld使用。
[root@node1 ~]# cat /etc/sysconfig/docker-network
# /etc/sysconfig/docker-network
DOCKER_NETWORK_OPTIONS=
還有一個ExecStartPost=/usr/libexec/flannel/mk-docker-opts.sh -k DOCKER_NETWORK_OPTIONS -d /run/flannel/docker,其中的/usr/libexec/flannel/mk-docker-opts.sh腳本是在flanneld啟動后運行,將會生成兩個環境變量配置文件:
- /run/flannel/docker
- /run/flannel/subnet.env
我們再來看下/run/flannel/docker的配置。
[root@node1 ~]# cat /run/flannel/docker
DOCKER_OPT_BIP="--bip=172.33.68.1/24"
DOCKER_OPT_IPMASQ="--ip-masq=true"
DOCKER_OPT_MTU="--mtu=1500"
DOCKER_NETWORK_OPTIONS=" --bip=172.33.68.1/24 --ip-masq=true --mtu=1500"
如果你使用systemctl
命令先啟動flannel后啟動docker的話,docker將會讀取以上環境變量。
我們再來看下/run/flannel/subnet.env的配置。
[root@node1 ~]# cat /run/flannel/subnet.env
FLANNEL_NETWORK=172.33.0.0/16
FLANNEL_SUBNET=172.33.68.1/24
FLANNEL_MTU=1500
FLANNEL_IPMASQ=false
以上環境變量是flannel向etcd中注冊的。
Docker
Node1的docker配置如下:
[root@node1 ~]# cat /usr/lib/systemd/system/docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=http://docs.docker.com
After=network.target rhel-push-plugin.socket registries.service
Wants=docker-storage-setup.service
Requires=docker-cleanup.timer
[Service]
Type=notify
NotifyAccess=all
EnvironmentFile=-/run/containers/registries.conf
EnvironmentFile=-/etc/sysconfig/docker
EnvironmentFile=-/etc/sysconfig/docker-storage
EnvironmentFile=-/etc/sysconfig/docker-network
Environment=GOTRACEBACK=crash
Environment=DOCKER_HTTP_HOST_COMPAT=1
Environment=PATH=/usr/libexec/docker:/usr/bin:/usr/sbin
ExecStart=/usr/bin/dockerd-current \
--add-runtime docker-runc=/usr/libexec/docker/docker-runc-current \
--default-runtime=docker-runc \
--exec-opt native.cgroupdriver=systemd \
--userland-proxy-path=/usr/libexec/docker/docker-proxy-current \
$OPTIONS \
$DOCKER_STORAGE_OPTIONS \
$DOCKER_NETWORK_OPTIONS \
$ADD_REGISTRY \
$BLOCK_REGISTRY \
$INSECURE_REGISTRY\
$REGISTRIES
ExecReload=/bin/kill -s HUP $MAINPID
LimitNOFILE=1048576
LimitNPROC=1048576
LimitCORE=infinity
TimeoutStartSec=0
Restart=on-abnormal
MountFlags=slave
KillMode=process
[Install]
WantedBy=multi-user.target
查看Node1上的docker啟動參數:
[root@node1 ~]# systemctl status -l docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/docker.service.d
└─flannel.conf
Active: active (running) since Fri 2018-02-02 22:52:43 CST; 2h 28min ago
Docs: http://docs.docker.com
Main PID: 4334 (dockerd-current)
CGroup: /system.slice/docker.service
‣ 4334 /usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current --selinux-enabled --log-driver=journald --signature-verification=false --bip=172.33.68.1/24 --ip-masq=true --mtu=1500
我們可以看到在docker在啟動時有如下參數:--bip=172.33.68.1/24 --ip-masq=true --mtu=1500。上述參數flannel啟動時運行的腳本生成的,通過環境變量傳遞過來的。
我們查看下node1宿主機上的網絡接口:
[root@node1 ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 52:54:00:00:57:32 brd ff:ff:ff:ff:ff:ff
inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic eth0
valid_lft 85095sec preferred_lft 85095sec
inet6 fe80::5054:ff:fe00:5732/64 scope link
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 08:00:27:7b:0f:b1 brd ff:ff:ff:ff:ff:ff
inet 172.17.8.101/24 brd 172.17.8.255 scope global eth1
valid_lft forever preferred_lft forever
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 08:00:27:ef:25:06 brd ff:ff:ff:ff:ff:ff
inet 172.30.113.231/21 brd 172.30.119.255 scope global dynamic eth2
valid_lft 85096sec preferred_lft 85096sec
inet6 fe80::a00:27ff:feef:2506/64 scope link
valid_lft forever preferred_lft forever
5: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
link/ether 02:42:d0:ae:80:ea brd ff:ff:ff:ff:ff:ff
inet 172.33.68.1/24 scope global docker0
valid_lft forever preferred_lft forever
inet6 fe80::42:d0ff:feae:80ea/64 scope link
valid_lft forever preferred_lft forever
7: veth295bef2@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP
link/ether 6a:72:d7:9f:29:19 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 fe80::6872:d7ff:fe9f:2919/64 scope link
valid_lft forever preferred_lft forever
我們分類來解釋下該虛擬機中的網絡接口。
- lo:回環網絡,127.0.0.1
- eth0:NAT網絡,虛擬機創建時自動分配,僅可以在幾台虛擬機之間訪問
- eth1:bridge網絡,使用vagrant分配給虛擬機的地址,虛擬機之間和本地電腦都可以訪問
- eth2:bridge網絡,使用DHCP分配,用於訪問互聯網的網卡
- docker0:bridge網絡,docker默認使用的網卡,作為該節點上所有容器的虛擬交換機
- veth295bef2@if6:veth pair,連接docker0和Pod中的容器。veth pair可以理解為使用網線連接好的兩個接口,把兩個端口放到兩個namespace中,那么這兩個namespace就能打通。
我們再看下該節點的docker上有哪些網絡。
[root@node1 ~]# docker network ls
NETWORK ID NAME DRIVER SCOPE
940bb75e653b bridge bridge local
d94c046e105d host host local
2db7597fd546 none null local
再檢查下bridge網絡940bb75e653b的信息。
[root@node1 ~]# docker network inspect 940bb75e653b
[
{
"Name": "bridge",
"Id": "940bb75e653bfa10dab4cce8813c2b3ce17501e4e4935f7dc13805a61b732d2c",
"Scope": "local",
"Driver": "bridge",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": null,
"Config": [
{
"Subnet": "172.33.68.1/24",
"Gateway": "172.33.68.1"
}
]
},
"Internal": false,
"Containers": {
"944d4aa660e30e1be9a18d30c9dcfa3b0504d1e5dbd00f3004b76582f1c9a85b": {
"Name": "k8s_POD_coredns-5984fb8cbb-sjqv9_kube-system_c5a2e959-082a-11e8-b4cd-525400005732_0",
"EndpointID": "7397d7282e464fc4ec5756d6b328df889cdf46134dbbe3753517e175d3844a85",
"MacAddress": "02:42:ac:21:44:02",
"IPv4Address": "172.33.68.2/24",
"IPv6Address": ""
}
},
"Options": {
"com.docker.network.bridge.default_bridge": "true",
"com.docker.network.bridge.enable_icc": "true",
"com.docker.network.bridge.enable_ip_masquerade": "true",
"com.docker.network.bridge.host_binding_ipv4": "0.0.0.0",
"com.docker.network.bridge.name": "docker0",
"com.docker.network.driver.mtu": "1500"
},
"Labels": {}
}
]
我們可以看到該網絡中的Config與docker的啟動配置相符。
Node1上運行的容器:
[root@node1 ~]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a37407a234dd docker.io/coredns/coredns@sha256:adf2e5b4504ef9ffa43f16010bd064273338759e92f6f616dd159115748799bc "/coredns -conf /etc/" About an hour ago Up About an hour k8s_coredns_coredns-5984fb8cbb-sjqv9_kube-system_c5a2e959-082a-11e8-b4cd-525400005732_0
944d4aa660e3 docker.io/openshift/origin-pod "/usr/bin/pod" About an hour ago Up About an hour k8s_POD_coredns-5984fb8cbb-sjqv9_kube-system_c5a2e959-082a-11e8-b4cd-525400005732_0
我們可以看到當前已經有2個容器在運行。
Node1上的路由信息:
[root@node1 ~]# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 10.0.2.2 0.0.0.0 UG 100 0 0 eth0
0.0.0.0 172.30.116.1 0.0.0.0 UG 101 0 0 eth2
10.0.2.0 0.0.0.0 255.255.255.0 U 100 0 0 eth0
172.17.8.0 0.0.0.0 255.255.255.0 U 100 0 0 eth1
172.30.112.0 0.0.0.0 255.255.248.0 U 100 0 0 eth2
172.33.68.0 0.0.0.0 255.255.255.0 U 0 0 0 docker0
172.33.96.0 172.30.118.65 255.255.255.0 UG 0 0 0 eth2
以上路由信息是由flannel添加的,當有新的節點加入Kubernetes集群中后,每個節點上的路由表都將增加。
我們在node上來traceroute下node3上的coredns-5984fb8cbb-tkfrc容器,其IP地址是172.33.96.3,看看其路由信息。
[root@node1 ~]# traceroute 172.33.96.3
traceroute to 172.33.96.3 (172.33.96.3), 30 hops max, 60 byte packets
1 172.30.118.65 (172.30.118.65) 0.518 ms 0.367 ms 0.398 ms
2 172.33.96.3 (172.33.96.3) 0.451 ms 0.352 ms 0.223 ms
我們看到路由直接經過node3的公網IP后就到達了node3節點上的Pod。
Node1的iptables信息:
[root@node1 ~]# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
KUBE-FIREWALL all -- anywhere anywhere
KUBE-SERVICES all -- anywhere anywhere /* kubernetes service portals */
Chain FORWARD (policy ACCEPT)
target prot opt source destination
KUBE-FORWARD all -- anywhere anywhere /* kubernetes forward rules */
DOCKER-ISOLATION all -- anywhere anywhere
DOCKER all -- anywhere anywhere
ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED
ACCEPT all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
KUBE-FIREWALL all -- anywhere anywhere
KUBE-SERVICES all -- anywhere anywhere /* kubernetes service portals */
Chain DOCKER (1 references)
target prot opt source destination
Chain DOCKER-ISOLATION (1 references)
target prot opt source destination
RETURN all -- anywhere anywhere
Chain KUBE-FIREWALL (2 references)
target prot opt source destination
DROP all -- anywhere anywhere /* kubernetes firewall for dropping marked packets */ mark match 0x8000/0x8000
Chain KUBE-FORWARD (1 references)
target prot opt source destination
ACCEPT all -- anywhere anywhere /* kubernetes forwarding rules */ mark match 0x4000/0x4000
ACCEPT all -- 10.254.0.0/16 anywhere /* kubernetes forwarding conntrack pod source rule */ ctstate RELATED,ESTABLISHED
ACCEPT all -- anywhere 10.254.0.0/16 /* kubernetes forwarding conntrack pod destination rule */ ctstate RELATED,ESTABLISHED
Chain KUBE-SERVICES (2 references)
target prot opt source destination
三、Kubernetes中的網絡解析-calico
概述
Calico創建和管理一個扁平的三層網絡(不需要overlay),每個容器會分配一個可路由的ip。由於通信時不需要解包和封包,網絡性能消耗小,易於排查,且易於水平擴展。
小規模部署時可以通過bgp client直接互聯,大規模下可通過指定的BGP route reflector來完成,這樣保證所有的數據流量都是通過IP路由的方式完成互聯的。Calico基於iptables還提供了豐富而靈活的網絡Policy,保證通過各個節點上的ACLs來提供Workload的多租戶隔離,安全組以及其他可達性限制等功能。
Calico架構
calico主要由Felix,etcd,BGP client,BGP Route Reflector組成。
- Etcd:負責存儲網絡信息
- BGP client:負責將Felix配置的路由信息分發到其他節點
- Felix:Calico Agent,每個節點都需要運行,主要負責配置路由、配置ACLs、報告狀態
- BGP Route Reflector:大規模部署時需要用到,作為BGP client的中心連接點,可以避免每個節點互聯
部署
mkdir /etc/cni/net.d/
kubectl apply -f https://docs.projectcalico.org/v3.0/getting-started/kubernetes/installation/rbac.yaml
wget https://docs.projectcalico.org/v3.0/getting-started/kubernetes/installation/hosted/calico.yaml
修改etcd_endpoints的值和默認的192.168.0.0/16(不能和已有網段沖突)
kubectl apply -f calico.yaml
wget https://github.com/projectcalico/calicoctl/releases/download/v2.0.0/calicoctl
mv calicoctl /usr/loca/bin && chmod +x /usr/local/bin/calicoctl
calicoctl get ippool
calicoctl get node