集群信息
calico配置
apiVersion: crd.projectcalico.org/v1 kind: IPPool metadata: name: default-ipv4-ippool spec: blockSize: 26 cidr: 10.10.0.0/16 ipipMode: Always natOutgoing: false nodeSelector: all() vxlanMode: Never
calico version: v3.19.1
測試環境
k8s-node-4 192.168.99.204 podA 10.10.55.134
k8s-node-5 192.168.99.205 podB 10.10.86.131
過程抓包
當podA訪問podB時,各階段抓包如下
節點k8s-node-4中的calic211b2bb019抓包
pod中發出的包通過veth pair直接到達宿主機對應的cali*網卡(對應關系可以在pod中通過cat /sys/class/net/eth0/iflink查看)

root@k8s-node-4:~# tcpdump -i calic211b2bb019 tcp and port 80 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on calic211b2bb019, link-type EN10MB (Ethernet), capture size 262144 bytes 12:04:02.681304 IP 10.10.55.134.38064 > 10.10.86.131.http: Flags [S], seq 2695070920, win 64860, options [mss 1410,sackOK,TS val 3501275550 ecr 0,nop,wscale 7], length 0 12:04:02.681779 IP 10.10.86.131.http > 10.10.55.134.38064: Flags [S.], seq 2150062280, ack 2695070921, win 65160, options [mss 1460,sackOK,TS val 2555844366 ecr 3501275550,nop,wscale 7], length 0 12:04:02.681789 IP 10.10.55.134.38064 > 10.10.86.131.http: Flags [.], ack 1, win 507, options [nop,nop,TS val 3501275550 ecr 2555844366], length 0 12:04:02.683390 IP 10.10.55.134.38064 > 10.10.86.131.http: Flags [P.], seq 1:77, ack 1, win 507, options [nop,nop,TS val 3501275552 ecr 2555844366], length 76: HTTP: GET / HTTP/1.1 12:04:02.683711 IP 10.10.86.131.http > 10.10.55.134.38064: Flags [.], ack 77, win 509, options [nop,nop,TS val 2555844368 ecr 3501275552], length 0 12:04:02.683948 IP 10.10.86.131.http > 10.10.55.134.38064: Flags [P.], seq 1:143, ack 77, win 509, options [nop,nop,TS val 2555844368 ecr 3501275552], length 142: HTTP: HTTP/1.1 200 OK 12:04:02.683952 IP 10.10.55.134.38064 > 10.10.86.131.http: Flags [.], ack 143, win 506, options [nop,nop,TS val 3501275553 ecr 2555844368], length 0 12:04:02.684653 IP 10.10.55.134.38064 > 10.10.86.131.http: Flags [F.], seq 77, ack 143, win 506, options [nop,nop,TS val 3501275553 ecr 2555844368], length 0 12:04:02.684896 IP 10.10.86.131.http > 10.10.55.134.38064: Flags [F.], seq 143, ack 78, win 509, options [nop,nop,TS val 2555844369 ecr 3501275553], length 0 12:04:02.684901 IP 10.10.55.134.38064 > 10.10.86.131.http: Flags [.], ack 144, win 506, options [nop,nop,TS val 3501275554 ecr 2555844369], length 0
可以看到src ip是podA ip,dst ip是podB ip
節點k8s-node-4的tunl0抓包
pod發出的包出現在cali*網卡之后,匹配完kube-proxy下發的prerouting等規則后,根據dst ip即podB ip查找節點路由表確定包如何轉發
root@k8s-node-4:~# ip r default via 10.0.2.2 dev enp0s3 proto dhcp src 10.0.2.15 metric 100 10.0.2.0/24 dev enp0s3 proto kernel scope link src 10.0.2.15 10.0.2.2 dev enp0s3 proto dhcp scope link src 10.0.2.15 metric 100 blackhole 10.10.55.128/26 proto bird 10.10.55.129 dev cali280cc1befad scope link 10.10.55.131 dev calie5904e003ea scope link 10.10.55.132 dev calida94a24526a scope link 10.10.55.133 dev cali37c7a3c7cb7 scope link 10.10.55.134 dev calic211b2bb019 scope link 10.10.55.135 dev cali2255478075d scope link 10.10.55.136 dev calie8f72551915 scope link 10.10.55.138 dev cali466ba5a1a55 scope link 10.10.55.139 dev cali8d25c92c70a scope link 10.10.76.128/26 via 192.168.99.203 dev tunl0 proto bird onlink 10.10.86.128/26 via 192.168.99.205 dev tunl0 proto bird onlink ### 匹配這條路由規則 10.10.140.64/26 via 192.168.99.202 dev tunl0 proto bird onlink 172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 192.168.99.0/24 dev enp0s8 proto kernel scope link src 192.168.99.204
如上,匹配到的規則表示 下一跳via是 192.168.99.205(podB所在節點IP),由tunl0設備處理發出,tunl0作為一種隧道設置(注意區別flannel中的tun/tap設備),會在原始包的基礎上加上一層ip頭,其中ip頭中的目的ip就是匹配的路由規則中的下一跳地址。
需要注意的是,如果calico使用的是純三層的網絡,即沒有使用ipip,vxlan等進行封包處理,那么via是告訴網卡配置二層數據幀的目的mac地址為podB節點對外網卡的mac地址,這樣就可以把目的節點當作網關,直接把pod發出的ip包通過二層轉發到目的節點

root@k8s-node-4:~# tcpdump -i tunl0 tcp and port 80 and host 10.10.55.134 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on tunl0, link-type RAW (Raw IP), capture size 262144 bytes 12:45:14.378775 IP 10.10.55.134.46962 > 10.10.86.131.http: Flags [S], seq 486181493, win 64860, options [mss 1410,sackOK,TS val 3503747247 ecr 0,nop,wscale 7], length 0 12:45:14.379210 IP 10.10.86.131.http > 10.10.55.134.46962: Flags [S.], seq 348894761, ack 486181494, win 65160, options [mss 1460,sackOK,TS val 2558316064 ecr 3503747247,nop,wscale 7], length 0 12:45:14.379234 IP 10.10.55.134.46962 > 10.10.86.131.http: Flags [.], ack 1, win 507, options [nop,nop,TS val 3503747248 ecr 2558316064], length 0 12:45:14.380742 IP 10.10.55.134.46962 > 10.10.86.131.http: Flags [P.], seq 1:77, ack 1, win 507, options [nop,nop,TS val 3503747249 ecr 2558316064], length 76: HTTP: GET / HTTP/1.1 12:45:14.381162 IP 10.10.86.131.http > 10.10.55.134.46962: Flags [.], ack 77, win 509, options [nop,nop,TS val 2558316065 ecr 3503747249], length 0 12:45:14.381298 IP 10.10.86.131.http > 10.10.55.134.46962: Flags [P.], seq 1:143, ack 77, win 509, options [nop,nop,TS val 2558316066 ecr 3503747249], length 142: HTTP: HTTP/1.1 200 OK 12:45:14.381363 IP 10.10.55.134.46962 > 10.10.86.131.http: Flags [.], ack 143, win 506, options [nop,nop,TS val 3503747250 ecr 2558316066], length 0 12:45:14.382243 IP 10.10.55.134.46962 > 10.10.86.131.http: Flags [F.], seq 77, ack 143, win 506, options [nop,nop,TS val 3503747251 ecr 2558316066], length 0 12:45:14.382705 IP 10.10.86.131.http > 10.10.55.134.46962: Flags [F.], seq 143, ack 78, win 509, options [nop,nop,TS val 2558316067 ecr 3503747251], length 0 12:45:14.382735 IP 10.10.55.134.46962 > 10.10.86.131.http: Flags [.], ack 144, win 506, options [nop,nop,TS val 3503747251 ecr 2558316067], length 0
可以看到src ip是podA ip,dst ip是podB ip
節點k8s-node-4的enp0s8抓包
節點間通信用的網卡enp0s8上的包已經是經過tunl0封包處理之后的(加上一層ip header),所以使用tcpdump抓包時需要注意指定協議為ip而不能是tcp,因為tcpdump指定為tcp協議時根據格式ip[tcp]解析raw ip包,但是經過ipip模塊封包處理之后,raw包格式變成ip[ip[tcp]],所以這個時候指定tcp協議抓包會不到,指定tcp協議相關的過濾參數也會導致抓不到包,比如指定port 80,一個參考的方式是指定協議為ip,通過配合grep來過濾包(tcpdump會把第二層ip頭信息打印出來),如下

root@k8s-node-4:~# tcpdump -i enp0s8 ip and host 192.168.99.205 | grep 10.10.86.131.http tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on enp0s8, link-type EN10MB (Ethernet), capture size 262144 bytes 12:51:52.288259 IP k8s-node-4 > k8s-node-5: IP 10.10.55.134.48402 > 10.10.86.131.http: Flags [S], seq 1172730419, win 64860, options [mss 1410,sackOK,TS val 3504145157 ecr 0,nop,wscale 7], length 0 (ipip-proto-4) 12:51:52.288717 IP k8s-node-5 > k8s-node-4: IP 10.10.86.131.http > 10.10.55.134.48402: Flags [S.], seq 3623141981, ack 1172730420, win 65160, options [mss 1460,sackOK,TS val 2558713973 ecr 3504145157,nop,wscale 7], length 0 (ipip-proto-4) 12:51:52.288761 IP k8s-node-4 > k8s-node-5: IP 10.10.55.134.48402 > 10.10.86.131.http: Flags [.], ack 1, win 507, options [nop,nop,TS val 3504145157 ecr 2558713973], length 0 (ipip-proto-4) 12:51:52.288806 IP k8s-node-4 > k8s-node-5: IP 10.10.55.134.48402 > 10.10.86.131.http: Flags [P.], seq 1:77, ack 1, win 507, options [nop,nop,TS val 3504145157 ecr 2558713973], length 76: HTTP: GET / HTTP/1.1 (ipip-proto-4) 12:51:52.289089 IP k8s-node-5 > k8s-node-4: IP 10.10.86.131.http > 10.10.55.134.48402: Flags [.], ack 77, win 509, options [nop,nop,TS val 2558713973 ecr 3504145157], length 0 (ipip-proto-4) 12:51:52.289277 IP k8s-node-5 > k8s-node-4: IP 10.10.86.131.http > 10.10.55.134.48402: Flags [P.], seq 1:143, ack 77, win 509, options [nop,nop,TS val 2558713974 ecr 3504145157], length 142: HTTP: HTTP/1.1 200 OK (ipip-proto-4) 12:51:52.289318 IP k8s-node-4 > k8s-node-5: IP 10.10.55.134.48402 > 10.10.86.131.http: Flags [.], ack 143, win 506, options [nop,nop,TS val 3504145158 ecr 2558713974], length 0 (ipip-proto-4) 12:51:52.289576 IP k8s-node-4 > k8s-node-5: IP 10.10.55.134.48402 > 10.10.86.131.http: Flags [F.], seq 77, ack 143, win 506, options [nop,nop,TS val 3504145158 ecr 2558713974], length 0 (ipip-proto-4) 12:51:52.289856 IP k8s-node-5 > k8s-node-4: IP 10.10.86.131.http > 10.10.55.134.48402: Flags [F.], seq 143, ack 78, win 509, options [nop,nop,TS val 2558713974 ecr 3504145158], length 0 (ipip-proto-4) 12:51:52.289891 IP k8s-node-4 > k8s-node-5: IP 10.10.55.134.48402 > 10.10.86.131.http: Flags [.], ack 144, win 506, options [nop,nop,TS val 3504145159 ecr 2558713974], length 0 (ipip-proto-4)
可以看到第一層ip header中:src ip是k8s-node-4 nodeIP,dst ip是k8s-node-5 nodeIP
第二層ip header中:src ip是podA ip,dst ip是podB ip
當ip包被節點間已有的三層網絡轉發到目的節點k8s-node-5時,內核會識別出該數據包是被IPIP驅動封包處理過的,驅動會進行解包,從而拿到原始ip包,再通過節點上的如下路由規則將包轉發給cali*網卡,最終到達pod中
### calico-node中的felix會為節點上每個pod創建如下類似規則,用於接收傳入節點pod的ip包 10.10.86.131 dev cali2be2e0f309a scope link
注意事項
上面是兩個不同節點的pod之間訪問的過程,如果是在一個節點上直接訪問另外一個節點的pod,則會有一點區別
在k8s-node-4上直接訪問podB,並在k8s-node-4的enp0s8抓包:

root@k8s-node-4:~# tcpdump -i enp0s8 ip and host 192.168.99.205 -nn | grep 10.10.86.131.80 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on enp0s8, link-type EN10MB (Ethernet), capture size 262144 bytes 13:21:03.055260 IP 192.168.99.204 > 192.168.99.205: IP 10.10.55.128.39594 > 10.10.86.131.80: Flags [S], seq 1659225601, win 64800, options [mss 1440,sackOK,TS val 3518023510 ecr 0,nop,wscale 7], length 0 (ipip-proto-4) 13:21:03.055706 IP 192.168.99.205 > 192.168.99.204: IP 10.10.86.131.80 > 10.10.55.128.39594: Flags [S.], seq 3905354649, ack 1659225602, win 65160, options [mss 1460,sackOK,TS val 3788728882 ecr 3518023510,nop,wscale 7], length 0 (ipip-proto-4) 13:21:03.055755 IP 192.168.99.204 > 192.168.99.205: IP 10.10.55.128.39594 > 10.10.86.131.80: Flags [.], ack 1, win 507, options [nop,nop,TS val 3518023510 ecr 3788728882], length 0 (ipip-proto-4) 13:21:03.055877 IP 192.168.99.204 > 192.168.99.205: IP 10.10.55.128.39594 > 10.10.86.131.80: Flags [P.], seq 1:77, ack 1, win 507, options [nop,nop,TS val 3518023510 ecr 3788728882], length 76: HTTP: GET / HTTP/1.1 (ipip-proto-4) 13:21:03.056176 IP 192.168.99.205 > 192.168.99.204: IP 10.10.86.131.80 > 10.10.55.128.39594: Flags [.], ack 77, win 509, options [nop,nop,TS val 3788728883 ecr 3518023510], length 0 (ipip-proto-4) 13:21:03.056406 IP 192.168.99.205 > 192.168.99.204: IP 10.10.86.131.80 > 10.10.55.128.39594: Flags [P.], seq 1:143, ack 77, win 509, options [nop,nop,TS val 3788728883 ecr 3518023510], length 142: HTTP: HTTP/1.1 200 OK (ipip-proto-4) 13:21:03.056432 IP 192.168.99.204 > 192.168.99.205: IP 10.10.55.128.39594 > 10.10.86.131.80: Flags [.], ack 143, win 506, options [nop,nop,TS val 3518023511 ecr 3788728883], length 0 (ipip-proto-4) 13:21:03.056824 IP 192.168.99.204 > 192.168.99.205: IP 10.10.55.128.39594 > 10.10.86.131.80: Flags [F.], seq 77, ack 143, win 506, options [nop,nop,TS val 3518023511 ecr 3788728883], length 0 (ipip-proto-4) 13:21:03.057111 IP 192.168.99.205 > 192.168.99.204: IP 10.10.86.131.80 > 10.10.55.128.39594: Flags [F.], seq 143, ack 78, win 509, options [nop,nop,TS val 3788728883 ecr 3518023511], length 0 (ipip-proto-4) 13:21:03.057136 IP 192.168.99.204 > 192.168.99.205: IP 10.10.55.128.39594 > 10.10.86.131.80: Flags [.], ack 144, win 506, options [nop,nop,TS val 3518023512 ecr 3788728883], length 0 (ipip-proto-4)
可以看到第一層ip header中:src ip是k8s-node-4 nodeIP,dst ip是k8s-node-5 nodeIP
區別是第二層ip header中:src ip是k8s-node-4的 tunl0 ip,dst ip是podB ip
也就是說直接在節點上訪問pod時,會把tunl0 ip作為原始ip包的src ip,原因是讓目的節點回包時能夠因為src ip(tunl0 ip)屬於源節點的pod子網(calico也叫做ip block)而對回報也做ipip封包處理,否則如果src ip還是192.168.99.204的話,回包不經過目的節點的tunl0封包處理,最終在源節點看來就會出現混亂並被丟棄,也就是說:如果ip包在源節點經過ipip模塊處理,那么需要保證回包時在目的節點也要經過ipip處理
回看最上面的calico中的natoutgoing配置,表示的是當在pod中訪問其他非pod ip時,是否需要做snat,如果配置為true,calico會通過felix在節點中添加相關iptables規則來做snat
如果natoutgoing配置為false,會導致在pod中無法訪問集群中的其他節點ip,原因剛好是上面的逆過程,即pod中訪問其他的節點ip,如果不經過snat把src ip設置為自身節點的ip,那么在目的節點回包是因為src ip是podIP,那么就會根據路由表把包交由tunl0做封包出來,導致混亂,也就是說:如果ip包在源節點沒有經過ipip模塊處理,那么需要保證回包時在目的節點也不能經過ipip處理
所以calico已經建議用戶在使用IPIP(或vxlan)模式時,需要搭配natoutgoing選項為true,可以參考:
issue:https://github.com/projectcalico/calicoctl/issues/1296
doc:https://docs.projectcalico.org/archive/v3.19/reference/resources/ippool