kubernetes service之node port


環境信息

CLIENT                             HOSTA                                     HOSTB
192.168.55.230              eth0: 192.168.16.73                      eth0: 192.168.16.139
                               flannel.1: 10.100.40.0/32                  flannel.1: 10.100.81.0/32
                                                                                   PODIP: 10.100.81.180

K8S資源信息

[root@c2v73 ~]# kubectl get services httpserver -o wide

NAME          CLUSTER-IP      EXTERNAL-IP   PORT(S)                         AGE       SELECTOR
httpserver   10.254.112.35   <nodes>       3000:30000/TCP,8080:30080/TCP   304d      caicloud-app=httpserver

[root@c2v73 ~]# kubectl get endpoints httpserver -o wide

NAME          ENDPOINTS                      AGE
httpserver   10.100.81.180:8080,10.100.81.180:3000   304d

[root@c2v73 ~]# kubectl get pod httpserver-v2.0.0-rc.3-patch1-zvcbp -o wide

NAME                                   READY     STATUS    RESTARTS   AGE       IP              NODE
httpserver-v2.0.0-rc.3-patch1-zvcbp   1/1       Running   32         2d        10.100.81.180   kube-node-24

請求報文 192.168.55.230:randomport -> 192.168.16.73:30000

  • HOSTA節點(轉發節點)

[root@c2v73 ~]# iptables -vnL PREROUTING -t nat

Chain PREROUTING (policy ACCEPT 71 packets, 5113 bytes)
 pkts bytes target     prot opt in     out     source               destination
   79  5709 cali-PREROUTING  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:6gwbT8clXdHdC1b1 */
   79  5709 KUBE-SERVICES  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */
   44  4053 DOCKER     all  --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL

KUBE-SERVICES鏈中優匹配目的ip為vip的報文,最后默認匹配KUBE-NODEPORTS

Chain KUBE-SERVICES (2 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-SVC-7LP2XUEI73XCWKBC  tcp  --  *      *       0.0.0.0/0            10.254.87.134        /* default/hbase-master:h
ttp cluster IP */ tcp dpt:16010
    0     0 KUBE-SVC-DJQF6FU6GLEBPD2P  tcp  --  *      *       0.0.0.0/0            10.254.233.102       /* default/cauth-redis-sl
ave: cluster IP */ tcp dpt:6379
……
  375 30172 KUBE-NODEPORTS  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL

在KUBE-NODEPORTS鏈中使用目的端口匹配規則后,首先給報文打MASQMARK,之后匹配NODEPORT SERVICE規則
**[root@c2v73 ~]# iptables -vnL KUBE-NODEPORTS -t nat |grep 30000
**

    7   448 KUBE-MARK-MASQ  tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/httpserver:http */ tcp dpt:30000
    7   448 KUBE-SVC-J5IAPMHBS2NR43EJ  tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/httpserver:http */ tcp dpt:30000

根據ENDPOINT個數,iptables會做平均分發,此例只有一個ENDPOINT,所以全部匹配該ENDPOINT規則

Chain KUBE-SVC-J5IAPMHBS2NR43EJ (2 references)
 pkts bytes target     prot opt in     out     source               destination
7   448 KUBE-SEP-XZ7F6CLIJGJWI66P  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/httpserver:http */

ENDPONT規則做DNAT,將目的ip和port轉換為對應容器pod和port

[root@c2v73 ~]# iptables -vnL KUBE-SEP-XZ7F6CLIJGJWI66P -t nat

Chain KUBE-SEP-XZ7F6CLIJGJWI66P (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-MARK-MASQ  all  --  *      *       10.100.81.180        0.0.0.0/0            /* default/httpserver:http */
7   448 DNAT       tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/httpserver:http */ tcp to:10.100.81.180:3000

路由查詢出接口

[root@c2v73 ~]# ip route |grep 10.100.81

10.100.81.0/24 via 10.100.81.0 dev flannel.1 onlink

[root@c2v73 ~]# ip -d link show flannel.1

12: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT
    link/ether 1a:cc:4c:8b:d4:6c brd ff:ff:ff:ff:ff:ff promiscuity 0
    vxlan id 1 local 192.168.16.73 dev eth0 srcport 0 0 dstport 8472 nolearning ageing 300 addrgenmode none

從一節點DNAT到另一節點,肯定要返回節點做SNAT,所以對於DNAT的報文打上0x4000 PKT mark

[root@c2v73 ~]# iptables -vnL KUBE-NODEPORTS -t nat |grep 30000

    7   448 KUBE-MARK-MASQ  tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/httpserver:http */ tcp dpt:30000
    7   448 KUBE-SVC-J5IAPMHBS2NR43EJ  tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/httpserver:http */ tcp dpt:30000

根據路由查詢結果,出接口為flannel.1,所以MASQUERADE通過inet_select_addr從flannel.1上先把IP作SNAT,此時報文請求變為10.100.40.0:randomport->10.100.81.180:3000

[root@c2v73 ~]# iptables -vnL KUBE-POSTROUTING -t nat

Chain KUBE-POSTROUTING (1 references)
 pkts bytes target     prot opt in     out     source               destination
7   798 MASQUERADE  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000

從vxlan端口(flannel.1)發送報文,驅動調用vxlan_xmit,通過目的mac查詢vxlan_fdb先把remote ip(外層目的ip),即192.168.16.139,從vxlan_dev中獲取源local ip 192.168.16.73(&vxlan->cfg.saddr),然后調iptunnel_xmit->ip_local_out從對應物理網口將報文送出去,並且由於 POSTROUTING 有如下匹配規則,所以 vxlan 報文不會再做一次 SNAT

iptables -vnL POSTROUTING -t nat
 713K   51M RETURN     all  --  *      *       192.168.16.0/20      192.168.16.0/20
 1350 81000 MASQUERADE  all  --  *      *       192.168.64.0/20     !224.0.0.0/4

[root@c2v73 ~]# ip nei |grep 10.100.81.0

10.100.81.0 dev flannel.1 lladdr aa:a1:54:36:e0:a9 PERMANENT

[root@c2v73 ~]# bridge fdb |grep aa:a1:54:36:e0:a9

aa:a1:54:36:e0:a9 dev flannel.1 dst 192.168.16.139 self permanent
  • HOSTB(接收端)

Vxlan udp報文處理報經過vxlan_rcv (ens3) -> netif_rx(vxlandev skb)
其中vxlan_rcv中會使用vxlan_vs_find_vni(vs, vxlan_vni(vxlan_hdr(skb)->vx_vni))通過vxlan報文vni來查找對應的vxlan_dev,所以vxlan網口vni要一致。

[root@c4v139 ~]# ip -d link show flannel.1

8: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT
    link/ether aa:a1:54:36:e0:a9 brd ff:ff:ff:ff:ff:ff promiscuity 0
    vxlan id 1 local 192.168.16.139 dev ens3 srcport 0 0 dstport 8472 nolearning ageing 300 addrgenmode none

報文從flannel.1轉發查詢路由轉到veth接口calid9cdc8e8d37
[root@c4v139 ~]# ip route |grep 10.100.81.180

10.100.81.180 dev calid9cdc8e8d37  scope link

veth接口調用驅動函數發送報文時,調用veth_xmit->dev_forward_skb(peer, skb)->enqueue_to_backlog
dev_forward_skb會在本地將報文轉發至另一個網口,即veth peer
調用軟中斷從接收隊列獲取數據包

  • 返程報文

POD內返程報文為10.100.81.180:3000->10.100.40.0:randomport,路由查詢出接口
root@httpserver-v2:/app# ip route

default via 169.254.1.1 dev eth0
169.254.1.1 dev eth0  scope link

此處calico配置了169.254.1.1作為下一跳,並將主機上的veth peer打開arp proxy的功能,所以下面查看到169.254.1.1的mac為veth peer的mac
root@httpserver-v2:/app# ip nei

192.168.16.139 dev eth0 lladdr 8a:85:67:77:67:83 STALE
169.254.1.1 dev eth0 lladdr 8a:85:67:77:67:83 REACHABLE

[root@c4v139 ~]# ip addr show calid9cdc8e8d37

137: calid9cdc8e8d37@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
    link/ether 8a:85:67:77:67:83 brd ff:ff:ff:ff:ff:ff link-netnsid 4
    inet6 fe80::8885:67ff:fe77:6783/64 scope link
       valid_lft forever preferred_lft forever

[root@c4v139 ~]# sysctl -a |grep calid9cdc8e8d37 |grep proxy_arp

net.ipv4.conf.calid9cdc8e8d37.proxy_arp = 1

calid9cdc8e8d37 對arp請求169.254.1.1做了代理(如果網口開啟了proxy_arp,且此ip查詢路由后的出接口不是此網口則會對此ip做proxy; openstack中dvr模式中fip命名空間fg口就是如此設置)

查詢路由從flannel.1送出去
[root@c4v139 ~]# ip route |grep 10.100.40.0

10.100.40.0/24 via 10.100.40.0 dev flannel.1 onlink

后面流程為從vxlan dev發送報文作vxlan封裝至HOSTA,流程與上面HOSTA至HOSTB一致,所以不再重復描述。
到達HOSTB后,根據內核連接跟蹤功能POSTROUTING做SNAT,PREROUTING會做DNAT,源ip和目的ip 分別轉換為192.168.16.73:和192.168.55.230

引用

veth在內核的實現 http://ju.outofmemory.cn/entry/187069
數據包接收系列 — 上半部實現(內核接口)http://blog.csdn.net/zhangskd/article/details/22211295
網絡子系統18_arp對代理的處理 http://blog.csdn.net/nerdx/article/details/12192961
calic FAQ https://docs.projectcalico.org/v2.0/usage/troubleshooting/faq#why-cant-i-see-the-16925411-address-mentioned-above-on-my-host
linux內核網絡源代碼調用關系 http://www.cnblogs.com/haoqingchuan/p/7882236.html


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM