1、時至今日,已經找不到單機設備了,所有的IT硬件設備都會聯網和其他的IT設備通信。設備之間傳遞數據總要遵守特定的協議規范吧,避免出現“雞同鴨講”的尷尬局面,這個就是至今世界范圍內最流行的tcp/ip協議! 為了簡化,又被分成了5層,各種體系的對應關系如下圖:
看網絡原理解析的各種技術文章時,經常會提起報文、數據包、包頭這些名詞,然后配上協議不同層級的包頭字段圖示,初學者可能會懵逼:這些概念到底指的是啥了?概念背后的本質又是啥了?先說說我個人的理解:所謂的報文也好、數據包也好、包頭也好,本質就是個字符串!不同層級的封裝,本質就是不停地在字符串前面添加新的字符!理解這個本質后,網絡數據包的構造過程就很容易理解了,圖示如下:
假如李雷想給韓梅梅發一條內容為“hello”的消息,操作系統怎么才能把這消息准確無誤地發送給韓梅梅了?很簡單:操作系統通過網卡發送的數據包遵從TCP/IP協議即可!李雷和韓梅梅之間可能有很多路由器、交換機這些幫忙轉發數據包的設備,為了能正確識別並轉發,需要操作系統發送的數據有特定的格式,這種特定格式的數據包制作過程如上如圖所以:應用層的app構造“hello”字符串,然后調用send函數發送數據。操作系統提供的send函數會繼續在“hello”這個字符串前面添加各種標識的字段(這就是所謂的包頭,本質還是字符串)。比如:
- 應用層的下一層是傳輸層,這一層是tcp或udp協議,需要加上端口(識別進程)和其他tcp或udp的屬性字段;
- 再往下是網絡層,需要加上源和目的ip地址,以及其他ip協議的屬性字段
- 繼續往下是鏈路層,加上網卡的硬件id,也就是MAC號
以上一切都做完后,由網卡發送出去!本質就是網卡發送了一串字符串,用戶負責構造字符串的應用層,然后調用操作系統提供的send函數!操作系統負責繼續構造字符串的傳輸層、網絡層和鏈路層!整個網絡通信數據源構造的原理就是這樣的,其實並不復雜,搞清楚協議每一層需要添加的字段就行了,沒啥難的!原理搞懂了,linux操作系統在代碼層面又是怎么做的了?
2、操作系統既然發出去的是字符串,圍繞着這段字符串有以下幾點需要明確:
- 肯定需要在內存找個地方存儲這串字符串
- 應用有很多,不同的應用可能會發送不同的應用數據;就算是同一個應用,也可能在不同的時間段發送不同的數據;換句話說這類的字符串有很多很多,絕對不止1個!
那么問題來了:大量的字符串該怎么管理了?linux操作系統使用了sk_buff結構體!這個結構體非常大,個人覺得重要的字段額外加了注釋:
/** * struct sk_buff - socket buffer * @next: Next buffer in list * @prev: Previous buffer in list * @tstamp: Time we arrived/left * @rbnode: RB tree node, alternative to next/prev for netem/tcp * @sk: Socket we are owned by * @dev: Device we arrived on/are leaving by * @cb: Control buffer. Free for use by every layer. Put private vars here * @_skb_refdst: destination entry (with norefcount bit) * @sp: the security path, used for xfrm * @len: Length of actual data * @data_len: Data length * @mac_len: Length of link layer header * @hdr_len: writable header length of cloned skb * @csum: Checksum (must include start/offset pair) * @csum_start: Offset from skb->head where checksumming should start * @csum_offset: Offset from csum_start where checksum should be stored * @priority: Packet queueing priority * @ignore_df: allow local fragmentation * @cloned: Head may be cloned (check refcnt to be sure) * @ip_summed: Driver fed us an IP checksum * @nohdr: Payload reference only, must not modify header * @nfctinfo: Relationship of this skb to the connection * @pkt_type: Packet class * @fclone: skbuff clone status * @ipvs_property: skbuff is owned by ipvs * @peeked: this packet has been seen already, so stats have been * done for it, don't do them again * @nf_trace: netfilter packet trace flag * @protocol: Packet protocol from driver * @destructor: Destruct function * @nfct: Associated connection, if any * @nf_bridge: Saved data about a bridged frame - see br_netfilter.c * @skb_iif: ifindex of device we arrived on * @tc_index: Traffic control index * @tc_verd: traffic control verdict * @hash: the packet hash * @queue_mapping: Queue mapping for multiqueue devices * @xmit_more: More SKBs are pending for this queue * @ndisc_nodetype: router type (from link layer) * @ooo_okay: allow the mapping of a socket to a queue to be changed * @l4_hash: indicate hash is a canonical 4-tuple hash over transport * ports. * @sw_hash: indicates hash was computed in software stack * @wifi_acked_valid: wifi_acked was set * @wifi_acked: whether frame was acked on wifi or not * @no_fcs: Request NIC to treat last 4 bytes as Ethernet FCS * @napi_id: id of the NAPI struct this skb came from * @secmark: security marking * @mark: Generic packet mark * @vlan_proto: vlan encapsulation protocol * @vlan_tci: vlan tag control information * @inner_protocol: Protocol (encapsulation) * @inner_transport_header: Inner transport layer header (encapsulation) * @inner_network_header: Network layer header (encapsulation) * @inner_mac_header: Link layer header (encapsulation) * @transport_header: Transport layer header * @network_header: Network layer header * @mac_header: Link layer header * @tail: Tail pointer * @end: End pointer * @head: Head of buffer * @data: Data head pointer * @truesize: Buffer size * @users: User count - see {datagram,tcp}.c */ struct sk_buff { union { struct { /* These two members must be first. */ /*雙向鏈表結構,用來存儲網絡數據包*/ struct sk_buff *next; struct sk_buff *prev; union { /*報文到達或者離開的時間戳; Time we arrived 表示這個skb的接收到的時間, 一般是在包從驅動中往二層發送的接口函數中設置 */ ktime_t tstamp; struct skb_mstamp skb_mstamp; }; }; /**/ struct rb_node rbnode; /* used in netem & tcp stack */ }; struct sock *sk;//該數據包屬於哪個socket struct net_device *dev;//收到這個報文的設備 /* * This is the control buffer. It is free to use for every * layer. Please put your private variables there. If you * want to keep them across layers you have to do a skb_clone() * first. This is owned by whoever has the skb queued ATM. */ char cb[48] __aligned(8); unsigned long _skb_refdst; //析構函數,一般都是設置為sock_rfree或者sock_wfree void (*destructor)(struct sk_buff *skb); #ifdef CONFIG_XFRM struct sec_path *sp; #endif #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE) struct nf_conntrack *nfct; #endif #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER) struct nf_bridge_info *nf_bridge; #endif /*表示當前的skb中的數據的長度,這個長度即包括buf中的數據也包括切片的數據, 也就是保存在skb_shared_info中的數據*/ unsigned int len, data_len;//只表示切片數據的長度,也就是skb_shared_info中的長度 __u16 mac_len,//mac頭的長度 hdr_len;//用於clone的時候,它表示clone的skb的頭的長度 /* Following fields are _not_ copied in __copy_skb_header() * Note that queue_mapping is here mostly to fill a hole. */ kmemcheck_bitfield_begin(flags1); __u16 queue_mapping;//多隊列設備的映射,也就是說映射到那個隊列。 /* if you move cloned around you also must adapt those constants */ #ifdef __BIG_ENDIAN_BITFIELD #define CLONED_MASK (1 << 7) #else #define CLONED_MASK 1 #endif #define CLONED_OFFSET() offsetof(struct sk_buff, __cloned_offset) __u8 __cloned_offset[0]; __u8 cloned:1, nohdr:1, fclone:2, peeked:1, head_frag:1, xmit_more:1, __unused:1; /* one bit hole */ kmemcheck_bitfield_end(flags1); /* fields enclosed in headers_start/headers_end are copied * using a single memcpy() in __copy_skb_header() */ /* private: */ __u32 headers_start[0]; /* public: */ /* if you move pkt_type around you also must adapt those constants */ #ifdef __BIG_ENDIAN_BITFIELD #define PKT_TYPE_MAX (7 << 5) #else #define PKT_TYPE_MAX 7 #endif #define PKT_TYPE_OFFSET() offsetof(struct sk_buff, __pkt_type_offset) __u8 __pkt_type_offset[0]; __u8 pkt_type:3; __u8 pfmemalloc:1; __u8 ignore_df:1; __u8 nfctinfo:3; __u8 nf_trace:1; __u8 ip_summed:2; __u8 ooo_okay:1; __u8 l4_hash:1; __u8 sw_hash:1; __u8 wifi_acked_valid:1; __u8 wifi_acked:1; __u8 no_fcs:1; /* Indicates the inner headers are valid in the skbuff. */ __u8 encapsulation:1; __u8 encap_hdr_csum:1; __u8 csum_valid:1; __u8 csum_complete_sw:1; __u8 csum_level:2; __u8 csum_bad:1; #ifdef CONFIG_IPV6_NDISC_NODETYPE __u8 ndisc_nodetype:2; #endif __u8 ipvs_property:1; __u8 inner_protocol_type:1; __u8 remcsum_offload:1; #ifdef CONFIG_NET_SWITCHDEV __u8 offload_fwd_mark:1; #endif /* 2, 4 or 5 bit hole */ #ifdef CONFIG_NET_SCHED __u16 tc_index; /* traffic control index */ #ifdef CONFIG_NET_CLS_ACT __u16 tc_verd; /* traffic control verdict */ #endif #endif union { __wsum csum; struct { __u16 csum_start; __u16 csum_offset; }; }; __u32 priority;/*優先級,主要用於QOS*/ int skb_iif; __u32 hash; __be16 vlan_proto; __u16 vlan_tci; #if defined(CONFIG_NET_RX_BUSY_POLL) || defined(CONFIG_XPS) union { unsigned int napi_id; unsigned int sender_cpu; }; #endif #ifdef CONFIG_NETWORK_SECMARK __u32 secmark; #endif union { __u32 mark; __u32 reserved_tailroom; }; union { __be16 inner_protocol; __u8 inner_ipproto; }; __u16 inner_transport_header; __u16 inner_network_header; __u16 inner_mac_header; __be16 protocol;//協議類型 __u16 transport_header;//傳輸層頭部 __u16 network_header;//網絡層頭部 __u16 mac_header;//鏈路層頭部 /* private: */ __u32 headers_end[0]; /* public: */ /* These elements must be at the end, see alloc_skb() for details. sk_buff_data_t就是unsigned char * */ sk_buff_data_t tail;//指向報文尾巴 sk_buff_data_t end;//指向報文最后一個字節 unsigned char *head,//分配的內存塊的起始位置;指向數據區中開始的位置(非實際數據區域開始位置) *data;//保存數據內容的首地址;(實際數據區域開始位置) /*緩沖區的總長度,包括sk_buff結構和數據部分。 如果申請一個len字節的緩沖區,alloc_skb函數會把它初始化成len+sizeof(sk_buff)。 當skb->len變化時,這個變量也會變化*/ unsigned int truesize; /*atomic_t users;這是個引用計數,表明了有多少實體引用了這個skb。 其作用就是在銷毀skb結構體時,先查看下users是否為零, 若不為零,則調用函數遞減下引用計數users即可;當某一次銷毀時,users為零才真正釋放內存空間。 有兩個操作函數:atomic_inc()引用計數增加1;atomic_dec()引用計數減去1;*/ atomic_t users; };
有幾點需要注意:
- 這個結構體並不直接存儲網絡數據包,而是存放了數據包的指針,就是上面的tail、end、head、data等!
- 這幾個指針的關系如圖所示:這下看明白了吧!應用層數據前面加上協議其他層級的頭部數據,用data指針保存!應用層尾部用tail指針保存!如果是從L4傳輸到L2,則是通過往sk_buff結構體中增加該層協議頭來操作;如果是從L4到L2,則是通過移動sk_buff結構體中的data指針來實現,不會刪除各層協議頭,這樣做可以提高CPU的工作效率!
3、結構體有了,接着就是操作這些結構體的方法了!既然網絡通信最核心的就是構造數據包,落實到結構體就是移動head、data、tail、end這4大指針了!linux內核采用了__skb_put、__skb_push、__pskb_pull、skb_reserve 4大函數,這4個函數參數是一樣的,都有啥區別了?
(1)先看看put函數:在數據區的尾部添加數據,也就是增加tail指針!
/*在數據區的末端添加某協議的尾部*/ static inline unsigned char *__skb_put(struct sk_buff *skb, unsigned int len) { unsigned char *tmp = skb_tail_pointer(skb);//獲取當前skb->tail SKB_LINEAR_ASSERT(skb); skb->tail += len; skb->len += len; return tmp; }
如下圖所示:tail指針增加了n
(2)再看看push函數:這次是在數據區前面填充數據,所以是data指針減少!從push的名稱就可以看出來,類似於棧,往棧里寫數據時,棧指針減少,所以這里的data作用類似sp指針!
/*在數據區的前端添加某協議的頭部*/ static inline unsigned char *__skb_push(struct sk_buff *skb, unsigned int len) { skb->data -= len; skb->len += len; return skb->data; }
如下圖所示:data指針減少n
(3)再看看pull函數:把data指針增加n,相當於彈出數據!
/*把data指針增加n,相當於彈出數據*/ unsigned char *skb_pull(struct sk_buff *skb, unsigned int len); static inline unsigned char *__skb_pull(struct sk_buff *skb, unsigned int len) { skb->len -= len; BUG_ON(skb->len < skb->data_len); return skb->data += len; }
如下如所示:
(4)skb_reserve函數:當skb還是空的時候,需要給協議不同層級預留存儲頭部信息的空間
/** * skb_reserve - adjust headroom * @skb: buffer to alter * @len: bytes to move * * Increase the headroom of an empty &sk_buff by reducing the tail * room. This is only allowed for an empty buffer. 給協議預留head的存儲空間,只能對空的skb使用; */ static inline void skb_reserve(struct sk_buff *skb, int len) { skb->data += len; skb->tail += len; }
如下圖所示:
參考:
1、https://www.jianshu.com/p/3738da62f5f6 sk_buff結構體詳解
2、https://blog.csdn.net/farmwang/article/details/54234176 sk_buff詳解
3、http://www.360doc.com/content/14/0310/16/2306903_359316839.shtml sk_buff操作函數