網絡設備是完成用戶數據包在網絡媒介上發送和接收的設備,它將上層協議傳遞下來的數據包以特定的媒介訪問控制方式進行發送,並將接收到的數據包傳遞給上層協議。
Linux系統對網絡設備驅動定義了4個層次,這4個層次分別為:
1)網絡協議接口層;
2)網絡設備接口層;
3)提供實際功能的設備驅動功能層;
4)網絡設備與媒介層。
一、Linux網絡設備驅動的結構
Linux網絡設備驅動程序的體系結構如下圖所示,從上到下可以分為4層,依次為網絡協議接口層、網絡設備接口層、提供實際功能的設備驅動功能層以及網絡設備與媒介層,各層作用如下:
網絡協議接口層: 向網絡層協議提供統一的數據包收發接口,不論上層協議是ARP,還是IP,都通過dev_queue_xmit()函數發送數據,並通過netif_rx()函數接收數據。這一層的存在使得上層協議獨立於具體的設備。
網絡設備接口層: 向協議接口層提供統一的用於描述具體網絡設備屬性和操作的結構體net_device,該結構體是設備驅動功能層中各函數的容器。網絡設備接口層從宏觀上規划了具體操作硬件的設備驅動功能層的結構。
設備驅動功能層: 設備驅動功能層的各函數是網絡設備接口層net_device數據結構的具體成員,是驅使網絡設備硬件完成相應動作的程序,它通過hard_start_xmit()函數啟動發送操作,並通過網絡設備上的中斷觸發接收操作。
網絡設備與媒介層: 完成數據包發送和接收的物理實體,包括網絡適配器和具體的傳輸媒介,網絡適配器被設備驅動功能層中的函數在物理上驅動。對於Linux系統而言,網絡設備和媒介都可以是虛擬的。

在設計具體的網絡設備驅動程序時,我們需要完成的主要工作是編寫設備驅動功能層的相關函數以填充net_device數據結構的內容並將net_device注冊進內核。
1.網絡協議接口層
網絡協議接口層最主要的功能是給上層協議提供透明的數據包發送和接收接口。當上層ARP或IP需要發送數據包時,它將調用網絡協議接口層的dev_queue_xmit()函數發送該數據包,同時需傳遞給該函數一個指向 struct sk_buff
數據結構的指針。dev_queue_xmit( ) 函數的原型為:
int dev_queue_xmit(struct sk_buff *skb);
同樣的,上層對數據包的接收也通過向netif_rx( )函數傳遞一個 struct sk_buff 數據結構的指針來完成。netif_rx( ) 函數的原型為:
int netif_rx(struct sk_buff *skb);
sk_buff結構體非常重要,它定義於 include/linux/skbuff.h 文件中, 含義為“套接字緩沖區”, 用於在Linux網絡子系統中的各層之間傳遞數據,是Linux網絡子系統數據傳遞的 “中樞神經”。
當發送數據包時,Linux內核的網絡處理模塊必須建立一個包含要傳輸的數據包的 sk_buff, 然后將 sk_buff 遞交給下層,各層在 sk_buff中添加不同的協議頭直至交給網絡設備發送。同理, 當網絡設備從網絡媒介上接收到數據包后,它必須
將接收到的數據轉換為 sk_buff 數據結構並傳遞給上層, 各層剝去相應的協議頭直至交給用戶。
/** * struct sk_buff - socket buffer * @next: Next buffer in list * @prev: Previous buffer in list * @tstamp: Time we arrived/left * @rbnode: RB tree node, alternative to next/prev for netem/tcp * @sk: Socket we are owned by * @dev: Device we arrived on/are leaving by * @cb: Control buffer. Free for use by every layer. Put private vars here * @_skb_refdst: destination entry (with norefcount bit) * @sp: the security path, used for xfrm * @len: Length of actual data * @data_len: Data length * @mac_len: Length of link layer header * @hdr_len: writable header length of cloned skb * @csum: Checksum (must include start/offset pair) * @csum_start: Offset from skb->head where checksumming should start * @csum_offset: Offset from csum_start where checksum should be stored * @priority: Packet queueing priority * @ignore_df: allow local fragmentation * @cloned: Head may be cloned (check refcnt to be sure) * @ip_summed: Driver fed us an IP checksum * @nohdr: Payload reference only, must not modify header * @nfctinfo: Relationship of this skb to the connection * @pkt_type: Packet class * @fclone: skbuff clone status * @ipvs_property: skbuff is owned by ipvs * @peeked: this packet has been seen already, so stats have been * done for it, don't do them again * @nf_trace: netfilter packet trace flag * @protocol: Packet protocol from driver * @destructor: Destruct function * @nfct: Associated connection, if any * @nf_bridge: Saved data about a bridged frame - see br_netfilter.c * @skb_iif: ifindex of device we arrived on * @tc_index: Traffic control index * @tc_verd: traffic control verdict * @hash: the packet hash * @queue_mapping: Queue mapping for multiqueue devices * @xmit_more: More SKBs are pending for this queue * @ndisc_nodetype: router type (from link layer) * @ooo_okay: allow the mapping of a socket to a queue to be changed * @l4_hash: indicate hash is a canonical 4-tuple hash over transport * ports. * @sw_hash: indicates hash was computed in software stack * @wifi_acked_valid: wifi_acked was set * @wifi_acked: whether frame was acked on wifi or not * @no_fcs: Request NIC to treat last 4 bytes as Ethernet FCS * @napi_id: id of the NAPI struct this skb came from * @secmark: security marking * @mark: Generic packet mark * @dropcount: total number of sk_receive_queue overflows * @vlan_proto: vlan encapsulation protocol * @vlan_tci: vlan tag control information * @inner_protocol: Protocol (encapsulation) * @inner_transport_header: Inner transport layer header (encapsulation) * @inner_network_header: Network layer header (encapsulation) * @inner_mac_header: Link layer header (encapsulation) * @transport_header: Transport layer header * @network_header: Network layer header * @mac_header: Link layer header * @tail: Tail pointer * @end: End pointer * @head: Head of buffer * @data: Data head pointer * @truesize: Buffer size * @users: User count - see {datagram,tcp}.c */ struct sk_buff { union { struct { /* These two members must be first. */ struct sk_buff *next; struct sk_buff *prev; union { ktime_t tstamp; struct skb_mstamp skb_mstamp; }; }; struct rb_node rbnode; /* used in netem & tcp stack */ }; struct sock *sk; struct net_device *dev; /* * This is the control buffer. It is free to use for every * layer. Please put your private variables there. If you * want to keep them across layers you have to do a skb_clone() * first. This is owned by whoever has the skb queued ATM. */ char cb[48] __aligned(8); unsigned long _skb_refdst; void (*destructor)(struct sk_buff *skb); #ifdef CONFIG_XFRM struct sec_path *sp; #endif #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE) struct nf_conntrack *nfct; #endif #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER) struct nf_bridge_info *nf_bridge; #endif unsigned int len, data_len; __u16 mac_len, hdr_len; /* Following fields are _not_ copied in __copy_skb_header() * Note that queue_mapping is here mostly to fill a hole. */ kmemcheck_bitfield_begin(flags1); __u16 queue_mapping; __u8 cloned:1, nohdr:1, fclone:2, peeked:1, head_frag:1, xmit_more:1; /* one bit hole */ kmemcheck_bitfield_end(flags1); /* fields enclosed in headers_start/headers_end are copied * using a single memcpy() in __copy_skb_header() */ /* private: */ __u32 headers_start[0]; /* public: */ /* if you move pkt_type around you also must adapt those constants */ #ifdef __BIG_ENDIAN_BITFIELD #define PKT_TYPE_MAX (7 << 5) #else #define PKT_TYPE_MAX 7 #endif #define PKT_TYPE_OFFSET() offsetof(struct sk_buff, __pkt_type_offset) __u8 __pkt_type_offset[0]; __u8 pkt_type:3; __u8 pfmemalloc:1; __u8 ignore_df:1; __u8 nfctinfo:3; __u8 nf_trace:1; __u8 ip_summed:2; __u8 ooo_okay:1; __u8 l4_hash:1; __u8 sw_hash:1; __u8 wifi_acked_valid:1; __u8 wifi_acked:1; __u8 no_fcs:1; /* Indicates the inner headers are valid in the skbuff. */ __u8 encapsulation:1; __u8 encap_hdr_csum:1; __u8 csum_valid:1; __u8 csum_complete_sw:1; __u8 csum_level:2; __u8 csum_bad:1; #ifdef CONFIG_IPV6_NDISC_NODETYPE __u8 ndisc_nodetype:2; #endif __u8 ipvs_property:1; __u8 inner_protocol_type:1; __u8 remcsum_offload:1; /* 3 or 5 bit hole */ #ifdef CONFIG_NET_SCHED __u16 tc_index; /* traffic control index */ #ifdef CONFIG_NET_CLS_ACT __u16 tc_verd; /* traffic control verdict */ #endif #endif union { __wsum csum; struct { __u16 csum_start; __u16 csum_offset; }; }; __u32 priority; int skb_iif; __u32 hash; __be16 vlan_proto; __u16 vlan_tci; #if defined(CONFIG_NET_RX_BUSY_POLL) || defined(CONFIG_XPS) union { unsigned int napi_id; unsigned int sender_cpu; }; #endif #ifdef CONFIG_NETWORK_SECMARK __u32 secmark; #endif union { __u32 mark; __u32 dropcount; __u32 reserved_tailroom; }; union { __be16 inner_protocol; __u8 inner_ipproto; }; __u16 inner_transport_header; __u16 inner_network_header; __u16 inner_mac_header; __be16 protocol; __u16 transport_header; __u16 network_header; __u16 mac_header; /* private: */ __u32 headers_end[0]; /* public: */ /* These elements must be at the end, see alloc_skb() for details. */ sk_buff_data_t tail; sk_buff_data_t end; unsigned char *head, *data; unsigned int truesize; atomic_t users; };
如下圖所示,尤其值得注意的是 head 和 end 指向緩沖區的頭部和尾部,而 data 和 tail 指向實際數據的頭部和尾部。每一層會在 head 和 data 之間填充協議頭,或者在 tail 和 end 之間添加新的協議數據。

下面分析下套接字緩沖區涉及的操作函數,Linux套接字緩沖區支持分配、釋放、變更等功能函數。
(1) 分配
Linux 內核中用於分配套接字緩沖區的函數有:
struct sk_buff *alloc_skb(unsigned int len, gfp_t priority); struct sk_buff *dev_alloc_skb(unsigned int len);
alloc_skb( )函數分配一個套接字緩沖區和一個數據緩沖區,參數 len 為數據緩沖區的空間大小,通常以L1_CACHE_BYTES字節(對於 ARM 為32)對齊,參數priority為內存分配的優先級。
dev_alloc_skb( )函數以 GFP_ATOMIC 優先級進行 skb 的分配,原因是該函數經常在設備驅動的接收中斷里被調用。
(2)釋放
Linux內核中用於釋放套接字緩沖區的函數有:
void kfree_skb(struct sk_buff *skb); void dev_kfree_skb(struct sk_buff *skb); void dev_kfree_skb_irq(struct sk_buff *skb); void dev_kfree_skb_any(struct sk_buff *skb);
上述函數用於釋放被alloc_skb( )函數分配的套接字緩沖區和數據緩沖區。
Linux內核內部使用 kfree_skb( ) 函數,但在網絡設備驅動程序中最好用 dev_kfree_skb( )、dev_kfree_skb_irq( ) 或 dev_kfree_skb_any( )函數進行套接字緩沖區的釋放。
dev_kfree_skb( )用於非中斷上下文, dev_kfree_skb_irq( )用於中斷上下文,dev_kfree_skb_any( )在中斷和非中斷上下文中皆可使用,它其實是做一個簡單地上下文判斷,然后再調用__dev_kfree_skb_irq( ) 或者 dev_kfree_skb( ),代碼實現如下:
void __dev_kfree_skb_any(struct sk_buff *skb, enum skb_free_reason reason) { if (in_irq() || irqs_disabled()) __dev_kfree_skb_irq(skb, reason); else dev_kfree_skb(skb); }
(3)變更
在Linux內核中可以用如下函數在緩沖區尾部增加數據:
unsigned char *skb_put(struct sk_buff *skb, unsigned int len) ;
它會導致 skb->tail 后移 len (skb->tail += len),而skb->len會增加 len 的大小( skb->len += len )。通常在設備驅動的接收數據處理中會調用此函數。
在Linux內核中可以用如下函數在緩沖區開頭增加數據:
unsigned char *skb_push(struct sk_buff *skb, unsigned len);
它會導致 skb->data 前移 len (skb->data -= len ),而 skb->len會增加 len 的大小(skb->len += len)。與 該函數功能相反的函數是 skb_pull( ),它可以在緩沖區開頭移除數據,執行的動作是skb->len -= len、 skb->data += len。
對於一個空的緩沖區而言,調用如下函數可以調整緩沖區的頭部:
static inline void skb_reserve(struct sk_buff *skb, int len);
它會將 skb->data 和 skb->tail 同時后移 len,執行 skb->data += len、skb->tail += len。內核里存在許多類似代碼:
skb = alloc_skb(len + headspace, GFP_KERNEL); skb_reserve(skb, headspace); skb_put(skb, len); memcpy_fromfs(skb->data, data, len); pass_to_m_protocol(skb);
上述代碼先分配一個全新的 sk_buff,接着調用 skb_reserve( )騰出頭部空間,之后調用 skb_put( )騰出數據空間,然后把數據復制進來,最后把 sk_buff 傳給協議棧。
2.網絡設備接口層
網絡設備接口層的主要功能是為千變萬化的網絡設備定義統一、抽象的數據結構 net_device 結構體,實現多種硬件在軟件層次上的統一。
net_device 結構體在內核中指代一個網絡設備,它定義於include/linux/netdevice.h 中,網絡設備驅動程序只需通過填充 net_device 的具體成員並注冊 net_device 即可實現硬件操作函數與內核的掛接。
net_device 是一個巨大的結構體,包含網絡設備的屬性描述和操作接口,下面介紹一些其中的關鍵成員。
(1)全局信息
char name[IFNAMESIZ];
name 是網絡設備的名稱。
(2)硬件信息
unsigned long mem_end; unsigned long mem_start;
mem_start 和 mem_end 分別定義了設備所使用的共享內存的起始和結束地址。
unsigned long base_addr; unsigned char irq; unsigned char if_port; unsigned char dma;
base_addr 為網絡設備I/O基地址。
irq為設備使用的中斷
if_port 指定多端口設備使用哪一個端口,該字段僅針對多端口設備。例如,如果設備同時支持 IF_PORT_10BASE2(同軸電纜) 和 IF_PORT_10BASET(雙絞線),則可使用該字段。
dma指定分配給設備的DMA通道。
(3)接口信息
unsigned short hard_header_len;
hard_header_len 是網絡設備的硬件頭長度,在以太網設備的初始化函數中,該成員被賦為 ETH_HLEN,即14。
unsigned short type;
type是接口的硬件類型。
unsigned mtu;
mtu 指最大傳輸單元(MTU)。
unsigned char *dev_addr;
用於存放設備的硬件地址,驅動可能提供了設置 MAC 地址的接口,這會導致用戶設置的 MAC 地址等存入該成員,如 drivers/net/ethernet/moxa/moxart_ether.c 中的 moxart_set_mac_address( ) 函數所示。
static int moxart_set_mac_address(struct net_device *ndev, void *addr) { struct sockaddr *address = addr; if (!is_valid_ether_addr(address->sa_data)) return -EADDRNOTAVAIL; memcpy (ndev->dev_addr, address->sa_data, ndev->addr_len); moxart_update_mac_address(ndev); return 0; }
上述代碼完成了 memcpy() 以及最終硬件上的 MAC 地址變更。
unsigned short flags;
flags 指網絡接口標志, 以 IFF_(Interface Flags)開頭, 部分標志由內核來管理,其他的在接口初始化時被設置以說明設備接口的能力和特性。接口標志包括 IFF_UP(當設備被激活並可以開始發送數據包時,內核設置該標志)、 IFF_AUTOMEDIA(設備可在多種媒介間切換)、IFF_BROADCAST(允許廣播)、IFF_DEBUG(調試模式,可用於控制 printk 調用的詳細程度)、IFF_LOOPBACK(回環) 、IFF_MULTICAST(允許組播)、IFF_NOARP(接口不能執行ARP)和 IFF_POINTOPOINT(接口連接到點對點鏈路)等。
(4)設備操作函數
const struct net_device_ops *netdev_ops;
該結構體是網絡設備的一系列硬件操作行數的集合,它也定義於 include/linux/netdevice.h 中,這個結構體很大,如下:
/* * This structure defines the management hooks for network devices. * The following hooks can be defined; unless noted otherwise, they are * optional and can be filled with a null pointer. * * int (*ndo_init)(struct net_device *dev); * This function is called once when network device is registered. * The network device can use this to any late stage initializaton * or semantic validattion. It can fail with an error code which will * be propogated back to register_netdev * * void (*ndo_uninit)(struct net_device *dev); * This function is called when device is unregistered or when registration * fails. It is not called if init fails. * * int (*ndo_open)(struct net_device *dev); * This function is called when network device transistions to the up * state. * * int (*ndo_stop)(struct net_device *dev); * This function is called when network device transistions to the down * state. * * netdev_tx_t (*ndo_start_xmit)(struct sk_buff *skb, * struct net_device *dev); * Called when a packet needs to be transmitted. * Must return NETDEV_TX_OK , NETDEV_TX_BUSY. * (can also return NETDEV_TX_LOCKED iff NETIF_F_LLTX) * Required can not be NULL. * * u16 (*ndo_select_queue)(struct net_device *dev, struct sk_buff *skb, * void *accel_priv, select_queue_fallback_t fallback); * Called to decide which queue to when device supports multiple * transmit queues. * * void (*ndo_change_rx_flags)(struct net_device *dev, int flags); * This function is called to allow device receiver to make * changes to configuration when multicast or promiscious is enabled. * * void (*ndo_set_rx_mode)(struct net_device *dev); * This function is called device changes address list filtering. * If driver handles unicast address filtering, it should set * IFF_UNICAST_FLT to its priv_flags. * * int (*ndo_set_mac_address)(struct net_device *dev, void *addr); * This function is called when the Media Access Control address * needs to be changed. If this interface is not defined, the * mac address can not be changed. * * int (*ndo_validate_addr)(struct net_device *dev); * Test if Media Access Control address is valid for the device. * * int (*ndo_do_ioctl)(struct net_device *dev, struct ifreq *ifr, int cmd); * Called when a user request an ioctl which can't be handled by * the generic interface code. If not defined ioctl's return * not supported error code. * * int (*ndo_set_config)(struct net_device *dev, struct ifmap *map); * Used to set network devices bus interface parameters. This interface * is retained for legacy reason, new devices should use the bus * interface (PCI) for low level management. * * int (*ndo_change_mtu)(struct net_device *dev, int new_mtu); * Called when a user wants to change the Maximum Transfer Unit * of a device. If not defined, any request to change MTU will * will return an error. * * void (*ndo_tx_timeout)(struct net_device *dev); * Callback uses when the transmitter has not made any progress * for dev->watchdog ticks. * * struct rtnl_link_stats64* (*ndo_get_stats64)(struct net_device *dev, * struct rtnl_link_stats64 *storage); * struct net_device_stats* (*ndo_get_stats)(struct net_device *dev); * Called when a user wants to get the network device usage * statistics. Drivers must do one of the following: * 1. Define @ndo_get_stats64 to fill in a zero-initialised * rtnl_link_stats64 structure passed by the caller. * 2. Define @ndo_get_stats to update a net_device_stats structure * (which should normally be dev->stats) and return a pointer to * it. The structure may be changed asynchronously only if each * field is written atomically. * 3. Update dev->stats asynchronously and atomically, and define * neither operation. * * int (*ndo_vlan_rx_add_vid)(struct net_device *dev, __be16 proto, u16 vid); * If device support VLAN filtering this function is called when a * VLAN id is registered. * * int (*ndo_vlan_rx_kill_vid)(struct net_device *dev, __be16 proto, u16 vid); * If device support VLAN filtering this function is called when a * VLAN id is unregistered. * * void (*ndo_poll_controller)(struct net_device *dev); * * SR-IOV management functions. * int (*ndo_set_vf_mac)(struct net_device *dev, int vf, u8* mac); * int (*ndo_set_vf_vlan)(struct net_device *dev, int vf, u16 vlan, u8 qos); * int (*ndo_set_vf_rate)(struct net_device *dev, int vf, int min_tx_rate, * int max_tx_rate); * int (*ndo_set_vf_spoofchk)(struct net_device *dev, int vf, bool setting); * int (*ndo_get_vf_config)(struct net_device *dev, * int vf, struct ifla_vf_info *ivf); * int (*ndo_set_vf_link_state)(struct net_device *dev, int vf, int link_state); * int (*ndo_set_vf_port)(struct net_device *dev, int vf, * struct nlattr *port[]); * int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb); * int (*ndo_setup_tc)(struct net_device *dev, u8 tc) * Called to setup 'tc' number of traffic classes in the net device. This * is always called from the stack with the rtnl lock held and netif tx * queues stopped. This allows the netdevice to perform queue management * safely. * * Fiber Channel over Ethernet (FCoE) offload functions. * int (*ndo_fcoe_enable)(struct net_device *dev); * Called when the FCoE protocol stack wants to start using LLD for FCoE * so the underlying device can perform whatever needed configuration or * initialization to support acceleration of FCoE traffic. * * int (*ndo_fcoe_disable)(struct net_device *dev); * Called when the FCoE protocol stack wants to stop using LLD for FCoE * so the underlying device can perform whatever needed clean-ups to * stop supporting acceleration of FCoE traffic. * * int (*ndo_fcoe_ddp_setup)(struct net_device *dev, u16 xid, * struct scatterlist *sgl, unsigned int sgc); * Called when the FCoE Initiator wants to initialize an I/O that * is a possible candidate for Direct Data Placement (DDP). The LLD can * perform necessary setup and returns 1 to indicate the device is set up * successfully to perform DDP on this I/O, otherwise this returns 0. * * int (*ndo_fcoe_ddp_done)(struct net_device *dev, u16 xid); * Called when the FCoE Initiator/Target is done with the DDPed I/O as * indicated by the FC exchange id 'xid', so the underlying device can * clean up and reuse resources for later DDP requests. * * int (*ndo_fcoe_ddp_target)(struct net_device *dev, u16 xid, * struct scatterlist *sgl, unsigned int sgc); * Called when the FCoE Target wants to initialize an I/O that * is a possible candidate for Direct Data Placement (DDP). The LLD can * perform necessary setup and returns 1 to indicate the device is set up * successfully to perform DDP on this I/O, otherwise this returns 0. * * int (*ndo_fcoe_get_hbainfo)(struct net_device *dev, * struct netdev_fcoe_hbainfo *hbainfo); * Called when the FCoE Protocol stack wants information on the underlying * device. This information is utilized by the FCoE protocol stack to * register attributes with Fiber Channel management service as per the * FC-GS Fabric Device Management Information(FDMI) specification. * * int (*ndo_fcoe_get_wwn)(struct net_device *dev, u64 *wwn, int type); * Called when the underlying device wants to override default World Wide * Name (WWN) generation mechanism in FCoE protocol stack to pass its own * World Wide Port Name (WWPN) or World Wide Node Name (WWNN) to the FCoE * protocol stack to use. * * RFS acceleration. * int (*ndo_rx_flow_steer)(struct net_device *dev, const struct sk_buff *skb, * u16 rxq_index, u32 flow_id); * Set hardware filter for RFS. rxq_index is the target queue index; * flow_id is a flow ID to be passed to rps_may_expire_flow() later. * Return the filter ID on success, or a negative error code. * * Slave management functions (for bridge, bonding, etc). * int (*ndo_add_slave)(struct net_device *dev, struct net_device *slave_dev); * Called to make another netdev an underling. * * int (*ndo_del_slave)(struct net_device *dev, struct net_device *slave_dev); * Called to release previously enslaved netdev. * * Feature/offload setting functions. * netdev_features_t (*ndo_fix_features)(struct net_device *dev, * netdev_features_t features); * Adjusts the requested feature flags according to device-specific * constraints, and returns the resulting flags. Must not modify * the device state. * * int (*ndo_set_features)(struct net_device *dev, netdev_features_t features); * Called to update device configuration to new features. Passed * feature set might be less than what was returned by ndo_fix_features()). * Must return >0 or -errno if it changed dev->features itself. * * int (*ndo_fdb_add)(struct ndmsg *ndm, struct nlattr *tb[], * struct net_device *dev, * const unsigned char *addr, u16 vid, u16 flags) * Adds an FDB entry to dev for addr. * int (*ndo_fdb_del)(struct ndmsg *ndm, struct nlattr *tb[], * struct net_device *dev, * const unsigned char *addr, u16 vid) * Deletes the FDB entry from dev coresponding to addr. * int (*ndo_fdb_dump)(struct sk_buff *skb, struct netlink_callback *cb, * struct net_device *dev, struct net_device *filter_dev, * int idx) * Used to add FDB entries to dump requests. Implementers should add * entries to skb and update idx with the number of entries. * * int (*ndo_bridge_setlink)(struct net_device *dev, struct nlmsghdr *nlh, * u16 flags) * int (*ndo_bridge_getlink)(struct sk_buff *skb, u32 pid, u32 seq, * struct net_device *dev, u32 filter_mask) * int (*ndo_bridge_dellink)(struct net_device *dev, struct nlmsghdr *nlh, * u16 flags); * * int (*ndo_change_carrier)(struct net_device *dev, bool new_carrier); * Called to change device carrier. Soft-devices (like dummy, team, etc) * which do not represent real hardware may define this to allow their * userspace components to manage their virtual carrier state. Devices * that determine carrier state from physical hardware properties (eg * network cables) or protocol-dependent mechanisms (eg * USB_CDC_NOTIFY_NETWORK_CONNECTION) should NOT implement this function. * * int (*ndo_get_phys_port_id)(struct net_device *dev, * struct netdev_phys_item_id *ppid); * Called to get ID of physical port of this device. If driver does * not implement this, it is assumed that the hw is not able to have * multiple net devices on single physical port. * * void (*ndo_add_vxlan_port)(struct net_device *dev, * sa_family_t sa_family, __be16 port); * Called by vxlan to notiy a driver about the UDP port and socket * address family that vxlan is listnening to. It is called only when * a new port starts listening. The operation is protected by the * vxlan_net->sock_lock. * * void (*ndo_del_vxlan_port)(struct net_device *dev, * sa_family_t sa_family, __be16 port); * Called by vxlan to notify the driver about a UDP port and socket * address family that vxlan is not listening to anymore. The operation * is protected by the vxlan_net->sock_lock. * * void* (*ndo_dfwd_add_station)(struct net_device *pdev, * struct net_device *dev) * Called by upper layer devices to accelerate switching or other * station functionality into hardware. 'pdev is the lowerdev * to use for the offload and 'dev' is the net device that will * back the offload. Returns a pointer to the private structure * the upper layer will maintain. * void (*ndo_dfwd_del_station)(struct net_device *pdev, void *priv) * Called by upper layer device to delete the station created * by 'ndo_dfwd_add_station'. 'pdev' is the net device backing * the station and priv is the structure returned by the add * operation. * netdev_tx_t (*ndo_dfwd_start_xmit)(struct sk_buff *skb, * struct net_device *dev, * void *priv); * Callback to use for xmit over the accelerated station. This * is used in place of ndo_start_xmit on accelerated net * devices. * netdev_features_t (*ndo_features_check) (struct sk_buff *skb, * struct net_device *dev * netdev_features_t features); * Called by core transmit path to determine if device is capable of * performing offload operations on a given packet. This is to give * the device an opportunity to implement any restrictions that cannot * be otherwise expressed by feature flags. The check is called with * the set of features that the stack has calculated and it returns * those the driver believes to be appropriate. * * int (*ndo_switch_parent_id_get)(struct net_device *dev, * struct netdev_phys_item_id *psid); * Called to get an ID of the switch chip this port is part of. * If driver implements this, it indicates that it represents a port * of a switch chip. * int (*ndo_switch_port_stp_update)(struct net_device *dev, u8 state); * Called to notify switch device port of bridge port STP * state change. */ struct net_device_ops { int (*ndo_init)(struct net_device *dev); void (*ndo_uninit)(struct net_device *dev); int (*ndo_open)(struct net_device *dev); int (*ndo_stop)(struct net_device *dev); netdev_tx_t (*ndo_start_xmit) (struct sk_buff *skb, struct net_device *dev); u16 (*ndo_select_queue)(struct net_device *dev, struct sk_buff *skb, void *accel_priv, select_queue_fallback_t fallback); void (*ndo_change_rx_flags)(struct net_device *dev, int flags); void (*ndo_set_rx_mode)(struct net_device *dev); int (*ndo_set_mac_address)(struct net_device *dev, void *addr); int (*ndo_validate_addr)(struct net_device *dev); int (*ndo_do_ioctl)(struct net_device *dev, struct ifreq *ifr, int cmd); int (*ndo_set_config)(struct net_device *dev, struct ifmap *map); int (*ndo_change_mtu)(struct net_device *dev, int new_mtu); int (*ndo_neigh_setup)(struct net_device *dev, struct neigh_parms *); void (*ndo_tx_timeout) (struct net_device *dev); struct rtnl_link_stats64* (*ndo_get_stats64)(struct net_device *dev, struct rtnl_link_stats64 *storage); struct net_device_stats* (*ndo_get_stats)(struct net_device *dev); int (*ndo_vlan_rx_add_vid)(struct net_device *dev, __be16 proto, u16 vid); int (*ndo_vlan_rx_kill_vid)(struct net_device *dev, __be16 proto, u16 vid); #ifdef CONFIG_NET_POLL_CONTROLLER void (*ndo_poll_controller)(struct net_device *dev); int (*ndo_netpoll_setup)(struct net_device *dev, struct netpoll_info *info); void (*ndo_netpoll_cleanup)(struct net_device *dev); #endif #ifdef CONFIG_NET_RX_BUSY_POLL int (*ndo_busy_poll)(struct napi_struct *dev); #endif int (*ndo_set_vf_mac)(struct net_device *dev, int queue, u8 *mac); int (*ndo_set_vf_vlan)(struct net_device *dev, int queue, u16 vlan, u8 qos); int (*ndo_set_vf_rate)(struct net_device *dev, int vf, int min_tx_rate, int max_tx_rate); int (*ndo_set_vf_spoofchk)(struct net_device *dev, int vf, bool setting); int (*ndo_get_vf_config)(struct net_device *dev, int vf, struct ifla_vf_info *ivf); int (*ndo_set_vf_link_state)(struct net_device *dev, int vf, int link_state); int (*ndo_set_vf_port)(struct net_device *dev, int vf, struct nlattr *port[]); int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb); int (*ndo_setup_tc)(struct net_device *dev, u8 tc); #if IS_ENABLED(CONFIG_FCOE) int (*ndo_fcoe_enable)(struct net_device *dev); int (*ndo_fcoe_disable)(struct net_device *dev); int (*ndo_fcoe_ddp_setup)(struct net_device *dev, u16 xid, struct scatterlist *sgl, unsigned int sgc); int (*ndo_fcoe_ddp_done)(struct net_device *dev, u16 xid); int (*ndo_fcoe_ddp_target)(struct net_device *dev, u16 xid, struct scatterlist *sgl, unsigned int sgc); int (*ndo_fcoe_get_hbainfo)(struct net_device *dev, struct netdev_fcoe_hbainfo *hbainfo); #endif #if IS_ENABLED(CONFIG_LIBFCOE) #define NETDEV_FCOE_WWNN 0 #define NETDEV_FCOE_WWPN 1 int (*ndo_fcoe_get_wwn)(struct net_device *dev, u64 *wwn, int type); #endif #ifdef CONFIG_RFS_ACCEL int (*ndo_rx_flow_steer)(struct net_device *dev, const struct sk_buff *skb, u16 rxq_index, u32 flow_id); #endif int (*ndo_add_slave)(struct net_device *dev, struct net_device *slave_dev); int (*ndo_del_slave)(struct net_device *dev, struct net_device *slave_dev); netdev_features_t (*ndo_fix_features)(struct net_device *dev, netdev_features_t features); int (*ndo_set_features)(struct net_device *dev, netdev_features_t features); int (*ndo_neigh_construct)(struct neighbour *n); void (*ndo_neigh_destroy)(struct neighbour *n); int (*ndo_fdb_add)(struct ndmsg *ndm, struct nlattr *tb[], struct net_device *dev, const unsigned char *addr, u16 vid, u16 flags); int (*ndo_fdb_del)(struct ndmsg *ndm, struct nlattr *tb[], struct net_device *dev, const unsigned char *addr, u16 vid); int (*ndo_fdb_dump)(struct sk_buff *skb, struct netlink_callback *cb, struct net_device *dev, struct net_device *filter_dev, int idx); int (*ndo_bridge_setlink)(struct net_device *dev, struct nlmsghdr *nlh, u16 flags); int (*ndo_bridge_getlink)(struct sk_buff *skb, u32 pid, u32 seq, struct net_device *dev, u32 filter_mask); int (*ndo_bridge_dellink)(struct net_device *dev, struct nlmsghdr *nlh, u16 flags); int (*ndo_change_carrier)(struct net_device *dev, bool new_carrier); int (*ndo_get_phys_port_id)(struct net_device *dev, struct netdev_phys_item_id *ppid); void (*ndo_add_vxlan_port)(struct net_device *dev, sa_family_t sa_family, __be16 port); void (*ndo_del_vxlan_port)(struct net_device *dev, sa_family_t sa_family, __be16 port); void* (*ndo_dfwd_add_station)(struct net_device *pdev, struct net_device *dev); void (*ndo_dfwd_del_station)(struct net_device *pdev, void *priv); netdev_tx_t (*ndo_dfwd_start_xmit) (struct sk_buff *skb, struct net_device *dev, void *priv); int (*ndo_get_lock_subclass)(struct net_device *dev); netdev_features_t (*ndo_features_check) (struct sk_buff *skb, struct net_device *dev, netdev_features_t features); #ifdef CONFIG_NET_SWITCHDEV int (*ndo_switch_parent_id_get)(struct net_device *dev, struct netdev_phys_item_id *psid); int (*ndo_switch_port_stp_update)(struct net_device *dev, u8 state); #endif };
ndo_open( ) 函數的作用是打開網絡接口設備,獲得設備需要的I/O地址、IRQ、DMA通道等。stop( )函數的作用是停止網絡接口設備,與 open( ) 函數的作用相反。
int(*ndo_start_xmit) (struct sk_buff *skb, struct net_device *dev);
ndo_start_xmit( ) 函數會啟動數據包的發送, 當系統調用驅動程序的 xmit 函數時,需要向其傳入一個 sk_buff 指針,使得驅動程序能獲取從上層傳遞下來的數據包。
void (*ndo_tx_timeout) (struct net_device *dev);
當數據包的發送超時時,ndo_tx_timeout( )函數會被調用,該函數需采取重新啟動數據包發送過程 或 重新啟動硬件等措施來恢復網絡設備到正常狀態。
struct net_device_stats* (*ndo_get_stats) (struct net_device *dev);
ndo_get_stats( ) 函數用於獲得網絡設備的狀態信息,它返回一個 net_device_stats結構體指針。net_device_stats 結構體保存了詳細的網絡設備流量統計信息,如發送和接收的數據包數、字節數等。
int (*ndo_do_ioctl) (struct net_device *dev, struct ifreq *ifr, int cmd); int (*ndo_set_config) (struct net_device *dev, struct ifmap *map); int (*ndo_set_mac_address) (struct net_device *dev, void *addr);
ndo_do_ioctl( ) 函數用於進行設備特定的 I/O 控制。
ndo_set_config( ) 函數用於配置接口,也可用於改變設備的 I/O 地址和中斷號。
ndo_set_mac_address( ) 函數用於設置設備的 MAC 地址。
除了 netdev_ops 以外, 在 net_device 中還存在類似於 ethtool_ops、header_ops這樣的操作集:
const struct ethtool_ops *ethtool_ops; const struct header_ops *header_ops;
ethtool_ops 成員函數與用戶空間 ethtool 工具的各個命令選項對應, ethtool提供了網卡及網卡驅動管理能力,能夠為 Linux 網絡開發人員和管理人員提供對網卡硬件、驅動程序和網絡協議棧的設置、查看以及調試等功能。
header_ops 對於硬件頭部操作,主要是完成創建硬件頭部和從給定的 sk_buff 分析出硬件頭部等操作。
(5)輔助成員
unsigned long trans_start; unsigned long last_rx;
trans_start 記錄最后的數據包開始發送時的時間戳, last_rx 記錄最后一次接收到數據包時的時間戳,這兩個時間戳記錄的都是jiffies,驅動程序應維護這兩個成員。
通常情況下,網絡設備驅動以中斷方式接收數據包,而 poll_controller( ) 則采用純輪詢方式,另外一種數據接收方式是 NAPI(New API),其數據接收流程為 “接收中斷來臨 -> 關閉接收中斷 ->以輪詢方式接收所有數據包直到收空->開啟接收中斷->接收中斷來臨······”。內核中提供了如下與 NAPI 相關的 API:
static inline void netif_napi_add(struct net_device *dev, struct napi_struct *napi, int (*poll) (struct napi_struct *, int), int weight); static inline void netif_napi_del(struct napi_struct *napi);
以上兩個函數分別用於初始化和移除一個 NAPI, netif_napi_add( ) 的 poll 參數是 NAPI 要調度執行的輪詢函數。
static inline void napi_enable(struct napi_struct *n); static inline void napi_disable(struct napi_struct *n);
以上兩個函數分別用於使能和禁止 NAPI 調度。
static inline int napi_schedule_prep(struct napi_struct *n);
該函數用於檢查 NAPI 是否可以調度, 而 napi_schedule( ) 函數用於調度輪詢實例的運行,其原型為:
static inline void napi_schedule(struct napi_struct *n);
在 NAPI 處理完成的時候應該調用:
static inline void napi_complete(struct napi_struct *n);
3.設備驅動功能層
net_device 結構體的成員(屬性和 net_device_ops 結構體中的函數指針)需要被設備驅動功能層賦予具體的數值和函數。對於具體的設備xxx,應該編寫相應的設備驅動功能層的函數,這些函數形如 xxx_open( )、xxx_stop( )、xxx_tx( )、 xxx_hard_header( )、
xxx_get_stats( ) 和 xxx_tx_timeout( )等。
由於網絡數據包的接收可由中斷引發,設備驅動功能層中的另一個主體部分將是中斷處理函數,它負責讀取硬件上接收到的數據包並傳送給上層協議,因此可能包含 xxx_interrupt( ) 和 xxx_rx( ) 函數,前者完成中斷類型判斷等基本工作,后者則需要完成數據包的生成及將其遞交給上層等復雜工作。
對於特定的設備,我們還可以定義相關的私有數據和操作,並封裝為一個私有信息結構體 xxx_private,讓其指針賦值給 net_device 的私有成員。 在 xxx_private 結構體中可包含設備的特殊屬性和操作、自旋鎖與信號量、定時器以及統計信息等,由我們自定義。
在驅動中,要用到私有數據的時候,則使用在 netdevice.h 中定義的接口:
static inline void *netdev_priv(const struct net_device *dev);
例如在驅動 drivers/net/ethernet/davicom/dm900.c 的 dm9000_probe( ) 函數中,使用 alloc_etherdev(sizeof(struct board_info)) 分配網絡設備, board_info 結構體就成了這個網絡設備的私有數據,在其他函數中可以簡單地提取這個私有數據,例如:
static int dm9000_start_xmit(struct sk_buff *skb, struct net_device *dev) { unsigned long flags; board_info_t *db = netdev_priv(dev); ``` }
