分析linux網絡的書已經很多了,包括《追蹤Linux TCP/IP代碼運行》《Linux內核源碼剖析——TCP/IP實現》,這里我只是從數據包在linux內核中的基本流程來分析,盡可能的展現一個主流程框架。
內核如何從網卡接收數據,傳統的過程:
1.數據到達網卡;
2.網卡產生一個中斷給內核;
3.內核使用I/O指令,從網卡I/O區域中去讀取數據;
我們在許多網卡驅動中(很老那些),都可以在網卡的中斷函數中見到這一過程。
但是,這一種方法,有一種重要的問題,就是大流量的數據來到,網卡會產生大量的中斷,內核在中斷上下文 中,會浪費大量的資源來處理中斷本身。所以,就有一個問題,“可不可以不使用中斷”,這就是輪詢技術,所謂NAPI技術,說來也不神秘,就是說,內核屏蔽 中斷,然后隔一會兒就去問網卡,“你有沒有數據啊?”……
從這個描述本身可以看到,如果數據量少,輪詢同樣占用大量的不必要的CPU資源,大家各有所長吧
OK,另一個問題,就是從網卡的I/O區域,包括I/O寄存器或I/O內存中去讀取數據,這都要CPU 去讀,也要占用CPU資源,“CPU從I/O區域讀,然后把它放到內存(這個內存指的是系統本身的物理內存,跟外設的內存不相干,也叫主內存)中”。於是 自然地,就想到了DMA技術——讓網卡直接從主內存之間讀寫它們的I/O數據,CPU,這兒不干你事,自己找樂子去:
1.首先,內核在主內存中為收發數據建立一個環形的緩沖隊列(通常叫DMA環形緩沖區)。
2.內核將這個緩沖區通過DMA映射,把這個隊列交給網卡;
3.網卡收到數據,就直接放進這個環形緩沖區了——也就是直接放進主內存了;然后,向系統產生一個中斷;
4.內核收到這個中斷,就取消DMA映射,這樣,內核就直接從主內存中讀取數據;
1.首先,內核在主內存中為收發數據建立一個環形的緩沖隊列(通常叫DMA環形緩沖區)。
2.內核將這個緩沖區通過DMA映射,把這個隊列交給網卡;
3.網卡收到數據,就直接放進這個環形緩沖區了——也就是直接放進主內存了;然后,向系統產生一個中斷;
4.內核收到這個中斷,就取消DMA映射,這樣,內核就直接從主內存中讀取數據;
——呵呵,這一個過程比傳統的過程少了不少工作,因為設備直接把數據放進了主內存,不需要CPU的干預,效率是不是提高不少?
對應以上4步,來看它的具體實現:
1)分配環形DMA緩沖區
Linux內核中,用skb來描述一個緩存,所謂分配,就是建立一定數量的skb,然后用e1000_rx_ring 環形緩沖區隊列描述符連接起來
1)分配環形DMA緩沖區
Linux內核中,用skb來描述一個緩存,所謂分配,就是建立一定數量的skb,然后用e1000_rx_ring 環形緩沖區隊列描述符連接起來
2)建立DMA映射
內核通過調用
dma_map_single(struct device *dev,void *buffer,size_t size,enum dma_data_direction direction)
建立映射關系。
struct device *dev 描述一個設備;
buffer:把哪個地址映射給設備;也就是某一個skb——要映射全部,當然是做一個雙向鏈表的循環即可;
size:緩存大小;
direction:映射方向——誰傳給誰:一般來說,是“雙向”映射,數據在設備和內存之間雙向流動;
對於PCI設備而言(網卡一般是PCI的),通過另一個包裹函數pci_map_single,這樣,就把buffer交給設備了!設備可以直接從里邊讀/取數據。
內核通過調用
dma_map_single(struct device *dev,void *buffer,size_t size,enum dma_data_direction direction)
建立映射關系。
struct device *dev 描述一個設備;
buffer:把哪個地址映射給設備;也就是某一個skb——要映射全部,當然是做一個雙向鏈表的循環即可;
size:緩存大小;
direction:映射方向——誰傳給誰:一般來說,是“雙向”映射,數據在設備和內存之間雙向流動;
對於PCI設備而言(網卡一般是PCI的),通過另一個包裹函數pci_map_single,這樣,就把buffer交給設備了!設備可以直接從里邊讀/取數據。
3)這一步由硬件完成;
4)取消映射
dma_unmap_single,對PCI而言,大多調用它的包裹函數pci_unmap_single,不取消的話,緩存控制權還在設備手里,要調用 它,把主動權掌握在CPU手里——因為我們已經接收到數據了,應該由CPU把數據交給上層網絡棧;當然,不取消之前,通常要讀一些狀態位信息,諸如此類, 一般是調用dma_sync_single_for_cpu()讓CPU在取消映射前,就可以訪問DMA緩沖區中的內容
dma_unmap_single,對PCI而言,大多調用它的包裹函數pci_unmap_single,不取消的話,緩存控制權還在設備手里,要調用 它,把主動權掌握在CPU手里——因為我們已經接收到數據了,應該由CPU把數據交給上層網絡棧;當然,不取消之前,通常要讀一些狀態位信息,諸如此類, 一般是調用dma_sync_single_for_cpu()讓CPU在取消映射前,就可以訪問DMA緩沖區中的內容
首先,數據包從網卡光電信號來之后,先經過網卡驅動,轉換成skb,進入鏈路層,那么我首先就先分析一下網卡驅動的流程。
源碼位置:Driver/net/E1000e文件夾下面。
static int __init e1000_init_module(void) {注冊網卡驅動,按照PCI驅動開發方式來進行注冊 int ret; printk(KERN_INFO "%s: Intel(R) PRO/1000 Network Driver - %s\n", e1000e_driver_name, e1000e_driver_version); printk(KERN_INFO "%s: Copyright (c) 1999-2008 Intel Corporation.\n", e1000e_driver_name); ret = pci_register_driver(&e1000_driver); pm_qos_add_requirement(PM_QOS_CPU_DMA_LATENCY, e1000e_driver_name, PM_QOS_DEFAULT_VALUE); return ret; }
然后看一下驅動結構體內容,這里不對PCI類型驅動開發做介紹了。
/* PCI Device API Driver */ static struct pci_driver e1000_driver = { .name = e1000e_driver_name, .id_table = e1000_pci_tbl, .probe = e1000_probe, .remove = __devexit_p(e1000_remove), #ifdef CONFIG_PM /* Power Management Hooks */ .suspend = e1000_suspend, .resume = e1000_resume, #endif .shutdown = e1000_shutdown, .err_handler = &e1000_err_handler };
這里面最重要的函數是e1000_probe,先看一下這個函數的作用是什么:“Device Initialization Routine”,這個應該不難理解。
static int __devinit e1000_probe(struct pci_dev *pdev, const struct pci_device_id *ent) { struct net_device *netdev; struct e1000_adapter *adapter; struct e1000_hw *hw; const struct e1000_info *ei = e1000_info_tbl[ent->driver_data]; resource_size_t mmio_start, mmio_len; resource_size_t flash_start, flash_len; static int cards_found; int i, err, pci_using_dac; u16 eeprom_data = 0; u16 eeprom_apme_mask = E1000_EEPROM_APME; e1000e_disable_l1aspm(pdev); 從這里開始對設備驅動進行初始化,包括名稱、內存之類的。 err = pci_enable_device_mem(pdev); if (err) return err; pci_using_dac = 0; err = pci_set_dma_mask(pdev, DMA_BIT_MASK(64)); if (!err) { err = pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(64)); if (!err) pci_using_dac = 1; } else { err = pci_set_dma_mask(pdev, DMA_BIT_MASK(32)); if (err) { err = pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(32)); if (err) { dev_err(&pdev->dev, "No usable DMA " "configuration, aborting\n"); goto err_dma; } } } err = pci_request_selected_regions_exclusive(pdev, pci_select_bars(pdev, IORESOURCE_MEM), e1000e_driver_name); if (err) goto err_pci_reg; /* AER (Advanced Error Reporting) hooks */ err = pci_enable_pcie_error_reporting(pdev); if (err) { dev_err(&pdev->dev, "pci_enable_pcie_error_reporting failed " "0x%x\n", err); /* non-fatal, continue */ } pci_set_master(pdev); /* PCI config space info */ err = pci_save_state(pdev); if (err) goto err_alloc_etherdev; err = -ENOMEM;
這里要為驅動分配一個容器之類的,因為驅動后面的一切操作都是在它的基礎之上。 netdev = alloc_etherdev(sizeof(struct e1000_adapter)); if (!netdev) goto err_alloc_etherdev; SET_NETDEV_DEV(netdev, &pdev->dev); pci_set_drvdata(pdev, netdev); adapter = netdev_priv(netdev); hw = &adapter->hw; adapter->netdev = netdev; adapter->pdev = pdev; adapter->ei = ei; adapter->pba = ei->pba; adapter->flags = ei->flags; adapter->flags2 = ei->flags2; adapter->hw.adapter = adapter; adapter->hw.mac.type = ei->mac; adapter->max_hw_frame_size = ei->max_hw_frame_size; adapter->msg_enable = (1 << NETIF_MSG_DRV | NETIF_MSG_PROBE) - 1; 0表示設備映射的內存的的bar mmio_start = pci_resource_start(pdev, 0); mmio_len = pci_resource_len(pdev, 0); err = -EIO;
這里我的理解是容器的硬件地址與bar進行映射,hw_addr代表的是網卡的硬件地址 adapter->hw.hw_addr = ioremap(mmio_start, mmio_len); if (!adapter->hw.hw_addr) goto err_ioremap; if ((adapter->flags & FLAG_HAS_FLASH) && (pci_resource_flags(pdev, 1) & IORESOURCE_MEM)) { flash_start = pci_resource_start(pdev, 1); flash_len = pci_resource_len(pdev, 1); adapter->hw.flash_address = ioremap(flash_start, flash_len); if (!adapter->hw.flash_address) goto err_flashmap; } /* construct the net_device struct */ netdev->netdev_ops = &e1000e_netdev_ops; e1000e_set_ethtool_ops(netdev); netdev->watchdog_timeo = 5 * HZ; netif_napi_add(netdev, &adapter->napi, e1000_clean, 64); strncpy(netdev->name, pci_name(pdev), sizeof(netdev->name) - 1); netdev->mem_start = mmio_start; netdev->mem_end = mmio_start + mmio_len; adapter->bd_number = cards_found++; e1000e_check_options(adapter); /* setup adapter struct */ err = e1000_sw_init(adapter); if (err) goto err_sw_init; err = -EIO; memcpy(&hw->mac.ops, ei->mac_ops, sizeof(hw->mac.ops)); memcpy(&hw->nvm.ops, ei->nvm_ops, sizeof(hw->nvm.ops)); memcpy(&hw->phy.ops, ei->phy_ops, sizeof(hw->phy.ops)); err = ei->get_variants(adapter); if (err) goto err_hw_init; if ((adapter->flags & FLAG_IS_ICH) && (adapter->flags & FLAG_READ_ONLY_NVM)) e1000e_write_protect_nvm_ich8lan(&adapter->hw); hw->mac.ops.get_bus_info(&adapter->hw); adapter->hw.phy.autoneg_wait_to_complete = 0; /* Copper options */ if (adapter->hw.phy.media_type == e1000_media_type_copper) { adapter->hw.phy.mdix = AUTO_ALL_MODES; adapter->hw.phy.disable_polarity_correction = 0; adapter->hw.phy.ms_type = e1000_ms_hw_default; } if (e1000_check_reset_block(&adapter->hw)) e_info("PHY reset is blocked due to SOL/IDER session.\n"); netdev->features = NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX; if (adapter->flags & FLAG_HAS_HW_VLAN_FILTER) netdev->features |= NETIF_F_HW_VLAN_FILTER; netdev->features |= NETIF_F_TSO; netdev->features |= NETIF_F_TSO6; netdev->vlan_features |= NETIF_F_TSO; netdev->vlan_features |= NETIF_F_TSO6; netdev->vlan_features |= NETIF_F_HW_CSUM; netdev->vlan_features |= NETIF_F_SG; if (pci_using_dac) netdev->features |= NETIF_F_HIGHDMA; if (e1000e_enable_mng_pass_thru(&adapter->hw)) adapter->flags |= FLAG_MNG_PT_ENABLED; /* * before reading the NVM, reset the controller to * put the device in a known good starting state */ adapter->hw.mac.ops.reset_hw(&adapter->hw); /* * systems with ASPM and others may see the checksum fail on the first * attempt. Let's give it a few tries */ for (i = 0;; i++) { if (e1000_validate_nvm_checksum(&adapter->hw) >= 0) break; if (i == 2) { e_err("The NVM Checksum Is Not Valid\n"); err = -EIO; goto err_eeprom; } } e1000_eeprom_checks(adapter); /* copy the MAC address out of the NVM */ if (e1000e_read_mac_addr(&adapter->hw)) e_err("NVM Read Error while reading MAC address\n"); memcpy(netdev->dev_addr, adapter->hw.mac.addr, netdev->addr_len); memcpy(netdev->perm_addr, adapter->hw.mac.addr, netdev->addr_len); if (!is_valid_ether_addr(netdev->perm_addr)) { e_err("Invalid MAC Address: %pM\n", netdev->perm_addr); err = -EIO; goto err_eeprom; } init_timer(&adapter->watchdog_timer); adapter->watchdog_timer.function = &e1000_watchdog; adapter->watchdog_timer.data = (unsigned long) adapter; init_timer(&adapter->phy_info_timer); adapter->phy_info_timer.function = &e1000_update_phy_info; adapter->phy_info_timer.data = (unsigned long) adapter; INIT_WORK(&adapter->reset_task, e1000_reset_task); INIT_WORK(&adapter->watchdog_task, e1000_watchdog_task); INIT_WORK(&adapter->downshift_task, e1000e_downshift_workaround); INIT_WORK(&adapter->update_phy_task, e1000e_update_phy_task); /* Initialize link parameters. User can change them with ethtool */ adapter->hw.mac.autoneg = 1; adapter->fc_autoneg = 1; adapter->hw.fc.requested_mode = e1000_fc_default; adapter->hw.fc.current_mode = e1000_fc_default; adapter->hw.phy.autoneg_advertised = 0x2f; 這里是默認的接收環和發送環大小是256,其實一次中斷,能接受的數據不會有太高,我做實驗的時候也就是1個2個。這里的環不是一直存放skb_buff,而是DMA一次中斷后能給內核的數據存放地,當中斷結束后,skb_buff會被轉移的。 /* ring size defaults */ adapter->rx_ring->count = 256; adapter->tx_ring->count = 256; /* * Initial Wake on LAN setting - If APM wake is enabled in * the EEPROM, enable the ACPI Magic Packet filter */ if (adapter->flags & FLAG_APME_IN_WUC) { /* APME bit in EEPROM is mapped to WUC.APME */ eeprom_data = er32(WUC); eeprom_apme_mask = E1000_WUC_APME; if (eeprom_data & E1000_WUC_PHY_WAKE) adapter->flags2 |= FLAG2_HAS_PHY_WAKEUP; } else if (adapter->flags & FLAG_APME_IN_CTRL3) { if (adapter->flags & FLAG_APME_CHECK_PORT_B && (adapter->hw.bus.func == 1)) e1000_read_nvm(&adapter->hw, NVM_INIT_CONTROL3_PORT_B, 1, &eeprom_data); else e1000_read_nvm(&adapter->hw, NVM_INIT_CONTROL3_PORT_A, 1, &eeprom_data); } /* fetch WoL from EEPROM */ if (eeprom_data & eeprom_apme_mask) adapter->eeprom_wol |= E1000_WUFC_MAG; /* * now that we have the eeprom settings, apply the special cases * where the eeprom may be wrong or the board simply won't support * wake on lan on a particular port */ if (!(adapter->flags & FLAG_HAS_WOL)) adapter->eeprom_wol = 0; /* initialize the wol settings based on the eeprom settings */ adapter->wol = adapter->eeprom_wol; device_set_wakeup_enable(&adapter->pdev->dev, adapter->wol); /* save off EEPROM version number */ e1000_read_nvm(&adapter->hw, 5, 1, &adapter->eeprom_vers); /* reset the hardware with the new settings */ e1000e_reset(adapter); /* * If the controller has AMT, do not set DRV_LOAD until the interface * is up. For all other cases, let the f/w know that the h/w is now * under the control of the driver. */ if (!(adapter->flags & FLAG_HAS_AMT)) e1000_get_hw_control(adapter); strcpy(netdev->name, "eth%d");
注冊網卡驅動 err = register_netdev(netdev); if (err) goto err_register; /* carrier off reporting is important to ethtool even BEFORE open */ netif_carrier_off(netdev); e1000_print_device_info(adapter); return 0; err_register: if (!(adapter->flags & FLAG_HAS_AMT)) e1000_release_hw_control(adapter); err_eeprom: if (!e1000_check_reset_block(&adapter->hw)) e1000_phy_hw_reset(&adapter->hw); err_hw_init: kfree(adapter->tx_ring); kfree(adapter->rx_ring); err_sw_init: if (adapter->hw.flash_address) iounmap(adapter->hw.flash_address); e1000e_reset_interrupt_capability(adapter); err_flashmap: iounmap(adapter->hw.hw_addr); err_ioremap: free_netdev(netdev); err_alloc_etherdev: pci_release_selected_regions(pdev, pci_select_bars(pdev, IORESOURCE_MEM)); err_pci_reg: err_dma: pci_disable_device(pdev); return err; }
通過上面的函數,我們完成了驅動的初始化和設備注冊工作。下面是網卡設備注冊的操作函數
static const struct net_device_ops e1000e_netdev_ops = { .ndo_open = e1000_open, .ndo_stop = e1000_close, .ndo_start_xmit = e1000_xmit_frame, .ndo_get_stats = e1000_get_stats, .ndo_set_multicast_list = e1000_set_multi, .ndo_set_mac_address = e1000_set_mac, .ndo_change_mtu = e1000_change_mtu, .ndo_do_ioctl = e1000_ioctl, .ndo_tx_timeout = e1000_tx_timeout, .ndo_validate_addr = eth_validate_addr, .ndo_vlan_rx_register = e1000_vlan_rx_register, .ndo_vlan_rx_add_vid = e1000_vlan_rx_add_vid, .ndo_vlan_rx_kill_vid = e1000_vlan_rx_kill_vid, #ifdef CONFIG_NET_POLL_CONTROLLER .ndo_poll_controller = e1000_netpoll, #endif };
這里關注一下最后一個函數
static void e1000_netpoll(struct net_device *netdev) { struct e1000_adapter *adapter = netdev_priv(netdev); disable_irq(adapter->pdev->irq);這里關閉容器設備中斷 e1000_intr(adapter->pdev->irq, netdev); 初始化設備中斷 enable_irq(adapter->pdev->irq); }
這是網卡驅動的中斷處理函數,也就是后半段的處理
static irqreturn_t e1000_intr(int irq, void *data) { struct net_device *netdev = data; struct e1000_adapter *adapter = netdev_priv(netdev); struct e1000_hw *hw = &adapter->hw; u32 rctl, icr = er32(ICR); if (!icr) return IRQ_NONE; /* Not our interrupt */ /* * IMS will not auto-mask if INT_ASSERTED is not set, and if it is * not set, then the adapter didn't send an interrupt */ if (!(icr & E1000_ICR_INT_ASSERTED)) return IRQ_NONE; /* * Interrupt Auto-Mask...upon reading ICR, * interrupts are masked. No need for the * IMC write */ if (icr & E1000_ICR_LSC) { hw->mac.get_link_status = 1; /* * ICH8 workaround-- Call gig speed drop workaround on cable * disconnect (LSC) before accessing any PHY registers */ if ((adapter->flags & FLAG_LSC_GIG_SPEED_DROP) && (!(er32(STATUS) & E1000_STATUS_LU))) schedule_work(&adapter->downshift_task); /* * 80003ES2LAN workaround-- * For packet buffer work-around on link down event; * disable receives here in the ISR and * reset adapter in watchdog */ if (netif_carrier_ok(netdev) && (adapter->flags & FLAG_RX_NEEDS_RESTART)) { /* disable receives */ rctl = er32(RCTL); ew32(RCTL, rctl & ~E1000_RCTL_EN); adapter->flags |= FLAG_RX_RESTART_NOW; } /* guard against interrupt when we're going down */ if (!test_bit(__E1000_DOWN, &adapter->state)) mod_timer(&adapter->watchdog_timer, jiffies + 1); } 這里調用了_napi_schedule完成將設備的napi隊列掛到CPU if (napi_schedule_prep(&adapter->napi)) { adapter->total_tx_bytes = 0; adapter->total_tx_packets = 0; adapter->total_rx_bytes = 0; adapter->total_rx_packets = 0; __napi_schedule(&adapter->napi); } return IRQ_HANDLED; }
void __napi_schedule(struct napi_struct *n) { unsigned long flags; local_irq_save(flags); list_add_tail(&n->poll_list, &__get_cpu_var(softnet_data).poll_list);//adapter里面的隊列地址掛到poll.list中 //設置軟中斷NET_RX_SOFTIRQ,等待調度其中斷處理程序 __raise_softirq_irqoff(NET_RX_SOFTIRQ); local_irq_restore(flags); }
再看一下如何打開網絡設備
static int e1000_open(struct net_device *netdev) { struct e1000_adapter *adapter = netdev_priv(netdev); struct e1000_hw *hw = &adapter->hw; int err; /* disallow open during test */ if (test_bit(__E1000_TESTING, &adapter->state)) return -EBUSY; netif_carrier_off(netdev); 初始化傳輸和接收描述符,這里主要是對接收環和發送環進行初始化,他們需要256個單元空間 /* allocate transmit descriptors */ err = e1000e_setup_tx_resources(adapter); if (err) goto err_setup_tx; /* allocate receive descriptors */ err = e1000e_setup_rx_resources(adapter); if (err) goto err_setup_rx; e1000e_power_up_phy(adapter); adapter->mng_vlan_id = E1000_MNG_VLAN_NONE; if ((adapter->hw.mng_cookie.status & E1000_MNG_DHCP_COOKIE_STATUS_VLAN)) e1000_update_mng_vlan(adapter); /* * If AMT is enabled, let the firmware know that the network * interface is now open */ if (adapter->flags & FLAG_HAS_AMT) e1000_get_hw_control(adapter); /* * before we allocate an interrupt, we must be ready to handle it. * Setting DEBUG_SHIRQ in the kernel makes it fire an interrupt * as soon as we call pci_request_irq, so we have to setup our * clean_rx handler before we do so. */這個函數比較重要,在這里面完成對容器的配置,包括軟中斷設置 e1000_configure(adapter); {
static void e1000_configure(struct e1000_adapter *adapter) { e1000_set_multi(adapter->netdev); e1000_restore_vlan(adapter); e1000_init_manageability(adapter); e1000_configure_tx(adapter);配置發送 e1000_setup_rctl(adapter); e1000_configure_rx(adapter);配置接收 adapter->alloc_rx_buf(adapter, e1000_desc_unused(adapter->rx_ring)); }
} err = e1000_request_irq(adapter); if (err) goto err_req_irq; /* * Work around PCIe errata with MSI interrupts causing some chipsets to * ignore e1000e MSI messages, which means we need to test our MSI * interrupt now */ if (adapter->int_mode != E1000E_INT_MODE_LEGACY) { err = e1000_test_msi(adapter); if (err) { e_err("Interrupt allocation failed\n"); goto err_req_irq; } } /* From here on the code is the same as e1000e_up() */ clear_bit(__E1000_DOWN, &adapter->state); napi_enable(&adapter->napi); e1000_irq_enable(adapter); netif_start_queue(netdev); /* fire a link status change interrupt to start the watchdog */ ew32(ICS, E1000_ICS_LSC); return 0; err_req_irq: e1000_release_hw_control(adapter); e1000_power_down_phy(adapter); e1000e_free_rx_resources(adapter); err_setup_rx: e1000e_free_tx_resources(adapter); err_setup_tx: e1000e_reset(adapter); return err;
這里看一下接收容器中斷設置
static void e1000_configure_rx(struct e1000_adapter *adapter) { struct e1000_hw *hw = &adapter->hw; struct e1000_ring *rx_ring = adapter->rx_ring; u64 rdba; u32 rdlen, rctl, rxcsum, ctrl_ext; if (adapter->rx_ps_pages) { /* this is a 32 byte descriptor */ rdlen = rx_ring->count * sizeof(union e1000_rx_desc_packet_split); adapter->clean_rx = e1000_clean_rx_irq_ps; adapter->alloc_rx_buf = e1000_alloc_rx_buffers_ps; } else if (adapter->netdev->mtu > ETH_FRAME_LEN + ETH_FCS_LEN) { rdlen = rx_ring->count * sizeof(struct e1000_rx_desc); adapter->clean_rx = e1000_clean_jumbo_rx_irq; adapter->alloc_rx_buf = e1000_alloc_jumbo_rx_buffers; } else { rdlen = rx_ring->count * sizeof(struct e1000_rx_desc); adapter->clean_rx = e1000_clean_rx_irq; 這里的函數是對前半段的一個處理流程,主要是將數據從DMA中獲取然后放到隊列中,供后半段進行處理。 adapter->alloc_rx_buf = e1000_alloc_rx_buffers; } /* disable receives while setting up the descriptors */ //寫接收控制寄存器 暫時停止接收 rctl = er32(RCTL); ew32(RCTL, rctl & ~E1000_RCTL_EN); e1e_flush(); msleep(10); /* set the Receive Delay Timer Register *///設置RDTR寄存器 有關 ew32(RDTR, adapter->rx_int_delay); /* irq moderation */ //設置RADV寄存器 有關RADV具體詳見開發者手冊 ew32(RADV, adapter->rx_abs_int_delay); if (adapter->itr_setting != 0) ew32(ITR, 1000000000 / (adapter->itr * 256)); ctrl_ext = er32(CTRL_EXT); /* Reset delay timers after every interrupt */ ctrl_ext |= E1000_CTRL_EXT_INT_TIMER_CLR; /* Auto-Mask interrupts upon ICR access */ ctrl_ext |= E1000_CTRL_EXT_IAME; ew32(IAM, 0xffffffff); ew32(CTRL_EXT, ctrl_ext); e1e_flush(); /* * Setup the HW Rx Head and Tail Descriptor Pointers and * the Base and Length of the Rx Descriptor Ring */ //與接收描述符環有關的有4個寄存器:RDBA存放描述符緩沖的首地址 做為基地址 供64位 包括各32位的高低地址 //RDLEN:為緩沖區分配的總空間的大小 RDH和RDT是頭尾指針 存放相對基址的偏移量 RDH的值由硬件增加 表示指向下一次DMA將用的描述符 //RDT由軟件增加 表示下一次要處理並送交協議棧的有關描述符 rdba = rx_ring->dma; ew32(RDBAL, (rdba & DMA_BIT_MASK(32))); ew32(RDBAH, (rdba >> 32)); ew32(RDLEN, rdlen); ew32(RDH, 0); ew32(RDT, 0); rx_ring->head = E1000_RDH; rx_ring->tail = E1000_RDT; /* Enable Receive Checksum Offload for TCP and UDP */ rxcsum = er32(RXCSUM); if (adapter->flags & FLAG_RX_CSUM_ENABLED) { rxcsum |= E1000_RXCSUM_TUOFL; /* * IPv4 payload checksum for UDP fragments must be * used in conjunction with packet-split. */ if (adapter->rx_ps_pages) rxcsum |= E1000_RXCSUM_IPPCSE; } else { rxcsum &= ~E1000_RXCSUM_TUOFL; /* no need to clear IPPCSE as it defaults to 0 */ } ew32(RXCSUM, rxcsum); /* * Enable early receives on supported devices, only takes effect when * packet size is equal or larger than the specified value (in 8 byte * units), e.g. using jumbo frames when setting to E1000_ERT_2048 */ if ((adapter->flags & FLAG_HAS_ERT) && (adapter->netdev->mtu > ETH_DATA_LEN)) { u32 rxdctl = er32(RXDCTL(0)); ew32(RXDCTL(0), rxdctl | 0x3); ew32(ERT, E1000_ERT_2048 | (1 << 13)); /* * With jumbo frames and early-receive enabled, excessive * C4->C2 latencies result in dropped transactions. */ pm_qos_update_requirement(PM_QOS_CPU_DMA_LATENCY, e1000e_driver_name, 55); } else { pm_qos_update_requirement(PM_QOS_CPU_DMA_LATENCY, e1000e_driver_name, PM_QOS_DEFAULT_VALUE); } /* Enable Receives */ ew32(RCTL, rctl); }