https://www.codenong.com/cs106676560/
MSI只支持32個中斷向量,而MSI-X支持多達2048個中斷向量,但是MSI-X的相關寄存器在配置空間中占用的空間卻更小。這是因為中斷向量信息並不直接存儲在這里,而是在一款特殊的Memory(MIMO)中。並通過BIR(Base address Indicator Register, or BAR Index Register)來確定其在MIMO中的具體位置。無論是MSI還是MSI-X,其本質上都是基於Memory Write 的,因此也可能會產生錯誤。比如PCIe中的ECRC錯誤等。
如下圖所示:
Pending Table
Pending Table的組成結構如圖6-4所示。

如上圖所示,在Pending Table中,一個Entry由64位組成,其中每一位與MSI-X Table中的一個Entry對應,即Pending Table中的每一個Entry與MSI-X Table的64個Entry對應。與MSI機制類似,Pending位需要與Per Vector Mask位配置使用。
當Per Vector Mask位為1時,PCIe設備不能立即發送MSI-X中斷請求,而是將對應的Pending位置1;當系統軟件將Per Vector Mask位清零時,PCIe設備需要提交MSI-X中斷請求,同時將Pending位清零。
[1] 此時PCI設備配置空間Command寄存器的“Interrupt Disable”位為1。
[2] MSI機制提交中斷請求的方式類似與邊界觸發方式,而使用邊界觸發方式時,處理器可能會丟失某些中斷請求,因此在設備驅動程序的開發過程中,可能需要使用這兩個字段。
確認設備的MSI/MSI-X capability
lspci -v可以查看設備支持的capability, 如果有MSI或者MSI-x或者message signal interrupt的描述,並且這些描述后面都有一個enable的flag, “+”表示enable,"-"表示disable。
[root@localhost ixgbe]# lspci -xxx -vv -s 05:00.0 05:00.0 Ethernet controller: Huawei Technologies Co., Ltd. Hi1822 Family (2*25GE) (rev 45) Subsystem: Huawei Technologies Co., Ltd. Device d139 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 32 bytes NUMA node: 0 Region 0: Memory at 80007b00000 (64-bit, prefetchable) [size=128K] Region 2: Memory at 80008a20000 (64-bit, prefetchable) [size=32K] Region 4: Memory at 80000200000 (64-bit, prefetchable) [size=1M] Expansion ROM at e9200000 [disabled] [size=1M] Capabilities: [40] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported- RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend- LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 128 bytes Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range B, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [80] MSI: Enable- Count=1/32 Maskable+ 64bit+ Address: 0000000000000000 Data: 0000 Masking: 00000000 Pending: 00000000 Capabilities: [a0] MSI-X: Enable+ Count=32 Masked- Vector table: BAR=2 offset=00000000 PBA: BAR=2 offset=00004000 Capabilities: [b0] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [c0] Vital Product Data Product Name: Huawei IN200 2*100GE Adapter Read-only fields: [PN] Part number: SP572 End Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn- Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI) ARICap: MFVC- ACS-, Next Function: 0 ARICtl: MFVC- ACS-, Function Group: 0 Capabilities: [200 v1] Single Root I/O Virtualization (SR-IOV) IOVCap: Migration-, Interrupt Message Number: 000 IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+ IOVSta: Migration- Initial VFs: 120, Total VFs: 120, Number of VFs: 0, Function Dependency Link: 00 VF offset: 1, stride: 1, Device ID: 375e Supported Page Size: 00000553, System Page Size: 00000010 Region 0: Memory at 0000080007b20000 (64-bit, prefetchable) Region 2: Memory at 00000800082a0000 (64-bit, prefetchable) Region 4: Memory at 0000080000300000 (64-bit, prefetchable) VF Migration: offset: 00000000, BIR: 0 Capabilities: [310 v1] #19 Capabilities: [4e0 v1] Device Serial Number 44-a1-91-ff-ff-a4-9b-eb Capabilities: [4f0 v1] Transaction Processing Hints Device specific mode supported No steering table available Capabilities: [600 v1] Vendor Specific Information: ID=0000 Rev=0 Len=028 <?> Capabilities: [630 v1] Access Control Services ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- Kernel driver in use: vfio-pci Kernel modules: hinic 00: e5 19 00 02 06 04 10 00 45 00 00 02 08 00 00 00 10: 0c 00 b0 07 00 08 00 00 0c 00 a2 08 00 08 00 00 20: 0c 00 20 00 00 08 00 00 00 00 00 00 e5 19 39 d1 30: 00 00 40 e6 40 00 00 00 00 00 00 00 ff 00 00 00 40: 10 80 02 00 e2 8f 00 10 37 29 10 00 03 f1 43 00 50: 08 00 03 01 00 00 00 00 00 00 00 00 00 00 00 00 60: 00 00 00 00 92 03 00 00 00 00 00 00 0e 00 00 00 70: 03 00 1f 00 00 00 00 00 00 00 00 00 00 00 00 00 80: 05 a0 8a 01 00 00 00 00 00 00 00 00 00 00 00 00 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 a0: 11 b0 1f 80 02 00 00 00 02 40 00 00 00 00 00 00 b0: 01 c0 03 f8 00 00 00 00 00 00 00 00 00 00 00 00 c0: 03 00 28 80 37 32 78 ff 00 00 00 00 00 00 00 00 d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [root@localhost ixgbe]#
[root@localhost ixgbe]# lspci -n -v -s 06:00.0 06:00.0 0200: 19e5:0200 (rev 45) Subsystem: 19e5:d139 Flags: fast devsel, NUMA node 0 [virtual] Memory at 80010400000 (64-bit, prefetchable) [size=128K] [virtual] Memory at 80011320000 (64-bit, prefetchable) [size=32K] [virtual] Memory at 80008b00000 (64-bit, prefetchable) [size=1M] Expansion ROM at e9300000 [disabled] [size=1M] Capabilities: [40] Express Endpoint, MSI 00 Capabilities: [80] MSI: Enable- Count=1/32 Maskable+ 64bit+ Capabilities: [a0] MSI-X: Enable- Count=32 Masked- Capabilities: [b0] Power Management version 3 Capabilities: [c0] Vital Product Data Capabilities: [100] Advanced Error Reporting Capabilities: [150] Alternative Routing-ID Interpretation (ARI) Capabilities: [200] Single Root I/O Virtualization (SR-IOV) Capabilities: [310] #19 Capabilities: [4e0] Device Serial Number 44-a1-91-ff-ff-a4-9b-ec Capabilities: [4f0] Transaction Processing Hints Capabilities: [600] Vendor Specific Information: ID=0000 Rev=0 Len=028 <?> Capabilities: [630] Access Control Services Kernel driver in use: vfio-pci Kernel modules: hinic
root@zj-x86:~# lspci -n -v -s 1a:00.1 1a:00.1 0200: 8086:37d0 (rev 09) Subsystem: 19e5:d123 Flags: bus master, fast devsel, latency 0, IRQ 31, NUMA node 0 Memory at a0000000 (64-bit, prefetchable) [size=16M] Memory at a3010000 (64-bit, prefetchable) [size=32K] Expansion ROM at a3700000 [disabled] [size=512K] Capabilities: [40] Power Management version 3 Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+ Capabilities: [70] MSI-X: Enable+ Count=129 Masked- Capabilities: [a0] Express Endpoint, MSI 00 Capabilities: [e0] Vital Product Data Capabilities: [100] Advanced Error Reporting Capabilities: [140] Device Serial Number 5c-ac-f7-ff-ff-6b-1d-f4 Capabilities: [150] Alternative Routing-ID Interpretation (ARI) Capabilities: [160] Single Root I/O Virtualization (SR-IOV) Capabilities: [1a0] Transaction Processing Hints Capabilities: [1b0] Access Control Services Kernel driver in use: i40e Kernel modules: i40e root@zj-x86:~#
I350網卡位於bus 3,device0,function 0。從配置空間可以看出網卡申請了一個BAR3,這正是MSI-X所使用的BAR3,MSI-X tablestructure存放在BAR3起始地址+0的位置,PBA structure存在BAR3起始地址+0x2000的位置。
root@zj-x86:~# lspci | grep -i ether
1a:00.0 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 09) 1a:00.1 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 09) 1a:00.2 Ethernet controller: Intel Corporation Ethernet Connection X722 for 1GbE (rev 09) 1a:00.3 Ethernet controller: Intel Corporation Ethernet Connection X722 for 1GbE (rev 09) root@zj-x86:~# lspci -s 1a:00.1 -v 1a:00.1 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 09) Subsystem: Huawei Technologies Co., Ltd. Ethernet Connection X722 for 10GbE SFP+ Flags: bus master, fast devsel, latency 0, IRQ 31, NUMA node 0 Memory at a0000000 (64-bit, prefetchable) [size=16M] Memory at a3010000 (64-bit, prefetchable) [size=32K] Expansion ROM at a3700000 [disabled] [size=512K] Capabilities: [40] Power Management version 3 Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+ Capabilities: [70] MSI-X: Enable+ Count=129 Masked- Capabilities: [a0] Express Endpoint, MSI 00 Capabilities: [e0] Vital Product Data Capabilities: [100] Advanced Error Reporting Capabilities: [140] Device Serial Number 5c-ac-f7-ff-ff-6b-1d-f4 Capabilities: [150] Alternative Routing-ID Interpretation (ARI) Capabilities: [160] Single Root I/O Virtualization (SR-IOV) Capabilities: [1a0] Transaction Processing Hints Capabilities: [1b0] Access Control Services Kernel driver in use: i40e Kernel modules: i40e root@zj-x86:~#
[root@localhost ~]# lspci -s 05:00.0 -v
05:00.0 Ethernet controller: Huawei Technologies Co., Ltd. Hi1822 Family (2*25GE) (rev 45) Subsystem: Huawei Technologies Co., Ltd. Device d139 Flags: fast devsel, NUMA node 0 [virtual] Memory at 80007b00000 (64-bit, prefetchable) [size=128K] [virtual] Memory at 80008a20000 (64-bit, prefetchable) [size=32K] [virtual] Memory at 80000200000 (64-bit, prefetchable) [size=1M] Expansion ROM at e9200000 [disabled] [size=1M] Capabilities: [40] Express Endpoint, MSI 00 Capabilities: [80] MSI: Enable- Count=1/32 Maskable+ 64bit+ Capabilities: [a0] MSI-X: Enable- Count=32 Masked- Capabilities: [b0] Power Management version 3 Capabilities: [c0] Vital Product Data Capabilities: [100] Advanced Error Reporting Capabilities: [150] Alternative Routing-ID Interpretation (ARI) Capabilities: [200] Single Root I/O Virtualization (SR-IOV) Capabilities: [310] #19 Capabilities: [4e0] Device Serial Number 44-a1-91-ff-ff-a4-9b-eb Capabilities: [4f0] Transaction Processing Hints Capabilities: [600] Vendor Specific Information: ID=0000 Rev=0 Len=028 <?> Capabilities: [630] Access Control Services Kernel driver in use: vfio-pci Kernel modules: hinic [root@localhost ~]#
MSI的中斷注冊
kernel/irq/manage.c
request_irq()
+-> __setup_irq() +-> irq_activate() +-> msi_domain_activate() // msi_domain_info中定義的irq_chip_write_msi_msg +-> irq_chip_write_msi_msg() // irq_chip對應的是pci_msi_create_irq_domain中關聯的its_msi_irq_chip +-> data->chip->irq_write_msi_msg(data, msg); +-> pci_msi_domain_write_msg()
從這個流程可以看出,MSI是通過irq_write_msi_msg往一個地址發一個消息來激活一個中斷。
4. 設備怎么使用MSI/MSI-x中斷?
傳統中斷在系統初始化掃描PCI bus tree時就已自動為設備分配好中斷號, 但是如果設備需要使用MSI,驅動需要進行一些額外的配置。
當前linux內核提供pci_alloc_irq_vectors來進行MSI/MSI-X capablity的初始化配置以及中斷號分配。
1
2 |
int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
unsigned int max_vecs, unsigned int flags); |
函數的返回值為該PCI設備分配的中斷向量個數。
min_vecs是設備對中斷向量數目的最小要求,如果小於該值,會返回錯誤。
max_vecs是期望分配的中斷向量最大個數。
flags用於區分設備和驅動能夠使用的中斷類型,一般有4種:
1
2 3 4 |
#define PCI_IRQ_LEGACY (1 << 0) /* Allow legacy interrupts */
#define PCI_IRQ_MSI (1 << 1) /* Allow MSI interrupts */ #define PCI_IRQ_MSIX (1 << 2) /* Allow MSI-X interrupts */ #define PCI_IRQ_ALL_TYPES (PCI_IRQ_LEGACY | PCI_IRQ_MSI | PCI_IRQ_MSIX) |
PCI_IRQ_ALL_TYPES可以用來請求任何可能類型的中斷。
此外還可以額外的設置PCI_IRQ_AFFINITY, 用於將中斷分布在可用的cpu上。
使用示例:
1
|
i = pc
i_alloc_irq_vectors(dev->pdev, min_msix, msi_count, PCI_IRQ_MSIX | PCI_IRQ_AFFINITY);
|
與之對應的是釋放中斷資源的函數pci_free_irq_vectors(), 需要在設備remove時調用:
1
|
void pci_free_irq_vectors(struct pci_dev *dev);
|
此外,linux還提供了pci_irq_vector()用於獲取IRQ number.
1
|
int pci_irq_vector(struct pci_dev *dev, unsigned int nr);
|
5. 設備的MSI/MSI-x中斷是怎樣處理的?
5.1 MSI的中斷分配pci_alloc_irq_vectors()
深入理解下pci_alloc_irq_vectors()
pci_alloc_irq_vectors() --> pci_alloc_irq_vectors_affinity()
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
unsigned int max_vecs, unsigned int flags, struct irq_affinity *affd) { struct irq_affinity msi_default_affd = {0}; int msix_vecs = -ENOSPC; int msi_vecs = -ENOSPC; if (flags & PCI_IRQ_AFFINITY) { if (!affd) affd = &msi_default_affd; } else { if (WARN_ON(affd)) affd = NULL; } if (flags & PCI_IRQ_MSIX) { msix_vecs = __pci_enable_msix_range(dev, NULL, min_vecs, max_vecs, affd, flags); ------(1) if (msix_vecs > 0) return msix_vecs; } if (flags & PCI_IRQ_MSI) { msi_vecs = __pci_enable_msi_range(dev, min_vecs, max_vecs, affd); ----- (2) if (msi_vecs > 0) return msi_vecs; } /* use legacy IRQ if allowed */ if (flags & PCI_IRQ_LEGACY) { if (min_vecs == 1 && dev->irq) { /* * Invoke the affinity spreading logic to ensure that * the device driver can adjust queue configuration * for the single interrupt case. */ if (affd) irq_create_affinity_masks(1, affd); pci_intx(dev, 1); ------ (3) return 1; } } if (msix_vecs == -ENOSPC) return -ENOSPC; return msi_vecs; } |
(1) 先確認申請的是否為MSI-X中斷
1
2 3 4 |
__pci_enable_msix_range()
+-> __pci_enable_msix() +-> msix_capability_init() +-> pci_msi_setup_msi_irqs() |
msix_capability_init會對msi capability進行一些配置。
關鍵函數pci_msi_setup_msi_irqs, 會創建msi irq number:
1
2 3 4 5 6 7 8 9 10 |
static int pci_msi_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
{ struct irq_domain *domain; domain = dev_get_msi_domain(&dev->dev); if (domain && irq_domain_is_hierarchy(domain)) return msi_domain_alloc_irqs(domain, &dev->dev, nvec); return arch_setup_msi_irqs(dev, nvec, type); } |
這里的irq_domain獲取的是pcie device結構體中定義的dev->msi_domain.
這里的msi_domain是在哪里定義的呢?
在drivers/irqchip/irq-gic-v3-its-pci-msi.c中, kernel啟動時會:
1
2 3 4 |
its_pci_msi_init()
+-> its_pci_msi_init() +-> its_pci_msi_init_one() +-> pci_msi_create_irq_domain(handle, &its_pci_msi_domain_info,parent) |
pci_msi_create_irq_domain中會去創建pci_msi irq_domain, 傳遞的參數分別是its_pci_msi_domain_info以及設置parent為its irq_domain.
所以現在邏輯就比較清晰:
gic中斷控制器初始化時會去add gic irq_domain, gic irq_domain是its irq_domain的parent節點,its irq_domain中的host data對應的pci_msi irq_domain.
1
2 3 4 5 6 7 8 9 10 11 |
gic irq_domain --> irq_domain_ops(gic_irq_domain_ops)
^ --> .alloc(gic_irq_domain_alloc) | its irq_domain --> irq_domain_ops(its_domain_ops) ^ --> .alloc(its_irq_domain_alloc) | --> ... | --> host_data(struct msi_domain_info) | --> msi_domain_ops(its_msi_domain_ops) | --> .msi_prepare(its_msi_prepare) | --> irq_chip, chip_data, handler... | --> void *data(struct its_node) |
pci_msi irq_domain對應的ops:
1
2 3 4 5 6 |
static const struct irq_domain_ops msi_domain_ops = {
.alloc = msi_domain_alloc, .free = msi_domain_free, .activate = msi_domain_activate, .deactivate = msi_domain_deactivate, }; |
回到上面的pci_msi_setup_msi_irqs()函數,獲取了pci_msi irq_domain后, 調用msi_domain_alloc_irqs()函數分配IRQ number.
1
2 3 4 5 |
msi_domain_alloc_irqs()
// 對應的是its_pci_msi_ops中的its_pci_msi_prepare +-> msi_domain_prepare_irqs() // 分配IRQ number +-> __irq_domain_alloc_irqs() |
msi_domain_prepare_irqs()對應的是its_msi_prepare函數,會去創建一個its_device.
__irq_domain_alloc_irqs()會去分配虛擬中斷號,從allocated_irq位圖中取第一個空閑的bit位作為虛擬中斷號。
至此, msi-x的中斷分配已經完成,且msi-x的配置也已經完成。
(2) 如果不是MSI-X中斷, 再確認申請的是否為MSI中斷, 流程與MSI-x類似。
(3) 如果不是MSI/MSI-X中斷, 再確認申請的是否為傳統intx中斷
5.2 MSI的中斷注冊
kernel/irq/manage.c
1
2 3 4 5 6 7 8 9 |
request_irq()
+-> __setup_irq() +-> irq_activate() +-> msi_domain_activate() // msi_domain_info中定義的irq_chip_write_msi_msg +-> irq_chip_write_msi_msg() // irq_chip對應的是pci_msi_create_irq_domain中關聯的its_msi_irq_chip +-> data->chip->irq_write_msi_msg(data, msg); +-> pci_msi_domain_write_msg() |
從這個流程可以看出,MSI是通過irq_write_msi_msg往一個地址發一個消息來激活一個中斷。
中斷產生
1. 產生MSI中斷請求
關於MSI以及MSI-X的詳細說明可以參閱王齊老師的《PCI Express體系結構導讀》及《Intel® 64 and IA-32 Architectures Software Developer’s Manual》
PCIe設備通過向MSI/MSI-X Capability中的Message Address地址寫Message Data數據,組成一個TLP向處理器提交MSI/MSI-X中斷請求,不同處理器采用不同機制處理MSI請求,x86使用FSB Interrupt Message的方式,下圖是Intel手冊中Message Data的格式,可以看到bit0~7是Vector,因此每一個MSI的請求中都攜帶了中斷向量的值。
Message Data中Vector是如何設置的?如下圖設備驅動在調用request_irq來申請中斷的時候會調用到msi_set_affinity,該函數將Vector的值寫入msg.data的低8bit,然后調用__write_msi_msg將Message Data寫入PCIe配置空間。
2.中斷整體流程
MSI報文中帶有Vector信息,根據Vector查找IDT table中的處理函數,x86架構中的IDT無論Vector是多少處理函數都會跳轉到common_interrupt,然后執行do_IRQ,函數do_IRQ主要包括以下幾個部分:
-
調用irq_enter增加preempt_count中HARDIRQ的計數,標志着一個HARDIRQ的開始;
preempt_count變量不為零的時候不可以搶占。
-
irq = __this_cpu_read(vector_irq[vector])根據vector獲得irq的值;
-
handle_irq做HARDIRQ處理;
-
如果handle_irq返回false則調用ack_APIC_irq()向APIC的EOI寄存器寫0,通知APIC中斷服務完成;
-
irq_exit調用sub_preempt_count(HARDIRQ_OFFSET)將irq_enter增加的計數減掉,這也標志着HARDIRQ的結束,然后調用in_interrupt()判斷preempt_count為0且有softirq pending,就調用invoke_softirq。
2.1. preempt_count
preempt_count是thread_info結構體的成員變量,內核中對preempt_count的描述如下,可以看到其中有softirq count,hardirq count等,preempt_count可以作搶占計數和判斷當前所在上下文的情況。
1 |
/* |
2.2. irq與vector
vector_irq是一個per cpu數組,數組反映了各cpu上vector與irq的對應關系,index代表vector的值,數組中存儲的值是irq。
1 |
|
2.3. handle_irq
handle_irq根據irq獲取中斷描述符結構irq_desc,然后調用generic_handle_irq_desc。
1 |
bool handle_irq(unsigned irq, struct pt_regs *regs) |
要繼續深入分析,首先要理解中斷代碼有三個主要的抽象層次:
- High-level driver API 高級驅動API
- High-level IRQ flow handlers 高級IRQ流處理程序
- Chip-level hardware encapsulation 硬件芯片級封裝
我們上面的分析(諸如common_interrupt等)包含了許多low-level architecture 代碼。當中斷觸發時,這些底層架構代碼通過調用desc->handle_irq來調用通用中斷代碼,這個handle_irq指向的函數屬於High-level IRQ flow handlers的層次,稱其為高級IRQ流處理程序(High-level IRQ flow handlers),Kernel中提供了一組預定義的irq-flow方法,這些函數在引導期間或在設備初始化期間由體系結構分配給特定的中斷(對desc->handle_irq的賦值):
1 |
/* |
高級IRQ流處理程序會調用desc->irq_data.chip原語(irq_chip中的,例如irq_ack),即Chip-level hardware encapsulation硬件芯片級封裝,如果中斷有specific handler的話還會調用外設的specific handler。
由於我碰巧用GDB斷住了handle_edge_irq(邊緣觸發中斷的通用實現),以此舉例分析,詳見下面的代碼以及注釋。
1 |
/** |
handle_irq_event(desc)函數中調用handle_irq_event_percpu遍歷action list執行specific handler,handle_irq_event_percpu函數不展開分析了。
需要注意的是handle_irq_event_percpu函數前后的鎖操作,注意上面handle_edge_irq函數開始和結束的lock和unlock不是對應關系,handle_irq_event_percpu前面的raw_spin_unlock對應的是handle_edge_irq開頭的raw_spin_lock,而handle_irq_event_percpu后面的raw_spin_lock對應的是handle_edge_irq最后的raw_spin_unlock。
1 |
irqreturn_t handle_irq_event(struct irq_desc *desc) |
其他的High-level IRQ flow handlers函數就不做詳細分析了,下面是內核文檔中對一些High-level irq處理函數的簡化摘錄
-
handle_level_irq
1
2
3:c:func:`desc->irq_data.chip->irq_mask_ack`;
handle_irq_event(desc->action);
:c:func:`desc->irq_data.chip->irq_unmask`; -
handle_fastoi_irq
1
2handle_irq_event(desc->action);
:c:func:`desc->irq_data.chip->irq_eoi`; -
handle_edge_irq
1
2
3
4
5
6
7
8
9
10
11
12
13
14if (desc->status & running) {
:c:func:`desc->irq_data.chip->irq_mask_ack`;
desc->status |= pending | masked;
return;
}
:c:func:`desc->irq_data.chip->irq_ack`;
desc->status |= running;
do {
if (desc->status & masked)
:c:func:`desc->irq_data.chip->irq_unmask`;
desc->status &= ~pending;
handle_irq_event(desc->action);
} while (status & pending);
desc->status &= ~running; -
handle_simple_irq
1
handle_irq_event(desc->action);
-
handle_percpu_irq
1
2
3
4
5if (desc->irq_data.chip->irq_ack)
:c:func:`desc->irq_data.chip->irq_ack`;
handle_irq_event(desc->action);
if (desc->irq_data.chip->irq_eoi)
:c:func:`desc->irq_data.chip->irq_eoi`;
2.4. ack_APIC_irq
回到do_IRQ函數,如果handle_irq返回false則調用ack_APIC_irq向APIC的EOI寄存器寫0,通知APIC中斷服務完成。那handle_irq返回true就不用ack_APIC_irq了么?答案是高級IRQ流處理程序中調用desc->irq_data.chip原語時做的,例如上面分析的handle_edge_irq函數調用irq_ack時其實就是調用了ack_APIC_irq。
2.5. invoke_softirq
irq_exit將irq_enter增加的計數減掉標志着HARDIRQ的結束,然后調用in_interrupt()判斷preempt_count為0且有softirq pending,就調用invoke_softirq。
1 |
static inline void invoke_softirq(void) |