MSI 中斷


https://www.codenong.com/cs106676560/

 

MSI只支持32個中斷向量,而MSI-X支持多達2048個中斷向量,但是MSI-X的相關寄存器在配置空間中占用的空間卻更小。這是因為中斷向量信息並不直接存儲在這里,而是在一款特殊的Memory(MIMO)中。並通過BIR(Base address Indicator Register, or BAR Index Register)來確定其在MIMO中的具體位置。無論是MSI還是MSI-X,其本質上都是基於Memory Write 的,因此也可能會產生錯誤。比如PCIe中的ECRC錯誤等。

如下圖所示:

 Pending Table

Pending Table的組成結構如圖6-4所示。




如上圖所示,在Pending Table中,一個Entry由64位組成,其中每一位與MSI-X Table中的一個Entry對應,即Pending Table中的每一個Entry與MSI-X Table的64個Entry對應。與MSI機制類似,Pending位需要與Per Vector Mask位配置使用。
當Per Vector Mask位為1時,PCIe設備不能立即發送MSI-X中斷請求,而是將對應的Pending位置1;當系統軟件將Per Vector Mask位清零時,PCIe設備需要提交MSI-X中斷請求,同時將Pending位清零。
[1] 此時PCI設備配置空間Command寄存器的“Interrupt Disable”位為1。
[2] MSI機制提交中斷請求的方式類似與邊界觸發方式,而使用邊界觸發方式時,處理器可能會丟失某些中斷請求,因此在設備驅動程序的開發過程中,可能需要使用這兩個字段。

確認設備的MSI/MSI-X capability

lspci -v可以查看設備支持的capability, 如果有MSI或者MSI-x或者message signal interrupt的描述,並且這些描述后面都有一個enable的flag, “+”表示enable,"-"表示disable。

 

[root@localhost ixgbe]# lspci -xxx -vv -s 05:00.0
05:00.0 Ethernet controller: Huawei Technologies Co., Ltd. Hi1822 Family (2*25GE) (rev 45)
        Subsystem: Huawei Technologies Co., Ltd. Device d139
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        NUMA node: 0 Region 0: Memory at 80007b00000 (64-bit, prefetchable) [size=128K] Region 2: Memory at 80008a20000 (64-bit, prefetchable) [size=32K] Region 4: Memory at 80000200000 (64-bit, prefetchable) [size=1M]
        Expansion ROM at e9200000 [disabled] [size=1M]
        Capabilities: [40] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
                DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 128 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range B, TimeoutDis+, LTR-, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [80] MSI: Enable- Count=1/32 Maskable+ 64bit+ Address: 0000000000000000  Data: 0000 Masking: 00000000  Pending: 00000000
        Capabilities: [a0] MSI-X: Enable+ Count=32 Masked-
                Vector table: BAR=2 offset=00000000
                PBA: BAR=2 offset=00004000
        Capabilities: [b0] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [c0] Vital Product Data
                Product Name: Huawei IN200 2*100GE Adapter
                Read-only fields:
                        [PN] Part number: SP572
                End
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
        Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [200 v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration-, Interrupt Message Number: 000
                IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
                IOVSta: Migration-
                Initial VFs: 120, Total VFs: 120, Number of VFs: 0, Function Dependency Link: 00
                VF offset: 1, stride: 1, Device ID: 375e
                Supported Page Size: 00000553, System Page Size: 00000010
                Region 0: Memory at 0000080007b20000 (64-bit, prefetchable)
                Region 2: Memory at 00000800082a0000 (64-bit, prefetchable)
                Region 4: Memory at 0000080000300000 (64-bit, prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Capabilities: [310 v1] #19
        Capabilities: [4e0 v1] Device Serial Number 44-a1-91-ff-ff-a4-9b-eb
        Capabilities: [4f0 v1] Transaction Processing Hints
                Device specific mode supported
                No steering table available
        Capabilities: [600 v1] Vendor Specific Information: ID=0000 Rev=0 Len=028 <?>
        Capabilities: [630 v1] Access Control Services
                ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
                ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        Kernel driver in use: vfio-pci
        Kernel modules: hinic
00: e5 19 00 02 06 04 10 00 45 00 00 02 08 00 00 00
10: 0c 00 b0 07 00 08 00 00 0c 00 a2 08 00 08 00 00
20: 0c 00 20 00 00 08 00 00 00 00 00 00 e5 19 39 d1
30: 00 00 40 e6 40 00 00 00 00 00 00 00 ff 00 00 00
40: 10 80 02 00 e2 8f 00 10 37 29 10 00 03 f1 43 00
50: 08 00 03 01 00 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 92 03 00 00 00 00 00 00 0e 00 00 00
70: 03 00 1f 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 05 a0 8a 01 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 11 b0 1f 80 02 00 00 00 02 40 00 00 00 00 00 00
b0: 01 c0 03 f8 00 00 00 00 00 00 00 00 00 00 00 00
c0: 03 00 28 80 37 32 78 ff 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

[root@localhost ixgbe]# 

 

 

[root@localhost ixgbe]# lspci -n -v -s 06:00.0
06:00.0 0200: 19e5:0200 (rev 45)
        Subsystem: 19e5:d139
        Flags: fast devsel, NUMA node 0
        [virtual] Memory at 80010400000 (64-bit, prefetchable) [size=128K]
        [virtual] Memory at 80011320000 (64-bit, prefetchable) [size=32K]
        [virtual] Memory at 80008b00000 (64-bit, prefetchable) [size=1M]
        Expansion ROM at e9300000 [disabled] [size=1M]
        Capabilities: [40] Express Endpoint, MSI 00
        Capabilities: [80] MSI: Enable- Count=1/32 Maskable+ 64bit+
        Capabilities: [a0] MSI-X: Enable- Count=32 Masked-
        Capabilities: [b0] Power Management version 3
        Capabilities: [c0] Vital Product Data
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [200] Single Root I/O Virtualization (SR-IOV)
        Capabilities: [310] #19
        Capabilities: [4e0] Device Serial Number 44-a1-91-ff-ff-a4-9b-ec
        Capabilities: [4f0] Transaction Processing Hints
        Capabilities: [600] Vendor Specific Information: ID=0000 Rev=0 Len=028 <?>
        Capabilities: [630] Access Control Services
        Kernel driver in use: vfio-pci
        Kernel modules: hinic

 

root@zj-x86:~# lspci -n -v -s 1a:00.1
1a:00.1 0200: 8086:37d0 (rev 09)
        Subsystem: 19e5:d123
        Flags: bus master, fast devsel, latency 0, IRQ 31, NUMA node 0
        Memory at a0000000 (64-bit, prefetchable) [size=16M]
        Memory at a3010000 (64-bit, prefetchable) [size=32K]
        Expansion ROM at a3700000 [disabled] [size=512K]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
        Capabilities: [70] MSI-X: Enable+ Count=129 Masked-
        Capabilities: [a0] Express Endpoint, MSI 00
        Capabilities: [e0] Vital Product Data
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Device Serial Number 5c-ac-f7-ff-ff-6b-1d-f4
        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
        Capabilities: [1a0] Transaction Processing Hints
        Capabilities: [1b0] Access Control Services
        Kernel driver in use: i40e
        Kernel modules: i40e

root@zj-x86:~# 

 

 

I350網卡位於bus 3,device0,function 0。從配置空間可以看出網卡申請了一個BAR3,這正是MSI-X所使用的BAR3,MSI-X tablestructure存放在BAR3起始地址+0的位置,PBA structure存在BAR3起始地址+0x2000的位置。

 

 

 

 

 

 

 

root@zj-x86:~# lspci | grep -i  ether
1a:00.0 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 09) 1a:00.1 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 09) 1a:00.2 Ethernet controller: Intel Corporation Ethernet Connection X722 for 1GbE (rev 09) 1a:00.3 Ethernet controller: Intel Corporation Ethernet Connection X722 for 1GbE (rev 09) root@zj-x86:~# lspci -s 1a:00.1 -v 1a:00.1 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 09) Subsystem: Huawei Technologies Co., Ltd. Ethernet Connection X722 for 10GbE SFP+ Flags: bus master, fast devsel, latency 0, IRQ 31, NUMA node 0 Memory at a0000000 (64-bit, prefetchable) [size=16M] Memory at a3010000 (64-bit, prefetchable) [size=32K] Expansion ROM at a3700000 [disabled] [size=512K] Capabilities: [40] Power Management version 3 Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+ Capabilities: [70] MSI-X: Enable+ Count=129 Masked- Capabilities: [a0] Express Endpoint, MSI 00 Capabilities: [e0] Vital Product Data Capabilities: [100] Advanced Error Reporting Capabilities: [140] Device Serial Number 5c-ac-f7-ff-ff-6b-1d-f4 Capabilities: [150] Alternative Routing-ID Interpretation (ARI) Capabilities: [160] Single Root I/O Virtualization (SR-IOV) Capabilities: [1a0] Transaction Processing Hints Capabilities: [1b0] Access Control Services Kernel driver in use: i40e Kernel modules: i40e root@zj-x86:~# 

 

[root@localhost ~]# lspci -s 05:00.0 -v
05:00.0 Ethernet controller: Huawei Technologies Co., Ltd. Hi1822 Family (2*25GE) (rev 45) Subsystem: Huawei Technologies Co., Ltd. Device d139 Flags: fast devsel, NUMA node 0 [virtual] Memory at 80007b00000 (64-bit, prefetchable) [size=128K] [virtual] Memory at 80008a20000 (64-bit, prefetchable) [size=32K] [virtual] Memory at 80000200000 (64-bit, prefetchable) [size=1M] Expansion ROM at e9200000 [disabled] [size=1M] Capabilities: [40] Express Endpoint, MSI 00 Capabilities: [80] MSI: Enable- Count=1/32 Maskable+ 64bit+ Capabilities: [a0] MSI-X: Enable- Count=32 Masked- Capabilities: [b0] Power Management version 3 Capabilities: [c0] Vital Product Data Capabilities: [100] Advanced Error Reporting Capabilities: [150] Alternative Routing-ID Interpretation (ARI) Capabilities: [200] Single Root I/O Virtualization (SR-IOV) Capabilities: [310] #19 Capabilities: [4e0] Device Serial Number 44-a1-91-ff-ff-a4-9b-eb Capabilities: [4f0] Transaction Processing Hints Capabilities: [600] Vendor Specific Information: ID=0000 Rev=0 Len=028 <?> Capabilities: [630] Access Control Services Kernel driver in use: vfio-pci Kernel modules: hinic [root@localhost ~]# 

 

MSI的中斷注冊

kernel/irq/manage.c

request_irq()
    +-> __setup_irq() +-> irq_activate() +-> msi_domain_activate() // msi_domain_info中定義的irq_chip_write_msi_msg +-> irq_chip_write_msi_msg() // irq_chip對應的是pci_msi_create_irq_domain中關聯的its_msi_irq_chip +-> data->chip->irq_write_msi_msg(data, msg); +-> pci_msi_domain_write_msg()

 

 

從這個流程可以看出,MSI是通過irq_write_msi_msg往一個地址發一個消息來激活一個中斷。

 

 

4. 設備怎么使用MSI/MSI-x中斷?

傳統中斷在系統初始化掃描PCI bus tree時就已自動為設備分配好中斷號, 但是如果設備需要使用MSI,驅動需要進行一些額外的配置。
當前linux內核提供pci_alloc_irq_vectors來進行MSI/MSI-X capablity的初始化配置以及中斷號分配。

1
2
int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
                unsigned int max_vecs, unsigned int flags);

函數的返回值為該PCI設備分配的中斷向量個數。
min_vecs是設備對中斷向量數目的最小要求,如果小於該值,會返回錯誤。
max_vecs是期望分配的中斷向量最大個數。
flags用於區分設備和驅動能夠使用的中斷類型,一般有4種:

1
2
3
4
#define PCI_IRQ_LEGACY      (1 << 0) /* Allow legacy interrupts */
#define PCI_IRQ_MSI     (1 << 1) /* Allow MSI interrupts */
#define PCI_IRQ_MSIX        (1 << 2) /* Allow MSI-X interrupts */
#define PCI_IRQ_ALL_TYPES   (PCI_IRQ_LEGACY | PCI_IRQ_MSI | PCI_IRQ_MSIX)

PCI_IRQ_ALL_TYPES可以用來請求任何可能類型的中斷。
此外還可以額外的設置PCI_IRQ_AFFINITY, 用於將中斷分布在可用的cpu上。
使用示例:

1
 i = pc i_alloc_irq_vectors(dev->pdev, min_msix, msi_count, PCI_IRQ_MSIX | PCI_IRQ_AFFINITY);

與之對應的是釋放中斷資源的函數pci_free_irq_vectors(), 需要在設備remove時調用:

1
void pci_free_irq_vectors(struct pci_dev *dev);

此外,linux還提供了pci_irq_vector()用於獲取IRQ number.

1
int pci_irq_vector(struct pci_dev *dev, unsigned int nr);

5. 設備的MSI/MSI-x中斷是怎樣處理的?

5.1 MSI的中斷分配pci_alloc_irq_vectors()

深入理解下pci_alloc_irq_vectors()
pci_alloc_irq_vectors() --> pci_alloc_irq_vectors_affinity()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
                   unsigned int max_vecs, unsigned int flags,
                   struct irq_affinity *affd)
{
    struct irq_affinity msi_default_affd = {0};
    int msix_vecs = -ENOSPC;
    int msi_vecs = -ENOSPC;

    if (flags & PCI_IRQ_AFFINITY) {                        
        if (!affd)
            affd = &msi_default_affd;
    } else {
        if (WARN_ON(affd))
            affd = NULL;
    }

    if (flags & PCI_IRQ_MSIX) {
        msix_vecs = __pci_enable_msix_range(dev, NULL, min_vecs,
                            max_vecs, affd, flags);               ------(1)
        if (msix_vecs > 0)
            return msix_vecs;
    }

    if (flags & PCI_IRQ_MSI) {
        msi_vecs = __pci_enable_msi_range(dev, min_vecs, max_vecs,
                          affd);                             ----- (2)
        if (msi_vecs > 0)
            return msi_vecs;
    }

    /* use legacy IRQ if allowed */
    if (flags & PCI_IRQ_LEGACY) {
        if (min_vecs == 1 && dev->irq) {
            /*
             * Invoke the affinity spreading logic to ensure that
             * the device driver can adjust queue configuration
             * for the single interrupt case.
             */
            if (affd)
                irq_create_affinity_masks(1, affd);
            pci_intx(dev, 1);                                 ------ (3)
            return 1;
        }
    }

    if (msix_vecs == -ENOSPC)                
        return -ENOSPC;
    return msi_vecs;
}

(1) 先確認申請的是否為MSI-X中斷

1
2
3
4
__pci_enable_msix_range()
    +-> __pci_enable_msix()
        +-> msix_capability_init()
            +-> pci_msi_setup_msi_irqs()

msix_capability_init會對msi capability進行一些配置。
關鍵函數pci_msi_setup_msi_irqs, 會創建msi irq number:

1
2
3
4
5
6
7
8
9
10
static int pci_msi_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
{
    struct irq_domain *domain;

    domain = dev_get_msi_domain(&dev->dev);      
    if (domain && irq_domain_is_hierarchy(domain))
        return msi_domain_alloc_irqs(domain, &dev->dev, nvec);

    return arch_setup_msi_irqs(dev, nvec, type);
}

這里的irq_domain獲取的是pcie device結構體中定義的dev->msi_domain.
這里的msi_domain是在哪里定義的呢?
在drivers/irqchip/irq-gic-v3-its-pci-msi.c中, kernel啟動時會:

1
2
3
4
its_pci_msi_init()
    +-> its_pci_msi_init()
        +-> its_pci_msi_init_one()
            +-> pci_msi_create_irq_domain(handle, &its_pci_msi_domain_info,parent)

pci_msi_create_irq_domain中會去創建pci_msi irq_domain, 傳遞的參數分別是its_pci_msi_domain_info以及設置parent為its irq_domain.
所以現在邏輯就比較清晰:
在這里插入圖片描述
gic中斷控制器初始化時會去add gic irq_domain, gic irq_domain是its irq_domain的parent節點,its irq_domain中的host data對應的pci_msi irq_domain.

1
2
3
4
5
6
7
8
9
10
11
        gic irq_domain --> irq_domain_ops(gic_irq_domain_ops)
              ^                --> .alloc(gic_irq_domain_alloc)
              |
        its irq_domain --> irq_domain_ops(its_domain_ops)
              ^                --> .alloc(its_irq_domain_alloc)
              |                --> ...
              |        --> host_data(struct msi_domain_info)
              |            --> msi_domain_ops(its_msi_domain_ops)
              |                --> .msi_prepare(its_msi_prepare)
              |            --> irq_chip, chip_data, handler...
              |            --> void *data(struct its_node)

pci_msi irq_domain對應的ops:

1
2
3
4
5
6
static const struct irq_domain_ops msi_domain_ops = {
        .alloc          = msi_domain_alloc,
        .free           = msi_domain_free,
        .activate       = msi_domain_activate,
        .deactivate     = msi_domain_deactivate,
};

回到上面的pci_msi_setup_msi_irqs()函數,獲取了pci_msi irq_domain后, 調用msi_domain_alloc_irqs()函數分配IRQ number.

1
2
3
4
5
msi_domain_alloc_irqs()
    // 對應的是its_pci_msi_ops中的its_pci_msi_prepare
    +-> msi_domain_prepare_irqs()
    // 分配IRQ number
    +-> __irq_domain_alloc_irqs()

msi_domain_prepare_irqs()對應的是its_msi_prepare函數,會去創建一個its_device.
__irq_domain_alloc_irqs()會去分配虛擬中斷號,從allocated_irq位圖中取第一個空閑的bit位作為虛擬中斷號。

至此, msi-x的中斷分配已經完成,且msi-x的配置也已經完成。

(2) 如果不是MSI-X中斷, 再確認申請的是否為MSI中斷, 流程與MSI-x類似。
(3) 如果不是MSI/MSI-X中斷, 再確認申請的是否為傳統intx中斷

5.2 MSI的中斷注冊

kernel/irq/manage.c

1
2
3
4
5
6
7
8
9
request_irq()
    +-> __setup_irq()
        +-> irq_activate()
            +-> msi_domain_activate()
                // msi_domain_info中定義的irq_chip_write_msi_msg
                +-> irq_chip_write_msi_msg()
                    // irq_chip對應的是pci_msi_create_irq_domain中關聯的its_msi_irq_chip
                    +-> data->chip->irq_write_msi_msg(data, msg);
                            +-> pci_msi_domain_write_msg()

從這個流程可以看出,MSI是通過irq_write_msi_msg往一個地址發一個消息來激活一個中斷。

 

 

 

 

 

 中斷產生

1. 產生MSI中斷請求

關於MSI以及MSI-X的詳細說明可以參閱王齊老師的《PCI Express體系結構導讀》及《Intel® 64 and IA-32 Architectures Software Developer’s Manual》

  PCIe設備通過向MSI/MSI-X Capability中的Message Address地址寫Message Data數據,組成一個TLP向處理器提交MSI/MSI-X中斷請求,不同處理器采用不同機制處理MSI請求,x86使用FSB Interrupt Message的方式,下圖是Intel手冊中Message Data的格式,可以看到bit0~7是Vector,因此每一個MSI的請求中都攜帶了中斷向量的值。

  Message Data中Vector是如何設置的?如下圖設備驅動在調用request_irq來申請中斷的時候會調用到msi_set_affinity,該函數將Vector的值寫入msg.data的低8bit,然后調用__write_msi_msg將Message Data寫入PCIe配置空間。

2.中斷整體流程

  MSI報文中帶有Vector信息,根據Vector查找IDT table中的處理函數,x86架構中的IDT無論Vector是多少處理函數都會跳轉到common_interrupt,然后執行do_IRQ,函數do_IRQ主要包括以下幾個部分

  1. 調用irq_enter增加preempt_count中HARDIRQ的計數,標志着一個HARDIRQ的開始;

    preempt_count變量不為零的時候不可以搶占。

  2. irq = __this_cpu_read(vector_irq[vector])根據vector獲得irq的值;

  3. handle_irq做HARDIRQ處理;

  4. 如果handle_irq返回false則調用ack_APIC_irq()向APIC的EOI寄存器寫0,通知APIC中斷服務完成;

  5. irq_exit調用sub_preempt_count(HARDIRQ_OFFSET)將irq_enter增加的計數減掉,這也標志着HARDIRQ的結束,然后調用in_interrupt()判斷preempt_count為0且有softirq pending,就調用invoke_softirq。

2.1. preempt_count

preempt_count是thread_info結構體的成員變量,內核中對preempt_count的描述如下,可以看到其中有softirq count,hardirq count等,preempt_count可以作搶占計數和判斷當前所在上下文的情況。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
/*
* We put the hardirq and softirq counter into the preemption
* counter. The bitmask has the following meaning:
*
* - bits 0-7 are the preemption count (max preemption depth: 256)
* - bits 8-15 are the softirq count (max # of softirqs: 256)
*
* The hardirq count can in theory reach the same as NR_IRQS.
* In reality, the number of nested IRQS is limited to the stack
* size as well. For archs with over 1000 IRQS it is not practical
* to expect that they will all nest. We give a max of 10 bits for
* hardirq nesting. An arch may choose to give less than 10 bits.
* m68k expects it to be 8.
*
* - bits 16-25 are the hardirq count (max # of nested hardirqs: 1024)
* - bit 26 is the NMI_MASK
* - bit 27 is the PREEMPT_ACTIVE flag
*
* PREEMPT_MASK: 0x000000ff
* SOFTIRQ_MASK: 0x0000ff00
* HARDIRQ_MASK: 0x03ff0000
* NMI_MASK: 0x04000000
*/
 

2.2. irq與vector

vector_irq是一個per cpu數組,數組反映了各cpu上vector與irq的對應關系,index代表vector的值,數組中存儲的值是irq。

1
2
3
#define NR_VECTORS 256
typedef int vector_irq_t[NR_VECTORS];
DECLARE_PER_CPU(vector_irq_t, vector_irq);
 

2.3. handle_irq

handle_irq根據irq獲取中斷描述符結構irq_desc,然后調用generic_handle_irq_desc。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
bool handle_irq(unsigned irq, struct pt_regs *regs)
{
struct irq_desc *desc;

stack_overflow_check(regs);

desc = irq_to_desc(irq);
if (unlikely(!desc))
return false;

generic_handle_irq_desc(irq, desc);
return true;
}

static inline void generic_handle_irq_desc(unsigned int irq, struct irq_desc *desc)
{
desc->handle_irq(irq, desc);
}
 

要繼續深入分析,首先要理解中斷代碼有三個主要的抽象層次:

  • High-level driver API 高級驅動API
  • High-level IRQ flow handlers 高級IRQ流處理程序
  • Chip-level hardware encapsulation 硬件芯片級封裝

我們上面的分析(諸如common_interrupt等)包含了許多low-level architecture 代碼。當中斷觸發時,這些底層架構代碼通過調用desc->handle_irq來調用通用中斷代碼,這個handle_irq指向的函數屬於High-level IRQ flow handlers的層次,稱其為高級IRQ流處理程序(High-level IRQ flow handlers),Kernel中提供了一組預定義的irq-flow方法,這些函數在引導期間或在設備初始化期間由體系結構分配給特定的中斷(對desc->handle_irq的賦值):

1
2
3
4
5
6
7
8
9
10
11
12
13
/*
* Built-in IRQ handlers for various IRQ types,
* callable via desc->handle_irq()
*/
extern void handle_level_irq(unsigned int irq, struct irq_desc *desc);
extern void handle_fasteoi_irq(unsigned int irq, struct irq_desc *desc);
extern void handle_edge_irq(unsigned int irq, struct irq_desc *desc);
extern void handle_edge_eoi_irq(unsigned int irq, struct irq_desc *desc);
extern void handle_simple_irq(unsigned int irq, struct irq_desc *desc);
extern void handle_percpu_irq(unsigned int irq, struct irq_desc *desc);
extern void handle_percpu_devid_irq(unsigned int irq, struct irq_desc *desc);
extern void handle_bad_irq(unsigned int irq, struct irq_desc *desc);
extern void handle_nested_irq(unsigned int irq);
 

高級IRQ流處理程序會調用desc->irq_data.chip原語(irq_chip中的,例如irq_ack),即Chip-level hardware encapsulation硬件芯片級封裝,如果中斷有specific handler的話還會調用外設的specific handler。

由於我碰巧用GDB斷住了handle_edge_irq(邊緣觸發中斷的通用實現),以此舉例分析,詳見下面的代碼以及注釋。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
/**
* handle_edge_irq - edge type IRQ handler
* @irq: the interrupt number
* @desc: the interrupt description structure for this irq
*
* Interrupt occures on the falling and/or rising edge of a hardware
* signal. The occurrence is latched into the irq controller hardware
* and must be acked in order to be reenabled. After the ack another
* interrupt can happen on the same source even before the first one
* is handled by the associated event handler. If this happens it
* might be necessary to disable (mask) the interrupt depending on the
* controller hardware. This requires to reenable the interrupt inside
* of the loop which handles the interrupts which have arrived while
* the handler was running. If all pending interrupts are handled, the
* loop is left.
*/
void
handle_edge_irq(unsigned int irq, struct irq_desc *desc)
{
raw_spin_lock(&desc->lock);
/*
* 清除IRQS_REPLAY和IRQS_WAITING狀態
* IRQS_REPLAY:與irq resend有關,在check_irq_resend中檢測到IRQS_PENDING
* 會置位IRQS_REPLAY。
*/
desc->istate &= ~(IRQS_REPLAY | IRQS_WAITING);
/*
* If we're currently running this IRQ, or its disabled,
* we shouldn't process the IRQ. Mark it pending, handle
* the necessary masking and go out
*/
/*
* 1.該中斷被其他的CPU disable了,需要PENDING狀態,mask並且ack該中斷,待其他
* CPUenable 該中斷的時候會resend該中斷;
* 2.該中斷描述符正在被其他cpu處理(這里需要理解一下currently running this IRQ
* 不是當前的這個中斷,而是之前產生的同irq號的中斷),需要PENDING狀態,mask並
* 且ack該中斷,其他CPU稍后會進行處理;
* 3.該中斷描述符沒有irqaction,沒必要執行后續specific handler流程。
*/
if (unlikely(irqd_irq_disabled(&desc->irq_data) ||
irqd_irq_inprogress(&desc->irq_data) || !desc->action)) {
if (!irq_check_poll(desc)) {
desc->istate |= IRQS_PENDING;
mask_ack_irq(desc);
goto out_unlock;
}
}
kstat_incr_irqs_this_cpu(irq, desc); //更新irq的統計信息

/* Start handling the irq */
desc->irq_data.chip->irq_ack(&desc->irq_data); //ack中斷,中斷被enable

do {
/*
* 如果上次循環handle_irq_event函數中不持有spin lock的那個階段,其他CPU
* 注銷了specific handler,mask irq並退出。
*/
if (unlikely(!desc->action)) {
mask_irq(desc);
goto out_unlock;
}

/*
* When another irq arrived while we were handling
* one, we could have masked the irq.
* Renable it, if it was not disabled in meantime.
*/
/*
* 如果desc處於pending狀態(pending的原因上面說過了),將之前mask的unmask掉。
*/
if (unlikely(desc->istate & IRQS_PENDING)) {
if (!irqd_irq_disabled(&desc->irq_data) &&
irqd_irq_masked(&desc->irq_data))
unmask_irq(desc);
}

handle_irq_event(desc);//處理中斷請求事件

} while ((desc->istate & IRQS_PENDING) &&
!irqd_irq_disabled(&desc->irq_data));

out_unlock:
raw_spin_unlock(&desc->lock);
}
 

handle_irq_event(desc)函數中調用handle_irq_event_percpu遍歷action list執行specific handler,handle_irq_event_percpu函數不展開分析了。

需要注意的是handle_irq_event_percpu函數前后的鎖操作,注意上面handle_edge_irq函數開始和結束的lock和unlock不是對應關系,handle_irq_event_percpu前面的raw_spin_unlock對應的是handle_edge_irq開頭的raw_spin_lock,而handle_irq_event_percpu后面的raw_spin_lock對應的是handle_edge_irq最后的raw_spin_unlock。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
irqreturn_t handle_irq_event(struct irq_desc *desc)
{
struct irqaction *action = desc->action;
irqreturn_t ret;

desc->istate &= ~IRQS_PENDING; //清除IRQS_PENDING標志
irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS); //設置IRQD_IRQ_INPROGRESS表示該CPU正在處理該desc的irq
raw_spin_unlock(&desc->lock); //desc解鎖

ret = handle_irq_event_percpu(desc, action);

raw_spin_lock(&desc->lock);//desc加鎖
irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS); //清除IRQD_IRQ_INPROGRESS標志
return ret;
}
 

其他的High-level IRQ flow handlers函數就不做詳細分析了,下面是內核文檔中對一些High-level irq處理函數的簡化摘錄

  • handle_level_irq

    1
    2
    3
       :c:func:`desc->irq_data.chip->irq_mask_ack`;
    handle_irq_event(desc->action);
    :c:func:`desc->irq_data.chip->irq_unmask`;
     
  • handle_fastoi_irq

    1
    2
    handle_irq_event(desc->action);
    :c:func:`desc->irq_data.chip->irq_eoi`;
     
  • handle_edge_irq

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    if (desc->status & running) {
    :c:func:`desc->irq_data.chip->irq_mask_ack`;
    desc->status |= pending | masked;
    return;
    }
    :c:func:`desc->irq_data.chip->irq_ack`;
    desc->status |= running;
    do {
    if (desc->status & masked)
    :c:func:`desc->irq_data.chip->irq_unmask`;
    desc->status &= ~pending;
    handle_irq_event(desc->action);
    } while (status & pending);
    desc->status &= ~running;
     
  • handle_simple_irq

    1
    handle_irq_event(desc->action);
     
  • handle_percpu_irq

    1
    2
    3
    4
    5
    if (desc->irq_data.chip->irq_ack)
    :c:func:`desc->irq_data.chip->irq_ack`;
    handle_irq_event(desc->action);
    if (desc->irq_data.chip->irq_eoi)
    :c:func:`desc->irq_data.chip->irq_eoi`;
     

2.4. ack_APIC_irq

回到do_IRQ函數,如果handle_irq返回false則調用ack_APIC_irq向APIC的EOI寄存器寫0,通知APIC中斷服務完成。那handle_irq返回true就不用ack_APIC_irq了么?答案是高級IRQ流處理程序中調用desc->irq_data.chip原語時做的,例如上面分析的handle_edge_irq函數調用irq_ack時其實就是調用了ack_APIC_irq。

2.5. invoke_softirq

irq_exit將irq_enter增加的計數減掉標志着HARDIRQ的結束,然后調用in_interrupt()判斷preempt_count為0且有softirq pending,就調用invoke_softirq。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
static inline void invoke_softirq(void)
{
if (!force_irqthreads) {
/*
* We can safely execute softirq on the current stack if
* it is the irq stack, because it should be near empty
* at this stage. But we have no way to know if the arch
* calls irq_exit() on the irq stack. So call softirq
* in its own stack to prevent from any overrun on top
* of a potentially deep task stack.
*/
do_softirq();
} else {
wakeup_softirqd();
}
}

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM