2017-08-24
今天咱們聊聊KVM中斷虛擬化,虛擬機的中斷源大致有兩種方式,來自於用戶空間qemu和來自於KVM內部。
中斷虛擬化起始關鍵在於對中斷控制器的虛擬化,中斷控制器目前主要有APIC,這種架構下設備控制器通過某種觸發方式通知IO APIC,IO APIC根據自身維護的重定向表pci irq routing table格式化出一條中斷消息,把中斷消息發送給local APIC,local APIC局部與CPU,即每個CPU一個,local APIC 具備傳統中斷控制器的相關功能以及各個寄存器,中斷請求寄存器IRR,中斷屏蔽寄存器IMR,中斷服務寄存器ISR等,針對這些關鍵部件的虛擬化是中斷虛擬化的重點。在KVM架構下,每個KVM虛擬機維護一個Io APIC,但是每個VCPU有一個local APIC。
核心數據結構介紹:
kvm_irq_routing_table
struct kvm_irq_routing_table { /*ue->gsi*/ int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS]; struct kvm_kernel_irq_routing_entry *rt_entries; u32 nr_rt_entries; /* * Array indexed by gsi. Each entry contains list of irq chips * the gsi is connected to. */ struct hlist_head map[0]; };
這個是一個中斷路由表,每個KVM都有一個, chip是一個二維數組,表示三個芯片的各個管腳,每個芯片有24個管腳,每個數組項紀錄對應管腳的GSI號;rt_entries是一個指針,指向一個kvm_kernel_irq_routing_entry數組,數組中共有nr_rt_entries項,每項對應一個IRQ;map其實可以理解為一個鏈表頭數組,可以根據GSi號作為索引,找到同一IRQ關聯的所有kvm_kernel_irq_routing_entry。具體中斷路由表的初始化部分見本文最后一節
struct kvm_kernel_irq_routing_entry { u32 gsi; u32 type; int (*set)(struct kvm_kernel_irq_routing_entry *e, struct kvm *kvm, int irq_source_id, int level, bool line_status); union { struct { unsigned irqchip; unsigned pin; } irqchip; struct msi_msg msi; }; struct hlist_node link; };
gsi是該entry對應的gsi號,一般和IRQ是一樣,set方法是該IRQ關聯的觸發方法,通過該方法把IRQ傳遞給IO-APIC,;link就是連接點,連接在上面同一IRQ對應的map上;
中斷注入在KVM內部流程起始於一個函數kvm_set_irq
int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level, bool line_status) { struct kvm_kernel_irq_routing_entry *e, irq_set[KVM_NR_IRQCHIPS]; int ret = -1, i = 0; struct kvm_irq_routing_table *irq_rt; trace_kvm_set_irq(irq, level, irq_source_id); /* Not possible to detect if the guest uses the PIC or the * IOAPIC. So set the bit in both. The guest will ignore * writes to the unused one. */ rcu_read_lock(); irq_rt = rcu_dereference(kvm->irq_routing); if (irq < irq_rt->nr_rt_entries) hlist_for_each_entry(e, &irq_rt->map[irq], link) irq_set[i++] = *e; rcu_read_unlock(); /*依次調用同一個irq上的所有芯片的set方法*/ while(i--) { int r; /*kvm_set_pic_irq kvm_set_ioapic_irq*/ r = irq_set[i].set(&irq_set[i], kvm, irq_source_id, level, line_status); if (r < 0) continue; ret = r + ((ret < 0) ? 0 : ret); } return ret; }
kvm指定特定的虛擬機,irq_source_id是中斷源ID,一般有KVM_USERSPACE_IRQ_SOURCE_ID和KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID;irq是全局的中斷號,level指定高低電平,需要注意的是,針對邊沿觸發,需要兩個電平觸發來模擬,先高電平再低電平。回到函數中,首先要收集的是同一irq上注冊的所有的設備信息,這主要在於irq共享的情況,非共享的情況下最多就一個。設備信息抽象成一個kvm_kernel_irq_routing_entry,這里臨時放到irq_set數組中。然后對於數組中的每個元素,調用其set方法,目前大都是APIC架構,因此set方法基本都是kvm_set_ioapic_irq,在傳統pic情況下,是kvm_set_pic_irq。我們以kvm_set_ioapic_irq為例進行分析,該函數沒有實質性的操作,就調用了kvm_ioapic_set_irq函數
int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int irq_source_id, int level, bool line_status) { u32 old_irr; u32 mask = 1 << irq;//irq對應的位 union kvm_ioapic_redirect_entry entry; int ret, irq_level; BUG_ON(irq < 0 || irq >= IOAPIC_NUM_PINS); spin_lock(&ioapic->lock); old_irr = ioapic->irr; /*判斷請求高電平還是低電平*/ irq_level = __kvm_irq_line_state(&ioapic->irq_states[irq], irq_source_id, level); entry = ioapic->redirtbl[irq]; irq_level ^= entry.fields.polarity; /*模擬低電平*/ if (!irq_level) { ioapic->irr &= ~mask; ret = 1; } else { /*判斷觸發方式*/ int edge = (entry.fields.trig_mode == IOAPIC_EDGE_TRIG); if (irq == RTC_GSI && line_status && rtc_irq_check_coalesced(ioapic)) { ret = 0; /* coalesced */ goto out; } /*設置中斷信號到中斷請求寄存器*/ ioapic->irr |= mask; /*如果是電平觸發且舊的irr和請求的irr不相等,調用ioapic_service*/ if ((edge && old_irr != ioapic->irr) || (!edge && !entry.fields.remote_irr)) ret = ioapic_service(ioapic, irq, line_status); else ret = 0; /* report coalesced interrupt */ } out: trace_kvm_ioapic_set_irq(entry.bits, irq, ret == 0); spin_unlock(&ioapic->lock); return ret; }
到這里,中斷已經到達模擬的IO-APIC了,IO-APIC最重要的就是它的重定向表,針對重定向表的操作主要在ioapic_service中,之前都是做一些准備工作,在進入ioapic_service函數之前,主要有兩個任務:1、判斷觸發方式,主要是區分電平觸發和邊沿觸發。2、設置ioapic的irr寄存器。之前我們說過,電觸發需要兩個邊沿觸發來模擬,前后電平相反。這里就要先做判斷是對應哪一次。只有首次觸發才會進行后續的操作,而二次觸發相當於reset操作,就是把ioapic的irr寄存器清除。在電平觸發模式下且請求的irq和ioapic中保存的irq不一致,就會對其進行更新,進入ioapic_service函數。
static int ioapic_service(struct kvm_ioapic *ioapic, unsigned int idx, bool line_status) { union kvm_ioapic_redirect_entry *pent; int injected = -1; /*獲取重定向表項*/ pent = &ioapic->redirtbl[idx]; if (!pent->fields.mask) { /*send irq to local apic*/ injected = ioapic_deliver(ioapic, idx, line_status); if (injected && pent->fields.trig_mode == IOAPIC_LEVEL_TRIG) pent->fields.remote_irr = 1; } return injected; }
該函數比較簡單,就是獲取根據irq號,獲取重定向表中的一項,然后向本地APIC傳遞,即調用ioapic_deliver函數,當然前提是kvm_ioapic_redirect_entry沒有設置mask,ioapic_deliver主要任務就是根據kvm_ioapic_redirect_entry,構建kvm_lapic_irq,這就類似於在總線上的傳遞過程。構建之后調用kvm_irq_delivery_to_apic,該函數會把消息傳遞給相應的VCPU ,具體需要調用kvm_apic_set_irq函數,繼而調用__apic_accept_irq,該函數中會根據不同的傳遞模式處理消息,大部分情況都是APIC_DM_FIXED,在該模式下,中斷被傳遞到特定的CPU,其中會調用kvm_x86_ops->deliver_posted_interrupt,實際上對應於vmx.c中的vmx_deliver_posted_interrupt
static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector) { struct vcpu_vmx *vmx = to_vmx(vcpu); int r; /*設置位圖*/ if (pi_test_and_set_pir(vector, &vmx->pi_desc)) return; /*標記位圖更新標志*/ r = pi_test_and_set_on(&vmx->pi_desc); kvm_make_request(KVM_REQ_EVENT, vcpu); #ifdef CONFIG_SMP if (!r && (vcpu->mode == IN_GUEST_MODE)); else #endif kvm_vcpu_kick(vcpu); }
這里主要是設置vmx->pi_desc中的位圖即struct pi_desc 中的pir字段,其是一個32位的數組,共8項。因此最大標記256個中斷,每個中斷向量對應一位。設置好后,請求KVM_REQ_EVENT事件,在下次vm-entry的時候會進行中斷注入。
具體注入過程:
在vcpu_enter_guest (x86.c)函數中,有這么一段代碼
if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) { kvm_apic_accept_events(vcpu); if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) { r = 1; goto out; } /*注入中斷在vcpu加載到真實cpu上后,相當於某些位已經被設置*/ inject_pending_event(vcpu);//中斷注入 ……
即在進入非跟模式之前會檢查KVM_REQ_EVENT事件,如果存在pending的事件,則調用kvm_apic_accept_events接收,這里主要是處理APIC初始化期間和IPI中斷的,暫且不關注。之后會調用inject_pending_event,在這里會檢查當前是否有可注入的中斷,而具體檢查過程時首先會通過kvm_cpu_has_injectable_intr函數,其中調用kvm_apic_has_interrupt->apic_find_highest_irr->vmx_sync_pir_to_irr,vmx_sync_pir_to_irr函數對中斷進行收集,就是檢查vmx->pi_desc中的位圖,如果有,則會調用kvm_apic_update_irr把信息更新到apic寄存器里。然后調用apic_search_irr獲取IRR寄存器中的中斷,沒找到的話會返回-1.找到后調用kvm_queue_interrupt,把中斷記錄到vcpu中。
static inline void kvm_queue_interrupt(struct kvm_vcpu *vcpu, u8 vector, bool soft) { vcpu->arch.interrupt.pending = true; vcpu->arch.interrupt.soft = soft; vcpu->arch.interrupt.nr = vector; }
最后會調用kvm_x86_ops->set_irq,進行中斷注入的最后一步,即寫入到vmcs結構中。該函數指針指向vmx_inject_irq
static void vmx_inject_irq(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); uint32_t intr; int irq = vcpu->arch.interrupt.nr;//中斷號 trace_kvm_inj_virq(irq); ++vcpu->stat.irq_injections; if (vmx->rmode.vm86_active) { int inc_eip = 0; if (vcpu->arch.interrupt.soft) inc_eip = vcpu->arch.event_exit_inst_len; if (kvm_inject_realmode_interrupt(vcpu, irq, inc_eip) != EMULATE_DONE) kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); return; } intr = irq | INTR_INFO_VALID_MASK;//設置有中斷向量的有效性 if (vcpu->arch.interrupt.soft) {//如果是軟件中斷 intr |= INTR_TYPE_SOFT_INTR;//內部中斷 vmcs_write32(VM_ENTRY_INSTRUCTION_LEN, vmx->vcpu.arch.event_exit_inst_len);//軟件中斷需要寫入指令長度 } else intr |= INTR_TYPE_EXT_INTR;//標記外部中斷 vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr); }
最終會寫入到vmcs的VM_ENTRY_INTR_INFO_FIELD中,這需要按照一定的格式。具體格式詳見intel手冊。0-7位是向量號,8-10位是中斷類型(硬件中斷或者軟件中斷),最高位是有效位,12位是NMI標志。
#define INTR_INFO_VECTOR_MASK 0xff /* 7:0 */ #define INTR_INFO_INTR_TYPE_MASK 0x700 /* 10:8 */ #define INTR_INFO_DELIVER_CODE_MASK 0x800 /* 11 */ #define INTR_INFO_UNBLOCK_NMI 0x1000 /* 12 */ #define INTR_INFO_VALID_MASK 0x80000000 /* 31 */
中斷路由表的初始化
用戶空間qemu通過KVM_CREATE_DEVICE API接口進入KVM的kvm_vm_ioctl處理函數,繼而進入kvm_arch_vm_ioctl,根據參數中的KVM_CREATE_IRQCHIP標志進入初始化中斷控制器的流程,首先肯定是注冊pic和io APIC,這里我們就不詳細闡述,重點在於后面對中斷路由表的初始化過程。中斷路由表的初始化通過kvm_setup_default_irq_routing函數實現,
int kvm_setup_default_irq_routing(struct kvm *kvm) { return kvm_set_irq_routing(kvm, default_routing, ARRAY_SIZE(default_routing), 0); }
首個參數kvm指定特定的虛擬機,后面default_routing是一個全局的kvm_irq_routing_entry數組,就定義在irq_comm.c中,該數組沒別的作用,就是初始化kvm_irq_routing_table,看下kvm_set_irq_routing
int kvm_set_irq_routing(struct kvm *kvm, const struct kvm_irq_routing_entry *ue, unsigned nr, unsigned flags) { struct kvm_irq_routing_table *new, *old; u32 i, j, nr_rt_entries = 0; int r; /*正常情況下,nr_rt_entries=nr*/ for (i = 0; i < nr; ++i) { if (ue[i].gsi >= KVM_MAX_IRQ_ROUTES) return -EINVAL; nr_rt_entries = max(nr_rt_entries, ue[i].gsi); } nr_rt_entries += 1; /*為中斷路由表申請空間*/ new = kzalloc(sizeof(*new) + (nr_rt_entries * sizeof(struct hlist_head)) + (nr * sizeof(struct kvm_kernel_irq_routing_entry)), GFP_KERNEL); if (!new) return -ENOMEM; /*設置指針*/ new->rt_entries = (void *)&new->map[nr_rt_entries]; new->nr_rt_entries = nr_rt_entries; for (i = 0; i < KVM_NR_IRQCHIPS; i++) for (j = 0; j < KVM_IRQCHIP_NUM_PINS; j++) new->chip[i][j] = -1; /*初始化每一項kvm_kernel_irq_routing_entry*/ for (i = 0; i < nr; ++i) { r = -EINVAL; if (ue->flags) goto out; r = setup_routing_entry(new, &new->rt_entries[i], ue); if (r) goto out; ++ue; } mutex_lock(&kvm->irq_lock); old = kvm->irq_routing; kvm_irq_routing_update(kvm, new); mutex_unlock(&kvm->irq_lock); synchronize_rcu(); /*釋放old*/ new = old; r = 0; out: kfree(new); return r; }
可以參考一個宏:
#define IOAPIC_ROUTING_ENTRY(irq) \
{ .gsi = irq, .type = KVM_IRQ_ROUTING_IRQCHIP, .u.irqchip.irqchip = KVM_IRQCHIP_IOAPIC, .u.irqchip.pin = (irq) }
這是初始化default_routing的一個關鍵宏,沒一項都是通過該宏傳遞irq號(0-23)64位下是0-47,可見gsi就是irq號,所以實際上,回到函數中nr_rt_entries就是數組中項數,接着為kvm_irq_routing_table分配空間,注意分配的空間包含三部分:kvm_irq_routing_table結構、nr_rt_entries個hlist_head和nr個kvm_kernel_irq_routing_entry,所以kvm_irq_routing_table的大小是和全局數組的大小一樣的。整個結構如下圖所示

根據上圖就可以理解new->rt_entries = (void *)&new->map[nr_rt_entries];這行代碼的含義,接下來是對沒項的table的chip數組做初始化,這里初始化為-1.接下來就是一個循環,對每一個kvm_kernel_irq_routing_entry做初始化,該過程是通過setup_routing_entry函數實現的,這里看下該函數
static int setup_routing_entry(struct kvm_irq_routing_table *rt, struct kvm_kernel_irq_routing_entry *e, const struct kvm_irq_routing_entry *ue) { int r = -EINVAL; struct kvm_kernel_irq_routing_entry *ei; /* * Do not allow GSI to be mapped to the same irqchip more than once. * Allow only one to one mapping between GSI and MSI. */ hlist_for_each_entry(ei, &rt->map[ue->gsi], link) if (ei->type == KVM_IRQ_ROUTING_MSI || ue->type == KVM_IRQ_ROUTING_MSI || ue->u.irqchip.irqchip == ei->irqchip.irqchip) return r; e->gsi = ue->gsi; e->type = ue->type; r = kvm_set_routing_entry(rt, e, ue); if (r) goto out; hlist_add_head(&e->link, &rt->map[e->gsi]); r = 0; out: return r; }
之前的初始化過程我們已經看見了,.type為KVM_IRQ_ROUTING_IRQCHIP,所以這里實際上就是把e->gsi = ue->gsi;e->type = ue->type;然后調用了kvm_set_routing_entry,該函數中主要是設置了kvm_kernel_irq_routing_entry中的set函數,APIC的話設置的是kvm_set_ioapic_irq函數,而pic的話設置kvm_set_pic_irq函數,然后設置irqchip的類型和管腳,對於IOAPIC也是直接復制過來,PIC由於管腳計算是irq%8,所以這里需要加上8的偏移。之后設置table的chip為gis號。回到setup_routing_entry函數中,就把kvm_kernel_irq_routing_entry以gsi號位索引,加入到了map數組中對應的雙鏈表中。再回到kvm_set_irq_routing函數中,接下來就是更新kvm結構中的irq_routing指針了。
中斷虛擬化流程
kvm_set_irq
kvm_ioapic_set_irq
ioapic_service
ioapic_deliver
kvm_irq_delivery_to_apic
kvm_apic_set_irq
__apic_accept_irq
vmx_deliver_posted_interrupt
具體注入階段
vcpu_enter_guest
kvm_apic_accept_events
inject_pending_event
kvm_queue_interrupt
vmx_inject_irq
vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr);
中斷路由表初始化
x86.c kvm_arch_vm_ioctl
kvm_setup_default_irq_routing irq_common.c
kvm_set_irq_routing irq_chip.c
setup_routing_entry irq_chip.c
kvm_set_routing_entry irq_chip.c
e->set = kvm_set_ioapic_irq; irq_common.c
以馬內利!
參考資料:
LInux3.10.1源碼
