2017-04-20
上篇文章對qemu部分的內存虛擬化做了介紹,上篇文章對於要添加的FR,調用了 MEMORY_LISTENER_UPDATE_REGION(frnew, as, Forward, region_add)
#define MEMORY_LISTENER_UPDATE_REGION(fr, as, dir, callback) \ MEMORY_LISTENER_CALL(callback, dir, (&(MemoryRegionSection) { \ .mr = (fr)->mr, \ .address_space = (as), \ .offset_within_region = (fr)->offset_in_region, \ .size = (fr)->addr.size, \ .offset_within_address_space = int128_get64((fr)->addr.start), \ .readonly = (fr)->readonly, \ }))
該宏實際上是另一個宏MEMORY_LISTENER_CALL的封裝,在MEMORY_LISTENER_CALL中臨時生成一個MemoryRegionSection結構,具體邏輯如下
#define MEMORY_LISTENER_CALL(_callback, _direction, _section, _args...) \ do { \ MemoryListener *_listener; \ \ switch (_direction) { \ case Forward: \ QTAILQ_FOREACH(_listener, &memory_listeners, link) { \ if (_listener->_callback \ && memory_listener_match(_listener, _section)) { \ _listener->_callback(_listener, _section, ##_args); \ } \ } \ break; \ case Reverse: \ QTAILQ_FOREACH_REVERSE(_listener, &memory_listeners, \ memory_listeners, link) { \ if (_listener->_callback \ && memory_listener_match(_listener, _section)) { \ _listener->_callback(_listener, _section, ##_args); \ } \ } \ break; \ default: \ abort(); \ } \ } while (0)
這里有個_direction,其實就是遍歷方向,因為listener按照優先級從低到高排列,所以這里其實就是確定讓誰先處理。Forward就是從前向后,而reverse就是從后向前。還有一個值得注意的是memory_listener_match函數,listener有這對應的AddressSpace,會在其 address_space_filter域指定,注意如果沒有指定AddressSpace,那么該listener對所有AddressSpace均適用,否則只適用於其指定的AddressSpace。知道這些,看這些代碼就不成問題了。
這樣kvm_region_add()函數得到執行,那咱們就從kvm_region_add函數開始,對KVM部分的內存管理做介紹。kvm_region_add()函數是核心listener的添加region的函數,在qemu申請好內存后,針對每個FR,調用了listener的region_add函數。最終需要利用此函數把region信息告知KVM,KVM以此對內存信息做記錄。咱們直奔主題,函數核心在static void kvm_set_phys_mem(MemoryRegionSection *section, bool add)函數中,
該函數比較長,咱們還是分段介紹
KVMState *s = kvm_state; KVMSlot *mem, old; int err; MemoryRegion *mr = section->mr; bool log_dirty = memory_region_is_logging(mr); /*是否可寫*/ bool writeable = !mr->readonly && !mr->rom_device; bool readonly_flag = mr->readonly || memory_region_is_romd(mr); //section中數據的起始偏移 hwaddr start_addr = section->offset_within_address_space; /*section的size*/ ram_addr_t size = int128_get64(section->size); void *ram = NULL; unsigned delta; /* kvm works in page size chunks, but the function may be called with sub-page size and unaligned start address. */ /*內存對齊后的偏移*/ delta = TARGET_PAGE_ALIGN(size) - size; if (delta > size) { return; } start_addr += delta; size -= delta;//這樣可以保證size是頁對齊的 size &= TARGET_PAGE_MASK; if (!size || (start_addr & ~TARGET_PAGE_MASK)) { return; } /*如果不是rom,則不能進行寫操作*/ if (!memory_region_is_ram(mr)) { if (writeable || !kvm_readonly_mem_allowed) { return; } else if (!mr->romd_mode) { /* If the memory device is not in romd_mode, then we actually want * to remove the kvm memory slot so all accesses will trap. */ add = false; } }
第一部分主要是一些基礎工作,獲取section對應的MR的一些屬性,如writeable、readonly_flag。獲取section的start_addr和size,其中start_addr就是section中的offset_within_address_space也就是FR中的offset_in_region,接下來對size進行了對齊操作 。如果對應的MR關聯的內存並不是作為ram存在,就要進行額外的驗證。這種情況如果writeable允許寫操作或者kvm不支持只讀內存,那么直接返回。
接下來是函數的重點處理部分,即把當前的section轉化成一個slot進行添加,但是在此之前需要處理已存在的slot和新的slot的重疊問題,當然如果沒有重疊就好辦了,直接添加即可。進入while循環
ram = memory_region_get_ram_ptr(mr) + section->offset_within_region + delta; /*對重疊部分的處理*/ while (1) { /*查找重疊的部分*/ mem = kvm_lookup_overlapping_slot(s, start_addr, start_addr + size); /*如果沒找到重疊,就break*/ if (!mem) { break; } /*如果要添加區間已經被注冊*/ if (add && start_addr >= mem->start_addr && (start_addr + size <= mem->start_addr + mem->memory_size) && (ram - start_addr == mem->ram - mem->start_addr)) { /* The new slot fits into the existing one and comes with * identical parameters - update flags and done. */ kvm_slot_dirty_pages_log_change(mem, log_dirty); return; } old = *mem; if (mem->flags & KVM_MEM_LOG_DIRTY_PAGES) { kvm_physical_sync_dirty_bitmap(section); } /*移除重疊的部分*/ /* unregister the overlapping slot */ mem->memory_size = 0; err = kvm_set_user_memory_region(s, mem); if (err) { fprintf(stderr, "%s: error unregistering overlapping slot: %s\n", __func__, strerror(-err)); abort(); } /* Workaround for older KVM versions: we can't join slots, even not by * unregistering the previous ones and then registering the larger * slot. We have to maintain the existing fragmentation. Sigh. * * This workaround assumes that the new slot starts at the same * address as the first existing one. If not or if some overlapping * slot comes around later, we will fail (not seen in practice so far) * - and actually require a recent KVM version. */ /*如果已有的size小於申請的size,則需要在原來的基礎上,添加新的,不能刪除原來的再次添加*/ if (s->broken_set_mem_region && old.start_addr == start_addr && old.memory_size < size && add) { mem = kvm_alloc_slot(s); mem->memory_size = old.memory_size; mem->start_addr = old.start_addr; mem->ram = old.ram; mem->flags = kvm_mem_flags(s, log_dirty, readonly_flag); err = kvm_set_user_memory_region(s, mem); if (err) { fprintf(stderr, "%s: error updating slot: %s\n", __func__, strerror(-err)); abort(); } start_addr += old.memory_size; ram += old.memory_size; size -= old.memory_size; continue; } /* register prefix slot */ /*new 的start_addr大於old.start_addr,需要補足前面多余的部分*/ if (old.start_addr < start_addr) { mem = kvm_alloc_slot(s); mem->memory_size = start_addr - old.start_addr; mem->start_addr = old.start_addr; mem->ram = old.ram; mem->flags = kvm_mem_flags(s, log_dirty, readonly_flag); err = kvm_set_user_memory_region(s, mem); if (err) { fprintf(stderr, "%s: error registering prefix slot: %s\n", __func__, strerror(-err)); #ifdef TARGET_PPC fprintf(stderr, "%s: This is probably because your kernel's " \ "PAGE_SIZE is too big. Please try to use 4k " \ "PAGE_SIZE!\n", __func__); #endif abort(); } } /* register suffix slot */ /**/ if (old.start_addr + old.memory_size > start_addr + size) { ram_addr_t size_delta; mem = kvm_alloc_slot(s); mem->start_addr = start_addr + size; size_delta = mem->start_addr - old.start_addr; mem->memory_size = old.memory_size - size_delta; mem->ram = old.ram + size_delta; mem->flags = kvm_mem_flags(s, log_dirty, readonly_flag); err = kvm_set_user_memory_region(s, mem); if (err) { fprintf(stderr, "%s: error registering suffix slot: %s\n", __func__, strerror(-err)); abort(); } } }
首先調用了kvm_lookup_overlapping_slot函數找到一個沖突的slot,注意返回結果是按照slot為單位,只要兩個地址范圍有交叉,就意味着存在沖突,就返回沖突的slot。如果沒有,那么直接break,添加新的。否則就根據以下幾種情況進行分別處理。其實主要由兩種大情況:
1、新的slot完全包含於引起沖突的slot中,並且參數都是一致的。
2、新的slot和引起沖突的slot僅僅是部分交叉。
針對第一種情況,如果flag有變動,則只更新slot的flags,否則,不需要變動。第二種情況,首先要把原來的region delete掉,具體方式是設置mem->memory_size=0,然后調用kvm_set_user_memory_region()函數。由於 新的region和delete的region不是完全對應的,僅僅是部分交叉,所以就會連帶 刪除多余的映射,那么接下來的工作就 是分批次彌補映射。如圖所示

如圖所示,old region是找到的和new region重疊的slot,首次刪除把整個slot都刪除了,造成了region1部門很無辜的受傷,所以在映射時,要把region1在彌補上。而黑色部分就是實際刪除的,這樣接下來就可以直接映射new region了,如果多余的部分在后方,也是同樣的道理。在把無辜被刪除region映射之后,接下來就調用kvm_set_user_memory_region把new slot映射進去。基本思路就是這樣。下面看下核心函數kvm_set_user_memory_region
static int kvm_set_user_memory_region(KVMState *s, KVMSlot *slot) { struct kvm_userspace_memory_region mem; mem.slot = slot->slot; mem.guest_phys_addr = slot->start_addr; mem.userspace_addr = (unsigned long)slot->ram; mem.flags = slot->flags; if (s->migration_log) { mem.flags |= KVM_MEM_LOG_DIRTY_PAGES; } if (slot->memory_size && mem.flags & KVM_MEM_READONLY) { /* Set the slot size to 0 before setting the slot to the desired * value. This is needed based on KVM commit 75d61fbc. */ mem.memory_size = 0; kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem); } mem.memory_size = slot->memory_size; return kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem); }
可以看到該函數實現也並不復雜,使用了一個kvm_userspace_memory_region對應,該結構本質上作為參數傳遞給KVM,只是由於不能共享堆棧,在KVM中需要把該結構復制到內核空間,代碼本身沒什么難度,只是這里如果是只讀的mem,需要調用兩次kvm_vm_ioctl,第一次設置mem的size為0.
至於為何這么做,咱們到分析KVM端的時候在做討論。
KVM接收端在kvm_vm_ioctl()函數中
…… case KVM_SET_USER_MEMORY_REGION: { struct kvm_userspace_memory_region kvm_userspace_mem; r = -EFAULT; if (copy_from_user(&kvm_userspace_mem, argp, sizeof (kvm_userspace_mem))) goto out; kvm_userspace_mem->flags |= 0x1; r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem); break; } ……
可以看到首要任務就是把參數復制到內核,然后調用了kvm_vm_ioctl_set_memory_region()函數。
int kvm_vm_ioctl_set_memory_region(struct kvm *kvm, struct kvm_userspace_memory_region *mem) { if (mem->slot >= KVM_USER_MEM_SLOTS) return -EINVAL; return kvm_set_memory_region(kvm, mem); }
函數檢查下slot編號如果超額,那沒辦法,無法添加,否則調用kvm_set_memory_region()函數。而該函數沒做別的,直接調用了__kvm_set_memory_region。該函數比較長,咱們還是分段介紹。開始就是一些常規檢查。
if (mem->memory_size & (PAGE_SIZE - 1)) goto out; if (mem->guest_phys_addr & (PAGE_SIZE - 1)) goto out; /* We can read the guest memory with __xxx_user() later on. */ if ((mem->slot < KVM_USER_MEM_SLOTS) && ((mem->userspace_addr & (PAGE_SIZE - 1)) || !access_ok(VERIFY_WRITE, (void *)(unsigned long)mem->userspace_addr, mem->memory_size))) goto out; if (mem->slot >= KVM_MEM_SLOTS_NUM) goto out; if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr) goto out;
如果memory_size 不是頁對齊的,則失敗;如果mem的客戶機物理地址不是頁對齊的,也失敗;如果slot的id在合法范圍內但是用戶空間地址不是頁對齊的或者地址范圍內的不能正常訪問,則失敗;如果slot id大於等於KVM_MEM_SLOTS_NUM,則失敗;最后if我是沒看明白啊,怎么可能越加越小呢,size也不可能是負值,該問題再議吧。
/*定位到指定slot*/ slot = id_to_memslot(kvm->memslots, mem->slot); base_gfn = mem->guest_phys_addr >> PAGE_SHIFT; npages = mem->memory_size >> PAGE_SHIFT; r = -EINVAL; if (npages > KVM_MEM_MAX_NR_PAGES) goto out; /*如果npages為0,則設置*/ if (!npages) mem->flags &= ~KVM_MEM_LOG_DIRTY_PAGES; /*new 為用戶空間傳遞過來的mem,old為和用戶空間mem id一致的mem*/ new = old = *slot; new.id = mem->slot; new.base_gfn = base_gfn; new.npages = npages; new.flags = mem->flags; r = -EINVAL; /*如果new 的 npage不為0*/ if (npages) { /*如果old 的npage為0,則創建新的mem*/ if (!old.npages) change = KVM_MR_CREATE; /*否則修改已有的mem*/ else { /* Modify an existing slot. */ if ((mem->userspace_addr != old.userspace_addr) || (npages != old.npages) || ((new.flags ^ old.flags) & KVM_MEM_READONLY)) goto out; /*如果兩個mem映射的基址不同*/ if (base_gfn != old.base_gfn) change = KVM_MR_MOVE; /*如果標志位不同則更新標志位*/ else if (new.flags != old.flags) change = KVM_MR_FLAGS_ONLY; else { /* Nothing to change. */ /*都一樣的話就什么都不做*/ r = 0; goto out; } } } /*如果new的npage為0而old的npage不為0,則需要delete已有的*/ else if (old.npages) { change = KVM_MR_DELETE; } else /* Modify a non-existent slot: disallowed. */ goto out;
這里如果檢查都通過了,首先通過傳遞進來的slot的id在kvm維護的slot數組中找到對應的slot結構,此結構可能為空或者為舊的slot。然后獲取物理頁框號、頁面數目。如果頁面數目大於KVM_MEM_MAX_NR_PAGES,則失敗;如果npages為0,則去除KVM_MEM_LOG_DIRTY_PAGES標志。使用舊的slot對新的slot內容做初始化,然后對new slot做設置,參數基本是從用戶空間接收的kvm_userspace_memory_region的參數。然后進入下面的if判斷
1、如果npages不為0,表明本次要添加slot此時如果old slot的npages為0,表明之前沒有對應的slot,需要添加新的,設置change為KVM_MR_CREATE;如果不為0,則需要先修改已有的slot,注意這里如果old slot和new slot的page數目和用戶空間地址必須相等,還有就是兩個slot的readonly屬性必須一致。如果滿足上述條件,進入下面的流程。如果映射的物理頁框號不同,則設置change KVM_MR_MOVE,如果flags不同,設置KVM_MR_FLAGS_ONLY,否則,什么都不做。
2、如果npages為0,而old.pages不為0,表明需要刪除old slot,設置change為KVM_MR_DELETE。到這里基本是做一些准備工作,確定用戶空間要進行的操作,接下來就執行具體的動作了
if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) { /* Check for overlaps */ r = -EEXIST; kvm_for_each_memslot(slot, kvm->memslots) { if ((slot->id >= KVM_USER_MEM_SLOTS) || (slot->id == mem->slot)) continue; if (!((base_gfn + npages <= slot->base_gfn) || (base_gfn >= slot->base_gfn + slot->npages))) goto out; } } /* Free page dirty bitmap if unneeded */ if (!(new.flags & KVM_MEM_LOG_DIRTY_PAGES)) new.dirty_bitmap = NULL; r = -ENOMEM; if (change == KVM_MR_CREATE) { new.userspace_addr = mem->userspace_addr; if (kvm_arch_create_memslot(&new, npages)) goto out_free; } /* Allocate page dirty bitmap if needed */ if ((new.flags & KVM_MEM_LOG_DIRTY_PAGES) && !new.dirty_bitmap) { if (kvm_create_dirty_bitmap(&new) < 0) goto out_free; } /*如果用戶層請求釋放*/ if ((change == KVM_MR_DELETE) || (change == KVM_MR_MOVE)) { r = -ENOMEM; slots = kmemdup(kvm->memslots, sizeof(struct kvm_memslots), GFP_KERNEL); if (!slots) goto out_free; /*先根據id定位具體的slot*/ slot = id_to_memslot(slots, mem->slot); /*首先設置非法*/ slot->flags |= KVM_MEMSLOT_INVALID; old_memslots = install_new_memslots(kvm, slots, NULL); /* slot was deleted or moved, clear iommu mapping */ kvm_iommu_unmap_pages(kvm, &old); /* From this point no new shadow pages pointing to a deleted, * or moved, memslot will be created. * * validation of sp->gfn happens in: * - gfn_to_hva (kvm_read_guest, gfn_to_pfn) * - kvm_is_visible_gfn (mmu_check_roots) */ kvm_arch_flush_shadow_memslot(kvm, slot); slots = old_memslots; } r = kvm_arch_prepare_memory_region(kvm, &new, mem, change); if (r) goto out_slots; r = -ENOMEM; /* * We can re-use the old_memslots from above, the only difference * from the currently installed memslots is the invalid flag. This * will get overwritten by update_memslots anyway. */ if (!slots) { slots = kmemdup(kvm->memslots, sizeof(struct kvm_memslots), GFP_KERNEL); if (!slots) goto out_free; } /* * IOMMU mapping: New slots need to be mapped. Old slots need to be * un-mapped and re-mapped if their base changes. Since base change * unmapping is handled above with slot deletion, mapping alone is * needed here. Anything else the iommu might care about for existing * slots (size changes, userspace addr changes and read-only flag * changes) is disallowed above, so any other attribute changes getting * here can be skipped. */ if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) { r = kvm_iommu_map_pages(kvm, &new); if (r) goto out_slots; } /* actual memory is freed via old in kvm_free_physmem_slot below */ if (change == KVM_MR_DELETE) { new.dirty_bitmap = NULL; memset(&new.arch, 0, sizeof(new.arch)); } old_memslots = install_new_memslots(kvm, slots, &new); kvm_arch_commit_memory_region(kvm, mem, &old, change); kvm_free_physmem_slot(&old, &new); kfree(old_memslots); return 0;
這里就根據change來做具體的設置了,如果 KVM_MR_CREATE,則設置new.用戶空間地址為新的地址。如果new slot要求KVM_MEM_LOG_DIRTY_PAGES,但是new並沒有分配dirty_bitmap,則為其分配。如果change為KVM_MR_DELETE或者KVM_MR_MOVE,這里主要由兩個操作,一是設置對應slot標識為KVM_MEMSLOT_INVALID,更新頁表。二是增加slots->generation,撤銷iommu mapping。接下來對於私有映射的話(memslot->id >= KVM_USER_MEM_SLOTS),如果是要創建,則需要手動建立映射。
接下來確保slots不為空,如果是KVM_MR_CREATE或者KVM_MR_MOVE,就需要重新建立映射,使用kvm_iommu_map_pages函數 ,而如果是KVM_MR_DELETE,就沒必要為new設置dirty_bitmap,並對其arch字段的結構清零。最終都要執行操作install_new_memslots,不過當為delete操作時,new的memory size為0,那么看下該函數做了什么。
static struct kvm_memslots *install_new_memslots(struct kvm *kvm, struct kvm_memslots *slots, struct kvm_memory_slot *new) { struct kvm_memslots *old_memslots = kvm->memslots; update_memslots(slots, new, kvm->memslots->generation); rcu_assign_pointer(kvm->memslots, slots); kvm_synchronize_srcu_expedited(&kvm->srcu); kvm_arch_memslots_updated(kvm); return old_memslots; }
這里核心操作在update_memslots里,如果是添加新的(create or move),那么new 的memory size肯定不為0,則根據new的id,在kvm維護的slot 數組中找到對應的slot,然后一次性吧new的內容賦值給old slot.如果頁面數目不一樣,則需要進行排序。如果是刪除操作,new的memorysize 為0,這里就相當於把清空了一個slot。update之后就更改kvm->memslots,該指針是受RCU機制保護的,所以不能直接修改,需要先分配好,調用API修改。最后再刷新MMIO頁表項。
void update_memslots(struct kvm_memslots *slots, struct kvm_memory_slot *new, u64 last_generation) { if (new) { int id = new->id; struct kvm_memory_slot *old = id_to_memslot(slots, id); unsigned long npages = old->npages; *old = *new; /*如果是刪除操作,那么new.npages就是0*/ if (new->npages != npages) sort_memslots(slots); } slots->generation = last_generation + 1; }
參考:
linux 3.10.1源碼
