目錄
文章目錄
前文列表
《OpenStack 虛擬機的磁盤文件類型與存儲方式》
《Libvirt Live Migration 與 Pre-Copy 實現原理》
《OpenStack 虛擬機冷/熱遷移功能實踐與流程分析》
在經過上述文章的鋪墊之后,終於來到了代碼實現部分,通過對代碼實現的分析,幫助我們洞穿 OpenStack 虛擬機遷移的本質。
冷遷移代碼分析(基於 Newton)
Nova 冷遷移實現原理
- 通過是否傳入了 New Flavor 來判斷這次請求是 Resize 還是 Cold Migrate
- 獲取虛擬機網絡信息 network_info
- 獲取虛擬機磁盤設備信息 block_device_info
- 獲取虛擬機的停機超時和重試時間間隔信息
- 關閉虛擬機電源
- 遷移虛擬機本地磁盤文件
- 遷移虛擬機共享塊設備
- 遷移虛擬機網絡
- 修改虛擬機的主機記錄和狀態信息
NOTE:block_device_info 保存的並非只是 OpenStack 塊設備(Volume)信息,而是虛擬機塊設備信息,即磁盤信息(包含 image、volume),這一點認識不清很容易在代碼中被混淆。
MariaDB [nova]> select device_name,destination_type,device_type,source_type,image_id from block_device_mapping where instance_uuid="1935fcf7-ba9b-437c-a7d3-5d54c6d0d6d3";
+-------------+------------------+-------------+-------------+--------------------------------------+
| device_name | destination_type | device_type | source_type | image_id |
+-------------+------------------+-------------+-------------+--------------------------------------+
| /dev/vda | local | disk | image | 0aff2888-47f8-4133-928a-9c54414b3afb |
+-------------+------------------+-------------+-------------+--------------------------------------+
# nova/nova/api/openstack/compute/migrate_server.py
def _migrate(self, req, id, body):
"""Permit admins to migrate a server to a new host."""
...
# 判斷用戶是否有權重執行 migrate 操作
context.can(ms_policies.POLICY_ROOT % 'migrate')
# 獲取 instance 資源模型對象
instance = common.get_instance(self.compute_api, context, id)
try:
# 實際調用的是 instance Resize 接口
self.compute_api.resize(req.environ['nova.context'], instance)
...
# nova/nova/compute/api.py
def resize(self, context, instance, flavor_id=None, clean_shutdown=True,
**extra_instance_updates):
"""Resize (ie, migrate) a running instance. If flavor_id is None, the process is considered a migration, keeping the original flavor_id. If flavor_id is not None, the instance should be migrated to a new host and resized to the new flavor_id. """
# 從注釋可以看出,是 Migrate 還是 Resize 主要看是否傳入了 New Flavor
...
# 獲取虛擬機當前的 Flavor
current_instance_type = instance.get_flavor()
# If flavor_id is not provided, only migrate the instance.
if not flavor_id:
LOG.debug("flavor_id is None. Assuming migration.",
instance=instance)
# 保證遷移前后虛擬機 Flavor 不會發生改變
new_instance_type = current_instance_type
...
filter_properties = {'ignore_hosts': []}
# 通過配置項 allow_resize_to_same_host 來決定是否會 resize 到同一個計算節點
# 實際上,當 Migrate 到同一個計算節點時,nova-compute 會觸發 UnableToMigrateToSelf 異常,
# 再繼續 Retry Scheduler,直至調度到合適的計算節點或異常退出,前提是 nova-scheduler 啟用了 RetryFilter
if not CONF.allow_resize_to_same_host:
filter_properties['ignore_hosts'].append(instance.host)
...
scheduler_hint = {'filter_properties': filter_properties}
self.compute_task_api.resize_instance(context, instance,
extra_instance_updates, scheduler_hint=scheduler_hint,
flavor=new_instance_type,
reservations=quotas.reservations or [],
clean_shutdown=clean_shutdown,
request_spec=request_spec)
# nova/compute/manager.py
def resize_instance(self, context, instance, image,
reservations, migration, instance_type,
clean_shutdown):
"""Starts the migration of a running instance to another host."""
...
# 獲取虛擬機的網絡信息
network_info = self.network_api.get_instance_nw_info(context,
instance)
...
# 獲取虛擬機磁盤信息
bdms = objects.BlockDeviceMappingList.get_by_instance_uuid(
context, instance.uuid)
block_device_info = self._get_instance_block_device_info(
context, instance, bdms=bdms)
# 獲取虛擬機的停機超時和重試信息
timeout, retry_interval = self._get_power_off_values(context,
instance, clean_shutdown)
# 關閉虛擬機電源並遷移虛擬機磁盤文件
disk_info = self.driver.migrate_disk_and_power_off(
context, instance, migration.dest_host,
instance_type, network_info,
block_device_info,
timeout, retry_interval)
# 斷開虛擬機的共享塊設備連接
self._terminate_volume_connections(context, instance, bdms)
# 遷移虛擬機網絡
migration_p = obj_base.obj_to_primitive(migration)
self.network_api.migrate_instance_start(context,
instance,
migration_p)
...
# 修改虛擬機的主機記錄
instance.host = migration.dest_compute
instance.node = migration.dest_node
instance.task_state = task_states.RESIZE_MIGRATED
instance.save(expected_task_state=task_states.RESIZE_MIGRATING)
...
# nova/nova/virt/libvirt/driver.py
def migrate_disk_and_power_off(self, context, instance, dest,
flavor, network_info,
block_device_info=None,
timeout=0, retry_interval=0):
# 獲取臨時盤信息
ephemerals = driver.block_device_info_get_ephemerals(block_device_info)
# 檢查是否要調整磁盤大小
# Checks if the migration needs a disk resize down.
root_down = flavor.root_gb < instance.flavor.root_gb
ephemeral_down = flavor.ephemeral_gb < eph_size
# 檢查虛擬機是否通過卷啟動
booted_from_volume = self._is_booted_from_volume(block_device_info)
# 本地磁盤文件不能 Resize
if (root_down and not booted_from_volume) or ephemeral_down:
reason = _("Unable to resize disk down.")
raise exception.InstanceFaultRollback(
exception.ResizeError(reason=reason))
# Cinder LVM Backend & Boot from volume 的虛擬機不能遷移
# NOTE(dgenin): Migration is not implemented for LVM backed instances.
if CONF.libvirt.images_type == 'lvm' and not booted_from_volume:
reason = _("Migration is not supported for LVM backed instances")
raise exception.InstanceFaultRollback(
exception.MigrationPreCheckError(reason=reason))
# copy disks to destination
# rename instance dir to +_resize at first for using
# shared storage for instance dir (eg. NFS).
inst_base = libvirt_utils.get_instance_path(instance)
inst_base_resize = inst_base + "_resize"
# 判斷是否為共享存儲
shared_storage = self._is_storage_shared_with(dest, inst_base)
# try to create the directory on the remote compute node
# if this fails we pass the exception up the stack so we can catch
# failures here earlier
if not shared_storage:
try:
# 非共享存儲:通過 SSH 在目的主機上創建虛擬機目錄
self._remotefs.create_dir(dest, inst_base)
except processutils.ProcessExecutionError as e:
reason = _("not able to execute ssh command: %s") % e
raise exception.InstanceFaultRollback(
exception.ResizeError(reason=reason))
# 關閉虛擬機電源
self.power_off(instance, timeout, retry_interval)
# 卸載共享塊設備
block_device_mapping = driver.block_device_info_get_mapping(
block_device_info)
for vol in block_device_mapping:
connection_info = vol['connection_info']
disk_dev = vol['mount_device'].rpartition("/")[2]
self._disconnect_volume(connection_info, disk_dev, instance)
# 獲取 disk.info 配置文件內容
# 記錄了 Root Disk、Ephemeral Disk、Swap Disk 的 file paths
disk_info_text = self.get_instance_disk_info(
instance, block_device_info=block_device_info)
disk_info = jsonutils.loads(disk_info_text)
try:
# 預刪除虛擬機目錄
utils.execute('mv', inst_base, inst_base_resize)
# if we are migrating the instance with shared storage then
# create the directory. If it is a remote node the directory
# has already been created
if shared_storage:
# 共享存儲:目的主機看作是自己
dest = None
# 共享存儲:直接在本地文件系統創建虛擬機目錄
utils.execute('mkdir', '-p', inst_base)
...
active_flavor = instance.get_flavor()
# 塊遷移虛擬機本地磁盤文件
for info in disk_info:
# assume inst_base == dirname(info['path'])
img_path = info['path']
fname = os.path.basename(img_path)
from_path = os.path.join(inst_base_resize, fname)
...
# We will not copy over the swap disk here, and rely on
# finish_migration/_create_image to re-create it for us.
if not (fname == 'disk.swap' and
active_flavor.get('swap', 0) != flavor.get('swap', 0)):
# 是否啟用壓縮
compression = info['type'] not in NO_COMPRESSION_TYPES
# 非共享存儲:使用 scp 遠程拷貝
# 共享存儲:使用 cp 本地拷貝
libvirt_utils.copy_image(from_path, img_path, host=dest,
on_execute=on_execute,
on_completion=on_completion,
compression=compression)
# Ensure disk.info is written to the new path to avoid disks being
# reinspected and potentially changing format.
# 拷貝 diks.inof 配置文件
src_disk_info_path = os.path.join(inst_base_resize, 'disk.info')
if os.path.exists(src_disk_info_path):
dst_disk_info_path = os.path.join(inst_base, 'disk.info')
libvirt_utils.copy_image(src_disk_info_path,
dst_disk_info_path,
host=dest, on_execute=on_execute,
on_completion=on_completion)
except Exception:
with excutils.save_and_reraise_exception():
self._cleanup_remote_migration(dest, inst_base,
inst_base_resize,
shared_storage)
return disk_info_text
熱遷移代碼分析
Nova 熱遷移實現原理
在《Libvirt Live Migration 與 Pre-Copy 實現原理》一文中我們提到了 Libvirt Live Migration 的實現原理,和 KVM Pre-Copy Live Migration 的實現原理。簡單的說,可分為 3 個階段:
- Stage 1:將虛擬機所有 RAM 數據都標記為臟內存。
- Stage 2:遷移所有臟內存,然后重新計算新產生的臟內存,如此迭代,直到某一個條件退出。例如:臟內存數據量達到低水位(low watermark)。
- Stage 3:停止運行 GuestOS,將剩余的臟內存以及虛擬機的設備狀態信息都遷移過去。
可以想到,其中最關鍵的階段就是 Stage 2,即退出條件的實現。Libvirt 早期的原生退出條件有:
- 50% 或者更少的臟內存需要遷移
- 不需要進行第 2 次迭代或迭代次數超過 30 次。
- 動態配置 max downtime(最大停機時間)
- 源主機策略(e.g. 源主機 5m 后關機,那么就需要即刻遷移所有的虛擬機)
而 Nova 選擇的是退出條件就是動態配置 max downtime,Libvirt Pre-Copy Live Migration 每次迭代都會重新計算虛擬機新的臟內存以及每次迭代所花掉的時間來估算帶寬,再根據帶寬和當前迭代的臟頁數計算出傳輸剩余數據的時間,這個時間就是 downtime。如果 downtime 在管理員配置的 Live Migration Max Downtime 范圍之內,則退出,進入 Stage 3。
NOTE:Live Migration Max Downtime(熱遷移最大停機時間,單位是 ms),表示可被允許的虛擬機靜態數據持續時間,描述業務中斷的容忍區間,一般小到可以忽略不計。可通過 nova.conf 配置項指定 CONF.libvirt.live_migration_downtime
。
需要注意的是,動態配置 downtime 的退出條件存在一個問題,如果虛擬機持續處於高業務狀態(不斷產生新的臟內存),就意味着每次迭代遷移數據量都很大,downtime 就會一直無法進入退出范圍。所以,你應該要有心理准備,使用熱遷移可能是一個漫長的過程。針對這種情況,Libvirt 引入了一些新特性:
- 自動收斂模式:如果虛擬機持續處於高業務狀態,那么 libvirtd 會自動調整 vCPU 參數以減輕負載,達到降低臟內存的增長速度,從而保證 downtime 進入退出范圍。
除了 Pre-Copy(預拷貝)模式之外,Libvirt 還支持 Post-Copy(后拷貝)模式。前者要求所有數據都必須在虛擬機切換到目標主機之前拷貝完;相對的,Post-Copy 則會優先考慮盡快的切換到目標主機,然后再拷貝內存數據。Port-Copy 模式先把虛擬機的設備狀態信息和一部分(10%)臟內存數據到目標主機,然后虛擬機就切換到目標主機上運行。當 GuestOS 發現訪問的某些內存頁不存在時,就會觸發一個遠程頁錯誤,進而觸發從源主機上面拉取該內存頁的動作。顯然,Post-Copy 模式也存在一些問題:如果其中一台主機宕機,或出現故障,或網絡不通都會導致整個虛擬機異常。Post-Copy 對於核心業務而言不是推薦的 Live Migration 方式,可以通過 nova.conf 配置項 live_migration_permit_post_copy
指定是否開啟。
除此之外,Nova 采用的 Libvirt Live Migration 控制模型是 “Client 直連控制”,所以作為 Libvirt Client 的 Nova 就需要輪詢訪問 libvirtd 以獲取數據遷移的狀態信息作為控制遷移的依據。故此,Nova 還需要實現一套數據遷移監控機制。
簡而言之,Nova 對於 Libvirt Live Migration 的主要實現有兩點:
- 作為 Libvirt Client 向源主機的 libvirtd 服務進程發出 Live Migration 指令
- 數據遷移監控機制
# nova/api/openstack/compute/migrate_server.py
def _migrate_live(self, req, id, body):
"""Permit admins to (live) migrate a server to a new host."""
...
# 是否執行塊遷移
block_migration = body["os-migrateLive"]["block_migration"]
...
# 是否異步執行
async = api_version_request.is_supported(req, min_version='2.34')
...
# 是否強制執行
force = self._get_force_param_for_live_migration(body, host)
...
# 是否支持磁盤超額
disk_over_commit = body["os-migrateLive"]["disk_over_commit"]
...
self.compute_api.live_migrate(context, instance, block_migration,
disk_over_commit, host, force, async)
...
# nova/nova/compute/api.py
def live_migrate(self, context, instance, block_migration,
disk_over_commit, host_name, force=None, async=False):
"""Migrate a server lively to a new host."""
...
# NOTE(sbauza): Force is a boolean by the new related API version
if force is False and host_name:
...
# 非強制執行:設定目的主機信息
destination = objects.Destination(
host=target.host,
node=target.hypervisor_hostname
)
request_spec.requested_destination = destination
...
self.compute_task_api.live_migrate_instance(context, instance,
host_name, block_migration=block_migration,
disk_over_commit=disk_over_commit,
request_spec=request_spec, async=async)
# nova/nova/conductor/manager.py
def _live_migrate(self, context, instance, scheduler_hint,
block_migration, disk_over_commit, request_spec):
# 獲取目的主機
destination = scheduler_hint.get("host")
...
task = self._build_live_migrate_task(context, instance, destination,
block_migration, disk_over_commit,
migration, request_spec)
...
task.execute()
...
# nova/nova/conductor/tasks/live_migrate.py
class LiveMigrationTask(base.TaskBase):
...
def _execute(self):
# 檢查虛擬機是否正常運行
self._check_instance_is_active()
# 檢查源主機服務進程是否正常
self._check_host_is_up(self.source)
# 熱遷移一定會指定目的主機
if not self.destination:
self.destination = self._find_destination()
self.migration.dest_compute = self.destination
self.migration.save()
else:
# 檢查目的主機和源主機是否為同一個
# 檢查目的主機服務進程是否正常
# 檢查目的主機是否有足夠的內存空間
# 檢查目的主機和源主機的 Hypervisor 是否一致
# 檢查目的主機是否可以進行熱遷移
self._check_requested_destination()
# TODO(johngarbutt) need to move complexity out of compute manager
# TODO(johngarbutt) disk_over_commit?
return self.compute_rpcapi.live_migration(self.context,
host=self.source,
instance=self.instance,
dest=self.destination,
block_migration=self.block_migration,
migration=self.migration,
migrate_data=self.migrate_data)
# nova/compute/manager.py
def live_migration(self, context, dest, instance, block_migration,
migration, migrate_data):
...
# 設定 migration 狀態為 '隊列中'
self._set_migration_status(migration, 'queued')
def dispatch_live_migration(*args, **kwargs):
with self._live_migration_semaphore:
self._do_live_migration(*args, **kwargs)
# Spawn 一個熱遷移隊列消息(任務)
utils.spawn_n(dispatch_live_migration,
context, dest, instance,
block_migration, migration,
migrate_data)
def _do_live_migration(self, context, dest, instance, block_migration,
migration, migrate_data):
...
# 設定 migration 狀態為 '准備'
self._set_migration_status(migration, 'preparing')
got_migrate_data_object = isinstance(migrate_data,
migrate_data_obj.LiveMigrateData)
if not got_migrate_data_object:
migrate_data = \
migrate_data_obj.LiveMigrateData.detect_implementation(
migrate_data)
try:
if ('block_migration' in migrate_data and
migrate_data.block_migration):
# 進行塊遷移:獲取 disk.info 中記錄的本地磁盤文件信息
block_device_info = self._get_instance_block_device_info(
context, instance)
disk = self.driver.get_instance_disk_info(
instance, block_device_info=block_device_info)
else:
disk = None
# 讓目的主機執行熱遷移前的准備
migrate_data = self.compute_rpcapi.pre_live_migration(
context, instance,
block_migration, disk, dest, migrate_data)
...
# 設定 migration 狀態為 '進行中'
self._set_migration_status(migration, 'running')
...
self.driver.live_migration(context, instance, dest,
self._post_live_migration,
self._rollback_live_migration,
block_migration, migrate_data)
...
# nova/nova/virt/libvirt/driver.py
def _live_migration(self, context, instance, dest, post_method,
recover_method, block_migration,
migrate_data):
...
# nova.virt.libvirt.guest.Guest 對象
guest = self._host.get_guest(instance)
disk_paths = []
device_names = []
if migrate_data.block_migration:
# 塊遷移:獲取本地磁盤文件路徑
# 如果不需要塊遷移,則只內存數據
# e.g. /var/lib/nova/instances/bf6824e9-1dac-466c-ab53-69f82d8adf73/disk
disk_paths, device_names = self._live_migration_copy_disk_paths(
context, instance, guest)
# Spawn 一個熱遷移執行函數
opthread = utils.spawn(self._live_migration_operation,
context, instance, dest,
block_migration,
migrate_data, guest,
device_names)
...
# 監控 libvirtd 數據遷移進度
self._live_migration_monitor(context, instance, guest, dest,
post_method, recover_method,
block_migration, migrate_data,
finish_event, disk_paths)
...
向 libvirtd 發出 Live Migration 指令
def _live_migration_operation(self, context, instance, dest,
block_migration, migrate_data, guest,
device_names):
...
# 獲取 live migration URI
migrate_uri = None
if ('target_connect_addr' in migrate_data and
migrate_data.target_connect_addr is not None):
dest = migrate_data.target_connect_addr
if (migration_flags &
libvirt.VIR_MIGRATE_TUNNELLED == 0):
migrate_uri = self._migrate_uri(dest)
# 獲取 GuestOS XML
new_xml_str = None
params = None
if (self._host.is_migratable_xml_flag() and (
listen_addrs or migrate_data.bdms)):
new_xml_str = libvirt_migrate.get_updated_guest_xml(
# TODO(sahid): It's not a really well idea to pass
# the method _get_volume_config and we should to find
# a way to avoid this in future.
guest, migrate_data, self._get_volume_config)
...
# 調用 libvirt.virDomain.migrate 的封裝函數
# 向 libvirtd 發出 Live Migration 指令
guest.migrate(self._live_migration_uri(dest),
migrate_uri=migrate_uri,
flags=migration_flags,
params=params,
domain_xml=new_xml_str,
bandwidth=CONF.libvirt.live_migration_bandwidth)
...
Libvirt Python Client 的遷移函數原型是 libvirt.virDomain.migrate
。
migrate(self, dconn, flags, dname, uri, bandwidth) method of libvirt.virDomain instance Migrate the domain object from its current host to the destination host given by dconn (a connection to the destination host).
Nova Libvirt Driver 對 libvirt.virDomain.migrate
進行了封裝:
# nova/virt/libvirt/guest.py
def migrate(self, destination, migrate_uri=None, params=None, flags=0,
domain_xml=None, bandwidth=0):
"""Migrate guest object from its current host to the destination """
if domain_xml is None:
self._domain.migrateToURI(
destination, flags=flags, bandwidth=bandwidth)
else:
if params:
...
if migrate_uri:
# In migrateToURI3 this paramenter is searched in
# the `params` dict
params['migrate_uri'] = migrate_uri
params['bandwidth'] = bandwidth
self._domain.migrateToURI3(
destination, params=params, flags=flags)
else:
self._domain.migrateToURI2(
destination, miguri=migrate_uri, dxml=domain_xml,
flags=flags, bandwidth=bandwidth)
通過 Flags 來配置 Libvirt 遷移細節:
- VIR_MIGRATE_LIVE – Do not pause the VM during migration
- VIR_MIGRATE_PEER2PEER – Direct connection between source & destination hosts
- VIR_MIGRATE_TUNNELLED – Tunnel migration data over the libvirt RPC channel
- VIR_MIGRATE_PERSIST_DEST – If the migration is successful, persist the domain on the destination host.
- VIR_MIGRATE_UNDEFINE_SOURCE – If the migration is successful, undefine the domain on the source host.
- VIR_MIGRATE_PAUSED – Leave the domain suspended on the remote side.
- VIR_MIGRATE_CHANGE_PROTECTION – Protect against domain configuration changes during the migration process (set automatically when supported).
- VIR_MIGRATE_UNSAFE – Force migration even if it is considered unsafe.
- VIR_MIGRATE_CHANGE_PROTECTION – Protect against domain configuration changes during the migration process (set automatically when supported).
- VIR_MIGRATE_UNSAFE – Force migration even if it is considered unsafe.
- VIR_MIGRATE_OFFLINE – Migrate offline.
這些 Flags 通過 nova.conf 配置項 live_migration_flag 定義,e.g.
live_migration_flag=VIR_MIGRATE_UNDEFINE_SOURCE, VIR_MIGRATE_PEER2PEER, VIR_MIGRATE_LIVE, VIR_MIGRATE_TUNNELLED
監控 libvirtd 的數據遷移狀態
# nova/nova/virt/libvirt/driver.py
def _live_migration_monitor(self, context, instance, guest,
dest, post_method,
recover_method, block_migration,
migrate_data, finish_event,
disk_paths):
# 獲取需要進行熱遷移的總數據量,包括 RAM 和本地磁盤文件
# data_gb: total GB of RAM and disk to transfer
data_gb = self._live_migration_data_gb(instance, disk_paths)
# e.g. downtime_steps = [(0, 46), (300, 47), (600, 48), (900, 51), (1200, 57), (1500, 66), (1800, 84), (2100, 117), (2400, 179), (2700, 291), (3000, 500)]
# downtime_steps 通過一個算法得出,參與計算的參數有:
# data_gb
# CONF.libvirt.live_migration_downtime
# CONF.libvirt.live_migration_downtime_steps
# CONF.libvirt.live_migration_downtime_delay
# downtime_steps 的含義:
# 一個元組表示一個 Step,分 Steps 次給 libvirtd 傳輸 downtime
# (delay, downtime),即:(下一次傳遞時間間隔,傳遞的 downtime 值)
# 直到最后一次 Step 傳遞的元組是 (CONF.libvirt.live_migration_downtime_delay, CONF.libvirt.live_migration_downtime_steps)
# 如果最后一次 libvirtd 迭代計算出來的 downtime 在傳遞的 downtime 范圍內,則滿足退出條件
# NOTE:downtime_steps 每個 Step 的 max downtime 都在遞增直到真正用戶設定的最大可容忍 downtime,
# 這是因為 Nova 在不斷的試探實際最小的 max downtime,盡可能早的進入退出狀態。
downtime_steps = list(self._migration_downtime_steps(data_gb))
...
# 輪詢次數
n = 0
# 監控開始時間
start = time.time()
progress_time = start
# progress_watermark 用來標記上次查詢到的剩余數據量,如果數據有在遷移,那么臟數據水位(watermark)總是遞減的
progress_watermark = None
# 是否啟用了 Port-Copy 模型
is_post_copy_enabled = self._is_post_copy_enabled(migration_flags)
while True:
# 獲取 Live Migration Job 的信息
info = guest.get_job_info()
...
elif info.type == libvirt.VIR_DOMAIN_JOB_UNBOUNDED:
# Migration is still running
#
# This is where we wire up calls to change live
# migration status. eg change max downtime, cancel
# the operation, change max bandwidth
libvirt_migrate.run_tasks(guest, instance,
self.active_migrations,
on_migration_failure,
migration,
is_post_copy_enabled)
now = time.time()
elapsed = now - start
if ((progress_watermark is None) or
(progress_watermark == 0) or
(progress_watermark > info.data_remaining)):
progress_watermark = info.data_remaining
progress_time = now
# progress_timeout 這個變量的設計用來防止由於 libvirtd 異常導致的數據遷移卡殼
# progress_timeout 標記遷移卡殼的超時時間,中止遷移
progress_timeout = CONF.libvirt.live_migration_progress_timeout
# completion_timeout 這個變量的設計用來防止 libvirtd 長時間處在遷移狀態
# 可能由於網絡帶寬太低等原因,libvirtd 就會長時間處於遷移狀態,可能會導致管理帶寬擁堵
# completion_timeout 從第一次輪詢開始計算,一旦超時沒有完成遷移,中止遷移
completion_timeout = int(
CONF.libvirt.live_migration_completion_timeout * data_gb)
# 判斷遷移過程是否應該終止
if libvirt_migrate.should_abort(instance, now, progress_time,
progress_timeout, elapsed,
completion_timeout,
migration.status):
try:
guest.abort_job()
except libvirt.libvirtError as e:
LOG.warning(_LW("Failed to abort migration %s"),
e, instance=instance)
self._clear_empty_migration(instance)
raise
# 判斷是否啟動 Port-Copy 模式
if (is_post_copy_enabled and
libvirt_migrate.should_switch_to_postcopy(
info.memory_iteration, info.data_remaining,
previous_data_remaining, migration.status)):
# 進行 Port-Copy 轉換
libvirt_migrate.trigger_postcopy_switch(guest,
instance,
migration)
previous_data_remaining = info.data_remaining
# 迭代的動態傳遞 Max Downtime Step
curdowntime = libvirt_migrate.update_downtime(
guest, instance, curdowntime,
downtime_steps, elapsed)
if (n % 10) == 0:
remaining = 100
if info.memory_total != 0:
# 計算剩余遷移數據量
remaining = round(info.memory_remaining *
100 / info.memory_total)
libvirt_migrate.save_stats(instance, migration,
info, remaining)
# 每輪詢 60 次打印一次 info
# 沒輪詢 10 次打印一次 debug
lg = LOG.debug
if (n % 60) == 0:
lg = LOG.info
# 打印已經遷移了幾秒、內存數據剩余量、遷移進度
lg(_LI("Migration running for %(secs)d secs, "
"memory %(remaining)d%% remaining; "
"(bytes processed=%(processed_memory)d, "
"remaining=%(remaining_memory)d, "
"total=%(total_memory)d)"),
{"secs": n / 2, "remaining": remaining,
"processed_memory": info.memory_processed,
"remaining_memory": info.memory_remaining,
"total_memory": info.memory_total}, instance=instance)
if info.data_remaining > progress_watermark:
lg(_LI("Data remaining %(remaining)d bytes, "
"low watermark %(watermark)d bytes "
"%(last)d seconds ago"),
{"remaining": info.data_remaining,
"watermark": progress_watermark,
"last": (now - progress_time)}, instance=instance)
n = n + 1
# 遷移完成
elif info.type == libvirt.VIR_DOMAIN_JOB_COMPLETED:
# Migration is all done
LOG.info(_LI("Migration operation has completed"),
instance=instance)
post_method(context, instance, dest, block_migration,
migrate_data)
break
# 遷移失敗
elif info.type == libvirt.VIR_DOMAIN_JOB_FAILED:
# Migration did not succeed
LOG.error(_LE("Migration operation has aborted"),
instance=instance)
libvirt_migrate.run_recover_tasks(self._host, guest, instance,
on_migration_failure)
recover_method(context, instance, dest, block_migration,
migrate_data)
break
# 遷移取消
elif info.type == libvirt.VIR_DOMAIN_JOB_CANCELLED:
# Migration was stopped by admin
LOG.warning(_LW("Migration operation was cancelled"),
instance=instance)
libvirt_migrate.run_recover_tasks(self._host, guest, instance,
on_migration_failure)
recover_method(context, instance, dest, block_migration,
migrate_data, migration_status='cancelled')
break
else:
LOG.warning(_LW("Unexpected migration job type: %d"),
info.type, instance=instance)
time.sleep(0.5)
self._clear_empty_migration(instance)
def _live_migration_data_gb(self, instance, disk_paths):
'''Calculate total amount of data to be transferred :param instance: the nova.objects.Instance being migrated :param disk_paths: list of disk paths that are being migrated with instance Calculates the total amount of data that needs to be transferred during the live migration. The actual amount copied will be larger than this, due to the guest OS continuing to dirty RAM while the migration is taking place. So this value represents the minimal data size possible. :returns: data size to be copied in GB '''
ram_gb = instance.flavor.memory_mb * units.Mi / units.Gi
if ram_gb < 2:
ram_gb = 2
disk_gb = 0
for path in disk_paths:
try:
size = os.stat(path).st_size
size_gb = (size / units.Gi)
if size_gb < 2:
size_gb = 2
disk_gb += size_gb
except OSError as e:
LOG.warning(_LW("Unable to stat %(disk)s: %(ex)s"),
{'disk': path, 'ex': e})
# Ignore error since we don't want to break
# the migration monitoring thread operation
# 返回 RAM + Disks 的數據量總和
return ram_gb + disk_gb
def _migration_downtime_steps(data_gb):
'''Calculate downtime value steps and time between increases. :param data_gb: total GB of RAM and disk to transfer This looks at the total downtime steps and upper bound downtime value and uses an exponential backoff. So initially max downtime is increased by small amounts, and as time goes by it is increased by ever larger amounts For example, with 10 steps, 30 second step delay, 3 GB of RAM and 400ms target maximum downtime, the downtime will be increased every 90 seconds in the following progression: - 0 seconds -> set downtime to 37ms - 90 seconds -> set downtime to 38ms - 180 seconds -> set downtime to 39ms - 270 seconds -> set downtime to 42ms - 360 seconds -> set downtime to 46ms - 450 seconds -> set downtime to 55ms - 540 seconds -> set downtime to 70ms - 630 seconds -> set downtime to 98ms - 720 seconds -> set downtime to 148ms - 810 seconds -> set downtime to 238ms - 900 seconds -> set downtime to 400ms This allows the guest a good chance to complete migration with a small downtime value. '''
# 通過配置項來控制 Live Migration 的執行細節
downtime = CONF.libvirt.live_migration_downtime
steps = CONF.libvirt.live_migration_downtime_steps
delay = CONF.libvirt.live_migration_downtime_delay
# TODO(hieulq): Need to move min/max value into the config option,
# currently oslo_config will raise ValueError instead of setting
# option value to its min/max.
if downtime < nova.conf.libvirt.LIVE_MIGRATION_DOWNTIME_MIN:
downtime = nova.conf.libvirt.LIVE_MIGRATION_DOWNTIME_MIN
if steps < nova.conf.libvirt.LIVE_MIGRATION_DOWNTIME_STEPS_MIN:
steps = nova.conf.libvirt.LIVE_MIGRATION_DOWNTIME_STEPS_MIN
if delay < nova.conf.libvirt.LIVE_MIGRATION_DOWNTIME_DELAY_MIN:
delay = nova.conf.libvirt.LIVE_MIGRATION_DOWNTIME_DELAY_MIN
delay = int(delay * data_gb)
offset = downtime / float(steps + 1)
base = (downtime - offset) ** (1 / float(steps))
for i in range(steps + 1):
yield (int(delay * i), int(offset + base ** i))
# nova/nova/virt/libvirt/migration.py
def update_downtime(guest, instance,
olddowntime,
downtime_steps, elapsed):
"""Update max downtime if needed :param guest: a nova.virt.libvirt.guest.Guest to set downtime for :param instance: a nova.objects.Instance :param olddowntime: current set downtime, or None :param downtime_steps: list of downtime steps :param elapsed: total time of migration in secs Determine if the maximum downtime needs to be increased based on the downtime steps. Each element in the downtime steps list should be a 2 element tuple. The first element contains a time marker and the second element contains the downtime value to set when the marker is hit. The guest object will be used to change the current downtime value on the instance. Any errors hit when updating downtime will be ignored :returns: the new downtime value """
LOG.debug("Current %(dt)s elapsed %(elapsed)d steps %(steps)s",
{"dt": olddowntime, "elapsed": elapsed,
"steps": downtime_steps}, instance=instance)
thisstep = None
for step in downtime_steps:
# elapsed 是當前的已遷移時長
if elapsed > step[0]:
# 如果已遷移時長大於 downtime_delay,那么此次 Step 就是 current step
thisstep = step
if thisstep is None:
LOG.debug("No current step", instance=instance)
return olddowntime
if thisstep[1] == olddowntime:
LOG.debug("Downtime does not need to change",
instance=instance)
return olddowntime
LOG.info(_LI("Increasing downtime to %(downtime)d ms "
"after %(waittime)d sec elapsed time"),
{"downtime": thisstep[1],
"waittime": thisstep[0]},
instance=instance)
try:
# 向 libvirtd 傳遞 current max downtime
guest.migrate_configure_max_downtime(thisstep[1])
except libvirt.libvirtError as e:
LOG.warning(_LW("Unable to increase max downtime to %(time)d"
"ms: %(e)s"),
{"time": thisstep[1], "e": e}, instance=instance)
return thisstep[1]
NUMA 親和、CPU 綁定、SR-IOV 網卡的熱遷移問題
在《OpenStack 虛擬機冷/熱遷移功能實踐與流程分析》中我們嘗試遷移過具有 NUMA 親和、CPU 綁定的虛擬機,結果是遷移之后虛擬機依舊能夠保持這些特性。這里我們再進行一個更加極端的測試 —— 將一個具有 NUMA 親和、CPU 獨占綁定的虛擬機遷移到一個 NUMA、CPU 資源都已經已經耗盡的目的主機。
[stack@undercloud (overcloudrc) ~]$ openstack server show VM1
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
| Field | Value |
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig | AUTO |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | overcloud-ovscompute-1.localdomain |
| OS-EXT-SRV-ATTR:hypervisor_hostname | overcloud-ovscompute-1.localdomain |
| OS-EXT-SRV-ATTR:instance_name | instance-000000d6 |
| OS-EXT-STS:power_state | Running |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | active |
| OS-SRV-USG:launched_at | 2019-03-20T10:45:55.000000 |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| addresses | net1=10.0.1.11, 10.0.1.8, 10.0.1.16, 10.0.1.10, 10.0.1.18, 10.0.1.19 |
| config_drive | |
| created | 2019-03-20T10:44:52Z |
| flavor | Flavor1 (2ff09ec5-19e4-40b9-a52e-6026652c0788) |
| hostId | 9f1230901ddf3fe0e1a41e1c650a784c122b791f89fdf66a40cff3d6 |
| id | a17ddcbf-d936-4c77-9ea6-2e684c41cc39 |
| image | CentOS-7-x86_64-GenericCloud (0aff2888-47f8-4133-928a-9c54414b3afb) |
| key_name | stack |
| name | VM1 |
| os-extended-volumes:volumes_attached | [] |
| progress | 0 |
| project_id | a6c78435075246f3aa5ab946b87086c5 |
| properties | |
| security_groups | [{u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}] |
| status | ACTIVE |
| updated | 2019-03-20T10:45:56Z |
| user_id | 4fe574569664493bbd660abfe762a630 |
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
[stack@undercloud (overcloudrc) ~]$ openstack server migrate --block-migration --live overcloud-ovscompute-0.localdomain --wait VM1
Complete
[stack@undercloud (overcloudrc) ~]$ openstack server show VM1
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
| Field | Value |
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig | AUTO |
| OS-EXT-AZ:availability_zone | ovs |
| OS-EXT-SRV-ATTR:host | overcloud-ovscompute-0.localdomain |
| OS-EXT-SRV-ATTR:hypervisor_hostname | overcloud-ovscompute-0.localdomain |
| OS-EXT-SRV-ATTR:instance_name | instance-000000d6 |
| OS-EXT-STS:power_state | Running |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | active |
| OS-SRV-USG:launched_at | 2019-03-20T10:45:55.000000 |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| addresses | net1=10.0.1.11, 10.0.1.8, 10.0.1.16, 10.0.1.10, 10.0.1.18, 10.0.1.19 |
| config_drive | |
| created | 2019-03-20T10:44:52Z |
| flavor | Flavor1 (2ff09ec5-19e4-40b9-a52e-6026652c0788) |
| hostId | 0f2ec590cd73fe0e9522f1ba715dae7a7d4b884e15aa8254defe85d0 |
| id | a17ddcbf-d936-4c77-9ea6-2e684c41cc39 |
| image | CentOS-7-x86_64-GenericCloud (0aff2888-47f8-4133-928a-9c54414b3afb) |
| key_name | stack |
| name | VM1 |
| os-extended-volumes:volumes_attached | [] |
| progress | 0 |
| project_id | a6c78435075246f3aa5ab946b87086c5 |
| properties | |
| security_groups | [{u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}] |
| status | ACTIVE |
| updated | 2019-03-20T10:51:47Z |
| user_id | 4fe574569664493bbd660abfe762a630 |
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
遷移過程中的異常信息:
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager [req-566373ae-5282-4378-9678-d8d08e121cdb - - - - -] Error updating resources for node overcloud-ovscompute-0.localdomain.
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager Traceback (most recent call last):
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6590, in update_available_resource_for_node
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager rt.update_available_resource(context)
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 536, in update_available_resource
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager self._update_available_resource(context, resources)
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 271, in inner
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager return f(*args, **kwargs)
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 896, in _update_available_resource
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager self._update_usage_from_instances(context, instances)
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 1393, in _update_usage_from_instances
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager self._update_usage_from_instance(context, instance)
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 1273, in _update_usage_from_instance
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager sign, is_periodic)
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 1119, in _update_usage
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager self.compute_node, usage, free)
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/hardware.py", line 1574, in get_host_numa_usage_from_instance
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager host_numa_topology, instance_numa_topology, free=free))
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/hardware.py", line 1447, in numa_usage_from_instances
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager newcell.pin_cpus(pinned_cpus)
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/objects/numa.py", line 86, in pin_cpus
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager self.pinned_cpus))
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager CPUPinningInvalid: CPU set to pin [0, 1] must be a subset of free CPU set [8]
遷移后的 NUMA 親和、CPU 綁定信息:
# 遷移虛擬機
[root@overcloud-ovscompute-0 ~]# virsh vcpupin instance-000000d6
VCPU: CPU Affinity
----------------------------------
0: 0
1: 1
# 已存在虛擬機
[root@overcloud-ovscompute-0 ~]# virsh vcpupin instance-000000d0
VCPU: CPU Affinity
----------------------------------
0: 0
1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
7: 7
遷移虛擬機的 XML 文件局部:
<cpu mode='custom' match='exact' check='full'>
<model fallback='forbid'>IvyBridge</model>
<topology sockets='1' cores='2' threads='1'/>
<feature policy='require' name='hypervisor'/>
<feature policy='require' name='arat'/>
<feature policy='require' name='xsaveopt'/>
<numa>
<cell id='0' cpus='0-1' memory='1048576' unit='KiB'/>
</numa>
</cpu>
結論:虛擬機可以成功遷移並且依舊保持原有的 NUMA、CPU 特性。這是因為 Dedicated CPU Policy 是 Nova 層的概念,但從上述代碼分析可以看出 Nova 是完全的 NUMA-Non-aware。Hypervisor 層就更不會買這些參數的單了,Hypervisor 完全忠於 XML 的描述,只要 XML 說了用 0,1 pCPU,那么即便 0,1 pCPU 已經被別的虛擬機占用了,Hypervisor 也依舊會安排下去。當然了,從 Nova 層面來看這就是一個 Bug,社區也已經有人描述來了這個問題並提出 BP:《NUMA-aware live migration》,《NUMA-aware live migration》。
至於 SR-IOV,Nova 官方文檔明確提到了不支持 SR-IOV 虛擬機的 Live Migration。我曾在《啟用 SR-IOV 解決 Neutron 網絡 I/O 性能瓶頸》中分析過,SR-IOV 的 vf 設備對於 KVM 虛擬機來說就是一個 XML 標簽段。e.g.
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address bus='0x81' slot='0x10' function='0x2'/>
</source>
</hostdev>
只要在目的計算節點可以找到與這個標簽段匹配的 vf 設備即可實現 SR-IOV 網卡的遷移。問題是,原則上 Live Migration 虛擬機的 XML 文件理應不被修改,但實際上修改一段 vf 標簽也許並無大礙,主要是要做好遷移失敗的回滾備案和 Nova 的 SR-IOV-aware(感知和管理),寫到這里我是越發的希望 OpenStack Placement 能夠快快發展,畢竟 Nova 對 NUMA、SR-IOV 等資源的 “黑盒” 管理是那么的痛苦。
最后
通過對 OpenStack 虛擬機冷/熱遷移的實現原理與代碼分析可以感受到,Nova 只是對傳統的遷移方式或對底層 Hypervisor 支撐軟件的遷移功能進行封裝和調度,使虛擬機的冷、熱遷移功能能夠達到企業級雲平台的業務需求水平。主要的技術價值還是體現在底層技術支撐上,一如其他 OpenStack 項目。
參考資料
https://developers.redhat.com/blog/2015/03/24/live-migrating-qemu-kvm-virtual-machines/
https://www.cnblogs.com/sammyliu/p/4572287.html
https://docs.openstack.org/nova/pike/admin/configuring-migrations.html
https://docs.openstack.org/nova/pike/admin/live-migration-usage.html
https://blog.csdn.net/lemontree1945/article/details/79901874
https://www.ibm.com/developerworks/cn/linux/l-cn-mgrtvm1/index.html
https://blog.csdn.net/hawkerou/article/details/53482268