In simple terms, migration enables you to move your VM from one physical machine to another physical machine, with a very minimal downtime or no downtime. We can also move VM storage, which is a resource-hog type of operation that needs to be carefully planned and—if possible —executed after hours so that it doesn’t affect other VMs’ performance as much as it could.
There are various different types of migration, as follows:
- Offline (cold)
- Online (live)
- Suspended migration
There are also various different types of online migrations, depending on what you’re moving, as follows:
- The compute part of the VM (moving the VM from one KVM host to another KVM host)
- The storage part of the VM (moving VM files from one storage pool to another storage pool)
- Both (moving the VM from host to host and storage pool to storage pool at the same time)
There are some differences in terms of which migration scenarios are supported if you’re using just a plain KVM host versus oVirt or Red Hat Enterprise Virtualization. If you want to do a live storage migration, you can’t do it on a KVM host directly, but you can easily do it if the VM is shut down. If you need a live storage migration, you will have to use oVirt or Red Hat Enterprise Virtualization.
We discussed single-root input-output virtualization (SR-IOV), Peripheral Component Interconnect (PCI) device passthrough, virtual graphics processing units (vGPUs), and similar concepts as well (in Chapter 2, KVM as a Virtualization Solution, and Chapter 4, Libvirt Networking). In CentOS 8, you can’t live-migrate a VM that has either one of these options assigned to a running VM.
Whatever the use case is, we need to be aware of the fact that migration needs to be performed either as the
root user or as a user that belongs to the
libvirt user group (what Red Hat refers to as system versus user
There are different reasons why VM migration is a valuable tool to have in your arsenal. Some of these reasons are obvious; others, less so. Let’s try to explain different use cases for VM migration and its benefits.
Benefits of VM migration
- Increased uptime and reduced downtime—A carefully designed virtualized environment will give you the maximum uptime for your application.
- Saving energy and going green—You can easily consolidate your VMs based on their load and usage to a smaller number of hypervisors during off hours. Once the VMs are migrated, you can power off the unused hypervisors.
- Easy hardware/software upgrade process by moving your VM between different hypervisors—Once you have the capability to move your VMs freely between different physical servers, the benefits are countless.
VM migration needs proper planning to be put in place. There are some basic requirements the migration looks for. Let’s see them one by one.
- The VM should be using a storage pool that is created on a shared storage.
- The name of the storage pool and the virtual disk’s path should remain the same on both hypervisors (source and destination hypervisors).
Check out Chapter 4, Libvirt Networking, and Chapter 5, Libvirt Storage, to remind yourself how to create a storage pool using shared storage.
There are, as always, some rules that apply here. These are rather simple, so we need to learn them before starting migration processes. They are as follows:
- It is possible to do a live storage migration using a storage pool that is created on non-shared storage. You only need to maintain the same storage pool name and file location, but shared storage is still recommended in a production environment.
- If there is an unmanaged virtual disk attached to a VM that uses a Fiber Channel (FC), an Internet Small Computer Systems Interface (iSCSI), Logical Volume Manager (LVM), and so on, the same storage should be available on both hypervisors.
- The virtual networks used by the VMs should be available on both hypervisors.
- A bridge that is configured for a networking communication should be available on both the hypervisors.
- Migration may fail if the major versions of
qemu-kvmon the hypervisors are different, but you should be able to migrate the VMs running on a hypervisor that has a lower version of
qemu-kvmto a hypervisor that has higher versions of those packages, without any issues.
- The time on both the source and destination hypervisors should be synced. It is highly recommended that you sync the hypervisors using the same Network Time Protocol (NTP) or Precision Time Protocol (PTP) servers.
- It is important that the systems use a Domain Name System (DNS) server for name resolution. Adding the host details on
/etc/hostswill not work. You should be able to resolve the hostnames using the
There are some pre-requisites that we need to have in mind when planning our environment for VM migration. For the most part, these pre-requisites are mostly the same for all virtualization solutions. Let’s discuss these pre-requisites and, in general, how to set up our environment for VM migration next.
Setting up the environment
We start this by setting up a shared storage. In this example, we are using Network File System (NFS) for the shared storage. We are going to use NFS because it is simple to set up, thus helping you to follow the migration examples easily. In actual production, it is recommended to use iSCSI-based or FC-based storage pools. NFS is not a good choice when the files are large and the VM performs heavy I/O operations. Gluster is a good alternative to NFS, and we would recommend that you try it. Gluster is well integrated in
We’re going to create a NFS share on CentOS 8 server. It’s going to be hosted in
/testvms directory, which we’re going to export via NFS. The name of the server is
nfs-01. (in our case, IP address of
- The first step is creating and exporting the
nfs-01and turning off SELinux (check Chapter 5, Libvirt Storage, Ceph section to see how):
# mkdir /testvms # echo '/testvms *(rw,sync,no_root_squash)' >> /etc/exports
- Then, allow the NFS service in the firewall by executing the following code:
# firewall-cmd --get-active-zones public interfaces: ens33 # firewall-cmd --zone=public --add-service=nfs # firewall-cmd --zone=public --list-all
- Start the NFS service, as follows:
# systemctl start rpcbind nfs-server # systemctl enable rpcbind nfs-server # showmount -e
- Confirm that the share is accessible from your KVM hypervisors. In our case, it is
PacktPhy02. Run the following code:
# mount 192.168.159.134:/testvms /mnt
- If mounting fails, reconfigure the firewall on the NFS server and recheck the mount. This can be done by using the following commands:
firewall-cmd --permanent --zone=public --add-service=nfs firewall-cmd --permanent --zone=public --add-service=mountd firewall-cmd --permanent --zone=public --add-service=rpc-bind firewall-cmd -- reload
- Unmount the volume once you have verified the NFS mount point from both hypervisors, as follows:
# umount /mnt
PacktPhy02, create a storage pool named
testvms, as follows:
# mkdir -p /var/lib/libvirt/images/testvms/ # virsh pool-define-as --name testvms --type netfs --source-host 192.168.159.134 --source-path /testvms --target /var/lib/libvirt/images/testvms/ # virsh pool-start testvms # virsh pool-autostart testvms
testvms storage pool is now created and started on two hypervisors.
In this next example, we are going to isolate the migration and VM traffic. It is highly recommended that you do this isolation in your production environment, especially if you do a lot of migrations, as it will offload that demanding process to a separate network interface, thus freeing other congested network interfaces. So, there are two main reasons for this, as follows:
- Network performance: The migration of a VM uses the full bandwidth of the network. If you use the same network for the VM traffic network and the migration network, the migration will choke that network, thus affecting the servicing capability of the VM. You can control the migration bandwidth, but it will increase the migration time.
Here is how we create the isolation:
PacktPhy01 -- ens36 (192.168.0.5) <--switch------> ens36 (192.168.0.6) -- PacktPhy02 ens37 -> br1 <-----switch------> ens37 -> br1
PacktPhy02are used for migration as well as administrative tasks. They have an IP assigned and connected to a network switch. A
br1bridge is created using
br1does not have an IP address assigned and is used exclusively for VM traffic (uplink for the switch that the VMs are connected to). It is also connected to a (physical) network switch.
- Security reasons: It is always recommended that you keep your management network and virtual network isolated for security reasons, as well. You don’t want your users to mess with your management network, where you access your hypervisors and do the administration.
We will discuss three of the most important scenarios— offline migration, non-live migration (suspended), and live migration (online). Then, we will discuss storage migration as a separate scenario that requires additional planning and forethought.
As the name suggests, during offline migration, the state of the VM will be either shut down or suspended. The VM will be then resumed or started at the destination host. In this migration model,
libvirt will just copy the VM’s XML configuration file from the source to the destination KVM host. It also assumes that you have the same shared storage pool created and ready to use at the destination. As the first step in the migration process, you need to set up two-way passwordless SSH authentication on the participating KVM hypervisors. In our example, they are called
/etc/sysconfig/selinux, use your favorite editor to modify the following line of code:
This needs to be modified as follows:
Also, in the command line, as
root, we need to temporarily set SELinux mode to permissive, as follows:
# setenforce 0
root, run the following command:
# ssh-keygen # ssh-copy-id root@PacktPhy02
root, run the following commands:
# ssh-keygen # ssh-copy-id root@PacktPhy01
You should now be able to log in to both of these hypervisors as
root without typing a password.
# virsh migrate migration-type options name-of-the-vm-destination-uri
PacktPhy01, run the following code:
[PacktPhy01] # virsh migrate --offline --verbose –-persistent MasteringKVM01 qemu+ssh://PacktPhy02/system Migration: [100 %]
PacktPhy02, run the following code:
[PacktPhy02] # virsh list --all # virsh list --all Id Name State ---------------------------------------------------- - MasteringKVM01 shut off [PacktPhy02] # virsh start MasteringKVM01 Domain MasteringKVM01 started
When a VM is on shared storage and you have some kind of issue with one of the hosts, you could also manually register a VM on another host. That means that you might end up in a situation where the same VM is registered on two hypervisors, after you repair the issue on your host that had an initial problem. It’s something that happens when you’re manually managing KVM hosts without a centralized management platform such as oVirt, which wouldn’t allow such a scenario. So, what happens if you’re in that kind of situation? Let’s discuss this scenario.
What if I start the VM accidently on both the hypervisors?
Accidently starting the VM on both the hypervisors can be a sysadmin’s nightmare. It can lead to VM filesystem corruption, especially when the filesystem inside the VM is not cluster-aware. Developers of
libvirt thought about this and came up with a locking mechanism. In fact, they came up with two locking mechanisms. When enabled, these will prevent the VMs from starting at the same time on two hypervisors.
The two locking mechanisms are as follows:
lockdmakes use of the
POSIX fcntl()advisory locking capability. It was started by the
virtlockddaemon. It requires a shared filesystem (preferably NFS), accessible to all the hosts that share the same storage pool.
sanlock: This is used by oVirt projects. It uses a disk
paxosalgorithm for maintaining continuously renewed leases.
libvirt-only implementations, we prefer
sanlock. It is best to use
sanlock for oVirt.
lock_manager = "lockd"
Now, enable and start the
virtlockd service on both the hypervisors. Also, restart
libvirtd on both the hypervisors, as follows:
# systemctl enable virtlockd; systemctl start virtlockd # systemctl restart libvirtd # systemctl status virtlockd
PacktPhy02, as follows:
[root@PacktPhy02] # virsh start MasteringKVM01 Domain MasteringKVM01 started
Start the same
MasteringKVM01 VM on
PacktPhy01, as follows:
[root@PacktPhy01] # virsh start MasteringKVM01 error: Failed to start domain MasteringKVM01 error: resource busy: Lockspace resource '/var/lib/libvirt/images/ testvms/MasteringKVM01.qcow2' is locked
Another method to enable
lockd is to use a hash of the disk’s file path. Locks are saved in a shared directory that is exported through NFS, or similar sharing, to the hypervisors. This is very useful when you have virtual disks that are created and attached using a multipath logical unit number (LUN).
fcntl() cannot be used in such cases. We recommend that you use the methods detailed next to enable the locking.
On the NFS server, run the following code (make sure that you’re not running any virtual machines from this NFS server first!):
mkdir /flockd # echo "/flockd *(rw,no_root_squash)" >> /etc/exports # systemctl restart nfs-server # showmount -e Export list for : /flockd * /testvms *
Add the following code to both the hypervisors in
/etc/fstab and type in the rest of these commands:
# echo "192.168.159.134:/flockd /var/lib/libvirt/lockd/flockd nfs rsize=8192,wsize=8192,timeo=14,intr,sync" >> /etc/fstab # mkdir -p /var/lib/libvirt/lockd/flockd # mount -a # echo 'file_lockspace_dir = "/var/lib/libvirt/lockd/flockd"' >> /etc/libvirt/qemu-lockd.conf
[root@PacktPhy01 ~]# virsh start MasteringKVM01 Domain MasteringKVM01 started [root@PacktPhy02 flockd]# ls 36b8377a5b0cc272a5b4e50929623191c027543c4facb1c6f3c35bacaa745 5ef 51e3ed692fdf92ad54c6f234f742bb00d4787912a8a674fb5550b1b826343 dd6
MasteringKVM01 has two virtual disks, one created from an NFS storage pool and the other created directly from a LUN. If we try to power it on the
PacktPhy02 hypervisor host,
MasteringKVM01 fails to start, as can be seen in the following code snippet:
[root@PacktPhy02 ~]# virsh start MasteringKVM01 error: Failed to start domain MasteringKVM01 error: resource busy: Lockspace resource '51e3ed692fdf92ad54c6f234f742bb00d4787912a8a674fb5550b1b82634 3dd6' is locked
When using LVM volumes that can be visible across multiple host systems, it is desirable to do the locking based on the universally unique identifier (UUID) associated with each volume, instead of their paths. Setting the following path causes
libvirt to do UUID-based locking for LVM:
lvm_lockspace_dir = "/var/lib/libvirt/lockd/lvmvolumes"
When using SCSI volumes that can be visible across multiple host systems, it is desirable to do locking based on the UUID associated with each volume, instead of their paths. Setting the following path causes
libvirt to do UUID-based locking for SCSI:
scsi_lockspace_dir = "/var/lib/libvirt/lockd/scsivolumes"
If you are not able to start VMs due to locking errors, just make sure that they are not running anywhere and then delete the lock files. Start the VM again. We deviated a little from migration for the
lockd topic. Let’s get back to migration.
Live or online migration
In this type of migration, the VM is migrated to the destination host while it’s running on the source host. The process is invisible to the users who are using the VMs. They won’t even know that the VM they are using has been transferred to another host while they are working on it. Live migration is one of the main features that have made virtualization so popular.
Migration implementation in KVM does not need any support from the VM. It means that you can live-migrate any VMs, irrespective of the operating system they are using. A unique feature of KVM live migration is that it is almost completely hardware-independent. You should ideally be able to live-migrate a VM running on a hypervisor that has an Advanced Micro Devices (AMD) processor to an Intel-based hypervisor.
We are not saying that this will work in 100% of the cases or that we in any way recommend having this type of mixed environment, but in most of the cases, it should be possible.
Before we start the process, let’s go a little deeper to understand what happens under the hood. When we do a live migration, we are moving a live VM while users are accessing it. This means that users shouldn’t feel any disruption in VM availability when you do a live migration.
Live migration is a five-stage, complex process, even though none of these processes are exposed to the sysadmins.
libvirt will do the necessary work once the VM migration action is issued. The stages through which a VM migration goes are explained in the following list:
- Preparing the destination: When you initiate a live migration, the source
SLibvirt) will contact the destination
DLibvirt) with the details of the VM that is going to be transferred live.
DLibvirtwill pass this information to the underlying QEMU, with relevant options to enable live migration. QEMU will start the actual live migration process by starting the VM in
pausemode and will start listening on a Transmission Control Protocol (TCP) port for VM data. Once the destination is ready,
SLibvirt, with the details of QEMU. By this time, QEMU, at the source, is ready to transfer the VM and connects to the destination TCP port.
- Transferring the VM: When we say transferring the VM, we are not transferring the whole VM; only the parts that are missing at the destination are transferred—for example, the memory and the state of the virtual devices (VM state). Other than the memory and the VM state, all other virtual hardware (virtual network, virtual disks, and virtual devices) is available at the destination itself. Here is how QEMU moves the memory to the destination:
a) The VM will continue running at the source, and the same VM is started in
pausemode at the destination.
b) In one go, it will transfer all the memory used by the VM to the destination. The speed of transfer depends upon the network bandwidth. Suppose the VM is using 10 gibibytes (GiB); it will take the same time to transfer 10 GiB of data using the Secure Copy Protocol (SCP) to the destination. In default mode, it will make use of the full bandwidth. That is the reason we are separating the administration network from the VM traffic network.
c) Once the whole memory is at the destination, QEMU starts transferring the dirty pages (pages that are not yet written to the disk). If it is a busy VM, the number of dirty pages will be high and it will take time to move them. Remember, dirty pages will always be there and there is no state of zero dirty pages on a running VM. Hence, QEMU will stop transferring the dirty pages when it reaches a low threshold (50 or fewer pages).
- Stopping the VM on the source host: Once the number of dirty pages reaches the said threshold, QEMU will stop the VM on the source host. It will also sync the virtual disks.
- Transferring the VM state: At this stage, QEMU will transfer the state of the VM’s virtual devices and remaining dirty pages to the destination as quickly as possible. We cannot limit the bandwidth at this stage.
- Continuing the VM: At the destination, the VM will be resumed from the paused state. Virtual network interface controllers (NICs) become active, and the bridge will send out gratuitous Address Resolution Protocols (ARPs) to announce the change. After receiving the announcement from the bridge, the network switches will update their respective ARP cache and start forwarding the data for the VM to the new hypervisors.
Note that Steps 3, 4, and 5 will be completed in milliseconds. If some errors happen, QEMU will abort the migration and the VM will continue running on the source hypervisor. All through the migration process,
libvirt services from both participating hypervisors will be monitoring the migration process.
Our VM called
MasteringKVM01 is now running on
PacktPhy01 safely, with
lockd enabled. We are going to live-migrate
We need to open the necessary TCP ports used for migration. You only need to do that at the destination server, but it’s a good practice to do this in your whole environment so that you don’t have to micro-manage these configuration changes as you need them in the future, one by one. Basically, you have to open the ports on all the participating hypervisors by using the following
firewall-cmd command for the default zone (in our case, the
# firewall-cmd --zone=public --add-port=49152-49216/tcp --permanent
Check the name resolution on both the servers, as follows:
[root@PacktPhy01 ~] # host PacktPhy01 PacktPhy01 has address 192.168.159.136 [root@PacktPhy01 ~] # host PacktPhy02 PacktPhy02 has address 192.168.159.135 [root@PacktPhy02 ~] # host PacktPhy01 PacktPhy01 has address 192.168.159.136 [root@PacktPhy02 ~] # host PacktPhy02 PacktPhy02 has address 192.168.159.135
Check and verify all the virtual disks attached are available at the destination, on the same path, with the same storage pool name. This is applicable to attached unmanaged (iSCSI and FC LUNs, and so on) virtual disks also.
Check and verify all the network bridges and virtual networks used by the VM available at the destination. After that, we can start the migration process by running the following code:
# virsh migrate --live MasteringKVM01 qemu+ssh://PacktPhy02/system --verbose --persistent Migration: [100 %]
Our VM is using only 4,096 megabytes (MB) of memory, so all five stages completed in a couple of seconds. The
--persistent option is optional, but we recommend adding this.
This is the output of
ping during the migration process (
10.10.48.24 is the IP address of
# ping 10.10.48.24 PING 10.10.48.24 (10.10.48.24) 56(84) bytes of data. 64 bytes from 10.10.48.24: icmp_seq=12 ttl=64 time=0.338 ms 64 bytes from 10.10.48.24: icmp_seq=13 ttl=64 time=3.10 ms 64 bytes from 10.10.48.24: icmp_seq=14 ttl=64 time=0.574 ms 64 bytes from 10.10.48.24: icmp_seq=15 ttl=64 time=2.73 ms 64 bytes from 10.10.48.24: icmp_seq=16 ttl=64 time=0.612 ms --- 10.10.48.24 ping statistics --- 17 packets transmitted, 17 received, 0% packet loss, time 16003ms rtt min/avg/max/mdev = 0.338/0.828/3.101/0.777 ms
If you get the following error message, change
none on the virtual disk attached:
# virsh migrate --live MasteringKVM01 qemu+ssh://PacktPhy02/system --verbose error: Unsafe migration: Migration may lead to data corruption if disks use cache != none # virt-xml MasteringKVM01 --edit --disk target=vda,cache=none
virsh dumpxml MasteringKVM01
You can try a few more options while performing a live migration, as follows:
--undefine domain: Option used to remove a KVM domain from a KVM host.
--suspend domain: Suspends a KVM domain—that is, pauses a KVM domain until we unsuspend it.
--compressed: When we do a VM migration, this option enables us to compress memory. That will mean a faster migration process, based on the –
--abort-on-error: If the migration process throws an error, it is automatically stopped. This is a safe default option as it will help in situations where any kind of corruption might happen during the migration process.
--unsafe: Kind of like the polar opposite of the
–abort-on-erroroption. This option forces migration at all costs, even in the case of error, data corruption, or any other unforeseen scenario. Be very careful with this option—don’t use it often, or in any situation where you want to be 100% sure that VM data consistency is a key pre-requisite.
You can read more about these options in the RHEL 7—Virtualization Deployment and Administration guide (you can find the link in the Further reading section at the end of this chapter). Additionally, the
virsh command also supports the following options:
virsh migrate-setmaxdowntime <domain>: When migrating a VM, it’s inevitable that, at times, a VM is going to be unavailable for a short period of time. This might happen—for example—because of the hand-off process, when we migrate a VM from one host to the other, and we’re just coming to the point of state equilibrium (that is, when the source and destination host have the same VM content and are ready to remove the source VM from the source host inventory and make it run on the destination host). Basically, a small pause happens as the source VM gets paused and killed, and the destination host VM gets unpaused and continues. By using this command, the KVM stack is trying to estimate how long this stopped phase will last. It’s a viable option, especially for VMs that are really busy and are therefore changing their memory content a lot while we’re migrating them.
virsh migrate-setspeed <domain> bandwidth: We can treat this as a quasi-Quality of Service (QoS) option. By using it, we can set the amount of bandwidth in MiB/s that we’re giving to the migration process. This is a very good option to use if our network is busy (for example, if we have multiple virtual local area networks (VLANs) going across the same physical network and we have bandwidth limitations because of it. Lower numbers will slow the migration process.
virsh migrate-getspeed <domain>: We can treat this as a get information option to the
migrate-setspeedcommand, to check which settings we assigned to the
As you can see, migration is a complex process from a technical standpoint, and has multiple different types and loads of additional configuration options that you can use for management purposes. That being said, it’s still such an important capability of a virtualized environment that it’s very difficult to imagine working without it.