The recent commercial availability of Intel SGX (Software Guard eXtensions) provides a hardware-enabled building block for secure execution of software modules in an untrusted cloud. As an untrusted hypervisor/OS has no access to an enclave’s running states, a VM (virtual machine) with enclaves running inside loses the capability of live migration, a key feature of VMs in the cloud. This paper presents the first study on the support for live migration of SGX-capable VMs.
We identify the security properties that a secure enclave migration process should meet and propose a software-based solution. We leverage several techniques such as two-phase checkpointing and self-destroy to implement our design on a real SGX machine. Security analysis confirms the security of our proposed design and performance evaluation shows that it incurs negligible performance overhead. Besides, we give suggestions on the future hardware design for supporting transparent enclave migration.
Briefly, the enclave migration process, as shown in the figure, includes the following three operations: first, the source machine dumps one enclave’s running states out. Second, the dumped states are transferred to the target machine through the network. Third, the target machine creates a new enclave and restores the running states to resume execution.
On the source machine, the control thread is responsible for generating a checkpoint that contains all the enclave’s memory and execution context, including the software-unreadable states maintained by hardware: the CSSA field in TCS. Specifically, we have designed a software mechanism running in the enclave that can track CSSA by monitoring all the entry and exit events of the enclave, without any dependency on the untrusted guest OS or hypervisor.
On the target machine, the restore process contains the following four steps.
The evaluation results show that our system’s performance overhead is negligible: to migrate a VM with 64 enclaves running inside, the total time of migration grows by 4.7%, and the downtime increases by only 3 milliseconds. More evaluation results can be found in the paper.