Crashed dump, or core dump, is the typical routine automatically executed on system crash by the system administrator, to save the failure context of the system into persistent storage, for future offline debugging and analysis. After core dump completes, the administrator can restart the system with the reclaimed memory as reboot-based recovery.
However, today's server machines have abundant memory and therefore it is slow to core-dump the whole memory of the server on system crash. If the crashed system restarts only after core-dumping the whole memory completes, the mean time to repair (MTTR) is very long. Seeing this fact, some administrators choose to skip core dump and restart the system immediately on system crash for shorter MTTR, which loses the failure context to analyze the root cause of system crash with.
Being aware of the usefulness of core dump as well as its long latency which increases MTTR, we propose to optimize core dump in virtualized environments, in order to minimize the downtime that core dump and recovery take. The optimizations include the parallelization of core dump and the restarting of crashed services in a newly-spawned recovery VM to improve overall hardware resource utilization, core-dumping useful part rather than the whole memory selectively to shorten core dump latency, and disk I/O rate control for balanced I/O between the parallelized core dump and recovery to shorten downtime as much as possible.
Figure 1 shows the system architecture to optimize core dump in virtualized environments. The crashed VM is a server providing application services to users. The management VM runs our core dump daemon which detects the system crash of a VM and then invokes our optimized core dump. The optimizations are:
While the core dump of the crashed VM is in progress, we concurrently create and start another recovery VM to continue the crashed application services. The key to realize concurrency is memory reallocation: the pseudo-physical memory of the crashed VM is divided into chunks; for each chunk, once core-dumped and reclaimed from the crashed VM, it is added into the recovery VM.
Through chunk-based memory reallocation, memory is reallocated from the crashed VM to the recovery VM at the earliest possible moment, so that the overall memory utilization is improved. Since core dump is I/O-intensive and does not fully consume CPU, the concurrency also allows the recovery VM to utilize CPU while core dump is in progress.
Here, to continue crashed services in a newly-spawned recovery VM, the key of service continuation is that the recovery VM shares the same file system with the previously crashed VM, so that persistent states written by the crashed VM when it is alive will be visible by the recovery VM.
Rather than dump the whole memory of a VM, which is slow, we dump only the part of memory useful for analyzing the cause of crash. Picking out the useful part requires user knowledge. Our prototype traverses the page descriptor array of the guest Linux in the crashed VM; memory pages indicated by the page descriptors as free (i.e. the reference count in the page descriptor is zero) are ignored during core dump. As a result, core dump takes much less time if there are a lot of free pages at the moment of system crash.
To access the page descriptor array of the crashed VM at the VM management layer, we implement VM introspection of the crashed VM.
We run I/O scheduler at the VM management layer, to balance the disk I/O between the core dump tool and the running recovery VM according to the user-tuned policy. The policy by essence balances between the rate of reclaiming memory from the crashed VM by core dump, and the QoS of workload in the recovery VM, and a well-tuned one should leads to the shortest possible downtime for core dump and recovery as a whole. For instance, if the booting of the recovery VM is as I/O-critical as core dump is, but is not eager to grab memory from the crashed VM, then the recovery VM can be assigned higher I/O priority over core dump, for faster recovery and shorter overall downtime.
Our working prototype, called Vicover (“VIrtualized COre dump and recoVERy”), is based on Xen 3.3.0 and Debian Linux as the guest OS. In our experiment to core-dump and recover a virtualized TPC-W server on a Dell PowerEdge R900 machine, Vicover shortens the downtime caused by crash dump by around 5X.
We propose optimizations of crash dump in the virtualized environments, to minimize the downtime it takes to core-dump the crashed VM and recover the crashed services. We core-dump the crashed VM and start a new one concurrently without requiring additional memory, by chunk-based memory reallocation, while retaining persistent states by sharing FS between the two. We also introspect the crashed VM to selectively dump useful part of the memory rather than the whole. Finally, we balance disk I/O between concurrent core dump and recovery. Experimental results show that downtime is shortened by 5X in our scenario.
We thank Ping-Hui Kao for his initial input that motivates this work, Rong Chen for his efforts in discussing and refining the paper and the anonymous reviewers for their valuable comments and suggestions. This work was funded by China National High-tech R&D Program (863 Program) under grant numbered 2008AA01Z138, China National 973 Plan under grant numbered 2005CB321905, Shanghai Leading Academic Discipline Project (Project Number: B114), and a research grant from Intel numbered MOE-INTEL-09-04.