COREMU is a parallel emulation framework for CMP systems that decouples the complexity of supporting parallel emulation from maturing a sequential emulator. It has now added deterministic replay support.
The continuity of the Moore’s Law has shifted the current computing to multi-core or many core eras. Currently, Quad-cores and eight cores on a Chip are commercially available. It was predicated that tens to hundreds (even thousands) of cores on a single chip will appear in the foreseeable future.
The advances of many core hardware also make full-system emulation more important than before, due to the increasing need of pre-hardware development of system software, characterizing performance bottlenecks, exposing and analyzing software bugs(especially con-currentones). Full-system emulation, which emulates the entire software stack including operating systems, libraries and user-level applications, is extremely useful in serving the above purposes.
The many-core or multi-core computing also creates challenges and opportunities to full-system emulation. On one hand, the rapid-increasing number of emulated cores requires full-system emulation to be scalable and able to handle a reasonable scale of input. On the other hand, the abundant cores provide even more resources for full-system emulators to harness.
Unfortunately, many commodity full-system emulators are sequential and only time-slice emulated cores on a single physical core in a round-robin fashion ,or only support discontinued outdated host and guest processor pairs. Hence, they cannot fully harness the power of likely abundant resources in current CMP architecture.
The sequential emulation design indicates linear slowdown when the number of emulated cores grows, thus scales poorly on current multi-core platforms. The sequential design implies that the reislimited parallelism exposed among emulated cores. This significantly restricts the use of full-system emulator to analyze software behaviors, thus sacrifices the fidelity of full-system emulation.
Unfortunately, building a parallel full-system emulator is usually resource-intensive and requires years to be mature. Full-system emulators, unlike user-mode emulators, need to model the system aspects of a computing platform, including system-ISA, address translation, privilege levels, interrupts and a set of devices.
The key observation is that CPU cores and devices in current (and likely future) multi processors and multi cores are loosely coupled and these cores and devices communicate through well-defined interfaces. Based on this observation, COREMU emulates multiple cores by creating multiple instances of existing sequential emulators, and uses a thin library layer to handle the inter-core and device communication and synchronization, to maintain a consistent view of system resources.
To provide scalable performance, COREMU also incorporates several techniques to enable efficient parallele mulation. First, efficient core-to-core communication is realized through non-blocking data structures and real-time signals with adaptive signal control. Second, efficient and portable core-to-core synchronization is achieved through lightweight memory transactions, with the only assumption that the host architecture supports compare and swap(CAS) primitives, and allows the reuse of existing code generation for sequential emulation. Third, to improve the scalability of code cache management, COREMU uses a private code cache scheme and addresses the issues with excessive inter-core cache eviction through lazy cache invalidation. Finally, the core-per-thread organization enables flexible and effective dynamic load balancing. While the migration of an emulated core from one thread to another is difficult, COREMU can easily bind a core thread to a different physical core. Using this technique, COREMU design and implement a feedback-directed scheduling algorithm that dynamically maps emulated core threads to different cores, hence to achieve optimal load balancing.
COREMU is implemented on X64 processors and currently use QEMU as the sequential emulator. It supports the full system emulation of up to 255 emulated x64 cores and 4 emulated ARM cores. Figure 1 shows the results for running Word Count benchmark in phoenix-2.0 test suit with 100 MB input on a 16 cores machine. It demonstrate that COREMU can run data-parallel applications with good performance.
COREMU is a fast and scalable full system emulator for CMP systems. COREMU clusters multiple mature sequential emulators using a thin synchronization layer, hence decouples the complexity of supporting parallel emulation from building an optimizing single-core emulator. Our experimental results show negligible uniprocessor performance overhead for CPU-intensive benchmarks. Multicore emulation shows that COREMU scales better than sequential emulator, and is orders of magnitude faster.
Deterministic replay is a great tool for many applications. One application is to reproduce concurrency bugs.
Previous researchers have proposed various schemes for deterministic replay in both application and full-system level. There are a number of software-based approaches that can provide relatively low overhead to replay a multi-threaded application. However, they may not be easily adopted to efficiently support full-system replay as system and device emulation in full-system emulators cannot be trivially rerun and a number of racy execution will significantly degrade performance.
This project ReEmu makes the first attempt (to our knowledge) to provide deterministic replay capability to parallel full-system emulators. Our goal is to efficiently support scalable record and replay of a relative large number of emulated cores running the entire software stack.
The key challenge in software-based approaches to deterministic replay is recording shared memory access order. ReEmu has a newly designed algorithm that is efficient and scalable.
The algorithm requires maintaining a version for each shared object which is updated upon each write. Each core will maintain a last seen version for each shared object. For any memory access, if the version has changed, it means there have been write operation by other cores. One key observation is that the version itself provides read-after-write and write-after-write ordering.
For write-after-read ordering, notice that when some core gets a read-after-write ordering when writing, the last read on that core must precede some write on other cores, and that's exactly the write-after-read ReEmu needs to record.
The algorithm matches the sequence lock used in Linux kernel and is implemented using a modified sequence lock. The benefits of this algorithm are no atomic instruction needed at reader side, no remote core information needed when recording memory access order thus very short critical section. These benefits leads to the good scalability of the algorithm.
ReEmu also solves some problems unique to implementing deterministic replay in a full system emulator, such as identifying instruction location, ordering of guest page table walking.
ReEmu is implemented on COREMU and now supports x86_64 and ARM. The results of running the PARSEC 2.1 benchmark shows that it has modest overhead compared to COREMU and has good scalability when running up to 16 emulated cores.
The source code (both COREMU and ReEmu), linux system image file and use user guide can be found at http://sourceforge.net/p/coremu/home/