Replication-based Fault-tolerance for Large-scale Graph Processing

The increasing algorithm complexity and dataset sizes necessitate the use of networked machines for many graph-parallel algorithms, which also makes fault tolerance a must due to the increasing scale of machines. Unfortunately, existing large-scale graph-parallel systems usually adopt a distributed checkpoint mechanism for fault tolerance, which incurs not only notable performance overhead but also lengthy recovery time.
We observe that the vertex replicas created for distributed graph computation can be naturally extended for fast in-memory recovery of graph states. We propose Imitator, a new fault tolerance mechanism, which supports cheap maintenance of vertex states by replicating them to their replicas during normal message exchanges, and provides fast in-memory reconstruction of failed vertices from replicas in other machines. Imitator has been implemented by extending Hama, a popular open-source clone of Pregel. Evaluation shows that Imitator incurs negligible performance overhead (less than 5% for all cases) and can recover from failures of more than one million of vertices with less than 3.4 seconds.

If you use Imitator in your work or research, please let us know about it. We also encourage you to reference our paper.

The Imitator source code is distributed with a Apache license 2.0. The copyright is held by Shanghai Jiao Tong University. Imitator is provided "as is" without any warranties and conditions of any kind.
The latest release of Imitator (~1.6MB) is based on Hama 0.6.0.
Once you've downloaded Imitator, you can find a simple README file, and the further instructions for installing and executing it on the Hama's Documentation page.


Contact Information
Please contact Peng Wang or Rong Chen for further information. Welcome to let us know about bug reports and suggested improvements, while we may not be able to provide technical support timely. If you make some correction or improvement to Imitator, we would appreciate receiving a patch that we can include in the next release.

Related Projects
PowerLyra: differentiated Graph Computation and Partitioning on Skewed Graphs.
Cyclops: Computation and Communication Efficient Graph Processing with Distributed Immutable View.
PowerSwitch: adaptive prediction and mode switch (synchronous and asynchronous) on parallel graph computation.

This page has been visited Website Hit Counter times since 5/21/2014.
Stats provided by the Easy Counter