X10-FT: Transparent fault tolerance for APGAS language and runtime
Introduction
The emergence of multicore machines has made exploiting parallelism a necessity to harness the abundant computing resources in both a single machine and clusters. This, however, may hinder programming productivity as multi-threaded and distributed programming are hard to use correctly and concurrency/distributed bugs are hard to spot.
The asynchronous partitioned global address space (APGAS) model [1] attempts to ease programming on both cluster and multicore machines. This model is an extension of PGAS [2]. PGAS abstracts a platform as a global yet partitioned address space, where each entity (e.g., core or machine) has its own portion of address space, yet can directly access other portions of address space using special language constructs. APGAS extends the PGAS model with “asynchrony”, supports heterogeneous hardware and intends to provide programmers with higher productivity. Specifically, there are two concepts in APGAS, the Place and the Async, to allow dynamically spawning tasks and bridge machine heterogeneity. APGAS has a richer execution framework than the SPMD style generally used by PGAS, because of the underlying heterogeneous hardware. Generally, the program pattern is single-threaded in each task in PGAS model. They also use barriers to synchronize all the tasks. The APGAS model allows each node to execute multiple tasks from a task pool, and allows nodes to invoke work on other nodes. The APGAS model also provides richer language constructs and more powerful compiler and runtime than PGAS to express the asynchrony and to improve productivity. Using these constructs, good structured parallelism can be achieved in more simple, clear and structured user code. A recent embodiment of the APGAS model is the X10 language [3], which hides underlying machine heterogeneity from users and allows users to conveniently write multi-threaded programs that can be executed in cluster environments. A number of other programming models, including MapReduce [4], can be easily expressed in X10 [5].
Unfortunately, though the APGAS model has the potential of embracing both performance and productivity, there is currently no support for fault tolerance in known languages and runtime. Providing fault tolerance (FT) is important as many current computation tasks (like those in HPC) usually require running several days or even months on thousands of cores. Hence, even a small failure in a small component may render the whole computation task meaningless, and it requires a restart of this failed computation task or even all tasks. With the increasing scale of machines, the potential error rate grows as well [6], which makes the problem even more serious.
Although there is some previous work in providing FT in PGAS model, such work cannot be easily used in X10. There are two different policies to provide FT support in PGAS, one is using computation redundancy, and the other is using storage replication, such as the data checkpoints. In X10-FT, we use the classic checkpoint-recovery method to provide FT support. Compared with PGAS, the FT in APGAS has both new challenges and opportunities. The main challenge is to handle the complicated state when building the checkpoints. As the PGAS programs are usually SPMD style, which means that usually the programs are single-threaded, and there are some barriers to synchronize all these processors. Hence in PGAS, the state that needs to be recorded is simple and could be easily get in a consistent way at barriers. APGAS provides more asynchrony, which means that there may be multiple asynchronous tasks executing in each node and one node can invoke work on other nodes. The points at which checkpoints are consistent are hard to find. We extend the X10 runtime to solve this hard problem by doing runtime checks. Meanwhile, because APGAS aims to provide high productivity, richer language features can be used. The X10 compiler and runtime are also more powerful. This leads to clear and simple user code with structured parallelism, which could make the checkpoint code transparent to programmers, as some analyses can be done by the compiler automatically.
In this paper, we make the first comprehensive analysis on providing fault tolerance to an APGAS-based language and runtime using checkpoint-recovery method, by using X10 as an example. The goal is to see how renowned techniques in distributed systems and APGAS-specific features may help to provide reliable and efficient fault-tolerance computation in the APGAS model.
As a typical APGAS language, X10 has a limited set of features related to fault tolerance. First, in each place, X10 maintains an exception system similar to Java, any exception in the user code will be caught by X10 runtime. Hence, we can mainly focus on making the X10 runtime fault tolerant, while leaving the faults in user code to be handled by X10 runtime. Second, the APGAS model ensures that a global variable should only be accessed in its own home place, who possesses the global variable during the whole execution. Local data among tasks are accessed through different copies with the help of X10 runtime. Hence, it is possible to selectively redo a task by eliminating side effects on the global variables of the system in that task. Finally, there are a set of explicit synchronization primitives, such as finish, collecting-finish and at (P). These primitives help switch the execution flow between user code and runtime code, so that fault detection and recovery can be mainly implemented in the X10 runtime with little or no modification to user code.
X10-FT leverages the language features of APGAS and combines them with known techniques in distributed system for fault tolerance. Based on the key primitives such as finish, we modify the X10 compiler to automatically insert checkpoint code into user code. X10-FT also adds an analysis pass in the X10 compiler to identify necessary variables that need to be recorded in each checkpoint. Besides, we still provide the flexibility such that users can also mark precise variables that need to be recorded manually using compiler annotations. To reliably store checkpoints for recovery, X10-FT seamlessly incorporates a distributed file system (DFS) engine into X10 runtime. X10-FT leverages the Paxos [7] consensus protocol to reliably detect possible node failures or network partitions. When failures are detected, X10-FT resumes the disrupted activities by rebuilding the failed place on another node, recovering the place state from the latest valid checkpoints in DFS, and changing the control flow of these activities to that in the checkpoint to continue their execution.
Currently, we have implemented a working prototype of X10-FT, including extensions to X10 compiler and runtime, incorporation of the Hadoop distributed file systems (HDFS) [8] for storing checkpoints and ZooKeeper [9] for consensus of failures. This prototype can run real-world X10 programs with fault-tolerance support. To evaluate the overhead of providing fault tolerance in X10, we use WordCount [4], a typical map-reduce application in distributed system, SSCA#1 [10], an application for bioinformatics optimal pattern matching that stresses integer and character operations, and two benchmarks from the HPC challenge benchmark suite [11], the Global RandomAccess and STREAM. Our evaluation shows that X10-FT can successfully recover the tested programs with only modest performance overhead.
This paper makes the following contributions:
- •
The first framework that provides APGAS programs with efficient fault-tolerance support.
- •
Combine renowned techniques in distributed system, like Paxos and DFS, with APGAS features in X10 runtime and compiler to find points to do consistent checkpoints and provide fault tolerance, which is transparent to programmers.
- •
A detailed implementation of X10-FT.
- •
A detailed evaluation and analysis of overhead using X10-FT with different kinds of benchmarks.
The rest of this paper is organized as follows. First, we discuss the related work of X10-FT, especially those using the PGAS model in Section 2. Section 3 introduces the background of X10 and fault tolerance. Sections 4 Design, 5 Implementation present the design and implementation of X10-FT. Section 6 evaluates the effectiveness and overhead of fault tolerance using X10-FT. Finally, we conclude this paper with a brief remark on our future work.
Section snippets
Related work
A lot of fault tolerance techniques have been created to make applications in HPC domain recover from frequent failures efficiently. The most popular rollback-recovery approach uses checkpoint/restart in HPC environments [12]. Other techniques include, the algorithm-based fault tolerance (ABFT) approach [13], [14], the proactive fault tolerance approach [15], [16], and so on. The checkpoints used for recovery can be diskless [17], [18], and also can be at the user/kernel level. The
Background
This section reviews some of the features of X10 runtime and the key X10 language constructs, which are related to X10-FT. We use the WordCount [4] program as an example. This section also briefly discusses fault-tolerance techniques in distributed systems.
Design
X10-FT adds fault tolerance support to the APGAS model in several aspects. First, it improves the detection capability of view changes (failed places) in the original X10. The detectors should be distributed themselves, in order to survive single point of failures. Each place connects with one of the detectors using heartbeat messages, which are able to detect failures in time. X10-FT incorporates the Paxos [7] protocol to make consensus of the current view of places among detectors when there
Implementation
Currently, we have implemented a working prototype of X10-FT based on X10 version 2.2.1. This prototype incorporates ZooKeeper [9] to detect failures; integrates HDFS [8] under X10 layer to reliably save the checkpoint records for recovery. We have added place-rebuilding support to the X10 runtime. We have also enhanced both the X10 compiler and X10 runtime to make X10 programs support recording and recovering the states, namely, the messages, the intermediate data, and the control flow.
Evaluation
This section presents the evaluation of the performance overhead of adding fault-tolerance support to X10. We evaluate four benchmarks on a small cluster with 6 servers. These benchmarks are used to dissect the performance overhead of X10-FT. They represent different execution characteristics of X10 programs and have different kinds of checkpoint records. The WordCount benchmark is written in X10 according to the WordCount algorithm presented in the MapReduce paper [4]. The other three
Conclusion and future work
In this paper, we analyzed the language features and execution behaviors of a typical APGAS language, X10. Based on the analysis, we described the X10-FT framework that extends the APGAS model with fault tolerance support. We proposed a design to add fault-tolerance features to X10. Our design uses DFS and the Paxos model, along with record and recovery solutions based on concepts of Place and Async from the APGAS model.
We described our implementation of the X10-FT framework, which used
References (42)
- et al.
Algorithm-based fault tolerance applied to high performance computing
J. Parallel Distrib. Comput.
(2009) - et al.
Application level fault tolerance in heterogeneous networks of workstations
J. Parallel Distrib. Comput.
(1997) - V. Saraswat, G. Almasi, G. Bikshandi, C. Cascaval, D. Cunningham, D. Grove, S. Kodali, I. Peshansky, O. Tardieu, The...
- et al.
An evaluation of global address space languages: co-array fortran and unified parallel c
- P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. Von Praun, V. Sarkar, X10: an...
- et al.
Mapreduce: simplified data processing on large clusters
Commun. ACM
(2008) - et al.
Evaluating the performance and scalability of mapreduce applications on x10
- et al.
Understanding failures in petascale computers
Paxos made simple
ACM Sigact News
(2001)- et al.
The hadoop distributed file system
Designing scalable synthetic compact applications for benchmarking high productivity computing systems
Cyberinfrastructure Technol. Watch
The hpc challenge (hpcc) benchmark suite
A survey of rollback-recovery protocols in message-passing systems
ACM Comput. Surv. (CSUR)
Algorithm-based fault tolerance for matrix operations
IEEE Trans. Comput.
Proactive fault tolerance using preemptive migration
Proactive process-level live migration in hpc environments
Faster checkpointing with n+1 parity
Diskless checkpointing
IEEE Trans. Parallel Distrib. Syst.
Catch-compiler-assisted techniques for checkpointing
Application-transparent checkpoint/restart for mpi programs over infiniband
Cited by (2)
Parallel Discrete-Event Simulation on Data Processing Engines
2016, Proceedings - IEEE International Symposium on Distributed Simulation and Real-Time Applications, DS-RT