Elsevier

Parallel Computing

Volume 40, Issue 2, February 2014, Pages 136-156
Parallel Computing

X10-FT: Transparent fault tolerance for APGAS language and runtime

https://doi.org/10.1016/j.parco.2013.11.006Get rights and content

Highlights

  • We make the first attempt to add fault tolerance support to the AGPAS programming model.

  • Leverage the great advances in distributed systems such like DFS and Paxos in X10-FT.

  • The X10-FT framework is transparent to the programmers in most cases by using the X10 language constructs.

  • We implement a prototype. The evaluation of four practical benchmarks shows that X10-FT has modest overhead.

Abstract

The asynchronous partitioned global address space (APGAS) model is a programming model aiming at unifying programming on multicore and clusters, with good productivity. However, it currently lacks support for fault tolerance (FT) such that a single transient failure may render hours to months of computation useless.

In this paper, we thoroughly analyze the feasibility of providing fault tolerance for APGAS model and make the first attempt to add fault tolerance support to an APGAS language called X10. Based on the analysis, we design and implement a fault-tolerance framework called X10-FT that leverages renowned techniques in distributed systems like distributed file systems and Paxos, as well as specific solutions based on the characteristics of the APGAS model to make checkpoints and consensus. This allows the system to transparently handle machine failures at different granularities. Using the features of the APGAS model, we extend the X10 compiler to automatically locate execution points to checkpoint program states without any intervention from programmers. Evaluation using a set of benchmarks shows that the cost for fault tolerance is modest.

Introduction

The emergence of multicore machines has made exploiting parallelism a necessity to harness the abundant computing resources in both a single machine and clusters. This, however, may hinder programming productivity as multi-threaded and distributed programming are hard to use correctly and concurrency/distributed bugs are hard to spot.

The asynchronous partitioned global address space (APGAS) model [1] attempts to ease programming on both cluster and multicore machines. This model is an extension of PGAS [2]. PGAS abstracts a platform as a global yet partitioned address space, where each entity (e.g., core or machine) has its own portion of address space, yet can directly access other portions of address space using special language constructs. APGAS extends the PGAS model with “asynchrony”, supports heterogeneous hardware and intends to provide programmers with higher productivity. Specifically, there are two concepts in APGAS, the Place and the Async, to allow dynamically spawning tasks and bridge machine heterogeneity. APGAS has a richer execution framework than the SPMD style generally used by PGAS, because of the underlying heterogeneous hardware. Generally, the program pattern is single-threaded in each task in PGAS model. They also use barriers to synchronize all the tasks. The APGAS model allows each node to execute multiple tasks from a task pool, and allows nodes to invoke work on other nodes. The APGAS model also provides richer language constructs and more powerful compiler and runtime than PGAS to express the asynchrony and to improve productivity. Using these constructs, good structured parallelism can be achieved in more simple, clear and structured user code. A recent embodiment of the APGAS model is the X10 language [3], which hides underlying machine heterogeneity from users and allows users to conveniently write multi-threaded programs that can be executed in cluster environments. A number of other programming models, including MapReduce [4], can be easily expressed in X10 [5].

Unfortunately, though the APGAS model has the potential of embracing both performance and productivity, there is currently no support for fault tolerance in known languages and runtime. Providing fault tolerance (FT) is important as many current computation tasks (like those in HPC) usually require running several days or even months on thousands of cores. Hence, even a small failure in a small component may render the whole computation task meaningless, and it requires a restart of this failed computation task or even all tasks. With the increasing scale of machines, the potential error rate grows as well [6], which makes the problem even more serious.

Although there is some previous work in providing FT in PGAS model, such work cannot be easily used in X10. There are two different policies to provide FT support in PGAS, one is using computation redundancy, and the other is using storage replication, such as the data checkpoints. In X10-FT, we use the classic checkpoint-recovery method to provide FT support. Compared with PGAS, the FT in APGAS has both new challenges and opportunities. The main challenge is to handle the complicated state when building the checkpoints. As the PGAS programs are usually SPMD style, which means that usually the programs are single-threaded, and there are some barriers to synchronize all these processors. Hence in PGAS, the state that needs to be recorded is simple and could be easily get in a consistent way at barriers. APGAS provides more asynchrony, which means that there may be multiple asynchronous tasks executing in each node and one node can invoke work on other nodes. The points at which checkpoints are consistent are hard to find. We extend the X10 runtime to solve this hard problem by doing runtime checks. Meanwhile, because APGAS aims to provide high productivity, richer language features can be used. The X10 compiler and runtime are also more powerful. This leads to clear and simple user code with structured parallelism, which could make the checkpoint code transparent to programmers, as some analyses can be done by the compiler automatically.

In this paper, we make the first comprehensive analysis on providing fault tolerance to an APGAS-based language and runtime using checkpoint-recovery method, by using X10 as an example. The goal is to see how renowned techniques in distributed systems and APGAS-specific features may help to provide reliable and efficient fault-tolerance computation in the APGAS model.

As a typical APGAS language, X10 has a limited set of features related to fault tolerance. First, in each place, X10 maintains an exception system similar to Java, any exception in the user code will be caught by X10 runtime. Hence, we can mainly focus on making the X10 runtime fault tolerant, while leaving the faults in user code to be handled by X10 runtime. Second, the APGAS model ensures that a global variable should only be accessed in its own home place, who possesses the global variable during the whole execution. Local data among tasks are accessed through different copies with the help of X10 runtime. Hence, it is possible to selectively redo a task by eliminating side effects on the global variables of the system in that task. Finally, there are a set of explicit synchronization primitives, such as finish, collecting-finish and at (P). These primitives help switch the execution flow between user code and runtime code, so that fault detection and recovery can be mainly implemented in the X10 runtime with little or no modification to user code.

X10-FT leverages the language features of APGAS and combines them with known techniques in distributed system for fault tolerance. Based on the key primitives such as finish, we modify the X10 compiler to automatically insert checkpoint code into user code. X10-FT also adds an analysis pass in the X10 compiler to identify necessary variables that need to be recorded in each checkpoint. Besides, we still provide the flexibility such that users can also mark precise variables that need to be recorded manually using compiler annotations. To reliably store checkpoints for recovery, X10-FT seamlessly incorporates a distributed file system (DFS) engine into X10 runtime. X10-FT leverages the Paxos [7] consensus protocol to reliably detect possible node failures or network partitions. When failures are detected, X10-FT resumes the disrupted activities by rebuilding the failed place on another node, recovering the place state from the latest valid checkpoints in DFS, and changing the control flow of these activities to that in the checkpoint to continue their execution.

Currently, we have implemented a working prototype of X10-FT, including extensions to X10 compiler and runtime, incorporation of the Hadoop distributed file systems (HDFS) [8] for storing checkpoints and ZooKeeper [9] for consensus of failures. This prototype can run real-world X10 programs with fault-tolerance support. To evaluate the overhead of providing fault tolerance in X10, we use WordCount [4], a typical map-reduce application in distributed system, SSCA#1 [10], an application for bioinformatics optimal pattern matching that stresses integer and character operations, and two benchmarks from the HPC challenge benchmark suite [11], the Global RandomAccess and STREAM. Our evaluation shows that X10-FT can successfully recover the tested programs with only modest performance overhead.

This paper makes the following contributions:

  • The first framework that provides APGAS programs with efficient fault-tolerance support.

  • Combine renowned techniques in distributed system, like Paxos and DFS, with APGAS features in X10 runtime and compiler to find points to do consistent checkpoints and provide fault tolerance, which is transparent to programmers.

  • A detailed implementation of X10-FT.

  • A detailed evaluation and analysis of overhead using X10-FT with different kinds of benchmarks.

The rest of this paper is organized as follows. First, we discuss the related work of X10-FT, especially those using the PGAS model in Section 2. Section 3 introduces the background of X10 and fault tolerance. Sections 4 Design, 5 Implementation present the design and implementation of X10-FT. Section 6 evaluates the effectiveness and overhead of fault tolerance using X10-FT. Finally, we conclude this paper with a brief remark on our future work.

Section snippets

Related work

A lot of fault tolerance techniques have been created to make applications in HPC domain recover from frequent failures efficiently. The most popular rollback-recovery approach uses checkpoint/restart in HPC environments [12]. Other techniques include, the algorithm-based fault tolerance (ABFT) approach [13], [14], the proactive fault tolerance approach [15], [16], and so on. The checkpoints used for recovery can be diskless [17], [18], and also can be at the user/kernel level. The

Background

This section reviews some of the features of X10 runtime and the key X10 language constructs, which are related to X10-FT. We use the WordCount [4] program as an example. This section also briefly discusses fault-tolerance techniques in distributed systems.

Design

X10-FT adds fault tolerance support to the APGAS model in several aspects. First, it improves the detection capability of view changes (failed places) in the original X10. The detectors should be distributed themselves, in order to survive single point of failures. Each place connects with one of the detectors using heartbeat messages, which are able to detect failures in time. X10-FT incorporates the Paxos [7] protocol to make consensus of the current view of places among detectors when there

Implementation

Currently, we have implemented a working prototype of X10-FT based on X10 version 2.2.1. This prototype incorporates ZooKeeper [9] to detect failures; integrates HDFS [8] under X10 layer to reliably save the checkpoint records for recovery. We have added place-rebuilding support to the X10 runtime. We have also enhanced both the X10 compiler and X10 runtime to make X10 programs support recording and recovering the states, namely, the messages, the intermediate data, and the control flow.

Evaluation

This section presents the evaluation of the performance overhead of adding fault-tolerance support to X10. We evaluate four benchmarks on a small cluster with 6 servers. These benchmarks are used to dissect the performance overhead of X10-FT. They represent different execution characteristics of X10 programs and have different kinds of checkpoint records. The WordCount benchmark is written in X10 according to the WordCount algorithm presented in the MapReduce paper [4]. The other three

Conclusion and future work

In this paper, we analyzed the language features and execution behaviors of a typical APGAS language, X10. Based on the analysis, we described the X10-FT framework that extends the APGAS model with fault tolerance support. We proposed a design to add fault-tolerance features to X10. Our design uses DFS and the Paxos model, along with record and recovery solutions based on concepts of Place and Async from the APGAS model.

We described our implementation of the X10-FT framework, which used

References (42)

  • G. Bosilca et al.

    Algorithm-based fault tolerance applied to high performance computing

    J. Parallel Distrib. Comput.

    (2009)
  • A. Beguelin et al.

    Application level fault tolerance in heterogeneous networks of workstations

    J. Parallel Distrib. Comput.

    (1997)
  • V. Saraswat, G. Almasi, G. Bikshandi, C. Cascaval, D. Cunningham, D. Grove, S. Kodali, I. Peshansky, O. Tardieu, The...
  • C. Coarfa et al.

    An evaluation of global address space languages: co-array fortran and unified parallel c

  • P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. Von Praun, V. Sarkar, X10: an...
  • J. Dean et al.

    Mapreduce: simplified data processing on large clusters

    Commun. ACM

    (2008)
  • C. Zhang et al.

    Evaluating the performance and scalability of mapreduce applications on x10

  • B. Schroeder et al.

    Understanding failures in petascale computers

  • L. Lamport

    Paxos made simple

    ACM Sigact News

    (2001)
  • K. Shvachko et al.

    The hadoop distributed file system

  • P. Hunt, M. Konar, F.P. Junqueira, B. Reed, Zookeeper: wait-free coordination for internet-scale systems, in: Proc....
  • D.A. Bader et al.

    Designing scalable synthetic compact applications for benchmarking high productivity computing systems

    Cyberinfrastructure Technol. Watch

    (2006)
  • P.R. Luszczek et al.

    The hpc challenge (hpcc) benchmark suite

  • E.N. Elnozahy et al.

    A survey of rollback-recovery protocols in message-passing systems

    ACM Comput. Surv. (CSUR)

    (2002)
  • K.-H. Huang et al.

    Algorithm-based fault tolerance for matrix operations

    IEEE Trans. Comput.

    (1984)
  • C. Engelmann et al.

    Proactive fault tolerance using preemptive migration

  • C. Wang et al.

    Proactive process-level live migration in hpc environments

  • J.S. Plank et al.

    Faster checkpointing with n+1 parity

  • J.S. Plank et al.

    Diskless checkpointing

    IEEE Trans. Parallel Distrib. Syst.

    (1998)
  • C.-C. Li et al.

    Catch-compiler-assisted techniques for checkpointing

  • Q. Gao et al.

    Application-transparent checkpoint/restart for mpi programs over infiniband

  • Cited by (2)

    View full text