POLUS

A POwerful Live Updating System

The Problems

The scale of software has increased dramatically in the past two decades, so do the bugs and security vulnerabilities. Despite progress made in software engineering with better programming support, improved developing models and more effective testing tools, it is undeniable that software is still far from perfect, and this trend is likely to continue. Consequently, there has been an increasing number of software updates to fix bugs, close vulnerabilities and evolve with new features.

Unfortunately, the traditional software update approaches usually involve in stopping the running software, applying the updates and restarting the software again. Such stop-and-restart approaches inevitably disrupt the execution of running services, thus decrease the availability of software. For example, one previous study indicated that 75% of about 6000 outages in highly available applications were caused by hardware and software maintenance. Since such abstention-of-service is ill affordable for many mission-critical systems, such as air control systems, credit card authorization and brokerage operations, these systems demand highly dependable services and require services to be available in 24×7.

Dynamic updating, or live updating is a promising software maintenance technique aiming to remedy such situations, yet still much cheaper and less complex compared to hardware based approaches such as hot/cold standby. By allowing the running systems to be updated on-the-fly without service disruption, such techniques have gained considerable interests and popularity from both researchers and practitioners. Nevertheless, there are few dynamic updating systems that are powerful enough to support rich semantics in modern complex applications. For examples, few of them could support updates to broadly used multi-threading systems that involve changes to data. Further, to the best of our knowledge, there is still no effective mechanism to roll back already committed updates and to fix already tainted state for currently running software.

Approaches

Being aware of the difficulties in update-point based approaches, we use a different approach that allows an update to be applied at any time. Our key idea is to allow the coexistence of both the old and the new versions of data, and maintain the coherence by calling some state synchronization functions whenever there is a write access to either version of the data: POLUS write protects either version of the data during the patch process using the debugging API provided by operating systems. Such API allows a process to gain control over another process, and track a write access to the protected data using signal mechanism (catching and checking the SIGSEGV signal). When there is no function manipulating the old version of data, the update process can be safely terminated. We design POLUS with an attempt to meet the criteria that we believe are required in dynamically updating software nowadays:

Binary Compatibility: Instead of using program transformation or reconstruction to make a program updatable, POLUS utilizes the debugging API to gain control over the patching process and modify the state of running program.. In addition, not relying on update points eliminates many constraints on the types of admissible updates, and increases the flexibility of updates.

Multithreading Support: The difficulty in dealing with multithreaded software lies in the fact that there may be several threads concurrently accessing the to be updated data. As mentioned earlier, we discard the update-point based approach and instead allow an update to be immediately applied. POLUS will track the write attempts to either version of data and maintain their consistency using state synchronization functions to synchronize the states of the old and the new data.

Recovery of Tainted State: In our experience, we found that some running software may be already in a tainted state due to internal software bugs or external attacks against known vulnerabilities. Therefore, updating such software without being aware of such situations may cause a system to fail. Although completely solving these problems may be impossible because sometimes it is impossible to know the correct running state, we try to change the software state to some known safe state. To achieve these goals, we added code to check for a tainted state and fix it using the recovery code if the system is already tainted.

Usability and Manageability: To ease the burden of operators, we developed a user interface to facilitate the process of updates. Operators only need to tell the system minimal information (process IDs and patch names) to apply an update. The patch process is also visible to operators. Moreover, POLUS allows operators to rollback committed updates. To help a user to construct a dynamic patch for POLUS, we provide a source to source compiler that could identify semantic differences between the old version and the new version of the source code. Most POLUS patches can be automatically generated, with some occasional manual adjustments.

Low Overhead: As we use binary rewriting to direct a function call from its old version to a new version, there may be a little overhead due to the function indirection when the software is being evolved into the new version. Indeed, such overhead is very minimal and our performance measurement shows that it is less than 1%.

Figure 1: An overview of POLUS and its working flow.

As shown in Figure 1, POLUS is composed of three components: a patch constructor, in the form of a source to source compiler, which detects the semantic differences between two successive software versions and generates the POLUS patch files. A patch injector, which is a running process that applied the updates. A runtime library, which provides some utility functions to manage POLUS patches for the patch injector.

Figure 1 also shows the life-cycle of software and general workflow of dynamic updating using POLUS. Traditional ways of software evolution involve stopping the running software, applying the update and restarting the software again, while dynamic updating supports changes to code and data on-the-fly. To retain binary compatibility, a dynamic update to the software can be started in any running version. The static patch is obtained by analyzing the semantic difference of two successive software versions. To facilitate iterative updates, a version file is used to control the renaming of functions and data in the patches. The static patch is then compiled using regular compilers to generate a dynamic patch as a shared library. The POLUS runtime library will be injected into the running software before the first update. The dynamic patch will be injected by the patch injector on-the-fly, facilitated by POLUS runtime library.

Experiment Results

To demonstrate the applicability of POLUS, we report our experience in using POLUS to dynamically update three prevalent server applications: vsftpd, sshd and apache HTTP server. Performance measurements show that POLUS incurs negligible runtime overhead: a less than 1% performance degradation (but 5% for one case). The time to apply an update is also minimal.

Summary

We have presented POLUS, a powerful live updating system for contemporary server software. In contrast to previous systems, POLUS is capable of updatingmultithreaded software, and is designed with an awareness of supporting recovering tainted software states and rolling back committed updates, yet with good usability and backwards binary compatibility. Our results suggest that POLUS has negligible impact on application performance. We plan to apply our approach to a wider range of real-life software in the future.

Publications

POLUS: A POwerful Live Updating System. Haibo Chen, Jie Yu, Rong Chen, Binyu Zang and Pen-chung Yew. In Proceedings of 29th International Conference on Software Engineering, pp. 271-281. Minneapolis, MN, USA, May 2007. (ICSE ‘07) [pdf][bib][ppt] Source Code:

polus-0.0.2 polus-patchgen-0.0.2

Acknowledgements

We thank the members of system research group in parallel Processing Institute. This work was funded by China National 973 Plan under grant numbered 2005CB321905 and Intel University Research Grant.