Table of Contents

Formally Verified Concurrent File System (RefFS) and MoLi Framework

News

Overview

RefFS is the first concurrent file system that guarantees both liveness and safety, backed by a machine-checkable proof in Coq. Unlike earlier concurrent file systems, RefFS provably avoids termination bugs such as livelocks and deadlocks, through its innovative introduction and use of the dynamically layered definite releases specification. This specification enables handling of blocking scenarios, facilitates modular reasoning for nested blocking, and effectively eliminates the possibility of circular blocking.

The approaches underlying the aforementioned specification are integrated into a framework called MoLi. This framework serves as a guide for developers in verifying concurrent file systems. By extending the specification, we further validated the correctness of the locking scheme for Linux Virtual File System (VFS). Remarkably, even without conducting code proofs, we uncovered a critical flaw in a recent version of the locking scheme, which could potentially lead to deadlocks of the entire OS (confirmed by the maintainers). Overall, RefFS achieves better performance than our previous work AtomFS, a state-of-the-art verified concurrent file system without the liveness guarantee.

People

Publication

Directory Locking in Linux VFS

Directory locking in the Linux kernel documentation (see link) provides an informal proof of deadlock-freedom of the locking scheme. Here, we apply our framework to give a more formal and intuitive proof. Necessary background is copied from the documentation to provide a context.

Directory Locking Background

Locking scheme used for directory operations is based on two kinds of locks - per-inode (→i_rwsem) and per-filesystem (→s_vfs_rename_mutex).

When taking the i_rwsem on multiple non-directory objects, we always acquire the locks in order by increasing address. We'll call that “inode pointer” order in the following.

For our purposes all operations fall in 6 classes:

1. read access. Locking rules:

2. object creation. Locking rules:

3. object removal. Locking rules:

4. link creation. Locking rules:

5. rename that is _not_ cross-directory. Locking rules:

6. cross-directory rename. The trickiest in the whole bunch. Locking rules:

The rules above obviously guarantee that all directories that are going to be read, modified or removed by method will be locked by the caller.

Proof

If no directory is its own ancestor, the scheme above is deadlock-free.

Proof:

There is a ranking on the locks, such that all primitives take them in order of increasing rank. Namely,

For example, if we have NFS filesystem caching on a local one, we have

The ranking above is mostly static except the internal ranking of directories. Below we will focus on internal ranking of directories on the same filesystem and present a way to dynamically define ranks of directories such that operations still take locks in order of increasing rank, i.e., operations always take a lock with rank higher than that of any already held locks.

First, without considering cross-directory renames, a directory's rank is defined as its distance from the root, i.e., root directory has rank 0 and each child directory's rank equals one plus its parent directory's rank. Because no directory is its own ancestor, every directory's rank can be uniquely determined starting from root to its descendants. This ranking accounts for the (locks parent, then child) nested locking in directory removal and same-directory rename.

Then, consider the most complex case in a cross-directory rename, where most locks need to be acquired. The rename may perform the following nested locking (only show lock statements for convenience). Here, dir1 and dir2 are the two parents and according to the locking rules, we know dir2 is not ancestor to dir1. child1 and child2 are source and target in some order.

1 lock filesystem (->s_vfs_rename_mutex);
2 lock dir1->i_rwsem; <set edge to (dir1, dir2)>
3 lock dir2->i_rwsem; <set edge to (non-parent, child1) 
  if child1 is a directory and non-parent is the one in dir1 
  and dir2 that is not parent to child1, otherwise, set edge to None>
4 lock child1->i_rwsem; <set edge to (child1, child2) if chil1 
  and child2 are directories, otherwise, set edge to None>
5 lock child2->i_rwsem; <set edge to None>

We introduce a ghost state edge to help define the ranks (the ghost state has no influence on the program execution and its only role is to assist the proof). The value of edge is either None or (inode1,inode2) representing a possible lock dependency from inode1 to inode2, i.e., some thread may request lock of inode2 with lock of inode1 held. There is only ever one edge value in a filesystem with edge set to None when the s_vfs_rename_mutex is not held and edge set to a determined value when s_vfs_rename_mutex is held. When edge is None, the ranking rules are the same as before (root is 0 and child is one plus its parent). When edge is (inode1, inode2), only the ranking rule for inode2 changes: inode2's rank is one plus the larger rank of inode2's parent and inode1. Intuitively, a directory's rank is the longest distance from root in the filesystem tree extended with edge.

Now let's check the setting of edge in angle brackets in above code to see how it ensures cross-directory rename takes locks in increasing order in this case. After the lock statement at line 2, edge is set to (dir1,dir2). Because dir2 is not ancestor to dir1, the rank of dir1 stays the same and the rank of dir2 (as well as every other directories) can be uniquely determined. So at line 3, dir2's rank is higher than dir1's. Before the code acquires child1 at line 4, it already holds dir1 and dir2. We know one of dir1 and dir2 is parent to child1, so we update edge to (the non-parent of child1, child1). Because child1 cannot be ancestor to dir1 or dir2, every directory's rank is still uniquely defined (for the same reason as above). So at line 4, we ensure child1 has higher rank than dir1 and dir2. Similarly, we update edge to (child1, child2) so that at live 5, child2 has higher rank than held locks (dir1, dir2 and child1). We reset edge to None whenever we have acquired all directory locks.

In a less complex case of cross-directory rename, the code may execute a prefix of above lines and we always reset edge to None after having acquired all directory locks.

The ranking of directories may change under following cases:

All these changes preserve the fact that at every time, the ranking of every directory can be uniquely determined (thus the ranking is a total order). Plus the fact that operations always take locks in order of increasing rank (can be easily verified). We can conclude the locking rules above are deadlock-free.

Soundness Proof

A separate paper [draft] illustrates MoLi's meta-theory.