Formally Verified Concurrent File System (RefFS) and MoLi Framework

News

March 21, 2024: our RefFS paper has been conditionally accepted by OSDI 2024

Overview

RefFS is the first concurrent file system that guarantees both liveness and safety, backed by a machine-checkable proof in Coq. Unlike earlier concurrent file systems, RefFS provably avoids termination bugs such as livelocks and deadlocks, through its innovative introduction and use of the dynamically layered definite releases specification. This specification enables handling of blocking scenarios, facilitates modular reasoning for nested blocking, and effectively eliminates the possibility of circular blocking.

The approaches underlying the aforementioned specification are integrated into a framework called MoLi. This framework serves as a guide for developers in verifying concurrent file systems. By extending the specification, we further validated the correctness of the locking scheme for Linux Virtual File System (VFS). Remarkably, even without conducting code proofs, we uncovered a critical flaw in a recent version of the locking scheme, which could potentially lead to deadlocks of the entire OS (confirmed by the maintainers). Overall, RefFS achieves better performance than our previous work AtomFS, a state-of-the-art verified concurrent file system without the liveness guarantee.

People

Publication

[OSDI 2024] Using Dynamically Layered Definite Releases for Verifying the RefFS File System. Mo Zou, Dong Du, Mingkai Dong, Haibo Chen. The 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2024. [paper]

Directory Locking in Linux VFS

Directory locking in the Linux kernel documentation (see link) provides an informal proof of deadlock-freedom of the locking scheme. Here, we apply our framework to give a more formal and intuitive proof. Necessary background is copied from the documentation to provide a context.

Directory Locking Background

Locking scheme used for directory operations is based on two kinds of locks - per-inode (→i_rwsem) and per-filesystem (→s_vfs_rename_mutex).

When taking the i_rwsem on multiple non-directory objects, we always acquire the locks in order by increasing address. We'll call that “inode pointer” order in the following.

For our purposes all operations fall in 6 classes:

1. read access. Locking rules:

lock the directory we are accessing (shared)

2. object creation. Locking rules:

lock the directory we are accessing (exclusive)

3. object removal. Locking rules:

lock the parent (exclusive)
find the victim
lock the victim (exclusive)

4. link creation. Locking rules:

lock the parent (exclusive)
check that the source is not a directory
lock the source (exclusive; probably could be weakened to shared)

5. rename that is _not_ cross-directory. Locking rules:

lock the parent (exclusive)
find the source and target
decide which of the source and target need to be locked. The source needs to be locked if it's a non-directory, target if it's a non-directory or about to be removed.
take the locks that need to be taken (exlusive), in inode pointer order if need to take both (that can happen only when both source and target are non-directories, the source because it wouldn't need to be locked otherwise and the target because mixing directory and non-directory is allowed only with RENAME_EXCHANGE, and that won't be removing the target).

6. cross-directory rename. The trickiest in the whole bunch. Locking rules:

lock the filesystem
if the parents don't have a common ancestor, fail the operation.
lock the parents in “ancestors first” order (exclusive). If neither is an ancestor of the other, lock the parent of source first.
find the source and target.
verify that the source is not a descendent of the target and target is not a descendent of source; fail the operation otherwise.
lock the subdirectories involved (exclusive), source before target.
lock the non-directories involved (exclusive), in inode pointer order.

The rules above obviously guarantee that all directories that are going to be read, modified or removed by method will be locked by the caller.

Proof

If no directory is its own ancestor, the scheme above is deadlock-free.

Proof:

There is a ranking on the locks, such that all primitives take them in order of increasing rank. Namely,

rank →i_rwsem of non-directories on given filesystem in inode pointer order.
put →i_rwsem of all directories on a filesystem lower than →i_rwsem of any non-directory on the same filesystem and the internal ranking of directories are defined later
put →s_vfs_rename_mutex at rank lower than that of any →i_rwsem on the same filesystem.
among the locks on different filesystems use the relative rank of those filesystems.

For example, if we have NFS filesystem caching on a local one, we have

1. →s_vfs_rename_mutex of NFS filesystem
2. →i_rwsem of directories on that NFS filesystem, internal ranking defined later
3. →i_rwsem of non-directories on that filesystem, in order of increasing address of inode
4. →s_vfs_rename_mutex of local filesystem
5. →i_rwsem of directories on the local filesystem, internal ranking defined later
6. →i_rwsem of non-directories on local filesystem, in order of increasing address of inode.

The ranking above is mostly static except the internal ranking of directories. Below we will focus on internal ranking of directories on the same filesystem and present a way to dynamically define ranks of directories such that operations still take locks in order of increasing rank, i.e., operations always take a lock with rank higher than that of any already held locks.

First, without considering cross-directory renames, a directory's rank is defined as its distance from the root, i.e., root directory has rank 0 and each child directory's rank equals one plus its parent directory's rank. Because no directory is its own ancestor, every directory's rank can be uniquely determined starting from root to its descendants. This ranking accounts for the (locks parent, then child) nested locking in directory removal and same-directory rename.

Then, consider the most complex case in a cross-directory rename, where most locks need to be acquired. The rename may perform the following nested locking (only show lock statements for convenience). Here, dir1 and dir2 are the two parents and according to the locking rules, we know dir2 is not ancestor to dir1. child1 and child2 are source and target in some order.

1 lock filesystem (->s_vfs_rename_mutex);
2 lock dir1->i_rwsem; <set edge to (dir1, dir2)>
3 lock dir2->i_rwsem; <set edge to (non-parent, child1) 
  if child1 is a directory and non-parent is the one in dir1 
  and dir2 that is not parent to child1, otherwise, set edge to None>
4 lock child1->i_rwsem; <set edge to (child1, child2) if chil1 
  and child2 are directories, otherwise, set edge to None>
5 lock child2->i_rwsem; <set edge to None>

We introduce a ghost state edge to help define the ranks (the ghost state has no influence on the program execution and its only role is to assist the proof). The value of edge is either None or (inode1,inode2) representing a possible lock dependency from inode1 to inode2, i.e., some thread may request lock of inode2 with lock of inode1 held. There is only ever one edge value in a filesystem with edge set to None when the s_vfs_rename_mutex is not held and edge set to a determined value when s_vfs_rename_mutex is held. When edge is None, the ranking rules are the same as before (root is 0 and child is one plus its parent). When edge is (inode1, inode2), only the ranking rule for inode2 changes: inode2's rank is one plus the larger rank of inode2's parent and inode1. Intuitively, a directory's rank is the longest distance from root in the filesystem tree extended with edge.

Now let's check the setting of edge in angle brackets in above code to see how it ensures cross-directory rename takes locks in increasing order in this case. After the lock statement at line 2, edge is set to (dir1,dir2). Because dir2 is not ancestor to dir1, the rank of dir1 stays the same and the rank of dir2 (as well as every other directories) can be uniquely determined. So at line 3, dir2's rank is higher than dir1's. Before the code acquires child1 at line 4, it already holds dir1 and dir2. We know one of dir1 and dir2 is parent to child1, so we update edge to (the non-parent of child1, child1). Because child1 cannot be ancestor to dir1 or dir2, every directory's rank is still uniquely defined (for the same reason as above). So at line 4, we ensure child1 has higher rank than dir1 and dir2. Similarly, we update edge to (child1, child2) so that at live 5, child2 has higher rank than held locks (dir1, dir2 and child1). We reset edge to None whenever we have acquired all directory locks.

In a less complex case of cross-directory rename, the code may execute a prefix of above lines and we always reset edge to None after having acquired all directory locks.

The ranking of directories may change under following cases:

directory removal/creation: a directory's rank is removed/added
rename: the effect of rename is (1) moving a subtree under a node that is not in the subtree (the directories in the subtree are all recalculated) and potentially removing target (a directory's rank is removed) or (2) in case of exchange, two subtrees exchange their positions and their ranks are recalculated
edge updates in a cross-directory rename: setting edge to (inode1,inode2) or resetting edge to None modify the ranks of the inode2 subtree

All these changes preserve the fact that at every time, the ranking of every directory can be uniquely determined (thus the ranking is a total order). Plus the fact that operations always take locks in order of increasing rank (can be easily verified). We can conclude the locking rules above are deadlock-free.

Soundness Proof

A separate paper [draft] illustrates MoLi's meta-theory.

IPADS

Table of Contents