Notes On Lock Poisoning Dec 12, 2020

Rust’s libs teams is considering overhauling std::sync module. As a part of this effort, they are launching lock poisoning survey.

https://blog.rust-lang.org/2020/12/11/lock-poisoning-survey.html

This is post is a an extended response to that survey. It is not be well-edited :-)

Panics Should Propagate

Midori error model makes sharp distinction between two kinds of errors:

bugs in the program, like indexing an array with -92
error conditions in programs’ environment (reading a file which doesn’t exist)

In Rust, those correspond to panics and Results. It’s important to not mix the two.

std I think sadly does mix them in sync API. The following APIs convert panics to recoverable results:

Mutex::lock
thread::JoinHandle::join
mpsc::Sender::send

All those APIs return a Result when the other thread panicked. These leads to people using ? with these methods, using recoverable error handling for bugs in the program.

In my mind, a better design would be to make those API panic by default. Sometimes synchronization point also happen to be failure isolation boundaries. More verbose result-returning catching_lock, catching_join, catching_send would work for those special cases.

If std::Mutex did implement lock poisoning, but the lock method returned a LockGuard<T>, rather than Result<LockGuard<T>, PoisonError>, then we wouldn’t be discussing poisoning in the rust book, in every mutex example, and wouldn’t consider changing the status quo. At the same time, we’d preserve “safer” semantics of lock poisoning.

There’s an additional consideration here. In a single-threaded program, panic propagation is linear. One panic is unwound past a sequence of frames. If we get the second panic in some Drop, the result is process aborting.

In a multi-threaded program, the stack is tree-shaped. What should happen if one of the three parallel threads panics? I believe the right semantics here is that siblings are cancelled, and then the panic is propagated to the parent. How to implement cancellation is an open question. If two children panic, we should propagate a pair of panics.

Almost UnwindSafe

A topic closely related to lock poisoning is unwinding safety — UnwindSafe and RefUnwindSafe traits. I want to share an amusing story how this machinery almost, but not quite, saved my bacon.

rust-analyzer implements cancellation via unwinding. After a user types something and we have new code to process, we set a global flag. Long-running background tasks like syntax highlighting read this flag and, if it is set, panic with a struct Cancelled payload. We use resume_unwind and not panic to avoid printing backtrace. After the stack is unwound, we can start processing new code.

This means that rust-analyzer’s data, stored in the Db type, needs to be unwind safe.

One day while I was idly hacking on rust-analyzer during Rust all-hands I’ve noticed a weird compilation error, telling me that Db doesn’t implement the corresponding trait. What’s worse, removing the target directory fixed the bug. This was an instance of incorrect incremental compilation.

The problem stemmed from two issues:

UnwindSafe and RefUnwindSafe are auto traits, and inference rules for those are complicated
Db type has a curiously recurring template structure

With incremental compilation in the mix, something somewhere went wrong.

The compiler bug was fixed after several months, but, to work around it in the meantime, we’ve added a manual impl UnwindSafe for Db which masked the bug.

Couple of months more has passed, and we started integrating chalk into rust-analyzer. At that time, chalk had it’s own layer of caching, in addition to the incremental compilation of rust-analyzer itself. So we had something like this:

struct Db {
    solver: parking_lot::Mutex<ChalkSolver>,
    ...
}

(We used parking_lot for perf, and to share mutex impl between salsa and rust-analyzer).

Now, one of the differences between std::Mutex and parking_lot::Mutex is lock poisoning. And that means that std::Mutex is unwind safe (as it just becomes poisoned), while parking_lot::Mutex is not. Chalk used some RefCell’s internally, so it wasn’t unwind safe. So the whole Db stopped being UnwindSafe after addition of chalk. But because we had that manual impl UnwindSafe for Db, we haven’t noticed this.

And that lead to a heisenbug. If cancellation happened during trait solving, we unwound past ChalkSolver. And, as didn’t have strict exception safety guarantees, that messed up its internal questions. So the next trait solving query would observe really weird errors like index out of bounds inside chalk.

The solution was to:

remove the manual impl (by that time the underlying compiler bug was fixed).
get the Db: !UnwindSafe expected error.
replace parking_lot::Mutex with std::Mutex to get unwind-safety.
change calls to .lock to propagate cancellation.

The last point is interesting, it means that we need support for recoverable poisoning in this case. We need to understand that the other thread was cancelled mid-operation (so that chalk’s state might be inconsistent). And we also need to re-raise the panic with a specific payload — the Cancelled struct. This is because the situation is not a bug.

Discussion on /r/rust.