Notes On Lock Poisoning
Rust’s libs teams is considering overhauling
As a part of this effort, they are launching lock poisoning survey.
This is post is a an extended response to that survey. It is not be well-edited :-)
Panics Should Propagate
Midori error model makes sharp distinction between two kinds of errors:
bugs in the program, like indexing an array with
- error conditions in programs’ environment (reading a file which doesn’t exist)
In Rust, those correspond to panics and Results. It’s important to not mix the two.
std I think sadly does mix them in sync API. The following APIs convert panics to recoverable results:
All those APIs return a
Result when the other thread panicked.
These leads to people using
? with these methods, using recoverable error handling for bugs in the program.
In my mind, a better design would be to make those API
panic by default.
Sometimes synchronization point also happen to be failure isolation boundaries.
More verbose result-returning
catching_send would work for those special cases.
std::Mutex did implement lock poisoning, but the
lock method returned a
LockGuard<T>, rather than
Result<LockGuard<T>, PoisonError>, then we wouldn’t be discussing poisoning in the rust book, in every mutex example, and wouldn’t consider changing the status quo.
At the same time, we’d preserve “safer” semantics of lock poisoning.
There’s an additional consideration here.
In a single-threaded program, panic propagation is linear.
One panic is unwound past a sequence of frames.
If we get the second panic in some
Drop, the result is process aborting.
In a multi-threaded program, the stack is tree-shaped. What should happen if one of the three parallel threads panics? I believe the right semantics here is that siblings are cancelled, and then the panic is propagated to the parent. How to implement cancellation is an open question. If two children panic, we should propagate a pair of panics.
A topic closely related to lock poisoning is unwinding safety –
I want to share an amusing story how this machinery almost, but not quite, saved my bacon.
rust-analyzer implements cancellation via unwinding.
After a user types something and we have new code to process, we set a global flag.
Long-running background tasks like syntax highlighting read this flag and, if it is set, panic with a
struct Cancelled payload.
resume_unwind and not
panic to avoid printing backtrace.
After the stack is unwound, we can start processing new code.
This means that rust-analyzer’s data, stored in the
Db type, needs to be unwind safe.
One day while I was idly hacking on rust-analyzer during Rust all-hands I’ve noticed a weird compilation error, telling me that
Db doesn’t implement the corresponding trait.
What’s worse, removing the
target directory fixed the bug.
This was an instance of incorrect incremental compilation.
The problem stemmed from two issues:
RefUnwindSafeare auto traits, and inference rules for those are complicated
Dbtype has a curiously recurring template structure
With incremental compilation in the mix, something somewhere went wrong.
The compiler bug was fixed after several months, but, to work around it in the meantime, we’ve added a manual
impl UnwindSafe for Db which masked the bug.
Couple of months more has passed, and we started integrating chalk into rust-analyzer. At that time, chalk had it’s own layer of caching, in addition to the incremental compilation of rust-analyzer itself. So we had something like this:
(We used parking_lot for perf, and to share mutex impl between salsa and rust-analyzer).
Now, one of the differences between
parking_lot::Mutex is lock poisoning.
And that means that
std::Mutex is unwind safe (as it just becomes poisoned), while
parking_lot::Mutex is not.
Chalk used some
RefCell’s internally, so it wasn’t unwind safe.
So the whole
Db stopped being
UnwindSafe after addition of chalk.
But because we had that manual
impl UnwindSafe for Db, we haven’t noticed this.
And that lead to a heisenbug.
If cancellation happened during trait solving, we unwound past
And, as didn’t have strict exception safety guarantees, that messed up its internal questions.
So the next trait solving query would observe really weird errors like index out of bounds inside chalk.
The solution was to:
- remove the manual impl (by that time the underlying compiler bug was fixed).
Db: !UnwindSafeexpected error.
std::Mutexto get unwind-safety.
change calls to
.lockto propagate cancellation.
The last point is interesting, it means that we need support for recoverable poisoning in this case.
We need to understand that the other thread was cancelled mid-operation (so that chalk’s state might be inconsistent).
And we also need to re-raise the panic with a specific payload — the
This is because the situation is not a bug.
Discussion on /r/rust.