Notes On Lock Poisoning
Rust’s libs teams is considering overhauling std::sync
module.
As a part of this effort, they are launching lock poisoning survey.
https://blog.rust-lang.org/2020/12/11/lock-poisoning-survey.html
This is post is a an extended response to that survey. It is not be well-edited :-)
Panics Should Propagate
Midori error model makes sharp distinction between two kinds of errors:
-
bugs in the program, like indexing an array with
-92
- error conditions in programs’ environment (reading a file which doesn’t exist)
In Rust, those correspond to panics and Results. It’s important to not mix the two.
std I think sadly does mix them in sync API. The following APIs convert panics to recoverable results:
-
Mutex::lock
-
thread::JoinHandle::join
-
mpsc::Sender::send
All those APIs return a Result
when the other thread panicked.
These leads to people using ?
with these methods, using recoverable error handling for bugs in the program.
In my mind, a better design would be to make those API panic
by default.
Sometimes synchronization point also happen to be failure isolation boundaries.
More verbose result-returning catching_lock
, catching_join
, catching_send
would work for those special cases.
If std::Mutex
did implement lock poisoning, but the lock
method returned a LockGuard<T>
, rather than Result<LockGuard<T>, PoisonError>
, then we wouldn’t be discussing poisoning in the rust book, in every mutex example, and wouldn’t consider changing the status quo.
At the same time, we’d preserve “safer” semantics of lock poisoning.
There’s an additional consideration here.
In a single-threaded program, panic propagation is linear.
One panic is unwound past a sequence of frames.
If we get the second panic in some Drop
, the result is process aborting.
In a multi-threaded program, the stack is tree-shaped. What should happen if one of the three parallel threads panics? I believe the right semantics here is that siblings are cancelled, and then the panic is propagated to the parent. How to implement cancellation is an open question. If two children panic, we should propagate a pair of panics.
Almost UnwindSafe
A topic closely related to lock poisoning is unwinding safety — UnwindSafe
and RefUnwindSafe
traits.
I want to share an amusing story how this machinery almost, but not quite, saved my bacon.
rust-analyzer implements cancellation via unwinding.
After a user types something and we have new code to process, we set a global flag.
Long-running background tasks like syntax highlighting read this flag and, if it is set, panic with a struct Cancelled
payload.
We use resume_unwind
and not panic
to avoid printing backtrace.
After the stack is unwound, we can start processing new code.
This means that rust-analyzer’s data, stored in the Db
type, needs to be unwind safe.
One day while I was idly hacking on rust-analyzer during Rust all-hands I’ve noticed a weird compilation error, telling me that Db
doesn’t implement the corresponding trait.
What’s worse, removing the target
directory fixed the bug.
This was an instance of incorrect incremental compilation.
The problem stemmed from two issues:
-
UnwindSafe
andRefUnwindSafe
are auto traits, and inference rules for those are complicated -
Db
type has a curiously recurring template structure
With incremental compilation in the mix, something somewhere went wrong.
The compiler bug was fixed after several months, but, to work around it in the meantime, we’ve added a manual impl UnwindSafe for Db
which masked the bug.
Couple of months more has passed, and we started integrating chalk into rust-analyzer. At that time, chalk had it’s own layer of caching, in addition to the incremental compilation of rust-analyzer itself. So we had something like this:
(We used parking_lot for perf, and to share mutex impl between salsa and rust-analyzer).
Now, one of the differences between std::Mutex
and parking_lot::Mutex
is lock poisoning.
And that means that std::Mutex
is unwind safe (as it just becomes poisoned), while parking_lot::Mutex
is not.
Chalk used some RefCell
’s internally, so it wasn’t unwind safe.
So the whole Db
stopped being UnwindSafe
after addition of chalk.
But because we had that manual impl UnwindSafe for Db
, we haven’t noticed this.
And that lead to a heisenbug.
If cancellation happened during trait solving, we unwound past ChalkSolver
.
And, as didn’t have strict exception safety guarantees, that messed up its internal questions.
So the next trait solving query would observe really weird errors like index out of bounds inside chalk.
The solution was to:
- remove the manual impl (by that time the underlying compiler bug was fixed).
-
get the
Db: !UnwindSafe
expected error. -
replace
parking_lot::Mutex
withstd::Mutex
to get unwind-safety. -
change calls to
.lock
to propagate cancellation.
The last point is interesting, it means that we need support for recoverable poisoning in this case.
We need to understand that the other thread was cancelled mid-operation (so that chalk’s state might be inconsistent).
And we also need to re-raise the panic with a specific payload — the Cancelled
struct.
This is because the situation is not a bug.
Discussion on /r/rust.