Join Your Threads
This is a note on how to make multithreaded programs more robust. It’s not really specific to Rust, but I get to advertise my new jod-thread micro-crate :)
Let’s say you’ve created a fresh new thread with std::thread::spawn
, but haven’t call JoinHandle::join
anywhere in your program.
What can go wrong in this situation?
As a reminder, join
blocks until the thread represented by handle completes successfully or with a panic.
First, if the main
function finishes earlier, some destructors on that other thread’s stack might not run.
It’s not a big deal if all that destructors do is just freeing memory: the OS cleanups after the process exit anyway.
However, Drop
could have been used for something like flushing IO buffers, and that is more problematic.
Second, not joining threads can lead to surprising interference between unrelated parts of the program and in general to more chaotic behavior.
Imagine, for example, running a test suite with many tests.
In this situation typical “singleton” threads may accumulate during a test run.
Another scenario is spawning helper threads when processing tasks.
If you don’t join these threads, you might end up using more resources than there are concurrent tasks, making it harder to measure the load.
To be clear, if you don’t call join
, the thread will complete at some point anyway, it won’t leak or anything.
But this some point is non-deterministic.
Third, If a thread panics in a forest, and no one is around to hear it, does it make a sound?
The join
method returns a Result
, which is be an Err
if the thread has panicked.
If you don’t join the thread, you won’t get a chance to react to this event.
So, unless you are looking at the stderr
at this moment, you might not realize that something is wrong!
It seems like joining the threads by default is a good idea.
However, just calling JoinHandle::join
is not enough:
The problem is, code in … might use ?
(or some other form of early return), or it can panic, and in both cases the thread won’t be joined.
As usual, the solution is to put the “cleanup” operation into a Drop
impl.
That’s exactly what my crate, jod_thread
, does!
Note that this is really a micro crate, so consider just rolling your own join on drop.
The value is not in the code, it’s in the pattern of never leaving a loose thread behind!
A Look At C++
As usual, it is instructive to contrast and compare Rust and C++.
In C++, std::thread
has this interesting peculiarity that it terminates the process in destructor unless you call .join
(which works just like in Rust) or .detach
(which says “I won’t be joining this thread at all”).
In other words, C++ mandates that you explicitly choose between joining and detaching.
Why is that?
It’s easy to argue that detach by default is a wrong choice for C++: it can easily lead to undefined behavior if the lambda passed to the thread uses values from parent’s stack frame.
Or, as Scott Meyer poetically puts it in the Item 37 of Effective Modern C++ (which is probably the best book to read if you are into both Rust and C++):
This also happens to be one of my favorite arguments for “why Rust?” :)
The reasoning behind not making join
the default is less clear cut.
The book says that join
by default is be counterintuitive, but that is somewhat circular: it is surprising precisely because it is not the default.
In Rust, unlike C++, implicit detach can’t cause undefined behavior (compiler will just refuse the code if the lambda borrows from the stack). I suspect this “we can, so why not?” is the reason why Rust detaches by default.
However, there’s a twist!
C++ core guidelines now recommend to always use gsl::joining_thread
(which does implicit join) over std::thread
in CP.25.
The following CP.26 reinforces the point by advising against .detach()
method.
The reasoning is roughly similar to my post: detached threads make the program more chaotic, as they add superfluous degrees of freedom to the runtime behavior.
It’s interesting that I’ve learned about these two particular guidelines only today, when refreshing my C++ for this section of the post!
So, it seems like both C++ and Rust picked the wrong default for the thread API in this case. But at least C++ has official guidelines recommending the better approach. And Rust, … well, Rust has my blog post now :-)
A Silver Bullet
Of course there isn’t one! Joining on drop seems to be a better default, but it brings its own problems. The nastiest one is deadlocks: if you are joining a thread which waits for something else, you might wait forever. I don’t think there’s an easy solution here: not joining the thread lets you forget about the deadlock, and may even make it go away (if a child thread is blocked on the parent thread), but you’ll get a detached thread on your hands! The fix is to just arrange the threads in such a way that shutdown is always orderly and clean. Ideally, shutdown should work the same for both the happy and panicking path.
I want to discuss a specific instructive issue that I’ve solved in rust-analyzer. It was about the usual setup with a worker thread that consumes items from a channel, roughly like this:
Here, the worker thread has a simple termination condition: it stops when the channel is closed.
However, here lies the problem: we create the channel before the thread, so the sender
is dropped after the worker
.
This is a deadlock: frobnicate
waits for worker
to exit, and worker
waits for frobnicate
to drop the sender
!
There’s a straightforward fix: drop the sender
first!
This solution, while obvious, has a pretty serious problem!
The prepare some work ...
bit of code can contain early returns due to error handling or it may panic.
In both case the result is a deadlock.
What is the worst, now deadlock happens only on the unhappy path!
There is an elegant, but tricky fix for this. Take a minute to think about it!
How to change the above snippet such that the worker
thread is guranted to be joined, without deadlocks, regardless of the exit condition (normal termination,?
, panic) of frobnicate
?
The answer will be below these beautiful Ukiyo-e prints :-)
First of all, the problem we are seeing here is an instance of a very general setup. We have a bug which only manifests itself if a rare error condition arises. In some sense, we have a bug in the (implicit) error handling (just like 92% of critical bugs). The solutions here are a classic:
- Artificially trigger unhappy path often (“restoring from backup every night”).
- Make sure that there aren’t different happy and unhappy paths (“crash only software”).
We are going to do the second one.
Specifically, we’ll arrange the code in such way that compiler automatically drops worker
first, without the need for explicit drop
.
Something like this:
The problem here is that we need receiver
inside the worker, but moving let (sender, receiver)
up brings us back to the square one.
Instead, we do this:
Beautiful, isn’t it? And super cryptic: the real code has a sizable comment chunk!
The second big issue with join by default is that, if you have many threads in the same scope, and one of them errors, you really want to not only wait until others are finished, but to actually cancel them. Unfortunately, cancelling a thread is a notoriously thorny problem, which I’ve explained a bit in another post.
Wrapping Up
So, yeah, join your threads, but be on guard about deadlocks! Note that most of the time one shouldn’t actually spawn threads manually: instead, tasks should be spawned to a common threadpool. This way, physical parallelism is nicely separated from logical concurrency. However, tasks should generally be joined for the same reason threads should be joined. A nice additional properly of tasks is that joining the threadpool itself in the end ensures that no tasks are leaked in the single place.
A part of the inspiration for this post was the fact that I once forgot to join a thread :( This rather embarrassingly happened in my other post. Luckily, my current colleague Stjepan Glavina noticed this. Thank you, Stjepan!
Discussion on r/rust.