Join Your Threads

This is a note on how to make multithreaded programs more robust. Its not really specific to Rust, but I get to advertise my new jod-thread micro-crate :)

Lets say youve created a fresh new thread with std::thread::spawn, but havent call JoinHandle::join anywhere in your program. What can go wrong in this situation? As a reminder, join blocks until the thread represented by handle completes successfully or with a panic.

First, if the main function finishes earlier, some destructors on that other threads stack might not run. Its not a big deal if all that destructors do is just freeing memory: the OS cleanups after the process exit anyway. However, Drop could have been used for something like flushing IO buffers, and that is more problematic.

Second, not joining threads can lead to surprising interference between unrelated parts of the program and in general to more chaotic behavior. Imagine, for example, running a test suite with many tests. In this situation typical singleton threads may accumulate during a test run. Another scenario is spawning helper threads when processing tasks. If you dont join these threads, you might end up using more resources than there are concurrent tasks, making it harder to measure the load. To be clear, if you dont call join, the thread will complete at some point anyway, it wont leak or anything. But this some point is non-deterministic.

Third, If a thread panics in a forest, and no one is around to hear it, does it make a sound? The join method returns a Result, which is be an Err if the thread has panicked. If you dont join the thread, you wont get a chance to react to this event. So, unless you are looking at the stderr at this moment, you might not realize that something is wrong!

It seems like joining the threads by default is a good idea. However, just calling JoinHandle::join is not enough:

let thread = std::thread::spawn(|| {
    /* useful work */
});

// ...

thread.join().unwrap(); // propagate the panic

The problem is, code in might use ? (or some other form of early return), or it can panic, and in both cases the thread wont be joined. As usual, the solution is to put the cleanup operation into a Drop impl. Thats exactly what my crate, jod_thread, does! Note that this is really a micro crate, so consider just rolling your own join on drop. The value is not in the code, its in the pattern of never leaving a loose thread behind!

A Look At C++

As usual, it is instructive to contrast and compare Rust and C++.

In C++, std::thread has this interesting peculiarity that it terminates the process in destructor unless you call .join (which works just like in Rust) or .detach (which says I wont be joining this thread at all). In other words, C++ mandates that you explicitly choose between joining and detaching. Why is that?

Its easy to argue that detach by default is a wrong choice for C++: it can easily lead to undefined behavior if the lambda passed to the thread uses values from parents stack frame.

Or, as Scott Meyer poetically puts it in the Item 37 of Effective Modern C++ (which is probably the best book to read if you are into both Rust and C++):

This also happens to be one of my favorite arguments for why Rust? :)

The reasoning behind not making join the default is less clear cut. The book says that join by default is be counterintuitive, but that is somewhat circular: it is surprising precisely because it is not the default.

In Rust, unlike C++, implicit detach cant cause undefined behavior (compiler will just refuse the code if the lambda borrows from the stack). I suspect this we can, so why not? is the reason why Rust detaches by default.

However, theres a twist! C++ core guidelines now recommend to always use gsl::joining_thread (which does implicit join) over std::thread in CP.25. The following CP.26 reinforces the point by advising against .detach() method. The reasoning is roughly similar to my post: detached threads make the program more chaotic, as they add superfluous degrees of freedom to the runtime behavior.

Its interesting that Ive learned about these two particular guidelines only today, when refreshing my C++ for this section of the post!

So, it seems like both C++ and Rust picked the wrong default for the thread API in this case. But at least C++ has official guidelines recommending the better approach. And Rust, well, Rust has my blog post now :-)

A Silver Bullet

Of course there isnt one! Joining on drop seems to be a better default, but it brings its own problems. The nastiest one is deadlocks: if you are joining a thread which waits for something else, you might wait forever. I dont think theres an easy solution here: not joining the thread lets you forget about the deadlock, and may even make it go away (if a child thread is blocked on the parent thread), but youll get a detached thread on your hands! The fix is to just arrange the threads in such a way that shutdown is always orderly and clean. Ideally, shutdown should work the same for both the happy and panicking path.

I want to discuss a specific instructive issue that Ive solved in rust-analyzer. It was about the usual setup with a worker thread that consumes items from a channel, roughly like this:

fn frobnicate() {
    let (sender, receiver) = channel();
    let worker = jod_thread::spawn(move || {
        for item receiver {
            do_work(item)
        }
    });

    // prepare some work and send it via sender
}

Here, the worker thread has a simple termination condition: it stops when the channel is closed. However, here lies the problem: we create the channel before the thread, so the sender is dropped after the worker. This is a deadlock: frobnicate waits for worker to exit, and worker waits for frobnicate to drop the sender!

Theres a straightforward fix: drop the sender first!

fn frobnicate() {
    let (sender, receiver) = channel();
    let worker = jod_thread::spawn(move || {
        for item receiver {
            do_work(item)
        }
    });

    // prepare some work and send it via sender

    drop(sender);
    drop(worker);
}

This solution, while obvious, has a pretty serious problem! The prepare some work ... bit of code can contain early returns due to error handling or it may panic. In both case the result is a deadlock. What is the worst, now deadlock happens only on the unhappy path!

There is an elegant, but tricky fix for this. Take a minute to think about it! How to change the above snippet such that the worker thread is guranted to be joined, without deadlocks, regardless of the exit condition (normal termination,?, panic) of frobnicate?

The answer will be below these beautiful Ukiyo-e prints :-)

Fine Wind, Clear Morning
Rainstorm Beneath the Summit

First of all, the problem we are seeing here is an instance of a very general setup. We have a bug which only manifests itself if a rare error condition arises. In some sense, we have a bug in the (implicit) error handling (just like 92% of critical bugs). The solutions here are a classic:

  1. Artificially trigger unhappy path often (restoring from backup every night).
  2. Make sure that there arent different happy and unhappy paths (crash only software).

We are going to do the second one. Specifically, well arrange the code in such way that compiler automatically drops worker first, without the need for explicit drop.

Something like this:

let worker = jod_thread::spawn(move || { ... });
let (sender, receiver) = channel();

The problem here is that we need receiver inside the worker, but moving let (sender, receiver) up brings us back to the square one. Instead, we do this:

let worker;
let (sender, receiver) = channel();
worker = jod_thread::spawn(move || { ... });

Beautiful, isnt it? And super cryptic: the real code has a sizable comment chunk!

The second big issue with join by default is that, if you have many threads in the same scope, and one of them errors, you really want to not only wait until others are finished, but to actually cancel them. Unfortunately, cancelling a thread is a notoriously thorny problem, which Ive explained a bit in another post.

Wrapping Up

So, yeah, join your threads, but be on guard about deadlocks! Note that most of the time one shouldnt actually spawn threads manually: instead, tasks should be spawned to a common threadpool. This way, physical parallelism is nicely separated from logical concurrency. However, tasks should generally be joined for the same reason threads should be joined. A nice additional properly of tasks is that joining the threadpool itself in the end ensures that no tasks are leaked in the single place.

A part of the inspiration for this post was the fact that I once forgot to join a thread :( This rather embarrassingly happened in my other post. Luckily, my current colleague Stjepan Glavina noticed this. Thank you, Stjepan!

Discussion on r/rust.