Join Your Threads Aug 23, 2019

This is a note on how to make multithreaded programs more robust. It’s not really specific to Rust, but I get to advertise my new jod-thread micro-crate :)

Let’s say you’ve created a fresh new thread with std::thread::spawn, but haven’t call JoinHandle::join anywhere in your program. What can go wrong in this situation? As a reminder, join blocks until the thread represented by handle completes successfully or with a panic.

First, if the main function finishes earlier, some destructors on that other thread’s stack might not run. It’s not a big deal if all that destructors do is just freeing memory: the OS cleanups after the process exit anyway. However, Drop could have been used for something like flushing IO buffers, and that is more problematic.

Second, not joining threads can lead to surprising interference between unrelated parts of the program and in general to more chaotic behavior. Imagine, for example, running a test suite with many tests. In this situation typical “singleton” threads may accumulate during a test run. Another scenario is spawning helper threads when processing tasks. If you don’t join these threads, you might end up using more resources than there are concurrent tasks, making it harder to measure the load. To be clear, if you don’t call join, the thread will complete at some point anyway, it won’t leak or anything. But this some point is non-deterministic.

Third, If a thread panics in a forest, and no one is around to hear it, does it make a sound? The join method returns a Result, which is be an Err if the thread has panicked. If you don’t join the thread, you won’t get a chance to react to this event. So, unless you are looking at the stderr at this moment, you might not realize that something is wrong!

It seems like joining the threads by default is a good idea. However, just calling JoinHandle::join is not enough:

let thread = std::thread::spawn(|| {
    /* useful work */
});

// ...

thread.join().unwrap(); // propagate the panic

The problem is, code in … might use ? (or some other form of early return), or it can panic, and in both cases the thread won’t be joined. As usual, the solution is to put the “cleanup” operation into a Drop impl. That’s exactly what my crate, jod_thread, does! Note that this is really a micro crate, so consider just rolling your own join on drop. The value is not in the code, it’s in the pattern of never leaving a loose thread behind!

A Look At C++

As usual, it is instructive to contrast and compare Rust and C++.

In C++, std::thread has this interesting peculiarity that it terminates the process in destructor unless you call .join (which works just like in Rust) or .detach (which says “I won’t be joining this thread at all”). In other words, C++ mandates that you explicitly choose between joining and detaching. Why is that?

It’s easy to argue that detach by default is a wrong choice for C++: it can easily lead to undefined behavior if the lambda passed to the thread uses values from parent’s stack frame.

Or, as Scott Meyer poetically puts it in the Item 37 of Effective Modern C++ (which is probably the best book to read if you are into both Rust and C++):

In doWork, for example, goodVals is a local variable that is captured by reference. It’s also modified inside the lambda (via the call to push_back). Suppose, then, that while the lambda is running asynchronously, conditionsAreSatisfied() returns false. In that case, doWork would return, and its local variables (including goodVals) would be destroyed. Its stack frame would be popped, and execution of its thread would continue at doWork’s call site.

Statements following that call site would, at some point, make additional function calls, and at least one such call would probably end up using some or all of the memory that had once been occupied by the doWork stack frame. Let’s call such a function f. While f was running, the lambda that doWork initiated would still be running asynchronously. That lambda could call push_back on the stack memory that used to be goodVals but that is now somewhere inside f’s stack frame. Such a call would modify the memory that used to be goodVals, and that means that from f’s perspective, the content of memory in its stack frame could spontaneously change! Imagine the fun you’d have debugging that.

This also happens to be one of my favorite arguments for “why Rust?” :)

The reasoning behind not making join the default is less clear cut. The book says that join by default is be counterintuitive, but that is somewhat circular: it is surprising precisely because it is not the default.

In Rust, unlike C++, implicit detach can’t cause undefined behavior (compiler will just refuse the code if the lambda borrows from the stack). I suspect this “we can, so why not?” is the reason why Rust detaches by default.

However, there’s a twist! C++ core guidelines now recommend to always use gsl::joining_thread (which does implicit join) over std::thread in CP.25. The following CP.26 reinforces the point by advising against .detach() method. The reasoning is roughly similar to my post: detached threads make the program more chaotic, as they add superfluous degrees of freedom to the runtime behavior.

It’s interesting that I’ve learned about these two particular guidelines only today, when refreshing my C++ for this section of the post!

So, it seems like both C++ and Rust picked the wrong default for the thread API in this case. But at least C++ has official guidelines recommending the better approach. And Rust, … well, Rust has my blog post now :-)

A Silver Bullet

Of course there isn’t one! Joining on drop seems to be a better default, but it brings its own problems. The nastiest one is deadlocks: if you are joining a thread which waits for something else, you might wait forever. I don’t think there’s an easy solution here: not joining the thread lets you forget about the deadlock, and may even make it go away (if a child thread is blocked on the parent thread), but you’ll get a detached thread on your hands! The fix is to just arrange the threads in such a way that shutdown is always orderly and clean. Ideally, shutdown should work the same for both the happy and panicking path.

I want to discuss a specific instructive issue that I’ve solved in rust-analyzer. It was about the usual setup with a worker thread that consumes items from a channel, roughly like this:

fn frobnicate() {
    let (sender, receiver) = channel();
    let worker = jod_thread::spawn(move || {
        for item receiver {
            do_work(item)
        }
    });

    // prepare some work and send it via sender
}

Here, the worker thread has a simple termination condition: it stops when the channel is closed. However, here lies the problem: we create the channel before the thread, so the sender is dropped after the worker. This is a deadlock: frobnicate waits for worker to exit, and worker waits for frobnicate to drop the sender!

There’s a straightforward fix: drop the sender first!

fn frobnicate() {
    let (sender, receiver) = channel();
    let worker = jod_thread::spawn(move || {
        for item receiver {
            do_work(item)
        }
    });

    // prepare some work and send it via sender

    drop(sender);
    drop(worker);
}

This solution, while obvious, has a pretty serious problem! The prepare some work ... bit of code can contain early returns due to error handling or it may panic. In both case the result is a deadlock. What is the worst, now deadlock happens only on the unhappy path!

There is an elegant, but tricky fix for this. Take a minute to think about it! How to change the above snippet such that the worker thread is guranted to be joined, without deadlocks, regardless of the exit condition (normal termination,?, panic) of frobnicate?

The answer will be below these beautiful Ukiyo-e prints :-)

First of all, the problem we are seeing here is an instance of a very general setup. We have a bug which only manifests itself if a rare error condition arises. In some sense, we have a bug in the (implicit) error handling (just like 92% of critical bugs). The solutions here are a classic:

Artificially trigger unhappy path often (“restoring from backup every night”).
Make sure that there aren’t different happy and unhappy paths (“crash only software”).

We are going to do the second one. Specifically, we’ll arrange the code in such way that compiler automatically drops worker first, without the need for explicit drop.

Something like this:

let worker = jod_thread::spawn(move || { ... });
let (sender, receiver) = channel();

The problem here is that we need receiver inside the worker, but moving let (sender, receiver) up brings us back to the square one. Instead, we do this:

let worker;
let (sender, receiver) = channel();
worker = jod_thread::spawn(move || { ... });

Beautiful, isn’t it? And super cryptic: the real code has a sizable comment chunk!

The second big issue with join by default is that, if you have many threads in the same scope, and one of them errors, you really want to not only wait until others are finished, but to actually cancel them. Unfortunately, cancelling a thread is a notoriously thorny problem, which I’ve explained a bit in another post.

Wrapping Up

So, yeah, join your threads, but be on guard about deadlocks! Note that most of the time one shouldn’t actually spawn threads manually: instead, tasks should be spawned to a common threadpool. This way, physical parallelism is nicely separated from logical concurrency. However, tasks should generally be joined for the same reason threads should be joined. A nice additional properly of tasks is that joining the threadpool itself in the end ensures that no tasks are leaked in the single place.

A part of the inspiration for this post was the fact that I once forgot to join a thread :( This rather embarrassingly happened in my other post. Luckily, my current colleague Stjepan Glavina noticed this. Thank you, Stjepan!

Discussion on r/rust.