Exceptions vs Structured Concurrency

Jul 24, 2018

This is partially a mild instance of xkcd://386 with respect to the great don’t panic post by @vorner (yes, it’s 2 am here) and partially a discussion of error-handling in the framework of structured concurrency, which was recently popularized by @njsmith.

Panics

In the blog post, @vorner argues that unwinding sometimes may do more harm than good, if it manages to break some unsafe invariants, cross FFI boundary or put the application into an impossible state. I fully agree that these all are indeed significant dangers of panics.

However, I don’t think that just disabling unwinding and using panic = "abort" is the proper fix to the problem for the majority of use cases. A lot of programs work in a series of requests and responses (often implicit), and I argue that for this pattern it is desirable to be able to handle bugs in requests gracefully.

I’ve spent quite some time working on an IDE, and, although it might not be apparent on the first sight, IDEs are also based on requests/responses:

users types a character, IDE updates its internal data structures
users requests completion, IDE runs some calculations on the data and gives results

As IDEs are large and have a huge number of features, it is inevitable that some not very important linting inspection will fail due to index out of bounds access on this particular macro invocation in this particular project. Killing the whole IDE process would definitely be a bad user experience. On the other hand, just showing a non-modal popup “Something went wrong, would you like to submit a bug report” is usually only a minor irritation: errors are more common in the numerous “additional” features, while the smaller core tends to be more correct.

I do think that this pattern of “show error message and chug along” is applicable to a significant number of applications. Of course, even in this setting a bug in the code can in theory have dire consequences, but in practice this is mitigated by the following:

Majority of requests are readonly and can’t corrupt data.
The low-level implementation of write requests usually has a relatively bug-free transnational semantics, so bugs in write requests which lead to transaction aborts don’t corrupt data as well.
Most applications have some kind of backup/undo functionality, and even if a bug leads to a commit of invalid data, user often can restore good state (of course this works only for relatively unimportant data).

However, @vorner identifies a very interesting specific problem with unwinding which I feel we should really try to solve better: if you have a bunch of threads running, and one of them catches fire, what happens? It turns out that often nothing particular happens: some more threads might die from the poisoned mutexes and closed channels, but other threads might continue, and, as a result the application will exist in a half-dead state for indefinite period of time.

Structured Concurrency

At this point, some of you might be silently screaming “Erlang!”:

Destroy one of my processes & I will only grow stronger

Source: http://leftoversalad.com/c/015_programmingpeople/

You are right! Erlang and especially OTP behaviors are great for managing errors at scale. However a full actor system might be an overkill if all you want is just an OS thread.

If you haven’t done this already, pack some snacks, prepare lots of coffee/tea and do read the structured concurrency blog post. The crux of the pattern is to avoid fire and forget concurrency:

use std::thread;

fn unstructured() {
    thread::spawn(|| {
        do_stuff()
    });
    // The thread is "leaked" out of `unstructured` function
}

Instead, each thread should be confined to some lexical scope and never escape it:

extern crate crossbeam;

fn structured() {
    crossbeam::scope(|scope| {
        scope.spawn(|| {
            do_stuff()
        })
    });
    // The thread is finished and joined at this point.
}

The benefit of this organization is that all threads form a tree, which gives you greater control, because you know for sure which parts are sequential and which are concurrent. Concurrency is explicitly scoped.

Panics and Structured Concurrency

And we have a really, really interesting API design problem if we combine structured concurrency and unwinding. What should be the behavior of the following program?

fn everything_is_terrible() {
    crossbeam::scope(|scope| {
        scope.spawn(|| do_work());
        scope.spawn(|| panic!("this hurts"));
    });
}

Now, for crossbeam specifically there’s little choice here due to the boring requirement for memory safety. But let’s pretend for now that this is a garbage collected language.

So, we have two concurrent threads in a single scope, one of which is currently running and another one is, unfortunately, dead.

The most obvious choice is to wait for the running thread to finish (we don’t want to let it escape the scope) and then to reraise the panic at scope exit. The problem with this approach is that there’s a potentially unbounded window between the instant the panic is created, and its propagation.

This is not a theoretical concern: some time ago a friend of mine had a fascinating debugging session with a Python machine learning application. The program was processing a huge amount of data, so, to speed things up, it partitioned the data and spawned a thread per partition (actual processing was in native code, so GIL was avoided):

with ThreadPoolExecutor() as executor:
    futures = []
    for task_type, hosts in reversed(tasks):
        for task_id, _host in enumerate(hosts):
            futures.append(
                executor.submit(func, task_type, task_id))

    # Re-raise the exception.
    for future in as_completed(futures):
        future.result()

The observed behavior was that a singe thread died, but no exception or stack trace were printed anywhere. This was because the executor was waiting for all other threads before propagating the exception. Although technically the exception was not lost, in practice you’d have to wait for several hours to actually see it!

The Trio library uses an interesting refinement of this strategy: when one of the tasks in scope fails, all others are immediately cancelled, and then awaited for. I think this should work well for Trio, because it has first-class support for cancellation; any async operation is a cancellation point. So all children tasks will be cancelled in a timely manner, although I wouldn’t be surprised if there are some pathological cases where exception propagation is delayed.

Unfortunately, this solution does’t work for native threads, because there are just no good cancellation points. And I don’t know of any approach that would work :(

One vague idea I have is inspired by handling of orphaned processes in Unix: if a thread in a scope dies, the scope is teared down immediately, and all the running processes are attached to the value that is thrown. If anyone wants to handle the failure, they must wait for all attached threads to finish first. This way, the initial panic and all in-progress threads could be propagated to the top-level init scope, which then can attempt either a clean exit by waiting for all children, or do a process::abort.

However this attachment to the parent violates the property that a thread never leaves its original scope. Because crossbeam relies on this property for memory safety, this approach is just not applicable for threads which share stack data.

It’s already 4 am here, so I really should be wrapping the post up :) So, a challenge: design a Rust library for scoped concurrency based on native OS threads that:

never looses a thread or a panic,
immediately propagates panics,
allows to (optionally?) share stack data between the threads.

Discussion on r/rust.