Exceptions vs Structured Concurrency
This is partially a mild instance of xkcd://386 with respect to the great don’t panic post by @vorner (yes, it’s 2 am here) and partially a discussion of error-handling in the framework of structured concurrency, which was recently popularized by @njsmith.
Panics
In the blog post, @vorner argues that unwinding sometimes may do more harm than good, if it manages to break some unsafe invariants, cross FFI boundary or put the application into an impossible state. I fully agree that these all are indeed significant dangers of panics.
However, I don’t think that just disabling unwinding and using panic
= "abort"
is the proper fix to the problem for the majority of use
cases. A lot of programs work in a series of requests and responses
(often implicit), and I argue that for this pattern it is desirable to
be able to handle bugs in requests gracefully.
I’ve spent quite some time working on an IDE, and, although it might not be apparent on the first sight, IDEs are also based on requests/responses:
- users types a character, IDE updates its internal data structures
- users requests completion, IDE runs some calculations on the data and gives results
As IDEs are large and have a huge number of features, it is inevitable that some not very important linting inspection will fail due to index out of bounds access on this particular macro invocation in this particular project. Killing the whole IDE process would definitely be a bad user experience. On the other hand, just showing a non-modal popup “Something went wrong, would you like to submit a bug report” is usually only a minor irritation: errors are more common in the numerous “additional” features, while the smaller core tends to be more correct.
I do think that this pattern of “show error message and chug along” is applicable to a significant number of applications. Of course, even in this setting a bug in the code can in theory have dire consequences, but in practice this is mitigated by the following:
-
Majority of requests are readonly and can’t corrupt data.
-
The low-level implementation of write requests usually has a relatively bug-free transnational semantics, so bugs in write requests which lead to transaction aborts don’t corrupt data as well.
-
Most applications have some kind of backup/undo functionality, and even if a bug leads to a commit of invalid data, user often can restore good state (of course this works only for relatively unimportant data).
However, @vorner identifies a very interesting specific problem with unwinding which I feel we should really try to solve better: if you have a bunch of threads running, and one of them catches fire, what happens? It turns out that often nothing particular happens: some more threads might die from the poisoned mutexes and closed channels, but other treads might continue, and, as a result the application will exist in a half-dead state for indefinite period of time.
Structured Concurrency
At this point, some of you might be silently screaming “Erlang!”:
Source: http://leftoversalad.com/c/015_programmingpeople/
You are right! Erlang and especially OTP behaviors are great for managing errors at scale. However a full actor system might be an overkill if all you want is just an OS thread.
If you haven’t done this already, pack some snacks, prepare lots of coffee/tea and do read the structured concurrency blog post. The crux of the pattern is to avoid fire and forget concurrency:
Instead, each thread should be confined to some lexical scope and never escape it:
The benefit of this organization is that all threads form a tree, which gives you greater control, because you know for sure which parts are sequential and which are concurrent. Concurrency is explicitly scoped.
Panics and Structured Concurrency
And we have a really, really interesting API design problem if we combine structured concurrency and unwinding. What should be the behavior of the following program?
Now, for crossbeam
specifically there’s little choice here due to
the boring requirement for memory safety. But let’s pretend for now
that this is a garbage collected language.
So, we have two concurrent threads in a single scope, one of which is currently running and another one is, unfortunately, dead.
The most obvious choice is to wait for the running thread to finish (we don’t want to let it escape the scope) and then to reraise the panic at scope exit. The problem with this approach is that there’s a potentially unbounded window between the instant the panic is created, and its propagation.
This is not a theoretical concern: some time ago a friend of mine had a fascinating debugging session with a Python machine learning application. The program was processing a huge amount of data, so, to speed things up, it partitioned the data and spawned a thread per partition (actual processing was in native code, so GIL was avoided):
The observed behavior was that a singe thread died, but no exception
or stack trace were printed anywhere. This was because the executor
was waiting for all other threads before propagating the
exception. Although technically the exception was not lost, in
practice you’d have to wait for several hours to actually see it!
The Trio library uses an interesting refinement of this strategy: when one of the tasks in scope fails, all others are immediately cancelled, and then awaited for. I think this should work well for Trio, because it has first-class support for cancellation; any async operation is a cancellation point. So all children tasks will be cancelled in a timely manner, although I wouldn’t be surprised if there are some pathological cases where exception propagation is delayed.
Unfortunately, this solution does’t work for native threads, because there are just no good cancellation points. And I don’t know of any approach that would work :(
One vague idea I have is inspired by handling of orphaned processes in
Unix: if a thread in a scope dies, the scope is teared down
immediately, and all the running processes are attached to the value
that is thrown. If anyone wants to handle the failure, they must
wait for all attached threads to finish first. This way, the initial
panic and all in-progress threads could be propagated to the top-level
init
scope, which then can attempt either a clean exit by waiting
for all children, or do a process::abort
.
However this attachment to the parent violates the property that a thread never leaves its original scope. Because crossbeam relies on this property for memory safety, this approach is just not applicable for threads which share stack data.
It’s already 4 am here, so I really should be wrapping the post up :) So, a challenge: design a Rust library for scoped concurrency based on native OS threads that:
- never looses a thread or a panic,
- immediately propagates panics,
- allows to (optionally?) share stack data between the threads.
Discussion on r/rust.