matklad

A Missing IDE Feature

2024-10-14T00:00:00+00:00

A Missing IDE Feature Oct 14, 2024

Slightly unusual genre — with this article, I want to try to enact a change in the world. I believe that there is a “missing” IDE feature which is:

very easy to implement (these days),
is a large force multiplier for experienced users,
is conspicuously missing from almost every editor.

The target audience here is anyone who can land a PR in Zed, VS Code, Helix, Neovim, Emacs, Kakoune, or any other editor or any language server. The blog post would be a success if one of you feels sufficiently inspired to do the thing!

The Feature

Suppose you are casually reading the source code of rust-analyzer, and are curious about handling of method bodies. There’s a Body struct in the code base, and you want to understand how it is used.

Would you rather look at this?

Or this?

(The screenshots are from IntelliJ/RustRover, because of course it gets this right)

The second option is clearly superior — it conveys significantly more useful information in the same amount of pixels. Function names, argument lists and return types are so much more valuable than a body of any particular function. Especially if the function is a page-full of boilerplate code!

And this is the feature I am asking for — make the code look like the second image. Or, specifically, Fold Method Bodies by Default.

There are two components here. First, only method bodies are folded. This is a syntactic check — we are not folding the second level. For code like

fn f() { ... }

impl S {
    fn g(&self) { ... }
}

Both f and g are folded, but impl S is not. Similarly, function parameters and function body are actually on the same level of folding hierarchy, but it is imperative that parameters are not folded. This is the part that was hard ten years ago but is easy today. “what is function body” is a non-trivial question, which requires proper parsing of the code. These days, either an LSP server or Tree-sitter can answer this question quickly and reliably.

The second component of the feature is that folded is a default state. It is not a “fold method bodies” action. It is a setting that ensures that, whenever you visit a new file, bodies are folded by default. To make this work, the editor should be smart to seamlessly unfold specific function when appropriate. For example, if you “go to definition” to a function, that function should get unfolded, while the surrounding code should remail folded.

Now that I have explained how the feature works, I will not try to motivate it. I think it is pretty obvious how awesome this actually is. Code is read more often than written, and this is one of the best multipliers for readability. Most of the code is in method bodies, but most important code is in function signatures. Folding bodies auto-magically hide the 80% of boring code, leaving the most important 20%. It was in 2018 when I last used an IDE (IntelliJ) which has this implemented properly, and I’ve been missing this function ever since!

You might also be wondering whether it is the same feature as the Outline, that special UI which shows a graphical, hierarchical table of contents of the file. It is true that outline and fold-bodies-by-default attack the same issue. But I’d argue that folding solves it better. This is an instance of a common pattern. In a smart editor, it is often possible to implement any given feature either by “lowering” it to plain text, or by creating a dedicated GUI. And the lowering approach almost always wins, because it gets to re-use all existing functionality for free. For example, the folding approach trivially gives you an ability to move a bunch of functions from one impl block to the other by selecting them with Shift + Down, cutting with Ctrl + X and pasting with Ctrl + V.

Call to Action

So, if you are a committer to one of the editors, please consider adding a “fold function bodies by default” mode. It probably should be off by default, as it can easily scare new users away, but it should be there for power users to enable, and it should be prominently documented, so that people can learn that they want it. After the checkbox is in place, see if you can implement the actual logic! If your editor uses Tree-sitter, this should be relatively easy — its syntax tree contains all the information you need. Just make sure that:

bodies are folded when the new file is opened,
the editor unfolds them when appropriate (generally, when navigated to a function from elsewhere).

If your editor is not based on Tree-sitter, you’ll have a harder time. In theory, the information should be readily available from the language server, but LSP currently doesn’t expose it. Here’s the problem:

https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#foldingRangeKind

There’s no body kind there! Adding it should be trivially technically, but it always is a pain to get something into the protocol if you are not VS Code.

My Role

What is my job here, besides sitting there and writing blog posts? I actually think that writing this down is quite valuable!

I suppose the feature is still commonly missing due to a two-sided market failure — the feature doesn’t exist, so prospective users don’t realize that it is possible, and don’t ask editor’s authors to implement it. Without users asking, editor authors themselves don’t realize this feature could exist, and don’t rush implementing it. This is exacerbated by the fact that it was a hard feature to implement ten years ago, when we didn’t have Tree-sitter/LSP, so there are poor workarounds in place — actions to fold a certain level. These workarounds the prevent the proper feature from gaining momentum.

So here I hope to maybe tip the equilibrium’s scale a bit, and start a feedback loop where more people realize that they want this feature, such that it is implemented in some of the more experimental editors, which hopefully would expose the feature to more users, popularizing it until it gets implemented everywhere!

Still, just talking isn’t everything I did here! Six years ago, I implemented the language-server side of this in rust-analyzer:

https://github.com/rust-lang/rust-analyzer/commit/23b040962ff299feeef1f967bc2d5ba92b01c2bc

This currently isn’t exposed to LSP, because it doesn’t allow flagging a folding range as method body. To fix that I opened this VS Code issue (LSP generally avoids doing something before VS Code):

https://github.com/microsoft/vscode/issues/128912

And since then I have been quietly waiting for some editor (not necessary VS Code) to pick this up. This hasn’t happened yet, hence this article!

Thanks for reading!

Two Workflow Tips

2024-10-08T00:00:00+00:00

Two Workflow Tips Oct 8, 2024

An article about a couple of relatively recent additions to my workflow which I wish I knew about years ago.

Split And Go To Definition

Go to definition is super useful (in general, navigation is much more important than code completion). But often, when I use “goto def” I don’t actually mean to permanently go there. Rather, I want to stay where I am, but I need a bit more context about a particular thing at point.

What I’ve found works really great in this context is to split the screen in two, and issue “go to def” in the split. So that you see both the original context, and the definition at the same time, and can choose just how far you would like to go. Here’s an example, where I want to understand how apply function works, and, to understand that, I want to quickly look up the definition of FuzzOp:

VS Code actually has a first-class UI for something like this, called “Peek Definition” but it is just not good — it opens some kind of separate pop-up, with a completely custom UX. It’s much more fruitful to compose two existing basic features — splitting the screen and going to definition.

Note that in the example above I do move focus to the split. I also tried a version that keeps focus in the original split, but focusing new one turned out to be much better. You actually don’t always know up front which split would become the “main” one, and moving the focus gives you flexibility of moving around, closing the split, or closing the other split.

I highly recommend adding a shortcut for this action. It’s a good idea to make it a “complementary” shortcut for the usual goto definition. I use , . for goto definition, and hence , > is the splitting version:

{ "key": ", .",       "command": "editor.action.revealDefinition" },
{ "key": ", shift+.", "command": "editor.action.revealDefinitionAside" },

, .

Yes, you are reading this right. , ., that is a comma followed by a full stop, is my goto definition shortcut. This is not some kind of evil vim mode. I use pedestrian non-modal editing, where I copy with ctrl + c, move to the beginning of line with Home and kill a word with ctrl + Backspace (though keys like Home, Backspace, or arrows are on my home row thanks to kanata).

And yet, I use , as a first keypress in a sequence for multiple shortcuts. That is, , . is not , + . pressed together, but rather a , followed by a separate .. So, when I press , my editor doesn’t actually type a comma, but rather waits for me to complete the shortcut. I have many of them, with just a few being:

, . goes to definition,
, > goes to definition in a split,
, r runs a task,
, e s edits selection by sorting it, , e C converts to camelCase,
, o g opens magit for VS Code, , o k opens keybindings.
, w re-wraps selection at 80 (something I just did to format the previous bullet point), , p pretty-prints the whole file.

I’ve used many different shortcut schemes, but this is by far the most convenient one for me. How do I type an comma? I bind , Space and , Enter to insert comma and a space/newline respectively, which handles most of the cases. And there’s , , which types just a lone comma.

To remember longer sequences, I pair the comma with whichkey, such that, when I type , e, what I see is actually a menu of editing operations:

This horrible idea was born in the mind of Susam Pal, and is officially (and aptly I should say) named Devil Mode.

I highly recommend trying it out! It is the perfect interface for actions that you do once in a while. Where it doesn’t work is for actions you want to repeat. For example, if you want to cycle through compilation errors, binding , e to the “next error” would probably be a bad idea, as typing , e , e , e to cycle three times is quite tiring.

This is actually a common theme, there are many things you might to cycle back and forward through:

completion suggestions
compiler errors
textual search results
reference search results
merge conflicts
working tree changes

It is mighty annoying to have to remember different shortcuts for all of them, isn’t it? If only there was some way to have a universal pair of shortcuts for the next/prev generalized motion…

The insight here is that you’d rarely need to cycle through several different categories of things at the same time. So I bind the venerable ctrl+n and ctrl+p to repeating the last next/prev motion. So, if the last next thing was a worktree change, then ctrl+n moves me to the next worktree change. But if I then query the next compilation error, the subsequent ctrl+n would continue cycling through compilation errors. To kick-start the cycle, I have a , n hydra:

, n e next error
, n c next change
, n C next merge Conflict
, n r next reference
, n f next find
, n . previous edit

I don’t know if there’s some existing VS Code extension to do this, I implement this in my personal extension.

Hope this is useful! Now go and make a deal with the devil yourself!

On Ousterhout's Dichotomy

2024-10-06T00:00:00+00:00

On Ousterhout’s Dichotomy Oct 6, 2024

Why are there so many programming languages? One of the driving reasons for this is that some languages tend to produce fast code, but are a bit of a pain to use (C++), while others are a breeze to write, but run somewhat slow (Python). Depending on the ratio of CPUs to programmers, one or the other might be relatively more important.

But can’t we just, like, implement a universal language that is convenient but slowish by default, but allows an expert programmer to drop to a lower, more performant but harder register? I think there were many attempts at this, and they didn’t quite work out.

The natural way to go about this is to start from the high-level side. Build a high-level featureful language with large runtime, and then provide granular opt outs of specific runtime facilities. Two great examples here are C# and D. And the most famous example of this paradigm is Python, with “rewrite slow parts in C” mantra.

It seems to me that such an approach can indeed solve the “easy to use” part of the dichotomy, but doesn’t quite work as promised for “runs fast” one. And here’s the reason. For performance, what matters is not so much the code that’s executed, but rather the layout of objects in memory. And the high-level dialect locks-in pointer-heavy GC object model! Even if you write your code in assembly, the performance ceiling will be determined by all those pointers GC needs. To actually get full “low-level” performance, you need to effectively “mirror” the data across the dialects across a quasi-FFI boundary.

And that’s what kills “write most of the code in Python, rewrite hot spots in C” approach — the overhead for transitioning between the native C data structures and the Python ones tends to eat any performance benefits that C brings to the table. There are some very real, very important exceptions, where it is possible to batch sufficiently large packages of work to minimize the overhead: http://venge.net/graydon/talks/VectorizedInterpretersTalk-2023-05-12.pdf. But it seems that the average case looks more like this: https://code.visualstudio.com/blogs/2018/03/23/text-buffer-reimplementation.

And this brings me to Rust. It feels like it accidentally blundered into the space of universal languages through the floor. There are no heavy runtime-features to opt out of in Rust. The object model is universal throughout the language. There isn’t a value-semantics/reference-semantics dichotomy, references are first-class values. And yet:

There’s memory safety, which removes most of the fun aspects of low-level programming.
The language didn’t sleep on basic PL niceties like sum-types, generics and “everything-is-expression”.
And a healthy minority of rubyists in the community worked tirelessly to ensure that systems programmers can have nice things.

As a result, there is a certain spectrum of Rust:

Sloppy Rust, which allocates and clones left-and-right.
Normal Rust, which opportunistically uses pretzels and avoids gratuitous allocations but otherwise doesn’t try to optimize anything specifically.
DoD Rust, which thinks a bit about cache-lines, packs things into arenas, uses indexes instead of pointers with an occasional SoA and SIMD.
Crazy here-be-dragons Rust with untagged unions, unsafe, inline assembly and other wizardry.

While the bottom end here sits pretty comfortably next to C, the upper tip doesn’t quite reach the usability level of Python. But this is mostly compensated through these three effects:

Unified object model ensures that there’s no performance tax and little ceremony when going up and, down performance sloppiness spectrum.
Unsafe abstractions not only allow an expert programmer to write optimal code, but, crucially, they allow wrapping it into misuse-resistance interface, which a non-expert programmer can easily use from a high-level Rust dialect.
Performance option is quite an unfair advantage. When you start writing something, you don’t necessary know how fast the thing would have to be. It often depends on the uncertain future. But, if you can sacrifice just a tiny bit of developer experience to get an insurance that, if push comes to shove, you could incrementally arrive at the optimal performance without whole-system rewrites, that is often a hard-to-refuse offer.

The Watermelon Operator

2024-09-24T00:00:00+00:00

The Watermelon Operator Sep 24, 2024

In these two most excellent articles, https://without.boats/blog/let-futures-be-futures and https://without.boats/blog/futures-unordered, withoutboats introduces the concepts of “multi-task” and “intra-task” concurrency. I want to revisit this distinction — while I agree that there are different classes of patterns of concurrency here, I am not quite satisfied with this specific partitioning of the design space. I will use Rust-like syntax for most of the examples, but I am more interested in the language-agnostic patterns, rather than in Rust’s specific implementation of async.

The Two Examples

Let’s introduce the two kinds of concurrency using a somewhat abstract example. We want to handle a Request by doing some computation and then persisting the results in the database and in the cache. Notably, writes to the cache and to the database can proceed concurrently. So, something like this:

async fn process(
  db: Database,
  cache: Cache,
  request: Request,
) -> Response {
  let response = compute_response(db, cache, request).await;
  spawn(update_db(db, response));
  spawn(update_cache(cache, response));
  response
}

async fn update_db(db: Database, response: Response);
async fn update_cache(cache: Cache, response: Response);

fn spawn(f: impl Future) -> JoinHandle;

This is multi-task concurrency style — we fire off two tasks for updating the database and the cache. Here’s the same snippet in intra-task style, where we use join function on futures:

async fn process(
  db: Database,
  cache: Cache,
  request: Request,
) -> Response {
  let response = compute_response(db, cache, request).await;
  join(
    update_db(db, response),
    update_cache(cache, response),
  ).await;
  response
}

async fn update_db(db: Database, response: Response) { ... }
async fn update_cache(cache: Cache, response: Response) { ... }

async fn join(
  f: impl Future,
  g: impl Future,
) -> (U, V);

In other words:

Multi-task concurrency uses spawn — an operation that takes a future and starts a tasks that executes independently of the parent task.

Intra-task concurrency uses join — an operation that takes a pair of futures and executes them concurrently as a part of the current task.

But what is the actual difference between the two?

Parallelism is not

One candidate is parallelism — with spawn, the tasks can run not only concurrently, but actually in parallel, on different CPU cores. join restricts them to the same thread that runs the main task. But I think this is not quite right, abstractly, and is more of a product of specific Rust APIs. There are executors which spawn on the current thread only. And, while in Rust it’s not really possible to make join poll the futures in parallel, I think this is just an artifact of Rust existing API design (futures can’t opt-out of synchronous cancellation). In other words, I think it is possible in theory to implement an async runtime which provides all of the following functions at the same time:

fn spawn(fut: F) -> JoinHandle
where
  F: Future;

fn pspawn(fut: F) -> PJoinHandle
where
  F: Future + Send + 'static,
  F::Output: Send + 'static;

async fn join(
  fut1: F1,
  fut2: F2,
) -> (F1::Output, F2::Output)
where
  F1: Future,
  F2: Future;

async fn pjoin(
  fut1: F1,
  fut2: F2,
) -> (F1::Output, F2::Output)
where
  F1: Future + Send, // NB: only Send, no 'static
  F1::Output:  Send,
  F2: Future + Send,
  F2::Output:  Send;

To confuse matters further, let’s rewrite our example in TypeScript:

async function process(
  db: Database,
  cache: Cache,
  request: Request,
): Response {
  const response = await compute_response(db, cache, request);
  const db_update = update_db(db, response);
  const cache_update = update_cache(cache, response);
  await Promise.all([db_update, cache_update]);
  return response
}

and using Rust’s rayon library:

fn process(
  db: Database,
  cache: Cache,
  request: Request,
) -> Response {
  let response = compute_response(db, cache, request).await;
  rayon::join(
    || update_db(db, response),
    || update_cache(cache, response),
  );
  response
}

Are these examples multi-task or intra-task? To me, the TypeScript one feels multi-task — although it is syntactically close to join().async, the two update promises are running independently from the parent task. If we forget the call to Promise.all, the cache and the database would still get updated (but likely after we would have returned the response to the user)! In contrast, rayon feels intra-task — although the closures could get stolen and be run by a different thread, they won’t “escape” dynamic extent of the encompassing process call.

To await or await to?

Let’s zoom in onto the JS and the join examples:

async function process(
  db: Database,
  cache: Cache,
  request: Request
): Response {
  const response = await compute_response(db, cache, request);

  await Promise.all([
    update_db(db, response),
    update_cache(cache, response),
  ]);

  return response;
}

async fn process(
  db: Database,
  cache: Cache,
  request: Request,
) -> Response {
  let response = compute_response(db, cache, request).await;

  join(
    update_db(db, response),
    update_cache(cache, response),
  ).await;

  response
}

I’ve re-written the JavaScript version to be syntactically isomorphic to the Rust one. The difference is on the semantic level: JavaScript promises are eager, they start executing as soon as a promise is created. In contrast, Rust futures are lazy — they do nothing until polled. And this I think is the fundamental difference, it is lazy vs. eager “futures” (thread::spawn is an eager “future” while rayon::join a lazy one).

And it seems that lazy semantics is quite a bit more elegant! The beauty of

join(
  update_db(db, response),
  update_cache(cache, response),
).await;

is that it’s Molière’s prose — this is structured concurrency, but without bundles, nurseries, scopes, and other weird APIs.

It makes runtime semantics nicer even in dynamically typed languages. In JavaScript, forgetting an await is a common, and very hard to spot problem — without await, code still works, but is sometimes wrong (if the async operation doesn’t finish quite as fast as usual). Imagine JS with lazy promises — there, forgetting an await would always consistently break. So, the need to statically lint missing awaits will be less pressing.

Compare this with Erlang’s take on nulls: while in typical dynamically typed languages partial functions can return a value T or a None, in Erlang the convention is to return either {ok, T} or none. That is, even if the value is non-null, the call-site is forced to unpack it, you can’t write code that happens to work as long as T is non-null.

And of course, in Rust, the killer feature of lazy futures is that you can just borrow data from the enclosing scope.

But it seems like there is one difference between multi-task and intra-task concurrency.

One, Two, N, and More

In the words of withoutboats:

The first limitation is that it is only possible to achieve a static arity of concurrency with intra-task concurrency. That is, you cannot join (or select, etc) an arbitrary number of futures with intra-task concurrency: the number must be fixed at compile time.

That is, you can do join(a, b).await, and

join(
  join(a, b)
  c,
).await

and, with some macros, even

join!(a, b, c, d, e, f).await;

but you can’t do join(xs...).await.

I think this is incorrect, in a trivial and in an interesting way.

The trivial incorrectness is that there’s join_all, that takes a slice of futures and is a direct generalization of join to a runtime-variable number of futures.

But join_all still can’t express the case where you don’t know the number of futures up-front, where you spawn some work, and only later realize that you need to spawn some more.

This is sort-of possible to express with FuturesUnordered, but that’s a yuck API. I mean, even its name screams “DO NOT USE ME!”.

But I do think that this is just an unfortunate API, and that the pattern actually can be expressed in intra-task concurrency style nicely.

Let’s take a closer look at the base case, join!

Asynchronous Semicolon

Section title is a bit of a giveaway. The join operator is async ;. The semicolon is an operator of sequential composition: A; B

runs A first and then B.

In contrast, join is concurrent composition: join(A, B)

runs A and B concurrently.

And both join and ; share the same problem — they can compose only a finite number of things.

But that’s why we have other operators for sequential composition! If we know how many things we need to run, we can use a counted for loop. And join_all is an analogue of a counted for loop!

In case where we don’t know up-front when to stop, we use a while. And this is exactly what we miss — there’s no concurrently-flavored while operator.

Importantly, what we are looking for is not an async for:

async for x in iter {
  process(x).await;
}

Here, although there could be some concurrency inside a single loop iteration, the iterations themselves are run sequentially. The second iteration starts only when the first one finished. Pictorially, this looks like a spiral, or a loop if we look from the side:

What we rather want is to run many copies of the body concurrently, something like this:

A spindle-like shape with many concurrent strands, which looks like wheel’s spokes from the side. Or, if you are really short on fitting metaphors:

The Watermelon Operator

Now, I understand that I’ve already poked fun at unfortunate FuturesUnordered name, but I can’t really find a fitting name for the construct we want here. So I am going to boringly use concurrently keyword, which is way too long, but I’ll refer to it as “the watermelon operator” The stripes on the watermelon resemble the independent strands of execution this operator creates:

So, if you are writing a TCP server, your accept loop could look like this:

concurrently let Some(socket) = listener.accept().await in {
  handle_connection(socket).await;
}.await

This runs accept in a loop, and, for each accepted socket, runs handle_connection concurrently. There are as many concurrent handle_connection calls as there are ready sockets in our listener!

Let’s limit the maximum number of concurrent connections, to provide back pressure:

let semaphore = Semaphore::new(16);

concurrently
  let Some((socket, permit)) = try {
    let permit = semaphore.acquire().await;
    let socket = listener.accept().await?;
    (socket, permit)
  }
in {
  handle_connection(socket).await;
  drop(permit);
}.await

You get the idea (hopefully):

In the “head” of our concurrent loop (cooloop?) construct, we first acquire a semaphore permit and then fetch a socket.
Both the socket and the permit are passed to the body.
The body releases the permit at the end.
While the “head” construct runs in a loop concurrently to bodies, it is throttled by the minimum of the available permits and ready connections.

To make this more concrete, let’s spell this out as a library function:

async fn join(
  fut1: F1,
  fut2: F2,
) -> (F1::Output, F2::Output)
where
  F1: Future,
  F2: Future;

async fn join_all(futs: Vec) -> Vec
where
  F: Future;

async fn concurrently(condition: C, body: B)
where
  C: FnMut() -> FC,
  FC: FutureOption>,
  B: FnMut(T) -> FB,
  FB: Future;

I claim that this is the full set of “primitive” operations needed to express more-or-less everything in intra-task concurrency style.

In particular, we can implement multi-task concurrency this way! To do so, we’ll write a universal watermelon operator, where the T which is passed to the body is an Box>, and where the body just runs this future:

async fn multi_task_concurrency_main(
  spawn: impl Fn(impl Future + 'static),
) {
    ...
}

type AnyFuture = Box<dyn Future + 'static>;

async fn universal_watermelon() {
  let (sender, receiver) = channel::();
  join(
    multi_task_concurrency_main(move |fut| {
      sender.send(Box::new(fut))
    }),
    concurrently(
      || async {
        receiver.recv().await;
      },
      |fut| async {
        fut.await;
      },
    ),
  )
  .await;
}

Note that the conversion in the opposite direction is not possible! With intra-task concurrency, we can borrow from the parent stack frame. So it is not a problem to restrict that to only allow 'static futures into the channel. In a sense, in the above example we return the future up the stack, which explains why it can’t borrow locals from our stack frame.

With multi-task concurrency though, we start with static futures. To let them borrow any stack data requires unsafe.

Note also that the above set of operators, join, join_all, concurrently is orthogonal to parallelism. Alongside those operators, there could exist pjoin, pjoin_all and pconcurrently with the Send bounds, such that you could mix and match parallel and single-core concurrency.

If a Stack is a Tree, Does it Make Any Difference?

One possible objection to the above framing of watermelon as a language-level operator is that it seemingly doesn’t pass zero-cost abstraction test. It can start an unbounded number of futures, and those futures have to be stored somewhere. So we have a language operator which requires dynamic memory allocation, which is a big no-no for any systems programming language.

I think there is some truth to it, and not an insignificant amount of it, but I think I can maybe weasel out of it.

Consider recursion. Recursion also can allocate arbitrary amount of memory (on the stack), but that is considered fine (I would also agree that it is not in fact fine that unbounded recursion is considered fine, but, for the scope of this discussion, I will be a hypocrite and will ignore that opinion of mine).

And here, we have essentially the same situation — we want to allocate arbitrary many (async) stack frames, arranged in a tree. Doing it “on the heap” is easy, but we don’t like the heap here. Luckily, I believe there’s a compilation scheme (hat tip to @rpjohnst for patiently explaining it to me five times in different words) that implements this more-or-less as efficiently as the normal call stack.

The idea is that we will have two stacks — a sync one and an async one. Specifically:

Every sync function we compile normally, with a single stack.
Async functions get two stack pointers. So, we burn sp and one other register (let’s call it asp).
If an async function calls a sync function, the callee’s frame is pushed onto sp. Crucially, because sync functions can only call other sync functions, the callee doesn’t need to know the value of asp.
If an async function calls another async function, the frame (specifically, the “variables live across await point” part of it) is pushed onto asp.
This async stack is segmented. So, for async function calls, we also do a check for “do we have enough stack?” and, if not, allocate a new segment, linking them via a frame pointer.
“Allocating a new segment” doesn’t mean that we actually go and call malloc. Rather, there’s a fixed-sized contiguous slab of say, 8 megs, out of which all async frames are allocated.
If we are out of async-stack, we crash in pretty much the same way as for the boring sync stack overflow.

While this looks just like Go-style segmented stacks, I think this scheme is quite a bit more efficient (warning: I in general have a tendency to confidently talk about things I know little about, and this one is the extreme case of that. If some Go compiler engineer disagrees with me, I am probably in the wrong!).

The main difference is that the distinction between sync and async functions is maintained in the type system. There are no changes for sync functions at all, so the principle of don’t pay for what you don’t use is observed. This is in contrast to Go — I believe that Go, in general, can’t know whether a particular function can yield (that is, if any function it (indirectly) calls can yield), so it has to conservatively insert stack checks everywhere.

Then, even the async stack frames don’t have to store everything, but just the stuff live across await. Everything that happens between two awaits can go to the normal stack.

On top of that, async functions can still do aggressive inlining. So, the async call (and the stack growth check) has to happen only for dynamically dispatched async calls!

Furthermore, the future trait could have some kind of size_hint method, which returns the lower and the upper bound on the size of the stack. Fully concrete futures type-erased to dyn Future would return the exact amount (a, Some(a)). The caller would be required to allocate at least a bytes of the async stack. The callee uses that contract to elide stack checks. Unknown bound, (a, None) would only be returned if type-erased concrete future itself calls something dynamically dispatched. So only dynamically dispatched calls would have to do stack grow checks, and that cost seems negligible in comparison to the cost of missing optimizations due to inability to inline.

Altogether, it feels like this adds up to something sufficiently cheap to just call it “async stack allocation”.

I guess that’s all for today? Summarizing:

Inter-task vs intra-task distinction is mostly orthogonal to the question of parallelism.
I claim that this is the same distinction as between eager and lazy futures.
In particular, there’s no principled obstacles for runtime-bounded intra-task concurrency.
But we do miss FuturesUnordered, but nice. The concurrently operator/function feels like a sufficiently low-hanging watermelon here.
One wrinkle is that watermelon requires dynamic allocation, but it looks like we could just completely upend the compilation strategy we use for futures, implement async segmented stacks which should be pretty fast, and also gain nice dynamically dispatched (and recursive) async functions for free?

Haha, just kidding! Bonus content! This really should be a separate blog post, but it is tangentially related, so here we go:

Applied Duality

So far, we’ve focused on join, the operator that takes two futures, and “runs” them concurrently, returning both results as a pair. But there’s a second, dual operator:

async fn race(
  fut1: F1,
  fut2: F2,
) -> Either
where
  F1: Future,
  F2: Future,

Like join, race runs two futures concurrently. Unlike join, it returns only one result — that which came first. This operator is the basis for a more general select facility.

Although race is dual to join, I don’t think it is as fundamental. It is possible to have two dual things, where one of them is in the basis and the other is derived. For example, it is an axiom of the set theory that the union of two sets, A ∪ B, is a set. Although the intersection of sets, A ∩ B is a dual for union, existence of intersection is not an axiom. Rather, the intersection is defined using axiom of specification:

A ∩ B := {x ∈ A : x ∈ B}

Proposition 131.7.1: race can be defined in terms of join

The race operator is trickier than it seems. Yes, it returns the result of the future that finished first, but what happens with the other one? It gets cancelled. Rust implements this cancellation “for free”, by just dropping the future, but this is restrictive. This is precisely the issue that prevents pjoin from working.

I postulate that fully general cancellation is an asynchronous protocol:

A requests that B is cancelled.
B receives this cancellation request and starts winding down.
A waits until B is cancelled.

That is, cancellation is not “I cancel thou”. Rather it is “I ask you to stop, and then I cooperatively wait until you do so”. This is very abstract, but the following three examples should help make this concrete.

A is some generic asynchronous task, which offloads some computation-heavy work to a CPU pool. That work (B) doesn’t have checks for cancelled flags. So, if A is canceled, it can’t really stop B, which means we are violating structured concurrency.
A is doing async IO. Specifically, A uses io_uring to read data from a socket. A owns a buffer, and passes a pointer to it to the kernel via io_uring as the target buffer for a read syscall. While A is being cancelled, the kernel writes data to this buffer. If A doesn’t wait until the kernel is done, buffer’s memory might get reused, and the kernel would corrupt some unrelated data.

These examples are somewhat unsatisfactory — A is philosophical (who needs structured concurrency?), while B is esoteric (who uses io_uring in 2024?). But the two can be combined into something rather pedestrianly bad:

Like in the case A, an async task submits some work to a CPU pool. But this time the work is very specific — computing a cryptographic checksum of a message owned by A. Because this is cryptography, this is going to be some hyper-optimized SIMD loop which definitely won’t have any affordance for checking some sort of a cancelled flag. The loop would have to run to completion, or at least to a safe point. And, because the loop checksums data owned by A, we can’t destroy A before the loop exits, otherwise it’ll be reading garbage memory!

And this example is the reason why

async fn pjoin(
  fut1: F1,
  fut2: F2,
) -> (F1::Output, F2::Output)
where
  F1: Future + Send,
  F1::Output:  Send,
  F2: Future + Send,
  F2::Output:  Send,

can’t be a thing in Rust — if fut1 runs on a thread separate from the pjon future, then, if pjoin ends up being cancelled, fut1 would be pointing at garbage. You could have

async fn pjoin(
  fut1: F1,
  fut2: F2,
) -> (F1::Output, F2::Output)
where
  F1: Future + Send + 'static,
  F1::Output:  Send + 'static,
  F2: Future + Send + 'static,
  F2::Output:  Send + 'static,

but that removes one of the major benefits of intra-task style API — ability to just borrow data.

So the fully general cancellation should be cooperative. Let’s assume that it is driven by some sort of cancellation token API:

impl CancellationSource {
  fn request_cancellation(&self) { ... }
  async fn await_cancellation(self) { ...  }

  async fn cancel(self) {
    self.request_cancellation();
    self.await_cancellation().await;
  }

  fn new_token(&self) -> CancellationToken { ... }
}

impl CancellationToken {
  fn is_cancelled(&self) -> bool { ... }
  fn on_cancelled(&self, callback: impl FnOnce()) { ... }
}

Note that the question of cancellation being cooperative is orthogonal to the question of explicit threading of cancellation tokens! They can be threaded implicitly (cooperative, implicit cancellation is how Python’s trio does this, though they don’t really document the cooperative part (the shields stuff)).

With this, we can write our own race — we’ll create a cancellation scope and then join modified futures, each of which would cancel the other upon completion:

fn race(
  fut1: impl async FnOnce(&CancellationToken) -> U,
  fut2: impl async FnOnce(&CancellationToken) -> V,
) -> Either {
  let source = CancellationSource::new();
  let token = source.new_token();
  let u_or_v = join(
    async {
      let u = fut1(&token).await;
      if token.is_cancelled() {
        return None;
      }
      source.cancel();
      Some(u)
    },
    async {
      let v = fut2(&token).await;
      if token.is_cancelled() {
        return None;
      }
      source.cancel();
      Some(v)
    },
  )
  .await;
  match u_or_v {
    (Some(u), None) => Left(u),
    (None, Some(v)) => Right(v),
    _ => unreachable!(),
  }
}

In other words, race is but a cooperatively-cancelled join!

That’s all for real for today, viva la vida!

What is io_uring?

2024-09-23T00:00:00+00:00

What is io_uring? Sep 23, 2024

An attempt at concise explanation of what io_uring is.

io_uring is a new Linux kernel interface for making system calls. Traditionally, syscalls are submitted to the kernel individually and synchronously: a syscall CPU instruction transfers control from the application to the kernel; control returns to the application only when the syscall is completed. In contrast, io_uring is a batched and asynchronous interface. The application submits several syscalls by writing their codes & arguments to a lock-free shared-memory ring buffer. The kernel reads the syscalls from this shared memory and executes them at its own pace. To communicate results back to the application, the kernel writes the results to a second lock-free shared-memory ring buffer, where they become available to the application asynchronously.

You might want to use io_uring if:

you need extra performance unlocked by amortizing userspace/kernelspace context switching across entire batches of syscalls,
you want a unified asynchronous interface to the entire system.

You might want to avoid io_uring if:

you need to write portable software,
you want to use only old, proven features,
and in particular you want to use features with a good security track record.

Try to Fix It One Level Deeper

2024-09-06T00:00:00+00:00

Try to Fix It One Level Deeper Sep 6, 2024

I had a productive day today! I did many different and unrelated things, but they all had the same unifying theme:

There’s a bug! And it is sort-of obvious how to fix it. But if you don’t laser-focus on that, and try to perceive the surrounding context, it turns out that the bug is valuable, and it is pointing in the direction of a bigger related problem. So, instead of fixing the bug directly, a detour is warranted to close off the avenue for a class of bugs.

Here are the examples!

In the morning, my colleague pointed out that we are giving substandard error message for a pretty stressful situation when the database runs out of disk space. I went ahead and added appropriate log messages to make it clearer. But then I stopped for a moment and noticed that the problem is bigger — we are missing an infrastructure for fatal errors, and NoSpaceLeft is just one of a kind. So I went ahead and added that along the way: #2289.

Then, I was reviewing a PR by @martinconic which was fixing some typos, and noticed that it was also changing the formatting of our Go code. The latter is by far the biggest problem, as it is the sign that we somehow are not running gofmt during our CI, which I fixed in #2287.

Then, there was a PR from yesterday, where we again had a not quite right log message. The cause was a confusion between two compile-time configuration parameters, which were close, but not quite identical. So, instead of fixing the error message I went ahead and made the two parameters exactly the same. But then my colleague noticed that I actually failed to fix it one level deeper in this case! Turns out, it is possible to remove this compile-time parametrization altogether, which I did in #2292.

But these all were randomly-generated side quests. My intended story line for today was to refactor the piece of code I had trouble explaining (and understanding!) on yesterday’s episode of Iron Beetle. To get into the groove, I decided to first refactor the code that calls the problematic piece of logic, as I noticed a couple of minor stylistic problems there. Of course, when doing that, I discovered that we have a bit of dead code, which luckily doesn’t affect correctness, but does obscure the logic. While fixing that, I used one of my favorite Zig patterns: defer assert(postcondition);

It of course failed in the simulator in a way postcondition checks tend to fail — there was an unintended reentrancy in the code. So I slacked my colleague something like

I thought myself to be so clever adding this assert, but now it fails and I have to fix it TT I think I’ll just go and .next_tick the prefetch path. It feels like there should be a more elegant solution here, but I am not seeing it.

But of course I can’t just “go and .next_tick it”, so here I am, trying to figure out how to encode a Duff’s device in Zig pre-#8220, so as to make this class of issues much less likely.

The Fundamental Law Of Software Dependencies

2024-09-03T00:00:00+00:00

The Fundamental Law Of Software Dependencies Sep 3, 2024

Canonical source code for software should include checksums of the content of all its dependencies.

Several examples of the law:

Software obviously depends on its source code. The law says that something should hold the hash of the entire source, and thus mandates the use of a content-addressed version control system such as git.

Software often depends on 3rd party libraries. These libraries could in turn depend on other libraries. It is imperative to include a lockfile that covers this entire set and comes with checksums. Curiously, the lockfile itself is a part of source code, and gets mixed into the VCS root hash.

Software needs a compiler. The hash of the required compiler should be included in the lockfile. Typically, this is not done — only the version is specified. I think that is a mistake. Specifying a version and a hash is not much more trouble than just the version, but that gives you a superpower — you no longer need to trust the party that distributes your compiler. You could take a shady blob of bytes you’ve found laying on the street, as long as its checksum checks out.

Note that you can compress hashes by mixing them. For compiler use-case, there’s a separate hash per platform, because the Linux and the Windows versions of the compiler differ. This doesn’t mean that your project should include one compiler’s hash per platform, one hash is enough. Compiler distribution should include a manifest – a small text file which lists all platform and their platform specific hashes. The single hash of that file is what is to be included by downstream consumers. To verify a specific binary, the consumer first downloads a manifest, checks that it has the correct hash, and then extracts the hash for the specific platform.

The law is an instrumental goal. By itself, hashes are not that useful. But to get to the point where you actually know the hashes requires:

Actually learning what are your dependencies (this is not trivial! If you have a single Makefile or an .sh, you most likely don’t know the set of your dependencies).
Coming up with some automated way to download those dependencies.
Fixing dependencies’s build process to become reproducible, so as to have a meaningful hash at all.
Learning to isolate dependencies per project, as hashed dependencies can’t be installed into a global shared namespace.

These things are what actually make developing software easier.

STD Doesn't Have to Abstract OS IO

2024-08-12T00:00:00+00:00

STD Doesn’t Have to Abstract OS IO Aug 12, 2024

A short note on what goes into a language’s standard library, and what’s left for third party libraries to implement!

Usually, the main underlying driving factor here is cardinality. If it is important that there’s only one of a thing, it goes into std. If having many of a thing is a requirement, it is better handled by a third-party library. That is, the usual physical constraint is that there’s only a single standard library, and everyone uses the same standard library. In contrast, there are many different third-party libraries, and they all can be used at the same time.

So, until very recently, my set of rules of thumb for what goes into stdlib looked roughly like this:

If this is a vocabulary type, which will be used by APIs of different libraries, it should be in the stdlib.
If this is a cross platform abstraction around an IO facility provided by an OS, and this IO facility has a reasonable common subset across most OSes, it should be in the stdlib.
If there’s one obvious way to implement it, it might go to stdlib.

So for example something like Vec goes into a standard library, because all other libraries are going to use vectors at the interfaces.

Something like lazy_static doesn’t: while it is often needed, it is not a vocabulary interface type.

But it is acceptable for something like OnceCell to be in std — it is still not a vocabulary type, but, unlike lazy_static, it is clear that the API is more or less optimal, and that there aren’t that many good options to do this differently.

But I’ve changed my mind about the second bullet point, about facilities like file IO or TCP sockets. I was always under the impression that these things are a must for a standard library. But now I think that’s not necessarily true!

Consider randomness. Not the PRNG kind of randomness you’d use to make a game fun, but a cryptographically secure randomness that you’d use to generate an SSH key pair. This sort of randomness ultimately bottoms out in hardware, and fundamentally requires talking to the OS and doing IO. This is squarely the bullet point number 2. And Rust is an interesting case study here: it failed to provide this abstraction in std, even though std itself actually needs it! But this turned out to be mostly a non-issue in practice — a third party crate, getrandom, took the job of writing all the relevant bindings to various platform-specific API and using a bunch of conditional compilation to abstract that all away and provide a nice cross-platform API.

So, no, it is not a requirement that std has to wrap any wrappable IOing API. This could be handled by the library ecosystem, if the language allows first-class bindings to raw OS APIs outside of compiler-privileged code (and Rust certainly allows for that).

So perhaps it won’t be too unreasonable to leave even things like files and sockets to community experimentation? In a sense, that is happening in the async land anyway.

To clarify, I still believe that Rust should provide bindings to OS-sourced crypto randomness, and I am extremely happy to see recent motion in that area. But the reason for this belief changed. I no longer feel the mere fact that OS-specific APIs are involved to be particularly salient. However, it is still true that there’s more or less one correct way to do this.

Primitive Recursive Functions For A Working Programmer

2024-08-01T00:00:00+00:00

Primitive Recursive Functions For A Working Programmer Aug 1, 2024

Programmers on the internet often use “Turing-completeness” terminology. Typically, not being Turing-complete is extolled as a virtue or even a requirement in specific domains. I claim that most such discussions are misinformed — that not being Turing complete doesn’t actually mean what folks want it to mean, and is instead a stand-in for a bunch of different practically useful properties, which are mostly orthogonal to actual Turing completeness.

While I am generally descriptivist in nature and am ok with words losing their original meaning as long as the new meaning is sufficiently commonly understood, Turing completeness is a hill I will die on. It is a term from math, it has a very specific meaning, and you are not allowed to re-purpose it for anything else, sorry!

I understand why this happens: to really understand what Turing completeness is and is not you need to know one (simple!) theoretical result about so-called primitive recursive functions. And, although this result is simple, I was only made aware of it in a fairly advanced course during my masters. That’s the CS education deficiency I want to rectify — you can’t teach students the halting problem without also teaching them about primitive recursion!

The post is going to be rather meaty, and will be split in three parts:

In Part I, I give a TL;DR for the theoretical result and some of its consequences. Part II is going to be a whirlwind tour of Turing Machines, Finite State Automata and Primitive Recursive Functions. And then Part III will circle back to practical matters.

If math makes you slightly nauseous, you might to skip Part II. But maybe give it a try? The math we’ll need will be baby math from first principles, without reference to any advanced results.

Part I: TL;DR

Here’s the key result — suppose you have a program in some Turing complete language, and you also know that it’s not too slow. Suppose it runs faster than O(2^{2^N}). That is, two to the power of two to the power of N, a very large number. In this case, you can implement this algorithm in a non-Turing complete language.

Most practical problems fall into this “faster than two to the two to the power of two” space. Hence it follows that you don’t need the full power of a Turing Machine to tackle them. Hence, a language not being Turing complete doesn’t in any way restrict you in practice, or give you extra powers to control the computation.

Or, to restate this: in practice, a program which doesn’t terminate, and a program that needs a billion billion steps to terminate are equivalent. Making something non-Turing complete by itself doesn’t help with the second problem in any way. And there’s a trivial approach that solves the first problem for any existing Turing-complete language — in the implementation, count the steps and bail with an error after a billion.

Part II: Weird Machines

The actual theoretical result is quite a bit more general than that. It is (unsurprisingly) recursive:

If a function is computed by a Turing Machine, and the runtime of this machine is bounded by some primitive recursive function of input, then the original function itself can be written as a primitive recursive function.

It is expected that this sounds like gibberish at this point! So let’s just go and prove this thing, right here in this blog post! Will work up slowly towards this result. The plan is as follows:

First, to brush up notation, we’ll define Finite State Machines.
Second, we’ll turn our humble Finite State Machine into the all-powerful Turing Machine (spoiler — a Turing Machine is an FSM with a pair of stacks), and, as is customary, wave our hands about the Universal Turing Machine.
Third, we leave the cozy world of imperative programming and define primitive recursive functions.
Finally, we’ll talk about the relative computational power of TMs and PRFs, including the teased up result and more!

Finite State Machines

Finite State Machines are simple! An FSM takes a string as input, and returns a binary answer, “yes” or “no”. Unsurprisingly an FSM has a finite number of states: Q0, Q1, …, Qn. A subset of states are designated as “yes” states, the rest are “no” states. There’s also one specific starting state.

The behavior of the state machine is guided by a transition (step) function, s. This function takes the current state of FSM, the next symbol of input, and returns a new state.

The semantics of FSM is determined by repeatably applying the single step function for all symbols of the input, and noting whether the final state is a “yes” state or a “no” state.

Here’s an FSM which accepts only strings of zeros and ones of even length:

States:     { Q0, Q1 }
Yes States: { Q0 }
Start State:  Q0

s :: State -> Symbol -> State
s Q0 0 = Q1
s Q0 1 = Q1
s Q1 0 = Q0
s Q1 1 = Q0

This machine ping-pongs between states Q0 and Q1 ends up in Q0 only for inputs of even length (including an empty input).

What can FSMs do? As they give a binary answer, they are recognizers — they don’t compute functions, but rather just characterize certain sets of strings. A famous result is that the expressive power of FSMs is equivalent to the expressive power of regular expressions. If you can write a regular expression for it, you could also do an FSM!

There are also certain things that state machines can’t do. For example they can’t enter an infinite loop. Any FSM is linear in the input size and always terminates. But there are much more specific sets of strings that couldn’t be recognized by an FSM. Consider this set:

That is, an infinite set which contains ‘1’s surrounded by the equal number of ‘0’s on the both sides. Let’s prove that there isn’t a state machine that recognizes this set!

As usually, suppose there is such a state machine. It has a certain number of states — maybe a dozen, maybe a hundred, maybe a thousand, maybe even more. But let’s say fewer than a million. Then, let’s take a string which looks like a million zeros, followed by one, followed by million zeros. And let’s observe our FSM eating this particular string.

First of all, because the string is in fact a one surrounded by the equal number of zeros on both sides, the FSM ends up in a “yes” state. Moreover, because the length of the string is much greater than the number of states in the state machine, the state machine necessarily visits some state twice. There is a cycle, where the machine goes from A to B to C to D and back to A. This cycle might be pretty long, but it’s definitely shorter than the total number of states we have.

And now we can fool the state machine. Let’s make it eat our string again, but this time, once it completes the ABCDA cycle, we’ll force it to traverse this cycle again. That is, the original cycle corresponds to some portion of our giant string:

0000 0000000000000000000 00 .... 1 .... 00000
     <- cycle portion ->

If we duplicate this portion, our string will no longer look like one surrounded by equal number of twos, but the state machine will still in the “yes” state. Which is a contradiction that completes the proof.

Turing Machine: Definition

A Turing Machine is only slightly more complex than an FSM. Like an FSM, a TM has a bunch of states and a single-step transition function. While an FSM has an immutable input which is being fed to it symbol by symbol, a TM operates with a mutable tape. The input gets written to the tape at the start. At each step, a TM looks at the current symbol on the tape, changes its state according to a transition function and, additionally:

Replaces the current symbol with a new one (which might or might not be different).
Moves the reading head that points at the current symbol one position to the left or to the right.

When a machine reaches a designated halt state, it stops, and whatever is written on the tape at that moment is the result. That is, while FSMs are binary recognizers, TMs are functions. Keep in mind that a TM does not necessarily stop. It might be the case that a TM goes back and forth over the tape, overwrites it, changes its internal state, but never quite gets to the final state.

Here’s an example Turing Machine:

States:  {A, B, C, H}
Start State: A
Final State: H

s :: State -> Symbol -> (State, Symbol, Left | Right)
s A 0 = (B, 1, Right)
s A 1 = (H, 1, Right)
s B 0 = (C, 0, Right)
s B 1 = (B, 1, Right)
s C 0 = (C, 1, Left)
s C 1 = (A, 1, Left)

If the configuration of the machine looks like this:

000010100000
     ^
     A

Then we are in the s A 0 = (B, 1, Right) case, so we should change the state to B, replace 0 with 1, and move to the right:

000011100000
      ^
      B

Turing Machine: Programming

There are a bunch of fiddly details to Turing Machines!

The tape is conceptually infinite, so beyond the input, everything is just zeros. This creates a problem: it might be hard to say where the input (or the output) ends! There are a couple of technical solutions here. One is to say that there are three different symbols on the tape — zeros, ones, and blanks, and require that the tape is initialized with blanks. A different solution is to invent some encoding scheme. For example, we can say that the input is a sequence of 8-bit bytes, without interior null bytes. So, eight consecutive zeros at a byte boundary designate the end of input/output.

It’s useful to think about how this byte-oriented TM could be implemented. We could have one large state for each byte of input. So, Q142 would mean that the head is on the byte with value 142. And then we’ll have a bunch of small states to read out the current byte. Eg, we start reading a byte in state S. Depending on the next bit we move to S0 or S1, then to S00, or S01, etc. Once we reached something like S01111001, we move back 8 positions and enter state Q121. This is one of the patterns of Turing Machine programming — while your main memory is the tape, you can represent some constant amount of memory directly in the states.

What we’ve done here is essentially lowering a byte-oriented Turing Machine to a bit-oriented machine. So, we could think only in terms of big states operating on bytes, as we know the general pattern for converting that to direct bit-twiddling.

With this encoding scheme in place, we now can feed arbitrary files to a Turing Machine! Which will be handy to the next observation:

You can’t actually program a Turing Machine. What I mean is that, counter-intuitively, there isn’t some user-supplied program that a Turing Machine executes. Rather, the program is hard-wired into the machine. The transition function is the program.

But with some ingenuity we can regain our ability to write programs. Recall that we’ve just learned to feed arbitrary files to a TM. So what we could do is to write a text file that specifies a TM and its input, and then feed that entire file as an input to an “interpreter” Turing Machine which would read the file, and act as the machine specified there. A Turing Machine can have an eval function.

Is such an “interpreter” Turing Machine possible? Yes! And it is not hard: if you spend a couple of hours programming Turing Machines by hand, you’ll see that you pretty much can do anything — you can do numbers, arithmetic, loops, control flow. It’s just very very tedious.

So let’s just declare that we’ve actually coded up this Universal Turing Machine which simulates a TM given to it as an input in a particular encoding.

This sort of construct also gives rise to the Church-Turing thesis. We have a TM which can run other TMs. And you can implement a TM interpreter in something like Python. And, with a bit of legwork, you could also implement a Python interpreter as a TM (you likely want to avoid doing that directly, and instead do a simpler interpreter for WASM, and then use a Python interpreter compiled to WASM). This sort of bidirectional interpretation shows that Python and TMs have equivalent computing power. Moreover, it’s quite hard to come up with a reasonable computational device which is more powerful than a Turing Machine.

There are computational devices that are strictly weaker than TMs though. Recall FSMs. By this point, it should be obvious that a TM can simulate an FSM. Everything a Finite State Machine can do, a Turing Machine can do as well. And it should be intuitively clear that a TM is more powerful than an FSM. An FSM gets to use only a finite number of states. A TM has these same states, but it also posses a tape which serves like an infinitely sized external memory.

Directly proving that you can’t encode a Universal Turing Machine as an FSM sounds complicated, so let’s prove something simpler. Recall that we have established that there’s no FSM that accepts only ones surrounded by an equal number of zeros on both sides (because a sufficiently large word of this form would necessary enter a cycle in a state machine, which could then be further pumped). But it’s actually easy to write a Turing Machine that does this:

Erase zero (at the left side of the tape)
Go to the right end of the tape
Erase zero
Go to the left side of the tape
Repeat
If what’s left is a single 1 the answer is “yes”, otherwise it is a “no”

We found a specific problem that can be solved by a TM, but is out of reach of any FSM. So it necessarily follows that there isn’t an FSM that can simulate an arbitrary TM.

It is also useful to take a closer look at the tape. It is a convenient skeuomorphic abstraction which makes the behavior of the machine intuitive, but it is inconvenient to implement in a normal programming language. There isn’t a standard data structure that behaves just like a tape.

One cool practical trick is to simulate the tape as a pair of stacks. Take this:

Tape: A B C D E F G
Head:     ^

And transform it to something like this:

Left Stack:  [A, B, C]
Right Stack: [G, F, E, D]

That is, everything to the left of the head is one stack, everything to the right, reversed, is the other. Here, moving the reading head left or right corresponds to popping a value off one stack and pushing it onto another.

So, an equivalent-in-power definition would be to say that a TM is an FSM endowed with two stacks.

This of course creates an obvious question: is an FSM with just one stack a thing? Yes! It would be called a pushdown automaton, and it would correspond to context-free languages. But that’s beyond the scope of this post!

There’s yet another way to look at the tape, or the pair of stacks, if the set of symbols is 0 and 1. You could say that a stack is just a number! So, something like [1, 0, 1, 1] will be 1 + 2 + 8 = 11. Looking at the top of the stack is stack % 2, removing an item from the stack is stack / 2 and pushing x onto the stack is stack * 2 + x. We won’t need this right now, so just hold onto this for a brief moment.

Turing Machine: Limits

Ok, so we have some idea about the lower bound for the power of a Turing Machine — FSMs are strictly less expressive. What about the opposite direction? Is there some computation that a Turing Machine is incapable of doing?

Yes! Let’s construct a function which maps natural numbers to natural numbers, which can’t be implemented by a Turing Machine. Recall that we can encode an arbitrary Turing Machine as text. That means that we can actually enumerate all possible Turing Machines, and write them in a giant line, from the most simple Turing Machine to more complex ones:

TM_0
TM_1
TM_2
...
TM_326
...

This is of course going to be an infinite list.

Now, let’s see how TM0 behaves on input 0: it either prints something, or doesn’t terminate. Then, note how TM1 behaves on input 1, and generalizing, create function f that behaves as the nth TM on input n. It might look something like this:

f(0) = 0
f(1) = 111011
f(2) = doesn't terminate
f(3) = 0
f(4) = 101
...

Now, let’s construct function g which is maximally diffed from f: where f gives 0, g will return 1, and it will return 0 in all other cases:

g(0) = 1
g(1) = 0
g(2) = 0
g(3) = 1
g(4) = 0
...

There isn’t a Turing machine that computes g. For suppose there is. Then, it exists in our list of all Turing Machines somewhere. Let’s say it is TM1000064. So, if we feed 0 to it, it will return g(0), which is 1, which is different from f(0). And the same holds for 1, and 2, and 3. But once we get to g(1000064), we are in trouble, because, by the definition of g, g(1000064) is different from what is computed by TM1000064! So such a machine is impossible.

Those math savvy might express this more succinctly — there’s a countably-infinite number of Turing Machines, and an uncountably-infinite number of functions. So there must be some functions which do not have a corresponding Turing Machine. It is the same proof — the diagonalization argument is hiding in the claim that the set of all functions is an uncountable set.

But this is super weird and abstract. Let’s rather come up with some very specific problem which isn’t solvable by a Turing Machine. The halting problem: given source code for a Turing Machine and its input, determine if the machine halts on this input eventually.

As we have waved our hands sufficiently vigorously to establish that Python and Turing Machines have equivalent computational power, I am going to try to solve this in Python:

def halts(program_source_code: str, program_input: str) -> Bool:
    # One million lines of readable, but somewhat
    # unsettling and intimidating Python code.
    return the_answer

raw_input = input()
[program_source_code, program_input] = parse(raw_input)
print("Yes" if halts(program_source_code, program_input) else "No")

Now, I will do a weird thing and start asking whether a program terminates, if it is fed its own source code, in a reverse-quine of sorts:

def halts_on_self(program_source_code: str) -> Bool:
    program_input = program_source_code
    return halts(program_source_code, program_input)

and finally I construct this weird beast of a program:

def halts(program_source_code: str, program_input: str) -> Bool:
    # ...
    return the_answer

def halts_on_self(program_source_code: str) -> Bool:
    program_input = program_source_code
    return halts(program_source_code, program_input)

def weird(program_input):
    if halts_on_self(program_input):
        while True:
            pass

weird(input())

To make this even worse, I’ll feed the text of this weird program to itself. Does it terminate with this input? Well, if it terminates, and if our halts function is implemented correctly, then the halts_on_self(program_input) invocation above returns True. But then we enter the infinite loop and don’t actually terminate.

Hence, it must be the case that weird does not terminate when self-applied. But then halts_on_self returns False, and it should terminate. So we get a contradiction both ways. Which necessarily means that either our halts sometimes returns a straight-up incorrect answer, or that it sometimes does not terminate.

So this is the flip side of a Turing Machine’s power — it is so powerful that it becomes impossible to tell whether it’ll terminate or not!

It actually gets much worse, because this result can be generalized to an unreasonable degree! In general, there’s very little we can say about arbitrary programs.

We can easily check syntactic properties (is the program text shorter than 4 kilobytes?), but they are, in some sense, not very interesting, as they depend a lot on how exactly one writes a program. It would be much more interesting to check some refactoring-invariant properties, which hold when you change the text of the program, but leave the behavior intact. Indeed, “does this change preserve behavior?” would be one very useful property to check!

So let’s define two TMs to be equivalent, if they have identical behavior. That is, for each specific input, either both machines don’t terminate, or they both halt, and give identical results.

Then, our refactoring-invariant properties are, by definition, properties that hold (or do not hold) for the entire classes of equivalence of TMs.

And a somewhat depressing result here is that there are no non-trivial refactoring-invariant properties that you can algorithmically check.

Suppose we have some magic TM, called P, which checks such a property. Let’s show that, using P, we can solve the problem we know we can not solve — the halting problem.

Consider a Turing Machine that is just an infinite loop and never terminates, M1. P might or might not hold for it. But, because P is non-trivial (it holds for some machines and doesn’t hold for some machines), there’s some different machine M2 which differs from M1 with respect to P. That is, P(M1) xor P(M2) holds.

Let’s use these M1 and M2 to figure out whether a given machine M halts on input I. Using Universal Turing Machine (interpreter), we can construct a new machine, M12 that just runs M on input I, then erases the contents of the tape and runs M2. Now, if M halts on I, then the resulting machine M12 is behaviorally-equivalent to M2. If M doesn’t halt on I, then the result is equivalent to the infinite loop program, M1. Or, in pseudo-code:

def M1(input):
    while True:
        pass

def M2(input):
    # We don't actually know what's here
    # but we know that such a machine exists.

assert(P(M1) != P(M2))

def halts(M, I):
    def M12(input):
        M(I) # might or might not halt
        return M2(input)

    return P(M12) == P(M2)

This is pretty bad and depressing — we can’t learn anything meaningful about an arbitrary Turing Machine! So let’s finally get to the actual topic of today’s post:

Primitive Recursive Functions

This is going to be another computational device, like FSMs and TMs. Like an FSM, it’s going to be a nice, always terminating, non-Turing complete device. But it will turn out to have quite a bit of the power of a full Turing Machine!

However, unlike both TMs and FSMs, Primitive Recursive Functions are defined directly as functions which take a tuple of natural numbers and return a natural number. The two simplest ones are zero (that is, zero-arity function that returns 0) and succ — a unary function that just adds 1. Everything else is going to get constructed out of these two:

zero = 0
succ(x) = x + 1

One way we are allowed to combine these functions is by composition. So we can get all the constants right off the bat:

succ(zero) = 1
succ(succ(zero)) = 2
succ(succ(succ(zero))) = 3

We aren’t going to be allowed to use general recursion (because it can trivially non-terminate), but we do get to use a restricted form of C-style loop. It is a bit fiddly to define formally! The overall shape is LOOP(init, f, n).

Here, init and n are numbers — the initial value of the accumulator and the total number of iterations. The f is a unary function that specifies the loop body – it takes the current value of the accumulator and returns the new value. So

LOOP(init, f, 0) = init
LOOP(init, f, 1) = f(init)
LOOP(init, f, 2) = f(f(init))
LOOP(init, f, 3) = f(f(f(init)))

While this is similar to a C-style loop, the crucial difference here is that the total number of iterations n is fixed up-front. There’s no way to mutate the loop counter in the loop body.

This allows us to define addition:

add(x, y) = LOOP(x, succ, y)

Multiplication is trickier. Conceptually, to multiply x and y, we want to LOOP from zero, and repeat “add x” y times. The problem here is that we can’t write an “add x” function yet

# Doesn't work, add is a binary function!
mul(x, y) = LOOP(0, add, y)

# Doesn't work either, no x in scope!
add_x v = add(x, v)
mul(x, y) = LOOP(0, add_x, y)

One way around this is to define LOOP as a family of operators, which can pass extra arguments to the iteration function:

LOOP0(init, f, 2) = f(f(init))
LOOP1(c1, init, f, 2) = f(c1, f(c1, init))
LOOP2(c1, c2, init, f, 2) = f(c1, c2, f(c1, c2, init))

That is, LOOP_N takes an extra n arguments, and passes them through to any invocation of the body function. To express this idea a little bit more succinctly, let’s just allow to partially apply the second argument of LOOP. That is:

All our functions are going to be first order. All arguments are numbers, the result is a number. There aren’t higher order functions, there aren’t closures.
The LOOP is not a function in our language — it’s a builtin operator, a keyword. So, for convenience, we allow passing partially applied functions to it. But semantically this is equivalent to just passing in extra arguments on each iteration.

Which finally allows us to write

mul(x, y) = LOOP(0, add x, y)

Ok, so that’s progress — we made something as complicated as multiplication, and we still are in the guaranteed-to-terminate land. Because each loop has a fixed number of iterations, everything eventually finishes.

We can go on and define x^y:

pow(x, y) = LOOP(1, mul x, y)

And this in turn allows us to define a couple of concerning fast growing functions:

pow_2(n) = pow(2, n)
pow_2_2(n) = pow_2(pow_2(n))

That’s fun, but to do some programming, we’ll need an if. We’ll get to it, but first we’ll need some boolean operations. We can encode false as 0 and true as 1. Then

and(x, y) = mul(x, y)

But or creates a problem: we’ll need a subtraction.

or(x, y) = sub(
  add(x, y),
  mul(x, y),
)

Defining sub is tricky, due to two problems:

First, we only have natural numbers, no negatives. This one is easy to solve — we’ll just define subtraction to saturate.

The second problem is more severe — I think we actually can’t express subtraction given the set of allowable operations so far. That is because all our operations are monotonic — the result is never less than the arguments. One way to solve this problem is to define the LOOP in such a way that the body function also gets passed a second argument — the current iteration. So, if you iterate up to n, the last iteration will observe n - 1, and that would be the non-monotonic operation that creates subtraction. But that seems somewhat inelegant to me, so instead I will just add a pred function to the basis, and use that to add loop counters to our iterations.

pred(0) = 0 # saturate
pred(1) = 0
pred(2) = 1
...

Now we can say:

sub(x, y) = LOOP(x, pred, y)

and(x, y) = mul(x, y)
or(x, y) = sub(
  add(x, y),
  mul(x, y)
)
not(x) = sub(1, x)

if(cond, a, b) = add(
  mul(a, cond),
  mul(b, not(cond)),
)

And now we can do a bunch of comparison operators:

is_zero(x) = sub(1, x)

# x >= y
ge(x, y) = is_zero(sub(y, x))

# x == y
eq(x, y) = and(ge(x, y), ge(y, x))

# x > y
gt(x, y) = and(ge(x, y), not(eq(x, y)))

# x < y
lt(x, y) = gt(y, x)

With that we could implement modulus. To compute x % m we will start with x, and will be subtracting m until we get a number smaller than m. We’ll need at most x iterations for that.

In pseudo-code:

def mod(x, m):
  current = x

  for _ in 0..x:
    if current < m:
      current = current
    else:
      current = current - m

  return current

And as a bona fide PRF:

mod_iter(m, x) = if(
  lt(x, m),
  x,        # then
  sub(x, m) # else
)
mod(x, m) = LOOP(x, mod_iter m, x)

That’s a curious structure — rather than computing the modulo directly, we essentially search for it using trial and error, and relying on the fact that the search has a clear upper bound.

Division can be done similarly: to divide x by y, start with 0, and then repeatedly add one to the accumulator until the product of the accumulator and y exceeds x:

div_iter x y acc = if(
  le(mul(succ(acc), y), y),
  succ(acc), # then
  acc        # else
)
div(x, y) = LOOP(0, div_iter x y, x)

This really starts to look like programming! One thing we are currently missing are data structures. While our functions take multiple arguments, they only return one number. But it’s easy enough to pack two numbers into one: to represent an (a, b) pair, we’ll use 2^a 3^b number:

mk_pair(a, b) = mul(pow(2, a), pow(3, b))

To deconstruct such a pair into its first and second components, we need to find the maximum power of 2 or 3 that divides our number. Which is exactly the same shape we used to implement div:

max_factor_iter p m acc = if(
  is_zero(mod(p, pow(m, succ(acc)))),
  succ(acc), # then
  acc,       # else
)
max_factor(p, m) = LOOP(0, max_factor_iter p m, p)

fst(p) = max_factor(p, 2)
snd(p) = max_factor(p, 3)

Here again we use the fact that the maximal power of two that divides p is not larger than p itself, so we can over-estimate the number of iterations we’ll need as p.

Using this pair construction, we can finally add a loop counter to our LOOP construct. To track the counter, we pack it as a pair with the accumulator:

LOOP(mk_pair(init, 0), f, n)

And then inside f, we first unpack that pair into accumulator and counter, pass them to actual loop iteration, and then pack the result again, incrementing the counter:

f acc = mk_pair(
  g(fst(acc), snd(acc)),
  succ(snd(acc)),
)

Ok, so we have achieved something remarkable: while we are writing terminating-by-construction programs, which are definitely not Turing complete, we have constructed basic programming staples, like boolean logic and data structures, and we have also built some rather complicated mathematical functions, like 2^{2^N}.

We could try to further enrich our little primitive recursive kingdom by adding more and more functions on an ad hoc basis, but let’s try to be really ambitious and go for the main prize — simulating Turing Machines.

We know that we will fail: Turing machines can enter an infinite loop, but PRFs necessarily terminate. That means, that, if a PRF were able to simulate an arbitrary TM, it would have to say after a certain finite amount of steps that “this TM doesn’t terminate”. And, while we didn’t do this, it’s easy to see that you could simulate the other way around and implement PRFs in a TM. But that would give us a TM algorithm to decide if an arbitrary TM halts, which we know doesn’t exist.

So, this is hopeless! But we might still be able to learn something from failing.

Ok! So let’s start with a configuration of a TM which we somehow need to encode into a single number. First, we need the state variable proper (Q0, Q1, etc), which seems easy enough to represent with a number. Then, we need a tape and a position of the reading head. Recall how we used a pair of stacks to represent exactly the tape and the position. And recall that we can look at a stack of zeros and ones as a number in binary form, where push and pop operations are implemented using %, *, and / — exactly the operations we already can do. So, our configuration is just three numbers: (S, stack1, stack2).

And, using the 2^a3^b5^c trick, we can pack this triple into just a single number. But that means we could directly encode a single step of a Turing Machine:

single_step(config) = if(
  # if the state is Q0 ...
  eq(fst(config), 0)

  # and the symbol at the top of left stack is 0
  if(is_zero(mod(snd(config), 2))
    mk_triple(
      1,                    # move to state Q1
      div(snd(config), 2),  # pop value from the left stack
      mul(trd(config), 2),  # push zero onto the right stack
    ),
    ... # Handle symbol 1 in state Q1
  )
  # if the state is Q1 ...
  if(eq(fst(config), 1)
    ...
  )
)

And now we could plug that into our LOOP to simulate a Turing Machine running for N steps:

n_steps initial_config n =
  LOOP(initial_config, single_step, n)

The catch of course is that we can’t know the N that’s going to be enough. But we can have a very good guess! We could do something like this:

hopefully_enough_steps initial_config =
  LOOP(initial_config, single_step, pow_2_2(initial_config))

That is, run for some large tower of exponents of the initial state. Which would be plenty for normal algorithms, which are usually 2^N at worst!

Or, generalizing:

If a TM has a runtime which is bounded by some primitive-recursive function, then the entire TM can be replaced with a PRF. Be advised that PRFs can grow really fast.

Which is the headline result we have set out to prove!

Primitive Recursive Functions: Limit

It might seem that non-termination is the only principle obstacle. That anything that terminates at all has to be implementable as a PRF. Alas, that’s not so. Let’s go and construct a function that is surmountable by a TM, but is out of reach of PRFs.

We will combine the ideas of the impossibility proofs for FSMs (noting that if a function is computed by some machine, that machine has a specific finite size) and TMs (diagonalization).

So, suppose we have some function f that can’t be computed by a PRF. How would we go about proving that? Well, we’d start with “suppose that we have a PRF P that computes f”. And then we could notice that P would have some finite size. If you look at it abstractly, the P is its syntax tree, with lots of LOOP constructs, but it always boils down to some succs and zeros at the leaves. Let’s say that the depth of P is d.

And, actually, if you look at it, there are only a finite number of PRFs with depth at most d. Some of them describe pretty fast growing functions. But probably there’s a limit to how fast a function can grow, given that it is computed by a PRF of size d. Or, to use a concrete example: we have constructed a PRF of depth 5 that computes two to the power of two to the power of N. Probably if we were smarter, we could have squeezed a couple more levels into that tower of exponents. But intuitively it seems that if you build a tower of, say, 10 exponents, that would grow faster than any PRF of depth 5. And that this generalizes — for any fixed depth, there’s a high-enough tower of exponents that grows faster than any PRF with that depth.

So we could conceivably build an f that defeats our d-deep P. But that’s not quite a victory yet: maybe that f is feasible for d+2-deep PRFs! So here we’ll additionally apply diagonalization: for each depth, we’ll build it’s own depth-specific nemesis f_d. And then we’ll define our overall function as

a(n) = f_n(n)

So, for n large enough it’ll grow faster than a PRF with any fixed depth.

So that’s the general plan, the rest of the own is basically just calculating the upper bound on the growth of a PRF of depth d.

One technical difficulty here is that PRFs tend to have different arities:

f(x, y)
g(x, y, z, t)
h(x)

Ideally, we’d use just one upper bound of them all. So we’ll be looking for an upper bound of the following form:

f(x, y, z, t) <= A_d(max(x, y, z, t))

That is:

Compute the depth of f, d.
Compute the largest of its arguments.
And plug that into unary function for depth d.

Let’s start with d=1. We have only primitive functions on this level, succ, zero, and pred, so we could say that

A_1(x) = x + 1

Now, let’s handle an arbitrary other depth d + 1. In that case, our function is non-primitive, so at the root of the syntax tree we have either a composition or a LOOP.

Composition would look like this:

f(x, y, z, ...) = g(
  h1(x, y, z, ...),
  h2(x, y, z, ...),
  h3(x, y, z, ...),
)

where g and h_n are d deep and the resulting f is d+1 deep. We can immediately estimate the h_n then:

f(args...) <= g(
  A_d(maxarg),
  A_d(maxarg),
  A_d(maxarg),
  ...
)

In this somewhat loose notation, args... stands for a tuple of arguments, and maxarg stands for the largest one.

And then we could use the same estimate for g:

f(args...) <= A_d(A_d(maxarg))

This is super high-order, so let’s do a concrete example for a depth-2 two-argument function which starts with a composition:

f(x, y) <= A_1(A_1(max(x, y)))
         = A_1(max(x, y) + 1)
         = max(x, y) + 2

This sounds legit: if we don’t use LOOP, then f(x, y) is either succ(succ(x)) or succ(succ(y)) so max(x, y) + 2 indeed is the bound!

Ok, now the fun case! If the top-level node is a LOOP, then we have

f(args...) = LOOP(
  g(args...),
  h(args...),
  t(args...),
)

This sounds complicated to estimate, especially due to that last t(args...) argument, which is the number of iterations. So we’ll be cowards and won’t actually try to estimate this case. Instead, we will require that our PRF is written in a simplified form, where the first and the last arguments to LOOP are simple.

So, if your PRF looks like

f(x, y) = LOOP(x + y, mul, pow2(x))

you are required to re-write it first as

helper(u, v) = LOOP(u, mul, v)
f(x, y) = helper(x + y, pow2(x))

So now we only have to deal with this:

f(args...) = LOOP(
  arg,
  g(args...),
  arg,
)

f has depth d+1, g has depth d.

On the first iteration, we’ll call g(args..., arg), which we can estimate as A_d(maxarg). That is, g does get an extra argument, but it is one of the original arguments of f, and we are looking at the maximum argument anyway, so it doesn’t matter.

On the second iteration, we are going to call g(args..., prev_iteration) which we can estimate as A_d(max(maxarg, prev_iteration)).

Now we plug our estimation for the first iteration:

g(args..., prev_iteration)
  <= A_d(max(maxarg, prev_iteration))
  <= A_d(max(maxarg, A_d(maxarg)))
  =  A_d(A_d(maxarg))

That is, the estimate for the first iteration is A_d(maxarg). The estimation for the second iteration adds one more layer: A_d(A_d(maxarg)). For the third iteration we’ll get A_d(A_d(A_d(maxarg))).

So the overall thing is going to be smaller than A_d iteratively applied to itself some number of times, where “some number” is one of the f original arguments. But no harm’s done if we iterate up to maxarg.

As a sanity check, the worst depth-2 function constructed with iteration is probably

f(x, y) = LOOP(x, succ, y)

which is x + y. And our estimate gives x + 1 applied maxarg times to maxarg, which is 2 * maxarg, which is indeed the correct upper bound!

Combining everything together, we have:

A_1(x) = x + 1

f(args...) <= max(
  A_d(A_d(maxarg)),               # composition case
  A_d(A_d(A_d(... A_d(maxarg)))), # LOOP case,
   <-    maxarg A's         ->
)

That max there is significant — although it seems like the second line, with maxarg applications, is always going to be longer, maxarg, in fact, could be as small as zero. But we can take maxarg + 2 repetitions to fix this:

f(args...) <=
  A_d(A_d(A_d(... A_d(maxarg)))),
  <-    maxarg + 2 A's         ->

So let’s just define A_{d+1}(x) to make that inequality work:

A_{d+1}(x) = A_d(A_d( .... A_d(x)))
            <- x + 2 A_d's in total->

Unpacking:

We define a family of unary functions A_d, such that each A_d “grows faster” than any n-ary PRF of depth d. If f is a ternary PRF of depth 3, then f(1, 92, 10) <= A_3(92).

To evaluate A_d at point x, we use the following recursive procedure:

If d is 1, return x + 1.
Otherwise, evaluate A_{d-1} at point x to get, say, v. Then evaluate A_{d-1} again at point v this time, yielding u. Then compute A_{d-1}(u). Overall, repeat this process x+2 times, and return the final number.

We can simplify this a bit if we stop treating d as a kind of function index, and instead say that our A is just a function of two arguments. Then we have the following equations:

A(1, x) = x + 1
A(d + 1, x) = A(d, A(d, A(d, ..., A(d, x))))
                <- x + 2 A_d's in total->

The last equation can re-formatted as

A(
  d,
  A(d, A(d, ..., A(d, x))),
  <- x + 1 A_d's in total->
)

And for non-zero x that is just

A(
  d,
  A(d + 1, x - 1),
)

So we get the following recursive definition for A(d, x):

A(1, x) = x + 1
A(d + 1, 0) = A(d, A(d, 0))
A(d + 1, x) = A(d, A(d + 1, x - 1))

As a Python program:

def A(d, x):
  if d == 1: return x + 1
  if x == 0: return A(d-1, A(d-1, 0))
  return A(d-1, A(d, x - 1))

It’s easy to see that computing A on a Turing Machine using this definition terminates — this is a function with two arguments, and every recursive call uses a lexicographically smaller pair of arguments. And we constructed A in such a way that A(d, x) as a function of x is larger than any PRF with a single argument of depth d. But that means that the following function with one argument a(x) = A(x, x)

grows faster than any PRF. And that’s an example of a function which a Turing Machine has no trouble computing (given sufficient time), but which is beyond the capabilities of PRFs.

Part III, Descent From the Ivory Tower

Remember, this is a three-part post! And are finally at the part 3! So let’s circle back to the practical matters. We have learned that:

Turing machines don’t necessarily terminate.
While other computational devices, like FSMs and PRFs, can be made to always terminate, there’s no guarantee that they’ll terminate fast. PRFs in particular can compute quite large functions!
And non-Turing complete devices can be quite expressive. For example, any real-world algorithm that works on a TM can be adapted to run as a PRF.
Moreover, you don’t even have to contort the algorithm much to make it fit. There’s a universal recipe for how to take something Turing complete and make it a primitive recursive function instead — just add an iteration counter to the device, and forcibly halt it if the counter grows too large.

Or, more succinctly: there’s no practical difference between a program that doesn’t terminate, and the one that terminates after a billion years. As a practitioner, if you think you need to solve the first problem, you need to solve the second problem as well. And making your programming language non-Turing complete doesn’t really help with this.

And yet, there are a lot of configuration languages out there that use non-Turing completeness as one of their key design goals. Why is that?

I would say that we are never interested in Turing-completeness per-se. We usually want some much stronger properties. And yet there’s no convenient catchy name for that bag of features of a good configuration language. So, “non-Turing-complete” gets used as a sort of rallying cry to signal that something is a good configuration language, and maybe sometimes even to justify to others inventing a new language instead of taking something like Lua. That is, the real reason why you want at least a different implementation is all those properties you really need, but they are kinda hard to explain, or at least much harder than “we can’t use Python/Lua/JavaScript because they are Turing-complete”.

So what are the properties of a good configuration language?

First, we need the language to be deterministic. If you launch Python and type id([]), you’ll see some number. If you hit ^C, and than do this again, you’ll see a different number. This is OK for “normal” programming, but is usually anathema for configuration. Configuration is often used as a key in some incremental, caching system, and letting in non-determinism there wreaks absolute chaos!

Second, you need the language to be well-defined. You can compile Python with ASLR disabled, and use some specific allocator, such that id([]) always returns the same result. But that result would be hard to predict! And if someone tries to do an alternative implementation, even if they disable ASLR as well, they are likely to get a different deterministic number! Or the same could happen if you just update the version of Python. So, the semantics of the language should be clearly pinned-down by some sort of a reference, such that it is possible to guarantee not only deterministic behavior, but fully identical behavior across different implementations.

Third, you need the language to be pure. If your configuration can access environment variables or read files on disk, than the meaning of the configuration would depend on the environment where the configuration is evaluated, and you again don’t want that, to make caching work.

Fourth, a thing that is closely related to purity is security and sandboxing. The mechanism to achieve both purity and security is the same — you don’t expose general IO to your language. But the purpose is different: purity is about not letting the results be non-deterministic, while security is about not exposing access tokens to the attacker.

And now this gets tricky. One particular possible attack is a denial of service — sending some bad config which makes our system just spin there burning the CPU. Even if you control all IO, you are generally still open to these kinds of attacks. It might be OK to say this is outside of the threat model — that no one would find it valuable enough to just burn your CPU, if they can’t also do IO, and that, even in the event that this happens, there’s going to be some easy mitigation in the form of a higher-level timeout.

But you also might choose to provide some sort of guarantees about execution time, and that’s really hard. Two approaches work. One is to make sure that processing is obviously linear. Not just terminates, but is actually proportional to the size of inputs, and in a very direct way. If the correspondence is not direct, than it’s highly likely that it is in fact non linear. The second approach is to ensure metered execution — during processing, decrement a counter for every simple atomic step and terminate processing when the counter reaches zero.

Finally one more vague property you’d want from a configuration language is for it to be simple. That is, to ensure that, when people use your language, they write simple programs. It seems to me that this might actually be the case where banning recursion and unbounded loops could help, though I am not sure. As we know from the PRF exercise, this won’t actually prevent people from writing arbitrary recursive programs. It’ll just require some roundabout code to do that. But maybe that’ll be enough of a speedbump to make someone invent a simple solution, instead of brute-forcing the most obvious one?

That’s all for today! Have a great weekend, and remember:

Any algorithm that can be implemented by a Turing Machine such that its runtime is bounded by some primitive recursive function of input can also be implemented by a primitive recursive function!

How I Use Git Worktrees

2024-07-25T00:00:00+00:00

How I Use Git Worktrees Jul 25, 2024

There are a bunch of posts on the internet about using git worktree command. As far as I can tell, most of them are primarily about using worktrees as a replacement of, or a supplement to git branches. Instead of switching branches, you just change directories. This is also how I originally had used worktrees, but that didn’t stick, and I abandoned them. But recently worktrees grew on me, though my new use-case is unlike branching.

When a Branch is Enough

If you use worktrees as a replacement for branching, that’s great, no need to change anything! But let me start with explaining why that workflow isn’t for me.

The principal problem with using branches is that it’s hard to context switch in the middle of doing something. You have your branch, your commit, a bunch of changes in the work tree, some of them might be stages and some unstaged. You can’t really tell Git “save all this context and restore it later.” The solution that Git suggests here is to use stashing, but that’s awkward, as it is too easy to get lost when stashing several things at the same time, and then applying the stash on top of the wrong branch.

Managing Git state became much easier for me when I realized that the staging area and the stash are just bad features, and life is easier if I avoid them. Instead, I just commit whatever and deal with it later. So, when I need to switch a branch in the middle of things, what I do is, basically:

$ git add .
$ git commit -m.
$ git switch another-branch

And, to switch back,

$ git switch -

# Undo the last commit, but keep its changes in the working tree
$ git reset HEAD~

To make this more streamlined, I have a ggc utility which does “commit all with a trivial message” atomically.

And I don’t always reset HEAD~ — I usually just continue hacking with . in my Git log and then amend the commit once I am satisfied with subset of changes

So that’s how I deal with switching branches. But why worktrees then?

Worktree Per Concurrent Activity

It’s a bit hard to describe, but:

I have a fixed number of worktrees (5, to be exact)
worktrees are mostly uncorrelated to branches
but instead correspond to my concurrent activities during coding.

Specifically:

The main worktree is a readonly worktree that contains a recent snapshot of the remote main branch. I use this tree to compare the code I am currently working on and/or reviewing with the master version (this includes things like “how long the build takes”, “what is the behavior of this test” and the like, so not just the actual source code).
The work worktree, where I write most of the code. I often need to write new code and compare it with old code at the same time. But can’t actually work on two different things in parallel. That’s why main and work are different worktrees, but work also constantly switches branches.
The review worktree, where I checkout code for code review. While I can’t review code and write code at the same time, there is one thing I am implementing, and one thing I am reviewing, but the review and implementation proceed concurrently.

Then, there’s the fuzz tree, where I run long-running fuzzing jobs for the code I am actively working on. My overall idealized feature workflow looks like this:

# go to the `work` worktree
$ cd ~/projects/tigerbeetle/work

# Create a new branch. As we work with a centralized repo,
# rather than personal forks, I tend to prefix my branch names
# with `matklad/`
$ git switch -c matklad/awesome-feature

# Start with a reasonably clean slate.
# In reality, I have yet another script to start a branch off
# fresh from the main remote, but this reset is a good enough approximation.
$ git reset --hard origin/main

# For more complicated features, I start with an empty commit
# and write the commit message _first_, before starting the work.
# That's a good way to collect your thoughts and discover dead
# ends more gracefully than hitting a brick wall coding at 80 WPM.
$ git commit --allow-empty

# Hack furiously writing throughway code.
$ code .

# At this point, I have something that I hope works
# but would be embarrassed to share with anyone!
# So that's the good place to kick off fuzzing.

# First, I commit everything so far.
# Remember, I have `ggc` one liner for this:
$ git add . && git commit -m.

# Now I go to my `fuzz` worktree and kick off fuzzing.
# I usually split screen here.
# On the left, I copy the current commit hash.
# On the right, I switch to the fuzzing worktree,
# switch to the copied commit, and start fuzzing:

$ git add . && git commit -m.  |
$ git rev-parse HEAD | ctrlc   | $ cd ../fuzz
$                              | $ git switch -d $(ctrlv)
$                              | $ ./zig/zig build fuzz
$                              |

# While the fuzzer hums on the right, I continue to furiously refactor
# the code on the left and hammer my empty commit with a wishful
# thinking message and my messy code commit with `.` message into
# a semblance of clean git history

$ code .
$ magit-goes-brrrrr

# At this point, in the work tree, I am happy with both the code
# and the Git history, so, if the fuzzer on the right is happy,
# a PR is opened!

$                              |
$ git push --force-with-lease  | $ ./zig/zig build fuzz
$ gh pr create --web           | # Still hasn't failed
$                              |

This is again concurrent: I can hack on the branch while the fuzzer tests the “same” code. Note that it is crucial that the fuzzing tree operates in the detached head state (-d flag for git switch). In general, -d is very helpful with this style of worktree work. I am also sympathetic to the argument that, like the staging area and the stash, Git branches are a misfeature, but I haven’t made the plunge personally yet.

Finally, the last tree I have is scratch – this is a tree for arbitrary random things I need to do while working on something else. For example, if I am working on matklad/my-feature in work, and reviewing #6292 in review, and, while reviewing, notice a tiny unrelated typo, the PR for that typo is quickly prepped in the scratch worktree:
```
$ cd ../scratch
$ git switch -c matklad/quick-fix
$ code . && git add . && git commit -m 'typo' && git push
$ cd -
```

TL;DR: consider using worktrees not as a replacement for branches, but as a means to manage concurrency in your tasks. My level of concurrency is:

main for looking at the pristine code,
work for looking at my code,
review for looking at someone else’s code,
fuzz for my computer to look at my code,
scratch for everything else!