Slightly unusual genre — with this article, I want to try to enact a change in the world. I believe that there is a “missing” IDE feature which is:
The target audience here is anyone who can land a PR in Zed, VS Code, Helix, Neovim, Emacs, Kakoune, or any other editor or any language server. The blog post would be a success if one of you feels sufficiently inspired to do the thing!
Suppose you are casually reading the source code of rust-analyzer, and are curious about handling of
method bodies. There’s a Body
struct in the code base, and you want to understand how it is used.
Would you rather look at this?
Or this?
(The screenshots are from IntelliJ/RustRover, because of course it gets this right)
The second option is clearly superior — it conveys significantly more useful information in the same amount of pixels. Function names, argument lists and return types are so much more valuable than a body of any particular function. Especially if the function is a page-full of boilerplate code!
And this is the feature I am asking for — make the code look like the second image. Or, specifically, Fold Method Bodies by Default.
There are two components here. First, only method bodies are folded. This is a syntactic check — we are not folding the second level. For code like
Both f
and g
are folded, but impl S
is not. Similarly, function parameters and function body
are actually on the same level of folding hierarchy, but it is imperative that parameters are not
folded. This is the part that was hard ten years ago but is easy today. “what is function body” is a
non-trivial question, which requires proper parsing of the code. These days, either an LSP server or
Tree-sitter can answer this question quickly and reliably.
The second component of the feature is that folded is a default state. It is not a “fold method bodies” action. It is a setting that ensures that, whenever you visit a new file, bodies are folded by default. To make this work, the editor should be smart to seamlessly unfold specific function when appropriate. For example, if you “go to definition” to a function, that function should get unfolded, while the surrounding code should remail folded.
Now that I have explained how the feature works, I will not try to motivate it. I think it is pretty obvious how awesome this actually is. Code is read more often than written, and this is one of the best multipliers for readability. Most of the code is in method bodies, but most important code is in function signatures. Folding bodies auto-magically hide the 80% of boring code, leaving the most important 20%. It was in 2018 when I last used an IDE (IntelliJ) which has this implemented properly, and I’ve been missing this function ever since!
You might also be wondering whether it is the same feature as the Outline, that special UI which shows a graphical, hierarchical table of contents of the file. It is true that outline and fold-bodies-by-default attack the same issue. But I’d argue that folding solves it better. This is an instance of a common pattern. In a smart editor, it is often possible to implement any given feature either by “lowering” it to plain text, or by creating a dedicated GUI. And the lowering approach almost always wins, because it gets to re-use all existing functionality for free. For example, the folding approach trivially gives you an ability to move a bunch of functions from one impl block to the other by selecting them with Shift + Down, cutting with Ctrl + X and pasting with Ctrl + V.
So, if you are a committer to one of the editors, please consider adding a “fold function bodies by default” mode. It probably should be off by default, as it can easily scare new users away, but it should be there for power users to enable, and it should be prominently documented, so that people can learn that they want it. After the checkbox is in place, see if you can implement the actual logic! If your editor uses Tree-sitter, this should be relatively easy — its syntax tree contains all the information you need. Just make sure that:
If your editor is not based on Tree-sitter, you’ll have a harder time. In theory, the information should be readily available from the language server, but LSP currently doesn’t expose it. Here’s the problem:
There’s no body
kind there! Adding it should be trivially technically, but it always is a pain to
get something into the protocol if you are not VS Code.
What is my job here, besides sitting there and writing blog posts? I actually think that writing this down is quite valuable!
I suppose the feature is still commonly missing due to a two-sided market failure — the feature doesn’t exist, so prospective users don’t realize that it is possible, and don’t ask editor’s authors to implement it. Without users asking, editor authors themselves don’t realize this feature could exist, and don’t rush implementing it. This is exacerbated by the fact that it was a hard feature to implement ten years ago, when we didn’t have Tree-sitter/LSP, so there are poor workarounds in place — actions to fold a certain level. These workarounds the prevent the proper feature from gaining momentum.
So here I hope to maybe tip the equilibrium’s scale a bit, and start a feedback loop where more people realize that they want this feature, such that it is implemented in some of the more experimental editors, which hopefully would expose the feature to more users, popularizing it until it gets implemented everywhere!
Still, just talking isn’t everything I did here! Six years ago, I implemented the language-server side of this in rust-analyzer:
https://github.com/rust-lang/rust-analyzer/commit/23b040962ff299feeef1f967bc2d5ba92b01c2bc
This currently isn’t exposed to LSP, because it doesn’t allow flagging a folding range as method body. To fix that I opened this VS Code issue (LSP generally avoids doing something before VS Code):
https://github.com/microsoft/vscode/issues/128912
And since then I have been quietly waiting for some editor (not necessary VS Code) to pick this up. This hasn’t happened yet, hence this article!
Thanks for reading!
An article about a couple of relatively recent additions to my workflow which I wish I knew about years ago.
Go to definition is super useful (in general, navigation is much more important than code completion). But often, when I use “goto def” I don’t actually mean to permanently go there. Rather, I want to stay where I am, but I need a bit more context about a particular thing at point.
What I’ve found works really great in this context is to split the screen in two, and issue “go to
def” in the split. So that you see both the original context, and the definition at the same time,
and can choose just how far you would like to go. Here’s an example, where I want to understand how
apply
function works, and, to understand that, I want to quickly look up the definition of
FuzzOp
:
VS Code actually has a first-class UI for something like this, called “Peek Definition” but it is just not good — it opens some kind of separate pop-up, with a completely custom UX. It’s much more fruitful to compose two existing basic features — splitting the screen and going to definition.
Note that in the example above I do move focus to the split. I also tried a version that keeps focus in the original split, but focusing new one turned out to be much better. You actually don’t always know up front which split would become the “main” one, and moving the focus gives you flexibility of moving around, closing the split, or closing the other split.
I highly recommend adding a shortcut for this action. It’s a good idea to make it a “complementary” shortcut for the usual goto definition. I use , . for goto definition, and hence , > is the splitting version:
Yes, you are reading this right. , ., that is a comma followed by a full stop, is my goto
definition shortcut. This is not some kind of evil vim mode. I use pedestrian non-modal editing,
where I copy with ctrl + c, move to the beginning of line with Home and kill a word
with ctrl + Backspace (though keys like Home, Backspace, or arrows are on my
home row thanks to kanata
).
And yet, I use ,
as a first keypress in a sequence for multiple shortcuts. That is, , . is
not , + . pressed together, but rather a , followed by a separate .. So,
when I press , my editor doesn’t actually type a comma, but rather waits for me to complete
the shortcut. I have many of them, with just a few being:
I’ve used many different shortcut schemes, but this is by far the most convenient one for me. How do I type an comma? I bind , Space and , Enter to insert comma and a space/newline respectively, which handles most of the cases. And there’s , , which types just a lone comma.
To remember longer sequences, I pair the comma with whichkey, such that, when I type , e, what I see is actually a menu of editing operations:
This horrible idea was born in the mind of Susam Pal, and is officially (and aptly I should say) named Devil Mode.
I highly recommend trying it out! It is the perfect interface for actions that you do once in a
while. Where it doesn’t work is for actions you want to repeat. For example, if you want to cycle
through compilation errors, binding , e to the “next error” would probably be a bad
idea, as typing , e , e , e
to cycle three times is quite tiring.
This is actually a common theme, there are many things you might to cycle back and forward through:
It is mighty annoying to have to remember different shortcuts for all of them, isn’t it? If only there was some way to have a universal pair of shortcuts for the next/prev generalized motion…
The insight here is that you’d rarely need to cycle through several different categories of things at the same time. So I bind the venerable ctrl+n and ctrl+p to repeating the last next/prev motion. So, if the last next thing was a worktree change, then ctrl+n moves me to the next worktree change. But if I then query the next compilation error, the subsequent ctrl+n would continue cycling through compilation errors. To kick-start the cycle, I have a , n hydra:
I don’t know if there’s some existing VS Code extension to do this, I implement this in my personal extension.
Hope this is useful! Now go and make a deal with the devil yourself!
Why are there so many programming languages? One of the driving reasons for this is that some languages tend to produce fast code, but are a bit of a pain to use (C++), while others are a breeze to write, but run somewhat slow (Python). Depending on the ratio of CPUs to programmers, one or the other might be relatively more important.
But can’t we just, like, implement a universal language that is convenient but slowish by default, but allows an expert programmer to drop to a lower, more performant but harder register? I think there were many attempts at this, and they didn’t quite work out.
The natural way to go about this is to start from the high-level side. Build a high-level featureful language with large runtime, and then provide granular opt outs of specific runtime facilities. Two great examples here are C# and D. And the most famous example of this paradigm is Python, with “rewrite slow parts in C” mantra.
It seems to me that such an approach can indeed solve the “easy to use” part of the dichotomy, but doesn’t quite work as promised for “runs fast” one. And here’s the reason. For performance, what matters is not so much the code that’s executed, but rather the layout of objects in memory. And the high-level dialect locks-in pointer-heavy GC object model! Even if you write your code in assembly, the performance ceiling will be determined by all those pointers GC needs. To actually get full “low-level” performance, you need to effectively “mirror” the data across the dialects across a quasi-FFI boundary.
And that’s what kills “write most of the code in Python, rewrite hot spots in C” approach — the overhead for transitioning between the native C data structures and the Python ones tends to eat any performance benefits that C brings to the table. There are some very real, very important exceptions, where it is possible to batch sufficiently large packages of work to minimize the overhead: http://venge.net/graydon/talks/VectorizedInterpretersTalk-2023-05-12.pdf. But it seems that the average case looks more like this: https://code.visualstudio.com/blogs/2018/03/23/text-buffer-reimplementation.
And this brings me to Rust. It feels like it accidentally blundered into the space of universal languages through the floor. There are no heavy runtime-features to opt out of in Rust. The object model is universal throughout the language. There isn’t a value-semantics/reference-semantics dichotomy, references are first-class values. And yet:
As a result, there is a certain spectrum of Rust:
While the bottom end here sits pretty comfortably next to C, the upper tip doesn’t quite reach the usability level of Python. But this is mostly compensated through these three effects:
In these two most excellent articles, https://without.boats/blog/let-futures-be-futures and https://without.boats/blog/futures-unordered, withoutboats introduces the concepts of “multi-task” and “intra-task” concurrency. I want to revisit this distinction — while I agree that there are different classes of patterns of concurrency here, I am not quite satisfied with this specific partitioning of the design space. I will use Rust-like syntax for most of the examples, but I am more interested in the language-agnostic patterns, rather than in Rust’s specific implementation of async.
Let’s introduce the two kinds of concurrency using a somewhat abstract example. We want to handle a
Request
by doing some computation and then persisting the results in the database and in the cache.
Notably, writes to the cache and to the database can proceed concurrently. So, something like this:
This is multi-task concurrency style — we fire off two tasks for updating the database and the cache. Here’s the same snippet in intra-task style, where we use join function on futures:
In other words:
Multi-task concurrency uses spawn
— an operation that takes a future and starts a tasks that
executes independently of the parent task.
Intra-task concurrency uses join
— an operation that takes a pair of futures and executes them
concurrently as a part of the current task.
But what is the actual difference between the two?
One candidate is parallelism — with spawn
, the tasks can run not only concurrently, but actually
in parallel, on different CPU cores. join
restricts them to the same thread that runs the main
task. But I think this is not quite right, abstractly, and is more of a product of specific Rust
APIs. There are executors which spawn on the current thread only. And, while in Rust it’s not
really possible to make join
poll the futures in parallel, I think this is just an artifact of
Rust existing API design (futures can’t opt-out of synchronous cancellation). In other words, I
think it is possible in theory to implement an async runtime which provides all of the following
functions at the same time:
To confuse matters further, let’s rewrite our example in TypeScript:
and using Rust’s rayon library:
Are these examples multi-task or intra-task? To me, the TypeScript one feels multi-task — although
it is syntactically close to join().async
, the two update promises are running independently from
the parent task. If we forget the call to Promise.all
, the cache and the database would still get
updated (but likely after we would have returned the response to the user)! In contrast, rayon
feels intra-task — although the closures could get stolen and be run by a different thread, they
won’t “escape” dynamic extent of the encompassing process
call.
Let’s zoom in onto the JS and the join examples:
I’ve re-written the JavaScript version to be syntactically isomorphic to the Rust one. The
difference is on the semantic level: JavaScript promises are eager, they start executing as soon as
a promise is created. In contrast, Rust futures are lazy — they do nothing until polled. And
this I think is the fundamental difference, it is lazy vs. eager “futures” (thread::spawn
is an
eager “future” while rayon::join
a lazy one).
And it seems that lazy semantics is quite a bit more elegant! The beauty of
is that it’s MoliÃ¨re’s prose — this is structured concurrency, but without bundles, nurseries, scopes, and other weird APIs.
It makes runtime semantics nicer even in dynamically typed languages. In JavaScript, forgetting an
await is a common, and very hard to spot problem — without await, code still works, but is
sometimes wrong (if the async operation doesn’t finish quite as fast as usual). Imagine JS with
lazy promises — there, forgetting an await
would always consistently break. So, the need to
statically lint missing awaits will be less pressing.
Compare this with Erlang’s take on nulls: while in typical dynamically typed languages partial
functions can return a value T
or a None
, in Erlang the convention is to return either {ok, T}
or none
. That is, even if the value is non-null, the call-site is forced to unpack it, you can’t
write code that happens to work as long as T
is non-null.
And of course, in Rust, the killer feature of lazy futures is that you can just borrow data from the enclosing scope.
But it seems like there is one difference between multi-task and intra-task concurrency.
In the words of withoutboats:
That is, you can do
join(a, b).await
,
and
and, with some macros, even
but you can’t do join(xs...).await
.
I think this is incorrect, in a trivial and in an interesting way.
The trivial incorrectness is that there’s join_all
, that takes a slice of futures and is a direct
generalization of join
to a runtime-variable number of futures.
But join_all
still can’t express the case where you don’t know the number of futures up-front,
where you spawn some work, and only later realize that you need to spawn some more.
This is sort-of possible to express with FuturesUnordered
, but that’s a yuck API. I mean, even
its name screams “DO NOT USE ME!”.
But I do think that this is just an unfortunate API, and that the pattern actually can be expressed in intra-task concurrency style nicely.
Let’s take a closer look at the base case, join
!
Section title is a bit of a giveaway. The join
operator is async ;
. The semicolon is an
operator of sequential composition:
A; B
runs A
first and then B
.
In contrast, join
is concurrent composition:
join(A, B)
runs A
and B
concurrently.
And both join
and ;
share the same problem — they can compose only a finite number of things.
But that’s why we have other operators for sequential composition! If we know how many things we
need to run, we can use a counted for
loop. And join_all
is an analogue of a counted for loop!
In case where we don’t know up-front when to stop, we use a while
. And this is exactly what we
miss — there’s no concurrently-flavored while
operator.
Importantly, what we are looking for is not an async for:
Here, although there could be some concurrency inside a single loop iteration, the iterations themselves are run sequentially. The second iteration starts only when the first one finished. Pictorially, this looks like a spiral, or a loop if we look from the side:
What we rather want is to run many copies of the body concurrently, something like this:
A spindle-like shape with many concurrent strands, which looks like wheel’s spokes from the side. Or, if you are really short on fitting metaphors:
Now, I understand that I’ve already poked fun at unfortunate FuturesUnordered
name, but I can’t
really find a fitting name for the construct we want here. So I am going to boringly use
concurrently
keyword, which is way too long, but I’ll refer to it as “the watermelon operator”
The stripes on the watermelon resemble the independent strands of execution this operator creates:
So, if you are writing a TCP server, your accept loop could look like this:
This runs accept in a loop, and, for each accepted socket, runs handle_connection
concurrently.
There are as many concurrent handle_connection
calls as there are ready sockets in our listener!
Let’s limit the maximum number of concurrent connections, to provide back pressure:
You get the idea (hopefully):
To make this more concrete, let’s spell this out as a library function:
I claim that this is the full set of “primitive” operations needed to express more-or-less everything in intra-task concurrency style.
In particular, we can implement multi-task concurrency this way! To do so, we’ll write a universal
watermelon operator, where the T
which is passed to the body is an
Box<dyn Future<Output=()>>
,
and where the body just runs this future:
Note that the conversion in the opposite direction is not possible! With intra-task concurrency, we
can borrow from the parent stack frame. So it is not a problem to restrict that to only allow
'static
futures into the channel. In a sense, in the above example we return the future up the
stack, which explains why it can’t borrow locals from our stack frame.
With multi-task concurrency though, we start with static futures. To let them borrow any stack data requires unsafe.
Note also that the above set of operators, join
, join_all
, concurrently
is orthogonal to
parallelism. Alongside those operators, there could exist pjoin
, pjoin_all
and pconcurrently
with the Send
bounds, such that you could mix and match parallel and single-core concurrency.
One possible objection to the above framing of watermelon as a language-level operator is that it seemingly doesn’t pass zero-cost abstraction test. It can start an unbounded number of futures, and those futures have to be stored somewhere. So we have a language operator which requires dynamic memory allocation, which is a big no-no for any systems programming language.
I think there is some truth to it, and not an insignificant amount of it, but I think I can maybe weasel out of it.
Consider recursion. Recursion also can allocate arbitrary amount of memory (on the stack), but that is considered fine (I would also agree that it is not in fact fine that unbounded recursion is considered fine, but, for the scope of this discussion, I will be a hypocrite and will ignore that opinion of mine).
And here, we have essentially the same situation — we want to allocate arbitrary many (async) stack frames, arranged in a tree. Doing it “on the heap” is easy, but we don’t like the heap here. Luckily, I believe there’s a compilation scheme (hat tip to @rpjohnst for patiently explaining it to me five times in different words) that implements this more-or-less as efficiently as the normal call stack.
The idea is that we will have two stacks — a sync one and an async one. Specifically:
sp
and one other register
(let’s call it asp
).
sp
.
Crucially, because sync functions can only call other sync
functions, the callee doesn’t need
to know the value of asp
.
asp
.
While this looks just like Go-style segmented stacks, I think this scheme is quite a bit more efficient (warning: I in general have a tendency to confidently talk about things I know little about, and this one is the extreme case of that. If some Go compiler engineer disagrees with me, I am probably in the wrong!).
The main difference is that the distinction between sync and async functions is maintained in the type system. There are no changes for sync functions at all, so the principle of don’t pay for what you don’t use is observed. This is in contrast to Go — I believe that Go, in general, can’t know whether a particular function can yield (that is, if any function it (indirectly) calls can yield), so it has to conservatively insert stack checks everywhere.
Then, even the async stack frames don’t have to store everything, but just the stuff live across await. Everything that happens between two awaits can go to the normal stack.
On top of that, async functions can still do aggressive inlining. So, the async call (and the stack growth check) has to happen only for dynamically dispatched async calls!
Furthermore, the future trait could have some kind of size_hint
method, which returns the lower
and the upper bound on the size of the stack. Fully concrete futures type-erased to dyn Future
would return the exact amount (a, Some(a))
. The caller would be required to allocate at least
a
bytes of the async stack. The callee uses that contract to elide stack checks. Unknown bound,
(a, None)
would only be returned if type-erased concrete future itself calls something
dynamically dispatched. So only dynamically dispatched calls would have to do stack grow checks, and
that cost seems negligible in comparison to the cost of missing optimizations due to inability to
inline.
Altogether, it feels like this adds up to something sufficiently cheap to just call it “async stack allocation”.
I guess that’s all for today? Summarizing:
FuturesUnordered
, but nice. The concurrently
operator/function feels like a
sufficiently low-hanging watermelon here.
Haha, just kidding! Bonus content! This really should be a separate blog post, but it is tangentially related, so here we go:
So far, we’ve focused on join
, the operator that takes two futures, and “runs” them concurrently,
returning both results as a pair. But there’s a second, dual operator:
Like join
, race
runs two futures concurrently. Unlike join
, it returns only one result —
that which came first. This operator is the basis for a more general select
facility.
Although race
is dual to join
, I don’t think it is as fundamental. It is possible to have two
dual things, where one of them is in the basis and the other is derived. For example, it is an axiom
of the set theory that the union of two sets, A âˆª B
, is a set. Although the intersection of sets,
A âˆ© B
is a dual for union, existence of intersection is not an axiom. Rather, the intersection
is defined using axiom of specification:
Proposition 131.7.1: race
can be defined in terms of join
The race
operator is trickier than it seems. Yes, it returns the result of the future that
finished first, but what happens with the other one? It gets cancelled. Rust implements this
cancellation “for free”, by just dropping the future, but this is restrictive. This is precisely the
issue that prevents pjoin
from working.
I postulate that fully general cancellation is an asynchronous protocol:
That is, cancellation is not “I cancel thou”. Rather it is “I ask you to stop, and then I cooperatively wait until you do so”. This is very abstract, but the following three examples should help make this concrete.
A is some generic asynchronous task, which offloads some computation-heavy work to a CPU pool. That work (B) doesn’t have checks for cancelled flags. So, if A is canceled, it can’t really stop B, which means we are violating structured concurrency.
A is doing async IO. Specifically, A uses io_uring
to read data from a socket. A owns a buffer,
and passes a pointer to it to the kernel via io_uring
as the target buffer for a read
syscall. While A is being cancelled, the kernel writes data to this buffer. If A doesn’t wait
until the kernel is done, buffer’s memory might get reused, and the kernel would corrupt some
unrelated data.
These examples are somewhat unsatisfactory — A is philosophical (who needs structured
concurrency?), while B is esoteric (who uses io_uring
in 2024?). But the two can be combined into
something rather pedestrianly bad:
Like in the case A, an async task submits some work to a CPU pool. But this time the work is very
specific — computing a cryptographic checksum of a message owned by A. Because this is
cryptography, this is going to be some hyper-optimized SIMD loop which definitely won’t have any
affordance for checking some sort of a cancelled
flag. The loop would have to run to completion,
or at least to a safe point. And, because the loop checksums data owned by A, we can’t destroy A
before the loop exits, otherwise it’ll be reading garbage memory!
And this example is the reason why
can’t be a thing in Rust — if fut1
runs on a thread separate from the pjon
future, then, if
pjoin
ends up being cancelled, fut1
would be pointing at garbage. You could have
but that removes one of the major benefits of intra-task style API — ability to just borrow data.
So the fully general cancellation should be cooperative. Let’s assume that it is driven by some sort of cancellation token API:
Note that the question of cancellation being cooperative is orthogonal to the question of explicit threading of cancellation tokens! They can be threaded implicitly (cooperative, implicit cancellation is how Python’s trio does this, though they don’t really document the cooperative part (the shields stuff)).
With this, we can write our own race
— we’ll create a cancellation scope and then join
modified futures, each of which would cancel the other upon completion:
In other words, race
is but a cooperatively-cancelled join
!
That’s all for real for today, viva la vida!
An attempt at concise explanation of what io_uring is.
io_uring
is a new Linux kernel interface for making system calls.
Traditionally, syscalls are submitted to the kernel individually and
synchronously: a syscall CPU instruction transfers control from the
application to the kernel; control returns to the application only when the
syscall is completed. In contrast, io_uring
is a batched and asynchronous
interface. The application submits several syscalls by writing their codes &
arguments to a lock-free shared-memory ring buffer. The kernel reads the
syscalls from this shared memory and executes them at its own pace. To
communicate results back to the application, the kernel writes the results to a
second lock-free shared-memory ring buffer, where they become available to the
application asynchronously.
You might want to use io_uring
if:
You might want to avoid io_uring
if:
I had a productive day today! I did many different and unrelated things, but they all had the same unifying theme:
There’s a bug! And it is sort-of obvious how to fix it. But if you don’t laser-focus on that, and try to perceive the surrounding context, it turns out that the bug is valuable, and it is pointing in the direction of a bigger related problem. So, instead of fixing the bug directly, a detour is warranted to close off the avenue for a class of bugs.
Here are the examples!
In the morning, my colleague pointed out that we are giving substandard error message for a pretty
stressful situation when the database runs out of disk space. I went ahead and added appropriate log
messages to make it clearer. But then I stopped for a moment and noticed that the problem is bigger
— we are missing an infrastructure for fatal errors, and NoSpaceLeft
is just one of a kind. So I
went ahead and added that along the way:
#2289.
Then, I was reviewing a PR by @martinconic
which was fixing some typos, and noticed that it was
also changing the formatting of our Go code. The latter is by far the biggest problem, as it is the
sign that we somehow are not running gofmt
during our CI, which I fixed in
#2287.
Then, there was a PR from yesterday, where we again had a not quite right log message. The cause was a confusion between two compile-time configuration parameters, which were close, but not quite identical. So, instead of fixing the error message I went ahead and made the two parameters exactly the same. But then my colleague noticed that I actually failed to fix it one level deeper in this case! Turns out, it is possible to remove this compile-time parametrization altogether, which I did in #2292.
But these all were randomly-generated side quests. My intended story line for today was to refactor
the piece of code I had trouble explaining (and understanding!) on yesterday’s
episode
of Iron Beetle. To get into the groove, I decided to first refactor the code that calls the
problematic piece of logic, as I noticed a couple of minor stylistic problems there. Of course, when
doing that, I discovered that we have a bit of dead code, which luckily doesn’t affect correctness,
but does obscure the logic. While fixing that, I used one of my favorite Zig patterns:
defer assert(postcondition);
It of course failed in the simulator in a way postcondition checks tend to fail — there was an unintended reentrancy in the code. So I slacked my colleague something like
But of course I can’t just “go and .next_tick
it”, so here I am, trying to figure out how to
encode a Duff’s device in Zig
pre-#8220, so as to make this class of issues much
less likely.
Several examples of the law:
Software obviously depends on its source code. The law says that something should hold the hash of the entire source, and thus mandates the use of a content-addressed version control system such as git.
Software often depends on 3rd party libraries. These libraries could in turn depend on other libraries. It is imperative to include a lockfile that covers this entire set and comes with checksums. Curiously, the lockfile itself is a part of source code, and gets mixed into the VCS root hash.
Software needs a compiler. The hash of the required compiler should be included in the lockfile. Typically, this is not done — only the version is specified. I think that is a mistake. Specifying a version and a hash is not much more trouble than just the version, but that gives you a superpower — you no longer need to trust the party that distributes your compiler. You could take a shady blob of bytes you’ve found laying on the street, as long as its checksum checks out.
Note that you can compress hashes by mixing them. For compiler use-case, there’s a separate hash per platform, because the Linux and the Windows versions of the compiler differ. This doesn’t mean that your project should include one compiler’s hash per platform, one hash is enough. Compiler distribution should include a manifest – a small text file which lists all platform and their platform specific hashes. The single hash of that file is what is to be included by downstream consumers. To verify a specific binary, the consumer first downloads a manifest, checks that it has the correct hash, and then extracts the hash for the specific platform.
The law is an instrumental goal. By itself, hashes are not that useful. But to get to the point where you actually know the hashes requires:
.sh
, you most likely don’t know the set of your dependencies).
These things are what actually make developing software easier.
]]>A short note on what goes into a language’s standard library, and what’s left for third party libraries to implement!
Usually, the main underlying driving factor here is cardinality. If it is important that there’s only one of a thing, it goes into std. If having many of a thing is a requirement, it is better handled by a third-party library. That is, the usual physical constraint is that there’s only a single standard library, and everyone uses the same standard library. In contrast, there are many different third-party libraries, and they all can be used at the same time.
So, until very recently, my set of rules of thumb for what goes into stdlib looked roughly like this:
So for example something like Vec
goes
into a standard library, because all other libraries are going to use vectors at the interfaces.
Something like lazy_static
doesn’t: while it is often needed, it is not a vocabulary interface type.
But it is acceptable for something like
OnceCell
to be in std
— it is still not a vocabulary type, but, unlike lazy_static
, it is clear that the API is more
or less optimal, and that there aren’t that many good options to do this differently.
But I’ve changed my mind about the second bullet point, about facilities like file IO or TCP sockets. I was always under the impression that these things are a must for a standard library. But now I think that’s not necessarily true!
Consider randomness. Not the PRNG kind of randomness you’d use to make a game fun, but a
cryptographically secure randomness that you’d use to generate an SSH key pair. This sort of
randomness ultimately bottoms out in hardware, and fundamentally requires talking to the OS and
doing IO. This is squarely the bullet point number 2. And Rust is an interesting case study here: it
failed to provide this abstraction in std, even though std itself actually needs it! But this turned
out to be mostly a non-issue in practice — a third party crate, getrandom
, took the job of
writing all the relevant bindings to various platform-specific API and using a bunch of conditional
compilation to abstract that all away and provide a nice cross-platform API.
So, no, it is not a requirement that std has to wrap any wrappable IOing API. This could be handled by the library ecosystem, if the language allows first-class bindings to raw OS APIs outside of compiler-privileged code (and Rust certainly allows for that).
So perhaps it won’t be too unreasonable to leave even things like files and sockets to community experimentation? In a sense, that is happening in the async land anyway.
To clarify, I still believe that Rust should provide bindings to OS-sourced crypto randomness, and I am extremely happy to see recent motion in that area. But the reason for this belief changed. I no longer feel the mere fact that OS-specific APIs are involved to be particularly salient. However, it is still true that there’s more or less one correct way to do this.
]]>Programmers on the internet often use “Turing-completeness” terminology. Typically, not being Turing-complete is extolled as a virtue or even a requirement in specific domains. I claim that most such discussions are misinformed — that not being Turing complete doesn’t actually mean what folks want it to mean, and is instead a stand-in for a bunch of different practically useful properties, which are mostly orthogonal to actual Turing completeness.
While I am generally descriptivist in nature and am ok with words losing their original meaning as long as the new meaning is sufficiently commonly understood, Turing completeness is a hill I will die on. It is a term from math, it has a very specific meaning, and you are not allowed to re-purpose it for anything else, sorry!
I understand why this happens: to really understand what Turing completeness is and is not you need to know one (simple!) theoretical result about so-called primitive recursive functions. And, although this result is simple, I was only made aware of it in a fairly advanced course during my masters. That’s the CS education deficiency I want to rectify — you can’t teach students the halting problem without also teaching them about primitive recursion!
The post is going to be rather meaty, and will be split in three parts:
In Part I, I give a TL;DR for the theoretical result and some of its consequences. Part II is going to be a whirlwind tour of Turing Machines, Finite State Automata and Primitive Recursive Functions. And then Part III will circle back to practical matters.
If math makes you slightly nauseous, you might to skip Part II. But maybe give it a try? The math we’ll need will be baby math from first principles, without reference to any advanced results.
Here’s the key result — suppose you have a program in some Turing complete language, and you also know that it’s not too slow. Suppose it runs faster than O(2^{2N}). That is, two to the power of two to the power of N, a very large number. In this case, you can implement this algorithm in a non-Turing complete language.
Most practical problems fall into this “faster than two to the two to the power of two” space. Hence it follows that you don’t need the full power of a Turing Machine to tackle them. Hence, a language not being Turing complete doesn’t in any way restrict you in practice, or give you extra powers to control the computation.
Or, to restate this: in practice, a program which doesn’t terminate, and a program that needs a billion billion steps to terminate are equivalent. Making something non-Turing complete by itself doesn’t help with the second problem in any way. And there’s a trivial approach that solves the first problem for any existing Turing-complete language — in the implementation, count the steps and bail with an error after a billion.
The actual theoretical result is quite a bit more general than that. It is (unsurprisingly) recursive:
It is expected that this sounds like gibberish at this point! So let’s just go and prove this thing, right here in this blog post! Will work up slowly towards this result. The plan is as follows:
Finite State Machines are simple! An FSM takes a string as input, and returns a binary answer, “yes” or “no”. Unsurprisingly an FSM has a finite number of states: Q0, Q1, …, Qn. A subset of states are designated as “yes” states, the rest are “no” states. There’s also one specific starting state.
The behavior of the state machine is guided by a transition (step) function, s
. This function
takes the current state of FSM, the next symbol of input, and returns a new state.
The semantics of FSM is determined by repeatably applying the single step function for all symbols of the input, and noting whether the final state is a “yes” state or a “no” state.
Here’s an FSM which accepts only strings of zeros and ones of even length:
This machine ping-pongs between states Q0 and Q1 ends up in Q0 only for inputs of even length (including an empty input).
What can FSMs do? As they give a binary answer, they are recognizers — they don’t compute functions, but rather just characterize certain sets of strings. A famous result is that the expressive power of FSMs is equivalent to the expressive power of regular expressions. If you can write a regular expression for it, you could also do an FSM!
There are also certain things that state machines can’t do. For example they can’t enter an infinite loop. Any FSM is linear in the input size and always terminates. But there are much more specific sets of strings that couldn’t be recognized by an FSM. Consider this set:
That is, an infinite set which contains ‘1’s surrounded by the equal number of ‘0’s on the both sides. Let’s prove that there isn’t a state machine that recognizes this set!
As usually, suppose there is such a state machine. It has a certain number of states — maybe a dozen, maybe a hundred, maybe a thousand, maybe even more. But let’s say fewer than a million. Then, let’s take a string which looks like a million zeros, followed by one, followed by million zeros. And let’s observe our FSM eating this particular string.
First of all, because the string is in fact a one surrounded by the equal number of zeros on both sides, the FSM ends up in a “yes” state. Moreover, because the length of the string is much greater than the number of states in the state machine, the state machine necessarily visits some state twice. There is a cycle, where the machine goes from A to B to C to D and back to A. This cycle might be pretty long, but it’s definitely shorter than the total number of states we have.
And now we can fool the state machine. Let’s make it eat our string again, but this time, once it completes the ABCDA cycle, we’ll force it to traverse this cycle again. That is, the original cycle corresponds to some portion of our giant string:
If we duplicate this portion, our string will no longer look like one surrounded by equal number of twos, but the state machine will still in the “yes” state. Which is a contradiction that completes the proof.
A Turing Machine is only slightly more complex than an FSM. Like an FSM, a TM has a bunch of states and a single-step transition function. While an FSM has an immutable input which is being fed to it symbol by symbol, a TM operates with a mutable tape. The input gets written to the tape at the start. At each step, a TM looks at the current symbol on the tape, changes its state according to a transition function and, additionally:
When a machine reaches a designated halt state, it stops, and whatever is written on the tape at that moment is the result. That is, while FSMs are binary recognizers, TMs are functions. Keep in mind that a TM does not necessarily stop. It might be the case that a TM goes back and forth over the tape, overwrites it, changes its internal state, but never quite gets to the final state.
Here’s an example Turing Machine:
If the configuration of the machine looks like this:
Then we are in the s A 0 = (B, 1, Right)
case, so we should change the state to B, replace 0 with
1, and move to the right:
There are a bunch of fiddly details to Turing Machines!
The tape is conceptually infinite, so beyond the input, everything is just zeros. This creates a problem: it might be hard to say where the input (or the output) ends! There are a couple of technical solutions here. One is to say that there are three different symbols on the tape — zeros, ones, and blanks, and require that the tape is initialized with blanks. A different solution is to invent some encoding scheme. For example, we can say that the input is a sequence of 8-bit bytes, without interior null bytes. So, eight consecutive zeros at a byte boundary designate the end of input/output.
It’s useful to think about how this byte-oriented TM could be implemented. We could have one large
state for each byte of input. So, Q142 would mean that the head is on the byte with value 142. And
then we’ll have a bunch of small states to read out the current byte. Eg, we start reading a byte in
state S
. Depending on the next bit we move to S0 or S1, then to S00, or S01, etc. Once we reached
something like S01111001, we move back 8 positions and enter state Q121. This is one of the patterns
of Turing Machine programming — while your main memory is the tape, you can represent some
constant amount of memory directly in the states.
What we’ve done here is essentially lowering a byte-oriented Turing Machine to a bit-oriented machine. So, we could think only in terms of big states operating on bytes, as we know the general pattern for converting that to direct bit-twiddling.
With this encoding scheme in place, we now can feed arbitrary files to a Turing Machine! Which will be handy to the next observation:
You can’t actually program a Turing Machine. What I mean is that, counter-intuitively, there isn’t some user-supplied program that a Turing Machine executes. Rather, the program is hard-wired into the machine. The transition function is the program.
But with some ingenuity we can regain our ability to write programs. Recall that we’ve just learned
to feed arbitrary files to a TM. So what we could do is to write a text file that specifies a TM and
its input, and then feed that entire file as an input to an “interpreter” Turing Machine which would
read the file, and act as the machine specified there. A Turing Machine can have an eval
function.
Is such an “interpreter” Turing Machine possible? Yes! And it is not hard: if you spend a couple of hours programming Turing Machines by hand, you’ll see that you pretty much can do anything — you can do numbers, arithmetic, loops, control flow. It’s just very very tedious.
So let’s just declare that we’ve actually coded up this Universal Turing Machine which simulates a TM given to it as an input in a particular encoding.
This sort of construct also gives rise to the Church-Turing thesis. We have a TM which can run other TMs. And you can implement a TM interpreter in something like Python. And, with a bit of legwork, you could also implement a Python interpreter as a TM (you likely want to avoid doing that directly, and instead do a simpler interpreter for WASM, and then use a Python interpreter compiled to WASM). This sort of bidirectional interpretation shows that Python and TMs have equivalent computing power. Moreover, it’s quite hard to come up with a reasonable computational device which is more powerful than a Turing Machine.
There are computational devices that are strictly weaker than TMs though. Recall FSMs. By this point, it should be obvious that a TM can simulate an FSM. Everything a Finite State Machine can do, a Turing Machine can do as well. And it should be intuitively clear that a TM is more powerful than an FSM. An FSM gets to use only a finite number of states. A TM has these same states, but it also posses a tape which serves like an infinitely sized external memory.
Directly proving that you can’t encode a Universal Turing Machine as an FSM sounds complicated, so let’s prove something simpler. Recall that we have established that there’s no FSM that accepts only ones surrounded by an equal number of zeros on both sides (because a sufficiently large word of this form would necessary enter a cycle in a state machine, which could then be further pumped). But it’s actually easy to write a Turing Machine that does this:
1
the answer is “yes”, otherwise it is a “no”
We found a specific problem that can be solved by a TM, but is out of reach of any FSM. So it necessarily follows that there isn’t an FSM that can simulate an arbitrary TM.
It is also useful to take a closer look at the tape. It is a convenient skeuomorphic abstraction which makes the behavior of the machine intuitive, but it is inconvenient to implement in a normal programming language. There isn’t a standard data structure that behaves just like a tape.
One cool practical trick is to simulate the tape as a pair of stacks. Take this:
And transform it to something like this:
That is, everything to the left of the head is one stack, everything to the right, reversed, is the other. Here, moving the reading head left or right corresponds to popping a value off one stack and pushing it onto another.
So, an equivalent-in-power definition would be to say that a TM is an FSM endowed with two stacks.
This of course creates an obvious question: is an FSM with just one stack a thing? Yes! It would be called a pushdown automaton, and it would correspond to context-free languages. But that’s beyond the scope of this post!
There’s yet another way to look at the tape, or the pair of stacks, if the set of symbols is 0 and
1. You could say that a stack is just a number! So, something like
[1, 0, 1, 1]
will be
1 + 2 + 8 = 11
.
Looking at the top of the stack is stack % 2
, removing an item from the stack is stack / 2
and
pushing x onto the stack is stack * 2 + x
. We won’t need this right now, so just hold onto this
for a brief moment.
Ok, so we have some idea about the lower bound for the power of a Turing Machine — FSMs are strictly less expressive. What about the opposite direction? Is there some computation that a Turing Machine is incapable of doing?
Yes! Let’s construct a function which maps natural numbers to natural numbers, which can’t be implemented by a Turing Machine. Recall that we can encode an arbitrary Turing Machine as text. That means that we can actually enumerate all possible Turing Machines, and write them in a giant line, from the most simple Turing Machine to more complex ones:
This is of course going to be an infinite list.
Now, let’s see how TM0 behaves on input 0
: it either prints something, or doesn’t terminate. Then,
note how TM1 behaves on input 1
, and generalizing, create function f
that behaves as the nth TM
on input n
. It might look something like this:
Now, let’s construct function g
which is maximally diffed from f
: where f
gives 0
, g
will
return 1
, and it will return 0
in all other cases:
There isn’t a Turing machine that computes g
. For suppose there is. Then, it exists in our list of
all Turing Machines somewhere. Let’s say it is TM1000064. So, if we feed 0
to it, it will return
g(0)
, which is 1
, which is different from f(0)
. And the same holds for 1
, and 2
, and 3
.
But once we get to g(1000064)
, we are in trouble, because, by the definition of g
, g(1000064)
is different from what is computed by TM1000064! So such a machine is impossible.
Those math savvy might express this more succinctly — there’s a countably-infinite number of Turing Machines, and an uncountably-infinite number of functions. So there must be some functions which do not have a corresponding Turing Machine. It is the same proof — the diagonalization argument is hiding in the claim that the set of all functions is an uncountable set.
But this is super weird and abstract. Let’s rather come up with some very specific problem which isn’t solvable by a Turing Machine. The halting problem: given source code for a Turing Machine and its input, determine if the machine halts on this input eventually.
As we have waved our hands sufficiently vigorously to establish that Python and Turing Machines have equivalent computational power, I am going to try to solve this in Python:
Now, I will do a weird thing and start asking whether a program terminates, if it is fed its own source code, in a reverse-quine of sorts:
and finally I construct this weird beast of a program:
To make this even worse, I’ll feed the text of this weird
program to itself. Does it terminate
with this input? Well, if it terminates, and if our halts
function is implemented correctly, then
the halts_on_self(program_input)
invocation above returns True
. But then we enter the infinite
loop and don’t actually terminate.
Hence, it must be the case that weird
does not terminate when self-applied. But then
halts_on_self
returns False
, and it should terminate. So we get a contradiction both ways. Which
necessarily means that either our halts
sometimes returns a straight-up incorrect answer, or that it
sometimes does not terminate.
So this is the flip side of a Turing Machine’s power — it is so powerful that it becomes impossible to tell whether it’ll terminate or not!
It actually gets much worse, because this result can be generalized to an unreasonable degree! In general, there’s very little we can say about arbitrary programs.
We can easily check syntactic properties (is the program text shorter than 4 kilobytes?), but they are, in some sense, not very interesting, as they depend a lot on how exactly one writes a program. It would be much more interesting to check some refactoring-invariant properties, which hold when you change the text of the program, but leave the behavior intact. Indeed, “does this change preserve behavior?” would be one very useful property to check!
So let’s define two TMs to be equivalent, if they have identical behavior. That is, for each specific input, either both machines don’t terminate, or they both halt, and give identical results.
Then, our refactoring-invariant properties are, by definition, properties that hold (or do not hold) for the entire classes of equivalence of TMs.
And a somewhat depressing result here is that there are no non-trivial refactoring-invariant properties that you can algorithmically check.
Suppose we have some magic TM, called P, which checks such a property. Let’s show that, using P, we can solve the problem we know we can not solve — the halting problem.
Consider a Turing Machine that is just an infinite loop and never terminates, M1. P might or might
not hold for it. But, because P is non-trivial (it holds for some machines and doesn’t hold for some
machines), there’s some different machine M2 which differs from M1 with respect to P. That is,
P(M1) xor P(M2)
holds.
Let’s use these M1 and M2 to figure out whether a given machine M halts on input I. Using Universal Turing Machine (interpreter), we can construct a new machine, M12 that just runs M on input I, then erases the contents of the tape and runs M2. Now, if M halts on I, then the resulting machine M12 is behaviorally-equivalent to M2. If M doesn’t halt on I, then the result is equivalent to the infinite loop program, M1. Or, in pseudo-code:
This is pretty bad and depressing — we can’t learn anything meaningful about an arbitrary Turing Machine! So let’s finally get to the actual topic of today’s post:
This is going to be another computational device, like FSMs and TMs. Like an FSM, it’s going to be a nice, always terminating, non-Turing complete device. But it will turn out to have quite a bit of the power of a full Turing Machine!
However, unlike both TMs and FSMs, Primitive Recursive Functions are defined directly as
functions which take a tuple of natural numbers and return a natural number. The two simplest ones
are zero
(that is, zero-arity function that returns 0
) and succ
— a unary function that
just adds 1. Everything else is going to get constructed out of these two:
One way we are allowed to combine these functions is by composition. So we can get all the constants right off the bat:
We aren’t going to be allowed to use general recursion (because it can trivially non-terminate),
but we do get to use a restricted form of C-style loop. It is a bit fiddly to define formally! The
overall shape is LOOP(init, f, n)
.
Here, init
and n
are numbers — the initial value of the accumulator and the total number of
iterations. The f
is a unary function that specifies the loop body – it takes the current value
of the accumulator and returns the new value. So
While this is similar to a C-style loop, the crucial difference here is that the total number of
iterations n
is fixed up-front. There’s no way to mutate the loop counter in the loop body.
This allows us to define addition:
Multiplication is trickier. Conceptually, to multiply x
and y
, we want to LOOP
from zero, and
repeat “add x
” y
times. The problem here is that we can’t write an “add x
” function yet
One way around this is to define LOOP
as a family of operators, which can pass extra arguments to
the iteration function:
That is, LOOP_N
takes an extra n
arguments, and passes them through to any invocation of the body
function. To express this idea a little bit more succinctly, let’s just allow to partially apply
the second argument of LOOP
. That is:
LOOP
is not a function in our language — it’s a builtin operator, a keyword. So, for
convenience, we allow passing partially applied functions to it. But semantically this is
equivalent to just passing in extra arguments on each iteration.
Which finally allows us to write
Ok, so that’s progress — we made something as complicated as multiplication, and we still are in the guaranteed-to-terminate land. Because each loop has a fixed number of iterations, everything eventually finishes.
We can go on and define x^{y}:
And this in turn allows us to define a couple of concerning fast growing functions:
That’s fun, but to do some programming, we’ll need an if
. We’ll get to it, but first we’ll need
some boolean operations. We can encode false
as 0
and true
as 1
. Then
But or
creates a problem: we’ll need a subtraction.
Defining sub
is tricky, due to two problems:
First, we only have natural numbers, no negatives. This one is easy to solve — we’ll just define subtraction to saturate.
The second problem is more severe — I think we actually can’t express subtraction given the set of
allowable operations so far. That is because all our operations are monotonic — the result is
never less than the arguments. One way to solve this problem is to define the LOOP in such a way
that the body function also gets passed a second argument — the current iteration. So, if you
iterate up to n
, the last iteration will observe n - 1
, and that would be the non-monotonic
operation that creates subtraction. But that seems somewhat inelegant to me, so instead I will just
add a pred
function to the basis, and use that to add loop counters to our iterations.
Now we can say:
And now we can do a bunch of comparison operators:
With that we could implement modulus. To compute x % m
we will start with x
, and will be
subtracting m
until we get a number smaller than m
. We’ll need at most x
iterations for that.
In pseudo-code:
And as a bona fide PRF:
That’s a curious structure — rather than computing the modulo directly, we essentially search for it using trial and error, and relying on the fact that the search has a clear upper bound.
Division can be done similarly: to divide x by y, start with 0, and then repeatedly add one to the accumulator until the product of the accumulator and y exceeds x:
This really starts to look like programming! One thing we are currently missing are data structures.
While our functions take multiple arguments, they only return one number. But it’s easy enough to
pack two numbers into one: to represent an (a, b)
pair, we’ll use 2^{a} 3^{b} number:
To deconstruct such a pair into its first and second components, we need to find the maximum power
of 2 or 3 that divides our number. Which is exactly the same shape we used to implement div
:
Here again we use the fact that the maximal power of two that divides p
is not larger than p
itself, so we can over-estimate the number of iterations we’ll need as p
.
Using this pair construction, we can finally add a loop counter to our LOOP
construct. To track
the counter, we pack it as a pair with the accumulator:
And then inside f, we first unpack that pair into accumulator and counter, pass them to actual loop iteration, and then pack the result again, incrementing the counter:
Ok, so we have achieved something remarkable: while we are writing terminating-by-construction programs, which are definitely not Turing complete, we have constructed basic programming staples, like boolean logic and data structures, and we have also built some rather complicated mathematical functions, like 2^{2N}.
We could try to further enrich our little primitive recursive kingdom by adding more and more functions on an ad hoc basis, but let’s try to be really ambitious and go for the main prize — simulating Turing Machines.
We know that we will fail: Turing machines can enter an infinite loop, but PRFs necessarily terminate. That means, that, if a PRF were able to simulate an arbitrary TM, it would have to say after a certain finite amount of steps that “this TM doesn’t terminate”. And, while we didn’t do this, it’s easy to see that you could simulate the other way around and implement PRFs in a TM. But that would give us a TM algorithm to decide if an arbitrary TM halts, which we know doesn’t exist.
So, this is hopeless! But we might still be able to learn something from failing.
Ok! So let’s start with a configuration of a TM which we somehow need to encode into a single
number. First, we need the state variable proper (Q0, Q1, etc), which seems easy enough to represent
with a number. Then, we need a tape and a position of the reading head. Recall how we used a pair of
stacks to represent exactly the tape and the position. And recall that we can look at a stack of
zeros and ones as a number in binary form, where push and pop operations are implemented using %
,
*
, and /
— exactly the operations we already can do. So, our configuration is just three
numbers: (S, stack1, stack2)
.
And, using the 2^{a}3^{b}5^{c} trick, we can pack this triple into just a single number. But that means we could directly encode a single step of a Turing Machine:
And now we could plug that into our LOOP
to simulate a Turing Machine running for N steps:
The catch of course is that we can’t know the N
that’s going to be enough. But we can have a very
good guess! We could do something like this:
That is, run for some large tower of exponents of the initial state. Which would be plenty for normal algorithms, which are usually 2^{N} at worst!
Or, generalizing:
Which is the headline result we have set out to prove!
It might seem that non-termination is the only principle obstacle. That anything that terminates at all has to be implementable as a PRF. Alas, that’s not so. Let’s go and construct a function that is surmountable by a TM, but is out of reach of PRFs.
We will combine the ideas of the impossibility proofs for FSMs (noting that if a function is computed by some machine, that machine has a specific finite size) and TMs (diagonalization).
So, suppose we have some function f
that can’t be computed by a PRF. How would we go about proving
that? Well, we’d start with “suppose that we have a PRF P that computes f
”. And then we could
notice that P would have some finite size. If you look at it abstractly, the P is its syntax tree,
with lots of LOOP
constructs, but it always boils down to some succ
s and zero
s at the leaves.
Let’s say that the depth of P is d
.
And, actually, if you look at it, there are only a finite number of PRFs with depth at most d
. Some
of them describe pretty fast growing functions. But probably there’s a limit to how fast a function
can grow, given that it is computed by a PRF of size d
. Or, to use a concrete example: we have
constructed a PRF of depth 5 that computes two to the power of two to the power of N. Probably if we
were smarter, we could have squeezed a couple more levels into that tower of exponents. But
intuitively it seems that if you build a tower of, say, 10 exponents, that would grow faster than
any PRF of depth 5
. And that this generalizes — for any fixed depth, there’s a high-enough
tower of exponents that grows faster than any PRF with that depth.
So we could conceivably build an f
that defeats our d
-deep P. But that’s not quite a victory
yet: maybe that f
is feasible for d+2
-deep PRFs! So here we’ll additionally apply
diagonalization: for each depth, we’ll build it’s own depth-specific nemesis f_d
. And then we’ll
define our overall function as
So, for n
large enough it’ll grow faster than a PRF with any fixed depth.
So that’s the general plan, the rest of the own is basically just calculating the upper bound on the
growth of a PRF of depth d
.
One technical difficulty here is that PRFs tend to have different arities:
Ideally, we’d use just one upper bound of them all. So we’ll be looking for an upper bound of the following form:
That is:
f
, d
.
d
.
Let’s start with d=1
. We have only primitive functions on this level, succ
, zero
, and pred
,
so we could say that
Now, let’s handle an arbitrary other depth d + 1
. In that case, our function is non-primitive, so at
the root of the syntax tree we have either a composition or a LOOP
.
Composition would look like this:
where g
and h_n
are d
deep and the resulting f
is d+1
deep. We can immediately estimate
the h_n
then:
In this somewhat loose notation, args...
stands for a tuple of arguments, and maxarg
stands for
the largest one.
And then we could use the same estimate for g
:
This is super high-order, so let’s do a concrete example for a depth-2 two-argument function which starts with a composition:
This sounds legit: if we don’t use LOOP, then f(x, y)
is either succ(succ(x))
or succ(succ(y))
so max(x, y) + 2
indeed is the bound!
Ok, now the fun case! If the top-level node is a LOOP
, then we have
This sounds complicated to estimate, especially due to that last t(args...)
argument, which is the
number of iterations. So we’ll be cowards and won’t actually try to estimate this case. Instead,
we will require that our PRF is written in a simplified form, where the first and the last arguments
to LOOP
are simple.
So, if your PRF looks like
you are required to re-write it first as
So now we only have to deal with this:
f
has depth d+1
, g
has depth d
.
On the first iteration, we’ll call g(args..., arg)
, which we can estimate as A_d(maxarg)
. That
is, g
does get an extra argument, but it is one of the original arguments of f
, and we are
looking at the maximum argument anyway, so it doesn’t matter.
On the second iteration, we are going to call
g(args..., prev_iteration)
which we can estimate as
A_d(max(maxarg, prev_iteration))
.
Now we plug our estimation for the first iteration:
That is, the estimate for the first iteration is A_d(maxarg)
. The estimation for the second
iteration adds one more layer: A_d(A_d(maxarg))
. For the third iteration we’ll get
A_d(A_d(A_d(maxarg)))
.
So the overall thing is going to be smaller than A_d
iteratively applied to itself some number of
times, where “some number” is one of the f
original arguments. But no harm’s done if we iterate up
to maxarg
.
As a sanity check, the worst depth-2 function constructed with iteration is probably
which is x + y
. And our estimate gives x + 1
applied maxarg
times to maxarg
, which is 2 *
maxarg
, which is indeed the correct upper bound!
Combining everything together, we have:
That max
there is significant — although it seems like the second line, with maxarg
applications, is always going to be longer, maxarg
, in fact, could be as small as zero. But we
can take maxarg + 2
repetitions to fix this:
So let’s just define A_{d+1}(x)
to make that inequality work:
Unpacking:
We define a family of unary functions A_d
, such that each A_d
“grows faster” than any n-ary PRF
of depth d
. If f
is a ternary PRF of depth 3, then f(1, 92, 10) <= A_3(92)
.
To evaluate A_d
at point x
, we use the following recursive procedure:
d
is 1
, return x + 1
.
A_{d-1}
at point x
to get, say, v
. Then evaluate A_{d-1}
again at
point v
this time, yielding u
. Then compute A_{d-1}(u)
. Overall, repeat this process x+2
times, and return the final number.
We can simplify this a bit if we stop treating d
as a kind of function index, and instead say
that our A
is just a function of two arguments. Then we have the following equations:
The last equation can re-formatted as
And for non-zero x that is just
So we get the following recursive definition for A(d, x):
As a Python program:
It’s easy to see that computing A
on a Turing Machine using this definition terminates — this
is a function with two arguments, and every recursive call uses a lexicographically smaller pair of
arguments. And we constructed A in such a way that A(d, x)
as a function of x
is larger than any
PRF with a single argument of depth d. But that means that the following function with one argument
a(x) = A(x, x)
grows faster than any PRF. And that’s an example of a function which a Turing Machine has no trouble computing (given sufficient time), but which is beyond the capabilities of PRFs.
Remember, this is a three-part post! And are finally at the part 3! So let’s circle back to the practical matters. We have learned that:
Or, more succinctly: there’s no practical difference between a program that doesn’t terminate, and the one that terminates after a billion years. As a practitioner, if you think you need to solve the first problem, you need to solve the second problem as well. And making your programming language non-Turing complete doesn’t really help with this.
And yet, there are a lot of configuration languages out there that use non-Turing completeness as one of their key design goals. Why is that?
I would say that we are never interested in Turing-completeness per-se. We usually want some much stronger properties. And yet there’s no convenient catchy name for that bag of features of a good configuration language. So, “non-Turing-complete” gets used as a sort of rallying cry to signal that something is a good configuration language, and maybe sometimes even to justify to others inventing a new language instead of taking something like Lua. That is, the real reason why you want at least a different implementation is all those properties you really need, but they are kinda hard to explain, or at least much harder than “we can’t use Python/Lua/JavaScript because they are Turing-complete”.
So what are the properties of a good configuration language?
First, we need the language to be deterministic. If you launch Python and type id([])
, you’ll
see some number. If you hit ^C
, and than do this again, you’ll see a different number. This is OK
for “normal” programming, but is usually anathema for configuration. Configuration is often used as a
key in some incremental, caching system, and letting in non-determinism there wreaks absolute chaos!
Second, you need the language to be well-defined. You can compile Python with ASLR disabled, and
use some specific allocator, such that id([])
always returns the same result. But that result
would be hard to predict! And if someone tries to do an alternative implementation, even if they
disable ASLR as well, they are likely to get a different deterministic number! Or the same could
happen if you just update the version of Python. So, the semantics of the language should be clearly
pinned-down by some sort of a reference, such that it is possible to guarantee not only
deterministic behavior, but fully identical behavior across different implementations.
Third, you need the language to be pure. If your configuration can access environment variables or read files on disk, than the meaning of the configuration would depend on the environment where the configuration is evaluated, and you again don’t want that, to make caching work.
Fourth, a thing that is closely related to purity is security and sandboxing. The mechanism to achieve both purity and security is the same — you don’t expose general IO to your language. But the purpose is different: purity is about not letting the results be non-deterministic, while security is about not exposing access tokens to the attacker.
And now this gets tricky. One particular possible attack is a denial of service — sending some bad config which makes our system just spin there burning the CPU. Even if you control all IO, you are generally still open to these kinds of attacks. It might be OK to say this is outside of the threat model — that no one would find it valuable enough to just burn your CPU, if they can’t also do IO, and that, even in the event that this happens, there’s going to be some easy mitigation in the form of a higher-level timeout.
But you also might choose to provide some sort of guarantees about execution time, and that’s really hard. Two approaches work. One is to make sure that processing is obviously linear. Not just terminates, but is actually proportional to the size of inputs, and in a very direct way. If the correspondence is not direct, than it’s highly likely that it is in fact non linear. The second approach is to ensure metered execution — during processing, decrement a counter for every simple atomic step and terminate processing when the counter reaches zero.
Finally one more vague property you’d want from a configuration language is for it to be simple. That is, to ensure that, when people use your language, they write simple programs. It seems to me that this might actually be the case where banning recursion and unbounded loops could help, though I am not sure. As we know from the PRF exercise, this won’t actually prevent people from writing arbitrary recursive programs. It’ll just require some roundabout code to do that. But maybe that’ll be enough of a speedbump to make someone invent a simple solution, instead of brute-forcing the most obvious one?
That’s all for today! Have a great weekend, and remember:
There are a bunch of posts on the internet about using git worktree
command. As far as I can tell,
most of them are primarily about using worktrees as a replacement of, or a supplement to git
branches. Instead of switching branches, you just change directories. This is also how I originally
had used worktrees, but that didn’t stick, and I abandoned them. But recently worktrees grew
on me, though my new use-case is unlike branching.
If you use worktrees as a replacement for branching, that’s great, no need to change anything! But let me start with explaining why that workflow isn’t for me.
The principal problem with using branches is that it’s hard to context switch in the middle of doing something. You have your branch, your commit, a bunch of changes in the work tree, some of them might be stages and some unstaged. You can’t really tell Git “save all this context and restore it later.” The solution that Git suggests here is to use stashing, but that’s awkward, as it is too easy to get lost when stashing several things at the same time, and then applying the stash on top of the wrong branch.
Managing Git state became much easier for me when I realized that the staging area and the stash are just bad features, and life is easier if I avoid them. Instead, I just commit whatever and deal with it later. So, when I need to switch a branch in the middle of things, what I do is, basically:
And, to switch back,
To make this more streamlined, I have a ggc
utility which does “commit all with a trivial message”
atomically.
And I don’t always reset HEAD~
— I usually just continue hacking with .
in my Git log and then amend the commit
once I am satisfied with subset of changes
So that’s how I deal with switching branches. But why worktrees then?
It’s a bit hard to describe, but:
Specifically:
The main worktree is a readonly worktree that contains a recent snapshot of the remote main branch. I use this tree to compare the code I am currently working on and/or reviewing with the master version (this includes things like “how long the build takes”, “what is the behavior of this test” and the like, so not just the actual source code).
The work worktree, where I write most of the code. I often need to write new code and compare it
with old code at the same time. But can’t actually work on two different things in parallel.
That’s why main
and work
are different worktrees, but work
also constantly switches branches.
The review worktree, where I checkout code for code review. While I can’t review code and write code at the same time, there is one thing I am implementing, and one thing I am reviewing, but the review and implementation proceed concurrently.
Then, there’s the fuzz tree, where I run long-running fuzzing jobs for the code I am actively working on. My overall idealized feature workflow looks like this:
This is again concurrent: I can hack on the branch while the fuzzer tests the “same” code. Note
that it is crucial that the fuzzing tree operates in the detached head state (-d
flag for git
switch
). In general, -d
is very helpful with this style of worktree work. I am also
sympathetic to the argument that, like the staging area
and the stash, Git branches are a misfeature, but I haven’t made the plunge personally yet.
Finally, the last tree I have is scratch – this is a tree for arbitrary random things I need
to do while working on something else. For example, if I am working on matklad/my-feature
in
work
, and reviewing #6292
in review
, and, while reviewing, notice a tiny unrelated typo, the
PR for that typo is quickly prepped in the scratch
worktree:
TL;DR: consider using worktrees not as a replacement for branches, but as a means to manage concurrency in your tasks. My level of concurrency is:
main
for looking at the pristine code,
work
for looking at my code,
review
for looking at someone else’s code,
fuzz
for my computer to look at my code,
scratch
for everything else!