Primitive Recursive Functions For A Working Programmer

Aug 1, 2024

Programmers on the internet often use “Turing-completeness” terminology. Typically, not being Turing-complete is extolled as a virtue or even a requirement in specific domains. I claim that most such discussions are misinformed — that not being Turing complete doesn’t actually mean what folks want it to mean, and is instead a stand-in for a bunch of different practically useful properties, which are mostly orthogonal to actual Turing completeness.

While I am generally descriptivist in nature and am ok with words losing their original meaning as long as the new meaning is sufficiently commonly understood, Turing completeness is a hill I will die on. It is a term from math, it has a very specific meaning, and you are not allowed to re-purpose it for anything else, sorry!

I understand why this happens: to really understand what Turing completeness is and is not you need to know one (simple!) theoretical result about so-called primitive recursive functions. And, although this result is simple, I was only made aware of it in a fairly advanced course during my masters. That’s the CS education deficiency I want to rectify — you can’t teach students the halting problem without also teaching them about primitive recursion!

The post is going to be rather meaty, and will be split in three parts:

In Part I, I give a TL;DR for the theoretical result and some of its consequences. Part II is going to be a whirlwind tour of Turing Machines, Finite State Automata and Primitive Recursive Functions. And then Part III will circle back to practical matters.

If math makes you slightly nauseous, you might to skip Part II. But maybe give it a try? The math we’ll need will be baby math from first principles, without reference to any advanced results.

Part I: TL;DR

Here’s the key result — suppose you have a program in some Turing complete language, and you also know that it’s not too slow. Suppose it runs faster than O(2^{2^N}). That is, two to the power of two to the power of N, a very large number. In this case, you can implement this algorithm in a non-Turing complete language.

Most practical problems fall into this “faster than two to the two to the power of two” space. Hence it follows that you don’t need the full power of a Turing Machine to tackle them. Hence, a language not being Turing complete doesn’t in any way restrict you in practice, or give you extra powers to control the computation.

Or, to restate this: in practice, a program which doesn’t terminate, and a program that needs a billion billion steps to terminate are equivalent. Making something non-Turing complete by itself doesn’t help with the second problem in any way. And there’s a trivial approach that solves the first problem for any existing Turing-complete language — in the implementation, count the steps and bail with an error after a billion.

Part II: Weird Machines

The actual theoretical result is quite a bit more general than that. It is (unsurprisingly) recursive:

If a function is computed by a Turing Machine, and the runtime of this machine is bounded by some primitive recursive function of input, then the original function itself can be written as a primitive recursive function.

It is expected that this sounds like gibberish at this point! So let’s just go and prove this thing, right here in this blog post! Will work up slowly towards this result. The plan is as follows:

First, to brush up notation, we’ll define Finite State Machines.
Second, we’ll turn our humble Finite State Machine into the all-powerful Turing Machine (spoiler — a Turing Machine is an FSM with a pair of stacks), and, as is customary, wave our hands about the Universal Turing Machine.
Third, we leave the cozy world of imperative programming and define primitive recursive functions.
Finally, we’ll talk about the relative computational power of TMs and PRFs, including the teased up result and more!

Finite State Machines

Finite State Machines are simple! An FSM takes a string as input, and returns a binary answer, “yes” or “no”. Unsurprisingly an FSM has a finite number of states: Q0, Q1, …, Qn. A subset of states are designated as “yes” states, the rest are “no” states. There’s also one specific starting state.

The behavior of the state machine is guided by a transition (step) function, s. This function takes the current state of FSM, the next symbol of input, and returns a new state.

The semantics of FSM is determined by repeatably applying the single step function for all symbols of the input, and noting whether the final state is a “yes” state or a “no” state.

Here’s an FSM which accepts only strings of zeros and ones of even length:

States:     { Q0, Q1 }
Yes States: { Q0 }
Start State:  Q0

s :: State -> Symbol -> State
s Q0 0 = Q1
s Q0 1 = Q1
s Q1 0 = Q0
s Q1 1 = Q0

This machine ping-pongs between states Q0 and Q1 ends up in Q0 only for inputs of even length (including an empty input).

What can FSMs do? As they give a binary answer, they are recognizers — they don’t compute functions, but rather just characterize certain sets of strings. A famous result is that the expressive power of FSMs is equivalent to the expressive power of regular expressions. If you can write a regular expression for it, you could also do an FSM!

There are also certain things that state machines can’t do. For example they can’t enter an infinite loop. Any FSM is linear in the input size and always terminates. But there are much more specific sets of strings that couldn’t be recognized by an FSM. Consider this set:

That is, an infinite set which contains ‘1’s surrounded by the equal number of ‘0’s on the both sides. Let’s prove that there isn’t a state machine that recognizes this set!

As usually, suppose there is such a state machine. It has a certain number of states — maybe a dozen, maybe a hundred, maybe a thousand, maybe even more. But let’s say fewer than a million. Then, let’s take a string which looks like a million zeros, followed by one, followed by million zeros. And let’s observe our FSM eating this particular string.

First of all, because the string is in fact a one surrounded by the equal number of zeros on both sides, the FSM ends up in a “yes” state. Moreover, because the length of the string is much greater than the number of states in the state machine, the state machine necessarily visits some state twice. There is a cycle, where the machine goes from A to B to C to D and back to A. This cycle might be pretty long, but it’s definitely shorter than the total number of states we have.

And now we can fool the state machine. Let’s make it eat our string again, but this time, once it completes the ABCDA cycle, we’ll force it to traverse this cycle again. That is, the original cycle corresponds to some portion of our giant string:

0000 0000000000000000000 00 .... 1 .... 00000
     <- cycle portion ->

If we duplicate this portion, our string will no longer look like one surrounded by equal number of twos, but the state machine will still in the “yes” state. Which is a contradiction that completes the proof.

Turing Machine: Definition

A Turing Machine is only slightly more complex than an FSM. Like an FSM, a TM has a bunch of states and a single-step transition function. While an FSM has an immutable input which is being fed to it symbol by symbol, a TM operates with a mutable tape. The input gets written to the tape at the start. At each step, a TM looks at the current symbol on the tape, changes its state according to a transition function and, additionally:

Replaces the current symbol with a new one (which might or might not be different).
Moves the reading head that points at the current symbol one position to the left or to the right.

When a machine reaches a designated halt state, it stops, and whatever is written on the tape at that moment is the result. That is, while FSMs are binary recognizers, TMs are functions. Keep in mind that a TM does not necessarily stop. It might be the case that a TM goes back and forth over the tape, overwrites it, changes its internal state, but never quite gets to the final state.

Here’s an example Turing Machine:

States:  {A, B, C, H}
Start State: A
Final State: H

s :: State -> Symbol -> (State, Symbol, Left | Right)
s A 0 = (B, 1, Right)
s A 1 = (H, 1, Right)
s B 0 = (C, 0, Right)
s B 1 = (B, 1, Right)
s C 0 = (C, 1, Left)
s C 1 = (A, 1, Left)

If the configuration of the machine looks like this:

000010100000
     ^
     A

Then we are in the s A 0 = (B, 1, Right) case, so we should change the state to B, replace 0 with 1, and move to the right:

000011100000
      ^
      B

Turing Machine: Programming

There are a bunch of fiddly details to Turing Machines!

The tape is conceptually infinite, so beyond the input, everything is just zeros. This creates a problem: it might be hard to say where the input (or the output) ends! There are a couple of technical solutions here. One is to say that there are three different symbols on the tape — zeros, ones, and blanks, and require that the tape is initialized with blanks. A different solution is to invent some encoding scheme. For example, we can say that the input is a sequence of 8-bit bytes, without interior null bytes. So, eight consecutive zeros at a byte boundary designate the end of input/output.

It’s useful to think about how this byte-oriented TM could be implemented. We could have one large state for each byte of input. So, Q142 would mean that the head is on the byte with value 142. And then we’ll have a bunch of small states to read out the current byte. Eg, we start reading a byte in state S. Depending on the next bit we move to S0 or S1, then to S00, or S01, etc. Once we reached something like S01111001, we move back 8 positions and enter state Q121. This is one of the patterns of Turing Machine programming — while your main memory is the tape, you can represent some constant amount of memory directly in the states.

What we’ve done here is essentially lowering a byte-oriented Turing Machine to a bit-oriented machine. So, we could think only in terms of big states operating on bytes, as we know the general pattern for converting that to direct bit-twiddling.

With this encoding scheme in place, we now can feed arbitrary files to a Turing Machine! Which will be handy to the next observation:

You can’t actually program a Turing Machine. What I mean is that, counter-intuitively, there isn’t some user-supplied program that a Turing Machine executes. Rather, the program is hard-wired into the machine. The transition function is the program.

But with some ingenuity we can regain our ability to write programs. Recall that we’ve just learned to feed arbitrary files to a TM. So what we could do is to write a text file that specifies a TM and its input, and then feed that entire file as an input to an “interpreter” Turing Machine which would read the file, and act as the machine specified there. A Turing Machine can have an eval function.

Is such an “interpreter” Turing Machine possible? Yes! And it is not hard: if you spend a couple of hours programming Turing Machines by hand, you’ll see that you pretty much can do anything — you can do numbers, arithmetic, loops, control flow. It’s just very very tedious.

So let’s just declare that we’ve actually coded up this Universal Turing Machine which simulates a TM given to it as an input in a particular encoding.

This sort of construct also gives rise to the Church-Turing thesis. We have a TM which can run other TMs. And you can implement a TM interpreter in something like Python. And, with a bit of legwork, you could also implement a Python interpreter as a TM (you likely want to avoid doing that directly, and instead do a simpler interpreter for WASM, and then use a Python interpreter compiled to WASM). This sort of bidirectional interpretation shows that Python and TMs have equivalent computing power. Moreover, it’s quite hard to come up with a reasonable computational device which is more powerful than a Turing Machine.

There are computational devices that are strictly weaker than TMs though. Recall FSMs. By this point, it should be obvious that a TM can simulate an FSM. Everything a Finite State Machine can do, a Turing Machine can do as well. And it should be intuitively clear that a TM is more powerful than an FSM. An FSM gets to use only a finite number of states. A TM has these same states, but it also posses a tape which serves like an infinitely sized external memory.

Directly proving that you can’t encode a Universal Turing Machine as an FSM sounds complicated, so let’s prove something simpler. Recall that we have established that there’s no FSM that accepts only ones surrounded by an equal number of zeros on both sides (because a sufficiently large word of this form would necessary enter a cycle in a state machine, which could then be further pumped). But it’s actually easy to write a Turing Machine that does this:

Erase zero (at the left side of the tape)
Go to the right end of the tape
Erase zero
Go to the left side of the tape
Repeat
If what’s left is a single 1 the answer is “yes”, otherwise it is a “no”

We found a specific problem that can be solved by a TM, but is out of reach of any FSM. So it necessarily follows that there isn’t an FSM that can simulate an arbitrary TM.

It is also useful to take a closer look at the tape. It is a convenient skeuomorphic abstraction which makes the behavior of the machine intuitive, but it is inconvenient to implement in a normal programming language. There isn’t a standard data structure that behaves just like a tape.

One cool practical trick is to simulate the tape as a pair of stacks. Take this:

Tape: A B C D E F G
Head:     ^

And transform it to something like this:

Left Stack:  [A, B, C]
Right Stack: [G, F, E, D]

That is, everything to the left of the head is one stack, everything to the right, reversed, is the other. Here, moving the reading head left or right corresponds to popping a value off one stack and pushing it onto another.

So, an equivalent-in-power definition would be to say that a TM is an FSM endowed with two stacks.

This of course creates an obvious question: is an FSM with just one stack a thing? Yes! It would be called a pushdown automaton, and it would correspond to context-free languages. But that’s beyond the scope of this post!

There’s yet another way to look at the tape, or the pair of stacks, if the set of symbols is 0 and 1. You could say that a stack is just a number! So, something like [1, 0, 1, 1] will be 1 + 2 + 8 = 11. Looking at the top of the stack is stack % 2, removing an item from the stack is stack / 2 and pushing x onto the stack is stack * 2 + x. We won’t need this right now, so just hold onto this for a brief moment.

Turing Machine: Limits

Ok, so we have some idea about the lower bound for the power of a Turing Machine — FSMs are strictly less expressive. What about the opposite direction? Is there some computation that a Turing Machine is incapable of doing?

Yes! Let’s construct a function which maps natural numbers to natural numbers, which can’t be implemented by a Turing Machine. Recall that we can encode an arbitrary Turing Machine as text. That means that we can actually enumerate all possible Turing Machines, and write them in a giant line, from the most simple Turing Machine to more complex ones:

TM_0
TM_1
TM_2
...
TM_326
...

This is of course going to be an infinite list.

Now, let’s see how TM0 behaves on input 0: it either prints something, or doesn’t terminate. Then, note how TM1 behaves on input 1, and generalizing, create function f that behaves as the nth TM on input n. It might look something like this:

f(0) = 0
f(1) = 111011
f(2) = doesn't terminate
f(3) = 0
f(4) = 101
...

Now, let’s construct function g which is maximally diffed from f: where f gives 0, g will return 1, and it will return 0 in all other cases:

g(0) = 1
g(1) = 0
g(2) = 0
g(3) = 1
g(4) = 0
...

There isn’t a Turing machine that computes g. For suppose there is. Then, it exists in our list of all Turing Machines somewhere. Let’s say it is TM1000064. So, if we feed 0 to it, it will return g(0), which is 1, which is different from f(0). And the same holds for 1, and 2, and 3. But once we get to g(1000064), we are in trouble, because, by the definition of g, g(1000064) is different from what is computed by TM1000064! So such a machine is impossible.

Those math savvy might express this more succinctly — there’s a countably-infinite number of Turing Machines, and an uncountably-infinite number of functions. So there must be some functions which do not have a corresponding Turing Machine. It is the same proof — the diagonalization argument is hiding in the claim that the set of all functions is an uncountable set.

But this is super weird and abstract. Let’s rather come up with some very specific problem which isn’t solvable by a Turing Machine. The halting problem: given source code for a Turing Machine and its input, determine if the machine halts on this input eventually.

As we have waved our hands sufficiently vigorously to establish that Python and Turing Machines have equivalent computational power, I am going to try to solve this in Python:

def halts(program_source_code: str, program_input: str) -> Bool:
    # One million lines of readable, but somewhat
    # unsettling and intimidating Python code.
    return the_answer

raw_input = input()
[program_source_code, program_input] = parse(raw_input)
print("Yes" if halts(program_source_code, program_input) else "No")

Now, I will do a weird thing and start asking whether a program terminates, if it is fed its own source code, in a reverse-quine of sorts:

def halts_on_self(program_source_code: str) -> Bool:
    program_input = program_source_code
    return halts(program_source_code, program_input)

and finally I construct this weird beast of a program:

def halts(program_source_code: str, program_input: str) -> Bool:
    # ...
    return the_answer

def halts_on_self(program_source_code: str) -> Bool:
    program_input = program_source_code
    return halts(program_source_code, program_input)

def weird(program_input):
    if halts_on_self(program_input):
        while True:
            pass

weird(input())

To make this even worse, I’ll feed the text of this weird program to itself. Does it terminate with this input? Well, if it terminates, and if our halts function is implemented correctly, then the halts_on_self(program_input) invocation above returns True. But then we enter the infinite loop and don’t actually terminate.

Hence, it must be the case that weird does not terminate when self-applied. But then halts_on_self returns False, and it should terminate. So we get a contradiction both ways. Which necessarily means that either our halts sometimes returns a straight-up incorrect answer, or that it sometimes does not terminate.

So this is the flip side of a Turing Machine’s power — it is so powerful that it becomes impossible to tell whether it’ll terminate or not!

It actually gets much worse, because this result can be generalized to an unreasonable degree! In general, there’s very little we can say about arbitrary programs.

We can easily check syntactic properties (is the program text shorter than 4 kilobytes?), but they are, in some sense, not very interesting, as they depend a lot on how exactly one writes a program. It would be much more interesting to check some refactoring-invariant properties, which hold when you change the text of the program, but leave the behavior intact. Indeed, “does this change preserve behavior?” would be one very useful property to check!

So let’s define two TMs to be equivalent, if they have identical behavior. That is, for each specific input, either both machines don’t terminate, or they both halt, and give identical results.

Then, our refactoring-invariant properties are, by definition, properties that hold (or do not hold) for the entire classes of equivalence of TMs.

And a somewhat depressing result here is that there are no non-trivial refactoring-invariant properties that you can algorithmically check.

Suppose we have some magic TM, called P, which checks such a property. Let’s show that, using P, we can solve the problem we know we can not solve — the halting problem.

Consider a Turing Machine that is just an infinite loop and never terminates, M1. P might or might not hold for it. But, because P is non-trivial (it holds for some machines and doesn’t hold for some machines), there’s some different machine M2 which differs from M1 with respect to P. That is, P(M1) xor P(M2) holds.

Let’s use these M1 and M2 to figure out whether a given machine M halts on input I. Using Universal Turing Machine (interpreter), we can construct a new machine, M12 that just runs M on input I, then erases the contents of the tape and runs M2. Now, if M halts on I, then the resulting machine M12 is behaviorally-equivalent to M2. If M doesn’t halt on I, then the result is equivalent to the infinite loop program, M1. Or, in pseudo-code:

def M1(input):
    while True:
        pass

def M2(input):
    # We don't actually know what's here
    # but we know that such a machine exists.

assert(P(M1) != P(M2))

def halts(M, I):
    def M12(input):
        M(I) # might or might not halt
        return M2(input)

    return P(M12) == P(M2)

This is pretty bad and depressing — we can’t learn anything meaningful about an arbitrary Turing Machine! So let’s finally get to the actual topic of today’s post:

Primitive Recursive Functions

This is going to be another computational device, like FSMs and TMs. Like an FSM, it’s going to be a nice, always terminating, non-Turing complete device. But it will turn out to have quite a bit of the power of a full Turing Machine!

However, unlike both TMs and FSMs, Primitive Recursive Functions are defined directly as functions which take a tuple of natural numbers and return a natural number. The two simplest ones are zero (that is, zero-arity function that returns 0) and succ — a unary function that just adds 1. Everything else is going to get constructed out of these two:

zero = 0
succ(x) = x + 1

One way we are allowed to combine these functions is by composition. So we can get all the constants right off the bat:

succ(zero) = 1
succ(succ(zero)) = 2
succ(succ(succ(zero))) = 3

We aren’t going to be allowed to use general recursion (because it can trivially non-terminate), but we do get to use a restricted form of C-style loop. It is a bit fiddly to define formally! The overall shape is LOOP(init, f, n).

Here, init and n are numbers — the initial value of the accumulator and the total number of iterations. The f is a unary function that specifies the loop body – it takes the current value of the accumulator and returns the new value. So

LOOP(init, f, 0) = init
LOOP(init, f, 1) = f(init)
LOOP(init, f, 2) = f(f(init))
LOOP(init, f, 3) = f(f(f(init)))

While this is similar to a C-style loop, the crucial difference here is that the total number of iterations n is fixed up-front. There’s no way to mutate the loop counter in the loop body.

This allows us to define addition:

add(x, y) = LOOP(x, succ, y)

Multiplication is trickier. Conceptually, to multiply x and y, we want to LOOP from zero, and repeat “add x” y times. The problem here is that we can’t write an “add x” function yet

# Doesn't work, add is a binary function!
mul(x, y) = LOOP(0, add, y)

# Doesn't work either, no x in scope!
add_x v = add(x, v)
mul(x, y) = LOOP(0, add_x, y)

One way around this is to define LOOP as a family of operators, which can pass extra arguments to the iteration function:

LOOP0(init, f, 2) = f(f(init))
LOOP1(c1, init, f, 2) = f(c1, f(c1, init))
LOOP2(c1, c2, init, f, 2) = f(c1, c2, f(c1, c2, init))

That is, LOOP_N takes an extra n arguments, and passes them through to any invocation of the body function. To express this idea a little bit more succinctly, let’s just allow to partially apply the second argument of LOOP. That is:

All our functions are going to be first order. All arguments are numbers, the result is a number. There aren’t higher order functions, there aren’t closures.
The LOOP is not a function in our language — it’s a builtin operator, a keyword. So, for convenience, we allow passing partially applied functions to it. But semantically this is equivalent to just passing in extra arguments on each iteration.

Which finally allows us to write

mul(x, y) = LOOP(0, add x, y)

Ok, so that’s progress — we made something as complicated as multiplication, and we still are in the guaranteed-to-terminate land. Because each loop has a fixed number of iterations, everything eventually finishes.

We can go on and define x^y:

pow(x, y) = LOOP(1, mul x, y)

And this in turn allows us to define a couple of concerning fast growing functions:

pow_2(n) = pow(2, n)
pow_2_2(n) = pow_2(pow_2(n))

That’s fun, but to do some programming, we’ll need an if. We’ll get to it, but first we’ll need some boolean operations. We can encode false as 0 and true as 1. Then

and(x, y) = mul(x, y)

But or creates a problem: we’ll need a subtraction.

or(x, y) = sub(
  add(x, y),
  mul(x, y),
)

Defining sub is tricky, due to two problems:

First, we only have natural numbers, no negatives. This one is easy to solve — we’ll just define subtraction to saturate.

The second problem is more severe — I think we actually can’t express subtraction given the set of allowable operations so far. That is because all our operations are monotonic — the result is never less than the arguments. One way to solve this problem is to define the LOOP in such a way that the body function also gets passed a second argument — the current iteration. So, if you iterate up to n, the last iteration will observe n - 1, and that would be the non-monotonic operation that creates subtraction. But that seems somewhat inelegant to me, so instead I will just add a pred function to the basis, and use that to add loop counters to our iterations.

pred(0) = 0 # saturate
pred(1) = 0
pred(2) = 1
...

Now we can say:

sub(x, y) = LOOP(x, pred, y)

and(x, y) = mul(x, y)
or(x, y) = sub(
  add(x, y),
  mul(x, y)
)
not(x) = sub(1, x)

if(cond, a, b) = add(
  mul(a, cond),
  mul(b, not(cond)),
)

And now we can do a bunch of comparison operators:

is_zero(x) = sub(1, x)

# x >= y
ge(x, y) = is_zero(sub(y, x))

# x == y
eq(x, y) = and(ge(x, y), ge(y, x))

# x > y
gt(x, y) = and(ge(x, y), not(eq(x, y)))

# x < y
lt(x, y) = gt(y, x)

With that we could implement modulus. To compute x % m we will start with x, and will be subtracting m until we get a number smaller than m. We’ll need at most x iterations for that.

In pseudo-code:

def mod(x, m):
  current = x

  for _ in 0..x:
    if current < m:
      current = current
    else:
      current = current - m

  return current

And as a bona fide PRF:

mod_iter(m, x) = if(
  lt(x, m),
  x,        # then
  sub(x, m) # else
)
mod(x, m) = LOOP(x, mod_iter m, x)

That’s a curious structure — rather than computing the modulo directly, we essentially search for it using trial and error, and relying on the fact that the search has a clear upper bound.

Division can be done similarly: to divide x by y, start with 0, and then repeatedly add one to the accumulator until the product of the accumulator and y exceeds x:

div_iter x y acc = if(
  le(mul(succ(acc), y), y),
  succ(acc), # then
  acc        # else
)
div(x, y) = LOOP(0, div_iter x y, x)

This really starts to look like programming! One thing we are currently missing are data structures. While our functions take multiple arguments, they only return one number. But it’s easy enough to pack two numbers into one: to represent an (a, b) pair, we’ll use 2^a 3^b number:

mk_pair(a, b) = mul(pow(2, a), pow(3, b))

To deconstruct such a pair into its first and second components, we need to find the maximum power of 2 or 3 that divides our number. Which is exactly the same shape we used to implement div:

max_factor_iter p m acc = if(
  is_zero(mod(p, pow(m, succ(acc)))),
  succ(acc), # then
  acc,       # else
)
max_factor(p, m) = LOOP(0, max_factor_iter p m, p)

fst(p) = max_factor(p, 2)
snd(p) = max_factor(p, 3)

Here again we use the fact that the maximal power of two that divides p is not larger than p itself, so we can over-estimate the number of iterations we’ll need as p.

Using this pair construction, we can finally add a loop counter to our LOOP construct. To track the counter, we pack it as a pair with the accumulator:

LOOP(mk_pair(init, 0), f, n)

And then inside f, we first unpack that pair into accumulator and counter, pass them to actual loop iteration, and then pack the result again, incrementing the counter:

f acc = mk_pair(
  g(fst(acc), snd(acc)),
  succ(snd(acc)),
)

Ok, so we have achieved something remarkable: while we are writing terminating-by-construction programs, which are definitely not Turing complete, we have constructed basic programming staples, like boolean logic and data structures, and we have also built some rather complicated mathematical functions, like 2^{2^N}.

We could try to further enrich our little primitive recursive kingdom by adding more and more functions on an ad hoc basis, but let’s try to be really ambitious and go for the main prize — simulating Turing Machines.

We know that we will fail: Turing machines can enter an infinite loop, but PRFs necessarily terminate. That means, that, if a PRF were able to simulate an arbitrary TM, it would have to say after a certain finite amount of steps that “this TM doesn’t terminate”. And, while we didn’t do this, it’s easy to see that you could simulate the other way around and implement PRFs in a TM. But that would give us a TM algorithm to decide if an arbitrary TM halts, which we know doesn’t exist.

So, this is hopeless! But we might still be able to learn something from failing.

Ok! So let’s start with a configuration of a TM which we somehow need to encode into a single number. First, we need the state variable proper (Q0, Q1, etc), which seems easy enough to represent with a number. Then, we need a tape and a position of the reading head. Recall how we used a pair of stacks to represent exactly the tape and the position. And recall that we can look at a stack of zeros and ones as a number in binary form, where push and pop operations are implemented using %, *, and / — exactly the operations we already can do. So, our configuration is just three numbers: (S, stack1, stack2).

And, using the 2^a3^b5^c trick, we can pack this triple into just a single number. But that means we could directly encode a single step of a Turing Machine:

single_step(config) = if(
  # if the state is Q0 ...
  eq(fst(config), 0)

  # and the symbol at the top of left stack is 0
  if(is_zero(mod(snd(config), 2))
    mk_triple(
      1,                    # move to state Q1
      div(snd(config), 2),  # pop value from the left stack
      mul(trd(config), 2),  # push zero onto the right stack
    ),
    ... # Handle symbol 1 in state Q1
  )
  # if the state is Q1 ...
  if(eq(fst(config), 1)
    ...
  )
)

And now we could plug that into our LOOP to simulate a Turing Machine running for N steps:

n_steps initial_config n =
  LOOP(initial_config, single_step, n)

The catch of course is that we can’t know the N that’s going to be enough. But we can have a very good guess! We could do something like this:

hopefully_enough_steps initial_config =
  LOOP(initial_config, single_step, pow_2_2(initial_config))

That is, run for some large tower of exponents of the initial state. Which would be plenty for normal algorithms, which are usually 2^N at worst!

Or, generalizing:

If a TM has a runtime which is bounded by some primitive-recursive function, then the entire TM can be replaced with a PRF. Be advised that PRFs can grow really fast.

Which is the headline result we have set out to prove!

Primitive Recursive Functions: Limit

It might seem that non-termination is the only principle obstacle. That anything that terminates at all has to be implementable as a PRF. Alas, that’s not so. Let’s go and construct a function that is surmountable by a TM, but is out of reach of PRFs.

We will combine the ideas of the impossibility proofs for FSMs (noting that if a function is computed by some machine, that machine has a specific finite size) and TMs (diagonalization).

So, suppose we have some function f that can’t be computed by a PRF. How would we go about proving that? Well, we’d start with “suppose that we have a PRF P that computes f”. And then we could notice that P would have some finite size. If you look at it abstractly, the P is its syntax tree, with lots of LOOP constructs, but it always boils down to some succs and zeros at the leaves. Let’s say that the depth of P is d.

And, actually, if you look at it, there are only a finite number of PRFs with depth at most d. Some of them describe pretty fast growing functions. But probably there’s a limit to how fast a function can grow, given that it is computed by a PRF of size d. Or, to use a concrete example: we have constructed a PRF of depth 5 that computes two to the power of two to the power of N. Probably if we were smarter, we could have squeezed a couple more levels into that tower of exponents. But intuitively it seems that if you build a tower of, say, 10 exponents, that would grow faster than any PRF of depth 5. And that this generalizes — for any fixed depth, there’s a high-enough tower of exponents that grows faster than any PRF with that depth.

So we could conceivably build an f that defeats our d-deep P. But that’s not quite a victory yet: maybe that f is feasible for d+2-deep PRFs! So here we’ll additionally apply diagonalization: for each depth, we’ll build it’s own depth-specific nemesis f_d. And then we’ll define our overall function as

a(n) = f_n(n)

So, for n large enough it’ll grow faster than a PRF with any fixed depth.

So that’s the general plan, the rest of the own is basically just calculating the upper bound on the growth of a PRF of depth d.

One technical difficulty here is that PRFs tend to have different arities:

f(x, y)
g(x, y, z, t)
h(x)

Ideally, we’d use just one upper bound of them all. So we’ll be looking for an upper bound of the following form:

f(x, y, z, t) <= A_d(max(x, y, z, t))

That is:

Compute the depth of f, d.
Compute the largest of its arguments.
And plug that into unary function for depth d.

Let’s start with d=1. We have only primitive functions on this level, succ, zero, and pred, so we could say that

A_1(x) = x + 1

Now, let’s handle an arbitrary other depth d + 1. In that case, our function is non-primitive, so at the root of the syntax tree we have either a composition or a LOOP.

Composition would look like this:

f(x, y, z, ...) = g(
  h1(x, y, z, ...),
  h2(x, y, z, ...),
  h3(x, y, z, ...),
)

where g and h_n are d deep and the resulting f is d+1 deep. We can immediately estimate the h_n then:

f(args...) <= g(
  A_d(maxarg),
  A_d(maxarg),
  A_d(maxarg),
  ...
)

In this somewhat loose notation, args... stands for a tuple of arguments, and maxarg stands for the largest one.

And then we could use the same estimate for g:

f(args...) <= A_d(A_d(maxarg))

This is super high-order, so let’s do a concrete example for a depth-2 two-argument function which starts with a composition:

f(x, y) <= A_1(A_1(max(x, y)))
         = A_1(max(x, y) + 1)
         = max(x, y) + 2

This sounds legit: if we don’t use LOOP, then f(x, y) is either succ(succ(x)) or succ(succ(y)) so max(x, y) + 2 indeed is the bound!

Ok, now the fun case! If the top-level node is a LOOP, then we have

f(args...) = LOOP(
  g(args...),
  h(args...),
  t(args...),
)

This sounds complicated to estimate, especially due to that last t(args...) argument, which is the number of iterations. So we’ll be cowards and won’t actually try to estimate this case. Instead, we will require that our PRF is written in a simplified form, where the first and the last arguments to LOOP are simple.

So, if your PRF looks like

f(x, y) = LOOP(x + y, mul, pow2(x))

you are required to re-write it first as

helper(u, v) = LOOP(u, mul, v)
f(x, y) = helper(x + y, pow2(x))

So now we only have to deal with this:

f(args...) = LOOP(
  arg,
  g(args...),
  arg,
)

f has depth d+1, g has depth d.

On the first iteration, we’ll call g(args..., arg), which we can estimate as A_d(maxarg). That is, g does get an extra argument, but it is one of the original arguments of f, and we are looking at the maximum argument anyway, so it doesn’t matter.

On the second iteration, we are going to call g(args..., prev_iteration) which we can estimate as A_d(max(maxarg, prev_iteration)).

Now we plug our estimation for the first iteration:

g(args..., prev_iteration)
  <= A_d(max(maxarg, prev_iteration))
  <= A_d(max(maxarg, A_d(maxarg)))
  =  A_d(A_d(maxarg))

That is, the estimate for the first iteration is A_d(maxarg). The estimation for the second iteration adds one more layer: A_d(A_d(maxarg)). For the third iteration we’ll get A_d(A_d(A_d(maxarg))).

So the overall thing is going to be smaller than A_d iteratively applied to itself some number of times, where “some number” is one of the f original arguments. But no harm’s done if we iterate up to maxarg.

As a sanity check, the worst depth-2 function constructed with iteration is probably

f(x, y) = LOOP(x, succ, y)

which is x + y. And our estimate gives x + 1 applied maxarg times to maxarg, which is 2 * maxarg, which is indeed the correct upper bound!

Combining everything together, we have:

A_1(x) = x + 1

f(args...) <= max(
  A_d(A_d(maxarg)),               # composition case
  A_d(A_d(A_d(... A_d(maxarg)))), # LOOP case,
   <-    maxarg A's         ->
)

That max there is significant — although it seems like the second line, with maxarg applications, is always going to be longer, maxarg, in fact, could be as small as zero. But we can take maxarg + 2 repetitions to fix this:

f(args...) <=
  A_d(A_d(A_d(... A_d(maxarg)))),
  <-    maxarg + 2 A's         ->

So let’s just define A_{d+1}(x) to make that inequality work:

A_{d+1}(x) = A_d(A_d( .... A_d(x)))
            <- x + 2 A_d's in total->

Unpacking:

We define a family of unary functions A_d, such that each A_d “grows faster” than any n-ary PRF of depth d. If f is a ternary PRF of depth 3, then f(1, 92, 10) <= A_3(92).

To evaluate A_d at point x, we use the following recursive procedure:

If d is 1, return x + 1.
Otherwise, evaluate A_{d-1} at point x to get, say, v. Then evaluate A_{d-1} again at point v this time, yielding u. Then compute A_{d-1}(u). Overall, repeat this process x+2 times, and return the final number.

We can simplify this a bit if we stop treating d as a kind of function index, and instead say that our A is just a function of two arguments. Then we have the following equations:

A(1, x) = x + 1
A(d + 1, x) = A(d, A(d, A(d, ..., A(d, x))))
                <- x + 2 A_d's in total->

The last equation can re-formatted as

A(
  d,
  A(d, A(d, ..., A(d, x))),
  <- x + 1 A_d's in total->
)

And for non-zero x that is just

A(
  d,
  A(d + 1, x - 1),
)

So we get the following recursive definition for A(d, x):

A(1, x) = x + 1
A(d + 1, 0) = A(d, A(d, 0))
A(d + 1, x) = A(d, A(d + 1, x - 1))

As a Python program:

def A(d, x):
  if d == 1: return x + 1
  if x == 0: return A(d-1, A(d-1, 0))
  return A(d-1, A(d, x - 1))

It’s easy to see that computing A on a Turing Machine using this definition terminates — this is a function with two arguments, and every recursive call uses a lexicographically smaller pair of arguments. And we constructed A in such a way that A(d, x) as a function of x is larger than any PRF with a single argument of depth d. But that means that the following function with one argument a(x) = A(x, x)

grows faster than any PRF. And that’s an example of a function which a Turing Machine has no trouble computing (given sufficient time), but which is beyond the capabilities of PRFs.

Part III, Descent From the Ivory Tower

Remember, this is a three-part post! And are finally at the part 3! So let’s circle back to the practical matters. We have learned that:

Turing machines don’t necessarily terminate.
While other computational devices, like FSMs and PRFs, can be made to always terminate, there’s no guarantee that they’ll terminate fast. PRFs in particular can compute quite large functions!
And non-Turing complete devices can be quite expressive. For example, any real-world algorithm that works on a TM can be adapted to run as a PRF.
Moreover, you don’t even have to contort the algorithm much to make it fit. There’s a universal recipe for how to take something Turing complete and make it a primitive recursive function instead — just add an iteration counter to the device, and forcibly halt it if the counter grows too large.

Or, more succinctly: there’s no practical difference between a program that doesn’t terminate, and the one that terminates after a billion years. As a practitioner, if you think you need to solve the first problem, you need to solve the second problem as well. And making your programming language non-Turing complete doesn’t really help with this.

And yet, there are a lot of configuration languages out there that use non-Turing completeness as one of their key design goals. Why is that?

I would say that we are never interested in Turing-completeness per-se. We usually want some much stronger properties. And yet there’s no convenient catchy name for that bag of features of a good configuration language. So, “non-Turing-complete” gets used as a sort of rallying cry to signal that something is a good configuration language, and maybe sometimes even to justify to others inventing a new language instead of taking something like Lua. That is, the real reason why you want at least a different implementation is all those properties you really need, but they are kinda hard to explain, or at least much harder than “we can’t use Python/Lua/JavaScript because they are Turing-complete”.

So what are the properties of a good configuration language?

First, we need the language to be deterministic. If you launch Python and type id([]), you’ll see some number. If you hit ^C, and than do this again, you’ll see a different number. This is OK for “normal” programming, but is usually anathema for configuration. Configuration is often used as a key in some incremental, caching system, and letting in non-determinism there wreaks absolute chaos!

Second, you need the language to be well-defined. You can compile Python with ASLR disabled, and use some specific allocator, such that id([]) always returns the same result. But that result would be hard to predict! And if someone tries to do an alternative implementation, even if they disable ASLR as well, they are likely to get a different deterministic number! Or the same could happen if you just update the version of Python. So, the semantics of the language should be clearly pinned-down by some sort of a reference, such that it is possible to guarantee not only deterministic behavior, but fully identical behavior across different implementations.

Third, you need the language to be pure. If your configuration can access environment variables or read files on disk, than the meaning of the configuration would depend on the environment where the configuration is evaluated, and you again don’t want that, to make caching work.

Fourth, a thing that is closely related to purity is security and sandboxing. The mechanism to achieve both purity and security is the same — you don’t expose general IO to your language. But the purpose is different: purity is about not letting the results be non-deterministic, while security is about not exposing access tokens to the attacker.

And now this gets tricky. One particular possible attack is a denial of service — sending some bad config which makes our system just spin there burning the CPU. Even if you control all IO, you are generally still open to these kinds of attacks. It might be OK to say this is outside of the threat model — that no one would find it valuable enough to just burn your CPU, if they can’t also do IO, and that, even in the event that this happens, there’s going to be some easy mitigation in the form of a higher-level timeout.

But you also might choose to provide some sort of guarantees about execution time, and that’s really hard. Two approaches work. One is to make sure that processing is obviously linear. Not just terminates, but is actually proportional to the size of inputs, and in a very direct way. If the correspondence is not direct, than it’s highly likely that it is in fact non linear. The second approach is to ensure metered execution — during processing, decrement a counter for every simple atomic step and terminate processing when the counter reaches zero.

Finally one more vague property you’d want from a configuration language is for it to be simple. That is, to ensure that, when people use your language, they write simple programs. It seems to me that this might actually be the case where banning recursion and unbounded loops could help, though I am not sure. As we know from the PRF exercise, this won’t actually prevent people from writing arbitrary recursive programs. It’ll just require some roundabout code to do that. But maybe that’ll be enough of a speedbump to make someone invent a simple solution, instead of brute-forcing the most obvious one?

That’s all for today! Have a great weekend, and remember:

Any algorithm that can be implemented by a Turing Machine such that its runtime is bounded by some primitive recursive function of input can also be implemented by a primitive recursive function!