Newtype Index Pattern In Zig

In efficiency-minded code, it is idiomatic to use indexes rather than pointers. Indexes have several advantages:

First, they save memory. Typically a 32-bit index is enough, a saving of four bytes per pointer on 64-bit architectures. I haven’t seen this measured, but my gut feeling is that this is much more impactful than it might initially seem. On modern architectures, saving memory saves time (and energy) as well, because the computing bottleneck is often the bit pipe between the memory and the CPU, not the computation per se. Dense data structures use CPU cache more efficiently, removing prohibitive latency of memory accesses. Bandwidth savings are even better: smaller item size obviously improves bandwidth utilization, but having more items in cache obviates the need to use the bandwidth in the first place. Best case, the working set fits into the CPU cache!

Note well that memory savings are evenly spread out. Using indexes makes every data structure slightly more compact, which improves performance across the board, regardless of hotspot distribution. It’s hard to notice a potential for such saving in a profiler, and even harder to test out. For these two reasons, I would default to indexes for code where speed matters, even when I don’t have the code written yet to profile it!

There’s also a more subtle way in which indexes save memory. Using indexes means storing multiple items in an array, but such dense storage contains extra information in relative positions of the items. If you need to store a list of items, you can often avoid materializing the list of indexes by storing a range “pointing” into the shared storage. Occasionally, you can even do UTF-8 trick and use just a single bit to mark the end of a list.

The second benefit of indexes is more natural modeling of cyclic and recursive data structures. Creating a cycle fundamentally requires mutability somewhere (“tying the knot” in Haskell relies on mutability of lazy thunks). This means that you need to make some pointers nullable, and that usually gets awkward even without borrow checker behind your back. Even without cycles and just recursion, pointers are problematic, due to a combination of two effects:

The combination works fine at small scale, but then it fails with stack overflow in production every single time, requiring awkward work-arounds. For example, rustc serializes error traces from nested macro expansions as a deeply nested tree of JSON objects, which requires using stacker hack when parsing the output (which you’ll learn about only after crashes in the hands of macro connoisseur users).

Finally, indexes greatly help serialization, they make it trivial to communicate data structures both through space (sending a network message) and time (saving to disk and reading later). Indexes are naturally relocatable, it doesn’t matter where in memory they are. But this is just a half of serialization benefit. The other is that, because everything is in few arrays, you can do bulk serialization. You don’t need to write the items one by one, you can directly memcpy arrays around (but be careful to not leak data via padding, and be sure to checksum the result).

The big problem with “naive” u32 indexes is of course using the right index with the wrong array, or vice verse. The standard solution here is to introduce a newtype wrapper around the raw index. @andrewrk recently popularized a nice “happy accident of language design” pattern for this in Zig. The core idea is to define an index via non-exhaustive enum:

const ItemIndex = enum(u32) {
    _
};

In Zig, enum designates a strongly-typed collection of integer constants, not a Rust-style ADT (there’s union(enum) for that). By default an backing integer type is chosen by the compiler, but you can manually override it with enum(u32) syntax:

const Color = enum(u16) { red, green, blue };

Finally, Zig allows making enums non-exhaustive with _. In a non-exhaustive enum, any numeric value is valid, and some have symbolic labels:

const FontWeight = enum(u16) {
    normal = 400,
    bold = 700,
    _,

    pub fn value(weight: FontWeight) u16 {
        return @intFromEnum(weight);
    }
}

test FontWeight {
    assert(FontWeight.value(.normal) == 400);

    const bold: FontWeight = @enumFromInt(700);
    assert(bold == .bold);
}

@intFromEnum and @enumFromInt builtins switch abstraction level between a raw integer and an enum value. So,

const ItemIndex = enum(u32) {
    _
};

is a way to spell “u32, but a distinct type”. Note that there’s no strong encapsulation boundary here, anyone can @enumFromInt. Zig just doesn’t provide language-enforced encapsulation mechanisms.

Putting everything together, this is how I would model n-ary tree with parent pointers in Zig:

pub const Tree = struct {
   nodes: []const Node.Data,

   pub const Node = enum(u32) {
       root = 0,
       invalid = std.math.maxInt(u32),
       _,

       pub const Data = struct {
           parent: Node, // .invalid means no parent.
           children: struct {
               index: u32,
               count: u32,
           },

           comptime {
               assert(@sizeOf(Data) == 12);
           }
       };
   };

   fn get(
       tree: *const Tree,
       node: Node,
   ) Node.Data {
       return tree.nodes[@intFromEnum(node)];
   }

   pub fn parent(
       tree: *const Tree,
       node: Node,
   ) ?Node {
       const result = tree.get(node).parent;
       return if (result == .invalid) null else result;
   }

   pub fn children(
       tree: *const Tree,
       node: Node,
   ) []const Node {
       const range = tree.get(node).children;
       return tree.nodes[range.index..][0..range.count];
   }
};

Some points of note:

P.S. Apparently I also wrote a Rust version of this post a while back? https://matklad.github.io/2018/06/04/newtype-index-pattern.html