Comptime Zig ORM

This post can be considered an advanced Zig tutorial. I will be covering some of the more unique aspects of the language, but wont be explaining the easy part. If you havent read the Zig Language Reference, you might start there. Additionally, we will also learn the foundational trick for implementing relational model.

You will learn a sizable chunk of Zig after this post, but this isnt going to be an easy read, so prepare your favorite beverage and get comfy!

On Learning

One of the most ridiculously effective ways of learning for me is building toy versions of programs. This is slightly more specific than to learn to code, code: I claim that you can learn more by spending a week building your own very bad version of an application from scratch that youd learn from working full-time for a year on a production ready codebase. Case in point: although the code in this post is lifted from TigerBeetle, and Ive been working with it for a couple of years, Ive learned a bunch of new things myself in the evening of hacking on the code for the post.

The hard part about the toy problem approach is finding the right toy! I remember, early in my career, spending about a year pestering everything with what is your favorite model problem?question, and not getting a real answer. Until one day @zmacter askedhave you tried a raytracer? and that became my model problem for learning programming languages. Seriously, if you want to learn Zig, go write yourself a raytracer, I have some notes for that here.

The Database

In this post, well work on a solution of a different model problem, which I think is an especially good fit for showcasing Zigs comptime capabilities. This problem is a simplified version of the LSM Forest code from TigerBeetle.

Specifically, we will be implementing an in-memory relational database, whose schema is set at compile time. Before diving into the implementation, lets sketch the interface we want.

First, we define objects which we will be storing in our database, accounts and transfers:

const Account = struct {
    id: ID = @enumFromInt(0),
    balance: u128,

    pub const ID = enum(u64) { _ };
};

const Transfer = struct {
    id: ID = @enumFromInt(0),
    amount: u128,
    debit_account: Account.ID,
    credit_account: Account.ID,

    pub const ID = enum(u64) { _ };
};

In Zig, struct is an expression that yields an anonymous struct type, which needs to be explicitly bound to an identifier: const Account = struct { ... }

Structs contain fields and declarations. Fields can have default values. This curious pattern pub const ID = enum(u64) { _ }; is a Zig idiom for creating a newtype over an integer. ID is an enumeration, whose backing type is u64. This enumeration doesnt have any explicitly named variants, but it is open (_) any u64 numeric value is considered to be a member. This is exactly what we want for an id its an opaque number with a unique type, whose numberness is not exposed (you cant add two ids together). In the transfer struct, we refer to account id: debit_account: Account.ID

Note that although Account.ID and Transfer.ID have exactly the same definition, they are distinct types. Let this sink in Zigs type system is nominal, but all types are anonymous!

Ids will be assigned by our database automatically, using an auto-incrementing counter, and we will use zero id to signify a new object without an id assigned yet: id: ID = @enumFromInt(0),

@enumFromInt and @intFromEnum are built-in functions for casting between an enum and its backing integer. It could have been cleaner to instead write:

const Account = struct {
    id: ID = .unassigned,
    balance: u128,

    pub const ID = enum(u64) {
        unassigned = 0,
        _,
    };
};

That is, to add an explicitly named variant for zero.


So, yeah, we have accounts and transfers, they both have ids assigned by the database, an account has a balance, a transfer has an amount and refers to two accounts:

const Account = struct {
    id: ID = @enumFromInt(0),
    balance: u128,

    pub const ID = enum(u64) { _ };
};

const Transfer = struct {
    id: ID = @enumFromInt(0),
    amount: u128,
    debit_account: Account.ID,
    credit_account: Account.ID,

    pub const ID = enum(u64) { _ };
};

Now, lets define our database from the schema:

const DBType = @import("./db.zig").DBType;

const DB = DBType(.{
    .tables = .{
        .account = Account,
        .transfer = Transfer,
    },
    .indexes = .{
        .transfer = .{
            .debit_account, .credit_account,
        },
    },
});

A lot is going on here. First, we use @import builtin function to import (look, no need for syntax again!) the DBType function. DBType is a type constructor it takes a DB schema, and returns a database type. For the schema, we ask for two tables, accounts and transfers, and also ask to include indexes on transfers foreign keys.

The implementation of DBType is the meat of this post, but, for now, lets see how we use it. Lets write a function to add a transfer to the database:

fn create_transfer(
    db: *DB,
    gpa: std.mem.Allocator,
    debit_account: Account.ID,
    credit_account: Account.ID,
    amount: u128,
) !?Transfer.ID {
    ...
}

Zig doesnt have a global allocator, so anything that needs to allocate takes an allocator argument. std.mem.Allocator is dynamically dispatched: inside it are a type-erased pointer to a particular allocators state, and a pointer to a vtable:

const Allocator = struct {
    ptr: *anyopaque,
    vtable: *const VTable,

    pub const VTable = struct {
        alloc: *const fn (
            *anyopaque,
            len: usize,
            alignment: Alignment,
            ret_addr: usize,
        ) ?[*]u8,
        ...
    };
};

This is a trait object, coded manually. gpa stands for general purpose allocator, which behaves more or less like a global allocator would, as far as the code is concerned. You often see arena: Allocator, signifying that memory doesnt need to be freed on per-object basis, or scratch: Allocator, signifying that the memory can be used for short-lived allocations inside the function, but cant outlive it.

Inserting a new object into our in-memory database could allocate, so we need an allocator argument, and, conversely, need to signal possibility of an allocation failure in our result type, which is what the bang (!) is for.

Another reason for why the operation might fail is that the transfer itself might be invalid (e.g., insufficient balance). For simplicity, I choose to model this by returning a null instead of Transfer.ID, hence the question mark (?).

Lets see the implementation of create_transfer:

if (debit_account == credit_account) return null;
const dr = db.account.get(debit_account) orelse
    return null;
const cr = db.account.get(credit_account) orelse
    return null;
if (dr.balance < amount) return null;
if (cr.balance > std.math.maxInt(u128) - amount) return null;

db.account.update(.{
    .id = debit_account,
    .balance = dr.balance - amount,
});
db.account.update(.{
    .id = credit_account,
    .balance = dr.balance + amount,
});
return try db.transfer.create(gpa, .{
    .debit_account = debit_account,
    .credit_account = credit_account,
    .amount = amount,
});

Zig doesnt require braces around ifs body which makes for concise guard ifs. It comes with autoformatter out of the box, so theres very little possibility for indentation confusion.

db.account.get(credit_account) looks up an account by an id. The account might or might not exist, the return type of this function is ?Account. Zigs orelse unpacks optionals. The type of return expression is noreturn (! from Rust), so the type of cr and dr is just Account, without a question mark. Instead of returning, we could have orelseed some default Account.

This line is the only place where we need to help type inference by spelling a type explicitly: if (cr.balance > std.math.maxInt(u128) - amount) return null;

We pass the u128 type to the maxInt function. This is the case where a sophisticated smart type inference algorithm could look at the surrounding context and infer the type, but Zig deliberately requires the user to spell it in stations like this.

Having done the balance checks, we ask our database to update the two balances, and to persist the new transfer object. Only transfer.create calls gets an allocator, so it is immediately obvious that only this part of the function can allocate.

A Usage Example

Now, lets write some test program:

pub fn main() !void {
    ...
}

We are going to allocate, so we might fail, so we return !void. So, well need an allocator:

var gpa_instance: std.heap.DebugAllocator(.{}) = .{};
const gpa = gpa_instance.allocator();

gpa_instance is a concrete allocator, of type std.heap.DebugAllocator(.{}). It is initialized with default values for all the fields. When evaluating = .{}, Zig knows the result type, so it knows which fields are defaulted.

gpa is our trait object. Internally, it contains a pointer to gpa_instance. The state of gpa_instance will be mutated, so it needs to be declared as a var. The gpa though is just a pair of pointers, and those pointers wont be mutated themselves, so we can declare it const, similarly to how in Rust youd write

let mut x = 92;
let r = &mut x; // no mut on r!

Because Zig doesnt track aliasing in the type system, figuring out what can get mutated when is generally harder in Zig than in Rust.

For the usage example, well need some random numbers, which follow a similar pattern a concrete PRNG and a dynamically dispatched trait object / vtable:

var random_instance = std.Random.DefaultPrng.init(92);
const random = random_instance.random();

Finally, we create an instance of our database:

var db: DB = .{};
// defer db.deinit(gpa);

DB will allocate memory, so we absolutely do need a deinit function to free it, but I am excluding it from the tutorial, as it requires some not particularly illuminating legwork.

For starters, just create two accounts and a transfer:

const alice: Account.ID =
    try db.account.create(gpa, .{ .balance = 100 });

const bob: Account.ID =
    try db.account.create(gpa, .{ .balance = 200 });

const transfer: ?Transfer.ID =
    try create_transfer(&db, gpa, alice, bob, 100);

assert(transfer != null);

So far, this feels like a hash-map with more steps. We arent doing anything relational here. We will, soon, but well need some fake data. To keep things a touch more realistic, we wont be distributing transfers equally between accounts, and instead ensure that 20% of hottest accounts are responsible for 80% of transfers:

fn pareto_index(random: std.Random, count: usize) usize {
    assert(count > 0);
    const hot = @divFloor(count * 2, 10);
    if (hot == 0) return random.uintLessThan(usize, count);
    if (random.uintLessThan(u32, 10) < 8) {
        return pareto_index(random, hot);
    }
    return hot + random.uintLessThan(usize, count - hot);
}

Nothing new here @divFloor is another builtin (an intention-bearing name for /), and we need to pass the type we want to get out of random explicitly, instead of having type inference (and the human reader) figuring it out.

The loop to populate the database is slightly more interesting:

var accounts: std.ArrayListUnmanaged(Account.ID) = .empty;
defer accounts.deinit(gpa);

const account_count = 100;
try accounts.ensureTotalCapacity(gpa, account_count);

accounts.appendAssumeCapacity(alice);
accounts.appendAssumeCapacity(bob);
while (accounts.items.len < account_count) {
    const account = try db.account.create(gpa, .{ .balance = 1000 });
    accounts.appendAssumeCapacity(account);
}

const transfer_count = 100;
for (0..transfer_count) |_| {
    const debit = pareto_index(random, account_count);
    const credit = pareto_index(random, account_count);
    const amount = random.uintLessThan(u128, 10);
    _ = try create_transfer(
        &db,
        gpa,
        accounts.items[debit],
        accounts.items[credit],
        amount,
    );
}

Well need to store generated account ids somewhere, so we use an ArrayList. Zig strongly pushes you towards batching your allocations, so we preallocate space for a hundred accounts at once, and then append without passing a gpa in. For simplicity, we dont implement reservation API for our database, so we do need a gpa when creating an account or a transfer.

In Zig, unused return value is a compilation error, so we need _ = to ignore the result of the transfer.

Finally, we come to the relational part of the tutorial, well do a non-trivial lookup. First, well ask for all transfers from alice to anyone, and then for transfers between alice and bob specifically:

var transfers_buffer: [10]Transfer = undefined;
const alice_transfers = db.transfer.filter(
    .{ .debit_account = alice },
    &transfers_buffer,
);
for (alice_transfers) |t| {
    std.debug.print("alice: from={} to={} amount={}\n", .{
        t.debit_account,
        t.credit_account,
        t.amount,
    });
}

const alice_to_bob_transfers = db.transfer.filter(
    .{ .debit_account = alice, .credit_account = bob },
    &transfers_buffer,
);
for (alice_to_bob_transfers) |t| {
    std.debug.print("alice to bob: from={} to={} amount={}\n", .{
        t.debit_account,
        t.credit_account,
        t.amount,
    });
}

The interesting parts are highlighted. We can ask the database to filter transfer objects to only those with matching attributes. You can do it using a brute-force loop over all transfers. But, if you are serious about your relational model, you obviously want to be faster than that! I wonder if this has something to do with the indexes we added when declaring DB?

For Zig specifics, I dont want to allocate the result, and I dont want to bother with iterators, so I pass a stack-allocated out buffer in.

A Call To Action

Heres what we got so far, the interface for the so-far mysterious db.zig:

const std = @import("std");
const assert = std.debug.assert;

const DBType = @import("./db.zig").DBType;

const Account = struct {
    id: ID = @enumFromInt(0),
    balance: u128,

    pub const ID = enum(u64) {
        unassigned,
        _,
    };
};

const Transfer = struct {
    id: ID = @enumFromInt(0),
    amount: u128,
    debit_account: Account.ID,
    credit_account: Account.ID,

    pub const ID = enum(u64) { _ };
};

const DB = DBType(.{
    .tables = .{
        .account = Account,
        .transfer = Transfer,
    },
    .indexes = .{
        .transfer = .{
            .debit_account, .credit_account,
        },
    },
});

fn create_transfer(
    db: *DB,
    gpa: std.mem.Allocator,
    debit_account: Account.ID,
    credit_account: Account.ID,
    amount: u128,
) !?Transfer.ID {
    if (debit_account == credit_account)
        return null;
    const dr = db.account.get(debit_account) orelse
        return null;
    const cr = db.account.get(credit_account) orelse
        return null;
    if (dr.balance < amount) return null;
    if (cr.balance > std.math.maxInt(u128) - amount) return null;

    db.account.update(.{
        .id = debit_account,
        .balance = dr.balance - amount,
     });
    db.account.update(.{
        .id = credit_account,
        .balance = dr.balance + amount,
     });
    return try db.transfer.create(gpa, .{
        .debit_account = debit_account,
        .credit_account = credit_account,
        .amount = amount,
    });
}

pub fn main() !void {
    var gpa_instance: std.heap.DebugAllocator(.{}) = .{};
    const gpa = gpa_instance.allocator();

    var random_instance = std.Random.DefaultPrng.init(92);
    const random = random_instance.random();

    var db: DB = .{};
    // defer db.deinit(gpa);

    const alice: Account.ID =
        try db.account.create(gpa, .{ .balance = 100 });
    const bob: Account.ID =
        try db.account.create(gpa, .{ .balance = 200 });
    const transfer =
         try create_transfer(&db, gpa, alice, bob, 100);
    assert(transfer != null);

    var accounts: std.ArrayListUnmanaged(Account.ID) = .empty;
    defer accounts.deinit(gpa);

    const account_count = 100;
    try accounts.ensureTotalCapacity(gpa, account_count);

    accounts.appendAssumeCapacity(alice);
    accounts.appendAssumeCapacity(bob);
    while (accounts.items.len < account_count) {
        const account =
            try db.account.create(gpa, .{ .balance = 1000 });
        accounts.appendAssumeCapacity(account);
    }

    const transfer_count = 100;
    for (0..transfer_count) |_| {
        const debit = pareto_index(random, account_count);
        const credit = pareto_index(random, account_count);
        const amount = random.uintLessThan(u128, 10);
        _ = try create_transfer(
            &db,
            gpa,
            accounts.items[debit],
            accounts.items[credit],
            amount,
        );
    }

    var transfers_buffer: [10]Transfer = undefined;
    const alice_transfers = db.transfer.filter(
        .{ .debit_account = alice },
        &transfers_buffer
    );
    for (alice_transfers) |t| {
        std.debug.print("alice: from={} to={} amount={}\n", .{
            t.debit_account,
            t.credit_account,
            t.amount,
        });
    }
    std.debug.print("\n\n", .{});

    const alice_to_bob_transfers = db.transfer.filter(
        .{ .debit_account = alice, .credit_account = bob },
        &transfers_buffer,
    );
    for (alice_to_bob_transfers) |t| {
        std.debug.print("alice to bob: from={} to={} amount={}\n", .{
            t.debit_account,
            t.credit_account,
            t.amount,
        });
    }
}

fn pareto_index(random: std.Random, count: usize) usize {
    assert(count > 0);
    const hot = @divFloor(count * 2, 10);
    if (hot == 0) return random.uintLessThan(usize, count);
    if (random.uintLessThan(u32, 10) < 8) return pareto_index(random, hot);
    return hot + random.uintLessThan(usize, count - hot);
}

If you want to get 90% out of this post, I strongly recommend you to not read any further, and instead copy the above code into your own main.zig and try to write db.zig yourself. I do think this is the most excellent exercise that can teach you more effectively than any blog post. Itll take more time, of course, but youll get more knowledge per minute spent out of it.

If you will settle for the 10%, read on! And, if you want 100% percent, then do your implementation first and then come back here!

The Table

We will be building db.zig from the gound up. It all will make sense! At the end.

Our fundamental data structure is a sorted list of values. Here, I kindly ask the reader to engage suspension of disbelief: in a real database we will be using a data structure with efficient lookups and modifications, such as a B-tree or an LSM tree. For the purposes of this tutorial, we will use a simple sorted array, and will just close our eyes on O(N) insertions and removals.

Values are going to be sorted by a particular field. For example, we sort transfers by their ids. So, when creating a Table of transfers, well need to pass the type of key, the type of value, and functions for extracting and comparing keys:

const TransfersTable = TableType(Transfer.ID, Transfer, struct {
    pub fn key_fn(value: Transfer) Transfer.ID {
        return value.id;
    }
    pub fn key_cmp(lhs: Transfer.ID, rhs: Transfer.ID) std.math.Order {
        return std.math.order(@intFromEnum(lhs), @intFromEnum(rhs));
    }
});

Heres the corresponding declaration:

fn TableType(
    comptime KeyType: type,
    comptime ValueType: type,
    comptime Functions: type,
) type {
    const key_fn = Functions.key_fn;
    const key_cmp = Functions.key_cmp;

    return struct {
        ...
    };
}

This is a type constructor function, which takes a bunch of types as arguments and returns a new type. Such functions can only be called at compile time., Zig doesnt have the ability to create new types at runtime, unlike something like the JVM.

Passing the table of functions as a Functions type is a weird idiom of Zig. It would be more natural to use the following signature:

fn TableType(
    comptime KeyType: type,
    comptime ValueType: type,
    comptime key_fn: fn(value: ValueType): KeyType,
    comptime key_cmp: fn(lhs: KeyType, rhs: KeyType): std.math.Order,
) type

But this version is more painful to use at the call site. While struct is an expression in Zig, and you can declare one inline, fn is not an expression, you cant declare a function inline unless you employ another Zig idiom:

const my_function = struct {
    fn double(x: u32) u32 { return x * 2; }
}.double;

Heres the implementation of the table:

struct {
    values: std.ArrayListUnmanaged(Value) = .empty,

    pub const Key = KeyType;
    pub const Value = ValueType;
    const Table = @This();

    pub fn search(
        table: *const Table,
        key: Key,
        start_index: usize,
    ) usize {
        return start_index + std.sort.lowerBound(
            Value,
            table.values.items[start_index..],
            key,
            compare_fn,
        );
    }

    fn compare_fn(key: Key, value: Value) std.math.Order {
        return key_cmp(key, key_fn(value));
    }

    pub fn get(table: *const Table, key: Key) ?Value {
        const index = table.search(key, 0);
        if (index >= table.values.items.len) return null;

        const value = table.values.items[index];
        if (key_cmp(key, key_fn(value)) != .eq) return null;

        return value;
    }

    pub fn reserve(
        table: *Table,
        gpa: std.mem.Allocator,
        extra: usize,
    ) !void {
        try table.values.ensureUnusedCapacity(gpa, extra);
    }

    pub fn insert(table: *Table, value: Value) void {
        assert(table.values.unusedCapacitySlice().len > 0);
        const index = table.search(key_fn(value), 0);
        table.values.insertAssumeCapacity(index, value);
    }

    pub fn remove(table: *Table, value: Value) void {
        const index = table.search(key_fn(value), 0);
        const removed = table.values.orderedRemove(index);
        assert(std.meta.eql(value, removed));
    }
};

The search function binary searches for the index corresponding to the given key in the list of values. For convenience of the call-site we are yet to see, we also pass in the starting index for the search. It is useful for, e.g., pagination-style API where you use index as a cursor.

The get function then uses search to find the index, and furthermore checks that we have an exact match.

For the insert function, we do implement the reservation pattern: memory allocation and data structure modification are split into two functions. Modification proper is infallible, and has a reservation as a precondition.

As promised, we do a naive linear memcpy for insert / remove, but we pretend that it is actually logarithmic.

Another unrealistic simplification is that our API is scalar we insert or remove a single item at a time. Both Zig and relational model strongly encorage operating on a batch of objects at a time, pushing the fors down: pub fn insert(table: *Table, values: []const Value) void

Even with a naive array list, the batched version runs in O(N + K log K), which is much faster than O(N K) of the scalar version repeated K times. But we leave batching as an exercise for the reader.

The Indexes

Now comes the relational model part of the tutorial. If we have

const TransfersTable = TableType(Transfer.ID, Transfer, struct {
    pub fn key_fn(value: Transfer) Transfer.ID {
        return value.id;
    }
    pub fn key_cmp(lhs: Transfer.ID, rhs: Transfer.ID) std.math.Order {
        return std.math.order(@intFromEnum(lhs), @intFromEnum(rhs));
    }
});

we can efficiently filter transfers by their ids. How can we add an ability to filter transfers by, say, debit account id?

The idea is to add a second sorted list, which stores pairs of (Account.ID, Transfer.ID)

With this setup, if you are interested in all transfers from alice, you can binary search for alices account ID in the second list, fetch corresponding transfer ids, and then lookup transfers in the first list:

Transfers:
id=1 debit_account=alice    credit_account=bob
id=2 debit_account=charley  credit_account=bob
id=3 debit_account=alice    credit_account=charley

Index:
debit_account=alice   id=1
debit_account=alice   id=3
debit_account=charley id=2

What makes this work is that we can maintain two lists in sync. When creating a transfer, you insert it in the Transfers table, but also insert the corresponding pair to the index table. Removal works similarly.

Implementing the Index

Lets implement an index table. Here, we start using comptime for real. In particular, we will parametrize our index table by the type (such as Transfer) and the name of the field to build an index over (such as .debit_account):

const TransferDebitAccountIndex =
    IndexTableType(Transfer, .debit_account);

This is the signature:

fn IndexTableType(
    comptime Value: type,
    comptime field: std.meta.FieldEnum(Value),
) type {
    ...
}

This is a type constructor, which takes a type and returns a type. The std.meta.FieldEnum(Value) call returns an enum whose variants are fields of Value. E.g, our Transfer is

const Transfer = struct {
    id: ID,
    amount: u128,
    debit_account: Account.ID,
    credit_account: Account.ID,
};

so the corresponding FieldEnum would look like this:

const TransferFieldEnum = enum {
    id,
    amount,
    debit_account,
    credit_account,
};

Now, the body:

fn IndexTableType(
    comptime Value: type,
    comptime field: std.meta.FieldEnum(Value),
) type {
    const FieldType = @FieldType(Value, @tagName(field));

    const Pair = struct {
        field: FieldType,
        id: Value.ID,
    };

    return TableType(Pair, Pair, struct {
        pub fn key_fn(value: Pair) Pair {
            return value;
        }

        pub fn key_cmp(lhs: Pair, rhs: Pair) std.math.Order {
            return order_by(Pair, lhs, rhs, &.{ .field, .id });
        }
    });
}

Ultimately, we want to delegate to existing TableType, as that already implements the logic for storing a sorted list of items. The item for us is a field type and value id pair. One note about the const FieldType = @FieldType(Value, @tagName(field)); incantation used. FieldEnum is the library-level abstraction. Meta programming builtins, like @FieldType or @field, work with string names of field (at compile time, of course). @tagName converts from .debit_account to "debit_account". We dont have to use FieldEnum, and could have used strings throughout, but FieldEnum gives us two advantages:

  • Earlier type errors: calling IndexTableType with a field that doesnt exist will error out at the call site, rather than at the definition side.
  • Greppability: in Zig, field access is always spelled as .debit_account syntactically, so it is advantageous to stick to the same convention during meta programming, to make sure it also gets into textual searches.

The Key for our table is the entire Pair. That is, we want to sort not only on the field value, but on ID as well, to make sure that, when we lookup all ids for a particular field value, we get back a sorted list. Thats why our key_fn is an identity function:

pub fn key_fn(value: Pair) Pair {
    return value;
}

In key_cmp, we want to compare first by field, and then by id. We can do it manually, but its more fun to do some meta programming here as well:

pub fn key_cmp(lhs: Pair, rhs: Pair) std.math.Order {
    return order_by(Pair, lhs, rhs, &.{ .field, .id });
}

fn order_by(
    comptime T: type,
    lhs: T,
    rhs: T,
    comptime fields: []const std.meta.FieldEnum(T),
) std.math.Order {
    ...
}

order_by is our first mixed-mode function. Some arguments are comptime, but some are runtime. This function should compare a pair of T by sequentially comparing the values of the corresponding fields, and returning as soon as two unequal fields are found. Here we use an inline for:

fn order_by(
    comptime T: type,
    lhs: T,
    rhs: T,
    comptime fields: []const std.meta.FieldEnum(T),
) std.math.Order {
    inline for (fields) |field| {
        const order = order_enums(
            @field(lhs, @tagName(field)),
            @field(rhs, @tagName(field)),
        );
        if (order != .eq) return order;
    }
    return .eq;
}

Because the list of fields is known at compile time, the loop is fully unrolled, and the actual generated code ends up looking as a sequence of direct comparisons. @field fetches a field from a value given a comptime fields name. order_enums is a little helper which allows comparing either numbers or enums:

fn order_enums(lhs: anytype, rhs: @TypeOf(lhs)) std.math.Order {
    return switch (@typeInfo(@TypeOf(lhs))) {
        .int => std.math.order(lhs, rhs),
        .@"enum" => std.math.order(
            @intFromEnum(lhs),
            @intFromEnum(rhs),
        ),
        else => comptime unreachable,
    };
}

@typeInfo is a builtin that allows reflecting on the structure of types. In particular, it classifies types as structs, unions, enums, integers, etc, exactly what we need here. One wrinkle here is that enum is a keyword, so the enum variant of @typeInfo is spelled as @"enum". The @"" syntax allows using any string as a Zig identifier, its an escape for keywords.

And thats basically it for indexes! Now we have our main table (object table), and a number of index tables. The next task is to bundle them together, so that we can enforce consistency between the tables.

The Bundle

The next thing we will be building is the Bundle. It takes a type, a list of fields to build indexes over, and provides an API for creating and looking up values while maintaining consistency of indexes:

pub fn BundleType(
    comptime Value: type,
    comptime indexed_fields: []const std.meta.FieldEnum(Value),
) type

With the bundle, we can finally explain what is the database that weve started with:

const DB = struct {
    account: BundleType(Account, &.{}),
    transfer: BundleType(Transfer, &.{
        debit_account,
        credit_account,
    }),
}

And these were the API we used, and which well now implement:

var db: DB = ...;
try db.account.create(gpa, .{ .balance = 100 });
db.account.update(.{ .id = debit_account, .balance = dr.balance - amount });
db.account.get(debit_account);

try db.transfer.create(gpa, .{
    .debit_account = debit_account,
    .credit_account = credit_account,
    .amount = amount,
})

db.transfer.filter(
    .{ .debit_account = alice, .credit_account = bob },
    &transfers_buffer,
);

Lets start:

pub fn BundleType(
    comptime Value: type,
    comptime indexed_fields: []const std.meta.FieldEnum(Value),
) type {
    return struct {
        id_counter: u64 = 0,

        objects: TableType(Value.ID, Value, struct {
            pub fn key_fn(value: Value) Value.ID {
                return value.id;
            }
            pub fn key_cmp(
                lhs: Value.ID,
                rhs: Value.ID,
            ) std.math.Order {
                return std.math.order(
                    @intFromEnum(lhs),
                    @intFromEnum(rhs),
                );
            }
        }) = .{},

        indexes: ???,

        const Bundle = @This();

        pub fn get(bundle: *Bundle, id: Value.ID) ?Value {
            return bundle.objects.get(id);
        }

        ...
    }
}

The basic structure is clear: we have an id_counter for assigning new ids when creating values, the object table which stores sorted values and which directly powers the get method, and then we have the indexes. The indexes are tricky. For transfers, where we index debit_account and credit_account, we want the indexes to look like this:

indexes: struct {
    debit_account:  IndexTableType(Transfer, .debit_account),
    credit_account: IndexTableType(Transfer, .credit_account),
}

So we need to iterate over passed in indexed_fields and create a struct with a field for each index. And thats exactly what well do:

indexes: blk: {
    var fields: [indexed_fields.len]std.builtin.Type.StructField =
        undefined;
    for (indexed_fields, 0..) |indexed, i| {
        const Type = IndexTableType(Value, indexed);
        fields[i] = .{
            .name = @tagName(indexed),
            .type = Type,
            .default_value_ptr = &(Type{}),
            .is_comptime = false,
            .alignment = @alignOf(Type),
        };
    }
    break :blk @Type(.{ .@"struct" = .{
        .layout = .auto,
        .is_tuple = false,
        .decls = &.{},
        .fields = &fields,
    } });
} = .{},

This makes much more sense if you read it backwards:

  • break :blk returns a value from the block labeled blk:, this is Zigs more imperative take on everything is an expression
  • We return a @Type. @Type is an inverse of @TypeInfo, in a sense that, hand waving a bit, @Type(@TypeInfo(T)) == T. That is, we pass a description of the type, and get a type back. What we want to get is a struct with fields, so we pass @"struct" and an array of fields.
  • Each element of fields is an std.builtin.Type.StructField, a description of a field, that is, its type, name, and default.
  • The type is const Type = IndexTableType(Value, indexed); That is, the index table for the indexed field of Value.
  • The name matches the name of the index field.
  • And the default is just a default for the Type.
  • Finally, we need indexed_fields.len fields.

Now that we have all tables, create and update are relatively straightforward. For create, we need to make sure to insert the appropriate values into the objects table and into all of the indexes:

pub fn create(
    bundle: *Bundle,
    gpa: std.mem.Allocator,
    value: Value,
) !Value.ID {
    assert(@intFromEnum(value.id) == 0);

    try bundle.objects.reserve(gpa, 1);
    inline for (indexed_fields) |field| {
        try @field(bundle.indexes, @tagName(field))
                .reserve(gpa, 1);
    }
    errdefer comptime unreachable;

    bundle.id_counter += 1;
    const id: Value.ID = @enumFromInt(bundle.id_counter);

    var value_with_id = value;
    value_with_id.id = id;

    bundle.objects.insert(value_with_id);
    inline for (indexed_fields) |indexed_field| {
        const field = @tagName(indexed_field);
        @field(bundle.indexes, field)
            .insert(.{ .field = @field(value, field), .id = id });
    }

    return id;
}

We start with asserting that the id is 0. Its our job to assign the id! But, before we do that, we reserve space for one more entry in all the tables. This is the place in the function where we allocate, and where we can fail. The cryptic errdefer comptime unreachable is a Zig tongue twister to say that no errors can happen after this point in function. Separating memory reservation and actual modification is helpful to make sure that the data structure remains consistent even in the face of a memory error.

Had we not split out fallible reserve from infallible insert, and kept the allocation inside insert, we could have ended in a situation when a value is inserted only in some of the indexes.

I must admit that I am deeply skeptical that it is possible to consistently handle these kind of issues on memory allocation error correctly, I am in the abort on OOM camp personally. As a quick quiz, have you noticed when we didnt handle this issue correctly in the code weve already seen?

With that throat clearing done, the actual logic is straightforward:

  • assign id,
  • insert the value into the objects table,
  • and then, for each of the indexed fields, insert the (id, field) pair into the corresponding index tree. The inline for loop is guaranteed to be fully unrolled at compile time.

As usual, we use @field to get a field by name (c.f. JavaScript obj.foo vs obj[foo]).

The update is even simple, as we dont need to allocate new memory. So we just remove old values and insert new ones:

pub fn update(bundle: *Bundle, value_new: Value) void {
    const id = value_new.id;
    assert(@intFromEnum(id) != 0);
    const value_old = bundle.get(value_new.id).?;
    assert(value_old.id == id);

    bundle.objects.remove(value_old);
    bundle.objects.insert(value_new);

    inline for (indexed_fields) |indexed_field| {
        const field = @tagName(indexed_field);
        @field(bundle.indexes, field).remove(.{
            .field = @field(value_old, field),
            .id = id,
         });
        @field(bundle.indexes, field).insert(.{
            .field = @field(value_new, field),
            .id = id,
         });
    }
}

Although simple, this is the trick that makes the whole relational model work. See how we keep the indexes consistent, by looking up the old value, and removing the corresponding old pairs from indexes.

If that was to easy, dont worry, well do filter next, and thats the toughest one in the entire exercise :)

Merge Sort Join

Lets revisit our original example:

var transfers_buffer: [10]Transfer = undefined;
const alice_to_bob_transfers = db.transfer.filter(
    .{ .debit_account = alice, .credit_account = bob },
    &transfers_buffer,
);
for (alice_to_bob_transfers) |t| {
    std.debug.print("alice to bob: from={} to={} amount={}\n", .{
        t.debit_account,
        t.credit_account,
        t.amount,
    });
}

The filter takes a filtering condition, equality conditions on a subset of indexed fields. It then fills the output buffer with as much objects as possible that match all conditions. How can we do that effectively?

Recall that our index tables are sorted by the fields value first, and then by the objects ID. So, to find all transfers with .debit_account = alice, we binary search the debit account index for pairs starting with alice. Similarly, we can find all transfers with .credit_account = bob, by binary searching in the other index table.

In both indexes, we get some slice of pairs whose first components are alice and bob respectively. Whats more, the second components of the pairs are sorted transfers ids! So, if we want to find ids which match both conditions, we merge two sorted sequences of ids!

Now lets do this for an arbitrary number of indexes:

pub fn filter(bundle: *Bundle, query: anytype, out: []Value) []Value {
}

We cant really type the query parameter, so we use Zigs anytype here. This is not dynamic typing, rather, its monomorphization maximus every distinct type of query will generate a fresh copy of the filter function.

First, we reflect on query to figure out how many index tables we need to intersect:

const fields = comptime std.meta.fieldNames(@TypeOf(query));

Then, we setup a bunch of indexes (the ijk kind, not the database index kind). We need an index into the output buffer, and index into the object table, and an index for each index table:

var out_index: usize = 0;
var object_index: usize = 0;
var indexes: [fields.len]usize = @splat(0);

Then, for each index table, wed want to get a slice of pairs that match the query. Ideally, wed want to have fields.len local variables, one variable per each query field, but Zig doesnt allow creating local variables via reflection.

What we can have though is a single variable, which is a tuple of slices. To create such a thing, we first need to create its type:

const TupleOfSlices = comptime blk: {
    var components: [fields.len]type = undefined;
    for (0..fields.len) |i| {
        const IndexTable = @FieldType(
            @TypeOf(bundle.indexes),
            fields[i],
        );
        const Pair = IndexTable.Value;
        components[i] = []Pair;
    }
    break :blk std.meta.Tuple(&components);
};

var slices: TupleOfSlices = undefined;

Then, we can use the search function to find the actual slices:

var slices: TupleOfSlices = undefined;

inline for (fields, 0..) |field, i| {
    const index = @field(bundle.indexes, field).search(.{
        .field = @field(query, field),
        .id = @enumFromInt(0),
    }, 0);
    slices[i] = @field(bundle.indexes, field)
        .values.items[index..];
}

Now, the merge algorithm proper. We are pointing a finger at each of the slices we have and, on each step, advance the fingers that point at the smallest ID. If at some point all our fingers point at the same id, we fetch the corresponding Value and add it to the output.

Ideally, wed use a k-way merge mere (comptime specialized to a particular k!), but, for simplicity, well use linear search to find the lowest id. We will be iterating until either we run out of space in the output buffer, or we run off one of our slices:

outer: while (true) {
    if (out_index == out.len) break :outer;
    inline for (0..fields.len) |i| {
        if (indexes[i] == slices[i].len) {
            break :outer;
        }
        if (slices[i][indexes[i]].field != @field(query, fields[i])) {
            break :outer;
        }
    }

    ...
}

Then, we find the minimum ID:

var id_min = slices[0][indexes[0]].id;
inline for (1..slices.len) |i| {
    const id_next = slices[i][indexes[i]].id
    if (@intFromEnum(id_next) < @intFromEnum(id_min)) {
        id_min = id_next;
    }
}

Then, we advance all slices with the minimal ID, counting them:

var advanced_count: u32 = 0;
inline for (0..slices.len) |i| {
    if (slices[i][indexes[i]].id == id_min) {
        indexes[i] += 1;
        advanced_count += 1;
    }
}

If we advanced all slices (that is, all ids at a particular position are the same), we lookup the corresponding transfer in the object table and add it to the output:

if (advanced_count == slices.len) {
    object_index = bundle.objects.search(id_min, object_index);
    assert(object_index < bundle.objects.values.items.len);
    const value = bundle.objects.values.items[object_index];

    inline for (fields) |field| {
        assert(@field(value, field) == @field(query, field));
    }

    out[out_index] = value;
    out_index += 1;
    object_index += 1;
}

This is where we pass a non-zero index to search!

Altogether:

pub fn filter(bundle: *Bundle, query: anytype, out: []Value) []Value {
    const fields = comptime std.meta.fieldNames(@TypeOf(query));

    var indexes: [fields.len]usize = @splat(0);
    var out_index: usize = 0;
    var object_index: usize = 0;

    const TupleOfSlices = comptime blk: {
        var components: [fields.len]type = undefined;
        for (0..fields.len) |i| {
            const IndexTable = @FieldType(
                @TypeOf(bundle.indexes),
                fields[i],
            );
            const Pair = IndexTable.Value;
            components[i] = []Pair;
        }
        break :blk std.meta.Tuple(&components);
    };

    var slices: TupleOfSlices = undefined;

    inline for (fields, 0..) |field, i| {
        const index = @field(bundle.indexes, field).search(.{
            .field = @field(query, field),
            .id = @enumFromInt(0),
        }, 0);
        slices[i] = @field(bundle.indexes, field)
            .values.items[index..];
    }

    outer: while (true) {
        if (out_index == out.len) break :outer;
        inline for (0..fields.len) |i| {
            if (indexes[i] == slices[i].len) {
                break :outer;
            }
            if (slices[i][indexes[i]].field != @field(query, fields[i])) {
                break :outer;
            }
        }

        var id_min = slices[0][indexes[0]].id;
        inline for (1..slices.len) |i| {
            const id_next = slices[i][indexes[i]].id;
            if (@intFromEnum(id_next) < @intFromEnum(id_min)) {
                id_min = id_next;
            }
        }

        var advanced_count: u32 = 0;
        inline for (0..slices.len) |i| {
            if (slices[i][indexes[i]].id == id_min) {
                indexes[i] += 1;
                advanced_count += 1;
            }
        }

        if (advanced_count == slices.len) {
            object_index = bundle.objects.search(id_min, object_index);
            assert(object_index < bundle.objects.values.items.len);
            const value = bundle.objects.values.items[object_index];

            inline for (fields) |field| {
                assert(@field(value, field) == @field(query, field));
            }

            out[out_index] = value;
            out_index += 1;
            object_index += 1;
        }
    }

    return out[0..out_index];
}

Congratulations! Weve finished BundleType, which means we are past the worst of it, and are almost at the end!

The Database Factory

We only need code to turn

const DB = DBType(.{
    .tables = .{
        .account = Account,
        .transfer = Transfer,
    },
    .indexes = .{
        .transfer = .{
            .debit_account, .credit_account,
        },
    },
});

into a bunch of bundles with indexes, but at this point this should be trivial:

pub fn DBType(comptime schema: anytype) type {
    const bundle_names =
        std.meta.fieldNames(@TypeOf(schema.tables));
    var bundles: [bundle_names.len]std.builtin.Type.StructField =
        undefined;
    for (bundle_names, 0..) |name, i| {
        const Value = @field(schema.tables, name);
        const indexes =
            if (@hasField(@TypeOf(schema.indexes), name))
                @field(schema.indexes, name)
            else
                .{};
        const Bundle = BundleType(Value, &indexes);
        bundles[i] = .{
            .name = name,
            .type = Bundle,
            .default_value_ptr = &Bundle{},
            .is_comptime = false,
            .alignment = @alignOf(Bundle),
        };
    }

    return @Type(.{ .@"struct" = .{
        .layout = .auto,
        .is_tuple = false,
        .decls = &.{},
        .fields = &bundles,
    } });
}

Heres the entire thing, db.zig, just under 300 lines:

const std = @import("std");
const assert = std.debug.assert;

pub fn DBType(comptime schema: anytype) type {
    const bundle_names =
        std.meta.fieldNames(@TypeOf(schema.tables));
    var bundles: [bundle_names.len]std.builtin.Type.StructField =
        undefined;
    for (bundle_names, 0..) |name, i| {
        const Value = @field(schema.tables, name);
        const indexes =
            if (@hasField(@TypeOf(schema.indexes), name))
                @field(schema.indexes, name)
            else
                .{};
        const Bundle = BundleType(Value, &indexes);
        bundles[i] = .{
            .name = name,
            .type = Bundle,
            .default_value_ptr = &Bundle{},
            .is_comptime = false,
            .alignment = @alignOf(Bundle),
        };
    }

    return @Type(.{ .@"struct" = .{
        .layout = .auto,
        .is_tuple = false,
        .decls = &.{},
        .fields = &bundles,
    } });
}

pub fn BundleType(
    comptime Value: type,
    comptime indexed_fields: []const std.meta.FieldEnum(Value),
) type {
    return struct {
        id_counter: u64 = 0,

        objects: TableType(Value.ID, Value, struct {
            pub fn key_fn(value: Value) Value.ID {
                return value.id;
            }
            pub fn key_cmp(lhs: Value.ID, rhs: Value.ID) std.math.Order {
                return std.math.order(@intFromEnum(lhs), @intFromEnum(rhs));
            }
        }) = .{},

        indexes: blk: {
            var fields: [indexed_fields.len]std.builtin.Type.StructField = undefined;
            for (indexed_fields, 0..) |indexed, i| {
                const Type = IndexTableType(Value, indexed);
                fields[i] = .{
                    .name = @tagName(indexed),
                    .type = Type,
                    .default_value_ptr = &(Type{}),
                    .is_comptime = false,
                    .alignment = @alignOf(Type),
                };
            }
            break :blk @Type(.{ .@"struct" = .{
                .layout = .auto,
                .is_tuple = false,
                .decls = &.{},
                .fields = &fields,
            } });
        } = .{},

        const Bundle = @This();

        pub fn get(bundle: *Bundle, id: Value.ID) ?Value {
            return bundle.objects.get(id);
        }

        pub fn create(
            bundle: *Bundle,
            gpa: std.mem.Allocator,
            value: Value,
        ) !Value.ID {
            assert(@intFromEnum(value.id) == 0);

            try bundle.objects.reserve(gpa, 1);
            inline for (indexed_fields) |field| {
                try @field(bundle.indexes, @tagName(field))
                    .reserve(gpa, 1);
            }
            errdefer comptime unreachable;

            bundle.id_counter += 1;
            const id: Value.ID = @enumFromInt(bundle.id_counter);

            var value_with_id = value;
            value_with_id.id = id;

            bundle.objects.insert(value_with_id);
            inline for (indexed_fields) |indexed_field| {
                const field = @tagName(indexed_field);
                @field(bundle.indexes, field)
                    .insert(.{ .field = @field(value, field), .id = id });
            }

            return id;
        }

        pub fn update(bundle: *Bundle, value_new: Value) void {
            const id = value_new.id;
            assert(@intFromEnum(id) != 0);
            const value_old = bundle.get(value_new.id).?;
            assert(value_old.id == id);

            bundle.objects.remove(value_old);
            bundle.objects.insert(value_new);

            inline for (indexed_fields) |indexed_field| {
                const field = @tagName(indexed_field);
                @field(bundle.indexes, field)
                    .remove(.{ .field = @field(value_old, field), .id = id });
                @field(bundle.indexes, field)
                    .insert(.{ .field = @field(value_new, field), .id = id });
            }
        }

        pub fn filter(bundle: *Bundle, query: anytype, out: []Value) []Value {
            const fields = comptime std.meta.fieldNames(@TypeOf(query));

            var indexes: [fields.len]usize = @splat(0);
            var out_index: usize = 0;
            var object_index: usize = 0;

            const TupleOfSlices = comptime blk: {
                var components: [fields.len]type = undefined;
                for (0..fields.len) |i| {
                    const IndexTable = @FieldType(
                        @TypeOf(bundle.indexes),
                        fields[i],
                    );
                    const Pair = IndexTable.Value;
                    components[i] = []Pair;
                }
                break :blk std.meta.Tuple(&components);
            };

            var slices: TupleOfSlices = undefined;

            inline for (fields, 0..) |field, i| {
                const index = @field(bundle.indexes, field).search(.{
                    .field = @field(query, field),
                    .id = @enumFromInt(0),
                }, 0);
                slices[i] = @field(bundle.indexes, field)
                    .values.items[index..];
            }

            outer: while (true) {
                if (out_index == out.len) break :outer;
                inline for (0..fields.len) |i| {
                    if (indexes[i] == slices[i].len) break :outer;
                    if (slices[i][indexes[i]].field != @field(query, fields[i])) break :outer;
                }

                var id_min = slices[0][indexes[0]].id;
                inline for (1..slices.len) |i| {
                    if (@intFromEnum(slices[i][indexes[i]].id) < @intFromEnum(id_min)) {
                        id_min = slices[i][indexes[i]].id;
                    }
                }

                var advanced_count: u32 = 0;
                inline for (0..slices.len) |i| {
                    if (slices[i][indexes[i]].id == id_min) {
                        indexes[i] += 1;
                        advanced_count += 1;
                    }
                }

                if (advanced_count == slices.len) {
                    object_index = bundle.objects.search(id_min, object_index);
                    assert(object_index < bundle.objects.values.items.len);
                    const value = bundle.objects.values.items[object_index];

                    inline for (fields) |field| {
                        assert(@field(value, field) == @field(query, field));
                    }

                    out[out_index] = value;
                    out_index += 1;
                    object_index += 1;
                }
            }

            return out[0..out_index];
        }
    };
}

fn IndexTableType(comptime Value: type, comptime field: std.meta.FieldEnum(Value)) type {
    const FieldType = @FieldType(Value, @tagName(field));

    const Pair = struct {
        field: FieldType,
        id: Value.ID,
    };

    return TableType(Pair, Pair, struct {
        pub fn key_fn(value: Pair) Pair {
            return value;
        }

        pub fn key_cmp(lhs: Pair, rhs: Pair) std.math.Order {
            return order_by(Pair, lhs, rhs, &.{ .field, .id });
        }
    });
}

fn order_by(
    comptime T: type,
    lhs: T,
    rhs: T,
    comptime fields: []const std.meta.FieldEnum(T),
) std.math.Order {
    inline for (fields) |field| {
        const order = order_enums(
            @field(lhs, @tagName(field)),
            @field(rhs, @tagName(field)),
        );
        if (order != .eq) return order;
    }
    return .eq;
}

fn order_enums(lhs: anytype, rhs: @TypeOf(lhs)) std.math.Order {
    return switch (@typeInfo(@TypeOf(lhs))) {
        .int => std.math.order(lhs, rhs),
        .@"enum" => std.math.order(
            @intFromEnum(lhs),
            @intFromEnum(rhs),
        ),
        else => comptime unreachable,
    };
}

fn TableType(
    comptime KeyType: type,
    comptime ValueType: type,
    comptime Functions: type,
) type {
    const key_fn = Functions.key_fn;
    const key_cmp = Functions.key_cmp;

    return struct {
        values: std.ArrayListUnmanaged(Value) = .empty,

        pub const Key = KeyType;
        pub const Value = ValueType;
        const Table = @This();

        pub fn get(table: *const Table, key: Key) ?Value {
            const index = table.search(key, 0);
            if (index >= table.values.items.len) return null;

            const value = table.values.items[index];
            if (key_cmp(key, key_fn(value)) != .eq) return null;

            return value;
        }

        pub fn reserve(table: *Table, gpa: std.mem.Allocator, extra: usize) !void {
            try table.values.ensureUnusedCapacity(gpa, extra);
        }

        pub fn insert(table: *Table, value: Value) void {
            assert(table.values.unusedCapacitySlice().len > 0);
            const index = table.search(key_fn(value), 0);
            table.values.insertAssumeCapacity(index, value);
        }

        pub fn remove(table: *Table, value: Value) void {
            const index = table.search(key_fn(value), 0);
            const removed = table.values.orderedRemove(index);
            assert(std.meta.eql(value, removed));
        }

        pub fn search(table: *const Table, key: Key, start_index: usize) usize {
            return start_index + std.sort.lowerBound(
                Value,
                table.values.items[start_index..],
                key,
                compare_fn,
            );
        }

        fn compare_fn(key: Key, value: Value) std.math.Order {
            return key_cmp(key, key_fn(value));
        }
    };
}

Post Script

Note that, while the exercise is useful, it deliberately focuses narrowly on a single aspect of Zig comptime reflection. You should avoid this feature if possible: hopefully, I have successfully convinced you that they can lead to somewhat mind-bending code. The topic of Zig in general is much larger, and I highly recommend the following resources, in this order: