Hard Mode Rust
This post is a case study of writing a Rust application using only minimal, artificially constrained API (eg, no dynamic memory allocation). It assumes a fair bit of familiarity with the language.
Hard Mode Rust
The back story here is a particular criticism of Rust and C++ from hard-core C programmers. This criticism is aimed at RAII – the language-defining feature of C++, which was wholesale imported to Rust as well. RAII makes using various resources requiring cleanups (file descriptors, memory, locks) easy — any place in the program can create a resource, and the cleanup code will be invoked automatically when needed. And herein lies the problem — because allocating resources becomes easy, RAII encourages a sloppy attitude to resources, where they are allocated and destroyed all over the place. In particular, this leads to:
- Decrease in reliability. Resources are usually limited in principle, but actual resource exhaustion happens rarely. If resources are allocated throughout the program, there are many virtually untested codepaths.
- Lack of predictability. It usually is impossible to predict up-front how much resources will the program consume. Instead, resource-consumption is observed empirically.
- Poor performance. Usually, it is significantly more efficient to allocate and free resources in batches. Cleanup code for individual resources is scattered throughout codebase, increasing code bloat
- Spaghetti architecture. Resource allocation is an architecturally salient thing. If all resource management is centralized to a single place, it becomes significantly easier to understand lifecycle of resources.
I think this is a fair criticism. In fact, I think this is the same criticism that C++ and Rust programmers aim at garbage collected languages. This is a spectrum:
Rust programmers typically are not exposed to the lowest level of this pyramid. But there’s a relatively compact exercise to gain the relevant experience: try re-implementing your favorite Rust programs on hard mode.
Hard Mode means that you split your program into
std binary and
#![no_std] no-alloc library.
Only the small binary is allowed to directly ask OS for resources.
For the library, all resources must be injected.
In particular, to do memory allocation, the library receives a slice of bytes of a fixed size, and should use that for all storage.
Something like this:
So, this is what the post is about: my experience implementing a toy hard mode ray tracer. You can find the code on GitHub: http://github.com/matklad/crt.
The task of a ray tracer is to convert a description of a 3D scene like the following one:
Into a rendered image like this:
This works rather intuitive conceptually.
First, imagine the above scene, with an infinite fuchsia colored plane and a red Utah teapot hovering above that.
Then, imagine a camera standing at
0,10,-50 (in cartesian coordinates) and aiming at the origin.
Now, draw an imaginary rectangular 80x60 screen at a focus distance of 50 from the camera along its line of side.
To get a 2D picture, we shoot a ray from the camera through each “pixel” on the screen, note which object on the scene is hit (plan, teapot, background), and color the pixel accordingly.
See PBRT Book if you feel like falling further into this particular rabbit hole (warning: it is very deep) (I apologize for “little square pixels” simplification I use throughout the post :-) ).
I won’t focus on specific algorithms to implement that (indeed, crt is a very naive tracer), but rather highlight Hard Mode Rust specific concerns.
Ultimately, the out of a ray tracer is a 2D buffer with 8bit RGB pixels. One would typically represent it as follows:
For us, we want someone else (main) to allocate that box of colors for us, so instead we do the following:
'm lifetime we use for abstract memory managed elsewhere.
Note how the struct grew an extra lifetime!
This is extra price we have to pay for not relying on RAII to cleanup resources for us:
Note in particular how the
Ctx struct now has to include two lifetimes.
This feels unnecessary:
'a is shorter than
I wish it was possible to somehow abstract that away:
I don’t think that’s really possible (earlier post about this). In particular, the following would run into variance issues:
Ultimately, this is annoying, but not a deal breaker.
rgb::Buf<'_>, we can sketch the program:
Hard Mode Rayon
Ray tracing is an embarrassingly parallel task — the color of each output pixel can be computed independently. Usually, the excellent rayon library is used to take advantage of parallelism, but for our raytracer I want to show a significantly simpler API design for taking advantage of many cores. I’ve seen this design in Sorbet, a type checker for Ruby.
Here’s how a
render function with support for parallelism looks:
The interface here is the
in_parallel function, which takes another function as an argument and runs it, in parallel, on all available threads.
You typically use it like this:
This is similar to a typical threadpool, but different.
Similar to a threadpool, there’s a number of threads (typically one per core) which execute arbitrary jobs.
The first difference is that a typical threadpool sends a job to to a single thread, while in this design the same job is broadcasted to all threads.
The job is
Fn + Sync rather than
FnOnce + Send.
The second difference is that we block until the job is done on all threads, so we can borrow data from the stack.
It’s on the caller to explicitly implement a concurrent queue to distributed specific work items. In my implementation, I slice the image in rows
main, we implement a concrete
ThreadPool by spawning a thread per core:
The scenes we are going to render are fundamentally dynamically sized. They can contain arbitrary number of objects. So we can’t just statically allocate all the memory up-front. Instead, there’s a CLI argument which sets the amount of memory a ray tracer can use, and we should either manage with that, or return an error. So we do need to write our own allocator. But we’ll try very hard to only allocate the memory we actually need, so we won’t have to implement memory deallocation at all. So a simple bump allocator would do:
We can create an allocator from a slice of bytes, and then ask it to allocate values and arrays.
alloc looks like this:
To make this fully kosher we need to handle alignment as well, but I cut that bit out for brevity.
For allocating arrays, it’s useful if all-zeros bitpattern is a valid default instance of type, as that allows to skip element-wise initialization. This condition isn’t easily expressible in today’s Rust though, so we require initializing every array member.
The result of an allocation is
&'m T – this is how we spell
Box<T> on hard mode.
The scene contains various objects, like spheres and planes:
Usually, we’d represent a scene as
We could implement a resizable array (
Vec), but doing that would require us to either leak memory, or to implement proper deallocation logic in our allocator, and add destructors to reliably trigger that.
But destructors is exactly something we are trying to avoid in this exercise.
So our scene will have to look like this instead:
And that means we want to know the number of objects we’ll need upfront. The way we solve this problem is by doing two-pass parsing. In the first pass, we just count things, then we allocate them, then we actually parse them into allocated space.
If an error is encountered during parsing, we want to create a helpful error message.
If the message is fully dynamic, we’d have to allocate it into
'm, but it seems simpler to just re-use bits of input for error message.
Error<'i> is tied to the input lifetime
'i, rather memory lifetime
One interesting type of object on the scene is a mesh of triangles (for example, the teapot is just a bunch of triangles). A naive way to represent a bunch of triangles is to use a vector:
This is wasteful: in a mesh, each edge is shared by two triangles. So a single vertex belongs to a bunch of triangles. If we store a vector of triangles, we are needlessly duplicating vertex data. A more compact representation is to store unique vertexes once, and to use indexes for sharing:
Again, on hard mode that would be
And a scene contains a bunch of meshes :
Note how, if the structure is recursive, we have “owned pointers” of
&'m mut T<'m> shape.
Originally I worried that that would cause problem with variance, but it seems to work fine for ownership specifically.
During processing, you still need
&'a mut T<'m> though.
And that’s why parsing functions hold an uncomfortable bunch of lifetimes:
&'i str input and a
&'a mut Mem<'m> memory.
It parses input into a
&'b mut Mesh<'m>.
Bounding Volume Hierarchy
Scene<'m> fully parsed, we can finally get to rendering the picture.
A naive way to do this would be to iterate through each pixel, shooting a ray through it, and then do a nested iterations over every shape, looking for the closest intersection.
That’s going to be slow!
The teapot model contains about 1k triangles, and we have 640*480 pixels, which gives us 307_200_000 ray-triangle intersection tests, which is quite slow even with multithreading.
So we are going to speed this up. The idea is simple — just don’t intersect a ray with each triangle. It is possible to quickly discard batches of triangles. If we have a batch of triangles, we can draw a 3D box around them as a pre-processing step. Now if the ray doesn’t intersect the bounding box, we know that it can’t intersect any of the triangles. So we can use one test with a bounding box instead of many tests for each triangle.
This is of course one-sided — if the ray intersects the box, it might still miss all of the triangles. But, if we place bounding boxes smartly (small boxes which cover many adjacent triangles), we can hope to skip a lot of work.
We won’t go for really smart ways of doing that, and instead will use a simple divide-and-conquer scheme. Specifically, we’ll draw a large box around all triangles we have. Then, we’ll note which dimension of the resulting box is the longest. If, for example, the box is very tall, we’ll cut it in half horizontally, such that each half contains half of the triangles. Then, we’ll recursively subdivide the two halves.
In the end, we get a binary tree, where each node contains a bounding box and two children, whose bounding boxes are contained in the parent’s bounding box. Leaves contains triangles. This construction is called a bounding volume hierarchy, bvh.
To intersect the ray with bvh, we use a recursive procedure. Starting at the root node, we descend into children whose bounding boxes are intersected by the ray. Sometimes we’ll have to descend into both children, but often enough at least one child’s bounding box won’t touch the ray, allowing us to completely skip the subtree.
On easy mode Rust, we can code it like this:
On hard mode, we don’t really love all those separate boxes, we love arrays! So what we’d rather have is
So we want to write the following function which recursively constructs a bvh for a mesh:
The problem is, unlike the parser, we can’t cheaply determine the number of leaves and splits without actually building the whole tree.
So what we are going to do here is to allocate a pointer-tree structure into some scratch space, and then copy that into an
&'m mut array.
How do we find the scratch space?
Our memory is
We allocate stuff from the start of the region.
So we can split of some amount of scratch space from the end:
Stuff we allocate into the first half is allocated “permanently”. Stuff we allocate into the second half is allocated temporarily. When we drop temp buffer, we can reclaim all that space.
This… probably is the most sketchy part of the whole endeavor.
unsafe, requires lifetimes casing, and I actually can’t get it past miri.
But it should be fine, right?
So, I have the following thing API:
It can be used like this:
And here’s how
With this infrastructure in place, we can finally implement bvh construction! We’ll do it in three steps:
- Split of half the memory into a scratch space.
- Build a dynamically-sized tree in that space, counting leaves and interior nodes.
- Allocate arrays of the right size in the permanent space, and copy data over once.
And that’s it! The thing actually works, miri complaints notwithstanding!
Actually, I am impressed.
I was certain that this won’t actually work out, and that I’d have to write copious amount of unsafe to get the runtime behavior I want.
Specifically, I believed that
&'m mut T<'m> variance issue would force my hand to add
'mmm and further lifetimes, but that didn’t happen.
For “owning” pointers,
&'m mut T<'m'> turned out to work fine!
It’s only when processing you might need extra lifetimes.
Parser<'m, 'i, 'a> is at least two lifetimes more than I am completely comfortable with, but I guess I can live with that.
I wonder how far this style of programming can be pushed. Aesthetically, I quite like that I can tell precisely how much memory the program would use!
Code for the post: http://github.com/matklad/crt.
Discussion on /r/rust.