matklad

Code Smell: Concrete Abstraction

This is a hand-wavy philosophical article about programming, without quantifiable justification, but with some actionable advice and a case study.

Suppose that there are two types in the program, Blorb and Gonk. Suppose also that they both can blag.

Does it make sense to add the following trait?

1
2
3
trait Blag {
    fn blag(&mut self);
}

I claim that it makes sense only if you have a function like

1
2
3
fn blagyify<T: Blag>(x: T) {
    ...
}

That is, if some part of you program is generic over T: Blag.

If in every x.blag() the x is either Blorg, or Gonk, but never a T (each usage is concrete), you don’t need this abstraction. “Need” is used in a literal sense here: replace a trait with two inherent methods named blag, and the code will be essentially the same. Using a trait here doesn’t achieve any semantic compression.

Given that abstractions have costs “don’t need” can be strengthen to “probably shouldn’t”.

What Is The Cost of Abstraction?

First is the cognitive cost — generics (and abstractions in general) are often harder to understand than concretions. I think this is true regardless of the abstraction skill. I am skilled with math which is incomparably more complicated than the typical code; still, I find concrete code easier to understand. There are exceptions here (you can do less things with T: Default than with Blorb), but they seem to be exceptions rather than a common case.

Second, in the context of Rust, is the compile time cost. It is important to understand why it is the case. “Traits are more complicated for compiler to understand” would be a wrong reason. Rust uses monomorphization to compile generic code. An fn foo<T> is compiled afresh for each different T, per crate. If foo<T> is defined in crate a, and foo::<i32> is called in crates b and c, then rustc compiles the same code twice.

Third, it often is just more code to write and read. Consider the original Blag example. For non-abstract case, there are two inherent impls with blag function. For abstract case there are these same two impls, plus a trait definition, plus a use Blag on every call-site.

Not going for an abstraction often allows a for more specific interface. A monad in Haskell is a thing with >>=. Which isn’t telling much. Languages like Rust and OCaml can’t express a general monad, but they still have concrete monads. The >>= is called and_then for futures and flat_map for lists. These names are more specific than >>= and are easier to understand. The >>= is only required if you want to write code generic over type of monad itself, which happens rarely.

Another example of abstraction which is used mostly concretely are collection hierarchies. In Java or Scala, there’s a whole type hierarchy for things which can hold other things. Rust’s type system can’t express Collection trait, so we have to get by with using Vec, HashSet and BTreeSet directly. And it isn’t actually a problem in practice. Turns out, writing code which is generic over collections (and not just over iterators) is not that useful. The "but I can change the collection type later" argument also seems overrated — often, there’s only single collection type that makes sense. Moreover, swapping HashSet for BTreeSet is mostly just a change at the definition site, as the two happen to have almost identical interface anyway. The only case where I miss Java collections is when I return Vec<T>, but mean a generic unordered collection. In Java, the difference is captured by List<T> vs Collection<T>. In Rust, there’s nothing built-in for this. It is possible to define a VecSet<T>(Vec<T>), but doesn’t seem worth the effort.

Collections also suffer from >>= problem — collapsing similar synonyms under a single name. Java’s Queue has add, offer, remove, and poll methods, because it needs to be a collection, but also is a special kind of collection. In C++, you have to spell push_back for vector's push operation, so that it duck-types with deque's front and back.

Collection hierarchy is a sufficient, but not necessary condition for mixing up method names. Rust’s BinaryHeap should have had BinaryHeap::pop_max method. Alas, we are stuck with pop, which, coupled with the fact that the heap is surprisingly and uselessly a max-heap, means many student-hours wasted on debugging misbehaving Dijkstra algorithm.

Finally, the promised case study! rust-analyzer needs to convert a bunch of internal type to types suitable for converting them into JSON message of the Language Server Protocol. ra::Completion is converted into lsp::Completion; ra::Completion contains ra::TextRange which is converted to lsp::Range, etc.

The first implementation started with an abstraction for conversion:

1
2
3
4
pub trait Conv {
    type Output;
    fn conv(self) -> Self::Output;
}

This abstraction doesn’t work for all cases — sometimes the conversion requires additional context. For example, to convert a rust-analyzer’s offset (a position of byte in the file) to an LSP position ((line, column) pair), a table with positions of newlines is needed. This is easy to handle:

1
2
3
4
pub trait ConvWith<CTX> {
    type Output;
    fn conv_with(self, ctx: CTX) -> Self::Output;
}

Naturally, there was an intricate web of delegating impls. The typical one looked like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
impl ConvWith<&LineIndex> for TextRange {
    type Output = Range;
    fn conv_with(
        self,
        line_index: &LineIndex,
    ) -> lsp_types::Range {
        Range::new(
            self.start().conv_with(line_index),
            self.end().conv_with(line_index),
        )
    }
}

There were a couple of genuinely generic impls for converting iterators of convertible things.

The code was hard to understand. It also was hard to use: if calling .conv didn’t work immediately, it took a lot of time to find which specific impl didn’t apply. Finally, there were many accidental (as in “accidental complexity”) changes to the shape of code: CTX being passed by value or by reference, switching between generic parameters and associated types, etc.

I was really annoyed by how this conceptually simple pure boilerplate operation got expressed as clever and fancy abstraction. Crucially, almost all of the usages of the abstraction (besides those couple of iterator impls) were concrete. So I replaced the whole edifice with much simpler code, a bunch of functions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
fn range(
    line_index: &LineIndex,
    range: TextRange,
) -> lsp_types::Range {
    let start = position(line_index, range.start());
    let end = position(line_index, range.end());
    lsp_types::Range::new(start, end)
}

fn position(
    line_index: &LineIndex,
    offset: TextSize,
) -> lsp_types::Position {
    ...
}

Simplicity and ease of use went up tremendously. Now instead of typing x.conv() and trying to figure out why an impl I think should apply doesn’t apply, I just auto-complete to_proto::range and let the compiler tell me exactly which types don’t line up.

I’ve lost fancy iterator impls, but the total diff for the commit was +999,-1123. There was some genuine code re-use in those impls, but it was not justified by the overall compression, even disregarding additional complexity tax.

To sum up, “is this abstraction used exclusively concretely?” is a meaningful question about the overall shape of code. If the answer is “Yes!”, then the abstraction can be replaced by a number of equivalent non-abstract implementations. As the latter tend to be simpler, shorter, and more direct, “Concrete Abstraction” can be considered a code smell. As usual though, any abstract programming advice can be applied only in a concrete context — don’t blindly replace abstractions with concretions, check if provided justifications work for your particular case!

Discussion on /r/rust.