Minimal Version Selection Revisited
In this post, I want to highlight one aspect of Go-style minimal version selection that I have missed completely at first. Maybe you missed it too?
If you recall, some years ago there was a discussion of various version selection algorithms in the Go and the Rust communities. See this list for a refresher:
https://research.swtch.com/vgo
Broadly, when you write foo = "1.2.3"
in the manifest for
your library, what you are specifying is not a specific version of foo
, but rather a set of
versions (a version constraint). Some of these versions might
not have existed when you wrote the foo = "1.2.3"
line!
As an author of a library, you can’t actually pick your dependencies
exactly. Instead, the final application that uses your library is in
charge of picking specific dependencies, including those for your
library. In particular, if that application depends on some other
library which says foo = "1.3.0"
, then your
library gets at least 1.3.0
as well, although you’ve
written 1.2.3
.
In general, there are many ways to pick a specific set of versions that satisfy the constraints. It’s interesting that Go and Rust chose, in some sense, opposite algorithms for solving these constraints. Rust selects maximal versions, while Go selects minimal versions.
The original discussion was focused (or at least I perceived it as being focused) on the algorithmic complexity. The straw-man claim was that min-ver can be solved linearly, while max-ver requires NP-hard SAT-solving. This is not a particularly strong claim:
-
First of all, it is plainly wrong. Min-ver can be solved efficiently
by a greedy algorithm that picks the minimal versions, while max-ver
can be solved efficiently by the same algorithm that just greedily
picks maximal versions. NP-hardness comes not from preferred
direction of resolution, but rather from complexity of version
requirements. If you only have semver-open constraints like
^1.2.3
, the solution is easy. If you have non-convex constraints like(^1.2.3 OR ^1.2.5) AND ≠1.2.4
, then stuff becomes NP-hard. - Second, SAT is not a big deal. It is well understood how to solve even elaborate constraints sufficiently quickly in practice. This does require moderate amount of engineering effort, but this effort is a fixed cost that amortizes across the entire ecosystem and can be just dismissed.
It’s much more interesting to look at the ecosystem implications of the two approaches.
The big benefit of max-ver is the ecosystem-wide live-at-head: if everyone uses latest versions, than everyone uses the same combination of versions of libraries. In particular, ecosystem-wide testing ends up being concentrated on a few combinations that everyone is using. In practice, if you release a new version of a library, and it turns out to be buggy, someone immediately discovers it in their tests, so that you can quickly yank®roup.
But the min-ver has a dual property (this is the new thing
I’ve realized only recently). In min-ver, if you get foo=1.2.3
in your build graph, that means that someone
somewhere made an explicit decision to write foo=1.2.3
in
the manifest for their library. That is, any version in your
dependency graph is additionally vetted manually by someone who is not the original library developer.
This is quite neat — it puts a natural damper on the supply chain attacks. If a bad version of a library is released, someone needs to explicitly opt into this new version. What’s more, the deeper in your dependency tree the library is, the more explicit approvals are required for the library to propagate to your project.
Consider this hypothetical system for specifying dependencies:
In the manifest file, you specify dependencies by listing both a version and a checksum:
foo = { version = "1.2.3", checksum = "sha256:bee71ecaffe" }
Only direct dependencies are listed. Checksums are of course filled-in by a tool.
Checksum covers the content of a dependency, including its manifest
file. In other words, the checksum for foo
above
implicitly includes checksums of all direct dependencies of foo
.
By exploring the set of dependencies from the root, we collect the set of dependencies and their checksums that are protected by some checksum in the root manifest. Then, if we have several versions for a single dependency, we pick the largest one.
This is cool. This is essentially “bring your own registry” version resolution, where the set of available versions is fixed, and pinned by the root hashes. All dependencies in the set have the “min-ver” property that someone somewhere in the dependency graph asked for this version of the dependency explicitly.
We also need neither lockfiles, nor the Go-style checksum files. Manifests form a distributed lockfile, where each dependency locks a part of the graph, and this lockfile fragment is itself hashed&locked by some direct-dependency specification.
But we still get “dependency resolution” — only direct dependencies
are specified, and, if both
foo=1.2.3
and foo=1.3.0
end up being in the
set of dependencies, only 1.3.0
will be included in the
final graph.
My understanding is that Zig is trying to do something like this, and, as all things Zig, it feels extremely neat, even if rough at the edges. It never occurred to me before that putting hashes of direct dependencies in the manifest also locks transitive dependencies. So cute!
Update(2024-12-25): It is worth mentioning another qualitative difference between min-ver and max-ver. The latter generally requires some form of centralized registry — an entity which knows all the versions of all the libraries. When you pick a maximal version, you pick the highest version known to the registry. For example, in the case of Cargo the central registry is implemented as a git repository (https://github.com/rust-lang/crates.io-index) which holds JSON files describing available versions.
With min-ver, you don’t need a registry, as the set of versions to
choose from can be incrementally discovered by crawling the web of
dependencies starting from the final application, as long as
dependency specifications include not only a name and a version, but
also a location (URL) and a checksum. Though, keep in mind that you
still need some solution to ensure availability of
dependencies — with a large enough dependency tree there will be at
least one dependency hosted from grandma’s old PC, prone to flooding.
You can use either Go-style global shared cache (it doesn’t have to be
trusted, as long as hashes are included in the dependency
specification), or make it very easy to create a project-local cache —
a ./vendor
directory in the same repository which embeds
all the dependencies (again, it is important that ./vendor
is strictly a cache, and the source of truth is the checksum in the
manifest).