Context Switches

I guess I can call myself a professional asynchronous programmer, given that I’ve been working on stateful, concurrent, highly asynchronous systems for the past few years. However, I still feel like a clueless novice, I don’t have a solid mental model for concurrent programming. For example, https://github.com/jimblandy/context-switch compares context-switch overhead between OS threads and tokio tasks and finds that switching tasks is faster, unless you pin the whole thing to a single core! I can’t really explain this result from the first principles.

One thing I’ve realized recently is that there are a lot of different things that can be called a “context switch”, where we are doing a thing, and then stop, and start doing another thing. In this post, I want to explain what I know about each case, mostly to organize my own thoughts. I haven’t double-checked everything, so probably a fair amount here is oversimplified, wrong, or was last correct 30 years ago!

Time Sharing

The central example of context switch! You are running two applications, say, a compiler and a breakout game, and the operating system switches between the two every few milliseconds or so, providing illusion of concurrency. When happens when we switch from compiling to gaming?

Let’s enumerate the things we need to switch first. The main thing is of course is the instruction pointer, the address of the “current” instruction. Then, there’s processor state, the contents of all registers that we need to save and restore. Curiously, the state of memory isn’t shared — each application uses its own physical pages, both processes are resident. What we do need to change is the memory mapping from virtual addresses to physical addresses.

CPU State Machine

But to understand what actually happens when we switch all those things, we need to look inside the CPU. The processor is a looping, clocked state machine. The heart of the CPU is a combinatorial scheme. Mathematically, it is a function from an n-bit vector to an m-bit vector, which can be represented as logical formula. Physically, it is a scheme built by wiring up transistors, where each transistor models one logical gate. Mathematically, the function gets the answer immediately. Physically, when you set the inputs to the scheme, it needs some time to “settle down” such that the value of outputs becomes stable.

Then, to this combinatorial core we add state: the registers. Each register is an electronic scheme that “holds” a bit vector. Depending on the control wire, the register can “read” the value (meaning that the wires going out of the register are set) or “write” it (meaning that the state of the register is set to correspond to the state of input wires).

Registers are wired to (a part of) the input to the combinatorial scheme, and the output is wired back to the registers. So, the scheme can determine the next value of the registers based on their current value. Now, if you directly wire outputs to inputs, you’ll get a feedback loop and nothing good will come out of it. So we need another component, a clock.

A clock is simply a wire which is set every so often (for 5GHz CPU — every 0.2 nanoseconds). And, roughly, the registers are read at the beginning of the clock cycles, and are written at the end of it. The clock is what makes the looping state machine step forward without confusing the “next” and the “previous” state. The duration of the clock corresponds to the time which is required for the combinatorial scheme to settle down after the inputs are set.

This boolean logic + registers + clock aggregate can now execute instructions like add a, b, c. Crucially, “executing” advances the instruction pointer register as a side effect, which causes the CPU to fetch and execute the following instruction.

It might seem that executing an instruction is fully handled by the stateless boolean logic and occupies one cycle, but that is generally not the case. Some instructions take more than one cycle to handle. There are two reasons for that.

First, some instructions are genuinely asynchronous. For example, loading data from memory takes time. When executing load instruction, CPU sets the output wire going towards memory to the address that needs to be loaded, and then every cycle it waits until the input wire from memory signals that the data is ready. There’s a little cycle in the state machine!

Second, while some instructions could be computed by a boolean circuit in a single cycle, they are split into phases and execute in several ticks. While you can build arbitrary complex boolean circuits, due to physics, more complex circuits require more time to settle down, which requires you to make clock period longer. If you split a complex instructions into several cycles, you can make cycles shorter, reducing the wall clock time.

In other words, while CPU is a state machine, state transitions do not map to instructions. Rather, every instruction is its own little state machine, and different instructions take different number of steps to complete!

Registers

Because instructions takes several cycles to complete, CPU pipelines them. The state machine keeps track of several in-flight instructions at different stages of execution. Generally, different stages of the same instruction require different circuity, but identical stages of different instructions often re-use similar logic. For example, arithmetic instructions start at instruction decoder, and then proceed to ALU. So, in a given cycle, you can have one instruction at the ALU, and then next one in the decoder!

From pipelining, it’s a short step to superscalar execution. If you have

mul a, b, c
mul d, e, f

in the instruction stream and two ALUs, the instructions can execute simultaneously, even if actual multiplication work in ALU takes several cycles. You just need to carefully track all in-flight instructions in the state machine logic.

The problem with keeping many instructions in flight, due to pipelining or parallelism, is that you’ll need more registers that you have in the instruction set! Several in-flight instructions might use the same named registers!

So another important part of CPU’s mental model, in addition to instruction != cycle, is that ISA register != hardware register. CPU generally has more hardware registers (“register file”), and its state machine assigns and maintains a mapping from instructions to hardware register they actually use. This is the genius of ISA — you can make a faster CPU by adding more ALUs and increasing the size of the register file, but this new CPU will run programs compiled for the old CPU, only faster.

Because instruction set is fixed, but hardware changes, the role of registers isn’t to assign hardware resources. Rather, we use registers to show dependencies between instructions, and the CPU uses that information to schedule instructions out of order. A and B can execute in parallel if they don’t have conflicting (read-write or write-write) register accesses.

So this is the picture of the CPU to have in mind — its a very complicated state machine which:

  • takes many steps for each instruction,
  • typically has multiple instructions in flight,
  • has enough hardware registers for all in-flight instructions.

Interrupts

Let’s context switch back to our time sharing context switch! In time sharing, the OS arranges a particular process to run only a certain amount of time (say, 8 milliseconds). It does that by programming an interrupt controller to raise an interrupt at a particular point in the future. When the CPU receives an interrupt, it saves the current instruction pointer and jumps to a pre-configured OS routine for interrupt handling. That routine than handles the interrupt, and can jump back to the original process using the saved instruction pointer.

Ok, but what actually happens? One of the input wires to the CPU is the “interrupt” wire. CPU’s state transition function treats it as any other input, though interrupt forces a pretty drastic transition. Recall that CPU has many instructions in flight, which are very cleverly managed by the state machine to not mix up. When an interrupt arrives, the CPU should stop the fun, and pick a specific instruction to stop at, such that everything before is executed fully, and anything after is not started. This is required to allow the OS to resume execution cleanly from that exact point. The CPU does that by finishing some instructions, cancelling the others (rolling back changes, if necessary), and collapsing the state of the register file to just the smaller set of architecture registers. After that, the CPU stores current instruction pointer to a special register, and jumps to a pre-configured interrupt handling routine. It is somewhat like taking several eggs, breaking the shells, jugging at 90° C until an omelet is ready, and then carefully unscrambling the eggs back when catching them with one hand to answer a phone call with another!

Timer interrupt is itself a context switch, which is nested inside a larger time sharing process context switch. The visible context is tiny, just the instruction pointer, but there’s a fair amount of work involved to deal with all the transparently in-flight instructions.

The rest of the registers are not saved automatically by the CPU, its the job of the OS. Similarly, interrupt handling happens in the context of the process that was running on the CPU when the interrupt happened, so virtual memory mapping is preserved. But the ring0 flag is set when jumping to the interrupt handler, so that the OS can change the mapping if it wants to.

Scheduling Processes

The OS interrupt handler will save the rest of the context, by writing all architectural registers to memory. Then, it’ll invoke the schedule routine to determine which process to run next. The scheduler does the larger context switch. It needs to change virtual memory mapping. This is easy to do because the kernel is mapped at the same address in every process, and scheduler is in the kernel. So, although the entire map is changed (by writing the address of the page table to a special register), nothing observable changes until we are back to user space. After that, the OS restores previously saved registers, and jumps back to the user space using the saved instruction pointer. And before doing all of that, the OS arms interrupt controller timer to get the next interrupt after the new process’ quant expires.

It’s fun to think how, although logically we seamlessly freeze and restore the process, physically the before and after states look quite different. Before, we had a full pipeline worth of instructions in flight, and had to abruptly cut it in the middle. After, we are starting with an empty pipeline, and it takes the CPU some time to fill it up again!

Furthermore, there’s state which we don’t really need to switch explicitly, but which is switched non-the-less — the CPU cache. A different process is likely to access different physical pages, breaking temporal and spatial locality completely.

A related kind of cache is TLB, which caches the mapping from virtual to physical addresses (given the hierarchical structure of page table, its imperative that address translation is cached). As we switch address spaces, the TLB generally goes away, though modern CPUs can tag TLB entries with process identifiers.

So far, we’ve learned about two context switches:

Interrupt

Causes CPU to “flush” in-flight pipeline, aligning register file with ISA-visible registers, only saves instruction pointer, jumps to the OS interrupt handler, transitioning to ring0.

Scheduler Switch

Triggered by a timer interrupt! Switches instruction pointer, registers and address space, clearing the cache and possibly TLB.

Trivial System Call

The next switch we’ll look at is system call. It is comparatively simple and not much of a switch at all. It works similar to an interrupt, where the CPU saves current instruction pointer and jumps to the kernel code, but, crucially, it is triggered by an explicit syscall instruction as a part of normal instruction stream, and can be integrated with the CPU pipeline.

As with interrupts, it’s on the OS to save the registers upon entering the kernels space. However, the kernel doesn’t have to save all the registers, as it won’t be switching to a different process. If I am reading the code correctly, x86_64 Linux doesn’t save SIMD registers on syscall, presumably because kernel doesn’t use these registers in every syscall. Really, when we do a syscall, most of the things stay the same, because syscall happens in the context of userspace thread! In particular, memory map, TLB, and caches stay intact. One thing that changes is the stack — usually each thread has both a large userspace stack, and a small kernel stack, which is used by syscalls. It would be unsafe for the kernel to use the user space stack, as other threads then could thrash kernel data. You can think about it as a segmented stack with two segment.

IO System Call

The above holds for trivial syscalls that just read some in-kernel data structure (for example, getpid syscall), but usually, when speaking about syscalls, we have some IO in mind. Let’s look at something like a write syscall. Usually, a write just copies data to page cache, so let’s assume that the program writes directly to a device and is about to do real, hardware IO.

The process starts with a syscall instruction, as above, which switches to the kernel space and kernel’s segment of the stack. Then, the kernel goes through several layers of function calls to get to the appropriate driver code. It’s worth emphasizing that at this point we have a single call stack spanning both userspace and kernel space. Kernel isn’t a separate entity, kernel is a part of every thread.

The driver starts the IO operation, which is likely to use DMA. That is, the driver instructs the device to write a certain region of memory, and from this point on the device communicates with the memory directly, bypassing the CPU. When the IO is complete, the device will notify the CPU via an interrupt. In the meantime, the original thread of execution is free to do whatever after initiating a DMA.

UNIX-inspired operating system hide this fundamentally asynchronous nature of IO behind a “blocking” interface. Which means that, after scheduling the DMA, the thread (still in the kernel space) will call the scheduler to give the CPU away to a different process. We have already seen this process when discussing time sharing. The scheduler will select the next process to run, save the current process’ state, and switch instruction pointer and page table.

When the IO completes and the interrupt arrives, the kernel will mark the original process as runnable and eligible to be switched to on the next scheduler call.

Message Passing

So far, we’ve been looking at the situation where we have a single CPU and several processes, where “context switching” decides whose turn is to run. Now, we’ll switch gears and look at multiple CPU cores. Specifically, we look at message passing, where one thread produces data and sends it to a different thread for processing. For example, the first thread might run networking code that fetches the next message over TCP, and the second thread is a part of thread pool that does the actual processing. The threads communicate via a channel.

The worker thread is waiting for a message, it issued futex_wait syscall, and the OS scheduler marked the thread as not runnable. The network thread does an atomic read-modify-write to the futex word, and then uses futex_wait syscall, which marks the worker thread as runnable. And I actually don’t know what happens next! The thing is, two completely different paths are possible. Here’s how the world looks right before we wake the worker thread:

There are two cores. One core is busy running the network thread, the second core is idle, and the worker thread is not runnable. The busy core has the message in its cache. After waking the worker thread up, there are two possibilities: either the worker thread wakes up on the spare core, or it wakes up on the with the message in cache, and the network thread gets moved to a different core:

In the first case, when the worker thread attempts to access the message, the cache coherency protocol needs to be involved to physically move message bytes between two caches. In the second case, the worker can immediately start processing the message, but the network thread than incurs context misses to move its state over to the second cache.

Which is the optimal schedule? I think that perhaps neither? What we actually want to do is to not run the worker thread immediately, and let the network thread to run for a little bit more, so that it can loop around, issue another read syscall against the socket to fetch the next message, and go to sleep. At that point we want to hand over the core to the worker thread, so that we don’t have to move any data between the caches.