Vibe Coding Terminal Editor
I “wrote” a small tool for myself as my biannual routine check of where llms are currently at. I think I’ve learned a bunch from this exercise. This is frustrating! I don’t want to learn by trial and error, I’d rather read someone’s blog post with lessons learned. Sadly, most of the writing on the topic that percolates to me tends to be high-level — easy to nod along while reading, but hard to extract actionable lessons. So this is what I want to do here, list specific tricks learned.
Terminal Editor
Let me quickly introduce the project. It’s a VS Code extension that allows me to run “shell” inside my normal editor widget, such that the output is normal text buffer where all standard motion/editing commands work. So I can “goto definition” on paths printed as a part of backtrace, use multiple cursors to copy compiler’s suggestions, or just PageUp / PageDown to scroll the output. If you are familiar with Emacs, it’s Eshell, just worse:
I now use terminal-editor
to launch most of my
compilation commands, as it has several niceties on top of what my
normal shell provides. For example, by default only the last 50
lines of output are shown, but I can hit tab to fold and unfold full
output. Such a simple feature, but such a pain to implement in a
UNIX shell/terminal!
What follows is an unstructured bag of things learned:
Plan / Reset
I originally tried to use claude
code normally, by
iteratively prompting in the terminal until I get the output I want.
This was frustrating, as it was too easy to miss a good place to
commit a chunk of work, or to rein in a conversation going astray.
This “prompting-then-waiting” mode also had a pattern of mental
context switches not matching my preferred style of work. This
article suggests a better workflow: https://harper.blog/2025/05/08/basic-claude-code/
Instead of writing your single prompt in the terminal, you write an
entire course of action as a task list in plan.md
document, and the actual prompt is then something along the lines of
Read @plan.md, complete the next task, and mark it with
X
.
After claude
finishes iterating on a step you look at
the diff and interactively prompt for necessary corrections. When
you are happy, git commit
and /clear
the
conversation, to start the next step from the clean slate.
The plan pattern reduces context switches, because it allows you to
plan several steps ahead, while you are in the planning mode, even
if it makes sense to do the work one step at a time. I often also
work on continuing the plan when claude
is working on
the current task.
Whiteboard / Agent Metaphor
A brilliant metaphor from another post https://crawshaw.io/blog/programming-with-agents is that prompting LLM for some coding task and then expecting it to one-shot a working solution is quite a bit like asking a candidate to whiteboard an algorithm during the interview.
LLMs are clearly superhuman at whiteboarding, but you can’t go far
without feedback. “Agentic” programming like claude
allows LLMs to iterate on solution.
LLMs are much better at whiteboarding than at iterating. My experience is that, starting with suboptimal solution, LLM generally can’t improve it by itself along the fuzzy aesthetic metrics I care about. They can make valid changes, but the overall quality stays roughly the same.
However, LLMs are tenacious, and can do a lot of iterations. If you do have a value function, you can use it to extract useful work from random walk! A bad value function is human judgement. Sitting in the loop with LLM and pointing out mistakes is both frustrating and slow (you are the bottleneck). In contrast “make this test green” is very efficient at getting working (≠ good) code.
Spec Is Code Is Tests
LLMs are good at “closing the loop”, they can make the ends meet.
This insight combined with the
plan.md
pattern gives my current workflow — spec ↔ code
↔ test loop. Here’s the story:
I coded the first version of terminal-editor
using just
the plan.md
pattern, but at some point I hit complexity
wall. I realized that my original implementation strategy for syntax
highlighting was a dead end, and I needed to change it, but that was
hard to do without making a complete mess of the code. The
accumulated plan.md
reflected a bunch of historical
detours, and the tests were too brittle and coupled to the existing
implementation (more on tests later). This worked for incremental
additions, but now I wanted to change something in the middle.
I realized that what I want is not an append-only plan.md
that reflects history, but rather a mutable spec.md
that describes clearly how the software should
behave. For normal engineering, this would have been “damn, I guess
I need to throw one out and start afresh” moment. With claude
, I added plan.md
and all the code to the
context and asked it to write spec.md
file in the same
task list format. There are two insights here:
First, mutable spec is a good way to instruct LLM. When I
want to apply a change to
terminal-editor
now, I prompt claude
to
update the spec first (unchecking any items that need re-doing),
manually review/touch-up the spec, and use a canned prompt to align
the code and tests with the spec.
Second, that you can think of an LLM as a machine translation, which can automatically convert between working code, specification, and tests. You can treat any of those things as an input, as if you are coding in miniKanren!
Tests
I did have this idea of closing the loop when I started with terminal-editor
, so I crafted the prompts to emphasize
testing. You can guess the result! claude
wrote a lot
of tests, following all the modern “best practices” — a deluge of
unit tests that were just needlessly nailing down internal API, a
jungle of bug-hiding mocks, and a bunch of unfocused integration
tests which were slow, flaky, and contained a copious amount of
sleeps to paper over synchronization bugs. Really, this was eerily
similar to a typical test suite you can find in the wild. I am
wondering why is that?
This is perhaps my main take away: if I am vibe-coding anything again, and I want to maintain it and not just one-shot it, I will think very hard about the testing strategy. Really, to tout my own horn, I think that perhaps How to Test? is the best article out there about agentic coding. Test iteration is a multiplier for humans, but a hard requirement for LLMs. Test must be very fast, non-flaky, and should end-to-end test application features, rather than code.
Concretely, I just completely wiped out all the existing tests. Then I added testing strategy to the spec. There are two functions:
export async function sync(): Promise<void>
export function snapshot(): string
The sync
function waits for all outstanding async work
(like external processes) to finish. This requires properly
threading causality throughout the code. E.g., there’s a promise you
can await
on to join currently running process. The snapshot
function captures the entire state of the extension as a single
string. There’s just one mock for the clock (another improvement on
the usual terminal — process runtime is always show).
Then, I prompted claude
with something along the lines
of
Oups, looks like someone wiped out all the tests here, but the code and the spec look decent, could you re-create the test suite using
snapshot
function as per @spec.md?
It worked. Again, “throw one away” is very cheap.
Conclusions
That’s it! LLMs obviously can code. You need to hold them right. In particular, you need to engineer a feedback loop to let LLM iterate at its own pace. You don’t want human in the “data plane” of the loop, only in the control plane. Learn to architecture for testing.
LLM drastically reduce the activation energy for writing custom
tools. I wanted something like
terminal-editor
forever, but it was never the most
attractive yak to shave. Well, now I have the thing, I use it daily.
LLMs don’t magically solve all software engineering problems. The
biggest time sink with
terminal-editor
was solving the pty
problem, but LLMs are not yet at the “give me UNIX, but without
pty
mess” stage.
LLMs don’t solve maintenance. A while ago I wrote about LSP for jj. I think I can actually code that up in a day with Claude now? Not the proof of concept, the production version with everything I would need. But I don’t want to maintain that. I don’t want to context switch to fix a minor bug, if I am the only one using the tool. And, well, if I make this for other people, I’d definitely be on the hook for maintaining it :D