I’m sure you have heard that LLMs are inevitable. One of the figures cited to demonstrate this inevitability is the adoption rate of ChatGPT, which hit however many million users so much faster than the previous record-holder. The details of the claim change periodically and frankly are not all that important for people who understand the dynamics behind technological adoption.

Adoption for truly revolutionary technologies is still slow, just like it has always been. Adoption is quick for incremental technologies that are highly compatible with existing social processes. The adoption curve of cellphones was steeper than for landline telephones, even though going to a cellphone from a landline is a much smaller improvement than going to a landline from nothing.

At this point, the darling of the AI discourse is Claude Code. It is popular precisely because it is not revolutionary. It fits neatly into the values software orgs have been adopting ever since managers heard of Agile and thought it meant “faster software delivery.” “Data driven” is the name of the game, and speed has been the north star of data driven orgs simply because it is much easier to measure:

Activity is easier to measure than progress, which is why shallow leaders love it. They're dazzled by the pace and don't notice everyone is just moving in a circle.

AI is, of course, the technology best suited to emulating vigorous activity. With a prompt, you can produce in seconds what it would take a mediocre programmer hours of work. This is a clear velocity improvement, as long as you’re measuring velocity as time to outputs, rather than time to outcomes.

“AI can make mistakes”

People tell me that this works better for code than it does for other things, because code quality can be checked by automated tests. This is a very 2003 way of looking at the world: our requirements are knowable because someone else (the business) gave them to us. We are only responsible for making sure that our outputs check those boxes.

This is why code can never be the source of truth. Kevin Muldoon lays it out clearly: code can be the ontological truth (what is), but it can never serve as a source of normative truth (what must be). For that, you need some kind of specification that says what the code is supposed to do, and then the tests can confirm it is doing that thing.

The trouble, of course, comes when the people responsible for writing those requirements are the ones vibe coding. And maybe your “requirements” are just a slopotype with a sticky note slapped on top that says “ship this.”

First of all, you should never ship vibe code in a professional capacity. Just don’t do it. It’s probably fine for n=1 users, but for anything remotely real what you have is just a spicy sketch. Ben Werdmuller’s feasibility rubric is the perfect demonstration of how much maintenance and risk eclipse the momentary euphoria of code that works in a demo context. In a fundamental sense, all code is technical debt, and the most maintainable code is measured in negative LOC.

Tests (especially tests written after the fact) can only tell you so much about whether the code is fit for purpose. It’s always better to just not ship the prototype than to “check the AI’s work.”

But even if you are willing to completely refactor the vibe coded prototype in the pursuit of shipping it, there is an even tougher problem at play: there is no automated testing for requirements. How do you know whether the sketch is fit for purpose?

AI is a machine that turns quality assurance into burnout

Spoiler: the AI’s output is mathematically guaranteed to fall short of the right thing. When it comes to creative problem-solving, an engine based on statistics is only ever going to output a statistically average response. And sometimes that is good enough. The trouble with LLMs is that they can’t tell you when it isn’t.

A good failure mode shows it has failed.

The burden of determining whether the proposed solution fits the problem in question is placed entirely on the operator, as Jennifer Moore explains. Not only on their judgment, crucially, but also their attention. After a dozen cases of the AI giving a good enough answer, how determined will you be to keep checking? Especially when higher oversight of AI outputs directly correlates with higher mental fatigue?

When the work happens inside your brain, you get into a flow state. When you have to keep checking the outputs of an artificial brain before you can continue, you burn out. And the very same volume of these outputs that makes you feel productive keeps you chugging along long past your body’s natural stopping point.

So it’s no surprise that people, in practice, don’t “check the work” of their AI assistant. In a way, they are preserving their own peace of mind by choosing not to do so. Great for them. Bad deal for anyone downstream who has to turn their AI-generated prototype into a product.

A common refrain in today’s UX discourse is that designers are uniquely suited to this kind of checking because they have great taste. But you will not be able to convince anyone that the design they generated is bad by saying “I looked at it with my great taste and I can see that it’s bad.” No one cares about your taste.

The definition of Good

The way you check if AI requirements (AKA specs, AKA designs) are fit for purpose is the same way you check if a human design is fit for purpose: critique. I went on a rant about this over on Bluesky but Rachel Been explains it thoroughly here in an article format:

Vibecoded concepts are missing the depth required for a sustainable product. … designs live and die by critique — not just aesthetic critique, but structural critique.

Why is this depth missing? Because it requires conceptual understanding of a problem, which only emerges as you engage with it. That is what good design critique seeks to strengthen. It is structured to stress-test decisions: what was the purpose, why did you choose to do it this way? Did you try other ways, was there a better one?

Having AI extrude a prototype robs you of the opportunity to interrogate those decisions throughout the process. The answer you got is not the best answer of many; it’s just the one the machine gave you, that looked alright to pass on down the line. Understanding whether the artifact is fit for purpose is part of a feedback loop of understanding the purpose; both evolve in parallel as we work on the problem and the evolution does not occur when the work is skipped.

Drawing is seeing … machines only record.

Sure, AI can type a million words a minute. But productivity isn’t capped by how fast you can type. It’s capped by this process of developing conceptual fidelity. Attempts to skip it by adding more detail don’t actually add more fidelity to the design. The detail becomes a distraction.

The LLM won’t help you understand the user’s goals any quicker than human pace. And understanding their goals is the prerequisite of understanding their problems. Only then can you define success, and therefore start iterating on solutions that arrive at that success.

The LLMs might be able you help you with production work. But skipping straight to the production work, because that’s where the magical tools live, is doing yourself a disservice.

— Pavel at the Product Picnic

Never miss an issue.

Subscribe for free