The discourse about to what stage AI-generated code needs to be reviewed usually feels very binary. Is vibe coding (i.e. letting AI generate code with out trying on the code) good or dangerous? The reply is in fact neither, as a result of “it relies upon”.
So what does it depend upon?
Once I’m utilizing AI for coding, I discover myself always making little threat assessments about whether or not to belief the AI, how a lot to belief it, and the way a lot work I have to put into the verification of the outcomes. And the extra expertise I get with utilizing AI, the extra honed and intuitive these assessments turn out to be.
Danger evaluation is usually a mixture of three elements:
- Likelihood
- Impression
- Detectability
Reflecting on these 3 dimensions helps me determine if I ought to attain for AI or not, if I ought to overview the code or not, and at what stage of element I try this overview. This additionally helps me take into consideration mitigations I can put in place after I wish to benefit from AI’s pace, however cut back the danger of it doing the flawed factor.
1. Likelihood: How probably is AI to get issues flawed?
The next are a number of the elements that enable you decide the chance dimension.
Know your instrument
The AI coding assistant is a operate of the mannequin used, the immediate orchestration taking place within the instrument, and the extent of integration the assistant has with the codebase and the event atmosphere. As builders, we don’t have all of the details about what’s going on below the hood, particularly after we’re utilizing a proprietary instrument. So the evaluation of the instrument high quality is a mixture of understanding about its proclaimed options and our personal earlier expertise with it.
Is the use case AI-friendly?
Is the tech stack prevalent within the coaching information? What’s the complexity of the answer you need AI to create? How large is the issue that AI is meant to unravel?
It’s also possible to extra typically think about should you’re engaged on a use case that wants a excessive stage of “correctness”, or not. E.g., constructing a display precisely based mostly on a design, or drafting a tough prototype display.
Concentrate on the out there context
Likelihood isn’t solely in regards to the mannequin and the instrument, it’s additionally in regards to the out there context. The context is the immediate you present, plus all the opposite data the agent has entry to by way of instrument calls and so on.
-
Does the AI assistant have sufficient entry to your codebase to make an excellent determination? Is it seeing the information, the construction, the area logic? If not, the possibility that it’s going to generate one thing unhelpful goes up.
-
How efficient is your instrument’s code search technique? Some instruments index the complete codebase, some make on the fly
grep-like searches over the information, some construct a graph with the assistance of the AST (Summary Syntax Tree). It will probably assist to know what technique your instrument of alternative makes use of, although in the end solely expertise with the instrument will inform you how nicely that technique actually works. -
Is the codebase AI-friendly, i.e. is it structured in a approach that makes it simple for AI to work with? Is it modular, with clear boundaries and interfaces? Or is it an enormous ball of mud that fills up the context window shortly?
-
Is the present codebase setting an excellent instance? Or is it a multitude of hacks and anti-patterns? If the latter, the possibility of AI producing extra of the identical goes up should you don’t explicitly inform it what the great examples are.
2. Impression: If AI will get it flawed and also you don’t discover, what are the results?
This consideration is especially in regards to the use case. Are you engaged on a spike or manufacturing code? Are you on name for the service you’re engaged on? Is it enterprise important, or simply inside tooling?
Some good sanity checks:
- Would you ship this should you have been on name tonight?
- Does this code have a excessive affect radius, e.g. is it utilized by quite a lot of different parts or customers?
3. Detectability: Will you discover when AI will get it flawed?
That is about suggestions loops. Do you might have good assessments? Are you utilizing a typed language? Does your stack make failures apparent? Do you belief the instrument’s change monitoring and diffs?
It additionally comes all the way down to your personal familiarity with the codebase. If you already know the tech stack and the use case nicely, you’re extra prone to spot one thing fishy.
This dimension leans closely on conventional engineering abilities: take a look at protection, system data, code overview practices. And it influences how assured you might be even when AI makes the change for you.
A mix of conventional and new abilities
You may need already observed that many of those evaluation questions require “conventional” engineering abilities, others

Combining the three: A sliding scale of overview effort
If you mix these three dimensions, they’ll information your stage of oversight. Let’s take the extremes for example for instance this concept:
- Low chance + low affect + excessive detectability Vibe coding is ok! So long as issues work and I obtain my aim, I don’t overview the code in any respect.
- Excessive chance + excessive affect + low detectability Excessive stage of overview is advisable. Assume the AI may be flawed and canopy for it.
Most conditions land someplace in between in fact.

Instance: Legacy reverse engineering
We not too long ago labored on a legacy migration for a consumer the place step one was to create an in depth description of the present performance with AI’s assist.
-
Likelihood of getting flawed descriptions was medium:
-
Instrument: The mannequin we had to make use of usually did not observe directions nicely
-
Out there context: we didn’t have entry to all the code, the backend code was unavailable.
-
Mitigations: We ran prompts a number of occasions to identify test variance in outcomes, and we elevated our confidence stage by analysing the decompiled backend binary.
-
-
Impression of getting flawed descriptions was medium
-
Enterprise use case: On the one hand, the system was utilized by 1000’s of exterior enterprise companions of this group, so getting the rebuild flawed posed a enterprise threat to popularity and income.
-
Complexity: Alternatively, the complexity of the appliance was comparatively low, so we anticipated it to be fairly simple to repair errors.
-
Deliberate mitigations: A staggered rollout of the brand new utility.
-
-
Detectability of getting the flawed descriptions was medium
-
Security web: There was no present take a look at suite that might be cross-checked
-
SME availability: We deliberate to usher in SMEs for overview, and to create a characteristic parity comparability assessments.
-
With out a structured evaluation like this, it could have been simple to under-review or over-review. As an alternative, we calibrated our method and deliberate for mitigations.
Closing thought
This type of micro threat evaluation turns into second nature. The extra you utilize AI, the extra you construct instinct for these questions. You begin to really feel which adjustments might be trusted and which want nearer inspection.
The aim is to not gradual your self down with checklists, however to develop intuitive habits that enable you navigate the road between leveraging AI’s capabilities whereas lowering the danger of its downsides.









