What makes an AI-written test unacceptable? We asked testers.
What makes an AI-written test unacceptable? We asked testers.
The easy part of generating tests is making the coverage number go up. The hard part is knowing whether any of those tests would catch a real bug. We build a service that writes tests for under-tested code, so this is the question we care about most, and we wanted the answer from people who review tests for a living rather than from our own assumptions.
So we asked a room of QA and testing engineers a blunt question: if you were reviewing a PR that claimed to increase coverage, what would make you reject it on sight? Here is what came back, lightly organized, plus how we try to enforce each one.
The reject list
Tests that assert nothing. A test that runs the code and then checks nothing is not a test, it is a line-execution counter. It will pass forever and protect nothing.
Self-satisfying assertions. One reviewer put it sharply: a good test must not provide its own success conditions. Asserting that X is X, or asserting on a value the test itself set, while ignoring the actual output of the code under test, is the most common way a generated test looks green and means nothing.
Mock-call-only tests. This was the single most-cited reject bar. A test that only checks "this mock was called" verifies your test setup, not your software. Over-mocking also hides the integration bug, which is often the one that actually matters.
Snapshot padding. Snapshot tests are fine as a supplement. As the main source of coverage they are dangerous, because people regenerate them on a red run without reading what changed. A wall of snapshots inflates the number and the maintenance bill at the same time.
Happy-path-only. Coverage that only exercises the path where everything goes right skips exactly the branches where bugs live.
Duplicates with slightly different inputs. Sneaky, because they raise the count and the maintenance load without adding a single new case that matters.
Flaky tests. A test that passes most of the time is worse than no test. It trains the team to ignore red, and it hides the real failures in the noise.
Correctness, which is the hard one. Several testers made the same point: an incorrect test is worse than no test, because it gives false confidence that someone relies on later. AI-generated tests are especially prone to this because they look plausible on the surface. When the model misreads the behavior, the test often reads obviously wrong to a human and you catch it on the diff. The dangerous case is the test that looks right, runs green, and is subtly asserting the wrong thing.
The one question that cuts through all of it
One reviewer offered a single line that is better than any checklist. For every generated test, ask:
What real defect would this test catch that was not being caught before?
If you cannot answer that after reading the test, it is coverage theater. We have adopted it as our one-line review test.
How we enforce this
A checklist is only worth as much as the gate behind it. On every delivery, before it reaches the customer:
- An automated assertion audit rejects no-assertion tests and tautological, self-satisfying assertions. They do not count toward the target.
- Each new test is re-run several times before it can count, so a flaky, sometimes-passing test is caught and dropped rather than shipped.
- We never lower coverage thresholds and never exclude files to fake the number. The existing suite has to stay green. The customer can reject any individual test in the PR.
And for the correctness problem, the one that keeps everyone in that thread up at night, we run mutation testing. Mutation testing deliberately introduces bugs into the code and checks whether the new tests fail. It is the strongest signal we know of that a test actually catches a defect rather than just executing a line. We have validated it on real packages, for example unjs/defu at an 84.47% mutation score on a vitest and pnpm stack, and vercel/ms at 85.46% on jest and npm. It runs as an advisory quality check on our own output today, on JavaScript and TypeScript, and we are increasingly of the view it should be the internal acceptance gate even where we report line coverage to the customer.
The honest part
None of this fully solves the correctness problem. A human still reviews every test, looking specifically for the plausible-but-wrong case, and you review the PR before you merge it. We would never pitch this as hands-off, and we only support JavaScript and TypeScript today. But the bar above is the difference between a number and a suite, and publishing it is the point. Most "AI test" tooling ships the slop. We would rather tell you exactly what we refuse to ship.