Coverage is a map, not the territory

If you have spent any time around testing people, you have heard the warning: coverage is a vanity metric. It is half right, and the half it gets right matters.

A coverage percentage tells you which lines of code ran while your tests executed. That is genuinely useful information. What it does not tell you is whether those tests would notice if the code broke. You can hit any number you like with tests that assert nothing, and plenty of products with a proud green badge still ship bugs. So the number on its own measures how much work got done, not whether the work was any good.

We sell a service that raises coverage, so we have a commercial incentive to pretend the number is the point. It is not, and saying otherwise to engineers is a fast way to lose their trust. Here is how we actually think about it.

Coverage as a forcing function, not a goal

The honest value of a coverage target is that chasing it makes you go and look at the paths nobody tested. That is usually where the bugs are hiding. The percentage is the byproduct of having gone and looked, not the prize.

The trap is the moment coverage becomes the KPI. As soon as the number is the goal, people write tests to hit lines instead of to catch anything, and you end up with a green number sitting on top of nothing. So we treat coverage as a check that the real behavior tests exist, not as the thing we are selling. The promise we make is about regressions caught. The 80% is how we verify the work landed.

Test the code that is moving, not the code that is asleep

One reviewer made a point that blanket coverage targets get badly wrong: code that has not changed in ten years and has not broken does not need new tests bolted onto it for a number. You are just adding churn and maintenance for no new confidence.

The effort belongs on changed code and the risky paths, the parts that are actually moving or that hurt if they break. Coverage is useful there as a way to find what is untested. Used as a blanket score across an entire repository it is closer to a tax. Deciding "this is stable legacy, leave it alone" versus "this is a hot path with a gap" is most of the actual skill, and it is the judgment we apply before writing a single test.

How do you know the tests are real? Mutation testing.

If coverage cannot tell you whether a test catches bugs, what can? The best answer we know is mutation testing.

Mutation testing deliberately changes the code, flipping a comparison, deleting a line, altering a return, and then runs your suite to see whether any test fails. If your tests do not notice the change, they were not really protecting that behavior. It is the closest thing to an objective measure of whether a suite has teeth.

We run it as a quality check on our own output. We have validated it on real, public packages, for example unjs/defu at an 84.47% mutation score on a vitest and pnpm stack, and vercel/ms at 85.46% on jest and npm. It is the signal we trust to confirm the tests we wrote actually catch a regression, rather than just moving the coverage number.

The honest limits

Mutation testing is not free. It is slower than running a normal suite, so today we run it as an advisory check, and it only covers JavaScript and TypeScript, which is all we support right now. It is not a perfect metric either. But it answers a question that a coverage percentage cannot, and pairing the two, coverage to find the gaps and mutation testing to prove the tests have teeth, is a far more honest picture than either number alone.

Coverage is the map. Mutation testing is how you check the map matches the territory.