Testing Whether a Code-Risk Metric Predicts Anything: Defects, Then Maintenance

In an earlier post I introduced riskratchet, a tool that scores Python functions for maintainability risk and fails a PR when a function gets worse. I argued the score was a better review signal than coverage or CRAP alone.

I never checked whether it was true.

A risk score makes an implicit prediction: the functions it scores high are the ones that will cause trouble later. That is a testable claim, and most tools in this space never test it. They show you a plausible-looking number and move on. This post is about how to test that claim properly. It took two experiments to do it right: first against bugs, where the metric came up short, then against a maintenance outcome, where it held up, sharper than I expected. The method is general. It works for any function-level or file-level code metric. riskratchet is just the worked example.

All of this ships alongside 0.2.10. None of it changed the product. That is the point.

The Claim a Score Is Actually Making

When a metric assigns a function a high risk number, it is quietly promising something:

Functions I rank high are more likely to cause trouble later than functions I rank low.

That sentence is falsifiable, once you pin down what "trouble" means. The whole story below turns on that choice. Pick the wrong outcome and you can make a fine metric look broken, or a broken one look fine.

The hard part is doing it without fooling yourself. Four things make it easy to cheat by accident:

Looking at the present. If you score today's code and label today's trouble, the metric has already "seen" the outcome. You need a gap between when you score and when you judge.
Labelling by hand. Tag a few dozen functions yourself and your sample is too small and too biased to mean anything.
Trusting one run. If the score depends on a flaky input, the number you publish is partly noise, and you will never know unless you run it twice.
Measuring the wrong outcome. The subtlest one, and the reason this post has two halves.

Experiment 1: Does It Predict Bugs?

Score the past, judge with the future

The trick that removes the first problem is simple. Score the code as it looked a year ago, then use the next year of git history to find out which functions actually broke. The tool cannot see the future, so it cannot cheat.

For each repository: pick a snapshot commit S about 365 days before HEAD and pin its SHA so "a year ago" never drifts. Check out S in a throwaway worktree, run that repo's own test suite under coverage, and feed the coverage into riskratchet. Now every function at S has a real score, computed using only information that existed at S. Everything that follows, the labels included, comes strictly from commits that landed after S.

Labelling bugs without labelling anything

Hand-labelling does not scale, and the standard answer in software research is SZZ, after Śliwerski, Zimmermann, and Zeller (2005). Derive the labels from git history:

Find the bug-fixes. Walk commits between S and HEAD, keep the ones whose message looks like a fix (fix, bug, closes #…).
Blame the fix backward. For each fix, take the lines it changed or deleted and git blame them at the commit just before the fix. Blame points at the earlier commit that, in SZZ's model, introduced the bug.
Map the line to a function, then track it back to S by path-and-name, or by a body fingerprint if it moved.

A function at S is defect-implicated if at least one later fix traces back to it. No human tags anything. The honest caveats bound everything downstream: keyword matching misses fixes phrased oddly and flags "fix typo in docstring", and git blame returns the last commit to touch a line, not always the one that introduced the bug (-w and an ignore-revs list for reformat commits reduce, but do not remove, the noise). SZZ is the standard, not the truth.

The scoreboard: AUC

Each function at S now has a score and a label. "Do the high scores land on the buggy functions?" has a clean single-number answer: AUC, the area under the ROC curve.

AUC = the probability that a randomly chosen buggy function scored higher than a randomly chosen clean one.

0.5 is a coin flip. Above means the score ranks buggy functions above clean ones. Below means the metric is pointing the wrong way, actively backwards, the case people forget is possible. AUC drops straight out of the Mann-Whitney U statistic, which also gives a z for significance.

Run everything twice

riskratchet's score depends on a coverage run, a coverage run depends on a test suite, and test suites are not always deterministic. So I ran the entire pipeline twice per repository, with independent fresh-coverage runs, and byte-compared the outputs:

The labels reproduced on all 34 repos. SZZ is parsing and blame, no coverage in the loop.
The scores were byte-identical on 26 of 34 repos.
8 repos had mildly flaky coverage. The worst, category-encoders, drifted on 14% of functions (unseeded scikit-learn random_state). The drift moved a few scores, never the labels, and re-running it end-to-end gave an identical AUC.

The bigger payoff was indirect: running everything twice surfaced two real bugs in my own harness that a single pass would have hidden. One was a bare uv run that walked up the tree and ran each repo's suite in riskratchet's own virtualenv (fixed with --no-project). The other was an unpinned pytest resolving to 9.x and killing year-old suites at plugin registration (pinned <9). Neither corrupted a committed number, both only ever suppressed repos, but I would not have known without the second run.

What it found

The verdict across the 34 repositories, showing the well-powered repos (10+ buggy functions), ordered by how much each row can be trusted. z beyond ±2 is significant.

repo	buggy fns	total AUC	sprawl AUC	drop file-line	z
xarray	224	0.464	0.526	0.480	−2.0
networkx	64	0.391	0.570	0.404	−3.0
croniter	55	0.479	0.720	0.356	−0.5
tenacity	52	0.502	0.473	0.507	0.0
deepdiff	41	0.517	0.525	0.523	0.4
packaging	37	0.643	0.539	0.610	2.8
pint	31	0.375	0.489	0.386	−2.6
click	28	0.648	0.542	0.664	2.7
sqlglot	28	0.788	0.562	0.792	5.3
marshmallow	26	0.571	0.496	0.587	1.2
pyparsing	22	0.575	0.456	0.623	1.2
rich	19	0.614	0.640	0.624	1.7
bayesian-optimization	18	0.722	0.652	0.554	3.1
more-itertools	17	0.757	0.496	0.757	3.6
lifelines	10	0.618	0.635	0.709	1.3
requests	10	0.632	0.520	0.615	1.4

Look at the top of the table. networkx scored 0.391. On the repo with the second-most bug data, riskratchet ranked buggy functions below clean ones, significantly (z = −3.0). xarray, the most data-rich repo, is also below chance (0.464). So is pint (0.375). The picture is split: sqlglot 0.788 (z = 5.3), more-itertools, bayesian-optimization, click and packaging are genuinely predictive. But weighted by data, the center of mass sits below chance. The clean "0.61 to 0.77 everywhere" I had seen on an early four-repo sample was a mirage that dissolved the moment I added bigger repos.

One narrow claim survived every expansion: dropping the file-line half of the sprawl component raised the AUC in 25 of 34 repos (sign test p ≈ 0.0045). On average that part of the score is dead weight, though croniter and bayesian-optimization get clearly worse without it. The honest one-word summary is heterogeneity: significantly predictive and significantly anti-predictive repos, at the same time.

Why I held the weights

The easy move is to declare the file-line term "noise" and ship a weight change in 0.2.10. That move is wrong. When populations disagree on the sign (sqlglot says it works, networkx says it is backwards), no single global weight fits both, and a "pooled AUC went up" would be an artifact of which repos happen to be biggest, not evidence about the metric. A result this mixed calls for changing less, not more.

But the deeper problem was not the heterogeneity. It was that I had measured the wrong thing.

The Pivot: Defects Were the Wrong Outcome

Two gaps in Experiment 1 are not fixable by adding more repos:

Construct. SZZ measures defects. riskratchet claims maintainability. A function can be sprawly and miserable to work in without ever producing a logged bug-fix in a one-year window. "Does not predict defects" is not "is worthless". It only means it does not predict that particular outcome.
Population. The corpus is mature, green-CI OSS libraries, close to the opposite of riskratchet's stated target, which is messy AI-assisted side-project code. And I cannot easily reach that target, because Experiment 1 needs a runnable test suite for coverage, and untested code is exactly the code that lacks one.

So I ran a second experiment built specifically to close both gaps: a different outcome, and a way to score code without ever running its tests.

Experiment 2: Does It Predict Maintenance?

A maintainability outcome from git alone

Instead of bugs, the outcome is change-proneness: how often a function gets edited in the year after S. A function people keep reopening is a reasonable stand-in for "hard to live with", closer to maintainability than "had a bug". I binarize it to the top quartile of future edit-count within each repo, far more statistical power than the 2 to 3% defect rate.

And I score coverage-free: source plus .git, no suite, no venv. The two coverage components drop out. The four static signals (complexity, the two sprawl halves, public surface) are real. This is the move that matters: untested repos are now in scope, and as a bonus the flaky-coverage problem from Experiment 1 disappears entirely. Run twice, the maintenance pipeline was byte-identical on all 34 repos, perfectly reproducible, because there is no test suite left to wobble.

The confound, and the null model that handles it

There is an obvious cheat. Functions edited a lot in the future are usually the ones edited a lot in the past. Busy code stays busy. That is autocorrelation of activity, not maintainability. So past activity is not a nuisance to ignore. It is the null model the structural signals must beat. I fit two models, the exact "proper pooled, repo-stratified model" Experiment 1 said a weight change should wait for:

null: change-prone ~ past-churn + repo
full: change-prone ~ past-churn + complexity + sprawl-fn + sprawl-file + public + repo

Both are L2 logistic regressions with repo fixed-effects, evaluated leave-one-repo-out (so every repo is predicted by a model that never saw it). The structural signals earn their keep only if full beats null, if they predict future maintenance beyond "active code stays active".

What it found

Across the same 34 repos (33,490 functions, 4,515 change-prone):

model	leave-one-repo-out AUC
null (past activity only)	0.574
full (+ structural signals)	0.661

The structure beats the activity null by Δ = +0.086, full better in 30 of 34 repos, sign-test p ≈ 0. Unlike the defect study, this is not heterogeneous noise. The structural signals carry real maintenance signal over and above prior churn, almost everywhere. This is the result the tool was built to earn.

Then I asked which signals, by adding each one alone on top of past-churn, and by dropping each from the full model:

signal	alone, Δ vs null	dropped, Δ vs full
structural_complexity	+0.079	−0.063
sprawl_function_term	+0.027	−0.003
sprawl_file_term	+0.008	−0.001
public_surface	−0.013	−0.001

This is the sharpest result in the post. Complexity does almost all the work. Alone it lifts AUC from 0.574 to 0.654, nearly the entire full-model gain. Dropping it collapses the model. Dropping any of the other three costs essentially nothing. The function-length sprawl half has a faint standalone signal but is redundant with complexity. public_surface is slightly negative alone. And the file-line sprawl term, the one Experiment 1 already flagged, is net-noise here too (its full-model coefficient's 95% CI spans zero). Two different experiments, two different outcomes, same verdict on that term.

One caveat carries weight, and it cuts against the number. Both outcomes I can measure, logged bug-fixes and future edit-count, are blind to the cost a long file imposes while you work in it. That cost is real even when the term reads as noise: you scan and scroll a 950-line module just to find the function you came for, you hold far more of the file's structure in working memory before any edit is safe, reviewers sign off on diffs buried in long files with less grasp of the surroundings, and every change carries more merge-conflict and ripple surface. None of that has to produce a logged defect or an extra edit inside a one-year window to be a tax a maintainer pays every day. So the honest reading is narrower than "the file-line term is dead weight." It is this: the term earns nothing against the two outcomes I tested, and those two outcomes cannot see the cost the term is actually trying to price.

Finally, sensitivity: is "top quartile" a lucky cutoff? No. At top-decile, top-quartile, and median splits the gap stays in a tight +0.084 to +0.097 band, full beating null in 28 to 31 of 34 repos every time. The conclusion does not hinge on the threshold.

Why the weights still held

Experiment 2 is the positive result, and the obvious next move is a weight change: "complexity works, the rest is dead weight, reweight toward complexity." I held the weights anyway, for four reasons that decide it.

The population is not the target yet. All 34 repos are polished OSS, the proven-easy end. The messy, untested cohort the tool is built for is not in the corpus yet, and coverage-free scoring is what finally makes it reachable. The method is proven where it was easiest to prove. The weight change gets earned on the target, not here.
Change-proneness is a proxy. It measures "got edited", not "was painful to edit". The past-churn null guards against the laziest confound, but it does not turn a proxy into the real thing.
The finding is an input, not a conclusion. "Complexity carries the signal, the other components do not earn their weight" is exactly the kind of result that feeds a weight redesign, after a target-resembling corpus confirms it, not one you ship off a polished-only sample.
The term both experiments flag has the strongest off-data case. The file-line sprawl term prices a visual and cognitive tax, the scanning and scrolling, the working-memory load, the review and merge friction of a long file, that neither defects nor change-proneness can see. Dropping it on this evidence trades a measured null for an unmeasured but real cost. That is the wrong trade until an outcome that can see the tax says otherwise.

So 0.2.10 ships two experiments, the harness, the data, and this writeup. It ships zero changes to the scoring weights, on purpose. The weight change is queued behind one clean run on the messy corpus. The discipline is the deliverable.

Running This on Your Own Metric

Strip away riskratchet and the recipe generalizes to any code metric:

Pick the outcome that matches your claim, not the one that is easiest to label. If your metric claims maintainability, do not settle for defects because SZZ is convenient. Getting this wrong invalidates everything after it.
Score the past, judge with the future. Pin a snapshot SHA, use only information that existed then.
Derive labels from git, not by hand. SZZ for defects, future change-counts for maintenance.
Null-model the obvious confound. For maintenance, past activity predicts future activity for free. Your signal has to beat that, not just beat a coin flip.
Compute AUC of score against label. Below 0.5 is the result you most need to be able to see.
Do all of it twice and byte-compare. If the numbers do not reproduce, fix that before you believe any of them. (Dropping a flaky input, as coverage-free scoring does, can buy you reproducibility and reach at once.)
Ablate per-signal before concluding. A composite score that "works" may be one ingredient working and five coming along for the ride. You want to know which.
Pool across many repos, and check whether they agree on sign before shipping any global change.

Steps 1, 4, 6, and 7 are the ones people skip, and they are exactly the ones that stop you from publishing noise, or a single lucky ingredient, as a finding.

Final Takeaway

I built a code-risk metric, then built the machinery to find out whether it was telling the truth. The answer depended entirely on what I asked it to predict. Against bugs, on the biggest repos, it ranks buggy functions slightly below clean ones. Against future maintenance churn, it works, and the signal concentrates in one ingredient, complexity, while the rest of the recipe, including a sprawl term flagged in both experiments, does not earn its place against either outcome I could measure. I kept that term anyway, because a long file taxes the humans who maintain it in ways neither outcome can see.

Publishing this about my own tool, twice, is the job. The alternative, shipping a metric and never checking it, is how most of these tools work, and it is why so few earn trust. The reusable lesson is not "riskratchet's sprawl term is weak". It is that a score is a prediction, a prediction is only as honest as the outcome you test it against, and testing it properly sometimes means running two experiments, proving the tool out on one of them, and shipping no change because the data has not earned it yet.

The harness, the 34-repo datasets for both experiments, and the full threats-to-validity writeups are in the repo under bin/calibration/ and data/calibration/. The tool is on PyPI as riskratchet. As of 0.2.10, the score is exactly what it was in 0.2.9, and now I can show you, two different ways, why.