Measuring Filter Accuracy Honestly: What 97.1% Actually Means
Accuracy numbers are easy to misread. Here is how to think about precision, recall, and the real meaning of a filter's headline figure.
Numbers Need Context
A filter that advertises "97.1% accuracy" tells you almost nothing without context. Accuracy on what dataset? Measuring precision, recall, or some blend? Across which threat categories? An honest discussion of filter performance has to unpack the headline.
Precision vs Recall
Recall is the fraction of real threats the filter catches. High recall means few threats slip through. Precision is the fraction of flagged items that are actually threats. High precision means few false alarms.
These trade off. Crank sensitivity up and you catch more threats (high recall) but flag more benign prompts (low precision). Tune it down and the reverse. There is no single "accuracy" that captures both — you have to talk about the tradeoff.
What the Two Tiers Deliver
Sprappy Filter's pattern tier catches roughly 95% of clear-cut threats. That figure refers to unambiguous cases — the injections, structured PII, and known malware that look exactly like what they are. Pattern matching is high-precision on these because the signals are unmistakable.
The transformer cascade addresses the ambiguous middle band — the paraphrased, context-dependent prompts where patterns fall short — pushing combined accuracy to 97.1%. That number is meaningful precisely because it covers the hard cases, not just the easy ones.
What 97.1% Is Not
It is not 100%. No filter is. It does not mean 2.9% of all traffic is malicious and uncaught — most traffic is benign and trivially allowed. It means that on the threats the filter is evaluated against, it correctly resolves about 97 in 100, with the remainder being the genuinely hardest, most novel cases.
False Positives Cost Too
A filter that blocks legitimate prompts is its own problem. Users hit walls, support tickets pile up, and teams disable the filter. This is why the block/sanitize/allow design matters — sanitize lets you handle borderline cases without an outright block, reducing the user-facing cost of a false positive.
How to Measure on Your Traffic
Do not trust any single vendor number for your specific use case. Run the filter in log-only mode on real traffic at https://api.sprapp.com/v1/filter, then:
- Sample the flagged prompts and label them by hand
- Compute your own precision (how many flags were real)
- Sample the allowed prompts and look for misses (recall proxy)
- Tune per-category thresholds to your tolerance
The Honest Bottom Line
Treat headline accuracy as a starting point, not a guarantee. The pattern tier's 95% on clear-cut threats and the cascade's 97.1% on the ambiguous band are real, useful figures — but the right number for you is the one you measure on your own data. Build defense in depth so the remaining percentage is not catastrophic.
Accuracy honesty is itself a security property. A vendor who claims 100% is the one to distrust.