Understanding face recognition accuracy: TAR, FAR, and what they mean in production

Every face recognition vendor claims high accuracy. "99.9% accurate" appears on a lot of marketing pages. But accuracy in face recognition is not a single number — it's a curve, and the operating point you choose on that curve determines your real-world false accept and false reject rates. Understanding this is essential for building systems that actually work.

ROC curve showing TAR vs FAR operating points for face verification across use cases

What is a verification system?

Face recognition is typically used in two modes:

1:1 verification — does this selfie match this ID photo? Binary yes/no above a similarity threshold.
1:N recognition — who is this person in a gallery of N known faces?

Both produce a similarity score (typically 0–100 or 0–1) that represents the confidence of a face match. You then choose a threshold: above the threshold, it's a match; below, it's not. The threshold you choose determines your operating point on the ROC (Receiver Operating Characteristic) curve.

TAR and FAR defined

True Accept Rate (TAR) — the proportion of genuine match pairs (same person) correctly identified as matching. Also called True Positive Rate or Sensitivity. You want this to be high.

False Accept Rate (FAR) — the proportion of impostor pairs (different people) incorrectly identified as matching. Also called False Positive Rate. You want this to be low.

These two rates are inversely related. Lowering your threshold increases TAR (you accept more genuine matches) but also increases FAR (you accept more impostors). Raising your threshold decreases FAR but also decreases TAR (you reject more genuine users).

TAR@FAR — the metric that matters

When a vendor says "99.4% accuracy," they almost certainly mean TAR@FAR=0.01%. This means: at the threshold where 1 in 10,000 impostor pairs would be incorrectly accepted, their model correctly accepts 99.4% of genuine pairs.

Why FAR=0.01%? It's an industry benchmark. But your application may require a different operating point:

A consumer app unlocking a personal device might accept FAR=0.1% — a 1-in-1,000 false accept is annoying but low-stakes
A KYC system for financial services might target FAR=0.001% — the cost of a false accept (fraud) far exceeds the cost of a false reject (user friction)
A border control system might target FAR=0.0001% — a false accept is a serious security failure

TAR at your target FAR is the only accuracy metric that matters. Ask vendors for the full ROC curve, not a single headline number.

Equal Error Rate (EER)

You may also see EER quoted — the Equal Error Rate is the point on the ROC curve where FAR equals the False Reject Rate (FRR = 1 - TAR). EER is useful for comparing models but is rarely the right operating point for production systems. A good model has a low EER (below 2%), but you'll set your actual threshold based on your application's cost function, not EER.

NIST FRVT — the independent benchmark that matters

Before accepting any vendor's internal benchmark figures, check whether they participate in the NIST Face Recognition Vendor Test (FRVT). FRVT is run by the US National Institute of Standards and Technology against standardised datasets under controlled, vendor-neutral conditions. FRVT results are publicly available at nist.gov/programs-projects/face-recognition-vendor-testing-frvt and are the only independent data point that lets you compare vendors on equal footing. A vendor who refuses to participate in FRVT while claiming top-tier accuracy is a red flag.

The standard governing biometric performance evaluation methodology is ISO/IEC 19795-1, which defines how TAR, FAR, and related metrics must be measured and reported. When reviewing vendor documentation, check that their evaluation follows ISO 19795-1 reporting conventions — otherwise numbers are not comparable across providers.

Dataset bias and demographic parity

Accuracy on a benchmark dataset is not the same as accuracy on your user population. Most public face recognition benchmarks (LFW, CFP-FP, IJB-C) are heavily skewed toward lighter skin tones and specific age ranges. A model that achieves 99.5% TAR@FAR=0.01% on LFW may perform meaningfully worse on darker skin tones, older faces, or non-Western facial structures.

Questions to ask any vendor:

What benchmark datasets were used for evaluation?
Do you have disaggregated accuracy metrics by Fitzpatrick skin tone scale?
What is the accuracy gap between the best and worst demographic subgroups?
Was the model trained on diverse data, and can you share the dataset composition?

EU AI Act Article 10 and US NIST FRVT evaluations increasingly require demographic disaggregation. Vendors who cannot provide it should be treated with scepticism.

1:N search — accuracy scales differently

In gallery search, you're asking: "Is this probe face in a gallery of N people?" As N grows, the probability of a false match increases. At FAR=0.01% and a gallery of 1,000 people, you expect roughly 0.1 false accepts per query on average. At N=100,000, that's 10 false accepts per query. 1:N accuracy must be evaluated at the scale you intend to operate at.

The relevant metric for 1:N is Rank-1 accuracy: the probability that the correct gallery identity appears as the top result. This degrades with gallery size in a way that 1:1 verification accuracy does not.

Production vs benchmark gap

Benchmark evaluations use controlled, high-quality face crops. Production systems encounter: low-light selfies, motion blur, partial occlusion, off-angle poses, low resolution, and inconsistent background. In production, you should expect real-world accuracy to be 1–5 percentage points lower than benchmark figures at the same FAR.

The best way to validate this is to run a pilot evaluation on a sample of your real-world data before committing to a provider at scale. Quantilence provides evaluation API access to enterprise customers for exactly this purpose.

Quantilence benchmarks

Our face recognition model achieves 99.4% TAR@FAR=0.01% on IJB-C, with a demographic accuracy gap of under 1.2 percentage points across the Fitzpatrick scale. Latency for a 1:N search against a gallery of 10,000 faces is under 150ms. Contact us for full benchmark documentation and a DPIA-compatible evaluation dataset.