Browse the knowledge base

Understanding confidence scores

Every extracted field gets a 0–100 confidence score. ≥70 is auto-applied without prompts (green). 50–69 is yellow-highlighted asking for confirmation. Below 50 is left blank for manual entry. The score reflects how sure the AI is — not how correct the value actually is. Always glance at high-confidence fields on important invoices.

When this matters

You're looking at an extracted document and trying to decide which fields to trust at a glance vs which to verify by eye. The confidence-score colour code tells you where the AI was sure, where it hesitated, and where you need to fill in manually. This article explains what each score band means, what drives the number, and the most important caveat: a high score doesn't guarantee correctness.

For the broader extraction story (what's extracted, how to correct it, vendor learning), see what the AI reads, and how to correct it. This article zooms in on the score itself.

The 0–100 scale

Range Visual Auto-applied? What to do
90–100 Green check, subtle Yes, silently Trust by default; spot-check on important docs
70–89 Green check Yes, silently Same — trust; the system thinks it nailed it
50–69 Yellow highlight Yes, with prompt Glance at it; confirm or correct
30–49 Orange highlight Yes, with prompt Look closer; often needs a fix
0–29 Red highlight, often blank No — left blank Type the value manually

The 70% threshold is the auto-apply cutoff. Fields scoring above 70 are silently committed; the system has decided your time isn't worth spending on glancing at them. Fields scoring 50–69 are auto-applied but visually flagged so you do spot-check; this is where the false-positive risk is highest. Below 50, the system declines to commit a value and asks you to fill it in.

We don't expose a user-adjustable threshold. The 70 cutoff is calibrated empirically across our extraction corpus — it's the point where auto-applying does more good than harm. Lowering it produces more silent errors; raising it makes every field a manual one. We'd rather tune the model itself than ask users to fiddle with a slider.

What drives the score

For each field, the AI considers four classes of signal:

  • Did it see the value clearly in the document? Legible text, clear formatting, predictable layout. A crisply printed German invoice scores higher than a phone photo of a faded receipt.
  • Does the value fit the field type? Dates look like dates, currency amounts are plausible (not negative, not absurdly large), VAT IDs match the country's format. A 2026-05-21 extracted as the invoice date scores higher than a 21/05 with the year cut off.
  • Does the format match common patterns? Invoice numbers usually follow vendor-specific patterns (e.g. Hetzner uses R000123, Stripe uses 23-character alphanumerics). A vendor whose invoice numbers match their historical pattern scores higher.
  • Cross-field consistency. Net + VAT = gross. Line-item sum = net total. Date is before due-date. If the extracted fields are mutually consistent, the per-field scores stay high; if there's a discrepancy (gross-net-VAT triangle doesn't add up), all three fields drop.

A receipt where the total is partly obscured by a fold scores lower on the total field than on the vendor field, even though both are visible. The score is per-field, not document-wide.

Confidence is not correctness

This is the most important caveat in the entire scoring model.

A high score means the AI is confident it read the document correctly. It does not mean the value is right in the absolute sense.

Two concrete examples:

  • A blurry $ glyph might confidently OCR as S with 95% confidence — the AI is sure it saw an "S", which is wrong but it doesn't know it's wrong. The currency comes out as "USS" instead of "USD".
  • A perfectly clear vendor name MICROSOFT IRELAND OPERATIONS LIMITED might score lower because it's an unusual long string the model hasn't seen often — even though the extraction is exactly right.

The implication: always glance at high-confidence fields on documents that matter (large amounts, year-end invoices, anything with regulatory implications). The yellow-highlighted fields draw your eye to where the AI hesitated, but they're not the only place errors hide.

A reasonable workflow for high-importance documents:

  1. Look at yellow-highlighted fields first — the AI flagged them for you.
  2. Glance at the four key numerics on the document detail page: invoice date, net, VAT, gross. These are the fields tax authorities care about; even high-confidence ones deserve a one-second look.
  3. Trust everything else on the first pass.

For routine documents (small receipts, repeat vendors, low-stakes), trust the auto-apply and only spot-check yellow.

How corrections feed back

When you change a field — at any confidence level — the system records your correction. The next document from the same vendor (matched by name, logo, layout, sender) does three things differently:

  1. Re-uses your correction. If you renamed "Stripe Inc" to "Stripe Payments Europe" once, the next Stripe invoice extracts as "Stripe Payments Europe" automatically.
  2. Scores that field higher. The system has now seen confirmation that this value is correct for this vendor; future readings of the same field will score higher.
  3. Updates the trading-partner record. The canonical vendor name in your Trading Partners view also updates, so historical invoices from that vendor show the corrected name.

See how to fix a misread vendor name for the specific vendor-rename flow.

This is per-vendor learning. Two different vendors with similar names stay separate; corrections on one don't propagate to the other. The matching uses multiple signals (visual layout, email sender, page hash, font, address format), not just text similarity.

How accurate is the score in practice

Across our extraction corpus, the calibration is roughly:

  • Fields scoring 90–100: ~98% correct
  • Fields scoring 70–89: ~94% correct
  • Fields scoring 50–69: ~80% correct (the yellow-prompt band)
  • Fields scoring 30–49: ~55% correct
  • Fields scoring 0–29: ~30% correct (mostly returned blank)

The numbers vary by field type. Currency amounts are easier than free-form vendor names; invoice dates are easier than line-item descriptions. The yellow-prompt band specifically is calibrated so that 1 in 5 of those fields needs editing, which is the right rate for "ask me, but don't ask too often".

If you find yourself correcting many high-confidence fields, something is off — most likely the input is consistently low-quality (consistently blurry photos, consistently unusual vendor templates). Re-shooting / re-exporting at higher quality, or training the system on a few documents from that vendor, usually fixes it.

Edge cases

Every field is yellow on this document. Probably a low-quality scan or photo. Re-upload a clearer version from the document detail page (Re-scan button gives you 3 retries without burning quota). The old document stays as a soft-duplicate marker; you can delete it after the better one extracts. If the scan is the highest quality the original allows, fill in manually — corrections feed back the same way.

Confidence 95% but the field is wrong. Edit it. Your correction is saved + remembered for that vendor going forward. If you find the same vendor consistently producing high-confidence-but-wrong fields, write to [email protected] with a sample (de-identified) — the model can be hand-tuned for that vendor.

Can I see the raw OCR text? Yes — on the document detail page, click Show raw extraction. You see what the AI literally read off the page before structured-field extraction. Useful for diagnosing why a field ended up wrong (was the text garbled? was the layout misread? was the right text picked up but assigned to the wrong field?).

Two documents from the same vendor with very different scores. The vendor's invoice template may have changed (new design, new logo, new layout) and the per-vendor learning needs to re-baseline. A few corrections on the new template will bring scores back up.

Why is there no "uncertain" mode where the system asks before applying anything? Because for routine high-volume users (50+ invoices/month), confirming every auto-apply would be the dominant time cost. The 70 threshold is the calibrated middle ground. If you want a stricter mode for review-heavy workflows, work with a tax advisor invite (advisors see the same scores plus the flag-for-review workflow, which is designed for verification rather than data entry).

Related

Didn't answer your question? Write to [email protected] · the AI chat in the bottom-right corner answers most common questions.