When this article is for you
You're a tax advisor working with multi-locale clients — a client in Madrid invoicing French customers, a Berlin freelancer with Italian + Greek suppliers, a Warsaw consultancy with English + Polish + Czech documents. You want to know how TaxItEasy preserves the original-language information through the export pipeline, what gets normalised, and what stays as-is.
For the multi-currency story alongside multi-language (often the same clients), see multi-currency for cross-border clients. For the broader bulk-export flow, see bulk export as a tax advisor. For the CSV/PDF format roadmap (currently JSON only), see CSV and PDF export status.
What's English, what's original-language
The export schema has a clear separation. The export-side structure is in English (predictable for any downstream tool); the content extracted from documents stays in the language of the source document.
Always English
These are platform-side metadata, always normalised:
- Field labels —
vendor,net,vat,gross,date,currency,category,invoice_number, etc. - Status flags —
approved,pending,matched,unmatched, etc. - Match-pipeline tier names —
strong,likely,possible,weak. - AI-suggested categories — when the AI assigned one (e.g.
Office supplies,Travel,Hosting,Software). Note: if the client renamed a category to their own (e.g. GermanBüromaterial), that override takes precedence and is exported as the client's chosen string. - Metadata — export-date, period-filter parameters, format version, JSON schema version.
Original-language preserved
These are document-side content fields, kept verbatim as the AI extracted from the original document:
- Vendor name — "Société Générale" stays "Société Générale", we don't transliterate or translate. Same for Cyrillic ("Сбербанк"), Greek ("Εθνική Τράπεζα"), CJK scripts ("三井住友銀行").
- Invoice number / reference — exactly as on the invoice, including any prefix codes ("R-2026-0042", "FA-23-001", "REF#5500").
- Line-item descriptions —
"Software-Lizenz Adobe Creative Cloud"stays German if the invoice was German;"Hébergement web mensuel"stays French. - Vendor address — original-language, original character set.
- Free-text fields filled in by humans — your private notes, your flag comments, the client's notes. These are in whatever language the human wrote them.
The combined export looks like: English field labels around original-language content. A French-speaking client's books export with French vendor names + French line items but English vendor / line_items / category keys. This makes the export importable into any accounting tool (which expects standard keys) while preserving original-language readability for any human eyes on the data.
The original document file
Every exported invoice JSON includes a signed URL to download the original PDF / image:
{
"invoice_number": "FA-23-001",
"vendor": "Société Générale",
...
"original_file_url": "https://signed-url.../invoice.pdf?expires=2026-05-29T..."
}
So even if you need to read the original document in its native language (e.g. a Greek-language invoice that the AI extracted but you want to inspect by hand), you can — the export doesn't strip access. Signed URLs are valid for 7 days from export generation; if you need access later, re-run the export.
For long-term archival, download the originals separately during the 7-day window and store locally with your engagement-retention process.
The raw OCR text field
The raw OCR text (the text the OCR engine extracted from the original document, before structured-field extraction) is in the original language. Stored verbatim under the raw_ocr_text field in the export.
This is the source of truth for any "what did the document actually say?" question — useful when you need to verify a translated line-item, or check whether a specific clause (e.g. "reverse-charge applies") was actually in the original document.
The OCR text is roughly the document linearised top-to-bottom; it's not pretty-formatted but it's complete. For documents in scripts the OCR engine isn't strong on (rare languages, handwritten text), the OCR text may be noisy — in those cases, fall back to reading the original PDF directly.
Why no automated translation
Not on the immediate roadmap. The product positioning is: TaxItEasy extracts + organises; downstream tools translate if needed. Reasons:
- Tax-substance terms don't translate cleanly. A German "Vorsteuer" isn't quite the same concept as a French "TVA déductible" or an Italian "IVA detraibile" — close, but with subtle legal differences that translation flattens. An automated translator would introduce drift; for tax purposes, drift is risk.
- Audit defensibility. Original-language is the document of record. A tax-authority audit asks "what did the actual invoice say?" — the answer is whatever script + language was on the document, not a translated rendition.
- Most client-firm setups already have the language alignment they need. A German client typically has a German-speaking advisor; a French client a French-speaking advisor. Cross-language relationships are the exception (and they tend to be senior advisors comfortable with both languages).
If you need bulk translation for downstream purposes (e.g. internal staff review by a colleague who doesn't speak the client's language), free tools work well on the exported JSON:
- DeepL has good API support and produces high-quality translations across EU languages.
- Google Translate API is OK; slightly less nuanced for tax terms.
- OpenAI / Anthropic LLMs can be prompted to translate with a glossary for tax-specific terms (best for finance domain).
The export's clean separation (English keys, original-language values) makes bulk-translation straightforward: translate only the values, leave the keys alone.
Multi-locale clients in practice
A few realistic patterns:
Client invoices in 3 languages
A Lisbon-based SMB receives Portuguese, English, and Spanish invoices from EU vendors. Each invoice arrives in its own language; the export carries each in original. The client's accounting system (likely Portuguese-localised) handles the import with whatever per-language rules it has.
Multi-language vendor names for the same vendor
The vendor name on the invoice changes over time (rebrand, transliteration, formal-vs-informal). You either:
- Edit the vendor on one invoice to your preferred canonical form; the system learns and applies forward (see how to fix a misread vendor name).
- Use a matching rule to normalise the vendor name on extraction.
Either approach gives you a single canonical name in your books going forward.
Client wants Dutch field labels
The export schema is English-labeled (see Always English above). For Dutch-only downstream tooling, your import pipeline does a one-time field-rename. Mapping is straightforward (one line per field). Write to [email protected] with [TECHNICAL] if you need the canonical mapping list.
A localised export schema (per-locale field labels) is on the backlog but low priority — the import-side rename is a 10-minute one-time step in most tools.
Edge cases
"My client's bookkeeping system is in Dutch, I need Dutch field labels." Currently the export schema is English-labeled. Workaround: a post-processing rename step in your import pipeline (we can supply the canonical field mapping on request to [email protected] with [TECHNICAL]). Localised export schemas are on the backlog.
"Vendor changed their name to a transliterated version (e.g. 'IKEA' → 'ИКЕА')." We store what's on the document. If the vendor sends a Cyrillic-script invoice, the vendor name in the export is Cyrillic. Use a matching rule to normalise to your preferred canonical form (e.g. "IKEA") if you want a single name across invoices.
"My client has invoices in 5 different languages." Fine. Each invoice carries its own language. Exports preserve all 5 in original. The English field labels stay the only constant; everything else is the document's own language. Most accounting tools handle multi-locale imports natively, especially the SaaS ones.
"I want subtitles / translations for the OCR text." Not a feature. The raw OCR text is preserved as-is in the export under raw_ocr_text; translation is downstream via DeepL / Google Translate / LLM. Add a translation pass to your import pipeline if needed.
"Vendor name encoding is broken in my downstream tool (shows '???' for special chars)." The export is UTF-8. If your downstream tool defaults to a different encoding (Windows-1252, ISO-8859-1), the receive-side reads the special chars as garbled. Force UTF-8 on the receive side. Most modern tools handle this correctly out of the box.
"Different language per company under the same client." A multi-company client with companies in different countries can have invoices in different languages per company. Each company exports independently. Cross-company language handling isn't a thing — each export is per-company per its own content.
"Client wrote private notes in their language; I write mine in English." Both stored, both exported, each in its language. Your notes carry your user-attribution; theirs carries theirs. No translation, no normalisation.
"AI-suggested category in original language?" The AI's category suggestions are platform-side defaults (English strings like "Office supplies"). If the client renamed the category in their account (e.g. to German "Büromaterial"), the export carries the client's renamed string. So the category is English by default but reflects any client customisation.
"Special characters in vendor name break my downstream regex." The export is plain text Unicode; vendor names can contain any character that was on the document. If your downstream tool's regex assumes ASCII, you'll need to widen the character class (e.g. [\p{L}]+ instead of [a-zA-Z]+). Standard Unicode-aware regex is the solution.
"Date formats — original or normalised?" Dates are normalised to ISO 8601 (YYYY-MM-DD) regardless of the document's locale. The original-format date isn't stored separately; the AI parses (handling DE 01.02.2026, FR 01/02/2026, ES 01-02-2026) and outputs ISO. If a date is ambiguous (e.g. 01/02/2026 could be Jan 2 or Feb 1), the country setting drives the interpretation — see the onboarding wizard explained for the country-driven defaults.
"Currency amount formatting — comma or period as decimal?" Stored as raw float in the export (123.45). The original-document format (European 1.234,56 vs US 1,234.56) is parsed during extraction and normalised to float; the per-document currency field carries the currency code. Your downstream tool handles formatting on display.
Related
- Multi-currency for cross-border clients — multi-locale + multi-currency often go together
- CSV and PDF export status — format roadmap (JSON only today)
- Bulk export as a tax advisor — the export flow that uses these conventions