Document AI automation has quietly become the most consequential back-office decision an SMB will make in 2026. Open OCR pipelines now self-host on a single GPU at around $0.001 a page, managed vendors sell extraction by the thousand pages, and a new generation of models ships almost monthly. The cheap-inference era has arrived — which means cost per page is no longer the question. The question is build versus buy, and the honest answer is a number, not a philosophy.
What’s at stake is real money and real risk. Between 80% and 90% of newly generated enterprise data is unstructured — invoices, contracts, receipts, forms, scanned PDFs — yet only a fraction of organisations extract it reliably, according to Market.us. Get the architecture wrong and you either overpay a managed vendor for volume you don’t have, or you sink an ML engineer’s salary into a self-hosted pipeline that never reaches the break-even that justifies it.
This guide does three things. It maps the three real paths — buy a managed IDP layer, call an OCR API, or self-host open weights. It normalises every vendor’s pricing to a single unit an SMB actually cares about, the cost to process one invoice end-to-end. And it puts a number on the pivot point: the volume above which building starts to beat buying. Every figure is vendor-stated or third-party-sourced as of June 26, 2026, and benchmark claims are flagged as directional where the vendor itself says so.
- 01Build vs buy is a volume question, not a values question.The self-hosting break-even sits at roughly 50,000–100,000 pages a month with a dedicated ML engineer on staff, per Mavik Labs and Spheron. Below that, engineering overhead erases the per-page savings.
- 02Cheap managed APIs pushed the break-even higher.Mistral OCR 4 sells OCR-plus-structure at about $0.004 a page ($0.002 via the Batch API). At those rates the math that used to favour building only flips at much larger volumes than the classic rule of thumb.
- 03Normalise to cost-per-invoice, not cost-per-page.Runs, pages, and operations are not fungible. Nanonets prices per run, OCR APIs per page, Rossum as a flat subscription. Our proprietary table converts all of them to one comparable unit across four volume tiers.
- 04Self-hosted compute is a floor, not a total.The ~$7.27 per 10,000 pages L40S figure excludes ML salary, fine-tuning, infrastructure maintenance, and uptime — maintenance alone is estimated at 5× the initial development cost. Always price the floor and the overhead.
- 05For most SMBs: buy now, migrate selectively later.Start on a managed layer for predictable cost and fast time-to-value. Migrate to open-weight self-hosting only once volume clears the break-even and you have ML-capable staff or a sovereignty mandate.
01 — Why NowThe unstructured-data problem finally has cheap tools.
Every SMB back office runs on documents it can’t easily query. Supplier invoices arrive as PDFs and scans, receipts pile up in inboxes, contracts live as flat text, and someone re-keys all of it into an accounting system by hand. Intelligent document processing (IDP) is the category that automates that re-keying: optical character recognition to read the pixels, then layout and field extraction to turn them into structured data your systems can use.
The reason this is a 2026 decision rather than a 2022 one is that the tooling got dramatically cheaper and better in parallel. Open-weight OCR models now run on a single mid-range GPU, and managed vendors have been forced to cut prices to compete. The market reflects the momentum — even if the exact size depends heavily on who you ask.
Adoption is documented to cut errors and processing time and to return positive first-year ROI, per Market.us’s aggregation of vendor case studies — though those benefit figures are vendor-supplied and vary widely by document type and pipeline quality. The practical takeaway isn’t a headline percentage. It’s that the capability is now cheap enough that the only remaining hard question is how you buy or build it.
02 — The Three PathsThree ways to turn documents into structured data.
Most build-vs-buy debates collapse into two options. In document AI there are really three, and they sit on a spectrum from most-managed to most-owned. Picking the wrong end of that spectrum is how SMBs either overpay or over-engineer.
Turnkey platforms
A full product: extraction models, a human validation screen, retraining, and an archive. Fastest time-to-value, least engineering, highest per-document price. Best when documents vary and you have no ML team.
Pay-as-you-go extraction
You call an API per page and own the pipeline glue — queueing, validation, retries, storage. Cheap per page, metered OpEx, no platform lock-in. Best when you have light engineering and standardised documents.
Open-weight models
Run permissively-licensed OCR models on your own GPU. Lowest marginal cost and full data sovereignty, but you own infrastructure, accuracy tuning, and uptime. Best at high volume with ML staff.
The open-weight shelf is unusually deep in 2026. PaddleOCR-VL-1.6 (Apache 2.0, ~0.9B params, 100-plus languages) runs at roughly 45 pages a minute on an L40S; DeepSeek-OCR (~3B, MIT) trades a little speed for a meaningfully lower cost at the same GPU; GOT-OCR 2.0 is strong on equations under 3GB of VRAM; and Granite-Docling (258M, Apache 2.0) is fast on financial tables, per Spheron’s 2026 self-hosting analysis. These are real production options — but they ship without a managed API, an SLA, or a validation UI. That gap is exactly what the buy paths charge for.
03 — The 2026 ShiftWhat the last week of June actually changed.
Two releases one day apart reset the reference points. On June 22, 2026, Baidu open-sourced Unlimited-OCR (MIT, 3B params), which parses 40-plus pages in a single forward pass and drew about 1,800 GitHub stars in its first day — but ships with no managed API, no enterprise SLA, and no bounding-box output. A day later, on June 23, Mistral released OCR 4, its fourth OCR generation in roughly fifteen months, adding the most-requested enterprise feature: structured output with bounding boxes, block-type classification, and per-word confidence scores. Our companion deep dive covers Mistral OCR 4’s capabilities and benchmark results in full.
That shift — from flat text to a layered semantic map where every block has a location, a type, and a confidence score — is what makes extraction auditable without a separate layout-analysis stage. For an SMB it matters because it collapses two tools into one and makes the output traceable for compliance. The benchmark story, though, needs reading carefully.
Vendor-forwarded customer testimonials point the same direction — one financial-AI engineering team described reaching equivalent accuracy at far lower cost and latency than an incumbent parser — but those evals are unpublished and should be read as marketing, not measurement. The practitioner mood is more sober. As one commenter with a decade in document parsing put it on the Unlimited-OCR thread, OCR still has rough edges in 2026: handwriting, scanned legacy documents, and multi-column math remain hard for every model on the shelf.
04 — True CostThe real cost, normalised per invoice.
Here is the comparison no vendor publishes, because it makes the units honest. Nanonets prices per run, OCR APIs price per page, Rossum sells a flat subscription — quoting them side by side is comparing apples, kilometres, and Tuesdays. The table below converts every path to one unit an SMB actually budgets in: the cost to process one invoice end-to-end, assuming a four-page invoice, across four realistic monthly volumes. Every cell is recomputed from the vendor’s own per-unit rate retrieved June 26, 2026.
| Path | Unit rate | 5K inv / mo20K pages | 25K inv / mo100K pages | 50K inv / mo200K pages | 100K inv / mo400K pages |
|---|---|---|---|---|---|
| Mistral OCR 4 API (standard) | $0.016 / invoice | $80 | $400 | $800 | $1,600 |
| Mistral OCR 4 Batch API | $0.008 / invoice | $40 | $200 | $400 | $800 |
| AWS Textract — Analyze Expense | $0.040 / invoice | $200 | $1,000 | $2,000 | $4,000 |
| Google Document AI — Invoice Parser | $0.040 / invoice | $200 | $1,000 | $2,000 | $4,000 |
| Azure Doc Intelligence — Invoice (Prebuilt) | $0.040 / invoice | $200 | $1,000 | $2,000 | $4,000 |
| Nanonets — Complex AI (4 blocks / invoice) | $1.20 / invoice | $6,000 | $30,000 | $60,000 | $120,000 |
| Self-hosted open OCR — compute only | $0.0029 / invoice | $15 | $73 | $146 | $291 |
| Self-hosted + 0.25-FTE ML engineer | compute + ~$2,000 / mo | $2,015 | $2,073 | $2,146 | $2,291 |
| Rossum Starter (flat subscription) | $1,500 / mo flat | $1,500 | $1,500 | $1,500* | $1,500* |
How to read it. Assumptions: one invoice = four pages; Nanonets = four $0.30 complex-AI runs per invoice; AWS, Google, and Azure invoice parsers at $0.01 a page; Mistral OCR 4 at $0.004 a page ($0.002 batch); self-hosted L40S compute at $0.000727 a page (Spheron). The self-hosted-plus-engineer row adds a quarter-time ML engineer at roughly $2,000 a month — a deliberately conservative floor, since a real build typically needs more. The Rossum row is a flat $1,500-a-month subscription that doesn’t scale with volume; the asterisked cells flag that Starter-tier page ceilings would, in practice, push the two highest bands onto a custom-quoted Business or Enterprise tier (Rossum has been Coupa-owned since May 2026).
Two patterns jump out. Nanonets’ per-run pricing makes it wildly expensive for high-volume invoice work — the very use case buyers most often evaluate it for — because a four-block invoice costs $1.20, not four cents. And the self-hosted compute floor is almost free, but the moment you add even a fractional engineer, the fixed cost dominates until volume is large. That fixed cost is the whole build-vs-buy story.
05 — Break-EvenWhere building actually starts to win.
The classic rule of thumb, synthesised from the Mavik Labs build-vs-buy framework and Spheron’s self-hosting analysis, is that document-AI self-hosting breaks even at roughly 50,000 to 100,000 pages a month — and only with a dedicated ML engineer on staff. Below that, the engineering overhead (maintenance alone estimated at 5× the initial development cost) erases the per-page savings. The chart below shows why, comparing monthly cost across the sensible paths at the top volume tier.
Monthly cost at 100,000 invoices/month · select paths
Source: recomputed from vendor pricing · 100K invoices/mo (400K pages). Nanonets excluded as an outlier ($120,000).The trend the table reveals. The conventional break-even assumes you’re comparing a self-hosted build against a typical full-service managed vendor. But aggressive 2026 API pricing moved the goalposts. Run the per-invoice math and the picture sharpens: against the cheap managed parsers, a self-hosted pipeline with even a quarter-time engineer doesn’t pull ahead until roughly 50,000 invoices a month (200,000 pages) — and against Mistral OCR 4’s standard API, the crossover is far higher still. Our own table shows self-host-plus-engineer at $2,146 versus $2,000 for the managed parsers at 50K invoices (essentially a tie), then winning at 100K. The build case is real, but it lives further up the volume curve than most SMBs ever reach.
Where this is heading. Open-weight OCR quality and managed-API prices are converging from opposite directions, and that convergence will keep pushing the break-even up, not down. For the next year or two, the rational default for a back office processing tens of thousands of pages a month is to buy the cheapest competent managed layer, instrument the volume, and revisit the build case only when sustained volume — or a hard sovereignty requirement — clears the bar. The same build-vs-buy calculus applies across your AI data stack, which we unpack in our CDP build, buy, or skip decision matrix and the broader AI build-vs-buy framework for agency stacks.
06 — Decision MatrixEight criteria that move the needle toward build or buy.
Volume is the spine of the decision, but it isn’t the whole skeleton. Run your situation down these eight criteria; if most of your answers land in the right-hand column, buy a managed layer. If they cluster on the left — and your volume clears the break-even — building becomes defensible.
| Criterion | Pushes toward Build | Pushes toward Buy |
|---|---|---|
| Monthly volume | Above ~100,000 pages / month, sustained | Below ~50,000 pages / month, or spiky |
| Document variability | Standardised, repetitive layouts you control | Arbitrary, varied contracts and one-offs |
| In-house ML engineering | Dedicated ML / MLOps capacity on staff | No ML hire; ops team owns the workflow |
| Data sovereignty | Strict GDPR / EU AI Act / sector rules | Vendor EU data residency is acceptable |
| Extraction depth | Custom fields, tables, equations, edge formats | General text plus standard invoice / receipt forms |
| Time-to-first-value | Months of build time is acceptable | You need extraction live in days or weeks |
| Budget model | CapEx plus predictable fixed OpEx | Variable, usage-metered OpEx preferred |
| Integration depth | Deeply embedded in ERP / custom accounting | Standalone, or light API into existing tools |
The honest reading for the median SMB: most answers land in the buy column. Standardised invoices at modest volume, no ML hire, a need to be live in weeks, and a preference for metered OpEx all point the same way. Building earns its keep when you have sustained six-figure monthly page volume, repetitive document types you control, an ML team already on payroll, and a sovereignty or sector-compliance mandate that a vendor can’t satisfy.
07 — Hidden CostsThe costs the per-page price never shows.
Every self-hosted cost comparison you’ll read online quotes the compute floor and stops. That figure is real, but it’s the start of the bill, not the total. Four categories sit underneath it, and they’re where most build projects quietly go over budget.
Ongoing maintenance
Maintenance of a self-hosted document-AI pipeline is estimated at roughly five times the initial development cost, per the Mavik Labs framework — model updates, retraining, breakage, and drift never stop.
ML engineering time
The compute floor excludes the salary that makes it run. A pipeline that processes real documents reliably needs ML and MLOps attention; a fractional hire is the optimistic case, a full one the realistic one at scale.
Uptime & infrastructure
GPU provisioning lead time, high-availability redundancy, queueing, retries, monitoring, and storage all carry cost and on-call burden that a managed API absorbs on your behalf for its per-page fee.
Compliance overhead
EU AI Act high-risk obligations take effect August 2, 2026. Whether your use case qualifies depends on the application — but data residency, audit trails, and traceability add real, ongoing cost to either path.
The compliance line deserves its own paragraph, because 2026 made the sovereignty argument concrete rather than theoretical. EU AI Act high-risk provisions (including Article 73 incident reporting) take effect August 2, 2026, and whether a given document-AI system is in-scope depends on its use — employment, credit, and healthcare applications are the obvious triggers, not document AI as a blanket category. Layer on the late-2025 invalidation of the EU-U.S. Data Privacy Framework and a growing list of data-localisation laws, and the location of your extraction pipeline becomes a legal question, not just a latency one.
Crucially, “EU data residency” is not the same as “EU legal jurisdiction.” A U.S.-headquartered vendor can store your data in Frankfurt and still be subject to U.S. jurisdiction and export controls. That distinction stopped being abstract in mid-June 2026, when U.S. export controls abruptly restricted some foreign enterprises’ access to frontier U.S. models — a real-world reminder that a kill switch you don’t control is a risk you’ve accepted. Self-hosted or EU-sovereign deployment is the only path that removes it entirely.
“At some point, you need to be able to turn it on or turn it off, and you don’t want to leave it to another country.”— Arthur Mensch, CEO, Mistral AI
08 — Vendor DiligenceHow to read any vendor’s accuracy claim.
Before you commit to any path, pressure-test the numbers. “99% accuracy” is the most abused figure in IDP marketing: it almost always refers to clean, standardised printed documents. Handwriting, scanned legacy paper, multi-column layouts, and mathematical content consistently score far lower across every vendor. The recommendation below maps the four common SMB profiles to a starting path — run your own eval on your own documents before you sign anything.
The typical small back office
Under ~50,000 pages a month, mixed invoices and receipts, no ML hire. Start on a cheap managed API (Mistral OCR 4 or a cloud invoice parser) and add a human validation step. Cheapest total cost, fastest to live.
Repetitive, high-throughput forms
Tens of thousands of standardised pages. Benchmark the cheap managed parsers head-to-head on your own documents; a flat-subscription IDP only wins once volume clears its break-even versus per-page pricing.
Six-figure monthly pages
Sustained 100,000-plus pages a month with ML and MLOps capacity already on staff. Now the self-hosted compute floor beats per-page pricing even after engineer overhead. Pilot open weights against your current spend.
Regulated or jurisdiction-sensitive
A hard GDPR, EU AI Act, or sector mandate. Self-hosted open weights or an EU-sovereign deployment removes cross-border transfer risk a vendor can’t — worth building below the pure-cost break-even.
Whichever profile fits, the discipline is the same: demand the accuracy figure broken out by document type, run a paid pilot on a representative sample of your own worst documents, and price the full stack — not just the per-page line. Wiring extraction into your accounting and operations systems cleanly is where most of the value (and most of the hidden work) actually lives. That integration layer is what our CRM and back-office automation work and our AI transformation services are built around — and it’s the same evaluation discipline we apply when deciding whether to self-host open-weight models.
09 — ConclusionA volume number, not a philosophy.
For most SMB back offices, the answer is still buy — for now.
Document AI in 2026 is genuinely cheap, and that’s exactly why build versus buy is no longer about ideology. The self-hosting break-even sits at roughly 50,000 to 100,000 pages a month with a dedicated ML engineer — and aggressive managed-API pricing, led by Mistral OCR 4 at $4 per 1,000 pages, has only pushed that threshold higher. The median SMB processes nowhere near that volume, which means the spreadsheet, not the manifesto, points to buy.
The practical playbook is unglamorous and correct: buy the cheapest competent managed layer now, normalise your real cost to dollars per invoice, instrument the volume, and revisit the build case only when sustained throughput clears the break-even or a hard sovereignty mandate overrides the math. Sovereignty is the one lever that can justify owning the pipeline early — the rest is arithmetic.
And read every accuracy claim like an auditor. The vendors worth trusting are the ones, like Mistral, willing to call their own benchmark scores directional rather than definitive. Run your own eval on your own worst documents, price the whole stack including the hidden 5× maintenance tail, and let the number — not the marketing — make the call.