Document AI automation has quietly become the most consequential back-office decision an SMB will make in 2026. Open OCR pipelines now self-host on a single GPU at around $0.001 a page, managed vendors sell extraction by the thousand pages, and a new generation of models ships almost monthly. The cheap-inference era has arrived — which means cost per page is no longer the question. The question is build versus buy, and the honest answer is a number, not a philosophy.

What’s at stake is real money and real risk. Between 80% and 90% of newly generated enterprise data is unstructured — invoices, contracts, receipts, forms, scanned PDFs — yet only a fraction of organisations extract it reliably, according to Market.us. Get the architecture wrong and you either overpay a managed vendor for volume you don’t have, or you sink an ML engineer’s salary into a self-hosted pipeline that never reaches the break-even that justifies it.

This guide does three things. It maps the three real paths — buy a managed IDP layer, call an OCR API, or self-host open weights. It normalises every vendor’s pricing to a single unit an SMB actually cares about, the cost to process one invoice end-to-end. And it puts a number on the pivot point: the volume above which building starts to beat buying. Every figure is vendor-stated or third-party-sourced as of June 26, 2026, and benchmark claims are flagged as directional where the vendor itself says so.

Key takeaways

01
Build vs buy is a volume question, not a values question.The self-hosting break-even sits at roughly 50,000–100,000 pages a month with a dedicated ML engineer on staff, per Mavik Labs and Spheron. Below that, engineering overhead erases the per-page savings.
02
Cheap managed APIs pushed the break-even higher.Mistral OCR 4 sells OCR-plus-structure at about $0.004 a page ($0.002 via the Batch API). At those rates the math that used to favour building only flips at much larger volumes than the classic rule of thumb.
03
Normalise to cost-per-invoice, not cost-per-page.Runs, pages, and operations are not fungible. Nanonets prices per run, OCR APIs per page, Rossum as a flat subscription. Our proprietary table converts all of them to one comparable unit across four volume tiers.
04
Self-hosted compute is a floor, not a total.The ~$7.27 per 10,000 pages L40S figure excludes ML salary, fine-tuning, infrastructure maintenance, and uptime — maintenance alone is estimated at 5× the initial development cost. Always price the floor and the overhead.
05
For most SMBs: buy now, migrate selectively later.Start on a managed layer for predictable cost and fast time-to-value. Migrate to open-weight self-hosting only once volume clears the break-even and you have ML-capable staff or a sovereignty mandate.

01 — Why NowThe unstructured-data problem finally has cheap tools.

Every SMB back office runs on documents it can’t easily query. Supplier invoices arrive as PDFs and scans, receipts pile up in inboxes, contracts live as flat text, and someone re-keys all of it into an accounting system by hand. Intelligent document processing (IDP) is the category that automates that re-keying: optical character recognition to read the pixels, then layout and field extraction to turn them into structured data your systems can use.

The reason this is a 2026 decision rather than a 2022 one is that the tooling got dramatically cheaper and better in parallel. Open-weight OCR models now run on a single mid-range GPU, and managed vendors have been forced to cut prices to compete. The market reflects the momentum — even if the exact size depends heavily on who you ask.

Market context (directional)

One widely cited estimate from Market.us puts the 2026 intelligent document processing market at roughly $4.38 billion, up from about $1.5 billion in 2022. Treat that as an order-of-magnitude signal, not a precise forecast: published 2026 IDP market estimates range from the low single-digit billions to well over ten billion depending on the research firm and what they count. The direction — steep growth, double-digit CAGR — is consistent across sources; the absolute number is not.

Adoption is documented to cut errors and processing time and to return positive first-year ROI, per Market.us’s aggregation of vendor case studies — though those benefit figures are vendor-supplied and vary widely by document type and pipeline quality. The practical takeaway isn’t a headline percentage. It’s that the capability is now cheap enough that the only remaining hard question is how you buy or build it.

02 — The Three PathsThree ways to turn documents into structured data.

Most build-vs-buy debates collapse into two options. In document AI there are really three, and they sit on a spectrum from most-managed to most-owned. Picking the wrong end of that spectrum is how SMBs either overpay or over-engineer.

Buy · Managed IDP

Turnkey platforms

Rossum · Docsumo · Nanonets · ABBYY

A full product: extraction models, a human validation screen, retraining, and an archive. Fastest time-to-value, least engineering, highest per-document price. Best when documents vary and you have no ML team.

Rossum Starter from $18,000 / yr (vendor)

Buy · OCR API

Pay-as-you-go extraction

Mistral OCR 4 · AWS Textract · Google · Azure

You call an API per page and own the pipeline glue — queueing, validation, retries, storage. Cheap per page, metered OpEx, no platform lock-in. Best when you have light engineering and standardised documents.

Mistral OCR 4 $4 / 1K pages (vendor)

Build · Self-host

Open-weight models

PaddleOCR-VL · DeepSeek-OCR · GOT-OCR 2.0

Run permissively-licensed OCR models on your own GPU. Lowest marginal cost and full data sovereignty, but you own infrastructure, accuracy tuning, and uptime. Best at high volume with ML staff.

≈$7.27 / 10K pages compute (3rd-party)

The open-weight shelf is unusually deep in 2026. PaddleOCR-VL-1.6 (Apache 2.0, ~0.9B params, 100-plus languages) runs at roughly 45 pages a minute on an L40S; DeepSeek-OCR (~3B, MIT) trades a little speed for a meaningfully lower cost at the same GPU; GOT-OCR 2.0 is strong on equations under 3GB of VRAM; and Granite-Docling (258M, Apache 2.0) is fast on financial tables, per Spheron’s 2026 self-hosting analysis. These are real production options — but they ship without a managed API, an SLA, or a validation UI. That gap is exactly what the buy paths charge for.

03 — The 2026 ShiftWhat the last week of June actually changed.

Two releases one day apart reset the reference points. On June 22, 2026, Baidu open-sourced Unlimited-OCR (MIT, 3B params), which parses 40-plus pages in a single forward pass and drew about 1,800 GitHub stars in its first day — but ships with no managed API, no enterprise SLA, and no bounding-box output. A day later, on June 23, Mistral released OCR 4, its fourth OCR generation in roughly fifteen months, adding the most-requested enterprise feature: structured output with bounding boxes, block-type classification, and per-word confidence scores. Our companion deep dive covers Mistral OCR 4’s capabilities and benchmark results in full.

That shift — from flat text to a layered semantic map where every block has a location, a type, and a confidence score — is what makes extraction auditable without a separate layout-analysis stage. For an SMB it matters because it collapses two tools into one and makes the output traceable for compliance. The benchmark story, though, needs reading carefully.

On the benchmark numbers

Mistral reports OCR 4 scoring 85.20 on OlmOCRBench and 93.07 on OmniDocBench, with a 72% average win rate in head-to-head human evaluations across 600-plus documents — all of which are vendor-stated. Unusually, Mistral publishes its own caveats about benchmark artifacts and states that it treats the aggregate score as directional rather than definitive. Worth noting alongside the marketing claim: on the public OlmOCRBench leaderboard, OCR 4 currently sits third rather than first, behind open models. Read every IDP accuracy figure, including this one, as a starting point for your own evaluation — not a verdict.

Vendor-forwarded customer testimonials point the same direction — one financial-AI engineering team described reaching equivalent accuracy at far lower cost and latency than an incumbent parser — but those evals are unpublished and should be read as marketing, not measurement. The practitioner mood is more sober. As one commenter with a decade in document parsing put it on the Unlimited-OCR thread, OCR still has rough edges in 2026: handwriting, scanned legacy documents, and multi-column math remain hard for every model on the shelf.

04 — True CostThe real cost, normalised per invoice.

Here is the comparison no vendor publishes, because it makes the units honest. Nanonets prices per run, OCR APIs price per page, Rossum sells a flat subscription — quoting them side by side is comparing apples, kilometres, and Tuesdays. The table below converts every path to one unit an SMB actually budgets in: the cost to process one invoice end-to-end, assuming a four-page invoice, across four realistic monthly volumes. Every cell is recomputed from the vendor’s own per-unit rate retrieved June 26, 2026.

Estimated monthly cost to process invoices end-to-end across nine document-AI paths and four volume tiers, assuming one invoice equals four pages. Per-unit vendor rates retrieved June 26, 2026 and are vendor-stated except self-hosted compute (Spheron, third-party) and Rossum (flat annual subscription). Nanonets prices per automation run, not per page. Rossum Starter is a flat $1,500/month subscription that does not scale with volume; the asterisked higher bands would in practice exceed Starter-tier page ceilings and require a custom-quoted upgrade.
Path	Unit rate	5K inv / mo20K pages	25K inv / mo100K pages	50K inv / mo200K pages	100K inv / mo400K pages
Mistral OCR 4 API (standard)	$0.016 / invoice	$80	$400	$800	$1,600
Mistral OCR 4 Batch API	$0.008 / invoice	$40	$200	$400	$800
AWS Textract — Analyze Expense	$0.040 / invoice	$200	$1,000	$2,000	$4,000
Google Document AI — Invoice Parser	$0.040 / invoice	$200	$1,000	$2,000	$4,000
Azure Doc Intelligence — Invoice (Prebuilt)	$0.040 / invoice	$200	$1,000	$2,000	$4,000
Nanonets — Complex AI (4 blocks / invoice)	$1.20 / invoice	$6,000	$30,000	$60,000	$120,000
Self-hosted open OCR — compute only	$0.0029 / invoice	$15	$73	$146	$291
Self-hosted + 0.25-FTE ML engineer	compute + ~$2,000 / mo	$2,015	$2,073	$2,146	$2,291
Rossum Starter (flat subscription)	$1,500 / mo flat	$1,500	$1,500	$1,500*	$1,500*

How to read it. Assumptions: one invoice = four pages; Nanonets = four $0.30 complex-AI runs per invoice; AWS, Google, and Azure invoice parsers at $0.01 a page; Mistral OCR 4 at $0.004 a page ($0.002 batch); self-hosted L40S compute at $0.000727 a page (Spheron). The self-hosted-plus-engineer row adds a quarter-time ML engineer at roughly $2,000 a month — a deliberately conservative floor, since a real build typically needs more. The Rossum row is a flat $1,500-a-month subscription that doesn’t scale with volume; the asterisked cells flag that Starter-tier page ceilings would, in practice, push the two highest bands onto a custom-quoted Business or Enterprise tier (Rossum has been Coupa-owned since May 2026).

Two patterns jump out. Nanonets’ per-run pricing makes it wildly expensive for high-volume invoice work — the very use case buyers most often evaluate it for — because a four-block invoice costs $1.20, not four cents. And the self-hosted compute floor is almost free, but the moment you add even a fractional engineer, the fixed cost dominates until volume is large. That fixed cost is the whole build-vs-buy story.

05 — Break-EvenWhere building actually starts to win.

The classic rule of thumb, synthesised from the Mavik Labs build-vs-buy framework and Spheron’s self-hosting analysis, is that document-AI self-hosting breaks even at roughly 50,000 to 100,000 pages a month — and only with a dedicated ML engineer on staff. Below that, the engineering overhead (maintenance alone estimated at 5× the initial development cost) erases the per-page savings. The chart below shows why, comparing monthly cost across the sensible paths at the top volume tier.

Monthly cost at 100,000 invoices/month · select paths

Source: recomputed from vendor pricing · 100K invoices/mo (400K pages). Nanonets excluded as an outlier ($120,000).

AWS / Google / Azure invoice parser$0.04 / invoice · managed API

$4,000

Self-hosted + 0.25-FTE engineercompute + ~$2,000/mo fixed

$2,291

Mistral OCR 4 API (standard)$0.016 / invoice · managed API

$1,600

Rossum Starter (flat)tier ceiling applies at this volume

$1,500

Mistral OCR 4 Batch API$0.008 / invoice · async

$800

Self-hosted compute onlyL40S · excludes all overhead

$291

The trend the table reveals. The conventional break-even assumes you’re comparing a self-hosted build against a typical full-service managed vendor. But aggressive 2026 API pricing moved the goalposts. Run the per-invoice math and the picture sharpens: against the cheap managed parsers, a self-hosted pipeline with even a quarter-time engineer doesn’t pull ahead until roughly 50,000 invoices a month (200,000 pages) — and against Mistral OCR 4’s standard API, the crossover is far higher still. Our own table shows self-host-plus-engineer at $2,146 versus $2,000 for the managed parsers at 50K invoices (essentially a tie), then winning at 100K. The build case is real, but it lives further up the volume curve than most SMBs ever reach.

Where this is heading. Open-weight OCR quality and managed-API prices are converging from opposite directions, and that convergence will keep pushing the break-even up, not down. For the next year or two, the rational default for a back office processing tens of thousands of pages a month is to buy the cheapest competent managed layer, instrument the volume, and revisit the build case only when sustained volume — or a hard sovereignty requirement — clears the bar. The same build-vs-buy calculus applies across your AI data stack, which we unpack in our CDP build, buy, or skip decision matrix and the broader AI build-vs-buy framework for agency stacks.

06 — Decision MatrixEight criteria that move the needle toward build or buy.

Volume is the spine of the decision, but it isn’t the whole skeleton. Run your situation down these eight criteria; if most of your answers land in the right-hand column, buy a managed layer. If they cluster on the left — and your volume clears the break-even — building becomes defensible.

Build-vs-buy decision matrix for SMB document AI. Eight criteria, each with the condition that pushes the decision toward building a self-hosted pipeline versus buying a managed layer or API. Synthesised from the Mavik Labs build-vs-buy framework and Spheron self-hosting analysis.
Criterion	Pushes toward Build	Pushes toward Buy
Monthly volume	Above ~100,000 pages / month, sustained	Below ~50,000 pages / month, or spiky
Document variability	Standardised, repetitive layouts you control	Arbitrary, varied contracts and one-offs
In-house ML engineering	Dedicated ML / MLOps capacity on staff	No ML hire; ops team owns the workflow
Data sovereignty	Strict GDPR / EU AI Act / sector rules	Vendor EU data residency is acceptable
Extraction depth	Custom fields, tables, equations, edge formats	General text plus standard invoice / receipt forms
Time-to-first-value	Months of build time is acceptable	You need extraction live in days or weeks
Budget model	CapEx plus predictable fixed OpEx	Variable, usage-metered OpEx preferred
Integration depth	Deeply embedded in ERP / custom accounting	Standalone, or light API into existing tools

The honest reading for the median SMB: most answers land in the buy column. Standardised invoices at modest volume, no ML hire, a need to be live in weeks, and a preference for metered OpEx all point the same way. Building earns its keep when you have sustained six-figure monthly page volume, repetitive document types you control, an ML team already on payroll, and a sovereignty or sector-compliance mandate that a vendor can’t satisfy.

07 — Hidden CostsThe costs the per-page price never shows.

Every self-hosted cost comparison you’ll read online quotes the compute floor and stops. That figure is real, but it’s the start of the bill, not the total. Four categories sit underneath it, and they’re where most build projects quietly go over budget.

Hidden cost

Ongoing maintenance

5×

Maintenance of a self-hosted document-AI pipeline is estimated at roughly five times the initial development cost, per the Mavik Labs framework — model updates, retraining, breakage, and drift never stop.

Recurring, not one-off

Hidden cost

ML engineering time

1FTE+

The compute floor excludes the salary that makes it run. A pipeline that processes real documents reliably needs ML and MLOps attention; a fractional hire is the optimistic case, a full one the realistic one at scale.

Excluded from compute quotes

Hidden cost

Uptime & infrastructure

24/7

GPU provisioning lead time, high-availability redundancy, queueing, retries, monitoring, and storage all carry cost and on-call burden that a managed API absorbs on your behalf for its per-page fee.

On-call & redundancy

Hidden cost

Compliance overhead

Aug 2

EU AI Act high-risk obligations take effect August 2, 2026. Whether your use case qualifies depends on the application — but data residency, audit trails, and traceability add real, ongoing cost to either path.

Use-case dependent

The compliance line deserves its own paragraph, because 2026 made the sovereignty argument concrete rather than theoretical. EU AI Act high-risk provisions (including Article 73 incident reporting) take effect August 2, 2026, and whether a given document-AI system is in-scope depends on its use — employment, credit, and healthcare applications are the obvious triggers, not document AI as a blanket category. Layer on the late-2025 invalidation of the EU-U.S. Data Privacy Framework and a growing list of data-localisation laws, and the location of your extraction pipeline becomes a legal question, not just a latency one.

Crucially, “EU data residency” is not the same as “EU legal jurisdiction.” A U.S.-headquartered vendor can store your data in Frankfurt and still be subject to U.S. jurisdiction and export controls. That distinction stopped being abstract in mid-June 2026, when U.S. export controls abruptly restricted some foreign enterprises’ access to frontier U.S. models — a real-world reminder that a kill switch you don’t control is a risk you’ve accepted. Self-hosted or EU-sovereign deployment is the only path that removes it entirely.

“At some point, you need to be able to turn it on or turn it off, and you don’t want to leave it to another country.”— Arthur Mensch, CEO, Mistral AI

The sovereignty read

For SMBs without a hard regulatory mandate, vendor EU data residency is usually enough, and the speed of buying wins. For regulated sectors — or any business that simply can’t accept a foreign kill switch on a core back-office function — the calculus tilts toward owning the pipeline, even below the pure-cost break-even. Sovereignty is the one criterion that can justify building at a volume the spreadsheet alone wouldn’t.

08 — Vendor DiligenceHow to read any vendor’s accuracy claim.

Before you commit to any path, pressure-test the numbers. “99% accuracy” is the most abused figure in IDP marketing: it almost always refers to clean, standardised printed documents. Handwriting, scanned legacy paper, multi-column layouts, and mathematical content consistently score far lower across every vendor. The recommendation below maps the four common SMB profiles to a starting path — run your own eval on your own documents before you sign anything.

Low volume, varied docs

The typical small back office

Under ~50,000 pages a month, mixed invoices and receipts, no ML hire. Start on a cheap managed API (Mistral OCR 4 or a cloud invoice parser) and add a human validation step. Cheapest total cost, fastest to live.

Buy an OCR API

Mid volume, standardised

Repetitive, high-throughput forms

Tens of thousands of standardised pages. Benchmark the cheap managed parsers head-to-head on your own documents; a flat-subscription IDP only wins once volume clears its break-even versus per-page pricing.

Buy, benchmark first

High volume + ML team

Six-figure monthly pages

Sustained 100,000-plus pages a month with ML and MLOps capacity already on staff. Now the self-hosted compute floor beats per-page pricing even after engineer overhead. Pilot open weights against your current spend.

Build, self-host

Sovereignty-bound

Regulated or jurisdiction-sensitive

A hard GDPR, EU AI Act, or sector mandate. Self-hosted open weights or an EU-sovereign deployment removes cross-border transfer risk a vendor can’t — worth building below the pure-cost break-even.

Build for control

Whichever profile fits, the discipline is the same: demand the accuracy figure broken out by document type, run a paid pilot on a representative sample of your own worst documents, and price the full stack — not just the per-page line. Wiring extraction into your accounting and operations systems cleanly is where most of the value (and most of the hidden work) actually lives. That integration layer is what our CRM and back-office automation work and our AI transformation services are built around — and it’s the same evaluation discipline we apply when deciding whether to self-host open-weight models.

09 — ConclusionA volume number, not a philosophy.

The shape of document AI, June 2026

For most SMB back offices, the answer is still buy — for now.

Document AI in 2026 is genuinely cheap, and that’s exactly why build versus buy is no longer about ideology. The self-hosting break-even sits at roughly 50,000 to 100,000 pages a month with a dedicated ML engineer — and aggressive managed-API pricing, led by Mistral OCR 4 at $4 per 1,000 pages, has only pushed that threshold higher. The median SMB processes nowhere near that volume, which means the spreadsheet, not the manifesto, points to buy.

The practical playbook is unglamorous and correct: buy the cheapest competent managed layer now, normalise your real cost to dollars per invoice, instrument the volume, and revisit the build case only when sustained throughput clears the break-even or a hard sovereignty mandate overrides the math. Sovereignty is the one lever that can justify owning the pipeline early — the rest is arithmetic.

And read every accuracy claim like an auditor. The vendors worth trusting are the ones, like Mistral, willing to call their own benchmark scores directional rather than definitive. Run your own eval on your own worst documents, price the whole stack including the hidden 5× maintenance tail, and let the number — not the marketing — make the call.

Document AI for SMBs: Build vs Buy in 2026

01 — Why NowThe unstructured-data problem finally has cheap tools.

02 — The Three PathsThree ways to turn documents into structured data.

Turnkey platforms

Pay-as-you-go extraction

Open-weight models

03 — The 2026 ShiftWhat the last week of June actually changed.

04 — True CostThe real cost, normalised per invoice.

05 — Break-EvenWhere building actually starts to win.

Monthly cost at 100,000 invoices/month · select paths

06 — Decision MatrixEight criteria that move the needle toward build or buy.

07 — Hidden CostsThe costs the per-page price never shows.

Ongoing maintenance

ML engineering time

Uptime & infrastructure

Compliance overhead

08 — Vendor DiligenceHow to read any vendor’s accuracy claim.

The typical small back office

Repetitive, high-throughput forms

Six-figure monthly pages

Regulated or jurisdiction-sensitive

09 — ConclusionA volume number, not a philosophy.

For most SMB back offices, the answer is still buy — for now.

Turn your unstructured back-office documents into structured, queryable data.

Document-AI engagements

The questions we get every week.

Continue exploring build-vs-buy decisions.

Mistral OCR 4: Document AI for Business Automation

Apple Price Hikes: Local AI vs Cloud Subscriptions 2026

Airwallex T:0 and Airi: Agentic Finance Arrives 2026

You Don't Need to Code to Ship With AI, Says the Data

AI Venture Funding 2026: Where the $242 Billion Went

AI Industry Weekly Recap: May 25-31, 2026 Top Stories