BusinessDecision Matrix12 min readPublished June 26, 2026

An evergreen decision guide · the pivot point is roughly 50,000 pages a month, not a philosophy

Document AI for SMBs: Build vs Buy in 2026

Open OCR pipelines now self-host on a single GPU at roughly $0.001 a page, and Mistral OCR 4 (June 23, 2026) sells managed extraction at $4 per 1,000 pages. Cheap inference is no longer the deciding factor. The build-vs-buy answer is a volume number — and for most SMB back offices it still points to buy.

DA
Digital Applied Team
Senior strategists · Published June 26, 2026
PublishedJune 26, 2026
Read time12 min
SourcesVendor pricing + market data
Mistral OCR 4 API
$4/1K
per 1K pages · vendor-stated
Batch $2/1K
Self-host compute floor
$7.27
per 10K pages · L40S
≈14× under Textract
Build break-even
50–100K
pages / month + an ML hire
Maintenance vs build
5×
of initial dev cost

Document AI automation has quietly become the most consequential back-office decision an SMB will make in 2026. Open OCR pipelines now self-host on a single GPU at around $0.001 a page, managed vendors sell extraction by the thousand pages, and a new generation of models ships almost monthly. The cheap-inference era has arrived — which means cost per page is no longer the question. The question is build versus buy, and the honest answer is a number, not a philosophy.

What’s at stake is real money and real risk. Between 80% and 90% of newly generated enterprise data is unstructured — invoices, contracts, receipts, forms, scanned PDFs — yet only a fraction of organisations extract it reliably, according to Market.us. Get the architecture wrong and you either overpay a managed vendor for volume you don’t have, or you sink an ML engineer’s salary into a self-hosted pipeline that never reaches the break-even that justifies it.

This guide does three things. It maps the three real paths — buy a managed IDP layer, call an OCR API, or self-host open weights. It normalises every vendor’s pricing to a single unit an SMB actually cares about, the cost to process one invoice end-to-end. And it puts a number on the pivot point: the volume above which building starts to beat buying. Every figure is vendor-stated or third-party-sourced as of June 26, 2026, and benchmark claims are flagged as directional where the vendor itself says so.

Key takeaways
  1. 01
    Build vs buy is a volume question, not a values question.The self-hosting break-even sits at roughly 50,000–100,000 pages a month with a dedicated ML engineer on staff, per Mavik Labs and Spheron. Below that, engineering overhead erases the per-page savings.
  2. 02
    Cheap managed APIs pushed the break-even higher.Mistral OCR 4 sells OCR-plus-structure at about $0.004 a page ($0.002 via the Batch API). At those rates the math that used to favour building only flips at much larger volumes than the classic rule of thumb.
  3. 03
    Normalise to cost-per-invoice, not cost-per-page.Runs, pages, and operations are not fungible. Nanonets prices per run, OCR APIs per page, Rossum as a flat subscription. Our proprietary table converts all of them to one comparable unit across four volume tiers.
  4. 04
    Self-hosted compute is a floor, not a total.The ~$7.27 per 10,000 pages L40S figure excludes ML salary, fine-tuning, infrastructure maintenance, and uptime — maintenance alone is estimated at 5× the initial development cost. Always price the floor and the overhead.
  5. 05
    For most SMBs: buy now, migrate selectively later.Start on a managed layer for predictable cost and fast time-to-value. Migrate to open-weight self-hosting only once volume clears the break-even and you have ML-capable staff or a sovereignty mandate.

01Why NowThe unstructured-data problem finally has cheap tools.

Every SMB back office runs on documents it can’t easily query. Supplier invoices arrive as PDFs and scans, receipts pile up in inboxes, contracts live as flat text, and someone re-keys all of it into an accounting system by hand. Intelligent document processing (IDP) is the category that automates that re-keying: optical character recognition to read the pixels, then layout and field extraction to turn them into structured data your systems can use.

The reason this is a 2026 decision rather than a 2022 one is that the tooling got dramatically cheaper and better in parallel. Open-weight OCR models now run on a single mid-range GPU, and managed vendors have been forced to cut prices to compete. The market reflects the momentum — even if the exact size depends heavily on who you ask.

Market context (directional)
One widely cited estimate from Market.us puts the 2026 intelligent document processing market at roughly $4.38 billion, up from about $1.5 billion in 2022. Treat that as an order-of-magnitude signal, not a precise forecast: published 2026 IDP market estimates range from the low single-digit billions to well over ten billion depending on the research firm and what they count. The direction — steep growth, double-digit CAGR — is consistent across sources; the absolute number is not.

Adoption is documented to cut errors and processing time and to return positive first-year ROI, per Market.us’s aggregation of vendor case studies — though those benefit figures are vendor-supplied and vary widely by document type and pipeline quality. The practical takeaway isn’t a headline percentage. It’s that the capability is now cheap enough that the only remaining hard question is how you buy or build it.

02The Three PathsThree ways to turn documents into structured data.

Most build-vs-buy debates collapse into two options. In document AI there are really three, and they sit on a spectrum from most-managed to most-owned. Picking the wrong end of that spectrum is how SMBs either overpay or over-engineer.

Buy · Managed IDP
Turnkey platforms
Rossum · Docsumo · Nanonets · ABBYY

A full product: extraction models, a human validation screen, retraining, and an archive. Fastest time-to-value, least engineering, highest per-document price. Best when documents vary and you have no ML team.

Rossum Starter from $18,000 / yr (vendor)
Buy · OCR API
Pay-as-you-go extraction
Mistral OCR 4 · AWS Textract · Google · Azure

You call an API per page and own the pipeline glue — queueing, validation, retries, storage. Cheap per page, metered OpEx, no platform lock-in. Best when you have light engineering and standardised documents.

Mistral OCR 4 $4 / 1K pages (vendor)
Build · Self-host
Open-weight models
PaddleOCR-VL · DeepSeek-OCR · GOT-OCR 2.0

Run permissively-licensed OCR models on your own GPU. Lowest marginal cost and full data sovereignty, but you own infrastructure, accuracy tuning, and uptime. Best at high volume with ML staff.

≈$7.27 / 10K pages compute (3rd-party)

The open-weight shelf is unusually deep in 2026. PaddleOCR-VL-1.6 (Apache 2.0, ~0.9B params, 100-plus languages) runs at roughly 45 pages a minute on an L40S; DeepSeek-OCR (~3B, MIT) trades a little speed for a meaningfully lower cost at the same GPU; GOT-OCR 2.0 is strong on equations under 3GB of VRAM; and Granite-Docling (258M, Apache 2.0) is fast on financial tables, per Spheron’s 2026 self-hosting analysis. These are real production options — but they ship without a managed API, an SLA, or a validation UI. That gap is exactly what the buy paths charge for.

03The 2026 ShiftWhat the last week of June actually changed.

Two releases one day apart reset the reference points. On June 22, 2026, Baidu open-sourced Unlimited-OCR (MIT, 3B params), which parses 40-plus pages in a single forward pass and drew about 1,800 GitHub stars in its first day — but ships with no managed API, no enterprise SLA, and no bounding-box output. A day later, on June 23, Mistral released OCR 4, its fourth OCR generation in roughly fifteen months, adding the most-requested enterprise feature: structured output with bounding boxes, block-type classification, and per-word confidence scores. Our companion deep dive covers Mistral OCR 4’s capabilities and benchmark results in full.

That shift — from flat text to a layered semantic map where every block has a location, a type, and a confidence score — is what makes extraction auditable without a separate layout-analysis stage. For an SMB it matters because it collapses two tools into one and makes the output traceable for compliance. The benchmark story, though, needs reading carefully.

On the benchmark numbers
Mistral reports OCR 4 scoring 85.20 on OlmOCRBench and 93.07 on OmniDocBench, with a 72% average win rate in head-to-head human evaluations across 600-plus documents — all of which are vendor-stated. Unusually, Mistral publishes its own caveats about benchmark artifacts and states that it treats the aggregate score as directional rather than definitive. Worth noting alongside the marketing claim: on the public OlmOCRBench leaderboard, OCR 4 currently sits third rather than first, behind open models. Read every IDP accuracy figure, including this one, as a starting point for your own evaluation — not a verdict.

Vendor-forwarded customer testimonials point the same direction — one financial-AI engineering team described reaching equivalent accuracy at far lower cost and latency than an incumbent parser — but those evals are unpublished and should be read as marketing, not measurement. The practitioner mood is more sober. As one commenter with a decade in document parsing put it on the Unlimited-OCR thread, OCR still has rough edges in 2026: handwriting, scanned legacy documents, and multi-column math remain hard for every model on the shelf.

04True CostThe real cost, normalised per invoice.

Here is the comparison no vendor publishes, because it makes the units honest. Nanonets prices per run, OCR APIs price per page, Rossum sells a flat subscription — quoting them side by side is comparing apples, kilometres, and Tuesdays. The table below converts every path to one unit an SMB actually budgets in: the cost to process one invoice end-to-end, assuming a four-page invoice, across four realistic monthly volumes. Every cell is recomputed from the vendor’s own per-unit rate retrieved June 26, 2026.

Estimated monthly cost to process invoices end-to-end across nine document-AI paths and four volume tiers, assuming one invoice equals four pages. Per-unit vendor rates retrieved June 26, 2026 and are vendor-stated except self-hosted compute (Spheron, third-party) and Rossum (flat annual subscription). Nanonets prices per automation run, not per page. Rossum Starter is a flat $1,500/month subscription that does not scale with volume; the asterisked higher bands would in practice exceed Starter-tier page ceilings and require a custom-quoted upgrade.
PathUnit rate5K inv / mo20K pages25K inv / mo100K pages50K inv / mo200K pages100K inv / mo400K pages
Mistral OCR 4 API (standard)$0.016 / invoice$80$400$800$1,600
Mistral OCR 4 Batch API$0.008 / invoice$40$200$400$800
AWS Textract — Analyze Expense$0.040 / invoice$200$1,000$2,000$4,000
Google Document AI — Invoice Parser$0.040 / invoice$200$1,000$2,000$4,000
Azure Doc Intelligence — Invoice (Prebuilt)$0.040 / invoice$200$1,000$2,000$4,000
Nanonets — Complex AI (4 blocks / invoice)$1.20 / invoice$6,000$30,000$60,000$120,000
Self-hosted open OCR — compute only$0.0029 / invoice$15$73$146$291
Self-hosted + 0.25-FTE ML engineercompute + ~$2,000 / mo$2,015$2,073$2,146$2,291
Rossum Starter (flat subscription)$1,500 / mo flat$1,500$1,500$1,500*$1,500*

How to read it. Assumptions: one invoice = four pages; Nanonets = four $0.30 complex-AI runs per invoice; AWS, Google, and Azure invoice parsers at $0.01 a page; Mistral OCR 4 at $0.004 a page ($0.002 batch); self-hosted L40S compute at $0.000727 a page (Spheron). The self-hosted-plus-engineer row adds a quarter-time ML engineer at roughly $2,000 a month — a deliberately conservative floor, since a real build typically needs more. The Rossum row is a flat $1,500-a-month subscription that doesn’t scale with volume; the asterisked cells flag that Starter-tier page ceilings would, in practice, push the two highest bands onto a custom-quoted Business or Enterprise tier (Rossum has been Coupa-owned since May 2026).

Two patterns jump out. Nanonets’ per-run pricing makes it wildly expensive for high-volume invoice work — the very use case buyers most often evaluate it for — because a four-block invoice costs $1.20, not four cents. And the self-hosted compute floor is almost free, but the moment you add even a fractional engineer, the fixed cost dominates until volume is large. That fixed cost is the whole build-vs-buy story.

05Break-EvenWhere building actually starts to win.

The classic rule of thumb, synthesised from the Mavik Labs build-vs-buy framework and Spheron’s self-hosting analysis, is that document-AI self-hosting breaks even at roughly 50,000 to 100,000 pages a month — and only with a dedicated ML engineer on staff. Below that, the engineering overhead (maintenance alone estimated at 5× the initial development cost) erases the per-page savings. The chart below shows why, comparing monthly cost across the sensible paths at the top volume tier.

Monthly cost at 100,000 invoices/month · select paths

Source: recomputed from vendor pricing · 100K invoices/mo (400K pages). Nanonets excluded as an outlier ($120,000).
AWS / Google / Azure invoice parser$0.04 / invoice · managed API
$4,000
Self-hosted + 0.25-FTE engineercompute + ~$2,000/mo fixed
$2,291
Mistral OCR 4 API (standard)$0.016 / invoice · managed API
$1,600
Rossum Starter (flat)tier ceiling applies at this volume
$1,500
Mistral OCR 4 Batch API$0.008 / invoice · async
$800
Self-hosted compute onlyL40S · excludes all overhead
$291

The trend the table reveals. The conventional break-even assumes you’re comparing a self-hosted build against a typical full-service managed vendor. But aggressive 2026 API pricing moved the goalposts. Run the per-invoice math and the picture sharpens: against the cheap managed parsers, a self-hosted pipeline with even a quarter-time engineer doesn’t pull ahead until roughly 50,000 invoices a month (200,000 pages) — and against Mistral OCR 4’s standard API, the crossover is far higher still. Our own table shows self-host-plus-engineer at $2,146 versus $2,000 for the managed parsers at 50K invoices (essentially a tie), then winning at 100K. The build case is real, but it lives further up the volume curve than most SMBs ever reach.

Where this is heading. Open-weight OCR quality and managed-API prices are converging from opposite directions, and that convergence will keep pushing the break-even up, not down. For the next year or two, the rational default for a back office processing tens of thousands of pages a month is to buy the cheapest competent managed layer, instrument the volume, and revisit the build case only when sustained volume — or a hard sovereignty requirement — clears the bar. The same build-vs-buy calculus applies across your AI data stack, which we unpack in our CDP build, buy, or skip decision matrix and the broader AI build-vs-buy framework for agency stacks.

06Decision MatrixEight criteria that move the needle toward build or buy.

Volume is the spine of the decision, but it isn’t the whole skeleton. Run your situation down these eight criteria; if most of your answers land in the right-hand column, buy a managed layer. If they cluster on the left — and your volume clears the break-even — building becomes defensible.

Build-vs-buy decision matrix for SMB document AI. Eight criteria, each with the condition that pushes the decision toward building a self-hosted pipeline versus buying a managed layer or API. Synthesised from the Mavik Labs build-vs-buy framework and Spheron self-hosting analysis.
CriterionPushes toward BuildPushes toward Buy
Monthly volumeAbove ~100,000 pages / month, sustainedBelow ~50,000 pages / month, or spiky
Document variabilityStandardised, repetitive layouts you controlArbitrary, varied contracts and one-offs
In-house ML engineeringDedicated ML / MLOps capacity on staffNo ML hire; ops team owns the workflow
Data sovereigntyStrict GDPR / EU AI Act / sector rulesVendor EU data residency is acceptable
Extraction depthCustom fields, tables, equations, edge formatsGeneral text plus standard invoice / receipt forms
Time-to-first-valueMonths of build time is acceptableYou need extraction live in days or weeks
Budget modelCapEx plus predictable fixed OpExVariable, usage-metered OpEx preferred
Integration depthDeeply embedded in ERP / custom accountingStandalone, or light API into existing tools

The honest reading for the median SMB: most answers land in the buy column. Standardised invoices at modest volume, no ML hire, a need to be live in weeks, and a preference for metered OpEx all point the same way. Building earns its keep when you have sustained six-figure monthly page volume, repetitive document types you control, an ML team already on payroll, and a sovereignty or sector-compliance mandate that a vendor can’t satisfy.

07Hidden CostsThe costs the per-page price never shows.

Every self-hosted cost comparison you’ll read online quotes the compute floor and stops. That figure is real, but it’s the start of the bill, not the total. Four categories sit underneath it, and they’re where most build projects quietly go over budget.

Hidden cost
Ongoing maintenance
5×

Maintenance of a self-hosted document-AI pipeline is estimated at roughly five times the initial development cost, per the Mavik Labs framework — model updates, retraining, breakage, and drift never stop.

Recurring, not one-off
Hidden cost
ML engineering time
1FTE+

The compute floor excludes the salary that makes it run. A pipeline that processes real documents reliably needs ML and MLOps attention; a fractional hire is the optimistic case, a full one the realistic one at scale.

Excluded from compute quotes
Hidden cost
Uptime & infrastructure
24/7

GPU provisioning lead time, high-availability redundancy, queueing, retries, monitoring, and storage all carry cost and on-call burden that a managed API absorbs on your behalf for its per-page fee.

On-call & redundancy
Hidden cost
Compliance overhead
Aug 2

EU AI Act high-risk obligations take effect August 2, 2026. Whether your use case qualifies depends on the application — but data residency, audit trails, and traceability add real, ongoing cost to either path.

Use-case dependent

The compliance line deserves its own paragraph, because 2026 made the sovereignty argument concrete rather than theoretical. EU AI Act high-risk provisions (including Article 73 incident reporting) take effect August 2, 2026, and whether a given document-AI system is in-scope depends on its use — employment, credit, and healthcare applications are the obvious triggers, not document AI as a blanket category. Layer on the late-2025 invalidation of the EU-U.S. Data Privacy Framework and a growing list of data-localisation laws, and the location of your extraction pipeline becomes a legal question, not just a latency one.

Crucially, “EU data residency” is not the same as “EU legal jurisdiction.” A U.S.-headquartered vendor can store your data in Frankfurt and still be subject to U.S. jurisdiction and export controls. That distinction stopped being abstract in mid-June 2026, when U.S. export controls abruptly restricted some foreign enterprises’ access to frontier U.S. models — a real-world reminder that a kill switch you don’t control is a risk you’ve accepted. Self-hosted or EU-sovereign deployment is the only path that removes it entirely.

“At some point, you need to be able to turn it on or turn it off, and you don’t want to leave it to another country.”— Arthur Mensch, CEO, Mistral AI
The sovereignty read
For SMBs without a hard regulatory mandate, vendor EU data residency is usually enough, and the speed of buying wins. For regulated sectors — or any business that simply can’t accept a foreign kill switch on a core back-office function — the calculus tilts toward owning the pipeline, even below the pure-cost break-even. Sovereignty is the one criterion that can justify building at a volume the spreadsheet alone wouldn’t.

08Vendor DiligenceHow to read any vendor’s accuracy claim.

Before you commit to any path, pressure-test the numbers. “99% accuracy” is the most abused figure in IDP marketing: it almost always refers to clean, standardised printed documents. Handwriting, scanned legacy paper, multi-column layouts, and mathematical content consistently score far lower across every vendor. The recommendation below maps the four common SMB profiles to a starting path — run your own eval on your own documents before you sign anything.

Low volume, varied docs
The typical small back office

Under ~50,000 pages a month, mixed invoices and receipts, no ML hire. Start on a cheap managed API (Mistral OCR 4 or a cloud invoice parser) and add a human validation step. Cheapest total cost, fastest to live.

Buy an OCR API
Mid volume, standardised
Repetitive, high-throughput forms

Tens of thousands of standardised pages. Benchmark the cheap managed parsers head-to-head on your own documents; a flat-subscription IDP only wins once volume clears its break-even versus per-page pricing.

Buy, benchmark first
High volume + ML team
Six-figure monthly pages

Sustained 100,000-plus pages a month with ML and MLOps capacity already on staff. Now the self-hosted compute floor beats per-page pricing even after engineer overhead. Pilot open weights against your current spend.

Build, self-host
Sovereignty-bound
Regulated or jurisdiction-sensitive

A hard GDPR, EU AI Act, or sector mandate. Self-hosted open weights or an EU-sovereign deployment removes cross-border transfer risk a vendor can’t — worth building below the pure-cost break-even.

Build for control

Whichever profile fits, the discipline is the same: demand the accuracy figure broken out by document type, run a paid pilot on a representative sample of your own worst documents, and price the full stack — not just the per-page line. Wiring extraction into your accounting and operations systems cleanly is where most of the value (and most of the hidden work) actually lives. That integration layer is what our CRM and back-office automation work and our AI transformation services are built around — and it’s the same evaluation discipline we apply when deciding whether to self-host open-weight models.

09ConclusionA volume number, not a philosophy.

The shape of document AI, June 2026

For most SMB back offices, the answer is still buy — for now.

Document AI in 2026 is genuinely cheap, and that’s exactly why build versus buy is no longer about ideology. The self-hosting break-even sits at roughly 50,000 to 100,000 pages a month with a dedicated ML engineer — and aggressive managed-API pricing, led by Mistral OCR 4 at $4 per 1,000 pages, has only pushed that threshold higher. The median SMB processes nowhere near that volume, which means the spreadsheet, not the manifesto, points to buy.

The practical playbook is unglamorous and correct: buy the cheapest competent managed layer now, normalise your real cost to dollars per invoice, instrument the volume, and revisit the build case only when sustained throughput clears the break-even or a hard sovereignty mandate overrides the math. Sovereignty is the one lever that can justify owning the pipeline early — the rest is arithmetic.

And read every accuracy claim like an auditor. The vendors worth trusting are the ones, like Mistral, willing to call their own benchmark scores directional rather than definitive. Run your own eval on your own worst documents, price the whole stack including the hidden 5× maintenance tail, and let the number — not the marketing — make the call.

Automate your document back office

Turn your unstructured back-office documents into structured, queryable data.

We help SMBs evaluate, price, and deploy document AI — managed APIs, IDP platforms, or self-hosted open weights — normalising the real cost to dollars per processed document and wiring extraction cleanly into your accounting and operations stack.

Free consultationVendor-neutral evaluationTailored solutions
What we work on

Document-AI engagements

  • Build-vs-buy cost modelling normalised per document
  • OCR API and IDP vendor benchmarking on your own corpus
  • Self-hosted open-weight OCR for sovereignty-bound workloads
  • Extraction wired into ERP, accounting, and CRM systems
  • Accuracy and compliance evaluation before you commit budget
FAQ · Document AI build vs buy

The questions we get every week.

For most SMBs, buy. Build versus buy in document AI is fundamentally a volume question, not a values one. The self-hosting break-even sits at roughly 50,000 to 100,000 pages a month, and only with a dedicated ML engineer on staff, per the Mavik Labs framework and Spheron's self-hosting analysis. Below that volume, the engineering overhead — ongoing maintenance is estimated at around five times the initial development cost — erases the per-page savings a self-hosted pipeline promises. The pragmatic path is to buy a cheap managed OCR API or IDP platform now for predictable cost and fast time-to-value, instrument your real volume, and revisit self-hosting only once sustained throughput clears the break-even or a hard data-sovereignty mandate overrides the pure-cost math.