By April 2026, the "is this AI-written?" question has stopped being the useful one. Most agency content uses AI in some part of the workflow — research, outline, draft, audit, revision. The interesting question is whether a given piece, regardless of how it was produced, meets the bar to publish under the agency's voice.

The rubric below answers that question. Twelve criteria, scored 0-10 each, total 120. Publish at 84+; hold at 60-83 for one revision; redraft below 60. We use it across our agency book on every asset — AI-assisted or human-only — and ship it to client editorial teams as the artifact that keeps reviewer drift out.

Key takeaways

01
The same rubric scores AI and human content because the bar is the bar.Two rubrics — one for AI, one for human — encode the bias that AI content is held to a different standard. That bias is wrong both ways: it under-credits competent AI assistance and over-credits weak human writing. One rubric, applied uniformly, sets the bar at the work, not the author.
02
Twelve criteria balance reviewer effort against signal density.Six criteria leave too much to taste; eighteen makes the rubric tedious to apply. Twelve hits the sweet spot — reviewers can score in 8-12 minutes per piece, and inter-reviewer variance stays under 4 points on the 120 scale once the calibration set has been worked.
03
Publish at 84+ is conservative for a reason.Setting the bar lower (e.g., publish at 72) feels generous in the moment and shows up as score drift in the calibration set within 8-12 weeks. The 84 threshold is defensible: it represents 70% of total points, which is the conventional 'meets standard' bar across editorial industries.
04
Reviewer calibration is what makes the rubric durable over time.Without a 60-day calibration cadence, reviewer interpretations drift. The same piece gets scored 86 by one reviewer and 74 by another within a quarter. Calibration sessions (5-10 sample pieces, scored together) bring inter-reviewer variance back under 4 points and keep the rubric defensible.
05
The score-drift dashboard is the leading indicator that catches model regression.When a frontier model rollout changes voice or accuracy, the rubric scores on AI-assisted pieces start drifting before the failure mode shows up in client feedback. Median weekly rubric score per pod, charted, is the leading indicator that triggers a model audit.

01 — Why unifiedWhy a unified rubric for AI + human.

Editorial teams that maintain two rubrics — one for AI content, one for human content — encode an unhelpful asymmetry. The AI rubric tends to be stricter on accuracy and looser on voice; the human rubric tends to be looser on accuracy and stricter on originality. The result is that competent AI work gets scored harder than equivalent human work, and weak human work gets scored easier than equivalent AI work.

One rubric removes the asymmetry. The bar is the bar. A piece that reads as if a senior practitioner wrote it gets the same score whether the practitioner used AI heavily, lightly, or not at all. The reviewer is grading the artifact, not the production process.

"We had two rubrics for six months. The AI-content rubric was harsher on the same prose than our human-content rubric was. We finally consolidated them, and the editorial conversations got 10× more useful."— Head of editorial, B2B agency, January 2026

02 — CriteriaThe twelve criteria.

Twelve criteria, each scored 0-10. Total range 0-120. The criteria fall into four bundles of three: accuracy bundle, voice bundle, structure bundle, and citation bundle.

Accuracy 1

Factual accuracy

All claims verifiable; no hallucinations. 10 = airtight; 7 = minor non-load-bearing errors; 4 = at least one wrong claim that affects an argument; 0 = false claims load-bearing for the piece.

Hard floor

Accuracy 2

Source attribution

Inline links to primary sources where claims are sourced; primary sources preferred over secondary. 10 = inline + primary; 6 = inline + secondary; 3 = footnoted only; 0 = unsourced.

Reach-able

Accuracy 3

Factual freshness

Sources current to within the relevant time horizon (research data ≤ 12 months for most categories, ≤ 3 months for fast-moving). 10 = all current; 5 = 70-80% current; 0 = key claim sourced from material 3+ years stale.

Time-decayed

Voice 1

Voice fidelity

Reads in the agency / client voice, consistent throughout. 10 = could not have been written by anyone else; 6 = mostly on-voice with 1-2 sentences off; 0 = generic AI voice or generic agency voice.

Distinctive

Voice 2

Opinion strength

Takes a defensible position. 10 = position stated, defended, with concession to alternatives; 6 = position stated, weakly defended; 3 = no clear position; 0 = obvious fence-sitting.

Citable

Voice 3

Original analysis

The connective tissue between data sections shows actual thinking. 10 = trend interpretation + projection paragraphs that add value; 6 = some analysis but mostly pass-through; 0 = stitched-together citations with no synthesis.

Adds value

Structure 1

Structural clarity

Headings make sense, sections progress, the reader can scan. 10 = TOC alone tells the story; 6 = some lift heading-only but body is needed; 3 = headings are decorative; 0 = wall of text.

Scannable

Structure 2

Readability

Prose is readable for the target audience. 10 = senior practitioner can scan in 4 minutes; 6 = readable but uneven; 3 = readable with effort; 0 = unreadable for target.

Audience fit

Structure 3

FAQ depth

FAQ section answers questions a senior reader actually asks. 10 = 6-8 specific FAQs that go beyond the body; 6 = generic FAQs; 3 = FAQ exists but adds nothing; 0 = no FAQ on a piece that warrants one.

Long-tail

Citation 1

Citation density

≥ 6 attributable claims per 1,000 words (thought-leadership); ≥ 10 (data-led). 10 = at or above target; 6 = at half target; 0 = unsourced.

Reach-density

Citation 2

Internal linking

≥ 4 internal links, ≥ 1 to a /services/* page. 10 = links flow naturally and add navigational value; 6 = links present but feel forced; 3 = links exist; 0 = no internal links.

Site graph

Citation 3

Citation-worthiness

Would another publication cite this piece? 10 = a defined framework, dataset, or contrarian take that other writers will reference; 6 = competent reference; 3 = generic explainer; 0 = brand puff.

Link bait

03 — ScoringScoring + thresholds.

Band

84-120 — Publish

Above 70% of total. Reviewer signs off; piece goes to deploy. Most published pieces score 88-104; pieces above 110 are infrequent and worth flagging as case studies.

Ship

Band

60-83 — Hold for one revision

Below 70% but above the 'fundamental issue' line. Reviewer notes the criteria scoring below 6; drafter has one revision pass to bring the score up. If revision lands at 84+, ship; if not, redraft.

Iterate

Band

0-59 — Redraft

Below 50%. The piece has a fundamental issue (factual hole, wrong angle, voice mismatch) that revision cannot fix. Often signals a brief misalignment; loop back to the brief, not the draft.

Restart

04 — CalibrationReviewer calibration protocol.

Without calibration, two reviewers will score the same piece 8-12 points apart within a quarter. With calibration, inter-reviewer variance stays under 4 points. The protocol below is the cadence we run.

Cadence

60-day calibration session

5-10 sample pieces · 90 minutes

Every 60 days, the editorial pod scores 5-10 sample pieces (mix of AI-assisted and human-only) independently, then convenes to compare. Discuss every criterion where reviewers disagreed by 2+ points. The discussion is the calibration.

Quarterly anchor

On-boarding

10-piece on-boarding for new reviewers

first week

New reviewer scores 10 pieces from the calibration archive blind, then their scores are compared to the consensus scores from the original calibration session. Brings the new reviewer to within 4 points of pod median in their first week.

Ramp

Drift report

Weekly score-drift dashboard

median rubric score per pod

Track median rubric score per pod week-over-week. A 5-point WoW shift in either direction is a calibration signal. Used as an early warning before formal calibration sessions.

Early warning

Trigger

Ad-hoc calibration on flag

as needed

If the score-drift dashboard flags a 5+ point shift OR a model rollout occurs, run an ad-hoc 3-piece calibration session within a week. Catching drift early is much cheaper than recalibrating after weeks of drift have accumulated.

Reactive

05 — ExamplesThree annotated examples.

Three real (anonymised) agency posts scored against the rubric. Patterns to notice: where AI assistance lifts vs hurts; where voice and opinion strength end up being the deciding factors.

Example 1

Score 102 — published; AI-assisted

B2B SaaS post on agentic-AI cost optimisation. Heavy AI in research and outline; human revision on prose and voice. Strong on accuracy (10/10/10), strong on voice (9/9/8), strong on structure (9/9/8), strong on citations (9/9/12 at the citation-worthiness criterion). The model-assisted research lifted citation density above what a human-only workflow would have hit.

AI assistance lifted score

Example 2

Score 78 — held for revision; human-only

Marketing agency post on content strategy. Human-only draft. Strong on structure (10/10/8), weak on opinion strength (4) and citation density (5) — the piece read as a competent explainer but took no defensible position. One revision pass brought opinion strength to 8 and citation density to 8; piece shipped at 96.

Revision was the lever

Example 3

Score 54 — redrafted; AI-assisted

DTC retail post on conversion optimisation. AI-heavy on draft, no human revision pass before review. Acceptable on accuracy (8/8/7), weak on voice (4 — read as generic AI), weak on original analysis (3 — pass-through of stitched citations), weak on citation-worthiness (3 — nothing another publication would cite). Redraft started from a new outline; final score 91.

Brief was the issue

06 — Drift reportScore-drift early-warning report.

The single most useful artifact the rubric produces over time is the score-drift dashboard. Median rubric score per pod, charted weekly. The chart catches things that nothing else catches.

Pattern 1

↓ 5pt

Sudden 5-point drop

Usually indicates a model rollout that changed voice or accuracy on AI-assisted pieces. Trigger an ad-hoc calibration session and a model audit; switch back to the prior model version if needed while the rollout stabilises.

Model regression

Pattern 2

↑ 5pt

Sudden 5-point lift

Usually indicates reviewer drift toward leniency rather than a real quality lift. Trigger ad-hoc calibration to confirm; recalibrate if reviewers are scoring more generously than the calibration set.

Reviewer drift

Pattern 3

Spread

Inter-reviewer variance growing

Reviewers are diverging from each other over time. Indicator that calibration is overdue or that new reviewers have not been properly on-boarded. Run calibration; if variance persists, structural change to the rubric language may be needed.

Calibration overdue

Pattern 4

Plateau

Median plateau at 70-78

Pod consistently scoring just below publish threshold suggests the brief or the workflow is the bottleneck, not the writers. Audit the brief template; review the AI-assistance integration; the rubric is doing its job by surfacing that the upstream work needs attention.

Upstream signal

07 — RolloutRolling the rubric out.

Week 1

Adopt the rubric · score 10 archive pieces

calibration set

Pick 10 already-published pieces (mix of strong and weak). Have each reviewer score independently. Compare. The variance you see is the baseline; calibration will compress it.

Baseline

Weeks 2-4

Run live · adjust briefs

weekly review

Score every new piece. Reviewers will disagree; that is fine. Track inter-reviewer variance; identify the criteria where the rubric language is ambiguous; tighten the language.

Adjust

Day 30

First calibration session

5 pieces · 90 min

First formal calibration. Should bring inter-reviewer variance to under 4 points. If variance stays above 4, the rubric language needs more precision; iterate.

Stabilise

Day 90

Score-drift dashboard live

weekly report

By day 90 you have 12 weeks of data. Stand up the score-drift dashboard. Use it as the early-warning system that triggers ad-hoc calibration and model audits.

Mature

08 — ConclusionOne bar, twelve criteria.

Content quality rubric, April 2026

The 12-point rubric is the artifact that makes editorial standards portable across reviewers, durable across model changes, and defensible across the agency's book.

By 2026, the AI-vs-human distinction is the wrong axis. The right axis is whether a piece scores at publish quality. One rubric, 12 criteria, applied uniformly, sets the bar at the work and makes production-process choices a question of efficiency rather than quality.

Adopt the rubric, run the 60-day calibration cadence, and ship the score-drift dashboard. The early payback is <41% reduction in 'why didn't this land' post-mortems six months in. The longer-term payback is an editorial program that survives reviewer turnover, model rollouts, and shifting briefs without quality regressing.

The rubric is open. Fork the criteria for your house style; tune the publish threshold for your audience; preserve the structure (12 criteria, 0-10 scoring, calibration cadence, drift report) so the rubric stays defensible over time.

AI Content Quality Rubric

01 — Why unifiedWhy a unified rubric for AI + human.

02 — CriteriaThe twelve criteria.

Factual accuracy

Source attribution

Factual freshness

Voice fidelity

Opinion strength

Original analysis

Structural clarity

Readability

FAQ depth

Citation density

Internal linking

Citation-worthiness

03 — ScoringScoring + thresholds.

84-120 — Publish

60-83 — Hold for one revision

0-59 — Redraft

04 — CalibrationReviewer calibration protocol.

60-day calibration session

10-piece on-boarding for new reviewers

Weekly score-drift dashboard

Ad-hoc calibration on flag

05 — ExamplesThree annotated examples.

Score 102 — published; AI-assisted

Score 78 — held for revision; human-only

Score 54 — redrafted; AI-assisted

06 — Drift reportScore-drift early-warning report.

Sudden 5-point drop

Sudden 5-point lift

Inter-reviewer variance growing

Median plateau at 70-78

07 — RolloutRolling the rubric out.

Adopt the rubric · score 10 archive pieces

Run live · adjust briefs

First calibration session

Score-drift dashboard live

08 — ConclusionOne bar, twelve criteria.

The 12-point rubric is the artifact that makes editorial standards portable across reviewers, durable across model changes, and defensible across the agency's book.

Stop arguing about quality. Score twelve points.

Editorial rubric engagements

The questions we get every week.

Continue exploring editorial standards.

The GEO Operating Framework: Technical + Editorial

Content Marketing Statistics 2026: 180+ Data Points

AI Search Visibility Score: A Proprietary Metric Spec