By April 2026, the "is this AI-written?" question has stopped being the useful one. Most agency content uses AI in some part of the workflow — research, outline, draft, audit, revision. The interesting question is whether a given piece, regardless of how it was produced, meets the bar to publish under the agency's voice.
The rubric below answers that question. Twelve criteria, scored 0-10 each, total 120. Publish at 84+; hold at 60-83 for one revision; redraft below 60. We use it across our agency book on every asset — AI-assisted or human-only — and ship it to client editorial teams as the artifact that keeps reviewer drift out.
- 01The same rubric scores AI and human content because the bar is the bar.Two rubrics — one for AI, one for human — encode the bias that AI content is held to a different standard. That bias is wrong both ways: it under-credits competent AI assistance and over-credits weak human writing. One rubric, applied uniformly, sets the bar at the work, not the author.
- 02Twelve criteria balance reviewer effort against signal density.Six criteria leave too much to taste; eighteen makes the rubric tedious to apply. Twelve hits the sweet spot — reviewers can score in 8-12 minutes per piece, and inter-reviewer variance stays under 4 points on the 120 scale once the calibration set has been worked.
- 03Publish at 84+ is conservative for a reason.Setting the bar lower (e.g., publish at 72) feels generous in the moment and shows up as score drift in the calibration set within 8-12 weeks. The 84 threshold is defensible: it represents 70% of total points, which is the conventional 'meets standard' bar across editorial industries.
- 04Reviewer calibration is what makes the rubric durable over time.Without a 60-day calibration cadence, reviewer interpretations drift. The same piece gets scored 86 by one reviewer and 74 by another within a quarter. Calibration sessions (5-10 sample pieces, scored together) bring inter-reviewer variance back under 4 points and keep the rubric defensible.
- 05The score-drift dashboard is the leading indicator that catches model regression.When a frontier model rollout changes voice or accuracy, the rubric scores on AI-assisted pieces start drifting before the failure mode shows up in client feedback. Median weekly rubric score per pod, charted, is the leading indicator that triggers a model audit.
01 — Why unifiedWhy a unified rubric for AI + human.
Editorial teams that maintain two rubrics — one for AI content, one for human content — encode an unhelpful asymmetry. The AI rubric tends to be stricter on accuracy and looser on voice; the human rubric tends to be looser on accuracy and stricter on originality. The result is that competent AI work gets scored harder than equivalent human work, and weak human work gets scored easier than equivalent AI work.
One rubric removes the asymmetry. The bar is the bar. A piece that reads as if a senior practitioner wrote it gets the same score whether the practitioner used AI heavily, lightly, or not at all. The reviewer is grading the artifact, not the production process.
"We had two rubrics for six months. The AI-content rubric was harsher on the same prose than our human-content rubric was. We finally consolidated them, and the editorial conversations got 10× more useful."— Head of editorial, B2B agency, January 2026
02 — CriteriaThe twelve criteria.
Twelve criteria, each scored 0-10. Total range 0-120. The criteria fall into four bundles of three: accuracy bundle, voice bundle, structure bundle, and citation bundle.
Factual accuracy
All claims verifiable; no hallucinations. 10 = airtight; 7 = minor non-load-bearing errors; 4 = at least one wrong claim that affects an argument; 0 = false claims load-bearing for the piece.
Hard floorSource attribution
Inline links to primary sources where claims are sourced; primary sources preferred over secondary. 10 = inline + primary; 6 = inline + secondary; 3 = footnoted only; 0 = unsourced.
Reach-ableFactual freshness
Sources current to within the relevant time horizon (research data ≤ 12 months for most categories, ≤ 3 months for fast-moving). 10 = all current; 5 = 70-80% current; 0 = key claim sourced from material 3+ years stale.
Time-decayedVoice fidelity
Reads in the agency / client voice, consistent throughout. 10 = could not have been written by anyone else; 6 = mostly on-voice with 1-2 sentences off; 0 = generic AI voice or generic agency voice.
DistinctiveOpinion strength
Takes a defensible position. 10 = position stated, defended, with concession to alternatives; 6 = position stated, weakly defended; 3 = no clear position; 0 = obvious fence-sitting.
CitableOriginal analysis
The connective tissue between data sections shows actual thinking. 10 = trend interpretation + projection paragraphs that add value; 6 = some analysis but mostly pass-through; 0 = stitched-together citations with no synthesis.
Adds valueStructural clarity
Headings make sense, sections progress, the reader can scan. 10 = TOC alone tells the story; 6 = some lift heading-only but body is needed; 3 = headings are decorative; 0 = wall of text.
ScannableReadability
Prose is readable for the target audience. 10 = senior practitioner can scan in 4 minutes; 6 = readable but uneven; 3 = readable with effort; 0 = unreadable for target.
Audience fitFAQ depth
FAQ section answers questions a senior reader actually asks. 10 = 6-8 specific FAQs that go beyond the body; 6 = generic FAQs; 3 = FAQ exists but adds nothing; 0 = no FAQ on a piece that warrants one.
Long-tailCitation density
≥ 6 attributable claims per 1,000 words (thought-leadership); ≥ 10 (data-led). 10 = at or above target; 6 = at half target; 0 = unsourced.
Reach-densityInternal linking
≥ 4 internal links, ≥ 1 to a /services/* page. 10 = links flow naturally and add navigational value; 6 = links present but feel forced; 3 = links exist; 0 = no internal links.
Site graphCitation-worthiness
Would another publication cite this piece? 10 = a defined framework, dataset, or contrarian take that other writers will reference; 6 = competent reference; 3 = generic explainer; 0 = brand puff.
Link bait03 — ScoringScoring + thresholds.
84-120 — Publish
Above 70% of total. Reviewer signs off; piece goes to deploy. Most published pieces score 88-104; pieces above 110 are infrequent and worth flagging as case studies.
Ship60-83 — Hold for one revision
Below 70% but above the 'fundamental issue' line. Reviewer notes the criteria scoring below 6; drafter has one revision pass to bring the score up. If revision lands at 84+, ship; if not, redraft.
Iterate0-59 — Redraft
Below 50%. The piece has a fundamental issue (factual hole, wrong angle, voice mismatch) that revision cannot fix. Often signals a brief misalignment; loop back to the brief, not the draft.
Restart04 — CalibrationReviewer calibration protocol.
Without calibration, two reviewers will score the same piece 8-12 points apart within a quarter. With calibration, inter-reviewer variance stays under 4 points. The protocol below is the cadence we run.
60-day calibration session
5-10 sample pieces · 90 minutesEvery 60 days, the editorial pod scores 5-10 sample pieces (mix of AI-assisted and human-only) independently, then convenes to compare. Discuss every criterion where reviewers disagreed by 2+ points. The discussion is the calibration.
Quarterly anchor10-piece on-boarding for new reviewers
first weekNew reviewer scores 10 pieces from the calibration archive blind, then their scores are compared to the consensus scores from the original calibration session. Brings the new reviewer to within 4 points of pod median in their first week.
RampWeekly score-drift dashboard
median rubric score per podTrack median rubric score per pod week-over-week. A 5-point WoW shift in either direction is a calibration signal. Used as an early warning before formal calibration sessions.
Early warningAd-hoc calibration on flag
as neededIf the score-drift dashboard flags a 5+ point shift OR a model rollout occurs, run an ad-hoc 3-piece calibration session within a week. Catching drift early is much cheaper than recalibrating after weeks of drift have accumulated.
Reactive05 — ExamplesThree annotated examples.
Three real (anonymised) agency posts scored against the rubric. Patterns to notice: where AI assistance lifts vs hurts; where voice and opinion strength end up being the deciding factors.
Score 102 — published; AI-assisted
B2B SaaS post on agentic-AI cost optimisation. Heavy AI in research and outline; human revision on prose and voice. Strong on accuracy (10/10/10), strong on voice (9/9/8), strong on structure (9/9/8), strong on citations (9/9/12 at the citation-worthiness criterion). The model-assisted research lifted citation density above what a human-only workflow would have hit.
AI assistance lifted scoreScore 78 — held for revision; human-only
Marketing agency post on content strategy. Human-only draft. Strong on structure (10/10/8), weak on opinion strength (4) and citation density (5) — the piece read as a competent explainer but took no defensible position. One revision pass brought opinion strength to 8 and citation density to 8; piece shipped at 96.
Revision was the leverScore 54 — redrafted; AI-assisted
DTC retail post on conversion optimisation. AI-heavy on draft, no human revision pass before review. Acceptable on accuracy (8/8/7), weak on voice (4 — read as generic AI), weak on original analysis (3 — pass-through of stitched citations), weak on citation-worthiness (3 — nothing another publication would cite). Redraft started from a new outline; final score 91.
Brief was the issue06 — Drift reportScore-drift early-warning report.
The single most useful artifact the rubric produces over time is the score-drift dashboard. Median rubric score per pod, charted weekly. The chart catches things that nothing else catches.
Sudden 5-point drop
Usually indicates a model rollout that changed voice or accuracy on AI-assisted pieces. Trigger an ad-hoc calibration session and a model audit; switch back to the prior model version if needed while the rollout stabilises.
Model regressionSudden 5-point lift
Usually indicates reviewer drift toward leniency rather than a real quality lift. Trigger ad-hoc calibration to confirm; recalibrate if reviewers are scoring more generously than the calibration set.
Reviewer driftInter-reviewer variance growing
Reviewers are diverging from each other over time. Indicator that calibration is overdue or that new reviewers have not been properly on-boarded. Run calibration; if variance persists, structural change to the rubric language may be needed.
Calibration overdueMedian plateau at 70-78
Pod consistently scoring just below publish threshold suggests the brief or the workflow is the bottleneck, not the writers. Audit the brief template; review the AI-assistance integration; the rubric is doing its job by surfacing that the upstream work needs attention.
Upstream signal07 — RolloutRolling the rubric out.
Adopt the rubric · score 10 archive pieces
calibration setPick 10 already-published pieces (mix of strong and weak). Have each reviewer score independently. Compare. The variance you see is the baseline; calibration will compress it.
BaselineRun live · adjust briefs
weekly reviewScore every new piece. Reviewers will disagree; that is fine. Track inter-reviewer variance; identify the criteria where the rubric language is ambiguous; tighten the language.
AdjustFirst calibration session
5 pieces · 90 minFirst formal calibration. Should bring inter-reviewer variance to under 4 points. If variance stays above 4, the rubric language needs more precision; iterate.
StabiliseScore-drift dashboard live
weekly reportBy day 90 you have 12 weeks of data. Stand up the score-drift dashboard. Use it as the early-warning system that triggers ad-hoc calibration and model audits.
Mature08 — ConclusionOne bar, twelve criteria.
The 12-point rubric is the artifact that makes editorial standards portable across reviewers, durable across model changes, and defensible across the agency's book.
By 2026, the AI-vs-human distinction is the wrong axis. The right axis is whether a piece scores at publish quality. One rubric, 12 criteria, applied uniformly, sets the bar at the work and makes production-process choices a question of efficiency rather than quality.
Adopt the rubric, run the 60-day calibration cadence, and ship the score-drift dashboard. The early payback is <41% reduction in 'why didn't this land' post-mortems six months in. The longer-term payback is an editorial program that survives reviewer turnover, model rollouts, and shifting briefs without quality regressing.
The rubric is open. Fork the criteria for your house style; tune the publish threshold for your audience; preserve the structure (12 criteria, 0-10 scoring, calibration cadence, drift report) so the rubric stays defensible over time.