AI DevelopmentMethodology14 min readPublished May 16, 2026

Reference · 60 terms · 6 categories · 1 authoritative source each

AI Agent Glossary: 60 Essential Terms, 2026

Sixty terms, six categories, one authoritative source per entry — MCP spec 2025-11-25, arXiv papers (ReAct, ToT, Reflexion, Constitutional AI, SWE-Bench, OSWorld), and canonical vendor docs. Plus a disambiguation table for the eight pairs that derail every architecture meeting.

DA
Digital Applied Team
Senior strategists · Published May 16, 2026
PublishedMay 16, 2026
Read time14 min
Sources60+ primary
Terms defined
60
Across 6 categories
Primary sources
60+
One canonical link per term
Disambiguation pairs
8
The most-conflated terms
Reference half-life
Q3 2026
Refresh on major spec changes

The vocabulary of AI agents shifted faster in 2025-2026 than in any comparable two-year window in software history — MCP replaced SSE transport in March 2025, Anthropic renamed its SDK in September 2025, and “subagent” became industry-standard terminology in a matter of months. This glossary defines 60 essential terms across six categories, each anchored to an official spec, peer-reviewed paper, or canonical vendor doc, so engineering teams stop talking past each other on PR reviews and architecture docs.

What makes a glossary obsolete is not bad definitions — it is definitions anchored to blog posts that were themselves anchored to press releases. Every entry below links to a primary source: the MCP specification itself, arXiv papers cited by the labs that coined the terms, or the official SDK docs. Secondary sources and explanatory blog posts do not qualify. This is the discipline that earns the “reference page” label.

Six categories mirror the parallel-agent stack an engineer actually has to reason about: Core primitives, Protocols & SDKs, Architectures, Memory & Context, Training & RL, and Safety & Evals. The disambiguation table in section 08 is the most link-worthy part: it captures the eight term-pairs that consistently derail architecture meetings — MCP vs. A2A, ReAct vs. Tree of Thoughts, RLHF vs. RLAIF, Skill vs. subagent, and four others.

Key takeaways
  1. 01
    Wikipedia-grade sourcing: one primary source per term.Every entry links to an official spec, arXiv paper, or canonical vendor doc — not to other agencies' glossary posts or press releases. That is the differentiating discipline.
  2. 02
    The 1-sentence + 1-sentence + 1-source format.Each term gets: a 1-sentence definition, a 1-sentence "why it matters" anchored in a 2026 production context, and one authoritative link. No fluff, no hedging, no boilerplate about AI being important.
  3. 03
    The disambiguation table is the link-bait magnet.MCP vs. A2A, RLHF vs. RLAIF, Skill vs. subagent, agent vs. assistant — eight pairs that get conflated constantly. No open-web glossary currently pairs these explicitly with canonical sources for both sides.
  4. 04
    Six categories mirror the full agent stack.Core, Protocols & SDKs, Architectures, Memory & Context, Training & RL, Safety & Evals. You can hand this to a new hire as a reading curriculum — read the Core category first, then branch.
  5. 05
    Hand this to new engineers. Link it in CLAUDE.md.The intended use case: drop this URL into your repo's CLAUDE.md or engineering onboarding docs, and reference it in PR descriptions when a reviewer asks 'what is a Skill vs. a subagent?'

01How to use this glossaryOne definition. One why-it-matters. One source.

Each entry follows a strict three-part format: a 1-sentence definition that is precise enough to distinguish the term from its nearest neighbor; a 1-sentence “why it matters” anchored in a specific 2026 production context, not generic AI importance language; and a single authoritative source link — official spec, peer-reviewed paper, or canonical vendor doc.

“Authoritative” means: (a) official specification body, (b) published arXiv paper with known authorship, or (c) canonical vendor documentation page (Anthropic, OpenAI, Google, LangChain, OWASP). It does not mean: other agencies’ glossary posts, LinkedIn summaries, or explanatory blog posts about the term.

Fabrication watch-outs embedded in this glossary: MCP Sampling, Roots, and Elicitation are client features, not server features — the recurring error in tutorials is to assign these to the server side. Subagents in Claude Code are one level deep only; they cannot spawn their own subagents. The MCP transport is Streamable HTTP, not HTTP+SSE — that transport was deprecated March 26, 2025.

Category 1
Core agentic primitives
10 terms

Agent, LLM, agentic loop, scaffolding, tool use, function calling, context window, in-context learning, token, vibe coding. The baseline vocabulary every engineer on the team must know.

Start here
Category 2
Protocols & SDKs
10 terms

MCP server/client, Streamable HTTP, A2A, AGNTCY, Claude Agent SDK, OpenAI Agents SDK, Vercel AI SDK, AGENTS.md. The integration layer your tooling runs on.

Integration layer
Category 3
Architectures & prompting
10 terms

Multi-agent orchestration, supervisor, swarm, fan-out/fan-in, subagent, Skill, ReAct, CoT, Tree of Thoughts, Reflexion. The patterns your agent system is built from.

System design
Category 4
Memory, context & retrieval
10 terms

Vector retrieval, embeddings, knowledge graph, episodic memory, hybrid retrieval, context engineering, prompt caching, needle-in-a-haystack, semantic search, RAG. How your agent remembers.

State & memory
Category 5
Training & RL
10 terms

Fine-tuning, instruct tuning, RLHF, RLAIF, Constitutional AI, RL post-training, MoE, DSPy, test-time compute, distillation. How models become agents.

Model internals
Category 6
Safety, evals & HITL
10 terms

Eval, SWE-Bench, Terminal-Bench, OSWorld, sandboxed execution, prompt injection, jailbreak, alignment, HITL, durable execution. How you keep it running safely.

Production safety

02Category 1Core agentic primitives — 10 terms.

These ten terms form the baseline vocabulary. If a new engineer cannot define all ten before their first architecture review, the meeting will lose 20 minutes re-explaining foundations.

Agent
A computational entity that perceives an environment through sensors and acts upon it through actuators to achieve goals. Why it matters: The Russell & Norvig definition still anchors every modern AI agent abstraction — without it, “agent” becomes synonymous with “chatbot.” Russell & Norvig, AIMA
LLM (Large Language Model)
A neural network, typically Transformer-based, trained on a vast text corpus for natural-language generation and reasoning. Why it matters: The reasoning engine inside virtually every 2026 production agent. Wikipedia: LLM
Agentic loop
The repeating cycle of gather context → take action → verify results that characterizes agentic systems. Why it matters: Anthropic’s canonical framing of how Claude Code and most coding agents actually run — understanding the loop is prerequisite for debugging latency or cost. Claude Code docs
Scaffolding
The harness — prompts, tool wiring, control flow, memory glue — wrapped around a base model to make it an agent. Why it matters: Research suggests long-running agent performance may come from better scaffolding, not smarter models — this is the operative engineering variable. Anthropic engineering
Tool use
Letting an LLM call structured functions to read external state or take real-world actions. Why it matters: The mechanism that turns a chat model into an agent — without tool use, there is no action, only text generation. Anthropic: Tool use with Claude
Function calling
OpenAI’s name for tool use — structured JSON arguments emitted by the model and executed by the host. Why it matters: The original 2023 productized form of tool use; still the dominant API surface across providers. OpenAI: Function Calling
Context window
The maximum number of tokens an LLM can process in a single inference call, including prompt, history, and tool outputs. Why it matters: Determines how much state an agent can carry without external memory — the operative constraint for long-running sessions. Anthropic: Context windows
In-context learning
Conditioning model output on examples placed in the prompt itself, without any weight updates. Why it matters: The mechanism that lets a single base model serve dozens of agent personas — no fine-tuning required. Brown et al. 2020 (NeurIPS)
Token
The atomic unit of text an LLM processes — roughly 4 characters or 0.75 words in English. Why it matters: The unit your API bill is denominated in — every cost and latency number in every agent benchmark is denominated in tokens. Anthropic pricing FAQ
Vibe coding
Andrej Karpathy’s term for the practice of accepting AI-generated code without reading it line-by-line. Why it matters: The dominant cultural pattern that AI coding tools are built around — and the reason code-review evals like SWE-Bench and Terminal-Bench matter. Karpathy, Feb 2 2025

03Category 2Protocols & SDKs — 10 terms.

The integration layer. MCP is the tool-to-model protocol; A2A is the agent-to-agent protocol; the SDKs are the wrappers your team writes against. Getting these terms wrong means writing documentation that is quietly incorrect — e.g., calling MCP “the agent communication protocol” (it is not; that is A2A) or referencing the old @anthropic-ai/claude-code package name (renamed September 29, 2025).

MCP (Model Context Protocol)
An open JSON-RPC 2.0 protocol that standardizes how LLM hosts connect to external data sources and tools. Why it matters: The de-facto interop standard, with six canonical hosts and reportedly 10,000+ public servers as of April 2026. MCP spec 2025-11-25
MCP server
A process that exposes tools, resources, and prompts over the MCP protocol — the service side. Why it matters: The integration unit you will be writing or wiring most often in 2026 — understanding server vs. client feature boundaries prevents auth and security errors. MCP spec — server features
MCP client
A connector inside an LLM host that talks to an MCP server — the calling side. Why it matters: Sampling, Roots, and Elicitation are client features; Tools, Prompts, and Resources are server features — conflating these is the most common MCP documentation error. MCP spec 2025-11-25
Streamable HTTP
The current MCP transport layer, which replaced HTTP+SSE on March 26, 2025. Why it matters: Any 2024-era or early-2025 “MCP over SSE” guide is stale — targeting the wrong transport causes silent connection failures. MCP Transports spec
A2A (Agent-to-Agent Protocol)
An open, Google-led protocol for opaque agent-to-agent communication and capability discovery. Why it matters: Complements MCP (agent ↔ tool) with the agent ↔ agent layer — the two protocols solve different problems. Google A2A announcement, Apr 2025
AGNTCY
A Cisco-led open-source framework for agent discovery, identity, messaging, and observability across vendor boundaries. Why it matters: The most production-ready inter-org agent directory standard as of mid-2026 — relevant whenever two organizations need their agents to find and call each other. AGNTCY.org
Claude Agent SDK
Anthropic’s official SDK for building autonomous agents on Claude, renamed from Claude Code SDK on September 29, 2025. Why it matters: npm and PyPI package names changed at the rename date — documentation written before that date references the wrong package. Anthropic engineering
OpenAI Agents SDK
OpenAI’s lightweight Python and TypeScript framework for building agentic applications. Why it matters: OpenAI’s recommended starting point for new agent work as of 2026 — replaces the prior Assistants API pattern for most use cases. OpenAI: Agents SDK
Vercel AI SDK
A TypeScript SDK for streaming text, tool use, structured output, and agents across multiple LLM providers. Why it matters: The default agent SDK for Next.js and React workloads — provider-agnostic with a unified interface. ai-sdk.dev
AGENTS.md
An open file convention that gives coding agents project-specific instructions, read by Claude Code, Codex, Cursor, and Grok Build. Why it matters: Adopted by reportedly 60,000+ open-source projects — the agent-era equivalent of README, and the mechanism for injecting team-specific behavior without modifying system prompts. agents.md
MCP server features vs. client features

The spec is explicit: servers expose Tools, Prompts, and Resources. Clients offer Sampling, Roots, and Elicitation. The most common fabrication in MCP tutorials is assigning Sampling to the server side — it belongs to the client. If you are reviewing an MCP integration document and see “the server handles Sampling,” flag it.

04Category 3Architectures & prompting strategies — 10 terms.

Architecture vocabulary is where ambiguity does the most damage — two engineers debating “supervisor vs. swarm” or “ReAct vs. Tree of Thoughts” are often arguing over naming, not substance, because neither has read the original papers. The entries below cite the papers that coined the terms.

Multi-agent orchestration
Coordinating multiple specialized agents to complete a task one model could not or should not tackle alone. Why it matters: The dominant deployment pattern for 2026 production systems — understanding the orchestration layer is prerequisite for sizing context budgets and latency targets. Microsoft Azure: Agent patterns
Supervisor pattern
A single coordinating agent assigns work to specialized worker agents and gathers their results. Why it matters: The anchor architecture for LangGraph’s langgraph-supervisor library — the most widely deployed multi-agent topology in 2026. LangChain: LangGraph Supervisor
Swarm pattern
Peer agents dynamically hand off to one another based on whichever specialty is currently needed. Why it matters: The decentralized alternative to supervisor — no single point of coordination failure, but harder to observe and debug. LangChain: LangGraph Swarm
Fan-out / fan-in
Dispatching parallel sub-tasks across multiple agents, then merging their outputs at a synchronization barrier. Why it matters: The same pattern as MapReduce — the cheapest concurrency primitive in agent frameworks and the default approach for embarrassingly parallel workloads. Microsoft Azure: Agent patterns
Subagent
A child agent spawned by a parent agent with its own context window and tool budget. Why it matters: Claude Code subagents are one level deep — they cannot spawn further subagents; writing system docs that imply recursive spawning will produce incorrect behavior. Claude Code docs: subagents
Skill
A self-contained capability bundle (.claude/skills/<name>/SKILL.md) loaded on demand by a Claude Code agent. Why it matters: Anthropic’s recommended pattern for keeping context lean — only the skill description (≤1,536 chars) sits in context until the skill is invoked. Claude Code Skills docs
ReAct (Reason + Act)
A prompting pattern that interleaves verbal reasoning traces with tool calls in a single forward pass. Why it matters: The 2022 paper that productized ‘thinking out loud + acting’ — still the default loop shape inside most production agents. Yao et al. 2022 (arXiv)
Chain-of-thought (CoT)
Eliciting intermediate reasoning steps from a model by prompting with worked examples or explicit reasoning instructions. Why it matters: Wei et al. 2022 showed CoT unlocked complex-reasoning performance — it is the backbone of every ‘thinking’ model since. Wei et al. 2022 (arXiv)
Tree of Thoughts (ToT)
A prompting framework that explores multiple branching reasoning paths rather than a single chain. Why it matters: Used in research benchmarks reporting 70%+ reasoning gains vs. CoT — relevant when a task has multiple viable solution paths. Yao et al. 2023 (arXiv)
Reflexion
An agent loop in which the agent verbally critiques its own prior trajectory before re-attempting the task. Why it matters: Foundational work on agent self-improvement without weight updates — the mechanism behind most ‘self-correcting agent’ patterns in 2026. Shinn et al. 2023 (arXiv)
"ReAct prompts LLMs to generate both verbal reasoning traces and actions pertaining to a task in an interleaved manner, which allows the model to perform dynamic reasoning to create, maintain, and adjust high-level plans for acting."— Yao et al., ReAct paper (arXiv:2210.03629), 2022

05Category 4Memory, context & retrieval — 10 terms.

Memory vocabulary is where the boundaries between complementary techniques blur most severely — RAG is not the same as fine-tuning, embeddings are not the same as semantic search, and prompt caching is a cost primitive, not a retrieval strategy. These distinctions drive real architecture decisions.

Vector retrieval
Looking up content by similarity in a high-dimensional embedding space rather than by keyword overlap. Why it matters: The default mechanism for grounding LLMs on external knowledge — the operational layer under most production RAG pipelines. Cohere: Semantic search
Embeddings
Dense numerical vectors that represent the semantic meaning of text, code, or other content. Why it matters: The substrate underneath retrieval, clustering, and semantic search — you cannot do vector retrieval without embeddings. OpenAI: Embeddings
Knowledge graph
A structured store of entities and relationships used to ground agent reasoning in symbolic facts. Why it matters: Complements vector retrieval for tasks requiring structured, relational facts — Neo4j’s ‘context graph for agents’ is the canonical 2026 pattern. Neo4j Labs: Agent Memory
Episodic memory
Memory of specific events — what happened, when, and in what context — as opposed to general factual knowledge. Why it matters: The Generative Agents paper used episodic memory to give simulated humans plausible long-horizon behavior — the same mechanism applies in production agents. Park et al. 2023 (Generative Agents)
Hybrid retrieval
Combining lexical (BM25) and vector retrieval to balance recall and semantic precision. Why it matters: The dominant production retrieval pattern as of 2026 — pure vector search misses exact-match queries; pure BM25 misses paraphrase; hybrid catches both. Elastic: Vector search guide
Context engineering
Systematic design of what loads into the model’s context window — system prompts, retrieved chunks, tool definitions, memory. Why it matters: The 2026 successor discipline to ‘prompt engineering’ — the operative craft at production scale, not the art of clever phrasing. Anthropic engineering
Prompt caching
Reusing previously processed prompt segments to reduce cost and latency on repeated or near-identical prompts. Why it matters: Anthropic’s cache hits reportedly cost 10% of standard input — at 60%+ cache rates common in long sessions, this is the single largest cost lever. Anthropic: Prompt caching
Needle-in-a-haystack
A benchmark that inserts a known fact into long context and tests whether the model can reliably retrieve it. Why it matters: The standard way to measure whether a 1M-token context window is actually usable, not just a marketing claim. Kamradt: NIAH benchmark
Semantic search
Search ranked by cosine similarity of embedding vectors rather than keyword frequency. Why it matters: The query interface most agent retrieval layers expose — understanding it means understanding why the same query can return different results on different embedding models. Cohere: Semantic search
RAG (Retrieval-Augmented Generation)
Generating outputs conditioned on dynamically retrieved external context to ground responses in live data. Why it matters: The pattern that bridges static model weights and live or proprietary data — the dominant architecture for enterprise LLM deployments in 2026. Lewis et al. 2020 (arXiv)

06Category 5Training & RL — 10 terms.

Training vocabulary matters to practitioners who need to evaluate model claims honestly — “RLHF-trained” and “Constitutional AI-trained” are not interchangeable; “RL post-training” is distinct from both. The entries below anchor each term to its originating paper.

Fine-tuning
Updating an LLM’s weights on a smaller, task-specific dataset after pre-training. Why it matters: Still the default mechanism for specializing a base model on proprietary data — understanding its cost vs. RAG alternatives is a core architecture decision. Ouyang et al. 2022 (OpenAI)
Instruct tuning
Fine-tuning a base model on instruction-response pairs to make it follow user intent. Why it matters: The pre-RLHF step that turned GPT-3 into InstructGPT — necessary context for understanding why base models and instruction-tuned models behave so differently. Ouyang et al. 2022 (OpenAI)
RLHF (Reinforcement Learning from Human Feedback)
Using human preference labels as a reward signal to fine-tune model behavior toward human intent. Why it matters: The technique that made ChatGPT possible — originating in Christiano et al. 2017, industrialized by Ouyang et al. 2022. Ouyang et al. 2022
RLAIF (Reinforcement Learning from AI Feedback)
Replacing human preference labels with an LLM-generated critic to provide the training reward signal. Why it matters: The training method behind Anthropic’s Constitutional AI work — distinct from RLHF in that the feedback source is an AI, not a human rater. Bai et al. 2022 (arXiv)
Constitutional AI (CAI)
Anthropic’s approach: a base model self-critiques and self-revises against a written set of principles (the ‘constitution’). Why it matters: Anthropic uses CAI as the alignment training pipeline for Claude — understanding it clarifies why Claude refuses or hedges in ways other models do not. Bai et al. 2022 (arXiv)
RL post-training
Reinforcement learning applied after supervised pre-training to elicit or amplify specific capabilities — most commonly reasoning. Why it matters: The technique behind DeepSeek-R1, o1, and most ‘thinking’ models — a distinct training phase from both RLHF and fine-tuning. DeepSeek-R1 paper (arXiv)
MoE (Mixture of Experts)
A model architecture where only a sparse subset of expert subnetworks activates per token, reducing active computation. Why it matters: Underlies most 2026 frontier open-source models — DeepSeek V4, Qwen 3, GLM-5, Llama 4 are all MoE; understanding it is prerequisite for sizing inference hardware. Shazeer et al. 2017 (arXiv)
DSPy
A Stanford framework for programming (rather than prompting) LLMs using composable modules and optimizers. Why it matters: Treats prompts as compiled artifacts rather than hand-written strings — emerging as a production alternative to brittle prompt engineering at scale. stanfordnlp/dspy (GitHub)
Test-time compute
Increasing inference-time computation — more reasoning tokens, more candidate samples — to improve output quality without retraining. Why it matters: The frontier scaling axis that reportedly replaced ‘just train a bigger model’ in 2025-2026 — the mechanism behind extended thinking modes. Anthropic: Extended thinking
Distillation
Training a smaller ‘student’ model to imitate the output distribution of a larger ‘teacher’ model. Why it matters: The mechanism behind every ‘Haiku-class’ and ‘mini’ model in the 2026 lineup — critical context for evaluating whether a smaller model is independently capable or derivatively capable. Hinton et al. 2015 (arXiv)

Timeline of foundational agent papers — 2022–2024

Sources: arXiv publication dates for each paper
CoT — Chain-of-Thought promptingWei et al. 2022 — unlocks complex reasoning
Jan 2022
ReAct — Reason + ActYao et al. 2022 — tool use + verbal traces
Oct 2022
Constitutional AI (RLAIF)Bai et al. 2022 — AI self-critique
Dec 2022
Reflexion — verbal self-improvementShinn et al. 2023 — agent self-correction
Mar 2023
Generative Agents — episodic memoryPark et al. 2023
Apr 2023
Tree of Thoughts (ToT)Yao et al. 2023 — branching reasoning
May 2023
SWE-Bench — coding-agent evalJimenez et al. 2023
Oct 2023
OSWorld — computer-use evalXie et al. 2024
Apr 2024

07Category 6Safety, evals & human-in-the-loop — 10 terms.

Production agent teams consistently under-invest in evals until something fails publicly. These ten terms form the safety and observability vocabulary — knowing them is prerequisite for a meaningful conversation about production readiness.

Eval (LLM evaluation)
A test set plus scoring rubric used to grade an LLM or agent’s performance, typically tracked over time. Why it matters: The production agent’s equivalent of a unit-test suite — without evals, every model update is a regression risk. Braintrust docs
SWE-Bench
A benchmark of 2,294 real GitHub issues used to evaluate whether an agent can resolve them end-to-end. Why it matters: The default coding-agent benchmark — SWE-Bench Verified is the cleaned subset most labs cite; comparing unverified scores creates misleading comparisons. Jimenez et al. 2023 (arXiv)
Terminal-Bench (TBench)
A Stanford and Laude Institute benchmark of terminal-mastery tasks for AI agents. Why it matters: Quickly became the canonical eval for CLI coding agents — Codex, Claude Code, and Grok Build all publish TBench scores. tbench.ai
OSWorld
A multimodal benchmark of 369 real computer tasks — web, desktop, and OS-level — for computer-use agents. Why it matters: The standard eval for ‘can your agent operate a real OS’ — the only benchmark that covers the full application-surface an agentic computer-use system must navigate. OSWorld project page
Sandboxed execution
Running agent-generated code inside an isolated environment — microVM or container — so it cannot damage the host. Why it matters: Mandatory for autonomous code-execution agents — Vercel Sandbox, Cloudflare Workers, and Daytona are the 2026 vendor options. Vercel Sandbox docs
Prompt injection
A class of attack where untrusted input in the agent’s context manipulates its behavior against the operator’s intent. Why it matters: Ranked LLM01 in OWASP’s 2025 Top 10 for LLM apps — the #1 security risk for any agent that ingests external content. OWASP LLM01:2025
Jailbreak
A prompt or sequence designed to bypass an LLM’s safety policies and elicit prohibited outputs. Why it matters: Distinct from prompt injection — jailbreak targets the model’s alignment training; prompt injection targets a tool surface or downstream system. Anthropic: Many-shot jailbreaking
Alignment
The technical and normative problem of making AI systems reliably pursue human-intended goals. Why it matters: The framing under which all safety-relevant agent work is organized — understanding it clarifies why Constitutional AI and HITL exist as engineering patterns. Wikipedia: AI alignment
Human-in-the-loop (HITL)
A design pattern where trained humans retain authority over high-risk agent actions before they execute. Why it matters: The default risk-mitigation control for production agents in regulated industries — banking, healthcare, legal — where autonomous action carries legal liability. IBM: What is HITL?
Durable execution
A runtime pattern where workflows survive crashes, retries, and long pauses by checkpointing every step. Why it matters: Required infrastructure for long-running agents — Temporal, Inngest, Restate, and Vercel WDK are the 2026 vendors; without it, a multi-hour agent task that fails at step 47 restarts from zero. Temporal docs: Agentic loop
Benchmark
SWE-Bench issues
2,294

Real GitHub issues from 12 popular Python repos, used to evaluate whether a coding agent can resolve them autonomously. SWE-Bench Verified is the cleaned subset most labs cite.

arXiv:2310.06770
Benchmark
OSWorld tasks
369

Real computer tasks across web, desktop, and OS-level interactions. The standard eval for computer-use agents operating a real OS.

os-world.github.io
Security
Prompt injection (OWASP LLM01)
#1risk

OWASP ranks prompt injection as the number-one vulnerability for LLM applications in their 2025 Top 10 — any agent ingesting external content is exposed.

OWASP LLM01:2025

08The most-conflated pairsDisambiguation table — eight pairs that derail every architecture meeting.

These are the eight conversations that consume 20 minutes of architecture meetings because neither side is working from the same definition. The table below gives: the short definition for each term, when they are confused, the actually-different bit, and a canonical source for each. No open-web glossary currently pairs these explicitly.

Term ATerm BWhen they’re confusedThe actually-different bit
MCP
Model↔tool protocol. Spec↗
A2A
Agent↔agent protocol. Google↗
Both described as “the agent interoperability protocol.”MCP connects a model to external tools and data sources. A2A connects one agent to another agent — different layer, different use case. Both can coexist in the same system.
ReAct
Interleaved reasoning + tool calls. arXiv↗
Tree of Thoughts
Multiple branching reasoning paths. arXiv↗
Both described as “advanced reasoning prompting.”ReAct is a single-path loop: think, then act, then observe, repeat. ToT explores multiple forking solution paths simultaneously. ReAct is the default production pattern; ToT is used when there are genuinely multiple valid solution branches worth exploring.
RLHF
Human preference labels. Ouyang 2022↗
RLAIF
AI-generated preference labels. Bai 2022↗
Both described as “the training method that makes models safe.”RLHF uses human raters to label which model output is preferred. RLAIF replaces those human raters with an LLM critic. Constitutional AI is Anthropic’s RLAIF-based pipeline — it is not RLHF.
Skill
Capability bundle (.claude/skills). Claude Code↗
Subagent
Child agent, own context window. Claude Code↗
Both described as “a modular agent capability.”A Skill is a file on disk — a SKILL.md with instructions loaded into the parent context on demand. A subagent is a spawned process with its own context window and tool budget. Skills are lightweight; subagents have independent compute cost. Subagents cannot spawn their own subagents.
Agent
Perceives environment, takes actions. Russell & Norvig↗
Assistant
Responds to user queries in a session.
“AI assistant” and “AI agent” used interchangeably in marketing copy.An assistant responds to queries; it does not initiate actions or persist goals between sessions without explicit prompting. An agent maintains goals, loops autonomously, and takes actions — often without a human in the loop per step.
Fine-tuning
Weight updates on task-specific data. OpenAI↗
RAG
Retrieval-grounded generation at inference. Lewis 2020↗
Both proposed as solutions to “make the model know our internal data.”Fine-tuning bakes knowledge into weights at training time — expensive, and the knowledge becomes stale the moment it is trained. RAG retrieves knowledge at inference time from a live data source. They are complementary: fine-tune for style and behavior; RAG for live or frequently-updated knowledge.
Embeddings
Dense semantic vector representation. OpenAI↗
Vector search
Similarity lookup in embedding space. Cohere↗
Used interchangeably in architecture discussions.Embeddings are the numeric representation — the output of an embedding model. Vector search is the retrieval operation performed over a database of embeddings. You need both; neither is the other.
RAG
Retrieve then generate. Lewis 2020↗
Grounding
Any method to anchor model output in verifiable facts.
“Grounding” used as a synonym for RAG in vendor marketing.RAG is a specific architecture: retrieve, then inject retrieved context into the prompt, then generate. Grounding is a broader goal — any technique (RAG, knowledge graph injection, structured data, citations, HITL review) that keeps output factually anchored. RAG is one grounding strategy, not the only one.
All source links point to official specifications, arXiv papers, or canonical vendor documentation — not to secondary glossary pages. Each term is defined independently; the disambiguation is the point.

09Practical usageHow to use this glossary in your team’s docs.

A glossary earns its keep when it stops a specific conversation from going in circles. The three highest-value deployment patterns for this one:

1. Link it in CLAUDE.md or AGENTS.md. If your repository has a CLAUDE.md, AGENTS.md, or equivalent agent-context file, add a line like: # Terminology: https://digitalapplied.com/blog/ai-agent-glossary-2026-60-essential-terms. Claude Code and most coding agents read these files at session start — the glossary URL becomes part of the shared vocabulary context without consuming tokens per request.

2. Hand it to new engineers before their first architecture review. The six categories are a reading curriculum: Core (start here), then Protocols & SDKs (before any integration work), then Architectures (before any design session). The disambiguation table deserves a separate read. Engineers who have read the disambiguation section arrive at MCP vs. A2A discussions with a clear framework rather than a vague sense that “they are both agent protocols.”

3. Reference it in PR descriptions. When a reviewer asks “what is a Skill vs. a subagent here?” or “why are we using RAG instead of fine-tuning?” — a link to the relevant section of this glossary is faster and more precise than a Slack thread. The disambiguation table entries have stable anchor IDs for deep-linking.

The glossary is intentionally narrow: 60 terms that a 2026 production agent team will hit in a typical sprint, not 200 terms that cover the entire AI research landscape. For a broader reference, see our 200-term agentic AI glossary, and for the specific SDK documentation, the Claude Agent SDK migration playbook covers the September 2025 rename in full detail. If you are building agent systems for marketing or content operations, our AI transformation services team can help you apply these patterns at scale — from MCP server integration to multi-agent orchestration.

For teams evaluating the cost side of agent tooling, our AI coding agent cost calculator models per-task costs across 10 tools, with prompt caching at 0/30/60/90% modeled explicitly.

New hire onboarding
Engineering teams

Assign Core + Protocols & SDKs before first sprint. Disambiguation table before first architecture review. The six categories are a readable curriculum, not a reference dump.

Read sequentially
Ongoing reference
PR and architecture review

Deep-link to the disambiguation table when a term conflict surfaces. Use the canonical source links to settle definitions — not informal consensus.

Link the disambiguation table
Agent context file
CLAUDE.md / AGENTS.md

Add the glossary URL to your repo's CLAUDE.md or AGENTS.md. Coding agents read these files at session start — shared vocabulary without per-request token cost.

Link in CLAUDE.md
Quarterly refresh
Glossary maintenance

Review the Protocols & SDKs category each quarter — it is the most volatile. MCP transports, SDK naming, and benchmark definitions change faster than architecture patterns.

Audit Q3 2026
Reference · May 2026 snapshot

This glossary will be wrong by Q3 2026 — and that is the point.

The agentic AI vocabulary is changing weekly. MCP’s Streamable HTTP transport — which deprecated SSE in March 2025 — was not in the v0.1 spec. “Skill” did not exist as a primitive until Claude Code 1.0. “Subagent” was Anthropic-coined in 2025 and is now industry-standard. By Q3 2026 there will be new terms — likely around agent governance, durable workflow vendors, and multimodal tool use — that belong in a refreshed version of this list.

The point of this glossary is not to be eternally correct. It is to be a snapshot of the shared vocabulary as of May 2026, so engineering teams stop talking past each other on PR reviews and architecture docs. Use it as a starting point. Hand it to new hires. Link to it in CLAUDE.md. Refresh it quarterly. The disambiguation table is the part that earns its keep most reliably — MCP vs. A2A, RLHF vs. RLAIF, Skill vs. subagent, fine-tuning vs. RAG — those are the four conversations that derail every architecture meeting, and they are unlikely to stop derailing meetings just because the underlying technology evolves.

For teams moving beyond vocabulary into production agent systems, the Digital Applied AI transformation team works on the implementation layer: MCP server integration, multi-agent orchestration, prompt caching optimization, and eval-driven iteration. The glossary is the shared language; we help build the system.

Build production agent systems

From shared vocabulary to production systems.

Our team helps engineering and marketing organizations build production agent systems — from MCP server integration to multi-agent orchestration, context engineering, and eval-driven iteration.

Free consultationExpert guidanceTailored solutions
What we work on

Agent system engagements

  • MCP server integration across tools and data sources
  • Multi-agent orchestration — supervisor, swarm, fan-out patterns
  • Context engineering and prompt caching optimization
  • Eval setup — SWE-Bench, Terminal-Bench, custom rubrics
  • Human-in-the-loop design for regulated industry deployments
FAQ · AI Agent Glossary 2026

Common questions about agent vocabulary.

60 is the number of terms a 2026 production agent engineering team will encounter in a typical sprint — the vocabulary for design reviews, PR descriptions, and onboarding conversations. 100+ terms starts to cover AI research territory that most practitioners will not hit in production work. For a broader reference, our 200-term agentic AI glossary covers the research layer. The constraint here is precision, not comprehensiveness: each term must earn its inclusion by appearing in real engineering conversations, not just in papers.