The vocabulary of AI agents shifted faster in 2025-2026 than in any comparable two-year window in software history — MCP replaced SSE transport in March 2025, Anthropic renamed its SDK in September 2025, and “subagent” became industry-standard terminology in a matter of months. This glossary defines 60 essential terms across six categories, each anchored to an official spec, peer-reviewed paper, or canonical vendor doc, so engineering teams stop talking past each other on PR reviews and architecture docs.

What makes a glossary obsolete is not bad definitions — it is definitions anchored to blog posts that were themselves anchored to press releases. Every entry below links to a primary source: the MCP specification itself, arXiv papers cited by the labs that coined the terms, or the official SDK docs. Secondary sources and explanatory blog posts do not qualify. This is the discipline that earns the “reference page” label.

Six categories mirror the parallel-agent stack an engineer actually has to reason about: Core primitives, Protocols & SDKs, Architectures, Memory & Context, Training & RL, and Safety & Evals. The disambiguation table in section 08 is the most link-worthy part: it captures the eight term-pairs that consistently derail architecture meetings — MCP vs. A2A, ReAct vs. Tree of Thoughts, RLHF vs. RLAIF, Skill vs. subagent, and four others.

Key takeaways

01
Wikipedia-grade sourcing: one primary source per term.Every entry links to an official spec, arXiv paper, or canonical vendor doc — not to other agencies' glossary posts or press releases. That is the differentiating discipline.
02
The 1-sentence + 1-sentence + 1-source format.Each term gets: a 1-sentence definition, a 1-sentence "why it matters" anchored in a 2026 production context, and one authoritative link. No fluff, no hedging, no boilerplate about AI being important.
03
The disambiguation table is the link-bait magnet.MCP vs. A2A, RLHF vs. RLAIF, Skill vs. subagent, agent vs. assistant — eight pairs that get conflated constantly. No open-web glossary currently pairs these explicitly with canonical sources for both sides.
04
Six categories mirror the full agent stack.Core, Protocols & SDKs, Architectures, Memory & Context, Training & RL, Safety & Evals. You can hand this to a new hire as a reading curriculum — read the Core category first, then branch.
05
Hand this to new engineers. Link it in CLAUDE.md.The intended use case: drop this URL into your repo's CLAUDE.md or engineering onboarding docs, and reference it in PR descriptions when a reviewer asks 'what is a Skill vs. a subagent?'

01 — How to use this glossaryOne definition. One why-it-matters. One source.

Each entry follows a strict three-part format: a 1-sentence definition that is precise enough to distinguish the term from its nearest neighbor; a 1-sentence “why it matters” anchored in a specific 2026 production context, not generic AI importance language; and a single authoritative source link — official spec, peer-reviewed paper, or canonical vendor doc.

“Authoritative” means: (a) official specification body, (b) published arXiv paper with known authorship, or (c) canonical vendor documentation page (Anthropic, OpenAI, Google, LangChain, OWASP). It does not mean: other agencies’ glossary posts, LinkedIn summaries, or explanatory blog posts about the term.

Fabrication watch-outs embedded in this glossary: MCP Sampling, Roots, and Elicitation are client features, not server features — the recurring error in tutorials is to assign these to the server side. Subagents in Claude Code are one level deep only; they cannot spawn their own subagents. The MCP transport is Streamable HTTP, not HTTP+SSE — that transport was deprecated March 26, 2025.

Category 1

Core agentic primitives

10 terms

Agent, LLM, agentic loop, scaffolding, tool use, function calling, context window, in-context learning, token, vibe coding. The baseline vocabulary every engineer on the team must know.

Start here

Category 2

Protocols & SDKs

10 terms

MCP server/client, Streamable HTTP, A2A, AGNTCY, Claude Agent SDK, OpenAI Agents SDK, Vercel AI SDK, AGENTS.md. The integration layer your tooling runs on.

Integration layer

Category 3

Architectures & prompting

10 terms

Multi-agent orchestration, supervisor, swarm, fan-out/fan-in, subagent, Skill, ReAct, CoT, Tree of Thoughts, Reflexion. The patterns your agent system is built from.

System design

Category 4

Memory, context & retrieval

10 terms

Vector retrieval, embeddings, knowledge graph, episodic memory, hybrid retrieval, context engineering, prompt caching, needle-in-a-haystack, semantic search, RAG. How your agent remembers.

State & memory

Category 5

Training & RL

10 terms

Fine-tuning, instruct tuning, RLHF, RLAIF, Constitutional AI, RL post-training, MoE, DSPy, test-time compute, distillation. How models become agents.

Model internals

Category 6

Safety, evals & HITL

10 terms

Eval, SWE-Bench, Terminal-Bench, OSWorld, sandboxed execution, prompt injection, jailbreak, alignment, HITL, durable execution. How you keep it running safely.

Production safety

02 — Category 1Core agentic primitives — 10 terms.

These ten terms form the baseline vocabulary. If a new engineer cannot define all ten before their first architecture review, the meeting will lose 20 minutes re-explaining foundations.

Agent

A computational entity that perceives an environment through sensors and acts upon it through actuators to achieve goals. Why it matters: The Russell & Norvig definition still anchors every modern AI agent abstraction â without it, “agent” becomes synonymous with “chatbot.” Russell & Norvig, AIMA↗

LLM (Large Language Model)

A neural network, typically Transformer-based, trained on a vast text corpus for natural-language generation and reasoning. Why it matters: The reasoning engine inside virtually every 2026 production agent. Wikipedia: LLM↗

Agentic loop

The repeating cycle of gather context → take action → verify results that characterizes agentic systems. Why it matters: Anthropic’s canonical framing of how Claude Code and most coding agents actually run â understanding the loop is prerequisite for debugging latency or cost. Claude Code docs↗

Scaffolding

The harness — prompts, tool wiring, control flow, memory glue — wrapped around a base model to make it an agent. Why it matters: Research suggests long-running agent performance may come from better scaffolding, not smarter models — this is the operative engineering variable. Anthropic engineering↗

Tool use

Letting an LLM call structured functions to read external state or take real-world actions. Why it matters: The mechanism that turns a chat model into an agent — without tool use, there is no action, only text generation. Anthropic: Tool use with Claude↗

Function calling

OpenAI’s name for tool use â structured JSON arguments emitted by the model and executed by the host. Why it matters: The original 2023 productized form of tool use; still the dominant API surface across providers. OpenAI: Function Calling↗

Context window

The maximum number of tokens an LLM can process in a single inference call, including prompt, history, and tool outputs. Why it matters: Determines how much state an agent can carry without external memory — the operative constraint for long-running sessions. Anthropic: Context windows↗

In-context learning

Conditioning model output on examples placed in the prompt itself, without any weight updates. Why it matters: The mechanism that lets a single base model serve dozens of agent personas — no fine-tuning required. Brown et al. 2020 (NeurIPS)↗

Token

The atomic unit of text an LLM processes — roughly 4 characters or 0.75 words in English. Why it matters: The unit your API bill is denominated in — every cost and latency number in every agent benchmark is denominated in tokens. Anthropic pricing FAQ↗

Vibe coding

Andrej Karpathy’s term for the practice of accepting AI-generated code without reading it line-by-line. Why it matters: The dominant cultural pattern that AI coding tools are built around — and the reason code-review evals like SWE-Bench and Terminal-Bench matter. Karpathy, Feb 2 2025↗

03 — Category 2Protocols & SDKs — 10 terms.

The integration layer. MCP is the tool-to-model protocol; A2A is the agent-to-agent protocol; the SDKs are the wrappers your team writes against. Getting these terms wrong means writing documentation that is quietly incorrect â e.g., calling MCP “the agent communication protocol” (it is not; that is A2A) or referencing the old @anthropic-ai/claude-code package name (renamed September 29, 2025).

MCP (Model Context Protocol)

An open JSON-RPC 2.0 protocol that standardizes how LLM hosts connect to external data sources and tools. Why it matters: The de-facto interop standard, with six canonical hosts and reportedly 10,000+ public servers as of April 2026. MCP spec 2025-11-25↗

MCP server

A process that exposes tools, resources, and prompts over the MCP protocol — the service side. Why it matters: The integration unit you will be writing or wiring most often in 2026 — understanding server vs. client feature boundaries prevents auth and security errors. MCP spec — server features↗

MCP client

A connector inside an LLM host that talks to an MCP server — the calling side. Why it matters: Sampling, Roots, and Elicitation are client features; Tools, Prompts, and Resources are server features — conflating these is the most common MCP documentation error. MCP spec 2025-11-25↗

Streamable HTTP

The current MCP transport layer, which replaced HTTP+SSE on March 26, 2025. Why it matters: Any 2024-era or early-2025 “MCP over SSE” guide is stale â targeting the wrong transport causes silent connection failures. MCP Transports spec↗

A2A (Agent-to-Agent Protocol)

An open, Google-led protocol for opaque agent-to-agent communication and capability discovery. Why it matters: Complements MCP (agent ↔ tool) with the agent ↔ agent layer — the two protocols solve different problems. Google A2A announcement, Apr 2025↗

AGNTCY

A Cisco-led open-source framework for agent discovery, identity, messaging, and observability across vendor boundaries. Why it matters: The most production-ready inter-org agent directory standard as of mid-2026 — relevant whenever two organizations need their agents to find and call each other. AGNTCY.org↗

Claude Agent SDK

Anthropic’s official SDK for building autonomous agents on Claude, renamed from Claude Code SDK on September 29, 2025. Why it matters: npm and PyPI package names changed at the rename date — documentation written before that date references the wrong package. Anthropic engineering↗

OpenAI Agents SDK

OpenAI’s lightweight Python and TypeScript framework for building agentic applications. Why it matters: OpenAI’s recommended starting point for new agent work as of 2026 â replaces the prior Assistants API pattern for most use cases. OpenAI: Agents SDK↗

Vercel AI SDK

A TypeScript SDK for streaming text, tool use, structured output, and agents across multiple LLM providers. Why it matters: The default agent SDK for Next.js and React workloads — provider-agnostic with a unified interface. ai-sdk.dev↗

AGENTS.md

An open file convention that gives coding agents project-specific instructions, read by Claude Code, Codex, Cursor, and Grok Build. Why it matters: Adopted by reportedly 60,000+ open-source projects — the agent-era equivalent of README, and the mechanism for injecting team-specific behavior without modifying system prompts. agents.md↗

MCP server features vs. client features

The spec is explicit: servers expose Tools, Prompts, and Resources. Clients offer Sampling, Roots, and Elicitation. The most common fabrication in MCP tutorials is assigning Sampling to the server side — it belongs to the client. If you are reviewing an MCP integration document and see “the server handles Sampling,” flag it.

04 — Category 3Architectures & prompting strategies — 10 terms.

Architecture vocabulary is where ambiguity does the most damage — two engineers debating “supervisor vs. swarm” or “ReAct vs. Tree of Thoughts” are often arguing over naming, not substance, because neither has read the original papers. The entries below cite the papers that coined the terms.

Multi-agent orchestration

Coordinating multiple specialized agents to complete a task one model could not or should not tackle alone. Why it matters: The dominant deployment pattern for 2026 production systems — understanding the orchestration layer is prerequisite for sizing context budgets and latency targets. Microsoft Azure: Agent patterns↗

Supervisor pattern

A single coordinating agent assigns work to specialized worker agents and gathers their results. Why it matters: The anchor architecture for LangGraph’s langgraph-supervisor library â the most widely deployed multi-agent topology in 2026. LangChain: LangGraph Supervisor↗

Swarm pattern

Peer agents dynamically hand off to one another based on whichever specialty is currently needed. Why it matters: The decentralized alternative to supervisor — no single point of coordination failure, but harder to observe and debug. LangChain: LangGraph Swarm↗

Fan-out / fan-in

Dispatching parallel sub-tasks across multiple agents, then merging their outputs at a synchronization barrier. Why it matters: The same pattern as MapReduce — the cheapest concurrency primitive in agent frameworks and the default approach for embarrassingly parallel workloads. Microsoft Azure: Agent patterns↗

Subagent

A child agent spawned by a parent agent with its own context window and tool budget. Why it matters: Claude Code subagents are one level deep — they cannot spawn further subagents; writing system docs that imply recursive spawning will produce incorrect behavior. Claude Code docs: subagents↗

Skill

A self-contained capability bundle (.claude/skills/<name>/SKILL.md) loaded on demand by a Claude Code agent. Why it matters: Anthropic’s recommended pattern for keeping context lean â only the skill description (≤1,536 chars) sits in context until the skill is invoked. Claude Code Skills docs↗

ReAct (Reason + Act)

A prompting pattern that interleaves verbal reasoning traces with tool calls in a single forward pass. Why it matters: The 2022 paper that productized ‘thinking out loud + acting’ â still the default loop shape inside most production agents. Yao et al. 2022 (arXiv)↗

Chain-of-thought (CoT)

Eliciting intermediate reasoning steps from a model by prompting with worked examples or explicit reasoning instructions. Why it matters: Wei et al. 2022 showed CoT unlocked complex-reasoning performance â it is the backbone of every ‘thinking’ model since. Wei et al. 2022 (arXiv)↗

Tree of Thoughts (ToT)

A prompting framework that explores multiple branching reasoning paths rather than a single chain. Why it matters: Used in research benchmarks reporting 70%+ reasoning gains vs. CoT — relevant when a task has multiple viable solution paths. Yao et al. 2023 (arXiv)↗

Reflexion

An agent loop in which the agent verbally critiques its own prior trajectory before re-attempting the task. Why it matters: Foundational work on agent self-improvement without weight updates â the mechanism behind most ‘self-correcting agent’ patterns in 2026. Shinn et al. 2023 (arXiv)↗

"ReAct prompts LLMs to generate both verbal reasoning traces and actions pertaining to a task in an interleaved manner, which allows the model to perform dynamic reasoning to create, maintain, and adjust high-level plans for acting."— Yao et al., ReAct paper (arXiv:2210.03629), 2022

05 — Category 4Memory, context & retrieval — 10 terms.

Memory vocabulary is where the boundaries between complementary techniques blur most severely — RAG is not the same as fine-tuning, embeddings are not the same as semantic search, and prompt caching is a cost primitive, not a retrieval strategy. These distinctions drive real architecture decisions.

Vector retrieval

Looking up content by similarity in a high-dimensional embedding space rather than by keyword overlap. Why it matters: The default mechanism for grounding LLMs on external knowledge — the operational layer under most production RAG pipelines. Cohere: Semantic search↗

Embeddings

Dense numerical vectors that represent the semantic meaning of text, code, or other content. Why it matters: The substrate underneath retrieval, clustering, and semantic search — you cannot do vector retrieval without embeddings. OpenAI: Embeddings↗

Knowledge graph

A structured store of entities and relationships used to ground agent reasoning in symbolic facts. Why it matters: Complements vector retrieval for tasks requiring structured, relational facts â Neo4j’s ‘context graph for agents’ is the canonical 2026 pattern. Neo4j Labs: Agent Memory↗

Episodic memory

Memory of specific events — what happened, when, and in what context — as opposed to general factual knowledge. Why it matters: The Generative Agents paper used episodic memory to give simulated humans plausible long-horizon behavior — the same mechanism applies in production agents. Park et al. 2023 (Generative Agents)↗

Hybrid retrieval

Combining lexical (BM25) and vector retrieval to balance recall and semantic precision. Why it matters: The dominant production retrieval pattern as of 2026 — pure vector search misses exact-match queries; pure BM25 misses paraphrase; hybrid catches both. Elastic: Vector search guide↗

Context engineering

Systematic design of what loads into the model’s context window â system prompts, retrieved chunks, tool definitions, memory. Why it matters: The 2026 successor discipline to ‘prompt engineering’ â the operative craft at production scale, not the art of clever phrasing. Anthropic engineering↗

Prompt caching

Reusing previously processed prompt segments to reduce cost and latency on repeated or near-identical prompts. Why it matters: Anthropic’s cache hits reportedly cost 10% of standard input â at 60%+ cache rates common in long sessions, this is the single largest cost lever. Anthropic: Prompt caching↗

Needle-in-a-haystack

A benchmark that inserts a known fact into long context and tests whether the model can reliably retrieve it. Why it matters: The standard way to measure whether a 1M-token context window is actually usable, not just a marketing claim. Kamradt: NIAH benchmark↗

Semantic search

Search ranked by cosine similarity of embedding vectors rather than keyword frequency. Why it matters: The query interface most agent retrieval layers expose — understanding it means understanding why the same query can return different results on different embedding models. Cohere: Semantic search↗

RAG (Retrieval-Augmented Generation)

Generating outputs conditioned on dynamically retrieved external context to ground responses in live data. Why it matters: The pattern that bridges static model weights and live or proprietary data — the dominant architecture for enterprise LLM deployments in 2026. Lewis et al. 2020 (arXiv)↗

06 — Category 5Training & RL — 10 terms.

Training vocabulary matters to practitioners who need to evaluate model claims honestly â “RLHF-trained” and “Constitutional AI-trained” are not interchangeable; “RL post-training” is distinct from both. The entries below anchor each term to its originating paper.

Fine-tuning

Updating an LLM’s weights on a smaller, task-specific dataset after pre-training. Why it matters: Still the default mechanism for specializing a base model on proprietary data — understanding its cost vs. RAG alternatives is a core architecture decision. Ouyang et al. 2022 (OpenAI)↗

Instruct tuning

Fine-tuning a base model on instruction-response pairs to make it follow user intent. Why it matters: The pre-RLHF step that turned GPT-3 into InstructGPT — necessary context for understanding why base models and instruction-tuned models behave so differently. Ouyang et al. 2022 (OpenAI)↗

RLHF (Reinforcement Learning from Human Feedback)

Using human preference labels as a reward signal to fine-tune model behavior toward human intent. Why it matters: The technique that made ChatGPT possible — originating in Christiano et al. 2017, industrialized by Ouyang et al. 2022. Ouyang et al. 2022↗

RLAIF (Reinforcement Learning from AI Feedback)

Replacing human preference labels with an LLM-generated critic to provide the training reward signal. Why it matters: The training method behind Anthropic’s Constitutional AI work â distinct from RLHF in that the feedback source is an AI, not a human rater. Bai et al. 2022 (arXiv)↗

Constitutional AI (CAI)

Anthropic’s approach: a base model self-critiques and self-revises against a written set of principles (the ‘constitution’). Why it matters: Anthropic uses CAI as the alignment training pipeline for Claude — understanding it clarifies why Claude refuses or hedges in ways other models do not. Bai et al. 2022 (arXiv)↗

RL post-training

Reinforcement learning applied after supervised pre-training to elicit or amplify specific capabilities — most commonly reasoning. Why it matters: The technique behind DeepSeek-R1, o1, and most ‘thinking’ models â a distinct training phase from both RLHF and fine-tuning. DeepSeek-R1 paper (arXiv)↗

MoE (Mixture of Experts)

A model architecture where only a sparse subset of expert subnetworks activates per token, reducing active computation. Why it matters: Underlies most 2026 frontier open-source models — DeepSeek V4, Qwen 3, GLM-5, Llama 4 are all MoE; understanding it is prerequisite for sizing inference hardware. Shazeer et al. 2017 (arXiv)↗

DSPy

A Stanford framework for programming (rather than prompting) LLMs using composable modules and optimizers. Why it matters: Treats prompts as compiled artifacts rather than hand-written strings — emerging as a production alternative to brittle prompt engineering at scale. stanfordnlp/dspy (GitHub)↗

Test-time compute

Increasing inference-time computation — more reasoning tokens, more candidate samples — to improve output quality without retraining. Why it matters: The frontier scaling axis that reportedly replaced ‘just train a bigger model’ in 2025-2026 â the mechanism behind extended thinking modes. Anthropic: Extended thinking↗

Distillation

Training a smaller ‘student’ model to imitate the output distribution of a larger ‘teacher’ model. Why it matters: The mechanism behind every ‘Haiku-class’ and ‘mini’ model in the 2026 lineup â critical context for evaluating whether a smaller model is independently capable or derivatively capable. Hinton et al. 2015 (arXiv)↗

Timeline of foundational agent papers — 2022–2024

Sources: arXiv publication dates for each paper

CoT — Chain-of-Thought promptingWei et al. 2022 — unlocks complex reasoning

Jan 2022

ReAct — Reason + ActYao et al. 2022 — tool use + verbal traces

Oct 2022

Constitutional AI (RLAIF)Bai et al. 2022 — AI self-critique

Dec 2022

Reflexion — verbal self-improvementShinn et al. 2023 — agent self-correction

Mar 2023

Generative Agents — episodic memoryPark et al. 2023

Apr 2023

Tree of Thoughts (ToT)Yao et al. 2023 — branching reasoning

May 2023

SWE-Bench — coding-agent evalJimenez et al. 2023

Oct 2023

OSWorld — computer-use evalXie et al. 2024

Apr 2024

07 — Category 6Safety, evals & human-in-the-loop — 10 terms.

Production agent teams consistently under-invest in evals until something fails publicly. These ten terms form the safety and observability vocabulary — knowing them is prerequisite for a meaningful conversation about production readiness.

Eval (LLM evaluation)

A test set plus scoring rubric used to grade an LLM or agent’s performance, typically tracked over time. Why it matters: The production agent’s equivalent of a unit-test suite â without evals, every model update is a regression risk. Braintrust docs↗

SWE-Bench

A benchmark of 2,294 real GitHub issues used to evaluate whether an agent can resolve them end-to-end. Why it matters: The default coding-agent benchmark — SWE-Bench Verified is the cleaned subset most labs cite; comparing unverified scores creates misleading comparisons. Jimenez et al. 2023 (arXiv)↗

Terminal-Bench (TBench)

A Stanford and Laude Institute benchmark of terminal-mastery tasks for AI agents. Why it matters: Quickly became the canonical eval for CLI coding agents — Codex, Claude Code, and Grok Build all publish TBench scores. tbench.ai↗

OSWorld

A multimodal benchmark of 369 real computer tasks — web, desktop, and OS-level — for computer-use agents. Why it matters: The standard eval for ‘can your agent operate a real OS’ â the only benchmark that covers the full application-surface an agentic computer-use system must navigate. OSWorld project page↗

Sandboxed execution

Running agent-generated code inside an isolated environment — microVM or container — so it cannot damage the host. Why it matters: Mandatory for autonomous code-execution agents — Vercel Sandbox, Cloudflare Workers, and Daytona are the 2026 vendor options. Vercel Sandbox docs↗

Prompt injection

A class of attack where untrusted input in the agent’s context manipulates its behavior against the operator’s intent. Why it matters: Ranked LLM01 in OWASP’s 2025 Top 10 for LLM apps â the #1 security risk for any agent that ingests external content. OWASP LLM01:2025↗

Jailbreak

A prompt or sequence designed to bypass an LLM’s safety policies and elicit prohibited outputs. Why it matters: Distinct from prompt injection â jailbreak targets the model’s alignment training; prompt injection targets a tool surface or downstream system. Anthropic: Many-shot jailbreaking↗

Alignment

The technical and normative problem of making AI systems reliably pursue human-intended goals. Why it matters: The framing under which all safety-relevant agent work is organized — understanding it clarifies why Constitutional AI and HITL exist as engineering patterns. Wikipedia: AI alignment↗

Human-in-the-loop (HITL)

A design pattern where trained humans retain authority over high-risk agent actions before they execute. Why it matters: The default risk-mitigation control for production agents in regulated industries — banking, healthcare, legal — where autonomous action carries legal liability. IBM: What is HITL?↗

Durable execution

A runtime pattern where workflows survive crashes, retries, and long pauses by checkpointing every step. Why it matters: Required infrastructure for long-running agents — Temporal, Inngest, Restate, and Vercel WDK are the 2026 vendors; without it, a multi-hour agent task that fails at step 47 restarts from zero. Temporal docs: Agentic loop↗

Benchmark

SWE-Bench issues

2,294

Real GitHub issues from 12 popular Python repos, used to evaluate whether a coding agent can resolve them autonomously. SWE-Bench Verified is the cleaned subset most labs cite.

arXiv:2310.06770

Benchmark

OSWorld tasks

369

Real computer tasks across web, desktop, and OS-level interactions. The standard eval for computer-use agents operating a real OS.

os-world.github.io

Security

Prompt injection (OWASP LLM01)

#1risk

OWASP ranks prompt injection as the number-one vulnerability for LLM applications in their 2025 Top 10 — any agent ingesting external content is exposed.

OWASP LLM01:2025

08 — The most-conflated pairsDisambiguation table — eight pairs that derail every architecture meeting.

These are the eight conversations that consume 20 minutes of architecture meetings because neither side is working from the same definition. The table below gives: the short definition for each term, when they are confused, the actually-different bit, and a canonical source for each. No open-web glossary currently pairs these explicitly.

Term A	Term B	When they’re confused	The actually-different bit
MCP Model↔tool protocol. Spec↗	A2A Agent↔agent protocol. Google↗	Both described as “the agent interoperability protocol.”	MCP connects a model to external tools and data sources. A2A connects one agent to another agent — different layer, different use case. Both can coexist in the same system.
ReAct Interleaved reasoning + tool calls. arXiv↗	Tree of Thoughts Multiple branching reasoning paths. arXiv↗	Both described as “advanced reasoning prompting.”	ReAct is a single-path loop: think, then act, then observe, repeat. ToT explores multiple forking solution paths simultaneously. ReAct is the default production pattern; ToT is used when there are genuinely multiple valid solution branches worth exploring.
RLHF Human preference labels. Ouyang 2022↗	RLAIF AI-generated preference labels. Bai 2022↗	Both described as “the training method that makes models safe.”	RLHF uses human raters to label which model output is preferred. RLAIF replaces those human raters with an LLM critic. Constitutional AI is Anthropic’s RLAIF-based pipeline — it is not RLHF.
Skill Capability bundle (.claude/skills). Claude Code↗	Subagent Child agent, own context window. Claude Code↗	Both described as “a modular agent capability.”	A Skill is a file on disk — a SKILL.md with instructions loaded into the parent context on demand. A subagent is a spawned process with its own context window and tool budget. Skills are lightweight; subagents have independent compute cost. Subagents cannot spawn their own subagents.
Agent Perceives environment, takes actions. Russell & Norvig↗	Assistant Responds to user queries in a session.	“AI assistant” and “AI agent” used interchangeably in marketing copy.	An assistant responds to queries; it does not initiate actions or persist goals between sessions without explicit prompting. An agent maintains goals, loops autonomously, and takes actions — often without a human in the loop per step.
Fine-tuning Weight updates on task-specific data. OpenAI↗	RAG Retrieval-grounded generation at inference. Lewis 2020↗	Both proposed as solutions to “make the model know our internal data.”	Fine-tuning bakes knowledge into weights at training time — expensive, and the knowledge becomes stale the moment it is trained. RAG retrieves knowledge at inference time from a live data source. They are complementary: fine-tune for style and behavior; RAG for live or frequently-updated knowledge.
Embeddings Dense semantic vector representation. OpenAI↗	Vector search Similarity lookup in embedding space. Cohere↗	Used interchangeably in architecture discussions.	Embeddings are the numeric representation — the output of an embedding model. Vector search is the retrieval operation performed over a database of embeddings. You need both; neither is the other.
RAG Retrieve then generate. Lewis 2020↗	Grounding Any method to anchor model output in verifiable facts.	“Grounding” used as a synonym for RAG in vendor marketing.	RAG is a specific architecture: retrieve, then inject retrieved context into the prompt, then generate. Grounding is a broader goal — any technique (RAG, knowledge graph injection, structured data, citations, HITL review) that keeps output factually anchored. RAG is one grounding strategy, not the only one.
All source links point to official specifications, arXiv papers, or canonical vendor documentation — not to secondary glossary pages. Each term is defined independently; the disambiguation is the point.

09 — Practical usageHow to use this glossary in your team’s docs.

A glossary earns its keep when it stops a specific conversation from going in circles. The three highest-value deployment patterns for this one:

1. Link it in CLAUDE.md or AGENTS.md. If your repository has a CLAUDE.md, AGENTS.md, or equivalent agent-context file, add a line like: # Terminology: https://digitalapplied.com/blog/ai-agent-glossary-2026-60-essential-terms. Claude Code and most coding agents read these files at session start — the glossary URL becomes part of the shared vocabulary context without consuming tokens per request.

2. Hand it to new engineers before their first architecture review. The six categories are a reading curriculum: Core (start here), then Protocols & SDKs (before any integration work), then Architectures (before any design session). The disambiguation table deserves a separate read. Engineers who have read the disambiguation section arrive at MCP vs. A2A discussions with a clear framework rather than a vague sense that “they are both agent protocols.”

3. Reference it in PR descriptions. When a reviewer asks “what is a Skill vs. a subagent here?” or “why are we using RAG instead of fine-tuning?” â a link to the relevant section of this glossary is faster and more precise than a Slack thread. The disambiguation table entries have stable anchor IDs for deep-linking.

The glossary is intentionally narrow: 60 terms that a 2026 production agent team will hit in a typical sprint, not 200 terms that cover the entire AI research landscape. For a broader reference, see our 200-term agentic AI glossary, and for the specific SDK documentation, the Claude Agent SDK migration playbook covers the September 2025 rename in full detail. If you are building agent systems for marketing or content operations, our AI transformation services team can help you apply these patterns at scale — from MCP server integration to multi-agent orchestration.

For teams evaluating the cost side of agent tooling, our AI coding agent cost calculator models per-task costs across 10 tools, with prompt caching at 0/30/60/90% modeled explicitly.

New hire onboarding

Engineering teams

Assign Core + Protocols & SDKs before first sprint. Disambiguation table before first architecture review. The six categories are a readable curriculum, not a reference dump.

Read sequentially

Ongoing reference

PR and architecture review

Deep-link to the disambiguation table when a term conflict surfaces. Use the canonical source links to settle definitions — not informal consensus.

Link the disambiguation table

Agent context file

CLAUDE.md / AGENTS.md

Add the glossary URL to your repo's CLAUDE.md or AGENTS.md. Coding agents read these files at session start — shared vocabulary without per-request token cost.

Link in CLAUDE.md

Quarterly refresh

Glossary maintenance

Review the Protocols & SDKs category each quarter — it is the most volatile. MCP transports, SDK naming, and benchmark definitions change faster than architecture patterns.

Audit Q3 2026

Reference · May 2026 snapshot

This glossary will be wrong by Q3 2026 — and that is the point.

The agentic AI vocabulary is changing weekly. MCP’s Streamable HTTP transport — which deprecated SSE in March 2025 — was not in the v0.1 spec. “Skill” did not exist as a primitive until Claude Code 1.0. “Subagent” was Anthropic-coined in 2025 and is now industry-standard. By Q3 2026 there will be new terms — likely around agent governance, durable workflow vendors, and multimodal tool use — that belong in a refreshed version of this list.

The point of this glossary is not to be eternally correct. It is to be a snapshot of the shared vocabulary as of May 2026, so engineering teams stop talking past each other on PR reviews and architecture docs. Use it as a starting point. Hand it to new hires. Link to it in CLAUDE.md. Refresh it quarterly. The disambiguation table is the part that earns its keep most reliably — MCP vs. A2A, RLHF vs. RLAIF, Skill vs. subagent, fine-tuning vs. RAG — those are the four conversations that derail every architecture meeting, and they are unlikely to stop derailing meetings just because the underlying technology evolves.

For teams moving beyond vocabulary into production agent systems, the Digital Applied AI transformation team works on the implementation layer: MCP server integration, multi-agent orchestration, prompt caching optimization, and eval-driven iteration. The glossary is the shared language; we help build the system.

AI Agent Glossary: 60 Essential Terms, 2026

01 — How to use this glossaryOne definition. One why-it-matters. One source.

Core agentic primitives

Protocols & SDKs

Architectures & prompting

Memory, context & retrieval

Training & RL

Safety, evals & HITL

02 — Category 1Core agentic primitives — 10 terms.

03 — Category 2Protocols & SDKs — 10 terms.

04 — Category 3Architectures & prompting strategies — 10 terms.

05 — Category 4Memory, context & retrieval — 10 terms.

06 — Category 5Training & RL — 10 terms.

Timeline of foundational agent papers — 2022–2024

07 — Category 6Safety, evals & human-in-the-loop — 10 terms.

SWE-Bench issues

OSWorld tasks

Prompt injection (OWASP LLM01)

08 — The most-conflated pairsDisambiguation table — eight pairs that derail every architecture meeting.

09 — Practical usageHow to use this glossary in your team’s docs.

Engineering teams

PR and architecture review

CLAUDE.md / AGENTS.md

Glossary maintenance

This glossary will be wrong by Q3 2026 — and that is the point.

From shared vocabulary to production systems.

Agent system engagements

Common questions about agent vocabulary.

Continue building your agent knowledge.

The Post-Training Revolution: RL Is the New Moat in 2026

Multi-Agent Orchestration: 5 Patterns That Work in 2026

Google I/O 2026 Day 2: 85+ Developer Sessions Roundup

Cursor Composer 2.5 vs Claude Code: When to Use Which

Prompt Engineering Pattern Library: 50 Templates