NVIDIA Dynamo 1.0: Open-Source Inference OS for AI
NVIDIA releases Dynamo 1.0, an open-source distributed inference OS achieving 7x performance boost for Blackwell GPUs. Adopted by major clouds.
Throughput vs TRT-LLM
Tokens/sec Improvement
GitHub Stars (thousands)
Open Source License
Key Takeaways
As organizations move from running a single AI model on one server to operating dozens of models across thousands of GPUs simultaneously, the bottleneck shifts from model quality to infrastructure efficiency. Training compute has dominated AI investment for years, but the frontier of the industry is now inference — specifically, how to maximize the throughput and minimize the cost of serving billions of requests from large models at production scale.
NVIDIA Dynamo 1.0 is the first open-source release of what NVIDIA calls an inference operating system for AI factories. Announced at GTC 2026, it addresses the fundamental inefficiencies in how GPU clusters run LLM inference today — idle compute during mismatched workload phases, redundant attention computation across similar requests, and static resource allocation that cannot adapt to shifting demand patterns. Released under the Apache 2.0 license, it integrates with the existing open-source inference ecosystem rather than replacing it. For organizations building AI infrastructure as described in our coverage of NVIDIA GTC 2026 enterprise agentic AI announcements, Dynamo represents the practical infrastructure layer that makes those capabilities economically viable at scale.
This guide explains the architecture, performance characteristics, and deployment considerations for Dynamo 1.0. It covers the disaggregated prefill-decode design, distributed KV cache, the planner and scheduler components, ecosystem integrations, and the scenarios where Dynamo delivers the most significant benefits relative to conventional inference setups.
What Is NVIDIA Dynamo
NVIDIA Dynamo is a distributed inference runtime designed to coordinate LLM serving across multi-node GPU clusters. The "inference operating system" framing is deliberate: just as an operating system abstracts hardware resources and schedules processes, Dynamo abstracts a pool of GPUs and schedules inference workloads across them with the goal of maximizing throughput while meeting latency targets.
Traditional inference servers are designed to run on a single machine. As models grow larger and require multiple GPUs, these servers scale up by adding more GPUs to one node and using tensor parallelism to split the model across them. Dynamo takes a different approach: it decomposes the inference pipeline into functional stages — prefill, decode, KV cache management, and routing — and distributes each stage across dedicated worker pools that can span many nodes. This disaggregated architecture is the foundational innovation that enables the system's performance characteristics.
Abstracts GPU cluster resources and schedules LLM inference workloads across them. Manages disaggregated pipeline stages, distributed caching, and dynamic resource reallocation in a single coordinated system.
Designed from the ground up for clusters spanning hundreds to thousands of GPUs across multiple nodes. Uses NVLink and InfiniBand for low-latency inter-node communication between disaggregated pipeline stages.
Fully open source under the Apache 2.0 license. Integrates with the existing open inference ecosystem including vLLM, TensorRT-LLM, and SGLang rather than replacing them, lowering the adoption barrier for teams with existing infrastructure.
The timing of Dynamo's release reflects a structural shift in the AI industry. Through 2023 and 2024, most AI deployments were relatively small-scale: a few GPUs serving one or two models for internal tooling or early products. In 2025 and 2026, the landscape has changed. Enterprises are operating what NVIDIA calls AI factories — large-scale GPU clusters continuously running inference workloads for customer-facing products, internal automation, and AI agents operating at high throughput. The infrastructure energy demands of AI at scale make efficiency improvements in inference compute directly translate to lower operating costs and reduced power consumption.
Disaggregated Prefill-Decode Architecture
The core architectural innovation in Dynamo is disaggregating the two phases of LLM inference — prefill and decode — onto separate GPU pools. Understanding why this matters requires understanding the fundamentally different computational demands of each phase.
Prefill processes the entire input prompt in parallel. For a 1,000 token prompt, all 1,000 tokens are fed through the model simultaneously in large matrix multiplications. This is highly compute-bound: the bottleneck is raw arithmetic throughput (FLOPS), and GPUs can achieve high utilization by batching many prefill requests together. Decode generates one output token at a time, requiring a complete forward pass through the model for every token produced. With most of the model's weights already loaded into memory but only one token being processed, this phase is memory-bandwidth-bound — the bottleneck is how quickly the GPU can read model weights from VRAM, not how many arithmetic operations it can perform per second.
When both phases share the same GPU, each phase degrades the other's efficiency. Prefill batches are interrupted by decode requests that arrive mid-computation. Decode workers sit partially idle during high-prefill-load periods. GPU hardware optimized for one phase's bottleneck is suboptimal for the other's. Disaggregation allows each worker pool to be configured, scaled, and batched independently — using compute-dense GPU configurations for prefill workers and memory-bandwidth-optimized configurations for decode workers, each running at near-peak efficiency for their specific computational profile.
Benchmark result: NVIDIA reports that Dynamo's disaggregated architecture delivers approximately 2× the throughput of TensorRT-LLM on equivalent hardware for typical production LLM serving workloads, with up to 30× improvement in tokens per second on specific benchmark configurations compared to baseline single-server inference setups.
Distributed KV Cache and Smart Routing
The KV cache — short for key-value cache — stores the attention mechanism's intermediate computation results from the prefill phase so that the decode phase can access them without recomputation. In single-node inference, this cache lives in the GPU's local VRAM and is scoped to a single request. When a request finishes, its cache entries are evicted. When a similar request arrives later with the same prefix, the computation is repeated from scratch.
Dynamo introduces a distributed KV cache layer that is shared across all worker nodes in the cluster. When a request is processed, its KV states for any repeated prefix — such as a common system prompt, a frequently retrieved RAG context chunk, or a shared conversation history — are stored in the distributed cache. When future requests arrive with the same prefix, the cached states are retrieved and injected directly into the decode workers, completely skipping the prefill computation for the matching portion of the input.
When a request prefix matches a cached entry, prefill is bypassed entirely for the matching portion. For workloads where all requests share a common system prompt — the vast majority of production API deployments — this eliminates redundant computation proportional to the system prompt length on every single request.
The smart router directs incoming requests to the decode worker node that already holds the relevant KV cache entries in local memory. This minimizes inter-node cache transfer overhead and maximizes the hit rate for locality-sensitive workloads like multi-turn conversations with the same context.
Retrieval-Augmented Generation workloads frequently retrieve the same popular document chunks across many requests. The distributed KV cache can store processed versions of frequently retrieved chunks so subsequent requests that retrieve the same chunk skip their prefill entirely, cutting latency and compute cost for popular knowledge items.
The cache uses configurable eviction policies based on recency, access frequency, and cache entry size. For long-running deployments, the cache warms up progressively as the request distribution stabilizes, with hit rates improving over time as the most common prefixes are retained in the warm cache layer.
The smart routing layer integrates with the KV cache manager to make routing decisions that optimize for cache locality, worker load balance, and request latency simultaneously. Unlike simple round-robin or least-connections load balancing, Dynamo's router understands the semantic content of incoming requests — specifically, which prefix tokens they share with cached entries — and factors this into routing decisions. The result is significantly higher effective cache hit rates compared to routing-unaware caching systems.
Planner and Scheduler for Dynamic Scaling
The planner is Dynamo's cluster-level resource manager. It monitors the state of all worker pools in real time — queue depths, GPU utilization, memory usage, request latency — and makes decisions about how to allocate GPU resources across the models being served on the cluster. In an AI factory environment where multiple models run simultaneously, the planner enables the cluster to behave as a single shared resource pool rather than a collection of fixed per-model allocations.
When demand for one model spikes — for example, a new product feature launches and traffic to a specific assistant endpoint increases tenfold — the planner can redirect GPUs from lower-utilization model pools to the high-demand pool without human intervention or manual reconfiguration. The reallocation is not instantaneous (workers need to load model shards), but the planner operates ahead of demand by monitoring queue growth trends and initiating reallocation before queues become critically long.
Inputs Monitored
Request queue depth per model, per-worker GPU utilization and memory usage, current and projected latency percentiles (P50, P95, P99), KV cache hit rates, and worker health signals.
Actions Available
Spin up new worker instances from the available GPU pool, migrate workers between model assignments, adjust prefill and decode worker pool ratios, trigger KV cache pre-warming for anticipated workload shifts, and gracefully drain underutilized workers.
Optimization Objectives
Maximize cluster-wide throughput while meeting per-model latency SLAs. The planner supports priority weighting so latency-sensitive customer-facing models take precedence over batch internal workloads when GPU resources are constrained.
The scheduler operates at a lower level than the planner, making per-request assignment decisions within the current worker pool configuration. It integrates the KV cache locality information from the routing layer with real-time worker load to select the optimal worker for each request. The scheduler also handles the complexity of streaming output: in a disaggregated architecture, the decode worker that generates a response may be different from the prefill worker that processed the input, and the scheduler coordinates the KV state transfer between them transparently.
Multi-Framework and Multi-Model Support
One of the practical strengths of Dynamo's design is its pluggable backend architecture. Rather than building its own model execution engine, Dynamo integrates with the existing open-source inference engines that organizations are already using — vLLM, TensorRT-LLM, and SGLang — as worker backends. This means teams can adopt Dynamo incrementally: deploy the distributed coordination layer on top of existing vLLM workers without replacing the engine itself.
The most broadly compatible backend, supporting the widest range of model architectures and hardware configurations. Best choice for organizations already running vLLM in production and teams prioritizing model coverage over raw throughput optimization.
Highest throughput on NVIDIA hardware through TensorRT kernel optimizations. Best for deployments where maximum tokens-per-second on H100/H200 hardware is the primary objective and the model set is relatively fixed.
Optimized for structured generation and complex multi-step prompting workflows. Best for agentic pipelines with complex output schemas, tool use, and structured data extraction workloads where output format control is critical.
On the model side, Dynamo supports all major open-weight model families including Llama 4 (Scout, Maverick), Mistral and Mixtral, Qwen 2.5, DeepSeek V3 and R1, Falcon, and Phi-4. The OpenAI API-compatible endpoint layer means existing applications built against the OpenAI API can point to a Dynamo cluster with minimal configuration changes — a critical practical detail for organizations migrating from proprietary API usage to self-hosted infrastructure.
Open Source Ecosystem and Integrations
The Apache 2.0 license is a deliberate choice that reflects NVIDIA's strategic interest in expanding the ecosystem of organizations building on GPU infrastructure. By releasing Dynamo as fully open source, NVIDIA enables cloud providers, enterprises, and research institutions to adopt and extend the system without commercial restrictions, accelerating adoption and generating community contributions that improve the core product.
Inference Backends
- vLLM (all hardware)
- TensorRT-LLM (NVIDIA)
- SGLang (multi-hardware)
- Hugging Face Transformers
Orchestration
- Kubernetes + Helm
- NVIDIA Dynamo Operator
- Prometheus metrics
- Grafana dashboards
API Compatibility
- OpenAI API v1 compatible
- Chat completions endpoint
- Streaming response support
- Function calling / tools
Storage and Cache
- NVMe-backed KV cache
- Redis for metadata
- S3-compatible model storage
- NVLink / InfiniBand transfer
Community momentum: The Dynamo repository accumulated over 100,000 GitHub stars within weeks of its release, reflecting strong adoption interest from the AI infrastructure community. Active contributions from cloud providers including AWS, Google Cloud, and Microsoft Azure are already integrating Dynamo into their managed inference service stacks.
Performance Benchmarks and Throughput Gains
NVIDIA published benchmark results at GTC 2026 comparing Dynamo to standalone TensorRT-LLM and vLLM on representative production workloads. The results demonstrate that the gains from disaggregation and distributed KV caching are highly workload-dependent — the larger the average input prompt, the higher the proportion of repeated prefixes, and the higher the request concurrency, the more significant the throughput improvement.
At 1,000+ concurrent requests on Llama 3.1 70B, Dynamo delivers approximately 2× the throughput of TensorRT-LLM running monolithically on equivalent hardware. The gain comes primarily from prefill-decode disaggregation enabling higher parallelism in each phase simultaneously.
For workloads where 80%+ of requests share common prefixes — typical in RAG deployments with standard system prompts and popular retrieved chunks — KV cache hits eliminate prefill for the shared portion. This produces up to 30× improvement in effective tokens per second on the cached prefix segments.
P95 time-to-first-token improves by 40–60% compared to monolithic serving under high load, as dedicated prefill workers are not starved by decode traffic. Consistent TTFT under variable load is one of the most impactful user experience improvements for interactive applications.
Serving five models simultaneously with dynamic allocation achieves 85%+ average GPU utilization compared to 40–55% typical for static per-model allocation. The planner's ability to shift resources to demand prevents idle GPUs during periods of uneven load distribution across models.
It is important to contextualize these numbers. The peak improvements occur under conditions that favor disaggregation: high concurrency, long prompts, and high prefix repetition rates. For low-concurrency deployments serving a single model with short varied prompts, Dynamo's overhead may produce only modest gains or even slight latency increases compared to a well-tuned single-node vLLM setup. Dynamo is not a universal optimization — it is specifically designed for the AI factory deployment pattern where its architectural advantages are most pronounced.
Deployment Scenarios and AI Factories
Understanding which deployments benefit most from Dynamo helps organizations prioritize where to invest the additional operational complexity of a distributed inference runtime. The value proposition scales with cluster size, workload diversity, and request volume.
Consumer products with millions of daily active users making LLM API calls benefit most from Dynamo's throughput improvements and consistent latency under load. The planner's dynamic allocation handles traffic spikes without pre-provisioning dedicated capacity for peak demand.
Multi-step agentic workflows that call an LLM many times per task — for planning, tool use, verification, and synthesis — benefit from reduced per-call latency across the entire pipeline. Lower TTFT compounds across multi-step reasoning chains to produce meaningful end-to-end response time improvements.
Enterprises running a portfolio of AI applications — coding assistants, document processors, customer support bots, and analytical tools — each backed by different models benefit from sharing a GPU pool. The planner optimizes allocation across all applications simultaneously rather than overprovisioning each one independently.
Managed LLM inference services from cloud providers serving millions of customers across thousands of GPU nodes represent the canonical AI factory use case. AWS, Google Cloud, and Azure are all integrating Dynamo components into their inference service stacks as the throughput improvements directly translate to lower operating cost per token served.
For organizations building AI-powered digital transformation initiatives, Dynamo represents the infrastructure layer that makes ambitious AI product economics viable at scale. The difference between serving one million requests per day at two cents per request and serving them at one cent — enabled by 2× throughput improvement — can be the difference between a profitable product and an unsustainable cost center. As AI becomes embedded in more business workflows and customer touchpoints, inference efficiency becomes a core competency rather than a technical detail.
Limitations and Production Considerations
Dynamo's architectural advantages come with meaningful operational complexity that organizations should evaluate carefully before committing to deployment. The disaggregated architecture that enables its performance gains also introduces new failure modes, increased network dependency, and higher minimum infrastructure requirements that make it overkill for many use cases.
Minimum scale threshold: Dynamo's overhead from distributed coordination, inter-node KV cache transfers, and planner management makes it suboptimal for small deployments. Practical benefits emerge at roughly 8+ GPU nodes serving at meaningful concurrency. Single-node or low-traffic deployments are better served by standalone vLLM or TensorRT-LLM.
Network infrastructure requirements: The distributed KV cache and inter-node coordination require low-latency, high-bandwidth networking between nodes. NVLink for intra-chassis GPU communication and InfiniBand or 400 GbE for inter-node transfers are effectively required for production performance. Standard data center Ethernet (25–100 GbE) may limit the benefits of disaggregation.
Operational complexity: Running disaggregated prefill and decode pools, a distributed KV cache service, and a planner requires significantly more operational expertise than a monolithic inference server. Teams need proficiency in Kubernetes, distributed systems debugging, and GPU infrastructure management to operate Dynamo reliably in production.
Version 1.0 maturity: As a 1.0 release, Dynamo is production-capable but still accumulating real-world operational experience. Edge cases in failure handling, cache consistency under node failure, and planner behavior at extreme load are areas where the community will refine the system over the coming months. Early production adopters should plan for active monitoring and involvement with the upstream community.
The pragmatic guidance is that Dynamo is the right choice when inference cost and throughput have become a meaningful constraint and the organization has the engineering capacity to operate a distributed system. For teams at earlier stages, investing in well-tuned vLLM or TRT-LLM on a single multi-GPU node will deliver most of the performance needed at a fraction of the operational complexity. Dynamo becomes the natural next step when scale demands it.
Conclusion
NVIDIA Dynamo 1.0 represents a genuine architectural advance in how large-scale LLM inference is orchestrated. Disaggregated prefill-decode, distributed KV cache with smart routing, and a cluster-level planner for dynamic resource allocation address the core inefficiencies that make current GPU cluster utilization lower than it needs to be. For organizations operating at AI factory scale, these improvements translate directly to lower cost per token and higher quality of service for end users.
The Apache 2.0 release and native integration with the existing open-source inference ecosystem lower the adoption barrier considerably. Organizations can start by layering Dynamo's coordination capabilities on top of their existing vLLM infrastructure rather than undertaking a full replacement. As the system matures through community contributions and real-world production hardening, it is positioned to become the standard infrastructure layer for serious AI factory deployments. The efficiency gains it enables are not just an engineering improvement — they are a prerequisite for making AI products economically sustainable as usage scales.
Ready to Scale Your AI Infrastructure?
Infrastructure choices made today determine the economics and capabilities of AI products at scale. Our team helps businesses design AI deployment strategies that are both technically sound and commercially viable.
Related Articles
Continue exploring with these related guides