Practical Agentic Engineering Workflow: Production Guide 2025
Based on extensive analysis of production agentic workflows, learn the strategies that ship code faster than experimental approaches. Master blast radius management, GPT-5 Codex optimization, parallel agent orchestration, model selection frameworks, and the anti-patterns that slow teams down.
GPT-5 Codex usable context vs Claude Code
Concurrent terminal grid for optimal throughput
Code quality maintenance via agent-driven refactoring
AI agents writing production code when managed well
Key Takeaways
Agentic engineering has moved from experimental to production-ready. AI agents now write 100% of production code—not because of elaborate prompting tricks or complex tooling charades, but because we finally understand how to work with them effectively.
This isn't another post celebrating AI achievements with unrealistic benchmarks. It's a practical playbook from someone shipping a ~300K LOC TypeScript React app, Chrome extension, CLI tool, Tauri client app, and Expo mobile app—all built primarily by AI agents working in parallel terminal grids.
The Agentic Engineering Revolution
We crossed a threshold in 2025. Models moved from "this is interesting" (May 2024 with Sonnet 4.0) to "this is production ready" (GPT-5 Codex). The difference isn't just benchmark scores—it's how models approach problems, read codebases, and maintain context across complex changes.
From Code Completion to Code Generation
Traditional AI coding tools focused on tab completion—suggesting the next few lines based on immediate context. Agentic engineering represents a fundamental shift: agents reason about entire features, make architectural decisions, coordinate changes across dozens of files, and handle complex refactoring that would take human developers hours or days.
The workflow transition looks like this:
- Old: Write detailed specs, review 100+ file changes, manually fix inconsistencies
- New: Start conversations with screenshots, watch features build live, queue related changes, iterate in real-time
Model Selection: GPT-5 Codex vs Claude Code
After months working daily with both GPT-5 Codex and Claude Code, clear patterns emerged. This isn't about benchmark leaderboards or marketing claims—it's practical observations from shipping real-world applications.
Why GPT-5 Codex Won for Daily Work
230K usable context vs Claude's 156K. While Claude offers 1M token context window if you get lucky, it gets "silly" long before depleting that context realistically. Codex maintains coherence across far more files and longer conversations.
Context fills far slower. Whatever OpenAI does differently, context bloat is significantly reduced compared to Claude Code's frequent "Compacting..." messages. This means longer work sessions without context reset.
Queue related tasks. Codex allows queuing multiple messages for sequential execution. Claude changed this behavior months ago so messages "steer" the model mid-work. Having both options (queue vs steer) is far better—press escape and enter when you want steering behavior.
Rust-based CLI, no memory bloat. Codex is incredibly fast with no multi-second freezes, no gigabyte memory bloat, no terminal flickering. It feels lightweight and responsive in ways Claude Code doesn't match.
No "absolutely perfect" false confidence. Claude's language ("absolutely right," "100% production ready" while tests fail) causes genuine frustration. Codex communicates more like an introverted engineer—chugs along, gets stuff done, pushes back on silly requests. This matters for mental health during long coding sessions.
Reading Before Acting
The most underrated difference: GPT-5 Codex reads far more files before deciding what to do. It pushes back harder when you make questionable requests. Claude and other agents are more eager—they just try something even when uncertain.
This changes prompt engineering fundamentally. With Claude, you need extensive context in prompts to compensate. With Codex, prompts became significantly shorter—often just 1-2 sentences plus an optional screenshot. The model already understands your codebase deeply before suggesting changes.
The Blast Radius Principle
"Blast radius" refers to estimating how many files a change will touch before executing it. This concept transforms how you think about agent orchestration and parallel workflows.
Small Bombs vs Fat Man
When planning work, you have intuition about complexity and scope. You can throw many small bombs at your codebase or one "Fat Man" and a few small ones. The blast radius determines your approach:
- Small blast radius (1-5 files): Perfect for parallel agents, easy rollback, clean atomic commits
- Medium blast radius (6-20 files): Single agent with monitoring, "what's the status" check-ins
- Large blast radius (20+ files): Consider "give me options before making changes" to gauge impact
- Multiple large bombs: Impossible to do isolated commits, much harder to reset if something goes wrong
Developing Blast Radius Intuition
Over time you develop feelings for task complexity. You know before starting if a change will touch 3 files or 30. This intuition guides decisions about:
- Whether to use parallel agents or a single focused agent
- Whether to ask for options/plan first or just start building
- Whether work needs separate folder/context or can share main context
- How long to let an agent run before checking status or intervening
Parallel Agent Orchestration Strategy
Running 3-8 agents in parallel in a 3x3 terminal grid, most in the same folder, some experiments in separate folders. This setup ships features faster than any traditional branching strategy.
Why Same-Folder Beats Worktrees
One dev server, one application, multiple simultaneous changes. As the project evolves, click through and test multiple changes at once. This workflow proves significantly faster than alternatives:
- One dev server: Test all changes together by clicking through application
- Atomic commits: Each agent commits only files it edited
- Real-time integration: See how changes interact immediately
- No OAuth limits: Single domain for all callback testing
- Faster iteration: No context switching between branches/servers
- Multiple dev servers: Quickly gets annoying and resource-heavy
- OAuth limitations: Can only register some domains for callbacks
- Context switching: Slower to test interactions between changes
- Setup overhead: Spinning up/down environments adds friction
Atomic Git Commits Per Agent
Agents make atomic git commits themselves for exactly the files they edited. Maintaining clean commit history required iterating on agent configuration to make git operations sharper. The result: 3-8 agents working simultaneously with minimal merge conflicts.
Models are incredibly clever—no hook will stop them if they're determined, but clear instructions in agent configuration work well. The key: explain that multiple agents work in the same folder and each should only commit their own changes.
Prompt Engineering: Just Talk To It
The controversial truth: elaborate prompting tricks and agent instructions mostly don't matter with GPT-5 Codex. The model is good enough that you can just talk to it like a human colleague.
From Verbose to Concise
With Claude, extensive prompts helped compensate for context gaps—the more context supplied, the better results. With GPT-5 Codex, prompts became dramatically shorter. Often just 1-2 sentences plus a screenshot.
At least 50% of prompts contain screenshots. Drag image into terminal, model finds exactly what you show, matches strings, arrives at the right place. No annotation needed (though it helps for complex cases). A screenshot takes 2 seconds and provides immense context.
Trigger Words That Help
When things get hard, certain phrases improve results noticeably:
- "take your time" — prevents rushing through complex problems
- "comprehensive" — encourages thorough analysis
- "read all code that could be related" — broader context gathering
- "create possible hypothesis" — explores multiple solution paths
- "preserve your intent" — maintains code purpose through changes
- "add code comments on tricky parts" — helps future model runs
But these are gentle nudges, not elaborate instructions. The fundamental approach remains conversational.
Conversational Development Pattern
Start discussions with Codex by pasting websites, sharing ideas, asking it to read code. Flesh out features together. For complex features, ask it to write everything into a spec, send that to GPT-5-Pro via chatgpt.com for review (surprisingly often improves the plan significantly), then paste back useful suggestions.
For UI work: start with something woefully under-specified. Watch the model build. See browser update in real-time. Queue additional changes and iterate. Often don't fully know how something should look—play with ideas, see them come to life. Sometimes Codex builds something interesting you didn't even think of. Don't reset, iterate and morph the chaos into shape.
The Tooling Ecosystem Reality Check
Controversial opinion: Most agentic engineering tools solve non-problems. RAG, elaborate MCPs, custom plugins, subagent orchestration systems—these work around current inefficiencies that GPT-5 Codex largely eliminated.
Why MCPs Are Usually Wrong
Almost all MCPs should be CLIs instead. Reference a CLI by name, models know how to use it from world knowledge, zero context tax. The CLI presents help menu on first incorrect call, context now has full info, and it works from then on.
- GitHub MCP: 23K tokens gone (was 50K at launch)
- gh CLI alternative: Same feature set, zero context tax
- Models already know gh CLI from world knowledge—no explanation needed
Exception: chrome-devtools-mcp for closing the loop on web debugging. Replaced Playwright for browser interaction. But even this isn't needed daily—most API endpoints are faster and more token-efficient via curl with API keys.
Subagents vs Separate Windows
Subagents were originally called "subtasks" in May—a way to spin out work into separate context when the model doesn't need full text, mainly for parallelization or reducing context waste for noisy build scripts.
What others do with subagents, do with separate terminal windows. This gives complete control and visibility over context engineering, unlike subagents which make it harder to view and steer what gets sent back.
If you want to research something, do it in a separate terminal pane and paste results to another. Simple, controllable, visible.
The Plugin/Agent Instructions Reality
Claude Code promotes elaborate agent instructions and plugins as ways to improve model behavior. Looking at their recommended "AI Engineer" agent reveals the problem: an autogenerated soup of words mentioning GPT-4o and o1 for integration with no actual meat.
Telling a model "You are an AI engineer specializing in production-grade LLM applications" doesn't change output quality. Giving it documentation, examples, and do/don't patterns helps. You'd get better results asking the agent to "google AI agent building best practices" and load some websites than using vague role-play instructions.
Real-World Implementation Patterns
After months shipping production code with agentic workflows, certain patterns consistently outperform others. These aren't theoretical—they're battle-tested approaches from real projects.
The Web-Based Agent Role
Codex web serves as short-term issue tracker. Ideas on the go via iOS app, review later on Mac. Intentionally limiting mobile capabilities—work is already addictive enough without constant pull-in during downtime.
Originally didn't count toward usage limits, but those days are numbered. Still valuable for capturing ideas without disrupting flow.
Background Task Management
GPT-5 Codex currently lacks background task management—one of Claude's advantages. CLI tasks that don't end (dev servers, tests that deadlock) can get stuck.
Workaround: tmux. Old tool for running CLIs in persistent background sessions. The model has plenty of world knowledge about tmux—just prompt "run via tmux." No custom agent.md charade needed.
Queue Up Continue Messages
Instead of perfect prompts to motivate continuation on long-running tasks, use lazy workarounds. For bigger refactors where Codex often stops mid-work, queue up continue messages if you want to step away. If Codex finishes and gets more messages, it happily ignores them.
Write Tests After Each Feature
Ask the model to write tests after each feature/fix is done—use the same context. This leads to far better tests and likely uncovers bugs in your implementation. If it's purely UI tweaks, tests make less sense. For anything else, do it.
AI generally is bad at writing good tests, but tests written with full implementation context in the same conversation prove far better than separate test generation.
Agent Configuration Approach
Agent file is ~800 lines of organizational scar tissue. Didn't write it personally—Codex did. Anytime something happens, ask it to make a concise note. This grew organically from actual pain points, not speculative "best practices."
Key sections: git instructions, product explanation, naming/API patterns, React Compiler notes, preferred React patterns, database migration management, testing, ast-grep rules. Things newer than world knowledge cutoff get documented; things the model already knows get removed.
Refactoring & Code Quality Management
Spend about 20% of time on agent-driven refactoring. All refactoring is done by agents—no manual time wasted. Refactor days are great when needing less focus or feeling tired, since you can make great progress without intense concentration.
Typical Refactoring Work
- Code duplication: Using jscpd to identify and consolidate
- Dead code: Running knip to find unused exports and imports
- React Compiler: eslint react-compiler and deprecation plugins
- API consolidation: Checking for routes that can be merged
- Documentation: Maintaining docs, adding comments for tricky parts
- File size: Breaking apart files that grew too large
- Test quality: Finding and rewriting slow tests
- Modern patterns: Updating to latest React patterns (you might not need useEffect)
- Dependencies: Tool upgrades and version updates
- File structure: Reorganizing for better clarity
The "Code is Slop" Argument
Critics argue AI-generated code is "slop." The response: 20% of time on refactoring addresses this through systematic quality maintenance. This isn't unique to AI—human-written code also accumulates technical debt without regular cleanup.
The advantage: agents execute refactoring far faster than humans. What would take days of manual work completes in hours. This makes regular quality maintenance actually sustainable rather than perpetually postponed.
Conclusion: Develop Intuition, Skip Charades
Don't waste time on stuff like RAG, subagents, elaborate agent instructions, or custom tooling that solves non-problems. Just talk to it. Play with it. Develop intuition. The more you work with agents, the better your results will be.
Many skills needed to manage agents mirror managing human engineers—characteristics of senior software engineers. Understanding task complexity, breaking down problems, giving clear direction, knowing when to intervene or let work continue. These are fundamentally people skills applied to AI systems.
Yes, writing good software is still hard. Just because you don't write code anymore doesn't mean you don't think hard about architecture, system design, dependencies, features, or how to delight users. Using AI simply means expectations of what to ship went up dramatically.
Ready to Accelerate Development with AI Agents?
Whether you're evaluating AI coding tools, implementing agentic workflows, or scaling development operations, our team helps agencies navigate this transformation with practical, battle-tested strategies.
Frequently Asked Questions
Related Articles
Continue exploring with these related guides