March 22, 2026·22 min read·4 views·6 providers

5B Token Cost Optimization for Always-On LLM Agents

Where 5B tokens are spent in agentic workflows — and how prompt caching, model routing, RAG tuning, and compaction can cut costs 5–19× across major LLMs.

Key Finding

Loading AGENTS.md/SOUL.md/MEMORY.md/tool schemas on every request (5,000–10,500 tokens) is the single largest hidden cost in agentic frameworks, potentially costing $60,000–$122,000/month at scale before optimization

high confidenceSupported by Anthropic, Gemini-Lite, Perplexity, Grok-Premium
Justin Furniss
Justin Furniss

@Parallect.ai and @SecureCoders. Founder. Hacker. Father. Seeker of all things AI

anthropicgeminigemini-litegrok-premiumopenaiperplexity

The Hidden Cost of 5 Billion Tokens: Cross-Provider Synthesis


Executive Summary

  • The unoptimized-to-optimized cost spread is 5–19×, representing the single most important finding across all six providers. An unoptimized 5B-token operation on flagship models costs $30,000–$236,500/month; a fully optimized one costs $3,000–$16,000/month. The spread is so large that architecture is your cost structure — provider selection is secondary.

  • Prompt caching is the highest-leverage single optimization, with Anthropic offering 90% discounts on cache reads and Google offering similar rates, versus OpenAI's 50%. However, practical cache hit rates in production (45–65%) fall significantly short of marketing claims (90%+), and the gap between theoretical and achievable savings is a critical blind spot most providers understate.

  • Static context reloading — AGENTS.md, SOUL.md, MEMORY.md, tool schemas — is the largest hidden tax in agentic frameworks. Loading 5,000–10,500 tokens of static identity files on every request can cost $60,000–$122,000/month alone at scale, before any actual work is done. This is the optimization most builders miss entirely.

  • Model routing (70% Haiku / 20% Sonnet / 10% Opus or equivalent) delivers 60–75% cost reduction and is the highest-impact strategy by savings magnitude. Most builders default to flagship models for all tasks; routing simple classification, triage, and extraction to sub-$1/MTok models is the fastest path to cost sanity.

  • Long-context surcharges are an emerging "bill shock" vector: GPT-5.4 doubles input costs beyond 272K tokens; Gemini 2.5 Pro doubles beyond 200K. Uncompacted conversation history in agentic loops will silently trigger these thresholds, creating non-linear cost explosions that standard monitoring won't catch until the invoice arrives.


Cross-Provider Consensus

1. Prompt Caching Provides 50–90% Discount on Repeated Input Tokens

Providers agreeing: Anthropic, Gemini, Gemini-Lite, Grok-Premium, OpenAI, Perplexity (all six) Confidence: HIGH

Every provider independently confirmed that Anthropic and Google offer ~90% discounts on cached input tokens, while OpenAI offers ~50%. All providers agree this is the single most impactful cost lever for agentic workloads where system prompts and tool schemas repeat across requests. The mechanism (exact prefix matching) and the requirement for static content to precede dynamic content were confirmed by at least four providers.

2. Batch API Provides 50% Discount Across All Major Providers

Providers agreeing: Anthropic, Gemini, Grok-Premium, OpenAI, Perplexity (five of six) Confidence: HIGH

All providers confirmed that Anthropic, OpenAI, Google, and xAI offer approximately 50% discounts for asynchronous batch processing with ~24-hour SLAs. All agree that 30–50% of typical agentic workloads (nightly cron jobs, batch analytics, content generation) qualify for batch processing without user-experience degradation.

3. Model Routing (Haiku/Flash for Simple Tasks) Delivers 60–75% Cost Reduction

Providers agreeing: Anthropic, Gemini, Grok-Premium, OpenAI, Perplexity (five of six) Confidence: HIGH

Multiple providers independently confirmed that routing 60–70% of agentic tasks to lightweight models (Claude Haiku at $1/$5 per MTok, Gemini Flash, GPT-4o Mini, Grok 4.1 Fast at $0.20/$0.50) while reserving flagship models for complex reasoning delivers the largest single cost reduction. Haiku 4.5 achieves ~90% of Sonnet performance on coding tasks at one-third the cost (Anthropic). Real-world deployments show 60%+ cost reductions from routing alone (Gemini, Grok-Premium).

4. Static Context Reloading (AGENTS.md/SOUL.md/MEMORY.md) Is the Primary Hidden Tax

Providers agreeing: Anthropic, Gemini, Gemini-Lite, Grok-Premium, Perplexity (five of six) Confidence: HIGH

All five providers independently identified the practice of loading large static identity and configuration files on every request as the most commonly overlooked cost driver. Estimates range from 5,000–10,500 tokens of static overhead per request, representing 15–25% of total input tokens. At scale, this single inefficiency can cost $60,000–$122,000/month before any productive work occurs.

5. Conversation History Growth Creates Compounding Costs Without Compaction

Providers agreeing: Anthropic, Gemini, Gemini-Lite, OpenAI, Perplexity (five of six) Confidence: HIGH

All five providers confirmed that unmanaged conversation history is a silent budget killer that grows quadratically in agentic loops. Compaction strategies (summarization after N turns, sliding window, hierarchical memory) can reduce history tokens by 60–95%. One platform reported 95% token reduction from conversation summarization (OpenAI). The consensus recommendation is to summarize after every 5–10 turns.

6. Long-Context Surcharges Create Non-Linear Cost Cliffs

Providers agreeing: Gemini, Grok-Premium, OpenAI, Perplexity (four of six) Confidence: HIGH

Four providers independently confirmed that GPT-5.4 doubles input costs beyond 272K tokens and Gemini 2.5 Pro doubles beyond 200K tokens. Uncompacted agentic sessions will silently breach these thresholds. Anthropic notably does not apply long-context surcharges on its 1M context window (confirmed by Anthropic and Gemini providers), making it structurally advantageous for long-context agentic workloads.

7. RAG Chunk Size Tuning (~512 Tokens) Reduces Context Bloat 60–70%

Providers agreeing: Anthropic, Gemini, OpenAI, Perplexity (four of six) Confidence: MEDIUM

Four providers confirmed that default RAG configurations (1,024-token chunks, k=5) inject 5,000+ tokens per request unnecessarily. Tuning to 512-token chunks with k=3, combined with reranking, reduces RAG context by 60–70% with minimal quality loss. The 512-token sweet spot was specifically cited by OpenAI and Perplexity as validated in production.

8. Grok 4.1 Fast ($0.20/$0.50 per MTok) Is the Cheapest Production-Grade Option

Providers agreeing: Anthropic, Gemini, Grok-Premium (three of six) Confidence: MEDIUM

Three providers confirmed Grok 4.1 Fast pricing at $0.20 input / $0.50 output per million tokens with a 2M token context window, making it the cheapest option among production-grade models by a significant margin. However, ecosystem maturity, reliability data, and tool-calling quality relative to Claude/GPT remain less validated.


Unique Insights by Provider

Anthropic

  • The "SEV-level" cache hit rate signal: The Anthropic provider cited that the Claude Code team declares severity incidents (SEVs) when cache hit rates drop, and that production Claude Code sessions achieve 96% cache hit rates with aggressive breakpoint placement. This is the most specific real-world cache performance benchmark in any report and establishes a concrete target for what's achievable with disciplined engineering. It also reveals that cache architecture is treated as a first-class reliability concern, not just a cost optimization.

  • The 5× multiplier from advisory council fan-out: The Anthropic provider provided the most explicit quantification of multi-agent cost multiplication — a "business advisory council" fanning out to 5 expert personas multiplies input tokens by 5× for a single logical operation. This framing helps builders understand why multi-agent architectures require fundamentally different cost models than single-agent ones.

Gemini

  • Gemini's explicit caching storage cost warning: The Gemini provider was the only one to clearly flag that Google's explicit context caching carries a storage fee of $4.50 per million tokens per hour — a cost that can exceed the savings for always-on agents that maintain large caches continuously. This is a critical gotcha that other providers omitted entirely and could cause budget surprises for teams migrating to Gemini for its caching discount.

  • GPT-5.4 "Tool Search" feature reducing tool token usage by 47%: The Gemini provider uniquely identified a GPT-5.4 native feature called "Tool Search" that allows the LLM to receive a lightweight tool index and retrieve full schemas on demand, reducing total tool token usage by 47% across 250 tested tasks. No other provider mentioned this feature.

  • Wasted compute from hidden reasoning tokens: The Gemini provider was the most explicit about "Extended Thinking" / reasoning tokens being billed as output tokens but hidden from the final response — meaning an agent that spends 2,000 tokens reasoning in the wrong direction before hitting an error represents a total financial loss at $10–25/MTok output rates.

Gemini-Lite

  • "Reliability-adjusted cost per task" as the correct optimization metric: The Gemini-Lite provider introduced the most important reframing in any report: don't optimize for cost per token, optimize for reliability-adjusted cost per task. A slightly more expensive model that solves a task in one pass is vastly cheaper than a cheap model requiring five iterations and human intervention. This is a crucial corrective to the naive "cheapest token wins" approach and deserves to be a first-order design principle.

  • "Token Spirals" as a distinct failure mode: Gemini-Lite identified "token spirals" — where lack of context management causes agents to loop and re-read, multiplying token usage 2–3× beyond the nominal budget — as a distinct failure mode separate from simple inefficiency. This is a qualitatively different problem from static overhead and requires loop detection and retry limits as architectural safeguards.

Grok-Premium

  • Practical cache hit rate range of 80–95% with latency benefits: The Grok-Premium provider cited specific production evidence of 80–95% cache hit rates with 13–31% TTFT (time-to-first-token) improvements, providing the most detailed latency-benefit quantification alongside cost savings. The latency benefit is often ignored in cost analyses but matters significantly for real-time agentic systems.

  • Cross-provider routing tools (RunCost) with budget guards and loop detection: Grok-Premium was the only provider to specifically name RunCost as a production tool for automated model routing with budget guards and loop detection for always-on systems. This is actionable tooling intelligence that other providers omitted.

  • Tool-use overhead quantification (~300–700 tokens per tool): Grok-Premium provided the most specific quantification of per-tool overhead in the prompt — 300–700+ tokens per relevant tool definition — enabling more precise cost modeling than the aggregate estimates other providers offered.

OpenAI

  • 31% of production LLM queries have high semantic similarity to previous requests: The OpenAI provider cited research showing roughly 31% of production queries are semantically similar to prior requests, establishing a concrete baseline for the minimum expected cache hit rate even without architectural optimization. This is the most specific demand-side caching benchmark in any report.

  • YouTube analyst case study: 90% cost reduction from caching 81K-token metadata: The OpenAI provider provided the most concrete real-world case study — an AI YouTube analyst that reduced per-query cost from €0.24 to €0.024 (90% reduction) by caching 81,000 tokens of video metadata rather than reloading it on every query. This is the most specific dollar-amount example of static context caching ROI in any report.

Perplexity

  • Practical cache hit ceiling of 60–65% for always-on agents (vs. 90% marketing claims): The Perplexity provider was the most explicit and detailed in explaining why production cache hit rates cannot reach the marketed 90%+ for always-on agents: daily context updates, user-specific personalization, rotating safety rules, and non-deterministic tool results all prevent full cache utilization. This is the most important corrective to provider marketing claims in the entire analysis.

  • Token inflation trap with quarterly monitoring recommendation: Perplexity uniquely identified "token inflation" — the gradual accumulation of new tool schemas, safety rules, and agent definitions over time — as a structural cost risk requiring quarterly audits. The specific 5% quarter-over-quarter growth threshold as a trigger for bloat audits is actionable operational guidance no other provider offered.

  • Detailed ROI calculation for optimization implementation effort: Perplexity provided the only explicit ROI analysis for the engineering investment required: partial optimization (40 hours, ~$4,000) yields 117:1 ROI in year one; full optimization (300+ hours, $30,000–50,000) yields 26–49:1 ROI. This transforms the optimization question from "should we?" to "how much should we invest?"


Contradictions and Disagreements

Contradiction 1: Achievable Cache Hit Rates (45–65% vs. 80–96%)

The disagreement: Perplexity argues that practical cache hit rates for always-on agents max out at 60–65% due to structural constraints (rotating context, personalization, non-deterministic tool results). Grok-Premium cites production evidence of 80–95% hit rates. The Anthropic provider cites 96% for Claude Code sessions.

Evidence for lower rates (Perplexity): Daily/weekly context updates, user-specific personalization, rotating safety guardrails, and non-deterministic tool results all prevent full cache utilization. These are structural constraints, not engineering failures.

Evidence for higher rates (Anthropic, Grok-Premium): Claude Code achieves 96% with aggressive breakpoint placement. Production systems with disciplined prompt architecture achieve 80–95%. The Anthropic provider notes that even a single character change invalidates cache, implying high rates require strict engineering discipline.

Resolution attempt: These may not be contradictory — they may describe different workload types. Claude Code has highly repetitive, structured prompts ideal for caching. A CRM agent with per-customer context is structurally less cacheable. Builders should target 60–65% as a realistic baseline for mixed agentic workloads and treat 80%+ as achievable only for highly structured, repetitive workflows. Do not use 90% in financial projections.


Contradiction 2: Total Cost of Unoptimized 5B Token Operation ($29,375 vs. $236,500)

The disagreement: Provider cost estimates for an unoptimized 5B token operation vary by nearly 8×:

  • Gemini provider: $29,375/month (using GPT-5.4 as baseline)
  • Anthropic provider: $55,000/month (using Claude Opus 4.6 as baseline)
  • Perplexity provider: $171,750/month (using Claude Opus 5 at projected 2026 prices)
  • Anthropic provider (full 5B over 4.3 months): $236,500 total

Root cause: These estimates use different model assumptions, different token split ratios (70/30 vs. 80/20 input/output), different pricing vintages (current vs. projected 2026), and different definitions of "unoptimized" (some include long-context surcharges, some don't).

This is not resolvable without knowing: (a) which model you're actually using, (b) your actual input/output ratio, (c) whether you're hitting long-context surcharges, and (d) whether you're on current or projected pricing.

Practical guidance: Use the Anthropic provider's math for Claude-centric operations ($55,000/month at current pricing), the Gemini provider's math for GPT-5.4-centric operations ($29,375/month), and treat Perplexity's figures as forward-looking projections for 2026 model generations. The spread between optimized and unoptimized (5–19×) is more reliable than the absolute numbers.


Contradiction 3: OpenAI Prompt Caching Capability

The disagreement: The OpenAI provider states OpenAI "does not (yet) have built-in prompt caching for developers." The Gemini provider states GPT-5.4 cached input costs $0.25/M (a 90% discount). The Grok-Premium provider states cached input is ~$0.25/M. The Anthropic provider states cached input is $1.25/M (50% discount, not 90%).

The likely resolution: OpenAI does have automatic prompt caching (confirmed by multiple providers), but the discount rate is disputed — 50% (Anthropic provider) vs. 90% (Gemini provider). The OpenAI provider's claim that caching doesn't exist appears to be outdated or incorrect. Use 50% as the conservative estimate for OpenAI cached input pricing; the 90% figure cited by Gemini may reflect a specific tier or promotional rate.


Contradiction 4: GPT-5.4 Pricing ($1.75/M vs. $2.50/M input)

The disagreement: The OpenAI provider cites GPT-5.4 at $1.75/M input / $14/M output. The Anthropic and Gemini providers cite $2.50/M input / $15/M output. The Grok-Premium provider cites $2.50/M input.

Likely explanation: The OpenAI provider may be using a different model variant (possibly GPT-5.4 standard vs. a specific tier), or pricing may have changed between report generation dates. Use $2.50/$15 as the consensus figure; $1.75/$14 should be verified directly with OpenAI's current pricing page.


Contradiction 5: Whether Gemini's Caching Is Viable for Always-On Agents

The disagreement: The Gemini provider presents Gemini 2.5 Pro as highly competitive with 90% cache discounts. The Perplexity provider explicitly flags that Gemini's 1-hour cache expiration "kills overnight batch processing" and limits effective cache hit rates to 20–35% for 24/7 agents. The Anthropic provider does not mention this limitation.

This is a genuine architectural constraint: Gemini's 1-hour cache TTL vs. Anthropic's persistent caching (with 5-minute and 1-hour TTL options that can be renewed) creates a structural disadvantage for always-on agents with variable request timing. For 24/7 agentic workloads, Anthropic's caching architecture is more suitable despite Gemini's lower base prices. Gemini's caching advantage is most relevant for batch workloads with predictable timing.


Detailed Synthesis

The Architecture of Agentic Token Consumption

The foundational insight across all six providers is that agentic token consumption is multiplicative, not additive [Gemini, Grok-Premium, Perplexity]. Unlike a chat interface where a user sends a message and receives a response, an always-on agent framework running 20+ automated systems operates through iterative ReAct loops (Reason → Act → Observe) where each step appends to an ever-growing context window [Gemini, OpenAI]. By the tenth step of a complex task, the model is re-processing the entire history of the first nine steps — a quadratic growth pattern that makes token consumption explode in ways that simple per-query cost estimates completely miss [Gemini].

Breaking down a single agentic invocation reveals the anatomy of this problem [Perplexity, Anthropic]:

The static overhead layer is the largest hidden tax. System prompts, tool definitions, and identity files (AGENTS.md, SOUL.md, MEMORY.md) loaded on every request can consume 5,000–10,500 tokens before any productive work begins [Anthropic, Perplexity]. At 4,200 requests/day with Claude Opus 4.6, this static overhead alone costs $6,615/month without caching — dropping to $728/month with 90% cache hits [Anthropic]. The Perplexity provider's detailed breakdown shows AGENTS.md/CONFIG.md reloading can cost $122,090/month at scale, making it potentially the single largest optimization target in the entire stack.

The tool schema layer adds 2,000–5,000 tokens per request for JSON schemas across 10–30 tools [Anthropic], with Grok-Premium quantifying individual tool overhead at 300–700 tokens per tool. The critical insight from Gemini is that tool execution failures compound this cost: when agents hallucinate arguments, receive error responses, and retry, each retry sends the entire accumulated history — original prompt, failed tool call, error message, new attempt — back to the model, creating a retry tax that can multiply tool-related costs by 3–5× in poorly designed systems.

The RAG context layer typically injects 4,000–12,000 tokens per request [Anthropic, Perplexity], with unoptimized pipelines fetching far more context than necessary. The Gemini provider notes that agentic RAG (iterative retrieval across multiple steps) improves accuracy but compounds token costs as retrieved documents from Step 1 remain in context through Step 5. The consensus recommendation across four providers is 512-token chunks with k=3 and a reranking step, reducing RAG context by 60–70% [Anthropic, OpenAI, Perplexity, Gemini].

The conversation history layer is the silent budget killer that grows without bounds [OpenAI, Gemini-Lite]. Without compaction, a 20-turn conversation can accumulate 30,000+ tokens of history that gets re-sent on every subsequent turn [Anthropic]. The Gemini provider's framing is precise: token consumption grows quadratically in agentic loops, not linearly.

The multi-agent multiplication layer is where costs become truly alarming for advisory council patterns [Anthropic, Gemini]. A "business advisory council" fanning out to 5 expert personas multiplies input tokens by 5× for a single logical operation [Anthropic]. Each persona receives the full system prompt plus context, and a synthesis agent must then process all five outputs — creating an N+1 multiplier on every council query [Gemini].

The Provider Landscape: Caching Economics Dominate

The most important finding from the provider comparison is that sticker price is a poor predictor of actual cost for agentic workloads — caching architecture is the dominant variable [Anthropic, Gemini, Grok-Premium].

Anthropic's 90% cache read discount (cached input at $0.50/MTok vs. $5.00/MTok for Opus 4.6) fundamentally transforms the economics of always-on agents [Anthropic, Gemini]. With 85% cache hit rates, the effective input cost for Claude Opus drops from $5.00/MTok to approximately $1.25/MTok — making it more competitive than its sticker price suggests [Anthropic]. The Grok-Premium provider adds that production systems achieve 80–95% cache hit rates with 13–31% TTFT improvements, confirming that caching benefits extend beyond cost to latency.

However, the Perplexity provider's critical corrective deserves emphasis: practical cache hit rates for always-on agents with mixed workloads max out at 60–65%, not the 90%+ implied by provider marketing. Daily context updates, user-specific personalization, rotating safety rules, and non-deterministic tool results all prevent full cache utilization. Financial projections should use 60–65% as the baseline, treating 80%+ as achievable only for highly structured, repetitive workflows like Claude Code.

Google's Gemini 2.5 Pro offers the lowest base input price at $1.25/MTok with a matching 90% cache discount [Gemini, Grok-Premium], making it theoretically the most cost-effective flagship option. However, the Perplexity provider's identification of Gemini's 1-hour cache expiration as a structural limitation for 24/7 agents is critical — effective cache hit rates drop to 20–35% for always-on workloads with variable request timing, and the $4.50/MTok/hour storage fee for explicit caching can exceed the savings for large, continuously maintained caches [Gemini provider].

OpenAI's GPT-5.4 at $2.50/$15 per MTok offers only 50% cache discounts (vs. 90% for Anthropic/Google) and introduces the most dangerous cost cliff in the market: input costs double beyond 272K tokens [Gemini, Grok-Premium, Perplexity]. For agentic workloads with uncompacted conversation history, this threshold will be breached regularly, creating non-linear cost explosions. The Gemini provider's identification of GPT-5.4's "Tool Search" feature (47% reduction in tool token usage) partially offsets this, but the long-context surcharge remains a structural disadvantage for heavy agentic use.

xAI's Grok 4.1 Fast at $0.20/$0.50 per MTok is the most disruptive pricing in the market [Anthropic, Gemini, Grok-Premium], offering a 25× input cost advantage over Claude Opus. For high-volume routing tasks, background processing, and simple classification, Grok 4.1 Fast represents a compelling option. The Gemini-Lite provider's caution about "reliability-adjusted cost per task" applies here — a model that requires multiple retries to complete a task may cost more in practice than a more expensive model that succeeds on the first attempt.

The Optimization Stack: Ranked by Impact

The consensus across providers on optimization priority, with specific quantification where available:

1. Model Routing (60–75% cost reduction) [Anthropic, Gemini, Grok-Premium, OpenAI, Perplexity]: The highest-impact single strategy. Routing 70% of tasks to Haiku/Flash/Grok Fast and reserving flagship models for complex reasoning delivers the largest absolute savings. The Anthropic provider's benchmark that Haiku 4.5 achieves 90% of Sonnet performance at one-third the cost establishes the quality-cost tradeoff. The Gemini provider's RouteLLM framework provides a concrete implementation path with trained classifiers for complexity assessment.

2. Prompt Caching Architecture (40–60% reduction on input) [all six providers]: The second-highest-impact strategy, but requiring careful engineering to achieve production-viable hit rates. The key implementation requirements — static content first, dynamic content last, no timestamps or session tokens in cached prefixes, explicit cache breakpoints — were confirmed by multiple providers [Anthropic, Gemini, Grok-Premium]. The Anthropic provider's specific warning that "adding an MCP tool, putting a timestamp in your system prompt, switching models mid-session — each of these can invalidate the entire cache and 5x your costs for that turn" is the most actionable implementation guidance.

3. Static Context Caching (AGENTS.md/SOUL.md/MEMORY.md) [Anthropic, Gemini-Lite, Perplexity]: Specifically targeting the identity file reload tax. The Perplexity provider's calculation shows this single optimization can save $102,380/month at scale. The implementation pattern — cache static blocks, load only active agent definition dynamically, compress MEMORY.md to 200 tokens — is the most specific guidance available.

4. Batch API for Async Workloads (20–25% overall reduction) [all providers]: The simplest optimization with the lowest implementation effort. All providers confirmed 50% discounts for batch processing. The Anthropic provider estimates 40–50% of total tokens in a 20-system framework qualify for batch processing (nightly content generation, batch email classification, social analytics, security reports, meeting transcripts).

5. Conversation Compaction (15–30% reduction on input) [Anthropic, Gemini, OpenAI, Perplexity]: Summarizing conversation history after every 5–10 turns. The OpenAI provider's case study of 95% token reduction from conversation summarization is the most dramatic quantification. The Perplexity provider's specific implementation — extractive summarization to 1,200 tokens + semantic rollup to 800 tokens after every 10 turns — is the most actionable guidance.

6. Selective Tool Loading (5–10% reduction) [Anthropic, Gemini, Perplexity]: Using a lightweight classifier pass to load only relevant tool schemas per request. The Perplexity provider quantifies this at $48,195/month in savings at scale. The Gemini provider's identification of GPT-5.4's native Tool Search feature as achieving 47% tool token reduction without custom implementation is the most efficient path for GPT-5.4 users.

7. RAG Chunk Tuning (5–15% reduction) [Anthropic, Gemini, OpenAI, Perplexity]: Tuning to 512-token chunks with k=3 and reranking. The Perplexity provider quantifies this at $37,260/month in savings at scale. Low implementation effort relative to impact.

The Real Math: Optimized vs. Unoptimized

The definitive cost spread, synthesizing across all providers with reconciled assumptions:

Baseline assumptions for comparison:

  • 5 billion tokens/month
  • 70% input (3.5B), 30% output (1.5B)
  • Primary model: Claude Opus 4.6 at current pricing ($5/$25 per MTok)
  • No long-context surcharges assumed (Anthropic doesn't apply them)

Unoptimized (all Opus, no caching, no routing, no batch):

  • Input: 3.5B × $5.00/MTok = $17,500
  • Output: 1.5B × $25.00/MTok = $37,500
  • Total: $55,000/month [Anthropic]

Partially optimized (basic routing + 50% cache hit rate + no batch):

  • Effective blended cost drops to ~$7,900/month [Anthropic]
  • Total: ~$8,000–$34,000/month depending on routing aggressiveness

Fully optimized (70/20/10 routing, 85% cache hit rate, 45% batch, compaction, selective tool loading, RAG tuning):

  • Effective blended rate: $2.50–$3.20/MTok vs. $47.30/MTok unoptimized [Anthropic]
  • Total: $12,500–$16,000/month [Anthropic]

The spread: 3.4–4.4× on Claude alone. When model routing shifts significant volume to Haiku ($1/$5) or Grok 4.1 Fast ($0.20/$0.50), the spread widens to 15–19× [Anthropic].

The Gemini-Lite provider's "token spiral" warning adds a further multiplier: unoptimized systems with poor loop detection can see 2–3× token multiplication from agents looping and re-reading, pushing the effective unoptimized cost to $110,000–$165,000/month — making the optimized-to-unoptimized spread potentially 10–20× in worst-case scenarios.

The Perplexity provider's ROI framing provides the most actionable decision framework: partial optimization (40 hours, ~$4,000 engineering cost) delivers 117:1 ROI in year one. Full optimization (300 hours, ~$30,000–50,000) delivers 26–49:1 ROI. At any reasonable engineering cost, optimization is not optional — it is the highest-ROI engineering investment available to teams running at this scale.


Evidence Explorer

Select a citation or claim to explore evidence.

Go Deeper

Follow-up questions based on where providers disagreed or confidence was low.

What is the actual production cache hit rate distribution across different agentic workload types (CRM vs. email triage vs. security review vs. content generation), and what architectural patterns reliably achieve 80%+ hit rates?

The most important unresolved contradiction in this analysis is the 45–65% vs. 80–96% cache hit rate disagreement. This gap represents a $20,000–$50,000/month difference in projected savings for a 5B token operation. Workload-specific benchmarks with controlled prompt architectures would resolve this and provide actionable targets for different system types.

How does reliability-adjusted cost per task (successful completions per dollar) compare across Claude Opus, Claude Haiku, GPT-5.4, Gemini 2.5 Pro, and Grok 4.1 Fast on standardized agentic benchmarks (SWE-Bench, tool-use evals, multi-step reasoning tasks)?

Multiple providers recommend aggressive routing to cheaper models, but the quality degradation data is sparse and often self-reported by model providers. A rigorous benchmark measuring first-pass success rates, retry rates, and human-intervention rates across models on realistic agentic tasks would enable evidence-based routing decisions rather than rule-of-thumb percentages. The Gemini-Lite provider's "reliability-adjusted cost" framing is correct but currently unsupported by systematic data.

What is the actual cost impact of Gemini's 1-hour cache expiration vs. Anthropic's persistent caching for always-on agents with different request frequency distributions (high-frequency vs. bursty vs. overnight batch)?

This is the most practically important unresolved contradiction for teams choosing between Gemini and Anthropic. Gemini's lower base price makes it theoretically attractive, but the cache TTL limitation could eliminate its cost advantage entirely for certain workload patterns. A controlled experiment measuring effective cache hit rates and total costs for identical workloads on both platforms would provide definitive guidance.

What is the minimum viable prompt caching architecture for a 20-system agentic framework — specifically, how should AGENTS.md/SOUL.md/MEMORY.md be structured, versioned, and cache-invalidated to maximize hit rates while accommodating legitimate context updates?

All providers agree that static context caching is the highest-impact optimization, but none provide a complete implementation blueprint for multi-system frameworks with overlapping context requirements. The specific engineering patterns for cache breakpoint placement, versioned context blocks, and graceful cache invalidation on legitimate updates would be immediately actionable for OpenClaw-style deployments.

How do long-context surcharge thresholds (GPT-5.4's 272K, Gemini's 200K) interact with multi-agent fan-out patterns in practice — specifically, what percentage of advisory council and parallel expert queries breach these thresholds in production, and what is the actual cost impact?

Multiple providers flagged long-context surcharges as a major "bill shock" vector, but none provided empirical data on how frequently agentic workloads actually breach these thresholds. Given that a 5-agent advisory council with full context could easily approach 200K tokens per synthesis pass, this may be a much larger cost driver than the analysis suggests. Empirical measurement of context length distributions in production multi-agent systems would quantify the true exposure.

Key Claims

Cross-provider analysis with confidence ratings and agreement tracking.

12 claims · sorted by confidence
1

Model routing (70% Haiku / 20% Sonnet / 10% Opus) reduces costs by 60–75% with minimal quality degradation for most agentic tasks

high·Anthropic, Gemini, Grok-Premium, OpenAI, Perplexity·
2

GPT-5.4 doubles input costs beyond 272K tokens; Gemini 2.5 Pro doubles beyond 200K tokens; Anthropic does not apply long-context surcharges on its 1M context window

high·Gemini, Grok-Premium, OpenAI, Perplexity, Anthropic·
3

Batch API processing (50% discount, ~24-hour SLA) is viable for 30–50% of typical agentic workloads including nightly cron jobs, batch analytics, and content generation

high·Anthropic, Gemini, Grok-Premium, OpenAI, Perplexity·
4

Prompt caching provides 90% discount on cached input tokens for Anthropic and Google, 50% for OpenAI

high·Anthropic, Gemini, Grok-Premium, Perplexity(OpenAI provider (claims OpenAI lacks developer caching — likely outdated) disagrees)·
5

Loading AGENTS.md/SOUL.md/MEMORY.md/tool schemas on every request (5,000–10,500 tokens) is the single largest hidden cost in agentic frameworks, potentially costing $60,000–$122,000/month at scale before optimization

high·Anthropic, Gemini-Lite, Perplexity, Grok-Premium·
6

Conversation compaction (summarizing after every 5–10 turns) can reduce history tokens by 60–95% with acceptable quality loss

high·Anthropic, Gemini, OpenAI, Perplexity·
7

The correct optimization metric is reliability-adjusted cost per task, not cost per token — a cheaper model requiring multiple retries may cost more than an expensive model succeeding on the first attempt

high·Gemini-Lite, Grok-Premium (implicitly)·
8

The spread between unoptimized and fully optimized 5B token operations is 5–19×, representing $40,000–$220,000/month in potential savings

medium·Anthropic, Gemini, Perplexity(Gemini-Lite (estimates narrower spread of ~40–60%), OpenAI (estimates 5–10× spread) disagree)·
9

Grok 4.1 Fast at $0.20/$0.50 per MTok with 2M context window is the cheapest production-grade model available as of March 2026

medium·Anthropic, Gemini, Grok-Premium(NONE (but ecosystem maturity and reliability data are limited) disagrees)·
10

RAG optimization (512-token chunks, k=3, reranking) reduces RAG context tokens by 60–70% with minimal quality loss

medium·Anthropic, OpenAI, Perplexity(NONE (but optimal chunk size is workload-dependent) disagrees)·
11

Practical cache hit rates for always-on mixed agentic workloads max out at 60–65%, not the 90%+ implied by provider marketing

medium·Perplexity(Grok-Premium (cites 80–95%), Anthropic (cites 96% for Claude Code) disagree)·
12

Gemini's 1-hour cache expiration makes it structurally unsuitable for always-on 24/7 agentic workloads, reducing effective cache hit rates to 20–35%

medium·Perplexity(Gemini provider (presents Gemini caching as highly competitive without flagging this limitation) disagrees)·

Topics

llm cost optimizationagentic workflowsprompt cachingmodel routingrag tuningconversation compactiongpt-5.4 pricingclaude opus pricing

Share this research

Read by 4 researchers

Share:

Research synthesized by Parallect AI

Multi-provider deep research — every angle, synthesized.

Start your own research