Saving 52% on Claude API Costs with Cache Optimization
How a bug led me down a rabbit hole of prompt caching mechanics, gap analysis on 30 days of real conversation data, and the discovery that my heartbeat system was keeping the wrong cache warm.
It Started With a Bug
I run a persistent AI assistant on OpenClaw — an open-source framework for running Claude (and other models) as always-on agents with tools, memory, and cron jobs. Think of it as systemd for AI assistants. My instance talks to Anthropic's API, and I interact with it primarily through Telegram.
The conversation is persistent. Same session, same context, growing over time. At the time of this investigation, my session transcript was about 131MB of JSONL, spanning roughly 30 days and 3,400+ messages. The context window sent to Claude on each turn was around 165,000 tokens — system prompt, personality files, tool definitions, and recent conversation history.
That's a lot of tokens to re-process every single message.
Anthropic's prompt caching is supposed to help with exactly this. The idea is simple: if the beginning of your prompt (the "prefix") hasn't changed since the last request, Anthropic can reuse the cached KV representations from last time instead of recomputing them. Cache hits cost 90% less than fresh processing.
The key words there are hasn't changed. The cache is invalidated if even a single byte of the prefix changes. And that's where the bug came in.
OpenClaw was injecting unique message IDs into the system prompt on every turn. Different ID each time. Cache busted every single message. My 165K token context was being fully reprocessed — and fully billed — on every interaction. No cache hits, ever.
I filed a bug report with the framework maintainers, and the response was validating — they cited the evidence as excellent documentation of the problem. The real-world impact was staggering: roughly a 150× cost increase on input token processing compared to what caching should have provided. What genuinely surprised me was how few people had noticed. I found only two other reports of the issue. Thousands of users running persistent conversations, paying full price on every turn, and almost nobody had caught it. I suspect most people don't look at their cache hit rates. They see the bill, shrug, and assume "AI is expensive."
It's worth understanding that not all providers handle caching the same way, and this likely explains why the bug went unnoticed. There are two caching models in the wild. Prefix caching — used by Anthropic, Google Gemini, and DeepSeek — caches everything from the start of the prompt and breaks at the first byte that differs. Bigger savings when it works (90% discount on hits), but fragile. A single dynamic value injected early in the prompt invalidates everything after it. Segment caching — OpenAI's approach — hashes independent chunks of the prompt and is position-tolerant. More resilient to small changes, 50% discount on hits, no write premium, and completely automatic. There's nothing to configure and nothing to break. OpenClaw's creator joined OpenAI in February 2026, and the project's development has naturally gravitated toward OpenAI's ecosystem. If you're developing primarily against OpenAI's API, this bug is invisible — the cache just works. It's only when users point the framework at a prefix-caching provider like Anthropic — where caching is fragile, explicit, and expensive to miss — that a stray message ID in the system prompt quietly turns a 90% discount into a 25% surcharge.
Once the fix landed — about five days after the bug was introduced — I started actually seeing cache hits again. Which naturally led me to wonder: how well is this actually working? And can I make it work better?
Understanding the Cache TTL Options
Anthropic offers two cache durations:
| 5-Minute TTL (default) | 1-Hour TTL | |
|---|---|---|
| Cache Write | 1.25× base input price | 2× base input price |
| Cache Hit | 0.1× base input price | 0.1× base input price |
Cache hits cost the same either way — 90% off. The difference is in the write cost. The 1-hour cache charges more upfront to write, but the cache sticks around 12× longer. The question is whether that tradeoff is worth it for your usage pattern.
For Opus 4.6 specifically (my primary model), here's what that looks like in real dollars at my context size:
| Operation | 5m TTL | 1h TTL |
|---|---|---|
| Cache write (165K tokens) | $1.03 | $1.65 |
| Cache hit (165K tokens) | $0.08 | $0.08 |
So: every time the cache expires and I send a message, it costs either $1.03 or $1.65 to rebuild. Every time the cache is still warm, it costs $0.08 regardless. The 1-hour option only makes sense if I frequently have gaps between messages that are longer than 5 minutes but shorter than 60 minutes.
"Frequently" is doing a lot of heavy lifting in that sentence. Time to find out what "frequently" actually means.
Analyzing 30 Days of Real Conversation Data
I have the full session transcript with timestamps on every message. Rather than guess at my usage pattern, I wrote a Python script to extract every user message timestamp, calculate the gap between consecutive messages, and bucket them.
Here's 30 days of actual data — 3,451 gaps between messages analyzed:
| Gap Duration | Count | Percentage |
|---|---|---|
| < 1 minute | 774 | 22.4% |
| 1–5 minutes | 1,806 | 52.3% |
| 5–15 minutes | 470 | 13.6% |
| 15–30 minutes | 139 | 4.0% |
| 30–60 minutes | 103 | 3.0% |
| 1–4 hours | 125 | 3.6% |
| 4+ hours | 34 | 1.0% |
Look at the highlighted rows. That's 712 gaps — 20.6% of all my messages — that fall in the 5-to-60-minute sweet spot. With a 5-minute TTL, every single one of those is a cache miss and a full rewrite. With a 1-hour TTL, every single one becomes a cache hit.
Running the numbers against actual pricing:
| 5-Minute TTL | 1-Hour TTL | |
|---|---|---|
| Cache hits | 2,580 × $0.08 = $212.85 | 3,292 × $0.08 = $271.59 |
| Cache writes | 871 × $1.03 = $898.22 | 159 × $1.65 = $262.35 |
| Total input cost | $1,111.07 | $533.94 |
$577 saved over 30 days. A 52% reduction in input token costs.
The tradeoff: those 159 gaps longer than an hour cost slightly more ($1.65 vs $1.03 per write). But that extra $98 is dwarfed by the $675 saved from converting 712 cache misses into hits.
This was a no-brainer. I switched to the 1-hour TTL immediately.
The Heartbeat Discovery
But I wasn't done. The conversation turned to whether I could do even better.
OpenClaw has a "heartbeat" system — a periodic poll that wakes the agent, lets it check for pending tasks, and goes back to sleep. Think of it like a cron job that runs inside your active session. The agent sees the full conversation context, can check files, and responds. If there's nothing to do, it replies with a short acknowledgment and the response gets stripped from the transcript — no pollution.
Something easy to overlook: the heartbeat has its own model configuration, separate from the primary model. My setup was:
- Primary model: Claude Opus 4.6 (for actual conversations)
- Heartbeat model: Claude Sonnet 4.6 (because "it's cheaper")
This is the conventional wisdom: use a cheaper model for background tasks. Makes sense on the surface. But prompt caching is per-model. The cache from a Sonnet request is completely separate from the cache for an Opus request. They don't share.
Which means my Sonnet heartbeat — firing every 30 minutes, dutifully processing my full 165K token context — was keeping the Sonnet cache warm. The Opus cache, the one I actually use for conversations, was getting zero benefit.
I was paying for heartbeats that warmed a cache I almost never used.
There's a lot of advice floating around about switching background tasks and heartbeats to cheaper models to save money. On paper it makes sense — why pay Opus prices for a health check? But if your heartbeat's primary value is keeping the prompt cache warm (and it should be, if you're running persistent sessions), then putting it on a different model means it's warming a cache you don't use. You're spending less per heartbeat while silently losing far more on cold cache writes every time you actually talk to the assistant. The "optimization" is costing you money.
The Cache Keepalive Strategy
The fix was obvious once I understood the mechanics: switch the heartbeat model to Opus. Now every heartbeat refreshes the same cache that my actual conversations use.
With the heartbeat interval set to 55 minutes (safely inside the 1-hour TTL), the Opus cache essentially never expires. Even when I'm asleep for 8 hours, the heartbeat fires roughly every 55 minutes, each one a cheap cache read ($0.08) that resets the expiration timer.
Cost of the keepalive
~26 heartbeats per day × $0.08 per cache read = ~$2/day.
In exchange, every conversation — including the first message of the day — hits a warm cache. No more cold starts. My first-message-of-the-day cost dropped from ~$1.65 (cold write) to ~$0.08 (warm read). On a usage-capped plan, that's the difference between burning 10-15% of my daily budget on "good morning" and burning less than 1%.
The Isolated Session Problem
One thing worth noting for anyone using OpenClaw or similar agent frameworks: cron jobs that run in isolated sessions don't benefit from cache keepalive at all.
Each isolated cron run starts a fresh session with its own prompt. The cache from run #1 is a completely different conversation than run #2. There's no continuity to keep warm. The keepalive strategy only works for persistent sessions — like a long-running conversation — where the prompt prefix is consistent across turns.
Implementation
If you're running OpenClaw (or any system that talks to the Anthropic API), here's what I changed:
{
"agents": {
"defaults": {
"models": {
"anthropic/claude-opus-4-6": {
"params": {
"cacheRetention": "long"
}
},
"anthropic/claude-sonnet-4-6": {
"params": {
"cacheRetention": "long"
}
}
},
"heartbeat": {
"every": "55m",
"model": "anthropic/claude-opus-4-6"
}
}
}
}
The key settings:
cacheRetention: "long"— switches from 5-minute to 1-hour TTL. OpenClaw maps this to Anthropic'scache_controlwith the appropriate TTL.heartbeat.model— set to the same model you use for conversation. This is the critical insight. Don't use a "cheaper" model for heartbeats if it means warming the wrong cache.heartbeat.every: "55m"— fires every 55 minutes, safely inside the 1-hour window.
If you're calling the Anthropic API directly, the equivalent is setting "ttl": "1h" in your cache_control blocks:
"cache_control": {
"type": "ephemeral",
"ttl": "1h"
}
Summary
Three changes, compounding savings:
- Fix the cache-busting bug. Unique data in the prompt prefix invalidates caching entirely. Audit your system prompt for anything that changes per-request (timestamps, request IDs, random nonces). This alone can be the difference between 0% and 90%+ cache hit rates.
- Switch to 1-hour TTL if your usage has 5-60 minute gaps. Analyze your actual message timing data. If more than ~15% of your gaps fall in the 5-60 minute range, the 1-hour TTL will save money despite the higher write cost. For me, 20.6% of gaps fell in that range — resulting in a 52% cost reduction.
- Run heartbeats on the same model you use for conversation. A cheaper heartbeat model warms a different cache. If your heartbeat exists primarily as a keepalive mechanism, it needs to hit the same model to refresh the same cache. This is the one that conventional wisdom gets wrong.
None of this required changing how I interact with the AI. Same conversations, same quality, same features. Just less money going to token processing that didn't need to happen.