How to Prevent OpenClaw Agents from Burning Tokens

The $5 Overnight Bill: A Real Story

Here is what happened. A developer deployed an SEO monitoring agent with a 10-minute heartbeat. The agent's job was to check a competitor's pricing page every 10 minutes and report changes. Simple enough.

The problem: the competitor's site was behind Cloudflare. Every request returned a JavaScript challenge page instead of the actual pricing content. The agent saw HTML that did not match the expected format, logged it as a failure, and retried. The retry logic had no cap. It retried 3 times per heartbeat, and the heartbeat fired every 10 minutes.

Over 8 hours of sleep, the agent made roughly 144 heartbeat cycles, each with 3 retries. That is 432 LLM calls. Each call included the full Cloudflare HTML response in the context (about 4,000 tokens), plus the agent's system prompt and instructions. Total token consumption: over 2 million tokens. Cost on Claude Sonnet: approximately $5.

Five dollars might not sound catastrophic. But scale that to a team running 10 agents, or an agent with a 1-minute heartbeat, and you are looking at real money. The good news: every one of these scenarios is preventable with the right configuration.

The Five Token-Burning Scenarios

After reviewing hundreds of agent deployments, these are the five patterns that cause the most unexpected token usage. Understanding each one is the first step to preventing them.

1. Cloudflare Challenge Loops

The agent makes an HTTP request to a Cloudflare-protected site. Instead of real content, it receives a challenge page with JavaScript. The agent cannot execute JavaScript, so it sees garbage HTML. It assumes the request failed and retries. Each retry hits the same challenge. The loop continues until the agent is stopped or the retry limit is reached. If there is no retry limit, it runs forever.

2. Infinite Retry Chains

The agent encounters any error (API timeout, 500 response, DNS failure) and retries without a cap. Default retry configurations in many frameworks do not set a maximum. Combined with exponential backoff that resets incorrectly, the agent can retry thousands of times before anyone notices.

3. Large File Reads

The agent is told to 'read this log file' or 'analyze this codebase.' It reads the entire file into its context window. A 10MB log file is roughly 2.5 million tokens. One read operation can cost more than a full day of normal agent activity. If the agent reads the file on every heartbeat because it does not cache results, costs multiply fast.

4. Chatty Heartbeats

The agent has a heartbeat interval that is too short for its task. A monitoring agent that checks disk space every 30 seconds does not need to call an LLM each time. But if the heartbeat triggers a full LLM inference cycle (system prompt + context + response), a 30-second interval generates 2,880 LLM calls per day.

5. Context Window Stuffing

The agent accumulates conversation history without pruning. Each new message includes the full history. After 50 exchanges, the context window is full, and every subsequent call sends the maximum token count. Some agents also include tool outputs in the history, which can be enormous if the tool returns large JSON payloads or HTML content.

Fix 1: Cloudflare Detection and Avoidance

Cloudflare protection is the single biggest cause of token burning in web-scraping agents. The fix has two parts: detect Cloudflare responses and handle them gracefully.

Add these rules to your agent's SOUL.md file:

# SOUL.md - Cloudflare Protection Rules

## HTTP Request Rules
- Before processing any HTTP response, check for Cloudflare indicators:
  - Response contains "cf-browser-verification"
  - Response contains "Checking if the site connection is secure"
  - Response contains "cf_clearance"
  - Response status is 403 with "cloudflare" in headers
- If Cloudflare is detected: STOP. Do not retry. Log the URL as
  "cloudflare-blocked" and move to the next task.
- Never include raw HTML from blocked responses in your output.
  Summarize as: "URL blocked by Cloudflare, skipping."
- Maximum 1 retry for any HTTP request. If the retry also fails,
  abandon the request.

The key principle: treat Cloudflare blocks as permanent failures, not transient errors. Retrying a Cloudflare challenge without a headless browser will never succeed. Each retry just burns tokens.

If your agent genuinely needs to access Cloudflare-protected content, use a headless browser tool like Playwright or Puppeteer as an MCP tool. The browser can solve JavaScript challenges. Retrying raw HTTP requests cannot.

Fix 2: Set Max Retries and Timeouts

Every agent that makes external requests needs explicit retry limits. Never rely on defaults, because many libraries default to unlimited retries or very high numbers.

Add this to your SOUL.md:

# SOUL.md - Retry and Timeout Rules

## Retry Policy
- Maximum retries per request: 2
- Backoff: wait 5 seconds after first failure, 15 seconds after second
- After 2 failed retries: mark task as failed, log the error, move on
- Never retry the same URL more than 3 times total (1 original + 2 retries)

## Timeout Policy
- HTTP request timeout: 30 seconds
- Task timeout: 5 minutes (if a single task takes longer, abort it)
- Session timeout: 60 minutes (restart the session after 60 minutes
  of continuous operation)

## Error Handling
- On timeout: log "TIMEOUT: [url/task]" and move to next task
- On 4xx error: do not retry (client errors are not transient)
- On 5xx error: retry up to 2 times with backoff
- On network error: retry up to 2 times with backoff

For environment-level control, set these variables in your docker-compose.yml or .env file:

# .env or docker-compose environment
AGENT_MAX_RETRIES=2
AGENT_REQUEST_TIMEOUT=30000
AGENT_TASK_TIMEOUT=300000
AGENT_SESSION_TIMEOUT=3600000

The combination of SOUL.md rules and environment variables creates two layers of protection. The SOUL.md rules guide the LLM's behavior. The environment variables enforce hard limits at the application level, catching anything the LLM misses.

Fix 3: Rate Limiting Agent Requests

Rate limiting prevents your agent from making too many LLM calls in a given time period. This is your safety net against any token-burning scenario you did not anticipate.

Set rate limits at three levels:

Level	Recommended Limit	Why
Per minute	10 LLM calls	Catches fast loops immediately. No legitimate agent task needs more than 10 LLM calls per minute.
Per hour	100 LLM calls	Catches slower loops that per-minute limits miss. Also prevents chatty heartbeats from accumulating.
Per day	500 LLM calls	Hard ceiling for daily usage. Even aggressive agents rarely need more than 500 calls per day.

Add this to your SOUL.md:

# SOUL.md - Rate Limiting

## Rate Limits
- Do not exceed 10 LLM calls per minute
- Do not exceed 100 LLM calls per hour
- Do not exceed 500 LLM calls per day
- If you approach any limit, pause and wait until the window resets
- Log a warning when you hit 80% of any rate limit

For application-level enforcement, implement a token bucket or sliding window counter in your agent's runtime. The SOUL.md rules are behavioral guidance for the LLM. Application-level rate limiting is the hard stop that works even if the LLM ignores the guidance.

Fix 4: Cost Caps and Alerts

Even with all the preventive measures above, you should have a financial safety net. Cost caps ensure that no single agent can exceed a budget, regardless of what goes wrong.

Set cost limits at the LLM provider level:

OpenAI: Usage Limits Dashboard

Go to platform.openai.com, navigate to Settings, then Limits. Set a hard cap on monthly spending. Set a soft cap that sends you an email alert. For a single agent, a $10/month hard cap is reasonable. Set the alert at $5 so you get a warning before hitting the limit.

Anthropic: Usage Controls

In the Anthropic Console, go to Settings, then Plans and Billing. Set a monthly spending limit. Anthropic also lets you set per-API-key limits, which is ideal if each agent has its own API key. Set a $10 limit per key and you will never get a surprise bill from a runaway agent.

Ollama (Local): Zero Cost

If cost is a primary concern, run your agents on Ollama with a local model. Llama 3, Mistral, and Phi-3 all run well on consumer hardware. There is no API cost because the model runs on your own machine. The trade-off is speed and quality, but for monitoring and routine tasks, local models are more than sufficient.

Add cost awareness to your SOUL.md:

# SOUL.md - Cost Control

## Token Budget
- Maximum tokens per single task: 50,000
- Maximum tokens per day: 500,000
- If a task requires reading a file larger than 10,000 tokens,
  summarize it instead of including the full content
- Never include raw HTML, JSON payloads, or log files larger
  than 2,000 tokens in your context. Truncate or summarize.

## Cost Awareness
- Before making an LLM call, estimate whether the input exceeds
  10,000 tokens. If it does, look for ways to reduce the input.
- Prefer shorter, focused prompts over long, comprehensive ones.
- Cache results when possible. Do not re-fetch data you already have.

Fix 5: SOUL.md Rules for Wasteful Behavior

Your SOUL.md file is the most powerful tool for controlling agent behavior. Unlike environment variables and API limits, SOUL.md rules shape how the agent thinks. Here is a complete set of anti-waste rules you can copy into any agent's SOUL.md.

# SOUL.md - Anti-Waste Rules

## Core Principles
- Every LLM call costs money. Be efficient.
- If you can answer from memory or cached data, do not make a new call.
- If a task is failing repeatedly, stop and report the failure.
  Do not keep trying.

## File Handling
- Never read files larger than 500 lines without user confirmation
- For log files: read only the last 100 lines unless instructed otherwise
- For codebases: read one file at a time, not the entire directory
- Never include binary files, images, or compiled assets in your context

## HTTP Requests
- Maximum 1 retry per request
- Always check for Cloudflare/bot protection before retrying
- Cache successful responses for at least 10 minutes
- Do not fetch the same URL more than once per session unless
  explicitly asked to refresh

## Conversation History
- Summarize old messages instead of keeping full history
- After 20 messages, compress the history to key points
- Never include tool outputs longer than 1,000 tokens in history.
  Summarize them.

## Heartbeat Behavior
- During heartbeat: check for new tasks only. Do not re-process
  completed tasks.
- If no new tasks exist, respond with a single short status message.
  Do not generate a detailed report on every heartbeat.
- Minimum heartbeat interval: 5 minutes (except for real-time
  monitoring agents)

These rules work because LLMs follow instructions well when the rules are explicit and specific. Vague instructions like "be efficient" do not work. Specific instructions like "never read files larger than 500 lines" give the model a clear boundary to follow.

Putting It All Together: The Defense Stack

The best protection is layered. No single measure catches everything. Here is the full defense stack, ordered from innermost to outermost layer:

Layer	Mechanism	What It Catches
1. SOUL.md Rules	Behavioral guidance for the LLM	Wasteful patterns, large file reads, unnecessary retries
2. Application Rate Limits	Token bucket / sliding window	Fast loops, chatty heartbeats, retry storms
3. Environment Variables	Max retries, timeouts, session limits	Runaway processes, stuck tasks, long sessions
4. Provider Cost Caps	Hard spending limits on API keys	Everything else. The last line of defense.

Layer 1 (SOUL.md) prevents most waste proactively. Layer 2 (rate limits) catches anything that slips through. Layer 3 (environment variables) enforces hard technical limits. Layer 4 (provider cost caps) is your financial safety net that ensures you never get a bill you cannot afford.

With all four layers in place, the $5 overnight bill scenario becomes impossible. The Cloudflare loop would be caught by SOUL.md rules at Layer 1. If the LLM ignored those rules, the per-minute rate limit at Layer 2 would cap it at 10 calls. If that somehow failed, the environment timeout at Layer 3 would kill the session after 60 minutes. And if everything else failed, the $10 provider cap at Layer 4 would cut off API access before costs got out of hand.

Monitoring Token Usage

Prevention is only half the equation. You also need visibility into how many tokens your agents are actually using. Without monitoring, you will not know if your limits are too tight (agent is being throttled unnecessarily) or too loose (agent is still wasting tokens).

Add logging rules to your SOUL.md:

# SOUL.md - Token Logging

## Logging
- At the end of each task, log:
  - Task name
  - Number of LLM calls made
  - Approximate tokens used (input + output)
  - Whether any retries occurred
  - Duration in seconds
- At the end of each session, log a summary:
  - Total tasks completed
  - Total tasks failed
  - Total approximate token usage
  - Any rate limits that were hit

Check your LLM provider's usage dashboard daily for the first week after deploying a new agent. Once you confirm that usage is within expected bounds, you can reduce monitoring to weekly. Set up email or Telegram alerts for when usage spikes above your normal baseline.

Quick Reference: Token Cost by Model

Understanding how much each token costs helps you estimate your agent's running expenses and set appropriate budgets.

Model	Input (per 1M tokens)	Output (per 1M tokens)	500 calls/day est.
Claude Sonnet	$3	$15	~$2-5/day
Claude Haiku	$0.25	$1.25	~$0.20-0.50/day
GPT-4o	$2.50	$10	~$1.50-4/day
GPT-4o Mini	$0.15	$0.60	~$0.10-0.30/day
Ollama (local)	$0	$0	$0/day

For most agent tasks that do not require top-tier reasoning, Claude Haiku or GPT-4o Mini offer the best cost-to-performance ratio. Reserve Sonnet or GPT-4o for tasks that genuinely need stronger reasoning, like code generation or complex analysis. Use Ollama for routine monitoring tasks where cost needs to be zero.

Frequently Asked Questions

Why is my OpenClaw agent using so many tokens?

The most common causes are Cloudflare challenge loops (the agent keeps retrying a blocked request), infinite retry configurations with no backoff or cap, reading large files or entire codebases into context, and heartbeat intervals that are too frequent. Check your agent logs for repeated identical requests, which is the clearest sign of a loop.

How do I set a cost cap on my OpenClaw agent?

Add a cost_limits section to your SOUL.md with max_tokens_per_task, max_cost_per_day, and alert_threshold fields. For API-level control, set OPENAI_MAX_TOKENS or ANTHROPIC_MAX_TOKENS in your environment variables. You can also use your LLM provider's dashboard to set hard spending limits on your API key.

What is the Cloudflare loop problem?

When an OpenClaw agent makes HTTP requests to a website protected by Cloudflare, the response is often a JavaScript challenge page instead of the actual content. The agent sees unexpected HTML, assumes the request failed, and retries. Each retry hits the same challenge page, creating an infinite loop that burns tokens on every attempt. The fix is to detect Cloudflare responses and abort instead of retrying.

Can I run OpenClaw agents for free to avoid token costs?

Yes. You can use Ollama with a local model like Llama 3 or Mistral. Local models have zero API costs because they run on your own hardware. CrewClaw deploy packages work with any LLM provider, including Ollama. The trade-off is that local models require decent hardware (at least 16GB RAM for 7B models) and may be slower than cloud APIs.

How often should I set my agent's heartbeat interval?

For most use cases, every 30 to 60 minutes is sufficient. Agents that monitor real-time systems like uptime checkers might need 5 to 10 minute intervals. Never set a heartbeat below 1 minute unless you have a specific reason and understand the token cost. A heartbeat every 30 seconds on Claude can cost over $50 per day in tokens alone.

Deploy agents that stay within budget

Free scan. Enter your URL, get an SEO analysis and a custom AI team recommendation with built-in cost controls.

Scan Your Site Free Browse Agent Templates