Stop a “Denial-of-Wallet” Attack with Token-Aware Rate Limiting

Why your RPM-based rate limiter is silently bankrupting your AI infrastructure — and what to do about it.A lot of infrastructure patterns we’ve relied on for years don’t really hold up anymore once you start building with LLMs.Rate limiting is a good example.For a long time, counting requests per minute (RPM) was enough. Most API calls were roughly similar in cost, so limiting traffic also meant limiting spend.That assumption breaks pretty quickly with LLMs.Today, two requests hitting the same endpoint can have completely different costs. One might be a short prompt. The other might quietly trigger thousands of times more compute — without looking any different at the gateway level.That’s where the idea of a Denial-of-Wallet attack comes in. Instead of crashing your system, the goal is simply to run up your bill.And the tricky part is: your infrastructure can look perfectly healthy while it happens.We built this after a misconfigured client drained our OpenAI budget overnight.The Fundamental Failure of RPM in AI WorkloadsFor a long time, rate limiting was simple. In most REST APIs, one request was more or less the same as another — whether it was GET /users/123 or POST /orders, the cost didn’t really swing wildly. So we built our entire defensive philosophy on a simple assumption:“If I cap requests, I cap cost.”That model doesn’t really hold anymore.AI workloads behave differently. The same endpoint can be cheap one second and absurdly expensive the next, depending entirely on what the user sends. Two requests can look identical at the HTTP level, but be completely different in compute cost once tokens enter the picture.The cost gap hiding in plain sightIn AI workloads, variability is the rule, not the exception. Consider two requests hitting the same endpoint:That’s a ~120,000× cost difference — and it’s all coming through the same endpoint, the same auth layer, and the same rate limiter.If your gateway is still capping at something like 60 RPM per user, it doesn’t really protect you in this world. A malicious (or even just badly designed) client can burn through a month’s worth of Anthropic or OpenAI credits in under a minute. Nothing crashes. No obvious errors. The system just keeps humming along… until the invoice shows up.So if your system is built around RPM limits, you’re effectively treating two completely different financial events as identical.Token-Aware Rate LimitingOnce you see this mismatch, the fix is conceptually simple: stop counting requests and start counting tokens.That leads to a different set of primitives:TPM — tokens per minuteTPD — tokens per dayTPR — tokens per request (as a guardrail, not a prediction)This aligns much better with how LLM costs actually work.But there’s a complication.You don’t actually know how many tokens a request will consume at the time you receive it.You know the input size, but not the output. And output is often the expensive part.So instead of trying to predict perfectly, most production systems use a two-step approach: reserve first, reconcile later.The Two-Phase Reservation PatternPhase 1: The ReservationWhen a request arrives, the gateway estimates the worst-case cost.It does something like:Tokenize the incoming prompt (cheap, deterministic, ~1ms with tiktoken).Read the max_tokens parameter from the request.Calculate the worst-case cost:estimated_cost=prompt_tokens+max_token4. Atomically check and decrement the user’s token bucket. If insufficient, reject with 429 Too Many Tokens.This ensures one thing very clearly: you never exceed your pre-approved budget in the worst case.Phase 2: The RefundOnce the LLM finishes generating a response, the actual usage is usually lower than the reservation.After generation completes, the system reconciles the difference:Read usage.completion_tokens from the LLM response.Calculate the difference between reserved and actual.Refund the unused tokens back to the bucket.refund = max_tokens - actual_completion_tokensWithout this step, you’d constantly over-reserve capacity and waste usable quota, especially for shorter-than-expected outputs.Why this works in practice?This pattern isn’t specific to LLMs — it’s the same idea used in financial systems.Credit card transactions don’t charge the final amount immediately. They pre-authorize a maximum and settle later.Token usage behaves similarly. You’re always dealing with uncertainty on the output side, so the only safe approach is:assume the worst upfrontadjust once the real cost is knownA note on implementationIn real systems, the tricky part isn’t the core idea — it’s everything around concurrency and correctness under load.A naive “check then increment” approach breaks pretty quickly in practice. Two requests can read the same bucket state, both pass the limit check, and then both update it. At that point, your limit isn’t really a limit anymore.That’s why production systems usually avoid split-step logic and rely on something atomic instead:Redis Lua scripts (most common in practice)or transactional increments with rollback semanticsBoth aim for the same thing: make the reservation decision indivisible.There are also a couple of subtleties that show up once you run this at scale.Tokenizer accuracy is one of them. Different models — and even different versions of the same model — can produce slightly different token counts. So your estimate will always be close, but never exact.And then there’s state management. TTLs matter more than people expect. Without proper expiration, token buckets slowly accumulate stale keys, especially in multi-tenant systems. Over time, that turns into silent memory growth and inconsistent enforcement.Here is a Minimal reference implementationHere’s the core logic, simplified, using Redis for atomic bucket operations:import tiktokenimport redis.asyncio as redisfrom fastapi import HTTPExceptionr = redis.Redis()enc = tiktoken.encoding_for_model("gpt-4o")async def reserve_tokens(user_id: str, prompt: str, max_tokens: int, tpm_limit: int) -> tuple[str, int]: prompt_tokens = len(enc.encode(prompt)) estimated = prompt_tokens + max_tokens bucket_key = f"tpm:{user_id}" # Atomic: increment usage and check current = await r.incrby(bucket_key, estimated) await r.expire(bucket_key, 60, nx=True) # 60s sliding window TTL if current > tpm_limit: await r.decrby(bucket_key, estimated) # rollback on rejection raise HTTPException(429, "Token budget exceeded") return bucket_key, estimated # caller uses these for refundasync def refund_tokens(bucket_key: str, reserved: int, actual: int) -> None: refund = reserved - actual if refund > 0: await r.decrby(bucket_key, refund)Why this still isn’t enough on its own?This version is intentionally simple, but it hides a few production realities:The INCRBY + check + rollback pattern is not atomic under race conditionsIn real deployments, this should be wrapped in a Lua script or equivalent atomic transactionRedis cluster behavior can change consistency guarantees depending on topologySo this is the “correct idea,” not the final hardened form.Moving enforcement to the edgeThere’s another important constraint: latency.If every request requires a round trip to a database for validation, you quickly add noticeable overhead — especially in streaming LLM systems where time-to-first-token matters.Because of that, modern systems tend to push token enforcement outward, not inward.Instead of handling it in application code, they move it into an AI gateway layer.Common approaches include:LiteLLM Proxy — self-hosted, flexible, supports multi-provider routing and token budgetsCloudflare AI Gateway — edge-based enforcement with global distributionKong AI plugins — integrates token-aware limits into existing API infrastructureThe idea is consistent across all of them:Cost enforcement should happen before the request reaches your application logic.Production-Ready ChecklistBefore you ship token-aware rate limiting, verify:Per-user, per-org, AND global limits — defense in depthTiered limits — free tier ≠ enterprise tierRefund pipeline for streaming responses (often forgotten!)“For streaming responses, you can’t wait until the end to refund. Instead, count tokens as chunks arrive and reconcile mid-stream. Most providers send usage.completion_tokens in the final chunk — use that to calculate refund before the response closes. If you wait until after the stream ends, you've already allocated that quota for nothing."Alerting at 80% bucket consumption, not just at 100%Hard max_tokens ceiling to prevent reservation abuseTokenizer mismatch tolerance — your estimate vs. provider’s count will driftGraceful degradation — fall back to cheaper models when buckets are near emptyHere’s how those layers stack up in practice:Closing thoughtsThe shift from RPM to token-based limiting isn’t just a technical tweak — it’s a change in how you think about cost in AI systems.You’re no longer just protecting compute. You’re protecting financial exposure.And that’s the key difference: in LLM systems, every request is effectively a transaction, not just a call.Traditional infrastructure was never designed with that assumption in mind.Modern systems have to be.Stop a “Denial-of-Wallet” Attack with Token-Aware Rate Limiting was originally published in Towards Dev on Medium, where people are continuing the conversation by highlighting and responding to this story.

Stop a “Denial-of-Wallet” Attack with Token-Aware Rate Limiting

Take Your Experience to the Next Level

Comments