Rate Limiting & Quotas

Protect your API with per-minute rate limits and monthly quotas.

Holdify provides two layers of usage control:

Rate limits — Prevent burst abuse with per-minute request limits
Quotas — Enforce plan limits with monthly request allowances

Token bucket algorithm

Holdify uses the token bucket algorithm for rate limiting. This provides smooth rate limiting while allowing short bursts of traffic.

How it works

text

┌─────────────────────────────────────────────────────────────┐
│                    Token Bucket Algorithm                    │
└─────────────────────────────────────────────────────────────┘

  Bucket Capacity: 60 tokens (rate limit per minute)

  ┌──────────────────────────────────────────────────────────┐
  │ ████████████████████████████████████░░░░░░░░░░░░░░░░░░░░ │
  │ ◄──────── 40 tokens remaining ────────►                  │
  └──────────────────────────────────────────────────────────┘

  • Tokens refill at a steady rate (1 per second for 60/min)
  • Each request consumes 1 token (or more with 'units' param)
  • When bucket is empty → 429 Rate Limit Exceeded
  • Bucket never exceeds capacity (burst is limited)

Why token bucket? It's more forgiving than fixed windows. A user who was idle for 30 seconds can make a small burst of requests, rather than being strictly limited to 1 request per second.

Rate limit vs quota

These serve different purposes and are enforced separately:

Comparison

text

┌─────────────────────────────────────────────────────────────┐
│           Rate Limit vs Quota: Key Differences               │
└─────────────────────────────────────────────────────────────┘

                    Rate Limit              Quota
                    ──────────              ─────
  Purpose           Burst protection        Usage metering
  Window            Per minute              Per month
  Resets            Every 60 seconds        Start of billing period
  HTTP Status       429 Too Many Requests   402 Payment Required
  Response Field    rateLimit.remaining     quota.remaining
  Use Case          Prevent abuse           Enforce plan limits

Configuring limits

Configure rate limits and quotas in your plan settings:

Plan configuration

json

// Plan configuration in Holdify Dashboard
// Settings → Plans → Edit Plan

{
  "name": "pro",
  "displayName": "Pro Plan",

  // Rate limiting (per minute)
  "rateLimit": {
    "requestsPerMinute": 60,    // Max 60 requests per minute
    "burstLimit": 10            // Allow burst of 10 requests
  },

  // Quota (per billing period)
  "quota": {
    "requestsPerMonth": 5000,   // 5000 requests per month
    "resetDay": 1               // Resets on 1st of each month
  },

  // Features for gating
  "features": [
    "model:gpt-4o",
    "model:claude-sonnet",
    "priority-queue"
  ]
}

Per-key overrides

You can override plan limits for specific API keys, useful for enterprise customers:

Per-key limits

json

// You can also set per-key limits that override plan defaults
// Useful for enterprise customers or special cases

POST /v1/api-keys
{
  "tenantId": "enterprise_customer",
  "name": "Enterprise Key",
  "overrides": {
    "rateLimit": {
      "requestsPerMinute": 300  // Higher than plan default
    },
    "quota": {
      "requestsPerMonth": 100000
    }
  }
}

Response headers

Holdify includes rate limit and quota information in response headers:

Header	Description
X-RateLimit-Limit	Maximum requests allowed per rate limit window
X-RateLimit-Remaining	Requests remaining in current window
X-RateLimit-Reset	Unix timestamp when window resets
X-Quota-Limit	Total quota for billing period
X-Quota-Remaining	Quota remaining for billing period
X-Quota-Reset	ISO 8601 timestamp when quota resets
Retry-After	Seconds until rate limit resets (on 429)

Handling in code

Here's how to properly handle rate limits and quotas in your API:

Handling limits

typescript

// Handling rate limits and quota in your API
async function handleRequest(apiKey: string) {
  const result = await holdify.verify(apiKey, {
    resource: 'chat_requests',
    units: 1,
  });

  if (!result.valid) {
    return response(401, { error: 'Invalid API key' });
  }

  // Check rate limit (per-minute)
  if (result.rateLimit.remaining <= 0) {
    const retryAfter = result.rateLimit.reset - Math.floor(Date.now() / 1000);
    return response(429, {
      error: 'Rate limit exceeded',
      retryAfter,
      message: `Try again in ${retryAfter} seconds`,
    }, {
      'Retry-After': String(retryAfter),
      'X-RateLimit-Limit': String(result.rateLimit.limit),
      'X-RateLimit-Remaining': '0',
      'X-RateLimit-Reset': String(result.rateLimit.reset),
    });
  }

  // Check quota (per-month)
  if (result.quota.remaining <= 0) {
    return response(402, {
      error: 'Monthly quota exceeded',
      resetAt: result.quota.resetAt,
      message: `Quota resets on ${result.quota.resetAt}`,
    }, {
      'X-Quota-Limit': String(result.quota.limit),
      'X-Quota-Remaining': '0',
      'X-Quota-Reset': result.quota.resetAt,
    });
  }

  // Process the request...
  return response(200, { success: true });
}

HTTP status codes

Status	Condition	User action
429	Rate limit exceeded	Wait and retry after `Retry-After` seconds
402	Monthly quota exhausted	Wait for next billing period or upgrade plan

Best practices

Forward headers to clients. Pass X-RateLimit-* and X-Quota-* headers to your API consumers so they can implement client-side throttling.
Implement exponential backoff. When receiving 429, wait for the Retry-After duration before retrying.
Set appropriate limits per plan. Free tiers should have lower limits to prevent abuse. Pro/Business tiers can have higher limits.
Use units for variable costs. If some operations cost more (e.g., GPT-4 vs GPT-3.5), use the units parameter to consume more quota.