How API Rate Limits Work for B2B SaaS Integrations

The nightly sync between your customers' Salesforce instances and your product worked fine for months. Then you added your 171st customer.

In the following days, no obvious errors or alerts were coming from the integration. But you did notice a substantial uptick in customers reporting that their data was a day behind or that records weren't appearing when they should.

The issue? The integration was running, but it wasn't doing everything it was supposed to. The Salesforce API started sending back a 429 (Too Many Requests) error, but your code was ignoring it.

This is the rate-limiting failure pattern. It's often quiet and easy to miss.

Rate limiting is often one of the last things teams think about when building their first integration, and one of the first things that bites them when they scale. The problem isn't that rate limits exist. They're a necessary mechanism for protecting shared infrastructure and ensuring fair usage. The problem is that most integrations aren't designed around them from the start because, in sandbox environments, rate limits are often more permissive (or non-existent). The first real encounter with rate limits is often in production, at scale, with real customer data.

This post covers what rate limits are, how third-party APIs implement them, and the two reactive patterns every integration needs: exponential backoff with jitter, and circuit breakers. Our next post on the topic covers the proactive, architectural patterns (rate-aware scheduling, batching, priority lane segmentation, and top-level observability) that become necessary once you're operating integrations at scale.

Why rate limiting is harder in B2B SaaS

Let's look at why rate limiting for B2B integrations differs from that for B2C (consumer) applications.

In consumer apps, traffic is driven by users working through a UI. It's relatively predictable, mostly human-paced, and bursty in ways that correlate with user activity. In B2B integrations, traffic is often driven by scheduled batch jobs, event-driven flows, and bulk imports. This can lead to large, irregular bursts, with thousands of requests arriving in seconds when a nightly sync kicks off or when a downstream outage is resolved.

A single customer syncing 1,000 records per night probably isn't an issue. Five hundred customers, each syncing 1,000 records at midnight (every night), is a different story. Rate limits that were fine in testing and even early production fail when usage scales.

If multiple customers share an API key or service account, a single customer running a bulk export may consume the rate limit budget for all other customers using that credential. At the instance level, this appears to be a rate limit issue. At the highest level, it's a tenant isolation problem. The structural fix is per-customer credentials, where each customer authenticates with their own OAuth connection or API key, giving them their own rate limit budget from the provider. On Prismatic, this is the default: each customer instance has its own connection, which holds its own credentials. One customer's aggressive sync can't starve another's.

When a 429 triggers an automatic retry, and that retry triggers another 429, and every integration instance is doing the same thing simultaneously, bad things can happen. A spike that would have naturally subsided can become a sustained load that keeps the API busy far longer than the original burst would have.

A 429 that isn't handled correctly often doesn't surface as a visible error; instead, it drops data or skips records, and the integration continues running as if nothing happened. Without proper logging, you have no way to know whether a rate limit was ever hit, so your first indication is a customer support ticket. If you haven't given integration logging careful thought, check out our post: Integration Logs: Neglected but Essential.

The rate limit taxonomy

The right response to a rate limit depends entirely on which type of limit you're hitting. This is the part that often gets skipped. "Just add exponential backoff" is incomplete advice.

Here's what you should know before reaching for a specific pattern to mitigate rate limit issues:

Fixed window

A counter resets at a fixed interval, typically the top of the hour or the top of the day. This is simple to implement, and the most common model for API subscriptions with daily quotas.

The common failure mode for fixed window rate limits is boundary bursting. If all your customer syncs start at midnight, they all hit the counter simultaneously at the beginning of the window, exhaust it quickly, and spend the rest of the window blocked. The counter resets, they burst again. Fixed window limits also allow a theoretical 2x burst at window boundaries, where requests at the end of one window and the start of the next can exceed the intended limit by 2x.

APIs that use a fixed window rate limit, such as Salesforce, often include response headers such as X-RateLimit-Limit, X-RateLimit-Remaining, or X-RateLimit-Reset.

Sliding window

This limit applies to a rolling time period, such as no more than 100 requests in any 60-second period, evaluated continuously. Smoother than a fixed window with no boundary burst problem, but harder to work with because the counter doesn't reset at a predictable time.

To stay within a sliding window limit, you need to track your recent request history; a client-side sliding window counter makes this manageable. GitHub uses this model for secondary rate limits.

Token bucket

You start with a bucket of tokens. Each request costs a token, which replenishes at a fixed rate up to the bucket's maximum capacity.

Brief spikes are permitted (you can drain the bucket quickly), but sustained high volume is not (you're capped at the replenishment rate). This is the most dev-friendly model because it feels responsive. Stripe and HubSpot both use token bucket variations. However, teams can misread it by testing with a burst, seeing it work, and assuming there is headroom where there isn't any.

Leaky bucket

Requests enter a fixed-size queue and are processed at a constant, controlled rate regardless of how quickly they arrive. Unlike a token bucket, there's no burst tolerance. Instead, everything processes at the leak rate. Excess requests beyond the queue capacity are rejected outright.

Less common in third-party public APIs, but relevant as a pattern for how you might design your own outbound request queue. The leaky bucket approach is used when you add a rate-aware scheduler that spaces requests at a fixed interval – which we cover in Part 2.

Concurrent request limits

This is a fundamentally different enforcement mechanism. Rather than limiting requests per unit of time, concurrent limits restrict the number of requests that can be in-flight simultaneously. You can hit this even with slow, carefully spaced requests if they all happen to overlap.

Some systems, such as Workday and various HRIS APIs, enforce strict concurrency limits on bulk data endpoints. For those systems, a 429 that arrives when your per-second request rate seems reasonable is a sign you're tripping a concurrency limit, just not a time-based one.

Prismatic's own runtime enforces concurrency limits at the organization level. If you reach your plan's concurrent execution limit, incoming requests each get a 429 until capacity frees up. You can configure a concurrency threshold alert to warn your team before you hit the ceiling. For webhook-triggered flows, Prismatic's throttled concurrency mode lets you cap the number of simultaneous executions at the flow level (the best way to protect an API that enforces its own concurrency limits).

Per-endpoint and per-resource limits

Many APIs apply limits by endpoint. A bulk endpoint might allow far fewer requests per hour than a single-record endpoint. Search and reporting endpoints are often more restricted than those for standard CRUD operations. Staying under the global limit doesn't guarantee you'll stay under a per-endpoint limit.

Always read the rate-limiting documentation for specific endpoints, not just the global limits page. Document anything that stands out, such as APIs that return rate limit info in the response body rather than headers, or ones that return a 200 with an error payload instead of a proper 429.

And never hard-code rate limit assumptions. Parse response headers dynamically on every response. Limits change when providers update their infrastructure or pricing tiers. What is documented for an API rate limit today is likely to change tomorrow.

Reactive pattern 1 – Retry with exponential backoff and jitter

When you receive a 429, wait before retrying. The wait time increases exponentially with each consecutive failure.

Use this for burst or transient rate limits, when you've sent too many requests too quickly, but the API will accept them if you slow down and give things time to recover.

1234567891011121314151617181920
async function fetchWithRetry(url: string, retries = 5): Promise<void> {
  for (let attempt = 0; attempt < retries; attempt++) {
    try {
      await fetch(url);
      console.log("✅ Success!");
      return;
    } catch (error) {
      if (attempt === retries - 1) throw error;

      // 1. Calculate exponential delay: 100ms, 200ms, 400ms, 800ms...
      const backoff = Math.pow(2, attempt) * 100;

      // 2. Add "Full Jitter": Randomize between 0 and the backoff
      const jitter = Math.random() * backoff;

      console.log(`❌ Failed. Retrying in ${Math.round(jitter)}ms...`);
      await new Promise(resolve => setTimeout(resolve, jitter));
    }
  }
}

Always check Retry-After first. When a 429 includes a Retry-After header, use it. Don't calculate your own wait time when the API is already telling you exactly how long to wait. Some APIs express this as seconds; others as an HTTP date. You'll need to parse both.
Jitter is not optional. Without it, every integration instance that hits a rate limit backs off for the same interval and then retries simultaneously. Instead of waiting exactly base * 2^attempt, wait base * 2^attempt + random(0, base).
Set hard limits. Cap the maximum wait time (30 seconds is reasonable for most scenarios) and the maximum retry count. For requests that exhaust all retries, route them to a dead-letter queue. After all, you need visibility into persistent failures, and you may need to replay the executions once the underlying issues are resolved.

Prismatic's automatic retry handles transient failures at the execution level with configurable retry counts (up to 10) and intervals. For this to work correctly with rate limits, your integration code needs to classify 429 responses as "retry-able" errors (instead of fatal failures), so the platform will re-queue the execution instead of sending an immediate alert. When you get the error classification right, the platform handles the wait and the retry logic for you, and the retry count and intervals are configurable per instance without touching code.

Reactive pattern 2 – Circuit breaker

When an API is consistently rate-limiting you (not occasionally, but repeatedly), stop sending requests and fail in a controlled way rather than hammering the API and escalating the problem.

Use this approach when backoff and retry are failing repeatedly. Sustained 429s can signal something beyond a transient spike: a structural over-capacity problem, a third-party limit change, or a degraded API. Continuing to retry against a sustained limit creates a retry storm in which multiple systems retry independently, and the resulting traffic keeps the API throttled far longer than the original burst would have lasted.

This pattern has three states:

Closed (normal) – Requests flow through, and failure rate is tracked.
Open (tripped) – After N consecutive failures, requests are blocked for a cooldown period.
Half-open (probing) – After the cooldown, a single probe request is sent. Success closes the circuit; failure resets the cooldown.

1234567891011121314151617181920212223242526272829
type State = 'CLOSED' | 'OPEN';

let failures = 0;
let nextAttemptAt = 0;
let state: State = 'CLOSED';

async function callService<T>(fn: () => Promise<T>) {
  // 1. Check if we are "cooling down"
  if (state === 'OPEN' && Date.now() < nextAttemptAt) {
    throw new Error("Circuit Open: Skipping request to save the server.");
  }

  try {
    const result = await fn();
    
    // 2. Success! Reset everything.
    failures = 0;
    state = 'CLOSED';
    return result;
  } catch (error) {
    // 3. Failure! Track it and "trip" the circuit if it happens too much.
    failures++;
    if (failures >= 5) {
      state = 'OPEN';
      nextAttemptAt = Date.now() + 30_000; // Stop for 30 seconds
    }
    throw error;
  }
}

The most important design decision is what happens when the circuit trips. A circuit breaker that opens silently and drops data is worse than no circuit breaker at all. When the circuit opens, log a clear error and trigger an alert. On Prismatic, logging the failure at the error level triggers a Log Level Matched or Exceeded alert, which can route to email, Slack, PagerDuty, or any webhook your team monitors. Log streaming to Datadog or New Relic, (or any other system you use) means a tripped circuit breaker triggers an alert in your observability stack. Prismatic also attaches execution context to every log entry (such as the Customer Id, integration name, and flow name) so you'll know exactly which customer and which provider the circuit tripped on, not just that something failed somewhere.

What backoff and circuit breakers don't solve

Exponential backoff and circuit breakers are reactive. That is, they respond to rate limits after they've been hit. For many integrations, that may be sufficient. The 429 fires, the code backs off, and everything recovers.

At scale, reactive rate limit management isn't enough. When you're running integrations for hundreds or thousands of customers, rate limits are a predictable consequence of decisions you've already made for scheduling and request volume. Recovering from 429s as they occur is treating the symptom. Preventing them through proactive design addresses the underlying issue.

We cover proactive rate-limiting patterns (rate-aware scheduling, distributed governors, batching, priority lane segmentation, and top-level observability) in Part 2, How to Design B2B SaaS Integrations that Don't Hit Rate Limits.