Technical Deep Dive2025-02-068 min read

Rate Limiting in LLM Councils: Managing API Constraints

Handle provider rate limits gracefully while maintaining council responsiveness and user experience.

LLM councilrate limitingAPI managementcouncil of LLMsmulti-model AI

The Rate Limit Reality

Every LLM provider has rate limits. When running a council of multiple models, you'll hit these limits. Here's how to manage them.

Types of Rate Limits

Requests Per Minute (RPM)

OpenAI: 500-10,000 depending on tier
Anthropic: 60-1,000
Google: 60-2,000

Tokens Per Minute (TPM)

OpenAI: 200K-30M
Anthropic: 40K-400K
Google: 1M-4M

Concurrent Requests

OpenAI: Varies
Anthropic: Usually limited
Others: Provider-specific

Rate Limit Strategies

1. Token Bucket Algorithm

class RateLimiter {
  constructor(rate, capacity) {
    this.tokens = capacity;
    this.rate = rate; // tokens per second
    this.lastRefill = Date.now();
  }
  
  async acquire(tokens) {
    this.refill();
    if (this.tokens >= tokens) {
      this.tokens -= tokens;
      return true;
    }
    const waitTime = (tokens - this.tokens) / this.rate * 1000;
    await sleep(waitTime);
    return this.acquire(tokens);
  }
  
  refill() {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.rate);
    this.lastRefill = now;
  }
}

2. Multi-Model Load Balancing

Distribute across models:

const modelLimits = {
  'claude': new RateLimiter(60, 100),
  'gpt-4o': new RateLimiter(500, 1000),
  'gemini': new RateLimiter(100, 200)
};

async function balancedQuery(prompt) {
  const available = Object.entries(modelLimits)
    .filter(([_, limiter]) => limiter.tokens > 0);
  
  if (available.length < 3) {
    // Not enough models, wait
    await sleep(1000);
    return balancedQuery(prompt);
  }
  
  return council.query(prompt, available.map(([model]) => model));
}

3. Priority Queues

Handle high-priority queries first:

const queues = {
  high: new PriorityQueue(),
  normal: new PriorityQueue(),
  low: new PriorityQueue()
};

async function processQueue() {
  for (const priority of ['high', 'normal', 'low']) {
    while (!queues[priority].isEmpty() && hasCapacity()) {
      const query = queues[priority].dequeue();
      await processQuery(query);
    }
  }
}

4. Backpressure

Signal upstream when overloaded:

function checkCapacity() {
  const utilization = currentUsage / maxCapacity;
  if (utilization > 0.8) {
    return { accept: false, retryAfter: estimateWaitTime() };
  }
  return { accept: true };
}

Handling 429 Responses

Retry with Backoff

async function queryWithBackoff(model, prompt, retries = 3) {
  for (let i = 0; i < retries; i++) {
    try {
      return await model.query(prompt);
    } catch (error) {
      if (error.status === 429) {
        const waitTime = error.headers['retry-after'] || Math.pow(2, i);
        await sleep(waitTime * 1000);
      } else {
        throw error;
      }
    }
  }
  throw new Error('Rate limit exceeded');
}

Fallback Models

async function resilientQuery(prompt) {
  const models = ['claude', 'gpt-4o', 'gemini', 'grok'];
  for (const model of models) {
    try {
      return await queryWithBackoff(model, prompt);
    } catch (e) {
      continue; // Try next model
    }
  }
  throw new Error('All models rate limited');
}

User-Facing Strategies

Queue Position

"In queue: position 3, estimated wait: 30 seconds"

Degraded Mode

"Running in economy mode due to high demand"

Async Processing

"Your query is processing. We'll notify you when complete."

SPRAPP Rate Management

Features included:

Multi-provider rate limit tracking
Automatic load balancing
Retry with exponential backoff
Graceful degradation
User queue management

The council of LLMs stays responsive even under rate limit pressure.

Written bySPRAPP Team

Hallucination Detection in LLM Councils: Catching AI Errors Before They Matter

Learn how LLM councils detect and prevent hallucinations through cross-model verification, consensus analysis, and confidence scoring.

2025-02-148 min read

Technical Deep Dive

Prompt Engineering for LLM Councils: Optimizing Multi-Model Queries

Master the art of crafting prompts that get the best results from multiple AI models working together in a council.

2025-02-139 min read

Technical Deep Dive

Token Optimization for LLM Councils: Reducing Costs and Latency

Learn strategies to minimize token usage in your LLM council without sacrificing answer quality or accuracy.

2025-02-128 min read

Technical Deep Dive

Council Latency Engineering: Building Fast Multi-Model AI Systems

Deep dive into the engineering techniques that make LLM councils respond quickly despite coordinating multiple AI models.

2025-02-119 min read

← Back to News

Technical Deep Dive2025-02-068 min read

Rate Limiting in LLM Councils: Managing API Constraints

Handle provider rate limits gracefully while maintaining council responsiveness and user experience.

LLM councilrate limitingAPI managementcouncil of LLMsmulti-model AI

The Rate Limit Reality

Every LLM provider has rate limits. When running a council of multiple models, you'll hit these limits. Here's how to manage them.

Types of Rate Limits

Requests Per Minute (RPM)

OpenAI: 500-10,000 depending on tier
Anthropic: 60-1,000
Google: 60-2,000

Tokens Per Minute (TPM)

OpenAI: 200K-30M
Anthropic: 40K-400K
Google: 1M-4M

Concurrent Requests

OpenAI: Varies
Anthropic: Usually limited
Others: Provider-specific

Rate Limit Strategies

1. Token Bucket Algorithm

class RateLimiter {
  constructor(rate, capacity) {
    this.tokens = capacity;
    this.rate = rate; // tokens per second
    this.lastRefill = Date.now();
  }
  
  async acquire(tokens) {
    this.refill();
    if (this.tokens >= tokens) {
      this.tokens -= tokens;
      return true;
    }
    const waitTime = (tokens - this.tokens) / this.rate * 1000;
    await sleep(waitTime);
    return this.acquire(tokens);
  }
  
  refill() {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.rate);
    this.lastRefill = now;
  }
}

2. Multi-Model Load Balancing

Distribute across models:

const modelLimits = {
  'claude': new RateLimiter(60, 100),
  'gpt-4o': new RateLimiter(500, 1000),
  'gemini': new RateLimiter(100, 200)
};

async function balancedQuery(prompt) {
  const available = Object.entries(modelLimits)
    .filter(([_, limiter]) => limiter.tokens > 0);
  
  if (available.length < 3) {
    // Not enough models, wait
    await sleep(1000);
    return balancedQuery(prompt);
  }
  
  return council.query(prompt, available.map(([model]) => model));
}

3. Priority Queues

Handle high-priority queries first:

const queues = {
  high: new PriorityQueue(),
  normal: new PriorityQueue(),
  low: new PriorityQueue()
};

async function processQueue() {
  for (const priority of ['high', 'normal', 'low']) {
    while (!queues[priority].isEmpty() && hasCapacity()) {
      const query = queues[priority].dequeue();
      await processQuery(query);
    }
  }
}

4. Backpressure

Signal upstream when overloaded:

function checkCapacity() {
  const utilization = currentUsage / maxCapacity;
  if (utilization > 0.8) {
    return { accept: false, retryAfter: estimateWaitTime() };
  }
  return { accept: true };
}

Handling 429 Responses

Retry with Backoff

async function queryWithBackoff(model, prompt, retries = 3) {
  for (let i = 0; i < retries; i++) {
    try {
      return await model.query(prompt);
    } catch (error) {
      if (error.status === 429) {
        const waitTime = error.headers['retry-after'] || Math.pow(2, i);
        await sleep(waitTime * 1000);
      } else {
        throw error;
      }
    }
  }
  throw new Error('Rate limit exceeded');
}

Fallback Models

async function resilientQuery(prompt) {
  const models = ['claude', 'gpt-4o', 'gemini', 'grok'];
  for (const model of models) {
    try {
      return await queryWithBackoff(model, prompt);
    } catch (e) {
      continue; // Try next model
    }
  }
  throw new Error('All models rate limited');
}

User-Facing Strategies

Queue Position

"In queue: position 3, estimated wait: 30 seconds"

Degraded Mode

"Running in economy mode due to high demand"

Async Processing

"Your query is processing. We'll notify you when complete."

SPRAPP Rate Management

Features included:

Multi-provider rate limit tracking
Automatic load balancing
Retry with exponential backoff
Graceful degradation
User queue management

The council of LLMs stays responsive even under rate limit pressure.

Written bySPRAPP Team

Hallucination Detection in LLM Councils: Catching AI Errors Before They Matter

Learn how LLM councils detect and prevent hallucinations through cross-model verification, consensus analysis, and confidence scoring.

2025-02-148 min read

Technical Deep Dive

Prompt Engineering for LLM Councils: Optimizing Multi-Model Queries

Master the art of crafting prompts that get the best results from multiple AI models working together in a council.

2025-02-139 min read

Technical Deep Dive

Token Optimization for LLM Councils: Reducing Costs and Latency

Learn strategies to minimize token usage in your LLM council without sacrificing answer quality or accuracy.

2025-02-128 min read

Technical Deep Dive

Council Latency Engineering: Building Fast Multi-Model AI Systems

Deep dive into the engineering techniques that make LLM councils respond quickly despite coordinating multiple AI models.

2025-02-119 min read

← Back to News

The Rate Limit Reality

Types of Rate Limits

Requests Per Minute (RPM)

Tokens Per Minute (TPM)

Concurrent Requests

Rate Limit Strategies

1. Token Bucket Algorithm

2. Multi-Model Load Balancing

3. Priority Queues

4. Backpressure

Handling 429 Responses

Retry with Backoff

Fallback Models

User-Facing Strategies

Queue Position

Degraded Mode

Async Processing

SPRAPP Rate Management

Tags

Related Articles

Hallucination Detection in LLM Councils: Catching AI Errors Before They Matter

Prompt Engineering for LLM Councils: Optimizing Multi-Model Queries

Token Optimization for LLM Councils: Reducing Costs and Latency

Council Latency Engineering: Building Fast Multi-Model AI Systems

The Rate Limit Reality

Types of Rate Limits

Requests Per Minute (RPM)

Tokens Per Minute (TPM)

Concurrent Requests

Rate Limit Strategies

1. Token Bucket Algorithm

2. Multi-Model Load Balancing

3. Priority Queues

4. Backpressure

Handling 429 Responses

Retry with Backoff

Fallback Models

User-Facing Strategies

Queue Position

Degraded Mode

Async Processing

SPRAPP Rate Management

Tags

Related Articles

Hallucination Detection in LLM Councils: Catching AI Errors Before They Matter

Prompt Engineering for LLM Councils: Optimizing Multi-Model Queries

Token Optimization for LLM Councils: Reducing Costs and Latency

Council Latency Engineering: Building Fast Multi-Model AI Systems