Rate Limiting in LLM Councils: Managing API Constraints
Handle provider rate limits gracefully while maintaining council responsiveness and user experience.
LLM councilrate limitingAPI managementcouncil of LLMsmulti-model AI
The Rate Limit Reality
Every LLM provider has rate limits. When running a council of multiple models, you'll hit these limits. Here's how to manage them.
Types of Rate Limits
Requests Per Minute (RPM)
- OpenAI: 500-10,000 depending on tier
- Anthropic: 60-1,000
- Google: 60-2,000
Tokens Per Minute (TPM)
- OpenAI: 200K-30M
- Anthropic: 40K-400K
- Google: 1M-4M
Concurrent Requests
- OpenAI: Varies
- Anthropic: Usually limited
- Others: Provider-specific
Rate Limit Strategies
1. Token Bucket Algorithm
class RateLimiter {
constructor(rate, capacity) {
this.tokens = capacity;
this.rate = rate; // tokens per second
this.lastRefill = Date.now();
}
async acquire(tokens) {
this.refill();
if (this.tokens >= tokens) {
this.tokens -= tokens;
return true;
}
const waitTime = (tokens - this.tokens) / this.rate * 1000;
await sleep(waitTime);
return this.acquire(tokens);
}
refill() {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000;
this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.rate);
this.lastRefill = now;
}
}
2. Multi-Model Load Balancing
Distribute across models:
const modelLimits = {
'claude': new RateLimiter(60, 100),
'gpt-4o': new RateLimiter(500, 1000),
'gemini': new RateLimiter(100, 200)
};
async function balancedQuery(prompt) {
const available = Object.entries(modelLimits)
.filter(([_, limiter]) => limiter.tokens > 0);
if (available.length < 3) {
// Not enough models, wait
await sleep(1000);
return balancedQuery(prompt);
}
return council.query(prompt, available.map(([model]) => model));
}
3. Priority Queues
Handle high-priority queries first:
const queues = {
high: new PriorityQueue(),
normal: new PriorityQueue(),
low: new PriorityQueue()
};
async function processQueue() {
for (const priority of ['high', 'normal', 'low']) {
while (!queues[priority].isEmpty() && hasCapacity()) {
const query = queues[priority].dequeue();
await processQuery(query);
}
}
}
4. Backpressure
Signal upstream when overloaded:
function checkCapacity() {
const utilization = currentUsage / maxCapacity;
if (utilization > 0.8) {
return { accept: false, retryAfter: estimateWaitTime() };
}
return { accept: true };
}
Handling 429 Responses
Retry with Backoff
async function queryWithBackoff(model, prompt, retries = 3) {
for (let i = 0; i < retries; i++) {
try {
return await model.query(prompt);
} catch (error) {
if (error.status === 429) {
const waitTime = error.headers['retry-after'] || Math.pow(2, i);
await sleep(waitTime * 1000);
} else {
throw error;
}
}
}
throw new Error('Rate limit exceeded');
}
Fallback Models
async function resilientQuery(prompt) {
const models = ['claude', 'gpt-4o', 'gemini', 'grok'];
for (const model of models) {
try {
return await queryWithBackoff(model, prompt);
} catch (e) {
continue; // Try next model
}
}
throw new Error('All models rate limited');
}
User-Facing Strategies
Queue Position
"In queue: position 3, estimated wait: 30 seconds"
Degraded Mode
"Running in economy mode due to high demand"
Async Processing
"Your query is processing. We'll notify you when complete."
SPRAPP Rate Management
Features included:
- Multi-provider rate limit tracking
- Automatic load balancing
- Retry with exponential backoff
- Graceful degradation
- User queue management
The council of LLMs stays responsive even under rate limit pressure.