Error Handling in LLM Councils: Building Resilient Multi-Model Systems
Learn how to handle API failures, rate limits, and unexpected responses in production LLM council systems.
LLM councilerror handlingAI reliabilitycouncil of LLMsmulti-model AI
The Fragility of Multi-Model Systems
With multiple models comes multiple failure points. Robust error handling is essential for production LLM councils.
Types of Errors
API Errors
- Rate limits (429)
- Service unavailable (503)
- Authentication failures (401)
- Timeout errors
Model Errors
- Malformed outputs
- Refusal to answer
- Content policy violations
- Hallucinations
System Errors
- Network failures
- Resource exhaustion
- Configuration errors
Error Handling Strategies
1. Retry with Backoff
async function queryWithRetry(model, prompt, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
try {
return await model.query(prompt);
} catch (error) {
if (error.status === 429) {
await sleep(Math.pow(2, i) * 1000);
} else if (!isRetriable(error)) {
throw error;
}
}
}
throw new Error('Max retries exceeded');
}
2. Fallback Models
When one model fails, try another:
const models = ['claude', 'gpt-4o', 'gemini'];
for (const model of models) {
try {
return await query(model, prompt);
} catch (e) {
continue;
}
}
3. Graceful Degradation
Continue with partial results:
- 5 models configured, 3 respond
- Proceed with available responses
- Note reduced confidence
4. Circuit Breakers
Stop trying failing services:
- Track failure rates
- Open circuit after threshold
- Allow retry after cooldown
5. Timeout Protection
Never hang indefinitely:
const response = await Promise.race([
model.query(prompt),
timeout(30000)
]);
Council-Specific Handling
Fan-Out Failures
If some models fail in fan-out:
- Log the failure
- Proceed with available responses
- Adjust consensus threshold
- Note degraded confidence
Debate Failures
If debate rounds fail:
- Use available debate data
- Skip to synthesis
- Note incomplete deliberation
Synthesis Failures
If synthesis fails:
- Return best individual response
- Try alternative synthesis model
- Fall back to voting
Error Classification
| Error Type | Action | User Impact |
|---|---|---|
| Transient (rate limit) | Retry | Delay |
| Provider outage | Fallback | Possible quality change |
| Model refusal | Alternative model | May get answer |
| Content policy | Alternative phrasing | Modified query |
| Timeout | Fallback/skip | Reduced confidence |
Monitoring and Alerting
Track error metrics:
- Error rate by model
- Error type distribution
- Recovery success rate
- User-facing error frequency
Set alerts for:
- Error rate spike
- Specific model failures
- Timeout increases
- Fallback frequency
SPRAPP Resilience
Our platform provides:
- Automatic retry with backoff
- Multi-provider fallback
- Graceful degradation
- Comprehensive error logging
- Real-time error dashboards
The council of LLMs remains reliable even when individual models fail.