Caching Strategies for LLM Councils: Speed and Cost Optimization
Implement effective caching to reduce costs and latency in your LLM council while maintaining freshness.
LLM councilcachingAI optimizationcouncil of LLMsmulti-model AI
Why Caching Matters
LLM council queries are expensive and slow. Strategic caching can dramatically reduce both costs and latency.
What to Cache
1. Exact Query Matches
Identical queries return cached results:
- Simple key-value cache
- Fastest cache hit
- Limited applicability
2. Semantic Similarity
Semantically similar queries share results:
- Embedding-based matching
- Higher hit rate
- Requires similarity threshold
3. Model Responses
Cache individual model responses:
- Reuse across councils
- Longer TTL
- More storage
4. Partial Results
Cache intermediate results:
- Fan-out responses
- Debate contributions
- Enables faster recomposition
Caching Strategies
Strategy 1: Exact Match Cache
function queryWithCache(prompt, models) {
const cacheKey = hash(prompt + models.join(','));
const cached = cache.get(cacheKey);
if (cached && !isExpired(cached)) {
return cached.result;
}
const result = await council.query(prompt, models);
cache.set(cacheKey, { result, timestamp: Date.now() });
return result;
}
Pros: Simple, fast Cons: Low hit rate
Strategy 2: Semantic Cache
async function semanticCacheQuery(prompt, threshold = 0.95) {
const promptEmbedding = await embed(prompt);
for (const [cachedPrompt, cached] of cache) {
const similarity = cosineSimilarity(promptEmbedding, cached.embedding);
if (similarity > threshold) {
return cached.result;
}
}
const result = await council.query(prompt);
cache.set(prompt, { result, embedding: promptEmbedding });
return result;
}
Pros: Higher hit rate Cons: Embedding cost, complexity
Strategy 3: Tiered Cache
L1: In-memory exact match (fastest, smallest)
L2: Redis semantic cache (medium speed, larger)
L3: Persistent storage (slowest, unlimited)
Strategy 4: Partial Cache
// Cache individual model responses
for (const model of models) {
const cached = modelCache.get(model, prompt);
if (cached) {
responses.push(cached);
} else {
const response = await query(model, prompt);
modelCache.set(model, prompt, response);
responses.push(response);
}
}
Cache Invalidation
Time-Based
TTL = {
facts: '7 days',
news: '1 hour',
code: '24 hours',
creative: 'never' // or very long
}
Event-Based
- Model update triggers flush
- User correction invalidates
- External data change
Hybrid Approach
function shouldInvalidate(cached, context) {
if (Date.now() - cached.timestamp > context.ttl) return true;
if (context.modelVersion !== cached.modelVersion) return true;
if (context.forceFresh) return true;
return false;
}
Cache Warming
Pre-populate cache for known queries:
- FAQ answers
- Common queries
- Scheduled refresh for time-sensitive
Performance Impact
| Cache Type | Hit Rate | Latency Reduction | Cost Reduction |
|---|---|---|---|
| Exact match | 5-15% | 90%+ | 90%+ |
| Semantic | 20-40% | 80%+ | 80%+ |
| Partial | 30-50% | 40%+ | 50%+ |
SPRAPP Caching
Built-in features:
- Multi-tier caching
- Semantic similarity matching
- Configurable TTL
- Cache warming
- Hit rate analytics
The multi-model AI council becomes much more efficient with proper caching.