Solving Cardinality Issues at Scale: A Practical Guide to Probabilistic Data Structures

By Amir Jakoby and Ishay Shor
At Sawmills, we analyze large datasets of telemetry date in real time, under space and cpu constraints. This made implementations to questions like “how many time have we seen this attribute” hit a wall real quick, in this article, we will take a glance on the iterative process of the path for a fix.
The Problem: When Counting Gets Expensive
In modern distributed systems, we often face the challenge of tracking and analyzing high-cardinality data streams. Whether it’s monitoring unique visitors, finding the most active users, or identifying trending topics, traditional approaches quickly become prohibitively expensive as data volume grows.
Consider a simple scenario: you’re building a web service that needs to track user logins and provide a “top users” endpoint. The naive approach might look like this:
type loginServer struct {
counts sync.Map // map[string]int
}
func (s *loginServer) handleTopUsers(w http.ResponseWriter, r *http.Request) {
users := s.snapshotCounts()
sort.Slice(users, func(i, j int) bool {
return users[i].Count > users[j].Count
})
// Return top N users...
}
This works beautifully for small datasets. But what happens when you have millions of unique users? The memory usage grows linearly with the number of unique items, and sorting becomes increasingly expensive.
The Performance Reality Check
To understand the real impact, I implemented and benchmarked three different approaches to solving the top-k problem:
- Naive Approach: Store all counters, sort when needed
- Heap-based TopK: Use a min-heap to maintain only top-k items
- Space-Saving Algorithm: Probabilistic top-k with bounded memory
Here are the benchmark results that tell the story:
The numbers speak for themselves: Space-Saving achieves 12x faster top-k queries than naive approaches and 28x less memory usage, while maintaining comparable login processing performance.
Solution 1: Space-Saving Algorithm
The Space-Saving algorithm, introduced by Metwally et al., is specifically designed for finding frequent items in data streams with bounded memory usage.
Core Concept
Instead of storing every item, Space-Saving maintains exactly k counters. When a new item arrives: - If it’s already tracked, increment its counter - If there’s space, add it as a new counter - If full, evict the item with the minimum count and replace it
Implementation Highlights
type Summary struct {
k int
items map[string]*Item
heap MinHeap
mu sync.RWMutex
}
func (s *Summary) Update(val string, inc int) {
s.mu.Lock()
defer s.mu.Unlock()
if item, ok := s.items[val]; ok {
item.Count += inc
heap.Fix(&s.heap, item.index)
return
}
if len(s.items) < s.k {
item := &Item{Value: val, Count: inc, Error: 0}
heap.Push(&s.heap, item)
s.items[val] = item
} else {
// Evict minimum and replace
min := heap.Pop(&s.heap).(*Item)
delete(s.items, min.Value)
item := min
item.Value = val
item.Error = min.Count // Track potential overcount
item.Count = min.Count + inc // Paper's formula
s.items[val] = item
heap.Push(&s.heap, item)
}
}
Key Advantages
- Bounded Memory: Always uses exactly O(k) space
- Error Bounds: Guarantees that true frequency is between (Count - Error) and Count
- Streaming: Processes items one at a time without requiring multiple passes
Our benchmarks show Space-Saving consistently outperforms both naive and heap-based approaches, especially under high concurrency:
- Concurrent Operations: 2,554 ns/op vs 55,875 ns/op (naive) - 22x improvement
- High Concurrency: Stable performance vs exponential degradation
- Memory Efficiency: Space-Saving uses 6,870 B/op vs 189,746 B/op (naive) - a 96% reduction
Solution 2: Heap-Based TopK Approach
For comparison, the heap-based TopK approach provides exact results while improving efficiency over naive sorting.
Implementation Strategy
The heap approach maintains a min-heap of size k, ensuring O(n log k) complexity instead of O(n log n):
func (s *loginServer) getTopKUsers(k int) []userCount {
h := &MinHeap{}
heap.Init(h)
s.counts.Range(func(key, value any) bool {
userID, _ := key.(string)
count, _ := value.(int)
user := userCount{UserID: userID, Count: count}
if h.Len() < k {
heap.Push(h, user)
} else if count > (*h)[0].Count {
heap.Pop(h)
heap.Push(h, user)
}
return true
})
return extractResults(h)
Performance Trade-offs
Advantages: - Exact results: No approximation errors - Better than naive: 5,493 ns/op vs 31,503 ns/op for top-k queries - Reasonable memory: 6,954 B/op - comparable to Space-Saving
Limitations: - Memory grows with users: Still needs to store all unique items - Slower than Space-Saving: 2x slower for top-k queries (5,493 vs 2,681 ns/op) - Scalability concerns: Performance degrades with very large datasets
Practical Considerations and Lessons Learned
Concurrency is Critical
Modern applications demand thread-safe operations. All implementations use appropriate synchronization:
func (s *Summary) Update(val string, inc int) {
s.mu.Lock() // Write lock for updates
defer s.mu.Unlock()
}
func (s *Summary) TopK() []Item {
s.mu.RLock() // Read lock for queries
defer s.mu.RUnlock()
// ... implementation
}
Our benchmarks revealed that proper synchronization can make or break performance under load.
Memory vs. Accuracy Trade-offs
Each approach offers different trade-offs:
- Naive: Perfect accuracy, unbounded memory, poor scalability
- Heap TopK: Perfect accuracy for top-k, O(k) memory for results, O(n) for storage
- Space-Saving: Approximate with error bounds, O(k) memory total, excellent for top-k
Error Bound Configuration
When using probabilistic structures, proper configuration is crucial:
// Space-Saving: k=100 means top-100 results with bounded error
summary := NewSummary(100)
// Alternative: Use epsilon-based constructor for automatic sizing
summary, _ := NewSummaryEpsilon(0.01) // 1% error bound, k=100
Real-World Impact
Implementing these algorithms in production systems has delivered significant benefits:
Performance Improvements
- 28x reduction in memory usage for high-cardinality analytics (189,746 → 6,870 B/op)
- 12x faster query response times for top-k operations (31,503 → 2,681 ns/op)
- 22x improvement in concurrent operations performance
- Linear scaling instead of quadratic growth with data volume
Cost Savings
- Reduced infrastructure costs due to lower memory requirements
- Improved cache hit rates from smaller working sets
- Better resource utilization across the entire system
Operational Benefits
- Predictable memory usage regardless of data volume
- Graceful degradation under extreme load
- Simplified capacity planning with bounded resource requirements
When to Use Each Algorithm
Based on our experience, here are practical guidelines:
Use Space-Saving When:
- You need top-k items from a stream
- Memory usage must be bounded and predictable
- You can tolerate approximate results with error bounds
- High throughput and concurrency are requirements
Use Heap-based TopK When:
- Dataset size is moderate and known
- You need exact results
- Memory usage isn’t the primary constraint
- Implementation complexity should be minimal
Avoid Naive Approaches When:
- Dataset size is unpredictable or large
- Memory usage is a concern
- You need consistent performance characteristics
Implementation Tips and Best Practices
1. Optimal Space-Saving Configuration
Choose the right k value based on your accuracy requirements and memory constraints:
// For top-10 queries with high accuracy, use k=50-100
summary := NewSummary(100)
// For memory-constrained environments, use epsilon-based sizing
summary, _ := NewSummaryEpsilon(0.01) // 1% error bound
// Monitor actual error rates in production
items := summary.TopK()
for _, item := range items {
fmt.Printf("Item: %s, Count: %d, Max Error: %d\n",
item.Value, item.Count, item.Error)
}
2. Proper Error Handling
Always validate configuration parameters:
func NewSummaryEpsilon(epsilon float64) (*Summary, error) {
if epsilon <= 0 || epsilon >= 1 {
return nil, fmt.Errorf("epsilon must be in (0,1), got %v", epsilon)
}
k := int(1/epsilon + 0.9999999) // ceil(1/epsilon)
return NewSummary(k), nil
}
3. Monitoring and Observability
Track key metrics to understand algorithm behavior: - Hit rates for tracked items - Error distribution for approximations - Memory usage patterns - Query latency percentiles
Conclusion
Cardinality problems are ubiquitous in modern systems, but they don’t have to break your architecture. The Space-Saving algorithm provides a mathematically sound solution that trades small amounts of accuracy for dramatic improvements in performance and bounded resource usage.
The key insights from our implementation and benchmarking effort:
- Space-Saving dominates for top-k queries: 12x faster than naive approaches with 28x less memory
- Algorithm selection matters: Heap-based approaches offer a middle ground with exact results
- Measure everything: Benchmarks reveal the true performance characteristics
- Design for scale: Solutions that work for thousands of items may fail for millions
- Bounded memory enables predictability: Fixed resource usage regardless of data volume
By implementing Space-Saving with careful attention to concurrency and error handling, you can build systems that scale gracefully while maintaining the performance characteristics your users expect. The 22x improvement in concurrent operations performance makes it particularly suitable for high-throughput production environments.
About the authors:
Amir Jakoby is the CTO and Co-founder of Sawmills. He leads the technical vision and execution of the company’s smart telemetry management platform.
Ishay Shor is a guest writer who specializes in scalable backend infrastructure and observability systems.