Solving Cardinality Issues at Scale: A Practical Guide to Probabilistic Data Structures

Observability

Jun

2025

Jun

2025

By Amir Jakoby and Ishay Shor

At Sawmills, we analyze large datasets of telemetry date in real time, under space and cpu constraints. This made implementations to questions like “how many time have we seen this attribute” hit a wall real quick, in this article, we will take a glance on the iterative process of the path for a fix.

The Problem: When Counting Gets Expensive

In modern distributed systems, we often face the challenge of tracking and analyzing high-cardinality data streams. Whether it’s monitoring unique visitors, finding the most active users, or identifying trending topics, traditional approaches quickly become prohibitively expensive as data volume grows.

Consider a simple scenario: you’re building a web service that needs to track user logins and provide a “top users” endpoint. The naive approach might look like this:

type loginServer struct {
    counts sync.Map // map[string]int
}

func (s *loginServer) handleTopUsers(w http.ResponseWriter, r *http.Request) {
    users := s.snapshotCounts()
    sort.Slice(users, func(i, j int) bool {
        return users[i].Count > users[j].Count
    })
    // Return top N users...
}

This works beautifully for small datasets. But what happens when you have millions of unique users? The memory usage grows linearly with the number of unique items, and sorting becomes increasingly expensive.

The Performance Reality Check

To understand the real impact, I implemented and benchmarked three different approaches to solving the top-k problem:

Naive Approach: Store all counters, sort when needed
Heap-based TopK: Use a min-heap to maintain only top-k items
Space-Saving Algorithm: Probabilistic top-k with bounded memory

Here are the benchmark results that tell the story:

Algorithm	Login Route (ns/op)	Top Users Route (ns/op)	Memory Usage (B/op)
Naive	1,173	31,503	189,746
TopK	1,202	5,493	6,954
Space-Saving	1,309	2,681	6,870

The numbers speak for themselves: Space-Saving achieves 12x faster top-k queries than naive approaches and 28x less memory usage, while maintaining comparable login processing performance.

Solution 1: Space-Saving Algorithm

The Space-Saving algorithm, introduced by Metwally et al., is specifically designed for finding frequent items in data streams with bounded memory usage.

Core Concept

Instead of storing every item, Space-Saving maintains exactly k counters. When a new item arrives: - If it’s already tracked, increment its counter - If there’s space, add it as a new counter - If full, evict the item with the minimum count and replace it

Implementation Highlights

type Summary struct {
    k     int
    items map[string]*Item
    heap  MinHeap
    mu    sync.RWMutex
}

func (s *Summary) Update(val string, inc int) {
    s.mu.Lock()
    defer s.mu.Unlock()

    if item, ok := s.items[val]; ok {
        item.Count += inc
        heap.Fix(&s.heap, item.index)
        return
    }

    if len(s.items) < s.k {
        item := &Item{Value: val, Count: inc, Error: 0}
        heap.Push(&s.heap, item)
        s.items[val] = item
    } else {
        // Evict minimum and replace
        min := heap.Pop(&s.heap).(*Item)
        delete(s.items, min.Value)

        item := min
        item.Value = val
        item.Error = min.Count       // Track potential overcount
        item.Count = min.Count + inc // Paper's formula
        s.items[val] = item
        heap.Push(&s.heap, item)
    }
}

Key Advantages

Bounded Memory: Always uses exactly O(k) space
Error Bounds: Guarantees that true frequency is between (Count - Error) and Count
Streaming: Processes items one at a time without requiring multiple passes

Our benchmarks show Space-Saving consistently outperforms both naive and heap-based approaches, especially under high concurrency:

Concurrent Operations: 2,554 ns/op vs 55,875 ns/op (naive) - 22x improvement
High Concurrency: Stable performance vs exponential degradation
Memory Efficiency: Space-Saving uses 6,870 B/op vs 189,746 B/op (naive) - a 96% reduction

Solution 2: Heap-Based TopK Approach

For comparison, the heap-based TopK approach provides exact results while improving efficiency over naive sorting.

Implementation Strategy

The heap approach maintains a min-heap of size k, ensuring O(n log k) complexity instead of O(n log n):

func (s *loginServer) getTopKUsers(k int) []userCount {
    h := &MinHeap{}
    heap.Init(h)

    s.counts.Range(func(key, value any) bool {
        userID, _ := key.(string)
        count, _ := value.(int)
        user := userCount{UserID: userID, Count: count}

        if h.Len() < k {
            heap.Push(h, user)
        } else if count > (*h)[0].Count {
            heap.Pop(h)
            heap.Push(h, user)
        }
        return true
    })

    return extractResults(h)

‍

Performance Trade-offs

Advantages: - Exact results: No approximation errors - Better than naive: 5,493 ns/op vs 31,503 ns/op for top-k queries - Reasonable memory: 6,954 B/op - comparable to Space-Saving

Limitations: - Memory grows with users: Still needs to store all unique items - Slower than Space-Saving: 2x slower for top-k queries (5,493 vs 2,681 ns/op) - Scalability concerns: Performance degrades with very large datasets

Practical Considerations and Lessons Learned

Concurrency is Critical

Modern applications demand thread-safe operations. All implementations use appropriate synchronization:

func (s *Summary) Update(val string, inc int) {
    s.mu.Lock()         // Write lock for updates
    defer s.mu.Unlock()
}

func (s *Summary) TopK() []Item {
    s.mu.RLock()        // Read lock for queries
    defer s.mu.RUnlock()
    // ... implementation
}

Our benchmarks revealed that proper synchronization can make or break performance under load.

Memory vs. Accuracy Trade-offs

Each approach offers different trade-offs:

Naive: Perfect accuracy, unbounded memory, poor scalability
Heap TopK: Perfect accuracy for top-k, O(k) memory for results, O(n) for storage
Space-Saving: Approximate with error bounds, O(k) memory total, excellent for top-k

Error Bound Configuration

When using probabilistic structures, proper configuration is crucial:

// Space-Saving: k=100 means top-100 results with bounded error
summary := NewSummary(100)

// Alternative: Use epsilon-based constructor for automatic sizing
summary, _ := NewSummaryEpsilon(0.01) // 1% error bound, k=100

Real-World Impact

Implementing these algorithms in production systems has delivered significant benefits:

Performance Improvements

28x reduction in memory usage for high-cardinality analytics (189,746 → 6,870 B/op)
12x faster query response times for top-k operations (31,503 → 2,681 ns/op)
22x improvement in concurrent operations performance
Linear scaling instead of quadratic growth with data volume

Cost Savings

Reduced infrastructure costs due to lower memory requirements
Improved cache hit rates from smaller working sets
Better resource utilization across the entire system

Operational Benefits

Predictable memory usage regardless of data volume
Graceful degradation under extreme load
Simplified capacity planning with bounded resource requirements

When to Use Each Algorithm

Based on our experience, here are practical guidelines:

Use Space-Saving When:

You need top-k items from a stream
Memory usage must be bounded and predictable
You can tolerate approximate results with error bounds
High throughput and concurrency are requirements

Use Heap-based TopK When:

Dataset size is moderate and known
You need exact results
Memory usage isn’t the primary constraint
Implementation complexity should be minimal

Avoid Naive Approaches When:

Dataset size is unpredictable or large
Memory usage is a concern
You need consistent performance characteristics

Implementation Tips and Best Practices

1. Optimal Space-Saving Configuration

Choose the right k value based on your accuracy requirements and memory constraints:

// For top-10 queries with high accuracy, use k=50-100
summary := NewSummary(100)

// For memory-constrained environments, use epsilon-based sizing
summary, _ := NewSummaryEpsilon(0.01) // 1% error bound

// Monitor actual error rates in production
items := summary.TopK()
for _, item := range items {
    fmt.Printf("Item: %s, Count: %d, Max Error: %d\n", 
        item.Value, item.Count, item.Error)
}

2. Proper Error Handling

Always validate configuration parameters:

func NewSummaryEpsilon(epsilon float64) (*Summary, error) {
    if epsilon <= 0 || epsilon >= 1 {
        return nil, fmt.Errorf("epsilon must be in (0,1), got %v", epsilon)
    }
    k := int(1/epsilon + 0.9999999) // ceil(1/epsilon)
    return NewSummary(k), nil
}

3. Monitoring and Observability

Track key metrics to understand algorithm behavior: - Hit rates for tracked items - Error distribution for approximations - Memory usage patterns - Query latency percentiles

Conclusion

Cardinality problems are ubiquitous in modern systems, but they don’t have to break your architecture. The Space-Saving algorithm provides a mathematically sound solution that trades small amounts of accuracy for dramatic improvements in performance and bounded resource usage.

The key insights from our implementation and benchmarking effort:

Space-Saving dominates for top-k queries: 12x faster than naive approaches with 28x less memory
Algorithm selection matters: Heap-based approaches offer a middle ground with exact results
Measure everything: Benchmarks reveal the true performance characteristics
Design for scale: Solutions that work for thousands of items may fail for millions
Bounded memory enables predictability: Fixed resource usage regardless of data volume

By implementing Space-Saving with careful attention to concurrency and error handling, you can build systems that scale gracefully while maintaining the performance characteristics your users expect. The 22x improvement in concurrent operations performance makes it particularly suitable for high-throughput production environments.

About the authors:

Amir Jakoby is the CTO and Co-founder of Sawmills. He leads the technical vision and execution of the company’s smart telemetry management platform.

Ishay Shor is a guest writer who specializes in scalable backend infrastructure and observability systems.

‍