From maps to bitmaps (and from bitmaps to bitmaps)
At Cerbos, we spend a lot of time thinking about authorization: how to express it, how to test it, how to ship it, and (crucially) how to evaluate it quickly enough that you'd actually want to put it in the request path of your application. Our open-source Policy Decision Point (PDP) takes a question such as "can user X do Y on resource Z?", matches it against a set of policies, and returns a decision. For same-network deployments, the whole round trip typically sits comfortably in the sub-millisecond range.
Sub-millisecond is the bit we obsess over. A good chunk of it boils down to an indexing problem: given a request, find the subset of rules that might apply to it, then evaluate them. This is the story of how we rewrote that index. Twice. In about three months. With some rather extensive groundwork in the months before that.
A brief history of the rule table
Before we get to bitmaps, a quick potted history.
The rule table isn't called "the rule table" because somebody sat down and designed one. The name stuck because, at some point along the way, it ended up looking like one. Each version grew out of the limits of the one before it.
First, an index of policies. Each with its own rules, compiled and cached lazily as requests came through. With sensible caching this scaled further than "walk everything on every request" sounds, and the performance story was genuinely good for a long time. What it didn't give us was shape. Every cross-cutting concern (indexing a new dimension, asking generic questions about the whole set of rules, swapping in a different storage model) meant working against the grain of a data model that was fundamentally policy-shaped.
So we reached for a table. We walked every rule in every policy and took the Cartesian product of its multi-valued fields: one rule with two actions and three roles becomes six rows. Collect the lot and you're left with a big scannable relational thing, those expanded combinations as rows, the dimensions (resource, role, action, and so on) as columns. The appeal wasn't speed so much as shape: rules in a uniform grid meant we could reason about them uniformly, add new dimensions, and structure data in ways the policy-centric model didn't easily allow. A query was a filter predicate. Now O(N) per request.
So we added indexes. Each dimension got a lookup from value to the set of matching rules, and queries shrank to "fetch-and-intersect". In practice, though, this turned into nested maps of nested maps of proto structs, tucked away in the engine and guarded by a shared mutex. Fast to query. Miserable to change. Reloading policies, adding a new dimension, or swapping in a different backend all meant touching too many things at once.
So we carved it out. A big refactor, plus a handful of follow-ups, lifted rule storage into a standalone package with a clean interface, flattened the nested maps back into a single inverted index, made the backend pluggable (in-memory plus an experimental Redis one), and made the whole thing thread-safe. At the time my own PR description apologised that "this seems like an unnecessarily complex change", which, in hindsight, is only fair.
Nothing about that last refactor was really faster. On raw latency it was a touch slower than what it replaced, a fact I noticed with some embarrassment a few months later. What it did was make the rule table improvable: a piece of furniture with handles, that you could drag out into the light and work on, rather than something the engine kept buried in a cupboard. Small irony: the bitmap rewrite ended up being so much better than either the in-memory or Redis backends that the pluggability story quietly retired itself in the same PR. Scaffolding does its job and then comes down. 🙂
And that's where we start.
A quick tour of the rule table
In Go, the index over those rows looked like this:
// Roughly
type Index struct {
byScope map[string]*rowSet
byVersion map[string]*rowSet
byResource map[string]*rowSet
byRole map[string]*rowSet
byAction map[string]*rowSet
}
type rowSet = map[uint64]*Row // keyed by content hash
A Row was a protobuf-backed struct carrying everything about a rule: its routing fields and its behavioural fields (condition, effect, params, output). Running a query meant intersecting the relevant dimensions and then filtering what came out:
candidates := scopeSet.intersectWith(versionSet).intersectWith(resourceSet)
for _, row := range candidates {
if row.Matches(roles, actions) {
// evaluate
}
}
This worked fine, and performed reliably for a good while. But it has three problems that compound as the policy set gets bigger:
- Every rule appears in every dimension's map, so duplication in the policy set multiplies through the index.
intersectWithbuilds a fresh map at each step. For a query touching four dimensions, that's three intermediate maps of throwaway data.- Even after intersecting, a
Matches()call still walks the candidates to confirm roles and actions line up. Not every dimension was part of the intersection, and something has to do the filtering.
Under load, a noticeable fraction of the PDP's CPU was going to the allocator rather than the actual authorization logic. Never a good sign.
Bindings and cores
Before bitmaps enter the picture, there's a smaller design move that made the whole thing possible.
A rule has two lives: where it applies, and what it does. The "where" is the (scope, version, resource, role, action, principal) tuple. The "what" is the condition, the effect, the params, the output. These two parts change at very different rates. In most policy sets the "where" tuples are highly unique; that's literally what distinguishes one rule from another. But the "what" payloads are often identical, especially in templated multi-tenant policies where every tenant inherits the same rule shape.
So we split them in two:
// The routing tuple. Every field is a dimension we might query by.
type Binding struct {
Scope, Version, Resource, Role, Action, Principal string
CoreID uint32
}
// The behavioural payload. Deduplicated by content hash.
type FunctionalCore struct {
Condition *Condition
Effect Effect
Params *Params
EmitOutput *Output
}
Every Binding carries a CoreID pointing to a FunctionalCore. Many Bindings can, and do, share a single Core. In our multi-tenant test set, where 100 tenants share identical policy structure, Core deduplication alone claws back a good chunk of memory before bitmaps are even involved.
Each Binding also gets a monotonically increasing uint32 ID. That ID is the position it occupies in a bitmap.
Enter the roaring bitmap
The index now looks different. For each (dimension, value) pair we no longer store a set of rules. We store a bitmap in which bit i is set if Binding i matches that value.
type bitmapIndex struct {
scopes map[string]*bitmap
versions map[string]*bitmap
resources map[string]*bitmap
roles map[string]*bitmap
actions map[string]*bitmap
bindings []Binding
cores []FunctionalCore
}
A query becomes: AND together the bitmaps of every filter, iterate over the set bits, look up the Core for each, evaluate. The match loop is gone; the bits are the matches.
For the first rewrite I reached for roaring bitmaps: a well-known compressed-bitmap format with fast set operations. The library is excellent. FastAnd takes a slice of bitmaps and intersects them in a single pass, with container-level optimisations that skip whole regions of the universe when they can't possibly match. Pretty much everything you could want from a production-grade bitmap library, off the shelf.
The numbers on our standard evaluator benchmark, after a small tweak to put the smallest-cardinality bitmap first in FastAnd, looked like this:
| Metric | before | after | Δ |
|---|---|---|---|
| time/op | 43.8 µs | 7.3 µs | ~6.0× faster |
| bytes/op | 39.3 KiB | 5.5 KiB | ~7.1× less |
| allocs/op | 207 | 100 | ~2.1× fewer |
Steady-state heap on a 22,500-rule policy set dropped from 8.4 MB to 6.2 MB. Indexing time got ~23% faster. Net code change: a bit over five hundred fewer lines. Always a good feeling.
Out of curiosity I also ran the benchmark on Cerbos v0.40, the release before the rule table, to put a specific number on the "small regression" I mentioned earlier. v0.40 managed 18.4 µs. So the rule table had indeed cost us a little latency; roaring was now ~2.5× faster than even v0.40 had been.
I shipped it, patted myself on the back, and moved on.
Enter Dennis
A couple of weeks later, my colleague Dennis, who has a particularly good eye for hot-path work, ran the change through his load-testing harness with a much bigger policy set: 1000 policy units, 8000 files, 22,000 Bindings. The benchmark numbers were as advertised. Under sustained load, though, something unwelcome showed up in the CPU profile: a hefty slice going to the garbage collector. The pattern that had plagued the old index was still there. It had just moved house.
His first move was to stay within the roaring world and go after the allocations directly. FastAnd and FastOr each return a fresh bitmap, which is useful in general and painful when you're running thousands of queries a second. He replaced them with in-place equivalents, backed the scratch bitmaps with a sync.Pool, and moved the allocation strategy from "per operation" to "per query, reused".
On the microbenchmark this barely moved the needle: allocations dropped from 100 to 82 per op, time/op essentially unchanged. Under load, though, it was a completely different world. RPS increased by 30%, and the tail of the response-time distribution compressed substantially. GC stopped being the limiting factor.
My response at the time was, roughly, "where to next, Zig?" (for real though, I'm desperate to find an excuse).
And then he asked the really uncomfortable question:
Do we actually need roaring bitmaps?
Roaring is optimised for large and very large sparse bitmaps. The canonical use case is something like "the set of 32-bit user IDs that visited a page". It achieves its compression and speed by partitioning the 32-bit universe into 16-bit containers, each of which might be represented as an array, a run-length encoding, or a raw bitmap depending on density. That's a lot of cleverness.
Our bitmaps, it turns out, are not large. With 22,000 Bindings an uncompressed bitmap is 22,000 / 64 = 344 × 64-bit words in the worst case. That's about 2.7 KiB, and most of the real bitmaps in the index are smaller. We were paying for a sophisticated container hierarchy to represent things that would fit in a dozen uint64s.
When I picked roaring I'd assumed most of our bitmaps would be sparse. The 22,000-binding load test made it fairly obvious they weren't: small, yes, and sparse only in boring ways that the container hierarchy wasn't helping with. An ungrounded optimisation on my part, in other words.
A vanilla bitmap with a twist
So he wrote one. Here's its entire state:
type Bitmap struct {
words []uint64 // the actual bits
meta []uint64 // bit j of meta[i] is set if words[i*64+j] != 0
}
Two slices. That's it. words holds the bits. meta is a second-level index tracking which words are non-zero; a skiplist, basically, baked into 64-bit words. Every operation you'd want (Add, And, Or, IsEmpty, GetCardinality) is a short, tight loop over those two slices.
The twist is the meta layer. Before doing any per-word work, intersections can ask a cheap preliminary question: could these two bitmaps possibly overlap at all?
func (b *Bitmap) MetaIntersects(other *Bitmap) bool {
n := len(b.meta)
if len(other.meta) < n {
n = len(other.meta)
}
for i := 0; i < n; i++ {
if b.meta[i]&other.meta[i] != 0 {
return true
}
}
return false
}
If the AND of meta words is zero across the board, the bitmaps are provably disjoint and we skip the intersection entirely. For queries where early dimensions narrow the candidate set aggressively (which is most real queries) later dimensions often cost essentially nothing. A few meta-word ANDs and done.
The simpler representation also makes a couple of things that had been awkward over roaring fall out naturally. Pooling genuinely just works; a *Bitmap has no internal containers or hidden allocations, just two slices you can reset and hand back. In-place operations compose cleanly too: arena.andInto(dst, src) mutates dst directly, and arena.orInto sorts its inputs by length first so the accumulator grows at most once.
On the microbenchmark, swapping roaring for the vanilla bitmap got another 10% on latency, 21% on memory, and 16% on allocations. On the load test (which is the number that really matters) RPS and latency both improved again on top of the pooled-roaring baseline. Three rewrites in, the hot path is now basically: word-level ANDs, a handful of cache lines, iterate bits, call the Core.
The numbers
The standard evaluator microbenchmark, through each era of the index:
| Era | time/op | bytes/op | allocs/op | Under load |
|---|---|---|---|---|
| v0.40 (pre-rule-table) | 18.4 µs | 18.2 KiB | 279 | — |
| Rule table | 43.8 µs | 39.3 KiB | 207 | — |
| Roaring bitmap | 7.3 µs | 5.5 KiB | 100 | GC-bound |
| Pooled roaring | 7.3 µs | 5.5 KiB | 82 | ~1.3× RPS |
| Vanilla bitmap | 6.6 µs | 4.3 KiB | 69 | better again |
The rule table paid for its structural wins in raw latency; roaring then beat even v0.40 by ~2.5× on the microbenchmark, but spent its gains at sustained load on GC pressure. Pooling didn't move the microbenchmark but increased RPS in the throughput test (measured against plain roaring). The vanilla bitmap improved both columns again.
A couple of things I learned
The right data structure is the one that fits your data. Roaring bitmaps are a terrific piece of engineering, and I'd still reach for them happily for the problems they're designed to solve. But our bitmaps weren't large, weren't sparse in the interesting way, and didn't need container-level compression. A simpler structure won because it fit the actual shape of the data better. A good reminder that "industry standard" is a great starting point, not the destination.
Allocations per query and GC pressure are two different problems. The roaring rewrite halved allocations per query, which looked great on the microbenchmark but didn't translate as cleanly to sustained throughput. The fix wasn't to allocate even less per query; it was to stop allocating at all, via pooling and in-place ops. Under load, that second metric is the one that actually counts.
And, more personally: ship it, write about it, and then hand it to your colleagues. They'll find things you missed. Each rewrite did something the others couldn't, and bundling them into a single bigger PR wouldn't have helped anyone.
Where we are now
Your average authorization decision now clocks in at around 7 microseconds (plus a bit of network and gRPC overhead). 😌
The index lives at internal/ruletable/index in the Cerbos repo. If any of this piqued your interest, the code and PRs are all open:
And if you have questions, disagreements, or (best of all) you spot the next 20%, come and find us over in our Slack community. We'd love to hear from you.
Thanks for reading!
Book a free Policy Workshop to discuss your requirements and get your first policy written by the Cerbos team
Recommended content

Mapping business requirements to authorization policy
eBook: Zero Trust for AI, securing MCP servers

Experiment, learn, and prototype with Cerbos Playground
eBook: How to adopt externalized authorization

Framework for evaluating authorization providers and solutions

Staying compliant – What you need to know
Subscribe to our newsletter
Join thousands of developers | Features and updates | 1x per month | No spam, just goodies.
