The Secret Ingredients: Protobufs and gRPC

This article was available first on The New Stack - read it here.

When designing any sort of software, deciding your data format, types and structure informs many design decisions which you may live to thrive from or regret. When designing Cerbos, we made an early decision to have structured, type-safe and serializable data as we knew we wanted to be able to expose a predictable interface to the application, with predictable data types that could be used by other services in a myriad of languages. The technology chosen to deliver upon this is called Protocol Buffers (protobuf) along with gRPC in which all the core features - including the policies, API, data storage, data interchange and even the test cases - are defined.

A bit of background

Protocol buffers are a language- and platform-neutral mechanism for serializing structured data. Unlike the more widely-known JSON format, protobufs have a compact, binary wire format and mandatory schemas for every message. The schema language supports a wide range of data types, nested messages, unions and enumerations. These schemas are used by the protobuf compiler to generate code for the developer’s programming language of choice. The generated code contains type-safe, native data types and structures that are equivalent to the abstract types defined in the schema along with specialized utility functions such as optimized encode/decode functions for each type.

gRPC is a RPC framework that uses protobufs for data exchange. It uses HTTP/2 as the transport mechanism that allows it to make use of all the speed, security and efficiency features provided by the HTTP/2 spec for inter-service (or even inter-process) communication. gRPC also benefits from code generation to make the RPC calls resemble native function calls in the programming language.

We were intimately familiar with protobufs and gRPC from our previous roles building highly-available, data-intensive, internet-facing API services and large-scale data processing systems. Protobufs’ concise and lightweight interface definition language (IDL) allowed us to create flexible schemas describing the data we were working with and evolve them over time by adding new fields or deprecating old ones while maintaining backward and forward compatibility. The efficient binary encoding helped us save bandwidth and transmit messages between different processing pipelines quickly and efficiently. (At the scale of dealing with billions of messages, even a few bytes shaved off each message makes a massive difference.) Because encoded protobufs are language-agnostic, they were an ideal format to exchange data between applications written in different languages such as API services written in Go and data processing pipelines written in Java or Python. They were also a good choice for storing data with loosely known schemas, such as cache entries, that were still accessible in a type-safe, native language structure in the programming language of choice. Using gRPC for services was a natural extension of our use of protobufs. Again, it allowed us to build fast and efficient API services that exchanged data in binary protobuf encoding and worked over HTTP/2, providing advanced features like bi-directional streaming and connection multiplexing.

Despite all the positives listed above, using protobufs in a non-trivial way used to be quite painful. The polyglot system described above required a shared set of protobuf schemas with both local imports and third-party imports such as Google protobufs. Every time there was an update to the schemas, we needed to generate code for multiple programming languages, package, version and distribute them to various package registries. We had to build our own custom tooling and CI jobs to properly version the changes, download external dependencies, download protobuf code generators and compile toolchains for each programming language, generate packaging projects and, finally, build and upload the packages to the appropriate package registry. There were some community-built tools such as ProtoEasy and ProtoTool that helped address some of the pain points such as downloading dependencies and language generators, but none that addressed all aspects of the process.

We loved gRPC for its streaming capabilities, speed and efficiency — resource usage of our services were measurably lower compared to previous JSON-based REST services and they had much better latency and throughput — but, it was difficult to expose pure gRPC services externally without a translation layer like grpc-gateway in front. Streaming — especially, bi-directional streaming — was out of the question with a translation layer, so we were losing some functionality as well. Most cloud providers did not support HTTP/2 over their load balancers (support is still quite spotty, in fact). Even if that limitation could be worked around using TCP load balancers, we still had the problem of generating and distributing gRPC client code to external customers. Even our own JavaScript code running in browsers could not access our gRPC services. The introduction of grpc-web was a partial solution to this problem, but the project was in very early stages with limited functionality and required extra infrastructure such as an Envoy proxy to do the translation, which was not ideal.

State of the ecosystem

The protobuf/gRPC ecosystem has significantly improved over the past few years. More organizations are investing in and adopting the technology. Projects like etcd, CockroachDB and Vitess are examples of large- scale, critical infrastructure built on top of protobufs and gRPC. Almost all popular service meshes, proxy servers and load balancers now have native support for gRPC services. Frameworks like Dapr use gRPC to provide language-agnostic, standardized component building blocks for application development. (Cerbos is quite similar in that we aim to provide a plug-n-play access control solution for any application). A whole bunch of great tools and utilities such as Buf, grpcurl, ghz and many others have made developing and working with protobufs and gRPC a much more pleasant experience. Buf deserves a special mention here because it has solved almost all of the annoyances and pain points associated with protobuf development mentioned earlier in this article. (We are not affiliated with Buf in any way; we are just a bunch of happy users with deep appreciation for an awesome product.)

Building Cerbos

When we first started building Cerbos, we had a clear set of principles that we wanted to follow.

**Cerbos should be usable in polyglot environments and provide a consistent experience.**Our main target users are those who are embracing the cloud native computing paradigm and building scalable, dynamic, service-oriented systems that run on public/private/hybrid clouds using technologies such as containers, service meshes and declarative APIs. However, from our own experience, we know that these transformations don’t happen overnight. Most organizations spend months or even years migrating to new architectures and having to deal with chimeric environments composed of legacy and new applications. Quite often, the migration involves re-platforming to a different programming language as well, and there’s a period of time where applications written in different languages have to integrate and work with each other. In some other cases, organizations embrace polyglot development as a way to increase developer happiness and velocity. Whatever the underlying reason may be, we wanted to make sure that Cerbos fits well into polyglot environments and provided a consistent experience regardless of which language was used to interact with it.

We wanted to keep the Cerbos API as simple as possible so that anyone could get started using our product with the built-in tools and libraries provided by their programming environment. Being a very small team, we knew that it would take time for us to build SDKs in different languages. However, when the time came, we also wanted to have a solid foundation to avoid reinventing the wheel with each new programming language we targeted.

**Cerbos should feel “native”**The idea of delegating authorization decisions to an external service is a fairly radical idea that tends to make a lot of people nervous. Given that authorization is a cross-cutting concern that permeates through the whole code base, this is a very valid concern. However, we were (and still are) convinced that decoupling access controls has many advantages that far outweigh those concerns. After all, nowadays it’s not unusual for many applications to rely on cloud-based services for even highly critical components like databases. Unlike those services, Cerbos is deployed to the local network (no internet hops) and is stateless (does not have to process terabytes of data on each request). Along with advances in technology and maturing techniques for ensuring reliability and resiliency, we were quite confident that Cerbos could perform almost as well as an embedded solution while providing bonus features like consistent access enforcement throughout the stack, dynamic updates without redeploys and comprehensive visibility of access rules and audit trails.

Given the above requirements, it was a natural choice to pick protobufs and gRPC for some of the most fundamental portions of the Cerbos product. Today, almost all of the core data structures and most of our extensive test suite are defined using protobufs and the primary API of Cerbos is a gRPC API.

We are able to enforce a schema for our data structures using the protobuf IDL. It enforces discipline because developers are forced to think carefully about the changes they are trying to make. Automated tooling can detect and warn about any potentially breaking changes and help us avoid making backward incompatible changes, which we take quite seriously. \
We don’t have to spend time copying or transforming objects received over the gRPC API because they are already in the format the Cerbos engine and other components are expecting.
Cerbos policy rules include conditions written in the Common Expression Language (CEL). The CEL runtime has native support for working with protobufs in a type-safe way. Again, this saves us the hassle and overhead of having to convert request data into a specific data format such as JSON, which does not have the same strong typing guarantees as protobufs.
We can use plugins like protoc-gen-validate to automatically generate validation code based on annotations, vtprotobuf to generate optimized marshaling and unmarshaling code, and protoc-gen-openapiv2 to generate OpenAPI specs from our gRPC service definitions. We have also developed our own custom plugins such as protoc-gen-go-hashpb to generate hash functions and protoc-gen-jsonschema to generate JSON schemas that can be used by IDEs and other tools.
We can easily convert between data formats and work with them using native language constructs. Cerbos policies have a protobuf schema and we take advantage of different protobuf encoders to read/write and validate those policies to and fro between human-friendly YAML or JSON to machine-friendly binary blobs.
Being able to easily serialize and deserialize data structures enables us to write concise tests that load their inputs and expected outputs from disk. A significant proportion of our extensive test suite is defined using protobuf schemas and stored on disk as YAML files. At runtime, our test framework loads the test cases from those files into Go structs (generated from protobufs) that we can use at runtime.
Audit trails are a core feature of the Cerbos offering, and we extensively make use of the schema evolution features and the compact encoding format of protobufs to store and work with audit entries. There are several types of audit entries with complex nested structures like arrays, but they are all stored as protobuf encoded blobs in our audit store. We can add new fields to audit entries in newer releases of Cerbos and still read and work with older audit entries that don’t have those fields because protobuf schema evolution supports that.
The Cerbos project itself is written in Go, but we are able to easily use the same data structures with other languages to build tools and prototypes. This gives us the freedom to use tools that are best suited for the task at hand.

The main API surface of Cerbos is implemented using gRPC. On the server side, we use interceptors to implement many important features like request validation, audit log capture, metrics collection, distributed trace propagation, error handling and authentication. Grpc-gateway is used to provide a REST+JSON translation layer for the benefit of humans and languages that don’t have a gRPC implementation. Some RPCs, such as the one for retrieving audit log entries, are built as streaming RPCs that can efficiently stream large volumes of data to clients with backpressure.

Almost all of our client SDKs use the gRPC API primarily for the speed and efficiency gains. But, almost as importantly, using gRPC allows us to generate type-safe client stubs with built-in low-level plumbing (HTTP/2 transports with support for mTLS, Unix domain sockets etc.) for most popular programming languages. This base layer gives us a solid foundation on top of which we can add a thin, convenience layer to provide idiomatic language constructs for working with Cerbos.

Most popular service meshes and load balancers provide tracing, retries, circuit breaking and load balancing of gRPC requests, which gives users full control over how their services are configured to communicate with Cerbos services or sidecars in their environment. It also saves us from having to reinvent the wheel for those complex resiliency features and, instead, rely on battle-tested implementations built by experts.

Our protobuf/gRPC workflow extensively makes use of Buf, an almost magical tool that makes working with protobufs an absolute pleasure. We use the Buf CLI and GitHub Actions to format, lint, detect breaking changes and generate code from protobufs. Buf automatically downloads dependencies and plugins required for the build and saves us the pain of having to manage them manually. On each successful build, Cerbos protobuf definitions are automatically uploaded to the Buf schema registry (BSR), which allows us to effortlessly distribute the service and schema definitions for anyone to use. The BSR eliminates the need to maintain copies of the protobuf definitions in each SDK repository. With a single command, developers can pull down the latest definitions from the BSR and regenerate client code. Buf’s managed mode and remote plugins become extremely handy during this process to customise the output and manage toolchains.

Conclusion

The decision to invest in protobufs and gRPC has had a massive positive impact on our productivity and velocity. Even with a very small team, we have managed to build a fast, lean, feature-rich product and a plethora of tools, SDKs and demos that would have taken much more time and effort to build without the convenience, safety and productivity provided by the protobuf/gRPC ecosystem. Going forward, we have plenty more exciting new Cerbos features in the pipeline that are going to be built on top of the same proven and reliable technical foundation.

Tagged in

Engineering