Miro is on a mission to decompose its monolithic architecture, and breaking out authorization is a crucial part of this. At Miro, we are building a dedicated service for application authorization, which is powered by Open Policy Agent (OPA) and Rego under the hood.
In this post, you’ll learn about Miro’s specific challenges with authorization, and how OPA and its versatility are helping Miro to address both service-to-service and real-time, high-throughput authorization. This content was originally presented at an OPA meetup co-organized by Styra and held at Miro HQ in April 2023.
Authorization at Miro
When we say “authorization” at Miro, we’re talking about end-user access control. It’s everything involved in determining what a Miro user is allowed to do while interacting with the product, both on and off the board.
What’s special about this, and why is it a crucial part of decomposing the monolith?
Authorization is part of the fabric of Miro. Miro is a knowledge product and millions of people use it every day to brainstorm ideas, share information, and many other things that are impossible to enumerate. Much of this information is sensitive or confidential, and our customers need to control access to it, so a lot of Miro’s value comes from its access control features. Authorization is the underpinning that makes all of this work, and that’s why it’s so important to get it right.
So how do we break up the monolith without impacting the correctness of authorization? And how do we make all of this work in cases where we need authorization in real-time, like board sessions?
We’ll answer these questions in this post, but if you prefer to watch and listen rather than read, feel free to check out this video instead:
The challenges of monolithic architecture
First, I need to say a few words about the monolith and why we’re breaking it up in the first place. In the beginning there was nothing, then there was the monolith. Monolithic architectures are often the best thing for small teams just starting out, and this was the case with Miro. But Miro’s engineering organization has grown a lot, and the monolith is now slowing us down. The analogy is that it’s big and heavy and requires a lot of coordination to do anything significant, and you have a lot of contributors pulling in different directions and different speeds. But in order to break free from the monolith, teams need a solution for authorization because it’s foundational.
Authorization is foundational
Almost everything you can do with the Miro product requires authorization somewhere along the line, so it’s a dependency of pretty much everything. And all the authorization logic is implemented as Java code in the monolith. Because of this, it’s directly accessible by any other code that’s also in the monolith.
The other thing that makes it tough to extract is that it directly accesses the monolith’s data stores. If a feature needs to know if a user can view a board, it calls into the authorization code, which in turn directly fetches the data needed for that decision.
No duplication
When you break out a service from the monolith, you no longer have access to the authorization code or the data stores. So what can you do? If there were no alternatives, you might be tempted to just write your own authorization code, and you’d have to figure out how to get the data you need, since you can no longer just query the same data stores.
This would not only be a lot of development effort, but it would also create a huge problem — copies of the same authorization logic would proliferate to many different services in a variety of implementations and languages. Very common permission checks have complex and nuanced implementations, and it’s easy to introduce a mistake. Also, introducing any changes would be extremely time-consuming and meticulous, because we’d have to track down and update all of these independent implementations. This greatly increases the likelihood of subtle mistakes, which may end up exposing confidential information. This is clearly unacceptable.
What Miro was looking for
We needed to centralize authorization logic to prevent duplication and preserve correctness.
We needed to separate data access from authorization code, because the monolith’s data stores aren’t available from the outside.
We also needed to be able to make authorization changes independently. At Miro, any changes you introduce to the monolith are grouped into a release with many unrelated changes from many different contributors. If there’s an issue that requires a rollback, authorization changes get rolled back too, and this can cause inconsistent product behavior.
Open Policy Agent and the Rego language
If you’re familiar with Open Policy Agent, you’ll probably see why it was a clear choice. It addresses all of these points right out-of-the-box. With just a little configuration, you can easily bring up OPA as a standalone service that takes data and policy code and outputs authorization decisions. But if you need a more custom solution, OPA has you covered, too. It’s extremely versatile with many different ways to leverage it, a few of which I’m going to highlight.
Of course Rego and OPA go together, since OPA is, in essence, a Rego evaluation engine. And Rego is, to quote the OPA docs, “purpose-built for expressing policies over complex hierarchical data structures,” which is exactly what we needed.
We could have chosen to extract the Java authorization code out of the monolith and factored out the data access concerns, or maybe re-written it in Kotlin, or something else, but a general-purpose programming language really isn’t the right tool for the job. Instead, Rego lets us write policy code in a way that is constrained to that purpose alone. You can’t do arbitrary IO operations like database or file access, there’s no inheritance, dependency injection, configuration, etc…. It’s just pure and declarative, and it actually simplifies the authorization logic a lot. As engineers, we love it when critical aspects of our systems are simple, and easy to understand, maintain, and evolve, so this is a huge win.
The Service-to-service architecture
Here’s what the whole thing looks like.
A set of standalone OPA instances takes care of pretty much everything except actually fetching the data. The AuthZ service is where we had to do the most work, because this is the part that’s Miro-specific. It’s the brains behind the operation. Its job is to integrate with the necessary data APIs, and it knows which data is needed for which authorization checks.
Starting from the left, some Miro service makes a “Check” request via unary gRPC, and waits for a boolean allow or deny response. This request contains an identity, which is usually a Miro user, an action, which is some application-level operation, like “VIEW_BOARD”, and a resource, which is the thing the action will be performed on, like a Miro board. AuthZ fetches the data relevant to the identity and resource and includes it in a JSON document together with these request parameters, which it then sends to OPA in an HTTP POST. OPA then evaluates the current policies using this big JSON document as input.
This is the “input overloading” approach for external data described in the OPA docs.
Isolated authorization changes
So how do the policies actually get loaded into these OPA instances? This is done with the OPA bundle API and it’s how we gained the ability to ship authorization changes completely independently with very little effort.
The authorization policies are a set of Rego files and minimal static data in their own GitHub repository. When a developer wants to make changes, they write code and tests, all of which are in Rego, and submit a pull request. After it’s merged, a GitHub action uses the OPA bundle command to create a compressed tarball of all the Rego code and data, which it then copies to an S3 bucket. The OPA instances are configured to continuously poll this bucket, so when a new bundle appears, OPA unpacks it and loads it into memory, which causes the authorization changes to go into effect right away.
Nothing else needs to change apart from the Rego code itself, and we can roll back these changes just as easily by publishing a bundle built from an earlier revision.
What about real-time?
This is all wonderful, but has only gotten us part of the way there. I haven’t said anything about real-time yet. After all, Miro is a real-time visual collaboration tool, so what happens when we need authorization decisions with sub-millisecond latency?
As a Miro user, your permissions can change at any time, even when you’re in the middle of a board session. Maybe someone decides to move the board to a different project, which changes who is allowed to edit it. So we need to make authorization checks when you perform certain actions like moving shapes, because maybe you’ve lost permission in the meantime. Today, there’s very little latency when computing these authorization decisions because everything is together in the monolith.
How can we extract authorization without affecting the board experience? If the service-to-service architecture we just saw were the only option, board users would have to wait for whole round-trip authorization checks to complete every time they tried to change something on the board. Even if we could get the latency down to low tens of milliseconds, it would still be a glitchy user experience. Another problem is that our service would have to handle a huge amount of unnecessary authorization checks, even when no data changes have occurred that would affect the results.
The team asked: What if we could push authorization changes instead, whenever there were relevant data changes?
The streaming architecture
That’s how the team came up with the “streaming architecture,” which is a special operating mode for cases that need it, like on-board authorization. The idea is that clients “subscribe” to authorization changes over a bidirectional gRPC stream by specifying an identity, a resource, and a set of actions to track.
Data providers publish data change events to kafka topics, which AuthZ consumes. AuthZ decides which data changes are relevant to its current subscriptions, then recomputes the set of allowed actions for each one and pushes them back over the stream. The subscribed clients then cache them locally in memory. In this way, clients always have the latest authorization state for some identity and resource which can be looked-up in memory with sub-millisecond latency.
This approach enables us to complete the journey of extracting authorization from the monolith, because it means we can handle real-time use cases as well.
But zooming-in on the details a bit, there’s an important difference to notice, which is the “evaluation engine.” This is not the same as the standalone, out-of-the-box OPA we saw previously. Before we take a closer look at what that is, let’s talk about why we weren’t able to use out-of-the-box OPA here, which I think is a good illustration of how OPA’s versatility makes it applicable to a wide variety of situations.
In the streaming architecture, the most important thing we need is high evaluation throughput.
Highly active board sessions
Miro board sessions can be very active, and each AuthZ instance can handle multiple board sessions with potentially hundreds of users on each one. And sometimes data changes happen that affect a large number of subscriptions, so sometimes we need to perform thousands of policy evaluations quickly. They need to happen quickly because the longer it takes to push out an authorization change, the longer a user might be able to do or see something they’re no longer allowed to.
To get high throughput, we needed to put the data closer to the policies. We couldn’t afford to construct and send a JSON payload containing all the necessary data over HTTP for each individual evaluation. So we needed something like the “push data” approach, which is another of several ways OPA offers for dealing with external data.
“Push data” is when you send data to OPA independently of evaluation. OPA stores this data in-memory, and it can be directly accessed within your policies later on. This looked like a great option, because we actually only needed to write data once per data change event. Once the data was updated we could evaluate the permissions for however many identity-resource pairs we needed to.
But there was still a problem. We still needed to establish an HTTP connection for each identity-resource pair. If you need to do this thousands of times in a short time period it has a real impact on throughput. If you’re wondering why we couldn’t just batch up everything in one big request with one connection, it’s mainly because it’s all-or-nothing: we would have to wait for everything to finish before being able to push any authorization changes back to clients. What we needed was to stream evaluation requests over a single connection, and receive results back over the same connection as they completed.
Exploring an evaluation engine
We decided to explore our own solution written in Go, which is very similar to “push data,” but uses a single bidirectional gRPC stream with protobuf messages instead of JSON, and stores data in-memory in what is basically just a hashmap. The stream connection only needs to be established once, and thanks again to the versatility of OPA, we were still able to offload all the complexity of policy evaluation by leveraging the Rego SDK.
The way it works is that each AuthZ instance is paired with exactly one evaluation engine. AuthZ sends relevant data changes to the engine, followed by a stream of (identity, resource) pairs affected by the data changes. The evaluation engine applies the data changes in-memory, then asks the Rego SDK to perform all the evaluations against that data. Because of Go, it was very straightforward to use goroutines to perform these evaluations concurrently to utilize multiple cores. Results are streamed back to AuthZ as they complete, which then routes them back to the appropriate subscribed client.
With this approach, we believe we’re going to be able to handle any realistic bursty-event scenario we’re likely to encounter in production when this goes live. That said, we will likely need to optimize even more than this approach can handle for some future use cases we’ve recently become aware of, so we’ve also been poking at other approaches like leveraging the Rego intermediate representation or web-assembly, which could eliminate the need for a separate evaluation service. That would be too much to cover in this post though, so stay tuned for something from us along these lines in the future!
Concluding thoughts
There you have it. That’s the story of how Miro is leveraging OPA to implement authorization-as-a-service, and to ultimately enable our teams to break free from the monolith. We’ve been able to keep the authorization logic centralized, and we’ve also greatly simplified it by implementing it in Rego.
OPA’s bundle API gave us the ability to deploy authorization policies independently of any other changes. And because OPA is so versatile with so many ways to integrate, we’ve been able to tackle the real-time use case by leveraging the Rego SDK. We’re on our way to being able to fully extract authorization into a service, and also being able preserve the fluidity and game-like feel of Miro board sessions.
Shout out to my team who actually did the bulk of this effort: Ionut, João, Yura, and Dima — you guys are amazing and I love working with you.
Do you have ideas or a different approach to authorization? Let us know in the comments — we’re interested in what you have to say!
Interested joining the Miro Engineering team? Check out our open positions.