We Built Osmium for Scale

How we built Osmium for scale | Osmium BlogLog In

TL;DR - We built our own actor framework in Rust from scratch. Here’s why we did that, what the benefits are over other solutions, and some details about how it works. The way many fast-moving startups solve scaling nowadays is by throwing money at cloud providers, but we wanted to do things differently. First of all, we are self funded, don’t have (and don’t want) VC funds, so we have to be cost-conscious. Second of all, we strongly dislike being dependent on third-party providers. We rent physical servers, use a third-party payment provider, and a CDN for serving the website and static files. We selfhost everything else, including Git, CI runners, VPN, databases, observability, and everything else needed for running a chat app in production. Running stuff ourselves feels great - if it breaks, we can fix it, instead of waiting for someone else to. Designing a chat app So, how does one design a chat platform? Reducing a chat app to its very basics involves roughly two things - loading previously sent messages and sending+receiving new messages. The first part is relatively simple - you have some database with a table of messages with some indexes, when you load existing messages it does a db query and gives it back to you, done. Sending+receiving messages gets a little more interesting - all users that are in a channel should receive a notification that there’s a new message in a channel they follow. Going from first principles - the way to solve this is to have some sort of a pub/sub system, where users that are online subscribe to channels they follow, and sending a message publishes a message to this system. One way to do this would be using Redis Pub/Sub - a relatively easy solution, this is for example what Stoat does. This means that if a user sends a message, the message gets serialized, sent to Redis and distributed across other nodes where users are connected, deserialized and sent to users over some kind of client socket. But what if there was a different, better way? What if, when two users are connected to the same machine, you could skip the network trips entirely? Optimizing for performance at this level is surprisingly important, and so we looked at a different kind of infrastructure modelling. Actors There is a concept in computer science called the actor model. The basics are these - you have some state wrapped in an actor and you can only talk to actors by sending them messages and them sending you responses back. By imposing this kind of structure on your code, you actually get a really nice benefit - if your messages are serializable into bytes, the actors talking to each other don’t need to be on the same machine, and suddenly you have a stateful app AND you can scale horizontally at the same time. In a “standard” REST API in modern apps the APIs are stateless and all state is stored in databases, which the APIs then talk to. This usually means that for every request you have to do a bunch of round trips to the network, to some cache for rate limits, to some database for reading/writing data, etc. With the actor model, you can store a lot of data directly in memory, and databases become significantly less important. A popular language implementing the actor model is Elixir - this is also what Discord runs on. It gives you some awesome features out of the box, like moving actors across nodes and automatic node discovery. It does, however, also have its drawbacks - relatively old-school syntax, limited type checking and the fact that it is garbage collected and runs on Erlang VM. These are significant drawbacks partially because we aren’t familiar with Elixir syntax, and partially because performance is critical and Elixir can slow you down. Discord has been known to write native modules for Elixir in Rust to make up for the performance issues. Rust is a language that we are familiar with, and was therefore our choice to go with. Osmium structure We ended up writing Nucleus, our own Rust crate implementing the actor model from scratch, given our specific requirements. It allows us to create Nucleus nodes that listen, discover, and talk to each other. We can define our own actor types, message handlers, and then Nucleus handles everything in between. As a consequence of it not running in a VM like Elixir/Erlang, it requires us to define explicitly how the messages and actors are serialized and deserialized so that they can be moved between nodes. For this we chose to use Protobuf, as it has good ecosystem support and offers nice backwards compatibility guarantees. Discovery When I have actor User-A and I send a message to Channel-B, the actor User-A first needs to find where Channel-B is located, and whether it exists at all. Nucleus keeps track of all named actors on its node, and nothing else. By default, to find Channel-B, Nucleus will send a discovery message to all nodes in the cluster. This serves well with low traffic, but scales...

We Built Osmium for Scale

Related Articles

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

How to Earn a Billion Dollars

Italy's Meloni says Trump 'made up' story that she 'begged' him for photo at G7