Kafka's Broken Promise: There is No Goldilocks Log | OpenData
TL;DR
Opendata Log maintains millions of individually keyed, ordered logs and<br>scales to tens of thousands of active readers on a single instance. It is<br>MIT-licensed, built directly on Object Storage, and deployed as a single Rust<br>binary.
Create your first logs in 2 minutes with our quickstart:<br>https://www.opendata.dev/docs/log/quickstart
There is No Goldilocks Log
In late 2013, Jay Kreps published one of the most influential blog posts in data engineering. He explained why The Log is the universal building block underneath nearly all data systems. It can replicate a state machine across a network, feed a firehose of data into downstream systems, and serve as efficient storage for real-time processing.
He was correct about the abstraction, but incorrectly assumed that this single abstraction could also be implemented by a single system (called Kafka).
A decade of working on logging systems has taught us that logging, as it turns out, is split into two types:
TypeDescriptionExamplesFunnelingCollecting data from high-cardinality sources and coalescing them onto a low-cardinality pipe.Telemetry pipelines, clickstream collection, warehouse dumpsRoutingDelivering events from addressable sources to addressable destinations.Messaging, feeds, agent traces<br>Both of these patterns require a durable log, but they have quite different nonfunctional characteristics that make a system designed for one a poor fit for the other. Some use cases just need the biggest, cheapest pipe they can get. Others need to scale to the millions of small logs.
Funnels
When Kafka was released in 2011, fifteen years ago, it was built to funnel telemetry (clickstream, page views, metrics) data from all of LinkedIn’s servers to Hadoop and/or an online metrics database.
An intriguing similarity across the initial Kafka use cases at LinkedIn was that messages sent to Kafka were keyless. In fact, the initial Kafka implementation discarded the partitioning key after computing a partition, leaving only the payload on storage (it wasn’t until later that Kafka supported keys and compaction).
In the context of funneling, this isn’t surprising. If what you want is to shuttle data from source to destination then a simple pipe that’s optimized for well-balanced throughput is ideal.
Routers
The other use case for logs is routing, which is often seen in messaging applications, feeds and microservice communication. It takes events from input sources and delivers them to specific, addressed destinations or records them for future replay.
The defining characteristic of routers is that destinations are only interested in a particular subset of keys. If you think of the funnel as a dumb broker, then routers need smart logic to distribute the messages in the log to their respective destinations.
Why Kafka is an Excellent Funnel, But a Bad Router
We like to reason about storage systems in terms of read, write and space amplification. The only way to improve all three in a system is to restrict the usage pattern.
When the engineers at LinkedIn designed Kafka, they restricted the API to make it particularly well suited as a funnel with low write amplification. Since data is written to disk via append-only, immutable files which are only cleaned up wholesale when they fall out of retention, the write amplification α_write is ~1x.
The consequence of this low write-amplification design is that reads are heavily restricted. The only read API provided is to scan the data in the order that it was written, but if that’s what you want then your read amplification α_read is also 1x.
Routers need a different access pattern on the underlying log. Instead of one massive log, routers want to store hundreds of thousands or even millions of individual logs.
Since Kafka partitions are isolated from one another, it’s not feasible to store each routing log in its own partition. This means that to find a specific message by key in a funnel like Kafka, your read amplification is the ratio of the entire partition size divided by the size of the record:
|partition|<br>α_read = ─────────────<br>|record|<br>This is nearly worst case read amplification, and makes it impractical to use Kafka as a router.
The right primitive for Routing
Today we’re announcing OpenData Log, a superior mechanism for the Routing log use cases (see OpenData Buffer for a better alternative to Kafka for funneling use cases).
Log is an MIT-Licensed, object-native, key-oriented log built on SlateDB. Breaking that down, in reverse:
Key-Oriented : Unlike Kafka, Log organizes data by key instead of topic-partition. You’re encouraged to have hundreds of thousands to millions of keys on a single log node.
Object-Native : The only requirement for durability is Object Storage. This makes it strongly consistent and absurdly durable.
MIT-Licensed : It’s open source and made for hosting yourself.
The first point alone makes Log uniquely different...