Introducing Dimster, a performance benchmarking tool for Apache Kafka — Jack Vanlightly
Dimster = DIMensional teSTER for Apache Kafka<br>On GitHub: https://github.com/dimster-hq/dimster
Most of my career in distributed systems has been as a tester, performance engineer and formal verification specialist. I’ve written performance benchmarking tools in the past, for RabbitMQ and Apache Pulsar but in recent years I’ve used OpenMessagingBenchmark (OMB) to run benchmarks against Apache Kafka and other messaging systems. But OMB is hard to deploy and has several limitations compared to more sophisticated benchmarking systems I’ve developed in the past. With Claude becoming so much better since Christmas I decided to write a Kafka-centric performance benchmarking tool, with a lot of inspiration from OMB. I took the bits I like about OMB and the things I like about the tooling I’ve built in the past, to make a performance testing tool for testing Apache Kafka.<br>In this post I’ll introduce some aspects of Dimster that are core to its design:<br>Dimensional testing
Shareable, self-contained results with reproducibility in mind
Test modes
Benchmark prep and post-processing
Kubernetes as a standardized runtime
1. Dimster (DIMensional teSTER) is born<br>A benchmarking and stress testing technique I’ve used for years is something I have called “Dimensional Testing”. We can think of all the configs and workload aspects as forming N-dimensional space. Within that space we can explore the impact of points in that space along a single dimension, or even co-varying dimensions. Take a config or an aspect of a workload as a dimension, and run a series of identical benchmarks where a set of points along that dimension are explored (while everything else remains the same). The dimension could be a client config, such as batch.size or acks. It could be an aspect of the workload such as number of consumers, type of consumer, number of consumer groups, the partition count, the produce rate and so on.<br>There are hundreds of dimensions to explore, which requires some patience and care lest you become overwhelmed. The below depicts just three dimensions, and a set of three scenarios which test performance along one or two dimensions at a time.
View fullsize
Fig 1. Three examples of varying or co-varying an aspect of a workload as dimensions
Each of the above 16 test points (across 3 scenarios) is a separate benchmark, with a fresh topic, warm-up time, recorded time, and cooldown time etc.<br>The generated charts for throughput and various latencies are repeated for each of the three scenarios, with each test point within a scenario plotted as a series/bar on those charts. This makes it easy to compare the performance results of varying the values of a single dimension (or co-varying values across multiple dimensions).
View fullsize
Fig 2. Each scenario maps to a set of charts, with the test points as data series.
Example: Consumer groups vs share groups<br>With share groups being relatively new, I could compare the performance of regular consumers against share group consumers, with identical benchmarks where the dimension explored is consumer type (CONSUMER_GROUP|SHARE_GROUP).<br>The following test has as the base workload of ten topics with each topic having 6 partitions, 6 consumers and 4 producers. Each scenario changes the producer rate, and compares consumer groups to share groups. Record keys are used, so batch sizes will be small, which is a tougher workload than a no-key test which typically results in larger batches.
The charts below show the results for an EKS deployment with Kafka deployed on 3x m6i.2xlarge with 300 MB/s provisioned gp3.<br>50 MB/s<br>At 50 MB/s we see that p99 end-to-end latency is stable, with roughly 15 ms overhead for share groups.
View fullsize
View fullsize
200 MB/s<br>At 200 MB/s, p99 end-to-end exhibits peaks in a periodic fashion.
View fullsize
Example: 2, 4 and 8 CPUs<br>Dimster uses environments. The sizing of a test is determined by which environment is used. I ran some share group consumer scaling tests, with full mTLS, on Kafka clusters assigned 2, 4, and 8 CPUs. These are the equivalent of vCPUs, as my Threadripper has SMT (hyperthreading) enabled.<br>2-CPU environment on my Threadripper:
I ran the following workload with the above environment, with the CPU requests/limit of 2, 4 and 8.
Then I used the dimster compare command to generate comparison charts based on the JSON result files of each run. Each chart compares each test point side-by-side.<br>10k msg/s - 1000 consumers (6th test point in 1st scenario)<br>We see that 2 CPUs fare a lot worse than 4 and 8 CPUs.
View fullsize
100k msg/s, 250 consumers (4th test point, 3rd scenario)<br>The 2 CPU cluster simply can’t keep up with 100k msg/s and 250 consumers.
If we unselect 2-CPU, we see that 4-CPU and 8-CPU was ok.
View fullsize
Dimster charts are interactive. Series can be toggled, time and percentile ranges can be selected.<br>2. Results as shareable...