How I'm building a better way to understand and store legacy logs

How I’m building a better way to understand and store legacy logs | Mihir DeshpandeMihir Deshpande

Posts About

/PostsMay 15, 2026 How I’m building a better way to understand and store legacy logs A Deep Dive into the ideation and the architecture of Logslim

In my first semester at Purdue, I took CS540 Database Systems to study database systems in depth. At its core, database design is about organizing data around a system’s access patterns whether it is optimizing for reads or writes, transactional workloads (OLTP), analytical workloads (OLAP), point lookups, scans, or some hybrid of the two. That got me thinking about data almost every distributed system produces continuously. Application logs sat awkwardly across every category; they are append-heavy like event streams, semi-structured like documents, queried analytically like OLAP systems, or using text search. Another problem that got my attention is the rising cost of log storage. In microservice architecture, log storage cost quietly eats budget. You look at the bill, then look at what you're storing, and realize most of it is the same 15 patterns firing millions of times. The timestamp changes. The user ID changes. The rest of the line is identical. The motivation for optimizing for both access and storage cost led me to build Logslim. Logslim is a log ingestion tool that runs the Drain log-parsing algorithm, separates repeating log templates from their variable parameters, and stores everything in compressed Parquet via DuckDB. LogSlim turns logs from immutable text streams into a structured event store with reconstructable templates and queryable parameters. Detect anomalies, understand system behavior, and spot noisy log patterns, and save storage cost, all from the templates alone! Compression runs at 76–81% on real workloads, with every line reconstructable, byte-for-byte. For teams paying significant Elasticsearch or S3 bills for log storage, 80% compression with full queryability and zero-loss reconstruction is a meaningful win. The Drain algorithm does the heavy lifting; everything else is plumbing to make it usable. This post walks through how it all works. Core idea Take a typical application log: 2024-01-15T10:23:45Z DEBUG 1234 DB SELECT table=sessions 5 rows 12ms 2024-01-15T10:23:46Z DEBUG 5678 DB SELECT table=sessions 2 rows 8ms 2024-01-15T10:23:47Z DEBUG 9999 DB SELECT table=sessions 14 rows 31msThree seemingly unique strings, but there's really just one template and three sets of parameters: Template: "{ts} DEBUG {num} DB SELECT table=sessions {rows} {duration}"

Parameters:

["2024-01-15T10:23:45Z", "1234", "5 rows", "12ms"] ["2024-01-15T10:23:46Z", "5678", "2 rows", "8ms"] ["2024-01-15T10:23:47Z", "9999", "14 rows", "31ms"]Extract the template and store it once and store the parameters separately. DuckDB's columnar zstd compression takes care of the rest. But how do you extract templates from the logs? One way is to use regex, which is essentially what I started off with. I asked Claude to write a regex for me to extract log levels, messages and timestamps. But this went quickly out of hand since logs are largely unstructured. Regex was a good idea to start with the MVP, but getting the one-size-fits-all regex is practically impossible.

Drain Algorithm Drain is a streaming log parser that groups log lines into clusters without requiring any regex configuration. The workflow is divided into two major steps. Step 1: Tokenize Every incoming line gets split into tokens. LogTokenizer handles this with whitespace-split, but the tokenizer is also aware of multi-word structures like quoted strings and certain compound values. Once you have tokens, TokenClassifier labels each one: TokenType2024-01-15T10:23:45ZTIMESTAMP12msNUM (numeric-ish)550e8400-e29b-41d4-a716-446655440000UUIDd41d8cd98f00b204e9800998ecf8427eHASH192.168.1.1IPSELECTWORD (static) The WORD tokens are potentially static. TokenClassifier has a regex for obvious variables, such as timestamps, UUIDs, IPs, hashes etc. These get replaced with placeholders like {ts} or {uuid} upfront. It is called pre-masking. Application-specific tokens like req-755556 don't get this treatment. They have to go through the drain-tree before they are marked as wildcards. Step 2: Drain Tree Every incoming log line finds its place in the Drain Tree using two keys: how many tokens it has, and what its first non-wildcard word is. Root ├── length=6 │ ├── "User" → [cluster: "User {num} logged in from {ip}"] │ └── "DB" → [cluster: "DB {num} SELECT table={word}"] └── length=8 └── "Cache" → [cluster: "Cache {num} hit for key {uuid}"]When a new line arrives, it walks this tree to find the closest existing cluster. Closeness is measured by how many static tokens match. A line that agrees on 4 out of 6 tokens scores higher than one that agrees on 2. If the best match scores above sim-threshold (default 50%), the line merges into that cluster. Any position where the line disagrees with the cluster becomes a...

How I'm building a better way to understand and store legacy logs

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast