The Rise of Single-Node Processing: Challenging the Distributed-First Mindset

b-man1 pts0 comments

The Rise of Single-Node Processing: Challenging the Distributed-First Mindset

Practical Data Engineering

SubscribeSign in

The Rise of Single-Node Processing: Challenging the Distributed-First Mindset<br>Data Landscape Trends: 2024-2025 Series

Alireza Sadeghi<br>Jan 06, 2025

74

Share

This is part two of Data Landscape Trends 2024-2025 series, focusing on single-node processing trends.<br>Introduction

2024 witnessed growing interest in single-node processing frameworks, with tools like DuckDB , Apache DataFusion , and Polars receiving increased attention and gaining unprecedented popularity from the data community.<br>This trend represents more than just a technological advancement—it marks a fundamental reassessment of how we approach data analytics.<br>As we move away from the "big data" era's distributed-first mindset, many businesses are discovering that single-node processing solutions often provide a more efficient, cost-effective, and manageable approach to their analytical needs when their size of data is not that big.<br>When I recently published a short post on LinkedIn titled "Why Single-Node Engines Are Gaining Ground in Data Processing ", I didn’t anticipate the significant attention it would receive from the LinkedIn data community. This response underscored the industry’s increasing interest in the topic.

In this article, I will dive deeper into the subject, exploring it in greater detail and providing further insights.<br>Rethinking Big Data

The past decade saw many businesses scrambling to implement big data strategies, with many companies investing heavily in distributed processing frameworks like Hadoop and Spark .<br>However, recent analyses reveal a surprising truth: most companies don't actually have "big data ".<br>A significant majority of companies do not require large data platforms to address their data analytics needs. Often, these companies are swayed by marketing hype and make substantial investments in these platforms, which may not effectively resolve their actual data challenges.<br>Jordan Tigani, a founding engineer on Google BigQuery , analysed usage patterns and found that the median data storage size among heavy BigQuery users is less than 100 GB.<br>Even more revealing, an analysis of half a billion queries run on Amazon Redshift published in a paper showed that:<br>Over 99% of queries processed less than 10 TB of data.

Over 90% of sessions processed less than 1 TB.

The paper also states that:<br>Most tables have less than a million rows and the vast majority (98 %) has less than a billion rows. Much of this data is small enough such that it can be cached or replicated.

This analysis reveals that with a big data processing threshold of 1 TB, over 90% of queries fall below this threshold.<br>As a result, single-node processing engines have the potential to handle workloads that previously required distributed systems like Spark, Trino, or Amazon Athena to process across multiple machines.

This reality challenges the common notion that big data infrastructure is a necessity for all modern businesses.

Thanks for reading Practical Data Engineering! Subscribe for free to receive new posts and support my work.

Subscribe

Workload Patterns & Rapid Data Aging

The case for single-node processing becomes even more compelling when we examine how organisations actually use their data.<br>Two key patterns emerge: the data aging effect and the 90/10 rule of analytical workloads.<br>The Data Aging Effect

As data ages, access frequency declines sharply. For the majority of companies, data access patterns follow a predictable lifecycle:<br>Hot data (0-48 hours): primarily from ETL pipelines.

Warm data (2-30 days): Accounts for most analytical queries.

Cold data (30+ days): Rarely accessed but often retained for compliance or historical analysis.

A study of Meta and eBay's data access patterns revealed this sharp decline in access after the first few days, with data typically becoming cold after a month.<br>In our analysis of a petabyte-scale data lake, we found that raw data remains hot for only 48 hours, with 95% of access occurring in that time, mainly by downstream ETL pipelines. In Analytics (Gold) zone, the hot period lasts about 7 days, and 95% of queries are executed only within 30 days.<br>The 90/10 Rule for Analytical Workloads

This aging effect leads to the 90/10 rule in analytical workloads:<br>If the combined hot and warm period is 30 days accounting for 90% of workloads, then, with a one-year retention period, over 90% of workloads access fewer than 10% of the data.

This pattern holds remarkably consistent across industries and use cases. Even in organisations with large datasets, most analytical workloads operate on recent, aggregated data that could easily fit within single-node processing capabilities.<br>Hardware Evolution & Rethinking Scale Up

The capability of single-node systems has grown exponentially since the early days of big data.<br>The rationale and motivation behind the scale-out strategy which...

data processing single node workloads days

Related Articles