LLM-style scaling laws hold for sensor data

brandonb1 pts0 comments

LLM-style scaling laws hold for sensor data | Empirical Health

New: 100 biomarkers for $190 server-island-start<br>LLM-style scaling laws hold for sensor data<br>Brandon Ballinger · Jun 30, 2026

Much of the magic of LLMs comes from the fact loss scales predictably with model size, dataset size, the amount of compute used for training. It’s easy to take scaling laws for granted, but they only published in 2020 and their structure underlies both the economics of AI (if not for scaling laws, frontier labs couldn’t invest nine figures in a training run) and AI’s emergent capabilities (the Phillip Anderson quote, “more is different”, comes to mind).

Do similar scaling laws apply to non-language foundation models, such as wearable foundation models? It turns out they do. Is the form exactly the same? It is not, which leads to some interesting questions.

First, let me describe an example of a non-LLM scaling law.

A non-LLM scaling law

Google’s Scaling Wearable Foundation Models was, to my knowledge, the first paper to establish a scaling law for physioloigical sensor data from wearables:

Scaling performance of a wearable foundation model as a function of data size & model size. Source: Scaling wearable foundation models.

Validation loss LLL scaled as:

L=aCb+cL = aC^{b} + cL=aCb+c<br>where CCC is compute, bbb is the power-law exponent, and ccc is an irreducible floor (more on that later). Across multiple orders of magnitude, loss falls along a nearly straight line on the log-log plot before bending toward the floor ccc (the same shape holds when you vary data hours or parameters instead of compute). LSM tested four model sizes (2M, 7M, 110M, and 328M parameters) against data from a few thousand hours up to 40 million. Bigger models and more data both helped on every generative task they measured: random imputation, temporal interpolation, sensor imputation, and forecasting. The payoff on downstream, post-trained tasks was good too. Fine-tuned LSM improved interpolation and forecasting by 16-23% over baselines and lifted activity recognition by 29%.

Non-LLM scaling laws are similar, but not identical to LLM scaling laws

LLM scaling laws were first established in Kaplan et al. (2020), and then refined in the 2022 Chincilla paper. In the Chinchilla scaling laws, for a fixed compute budget, you should scale parameters and tokens together, about 20 tokens per parameter. Chinchilla was a 70B model trained on 1.4 trillion tokens, and it beat models several times its size that had been starved of data.

The Chinchilla LLM scaling laws are expressed as:

L(N,P)=L∞+a⋅N−b+c⋅P−dL(N, P) = L_\infty + a \cdot N^{-b} + c \cdot P^{-d}L(N,P)=L∞​+a⋅N−b+c⋅P−d<br>Here, L(N,P)L(N, P)L(N,P) is the validation loss; L∞L_\inftyL∞​ represents the irreducible loss floor; and aaa, bbb, ccc, and ddd are fitted constants (exponents and multipliers). One widely cited finding is that, under a fixed compute budget, optimal results are achieved by scaling data and model size together: specifically, the compute-optimal regime is where the number of training tokens NNN is proportional to the number of parameters PPP (in practice, about 20 tokens per parameter).

One major difference is that LSM’s gains flattened out around 10 million hours of data and roughly 100 million parameters. LLMs have shown no such ceiling at consumer scale. Chinchilla used 1.4 trillion tokens and frontier models have gone well past that, with no flattening yet. (Both scaling laws have an irreducible error term, so this isn’t a difference in functional form but rather an empirical result.)

That’s a potentially interesting opportunity for startups. We trained a JEPA-style wearable foundation model, JETS, on the same order of magnitude of data as Google and Apple with a four-person team. So whereas starting another LLM foundation model company requires billions of dollars of investment, non-LLM domains might actually be open for smaller startups.

Some open questions I have

While the power laws rhyme, many of the underlying details are pretty different:

DimensionLLM scalingWearable sensor scalingUnit of dataTokens (discrete vocabulary)Hours of continuous, multi-channel signalPretraining objectiveNext-token predictionMasked reconstruction (80% of patches hidden, MSE loss)LossCross-entropy / perplexityMean squared error on held-out patchesSaturationNone yet at trillions of tokensFlattens near 10⁷ hours and ~10⁸ parametersCompute-optimal recipe~20 tokens per parameter (Chinchilla)Scale data and model together; total hours dominateData supplyFinite; the public text pool is being exhaustedRenewable; billions of devices generate hours continuouslyEconomicsOligopoly with $1B+ entry costCapital light?<br>This leads to several interesting questions:

Data wall. LLMs are running into a data wall, where the stock of high-quality public text is close to spent and synthetic data is an uneasy substitute. As Ilya Sutskevar put it in his talk on the end of pretraining, “we have but one...

scaling data laws model foundation models

Related Articles