A systematic way to think about data quality

On Data Quality - The Fundamentals - by Abraham Thomas

Pivotal

SubscribeSign in

On Data Quality A systematic way to think about data quality.

Abraham Thomas Jun 27, 2026

This is the first of two essays on data quality. Today’s essay is about the basics: what is data quality, and how should we think about it? The second essay, publishing next week, is about the fun stuff: data quality in an AI world. Introduction

Data quality. We love it, we want it, we praise it, we aspire to it. Even in these benighted and degenerate times, if there’s one belief that unites all sensible individuals, it is the belief that data quality is a Good Thing. It’s a pity, then, that nobody seems to know what data quality is. Ask six practitioners to define data quality and you’ll get six different answers. In fact it’s worse than that: give the same data to six practitioners, and you’ll get six different evaluations of its quality. Data is the elephant and we are the blind men of Hindustan. Fortunately, Pivotal is here to save the day. Today we shall learn all about data quality. Read on! Standards Are Poor

Let’s start with the “standard” definitions of data quality. They are, unfortunately, not very helpful. ISO 8000 defines quality data as data that meets its stated requirements. This is one of those tautological statements that is perfectly accurate and completely useless. ISO 25012 defines data quality using 15 attributes, including all the usual suspects: accuracy, completeness, consistency and so on. This too is correct, but incomplete. I take a somewhat different approach. A Modest Assertion

I begin with an assertion: data has no innate quality . Quality is a purely emergent phenomenon, conditional entirely on use case. Readers of How to Price a Data Asset will recognize this line of thinking. In that essay, I argued that data has no intrinsic value; instead, the value of data is the value of what can be done with it. Data quality is that which increases data value. Since data value is a function of usage, so too is data quality. Data quality can only be assessed with reference to what can be done with the data. We care about data quality precisely because it allows us to do more; do better, faster, cheaper; or just do differently with our data. This is still a bit abstract and hand-wavy. We’re going to make it more concrete.

Levels of the Game

Our first insight is this: data quality comes in levels. These levels are not separate or mutually exclusive; they exist simultaneously; and much of the noise around data quality stems from level confusion. These levels are ordered and dependent . Ordered: data quality can pertain to individual record, to data corpus, to application, or to business outcome. And dependent: each level requires the ones below and above, for coherence and usability. I’ll explain all these terms in a bit, but first, let’s examine the levels and what they cover. Granular Quality

The first level of data quality is granular or unit-level quality . Think of an individual “unit” of data – a single database record, or sentence, or question-answer pair, or labeled example. You can test this granular unit for accuracy, precision, recency, well-formed-ness, internal consistency, plausibility, provenance, interpretability, confidence, and more1 . This is what many data quality evaluators do, and where they stop; it’s the realm of ISO 25012, of observability and monitoring. Two facts jump out. First, all these quality attributes exist at the level of individual units of data. You don’t need to inspect other records to know if a given record is accurate, precise, recent and so on. This is why we call this granular quality. Each unit stands alone.

Second, all these attributes are downstream of clear usage/value questions: is the data true, is it usable, is it current, and is it coherent? And the questions themselves are conditional. True, in what context? Current, relative to what? Usable, how? Example: Revenue Consider the most basic of financial data, revenue. Imagine you’re a CFO, or perhaps a founder hoping to one day be able to afford a CFO. It’s all too easy to book the wrong revenue number2 – to misread contract terms, renewals, discounts, one-off versus recurring, and so on. You need to be extremely careful to ensure granular data quality for this field. But even if you’re careful and capture revenue perfectly: what number should you use? Say you’re a marketplace. Some marketplaces report net, others report gross3. Which is correct? Well, it depends. Are you an active, value-adding seller; did you set the price; are you on the hook for the service? Or are you just a matchmaking middleperson? Reasonable minds – and auditors – can differ on that question, and by extension, on their evaluation of data that happens to tilt one way or the other. So much for innate data quality!

Aggregate Quality

The second level of data quality is aggregate or corpus-level quality . All your individual...

A systematic way to think about data quality

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

Italy's Meloni says Trump 'made up' story that she 'begged' him for photo at G7