How to Clean Time Series Data in Python

eigenBasis1 pts0 comments

How to Clean Time Series Data in Python

Bala Priya C

Real-world time series data is rarely clean. Sensors drop out, systems clock-drift, pipelines duplicate records, and manual data entry introduces mistakes. By the time a dataset reaches your notebook, it has passed through collection, transmission, and storage, each step a potential source of corruption.

Cleaning time series data is harder than cleaning tabular data because time is a structural constraint. You can't shuffle rows or impute a missing value with a column mean without pulling future data into a past observation. Every cleaning decision has to respect temporal ordering, or it breaks the integrity of everything built on top of it.

This guide walks through the full cleaning pipeline in Python: from raw data arrival to a dataset ready for feature engineering or modelling. We'll cover missing value detection and imputation, outlier identification and treatment, duplicate handling, frequency alignment, noise smoothing, and schema validation, applied to sample sensor data throughout.

You can get the Colab notebook from GitHub and follow along.

Prerequisites

To follow along to this guide, you'll need to be:

Comfortable working with Python and pandas DataFrames

Familiar with time-indexed data

Aware of what feature engineering and machine learning modelling involve at a high level

We'll use pandas and numpy for data manipulation, scipy for signal smoothing and statistical tests, scikit-learn for anomaly detection, and statsmodels for seasonal decomposition. Install them before running any code in this guide:

pip install pandas numpy scipy scikit-learn statsmodels

Table of Contents

How to Audit Your Time Series Before Cleaning It

How to Reindex to a Canonical Frequency

How to Handle Missing Values

Forward Fill — For Step-Function Signals

Time-Weighted Interpolation — For Continuous Signals

Seasonal Decomposition Imputation — For Long Gaps

How to Detect and Handle Outliers

Z-Score with Rolling Window

IQR-Based Outlier Detection

Isolation Forest — For Multivariate Outlier Detection

Outlier Treatment

How to Remove Duplicates

Frequency Alignment and Resampling

Smoothing Noise

Exponential Weighted Moving Average

Savitzky-Golay Filter

Schema and Sanity Validation

The Complete Cleaning Checklist

How to Audit Your Time Series Before Cleaning It

The first rule of data cleaning is: look before you cut. Before imputing, smoothing, or dropping anything, you need a complete picture of what's wrong and where.

A good audit covers the following:

The time index: Is it regular? Are there gaps?

Missing value distribution: Are missing values random or clustered?

Value range: Are there obvious gaps or sensor failures?

Duplicate timestamps

Let's spin up a sample dataset (with some of the above problems):

# Simulate one week of smart grid voltage readings (hourly)<br># with realistic problems injected<br>periods = 168<br>index = pd.date_range("2024-06-01", periods=periods, freq="H")

voltage = (<br>230.0<br>+ 3.5 * np.sin(2 * np.pi * np.arange(periods) / 24)<br>+ np.random.normal(0, 1.2, periods)

# Inject problems<br>voltage[14:17] = np.nan # sensor dropout: 3 consecutive missing<br>voltage[42] = np.nan # isolated missing<br>voltage[78] = 312.4 # spike outlier<br>voltage[101:104] = np.nan # another dropout<br>voltage[130] = 187.2 # dip outlier

series = pd.Series(voltage, index=index, name="voltage_v")

# --- Audit ---<br>print("=== TIME SERIES AUDIT ===")<br>print(f"Period: {series.index.min()} → {series.index.max()}")<br>print(f"Observations: {len(series)}")<br>print(f"Expected freq: {pd.infer_freq(series.index)}")<br>print(f"\nMissing values: {series.isna().sum()} ({series.isna().mean()*100:.1f}%)")<br>print(f"Value range: [{series.min():.2f}, {series.max():.2f}]")<br>print(f"Mean ± Std: {series.mean():.2f} ± {series.std():.2f}")

# Identify consecutive missing runs<br>missing_mask = series.isna()<br>missing_runs = []<br>run_start = None<br>for i, (ts, is_missing) in enumerate(missing_mask.items()):<br>if is_missing and run_start is None:<br>run_start = ts<br>elif not is_missing and run_start is not None:<br>missing_runs.append((run_start, missing_mask.index[i - 1]))<br>run_start = None

print(f"\nMissing runs ({len(missing_runs)} total):")<br>for start, end in missing_runs:<br>print(f" {start} → {end}")

Output:

=== TIME SERIES AUDIT ===<br>Period: 2024-06-01 00:00:00 → 2024-06-07 23:00:00<br>Observations: 168<br>Expected freq: h

Missing values: 7 (4.2%)<br>Value range: [187.20, 312.40]<br>Mean ± Std: 230.22 ± 7.81

Missing runs (3 total):<br>2024-06-01 14:00:00 → 2024-06-01 16:00:00<br>2024-06-02 18:00:00 → 2024-06-02 18:00:00<br>2024-06-05 05:00:00 → 2024-06-05 07:00:00

This audit gives you a map of the damage before you start cleaning. The key task is distinguishing between isolated missing values , which are imputable with local context, and missing long runs , which may need a different strategy or flagging for downstream consumers.

How to Reindex to a Canonical Frequency

Before imputing missing values, you need to confirm your...

series time missing data cleaning print

Related Articles