How to Clean Time Series Data in Python
Bala Priya C
Real-world time series data is rarely clean. Sensors drop out, systems clock-drift, pipelines duplicate records, and manual data entry introduces mistakes. By the time a dataset reaches your notebook, it has passed through collection, transmission, and storage, each step a potential source of corruption.
Cleaning time series data is harder than cleaning tabular data because time is a structural constraint. You can't shuffle rows or impute a missing value with a column mean without pulling future data into a past observation. Every cleaning decision has to respect temporal ordering, or it breaks the integrity of everything built on top of it.
This guide walks through the full cleaning pipeline in Python: from raw data arrival to a dataset ready for feature engineering or modelling. We'll cover missing value detection and imputation, outlier identification and treatment, duplicate handling, frequency alignment, noise smoothing, and schema validation, applied to sample sensor data throughout.
You can get the Colab notebook from GitHub and follow along.
Prerequisites
To follow along to this guide, you'll need to be:
Comfortable working with Python and pandas DataFrames
Familiar with time-indexed data
Aware of what feature engineering and machine learning modelling involve at a high level
We'll use pandas and numpy for data manipulation, scipy for signal smoothing and statistical tests, scikit-learn for anomaly detection, and statsmodels for seasonal decomposition. Install them before running any code in this guide:
pip install pandas numpy scipy scikit-learn statsmodels
Table of Contents
How to Audit Your Time Series Before Cleaning It
How to Reindex to a Canonical Frequency
How to Handle Missing Values
Forward Fill — For Step-Function Signals
Time-Weighted Interpolation — For Continuous Signals
Seasonal Decomposition Imputation — For Long Gaps
How to Detect and Handle Outliers
Z-Score with Rolling Window
IQR-Based Outlier Detection
Isolation Forest — For Multivariate Outlier Detection
Outlier Treatment
How to Remove Duplicates
Frequency Alignment and Resampling
Smoothing Noise
Exponential Weighted Moving Average
Savitzky-Golay Filter
Schema and Sanity Validation
The Complete Cleaning Checklist
How to Audit Your Time Series Before Cleaning It
The first rule of data cleaning is: look before you cut. Before imputing, smoothing, or dropping anything, you need a complete picture of what's wrong and where.
A good audit covers the following:
The time index: Is it regular? Are there gaps?
Missing value distribution: Are missing values random or clustered?
Value range: Are there obvious gaps or sensor failures?
Duplicate timestamps
Let's spin up a sample dataset (with some of the above problems):
# Simulate one week of smart grid voltage readings (hourly)<br># with realistic problems injected<br>periods = 168<br>index = pd.date_range("2024-06-01", periods=periods, freq="H")
voltage = (<br>230.0<br>+ 3.5 * np.sin(2 * np.pi * np.arange(periods) / 24)<br>+ np.random.normal(0, 1.2, periods)
# Inject problems<br>voltage[14:17] = np.nan # sensor dropout: 3 consecutive missing<br>voltage[42] = np.nan # isolated missing<br>voltage[78] = 312.4 # spike outlier<br>voltage[101:104] = np.nan # another dropout<br>voltage[130] = 187.2 # dip outlier
series = pd.Series(voltage, index=index, name="voltage_v")
# --- Audit ---<br>print("=== TIME SERIES AUDIT ===")<br>print(f"Period: {series.index.min()} → {series.index.max()}")<br>print(f"Observations: {len(series)}")<br>print(f"Expected freq: {pd.infer_freq(series.index)}")<br>print(f"\nMissing values: {series.isna().sum()} ({series.isna().mean()*100:.1f}%)")<br>print(f"Value range: [{series.min():.2f}, {series.max():.2f}]")<br>print(f"Mean ± Std: {series.mean():.2f} ± {series.std():.2f}")
# Identify consecutive missing runs<br>missing_mask = series.isna()<br>missing_runs = []<br>run_start = None<br>for i, (ts, is_missing) in enumerate(missing_mask.items()):<br>if is_missing and run_start is None:<br>run_start = ts<br>elif not is_missing and run_start is not None:<br>missing_runs.append((run_start, missing_mask.index[i - 1]))<br>run_start = None
print(f"\nMissing runs ({len(missing_runs)} total):")<br>for start, end in missing_runs:<br>print(f" {start} → {end}")
Output:
=== TIME SERIES AUDIT ===<br>Period: 2024-06-01 00:00:00 → 2024-06-07 23:00:00<br>Observations: 168<br>Expected freq: h
Missing values: 7 (4.2%)<br>Value range: [187.20, 312.40]<br>Mean ± Std: 230.22 ± 7.81
Missing runs (3 total):<br>2024-06-01 14:00:00 → 2024-06-01 16:00:00<br>2024-06-02 18:00:00 → 2024-06-02 18:00:00<br>2024-06-05 05:00:00 → 2024-06-05 07:00:00
This audit gives you a map of the damage before you start cleaning. The key task is distinguishing between isolated missing values , which are imputable with local context, and missing long runs , which may need a different strategy or flagging for downstream consumers.
How to Reindex to a Canonical Frequency
Before imputing missing values, you need to confirm your...