Parallelogram – catch fine-tuning dataset bugs before training

Parallelogram — The linter for fine-tuning data

GitHub

Get started →

v0.4 · open source · --fix shipped

The linter for fine-tuning data.

Parallelogram catches the silent killers — broken role sequences, context-window overflows, duplicates, mojibake — before your $500 GPU run discovers them for you.

Get started →

pip install parallelogram

If it exits 0, your run won't fail because of data.

/ 01 — pre-flight See what it catches.

One command before you train. Errors report the exact line and rule. It auto-runs the real checks below — click a sample or edit the data to lint your own, right here in the browser.

sample broken.jsonl clean.jsonl edit ▾

↻ auto-demo ▶ run

One JSON record per line. Runs entirely in your browser — nothing is uploaded.

~/projects/finetune-llama3 — parallelogram ⌘1

/ 02 — what it checks Six rules. Every silent killer.

Every check maps to a real failure mode that has cost someone a training run. Rules are pluggable; the v0.4 set is non-negotiable.

schema error

Malformed records.

Missing fields, wrong types, invalid roles. The first rule, because every other rule assumes it.

data.jsonl:84 {"messages": [{"role": "owner", "content": "…"}]} ^^^^^^^ invalid role: must be system|user|assistant|tool

roles error

Bad role sequences.

System out of place. Doubled turns. Conversations that don't end on the assistant. The model learns to talk to itself.

data.jsonl:23 [user → user → assistant] ^^^^ role alternation broken at turn 1: expected 'assistant', got 'user'

empty-content error

Empty turns.

Whitespace-only content slips past most validators. The model trains to produce silence.

data.jsonl:312 {"role": "user", "content": " "} ^^^^^ message 0 has empty content

context-window error

Context-window overflow.

TRL truncates oversized records silently — usually severing the assistant turn. You train on noise and don't know. Counted with the model's own tokenizer — tiktoken for OpenAI, HuggingFace for open-weight, approximate by default.

data.jsonl:1209 ~8512 tokens > max_seq_len = 8192 will be silently truncated, severing the assistant response

duplicates error

Exact duplicates.

Repeated examples push the model toward memorization, not generalization. We hash with normalized whitespace so trivial differences don't mask real dupes.

data.jsonl:147 duplicate of line 89 (3 copies total at lines: [89, 147, 402])

encoding warning

Mojibake & BOM.

UTF-8 → latin-1 → UTF-8 round-trips look fine in your editor and ruin your model's punctuation forever.

data.jsonl:401 "donâ€™t do it" → should be: "don't do it" ^^^ latin-1 → UTF-8 round-trip artifact

/ in motion — input vs output Watch a broken dataset get walled off.

547 records in. Three errors and one warning identified. 543 records out, ready to train. The rejected lines stay on your disk; nothing leaves your machine.

↻ replay

data.jsonl 547 records

parallelogram check

clean.jsonl 543 records

/ 03 — anywhere a build runs Free CI integration. No config required.

Clean POSIX exit codes — 0 clean, 1 warnings, 2 errors — plus a structured JSON report. Drop one line into your workflow and your training data gets the same gate as your code.

# .github/workflows/data.yml - run: pip install parallelogram - run: parallelogram check data.jsonl --json

exit codes 0clean

1warnings

2errors

No telemetry. No upload boundary. No backend.

Streams the file. Memory stays flat at 100k records.

Pluggable rules — disable or extend without forking.

github.com / Thatayotlhe04 / openai-fine-tune PR #847 · 2m ago

Check failed · parallelogram / data 3 errors, 1 warning in data.jsonl

L23 roles Conversation must end on 'assistant', ended on 'user'

L147 duplicates Duplicate of line 89

L312 context-window Record exceeds max_seq_len: ~8512 > 8192 tokens

L401 encoding Likely mojibake: 'â€™'

3 errors 1 warning 543 clean Re-run jobs

/ 04 — quickstart A few commands. Then never lose a run again.

01 Install

$ pip install parallelogram One command, on PyPI. No GPU, no network, no config.

02 Validate

$ parallelogram check data.jsonl Exits 0 if your data is fine. Otherwise, exact lines and rules.

03 Repair

$ parallelogram check data.jsonl \ --fix --output clean.jsonl Mechanical fixes — dedupe, BOM strip, mojibake repair. Free, local, no network.

04 Ship

$ parallelogram check clean.jsonl Re-validates clean. Ready to feed your trainer.

/ where it sits One step. One file. No surprises downstream.

📄 raw.jsonl unverified

parallelogram 6 rules

clean.jsonl verified

trainer no nasty surprises

/ 05 — works with Validates the formats your trainer already speaks.

axolotl

🦥 Unsloth

TRL

🤗 Hugging Face

OpenAI

GitHub Actions

v0.4 supports OpenAI chat ({"messages":[…]}) and ShareGPT ({"conversations":[…]}). raw-completion shipping next.

records uploaded no telemetry, ever

rules in v0.4 covering every silent killer

~0ms per 1k records streaming, O(1) memory

$0 to use apache 2.0, forever

Run before you train.

Free, open...

Parallelogram – catch fine-tuning dataset bugs before training

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs