Parallelogram – catch fine-tuning dataset bugs before training

thatayotlhe041 pts0 comments

Parallelogram — The linter for fine-tuning data

GitHub

Get started →

v0.4 · open source · --fix shipped

The linter for<br>fine-tuning data.

Parallelogram catches the silent killers — broken role sequences, context-window<br>overflows, duplicates, mojibake — before your $500 GPU run discovers them for you.

Get started →

pip install parallelogram

If it exits 0, your run won't fail because of data.

/ 01 — pre-flight<br>See what it catches.

One command before you train. Errors report the exact line and rule.<br>It auto-runs the real checks below — click a sample or edit the<br>data to lint your own, right here in the browser.

sample<br>broken.jsonl<br>clean.jsonl<br>edit ▾

↻ auto-demo<br>▶ run

One JSON record per line. Runs entirely in your browser — nothing is uploaded.

~/projects/finetune-llama3 — parallelogram<br>⌘1

/ 02 — what it checks<br>Six rules. Every silent killer.

Every check maps to a real failure mode that has cost someone a training run.<br>Rules are pluggable; the v0.4 set is non-negotiable.

schema<br>error

Malformed records.

Missing fields, wrong types, invalid roles. The first rule, because every other rule assumes it.

data.jsonl:84<br>{"messages": [{"role": "owner", "content": "…"}]}<br>^^^^^^^ invalid role: must be system|user|assistant|tool

roles<br>error

Bad role sequences.

System out of place. Doubled turns. Conversations that don't end on the assistant. The model learns to talk to itself.

data.jsonl:23<br>[user → user → assistant]<br>^^^^ role alternation broken at turn 1: expected 'assistant', got 'user'

empty-content<br>error

Empty turns.

Whitespace-only content slips past most validators. The model trains to produce silence.

data.jsonl:312<br>{"role": "user", "content": " "}<br>^^^^^ message 0 has empty content

context-window<br>error

Context-window overflow.

TRL truncates oversized records silently — usually severing the assistant turn. You train on noise and don't know. Counted with the model's own tokenizer — tiktoken for OpenAI, HuggingFace for open-weight, approximate by default.

data.jsonl:1209<br>~8512 tokens > max_seq_len = 8192<br>will be silently truncated, severing the assistant response

duplicates<br>error

Exact duplicates.

Repeated examples push the model toward memorization, not generalization. We hash with normalized whitespace so trivial differences don't mask real dupes.

data.jsonl:147<br>duplicate of line 89 (3 copies total at lines: [89, 147, 402])

encoding<br>warning

Mojibake & BOM.

UTF-8 → latin-1 → UTF-8 round-trips look fine in your editor and ruin your model's punctuation forever.

data.jsonl:401<br>"don’t do it" → should be: "don't do it"<br>^^^ latin-1 → UTF-8 round-trip artifact

/ in motion — input vs output<br>Watch a broken dataset get walled off.

547 records in. Three errors and one warning identified. 543 records out, ready<br>to train. The rejected lines stay on your disk; nothing leaves your machine.

↻ replay

data.jsonl<br>547 records

parallelogram check

clean.jsonl<br>543 records

/ 03 — anywhere a build runs<br>Free CI integration.<br>No config required.

Clean POSIX exit codes — 0 clean, 1 warnings, 2 errors —<br>plus a structured JSON report. Drop one line into your workflow and your training data<br>gets the same gate as your code.

# .github/workflows/data.yml<br>- run: pip install parallelogram<br>- run: parallelogram check data.jsonl --json

exit codes<br>0clean

1warnings

2errors

No telemetry. No upload boundary. No backend.

Streams the file. Memory stays flat at 100k records.

Pluggable rules — disable or extend without forking.

github.com / Thatayotlhe04 / openai-fine-tune<br>PR #847 · 2m ago

Check failed · parallelogram / data<br>3 errors, 1 warning in data.jsonl

L23<br>roles<br>Conversation must end on 'assistant', ended on 'user'

L147<br>duplicates<br>Duplicate of line 89

L312<br>context-window<br>Record exceeds max_seq_len: ~8512 > 8192 tokens

L401<br>encoding<br>Likely mojibake: '’'

3 errors<br>1 warning<br>543 clean<br>Re-run jobs

/ 04 — quickstart<br>A few commands.<br>Then never lose a run again.

01<br>Install

$ pip install parallelogram<br>One command, on PyPI. No GPU, no network, no config.

02<br>Validate

$ parallelogram check data.jsonl<br>Exits 0 if your data is fine. Otherwise, exact lines and rules.

03<br>Repair

$ parallelogram check data.jsonl \<br>--fix --output clean.jsonl<br>Mechanical fixes — dedupe, BOM strip, mojibake repair. Free, local, no network.

04<br>Ship

$ parallelogram check clean.jsonl<br>Re-validates clean. Ready to feed your trainer.

/ where it sits<br>One step. One file. No surprises downstream.

📄<br>raw.jsonl<br>unverified

parallelogram<br>6 rules

clean.jsonl<br>verified

trainer<br>no nasty surprises

/ 05 — works with<br>Validates the formats your trainer already speaks.

axolotl

🦥 Unsloth

TRL

🤗 Hugging Face

OpenAI

GitHub Actions

v0.4 supports OpenAI chat ({"messages":[…]}) and ShareGPT<br>({"conversations":[…]}). raw-completion shipping next.

records uploaded<br>no telemetry, ever

rules in v0.4<br>covering every silent killer

~0ms<br>per 1k records<br>streaming, O(1) memory

$0<br>to use<br>apache 2.0, forever

Run before you train.

Free, open...

data jsonl parallelogram clean records check

Related Articles