Parallelogram — The linter for fine-tuning data
GitHub
Get started →
v0.4 · open source · --fix shipped
The linter for<br>fine-tuning data.
Parallelogram catches the silent killers — broken role sequences, context-window<br>overflows, duplicates, mojibake — before your $500 GPU run discovers them for you.
Get started →
pip install parallelogram
If it exits 0, your run won't fail because of data.
/ 01 — pre-flight<br>See what it catches.
One command before you train. Errors report the exact line and rule.<br>It auto-runs the real checks below — click a sample or edit the<br>data to lint your own, right here in the browser.
sample<br>broken.jsonl<br>clean.jsonl<br>edit ▾
↻ auto-demo<br>▶ run
One JSON record per line. Runs entirely in your browser — nothing is uploaded.
~/projects/finetune-llama3 — parallelogram<br>⌘1
/ 02 — what it checks<br>Six rules. Every silent killer.
Every check maps to a real failure mode that has cost someone a training run.<br>Rules are pluggable; the v0.4 set is non-negotiable.
schema<br>error
Malformed records.
Missing fields, wrong types, invalid roles. The first rule, because every other rule assumes it.
data.jsonl:84<br>{"messages": [{"role": "owner", "content": "…"}]}<br>^^^^^^^ invalid role: must be system|user|assistant|tool
roles<br>error
Bad role sequences.
System out of place. Doubled turns. Conversations that don't end on the assistant. The model learns to talk to itself.
data.jsonl:23<br>[user → user → assistant]<br>^^^^ role alternation broken at turn 1: expected 'assistant', got 'user'
empty-content<br>error
Empty turns.
Whitespace-only content slips past most validators. The model trains to produce silence.
data.jsonl:312<br>{"role": "user", "content": " "}<br>^^^^^ message 0 has empty content
context-window<br>error
Context-window overflow.
TRL truncates oversized records silently — usually severing the assistant turn. You train on noise and don't know. Counted with the model's own tokenizer — tiktoken for OpenAI, HuggingFace for open-weight, approximate by default.
data.jsonl:1209<br>~8512 tokens > max_seq_len = 8192<br>will be silently truncated, severing the assistant response
duplicates<br>error
Exact duplicates.
Repeated examples push the model toward memorization, not generalization. We hash with normalized whitespace so trivial differences don't mask real dupes.
data.jsonl:147<br>duplicate of line 89 (3 copies total at lines: [89, 147, 402])
encoding<br>warning
Mojibake & BOM.
UTF-8 → latin-1 → UTF-8 round-trips look fine in your editor and ruin your model's punctuation forever.
data.jsonl:401<br>"don’t do it" → should be: "don't do it"<br>^^^ latin-1 → UTF-8 round-trip artifact
/ in motion — input vs output<br>Watch a broken dataset get walled off.
547 records in. Three errors and one warning identified. 543 records out, ready<br>to train. The rejected lines stay on your disk; nothing leaves your machine.
↻ replay
data.jsonl<br>547 records
parallelogram check
clean.jsonl<br>543 records
/ 03 — anywhere a build runs<br>Free CI integration.<br>No config required.
Clean POSIX exit codes — 0 clean, 1 warnings, 2 errors —<br>plus a structured JSON report. Drop one line into your workflow and your training data<br>gets the same gate as your code.
# .github/workflows/data.yml<br>- run: pip install parallelogram<br>- run: parallelogram check data.jsonl --json
exit codes<br>0clean
1warnings
2errors
No telemetry. No upload boundary. No backend.
Streams the file. Memory stays flat at 100k records.
Pluggable rules — disable or extend without forking.
github.com / Thatayotlhe04 / openai-fine-tune<br>PR #847 · 2m ago
Check failed · parallelogram / data<br>3 errors, 1 warning in data.jsonl
L23<br>roles<br>Conversation must end on 'assistant', ended on 'user'
L147<br>duplicates<br>Duplicate of line 89
L312<br>context-window<br>Record exceeds max_seq_len: ~8512 > 8192 tokens
L401<br>encoding<br>Likely mojibake: '’'
3 errors<br>1 warning<br>543 clean<br>Re-run jobs
/ 04 — quickstart<br>A few commands.<br>Then never lose a run again.
01<br>Install
$ pip install parallelogram<br>One command, on PyPI. No GPU, no network, no config.
02<br>Validate
$ parallelogram check data.jsonl<br>Exits 0 if your data is fine. Otherwise, exact lines and rules.
03<br>Repair
$ parallelogram check data.jsonl \<br>--fix --output clean.jsonl<br>Mechanical fixes — dedupe, BOM strip, mojibake repair. Free, local, no network.
04<br>Ship
$ parallelogram check clean.jsonl<br>Re-validates clean. Ready to feed your trainer.
/ where it sits<br>One step. One file. No surprises downstream.
📄<br>raw.jsonl<br>unverified
parallelogram<br>6 rules
clean.jsonl<br>verified
trainer<br>no nasty surprises
/ 05 — works with<br>Validates the formats your trainer already speaks.
axolotl
🦥 Unsloth
TRL
🤗 Hugging Face
OpenAI
GitHub Actions
v0.4 supports OpenAI chat ({"messages":[…]}) and ShareGPT<br>({"conversations":[…]}). raw-completion shipping next.
records uploaded<br>no telemetry, ever
rules in v0.4<br>covering every silent killer
~0ms<br>per 1k records<br>streaming, O(1) memory
$0<br>to use<br>apache 2.0, forever
Run before you train.
Free, open...