FineSet — training datasets, described in plain English<br>Live on HuggingFace · datasets shipped↗Your fine-tune deserves better training data.<br>Describe a research area in plain English. FineSet pulls from arXiv, Semantic Scholar, GitHub and more — merges duplicates, scores them, and hands you clean JSONL or Parquet, refreshed on a schedule.<br>Start for freeSee pricing<br>Pulls fromarXivSemantic ScholarGitHub+ others→ShipsJSONLParquetHF push
raw · duplicated · unscored→ dataset.jsonl
4datasets live on HF<br>4,800+papers assembled<br>weeklyrefresh cadence<br>0.92dedup similarity cutoff
The problem<br>Assembling a training set by hand is the worst part.<br>You know the data exists. Getting it clean, deduplicated, and fresh is a week of glue code you rewrite for every project.<br>Doing it yourselfScrape arXiv, then Semantic Scholar, then reconcile two schemas by hand.<br>Same paper, three times, under three slightly different titles.<br>No quality signal — you train on the noise too.<br>Stale the moment you download it. Re-run the whole mess next month.<br>Scrub PII and malformed records before anything touches your model.
With FineSetDescribe the topic once — sources are wired in for you.<br>Cross-source dedup merges the same record into one row.<br>Every record carries a 0–1 quality score you can filter on.<br>Refreshes on a schedule. Never just the version you downloaded last month.<br>PII stripped and fields normalized before it reaches you.
The pipeline<br>How it works<br>A topic you type becomes a clean dataset in three steps. No scrapers to write, nothing to maintain.<br>A topic you type queries every relevant source; records flow through normalize, merge, and score into dataset.jsonl.Your topicllm agents▌arXivSem. Scholar…NormalizeMergeScoredataset.jsonlA topic you type queries every relevant source; records flow through normalize, merge, and score into dataset.jsonl.llm agents▌arXivSem. Scholar…NormalizeMergeScoredataset.jsonl<br>01Describe your topic<br>Give it keywords and, if you want, arXiv categories. Something like "RLHF, cs.LG, since 2023".
02The pipeline assembles it<br>It pulls from every source that fits, normalizes the fields, merges duplicates, scores quality, and strips PII.
03Download JSONL<br>Export it and it stays fresh, refreshed every day. Parquet and a one-click push to HuggingFace are coming soon.
Try it<br>Type any topic, see real records.<br>No signup. Animals, cars, LLMs, RLHF — type anything. We check the shared pool first, and if it's a topic we don't cover yet we pull a live sample across arXiv, GitHub and Hacker News — papers, repos and threads.<br>Preview recordsRLHFefficient LLMsMoE architecturesmechanistic interpretability
What's inside<br>What's in every dataset<br>The boring, necessary work — done before the file reaches you, every refresh.<br>Scheduled refresh<br>Subscribe to a topic and new matching records flow in on a schedule. The dataset is never the stale copy you downloaded last month.
Cross-source dedup<br>The same record from different sources merges into one row, with every source kept — pgvector similarity at a 0.92 cutoff.
Quality scores<br>Every record carries a 0–1 score from citation signal, so you can filter the noise out before you ever start training.
JSONL today, more soon<br>One clean record per line, normalized fields, ready for your loader. Parquet and one-click push to HuggingFace are next.
PII stripped<br>Emails, phone numbers, and other PII are removed before anything is stored — so what you train on stays clean.
Your own topic<br>Not just our templates. Describe any research area or data domain and FineSet builds the pipeline for it.
Live now<br>Datasets already shipping<br>Download a sample in the exact JSONL you'd export — no account. The full dataset is a free account away.
Who it's for<br>Built for people who train models<br>Same pipeline, different output. If you need domain-specific records and you don't want to babysit scrapers, it's for you.<br>ML practitioners<br>Thousands of papers on your topic as a JSONL dataset that refreshes on a schedule. Quality scored, so you can filter out the noise before you train.<br>e.g. “RLHF, cs.LG, since 2023”<br>Researchers<br>Every new paper in your subfield, structured and deduplicated, waiting for you in the morning.<br>e.g. “mechanistic interpretability”<br>Builders<br>Tell it what data you want to track and the records keep flowing. No scrapers to write, none to babysit.<br>e.g. “LLM agents + tool use”
Pricing<br>Free while we're in beta.<br>No plans, no limits to weigh up. Build datasets and export them at no cost while we're early. Paid tiers come later — you'll hear about them first.<br>Free while in beta$0<br>Every dataset, every export — free for now. No card, no plans to pick. We'll give plenty of notice before anything costs money.<br>Start for free
FAQ<br>Questions, answered straight.<br>How fresh is the data?Pipelines refresh on a schedule. New papers land in your dataset the week they post to arXiv — no scraper to maintain.<br>What export formats are available?JSONL today, with Parquet coming soon. The shape matches what's already on...