Show HN: Describe a research topic, get a daily-updated ArXiv/S2 dataset

FineSet — training datasets, described in plain English Live on HuggingFace · datasets shipped↗Your fine-tune deserves better training data. Describe a research area in plain English. FineSet pulls from arXiv, Semantic Scholar, GitHub and more — merges duplicates, scores them, and hands you clean JSONL or Parquet, refreshed on a schedule. Start for freeSee pricing Pulls fromarXivSemantic ScholarGitHub+ others→ShipsJSONLParquetHF push

raw · duplicated · unscored→ dataset.jsonl

4datasets live on HF 4,800+papers assembled weeklyrefresh cadence 0.92dedup similarity cutoff

The problem Assembling a training set by hand is the worst part. You know the data exists. Getting it clean, deduplicated, and fresh is a week of glue code you rewrite for every project. Doing it yourselfScrape arXiv, then Semantic Scholar, then reconcile two schemas by hand. Same paper, three times, under three slightly different titles. No quality signal — you train on the noise too. Stale the moment you download it. Re-run the whole mess next month. Scrub PII and malformed records before anything touches your model.

With FineSetDescribe the topic once — sources are wired in for you. Cross-source dedup merges the same record into one row. Every record carries a 0–1 quality score you can filter on. Refreshes on a schedule. Never just the version you downloaded last month. PII stripped and fields normalized before it reaches you.

The pipeline How it works A topic you type becomes a clean dataset in three steps. No scrapers to write, nothing to maintain. A topic you type queries every relevant source; records flow through normalize, merge, and score into dataset.jsonl.Your topicllm agents▌arXivSem. Scholar…NormalizeMergeScoredataset.jsonlA topic you type queries every relevant source; records flow through normalize, merge, and score into dataset.jsonl.llm agents▌arXivSem. Scholar…NormalizeMergeScoredataset.jsonl 01Describe your topic Give it keywords and, if you want, arXiv categories. Something like "RLHF, cs.LG, since 2023".

02The pipeline assembles it It pulls from every source that fits, normalizes the fields, merges duplicates, scores quality, and strips PII.

03Download JSONL Export it and it stays fresh, refreshed every day. Parquet and a one-click push to HuggingFace are coming soon.

Try it Type any topic, see real records. No signup. Animals, cars, LLMs, RLHF — type anything. We check the shared pool first, and if it's a topic we don't cover yet we pull a live sample across arXiv, GitHub and Hacker News — papers, repos and threads. Preview recordsRLHFefficient LLMsMoE architecturesmechanistic interpretability

What's inside What's in every dataset The boring, necessary work — done before the file reaches you, every refresh. Scheduled refresh Subscribe to a topic and new matching records flow in on a schedule. The dataset is never the stale copy you downloaded last month.

Cross-source dedup The same record from different sources merges into one row, with every source kept — pgvector similarity at a 0.92 cutoff.

Quality scores Every record carries a 0–1 score from citation signal, so you can filter the noise out before you ever start training.

JSONL today, more soon One clean record per line, normalized fields, ready for your loader. Parquet and one-click push to HuggingFace are next.

PII stripped Emails, phone numbers, and other PII are removed before anything is stored — so what you train on stays clean.

Your own topic Not just our templates. Describe any research area or data domain and FineSet builds the pipeline for it.

Live now Datasets already shipping Download a sample in the exact JSONL you'd export — no account. The full dataset is a free account away.

Who it's for Built for people who train models Same pipeline, different output. If you need domain-specific records and you don't want to babysit scrapers, it's for you. ML practitioners Thousands of papers on your topic as a JSONL dataset that refreshes on a schedule. Quality scored, so you can filter out the noise before you train. e.g. “RLHF, cs.LG, since 2023” Researchers Every new paper in your subfield, structured and deduplicated, waiting for you in the morning. e.g. “mechanistic interpretability” Builders Tell it what data you want to track and the records keep flowing. No scrapers to write, none to babysit. e.g. “LLM agents + tool use”

Pricing Free while we're in beta. No plans, no limits to weigh up. Build datasets and export them at no cost while we're early. Paid tiers come later — you'll hear about them first. Free while in beta$0 Every dataset, every export — free for now. No card, no plans to pick. We'll give plenty of notice before anything costs money. Start for free

FAQ Questions, answered straight. How fresh is the data?Pipelines refresh on a schedule. New papers land in your dataset the week they post to arXiv — no scraper to maintain. What export formats are available?JSONL today, with Parquet coming soon. The shape matches what's already on...

Show HN: Describe a research topic, get a daily-updated ArXiv/S2 dataset

Related Articles

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI