Building ArXiv Scholar: A Production RAG Pipeline — Without a GPU, Without LangChain, and Without a Budget
On This Page
Why We Built This<br>Architecture<br>Data Acquisition<br>Layout-Aware Chunking<br>Dual Embedding Pipeline<br>6-Colab Strategy<br>Qdrant Storage<br>Intelligent Retrieval<br>Evaluating the Re-Ranker<br>LLM & Streaming<br>Evaluation<br>Benchmarks<br>What We Learned<br>Try It Yourself
TL;DR
We built a from-scratch RAG pipeline (no abstraction frameworks) over a 5,600-paper curated subset of arXiv's corpus, processed everything using 6 free Google Colab accounts running in parallel, uploaded the results to Qdrant's free cloud tier, and shipped a lightweight streaming frontend with intelligent query routing, HyDE, query decomposition, and hybrid dense+sparse search — our entire production infrastructure cost exactly $1 (for the domain name).
We wanted to search academic papers the way researchers actually think — not keyword-matching against titles, but asking real questions like "What is the state of the art for long-context attention mechanisms published after 2023?" and getting back grounded, cited answers from actual arXiv publications.
So we built ArXiv Scholar : an end-to-end Retrieval-Augmented Generation (RAG) system that ingests, parses, chunks, embeds, and searches thousands of academic papers from arXiv. No LangChain. No GPU in production. No paid infrastructure.
This post is the honest story of building it — what worked, what didn't, and the engineering tricks that made a zero-budget project achieve 98.8% True Recall@20 with high-precision reranking over 5,600 papers.
Why We Built This
Every week, thousands of new papers appear on arXiv. Researchers rely on keyword searches, Twitter threads, or manually scrolling through listings to find relevant work. Traditional search over arXiv — including arXiv's own search — matches against titles and abstracts using basic text retrieval. It doesn't understand concepts.
We asked a simple question: What if you could ask arXiv a question in plain English and get back a synthesized, cited answer from the actual papers?
The catch was our constraints:
Zero compute budget. No AWS, no GCP, no rented GPUs. Our total bill was exactly $1 for the custom domain.
No high-level frameworks. We wanted full architectural control — no LangChain, no LlamaIndex — just Python, raw API calls, and an understanding of what every byte was doing.
Free-tier everything. Free Colab for processing, free Qdrant Cloud for vector storage, free arXiv data from GCS, API hosted on Hugging Face Spaces, frontend on GitHub Pages, and Cloudflare free-tier for routing.
These constraints weren't limitations — they were design parameters. They forced us to make thoughtful engineering decisions at every layer.
The Architecture at a Glance
The system is split into two decoupled halves: an ingestion pipeline that runs offline (in Colab), and a retrieval pipeline that serves live queries. Let's walk through each decision.
Component Deep-Dive
1. Data Acquisition: Free Access to 1.4TB of Science
ArXiv mirrors its entire publication archive as a public Google Cloud Storage bucket (arxiv-dataset). Every paper ever uploaded — over 3 million PDFs, roughly 1.4TB — is freely accessible via anonymous GCS reads.
python<br>Copy
# Zero credentials, zero cost<br>client = storage.Client.create_anonymous_client()<br>bucket = client.bucket("arxiv-dataset")
Our ArxivUnifiedEngine is a stateful, crash-safe batch downloader. It tracks progress with a JSON cursor persisted to disk after every single file:
json<br>Copy
{"current_month": "2604", "last_file": "2604.04869.pdf"}
If the process crashes mid-batch, restart picks up from the exact next file. No duplicates, no gaps. The engine seamlessly rolls over month boundaries (2604 → 2605) and even transitions from historical backfill to live-mode when it catches up to the present.
The curation decision: While the pipeline can ingest all 3 million papers, free-tier Qdrant comfortably holds ~5,600 papers worth of embeddings. So we built a 4-stage manifest filter:
Papers must be updated after January 2022 and belong to core CS categories (cs.AI, cs.CL, cs.IR, cs.LG, cs.SE)
Aggressive anti-noise filtering to exclude cross-listed medical, physics, and pure math papers
Inclusion requires mentions of VIP tools (vLLM, LangChain, etc.) OR dense keyword matches across 3+ AI topic groups
Budget cap at exactly 5,600 papers, ranked by relevance tier and recency
This manifest is a cost-saving measure, not a technical limitation. Remove it, and the same pipeline ingests millions.
2. Layout-Aware Chunking with Docling
This is where most RAG pipelines fail silently. The default approach — split every 500 characters — destroys the semantic structure of academic papers. You end up with chunks that start mid-equation, split a table in half, or separate a section header from its content.
We use IBM's Docling library for visual document understanding. Instead of treating a PDF as a flat string,...