A file-level tree that lets an LLM reason over a document corpus

cccaaai1 pts0 comments

PageIndex File System:<br>Massive-Scale Document SearchTry Now<br>Book a Demo

Contact us

PageIndex now scales to millions of documents

Available today for enterprise. Cloud rollout coming soon. (Get early access)

We started PageIndex with one belief: retrieval over long documents should look more like human reading than like semantic similarity search . Since launch, the open-source PageIndex, one of the fastest-growing AI-infra repos on GitHub, has crossed 26k GitHub stars in a few months, hit #1 on GitHub Trending , been selected for the GitHub Secure Open Source Fund , and now serves 23k+ cloud users in production.

Today we're announcing the next chapter: the PageIndex File System , a new layer on top of the vectorless retrieval engine that lets a single index reason over millions of documents . It ships today as part of PageIndex Enterprise , with a cloud edition arriving later this month.

This post is a quick tour: why classic vector-based RAG hits a ceiling, what PageIndex is, why a plain file system stops working at this scale, and what the PageIndex File System adds to get past it.

Where classic vector-based RAG breaks

The standard RAG recipe is by now familiar: chunk every document into passages, run each chunk through an embedding model to get a fixed-size vector, store those vectors in a vector database, and at query time embed the question and pull back the top-K nearest neighbors. It works, until it doesn't. Two things go wrong, and both get worse as the corpus grows.

1. Embeddings have limited representation power.<br>A single fixed-length vector has to summarize an entire chunk into a few hundred numbers, and embedding models cap their input length at a few hundred or a few thousand tokens. That cap forces two compromises that quietly degrade quality:

Chunking breaks semantic continuity. Real documents have sections, tables, footnotes, and cross-references that flow across page boundaries. Slicing them into fixed-size windows shreds those dependencies. The chunk that contains the answer is often missing the context that makes the answer make sense.

Retrieval is blind to context. Only the user's literal query gets embedded. The conversation that came before, the user's role, the evolving intent of a multi-turn dialogue: all of that has to be discarded before encoding. The retriever sees a context-stripped probe, not a real question in a real situation.

2. Similarity is not the same as relevance.<br>Vector search ranks by cosine similarity to the query. But what users actually want is relevance, and the two come apart in both directions:

Similar but not relevant (low accuracy). In professional domains (legal, medical, financial), language is repetitive and small differences carry critical meaning. Two paragraphs can look almost identical to an embedding model and yet say opposite things about who is liable, what dose to give, or which clause applies. Vector search happily returns the wrong one because it "looks right".

Relevant but not similar (low recall). Conversely, the right answer is often phrased very differently from the query, or lives many sections away from the most-cited passage. Finding it takes reasoning over the document's structure, not surface-level word matching. Vector search has no mechanism for that, so the genuinely relevant chunk falls past rank K and disappears silently. You don't get an error; you just get a worse answer.

These aren't edge cases. They're the two failure modes our enterprise customers hit again and again, and they're exactly what motivated us to build a different kind of retriever.

What is PageIndex?

PageIndex is a vectorless RAG framework . Instead of chopping documents into chunks, embedding them into vectors, and ranking by cosine similarity, PageIndex represents each document as a tree (sections nest into subsections, subsections into pages, pages into content blocks) and lets an LLM navigate the tree to find the answer.

The shape of the tree is the table of contents you'd see in a book. The retrieval policy is an LLM that, at each node, asks a single question: given the user's query, the conversation so far, and where I am in the document, should I look inside this subtree? No fixed top-K, no embedding bottleneck, no information dropped silently because it ranked K+1K{+}1K+1.

Three properties fall out of this design, and each one is exactly what classic vector RAG cannot offer:

Relevance classification, not semantic similarity. The LLM doesn't compute a cosine score; it makes a yes/no judgment at every node (is this subtree worth opening for this query?) using full-document understanding, not a 768-dimensional proxy. The two failure modes of similarity search (similar-but-irrelevant, relevant-but-dissimilar) simply don't apply.

Retrieval depends on context. The decision at each node is conditioned on the query, the conversation history, the user's role, and the path the LLM has already walked. There's no fixed-length cap forcing context to...

pageindex vector document query similarity file

Related Articles