How to Build an Agentic RAG with RubyLLM and Rails | Giovanni Panasiti - Personal Website and Blog
p" />
-->
Go back
I run a RAG application for Italian pension and tax consultants. Users ask questions about INPS, professional pension funds, laws and regulations, and the app answers using a knowledge base of uploaded documents.
For a long time the app used the classic single-shot RAG pipeline: take the question, search the database, stuff the results into a system prompt, ask the model. It works, but it has a hard limit: the retrieval happens once, before the model has any chance to reason about the question. If the first search misses, the answer is bad and there is nothing the model can do about it.
So I rebuilt the pipeline as an agent. Now the model drives the retrieval itself: it decides what to search, reads the results, searches again with different terms, follows cross references between documents, and only then writes the answer. All in plain Ruby, with RubyLLM and Rails. No LangChain, no Python sidecar.
In this article I will show you exactly how it works, with the real code from my application. One note before we start: since the app serves Italian consultants, all the prompts, tool descriptions and user-facing strings are in Italian in the real codebase. I translated them to English here so you can follow along, but the structure is identical.
The stack
Rails 8.1, Ruby 3.3
RubyLLM ~> 1.15 for everything LLM related (chat, tools, embeddings, streaming)
PostgreSQL with pgvector, through the neighbor gem, for vector search
pg_search for full-text search (Italian dictionary)
Turbo Streams to stream the answer and the agent steps to the browser
A background job (Active Job) that runs the whole thing off the request cycle
Single-shot RAG vs agentic RAG
Quick recap of the two approaches, because the agentic version reuses almost everything from the single-shot one.
Single-shot RAG:
User asks a question
You embed the question and search your chunks
You build a context block from the top results
You put the context in the system prompt and call the LLM once
Agentic RAG:
User asks a question
You give the LLM a set of tools: search_knowledge_base, fetch_document_section, list_documents
The LLM calls the tools as many times as it needs, in a loop
When it has enough material, it writes the final answer with citations
If you never worked with tool calling (also called function calling), here is the mental model. A “tool” is just a function you describe to the model: a name, a description, and the parameters it accepts. The model cannot run anything itself. When it decides it needs a tool, it stops generating text and replies with a structured request like “call search_knowledge_base with query: 'aliquote 2024'”. Your code runs the function, sends the result back as a new message, and the model continues from there. It can ask for another tool, or write the final answer.
The key insight: you do not need to write that loop yourself. RubyLLM already has a tool loop built in. When you call chat.complete on a chat that has tools registered, RubyLLM sends the tool definitions to the model, executes the tools the model asks for, feeds the results back, and repeats until the model produces a normal text answer. Your job is only to write good tools and a good system prompt.
Step 1: the retrieval primitive
The agent is only as good as its search tool. Mine is a hybrid search on the Chunk model that combines vector similarity and full-text search, merges them with Reciprocal Rank Fusion, and reranks for diversity with MMR. This is the same method the old single-shot pipeline used, so the agent got all of it for free.
A quick word on the data model first. When a user uploads a document, a background job splits it into Chunk records. Each chunk holds a piece of text (content), its order in the document (position), and an embedding: a vector of numbers, produced by an embedding model, that represents the meaning of the text. Texts with similar meaning get vectors that point in similar directions, so “find chunks similar to this question” becomes “find the nearest vectors”, which pgvector can do with an index.
The model setup looks like this:
class Chunk ApplicationRecord<br>include PgSearch::Model
belongs_to :document<br>belongs_to :parent_chunk, class_name: "Chunk", optional: true
has_neighbors :embedding
pg_search_scope :text_search,<br>against: :content,<br>using: {<br>tsearch: {<br>prefix: true,<br>dictionary: "italian",<br>tsvector_column: "searchable"<br>end
Line by line:
has_neighbors :embedding comes from the neighbor gem. It tells Rails that the embedding column is a pgvector column and unlocks the nearest_neighbors query method, which I use later for the vector search.
belongs_to :parent_chunk is a self-reference: a chunk can point to a bigger chunk of the same document. This is the “small-to-big” pattern, I will explain it when we get to the fetch_document_section tool.
pg_search_scope...