Lessons We Learned Building a RAG Assistant Without a Separate Vector Database

Lessons We Learned Building a RAG Assistant Without a Separate Vector Database | by StarRocks Engineering | Jun, 2026 | Dev GeniusSitemapOpen in appSign up Sign in

Medium Logo

Get app Write

Dev Genius

Coding, Tutorials, News, UX, UI and much more related to development

Lessons We Learned Building a RAG Assistant Without a Separate Vector Database

How we used StarRocks, Gemini, and tool-based retrieval to power grounded Q&A in a developer community Slack.

StarRocks Engineering

9 min read· 7 hours ago

Listen

Author: Billy Chang, Software Engineer at Phoenix AI Press enter or click to view image in full size

StarRocks gives data teams a fast open-source analytical database with a unified execution engine, a flexible deployment model, and strong performance for real-world workloads. But as the StarRocks community grows, the support workload grows with it: maintainers repeatedly answer the same questions about docs, GitHub issues, release notes, and historical Slack conversations. Rocky is the official Slack assistant we built to address that problem. Its job is simple: take repetitive Q&A work off community maintainers while keeping answers grounded in StarRocks documentation and related sources. The architecture is the important part. Rocky itself runs on StarRocks: document chunks, keyword lookup, vector retrieval, and similarity scoring all live in a single OLAP table. The AI that answers questions about StarRocks also runs on StarRocks. The result is a compact AI application built from roughly 600 lines of Python, one StarRocks table, and a Gemini API key. Press enter or click to view image in full size

How Rocky works in the Slack channelThe RAG Foundation in StarRocks The conventional first step when building a RAG application is to introduce a purpose-built vector database. Each new component adds operational overhead: a separate deployment, backup strategy, and consistency model. For a lightweight community bot, that overhead is hard to justify. Instead of adding a vector database, we kept everything in StarRocks. Document chunks — along with their 768-dimensional Gemini embeddings — live in a standard OLAP table using a PRIMARY KEY model. Retrieval is a SQL query. The architectural principle is simple: if your analytical database already supports array-type columns and cosine similarity functions, you do not need a second data system for vector search. The table definition looks like any other StarRocks table: CREATE TABLE docs ( id BIGINT NOT NULL, path VARCHAR(512), `index` INT, `text` STRING, vector ARRAY -- 768-dim Gemini embedding ) ENGINE = OLAP PRIMARY KEY(id) DISTRIBUTED BY HASH(id) BUCKETS 1;And retrieval is a single query using cosine_similarity: SELECT path, `index`, `text`, approx_cosine_similarity([0.012, -0.034, ...], vector) AS similarity FROM docs ORDER BY similarity DESC LIMIT 8;No SDK, no client library, no second data system. From Rocky’s perspective, the “vector database” is just another StarRocks table it already knows how to query. Architecture: From Slack Events to Grounded Answer The end-to-end flow is deliberately minimal. A Slack @Rocky mention triggers the bot, which delegates reasoning to Gemini 3 Flash with native function calling. The model decides whether to search documentation or query Google, cycling through up to ten tool-call rounds per turn. Press enter or click to view image in full size

The Rest of the Stack at a Glance Press enter or click to view image in full size

Slack receives the mention : slack_bolt (Python) captures the event via Socket Mode and extracts the user query plus thread context. Gemini reasons and calls tools : the LLM receives a system prompt with strict honesty rules and two available tools: search_starrocks_doc (client-side vector retrieval) and google_search (server-side grounding via Gemini’s built-in web search). Vector retrieval executes in StarRocks : search_starrocks_doc embeds the query using gemini-embedding-001 with task_type=RETRIEVAL_QUERY, then runs the cosine similarity SQL above. The model synthesizes an answer : Gemini assembles the retrieved chunks, generates a Markdown response, and Rocky converts it to Slack mrkdwn format before posting. Telemetry flows to the observability stack : every LLM call, tool invocation, and token count is captured as an OpenTelemetry span, keyed by thread_ts as the session ID. The entire bot is about 600 lines of Python in a single file. The rest of the toolchain, including document chunking, embedding generation, and index building, adds only a few thousand more. Storage, retrieval, and similarity scoring are handled entirely by StarRocks. Why This Stack Works for a Lightweight RAG App Three design choices keep Rocky’s operational footprint small while still delivering useful answers. Primary Key Table + Stream Load for Hot-Swappable Docs The document corpus is not a streaming workload. When the StarRocks docs update, we re-chunk the entire docs/en/...

Lessons We Learned Building a RAG Assistant Without a Separate Vector Database

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs