Lessons We Learned Building a RAG Assistant Without a Separate Vector Database

HermitX2 pts0 comments

Lessons We Learned Building a RAG Assistant Without a Separate Vector Database | by StarRocks Engineering | Jun, 2026 | Dev GeniusSitemapOpen in appSign up<br>Sign in

Medium Logo

Get app<br>Write

Search

Sign up<br>Sign in

Dev Genius

Coding, Tutorials, News, UX, UI and much more related to development

Lessons We Learned Building a RAG Assistant Without a Separate Vector Database

How we used StarRocks, Gemini, and tool-based retrieval to power grounded Q&A in a developer community Slack.

StarRocks Engineering

9 min read·<br>7 hours ago

Listen

Share

Author: Billy Chang, Software Engineer at Phoenix AI<br>Press enter or click to view image in full size

StarRocks gives data teams a fast open-source analytical database with a unified execution engine, a flexible deployment model, and strong performance for real-world workloads. But as the StarRocks community grows, the support workload grows with it: maintainers repeatedly answer the same questions about docs, GitHub issues, release notes, and historical Slack conversations.<br>Rocky is the official Slack assistant we built to address that problem. Its job is simple: take repetitive Q&A work off community maintainers while keeping answers grounded in StarRocks documentation and related sources.<br>The architecture is the important part. Rocky itself runs on StarRocks: document chunks, keyword lookup, vector retrieval, and similarity scoring all live in a single OLAP table. The AI that answers questions about StarRocks also runs on StarRocks. The result is a compact AI application built from roughly 600 lines of Python, one StarRocks table, and a Gemini API key.<br>Press enter or click to view image in full size

How Rocky works in the Slack channelThe RAG Foundation in StarRocks<br>The conventional first step when building a RAG application is to introduce a purpose-built vector database. Each new component adds operational overhead: a separate deployment, backup strategy, and consistency model. For a lightweight community bot, that overhead is hard to justify.<br>Instead of adding a vector database, we kept everything in StarRocks. Document chunks — along with their 768-dimensional Gemini embeddings — live in a standard OLAP table using a PRIMARY KEY model. Retrieval is a SQL query. The architectural principle is simple: if your analytical database already supports array-type columns and cosine similarity functions, you do not need a second data system for vector search.<br>The table definition looks like any other StarRocks table:<br>CREATE TABLE docs (<br>id BIGINT NOT NULL,<br>path VARCHAR(512),<br>`index` INT,<br>`text` STRING,<br>vector ARRAY -- 768-dim Gemini embedding<br>) ENGINE = OLAP<br>PRIMARY KEY(id)<br>DISTRIBUTED BY HASH(id) BUCKETS 1;And retrieval is a single query using cosine_similarity:<br>SELECT path, `index`, `text`,<br>approx_cosine_similarity([0.012, -0.034, ...], vector) AS similarity<br>FROM docs<br>ORDER BY similarity DESC<br>LIMIT 8;No SDK, no client library, no second data system. From Rocky’s perspective, the “vector database” is just another StarRocks table it already knows how to query.<br>Architecture: From Slack Events to Grounded Answer<br>The end-to-end flow is deliberately minimal. A Slack @Rocky mention triggers the bot, which delegates reasoning to Gemini 3 Flash with native function calling. The model decides whether to search documentation or query Google, cycling through up to ten tool-call rounds per turn.<br>Press enter or click to view image in full size

The Rest of the Stack at a Glance<br>Press enter or click to view image in full size

Slack receives the mention : slack_bolt (Python) captures the event via Socket Mode and extracts the user query plus thread context.<br>Gemini reasons and calls tools : the LLM receives a system prompt with strict honesty rules and two available tools: search_starrocks_doc (client-side vector retrieval) and google_search (server-side grounding via Gemini’s built-in web search).<br>Vector retrieval executes in StarRocks : search_starrocks_doc embeds the query using gemini-embedding-001 with task_type=RETRIEVAL_QUERY, then runs the cosine similarity SQL above.<br>The model synthesizes an answer : Gemini assembles the retrieved chunks, generates a Markdown response, and Rocky converts it to Slack mrkdwn format before posting.<br>Telemetry flows to the observability stack : every LLM call, tool invocation, and token count is captured as an OpenTelemetry span, keyed by thread_ts as the session ID.<br>The entire bot is about 600 lines of Python in a single file. The rest of the toolchain, including document chunking, embedding generation, and index building, adds only a few thousand more. Storage, retrieval, and similarity scoring are handled entirely by StarRocks.<br>Why This Stack Works for a Lightweight RAG App<br>Three design choices keep Rocky’s operational footprint small while still delivering useful answers.<br>Primary Key Table + Stream Load for Hot-Swappable Docs<br>The document corpus is not a streaming workload. When the StarRocks docs update, we re-chunk the entire docs/en/...

starrocks vector gemini database slack table

Related Articles