Show HN: A 150M model that extracts verbatim evidence spans for RAG, no LLM call

KRLabsOrg/verbatim-rag-modern-bert-v2 · Hugging Face

Verbatim-RAG Extractor

Chill, I Ground! 🌶️

Model Name: verbatim-rag-modern-bert-v2 Organization: KRLabsOrg Github: https://github.com/KRLabsOrg/verbatim-rag

Overview

The Verbatim-RAG Extractor is a query-conditioned token classifier that highlights the verbatim spans of a passage that answer a question. It is the encoder companion to VerbatimRAG and the successor to verbatim-rag-modern-bert-v1. Built on Alibaba-NLP/gte-reranker-modernbert-base, which provides the long ModernBERT context (up to 8192 tokens) and a query-conditioned reranking prior on top of which span extraction is fine-tuned.

The goal is a lightweight extractor that can replace many LLM-based evidence highlighting calls in production RAG systems: local, deterministic, cheap to serve, and still competitive on span-overlap quality. In our ACL-Verbatim gold benchmark, the ACL-specialized sibling model is on par with strong LLM extractors by word-level F1, while this generic multi-domain model beats public extractive baselines across ACL gold, RAGBench, Squeez, and QASPER.

You can use it as the extraction stage inside VerbatimRAG, or drop it into your own RAG pipeline after retrieval/reranking to turn retrieved chunks into grounded evidence spans before displaying them to users or passing them to a generator.

Most public evidence extractors (Provence, Zilliz Semantic-Highlight, MultiSpanQA-trained models) are trained on Wikipedia-style prose QA only. This model is trained on KRLabsOrg/verbatim-spans, which adds financial tables, legal contracts, medical literature, product manuals, and — uniquely among public extractors — coding-agent tool output (pytest failures, git diff hunks, stack traces). The result is a single 150M-parameter encoder usable across the content shapes a real RAG or agent pipeline tends to retrieve, not just article paragraphs.

For an ACL-Anthology-specialized variant, see KRLabsOrg/acl-verbatim-modernbert.

Model Details

Architecture: ModernBERT (gte-reranker-modernbert-base) with 8192-token context

Task: Token classification — binary evidence labels mapped to character spans

Training Dataset: KRLabsOrg/verbatim-spans (multi-domain)

Language: English

Parameters: 150M

Training data composition

content shape source

scientific paragraphs with citations ACL silver

Wikipedia / general QA, multi-hop RAGBench (HotpotQA, MS MARCO, ExpertQA, ...)

financial tables RAGBench (TAT-QA, FinQA)

medical literature RAGBench (PubMedQA, CovidQA)

legal contracts RAGBench (CUAD)

product manuals RAGBench (eManual, TechQA)

code, tool output, stack traces, logs Squeez (SWE-bench tool outputs)

How It Works

A (question, context) pair is encoded as a single sequence; the model predicts a per-token positive-class probability over the context tokens. Above a threshold, contiguous positive runs are merged into character spans, with post-processing (min_span_chars, merge_gap_chars) that removes fragmentation artifacts. Long contexts are handled with sliding windows of max_length tokens stepped by doc_stride, and spans are merged across windows.

Usage

from transformers import AutoModel

model = AutoModel.from_pretrained( "KRLabsOrg/verbatim-rag-modern-bert-v2", trust_remote_code=True,

result = model.process( question="What is ModernBERT?", context=( "ModernBERT is a long-context encoder for NLP. " "It supports sequences up to 8192 tokens. " "Unlike earlier BERT variants, it uses rotary position embeddings." ), threshold=0.2,

for span in result["spans"]: print(f"[{span['score']:.2f}] {span['text']}")

Use inside VerbatimRAG

from verbatim_rag.core import VerbatimRAG from verbatim_rag.index import VerbatimIndex from verbatim_rag.extractors import ModelSpanExtractor from verbatim_rag.vector_stores import LocalMilvusStore from verbatim_rag.embedding_providers import SpladeProvider

# v2 is the default ModelSpanExtractor model, but passing it explicitly makes # the dependency clear. extractor = ModelSpanExtractor( model_path="KRLabsOrg/verbatim-rag-modern-bert-v2", threshold=0.2, min_span_chars=30, merge_gap_chars=20, device=None, # auto-detects cuda, then mps, then cpu

sparse_provider = SpladeProvider( model_name="opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill", device="cuda", # use "cpu" if no GPU is available

vector_store = LocalMilvusStore( db_path="./index.db", collection_name="verbatim_rag", enable_dense=False, enable_sparse=True,

# Assumes the index has already been populated with your documents. index = VerbatimIndex( vector_store=vector_store, sparse_provider=sparse_provider,

rag = VerbatimRAG( index=index, extractor=extractor, k=5,

response = rag.query("Main findings of the paper?") print(response.answer)

You can also use the model directly after your own retriever/reranker:

from transformers import AutoModel

extractor = AutoModel.from_pretrained( "KRLabsOrg/verbatim-rag-modern-bert-v2", trust_remote_code=True,

question =...

Show HN: A 150M model that extracts verbatim evidence spans for RAG, no LLM call

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs