Show HN: A 150M model that extracts verbatim evidence spans for RAG, no LLM call

justacoolname1 pts0 comments

KRLabsOrg/verbatim-rag-modern-bert-v2 · Hugging Face

Log In<br>Sign Up

Verbatim-RAG Extractor

Chill, I Ground! 🌶️

Model Name: verbatim-rag-modern-bert-v2<br>Organization: KRLabsOrg<br>Github: https://github.com/KRLabsOrg/verbatim-rag

Overview

The Verbatim-RAG Extractor is a query-conditioned token classifier that<br>highlights the verbatim spans of a passage that answer a question. It is the<br>encoder companion to VerbatimRAG<br>and the successor to<br>verbatim-rag-modern-bert-v1.<br>Built on<br>Alibaba-NLP/gte-reranker-modernbert-base,<br>which provides the long ModernBERT context (up to 8192 tokens) and a<br>query-conditioned reranking prior on top of which span extraction is fine-tuned.

The goal is a lightweight extractor that can replace many LLM-based evidence<br>highlighting calls in production RAG systems: local, deterministic, cheap to<br>serve, and still competitive on span-overlap quality. In our ACL-Verbatim gold<br>benchmark, the ACL-specialized sibling model is on par with strong LLM<br>extractors by word-level F1, while this generic multi-domain model beats public<br>extractive baselines across ACL gold, RAGBench, Squeez, and QASPER.

You can use it as the extraction stage inside VerbatimRAG, or drop it into your<br>own RAG pipeline after retrieval/reranking to turn retrieved chunks into<br>grounded evidence spans before displaying them to users or passing them to a<br>generator.

Most public evidence extractors (Provence, Zilliz Semantic-Highlight,<br>MultiSpanQA-trained models) are trained on Wikipedia-style prose QA only.<br>This model is trained on<br>KRLabsOrg/verbatim-spans,<br>which adds financial tables, legal contracts, medical literature, product<br>manuals, and — uniquely among public extractors — coding-agent tool output<br>(pytest failures, git diff hunks, stack traces). The result is a single<br>150M-parameter encoder usable across the content shapes a real RAG or agent<br>pipeline tends to retrieve, not just article paragraphs.

For an ACL-Anthology-specialized variant, see<br>KRLabsOrg/acl-verbatim-modernbert.

Model Details

Architecture: ModernBERT (gte-reranker-modernbert-base) with 8192-token context

Task: Token classification — binary evidence labels mapped to character spans

Training Dataset: KRLabsOrg/verbatim-spans (multi-domain)

Language: English

Parameters: 150M

Training data composition

content shape<br>source

scientific paragraphs with citations<br>ACL silver

Wikipedia / general QA, multi-hop<br>RAGBench (HotpotQA, MS MARCO, ExpertQA, ...)

financial tables<br>RAGBench (TAT-QA, FinQA)

medical literature<br>RAGBench (PubMedQA, CovidQA)

legal contracts<br>RAGBench (CUAD)

product manuals<br>RAGBench (eManual, TechQA)

code, tool output, stack traces, logs<br>Squeez (SWE-bench tool outputs)

How It Works

A (question, context) pair is encoded as a single sequence; the model<br>predicts a per-token positive-class probability over the context tokens. Above<br>a threshold, contiguous positive runs are merged into character spans, with<br>post-processing (min_span_chars, merge_gap_chars) that removes<br>fragmentation artifacts. Long contexts are handled with sliding windows of<br>max_length tokens stepped by doc_stride, and spans are merged across<br>windows.

Usage

from transformers import AutoModel

model = AutoModel.from_pretrained(<br>"KRLabsOrg/verbatim-rag-modern-bert-v2",<br>trust_remote_code=True,

result = model.process(<br>question="What is ModernBERT?",<br>context=(<br>"ModernBERT is a long-context encoder for NLP. "<br>"It supports sequences up to 8192 tokens. "<br>"Unlike earlier BERT variants, it uses rotary position embeddings."<br>),<br>threshold=0.2,

for span in result["spans"]:<br>print(f"[{span['score']:.2f}] {span['text']}")

Use inside VerbatimRAG

from verbatim_rag.core import VerbatimRAG<br>from verbatim_rag.index import VerbatimIndex<br>from verbatim_rag.extractors import ModelSpanExtractor<br>from verbatim_rag.vector_stores import LocalMilvusStore<br>from verbatim_rag.embedding_providers import SpladeProvider

# v2 is the default ModelSpanExtractor model, but passing it explicitly makes<br># the dependency clear.<br>extractor = ModelSpanExtractor(<br>model_path="KRLabsOrg/verbatim-rag-modern-bert-v2",<br>threshold=0.2,<br>min_span_chars=30,<br>merge_gap_chars=20,<br>device=None, # auto-detects cuda, then mps, then cpu

sparse_provider = SpladeProvider(<br>model_name="opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill",<br>device="cuda", # use "cpu" if no GPU is available

vector_store = LocalMilvusStore(<br>db_path="./index.db",<br>collection_name="verbatim_rag",<br>enable_dense=False,<br>enable_sparse=True,

# Assumes the index has already been populated with your documents.<br>index = VerbatimIndex(<br>vector_store=vector_store,<br>sparse_provider=sparse_provider,

rag = VerbatimRAG(<br>index=index,<br>extractor=extractor,<br>k=5,

response = rag.query("Main findings of the paper?")<br>print(response.answer)

You can also use the model directly after your own retriever/reranker:

from transformers import AutoModel

extractor = AutoModel.from_pretrained(<br>"KRLabsOrg/verbatim-rag-modern-bert-v2",<br>trust_remote_code=True,

question =...

verbatim model spans krlabsorg bert extractor

Related Articles