KRLabsOrg/verbatim-rag-modern-bert-v2 · Hugging Face
Log In<br>Sign Up
Verbatim-RAG Extractor
Chill, I Ground! 🌶️
Model Name: verbatim-rag-modern-bert-v2<br>Organization: KRLabsOrg<br>Github: https://github.com/KRLabsOrg/verbatim-rag
Overview
The Verbatim-RAG Extractor is a query-conditioned token classifier that<br>highlights the verbatim spans of a passage that answer a question. It is the<br>encoder companion to VerbatimRAG<br>and the successor to<br>verbatim-rag-modern-bert-v1.<br>Built on<br>Alibaba-NLP/gte-reranker-modernbert-base,<br>which provides the long ModernBERT context (up to 8192 tokens) and a<br>query-conditioned reranking prior on top of which span extraction is fine-tuned.
The goal is a lightweight extractor that can replace many LLM-based evidence<br>highlighting calls in production RAG systems: local, deterministic, cheap to<br>serve, and still competitive on span-overlap quality. In our ACL-Verbatim gold<br>benchmark, the ACL-specialized sibling model is on par with strong LLM<br>extractors by word-level F1, while this generic multi-domain model beats public<br>extractive baselines across ACL gold, RAGBench, Squeez, and QASPER.
You can use it as the extraction stage inside VerbatimRAG, or drop it into your<br>own RAG pipeline after retrieval/reranking to turn retrieved chunks into<br>grounded evidence spans before displaying them to users or passing them to a<br>generator.
Most public evidence extractors (Provence, Zilliz Semantic-Highlight,<br>MultiSpanQA-trained models) are trained on Wikipedia-style prose QA only.<br>This model is trained on<br>KRLabsOrg/verbatim-spans,<br>which adds financial tables, legal contracts, medical literature, product<br>manuals, and — uniquely among public extractors — coding-agent tool output<br>(pytest failures, git diff hunks, stack traces). The result is a single<br>150M-parameter encoder usable across the content shapes a real RAG or agent<br>pipeline tends to retrieve, not just article paragraphs.
For an ACL-Anthology-specialized variant, see<br>KRLabsOrg/acl-verbatim-modernbert.
Model Details
Architecture: ModernBERT (gte-reranker-modernbert-base) with 8192-token context
Task: Token classification — binary evidence labels mapped to character spans
Training Dataset: KRLabsOrg/verbatim-spans (multi-domain)
Language: English
Parameters: 150M
Training data composition
content shape<br>source
scientific paragraphs with citations<br>ACL silver
Wikipedia / general QA, multi-hop<br>RAGBench (HotpotQA, MS MARCO, ExpertQA, ...)
financial tables<br>RAGBench (TAT-QA, FinQA)
medical literature<br>RAGBench (PubMedQA, CovidQA)
legal contracts<br>RAGBench (CUAD)
product manuals<br>RAGBench (eManual, TechQA)
code, tool output, stack traces, logs<br>Squeez (SWE-bench tool outputs)
How It Works
A (question, context) pair is encoded as a single sequence; the model<br>predicts a per-token positive-class probability over the context tokens. Above<br>a threshold, contiguous positive runs are merged into character spans, with<br>post-processing (min_span_chars, merge_gap_chars) that removes<br>fragmentation artifacts. Long contexts are handled with sliding windows of<br>max_length tokens stepped by doc_stride, and spans are merged across<br>windows.
Usage
from transformers import AutoModel
model = AutoModel.from_pretrained(<br>"KRLabsOrg/verbatim-rag-modern-bert-v2",<br>trust_remote_code=True,
result = model.process(<br>question="What is ModernBERT?",<br>context=(<br>"ModernBERT is a long-context encoder for NLP. "<br>"It supports sequences up to 8192 tokens. "<br>"Unlike earlier BERT variants, it uses rotary position embeddings."<br>),<br>threshold=0.2,
for span in result["spans"]:<br>print(f"[{span['score']:.2f}] {span['text']}")
Use inside VerbatimRAG
from verbatim_rag.core import VerbatimRAG<br>from verbatim_rag.index import VerbatimIndex<br>from verbatim_rag.extractors import ModelSpanExtractor<br>from verbatim_rag.vector_stores import LocalMilvusStore<br>from verbatim_rag.embedding_providers import SpladeProvider
# v2 is the default ModelSpanExtractor model, but passing it explicitly makes<br># the dependency clear.<br>extractor = ModelSpanExtractor(<br>model_path="KRLabsOrg/verbatim-rag-modern-bert-v2",<br>threshold=0.2,<br>min_span_chars=30,<br>merge_gap_chars=20,<br>device=None, # auto-detects cuda, then mps, then cpu
sparse_provider = SpladeProvider(<br>model_name="opensearch-project/opensearch-neural-sparse-encoding-doc-v2-distill",<br>device="cuda", # use "cpu" if no GPU is available
vector_store = LocalMilvusStore(<br>db_path="./index.db",<br>collection_name="verbatim_rag",<br>enable_dense=False,<br>enable_sparse=True,
# Assumes the index has already been populated with your documents.<br>index = VerbatimIndex(<br>vector_store=vector_store,<br>sparse_provider=sparse_provider,
rag = VerbatimRAG(<br>index=index,<br>extractor=extractor,<br>k=5,
response = rag.query("Main findings of the paper?")<br>print(response.answer)
You can also use the model directly after your own retriever/reranker:
from transformers import AutoModel
extractor = AutoModel.from_pretrained(<br>"KRLabsOrg/verbatim-rag-modern-bert-v2",<br>trust_remote_code=True,
question =...