Hanno-Labs/bosun-xs · Hugging Face
Log In<br>Sign Up
Bosun-XS (0.6B)
Launch post: Introducing Bosun →
The judge that keeps an agent's memory — its knowledge graph — clean. As an agent accumulates memory as<br>a graph of facts linked by relationships, Bosun-XS decides, edge by edge, which connections are<br>warranted — supported, non-redundant, still-true — so the graph stays useful instead of growing into noise<br>that drowns the model reading it back. Nothing else scores that "judge" step; Bosun-XS is a small, fast,<br>calibrated model built for it, and you program it with a sentence .
Given two findings and an instruction it emits P = sigmoid(logit_yes - logit_no) ∈ [0,1] — how strongly<br>the pair satisfies the rule you supplied , with no opinion of its own. "Warranted" isn't one fixed rule<br>(same-entity, cross-domain bridge, not-a-duplicate, still-supported-by-evidence), so you define it per graph;<br>Bosun-XS follows the rule, respects negation, and generalizes to rules it never trained on. That same<br>capability is exactly what RAG filtering, content moderation, and deduplication need too — knowledge-graph<br>curation is simply where the need bites first and hardest.
LoRA fine-tune of Qwen/Qwen3-Reranker-0.6B, scored on the native reranker yes/no logits.
Inference contract
Native Qwen3-Reranker template; read the last-token logits:
: These two findings share the specified relationship.<br>: FINDING A:\n\n\nFINDING B:\n
score = sigmoid(logits[yes_id] - logits[no_id]) at the final position (logits_to_keep=1). The exact<br>yes_id / no_id / template prefix+suffix and max_len are in serving.json.
import torch<br>from transformers import AutoTokenizer, AutoModelForCausalLM<br>from peft import PeftModel
repo = "Hanno-Labs/bosun-xs"<br>cfg = ... # serving.json from this repo<br>tok = AutoTokenizer.from_pretrained(repo, subfolder="tokenizer", padding_side="left")<br>base = AutoModelForCausalLM.from_pretrained(cfg["base_model"], torch_dtype=torch.bfloat16,<br>attn_implementation="sdpa", trust_remote_code=True)<br>model = PeftModel.from_pretrained(base, repo).merge_and_unload().eval().cuda()<br># build ids = prefix + + suffix, then:<br># lg = model(input_ids, attention_mask, logits_to_keep=1).logits[:, -1, :]<br># p = torch.sigmoid(lg[:, cfg["yes_id"]] - lg[:, cfg["no_id"]])
Results
WarrantBench (Hanno-Labs/warrantbench) — it out-steers a frontier LLM:
cosine<br>Bosun-XS<br>gemini-3.1-flash-lite
steerability — score flips with the rule<br>0.00<br>0.94<br>0.58
negation — "NOT the same topic"<br>0.00<br>0.97<br>0.996
cross-domain bridge<br>0.32<br>0.83<br>0.38
On novel rules it never trained on: 0.95 ("both mention a figure ≥ $1B") and 0.95<br>("both involve a government or regulator"), vs 0.35 / 0.63 for flash-lite.
FollowIR (public instruction-following retrieval, p-MRR): Bosun-XS tops the board where most<br>retrievers score zero or negative — they read the instruction as keywords; Bosun reads it as a rule.
Files
file<br>what
adapter_model.safetensors, adapter_config.json<br>the LoRA adapter (load with PEFT over the base)
serving.json<br>inference contract: template + yes_id/no_id + max_len
tokenizer/<br>Qwen tokenizer (left-padding)
Links
Launch post — Introducing Bosun
WarrantBench — github.com/Hanno-Labs/warrantbench<br>(dataset)
From Hanno Labs.
Downloads last month -
Inference Providers NEW<br>Text Ranking
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for Hanno-Labs/bosun-xs<br>Base model
Qwen/Qwen3-0.6B-Base
Finetuned
Qwen/Qwen3-Reranker-0.6B
Adapter<br>(3)<br>this model
Dataset used to train Hanno-Labs/bosun-xs<br>Viewer • Updated 11 minutes ago • 2k
Evaluation results<br>Steerability (score flips with the rule) on WarrantBench<br>self-reported<br>0.935
p-MRR (full pool, avg of 3 tasks) on FollowIR<br>self-reported<br>10.500