Show HN: Bosun – a small model that keeps an agent's memory graph clean

Hanno-Labs/bosun-xs · Hugging Face

Bosun-XS (0.6B)

Launch post: Introducing Bosun →

The judge that keeps an agent's memory — its knowledge graph — clean. As an agent accumulates memory as a graph of facts linked by relationships, Bosun-XS decides, edge by edge, which connections are warranted — supported, non-redundant, still-true — so the graph stays useful instead of growing into noise that drowns the model reading it back. Nothing else scores that "judge" step; Bosun-XS is a small, fast, calibrated model built for it, and you program it with a sentence .

Given two findings and an instruction it emits P = sigmoid(logit_yes - logit_no) ∈ [0,1] — how strongly the pair satisfies the rule you supplied , with no opinion of its own. "Warranted" isn't one fixed rule (same-entity, cross-domain bridge, not-a-duplicate, still-supported-by-evidence), so you define it per graph; Bosun-XS follows the rule, respects negation, and generalizes to rules it never trained on. That same capability is exactly what RAG filtering, content moderation, and deduplication need too — knowledge-graph curation is simply where the need bites first and hardest.

LoRA fine-tune of Qwen/Qwen3-Reranker-0.6B, scored on the native reranker yes/no logits.

Inference contract

Native Qwen3-Reranker template; read the last-token logits:

: These two findings share the specified relationship. : FINDING A:\n\n\nFINDING B:\n

score = sigmoid(logits[yes_id] - logits[no_id]) at the final position (logits_to_keep=1). The exact yes_id / no_id / template prefix+suffix and max_len are in serving.json.

import torch from transformers import AutoTokenizer, AutoModelForCausalLM from peft import PeftModel

repo = "Hanno-Labs/bosun-xs" cfg = ... # serving.json from this repo tok = AutoTokenizer.from_pretrained(repo, subfolder="tokenizer", padding_side="left") base = AutoModelForCausalLM.from_pretrained(cfg["base_model"], torch_dtype=torch.bfloat16, attn_implementation="sdpa", trust_remote_code=True) model = PeftModel.from_pretrained(base, repo).merge_and_unload().eval().cuda() # build ids = prefix + + suffix, then: # lg = model(input_ids, attention_mask, logits_to_keep=1).logits[:, -1, :] # p = torch.sigmoid(lg[:, cfg["yes_id"]] - lg[:, cfg["no_id"]])

Results

WarrantBench (Hanno-Labs/warrantbench) — it out-steers a frontier LLM:

cosine Bosun-XS gemini-3.1-flash-lite

steerability — score flips with the rule 0.00 0.94 0.58

negation — "NOT the same topic" 0.00 0.97 0.996

cross-domain bridge 0.32 0.83 0.38

On novel rules it never trained on: 0.95 ("both mention a figure ≥ $1B") and 0.95 ("both involve a government or regulator"), vs 0.35 / 0.63 for flash-lite.

FollowIR (public instruction-following retrieval, p-MRR): Bosun-XS tops the board where most retrievers score zero or negative — they read the instruction as keywords; Bosun reads it as a rule.

Files

file what

adapter_model.safetensors, adapter_config.json the LoRA adapter (load with PEFT over the base)

serving.json inference contract: template + yes_id/no_id + max_len

tokenizer/ Qwen tokenizer (left-padding)

Links

Launch post — Introducing Bosun

WarrantBench — github.com/Hanno-Labs/warrantbench (dataset)

From Hanno Labs.

Downloads last month -

Inference Providers NEW Text Ranking

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Hanno-Labs/bosun-xs Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-Reranker-0.6B

Adapter (3) this model

Dataset used to train Hanno-Labs/bosun-xs Viewer • Updated 11 minutes ago • 2k

Evaluation results Steerability (score flips with the rule) on WarrantBench self-reported 0.935

p-MRR (full pool, avg of 3 tasks) on FollowIR self-reported 10.500

Show HN: Bosun – a small model that keeps an agent's memory graph clean

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs