Improving knowledge graph creation in life sciences through agent steering | Blue Guardrails<br>Book a Demo
Intro
Biomedical R&D generates increasingly diverse datasets across modalities like clinical, preclinical, omics, and real-world data, often in heterogeneous formats.<br>Knowledge graphs are a way to harmonize this data, but creating them is costly and time-consuming.<br>Many teams turn to LLMs to speed up the process, and the most capable setups use agents.<br>While agents are a step-up from single-pass LLM workflows, they are not without fault.<br>Errors may potentiate in long agent trajectories and, unlike coding agents, domain-specific agents lack verification loops.
In this post, we demonstrate empirically how agent steering improves factors like completeness, hallucinations,<br>and entity resolution in an agent for knowledge graph creation.
What is agent steering?
Instead of front-loading all instructions into the initial prompt, agent steering intercepts the agent mid-run to provide<br>feedback specific to its current state.<br>The agent self-corrects and yields better outcomes than what would be achieved through prompting alone.
In our setup, an evaluator detects issues like deviations from the system prompt, missed nodes, or hallucinations.<br>It pinpoints each issue to a specific text span and provides an explanation.<br>This information is injected into the agent's trajectory as a correction prompt.<br>The correction loop can run multiple times: if the evaluator finds new issues after the initial correction, the agent is prompted to correct them too.
Creating a knowledge graph from unstructured documents
To evaluate agent steering, we built an agent that creates a knowledge graph from Summary of Product Characteristics (SmPCs).<br>SmPCs are regulatory documents that describe prescription medicines for medical professionals.<br>The agent extracts nodes and edges from SmPCs and stores them in a Neo4j graph database.
Here is a simplified excerpt of the schema:
from dataclasses import dataclass<br>from typing import Literal
@dataclass<br>class ClinicalCondition:<br>node_id: str<br># SNOMED CT is a database for clinical terminology<br># the agent has a tool to search it and uses SNOMED codes for entity resolution<br>snomed_code: int<br>canonical_name: str
@dataclass<br>class Substance:<br>node_id: str<br>snomed_code: int<br>canonical_name: str
@dataclass<br>class CausesAdverseReactionEdge:<br>from_node: str<br>to_node: str<br>frequency: Literal[<br>"Very common",<br>"Common",<br>"Uncommon",<br>"Rare",<br>"Very rare",<br>"Not known"<br>The agent receives a PDF and extracts nodes and edges using the tools we provide as its harness.<br>The harness is critical to the system's success but out of scope for this post (we cover it in our upcoming live stream).
As an example, here is a sub-graph of some adverse reactions extracted from two SmPCs:
Sub-graph: drugs and adverse reactions<br>Drag to rotate, scroll to zoom, click an edge or node to inspect its grounding.
Loading graph…
Extracted edge<br>Aerinazecauses adverse reactionInsomnia<br>Frequency: CommonPage 6<br>“Insomnia, somnolence, sleep disorder,”<br>aerinaze-epar-product-information_en.pdf
Edge metadata<br>Source nodedrug_aerinaze<br>Target nodecc_insomnia<br>Edge typeCausesAdverseReaction
Even with a well-designed harness, the extraction is far from perfect. The main issues we observed with the plain agent are:
Completeness
The agent underextracts nodes and edges, it misses information that it should extract and only creates a fraction of the<br>expected graph.
Hallucinations
The agent fabricates node or edge attributes, for example assigning the wrong frequency to an adverse reaction.<br>This is also the type of issue we cover in PlaceboBench (our hallucination benchmark for life sciences).
Entity Resolution
The agent uses the wrong concept ID or no ID at all, which leads to duplicate nodes and merge failures across documents.
Steering the extraction agent
While the agent runs, traces are sent to Blue Guardrails.<br>An evaluator using an Agent-as-a-Judge approach detects issues, pinpoints them to specific text spans, and returns explanations.<br>The agent receives a correction prompt based on this data.<br>During our experiments, we limit the system to two correction loops.
AGENT STEERING LOOPLocal MachineEXTRACTION AGENTSmPC.pdf1extract_nodes(...)2extract_edges(...)Correction injected3extract_nodes(...)Tracing EndpointReceives OTLP spansAgent-as-a-judgeIdentifies extraction issuesMissed node · adverse reactionSmPC §4.8 — “drowsiness”Issue Prompt“Revisit §4.8 — an adversereaction node was missed.”1send traces2error correctionThe agent steering loop: traces flow to Blue Guardrails, an evaluator detects issues, and an issue prompt is injected back into the agent run as a correction.<br>Impact on extraction quality
We evaluate agent steering against a ground truth dataset of seven medicines spanning various drug classes and complexities.<br>The dataset contains 1,622 nodes and 1,865 edges.<br>We run graph creation with two models in different capability tiers: Deepseek...