Building Reliable Agentic AI Systems
Building Reliable Agentic AI Systems
A Case Study in building production-ready agentic AI systems
This paper presents the Preclinical Information Center (PRINCE), a cloud-hosted platform<br>developed by Bayer AG with Thoughtworks to address pharmaceutical industry challenges in drug<br>development. PRINCE leverages Agentic Retrieval-Augmented Generation<br>and Text-to-SQL to integrate decades of safety study reports. We describe PRINCE's evolution<br>from keyword-based search to an intelligent research assistant capable of answering complex<br>questions and drafting regulatory documents. We reflect on key engineering decisions through<br>the lens of context engineering—how information was shaped and routed between specialized<br>agents—and harness engineering—how orchestration, recovery, and observability were built<br>around the models to maintain control and reliability. The system prioritizes trust through<br>transparency, explainability, and human-in-the-loop integration. PRINCE demonstrates AI's<br>transformative potential in pharmaceuticals, significantly improving data accessibility and<br>research efficiency while ensuring governance and compliance.
16 June 2026
Sarang Sanjay Kulkarni
Sarang Kulkarni is a Principal Consultant at Thoughtworks, working at the intersection of<br>software engineering, data platforms, and applied AI. He focuses on building<br>production-grade GenAI systems, particularly Retrieval-Augmented Generation (RAG) and<br>multi-agent workflows, and helps teams take these systems from early ideas to real-world<br>use. Sarang also contributes to Thoughtworks’ Global AI Service Development team and teaches<br>an O’Reilly<br>course on building production-ready RAG applications.
Contents
The Challenge: Navigating the Preclinical Data Maze
The Solution: PRINCE - An Evolutionary Platform
System Architecture: Engineering a Reliable Agentic RAG System
The Agentic RAG System
Clarify User Intent
Think & Plan: Process Reflection
The Researcher Agent
The Reflection Agent: Data Validation and Sufficiency
The Writer Agent: Answer Synthesis and Formatting
Building Trust in a Production LLM System
Transparency and Explainability
Evaluation
Monitoring
Engineering for Resilience: Error Handling and Recovery
Enhancing Data Quality: Named Entity Recognition and Annotation
The Journey Continues: Iterative Development
Conclusion
Preclinical drug discovery is inherently complex and data-intensive.<br>Researchers face the significant challenge of efficiently accessing and<br>analyzing vast volumes of information generated during this critical phase.<br>Traditional keyword-based search methods, often reliant on rigid Boolean<br>logic, frequently fall short when confronted with the nuanced and intricate<br>nature of preclinical research questions.
The advent of Large Language Models (LLMs) has presented a transformative opportunity. By<br>combining the generative power of LLMs with the precision of information retrieval systems, Retrieval-Augmented Generation (RAG) has emerged as a promising technique.<br>This approach holds the potential to revolutionize preclinical data access, enabling<br>researchers to pose complex questions in natural language and receive accurate, context-rich<br>answers grounded in proprietary data.
Recognizing this potential early, Bayer committed to exploring how these<br>technologies could address longstanding challenges in preclinical research.
In this post, we share that journey—how Bayer's early investment in generative AI<br>has resulted in PRINCE, an agentic AI system built on Agentic RAG. This case study<br>explores the technical architecture, engineering decisions, and lessons<br>learned in transforming preclinical data retrieval from a challenging maze<br>into an intuitive conversational experience.
Many of the engineering decisions behind PRINCE can now be understood through the lens of context<br>engineering and harness engineering, although when the system was first designed we did not use these terms. Context engineering shaped what information each model<br>received, what it did not receive, and how context moved between specialized steps such as<br>research, reflection, and writing. Harness engineering shaped the scaffolding around the<br>models: orchestration, tool boundaries, state persistence, retries, fallbacks, validation,<br>reflection loops, observability, and human review.
While this post focuses on the technical architecture and engineering challenges, our paper<br>published in Frontiers in Artificial Intelligence covers the<br>product evolution and business impact in more detail.
The Challenge: Navigating the Preclinical Data Maze
The preclinical research landscape at Bayer, like many large<br>pharmaceutical organizations, is characterized by a diverse and extensive<br>array of data. This includes highly structured datasets from various studies, alongside vast<br>amounts of unstructured<br>information embedded within text documents such as study reports,<br>publications, and regulatory submissions. Researchers...