AI agents for regulatory use cases: competition on the EU AI Act

EU AI Act Q&A Benchmark Competition | regenold

We're here to help answer any questions you might have. We look forward to hearing from you.

Arrange a Call

Headquarters:

regenold GmbH

Zöllinplatz 4

79410 Badenweiler

Germany

Phone: +49 7632 82 26-0

Fax: +49 7632 82 26-555

Email: info@regenold.com

Live competition · May to June 2026 EU AI Act Q&A Competition

An independent benchmark for purpose-built AI systems answering questions on Regulation (EU) 2024/1689.

Regulatory workflows have near-zero margin for error. Yet rigorous, agentic evaluations remain scarce in the field. We built the first Q&A evaluation competition focused on the EU AI Act so that the systems claiming to handle it can be tested against ground truth, by an independent third party, on the dimensions that matter for regulated use.

Free to participate

One submission per contestant this edition

Individual report plus opt-out option

Reserve your benchmark slot → Download full rules

Off-the-Shelf AI Isn't Audit-Ready for Regulation

We see the potential of AI in regulatory workflows. We also see the practical gap between promising answers and audit-ready reliability. We've been measuring it.

Finding 1 in 3

Even the best-performing search-enabled systems give factually incorrect answers or cite the wrong references roughly one out of three times when tested on the EU AI Act.

W&B Weave report, March 2026 →

Observation Purpose-built

Generic agents are not enough for regulated work. Purpose-built systems are necessary. So are rigorous evaluations that ensure those systems are fit for purpose, and remain so over time.

The EU AI Act Q&A Competition serves as the next step: advancing the evaluation of purpose-built regulatory AI systems.

For the broader regulatory context and how this translates into implementation work, see AI in Regulated Life Science and our AI Governance & Compliance service.

Five Dimensions

To reflect how regulatory AI is used in practice, where more aspects than just answer correctness matter, the competition is multi-dimensional. Every submission is scored against question-specific ground truth across five dimensions. Furthermore, the evaluation is repeated when considering a simulated multi-turn conversation, to reflect how these systems are used in practice.

Answer Correctness Tested against question-specific ground-truth correctness criteria. Variants: strict and loose.

Reference Accuracy Proposed references checked against expected ones. Variants: strict and loose.

Conciseness Answer and reference-set lengths assessed against benchmark exemplars.

Tone Assessment of clarity and appropriateness of the language for regulatory contexts.

Latency Time from prompt submission to response measured per question.

Illustrative visualization of results for one multi-turn configuration (e.g. on/off). Contestant names are invented.

Independent Assessment Plus Reach

Participants receive an individual benchmark report and may be included in public summary materials after the opt-out period.

💡 Independent evaluation

A best-effort, automated assessment grounded in question-specific correctness checks and reference verification.

📊 Individual report

A dedicated report for your system, with scoring across every benchmark dimension and comparisons against off-the-shelf reference methods.

🌎 Public visibility

Your system is featured in regenold's downstream publications: articles, web content, and social posts after the opt-out window.

Opt-out option. Not happy with your individual report? You can opt out in writing within 10 working days from when results are communicated - your entry is then anonymized in our use of the results.

Three Steps to Your Benchmark

Participation is free and open to anyone willing to test their AI system on the EU AI Act.

Get in touch Send your endpoint details and participant information to our technical contact. We confirm onboarding and reserve your benchmark slot.

We run the evaluation We send conversation histories to your API and collect the JSON responses. Latency and multi-turn behaviour are measured automatically.

You receive your report Your individual report shows performance across all dimensions, alongside reference benchmarks from popular off-the-shelf methods.

What Your System Needs to Provide

Your system needs to expose an API that accepts a conversation history and returns a single JSON response. The format follows the OpenAI/LiteLLM message convention.

InputYou receive

# JSON with OpenAI/LiteLLM message format {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}, ..., {"role": "user", "content": "..."}

OutputYou return

# JSON with three fields "reasoning": "Optional. Not scored.", "answer": "Short, professional answer.", "references": [ "Annex IV.2", "Article 3.1"

Frequently Asked Questions (FAQ)

Who can participate?

Anyone willing to test their AI system on EU AI Act questions. Participation is free of charge....

AI agents for regulatory use cases: competition on the EU AI Act

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast