New benchmark evaluates AI for everyday patient care

New Benchmark Evaluates AI for Everyday Patient Care | Mass General Brigham

Skip to cookie consent Skip to main content Skip to pause video preview Skip to alerts Skip to pause carousel

Featured Links

Giving to Mass General Brigham

Featured Links

Giving to Mass General Brigham

Featured Links

Giving to Mass General Brigham

Featured Links

Giving to Mass General Brigham

Search the Site

Site-wide search

0 items available in list

reset search

Close Search

Open Menu

Site-wide search

0 items available in list

reset search

Language selector navigation menu.

Giving to Mass General Brigham

More alert details

New Benchmark Evaluates AI for Everyday Patient Care

Jun 17, 2026

4 minute read

Technology & Innovation

Research

share on facebook

share on linkedIn

share on X, formerly known as Twitter

share by email

print article

Mass General Brigham researchers created BRIDGE, which identified significant gaps between AI’s performance on medical licensing exams and patient care tasks.\r\n\r\n\r\nResearchers at Mass General Brigham developed BRIDGE, a multilingual benchmark that evaluates how well large language models (LLMs) understand clinical patient-care text, including language used in electronic health records (EHRs), across nine languages. The benchmarking tool could help clinicians evaluate and compare LLMs to use in specific contexts. Results are published in Nature Biomedical Engineering.\r\n“Unlike many existing medical AI benchmarks, BRIDGE focuses on real-world clinical data sources that better reflect the complexity of real-world care,” said senior author Jie Yang, PhD, FACMI, FAMIA, of the Division of Pharmacoepidemiology and Pharmacoeconomics in the Mass General Brigham Department of Medicine. “BRIDGE can help clinicians select the right AI tools while guiding developers in improving model performance.”\r\nMedical LLMs have traditionally been assessed using licensing exam questions composed of standardized language and medical knowledge that may not fully reflect the complexity of real-world clinical interactions. The developers of BRIDGE created a framework for assessing LLMs using clinical text from EHRs, clinical case reports, and patient-doctor consultations. While the highest performing LLM scored as high as 92 on standardized medical exams, it earned only 44.8% on BRIDGE, highlighting the LLM’s gaps in understanding of nuanced clinical language used in health care settings.\r\nYang and colleagues, including co-senior author Joshua Lin, MD, MPH, ScD, and co-first authors Jiageng Wu and Bowen Gu, used BRIDGE to systematically evaluate the performance of 95 LLMs from 59 clinical sources on real-world clinical tasks spanning the patient care continuum. This involved 14 clinical specialties and included triage, information extraction, diagnosis, prognosis, and billing coding. They also created a public continuously updated leaderboard (which now includes 107 LLMs), enabling clinicians and AI developers to compare LLM performance across clinical tasks.\r\nBRIDGE also revealed that AI performance varies across medical specialties. Because the benchmark includes clinical data in nine languages, it enables researchers to identify LLM performance gaps and support the development of more accurate and equitable AI tools for non-English-speaking patients.\r\n"}}" id="text-eb0e8c124b" class="cmp-text"> Mass General Brigham researchers created BRIDGE, which identified significant gaps between AI’s performance on medical licensing exams and patient care tasks.

Researchers at Mass General Brigham developed BRIDGE, a multilingual benchmark that evaluates how well large language models (LLMs) understand clinical patient-care text, including language used in electronic health records (EHRs), across nine languages. The benchmarking tool could help clinicians evaluate and compare LLMs to use in specific contexts. Results are published in Nature Biomedical Engineering.

“Unlike many existing medical AI benchmarks, BRIDGE focuses on real-world clinical data sources that better reflect the complexity of real-world care,” said senior author Jie Yang, PhD, FACMI, FAMIA, of the Division of Pharmacoepidemiology and Pharmacoeconomics in the Mass General Brigham Department of Medicine. “BRIDGE can help clinicians select the right AI tools while guiding developers in improving model performance.”

Medical LLMs have traditionally been assessed using licensing exam questions composed of standardized language and medical knowledge that may not fully reflect the complexity of real-world clinical interactions. The developers of BRIDGE created a framework for assessing LLMs using clinical text from EHRs, clinical case reports, and patient-doctor consultations. While the highest performing LLM scored as high as 92 on standardized medical exams, it earned only 44.8% on BRIDGE, highlighting the LLM’s...

New benchmark evaluates AI for everyday patient care

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews

Britain Became as Poor as Mississippi