New Benchmark Evaluates AI for Everyday Patient Care | Mass General Brigham
Skip to cookie consent<br>Skip to main content<br>Skip to pause video preview<br>Skip to alerts<br>Skip to pause carousel
Featured Links
Giving to Mass General Brigham
Featured Links
Giving to Mass General Brigham
Featured Links
Giving to Mass General Brigham
Featured Links
Giving to Mass General Brigham
Search the Site
Site-wide search
0 items available in list
reset search
Search
Close Search
Open Menu
Site-wide search
0 items available in list
reset search
Search
Language selector navigation menu.
Giving to Mass General Brigham
Giving to Mass General Brigham
Giving to Mass General Brigham
Giving to Mass General Brigham
More alert details
New Benchmark Evaluates AI for Everyday Patient Care
Jun 17, 2026
4 minute read
Technology & Innovation
Research
share on facebook
share on linkedIn
share on X, formerly known as Twitter
share by email
print article
Mass General Brigham researchers created BRIDGE, which identified significant gaps between AI’s performance on medical licensing exams and patient care tasks.\r\n\r\n\r\nResearchers at Mass General Brigham developed BRIDGE, a multilingual benchmark that evaluates how well large language models (LLMs) understand clinical patient-care text, including language used in electronic health records (EHRs), across nine languages. The benchmarking tool could help clinicians evaluate and compare LLMs to use in specific contexts. Results are published in Nature Biomedical Engineering.\r\n“Unlike many existing medical AI benchmarks, BRIDGE focuses on real-world clinical data sources that better reflect the complexity of real-world care,” said senior author Jie Yang, PhD, FACMI, FAMIA, of the Division of Pharmacoepidemiology and Pharmacoeconomics in the Mass General Brigham Department of Medicine. “BRIDGE can help clinicians select the right AI tools while guiding developers in improving model performance.”\r\nMedical LLMs have traditionally been assessed using licensing exam questions composed of standardized language and medical knowledge that may not fully reflect the complexity of real-world clinical interactions. The developers of BRIDGE created a framework for assessing LLMs using clinical text from EHRs, clinical case reports, and patient-doctor consultations. While the highest performing LLM scored as high as 92 on standardized medical exams, it earned only 44.8% on BRIDGE, highlighting the LLM’s gaps in understanding of nuanced clinical language used in health care settings.\r\nYang and colleagues, including co-senior author Joshua Lin, MD, MPH, ScD, and co-first authors Jiageng Wu and Bowen Gu, used BRIDGE to systematically evaluate the performance of 95 LLMs from 59 clinical sources on real-world clinical tasks spanning the patient care continuum. This involved 14 clinical specialties and included triage, information extraction, diagnosis, prognosis, and billing coding. They also created a public continuously updated leaderboard (which now includes 107 LLMs), enabling clinicians and AI developers to compare LLM performance across clinical tasks.\r\nBRIDGE also revealed that AI performance varies across medical specialties. Because the benchmark includes clinical data in nine languages, it enables researchers to identify LLM performance gaps and support the development of more accurate and equitable AI tools for non-English-speaking patients.\r\n"}}" id="text-eb0e8c124b" class="cmp-text"><br>Mass General Brigham researchers created BRIDGE, which identified significant gaps between AI’s performance on medical licensing exams and patient care tasks.
Researchers at Mass General Brigham developed BRIDGE, a multilingual benchmark that evaluates how well large language models (LLMs) understand clinical patient-care text, including language used in electronic health records (EHRs), across nine languages. The benchmarking tool could help clinicians evaluate and compare LLMs to use in specific contexts. Results are published in Nature Biomedical Engineering.
“Unlike many existing medical AI benchmarks, BRIDGE focuses on real-world clinical data sources that better reflect the complexity of real-world care,” said senior author Jie Yang, PhD, FACMI, FAMIA, of the Division of Pharmacoepidemiology and Pharmacoeconomics in the Mass General Brigham Department of Medicine. “BRIDGE can help clinicians select the right AI tools while guiding developers in improving model performance.”
Medical LLMs have traditionally been assessed using licensing exam questions composed of standardized language and medical knowledge that may not fully reflect the complexity of real-world clinical interactions. The developers of BRIDGE created a framework for assessing LLMs using clinical text from EHRs, clinical case reports, and patient-doctor consultations. While the highest performing LLM scored as high as 92 on standardized medical exams, it earned only 44.8% on BRIDGE, highlighting the LLM’s...