Testing and evaluation of generative large language models in electronic health record applications: a systematic review
Jan 13, 2026ยท,
,,,,,,,,,,,ยท
0 min read
Xinsong Du
Zhengyang Zhou
Yifei Wang
Ya-Wen Chuang
Yiming Li
Richard Yang
Wenyu Zhang
Xinyi Wang
Xinyu Chen
Hao Guan
John Lian
Pengyu Hong
David W. Bates
Li Zhou
Abstract
Background: The use of generative large language models (LLMs) with electronic health record (EHR) data is rapidly expanding to support clinical and research tasks. This systematic review characterizes the clinical fields and use cases that have been studied and evaluated to date; Methods: We followed the Preferred Reporting Items for Systematic Review and Meta-Analyses guidelines to conduct a systematic review of articles from PubMed and Web of Science published between January 1, 2023, and November 9, 2024. Studies were included if they used generative LLMs to analyze real-world EHR data and reported quantitative performance evaluations. Through data extraction, we identified clinical specialties and tasks for each included article, and summarized evaluation methods; Results: Of the 18 735 articles retrieved, 196 met our criteria. Most studies focused on radiology (26.0%), oncology (10.7%), and emergency medicine (6.6%). Regarding clinical tasks, clinical decision support made up the largest proportion of studies (62.2%), while summarizations and patient communications made up the smallest, at 5.6% and 5.1%, respectively. In addition, GPT-4 and GPT-3.5 were the most commonly used generative LLMs, appearing in 60.2% and 57.7% of studies, respectively. Across these studies, we identified 22 unique non-NLP metrics and 35 unique NLP metrics. While NLP metrics offer greater scalability, none demonstrated a strong correlation with gold-standard human evaluations; Conclusion: Our findings highlight the need to evaluate generative LLMs on EHR data across a broader range of clinical specialties and tasks, as well as the urgent need for standardized, scalable, and clinically meaningful evaluation frameworks.
Type
Publication
Journal of the American Medical Informatics Association