medrxiv Preprint
|
2024
C. Chang, et al.
Abstract
As a proof-of-concept, we convened an interactive “red teaming” workshop in which medical and technical professionals stress-tested popular large language models (LLMs) through publicly available user interfaces on clinically relevant scenarios. Results demonstrate a significant proportion of inappropriate responses across GPT-3.5, GPT-4.0, and GPT-4.0 with Internet (25.7%, 16.2%, and 17.5%, respectively) and illustrate the valuable role that non-technical clinicians can play in evaluating models.
Coming Fall 2026
A one-day, invite-only summit providing a first look at the benchmarks and research that will shape the frontier. Sign up for updates.