Red Teaming Large Language Models in Medicine: Real-World Insights on Model Behavior

C. Chang, et al.

Abstract

As a proof-of-concept, we convened an interactive “red teaming” workshop in which medical and technical professionals stress-tested popular large language models (LLMs) through publicly available user interfaces on clinically relevant scenarios. Results demonstrate a significant proportion of inappropriate responses across GPT-3.5, GPT-4.0, and GPT-4.0 with Internet (25.7%, 16.2%, and 17.5%, respectively) and illustrate the valuable role that non-technical clinicians can play in evaluating models.

Read the paper

Share this article

Coming Fall 2026

A one-day, invite-only summit providing a first look at the benchmarks and research that will shape the frontier. Sign up for updates.

Request invite

Red Teaming Large Language Models in Medicine: Real-World Insights on Model Behavior

Abstract

How do you want to work with Snorkel?