How Good Are AI ‘Clinicians’ at Medical Conversations?

Researchers design a more realistic test to evaluate AI’s clinical communication skills

Close-up of a laptop, stethoscope, and physician’s forearm in white coat
Image: xijian/Getty Images

At a glance:

  • Researchers design a new way to more reliably evaluate AI models’ ability to make clinical decisions in realistic scenarios that closely mimic real-life interactions.

  • The analysis finds that large language models excel at making diagnoses from exam-style questions but struggle to do so from conversational notes.

  • The researchers propose a set of guidelines to optimize AI tools’ performance and align them with real-world practice before integrating them into the clinic.

Artificial-intelligence tools such as ChatGPT have been touted for their promise to alleviate clinician workload by triaging patients, taking medical histories, and even providing preliminary diagnoses. These tools, known as large language models, are already being used by patients to make sense of their symptoms and medical test results.

But while these AI models perform impressively on standardized medical tests, how well do they fare in situations that more closely mimic the real world?

Get more HMS news here

Not that great, according to the findings of a new study led by researchers at Harvard Medical School and Stanford University.

Authorship, funding, disclosures

Additional authors include Jaehwan Jeong and Hong-Yu Zhou, Harvard Medical School; Benjamin A. Tran, Georgetown University; Daniel I. Schlessinger, Northwestern University; Shannon Wongvibulsin, University of California-Los Angeles; Leandra A. Barnes, Zhuo Ran Cai, and David Kim, Stanford University; and Eliezer M. Van Allen, Dana-Farber Cancer Institute.

The work was supported by the HMS Dean’s Innovation Award and a Microsoft Accelerate Foundation Models Research grant awarded to Pranav Rajpurkar. Johri received further support through the IIE Quad Fellowship.

Daneshjou reported receiving personal fees from DWA, personal fees from Pfizer, personal fees from L’Oréal, personal fees from VisualDx, stock options from MDAlgorithms and Revea outside the submitted work, and a patent for TrueImage pending. Schlessinger is the co-founder of FixMySkin Healing Balms, a shareholder in Appiell Inc. and K-Health, a consultant with Appiell Inc. and LuminDx, and an investigator for AbbVie and Sanofi. Van Allen serves as an advisor to Enara Bio, Manifold Bio, Monte Rosa, Novartis Institute for Biomedical Research, and Serinus Biosciences and provides research support to Novartis, BMS, Sanofi, and NextPoint. Van Allen holds equity in Tango Therapeutics, Genome Medical, Genomic Life, Enara Bio, Manifold Bio, Microsoft, Monte Rosa, Riva Therapeutics, Serinus Biosciences, and Syapse. Van Allen has filed for institutional patents on chromatin mutations and immunotherapy response and methods for clinical interpretation, provides intermittent legal consulting on patents for Foley Hoag, and serves on the editorial board of Science Advances.