AI Health Apps Fail at the One Thing That Actually Matters

About 1 in 3 American adults now use AI chatbots for health questions, according to a recent KFF survey. That number should alarm anyone who read what landed in JAMA Network Open this morning: a Mass General Brigham team tested 21 large language models, including GPT-5, Claude 4.5 Opus, Grok 4, and Gemini 3.0, against 29 clinical vignettes using the PrIME-LLM evaluation tool. Every single model failed to produce an appropriate differential diagnosis more than 80% of the time. Not some budget chatbot. The flagship models. All of them.

Differential diagnosis is the process a clinician uses at the beginning of a case, before the labs come back, before imaging confirms anything. It is the part of medicine that requires holding uncertainty, weighing competing explanations, and knowing what you do not yet know. Study author Arya Rao put it plainly: these models are good at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case. That is precisely when patients are most vulnerable, and most likely to be using an app instead of calling a doctor.

The 65% Claim Deserves Scrutiny

OpenAI told NBC News on April 8 that a study showed 65% accuracy for non-urgent cases. The researchers behind the JAMA study dispute that figure as unreflective of real-world deployment. I would add: 65% accuracy in a non-urgent context is not a selling point. A clinician operating at 65% accuracy on non-urgent cases would face a licensing board. The bar for software marketed as health guidance should not be lower than the bar for the humans it claims to supplement.

To be fair to the technology: the same JAMA study found that final diagnosis accuracy climbs to 60-90% when models have complete lab and imaging data. That is a genuine finding, and it supports a real use case for AI as a tool inside clinical workflows, where a physician interprets the output. Co-author Marc Succi said it directly: these models are not ready for unsupervised clinical-grade deployment. The supervised version of this technology may eventually be useful. The consumer-facing version, sold to people who are scared and Googling symptoms at midnight, is a different product entirely.

The Privacy Problem Nobody Is Pricing In

The accuracy problem is compounded by a data problem. Most consumer AI health apps are not covered by HIPAA. ChatGPT, Fitbit, and similar tools can share or sell health data without user consent. A MyLymeData survey of more than 1,900 patients found that people fear insurer denial and employer discrimination as direct consequences of sharing health data with non-HIPAA platforms. The health data market is valued at $434 billion. The incentive to collect and monetize is not subtle.

Lorraine Johnson, CEO of LymeDisease.org, said AI chatbots can sound certain while being wrong. That is the specific danger. Confidence without calibration. A tool that hedges appropriately is annoying; a tool that delivers a wrong answer with authority sends people home from the ER.

The FDA and FTC have the authority to classify consumer AI diagnostic tools as medical devices and require pre-market accuracy validation. They should use it. Vendors should be required to publish methodology, sample size, and failure rates for any accuracy claim they make publicly, the same standard we hold pharmaceutical companies to when they advertise a drug. The $434 billion market will not self-correct. It will keep selling confidence to people who cannot afford to be wrong about their health.