Dr. Adam Rodman, a physician and researcher at Beth Israel Medical Center in Boston, was shocked by the results of an experiment he ran comparing the diagnostic abilities of doctors to generative artificial intelligence (genAI) tools. The data showed that physicians, even when assisted by the technology, were worse at correctly diagnosing patients than genAI alone — a lot worse.
“I reran the experiment, and my jaw dropped,” said Rodman, who is also director of AI Programs at Beth Israel. “AI alone was almost 20% better, like 90% accurate.
“We assumed that the humans would be better than the AI, and quite frankly we were shocked when we realized the AI really didn’t improve the physicians, and actually the AI alone was much better at diagnosing correctly,” he said.
The genAI model used, GPT-4 Turbo, came from OpenAI; it’s the same technology that powers Microsoft’s Copilot chatbot assistant. Not only did the model outperform physicians, but it also outdid every single AI system that had been developed for healthcare over the last 50 years, Rodman said. And it did so without any medical training.
The study, by Rodman and Dr. Jonathan Chen, an assistant professor of Medicine at Stanford University, was performed in 2023; it’s not unusual for study results to be published more than a year after completion.
The results of another study, to be published in Nature in two months, will be even more startling, Rodman said. “We’re also working on next-generation models and AI tools that try to get physicians to be better. So, we have a number of other studies that will be coming out.”
Rodman noted that the original study was performed around the same time health systems were rolling out secure GPT models for doctors. So, while new to the workplace, Rodman and Chen both believed combining physicians and genAI tools would yield better results than relying on technology alone.
“The results flew in the face of the ‘fundamental theorem of informatics’ that assumes that the combination of human and computer should outperform either alone,” Chen said. “While I’d still like to believe that is true, the results of this study show that deliberate training, integration, and evaluation is necessary to actually realize that potential.”
Chen compared physicians’ use of genAI to the public’s early understanding of the Internet, noting that daily activities like searching, reading articles, and making online transactions are now taken for granted, though they were once learned skills. “Similarly,” Chen said. “I expect we will all need to learn new skills in how to interact with chatbot AI systems to nudge and negotiate them to behave in the ways we wish.”
Dr. Jonathan Chen
AI is nothing new in healthcare
Healthcare organizations have utilized machine learning and AI since the early 1970s. In 1972, AAPhelp, a computer-based decision support system, became one of the first AI-based assistants developed to help in diagnosing appendicitis.
Two years ago, when OpenAI released GPT into the wild, things began to change, and adoption among healthcare providers grew quickly as natural language processing made AI tools more user friendly.
By 2025, more than half (53.2%) of genAI spending by US healthcare providers will focus on chatbots and virtual health assistants, according to Mutaz Shegewi, IDC’s senior research director for healthcare strategies. “This reflects a focus on using genAI to personalize patient engagement and streamline service delivery, highlighting its potential to transform patient care and optimize healthcare operations,” Shegewi said.
According to IDC, 39.4% of US healthcare providers see genAI as a top-three technology that will shape healthcare during the next five years.
Large language models like GPT-4 are already being rolled out across the country by businesses and government agencies. While clinical decision support is one of the top uses for genAI in healthcare, there are others. For example, ambient listening models are being used to record conversations between physicians and patients and automatically write clinical notes for the doctor. Patient portals are adopting genAI, too, so when patients message their physicians with questions, the chatbot writes the first draft of responses.
In fact, the healthcare industry is among the top adopters of AI technology, according to a study by MIT’s Sloan School of Management.
In 2021, the AI in healthcare market was worth more than $11 billion worldwide, with a forecast for it to reach around $188 billion by 2030, according to online data analysis service Statista. Also in 2021, about one-fifth of healthcare organizations worldwide were already in early-stage initiatives using AI.
Today, more than 70% of 100 US healthcare leaders — including payers, providers, and healthcare services and technology (HST) groups — are pursuing or have already implemented genAI capabilities, according to research firm McKinsey & Co.
Where AI can be found in healthcare today
AI is mostly being used for clinical decision support where it can analyze patient information against scientific literature, care guidelines, and treatment history and offer physicians diagnostic and therapeutic options, according to healthcare credentialling and billing company Medwave.
AI models, such as deep learning algorithms, can predict the risk of patient readmission within 30 days of discharge, particularly for conditions like heart failure, according to Colin Drummond, assistant chair of the Department of Biomedical Engineering at Case Western Reserve University.
Natural language processing models can analyze clinical notes and patient records to extract relevant information, aiding in diagnosis and treatment planning. And AI-powered tools are already being used to interpret medical images with a high degree of accuracy, according to Drummond.
“This can streamline the workflow for clinicians by reducing the time spent on documentation,” he said. “These do, of course, need to be vetted and verified by staff, but, again, this can expedite decision-making.
“For instance, AI systems can detect diabetic retinopathy from retinal images, identify wrist fractures from X-rays, and even diagnose melanoma from dermoscopic images,” Drummond said. “Imaging seems to lead in terms of reimbursable CPT coding, but many other applications for screening and diagnosis are on the horizon.”
AI also enables early intervention to reduce readmission rates and enhances risk calculators for patient care planning. It helps stratify patients by risk levels for conditions like sepsis, allowing for timely intervention.
“AI today seems more prevalent and impactful on the operational side of things,” Drummond said. “This seems to be where AI is being most successfully monetized. This involves examining operational activity and looking for optimal use and management of assets used in care. This is not so much aligned with clinical decision-making, but underpins the data available for decisions.”
For example, Johns Hopkins Medical Center researchers created an AI tool to assist emergency department nurses in triaging patients. The AI analyzes patient data and medical condition to recommend a care level, coupled with an explanation of its decision — all within seconds. The nurse then assigns a final triage level.
Dayton Children’s Hospital used an AI model to predict pediatric leukemia patients’ responses to chemotherapy drugs; it was 92% accurate, which helped inform patient care.
A second use of AI in healthcare is in operational analytics, where the algorithms analyze complex data on systems, costs, risks, and outcomes to identify gaps and inefficiencies in organizational performance.
The third major use is in workflow enhancement, where AI automates routine administrative and documentation tasks, freeing clinicians to focus on higher-value patient care, according to Drummond.
Beth Israel’s Rodman understands the skepticism that comes from having a computer algorithm influence physicians’ decisions and care recommendations, but he’s also quick to point out healthcare professionals also aren’t perfect.
“Remember that the human baseline isn’t that good. We know 800,000 Americans are either killed or seriously injured [each year] because of diagnostic errors [by healthcare providers]. So, LLMs are never going to be perfect, but the human baseline isn’t perfect either,” Rodman said.
GenAI will become a standard tool for clinical decisions
According to Veronica Walk, a vice president analyst at Gartner Research, there is “huge potential and hype” around how genAI can transform clinical decision-making. “Vendors are incorporating it into their solutions, and clinicians are already using it in practice — whether provided or sanctioned by their organizations or not,” she said.
Healthcare has primarily focused on two types of AI: machine learning (ML), which learns from examples instead of predefined rules, and natural language processing (NLP), which enables computers to understand human language and convert unstructured text into structured, machine-readable data. (An example of ML in use is when it suggests purchases based on a consumer’s selections, such as a book or shirt, while NLP analyzes customer feedback to identify sentiment trends and guide product improvements.)
“So, we didn’t just look at clinical accuracy,” Rodman said. “We also looked at things that in the real world we want doctors to do, like the ability to figure out why you could be wrong.”
Over the next five years, even just using today’s technology, AI could result in savings of 5% to 10% of healthcare spending, or $200 billion to $360 billion annually, according to a study by the National Bureau of Economic Research (NBER).
Savings for hospitals primarily come from use cases that enhance clinical operations (such as operating room optimization) and improve quality and safety, which involves detecting adverse events. Physician groups benefit mainly from improved clinical efficiencies or workload management and continuity of care (such as referral management).
Insurance companies see savings, too, from improved claims management; automatic adjudication and prior authorization; reductions in avoidable readmissions; and provider relationship management.
Case Western Reserve’s Drummond breaks AI into healthcare into two categories:
- Predictive AI: using data and algorithms to predict some output (e.g., diagnosis, treatment recommendation, prognosis, etc.)
- Generative AI: generating new output based on prompts (e.g., text, images, etc.)
The problem with genAI models is their chatbots can mimic human language and quickly return detailed and coherent-seeming responses. “These properties can obscure that chatbots might provide inaccurate information,” Drummond said.
Risks and AI biases are built in
One of the things GPT-4 “was terrible at” compared to human doctors is causally linked diagnoses, Rodman said. “There was a case where you had to recognize that a patient had dermatomyositis, an autoimmune condition responding to cancer, because of colon cancer. The physicians mostly recognized that the patient had colon cancer, and it was causing dermatomyositis. GPT got really stuck,” he said.
IDC’s Shegewi points out that if AI models are not tuned rigorously and with “proper guardrails” or safety mechanisms, the technology can provide “plausible but incorrect information, leading to misinformation.
“Clinicians may also become de-skilled as over-reliance on the outputs of AI diminishes critical thinking,” Shegewi said. “Large-scale deployments will likely raise issues concerning patient data privacy and regulatory compliance. The risk for bias, inherent in any AI model, is also huge and might harm underrepresented populations.”
Additionally, AI’s increasing use by healthcare insurance companies doesn’t typically translate into what’s best for a patient. Doctors who face an onslaught of AI-generated patient care denials from insurance companies are fighting back — and they’re using the same technology to automate their appeals.
“One reason the AI outperformed humans is that it’s very good at thinking about why it might be wrong,” Rodman said. “So, it’s good at what doesn’t fit with the hypothesis, which is a skill humans aren’t very good at. We’re not good at disagreeing with ourselves. We have cognitive biases.”
Of course, AI has its own biases, Rodman noted. The higher ratio of sex and racial biases has been well documented with LLMs, but it’s probably less prone to biases than people are, he said.
Even so, bias in classical AI has been a longstanding problem, and genAI has the potential to exacerbate the problem, according to Gartner’s Walk. “I think one of the biggest risks is that the technology is outpacing the industry’s ability to train and prepare clinicians to detect, respond to, and report these biases,” she said.
GenAI models are inherently prone to bias due to their training on datasets that may disproportionately represent certain populations or scenarios. For example, models trained primarily on data from dominant demographic groups might perform poorly for underrepresented groups, said Mutaz Shegewi, a senior research director with IDC’s Worldwide Healthcare Provider Digital Strategies group.
“Prompt design can further amplify bias, as poorly crafted prompts may reinforce disparities,” he said. “Additionally, genAI’s focus on common patterns risks overlooking rare but important cases.”
For example, research literature that’s ingested by LLMs is often skewed toward white males, creating critical data gaps regarding other populations, Mutaz said. “Due to this, AI models might not recognize atypical disease presentations in different groups. Symptoms for certain diseases, for example, can have stark differences between groups, and a failure to acknowledge such differences could lead to delayed or misguided treatment,” he said.
With current regulatory structures, LLMs and their genAI interfaces can’t accept liability and responsibility the way a human clinician can. So, for “official purposes,” it’s likely a human will still be needed in the loop for liability, judgement, nuance, and the many other layers of evaluation and support patients need.
Chen said it wouldn’t surprise him if physicians were already using LLMs for low-stakes purposes, like explaining medical charts or generating treatment options for less-severe symptoms.
“Good or bad, ready or not, Pandora’s box has already been opened, and we need to figure out how to effectively use these tools and counsel patients and clinicians on appropriately safe and reliable ways to do so,” Chen said.