Talk on the Street
By Beth W. Orenstein
Radiology Today
Vol. 25 No. 5 P. 18

Radiologists weigh in on the promises and pitfalls of large language models.

With workloads increasing and staff shrinking, radiologists are looking for ways to automate certain tasks and clinical decision support. Some see large language models (LLMs) as a potential way to improve efficiency and accuracy in overburdened radiology practices. LLMs are AI models trained on large amounts of text data to generate humanlike responses and perform language-related tasks.

In theory, at least, LLMs could be used to analyze and interpret radiology reports, assist readers in making diagnostic decisions, and offer recommendations based on the information contained in patient reports. Whether LLMs can be used to perform complex tasks requiring medical reasoning is open to debate. Some say they are ready. Others believe a great deal more research is needed before radiology practices rely on LLMs for any task.

Roman J. Gertz, MD, resident in the department of radiology at the University Hospital of Cologne, Germany, is among those who believe that LLMs have significant potential to enhance radiological practices. Based on a study he and colleagues conducted that was published in Radiology in April 2024, Gertz, who is a colead author, believes LLMs could help with a number of tasks, including optimizing workflow by helping with correct study and protocol determination, aiding in differential diagnoses, and identifying common errors in radiological reports. To prove their point, the researchers compared the LLM GPT-4 and human performance in error detection in radiology reports. They assessed GPT-4’s capabilities against radiologists of varied levels of experience in terms of accuracy, speed, and cost effectiveness.

To conduct their study, the researchers gathered 200 radiology reports (X-ray, CT, MRI) from a six-month period (June 2023 to December 2023) at a single institution. They intentionally entered 150 errors from five error categories (omission, insertion, spelling, side confusion, and other) into one-half of the reports. Six radiologists—two senior radiologists, two attending physicians, and two residents—were tasked with detecting the errors, as was GPT-4. The results from the GPT-4 model and the radiologists were comparable. In the overall analysis, GPT-4 detected fewer errors compared with the top senior radiologist (82.7% vs 94.7%).

“However, there was little difference in error detection rates between GPT-4 and the other radiologists,” Gertz says. “Similar to previous findings, we were pleasantly surprised by GPT-4’s capabilities and the consistency of its performance.”

Gertz says he and his colleagues undertook the study because of “daily clinical challenges and addressing the need for improved efficiency and accuracy in radiological services.” He believes that the key takeaway from the study is “It illustrates how AI, particularly through tools like GPT- 4, can significantly improve efficiency, reduce errors, and enhance access to diagnostic services—imperatives for better patient care outcomes.”

Potential Time Savings
Woojin Kim, MD, chief medical information officer of Rad AI, agrees that LLMs have numerous potential use cases in radiology, many of which can be time saving for radiologists. “Imagine,” Kim says, “if you could dictate only the essential findings and generate the entire report with the generative AI (GenAI) taking care of the report formatting, allowing the radiologists to keep their eyes on the images the whole time.”

Radiologists often look at exams with prior images and reports. But they can’t simply copy and paste prior reports because they may have been dictated by another radiologist with different styles, report formats, and dates that will need to be manually updated, he says. Leveraging GenAI, however, radiologists can generate an unchanged report with a simple command. “Imagine the time savings!” Kim says. If something were different on the most current exam, the radiologist could dictate just what is different and let the GenAI take care of the rest.

Furthermore, Kim sees a timesaving role for LLMs when it comes to checking for errors within radiology reports. Even in 2024, radiologists are still largely limited to spellcheckers, he says. LLMs could be leveraged to check for other reporting errors, such as nonsense words, laterality errors, wrong body parts, and contradictions. “These features allow radiologists to practice at the top of their license,” Kim says.

Kim also sees a time-saving role for LLMs in generating impressions or summary sections of reports. “Generating this section takes time, and attempts to automatically create impressions have been a popular area of research, especially in the past year, thanks to ChatGPT,” Kim says, noting that Rad AI Omni Impressions, a commercial solution, has been in this space much longer than ChatGPT. “Hence,” he says, “I can tell you not only does this feature save time, but it also reduces radiologists’ cognitive load. People who have used it will tell you they feel less tired at the end of the day.” (A 2021 study published in The Joint Commission Journal on Quality and Patient Safety scored radiology the highest on the mental demand component of physician task load.)

Finding Its Voice
Like Kim, Rajesh Bhayana, MD, of the University of Toronto, believes that LLMs offer many clinical and research applications in radiology. In fact, he says, “several have been explored in the literature with encouraging results.” Specifically, Bhayana believes multimodal LLMs have the potential to generate reports taking into account imaging and other clinical information, closely mirroring how radiologists work today.

Bhayana, author of “Chatbots and Large Language Models in Radiology: A Practical Primer for Clinical and Research Applications,” which was published in January 2024 in Radiology, also sees a role for LLMs in nearly every step of the radiology pathway, including summarizing and simplifying radiologists’ reports for their patients. AI chatbots have the potential to answer not only patients’ general medical questions but also questions about their imaging, saving physicians time, he says. LLMs can also help with many steps of the research process, he adds, many of which are time consuming, such as summarization and data analysis.

However, Kim and Bhayana also agree that radiologists need to proceed with extreme caution. “Medical expert oversight is needed for safe and responsible LLM uses in health care today,” Kim says. He cautions radiologists and radiology groups not to use public-facing chatbots such as ChatGPT for patient-related queries, as sending protected health information to such versions of ChatGPT is a HIPAA violation.

Bhayana says that while LLMs can be used for many different tasks, “much more research and validation is needed before we rely on them in practice for critical tasks. The research/validation piece needs to catch up.” For example, he says, while preliminary studies have shown that using LLMs to help patients has promise—responses may be more empathetic than those of online forum physicians—their accuracy and safety still need to be assessed. Or, while using LLMs for data analysis takes a fraction of the time it may take a human, it still makes mistakes, and potential errors would still have to be part of the equation.

Blind Spots
LLMs have important weaknesses relevant to radiology, the radiologists agree. One problem is that they can generate inaccurate responses, which are called fact fabrications or hallucinations. Since LLMs use probability to generate outputs, they are more likely to hallucinate where pretraining is lacking, Bhayana says.

Kim says LLMs give answers confidently and convincingly, even if they’re wrong. “This is one of the many reasons why radiology professional education on GenAI is crucial,” he says.

Another issue is granularity, says Bradley J. Erickson, MD, PhD, a radiologist at the Mayo Clinic in Rochester, Minnesota, and former chair of the American Board of Imaging Informatics.

“By this, I mean the LLM needs to summarize at the right level. For instance, in the history, do you want to know if the patient 1) had cancer or 2) had brain cancer or 3) had IDH-wt glioma (a specific type of glioma)? All are right answers, and the right level of detail depends on the context,” Erickson says.

Radiology has also seen automation bias with deep learning solutions, Kim says. A paper published in April 2024 in The Lancet Digital Health showed this issue with LLM. The researchers found that “the content of physician responses changed when using LLM assistance, suggesting an automation bias and anchoring, which could have a downstream effect on patient outcomes.” There is some fear, Kim says, that physicians’ overreliance on language models could lead to reduced critical thinking or independent decision making. Kim agrees with the authors of a study published in February 2024 in Diagnostic Pathology who wrote: “It is crucial to view these models as tools to augment the human expertise rather than replace it entirely.”

Erickson characterizes the problem slightly differently. It’s not exactly automation bias, he says, but the result is similar. “Unlike diagnosis, where you both see the images and the AI result and might go with the AI, in most LLM use cases, you don’t see the input data, so there is no possibility to decide not to trust the AI. You just have to trust it, or you have to go through all the documents,” he says.

Fine-Tuning Required
Andrea Cozzi, MD, PhD, radiology resident and postdoctoral research fellow at the Imaging Institute of Southern Switzerland, Ente Ospedaliero Cantonale, in Lugano, Switzerland, is concerned that the more complex the task, the less successful publicly available generic LLMs such as ChatGPT (GPT-3.5 or GPT-4) are. Cozzi and colleagues published a study at the end of April 2024 in Radiology that found that unregulated use of publicly available LLMs resulted in changes in breast imaging report classifications and concluded that their use could have a negative effect on patient management.

The Swiss researchers partnered with an American team from Memorial Sloan Kettering Cancer Center in New York and a Dutch team at the Netherlands Cancer Institute in Amsterdam. They looked at BI-RADS classifications of 2,400 breast imaging reports written in English, Italian, and Dutch. The researchers used three LLMs—GPT-3.5, GPT-4, and Google Bard (now Google Gemini)—to assign BI-RADS categories. Then, they compared the performance of the LLMs with that of board-certified breast radiologists.

The agreement among the human readers was almost perfect. However, the agreement between the human readers and the LLMs was only moderate. The researchers also observed a large percentage of discordant category assignments that would result in negative changes in patient management. They concluded that their results raised several concerns about the potential consequences of placing too much reliance on these widely available LLMs that are not fine-tuned to medical knowledge. Cozzi says the results highlight the need for regulation of LLMs when there is a highly likely possibility that users may ask them health care-related questions of varying depth and complexity.

However, while default LLMs may struggle on complex clinical tasks out-of-box, such as assigning BI-RADS categories, this does not mean that LLMs are not useful for these complex tasks, Bhayana says. “LLMs should be viewed as powerful but blunt tools that must be optimized for specific clinical tasks,” he says. Bhayana likens optimizing LLMs to optimizing MRI techniques: “Whole body MRI wouldn’t be great for detecting prostate cancer, but we all know that prostate-specific MRI performs quite well for this.”

Room for Growth
LLMs can be a wonderful tool for many tasks, Cozzi says, but they must be used wisely. “It’s still too early for radiology groups to begin integrating LLMs into their practices in any way because the current state of LLM development is far from qualifying them as fit for use in clinical practice,” he says. LLM chatbots do not currently meet key principles for AI use in health care: transparency, explainability, bias control, and many other areas, he adds. “And this entails critical professional, legal, and ethical issues.”

Kim respectfully disagrees, he says, “since there are already LLMbased applications in use in radiology commercially.”

Cozzi believes that as “We are just at the first stages of LLM development in the health care field,” it’s difficult to discern what the real benefits of their use will be and where they will be applied. The main challenge to overcome, he says, before LLM integration into clinical practice, is the need for accountability and regulation, since all other challenging aspects (scientific, technical, commercial, etc) ultimately converge in the regulatory framework “as we saw with previously developed AI tools.”

A growing body of research suggests that radiologists are receptive to LLMs, Gertz says. Eventually, “these tools could help manage the increasing demands for documentation and workflow management, allowing radiologists to focus on their core competency: image interpretation.”

Bhayana believes that over the next years, LLMs will become ubiquitous, and the specialty will find a way to use them safely “to get what it needs when it needs it.”

Beth W. Orenstein of Northampton, Pennsylvania, is a freelance medical writer and regular contributor to Radiology Today.