Inside the Algorithm
By Beth W. Orenstein
Radiology Today
Vol. 24 No. 6 P. 10

Do radiologists need to know how AI models work?  

The use of AI in radiology is gaining momentum. A growing number of studies have shown that AI has the ability to increase radiologists’ efficiency and diagnostic confidence, reduce workload, and highlight urgent cases, among other tasks. Much of this happens, however, without radiologists knowing how or why an AI algorithm arrived at its prediction/ decision/recommendation because the model’s inputs aren’t made clear to the clinicians who use them.

“There are no FDA-cleared AI algorithms that I am aware of that actually show you how they came up with their diagnosis or interpretation,” says Eliot Siegel, MD, a professor and vice chair of diagnostic radiology and nuclear medicine at the University of Maryland and chief of radiology and nuclear medicine at VA Maryland Healthcare in Baltimore. “There is no transparency into these AI applications with regard to specific features that the algorithm is looking for or their relative importance.” Does it matter that the AI algorithms radiologists are using don’t provide such explanations? According to radiologists, it depends.

Judy Wawira Gichoya, MD, MS, an assistant professor in the department of radiology and imaging sciences at Emory University School of Medicine in Atlanta, believes it is more important for users to have confidence in the algorithms they use than a precise understanding of how they work. “Most radiologists are not data scientists,” she says. “I think it’s a big bar to say we want to know what’s going on with the algorithm we are using.”

In a way, radiologists’ use of AI is similar to their relationship with MRI machines. Radiologists can interpret the images MRI scanners produce but do not necessarily understand the sequences that were used to produce those images. As a radiologist, “I don’t necessarily understand how the specific sequence is generated and the physics behind it in the way a physicist would, but I am able to interpret the results and determine when I cannot make an accurate diagnosis from an image,” Gichoya says.

Gichoya notes that learning how the AI algorithm comes to its conclusion would take extra time, something that few radiologists have in today’s world. “If I needed to stop and take the time to understand the algorithm’s methodology, it could affect my workflow and that would be disruptive,” Gichoya says. She adds that it could negate the value of using the algorithm in the first place.

However, Gichoya says, it is important that radiologists have the utmost confidence in the AI algorithms that they use. For some, understanding how the algorithm reaches its conclusions may give them more confidence. For others, it may not be important. To address this issue, Gichoya says, it would help if radiologists share their experiences with AI algorithms. “What we need to do is share when AI systems fail us and when they work well,” she says.

Value Proposition
It’s more important for radiologists to know whether the AI algorithms they use add value or not. “Radiologists want to use something that is helpful for them,” Gichoya says. “If they know an AI algorithm adds value, then they will use it.” Knowing whether a particular AI algorithm works and under what circumstances it works best is helpful, she says, “because no one solution works for everyone.”

The most likely reason that most AI algorithms don’t show their work is that they are proprietary. “There is limited technical information on methodologies for algorithms approved by the FDA compared with open-source products, likely because of intellectual property concerns,” Gichoya says. Also, FDA-approved products use much smaller datasets compared with open-source AI tools. That’s likely because public datasets are limited to academic and noncommercial entities, which precludes their being used in commercial products, Gichoya says.

To work around this issue, Gichoya and colleagues developed a 10-question assessment tool for reviewing AI products, with an emphasis on validation and dissemination of results. They applied their assessment tool to commercial and open-source algorithms used for diagnosis to extract evidence on the clinical utility of the tools. The researchers published a study on their results in November 2020 in the Journal of the American College of Radiology. In it, they concluded that “a large gap exists in exploring the actual performance of AI tools in clinical practice.”

Salient Details
Siegel believes radiologists would be more comfortable with AI tools if the tools were to show their work. An AI algorithm can make a determination that is correct “but it doesn’t tell you how it came up with that conclusion,” Siegel says. “Whether it’s stroke detection or detection of an intracranial hemorrhage or detection of a cancer, in general, algorithms currently don’t go into very much explanation.” Users rely more on their experience with the algorithm and how well it works on their patient population, he says.

Like Gichoya, Siegel says physicians often use products, such as medications, without fully understanding how they work. “There are a lot of drugs we use even though we don’t really have an explanation for exactly how they work or why they work,” he says; AI tools can be viewed the same way.

Siegel coauthored “Artificial Intelligence: Algorithms Need to Be Explainable—or Do They? ” which was published in April 2023 in the Journal of Nuclear Medicine. One method of showing work that’s discussed in the article is a saliency map. Saliency maps are used to explain what the model is focusing on when it makes its prediction. They identify the salient regions in the input data and highlight the parts of an image that have the most impact on the model’s predictions. For example, saliency maps can highlight the most influential regions of the myocardium that an AI model uses to diagnose coronary artery disease in SPECT images.

It is well documented, Siegel says, that AI models have biases built in. Their biases are caused by confounding factors— factors that can distort true associations and/or influence an algorithm’s interpretation. Siegel says he has seen AI models trained on images that happened to have labels included that indicated they were from a certain cancer clinic or oncologist. The AI algorithm simply learned the association of positivity for cancer from the labels rather than the images themselves. A saliency map for explainability would catch these training errors and help ensure that the algorithm was utilizing relevant data from the images, Siegel says.

Siegel says saliency maps work by progressively blanking out different regions of an image and then determining whether the prediction of the algorithm changed. For example, when looking for a pneumothorax, blanking out the lung apices would be expected to have a significant impact on the performance of an algorithm, while blanking out portions of the shoulders or upper abdomen would have less of an impact. Consequently, the lung apices would be displayed on a “heat map” with more intensity than those of other anatomic areas, he says. He adds that saliency maps are one of a multitude of techniques that help with explainability.

Developing Confidence
Over time, if radiologists are confident that a model is reliable in their population of patients, “then I think it would be acceptable if they did not feel the need to understand the explanation of how it worked,” Siegel says. “I think we need to take any AI explanations with a grain of salt and not be overly reliant on the explanations, whether they are consistent or not consistent with what we think intuitively. We need to look at the explanations as only one of many variables that we are using in evaluating the performance of AI algorithms and whether a particular algorithm is appropriate for a particular patient population.”

Siegel believes that the ability of an algorithm to utilize prior information about a patient could be as important as its explanation. Most algorithms today are mainly looking at the pixel information in the study, he says. “They are not looking at the probability of disease in the patient, the patient’s clinical history, or whether the patient’s cancer has been stable over the last 12 years. If I’m only looking at the study by itself, it makes it much more difficult for me to determine the significance of some of the findings,” he says. Algorithms need to learn to include such factors “for me to feel comfortable with them,” Siegel says.

Siegel also believes that explanations of decision-making would be most helpful when he disagrees with an algorithm’s findings. Knowing how an algorithm reached its conclusion could make him more confident in his findings if they are decisively different, he says. Siegel expects “explanations” to become a more important issue. He says radiologists and other health care providers are “trying to make decisions about which AI algorithms they should be using.” Explanations could be a way for radiologists to differentiate among the growing number of AI algorithms that are popping up as the field matures. Research, he says, should include improved robustness of explainability with more standardized methods of objectively measuring the quality of the explanation.

Watching Data Drift
Raym Geis, MD, a radiologist in Fort Collins, Colorado, affiliated with National Jewish Health, says the issue with AI algorithms is not explainability but data drift. Data drift is a significant challenge, Geis says, “Because we are finding these programs don’t work as well in some settings as advertised and, over time, the results start to get worse because the data is going to drift for a variety of reasons.”

To make a machine learning model more robust, Geis says, “You need a wider distribution of data on which to train it—datasets that represent a much larger distribution of the population.” Ideally, Geis says, the models would be trained on data from every single hospital and every single CT scanner around the world. If that were to happen, it would make the models more robust and help eliminate the issues with them, “but that’s not technologically possible.”

Users are discovering that AI tools often lack consistency. When using them on their own images, an AI tool may not work as well as expected, Geis says. Often, this is because local data are subtly different from the data the AI was trained on, even if the images “look” the same to a human. But building machine learning tools that monitor data going in, to flag these differences, or labels coming out, is complex and “takes a sophisticated level of computer science and knowledge of systems engineering,” Geis says. It’s not a simple matter of training the models on new data or fine-tuning them on a local dataset.

Geis does not believe that model explainability will be as useful as it might seem. “I need to know how accurate the labels are that are coming out of the model. But if I’m a radiologist trying to get my work done, and I’m already overwhelmed by the volume [of images], I don’t want to add an extra chore to my day to perform an action verifying whether the machine learning output is correct or, if not, why it’s not correct. That’s a significant time cost to a radiologist. To me, the more important goal is to verify that the AI model is providing trustworthy results, both today and at any time in the future.”   

— Beth W. Orenstein, a freelance medical writer and regular contributor to Radiology Today, lives in Northampton, Pennsylvania.