March/April 2026 Issue

Women’s Imaging: Collective Effort
By Rebekah Moan
Radiology Today
Vol. 27 No. 2 P. 6

Lessons Learned From Compiling a 1M DBT Dataset

Little by little, things that once seemed unimaginable can become feasible—including preparing a dataset of more than one million digital breast tomosynthesis (DBT) studies with paired histology outcomes for an AI developer. “You have to start somewhere,” says Luke Bideaux, president and CEO of Vega Imaging Informatics. “Take those first steps, and it keeps getting better with each step along the way.”

Vega has a data program that allows it to acquire information from various sites in three regions of the United States. The exact number of sites and specific locations is confidential, but by joining the program, Vega can interact with the site’s protected health information as a business associate and create a deidentified dataset.

Participating health care facilities join the program and deploy the data processing server on their premises. Vega has to integrate with the facilities’ PACS so it can extract copies of the studies. Through the use of specialized tools and processes, Vega completes the end-to-end data processing on behalf of the imaging facilities. This allows these facilities to leverage Vega’s data specialists to lead the data projects, while sharing in the revenue that’s generated from them.

When Vega received a request from an AI developer to compile a large multimodal, deidentified DBT dataset, Vega contracted with sites within its data program to create it. The company performed a large-scale extraction across the sites, including DICOM images and associated clinical data from various information systems. The final dataset included deidentified images, demographic data elements, BI-RADS information, radiology reports, and other important attributes about the imaging exams. Crucially, biopsy outcome information for more than 22,000 patients was included, containing over 7,000 cancer cases.

“We’ve done more complicated data projects than this because they were about rare diseases or hard-to-find patient populations,” Bideaux says. “But this dataset is significant because of its size, as it is intended to represent a large distribution of demographics, breast densities, BIRADS scores, cancer types, and more, all based on real-world occurrences.”

The inclusion of such a wide swath of the patient population into new AI models means the AI solutions will work more effectively because broader patient populations are represented in the data being used for training and testing. “So many times, we see AI developers working with limited sources. That results in AI solutions working in a lab, but when they try to deploy those solutions in the real world, they fail,” he says. “With this contribution, I believe huge strides will be made in the effectiveness of the AI solutions being developed in the area of breast imaging.”

Big Data
The downside to such a broad representation of patients is that it requires immense amounts of data to be processed, validated, and distributed. Vega acknowledged that these challenges need to be addressed on a case-by-case basis with each participating facility. Every site Vega works with has a PACS that performs differently so a tailored approach to data processing was crucial. There are numerous techniques to maximize data migration performance, such as leveraging multithreading, move delays, and lossless compression so the source systems aren’t overwhelmed and maintain a consistent data transfer pace.

“We had to find the perfect harmony of configurations to keep the extractions humming, without noticeably impacting the operational performance of the PACS,” Bideaux says. “It takes a lot of fine tuning, and we did continue to optimize our configurations as the project went on.”

After extracting the data, Vega deidentified the protected health information at the metadata and pixel levels of every DBT image. It also managed nonimage objects, such as scanned documents, text-based reports, DICOM SR objects, presentation state objects, and more to ensure that the final version of each study was fully deidentified.

“There was an extensive amount of testing and validation completed upfront, and then once the data processing was finished, there was a full validation procedure that was done according to our process to validate that the deidentification was successful,” Bideaux says.

From there, Vega moved into the distribution phase, where it leveraged one of the major cloud service vendors and some of the transfer appliance solutions that the cloud provider offers to import the data at a large scale into secure cloud buckets. The overall file size of the dataset—nearly 1 petabyte—is what made the project particularly challenging for Vega. It had to use offline transfer appliances because uploading that much data over facilities’ WAN circuits would have thrown day-today operations at those sites into disarray.

Once the data was in the cloud, the AI developer was able to access the deidentified data and complete their acceptance review. The entire process—including contracting to conclusion—took about a year, according to Bideaux. Bideaux’s advice to others working in the imaging data field is, “Don’t be afraid to try something that has a chance of failure. That’s the only way you’re going to be able to achieve something of great significance.”

These sorts of AI projects are not possible without the cooperation of health care facilities. “I think the message is pretty clear from the health care providers that they need AI that actually works in the real world, and the only way that’s going to happen is if we get more health care providers contributing data for the advancement of medical imaging AI,” he says. “Health care organizations can’t have it both ways where they keep their data siloed off but, at the same time, reap the benefits from what other organizations have contributed. If everyone took that stance, we’d never get anywhere.”

— Rebekah Moan is a freelance journalist and ghostwriter based in Oakland, California. Her specialties are health care and profiles.