You are here

De-Identification Software Tools for Cancer Imaging Research


Fast-Track proposals will be accepted. Direct-to-Phase II proposals will not be accepted. Number of anticipated awards: 3-5 Budget (total costs, per award): Phase I: up to $400,000 for up to 9 months; Phase II: up to $2,000,000 for up to 2 years PROPOSALS THAT EXCEED THE BUDGET OR PROJECT DURATION LISTED ABOVE MAY NOT BE FUNDED. Summary Imaging data are a core component in the development of the National Cancer Data Ecosystem and are important in areas from basic research to diagnostics and surveillance. Sharing of any data collected from patients, however, requires that information that can connect that data to the individual from which the data were collected must be removed, or anonymized to the extent possible. Removal of Protected Health Information (PHI) from imaging data files is a twofold problem. Both the file header and the image field itself must be examined for information that could link the file to a specific individual. In headers, this information is often found in fields not intended to contain such information. In the image field itself, PHI can be found in different forms, inserted into the image by the imaging system, or by the presence of identifying jewelry in the image (in the case of radiological images). The complexity of the de-identification problem dictates that a substantial amount of human curation is required to ensure proper and complete removal of PHI from images. This need for human participation in the de-identification process is a significant bottleneck; it impedes the generation of image collections suitable for public distribution and sharing, including deposition into components of the National Cancer Data Ecosystem like The Cancer Imaging Archive (TCIA) ( and the proposed Imaging Data Commons of the Cancer Research Data Commons. For example, on a TCIA data curation team, one person manually reviews files for PHI. Improved tools would shift a large portion of the de-identification burden to software, improving data throughput and increasing data accessibility. Currently, tools do not exist to properly remove PHI from proprietary file formats (e.g. digital pathology images) while retaining other data that maybe be useful to researchers. Project Goals The goal of this contract solicitation is to support development and sustainment of software tools and pipelines for image de-identification, especially for but not exclusive to CT patient data sets and images produced by whole slide imagers (WSI) for digital pathology applications. These tools will selectively remove PHI while retaining other metadata fields that help provide interoperability with other image formats and other data types, such as genomic data and proteomic data. The following tasks/objectives should be met by the software tool: 1) Removal of PHI from expected fields in multiple imaging formats 2) Scanning for PHI in fields not designed for their insertion, identification, and subsequent removal (e.g., comment fields that may contain PHI) 3) Scanning of images for PHI, identification, labeling, and subsequent resolution 4) Production of processed images that meet a threshold level of de-identification 5) Validation algorithm to confirm images within the processed dataset are de-identified 6) Identification (e.g., flagging) of processed data files that may require manual resolution to remove PHI Brute force methods for de-identification (e.g., erasing of all header information) are not acceptable. Retention of data and metadata necessary for downstream applications (population studies, segmentation training) is required. Solutions should not compromise the biomedical use of data files. To build upon previous work for field retention, removal, and alteration, the TCIA de-identification knowledge base ( may serve as a foundation for determining and prioritizing similar attributes in digital pathology images. Phase I Activities and Deliverables • In addressing WSI datasets, identify different WSI vendor file types and the fields that contain PHI (i.e., conduct landscape analysis) • Ability to recognize and open multiple WSI file formats • Provide data set(s) for Phase I activities • Display PHI field variable values • Remove or alter PHI field values from fields labeled with PHI • Identify the data sets and file types required to demonstrate software capability in Phase II o WSI data sets should include at least 1000 differentiated case files (i.e., one image per patient case) from across various imager systems as identified from the performed landscape analysis o CT data sets should include at least 1000 difference case files (e.g., 100 images per patient case) from across at least 5 distinct research institutions o Requests to use NCI data sets from the TCIA database or similar may be directed to the NCI Contracting contact person listed for this solicitation. Requests will be granted at the discretion of NCI. Phase II Activities and Deliverables • Detect PHI in non-PHI fields (e.g., comment fields that may contain PHI) • Alert user, allow user to edit detected fields • Detection of PHI within image • Masking of PHI within image • Generation of de-identified images with provenance of process • Validation with a test data set should demonstrate successful PHI removal from image and image file meta data for ≥95% test files • Statistical analysis of validation testing will be provided to NCI • The software tool should identify and flag any cases that are less than fully verified for PHI removal • For any dataset where 5% or less of files are not fully verified with successful PHI removal, such files should be flagged for manual correction • In the first year of the contract, provide the Program and Contract officers with a letter(s) of commercial interest • In the second year of the contract, provide the Program and Contract officers with a letter(s) of commercial commitment
US Flag An Official Website of the United States Government