USA flag logo/image

An Official Website of the United States Government

Balancing Disclosure Risk with Inferential Power: Software for Intervalized Data

Award Information

Agency:
Department of Health and Human Services
Branch:
N/A
Award ID:
Program Year/Program:
2012 / SBIR
Agency Tracking Number:
R43TR000173
Solicitation Year:
2012
Solicitation Topic Code:
NCATS
Solicitation Number:
PA11-096
Small Business Information
Applied Biomathematics
100 North Country Rd. Setauket, NY -
View profile »
Woman-Owned: No
Minority-Owned: No
HUBZone-Owned: No
 
Phase 1
Fiscal Year: 2012
Title: Balancing Disclosure Risk with Inferential Power: Software for Intervalized Data
Agency: HHS
Contract: 1R43TR000173-01
Award Amount: $480,868.00
 

Abstract:

DESCRIPTION (provided by applicant): Patient data collected during health care delivery and public health surveys possess a great deal of information that could be used in biomedical and epidemiological research. Access to these data, however, is usually limited because of the private nature of most personal health records. Methods of balancing the informativeness of data for research with the information loss required to minimize disclosure risk are needed before these data can be used to improve public health. Current methods are primarily focused on protecting privacy, but focusing on protecting privacy alone is inadequate. In statistical disclosure control techniques, information truthfulness is not well preserved so that unreliable results may be released. In generalization-based anonymization approaches, there is information loss due to attribute generalization and existing techniques do not provide sufficient control for maintaining data utility. What are currently needed are methods that protect boththe privacy of individuals represented in the data as well as the integrity of relationships studied by researchers. The problem is that there is an inherent tradeoff between protecting the privacy of individuals and protecting the informativeness of the data set. Protecting the privacy of individuals always results in a loss of information and it is the information contained by the data set that affects the power of a statistical test. For a given anonymization strategy, however, there are often multiple ways of masking the data that meet the disclosure risk criteria provided. This can be taken advantage of to choose the solution that best preserves statistical information while meeting the disclosure risk criteria provided. This project will develop the first integrated software system that provides solutions for problems faced in all three stages in the release of sensitive health care data: 1. anonymize a data set by intervalizing/generalizing data to satisfy currently available anonymization strategies,2. provide sufficient controls within anonymization procedures to satisfy constraints on statistical usefulness of the data, and 3. compute statistical tests for the anonymized data intervals. There are two main challenges facing this effort. The first isthat, based on existing research results, integrating our proposed new control processes into anonymization procedures is expected to be computationally difficult. We will overcome this challenge by developing efficient and practically useful greedy algorithms, approximation algorithms, or algorithms working for realistic situations (if not for general cases). The other primary challenge facing this effort is the fact that statistical calculations with interval data sets are known to be computationally difficult, and these calculations are necessary both for control processes within anonymization procedures and for subsequent statistical computation and tests. We will overcome this challenge with efficient algorithms that exploit the structure present in data sets intervalized for privacy. The software will be tested on medical data sets of various sizes and structures to demonstrate the feasibility of the approach and to characterize the scalability of the algorithms with data set size. PUBLIC HEALTHRELEVANCE: Patient health records possess a great deal of information that is useful in medical research, but access to these data is usually limited because of the private nature of most personal health records. Methods of balancing the informativeness ofdata for research with the information loss required to minimize disclosure risk are needed before these data can be used to improve public health. This project will develop the first integrated software system that provides solutions for intervalizing/generalizing data, controlling data utility, and performing analyses using interval statistics.

Principal Investigator:

Scott D. Ferson
631-751-4350
scott@ramas.com

Business Contact:

Lev R. Ginxburg
631-751-4350
lev@ramas.com
Small Business Information at Submission:

APPLIED BIOMATHEMATICS, INC.
100 NORTH COUNTRY RD SETAUKET, NY -

EIN/Tax ID: 111259650
DUNS: N/A
Number of Employees: N/A
Woman-Owned: No
Minority-Owned: No
HUBZone-Owned: No