You are here

Open-Source and User-Friendly Record Linkage/De-duplication Tool

Description:

Phase I SBIR proposals will be accepted. Fast-Track proposals will not be accepted. Phase I clinical trials will not be accepted. Number of anticipated awards: 1 Budget (total costs): Phase I: up to $243,500 for up to 6 months; Phase II of up to $1,000,000 and a Phase II duration of up to 2 years PROPOSALS THAT EXCEED THE BUDGET OR PROJECT DURATION LISTED ABOVE MAY NOT BE FUNDED. Page 120 Background Record linkage (or de-duplication) is an essential component of many CDC-supported projects and programs. If an individual is reported as a case by more than one data source, or reported at multiple times, it is vital to link records so that an individual will not be counted as multiple incident cases. There are powerful algorithms that can automatically detect matches in many situations. However, these software tools are often proprietary or require programming/coding skills that may not be available in every state or jurisdiction. A free and easy-to-use solution would strengthen public health expertise, as the same tools could be used across programs, and users who cannot write code could use the same underlying packages and algorithms as more technically inclined users. Motivating example: CDC’s Autism and Developmental Disabilities Monitoring (ADDM) Network currently supports autism surveillance in different states. States receive information from various medical and educational providers, and states must link records to ensure each child is counted once and that all critical data elements are linked to the child’s record. The ADDM surveillance program uses “The Link King”, a SAS-based record linkage program, for data linkages. There are several beneficial attributes of this tool: it uses high-performing algorithms, is free (but requires a paid SAS subscription), and it has a graphical user interface that allows easy use by non-coders. However, it is no longer actively supported or developed (the team received permission to host an archival copy at www.the-link-king.party). Future updates to SAS, Microsoft Windows, or any dependency could jeopardize the functioning of the tool, and therefore the surveillance program. Project Goals Short term project goals – • Understand basic needs and use cases for record linkage in public health applications • Develop an R package that provides an R Shiny front-end to a high-performance record linkage package (such as fastLink, RecordLinkage, or csvdedupe) ○ Functionality should include the ability to facilitate linkage parameters (select variables used for linkages), identify data sets to be used, manually verify and review results, and export the resulting matched and non-matched data. ○ Create documentation to instruct users on its use (such as a “getting started” vignette) ○ Create a public GitHub repository for the code, as well as for tracking issues and feature requests from end-users Phase I Activities and Expected Deliverables During the Phase I period, the activities can include, but are not limited to: The following deliverables should be produced by the end of the project period: - R Package providing interface to record linkage/de-duplication program(s) - Includes documentation (built into package, and vignette) - Package and materials hosted on CRAN - Source code maintained on a public GitHub repository - Demonstration to CDC/public health community - Summary of potential enhancements and community feedback/requests Impact This project could have both long- and short-term impact on CDC surveillance programs and other projects. Most immediately, it will provide a sustainable solution for the ADDM Network, as the current record linkage software is effectively “abandonware” and requires SAS licenses. Other “free options” (summarized here and here) often lack easy-to-use interfaces, are not updated, or are only available in programming languages that would add complexity to (or be incompatible with) a public health program. Commercial tools could be expensive (as shown here) or require uploading sensitive data to a cloud-based service, which might violate public health data privacy requirements. Proprietary software could also be custom-tailored to each surveillance system and include this functionality. For example, the ADDM Network discontinued a $500,000 annual contract to build and maintain a proprietary data system that included rudimentary record linkage functionality. Other customizable products have linkage/de-duplication functionality, such as Conduent’s Maven software, but can be expensive and encourage fragmentation between different systems by virtue of requiring software licenses/contracts. Page 121 More broadly, this tool could fill similar gaps in functionality in other CDC and public health programs without having to resort to custom-developed software. There are already thousands of R users at CDC, and they would be able to easily integrate this tool into other systems that could benefit, such as during Epi Aids, when simple tools are needed immediately. When we designed our current data system, we spoke with other surveillance programs and often heard that record linkage / de-duplication processes were lacking in performance (such as when a basic matching algorithm is integrated into custom software) or were deemed responsibilities that were “left up to the states” to complete without explicit support from CDC. If selected, this project would have a high likelihood of success, as the core record linkage algorithms are already available – this project would make them easier to use by non-programmers and better integrate them into typical public health / surveillance workflows. Commercialization Potential Many open source software projects have successful commercial models through selling professional services, including enhanced support, customized features, consultation, training, or analytic capacity. This record linkage tool could become part of a suite of widely-used data management and analytic tools that are commonly deployed in the public health community. The developer would be well-positioned to offer premium support and technical services to programs that use the tools or need custom solutions built upon an open-source platform.
US Flag An Official Website of the United States Government