Description:
Phase I SBIR proposals will be accepted. Fast-Track proposals
will not be accepted. Phase I clinical trials will not be accepted.
Number of anticipated awards: 1
Budget (total costs): Phase I: up to $243,500 for up to 6 months; Phase II of up to $1,000,000 and a Phase II duration of up to
2 years
PROPOSALS THAT EXCEED THE BUDGET OR PROJECT DURATION LISTED ABOVE MAY NOT BE FUNDED.
Page 120
Background
Record linkage (or de-duplication) is an essential component of many CDC-supported projects and programs. If an individual is
reported as a case by more than one data source, or reported at multiple times, it is vital to link records so that an individual will not
be counted as multiple incident cases. There are powerful algorithms that can automatically detect matches in many
situations. However, these software tools are often proprietary or require programming/coding skills that may not be available in
every state or jurisdiction. A free and easy-to-use solution would strengthen public health expertise, as the same tools could be used
across programs, and users who cannot write code could use the same underlying packages and algorithms as more technically
inclined users.
Motivating example: CDC’s Autism and Developmental Disabilities Monitoring (ADDM) Network currently supports autism
surveillance in different states. States receive information from various medical and educational providers, and states must link
records to ensure each child is counted once and that all critical data elements are linked to the child’s record. The ADDM
surveillance program uses “The Link King”, a SAS-based record linkage program, for data linkages. There are several beneficial
attributes of this tool: it uses high-performing algorithms, is free (but requires a paid SAS subscription), and it has a graphical user
interface that allows easy use by non-coders. However, it is no longer actively supported or developed (the team received
permission to host an archival copy at www.the-link-king.party). Future updates to SAS, Microsoft Windows, or any dependency
could jeopardize the functioning of the tool, and therefore the surveillance program.
Project Goals
Short term project goals –
• Understand basic needs and use cases for record linkage in public health applications
• Develop an R package that provides an R Shiny front-end to a high-performance record linkage package (such
as fastLink, RecordLinkage, or csvdedupe)
○ Functionality should include the ability to facilitate linkage parameters (select variables used for linkages), identify
data sets to be used, manually verify and review results, and export the resulting matched and non-matched data.
○ Create documentation to instruct users on its use (such as a “getting started” vignette)
○ Create a public GitHub repository for the code, as well as for tracking issues and feature requests from end-users
Phase I Activities and Expected Deliverables
During the Phase I period, the activities can include, but are not limited to:
The following deliverables should be produced by the end of the project period:
- R Package providing interface to record linkage/de-duplication program(s)
- Includes documentation (built into package, and vignette)
- Package and materials hosted on CRAN
- Source code maintained on a public GitHub repository
- Demonstration to CDC/public health community
- Summary of potential enhancements and community feedback/requests
Impact
This project could have both long- and short-term impact on CDC surveillance programs and other projects. Most immediately, it
will provide a sustainable solution for the ADDM Network, as the current record linkage software is effectively “abandonware” and
requires SAS licenses.
Other “free options” (summarized here and here) often lack easy-to-use interfaces, are not updated, or are only available in
programming languages that would add complexity to (or be incompatible with) a public health program. Commercial tools could be
expensive (as shown here) or require uploading sensitive data to a cloud-based service, which might violate public health data
privacy requirements. Proprietary software could also be custom-tailored to each surveillance system and include this functionality.
For example, the ADDM Network discontinued a $500,000 annual contract to build and maintain a proprietary data system that
included rudimentary record linkage functionality. Other customizable products have linkage/de-duplication functionality, such as
Conduent’s Maven software, but can be expensive and encourage fragmentation between different systems by virtue of requiring
software licenses/contracts.
Page 121
More broadly, this tool could fill similar gaps in functionality in other CDC and public health programs without having to resort to
custom-developed software. There are already thousands of R users at CDC, and they would be able to easily integrate this tool into
other systems that could benefit, such as during Epi Aids, when simple tools are needed immediately. When we designed our
current data system, we spoke with other surveillance programs and often heard that record linkage / de-duplication processes were
lacking in performance (such as when a basic matching algorithm is integrated into custom software) or were deemed
responsibilities that were “left up to the states” to complete without explicit support from CDC.
If selected, this project would have a high likelihood of success, as the core record linkage algorithms are already available – this
project would make them easier to use by non-programmers and better integrate them into typical public health / surveillance
workflows.
Commercialization Potential
Many open source software projects have successful commercial models through selling professional services, including enhanced
support, customized features, consultation, training, or analytic capacity. This record linkage tool could become part of a suite of
widely-used data management and analytic tools that are commonly deployed in the public health community. The developer would
be well-positioned to offer premium support and technical services to programs that use the tools or need custom solutions built
upon an open-source platform.