You are here

Probabilistic Genotyping Software for Mixture Deconvolution of Next Generation Sequencing Data



OBJECTIVE: Develop an expert probabilistic genotyping software system to reliably interpret next-generation sequencing (NGS) data using a fully continuous approach. 

DESCRIPTION: Forensic DNA laboratories are preparing for the implementation of NGS technologies to supplement and eventually replace current capillary electrophoresis (CE)-based human identification. The advances in sequencing technology provided by NGS approaches allow interrogation of the human genome in new ways, enabling both short tandem repeats (STRs) and single nucleotide polymorphisms (SNPs) to be analyzed for forensic purposes within a single workflow. Utilizing NGS technology to analyze STRs allows the sequence of the repeat region to be viewed, enabling identification of isoalleles, alleles with the same length that contain unique sequences, which can further differentiate individuals who would otherwise have the same allele designation at a particular locus when analyzed using a CE-based analysis. These capabilities make NGS a powerful tool for forensic human identification and may ultimately enable resolution of even more complex mixtures than is currently possible [1] [2], but the transition to this technology also presents challenges. Data gathered from NGS analysis is more complex than CE-based data, and to take full advantage of the advanced capabilities in resolution of mixtures new software solutions are required. Multiple software platforms exist to analyze raw NGS data and create a visual representation where mixtures and low-level samples can be interpreted manually by a DNA examiner. These software platforms, however, do not address how to reliably and objectively interpret complex DNA mixtures commonly seen in forensic DNA analysis, particularly for limited or degraded samples collected in operational environments. To enhance the amount of actionable information collected from DNA evidence and fully utilize the sequence information NGS offers, an expert probabilistic genotyping software system designed to analyze sequence information must be developed. The software must be compatible with data generated by currently available NGS STR and SNP chemistries. Furthermore, the software should enable data-in to answer-out analysis with minimal user interaction. The software must be capable of utilizing statistical theory to calculate likelihood ratios (LRs) from published allele frequencies. In addition, computer algorithms and biological modeling must be used to infer genotypes from mixed DNA profiles entered into the software system. These capabilities should be optimized to computationally model NGS data and maximize the number of true positives while minimizing false positives. 

PHASE I: Develop a prototype expert probabilistic genotyping system that can ingest NGS data from at least one NGS STR and SNP chemistry/platform type. The software must demonstrate the ability to analyze clear two-person mixtures with input from a reference profile for inclusion. A fully continuous approach is required, incorporating biological parameters such as peak height ratio, mixture ratio and stutter. The output file should deconvolute the mixture into potential genotypes, providing weights to each genotype inferred, display sequence information for both contributors, and contain a likelihood ratio for two competing hypotheses using published allele frequencies from a single population. It is highly desirable that the software parameters include population allele frequencies, drop-out, drop-in, stutter, and kit variance. Ideally, these parameters will be customizable to allow laboratories to use any NGS STR and SNP technology available. Design of the software platform must not prohibit backward compatibility with CE data. Preferably the software platform will run on commercially available computing systems. Ideally at the end of the phase I effort, the analysis of at two-person DNA mixture can be demonstrated in minutes. 

PHASE II: Extend the methods and computer algorithms developed in Phase I to allow for ingestion of NGS data from all currently available NGS STR chemistry/platform types. Improve the software system to interpret at least four-person mixtures and low-level DNA samples. Calculate the likelihood ratio for each genotype using allele frequencies from user-selected population groups (typically Caucasian, African American and Asian) in a single run. Establish a training set of samples to evaluate the software system performance. For guidance on testing probabilistic genotyping systems, please refer to the “SWGDAM Guidelines on Validation of Probabilistic Genotyping Systems” [3]. Incorporate the developed methods and computer algorithms into a mature software system with a user-friendly interface and the ability to allow integration of data output into case reports. The software must be designed to allow backward compatibility with CE data. Prior to mixture deconvolution, it is highly desirable that the software has the ability to utilize NGS SNP and STR information to infer the number of contributors (NOC). Ideally the analysis of a complex four-person mixture could be completed within hours. In addition to monthly technical progress reports, deliverables will include: a detailed report demonstrating specifics on how the software obtains its answer (black box systems are unacceptable), a user guide for the software including set-up and troubleshooting, and a report describing the results of the Phase II sample set testing. 

PHASE III: The development of an expert probabilistic genotyping software system that reliably interprets NGS data using a fully continuous approach will have a significant impact in the forensic science community at the federal, state and local levels. The software will fully utilize the sequence information that NGS affords, allowing for the objective and reliable interpretation of more complex DNA mixtures than is possible with capillary electrophoresis-based methods. Software developed under this topic will initially be tested and evaluated in Government forensic laboratories to assess applicability within forensic analytical workflows. There are a number of commercial applications for analysis of samples that will directly benefit from this new software capability including: research purposes, law enforcement, and medical diagnostics. 


1: Butler Gettings, K., Kiesler, K. M., Faith, S. A., Montano, E., Baker, C. H., Young, B. A., Guerrieri, R. A., Vallone, P. M., "Sequence variation of 22 autosomal STR loci detected by next generation sequencing," Forensic Sci. int. Genet, vol. 21, pp. 15-21, 2016.

KEYWORDS: Next-generation Sequencing, Probabilistic Genotyping, Fully Continuous, Forensic DNA Analysis 

US Flag An Official Website of the United States Government