Integrated Assembly Software for Sanger and Next Generation Sequence Technologies
Small Business Information
DNASTAR, INC., 3801 REGENT STREET, MADISON, WI, -
AbstractDESCRIPTION (provided by applicant): The advent of next-generation (Next-gen) sequencing technologies has begun a surge in whole genome sequencing and resequencing, exemplified spectacularly by four papers describing five complete human genomes in 2008 al one. One company, Knome, now even offers customers their entire genome sequence using Next-gen sequencing technology. These developments, together with targeted resequencing of genome, presage the day of the 1000 human genome. Broad-scale whole human geno me resequencing (WHGR) will have enormous impact on the areas of personalized medicine, human evolution and human diversity. To fully realize that potential, however, software capabilities must be dramatically enhanced to meet the significant challenges po sed by the sheer volume of data generated in these projects, the diversity of technology-specific data characteristics and simply analyzing the 6 billion base pair diploid human genome. Moreover, we see the day when technology improvements and cost reducti ons make WHGR as commonplace as bacterial genome sequencing has become today. For that to occur, assembly and analysis software must be accessible to a far broader and less computer savvy range of researchers than the highly specialized bioinformatics team s that decode the information now. Also, computer resources are far more limited even for a well funded research laboratory than available to a large sequencing center. Therefore, the overall goal of this proposal is to develop a Next-gen sequence assembly and analysis pipeline, DESKAPP, that will run on an affordable ( 5000) high- end desktop computer and produce a human genome sequence in a reasonable timeframe (days, not weeks). WHGR by DESKAPP will involve a reference-guided main assembly as well as a d e novo assembly branch to characterize unique regions of the new genome relative to the reference. Merging of the assemblies produces a complete sequence that can be evaluated for gene content, single nucleotide polymorphisms (SNPs) and structural variatio n (SV; indels, inversion, translocations) both by web-based searches of external databases to identify known allelic variation and by direct examination of the sequence to identify new polymorphisms. A Disk Sort Alignment algorithm allows the data sets whi ch are far too large for in-memory processing to be evaluated and clustered for assembly by SeqMan N-Gen (SM N-Gen), our desktop assembly engine. Using a prototype DSA-SM N-Gen pipeline, we have processed the entire 7.4x 454 data set from the James Watson genome to a layout file in 31 hours using DSA and have assembled three chromosomes: 8; 21; and X; using SM N-Gen. Assembly times varied from 1 hour for Chromosome 21 to 10.6 hours for an average- sized chromosome, such as Chromosome 8. Together, these resu lts demonstrate the feasibility of constructing a DESKAPP pipeline for WHGR. The Phase II Aims are designed to build upon this foundation and produce a seamless pipeline for the desktop assembly and analysis of a human genome in a matter of days. PU BLIC HEALTH RELEVANCE: Next-gen sequencing technologies have started a new revolution throughout biology by providing DNA sequence data in unprecedented quantities at continually decreasing costs. This data will be invaluable in the emerging era of person alized medicine and in exploring the immense diversity of life. The goal of this project is to develop desktop computer software that will enable research laboratories and clinics of any size to realize the promise of these new technologies.
* information listed above is at the time of submission.