- 001: Funding Opportunity Description
The National Human Genome Research Institute (NHGRI) solicits R43/R44 grant applications to develop novel technologies that will enable extremely low-cost, high quality DNA sequencing. This FOA continues a program that began in 2004, when the cost to produce a high quality draft mammalian genome sequence was estimated at $5 to $10 million, and the goal was to reduce costs by four orders of magnitude – to approximately $1,000 – in ten years.
While substantial progress toward the $1,000 genome has been made, daunting scientific and technical challenges remain. This program will continue to support both fundamental scientific investigation underlying the technologies, and their engineering, to achieve this goal. The program supports development of key system components and of full systems. In this context, ‘key components’ refers to the method for determining the linear order of nucleotides, in contrast to upstream or downstream steps in sequencing. Exploration of methods other than those currently being pursued as potential $1,000 genome technologies is encouraged. High-risk/high-payoff applications are appropriate to achieve the goals of this FOA.
The ability to sequence complete genomes and the free dissemination of sequence data have dramatically changed the nature of biological and biomedical research. Sequence and other genomic data have the potential to lead to remarkable improvement in many facets of human life and society, including the understanding, diagnosis, treatment and prevention of disease; advances in agriculture, environmental science and remediation; and our understanding of evolution and ecological systems.
The ability to sequence many genomes completely has been made possible by the enormous reduction of the cost of sequencing in the past 30 years, from tens of dollars per base in the 1980s to a fraction of a cent per base today. We have progressed from the early Human Genome Project goals to sequence the genomes of the human and mouse and a few additional model organisms (E. coli, C. elegans and D. melanogaster), through programs to sequence at varying pre-determined quality levels numerous genomes across the evolutionary tree including multiple species and variants for some of those, to current programs to sequence portions or the entire genomes of increasing numbers of tumors and human individuals. Technology advances, and in particular the recent emergence of a new generation of sequencing systems, have enabled the launch of several such projects that are producing stunning insights into biology and disease. Nevertheless, the cost to completely sequence large numbers of entire genomes remains too high to allow complete genome sequencing to be used routinely, and we remain far from achieving the low costs and high quality needed to enable the use of comprehensive genomic sequence information in individual health care.
A few examples of high priority research to which genomic sequencing at high quality and dramatically reduced cost would make vital contributions include:
- Comparative genomic analyses across species to yield insights into the structure and function of the human genome and, consequently, the genetics of human health and disease;
- Studies of human genetic variation and its relationship to health and disease, in large numbers of individuals, that capture not only common single nucleotide variation but also rare, copy number, and structural variants that are increasingly thought to play an important role in complex disease;
- Characterization of somatic changes in the genome that contribute to cancer, revealing information from large numbers of matched tumor/normal samples on a genome-wide basis rather than restricting analysis to “suspect” genomic regions that are based on incomplete knowledge;
- Additional genome sequences of agriculturally important animals and plants that are needed to study individual variation, different domesticated breeds and wild variants of each species; and
- Sequence analysis of microbial communities, many members of which cannot be cultured. This will provide a rich source of medically and environmentally useful information. Accurate, rapid sequencing may also be the best approach to microbial monitoring of food and the environment, including rapid detection and mitigation of bioterrorism threats.
The broad utility and high importance of dramatically reducing DNA sequencing costs prompted the NHGRI, in 2004, to embark on two parallel technology development programs (Nature Biotech. 26:1113, 2008). The first had the objective of reducing the cost of producing a high quality sequence of a mammalian-sized genome by two orders of magnitude, to about $100,000. This goal has been achieved so no additional grant applications are being solicited at this time. Rather, the NHGRI is focusing its efforts on attaining the goal of the second program, which, as described in this FOA and parallel FOAs for other grant mechanisms, is the development of technologies with which to sequence a human genome for about $1,000 (a four order of magnitude cost reduction). Implicit in this goal is the quality of the expected sequence product, as sequencing cost targets are meaningless without associated quality standards as described below.
Sequencing Strategy and Quality
The sequencing technology that was used to produce the reference human genome sequence (Nature 431:931, 2004; Nature 409:860, 2001; Science 291:1304, 2001) was fluorescence detection of dideoxynucleotide-terminated DNA extension reactions resolved by capillary array electrophoresis (CAE). Individual sequence “read” segments can be as long as 1000 nucleotides. If all of the DNA in a 3 Gb genome were unique, it would be possible to determine the sequence of the entire genome by generating a sufficient number (tens of millions) of randomly-overlapping 1000-base reads and aligning their overlaps. However, the human and the majority of other interesting genomes contain a substantial amount of repetitive DNA. To cope with the complexities of repetitive DNA elements and to assemble the thousand-base reads in the correct long-range order across the genome, genomic sequencing methods involve a variety of additional strategies, such as the sequencing of both ends of cloned DNA fragments, use of libraries of cloned fragments of specified lengths, incorporation of map information, achievement of substantial redundancy (multiple reads of each nucleotide from overlapping fragments) and application of sophisticated assembly algorithms to filter and align the reads. As new sequencing technologies are developed, they must incorporate means to deal with these features of the structure of the genome.
The gold standard for genomic sequencing is based on the above-described methods and remains =99.99% accuracy (not more than one error per 10,000 nucleotides) with essentially no gaps (http://www.genome.gov/10000923). The “finishing” steps needed to achieve that very high quality have not been automated and thus require substantial hand-crafting. However, experience shows that much comparative and medically useful sequence information can be obtained from automatically generated sequence assemblies that are known as “high-quality draft” or “comparative grade.” Therefore, the cost targets for NHGRI’s sequencing technology development programs were originally defined in terms of a mammalian-sized genome with a sequence quality equivalent to or better than that of the mouse draft assembly published in December 2002 (Nature 420:520, 2002). Producing such a product for $1,000 is still not possible, so this remains a useful and challenging technology target. Unquestionably, the ultimate need, for medical research and individualized medicine, is for far higher quality sequence of the 3 Gb diploid human genome.
Subsequent to dideoxy/CAE sequencing, a second generation of sequencing technologies has been developed and implemented. These are broadly described as array-based methods in which large numbers of templates are extended one base at a time, the extensions are detected, and that cycle is repeated (reviewed in Nature Biotechnology October 2008). These technologies enable the sequencing of larger numbers of genomes for under $100,000 dollars each, and even less for targeted sequencing of particular genomic regions. While challenges remain related to producing and interpreting sequence information using these technologies, they are proving invaluable to the biomedical research enterprise and they validate the impetus toward further substantial decreases in cost with increases in quality, throughput and speed of genomic sequencing.
The eventual goal of these programs is to achieve technologies that can produce assembled sequence of genomes that had not been previously sequenced (i.e., de novo sequencing). However, an accompanying goal is to obtain highly accurate sequence data at the single base level that can be overlain on a reference sequence of the organism (i.e., re-sequencing). This could be achieved, for example, with short reads that lack information linking them to other reads. In spite of shortcomings (some of which are described below), re-sequencing would potentially be available sooner and of considerable value for certain studies on disease etiology and individualized medicine. Therefore, technology development for re-sequencing will be supported under this FOA. As the cost goal for the more difficult challenge, de novo assembly sequencing, is $1,000, the goal for re-sequencing is to develop technologies that will provide a genome’s worth of less-well-mapped sequence for well below $1,000.
For re-sequencing, the per-base accuracy must be sufficient to distinguish between sequencing errors and real polymorphism. Additional challenges include assigning reads to gene families with very similar sequence, the identification of copy number changes and genomic rearrangements, and the identification of haplotypes (i.e., linear juxtapositioning of particular single nucleotide polymorphism [SNP] alleles along a single chromosome) in diploid organisms. Thus, in proposing the development of re-sequencing technologies, it is essential that the applicant clearly address the extent to which the proposed technology will meet or fall short of these various challenges, and the cost tradeoffs that justify developing the technology to produce data of high value for particular biomedical studies.Specific Areas of Research Interest
The goal of research supported under this FOA is to develop new or improved technology to enable rapid, efficient DNA sequencing of mammalian-sized genomes. The target cost for a 3 Gb diploid genome sequence determined at reasonably high quality is about $1,000 because the ability to generate routinely complete genomic sequences at that cost would revolutionize biological research and medicine.
Both fundamental scientific discovery and cutting edge engineering will likely be needed to achieve these goals. For example, new sensing and detection modalities and fabrication methods may be required, and the physics of systems operating at nm length scales will need to be better understood. It is therefore anticipated that applications responding to this FOA will involve fundamental and engineering research conducted by multidisciplinary teams of investigators. The guidance for budget requests accommodates the formation of groups having investigators at several institutions, in cases where that is needed to assemble a team of the appropriate balance, breadth and experience.
The scientific and technical challenges inherent in achieving the cost goals are significant. Achieving these goals may require research projects that entail substantial risk. That risk should be balanced by an outstanding scientific and management plan designed to achieve the very high payoff goals of this FOA. High-risk/high-payoff projects may fail for legitimate reasons, so applicants proposing such projects should identify them as such, elaborate key quantitative milestones to be achieved, and describe the consequences of not achieving those milestones in a reasonable period of time.
Applicants may propose to develop full-scale sequencing systems or investigate key components of such systems. For the latter, applicants must describe how the knowledge gained as a result of the proposed project would be incorporated into a full system that they or others might subsequently propose to develop. Such independent applications are an important path for pursuing novel, high-risk/high-payoff ideas, short of developing a full system.
While the major focus of this program is on the development of new technologies for detection of nucleotide sequence, any successful technology will have to address matters related to the practical implementation of the technologies so that the technology can form the basis of, or be incorporated into, an efficient, high quality, high-throughput DNA sequencing scheme. Any new technology will eventually need to be incorporated effectively into a sequencing workflow, starting with a biological sample and ending with sequence data of the desired quality. Sample preparation requirements can depend upon the detection method which, in turn, can affect the way in which output data are handled. If a full system development is proposed, these issues should be addressed on an appropriate schedule in the research plan; applicants should focus as early as possible in the research plan on the most critical and highest-risk aspects of the project related to determining the sequence of nucleotides, on which the rest of the project depends. Projects to address key components must address the fundamental method of determining the base sequence (that is, an application addressing only sample prep or only the downstream informatics would not be considered responsive to this FOA).
Most technology developers lack practical experience in high-throughput sequencing and in testing of methods and instruments for robust, routine sequencing operation. Applicants may therefore wish to include such expertise as they develop their teams. Academic investigators may wish to consider collaborating with commercial entities that have the experience and capabilities to bring practical systems into the hands of users.
The quality of sequence to be generated by the technology is of paramount importance for this FOA. Two major factors contributing to genomic sequence quality are per-base accuracy and contiguity of the assembly. Much of the utility of comparative sequence information will derive from characterization of sequence variation between species, and between individuals of a species. Therefore, per-base accuracy must be high enough to discern polymorphism at the single-nucleotide level (substitutions, insertions, deletions) and distinguish polymorphism from sequencing errors. Experience and resulting policy have established a target accuracy of not more than one error per 10,000 bases. All applications in response to this FOA, whether to develop re-sequencing or de novo sequencing technologies, must propose to achieve at least this standard.
Assembly information is needed for determining sequence of new genomes and ultimately also for genomes for which a reference sequence exists, to detect rearrangements, insertions, deletions, and copy number changes. All of these are genomic changes that have been shown to be associated with disease, and knowledge of rearrangements can reveal new biological mechanisms. The phase of single nucleotide polymorphisms to define haplotypes is important in understanding and diagnosing disease. Achieving a high level of sequence contiguity may be essential to achieve the full benefit from the use of sequencing for individualized medicine, e.g., to evaluate genomic contributions to risk for specific diseases and syndromes, and drug responsiveness. Nevertheless, it is recognized that perfect sequence assembly from end to end of each chromosome is unlikely to be achievable with most technologies in a fully automated fashion and without adding considerable cost. Therefore, for the purpose of this FOA, grant applications proposing technology development for de novo sequencing shall describe how they will achieve, for about $1,000, a draft-quality assembly that is at least comparable to that represented by the mouse draft sequence produced by December 2002: 7.7-fold coverage, 6.5-fold coverage in Q20 bases, assembled into 225,000 sequence contigs connected by at least two read-pair links into supercontigs [total of 7,418 supercontigs at least 2 kb long], with N50 length for contigs equal to 24.8 kb and for supercontigs equal to 16.9 Mb (Nature 420:520, 2002). Grant applications that propose technology development for re-sequencing should fully describe the qualities and characteristics of the genomic sequence information that the technology would produce, and the projected cost. That cost should be at least four orders of magnitude lower than was the cost to produce comparable quality data in 2004, when this program was initiated.
Grant applications will be evaluated, and funding decisions made, in such a way as to develop a balanced portfolio that has strong potential to develop both robust de novo and re-sequencing technologies, exploring a variety of technology approaches. If the estimate is correct, that achieving the goal of $1,000 de novo genome sequencing incorporating substantial assembly information will be achieved by about 2014, then low-cost re-sequencing technologies for even lower cost might be expected in a shorter time. Projects with a plan to achieve re-sequencing while on the path to de novo sequencing will receive priority.
Research conducted under this FOA may include development of the computational tools associated with the technology, e.g., to extract sequence information, including image analysis and signal processing, and to evaluate sequence quality and assign confidence scores. It may also address strategies to assemble the sequence from the information being obtained from the technology or by merging the sequence data with information from parallel technology. Applications that incorporate effective plans to develop systems that address the bioinformatics challenges in concert with the sequencing wet-ware will receive high priority consideration. However, this FOA will not support development of sequence assembly or sequence analysis software independent of technology development to obtain the linear nucleotide sequence.
This program is aimed at technology to sequence entire genomes. Projects are under way to determine sequence from selected important regions (e.g., all of the genes). Grant applications that propose to meet the cost targets by sequencing only selected regions of a genome will be considered unresponsive to this FOA. However, applications that propose novel ways to sequence selected genomic regions, cost-effectively, while on a path to whole-genome sequencing, will be considered responsive.
NHGRI is interested in supporting diverse approaches to achieving the goals of this FOA. To assist the research community, investigators who are pursuing one set of technology paths that involve the use of nanopores and nanogaps published an overview of the challenges they face (Nature Biotech. 26:1146, 2008). Similarly, challenges attending sequencing by synthesis have been described (Nature Biotech. 27:1013, 2009). Grant applications to meet these and related challenges, and to pursue alternative technologies, are welcome under this FOA. Information on projects funded under earlier versions of this and related FOAs and a program bibliography are available at http://www.genome.gov/10000368#6.