You are here

Deep Learning and Extraction of Chemical Synthesis or Biosynthetic Pathways from Scientific Literature


OUSD (R&E) CRITICAL TECHNOLOGY AREA(S): Advanced Computing and Software; Biotechnology


OBJECTIVE: Conduct proof-of-concept studies to enable automated knowledge extraction and Natural Language Processing (NLP) approaches in machine learning (ML) for chemical synthesis or biosynthetic pathways of publicly available scientific publications that may pose dual-use research of concern (DURC).  This topic seeks development of (1) computational methods for employing large language models (LLMs) for automated production of knowledge ontologies from scientific literature and (2) a scalable system with automated annotation capability.


DESCRIPTION: Scientific research of dual use concern includes research that, based on current understanding, can be reasonably anticipated to provide knowledge, information, products, or technologies that could be directly misapplied to pose a significant threat with broad potential consequences to public health and safety, materiel, national security, among other sectors.  Automated, large-scale extraction of procedures for chemical synthesis or biosynthesis will increase the efficiency and effectiveness of analysts seeking to provide timely courses of action (COA) for relevant decision makers concerning biological or chemical synthesis information of interest to the Defense Threat Reduction Agency (DTRA) and end users.  An ideal system shall require minimal human-in-the-loop, enabling a subject matter expert (SME) to supplement the model with ad hoc data.


The development of open-source semantic systems have expanded beyond manually curated, commercial databases such as Reaxys (Reference 1).  One recent effort, SynKB, has applied methods to automatically extract data from organic chemistry reactions described in United States (U.S.) and European commercial patents (Reference 2).  The SynKB system enables chemists to perform structured queries over large corpora of synthesis procedures.  Other research groups have applied language models for molecular design, whereby the model implicitly learns the “vocabulary” and composition of valid molecules and provides the ability to survey optimized molecular properties (Reference 3).  Text-like representations of chemical reactions (SMILES) and Natural Language Processing (NLP) neural network Transformer architectures have been applied to retrosynthesis prediction problems (Reference 4).  Another research team has developed the AiZynthTrain package for training synthesis models on USPTO patent data with the intent to integrate into retrosynthesis software (Reference 5).  This topic seeks to build upon these and similar NLP-based approaches for knowledge extraction that may be broadly applied to chemical or biological scientific domains, improve system scalability, and automate data annotation capabilities.

DTRA’s areas of interest include, but are not limited to: (1) understanding viable synthesis routes to a chemical compound from precursor molecules or substrates, and other starting materials, and (2) retrosynthesis prediction, which may be used to identify possible routes of synthesis and determine the most effective route for the synthesis process.


PHASE I: Leverage an LLM-based ontology system that uses textual knowledge of ontologies to extract information about biosynthetic or chemical synthetic procedure details from open source literature.  The proof-of-concept system shall provide information about viable routes to a chemical compound and possible retrosynthesis analysis.  Performers shall utilize visual analytical methods that enable users to browse and search for chemical-related data.  Performers are encouraged to represent findings to a user with chemical-pathway association graphs, knowledge graphs, or by other means.  The architecture shall be scalable, and shall leverage automated annotation capabilities in lieu of human annotation.


The devised solution shall capture relationships between concepts that indicate possible DURC.  The performers will develop quantitative metrics to evaluate the LLM neural network classification performance by way of statistical approaches.


Phase I deliverables will include (1) a final report and (2) demonstration of the preliminary architecture.  The report should provide results on architecture performance using unambiguous statistical methods, describe training and development, and identify advantages, limitations, and weaknesses.  The architecture shall be described, including operating system, other software requirements (if applicable), and data sources.


PHASE II: Phase II efforts will focus on refinement of the approach developed during Phase I and prototype demonstration.  The Phase II deliverables will be a prototype demonstration of the LLM neural network architecture and a report detailing: (1) a description of the approach, optimization techniques, and performance outcomes; (2) training, testing and validation methods; (3) a real world evaluation of the approach with a use case of mutual interest with DTRA; and (4) advantages, disadvantages, and limitations of the approach.  The performer will identify weaknesses of the approach, and identify methods that may improve performance in the classifier and aspects of the overall architecture.  The performer will provide details about user interfaces (if applicable) and any associated executables.


PHASE III DUAL USE APPLICATIONS: The performer will identify and employ features that have the potential for use in commercial applications.



  1. “Reaxys: An Expert-curated Chemistry Database,”
  2. F. Bai, et al., “SynKB: Semantic Search for Synthetic Procedures” in arXiv, 2022.
  3. J. Owoyemi, et al., “SmilesFormer: Language Model for Molecular Design” in ChemRxiv, 2023 (preprint publication)
  4. I. V. Tetko, et al. “State-of-the Art Augmented NLP Transformer Models for Direct and Single-Step Retrosynthesis,” in Nature Communications, vol. 11, p. 1-11, 2020.
  5. S. Genheden, et al. “AiZynth Train: Robust, Reproducible, and Extensible Pipelines for Training Synthesis Prediction Models” in ACS Journal of Chemical Information and Modeling, vol. 63, p. 1841 – 1846, 2023.


KEYWORDS: Biosynthetic Pathways, Chemical Synthesis, Deep Learning, Extraction

US Flag An Official Website of the United States Government