BIGDATA TECHNOLOGIES FOR SCIENCE, ENGINEERING, AND MANUFACTURING

Description:

Please Note that a Letter of Intent is due Tuesday, September 05, 2017

PROGRAM AREA OVERVIEW: OFFICE OF SCIENCE

The Office of Science’s mission is to deliver scientific discoveries and major scientific tools to transform our understanding of nature and advance the energy, economic and national security of the United States. The Office of Science is the Nation’s largest Federal sponsor of basic research in the physical sciences and the lead Federal agency supporting fundamental scientific research for our Nation’s energy future. The topic below is a collaborative topic among multiple programs in the Office of Science. The Office of Science’s mission is to deliver scientific discoveries and major scientific tools to transform our understanding of nature and advance the energy, economic and national security of the United States. The Office of Science is the Nation’s largest Federal sponsor of basic research in the physical sciences and the lead Federal agency supporting fundamental scientific research for our Nation’s energy future. The topic below is a collaborative topic among multiple programs in the Office of Science.

Maximum Phase I Award Amount: $150,000

Maximum Phase II Award Amount: $1,000,000

Accepting SBIR Applications: YES

Accepting STTR Applications: YES

The offices of Advanced Scientific Computing Research, Biological and Environmental Research, Basic Energy Sciences, and Nuclear Physics in the Office of Science at the US Department of Energy (DOE) are soliciting grant applications in the broad technical area of “BigData” - technologies for managing and analyzing complex scientific and engineering data sets. The challenge of managing and analyzing increasingly BigData streams is impacting every sector of modern society from energy, defense, healthcare, and transportation to science and engineering. Unlike traditional structured data sets, today’s BigData are characterized by multi-dimensional features that include large data volumes, variety, velocity and veracity. Despite the ubiquitous BigData challenge faced by the scientific and engineering communities there is still a lack of cost-effective and easy-touse tools and services that facilitate and accelerate the analysis, organization, retrieval, sharing, and modeling of complex data streams. The focus of this topic is on the development of commercializable BigData technology products and services that reduce bottlenecks and increase efficiency in the management and analysis of complex data for the science, engineering, and manufacturing sectors.

Potential grant applicants should focus on the development of innovative BigData management products in the form of turnkey subsystems, cloud-based services, and complete toolkits that can be packaged as standalone or value-added commercial products and services. Additional information on BigData products and services of interest to participating Office of Science programs is described below.

Office of Advanced Scientific Computing Research
a) Complex-Data Management Technologies
Office of Biological and Environment Research
b) Advanced Data Analytic Technologies for Systems Biology and Bioenergy
c) Technologies for Managing and Analyzing Complex Data for Watershed and Terrestrial Ecosystems
Office of Basic Energy Sciences
d) HDF5 Extensions for Data Streaming
e) Visualization, Data Management, and Workflow Tools for Experimental Data
Office of Nuclear Physics (See the following the Nuclear Physics section of this solicitation)

  • Software-Driven Network Architectures for Data Acquisition
  • Large Scale Data Storage
  • Data Science / Distributed Computing Applications

Successful grant applications will be required to satisfy the following two important criteria: a) a clear plan to develop innovative data analytics or data management techniques and b) the use of appropriate data sets that represent one or more attributes of BigData, namely, large data volume, variety, velocity and veracity. Priority will be given to grant applications with complex data sources drawn from the domain sciences of the DOE programs participating in this solicitation. An application can cover crosscutting issues but it must be submitted to a specific participating Office of Science program. Failure to do so will result in automatic declination of the application without review. Grant applications that focus exclusively on the following topics will be considered out of scope and will not be reviewed: a) data analytics algorithms that are not packaged as complete commercial products or services with the relevant BigData, b) improvements or extensions of data analytics and open software source stacks that do not lead to commercializable products or services, and c)other restrictions as specified in the detailed description of each sub-topic.

Office of Advanced Scientific Computing Research
a.  BigData Management Technologies
This sub-topic focuses on complex data management technologies that go beyond traditional relational database management systems. The efficient and cost-effective technologies to collect, manage, and analyze distributed BigData is a challenge to many organizations including the scientific community. Database management technologies based on traditional relational and hierarchical database systems are proving to be inadequate to deal with BigData complexities (volume, variety, veracity, and velocity), especially when applied to BigData systems in science and engineering. While the primary focus is on the development of tools and services to support complex scientific and engineering data, all sources of complex data are in-scope for this sub-topic. The focus of this sub-topic is on the development of costand time-effective commercial grade technologies in the following categories:

  1. BigData management software-enabling technologies – these include but are not limited to the development of software tools, algorithms, and turnkey solutions for complex data management such as NOSQL/graph databases to deal with unstructured data in new ways; visualization and data processing tools for unstructured multi-dimensional data, robust tools to test, validate, and remove defects in large unstructured data sets; tools to manage and analyze hybrid structured and unstructured data; BigData security and privacy solutions; BigData as a service system; high-speed data hardware/software data encryption and reduction systems; and online management and analysis of streaming and text data from instruments or embedded systems
  2. BigData Network-aware middleware technologies – this includes high-speed network and middleware technologies that enable the collection, archiving, and movement of massive amounts of data within datacenters, data cloud systems, and over Wide Area Networks (WANS). These may include but are not limited to hardware subsystems such high-performance data servers and data transfer nodes, highspeed storage area network (SAN) technologies; network-optimized data cloud services such as virtual storage technologies; and other distributed BigData solutions. Grant applications must ensure the following: a) that proposed work is based on concrete BigData owned by the company or readily accessible, and b) that the proposed work goes beyond traditional data management system technologies.

Questions – Contact: Thomas Ndousse-Fetter, Thomas.ndousse-fetter@science.doe.gov

 

Office of Biological and Environmental Research (BER)
b.  Advanced Data Analytic Technologies for Systems Biology and Bioenergy

BER’s Biological Systems Science Division programs integrate multidisciplinary discovery and hypothesisdriven science with technology development to understand plant and microbial systems relevant to national priorities in sustainable energy and innovation in life sciences. These programs generate very large and complex data sets that have all of the characteristics of “big data”, often summarized as the four Vs: volume, velocity, veracity, and variability. Technology improvements in biological instruments from sequencers to advanced imaging devices are continuing to advance at exponential rates, with data volumes in petabytes today and expected to grow to exabytes in the future. These data are highly complex ranging from high-throughput “omics” data, experimental and contextual environmental data across multiple scales of observations, from the molecular to cellular to the multicellular scale (plants and microbial communities), and multiscale 3D and 4D images for conceptualizing and visualizing the spatiotemporal expression and function of biomolecules, intracellular structures, and the flux of materials across cellular compartments. Currently, the ability to generate complex multi-“omic” and associated meta-datasets greatly exceeds the ability to interpret these data.

Applications are invited to develop innovative data integration approaches and new software frameworks for management and analysis of large-scale, multimodal and multiscale data that enhance effectiveness and efficiency of data processing for investigations across spatial scales and scientific disciplines. This SBIR sub-topic seeks analytic solutions for biological BigData: including advanced data analytics, simulation, predictive modelling, multiscale algorithms, data visualization, visual data analytics and optimization, and data fusion to enable integrated analysis and comparison of data from multiple modalities. Of particular interest is the development of innovative and cost- and time-effective commercial technologies in the form of turnkey cloud services, and value added services to existing products. Areas of interest include (but are not limited to):

  • Improved computational tools for management of petabytes of scientific, observational and experimental data sets, with real-time integration of new results with historical findings.
  • Methods for management of complex analysis workflows, reproducible data analysis that support provenance, standardized data, storage and interfaces.
  • Methods for data hosting, archiving, indexing and registration.
  • Methods for automated feature detection, dimensionality reduction, and interpretation.
  • Improved methods for data handling, data transport, data compression, data management, data processing, knowledge representation, machine learning, and mixed mechanistic and statistical simulation.
  • New computational methods that can extract features to map onto mathematical models for further analysis and simulation, visualization and exploration.

Questions – Contact: Ramana Madupu, Ramana.Madupu@Science.doe.gov

Office of Biological and Environmental Research (BER)
c.  Technologies for Managing and Analyzing Complex Data for Watersheds and Terrestrial Ecosystems

BER’s Environmental System Science (ESS) activity consists of the Subsurface Biogeochemical Research (SBR) and the Terrestrial Ecosystem Science (TES) programs. The SBR program has a “watershed science for energy” focus which seeks to advance a robust predictive understanding of how watersheds within the contiguous United States function as complex hydro-biogeochemical systems and how these systems respond to perturbations such as changes in contaminant loading, land use, weather patterns, and snow melt. The TES program seeks to improve the representation of terrestrial ecosystem processes in Earth system models focusing on ecosystems and processes that are globally important, climatically or environmentally sensitive, and comparatively understudied or underrepresented in Earth system models. Both SBR and TES investigators are encouraged to use a holistic systems approach to understand and capture in predictive models the coupled physical, chemical and biological interactions that control the functioning of watershed systems and terrestrial ecosystems, and that extend from bedrock to the top of the vegetative canopy and across a vast range of spatial and temporal scales. Investigators are encouraged to use an iterative approach to understand the structure and functioning of complex environmental systems using a hierarchy of models to drive experimentation and observations across relevant spatial and temporal scales. A key challenge for the SBR and TES scientific communities is dealing with the extreme complexity and variety of data that is generated from these watershed and terrestrial ecosystem experiments and observations, and facilitating the use of these complex data sets to test and further advance predictive models of the structure and functioning of watershed systems and terrestrial ecosystems. The watershed and terrestrial ecosystem simulation outputs are increasing in size and complexity as the model fidelity and the temporal and spatial bandwidths of the simulations increase. Another important challenge is the development of tools, approaches and workflows to facilitate the management and analysis of complex model simulation outputs and to facilitate data assimilation and uncertainty quantification.

This sub-topic focuses on the development of innovative technologies in the fields of complex data and advanced data analytics including predictive modelling, multiscale algorithms, machine learning, data interpretation and reduction, intelligent systems, and novel visualization methods to enable the integrated analysis and comparison of environmental system science data from multiple modalities. Examples of relevant methods and tools include (but are not limited to):

  • The development of approaches and tools that enable flexible model-data integration workflows, including full provenance to support reproducibility.
  • Tools and techniques to automate the QA/QC process for watershed and terrestrial ecosystem data typical of ESS projects.
  • Tools and approaches that leverage the latest developments in open-source visualization tools (e.g., through SciDAC and Exascale projects) to demonstrate in situ data analysis and parallel visualization for watershed and terrestrial ecosystems typical of ESS applications.
  • Design and develop modular components of a data system that can create integrated views of the heterogeneous and disparate data typically found in watershed and terrestrial ecosystem applications, and prepare it for visualization, exploration, and simulation.
  • Design and develop approaches and tools to enable scientists to work effectively with data that is stored in federated distributed systems.
  • For high-fidelity simulations, develop tools and approaches to facilitate: in situ data analysis, storage and access of full or partial data sets, post-processing including compression, indexing, distribution, and the application of machine learning techniques within and across data sets.

Questions – Contact: David Lesmes, David.Lesmes@science.doe.gov or Jay Hnilo, Justin.Hnilo@science.doe.gov

 

Office of Basic Energy Sciences
d.  HDF5 Extensions for Data Streaming

HDF5 is becoming the de-facto standard for storing science data at light source facilities for user data analysis. HDF5, besides being compatible virtually with any modern photon science data analysis tool, provides an efficient means to organize images with awareness of exascale node layout, e.g. the high bandwidth memory (HBM) capacity, node topology and burst buffer layout.

In order to effectively adopt HDF5 as a primary data format, light source facilities will need to be able to read files while writing to them, with the ability to consolidate the output of multiple writers into consistent virtual datasets. These features have been introduced recently in HDF5, but they impose unwanted constraints:

  1. The Single Writer/Multiple Readers feature, known as SWMR, doesn’t support variable length datatypes.
  2. The HDF5 virtual dataset does not work for irregular output patterns common with data acquisition. (First, one must specify the view ahead of time, and second, it must be a regular pattern. Dropped or missing shots are expected in data acquisition and contribute to the irregular pattern that needs a consistent view.)
  3. Currently the virtual dataset view must be specified when the file is created. To deal with the irregular patterns in the previous point, the description of the virtual dataset must become mutable while running.

Removing these constraints is critical for being able to write HDF5 files directly from the data acquisition system. For example, some light sources are currently saving data in a custom data format, which does provide these streaming capabilities, and later on they translate these files into HDF5 to provide easy access to various users and analysis tools. This translation step is a bottleneck for large data throughputs and is unsustainable. While these capabilities are critical for light sources, we believe they will benefit all experimental facilities which adopt HDF5.

Grant applications grants are sought to:

  • Add support for variable length datatypes in the HDF5 SWMR.
  • Add support for irregular output patterns in the HDF5 virtual data set.
  • Add support for mutable description of the HDF5 virtual dataset.

Questions – Contact: Eliane Lessner, Eliane.Lessner@science.doe.gov

 

Office of Basic Energy Sciences
e.  Visualization, Data Management, and Workflow Tools for Experimental Data

When users collect data at synchrotron and neutron beamlines at BES Scientific User Facilities they need immediate analysis feedback to ensure that a measurement is functioning properly; more detailed analysis is later needed for interpretation of the experimental results, which again is needed on a short timeframe. A wealth of software is employed for this, matching the wide range of experimental techniques, and where the algorithms used within may themselves be research topics, with frequent software customization and updates. Improvements in x-ray and neutron detection are accelerating rates that these instruments produce data; this combined with new sources, will create instruments that produce data rates circa a petabyte/day. Analysis of these data will require a migration to use of high performance computing resources.

Tools that will be able to assist computational scientists implement improved approaches to data analysis are sought. This includes (a) new approaches for graphical visualization of large and very high dimensionality data sets; (b) methodologies that allow existing software, typically written in Python, to be easily ported to run on petascale clusters; (c) data management and workflow tools that enable automation and data discovery and re-use; and (d) on-demand scheduling mechanisms that allow beamlines to obtain rapid access to HPC systems, with minimal impact on long-running tasks.

Questions – Contact: Eliane Lessner, Eliane.Lessner@science.doe.gov

References: Subtopic a:

  1. Hey, T., Tansley, S., Tolle, K., 2009, the Fourth Paradigm: Data-Intensive Scientific Discovery, Microsoft Research, Redmond, Washington, p. 284. https://www.microsoft.com/enus/research/publication/fourth-paradigm-data-intensive-scientific-discovery/#
  2. Department of Energy, VACET, the Visualization and Analytics Center for Enabling Technologies (VACET), Homepage. http://www.vacet.org/about.html
  3. Department of Energy, SciDAC, 2007, Visualization & Data Management. http://www.scidac.gov/viz/viz.htm

References: Subtopic b:

  1. U.S. Department of Energy, Office of Science, 2017, Genomic Science Program, Systems Biology for Energy and Environment. http://genomicscience.energy.gov/index.shtml,
  2. U.S. Department of Energy, Advanced Science Computing Research, Biological and Environmental Research, 2016, Biological and Environmental Research, Exascale Requirements Review, p. 366. http://blogs.anl.gov/exascaleage/wp-content/uploads/sites/67/2017/05/DOEExascaleReport_BER_R27.pdf
  3. U.S. Department of Energy, Office of Biological and Environmental Research, 2015, Biological Systems Science Division, Strategic Plan, p. 18. http://genomicscience.energy.gov/pubs/BSSDStrategicPlan.pdf

References: Subtopic c:

  1. U.S. Department of Energy, Office of Biological and Environmental Research, 2014, Building Virtual Ecosystems: Computational Challenges for Mechanistic Modeling of Terrestrial Ecosystems, Workshop Report, p. 48. http://science.energy.gov/~/media/ber/pdf/workshop%20reports/VirtualEcosystems.pdf
  2. U.S. Department of Energy, 2015, Building a Cyberinfrastructure for Environmental System Science: Modeling Frameworks, Data Management, and Scientific Workflows, Workshop Report, p. 44, DOE/SC-0178. https://science.energy.gov/~/media/ber/pdf/workshop%20reports/ESSWG_WorkshopReport.pdf
  3. U.S. Department of Energy, 2016, Towards a Shared ESS Cyberinfrastructure: Vision and First Steps, Report from the ESS Executibe Committee Workshop on Data Infrastructure, p. 19. https://science.energy.gov/~/media/ber/pdf/workshop%20reports/Towards_a_Shared_ESS_Cyberinfr astructure.pdf
  4. U.S. Department of Energy, Office of Biological and Environmental Research, 2016. Working Group on Virtual Data Integration, Report, p. 64, DOE/SC-0180. https://science.energy.gov/~/media/ber/pdf/workshop%20reports/Virtual_Data_Integration_worksho p_report.pdf
  5. U.S. Department of Energy, Advanced Science Computing Research, Biological and Environmental Research, 2016, Biological and Environmental Research, Exascale Requirements Review, p. 366. http://blogs.anl.gov/exascaleage/wp-content/uploads/sites/67/2017/05/DOEExascaleReport_BER_R27.pdf
  6. U.S. Department of Energy, Biological and Environmental Research (BER), Climate and Environmental Sciences Division (CESD), Subsurface Biogeochemical Research (SBR) Program. http://science.energy.gov/ber/research/cesd/subsurface-biogeochemical-research/.
  7. U.S. Department of Energy, Biological and Environmental Research (BER), Climate and Environmental Sciences Division (CESD), Terrestrial Ecosystem Science (TES) Program. http://science.energy.gov/ber/research/cesd/terrestrial-ecosystem-science/.
  8. AmeriFlux, U.S. Department of Energy, 2017, AmeriFlux Management Project, About AmeriFluxManagement Project. http://ameriflux.lbl.gov/about/ameriflux-management-project/
  9. International Land Model Benchmarking Project (ILAMB), 2010, Welcome to ILAMB!. https://www.ilamb.org/

References: Subtopic d:

  1. The HDF Group, 2017, What is HDF5. https://support.hdfgroup.org/HDF5/whatishdf5.html
  2. Rees, N., Billich, H., Koziol, Q., et al., Developing HDF5 for the Synchrotron Community, Data Management, Analytics & Visualistion, Proceedings of ICALEPCS2015, Melborne, Australia, WEPGF063. http://icalepcs.synchrotron.org.au/papers/wepgf063.pdf
  3. The HDF Group, 2017, Single-Writer/Multiple-Reader (SWMR) Documentation. https://support.hdfgroup.org/HDF5/docNewFeatures/NewFeaturesSwmrDocs.html
  4. The HDF Group, 2017, Virtual Dataset (VDS) Documentation. https://support.hdfgroup.org/HDF5/docNewFeatures/NewFeaturesVirtualDatasetDocs.htm

References: Subtopic e:

  1. Early Science at the Upgraded Advanced Photon Source, https://www1.aps.anl.gov/files/download/Aps-Upgrade/Beamlines/APS-U%20Early-Science-103015-FINAL.pdf
  2. U.S. Department of Energy, Advanced Scientific Computing Research, Basic Energy Sciences, 2015, BES Exascale Requirements Review, p. 316. https://science.energy.gov/~/media/bes/pdf/reports/2017/BES-EXA_rpt.pdf
  3. Oak Ridge National Laboratory, Neutron Sciences, 2014, ORNL Neutron Sciences Strategic Plan, p. 74. http://neutrons.ornl.gov/sites/default/files/NScD-Strategic-Plan-2014.pdf

Agency Micro-sites

SBA logo
Department of Agriculture logo
Department of Commerce logo
Department of Defense logo
Department of Education logo
Department of Energy logo
Department of Health and Human Services logo
Department of Homeland Security logo
Department of Transportation logo
Environmental Protection Agency logo
National Aeronautics and Space Administration logo
National Science Foundation logo
US Flag An Official Website of the United States Government