You are here

Indexing large scientific data


OBJECTIVE: Develop new indexing schemes for large heterogeneous data that operate within a cloud-computing framework in order to enable rapid search and analytics. DESCRIPTION: Data continue to be generated and digitally archived at increasing rates, resulting in large volumes available for search and analysis. Access to these volumes has generated new insights through data-driven methods in commerce, science, and computing sectors. Processing data at the requisite scale now requires specialized databases or clusters of computers, necessitating distributed computing paradigms for data ingestion, transformation, and loading for distributed computation. Therefore it is critical to develop fast, scalable, and efficient indexing schemes for data that not only support data ingestion and transformation but also enable fast search and analytics. Bulk data processing models like MapReduce enable users to leverage the power of thousands of commodity machines with little programming effort within easy-to-use software stacks [1]. Its open source implementation Hadoop has been primarily used to index large collections of text documents for search by exact match string comparison [2-4]. However, little progress has been made in indexing heterogeneous scientific data: semi-structured documents with meta-data and free-text, schema-less structured files, spatial measurements from sensors, categorical data with possibly missing values, noisy measurements, video, speech, graphical/networked information, as well as other data types coming from scientific measurements by instruments. In this solicitation, we seek new indexing schemes for large heterogeneous scientific data that operate within a cloud-computing framework. As most existing implementations of MapReduce do not provide underlying data indexing, new indexing schemes are sought to improve performance for jobs that join data across distinct inputs as well for jobs requiring more descriptive classes of search criteria. Schemes are also sought that support iterative algorithms and successive search refinement, which arise in applications such as mining, ranking, traversal, and parameter estimation. A technical challenge to building indices is to address uncertainty in data that has the potential to bias resultant analysis and lead to erroneous conclusions. For example, it is infeasible for a sensor database to contain the exact value of each sensor at all points in time. This uncertainty is inherent from measurement and sampling errors as well as from resource limitations. In categorical data, a correct value of an attribute is often unknown but may be selected from a number of alternatives. Current research and technology does not incorporate a rigorous method for representing, propagating, or manipulating this type of uncertainty. We seek index structures for efficiently searching uncertain categorical data as well as index structures that intentionally approximate values for speed and efficient implementation, along with corresponding performance guarantees from the probabilistic queries they enable. Another challenge is indexing scientific data using both foreground and background information. The effectiveness of text indices for reliable web search can partly be attributed to the inverted index, where both term frequency and inverse document frequency contribute to the match between document and query. There are not well developed analogues to this combination for scientific data, and therefore we seek novel approaches for indexing scientific data that include similar foreground and background information toward a relevant match. Finally, efficient indexing methods for streaming data are lacking. Existing cloud-computing methods primarily focus on storage and query techniques for sets of static data. We also seek indexing schemes designed to operate on data that appear as continuous and rapid streams. The effort should also develop data annotations to provide an effective means to link data from diverse archives to a domain conceptualization, e.g., a formal vocabulary or grammar, which then provides users with an integrated view for querying the data. PHASE I: Task 1: Develop an approach for scientific data with foreground and background information. Task 2: Develop an approach for indexing data with uncertainty. Task 3: Develop indexing methods for streaming data. Task 4: Extend methods to indexing heterogeneous data sets. Task 5: Implement a minimal proof-of-concept system with sample scientific datasets. Phase I deliverables should include a Final Phase I report that includes: (1) a detailed description of the approach (or algorithms), and benefits of the selected approach over other alternatives; (2) an implementation architecture that integrates tasks 1-4; (3) a demonstration of the approach using the proof-of-concept system on a small cloud. PHASE II: Develop a scalable implementation of the methods. Validate and demonstrate on a heterogeneous dataset in a significant cloud-computing environment. The required deliverable for Phase II will include: the full prototype system, demonstration and testing of the prototype system on users, quantification of performance metrics including number of simultaneous queries per server, number of records indexed, latency, etc., and a Final Report. The Final Report will include (1) a detailed design of the system, documentation, and technical and user manuals, and (2) a plan for Phase III. PHASE III DUAL USE APPLICATIONS: Being able to efficiently and effectively index large scientifically collected data would impact many DARPA efforts to build and deploy instruments such as sensors. Also, it would enable new classes of problem solving in the information processing domain relevant to several on-going efforts at DARPA. The Department of Defense has many applications where scientifically collected information is unable to be stored and used in later stages of information processing and decision making because of size and inherent format. Unlike text documents and reports, where indexing and processing have been standard, scientific data such as sensor measurements have not been effectively incorporated into the process. REFERENCES: 1) Dean, J. and Ghemawat, S., MapReduce: Simplified Data Processing on Large Clusters, Communications of the ACM, 2008. 2) Olson, M. HADOOP: Scalable, Flexible Data Storage and Analysis. IQT Quarterly, pp. 14-18. Spring 2010. 3) Lin, J. and Dyer, C., Data-Intensive Text Processing with MapReduce. Morgan and Claypool. 2010. 4) Tamer Elsayed, Ferhan Ture, and Jimmy Lin, Brute-Force Approaches to Batch Retrieval: Scalable Indexing with MapReduce, or Why Bother? Technical Report HCIL-2010-23, University of Maryland, College Park, October 2010.
US Flag An Official Website of the United States Government