You are here

Platform for Large-scale Unsupervised and Supervised Learning


TECHNOLOGY AREA(S): Human Systems, Information Systems

ACQUISITION PROGRAM: Distributed Common Ground System-Navy (DCGS-N), Data Focused Naval Tactical

OBJECTIVE: Develop a platform which takes in large amounts of data from a variety of sources, analyzes it using sophisticated and fast algorithms and provides detailed interpretable probabilistic models as output.

DESCRIPTION: Recent advances in technology have led to the era of massive data sets which are not only larger, both in terms of sample size and dimensionality of the data, but also more complex. The data can be multi-modal, multi-relational and gathered from different sources. The massive data sets (“Big Data”) introduce unique computational and statistical challenges. Traditionally, the issues of statistical accuracy of an estimator and the computational cost of implementing it have been considered separately. This approach is suitable to small-scale data sets in which computation is not a limiting factor. However, large-scale data sets require an integrated approach to statistical and computational issues [1]. With big and messy data there is an increasing need for scalable software that will fit user-specified models that include multiple levels of variation and allow the combination of diverse data sources [2]. This software should facilitate easier customization of statistical models for big data and offer robust implementation for inference over all models. Among the challenges is to find ways to incorporate problem-specific knowledge into an analysis. This often entails customizing default methods to better suit the unique characteristics of the application at hand.

Recently, there have been some promising approaches that addressed the previous challenges. For example, the DimmWitted framework [2] provides the trade-off between statistical efficiency (roughly the number of steps an algorithm takes to converge) and hardware efficiency (roughly the efficiency of each of those steps). Similarly, scalable tensor-based approaches for learning latent variable models [4] provide novel analysis for tractable tensor decomposition for many classes of latent variable models, including Gaussian mixtures, latent Dirichlet allocation and hidden Markov models. Sparse coding have also led to a number of breakthroughs in automatic processing of large volumes of textual information, to the extent that billions of text documents can be processed to extract trending topics and story lines [5]. However, such success is not matched in general media data.

There is a clear need to develop a platform for automated and efficient analysis of big data and extraction of relevant information in real time. Such platform should implement advanced mathematical algorithms that are backed by rigorous theoretical analysis and experiments.

PHASE I: Determine feasibility, advantages and limitations of existing computational algorithms to be used or develop novel algorithms for the analysis of big data. Design metrics for evaluation of the platform in Phase II including but not limited to issues related to data types (modalities), data amounts, processing time, computational efficiency, robustness of the algorithms, scalability, and adaptability. Select the data and the state-of-the-art algorithms that will be used in Phase II as a baseline for comparison and specify detailed testing and validation procedure.

PHASE II: Develop open source libraries on various platforms such as graphics processing unit (GPU), central processing unit (CPU) and cloud. Develop scalable software that will fit user-specified models. Develop a prototype platform and demonstrate the operation of the platform on simulated and real-world data. Perform detailed testing and evaluation of the platform. Demonstrate advantages of the platform in comparison to the state-of-the-art algorithms that were selected in Phase I.

PHASE III DUAL USE APPLICATIONS: The functional final system should be developed with performance specifications. Finalize the design from Phase II, perform relevant testing and transition the technology to appropriate Navy and commercial entities. Potential applications of this topic are in defense, security agencies both government and private, and law enforcement. This technology will primarily support analysis of large datasets such as satellite images, streaming audio and video signals, and text documents.


    • M. J. Wainwright. Structured regularizers: Statistical and computational issues. Annual Review of Statistics and its Applications, 1:233–253, January 2014.


    • B. Carpenter, A. Gelman, M. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell. Stan: A probabilistic programming language, Journal of Statistical Software, 2015.


    • C. Zhang and C. Re. DimmWitted: A study of main-memory statistical analytics. In Proceedings of the 40th International Conference on Very Large Databases, 2014.


    • A. Anandkumar, R. Ge, D. Hsu, S.M. Kakade, and M. Telgarsky. Tensor Methods for Learning Latent Variable Models. Journal of Machine Learning Research, 15:2773–2832, 2014.


  • C. H. Teo J. Eisenstein A. Smola A. Ahmed, Q. Ho and E. P. Xing. Online inference for the infinite topic-cluster model: Storylines from streaming text. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTAT) 2011.

KEYWORDS: Big data; scalable algorithms; data processing; information integration; computing; inference; learning; automated analysis

  • TPOC-1: Predrag Neskovic
  • Email:
  • TPOC-2: Behzad Kamgarparsi
  • Email:

Questions may also be submitted through DoD SBIR/STTR SITIS website.

US Flag An Official Website of the United States Government