You are here
Data compression and dimensionality reduction for exascale and other highly parallel systems
Phone: (301) 294-5271
Email: svoronin@i-a-i.com
Phone: (301) 294-5221
Email: mjames@i-a-i.com
Extreme scale computing hardware is expected to become available in the near future, with the ‘El Capitan’ system slated for delivery in 2022. There is particular interest in storage, transmission, and manipulation of data stores for big data applications. Concurrently, suitable compression and dimensionality reduction algorithms can be formulated for exascale capable and other highly parallel systems with high scalability and reasonable communication costs. For instance, simulation models such as the Energy Exascale Earth System Model (E3SM) take up valuable computing time and produce large quantities of data which is then analyzed by researchers, resulting in textual, image and video content. Such information from many sets of runs likely residing in distributed memory, can be effectively compressed by a heterogeneous parallel compression approach which groups together similar content and applies a targeted algorithm to each cluster to produce favorable compression ratios. Likewise, experiments from the national collider and accelerator facilities can produce data at rates of several GB per second. It is important to analyze this data for patterns, so that only certain anomaly portions are stored and analyzed by researchers. Exascale capable dimensionality reduction algorithms such as large scale PCA are lucrative for this purpose. We propose to develop a parallel lossless compression prototype which is suitable for the highly parallel target computing architecture consisting of multiple distributed memory nodes, each with many computing cores and specialty hardware such as GPUs. It achieves better compression performance than possible with uniform data subdivision via data permutation into mega-blocks with similar content distributions. In addition, we propose to develop a parallel rank adaptive matrix factorization prototype from which many different forms can be derived, suitable both for lossy compression and dimensionality reduction methods, like the large scale PCA, which can be used to compare generated data bundles in search for anomalies. Building on the algorithmic formulations in the proposal, we will produce high performance implementations of these methods for the envisioned parallel architecture suitable for very large data sizes using distributed memory, shared memory, and GPU programming. We will implement a parallel lossless compression engine, which will take as input a big heterogeneous data store, possibly positioned across multiple nodes, and turn it into a compressed representation suitable for storage or transmission using a combination of parallelized content targeted algorithms. Corresponding decompression and unpacking methods will also be developed. In addition, we will implement a rank adaptive QB matrix decomposition designed for highly parallel computing architectures. This decomposition will yield two smaller factors Q and B for a large input matrix A and will be used to construct PCA outputs and for constructing different representations such as interpolative decompositions. We will apply the developed methods to applications in compression simulation data (e.g. from E3SM) and for rapid anomaly detection in bundles of large experimental data from collider like sources. A working prototype will be developed for lossless or lossy data compression of heterogeneous data bundles and rank adaptive matrix factorizations and applications to PCA for dimensionality reduction. The algorithms will be designed, analyzed, and implemented on the target architecture using a combination of distributed message-passing, multi-core shared memory, and GPU programming. The results will be applied to demonstrate compression gains of large-scale multi-format data (such as that from simulations runs), to the compression of data from multi-channel recording systems and high dimensional video data, and to anomaly detection in high dimensional data bundles. There is substantial need for more efficient storage and transmission and visualization and automated analysis of large- scale data stores from big data applications. Suitable implementations for exascale systems would have many possible commercial venues. Examples are storage and transmission of large piles of use case records consisting of numerical, text and other binary data, multi-channel acoustic recording systems, anomaly detection in large sequences of data bundles, and for approximation and pattern discovery in statistical and natural science applications.
* Information listed above is at the time of submission. *