Big Data Analytics On Cloud Using NEXUS

Abstract: 

Almost all of the existing Earth science data analysis solutions are built around large archives of files. When an analysis involves large collection of files, performance suffers due to the large amount of I/O required. Common data access solutions, such as OPeNDAP and THREDDS, provide web service interface to archives of observational data. They also yield poor performance when it comes to large amount of observations, because they are still built around the notion of files. In his famous 2005 paper on Scientific Data Management in the Coming Decade, the late Jim Gray stated, “The scientific file-formats of HDF, NetCDF, and FITS can represent tabular data but they provide minimal tools for searching and analyzing tabular data.” He continued to point out, “Performing this filter-then-analyze, data analysis on large datasets with conventional procedural tools runs slower and slower as data volume increases.”

NEXUS (https://github.com/dataplumber/nexus) is an emerging, open source, data-intensive analysis framework developed with a new approach for handling science data that enables large-scale data analysis. MapReduce is a well- known paradigm for processing large amounts of data in parallel using clustering or Cloud environments. Unfortunately, this paradigm doesn't work well with temporal, geospatial array-based data. One major issue is they are packaged in files in various sizes. The size of each data file can range from tens of megabytes to several gigabytes. Depending on the user input, some analysis operations could involve hundreds to thousands of these files.

NEXUS takes on a different approach in handling file-based observational temporal, geospatial artifacts by fully leveraging the elasticity of Cloud Computing environment. Rather than performing on-the-fly file I/O, NEXUS stores tiled data in Cloud-scaled databases with high-performance spatial lookup service. NEXUS provides the bridge between science data and horizontal-scaling data analysis. This platform simplifies development of big data analysis solutions by bridging the gap between files and MapReduce solutions such as Spark. NEXUS has been integrated into the NASA Sea Level Change Portal (https://sealevel.nasa.gov) as the Big Data analytic backend for its Data Analysis Tool. 

Collaboration Area: 
Attachments for download: 
Creative Common License: 
Creative Commons Attribution 3.0 License