A Framework for Comparing Data Containers

Abstract/Agenda: 

Data containers are infrastructures that facilitate storage, retrieval, and analysis of data sets. Big data applications in Earth Science require a mix of processing techniques, data sources and storage formats that are supported by different data containers. Some of the most popular data containers used in Earth Science studies are Hadoop, Spark, SciDB, AsterixDB, RasDaMan, and HDF. The goal is to develop an evaluation plan for these infrastructures to assess their suitability for Earth Science data processing needs. We have identified a selection of test cases that are relevant to most data processing exercises in Earth Science applications and we aim to evaluate these systems for optimal performance against each of these test cases. The use cases identified as part of this study are (i) data fetching, (ii) data preparation for multivariate analysis, (iii) data normalization, (iv) distance (kernel) computation, and (v) optimization.

Technologies to be discussed:

  • Rasdaman - P. Yang and Q. Huang 
  • SciSpark and AstrixDB- C. Mattmann
  • HDF - A. Jelenak and T. Haberman
Notes: 

A Framework for Comparing Data Containers

 

  1. AsterixDB

  • BDMS, NoSQL, JSON++, Parallel query, Partitioned LSM-based data, HDFS, indexing (B+, R)

  • Preprocessing: dynamic sebsetting, uncompress netCDF4 to ASCII JSON -> data size: 9 times more

  • Data loading: chunking the data for the large file; data feed adapter

  • Current activities:

    • integrate to NEXUS

    • AQL query

 

2. Rasdaman

  • AIST: capture how to manage big data , validate innovation of technologies ,......

  • Big data management solution: 4Vs,

  • Test datasets: MODIES (1 TB ), Dust Storm Datasets (simulation, netcdf, 30GB/day)

  • Query types:

  • Workflow: Rasdaman does not support high dimension array-based data, so data are converted

  • Test Results:

    • The time increases sharply with the query size;

    • Compare ransdoman, hive, and spark;

    • multi-threads with high scalability;

    • spark scalability issues

  • Conclusions: Rasdaman support NetCDF better than HDF; Rasdaman clustering configuration is complex, and how to build further index (need be in touch with the developers in Rasdaman)

 

3. Data Container Study: HDF5 in a POSIX File System

  • Hardware: Open Science Data Cloud Griffin cluster, S3

  • Software: HDF5, MAFISC/GZIP/BLOSC, Ubuntu, HDF5-tools, python3

  • Workflow: download HDF5 to S3, repack

  • HDF5 Chunking:

    • storage layouts for HDF5 datasets

    • HDF5 is broken up in chunks, and stored at various locations

    • Chunks are of equary size in data space, but not in byte size

      • two chunking algorithms

        • Unidata’s optimal chunking formula

        • h5py

      • different chunk sizes: synoptic map; data rod; data cube; (which is the best one? What size is the best?)

4. Can data be Organized for Science and Reuse?

  • Data Organizations:

    • ESDIS & ESGF

    • Data RodsWFS, SOS

    • 3D Chunking/CompressionWCS, WFS, SOS

    • Object store: metadata - > values; HDF4Maps - Arrays

    • Metadata levels & Interoperability: Collection(ECHO, DIF, Data.Gov, …, ISO 19115) & Granules (HDF-EOS, NUG/CF)

    • Evolving backends: local files -> chunked -> cloud (API stable for data moving)

  • Compression: size reducing VS. time consuming

  • Data indexing: range information, coordinates stored in a PyTables

  • Results: Time reduce in super-linearly with the increasing nodes.

  • Conclusions POSIX file system is a uniquer challenges, chunk size influence runtime, compression: Blosc < GZIP9 < MAFISC

 

Questions:

  • go to cloud and compare them in the same dataset?

    • It is good comment; It is a problem needed to solve that how to share the environment for different test software; Use case to test; repeatable, understandable; pros and cons for different cases;

  • find boundary for different containers

  • IO is problem for models. Writing data back?

    • Flash IO performance benchmarking for netcdf3&4/hdf5 (parallel I/O) have been done by a lot?

  • data missing for high dimension in Rasdaman?

    • for x,y,z,t or many bands, some dimensions may be lost. Rasdaman and HDF5: metadata for chunks

  • OpenDap is possible?

    • Object store is not suitable for OpenDap, but it is an option for other situations. How to maintain the metadata when data go to pieces.

  • Data moving lead to IO issues?

    • Data container is trying to solve such data issues. data loading, network, I/O speed, metadata missing.

  • the results and optimization may be useful for other studies

 

Citation:
Habermann, T.; A Framework for Comparing Data Containers; Winter Meeting 2016. ESIP Commons , November 2015