A Framework for Comparing Data Containers
Data containers are infrastructures that facilitate storage, retrieval, and analysis of data sets. Big data applications in Earth Science require a mix of processing techniques, data sources and storage formats that are supported by different data containers. Some of the most popular data containers used in Earth Science studies are Hadoop, Spark, SciDB, AsterixDB, RasDaMan, and HDF. The goal is to develop an evaluation plan for these infrastructures to assess their suitability for Earth Science data processing needs. We have identified a selection of test cases that are relevant to most data processing exercises in Earth Science applications and we aim to evaluate these systems for optimal performance against each of these test cases. The use cases identified as part of this study are (i) data fetching, (ii) data preparation for multivariate analysis, (iii) data normalization, (iv) distance (kernel) computation, and (v) optimization.
Technologies to be discussed:
- Rasdaman - P. Yang and Q. Huang
- SciSpark and AstrixDB- C. Mattmann
- HDF - A. Jelenak and T. Haberman
A Framework for Comparing Data Containers
BDMS, NoSQL, JSON++, Parallel query, Partitioned LSM-based data, HDFS, indexing (B+, R)
Preprocessing: dynamic sebsetting, uncompress netCDF4 to ASCII JSON -> data size: 9 times more
Data loading: chunking the data for the large file; data feed adapter
Current activities:
integrate to NEXUS
AQL query
2. Rasdaman
AIST: capture how to manage big data , validate innovation of technologies ,......
Big data management solution: 4Vs,
Test datasets: MODIES (1 TB ), Dust Storm Datasets (simulation, netcdf, 30GB/day)
Query types:
Workflow: Rasdaman does not support high dimension array-based data, so data are converted
Test Results:
The time increases sharply with the query size;
Compare ransdoman, hive, and spark;
multi-threads with high scalability;
spark scalability issues
Conclusions: Rasdaman support NetCDF better than HDF; Rasdaman clustering configuration is complex, and how to build further index (need be in touch with the developers in Rasdaman)
3. Data Container Study: HDF5 in a POSIX File System
Hardware: Open Science Data Cloud Griffin cluster, S3
Software: HDF5, MAFISC/GZIP/BLOSC, Ubuntu, HDF5-tools, python3
Workflow: download HDF5 to S3, repack
HDF5 Chunking:
storage layouts for HDF5 datasets
HDF5 is broken up in chunks, and stored at various locations
Chunks are of equary size in data space, but not in byte size
two chunking algorithms
Unidata’s optimal chunking formula
different chunk sizes: synoptic map; data rod; data cube; (which is the best one? What size is the best?)
4. Can data be Organized for Science and Reuse?
Data Organizations:
Data RodsWFS, SOS
3D Chunking/CompressionWCS, WFS, SOS
Object store: metadata - > values; HDF4Maps - Arrays
Metadata levels & Interoperability: Collection(ECHO, DIF, Data.Gov, …, ISO 19115) & Granules (HDF-EOS, NUG/CF)
Evolving backends: local files -> chunked -> cloud (API stable for data moving)
Compression: size reducing VS. time consuming
Data indexing: range information, coordinates stored in a PyTables
Results: Time reduce in super-linearly with the increasing nodes.
Conclusions POSIX file system is a uniquer challenges, chunk size influence runtime, compression: Blosc < GZIP9 < MAFISC
go to cloud and compare them in the same dataset?
It is good comment; It is a problem needed to solve that how to share the environment for different test software; Use case to test; repeatable, understandable; pros and cons for different cases;
find boundary for different containers
IO is problem for models. Writing data back?
Flash IO performance benchmarking for netcdf3&4/hdf5 (parallel I/O) have been done by a lot?
data missing for high dimension in Rasdaman?
for x,y,z,t or many bands, some dimensions may be lost. Rasdaman and HDF5: metadata for chunks
OpenDap is possible?
Object store is not suitable for OpenDap, but it is an option for other situations. How to maintain the metadata when data go to pieces.
Data moving lead to IO issues?
Data container is trying to solve such data issues. data loading, network, I/O speed, metadata missing.
the results and optimization may be useful for other studies