Emerging Big Data Technologies for Geoscience
The objective of this session is to share innovative concepts, emerging solutions, and applications for Big Earth and Space Data for Geoscience. Being able to handle massive amount of data impacts our architectural decisions and approaches. Topics include demonstration, studies, methods, solutions and/or architectural discussion on
- Common enabling technologies
- Automated techniques for data analysis
- Science analysis and visualization
- Real time decision support
- Implication of Data Intensive science to education
- Data management lifecycle functions from data capture through analysis
- SciSpark - Chris Mattmann
- NEXUS - Brian Wilson (Thomas Huang)
- The Arctic Boreal Vulnerability Experiment and Big Data Analytics for Ecosystem Science and Data Management - Stephen Ambrose
- Lessons Learned on Optimization Approaches for High-Scalability and High-Resiliency in AWG - Hook Hua
NEXUS: The Deep Data Platform
One-minute Summary:
data volumes exploding, scalable store & compute, pre-chunk and summarize key variables(spark, cassandra store)
Moving/copying data is more expensive than moving computing program
Hardware & Software limitation on big data
Current file formats good for archiving, not for data analysis
Nexus: cloud-based, high performance index, small chunks for data products
HDF/NetCDF partitioned into small fixed sized chunks in a cloud database (Cassandra DB Cluster)
Data analysis modules on cloud (Spark in-memory)
metadata in high performance search engine (Solr DB Cluster)
Demo: http://smapcast.jpl.nasa.gov/, https://sealevel.nasa.gov/
SWOT: big data V’s, next big data mission
Fast ETL : Ingestion Metadata harvesting, Deep Indexing
Standard API for analysis by Spark
Algorithms integration by cloud computing
W10N integration by data in memory
2. The Arctic Boreal Vulnerability Experiment and Big Data Analytics for Ecosystem Science and Data Management
ABoVE: a large scale NASA-led study of environmental change in arctic and boreal regions…, SciCloud
CISTO Conceptual Service layers: Data -> compute service->data services->analytic services->knowledge service
NCCS: Advanced Data Analytics Platform(ADAPT) - High Performance Science Cloud
Persistent Data Services
High available database nodes with SSD
Remote visualization
High-speed/capacity storage (3 PB of RAW storage capacity)
ODISE: Ontology-Driven Interactive Search environment for ABoVE
metadata search engine
find and compare variables from heterogeneous datasets
ABoVE Science Cloud:
Data: Landsat 123 TB, MODIS 57 TB, NGA High Resolution Imagery 447 TB, MERRA 89 TB
Q: How an external NASA user to global monthly temperature average over 42 layers of the atmosphere for the last 30 years?
Big data, but small computing programs
ABoVE Users - Account SetUp -> visit data-> curated Peer Reviewed science Products ->ODISEA Data discovery/Access System
3. Lessons Learned on Optimization Approaches For High-Scalability and High-Resiliency in AWS
order of magnitude larger than existing missions
scaling issues
spot market for cost saving, but need resiliency
Fault tolerant
Hybrid cloud
S3 for storage
moving existing code to cloud , node scale up
Assess capacity and demand of different regions
Benchmark performance on EC2
Use Case
Large-scale Consideration
Data locality: Different data in different regions
Transport approach
Horizontally scaling up compute workers
Cache ancillary data in EBS(local)
How to manage thousands of worker nodes
auto-scaling group policies
auto-scaling rest periods of 60-seconds
Availability Zone Load rebalancing: shutdown some nodes to rebalance by AWS
Optimizing auto-scaling: multiples of availability zones(AZ); mitigations plan : randomization
S3 object key naming affects performance: objects in which partition
Spot Market: cost saving
Hybrid cloud: flexibility
4. SciSpark: Applying In-Memory Distributed Computing to Large Scale Climate Science & Analytics
Improving version 2 of RCMES using parallel computing
Spark: In-Memory MapReduce
datasets partitioned, RDD, New RDD’s resilience, rich operations on RDD
Spark ecosystem
Envisioned architecture
scientific RDD for data loading, regriding, reshaping
reuse of array data across multi-stage operations
D3.js for interactive data exploration
Three challenges using spark for Earth Science
Spark RDD to geospatial 2D&3D arrays
URIs set -> Partitioning function
sRDD: sTensor
N-dim in Spark: ND4J, BREEZE
Reworking complex science algorithms (like GTC)
Performance in Spark JVM (e.g. JVM heap, GC overheads)
PETAL layer for I/O issues: define users’ own loader and partitioner
sRDD: Opendap, partition, load data in parallel
sciTensor (spark in-memory abstraction of NetCDF/HDF): metadata in hashmap, a list of variable arrays
N-dim in Spark: ND4J, BREEZE
Use case:
Sequential Grab
notebook (Zeppelin)
GTG implemented in Spark
D3 visualization
Performance issues:
Garbage Collection
- Accomplishments: SRDD architecture, N-dimensional array read and operation