Emerging Big Data Technologies for Geoscience

Abstract/Agenda:

The objective of this session is to share innovative concepts, emerging solutions, and applications for Big Earth and Space Data for Geoscience. Being able to handle massive amount of data impacts our architectural decisions and approaches. Topics include demonstration, studies, methods, solutions and/or architectural discussion on

Common enabling technologies
Automated techniques for data analysis
Science analysis and visualization
Real time decision support
Implication of Data Intensive science to education
Data management lifecycle functions from data capture through analysis

Speakers

SciSpark - Chris Mattmann
NEXUS - Brian Wilson (Thomas Huang)
The Arctic Boreal Vulnerability Experiment and Big Data Analytics for Ecosystem Science and Data Management - Stephen Ambrose
Lessons Learned on Optimization Approaches for High-Scalability and High-Resiliency in AWG - Hook Hua

Notes:

NEXUS: The Deep Data Platform

One-minute Summary:
- data volumes exploding, scalable store & compute, pre-chunk and summarize key variables(spark, cassandra store)
Facts:
- Moving/copying data is more expensive than moving computing program
- Hardware & Software limitation on big data
- Current file formats good for archiving, not for data analysis
Nexus: cloud-based, high performance index, small chunks for data products
Methods:
- HDF/NetCDF partitioned into small fixed sized chunks in a cloud database (Cassandra DB Cluster)
- Data analysis modules on cloud (Spark in-memory)
- metadata in high performance search engine (Solr DB Cluster)
Datasets: SMAP MODIS GRHSST JASON
Demo: http://smapcast.jpl.nasa.gov/, https://sealevel.nasa.gov/
SWOT: big data V’s, next big data mission
Summary
- Fast ETL : Ingestion Metadata harvesting, Deep Indexing
- Standard API for analysis by Spark
- Algorithms integration by cloud computing
- W10N integration by data in memory

2. The Arctic Boreal Vulnerability Experiment and Big Data Analytics for Ecosystem Science and Data Management

ABoVE: a large scale NASA-led study of environmental change in arctic and boreal regions…, SciCloud
CISTO Conceptual Service layers: Data -> compute service->data services->analytic services->knowledge service
NCCS: Advanced Data Analytics Platform(ADAPT) - High Performance Science Cloud
- Persistent Data Services
- High available database nodes with SSD
- Remote visualization
- HPC
- High-speed/capacity storage (3 PB of RAW storage capacity)

ODISE: Ontology-Driven Interactive Search environment for ABoVE
- metadata search engine
- find and compare variables from heterogeneous datasets
ABoVE Science Cloud:
- Data: Landsat 123 TB, MODIS 57 TB, NGA High Resolution Imagery 447 TB, MERRA 89 TB
- Q: How an external NASA user to global monthly temperature average over 42 layers of the atmosphere for the last 30 years?
  - Analytics-as-a-Service
- Big data, but small computing programs
- ABoVE Users - Account SetUp -> visit data-> curated Peer Reviewed science Products ->ODISEA Data discovery/Access System

3. Lessons Learned on Optimization Approaches For High-Scalability and High-Resiliency in AWS

order of magnitude larger than existing missions
Issues
- scaling issues
- spot market for cost saving, but need resiliency
- Fault tolerant
- Hybrid cloud
Cloud:
- S3 for storage
- moving existing code to cloud , node scale up
- Assess capacity and demand of different regions
- Benchmark performance on EC2
Use Case
Large-scale Consideration
- Data locality: Different data in different regions
  - Transport approach
- Horizontally scaling up compute workers
- Cache ancillary data in EBS(local)
- How to manage thousands of worker nodes
  - auto-scaling group policies
    - auto-scaling rest periods of 60-seconds
  - Availability Zone Load rebalancing: shutdown some nodes to rebalance by AWS
Optimizing auto-scaling: multiples of availability zones(AZ); mitigations plan : randomization
S3 object key naming affects performance: objects in which partition
Spot Market: cost saving
Hybrid cloud: flexibility

4. SciSpark: Applying In-Memory Distributed Computing to Large Scale Climate Science & Analytics

Improving version 2 of RCMES using parallel computing
Spark: In-Memory MapReduce
- datasets partitioned, RDD, New RDD’s resilience, rich operations on RDD
- Spark ecosystem
Envisioned architecture
- scientific RDD for data loading, regriding, reshaping
- reuse of array data across multi-stage operations
- D3.js for interactive data exploration
Three challenges using spark for Earth Science
- Spark RDD to geospatial 2D&3D arrays
  - URIs set -> Partitioning function
  - sRDD: sTensor
  - N-dim in Spark: ND4J, BREEZE
- Reworking complex science algorithms (like GTC)
- Performance in Spark JVM (e.g. JVM heap, GC overheads)
Methods:
- PETAL layer for I/O issues: define users’ own loader and partitioner
- sRDD: Opendap, partition, load data in parallel
  - sciTensor (spark in-memory abstraction of NetCDF/HDF): metadata in hashmap, a list of variable arrays
- N-dim in Spark: ND4J, BREEZE
Use case:
- Sequential Grab
- notebook (Zeppelin)
- GTG implemented in Spark
- D3 visualization
Performance issues:
- Garbage Collection
- Accomplishments: SRDD architecture, N-dimensional array read and operation

Attachments/Presentations:

Attachment	Size
ABoVE_Science_Cloud__ESIP_Presentation_final2.pptx	5.93 MB
NEXUS_ESIPFed_2016Jan_opt.pdf	13.16 MB
SciSpark_ESIPFed_2016Jan_v2.pptx	6.38 MB

Citation:

Huang, T.; Wilson, B.; Yang, P.; Emerging Big Data Technologies for Geoscience; Winter Meeting 2016. ESIP Commons , November 2015

Submitted by thuang on 2015-11-02 18:14.

Emerging Big Data Technologies for Geoscience

NEXUS: The Deep Data Platform

2. The Arctic Boreal Vulnerability Experiment and Big Data Analytics for Ecosystem Science and Data Management

3. Lessons Learned on Optimization Approaches For High-Scalability and High-Resiliency in AWS

4. SciSpark: Applying In-Memory Distributed Computing to Large Scale Climate Science & Analytics