Emerging Big Data Technologies for Geoscience

Abstract/Agenda: 

 

The objective of this session is to share innovative concepts, emerging solutions, and applications for Big Earth and Space Data for Geoscience. Being able to handle massive amount of data impacts our architectural decisions and approaches.   Topics include demonstration, studies, methods, solutions and/or architectural discussion on

  • Common enabling technologies
  • Automated techniques for data analysis
  • Science analysis and visualization
  • Real time decision support
  • Implication of Data Intensive science to education
  • Data management lifecycle functions from data capture through analysis

Speakers

  • SciSpark - Chris Mattmann
  • NEXUS - Brian Wilson (Thomas Huang)
  • The Arctic Boreal Vulnerability Experiment and Big Data Analytics for Ecosystem Science and Data Management - Stephen Ambrose
  • Lessons Learned on Optimization Approaches for High-Scalability and High-Resiliency in AWG - Hook Hua
Notes: 
  1. NEXUS: The Deep Data Platform

  • One-minute Summary:

    • data volumes exploding, scalable store & compute, pre-chunk and summarize key variables(spark, cassandra store)

  • Facts:

    • Moving/copying data is more expensive than moving computing program

    • Hardware & Software limitation on big data

    • Current file formats good for archiving, not for data analysis

  • Nexus: cloud-based, high performance index, small chunks for data products

  • Methods:

    • HDF/NetCDF partitioned into small fixed sized chunks in a cloud database (Cassandra DB Cluster)

    • Data analysis modules on cloud (Spark in-memory)

    • metadata in high performance search engine (Solr DB Cluster)

  • Datasets: SMAP MODIS GRHSST JASON

  • Demo: http://smapcast.jpl.nasa.gov/, https://sealevel.nasa.gov/

  • SWOT: big data V’s, next big data mission

  • Summary

    • Fast ETL : Ingestion Metadata harvesting, Deep Indexing

    • Standard API for analysis by Spark

    • Algorithms integration by cloud computing

    • W10N integration by data in memory

 

2. The Arctic Boreal Vulnerability Experiment and Big Data Analytics for Ecosystem Science and Data Management

  • ABoVE: a large scale NASA-led study of environmental change in arctic and boreal regions…,   SciCloud

  • CISTO Conceptual Service layers: Data -> compute service->data services->analytic services->knowledge service

  • NCCS: Advanced Data Analytics Platform(ADAPT) - High Performance Science Cloud

    • Persistent Data Services

    • High available database nodes with SSD

    • Remote visualization

    • HPC

    • High-speed/capacity storage (3 PB of RAW storage capacity)

 

  • ODISE: Ontology-Driven Interactive Search environment for ABoVE

    • metadata search engine

    • find and compare variables from heterogeneous datasets

  • ABoVE Science Cloud:

    • Data: Landsat 123 TB, MODIS 57 TB, NGA High Resolution Imagery 447 TB, MERRA 89 TB

    • Q: How an external NASA user to global monthly temperature average over 42 layers of the atmosphere for the last 30 years?

      • Analytics-as-a-Service

    • Big data, but small computing programs

    • ABoVE Users - Account SetUp -> visit data-> curated Peer Reviewed science Products ->ODISEA Data discovery/Access System

3. Lessons Learned on Optimization Approaches For High-Scalability and High-Resiliency in AWS

  • order of magnitude larger than existing missions

  • Issues

    • scaling issues

    • spot market for cost saving, but need resiliency

    • Fault tolerant

    • Hybrid cloud

  • Cloud:

    • S3 for storage

    • moving existing code to cloud , node scale up

    • Assess capacity and demand of different regions

    • Benchmark performance on EC2

  • Use Case

  • Large-scale Consideration

    • Data locality: Different data in different regions

      • Transport approach

    • Horizontally scaling up compute workers

    • Cache ancillary data in EBS(local)

    • How to manage thousands of worker nodes

      • auto-scaling group policies

        • auto-scaling rest periods of 60-seconds

      • Availability Zone Load rebalancing: shutdown some nodes to rebalance by AWS

  • Optimizing auto-scaling:    multiples of availability zones(AZ); mitigations plan : randomization

  • S3 object key naming affects performance: objects in which partition

  • Spot Market: cost saving

  • Hybrid cloud: flexibility

 

4. SciSpark: Applying In-Memory Distributed Computing to Large Scale Climate Science & Analytics

  • Improving version 2 of RCMES using parallel computing

  • Spark: In-Memory MapReduce

    • datasets partitioned, RDD, New RDD’s resilience, rich operations on RDD

    • Spark ecosystem

  • Envisioned architecture

    • scientific RDD for data loading, regriding, reshaping

    • reuse of array data across multi-stage operations

    • D3.js for interactive data exploration

  • Three challenges using spark for Earth Science

    • Spark RDD to geospatial 2D&3D arrays

      • URIs set -> Partitioning function

      • sRDD: sTensor

      • N-dim in Spark: ND4J, BREEZE

    • Reworking complex science algorithms (like GTC)

    • Performance in Spark JVM (e.g. JVM heap, GC overheads)

  • Methods:

    • PETAL layer for I/O issues: define users’ own loader and partitioner

    • sRDD: Opendap, partition, load data in parallel

      • sciTensor (spark in-memory abstraction of NetCDF/HDF): metadata in hashmap, a list of variable arrays

    • N-dim in Spark: ND4J, BREEZE

  • Use case:

    • Sequential Grab

    • notebook (Zeppelin)

    • GTG implemented in Spark

    • D3 visualization

  • Performance issues:

    • Garbage Collection

    • Accomplishments: SRDD architecture, N-dimensional array read and operation
Citation:
Huang, T.; Wilson, B.; Yang, P.; Emerging Big Data Technologies for Geoscience; Winter Meeting 2016. ESIP Commons , November 2015