Application of Cloud Computing for Geoscience Analytics

Abstract/Agenda: 

The objective of this session is to share innovative concepts, emerging solutions, and applications of Cloud Computing for Geoscience Analytics. The elasticity of Cloud Computing enables us to horizontally scale of big data analytic solutions to be able to handle more data at the same time. Topics include demonstration, studies, methods, solutions and/or architectural discussion on

  • Architecture for big data analytic
  • Application of open source technologies
  • Automated techniques and solutions for data analysis
  • Browser-based data analytics and visualization
  • Real time decision support

Invited speakers

  • Mike Little - NASA ESTO, AIST Managed Cloud Environment
  • Brett McLaughlin - ESDIS/N-GAP
  • Brian Wilson/Thomas Huang - JPL - NEXUS - Deep Data Analytic Platform
  • Fei Hu/Zhenlong Li- GMU - A High Performance Framework for Big Climate Data Analytics
  • Hook Hua - JPL - Machine Learning applied on SAR processing​
Notes: 

Presentation 1: The AIST Managed Cloud environment (AMCE)
Introduction:

  • Provide cloud service to AIST PIs, test- and end- users;
  • Each PI can create their own space in a common environment;
  • Some services, e.g. helpdesk, security, and management, can be shared

Access:

  • The cloud service can be accessed broadly at the discretion of the PI, not NASA;
  • The workspace is isolated and secure
  • A collaborative environment for flexible teams with diverse sets of users
  • Explore and test the newly-developed tools in a secure environment

Infusions

  • Increase the number of AIST projects that are ready to be adopted and used by end users;
  • More end users test to accelerate the infusion process, making it easier to infuse new work into their systems and processes.
  • Reduce the total cost and time to deploy a new tool;

Security: Cloud services are preconfigured for security; AMCE teams can access the the cloud platform no matter inside or outside of NASA;

Benefits: accelerate project research, reduce risk and costs, encourage collaboration and support end-user implementation.

Presentation 2: NGAP: A (Brief) Update-PaaS, IaaS, Onboarding, and the Future
NGAP (NASA General Application Platform): a cloud-based Platform-as-a-Service (PaaS) and Infrastructure-as-a-Service (IaaS) for ESDIS applications.

Cloud computing defined by NIST: on-demand computing, storage, and, network resources with self service; broad access by network (mobile, tablets); resource can be elastically provisioned and released; resources usages can be monitored, measured, and controlled.

NGAP as a PaaS: simplified AWS-focused architecture; hosting web applications; a demonstration: earthdata search running in NGAP prototype with fault tolerance, high availability and scaling.

NGAP as an IaaS: utilize cloud formation to manage system configuration, task setup, and application deployment to accelerate the data services.

Next steps: identify the required features for different kinds of applications; identify the applications;

Presentation 3: NEXUS: Big Data Analytics
Traditional data analysis: for big datasets, it takes many hours to download, compute, and requires expensive local computing resources (CPU+RAM+Storage); After getting the results, purge download files; → Observations: traditional methods yield poor performance; performance suffers when processing large files; A high-performance data analysis solution need be free from the I/O bottleneck.

NEXUS Deep data analytics:

  • Cloud-based with scalability to hand observation parameters analysis
  • High-performance indexed, temporal, and geospatial search solution
  • Data are stored in small chunks in cloud data store
  • Scalable store and compute: NoSQL, parallel compute, in-memory map-reduce, hybrid cloud
  • Pre-chunk and summarize key variables → easy statistics, harder statistics on-demand, visualize original data on a map quickly
  • Technologies: Apache Solr, Cassandra, Spark/PySpark, Mesos/Yarn, Kafka, Zookeeper, Tornado, Spring XD, and EDGE

Analytics & Summarization Stack: use Cassandra DB cluster to store chunks and Solr DB cluster to store meta data to achieve fast and scalable analytic and visualization services.

DDCE(Deep Data Computing Environment): ETL + deep data processors for metadata, statistics, and tiles + index and data catalog -> spark-based analytic platform , tile and collection-based data access, and notebook for interactive analysis, which are deployed on portal and custom VMs on cloud platform.

Apache Spark: in-memory map-reduce, partition datasets by key, RDD, and rich set of operations on RDDs, lazy computation, use Yarn/Mesos;
Apache Cassandra: horizontal-scale NoSQL database, constant time writes, no-single-point of failure architecture.

On-The-Fly analysis for Sea Level Rise Research: https://sealevel.nasa.gov

AIST-14: OceanXtremes - Data-intensive Anomaly Detection System

  • Challenges: rapidly identify features and anomalies in increasingly complex and voluminous observations
  • Two-stage procedure: determine a long-term/periodic mean → search deviations from the mean.

AIST-14: Distributed Oceanographic Matchup Service (DOMS)

  • A distributed data service to match satellite and in situ marine observations to support platform comparisons, cross-calibration, validation, and quality control.
  • Use cases: satellite Cal/Val, decision support(planning field campaigns), scientific investigations, alternate matching (satellite to satellite, satellite/in situ to model)

NEXUS Performance Challenges:

  • NEXUS performs 2X or greater speed improvement compare to Giovanni – Sponsored by NASA/ESDIS (backed by the popular NCO library)
  • With smaller tiles enables more parallelism. Does more parallelism yield faster performance? → More parallelism ≠ Faster performance, because scheduling, data transport, data queries, etc.
  • Area averaged time series: For 16-way parallel, NEXUS performs ~8X faster; For 64-way parallel, NEXUS performs ~15X faster than Giovanni
  • Global 18-Year Time Averaged Map & Global correlation map: NEXUS, with bigger tiles, took about 1min;More tiles and more executors yield SLOWER performance;

Key Takeaways:

  • Big Data ≠ Cloud Computing; Big Data is not a new computing problem. Cloud Computing opens up new approaches in tackling Big Data. While Cloud Computing has many benefits, it only plays a part in the overall Big Data architecture
  • Apply cloud computing where it make senses: Data-Intensive Science, Cost reduction, Service reliability, etc. Truly leverage the elasticity of the Cloud;
  • The software known as NEXUS: Deep Data Platform (NTR-50157) has been approved for release as open source. You are authorized to upload the software to open source repository when you are ready to do so.

Presentation 4: ClimateSpark: A High Performance Framework for Big Climate Data Analytics
Motivation: how to handle petabytes of climate data for comparing climate models and improving them for next IPCC assessment?
Challenges: high dimensions; big data volume; computing intensive → Object: Develop a tool to help climatologists better handle the big climate data using Spark and advanced GIS methodologies.
Although Apache Spark is fast, flexible, and composable, the pure Spark could not read array-based data, spatiotemporal query, climate data analytics, and visualizations;

Architecture:

  • Persistence layer: HDFS, PostGIS
  • Distributed data storage layer: spatiotemporal index
  • Spark-based data analytics layer
  • Web-portal layer

Spatiotemporal Index: bridge the gap between the logical data model and physical data model of multiple-dimensional array-based climate dataset. A b-tree index has been built to organize the logical data information and physical data pointers. It supports the array-based data with chunking data structure. Avoid data preprocessing when storing on HDFS

climateRDD

  • Key: variable name, geospatial boundary, temporal range, chunk info
  • Value: multi-dimension array read by HDFS I/O

ClimateSpark: Spatiotemporal index, climateRDD, climateRDD transformation, climateRDD action, ClimateSpark SQL.

Two use cases:

  • Distributed geospatial and temporal query and visualzation: subset the big datasets according to the input geospatial bounding box(geoJSON), and visualize them in a long time series in animation (e.g. GIF)
  • Taylor-diagram service: compare different climate model outputs to evaluate their correlation between each other.
  • A web portal to hide the technical details to ease the operation

Performance: MERRA2, hourly data (~9 TB) → good scalabity and data locality

Presentation 5: Cloud Computing-based Machine Learning on SAR Science Data Processing​

Flood of Multi-sensor SAR Data: Remote sensing SAR (e.g. Sentinel 1A/1B 1.8TB/day) & Airborne SAR; Voluminous SAR Data, various quality issues  → need for scalable and automated quality assessment in [SAR] science data systems

Scalable Hybrid Cloud Science Data System with Machine Learning:

  • Need science experts to provide true data for training
  • Machine learning for automated quality assessment
    • Phase unwrapping Error Detection -> tagging-based quality assessment: scientists to tag the intensity level
    • Extract features from SAR, classify the data to generate a model to get predictions, scalable on cloud
    • Map unwrapping quality metrics to numerical features for training ML algorithms
  • UWE Detection with Deep Neural Networks - mix approach with Tensor Flow

High-level cloud approach for Science data systems:

  • Object storage and elastic computing
  • SDS -> S3 + elasticsearch(User tags) -> EC2 (classification, feature extraction) -> S3(classification model and features)
  • Can apply high performance GPU, do not need upload data
  • All environment stuff in container, use what you store and store what you use
  • Cache some data

Impacts:

  • Machine Learning Impact: machine learning-based predictions exposed as facets, provide recommendation
  • Automatically scalable machine learning: high quality: L3 time series; low quality: L2 diagnostics & reprocessing; bad quality: QA assessment

Monitoring & Event Response
Events (Earthquake) + data sources () -> Hybrid cloud -> data products -> machine learning

  • Kumamoto Earthquake 2016 ->  automatically generate the sentinel-1A L2 interferograms data product

Next steps:

  • ML: Expend to Sentinel-1A and Sentinel-1B & additional QA metrics
  • Additional analytics: time series, with machine learning-based selection of stack
  • SDS for Big Data mission needs: dynamic hot data caching; HA; further expand usage of AWS spot market

 

Attachments/Presentations: 
Citation:
Huang, T.; Yang, P.; Application of Cloud Computing for Geoscience Analytics; 2016 ESIP Summer Meeting. ESIP Commons , March 2016