SciSpark 101: Introduction to Spark

Abstract/Agenda: 

We introduce a 3 part course module on SciSpark, our AIST14 funded project for Highly Interactive and Scalable Climate Model Metrics and Analytics. The three part course session includes 101, 201, and 301 classes for learning how to use Spark for science.

SciSpark 101 is a 1.5 hour session in which we will use SciSpark to introduce the fundamental concepts required to develop new programs and convert existing programs to take advantage of Spark. This will include an overview of Apache Zeppelin, Spark, and Hadoop and also cover the concepts of filter, map, reduce, collect, and counter. We will work within the SciSpark environment using both Scala and Python as a functional programming language.

Notes: 

Intro. to SciSpark:

a.Funded by AIST project
b. Motivation for SciSpark: in memory and frequent data reuse operations for earth science
C. envisioned architecture: Zeppelin as the front end
D. an analytics engine for science data

  1. I/O bottleneck
  2. Extend native Spark on JVM: handle earth science geolocation arrays; netCDF/OpenDAP data ingest; array operation like numpy; two complex use case: MCS and PDF clustering of atmospheric state
  3. PySpark Gateway
  4. Three challenges: Adapting Spark RDD to geospatial 2D/3D
  5. Parallel computing styles: parallelize over time/over space/variable, model, metrics, parameters
  6. sciRDD transformation and actions
  7. sciSpark extensions for netCDF
  8. SciSpark front-end: scala, python, spark sql; Notebooks automatic connects to spark-shell
  9. Apache Zeppelin, SciSpark, sciRDD
  10. Virtual machines with SciSpark were given to attendees.

Warm up -101-1: intro. To Spark: some basic examples, such as work count
101-2 SparkSQL and DataFrames: USEFUL FOR EARTH scientific data: e.g. csv
Using the SparkSQL package for discovery within the Storm Database
Load csv data -> clean data -> create schema for Dataframe -> run Spark SQL to query the dataframe with ‘typical’ RDD operations.

Parallel Statistical Rollups for a Time-Series of Grids

  1. Compute per-pixel
  2. Statistic algorithms:
  3. Rollup statistics by Daily, monthly, yearly

Demo: Parallel Statistical Rollups for a Time-Series of Grids

  1. Read files using OpeNDAP, and split URL’s by month;
  2. Define accumulate function -> update accumulators for a set of variable grids
  3. Define combine function to merge accumulators to Go from Monthly to seasonal to yearly to total
  4. Define function to compute final statistics from the accumulators
  5. Define function to write stats to netcdf file
Attachments/Presentations: 
AttachmentSize
PDF icon SciSpark 101.pdf1.07 MB
Citation:
SciSpark 101: Introduction to Spark; 2016 ESIP Summer Meeting. ESIP Commons , February 2016