SciSpark 101: Introduction to Spark

Abstract/Agenda:

We introduce a 3 part course module on SciSpark, our AIST14 funded project for Highly Interactive and Scalable Climate Model Metrics and Analytics. The three part course session includes 101, 201, and 301 classes for learning how to use Spark for science.

SciSpark 101 is a 1.5 hour session in which we will use SciSpark to introduce the fundamental concepts required to develop new programs and convert existing programs to take advantage of Spark. This will include an overview of Apache Zeppelin, Spark, and Hadoop and also cover the concepts of filter, map, reduce, collect, and counter. We will work within the SciSpark environment using both Scala and Python as a functional programming language.

Notes:

Intro. to SciSpark:

a.Funded by AIST project
b. Motivation for SciSpark: in memory and frequent data reuse operations for earth science
C. envisioned architecture: Zeppelin as the front end
D. an analytics engine for science data

I/O bottleneck
Extend native Spark on JVM: handle earth science geolocation arrays; netCDF/OpenDAP data ingest; array operation like numpy; two complex use case: MCS and PDF clustering of atmospheric state
PySpark Gateway
Three challenges: Adapting Spark RDD to geospatial 2D/3D
Parallel computing styles: parallelize over time/over space/variable, model, metrics, parameters
sciRDD transformation and actions
sciSpark extensions for netCDF
SciSpark front-end: scala, python, spark sql; Notebooks automatic connects to spark-shell
Apache Zeppelin, SciSpark, sciRDD
Virtual machines with SciSpark were given to attendees.

Warm up -101-1: intro. To Spark: some basic examples, such as work count
101-2 SparkSQL and DataFrames: USEFUL FOR EARTH scientific data: e.g. csv
Using the SparkSQL package for discovery within the Storm Database
Load csv data -> clean data -> create schema for Dataframe -> run Spark SQL to query the dataframe with ‘typical’ RDD operations.

Parallel Statistical Rollups for a Time-Series of Grids

Compute per-pixel
Statistic algorithms:
Rollup statistics by Daily, monthly, yearly

Demo: Parallel Statistical Rollups for a Time-Series of Grids

Read files using OpeNDAP, and split URL’s by month;
Define accumulate function -> update accumulators for a set of variable grids
Define combine function to merge accumulators to Go from Monthly to seasonal to yearly to total
Define function to compute final statistics from the accumulators
Define function to write stats to netcdf file

Attachments/Presentations:

Attachment	Size
SciSpark 101.pdf	1.07 MB

Citation:

SciSpark 101: Introduction to Spark; 2016 ESIP Summer Meeting. ESIP Commons , February 2016

Submitted by ChrisMattmann on 2016-02-12 12:53.