SciSpark 201: Searching for MCCs
We introduce a 3 part course module on SciSpark, our AIST14 funded project for Highly Interactive and Scalable Climate Model Metrics and Analytics. The three part course session includes 101, 201, and 301 classes for learning how to use Spark for science.
SciSpark 201 is a 1.5 hour session in which we will use the search for Mesoscale Convective Complexes (MCCs) in Satellite Infrared data to show a real world example of how SciSpark enables real time response to both search queries and modifications to the underlying code. This task is representative of the motivation behind SciSpark - iterative data-reuse algorithms that share information between multiple stages.
- Whitehall, Kim, et al. "Exploring a graph theory based algorithm for automated identification and characterization of large mesoscale convective systems in satellite datasets." Earth Science Informatics 8.3 (2015): 663-675.
- Implementation of Grab Em', Tag Em', Graph Em' (GTG) algorithm in Python.
Note for SciSpark 201
A two-pronged approach to Spark
1. The goal of scientific RDD(sRDD)? The scientific Resilient Distributed Dataset (sRDD), exploits Apache Spark's concept of RDDs for multi-dimensional data representing a scientific measurement that can be subset by time, or by space. The sRDD supports multidimensional data and processing of scientific algorithms in the MapReduce paradigm within a distributed environment.
The sciTensor datatype is a self-documented array that keeps a list of arrays for a variable arrays and maintains associated metadata in a hashmap. The sciTensor is read into the sRDD and the data within is operated on via arithmetic and relational operations. sciTnesor can load data from: HDFS, OpeNDap, and local FS.
Scala RDD -> Python RDD -> python visualization
Use case: Mesoscale convective complexes
- Data: brightness temperature data
- Nodes: areas with a given brightness temperature value and a given size
- Edges: determined by area overlaps between nodes within consecutive time periods
- Identify nodes and edges
- Find cloud elements and connect the cloud elements between frames.
- Find the subgraphs of cloudy areas that have evolve in time