SciSpark 301: Build your own Climate Metrics
We introduce a 3 part course module on SciSpark, our AIST14 funded project for Highly Interactive and Scalable Climate Model Metrics and Analytics. The three part course session introduces a 101, 201, and 301 class for learning how to use Spark for science.
SciSpark 301 is a 1.5 hr course in which we will provide lessons learned from our experience in SciSpark as well as a selection of notebooks for attendees to explore, learn from, expand on, and venture out on their own. This session is intended for individuals who have a desire to play with SciSpark and investigate its possible uses in their own work. We plan to have notebooks prepared that show use of a K-means clustering algorithm for identification of Probability Density Functions for climate extremes, the Open Climate WorkBench, and the Climate Model Diagnostic Analyzer. This session will include ample time for more in-depth discussion and problem-solving of attendees’ interests.
- P. C. Loikith, J. Kim, H. Lee, B. Linter, C. Mattmann, J. J. D. Neelin, D. E. Waliser, L. Mearns, S. McGinnis. Evaluation of Surface Temperature Probability Distribution Functions in the NARCCAP Hindcast Experiment. Journal of Climate, Vol. 28, No. 3, pp. 978-997, February 2015. doi:10.1175/JCLI-D-13-00457.1.
- Lee, Seungwon, et al. "Climate model diagnostic analyzer." Big Data (Big Data), 2015 IEEE International Conference on. IEEE, 2015.
Goals for expanding SciSpark outside of MCC, PDF, and other cases
- Data Reuse Tasks:Sparkler, Data Science (web crawling on Spark http://github.com/USCDataScience/sparkler/)
- Science: search analytics, RCMES (Regional Climate Model Evaluation System)
Use case: use K-means clustering to group grid points together based on similarities in probability density function clustering
Climate Model Diagnostic Analyzer (CMDA)
- Goal: develop a technology to help Earth scientists create and manage workflows for scientific calculations.
- Provenance-powered workflow:
- Read JSON from the frontend
- Anomaly calculation web service call by REST call
- Time series web service call by dataURL from previous call
Spark Idioms/Lessons Learned
- Partition data as needed to main “data locality” for algorithms
- Be careful about ‘collect’
- “Never” use GroupByKey, use Re-Keying and GroupByKey to replicate data
- Use accumulators instead of multiple reduces or collects
- In spark shells or notebooks, global variables won’t be pulled into lambda expressions, please use Literals
- JVM performance: memory issues; monitor garbage collection.