Avoid Tech Regret: Let a Workflow Turn the Crank


Workflows can be used to automate data processing from the field to the database. Additionally, they can be used to produce “value added” databases that integrate and harmonize existing data sources into new and useful forms, including web-consumable resources. We will provide a brief overview on the use of workflows to generate primary and useful data products (e.g. data integration, improved quality control, aggregation), and discuss tools that can be used and the challenges they present. We will then demonstrate various workflows, including a statistical code generator web service for ingesting data in R, MATLAB, SAS and SPSS, workflows for harvesting data from Data Turbine, Campbell loggers and other sources, and workflows for integrating and summarizing multi-site climate data. The workshop will then conclude with a group discussion on existing challenges and needs, such as enabling technologies, training, and improved metadata generation systems for documenting composite data sets.

The material will be practical more than conceptual. This follows earlier workshops at the September 2012 All Scientists Meeting and an LTER Workshop in winter 2012, from which screencasts of the demonstrations are available, and a useful introduction to this topic. These workflow examples will use EML-described data. (Ecological Markup Language, the metadata standard used by the LTER network and others.)

Best practices:

Statistical code generation web service:

Workshop background:

Our demonstration will use the LTER Data Catalog component the Provenance Aware Synthesis Tracking Architecture (PASTA) as well as other sources of well-described data.


**Although the presenters are primarily working with LTER data, the examples shown will apply to a wider audience, and we welcome a broad range of perspectives and experience.

Introduction to workflows (J. Porter)
-What are they?
-Why use them?
-Automating Workflows
--Increased consistency & detailed documentation
-Tools for Workflows
--Manual, semi-automated and automated
--Each tool has its own strengths and weaknesses
-Web services for integration of existing data
-Danger of workflows (XKCD Comic)

Data Processing, harvesting workflows (W. Sheldon)
-GCE Data Toolbox for MATLAB
-Data management lifecycle
-Importing Data
--Generic or specified parsers
-Add/Importing Metadata
-QA/QC Analysis
--Programmatic QC Analysis
--Interactive QC Analysis and Revision
--Automatic Documentation of QC Steps
--Data analysis, synthesis tools QC-aware
-Real-world sensor data workflow
-Demo Workflow
-Resources: MATLAB, Software Distribution and User Support

Data integration example workflows
--Stats packages
--Programming languages
--Graphical workflow
-Sample Workflows
--Raw data
--Data integration

-Climate Data
--Goal: to create a workflow that creates a unified dataset that can be automatically updated w/new data and sites
-Step 1: Amalgamation of files
-Step 2: Harvest new data and sites
-Historical Data
-ex: Climatological data
-ex: Forest Plots

Provenance-tracking workflows in R
-What is provenance?
--From French "to come from"
-Data provenance: Info required to accurately document the history of an item of data including how it was created and transformed
-State of Data Provenance Today
--Standard analysis tools do not collect provenance
-Uses of Data Provenance
--Short, mid and long term examples
-From R Scripts to Provenance Graphs (Flow graphic)
-Several examples

Conclusion (W. Sheldon)
-Workflow benefits
-- Increase in efficiency and scalability
--Design process required protocol formalization and review
-Workflow challenges and barriers
--Data challenges
--Software challenges
--Documentation challenges
--Automation challenges
--Personnel challenges

Online Resource Guide

Discussion Q&A
-Other examples of existing workflows
-What application would most benefit from workflows
-What would you need to implement workflows

Gastil-Buhl, M.; Porter, J.; Sheldon, W.; Boose, E.; Avoid Tech Regret: Let a Workflow Turn the Crank; Summer Meeting 2014. ESIP Commons , April 2014