Avoid Tech Regret: Let a Workflow Turn the Crank
Workflows can be used to automate data processing from the field to the database. Additionally, they can be used to produce “value added” databases that integrate and harmonize existing data sources into new and useful forms, including web-consumable resources. We will provide a brief overview on the use of workflows to generate primary and useful data products (e.g. data integration, improved quality control, aggregation), and discuss tools that can be used and the challenges they present. We will then demonstrate various workflows, including a statistical code generator web service for ingesting data in R, MATLAB, SAS and SPSS, workflows for harvesting data from Data Turbine, Campbell loggers and other sources, and workflows for integrating and summarizing multi-site climate data. The workshop will then conclude with a group discussion on existing challenges and needs, such as enabling technologies, training, and improved metadata generation systems for documenting composite data sets.
The material will be practical more than conceptual. This follows earlier workshops at the September 2012 All Scientists Meeting and an LTER Workshop in winter 2012, from which screencasts of the demonstrations are available, and a useful introduction to this topic. These workflow examples will use EML-described data. (Ecological Markup Language, the metadata standard used by the LTER network and others.)
Statistical code generation web service:
Our demonstration will use the LTER Data Catalog component the Provenance Aware Synthesis Tracking Architecture (PASTA) as well as other sources of well-described data.
**Although the presenters are primarily working with LTER data, the examples shown will apply to a wider audience, and we welcome a broad range of perspectives and experience.
Introduction to workflows (J. Porter)
-What are they?
-Why use them?
--Increased consistency & detailed documentation
-Tools for Workflows
--Manual, semi-automated and automated
--Each tool has its own strengths and weaknesses
-Web services for integration of existing data
-Danger of workflows (XKCD Comic)
Data Processing, harvesting workflows (W. Sheldon)
-GCE Data Toolbox for MATLAB
-Data management lifecycle
--Generic or specified parsers
--Programmatic QC Analysis
--Interactive QC Analysis and Revision
--Automatic Documentation of QC Steps
--Data analysis, synthesis tools QC-aware
-Real-world sensor data workflow
-Resources: MATLAB, Software Distribution and User Support
Data integration example workflows
--Goal: to create a workflow that creates a unified dataset that can be automatically updated w/new data and sites
-Step 1: Amalgamation of files
-Step 2: Harvest new data and sites
-ex: Climatological data
-ex: Forest Plots
Provenance-tracking workflows in R
-What is provenance?
--From French "to come from"
-Data provenance: Info required to accurately document the history of an item of data including how it was created and transformed
-State of Data Provenance Today
--Standard analysis tools do not collect provenance
-Uses of Data Provenance
--Short, mid and long term examples
-From R Scripts to Provenance Graphs (Flow graphic)
Conclusion (W. Sheldon)
-- Increase in efficiency and scalability
--Design process required protocol formalization and review
-Workflow challenges and barriers
Online Resource Guide
-Other examples of existing workflows
-What application would most benefit from workflows
-What would you need to implement workflows