Harness: A New Development and Testing Environment for Big Data, Algorithms (and anything else)

Abstract/Agenda: 

Harness is a generic environment designed to handle easy loading, parsing, and evaluation of complex software projects and project groups. Large-scale systems consisting of multiple languages and intricate computational routines can be loaded into Harness as components and parsed into internal and external functions that can in turn be combined into workflows to be evaluated both in execution metrics and workflow state. Evaluations with matching output types can be compared against each other for differences as projects evolve over time. Harness uses a generic style that exposes system components as web enabled microservices, making it a good candidate for modernizing and standardizing all sorts of code bases.

Harness was developed in concert with a reengineering effort of the pairwise homogeneity algorithm at NOAA's National Centers for Environmental Information in Asheville, North Carolina, and is a perfect match for the complex climate data domain.

This will be a general overview of the concept of the open source Harness software system, examining its tenets and potential utility in Big Climate Data. We plan on having the engineers who worked on the pairwise homogeneity algorithm reengineering available to discuss a real world application of the system.

We are also hoping to hear feedback and features that would be most useful in future development, to see where the project should focus its development resources.

For more information and a more detailed look, the white paper on the project is available. 

 

Notes: 

Always looking for collaborators to help write language drivers, meta-workflows for machine learning, and other enhancements.

Contact the principal for more information:
[email protected]

Please see attached powerpoint, white paper, and poster for more information.

(Following notes added by attendee)

Harness: A Development and State Testing Framework

Domain characteristics

  1. Multiple languages and project dependencies
  2. Project duplication
  3. Diverse array of tools
  4. Large amount of documents
  5. Full stack production (design, collection, analysis, production, maintenance)
  6. The scope, breadth and depth of data issues

Domain trends:

  1. Technology -> more options
  2. Stakeholders -> more demanding
  3. Algorithms -> more complex
  4. Funding -> relatively stagnant
  5. People -> increasing specialization
  6. Obligations -> increasing

Our focus: data/expertise/technology diversity
    A.pros: better breath and depth; deeper understanding; more options to pick the best tools
    B. cons: more noise, processing, deeper specializations; more hands in the post , mismatch, computing

Unmanaged complexity:
Product management: Version control: code-manual, git…; Development: trac, fogBugz, JIRA; Deployment:

Product Development Ideal:
Some changes in Team A and B
Why Ideal isn’t:

  1. No team develops in a bubble;
  2. The ‘best’ tools are arbitrary;
  3. Nothing escapes change
  4. Documentation is hard;
  5. Datasets are large and complex

Potential for improvement:

  1. Similar things should be interchangeable and technology agnostic;
  2. Product changes should be documented

Improvement without adding complexity
Harness concept: flexible lightweight
Purpose: controllable and predictable; emphasizing functionality through constraint

Taken the astronaut as an example: Solar system as a harness, sun is the algorithm package; Component functions - internal; component functions - external;

Functional workflows:
Language translation: communicate among python, java, fortant
Accessible states: a retrievable record of all states through time
Concept to Design

  1. Design tenets, abstract and generic framework-> try not to make assumptions about the project
  2. Seamless and comprehensive utility -> automatically add DRY functionality without obfuscation
  3. Accommodation for development -> promote flexible development strategies

Harness architecture:

  1. Status:
  2. Component: local copy on disk
  3. Adding a function:
  4. Functions in Harness: functions contain many sub-functions
  5. A workflow in Harness: function->inputs->(metadata, order, required, source, target)
  6. Adding an evaluation: choose workflow -> define input sources, and optionally define output targets
  7. Structure architecture overview:
  • structure template provides translators between each other.
  • Add a structure (group metadata); add a template (metadata structure); add a field (index, template, metadata, type, order, required);

Case study: PHA practical application

Questions: more details about the evaluation part: it heavily depends on the use case.
UI is not a problem because Angular and D3 are used.

 

Attachments/Presentations: 
Citation:
Harness: A New Development and Testing Environment for Big Data, Algorithms (and anything else); 2016 ESIP Summer Meeting. ESIP Commons , April 2016