Towards systematically curating and integrating data product descriptive information for data users


Complete, consistent, and easy to understand information about data products is critical for meeting data discoverability, improved accessibility and usability, and interoperability requirements.

In the BigData and Open Data Era, with ever increasing variety and number of data products, it becomes increasingly impractical to do so in a manual fashion. The most effective way to ensure the completeness and quality of metadata and description documents of data products is to curate them in a systematic, consistent, and automatic fashion based on standards, community best practices, and defined frameworks.

Efforts to meeting this goal have been carried out in various disciplines and projects. This session invites presentations to describe and share their work/progress with the ESIP community on systems, tools, frameworks, workflows, etc. that enable repositories/data centers to systematically generate and provide descriptive information about the data products to data users for improved discoverability, transparency, usability, and interoperability. Additionally this session will discuss gaps that still need to be addressed.

Agenda - Invited Presentations - 15 mins each (presentation + Q&A)

  • NSF/EarthCube X-DOMES (Cross-Domain Observational Metadata for Environmental Sensing) team - Gayanilo, FelimonApplication of Standards-based Description of Environmental Sensor Metadata
  • DataONE MetaDIG team/ESIP Documentation Cluster Contribution - Mecum, Bryce: Improving Metadata with Automated Quality Evaluation 
  • NOAA OneStop Metadata Team - Zinn, SonnyDesign and implementation of automation tools for DSMM diagrams and reports
  • NSF EDI (Environmental Data Initiative) -  O'Brien, Margaret: Metadata content standardization in the Environmental Data Initiative
  • WCRP/WDC Obs4MIPS (Observations for Model Intercomparisons Project) - Ferraro, RobertObs4MIPs - Satellite Observations Rehosted for GCM Model Evaluation



Notes were taken by Jamie Collins, [email protected], ESIP Student Fellow

Introduction by Ge Peng 
Felimon Gayanilo, X-DOMES 

Funded by NSF EarthCub
Goal: Develop standard framework for environmental sensor metadata
Based on a schema, “SensorML 2.0”
Key challenge: Need to develop a SensorML editor to allow instrument manufacturers and users to create the metadata
i.Will allow direct import from spreadsheets, etc.
ii.Objective: Encourage manufacturers to produce a SensorML file that can be transferred to the user upon sale of the equipment
Also involves creation of a repository (hosted by X-DOMES) to warehouse all the SensorML data
Teledyne has agreed to be the first manufacturer to participate;
Audience questions: none

Bryce Mecum, NCEAS (UCSB) and DataONE, “Improving Metadata with Automated Quality Evaluation” 

Challenge: Currently, have various definitions for “good” metadata (tends to depend on the user community)
Right now, most data NCEAS deals lies somewhere between “minimal” and “pretty good”
i.One current set of recommendations/QA criteria comes from NCDD
ii.LTER program also has its own set of QA checks (32 in all, called “PASTA”) 
Metadata can improve at multiple stages
NCEAS has produced a “Metadata Quality Engine” that can be deployed alongside all different types of softwarei.Purpose is to grade the current state of the metadata associated with a particular dataset
ii.Supports any XML-based metadata standard
iii.Supports a number of languages (e.g., Python, R), so users don’t have to learn a new language just to check their metadata (can continue using the language in which the data was produced/analyzed)
iv.Supports a number of different checks; there are some very common often-used checks which are built into the system, but users can add their own
Can rate each QA item as critical or simply recommended (yields “Failure” versus “Warning” rating; also has “Informational” rating)
v.Workflow: Metadata Metadata Quality Engine Metadata Quality Report. Report comes as nice visual web output; goal is to assist PI’s in easily identifying issues
vi.Remaining challenge: What should the Engine show the user in the quality report? % checks passed, with or without comparison to peers
Audience questions:
i.Suggestion: A NetCDF compliance checker (presented yesterday) provides a large amount of feedback to the user
ii.Suggestion: NOAA has a checker which produces reports containing links directly to the “problem areas” flagged (makes it easy to diagnose)
iii.Suggestion: Need to define/record the metadata authority for a particular dataset which has been checked (often, the particular authority which defined the standard is not retained)

Sonny Zinn, NOAA NCEI, “Design and implementation of automation tools for DSMM diagrams and reports” 

Talking about NOAA OneStop
DSMM = Data Stewardship Maturity Matrix
i.Each dataset is evaluated in 9 different areas
ii.Produces nice visual summaries for users/data managers as .pptx files
Contains an embedded macro to color-code each matrixiii.Also produces a report for the user/data manager
Future work: integration of CEdit
Audience questions:
i.Suggestion: Can pull out a JSON representation from Google spreadsheets

Margaret O’Brien, UCSB/LTER, “Environmental Data Initiative” (“EDI”) 

Funded by NSF-DEB
Evolved from LTER data management systemi.Builds on LTER data management’s existing partnerships, e.g., with NCEAS and ESIP
Lies somewhere between “Aggregators” (DataONE, etc.) and “Researchers” who generate data (such as those at LTER)i.Objective: Process & content standardization to enhance discovery and preservation
Margaret then describes the standards currently used for various types of data
There’s also an automated metadata checking system w/33 checks (growing theme here…)
Challenge for LTER (and similar) data: “Data diversity”
i.i.e., often encounter new types of data without defined standards elsewhere; in this case, have to create new, custom definitions
ii.How address?
Focus first on dataset design (encourage researchers to think about the new kinds of data they will be creating up front… and communicate that early with data managers!)
Provide a “skills exchange” that encourages researchers to share code & tools for common data conversions and manipulationsiii.Margaret then provided two examples
Audience questions: None

Robert Ferraro, JPL WCRP, “Obs4MIPs – Satellite Observations Rehosted for GCM Model Evaluation” 
Collaboration between DOE & NASA
Objective: Allow modelers to use & easily access satellite data that might be relevant for them (i.e., leverage under-exploited satellite observations to allow modelers to assess GCM effectiveness (particularly GCMs referenced in IPCC reports)
Challenge: Modelers tend to ignore/not use satellite data because it’s in different formats and they aren’t necessarily familiar with it

i.Particularly, modelers were confused about flags in satellite data and their meaning
ii.Modelers also often don't have time to search for/pull out satellite data that would be useful to them (geographical associations aren’t always evident or straightforward)
Strategies:i.Common data formats (harmonize structure of CMIP5 with NetCDF)
iii. Provide accessible “technical notes” for the modelers containing desired metadata for the satellite products that have been traditionally hard to find, e.g., report of uncertainities and various sampling biases in satellite observations, written at a “graduate student level”

Audience questions:

i.Who writes the “technical notes” – decision is important for usability
ii.How get “buy-in” from both satellite data producers and modelers (when those in each community might not be incentivized to share…)
Can leverage relevance – For satellite data providers: If your data isn’t in the database, it wont get used… and in order to get it in the database, you have to produce a technical note
Peng, G.; Ritchey, N.; Gordon, S.; Towards systematically curating and integrating data product descriptive information for data users; Winter Meeting 2017. ESIP Commons , October 2016