Towards systematically curating and integrating data product descriptive information for data users
Abstract
Complete, consistent, and easy to understand information about data products is critical for meeting data discoverability, improved accessibility and usability, and interoperability requirements.
In the BigData and Open Data Era, with ever increasing variety and number of data products, it becomes increasingly impractical to do so in a manual fashion. The most effective way to ensure the completeness and quality of metadata and description documents of data products is to curate them in a systematic, consistent, and automatic fashion based on standards, community best practices, and defined frameworks.
Efforts to meeting this goal have been carried out in various disciplines and projects. This session invites presentations to describe and share their work/progress with the ESIP community on systems, tools, frameworks, workflows, etc. that enable repositories/data centers to systematically generate and provide descriptive information about the data products to data users for improved discoverability, transparency, usability, and interoperability. Additionally this session will discuss gaps that still need to be addressed.
Agenda - Invited Presentations - 15 mins each (presentation + Q&A)
- NSF/EarthCube X-DOMES (Cross-Domain Observational Metadata for Environmental Sensing) team - Gayanilo, Felimon: Application of Standards-based Description of Environmental Sensor Metadata
- DataONE MetaDIG team/ESIP Documentation Cluster Contribution - Mecum, Bryce: Improving Metadata with Automated Quality Evaluation
- NOAA OneStop Metadata Team - Zinn, Sonny: Design and implementation of automation tools for DSMM diagrams and reports
- NSF EDI (Environmental Data Initiative) - O'Brien, Margaret: Metadata content standardization in the Environmental Data Initiative
- WCRP/WDC Obs4MIPS (Observations for Model Intercomparisons Project) - Ferraro, Robert: Obs4MIPs - Satellite Observations Rehosted for GCM Model Evaluation
Notes were taken by Jamie Collins, [email protected], ESIP Student Fellow
Introduction by Ge Peng
Felimon Gayanilo, X-DOMES
Goal: Develop standard framework for environmental sensor metadata
Based on a schema, “SensorML 2.0”
Key challenge: Need to develop a SensorML editor to allow instrument manufacturers and users to create the metadata
ii.Objective: Encourage manufacturers to produce a SensorML file that can be transferred to the user upon sale of the equipment
Teledyne has agreed to be the first manufacturer to participate;
http://esipfed.org/earthcube-xdomes
Audience questions: none
Bryce Mecum, NCEAS (UCSB) and DataONE, “Improving Metadata with Automated Quality Evaluation”
Right now, most data NCEAS deals lies somewhere between “minimal” and “pretty good”
ii.LTER program also has its own set of QA checks (32 in all, called “PASTA”)
Metadata can improve at multiple stages
NCEAS has produced a “Metadata Quality Engine” that can be deployed alongside all different types of softwarei.Purpose is to grade the current state of the metadata associated with a particular dataset
ii.Supports any XML-based metadata standard
iii.Supports a number of languages (e.g., Python, R), so users don’t have to learn a new language just to check their metadata (can continue using the language in which the data was produced/analyzed)
iv.Supports a number of different checks; there are some very common often-used checks which are built into the system, but users can add their own
Can rate each QA item as critical or simply recommended (yields “Failure” versus “Warning” rating; also has “Informational” rating)
v.Workflow: Metadata Metadata Quality Engine Metadata Quality Report. Report comes as nice visual web output; goal is to assist PI’s in easily identifying issues
vi.Remaining challenge: What should the Engine show the user in the quality report? % checks passed, with or without comparison to peers
ii.Suggestion: NOAA has a checker which produces reports containing links directly to the “problem areas” flagged (makes it easy to diagnose)
iii.Suggestion: Need to define/record the metadata authority for a particular dataset which has been checked (often, the particular authority which defined the standard is not retained)
Sonny Zinn, NOAA NCEI, “Design and implementation of automation tools for DSMM diagrams and reports”
DSMM = Data Stewardship Maturity Matrix
ii.Produces nice visual summaries for users/data managers as .pptx files
Future work: integration of CEdit
Audience questions:
Margaret O’Brien, UCSB/LTER, “Environmental Data Initiative” (“EDI”)
Evolved from LTER data management systemi.Builds on LTER data management’s existing partnerships, e.g., with NCEAS and ESIP
Lies somewhere between “Aggregators” (DataONE, etc.) and “Researchers” who generate data (such as those at LTER)i.Objective: Process & content standardization to enhance discovery and preservation
Margaret then describes the standards currently used for various types of data
There’s also an automated metadata checking system w/33 checks (growing theme here…)
Challenge for LTER (and similar) data: “Data diversity”
ii.How address?
Provide a “skills exchange” that encourages researchers to share code & tools for common data conversions and manipulationsiii.Margaret then provided two examples
Audience questions: None
Robert Ferraro, JPL WCRP, “Obs4MIPs – Satellite Observations Rehosted for GCM Model Evaluation”
Collaboration between DOE & NASA
Objective: Allow modelers to use & easily access satellite data that might be relevant for them (i.e., leverage under-exploited satellite observations to allow modelers to assess GCM effectiveness (particularly GCMs referenced in IPCC reports)
Challenge: Modelers tend to ignore/not use satellite data because it’s in different formats and they aren’t necessarily familiar with it
ii.Modelers also often don't have time to search for/pull out satellite data that would be useful to them (geographical associations aren’t always evident or straightforward)
Strategies:i.Common data formats (harmonize structure of CMIP5 with NetCDF)
iii. Provide accessible “technical notes” for the modelers containing desired metadata for the satellite products that have been traditionally hard to find, e.g., report of uncertainities and various sampling biases in satellite observations, written at a “graduate student level”
Audience questions:
ii.How get “buy-in” from both satellite data producers and modelers (when those in each community might not be incentivized to share…)