Information Quality Cluster - Fostering Collaborations
The goals of the Information Quality Cluster (IQC) as expressed in the Strategic Plan are: 1. Bring together people from various disciplines to assess aspects of quality of Earth science data; 2. Establish and publish baseline of standards and best practices for data quality for adoption by inter-agency and international data providers; and 3. Build a framework for consistent capture, harmonization, and presentation of data quality for the purposes of climate change studies, Earth science and applications. Moving towards these goals, the IQC has been collecting and evaluating use cases to help identify issues and analyze them to arrive at recommendations for improvements in capturing, describing, enabling discovery and facilitating use of data quality information. The purpose of this session will be to collaborate with other ESIP clusters whose primary emphasis is on utilization of Earth science data for research and applications. The purpose of the collaboration will be to answer questions such as: What type of quality information is needed for their applications? Are they easy to find? Are they complete and easy to understand? What level of data quality is important for their applications (what can they “get away with”? What improvements should be made in conveying quality information?
A brief introduction will be provided to familiarize new attendees with the IQC. Brief presentations will be made by invited panelists from other ESIP clusters such as the Disaster Lifecycle Cluster and Agriculture and Climate Cluster representing data users’ perspectives, and a panelist from the data provider community. The presentation will be followed by a discussion period to identify gaps and approaches to filling the gaps and addressing issues.
- Introduction to Information Quality Cluster and Status - David Moroni - 8 minutes
- Panelist Presentations (Total of 32 minutes)
- Agriculture and Climate Cluster - Jeff Campbell (USDA) - 8 minutes
- Disaster Lifecycle Cluster - Karen Moe (NASA/GSFC) - 8 mintues
- CEOS/WGCV Land Product Validation - Pierre Guillevic (U of MD) - 8 minutes
- Obs4MIPS - Robert Ferraro (JPL) - 8 minutes
- Discussion - 50 minutes
* Information quality cluster - fostering collaborations
* David Moroni - Overview of IQC
Motivation: Find out what challenges they face in terms of data quality issues
and find what we can do for them.
Purpose of cluster: To become internationally recognized as an authoritative
information resource for guiding implementation of data quality standards and
** Information quality
- Scientific quality: accuracy, precision, uncertainty, validity and suitability
- Product quality: how well the scientific quality is assessed and documented.
Completeness of metadata and documentation.
- Stewardship quality: how well data are being managed and preserved.
- Service quality: How easy it is for users to find, understand, trust and use data.
- Share experiences.
- Evaluate best practices and standards for DQ.
- Improve collection, description, discovery and usability of info.
- Support data producers with info about standards and best practices.
- Consistently provide guidance to data managers and stewards.
Some global players: eesa, ogc, esgf, opendap, etc.
- NASA ESDSWG data quality working group
- NOAA dataset lifecycle stage based maturity matrices.
- ISO metadata quality standards.
- EUMETSAT CORE_CLIMATE data system maturity matrices.
- GEOSS data quality guidelines.
- GEOS essential climate variables inventory questions.
- NCAR community contribution pages.
Active involvement with documentation cluster, discovery cluster, data
Use case evaluation summary.
- Dataset rice cooker theory: When heterogeneous data products are assimilated into a
final product. Can cause problems with data quality.
- Appropriate amount/extent of documentation for data use.
- Improving use of SBC LTER Data Portal.
- Citizen Science
- Capture, description, discovery and usability of info about data quality in
earth science data products is critical for proper use of data.
- Promote standards and best practices.
- Use case submission and evaluation is ongoing.
* Jeff Campbell - National Agricultural Library
- Library (archive) data quality topics
Research data includes current and historical: meteo, hydro (flow, quality),
eddy flux of CO2 and non-CO2 gasses, land management practices, soil
characteristics, biological outputs, remote sensing, socio-economic.
** Perspectives on info quality
- Controlled experiment paradigm is common.
- Many factors cannot be controlled in the experiment.
- Methodologies are designed to meet research needs of each application (e.g.
corn methods may differ from soybean methods).
- Differences in collection method are larger than sensor uncertainty.
** NAL Data management
- sFTP of csv files to reduce security risk.
- File hash comparison, record counts and hashed values in data.
- Every change in data value is logged in secured database.
** Consistent QA/QC for meteorological data.
* Karen Moe - Disasters lifecycle cluster
- Wanted to create "trusted data", approved by the ESIP federation, for use in
data-driven decision making.
- What would it mean to be "trusted"?
** Trusted data concepts
- People use data that they trust, leading to greater consistency, compliance and accuracy.
- People want to feel confident that their data is in the best condition
possible, to ensure that their actions are accurate, timely, effective, and
conform to their requirements.
- Should come from carefully selected sources, be transformed in accordance with
the data's intended use, and be delivered in appropriate formats and time
- Meet conditions of completeness, quality, age, schema, profile, and documentation.
Complete: for decision making.
Current: freshness, speed of delivery.
Consistent: metadata management.
Clean: result from data quality techniques,
such as standardization, verification, matching and de-duplication. User's
perceptions of data quality are biggest challenge to trust.
** Trusted data characterstics, as determined by meetings:
- Provide actionable info.
- Sharable, tech interoperability for user community.
- Common operational data / common view
- Sensitive info sharing framework / trusted location
- Expedited access (low latency).
** Data driven decision making
- Providing appropriate info for decision makers to enable situational awareness.
- Data is standardized, simplified, easy to access.
- Knowledge based value added bundle of data provided in time for helping to
make adequate, timely decisions.
- Integrated data and info in easily recognized formats--by decision makers;
easily incorporated into Common Operation Picture.
- Data driven decision making: use data if expeditious and available, clearly
connected to problem at hand.
** Drivers for the disasters response user community
- Need to capture user feedback
- Determine pathway for generation collection level metadata for NASA systematic data.
- Trust, safety and speed are the key drivers.
Make it easy for users to access and use the data.
Role of LPV is to provide well characterized uncertainty of data for use by (e.g. GEOGLAM)
** Focus areas:
- Snow cover, sea ice
- Surface rad
- Land cover
- Leaf area index
- Biomass and NDVI
** Subgroup objectives
1. Foster and coordinate quantitative validation of higher level global land
2. Increase quality and efficiency of global satellite product validation by
developing and promoting international standards.
- Validation framework: 3 components integrated into master tool.
** Example: leaf area index (released in 2015).
Present definition of product, attributes of product, physics behind measurements.
- Characterize in situ measurements and validation datasets at satellite product
resolution including uncertainty.
- Validation challenges: spatial vs. temporal variability. Need seasonal validation reports.
- Critical need for ground truth. Cannot necessarily compare with heritage products.
5-year roadmap: 2017 land surface temperature, albedo, burn area.
Conclusion: coordinate validation efforts internationally.
* Robert Ferraro - Obs4MIPs
** Two interrelated questions:
- How to bring as much observational scrutiny as possible to the CMIP/APCC process?
- How to best utilize the wealth of satellite observations for CMIP/IPCC Process?
Many variables measured by satellites that are also output/used by CMIP models.
However, they don't necessarily match.
** Finding what you need:
- Fit for purpose
** Understanding what you found:
- Data formats
- Documentation: not always linked to the data.
** Is it any good?
- Error characteristics
- Uncertainties: often not provided.
** Obs4MIPS approach:
- Curated specifically for model evaluation of ESGF
- Contributions are reviewed by a task team before acceptance.
- Required format, metadata convention, and metadata content.
- Evidence of prior use for model evaluation is required (peer reviewed).
Reformatting and rehosting allowed.
- Uncertainty characterization required, with per datum error/uncertainty
- Data and documentation that are tailored to the customer.
- Questions: how to handle in-situ data.
- Does this approach scale?
this used as a criteria?
- (Robert) Modelers are inherently distrustful of satellite data due to measurement
error. To assuage this fear, uncertainty is provided. Thresholds on
measurement error are generally much smaller than the spread between models.
- Possible general approach: putting users in touch with the people that do
CAL/VAL, with goal of improving trust.
- (Robert) 2000 papers published on CMIP5 -- user base is very large. Not
- (Obs4MIPS) Consider using DOI. Difficult for novices to find data. Consider
using data bundle.
- (Robert) There is a document that each contributor has to fill out (small).
Includes error characterization targeted at people with little knowledge about
- (Jeff) What is the concept of data-sharing? Often little incentive to provide
- (Robert) Data scientists and computer scientists often don't get credit for
their products. Difficult to get scientists to write documentation. Data
conversion often takes just a couple of days; however writing documentation
can take much longer.
- What about having data papers instead of scientific papers?
- (Jeff) 6 of 18 agricultural sites have produced data papers. Discovery tools
are still lacking.
- What are some of the differences you've observed with respect to CAL/VAL
- Even with the same product, you have multiple teams using different
techniques. Goal is to agree on a validation protocol.
- What does the selection process look like for evaluating products?
- New products are biomass (Motivated by JEDI mission).
- Often little incentive to publish data. Tenure committees don't necessarily
give articles the same weight. Perhaps its the same with data. Perhaps better
quality of data are more worthy of tenure/promotion.
- (Pierre) For data, best journals are the users. NASA requires you to identify
stakeholders, users, and usefulness of data.
- Data itself has inherent value.
- (Karen) In disaster response (e.g. power grid outage) timely response is worth
$12 million per hour.
- In disaster response, there may be difficulty in getting the right kind of
data (e.g. real-time data).
- (Karen) NASA's LANCE system produces real-time data. Often producers are not
working with user community.
- (Maggi) Rapid-assessment products; satellite products come later. Work with
user community to inform and manage expectations.