Data Quality

Abstract/Agenda: 

The generation, delivery and access of Earth Observation (EO) data quality information is a difficult problem because it is not uniquely defined, user dependent, difficult to be quantified, handled differently by different teams and perceived differently by data providers and data users. Initiatives such as the International Organization for Standards (ISO) 19115 and 19157 are important steps forward but difficult to implement, too complex and out of reach for the majority of data producers and users. This is because most users only want a quick and intelligible way to compare data sets from different providers to find the ones that best fit their interest. Therefore we need to simplify the problem by focusing on a few relevant quality parameters and develop a common framework to deliver them.

 

This session is intended to tap into the audience knowledge and expertise on data quality understanding and use to refine the proposed “Data Quality Matrix Concept” put forward during the ESIP Winter meeting in January. The Quality Matrix is a “consumer report” version that displays no more than 20 quality parameters with a rate of importance for each product and group of users.

 

Notes: 

Data Quality

Gilberto Vicente

Notes:

Gilberto Introduced session, there will be three speakers.  one is calling in, Rama and Eric are also presenting.

Main issues - from users and producers, this is a major problem, but there is not a solution that works for both.

Session purpose is to work on combining efforts. How can we combine individual users and producers, especially those that might not be communicating already.  Based on the January session - will follow up on these things specifically the data quality matrix.

Introduction

Scope of the session, themes will be - top down, ground rules, present the information, more experts to the lesser experts, and what can I do?

What is not in the scope - iso 19115 and iso 19157, my own data sets or interests, my personal interests in this topic.

Gilberto talked about uniformity and simplicity and how that is done in business (nutrition, car shopping, etc.), and how that can be applied in our domain.  The private sector already found ways to address these issues, and even areas like reporting hurricane intensity and other essential information (level 4, 5 refers to wind speed and what that means about safety measures).  EPA has a simple way of describing air quality.  

He reviewed the user classifications that might be various interests groups.  Features that one might expect from say, graduate students.  or researchers in a specific field that might be different from others.  As data producers, being aware of the information needs of these various interest groups - requires you to know more about who is using your data.  ANd how can you convey best how to these groups.

Giberto showed some examples from Chris Lynnes from the 2014 winter meeting - specific examples of 5 different users of the data.

Next he showed how we see things today, and the way we should be doing things in the future, with a proposed system.  This includes an area for one standard organization format and then classified user groups.  As opposed to the current system of different formats, and all users have to figure out the formats before they can access the data.

Storage ship containers as an example - defining the standards, making them simple, people will accept them.  

If we give people boxes, they will fill them in.  Like a food label or other data quality requirements.

Steve “Nasa perspectives on data quality”

Overall goal - to answer the user question - which product is better for me?

Moving from data to information, to knowledge, to wisdom (in the future).  some of what we are doing parallels this.  Aspects of quality - users impressions on what the quality of the data are, spatial and temporal completeness, how mature it is,intrinsic as well as extrinsic.

Next looking at structured information quality - aspects of quality that has a quick way to make choices, like the food labels.

Knowledge of quality - tools that can make inferences on the right product - how do you extract it quickly by the most types of users possible.  A consumer quality sticker etc.  This matrix or sticker is an abstraction from the data and the standards, that provide users with a quick view of fitness for a particular purpose.

How do you do this?  have to understand the users needs.  Depending on what the purpose is (developing an app, etc). what is there list of needs?  Relative importance of those items and make it a profile, that can define on some level some of the broad user groups.  At the last meeting they discussed users and how one might use the data, and they voted on what might be the important needs.  From this sort of activities they might be able to develop a data quality sticker - say for a climate researcher, but not one doing modeling. Or one developing apps.  The hard part is defining these things, and representing them somewhere … also they need to come up with information for users who might not fall in the the classified user groups.  You can decide what is useful to you.  But this is the hard part.  We are good already at representing the quality, but making it easier to communicate quickly, who do not want to look at the details, and extract in a concise usable way.

Question about intrinsic and extrinsic data quality points.  Data usability, data maturity, etc.  We are not going to reeducate people, but we need to address if people will find data useful or not.  Questioner asked about focusing on the extrinsic data.  Steve mentioned not wanting to overload the user.

Question - asked if planning to address different types of data products which might have different features?  Giberto discussed that they have many different products, but use different packaging.  And do not always think about the needs of user a vs b who might have different needs for the same product.  HOw can they find categories that user fits into and target them in the way it is presented.  Questioner asked about different users and different products - will they address this?  Giberto said they would make a matrix that had different user classifications and data types (?).  Steve followed up with this as what the format is recommended for different types of users.  IN a standard way, extended to other users groups that are not typically the focus, like applied science.  Also making data more available to entrepreneurs and commercial users.  Same information, but each users group would be able to pick from that list.

Question - data quality statement, gave an example and asked about evolving.  THis is hard to generalize though they like the direction this is going in.  So simple quality statement - how much will this cover?  Many of the data sets they have this might not work. Steve said that it might not do more than the quick look - you still have to look closer at the data to determine if it fits your needs, the lower level or more detailed information.  Questioner’s solution is to have them contact the original researcher.  Steve said, yeah, but this is also meant to solve things like - I can’t figure out which of the many is right for my needs.  Once you get down to your particular use or application, what you are trying to do with the tool - then yeah, you will have to look at the more detailed information.  Giberto said the more detailed information would still be there.  Like the food label - the list is there of ingredients as well.  Steve said it is more like a walmart greeter.  Someone who can direct you where to look.

Ward - as expanding users groups, how often have you encountered different groups using different terminology?  Steve said this has to be addressed, the same way were addressed in ISO 19115.  But agreed we have to do better.  Giberto said as a result of the chaos we created, we have to go back and clean that up.  Beyond thinking about getting a mission off the ground, have to consider other users outside of the mission planned users.  Had a request for the data in a specific format, which did not care about learning each others fields.  How can we fit these users needs?

Question - scoping, with standard products, are you looking at standard reproducible data or other types as well?  Steve said that data quality is a thing that has a scoping problem.  The questioner said that this metaphor does not lend itself well for this as it moves on.  Giberto suggested that people will use these things in mind at the start of the project instead of secondary.  Questioner talked about engaging customers, developing a product, and wants to test the idea about the metaphor and how extensible it is.  Giberto replied about trusting the data and users with various capabilities, is there continuity?  What about clouds? etc.  Steve said that they have to focus on higher level details first.

QUestion about data quality and open source software - more people who see it, the better comments you get.  So in this, how do you get engagement with the community and ability to contextualize their comments?  Steve said that should be a factor, and the crowdsourcing within the community.  If something is widely used, widely supported etc.  Nancy said it sounded like more like Yelp, that it is clear and authoritative, and what the stars represent.  Steve said you have to make sure it is disentangled from intrinsic data quality.  Those are determined and not up for discussion.  Not the uncertainty, but other things about how this can be used for evalution.

Questioner - asked about open sources like Edx - massive open online systems that can be used to do education and forms for discussions.  Steve said they are thinking about this, how to have webinars to target specific users.  R sent (?) program, out of applied sciences, works with different users groups as far as how can you use NASA data, here are some tricks etc.  And gets feed back from those groups.  Anna Prados (sp?).

Bob - looking at the original analogy (peanut butter), if studying the users, you would be asking about opinions of taste vs. the label has determined information that is important for people who eat it, regardless of why or when.  We need to determine those measures.  Giberto agreed, this is the type of information they are trying to capture, map and communicate.  How can I facilitate comparison of data from NASA and another source like NOAA, which is better for my study?  How do I do this?  Steve mentioned again differences between intrinsic and extrinsic data.  Things that are preference rather than the standards of the lab.  How do you factor those in and normalize them so they can be compared?

Questioner - re: Yelp, if anyone goes down that road - an advanced users might think something is easy compared to novice users.  Steve - that is why it is based on a huge pool of users and the outliers will be small.  BUt we do have to be careful about doing this and setting the context appropriately.  

Mustapha - asked about different recipes for peanut butter, and using the make up to help determine appropriateness for users.  Steve mentioned if other similar users found it appropriate or not, and how to capture that information or not?  Social aspects etc. That will be harder.  We can grab what we can and hope for the best.  But it must be presented to the users in some way.

Rama “Product quality and documentation”

This cluster needs to think about how to work in this chaotic environment.  Started with some motivation and context - scientists are motivated to make high quality products and the users are aware of limitations, uncertainties and other intrinsic information.  The users need to know the quality, and there are many ways of expressing data quality by different organizations.  This makes it difficult for both groups.  Data centers should be in a position to explain this information.

Referring to Giberto’s presentation, need to have something similar looking interfaces so that providers can answer questions once and not 10 times, having it outward facing.

Background - Rama listed a number of activities starting in 2010 leading up to the 2013.  Focused on NASA’’s MEaSUREs (2012), NCAR’s community contribution pages (2013), and WGISS standards from CEOS in 2013.

Product quality checklist created by NASA a few years ago - MEaSUREs program.  Distinction between science data quality and product quality.  Example science data maturity and where it is best used.  For product quality, including information in the package like publications etc.  They generated two checklists, one for PIs and one for DAACs.  PIs made long term recommendations that are delivered to the DAACs who can fill things out with the PI’s point of view.  It has been adopted and is being implemented in a number of projects.  Rama provided an example of the PI’s product quality checklist.  Most are yes or no questions.  Three sections - science quality level, documentation quality level, and usage/satisfaction.

Product quality checklist from the DAAcs - science data quality level, documentation quality level, accessibility/support services, and usage/satisfaction from the users input.

NCAR quality data guide - community contributions pages - list of questions shown on slide.  This is for users.  Also provides a summary statement about climate research and modeling.

CEOS WGISS metadata quality exploration questionnaire - why did you choose this data set etc.  Questions shown on slides.  You can see some similarity between these and the other documents.  Different organizations have developed different ways.  As a cluster, we need to find a way to simplify these questions being asked.

Question - how to map these questions to the ISO metadata standard?  Rama said it was suggested and they do need to do that.

Janet F - Question - comment on it being great to hear on the first slide - qualifying people asking the right questions before making data public.  Something different than good, bad or not determined.  Question, is this group is working with another GeoViQua (http://www.geoviqua.org/Outreach.htm), which is making data quality labels?  Part of GEOSS.  And brokering efforts such as transforming parts of the ISO standards into the labels.  Perhaps we can work together. Markup languages to describe quality, uncertainty etc.

Bob - what might be some initial steps this working group might take to go beyond?  Rama - there we have a practical problem.  How does that apply to other agencies?  How do they all negotiate the commonalities and differences?  And how do you provide access to commonly asked questions from different groups?

Giberto mentioned different file formats and how that gets to the core of the session - how do you put this in one place, and how to express this information?  YOu can make a wiki, or other services, if we can come up with a frame work, those things can be populated instead of having to translate.

Rama said we should look at the ISO 19115 standard, and if everyone provides the information required, as a provider, is there one way every one can understand and multiple ways of answering questions?

Question - data providers? Rama - scientist who is a PI who has been funded to produce materials. Etc.

David - personal experience with a specific type of data, there needs to be more rigorous data testing.  Especially those that have been published, and that later has to be taken down.  At some point we need to run some tests to ensure people processed the data correctly (like ground truthing) so that you do not embarrass oneself in the future.  Something the DAACs should do instead of the PI.  Steve said right now that is assumed to be done, but that is a piece we are peeling off, that the intrinsic quality of the data has been addressed, how do we assess that with other aspects in a standard way?  We still have to do better than just the ISO standard.  David suggested a data audit.

Giberto talked about when people get calls for proposals - you will have to deliver the data with these features being addressed, or as a bi-product.  David said this data is still being processed and errors went back to 1981.  David elaborated on the example and how they found errors in the data quality, which a few simple checks would have discovered and prevented 30 papers being published with flawed data.  Steve says that is important but we are not addressing that here.

Discussion on errors in systems and quality - what is good enough for some might not be for others…. some errors are not human but system errors. etc.

Mustapha - what about including this into a data management plan?  In one framework?  Rama said yes, if we can come up with a set of questions we all agree on, it needs to be broader than NASA, than the US.  It is a global issue.  Giberto said this is why it is good to have this discussion at ESIP.

Discussion of ISO as a dictionary, and how to develop phrases people can understand.

Monthly telecon, and a session on Thursday will be a continued discuss.

John Bates “Assessing the maturity of climate data records”

Cancelled - speaker was not able to call in.

Citation:
Data Quality; Summer Meeting 2014. ESIP Commons , March 2014