Data/Information Quality BoF


Carol introduces the discussion topic and the first speaker


Gilberto Vicente from GES DISC

Problem: Data quality is an ill-posed problem because it is hard to define, user dependent, difficult to quantify, handled differently across teams, percieved differently by different data providers and users

Data Quality aspects: elements of completeness and consistency

Quality concerns are poorly addressed: lower priority than the instrument, little attention to validation between types of data

Adressing what users want vs. what they get: they want gridded data without gaps and they get swath data with poorly defined quality flags

User perspective: make do with what they can get

Poses a question to the group:

Can we develop a framework to help the community deal with the assessment and quantification of quality data

Reminder that this work was started by Greg Leptoukh

Introduces the next speaker Chris Lynnes: NASA Goddard

Passes out clickers to use during the presentation

Eye of the beholder: User-oriented data quality

Addresses facets of data quality:

  • accuracy

  • completeness

  • consistency

  • resolution

  • Ease of use

  • latency

Introduces a thought experiment- if you were a museum curator (designing exhibit on wildfires with cool satellite images) what element of data quality would be most important to you


Clear that depending on what kind of user you are there are different facets of data quality that are most important to you

This is something that is being done: User needs Analysis working group


  • Develop a user model : both human and machine (includes DSS)

  • inventory dources of user input

  • starting with an ASDI survey comments

  • Assessing the importance /utility of complaint or suggestion

  • look at relative scores


Shows list of user classes

Shows an example of the assessment chart

Suggestion that need to repurpose user needs analysis methodology

Step 1: Quality Needs Assessment

Step 2: Quality Communication (We should be looking at metrics, but also beyond that to visuals)

Asks for questions

Q: asks to clarify ease of use

A: Put yourself in the shoes of a museum coordinator and ask whether you could understand specific data sets

Q: Suggests a useful publication for user needs categories

Suggestion about groups of professionals- policy members, etc. and how to reach them

Q: Questioning whether latency is actually an attribute of quality

A: If you consider quality as fitness for use

often there is a trade-off for accuracy

A: Latency is timeliness - a more general understanding

A: Stressing the phrase “fitness for use” and the ability to measure this

Suggestion: great idea coming out of the question of latency: How do we discuss the tradeoffs between these issues of quality

ex: a type of bias selection bias- that is a trade off for several of these quality aspects

Suggestion: formal ISO standards - stating a range to deal with uncertainty - computing an uncertainty statistic

Response: this does not align with what the community needs - the importance of developing a template that does address these in a model

Suggestion: Encouraging to use some of the previous data quality measures work that has already been done

Individual offers use case example: Calypso

Reminder that we have to talk in the language of the user

Suggestion that the key is that we have to use these meetings to have these discussions

Suggestion: begin to think about the interaction between the interface and the user as a librarian would a “reference interview” collapsing down the choices - disclosing the hints  and allow the user to

Suggestion: Greg was trying to get at a consumer reports approach: you have all these different user bases - but you have a template

Suggestion: Stresses the act of a dialogue

Response: Questions this idea with the machine users

Suggestion: don’t want to focus on just one area but keep addressing all these user groups

Wants to extend this beyond just the search process, but also in the management aspect

Uncertainty as having three different scales

Individual value

How does the uncertainty vary and how would you summarize for a single file

For a whole collection how summarize the characteristics that come back to a global statement on the collectuin

Society of industrial and appplied mathematicians - deal with calculating uncertainty

may want to try to cherry pick some of the issues of quality that are easier to solve first

Suggestion about the quality of the container of the data- not just the data itself

Asks whether anyone else wants to discuss examples and ideas from their institution/ organization

GCMD has tried to do things along this line in the past and they would love to have something like this to use

This would  be a way to begin building some consensus in one area

Addressing the DOI example which is a similar situation

suggestion to take into account the amount of work that can be done in this area given the budget

Monte Carlo approach - need some way of simplifying statistical processes for discussion with different groups, but the type of rigor that is needed given the subject is very complex

Chris is suggesting that we work on the communities where accuracy is a lower priority

Suggestion to invite more scientists into the discussion

Question of need of a list to define where to go to get best informed

Response - it will be different depending on what facet we are addressing

There is a wiki page for this cluster and is a good tool for collaboration- no longer linked off the commons page but easy to get to


Session ends- there is a fire alarm



Continue this discussion and visit the wiki for collaboration

Meyer, C.; Data/Information Quality BoF; Winter Meeting 2014. ESIP Commons , January 2014