Data/Information Quality BoF
Carol introduces the discussion topic and the first speaker
Gilberto Vicente from GES DISC
Problem: Data quality is an ill-posed problem because it is hard to define, user dependent, difficult to quantify, handled differently across teams, percieved differently by different data providers and users
Data Quality aspects: elements of completeness and consistency
Quality concerns are poorly addressed: lower priority than the instrument, little attention to validation between types of data
Adressing what users want vs. what they get: they want gridded data without gaps and they get swath data with poorly defined quality flags
User perspective: make do with what they can get
Poses a question to the group:
Can we develop a framework to help the community deal with the assessment and quantification of quality data
Reminder that this work was started by Greg Leptoukh
Introduces the next speaker Chris Lynnes: NASA Goddard
Passes out clickers to use during the presentation
Eye of the beholder: User-oriented data quality
Addresses facets of data quality:
-
accuracy
-
completeness
-
consistency
-
resolution
-
Ease of use
-
latency
Introduces a thought experiment- if you were a museum curator (designing exhibit on wildfires with cool satellite images) what element of data quality would be most important to you
Clear that depending on what kind of user you are there are different facets of data quality that are most important to you
This is something that is being done: User needs Analysis working group
Methodology:
-
Develop a user model : both human and machine (includes DSS)
-
inventory dources of user input
-
starting with an ASDI survey comments
-
Assessing the importance /utility of complaint or suggestion
-
look at relative scores
Shows list of user classes
Shows an example of the assessment chart
Suggestion that need to repurpose user needs analysis methodology
Step 1: Quality Needs Assessment
Step 2: Quality Communication (We should be looking at metrics, but also beyond that to visuals)
Asks for questions
Q: asks to clarify ease of use
A: Put yourself in the shoes of a museum coordinator and ask whether you could understand specific data sets
Q: Suggests a useful publication for user needs categories
Suggestion about groups of professionals- policy members, etc. and how to reach them
Q: Questioning whether latency is actually an attribute of quality
A: If you consider quality as fitness for use
often there is a trade-off for accuracy
A: Latency is timeliness - a more general understanding
A: Stressing the phrase “fitness for use” and the ability to measure this
Suggestion: great idea coming out of the question of latency: How do we discuss the tradeoffs between these issues of quality
ex: a type of bias selection bias- that is a trade off for several of these quality aspects
Suggestion: formal ISO standards - stating a range to deal with uncertainty - computing an uncertainty statistic
Response: this does not align with what the community needs - the importance of developing a template that does address these in a model
Suggestion: Encouraging to use some of the previous data quality measures work that has already been done
Individual offers use case example: Calypso
Reminder that we have to talk in the language of the user
Suggestion that the key is that we have to use these meetings to have these discussions
Suggestion: begin to think about the interaction between the interface and the user as a librarian would a “reference interview” collapsing down the choices - disclosing the hints and allow the user to
Suggestion: Greg was trying to get at a consumer reports approach: you have all these different user bases - but you have a template
Suggestion: Stresses the act of a dialogue
Response: Questions this idea with the machine users
Suggestion: don’t want to focus on just one area but keep addressing all these user groups
Wants to extend this beyond just the search process, but also in the management aspect
Uncertainty as having three different scales
Individual value
How does the uncertainty vary and how would you summarize for a single file
For a whole collection how summarize the characteristics that come back to a global statement on the collectuin
Society of industrial and appplied mathematicians - deal with calculating uncertainty
may want to try to cherry pick some of the issues of quality that are easier to solve first
Suggestion about the quality of the container of the data- not just the data itself
Asks whether anyone else wants to discuss examples and ideas from their institution/ organization
GCMD has tried to do things along this line in the past and they would love to have something like this to use
This would be a way to begin building some consensus in one area
Addressing the DOI example which is a similar situation
suggestion to take into account the amount of work that can be done in this area given the budget
Monte Carlo approach - need some way of simplifying statistical processes for discussion with different groups, but the type of rigor that is needed given the subject is very complex
Chris is suggesting that we work on the communities where accuracy is a lower priority
Suggestion to invite more scientists into the discussion
Question of need of a list to define where to go to get best informed
Response - it will be different depending on what facet we are addressing
There is a wiki page for this cluster and is a good tool for collaboration- no longer linked off the commons page but easy to get to
Session ends- there is a fire alarm
Continue this discussion and visit the wiki for collaboration