Identifying and Assessing Best Practices in Data Quality

Abstract/Agenda: 

Some of the challenging but important aspects of stewarding environmental data products are to provide guidance on best practices in data quality, to quantitatively assess the practices applied to individual data products, and to consistently convey the data quality information to users.

What key functional areas should we focus on?  What are the best practices for each key functional area of data quality? How do we use metadata in recording the quality practices?

What are the right and measurable levels of maturity for assessing quality practices? 

In this session, we will discuss issues related but not limited to above questions including a progressive scale of assessing data quality practices applicable to diverse digital environmental data products within a framework of scientific data stewardship maturity matrix.

Notes: 

Identifying and Assessing Best Practices in Data Quality – Ge Peng

·         Once a data center has a data set – the work doesn’t end there

·         Questions include:

o   congress – are you compliant with the law,

o   business – are product credible, are in a common data format, sustainable,

o   modelers – is the quality of a routinely updated product being assessed

o   don’t’ have a way to access the communities  practices

o   define a stewardship maturity matrix

·         scientific data stewardship – activities to preserve or improve the information content, accessibility, and usability of environment data and metadata (NRC< 2007)

o   quality (DQ screening and usability (common data format, spatial/temporal characteristics, uncertainty estimates)

o   ensuring data is always meaningful

·         how to define ID key components and define levels of stewardship maturity matrix

o   policies - started from US laws and mandates – 8 nonfunctional requirements

o   process – 19 key function areas defined

o   Procedures standards – identify key components that the community uses

o   Tasks – evaluate products for community practices

§  Relevant, measureable (quantitatively), progressive, and possibly independent

·         Maturity levels follows CMMI level structure

o   CMMI – capability maturity model integration

o   CDRMM = climate data record maturity matrix

o   Until you are at level 3 you can’t confirm consistence in your dataset

o   Level 1 ad hoc – not manage

o   Level 2 minimal limited managed

o   Level 3 – intermediate/managed/ community good practices - minimum

o   Level 4 advanced well managed community best practices

o   Level 5 – optimal/ well managed, measured, controlled, audit – difficult to get to

·         Goals – general, simple, and concise

o   Assess and convey and path forward – not reinvent wheels – but leverage

o   Now want feedback in ESIP community

·         Key components

o   Dq assessment/validation vs. dq screen/assurance/monitor/control

o   Discovery and access, preservation (have good handle)

o   Data integrity, data usability

o   Product sustainability & transparency/traceability – still working on best practice

·         Who could use

o   Data providers and scientific stewards

o   Modelers, decision-support system users, and scientists

o   Data mangers/stewards of data centers and repositories

§  Assess the current state and decide if we want to improve

o   Data users would have the choice of which they use – look at ranking that fits their purpose most

·         Should we use the community best practice or at level for – what is a best practice

o   Best practice – the best in a particular business or industry

§  If everyone is at level 4 – feel good but level is too low

o   Need for multiple data quality key components

§  DQ assurance/screen vs. assessment/validation vs. monitor/control

o   How can we inegrate our effort with the esip community

·         [email protected]

 

Iso 19157 – questions and answers – Ted Habermann

·         Iso 19157 is reference from 19115

·         19157 is for data quality

·         Data Quality scope – Questions and answer (handout)

o   Q – Ken Casey – he means in a structured way in xml that can represent

o   Data type – DQ scope – has extent level description

o   Quality is a spatial and temporal database

·         Papers and webpages to describe quality – you can now point via citations

o   Q – ken – are they distinct from other types of citations

§  Yes – lots of different kinds of citations

o   1 citation class but can be implemented in different places – can be reference in multiple places

o   Can now include the full citation even with an abstract

·         What is a data quality element – it includes – measure, method (for applying the measure), and a result

o   QA_PercentMissingData

§  Number of pixels with missing flags / total number of pixel = ex 15%

o   Has a standard way of describing how you do DQ

·         Quality measures

o   QA flags – classes of quality measures with product specific implements

o   Need an explanation of how applying QA flags

o   Proposed a set of properties so that uses can understand measures

o   Have names and aliases, classes, parameters, descriptions, references, and illustrations

o   There is a registry or database of quality measures – then you point to measures – use doi, uuid…

·         Modular data quality information

o   Measure, methods, and results can call be reference, say from the web service… need one each

o   Q when dealing with archival data, where measures might change, you may have just manipulated an archival file where the measure changes

§  Benefit to having pointers in webpages, how you manage links is different

§  Data citations

o   Q –– how do you go from not filling in data quality to filling it in

§  It already exists, but it varies how it exists

§  Q – if you link to something that isn’t machine readable – then quality then it is hard to find

§  Also – by referencing then you have that information in the future

o   In DIF you have a quality field – you can put whatever you want in it

o   Iso allows same character string in metadata record

·         Data Usage

o   In 19115 have structure representation of user found a limitation (with data time) – added a  in 19157 – additional documentation goes to webpage & the citation to identify issues

o   Q – this is a place for a user to modify metadata – no user provides feedback

§  Could have a feedback database and a doi – if looking for known database, have a service that came back and gave you the pieces

§  Info at ngdc.noaa.gov…

·         International standard

o   Iso is an international standard – accepted by 122 nations in iso

·         Q – Jeff – if I’m a scientist and I’ve got a bad channel in my sensor how am I going to figure it out from the metadata

o   Metadata is a way to share that with the user – report channel at a specific time

o   User – will likely use a web interface “are there any quality reports related to this data set”

o   Ajax allowed you to download everything around where you are – or quality report or user input about a channel

·         19157 now supports data quality management

·         Ken – in the future – you could force this this information on the user

·         Matt – tying to Ge, what about mapping iso to a maturity matrix

o   Most matrices are so general it is difficult to connect to real metadata

o   1 data uses metadata standards

o   Matrices are general guidance

o   Measuring completeness of metadata = rubric (test completeness vs. quality)

o   Matt – completeness is only one aspect of quality = consistency

o   Matt - users want information about an aspect of an element information would be an aggregation of metadata

·         Metadata is mainly used for discovery – need to think about metadata for use & understanding

·         Q - Ge - direct mapping between metadata record and maturity map - is there is way to capture from metadata to populate maturity matrix

·         Q - what getting is look for users to see about quality

o   The data provider should make metadata because they know the data

o   But the data center needs to make it because they know the users

o   If we could take words and put that into structure

·         Matt - metadata steward doesn't understand the science to put it in the report

·         Jeff - metadata has to be a joint effort of the producer and the DAAC

o   the DAAC has to keep maintaining the data after

Talking Points

·         Roles and responsibilities

o    

·         An efficient way to report, resolve, and notify timely and effectively

o   Needs to be a process to prioritize the report and how to identify & then how to notify user

o   Ken – there have been interesting discussion as datasets as entities in a social network – can manage user feedback – can follow

§  Who is going to do it – how (Ge)

o   Ted – one useful goal – data citation and doi for data – looking for a paper about a data set (sometimes) can sometimes search for the doi and find papers

o   Need to start connecting citations or make sure they are systems that can be connected to a metadata system – could even include abstracts

o   For scientists can you start requiring measurements of data quality consistently – use language they can understand

o   In communities – looking for similarities across

o   Have mechanisms to take existing quality standards

o   Information quality cluster have had a number of false starts

§  19157 provides sign post for how to move forward – can start building structured content for metadata

§  & built system for user feedback – things will grow with a function of time

§  3 legs – providers, data centers, and users

o   Matt - relations between uncertainty and quality in netcdf ways of measuring - how map to iso

§  UNCERml  - can also be a report with a doi

§  Don’t know of many implementations

·         Linage was part of data quality in 19115 – it is still in 19115-1 and added “citation or reference” – put there to put a more detailed prove record – iso metadata record provides a hub where users can find info to understand some elements

·         19115-2 – metadata for imagery – it is about platforms and instruments – worst named standard in the world

o   Ed – 19115-2 – object quality description – good container – good way to recognize

§  Ability to identify coverage is in 19115-1 – 19157 didn’t want to be dependent – so will be able to combine 19157 & 19115-1 – there was a lot of bureaucratic elements in iso & finally overcame them – new schema will include everything

 

Attachments/Presentations: 
AttachmentSize
PDF icon GePeng_20140710_v6_final.pdf3.06 MB
Citation:
Peng, G.; L., J.; Identifying and Assessing Best Practices in Data Quality; Summer Meeting 2014. ESIP Commons , April 2014

Comments

ge.peng's picture

Ge Peng: Stewardship maturity matrix introduction and talking points for the discussion Ted Habermann: Overview of ISO 19157 data quality capabilities Ed Armstrong: GHRSST Data Quality - Satellite / in Situ Comparisons