Metadata cataloging for discovery of open data
Data discovery has traditionally focused on providing documentation (metadata) to enable finding a particular, discrete dataset, evaluating the content, and accessing the data for use in some fashion. In recent years, the emergence of web-service enabled data access has allowed the discovery process to delve into exploratory analytics and data visualization at a more granular level. Enabling this deeper level of discovery requires additional, machine-actionable metadata content. Many metadata content models and interchange formats have emerged in different Earth Science communities due to varying requirements and community practice. The Obama Administration's Project Open Data initiative calls for a common, simple set of metadata attributes to be applied across every described data asset of the Federal Government and to be used in pan-governmental aggregate catalogs such as Data.gov. Much work is being done in the earth science agencies to assemble bureau, office, and agency catalogs of various kinds to feed the broader “system,” with efforts focused on everything from the application of metadata standards to catalog technologies and interface mechanisms.
This session will include brief presentations from agency staff involved in implementing metadata catalogs at various levels and an open discussion on technologies, methodologies, and practices. The following key questions will be pursued:
-
What metadata standards are being employed, and what dynamics are being encountered in the mapping to or integration with the more abstract Project Open Data conventions?
-
At what levels of granularity are existing catalogs presenting their resources to the “Public Data Listing” and “Data Inventory” under Project Open Data?
-
Following the initial release of agency “open data” catalogs/data listings in November 2013, what are the primary challenges faced in next steps and growth of these resources?
-
How have the collaborative opportunities brought about through the ESIP community helped to advance capabilities in metadata cataloging and data discovery?
Slides for this session:
Please contact Sky Bristol ([email protected]) and Ben Wheeler ([email protected]) if interested in presenting/participating in this session. Organizers are especially interested in garnering a wide range of agency input.
Ben Wheeler, USGS
· This morning we heard about the high level
· Last session, learned what to do with metadata – human/machine formats
· This session is how to get to/discover the metadata – what the catalogue/agencies are doing, crosswalks/formats
· ESIP Documentation Cluster – Meeting 1:30 tomorrow… most likely here
· 3 or 4 talks – Anna, Ben, Ted, and then possibly Steve, discussion
Iso Metadata Foundation for Discovery and Citation – Anna Milan - NOAA, NGDC
· How using existing
· NGDC has been doing really good with managing metadata
· Take data out of metadata record, get data for minting a doi and then put it back into the metadata record, then republish the record, view of the metadata record is the landing page of the doi
o Easy id system means we can update urls quickly
· The doi landing page was originally the get the data page – everything comes straight from the metadata
· Because NGDC already had good metadata, this was an easy process
· Mapping Resources
o NOAA metadata working groups have been meeting regularly about mapping the data
o With DCAT mapping, it is done with a google spreadsheet
o Currently working on data citation/iso mapping
· Q what was thinking of putting recommended citation into resource constraint
o Made the most sense, text field
· Q Bruce – have you worried about notion that observation are better to be associated with date and time than by publication
o Aside from this approach, the actual the scope of what minting doi requires thought and varies with resource
o For this case, the doi for a database, as a subset it has a set
o Depending on the real community, then data observation
§ But search would be by observation date
o 2 things going on, 1 is a citation which has a standard that uses publications
· Q Sky – conversation, citing analysis from a particular publication or use a citing in a publication
o NOAA has doi for the collect, this gets described in the publication and then clarify in citation
o If have 1 doi for 1 data collect – if want metric on doi, if you have a doi for each granule – then harder to get metric on the granule – can’t dereference subsetted text
· Q what in general are you using for registration url
o It is a landing page – it is run dynamically from iso metadata
· Q (Ben) fields that came out with open data, agency code by OMB – are you applying those only when it goes up to NOAA CKAN
o Bureau/program code – agencies have to use CKAN response
o Inbed in JSON output
USGS Metadata Criteria and Response to Open Data - Ben Wheeler, USGS
· Goals
o Trying to facilitate data.gov responses to actions department of interior
o Discuss other agency approaches
· Questions
o What to provide – granularity or systems
§ When you are looking at a catalogue for an agency – what level of granularity
o Then – how to find it, use it… and views at different levels
· Background - ODM
o Data.gov population and redesign
o Mostly heard about ISO and DCAT, USGS still works in FGDC
· USGS
o Growing focus on data management – tring for best practices, managing data for lifecycle
o Legacy and new metadata
o Trying to get data to data.gov before ODM – before it was all old maps
o Engage in cross-agency collaboration – what is the way to support
o Support DOI
o www.usgs.gov/datamanagement/
· USGS ScienceData Catalog (SDC)
o Checks minimum metadata requirements
o Featuring datasets and systems that will show up on the top of a search result
· SDC Search and Discovery tool
o 1) metadata discovery
o 2) visualization
o 3) downloading
o 4) link to data
o 5)
· SDC dashboard – have ~20-30 harvest sources
· SDC draft catalog content criteria – 7 pg doc sent out for review
o What most robust records possible
o But to have comprehensive scope
o Cause a balancing act –
o Creating a more comprehensive metadata requirements (next few months)
· CKAT search gives collection level and then can search within a catalogue
· Next step/challenges
o Data from USGS pubs warehouse – some have been taken apart … are these duplicates
o What is data – publications/ samples/ print
o Qualify vs. quantity
· Sky – quality vs. quantity – browsing… Fish and Wildlife – lineage info – includes “vintage” i.e. crap metadata – populating data.gov with useless metadata – USGS is not just pushing this stuff out
o USGS is checking the metadata before it is put out on data.gov – getting flack over this
o Anna – have people update their metadata – yes – coastal and marine geology is interested updating metadata
· Q Bruce - where is the user community data search and data use habits
o The levels where it gets translated up to – datacenter, science, … way to find record at data.gov or coastal marine geology science center
o How do you make the same record discoverable at each place along the way – specialist to known agency to average Joe
o Book “Thinking fast and slow” – careful about answering question
o Q WHERE users THINK
§ Ted – user model of data.gov is simple – only discovery metadata
§ Search with keywords and time ranges
o Q which demographic community?
§ This has not been done
§ User communities change fast – moving target
§ Steve A. - Also, way people are using data is changing – now people want small piece of data
§ Bruce - Marketing is difficult – suspicions about how rapidly vocabulary is evolving – might be interesting to think about what kinds of communities – based on what vocabulary, education level – need to bridge tribal vocabulary difficulty in metadata
Metadata Evaluation and improvement - Ted Habermann – The HDF Group
· Talk more about organization change – most of the problems in ESIP require a large amount of organization change – Switch – Heath brother (author)
o “What looks like a people problem is often a situation problem”
o “To change someone’s behavior, you’ve got to change that person’s situation”
o NOAA metadata plans… haven’t really changed
· Two targets
o Driver and emotional parts of the game
o Directing the rider
o Motivating the elephant
· Point to the destination
o Problem is trying to document scientific data
o 1) use
o 2) discovery
o 3) understanding of the data
o Data.gov is trying to simply the problem that the discovery becomes the destination
o Discovery is antithesis of data stewardship
o By making discovery the destination – destination needs to be the whole metadata
§ Or reproducible provenance
o Difference between science data and other data (ex. Baseball stats verses average temperature of X location)
o Create a reward card… discovery, use, understanding… these are the goals
o Feeling à data are important for science, complete documentation makes them trustworthy
§ Trust is the key point… want people to trust your work
o Need to make things simpler – so we have discovery covered
§ Shrinking the change helps people complete
§ Discovery has been done
· Shrinking the change
o But metadata is more complicated
o NGDC good a breaking things up
o Now lots of things are checked – Identification is done… some extent… sort of half way there
o Example FGDC to ISO record rubric – 17/41
§ Discovery is mostly done because FGDC is a discovery record
· Script critical moves
o Broke metadata into spirals
o This means you can see what you need to do
o Making sure the people that are improving metadata know what they need to do – they can script the critical moves
· Growing your people
o Red and green cells and table that provides a user guide to develop better metadata
o Includes xml so people get used to it
o Best practice is based on which group (or tribe) or community can help define these
o Table links to the wiki
o With xml editor then can do this on the desktop and publish when they are happy with their score
· Identify Bright Spots
o Rows in the rubric/spirals
o Shows the residual /score for many records (ex. 2400 records)
o 52 records in 2 groups… exactly the same except for one spiral – there are 14 bright spots in the spiral (the good examples) and the other things are opportunities for improvements
o Can provide recommendations of how to correct metadata – provide examples of how to do it… then provide local and globally find good examples
· If want a successful change you need to approach both aspects – rational and emotion
· Q – Ed – know you destination – data.gov has an awful target… but if had general search tool (ex. Google tab for data)… who is developing the tool that leverages the data to a general user
o Google kind of already did
o Problem – goal is not to deliver data to typical user
o Goal is to deliver data to ANY user – they can find, tools can use, and then they can understand – this is harder now that there are so many datasets – get multiple data sets for the same thing
§ Need the metadata for data to help user translate between datasets
o Have good examples – faceted searchers – ESRI – geoportal, NODC, USGS
o Dan – there are different uses of metadata – tools like worldview – consume metadata with visualization to present the user with data