Streamlining Metadata using the CMR

Abstract/Agenda: 

The NASA ESDIS Common Metadata Repositiory in being jointly developed by GCMD and ECHO.  The CMR team is beginning to reconcile metadata between GCMD, EMS, and ECHO into this authoritative metadata source.

This session will outline the overall goals for the system, update attendees on the current status of the system's development, and report on the progress of the metadata reconciliation efforts. We will be also covering current plans to include new metadata concepts into the system, such as parameter level and documentation metadata.

Outline

  • CMR Overview
  • Building the CMR
  • Expanding Metadata Capabilities
  • Unified Metadata Model for Collections (UMM-C)
  • Reconciliation Process
  • Metadata Quality/Validation Service
  • General Question & Answer
Notes: 

Streamlining Metadata using the CMR (ECHO & ESIP) – Katie Baynes

·         CMR = common metadata repository

o    Unified, high-quality and reliable Earth Science metadata

o    High performance ingest and search archietecture of all EOSDIS (earth observing system data and information system

·         Key beneifits

o    single authoritative source of EOSDIS metadata

o    UMM and metadata concepts

o    Doc builder go through science coordinator for quality assurance

o    High availability

·         UMM – Science metadata – collections and granules… want to expand to more for CMR – service, parameter, visualization and future metadata == meta-metadata

o    Ted scoring rubrics – xslt that works above the schema required – meta-metadata – something that needs to be updated every few years

·         All the metadata needs to be high quality, who, what, where

·         Sub-second search

o    Format translations – don’t have all the options cached

o    Spatial searcher – point, polygon and bounding boxes – did performance tests – takes almost 2 seconds for bounding boxes

o    Elasticsearch indexing – way slice query when it comes in

o    Caching – don’t keep iso caches… but are exploring

o    Parallelization – more hardware the better

·         Standards-based & backward compatible formats/api

o    Project open data “common core”

o    Have minimal JSON format

o    Iso 19115 or 19139

o    Api will continue to have backward compatibility to CMR

 

Building the CMR – Dan Pilone

·         Performance, extensibility and metadata quality

·         ECHO Search Flow –

o    Echo granule query = 100 ms

o    Performance degrades and increase complexity in search

o    2 phase search – bounding box and then high resolution oracle

o    Result lookup – for 2000 granules = 5 seconds

o    Response format frequency= atom xml 11.9%, (2) ECHO10 38.7%, (1) xml reference 42%, atom JSON 7.4%

§  Clients are seeing echo and then requesting smaller amount of results

§  No db lookup needed for XML references

o    Q what is the native format that you keep metadata in

§  Whatever it is submitted in – ECHO, DIF, ISO … always have record from provider

§  Extract native polygon points from xml

·         How many items – most clients are doing search and then are walking through and selecting what they want – or they are pulling “pages”

·         Response creation – how fast create the type – Echo10 is the fastest because it is stored – jumps in magnitude

·         Q – why decision made at query time not ingest time

o    As formats were added

o    Trade of costs

·         Micro-service based architecture (like sustainable architecture)

o    Decoupling units to scale, change, and swap out

o    Lets us look at flow & look at performance profile

o    Ingest adapters, metadata DB services, format conversion, search services, metrics… each is independent

·         How to optimize discovery search – what if we sharded it more – increased performance as we sharded it – ended up with a hybrid that balanced the shard & execute against its own shard – as collection grows can increase and spread across more machines – decreasing the index size

·         Retrieving results – elastic search has scaled well – swapped out a lot of xml handling, moved to written in closure, to get out as fast as can – trade latency for performance, willing to offset the availability of data by a couple of seconds

·         Q talk about work flow for ingest – what are you doing to manage the lucen indexes

o    Don’t deal with lucene index directly because of Elastic search

§  Will do clean up, rubric scoring, … things to do to accept metadata

§  Will persist all version of metadata in metadata DB

·         CMR extensibility

o    Idea of metadata concepts – conceptions and granules, but there are room for paraterization, meta-metadata, … CRM is building built to handle them – working with providers to make more metadata available to the community

o    Collection – dif, echo10, atom…

o    For each concept – doesn’t matter what format – they have to have “these” fields – UMM mapping says where they exist & then have a set of validation rules

o    Q do you do any validation of field for content (acronym)

§  Yes – for example granule start time has to be after the start time of the collection

§  For text – only validate that there is content not what is included… it is flagged for review by a human (at a collection level) & rubric

o    Q that handle he simple cases – the quality issues are more subtle – what about a reporting mechanisms from clients

§  Yes – interest – but don’t have an automatic process in CRM

§  GCMD does some of that now – some will transition

o    Q is there an artifact of this conceptual model

§  UMM document is currently being circulated for review

§  Next version will have a UML

§  There are required and recommended parts

o    Q (Alek) – collection assumes that granule has time within it

§  Near real time collections – can provide no end date = now, fixed width and then update to slide it

o    Q (Mike) – flagged for review – how about before appraisal process before

§  Yes, that is part of the workflow – CRM will surface & score – work with science coordinator team

o    Q – is the data owner in the loop of changes

§  Yes – they own it unless they say otherwise

o    Q – are they submitted by scientists or metadata experts

§  Sometimes submitted by computers… it varies

o    Concept difference

§  Scale 1000s to 100s of millions

§  Relationships should also be complicated… hey that should be on a t-shirt – visualization metadata record, what parameters, what granules, what tools… there is a lot of cross related metadata

o    Q – NASA provides a user survey, at NOAA had problem getting information from users - #1 thing users ask for is better documentation – more completed – where on that list was faster searchers

§  Not from a user pushback but from the tools

§  Some of this is being driven by non-human users

§  Within user service community – you lose uses if the search takes too long

o    Q – have you had feedback on csw – if someone wanted a xpath…

§  ECHO does not expose csw – cannot give usage numbers

§  Overhead of SOAP is sufficient that moving to a REST api has been dramatic

§  Does not know of a request for clients to csw

§  Csw has just published 3.0 that is out for review

Expanding metadata capabilities – Katie Baynes

·         UMM-C is out for review – it includes DIF, ESDIS and ECHO10 models

·         UMM metadata model

o    Ingest adapter before they get to the indexing cue  in a centralized UMM  then can go into different formats

o    ECHO10 – will move UMM-G (granule)

o    Should include ISO in all sections on this slide too

o    Visualization, parameters/variables/ future concepts – UMM-V, UMM-P and UMM-? … there will be a process to extend this model

o    All records comes out in other formats include ISO, ATOM…

·         Targeted concepts

o    Granules, parameters, services, visualization, documentation, meta-metadata – these are things we are still developing

o    Always trying to raise the bar higher – metadata that human and computers can read

·         How to adopt UMM concepts – this is a curated process

o    Start with stakeholder interview – what doing currently – needs

o    Existing implementation – don’t want to reinvent

o    Develop use cases – why we need it

o    Defining fields, relations – temporal extent

o    Identifying subsystem interactions – need for caching (ex)…

o    Tailoring lifecyle – these are not dead concepts… this is intended to be a living feedback model

§  Intention is to be revisited quarterly – this is then applied to the meta-metadata as to which version it was approved under

o    ESO review – currently doing it for UMMC but it will be part of the process for all metadata concept

·         Lifecycle of metadata – publishing metadata models on the Earth Data space… You are part of the process

·         Q – what do you mean by visualization

o    Granules will have information that is valid for GIBBS… it is a fuzzy future – if you have ideas we want your input

·          

 

UMM-C … Tyler Steven

·         A more detailed look at the fields

o    Had a set of required (15), highly recommended (5), and recommended (18) fields

o    These field come from GCMD, DIF, ECHO, and emf

·         UMM-C crosswalk study document is out for review – discuss mapping and analysis of each model

o    Also have a spreadsheet

o    Q are these documents available

§  Cross-walk is on the eos wiki (http://wiki.earthdata.nasa.gov/display/ESO/UMM-C+Review  

·         ESDIS standard office (ESO) review – what is missed, is it correct and utility of the field, precision of the mapping, do we need more controlled vocab

·         Have a governance process for changes & feedback in the system

·         Q – excited about the lifecycle to extent this – ESO review part –doing the review is a straightforward... did we get this right… the changes are more bleeding edge – will it account for stuff that should already be there – need to be able to propose cutting edge stuff

o    Originally just look at what is in there – the lifecycle will include future stuff

o    Drifting towards a proof of concept track and put it into the main track – can’t really test things out

·         Reconciliation process

o    Goal – through reiterations – develop high quality data products

·         Linking records in ECHO and DIF and identify things that need to be linked or merged

o    Sometimes have parent/child relationships

o    Work with providers as to their best representative collection record

o    Edit/QA of UMM-C fields

·         Quality in CMR – there are various checks as the metadata goes into the CMR

o    Science coordinator teams work with the provider to maintain the quality

·         Workflow – first automated validation… later have human checks

o    Developing a QA validation service for providers to validate their metadata before ingest

§  Q – when will this be available – not yet

·         Q do you have this in an iso profile – later will check against the iso – ECHO team has mapping for their model to UMM and back – this will be based off MENNS group – also have well known mapping between dif and iso

·         Want input/question – [email protected] include CMR in the subject link

·         Q are you planning a similar workflow for data providers with dif in the international community like USGS and NOAA from GCMD

o    Phase 1 is clean up between EOS data centers – not plan yet

o    What is the schedule –

§  For reconciliation – look at high priority targets – EOS data centers 1st, some data providers have already been contacted (low hanging fruit)

§  What about new data sets – begun to add fields into ECHO schema – include metadata standard name and metadata schema – there are fields that are not currently required that will be that are not in ECHO10 data model – should start thinking about it, new collections will need things like a more standard platform name – should be dialing into the UTC

§  If you could have a schedule could help with long term planning

o    Q subsecond search timeline

§  There is not going to be the CRM as another system – slowly replace ECHO components … will stand up new system and retire old one

§  Early testing

o    Q will it just suddenly change – sort of… it is slowly going to migrate

o    Q is there a goal end date for the metadata record complete

§  Originally 75% by the end of this year

§  This isn’t something that we have to wait to do, can do this right now

§  Already started to put together an outreach package to help facilitate process

o    Q (Ted) – part is linking GCMD to ECHO records – not sure how to implement this linkage and how this is going to data provider – making the match up will be interesting – are they already available

§  They will be available to the provider as start the process – have a report that will show the linkages

§  Ted – report is the most difficult way – is there a service with an entry ID record

·         Currently have an associated DIF entry id in ECHO metadata – mostly filled out and there are duplicates in the records

§  Q - Eventually – there will be one CMR with GCMD and ECHO – there idea of a link between … actually it will be a single record (first id linkage)

·        

Attachments/Presentations: 
AttachmentSize
File CMR ESIP Summer 2014.pptx2.53 MB
Citation:
Baynes, K.; Pilone, D.; Stevens, T.; Streamlining Metadata using the CMR; Summer Meeting 2014. ESIP Commons , April 2014