Design and Implementation of ISO Metadata in Science Data Products
The NASA SMAP and ICESat-II missions have experience in the development of tools that design and that implement data products that employ ISO metadata. SMAP explored two major methods to automate the generation of ISO Metadata. One method employs the JAVA Saxon software and an XSL transform to convert HDF5 XML to XML that conforms to the ISO 19139 serialization. A second method employs a data binding approach that generate a set of C++ classes that construct components that conform to ISO 19139 serialization. Both teams are working methods that improve methods to design ISO metadata that best describe their data products. Both teams are also leveraging HDF5 to represent ISO metadata in an alternative form. Others with experience in the development of ISO metadata that represent scientific data products will be encouraged to participate.
This session will address the methods that have been used, the lessons learned and recommendations for future development of scientific data products that employ ISO metadata.
Design and Implementation of ISO Metadata in Science Data Products – Barry Weiss
· Need to get the right model in the right format – 19139:2007 geographic information
SMAP – Barry Weiss
· SMAP – soil moisture active/passive mission
· Products must be usable by the standard user community
· 19115 dictates how the iso is in xml
· Employ HDF5 groups and attributes to represent ISO metadata – 19139 compliant – people who know HDF can find the data and don’t need to use the XML
· Challenge – how to get into products – 15 products
· Metadata handling - - 19115 in native HDF5 AND iso 19139 XML then into HDF
o Q what is the difference between the 2 iso types
§ Mimicking
§ 19115 is the of the standard and 19139 is the xml representation of the same standard
· XSL SMAP Tool chain
o Have configuration files – this is the standard executable – this point only have the group attribute form not the xml - process of h5dump – HDF5 XML – then in saxxon which converts with iso compliant & add serialization curating a series
o Q put into the product you mean the granule - yes
o Q attribute is an hdf container - yes
· XSL transform features
o Requires multiple executable to operate
o Lineage becomes a problem because there are lots of them coming in
o Xpaths become critical in the data binding approach
· Advantages
o Taylor an xslt for each product
o Don’t need to tell the designer for each xpath
· Disadvantage – requires complex software chain
o Simple xslt mean cut and paste errors
· Xml Data binding
o Xml is treaded as object in memory – each little piece
o Then need a library for when you run the code – it is a huge library (~2GB)
o Able to serialization and deserialize instance to/from code
o With a piece you can build from any xml
· SMAP binding flow …
o It works very well
· Data binding features
o With 15– enumerated options… started to screw up a lot, better now
o Wrapper solved the problem
o Adv – no cut and past
o Dis – wrapper – large library
· Larger mission – might want a data binding approach – start early
· Q does putting metadata into the file blot the file – No it doesn’t
Jeffrey Lee – ICESat-2 Metadata
· ICESat-2 – it is the next decadal survey mission at smap – instrument is ATLAS – MABEL (other mission)
· ASA – ATLAS science algorism software
· Data product goals – provide data and metadata, compliant…
· characteristics
o 80GB L0 data daily – 1 TB of L1A – L3B data daily
o 3.5 PB over 3 yr
o Sparse multi-rate alongtrack produce (L3B is gridded)
o 3,200 products and counting
· Going to use HDF5 model and try to be NetCDF4 compatible
· Need to collect stat, transform it, and then get it out in the right format
· Metadata – self documenting and provide provenance information and traceability – also need iso 19115 and 19139 xml
· “granules are forever” and should stand alone
· Pieces – acdd, ancillary data, tools…
o Acdd/cf global attributes
o Acdd/cf variable attributes
o Groups – are group metadata – organize information about a product – can add attributes
o /ancillary_data – just having a label is insufficient for some metadata – store HDF5 compact dataset with CF stand labeling
o /ancillary_data content – algorithm constant, data setting (ex. Flag), control info, directory paths
o /metadata – OCDD – provide the information on the product to get a ISO 19115 representation
§ Flat attributes were insufficient to represent – need groups/organization
§ Issues – no standard labelling convention
o Metadata example MABLE
§ Q mentioned that there weren’t tools to read – is that groups?
· If you pulled this into a future version of penalipy(?) or IDL it would see HDF5 groups and attributes but none of the tools would know that it is metadata
· If there was a standard that said “metadata” if it has specific fields then it can be parsed to make it useful
· Look at H5U
o Iso 19139 xml is what is required to be delivered to the data center to ingest products
· Generation workflow – PGE read the product, have a QA product … most of this already exist
o ASAD DD – most data center want a data dictionary – this reads output and transforms it into html tables – that is similar to translating to xml
· The challenge
o 20 standard data product
o 3,200 science parameters
o 6 different flavors of metadata
· Programming steps… lots of steps yikes
· A solution – web-based product database to store and maintain relations between files/group/attributes/parameters (mySQL/PP: h5es_builder)
o Produce Hdf5 template files
o Called H5-ES or EDF5-EaSY
o Template files – HDF5 file skeletons – all the groups and attributes and data parameters with no values filled in
§ Chucked dataset with dimension 0 & attributes can overwrite values, H5_copy allows copy piece or whole file to another
· Example PGE code
o Create grouped 2D, 90k element HDF5 with CF/ACDD & DS
o BUT… how did the parameter get created, how did the groups created, how di te attributes… that was defined in the template
· Product development strategy
o Product designer not a programmer works with the database – creates the file – generate the template file – then programmer gets involved
· Strategy
o PGE puts metadata template and then only have to update the data
· HDF group is partnering to extend this concept – it will be generalized and have new features added
EDGE: the multi-metadata standard platform – Thomas Huang and Ed Armstrong
· Podaac – is the oceanographic center for NASA
· EDGE: extensible data gateway environment
· Data management archive system – automated the ingest
o Have a rich metadata model in oracle
o Have different data handle for different data products
o Export all the metadata into solar (enterprise search engine) then built a front end
o Support opensearch
· There is no pregeneration for metadata translation – it is indexed every 15 min
· EDGE – is the brain behind the webportal & the back for ISO support/opensearch and data casting
· When you want a file in iso… start by searching with opensearch, get an opensearch document and then request it in iso & the system generates it for you – really fast
o In the past system the fgdc took hours… new system took seconds
· Python code that export dataset metadata in iso format – it is open to the web… Ted and Katie have played with it
o Want to minimize the coding
o Katie – should be using the UMM as the back end data model
o Actually used the ECHO data model… so very close
o Use this to export to the GCMD records
· The architecture is very important – if you do your homework at first – don’t want to have to go back to the file
· Challenges
o Can get metadata for collect and granule
o Internal templates
o Found holes in map – Berry’s work is building it in now… bad when didn’t get it from the provide… better in the future
· Onward – look at quality information, missing attributes, tools for the user,
· Back end – upgrading to solar4 – looking to support ElasticSearch because it is opensource and can repackage
· Q supporting ElasticSearch which is JSON – will there be a JSON to XML
o Yes, but that will happen in the EDGE level – user won’t see it – they see iso and opensearch
ECHO Catalogue and ISO 19115 metadata – Katie Baynes
· ECHO is a metadata catalog – with all the EOSDIS stacks – 3.9 k with 160 million granule, grown 600k/week – support REST Api, Reverb
· Current – not currently able to ingest iso metadata but testing SMAP and GRACE
· Translating content – retrieving – if ingested in DIF or ECHO10 – have a mechanism using a xslt using NASA iso (NASA best practiced iso) – there is lost information during this
o Q Ted – there is only visible/orderable that maps to ISO – there is no lose from ECHO to ISO (can’t go back)
· Translating use cases – augmenting, lineage, provenance, data quality ECHO10 with the UMM that will go into the NASA ISO record
o Ted – it is the same problem with FGDC, DIF… discovery level (which is what we extracted from ISO) iso has so many things that means you can’t go back from iso
· Reverb results list – info icon – get list of downloadable metadata – native (exactly how they got it), can get iso 19115
· Retrieving iso via ECHO api
o Don’t’ have a format parameter, it is dataset parameter… just tack it on
· What version of the translation am I looking at – this is a translation that is in flux – xpath says which version and date of version… version 1.15 – Jan 27,2014 current
o New as of June 6 not yet available
o http://cdn.earthdata.nasa.gov/iso/
o http://cdn.earthdata.nasa.gov/iso/resources/transforms/
· ESO MENDS review
o MENDS = Metadata evolution national data standards
o Outputs – base reference, xsl transform, crosswalk …
o Voted on issues ex. Name representation
o 1. Science experts/previous MENDS Team, 2. Data center staff with iso experience, 3. Data center staff with little/no iso experience
o Gotchas – metadata quality problem not issues with iso .. Ted – this is happening all over the place
o Hierarchical key words - voted on (GCMD is hierarchical)
o Translation time – translated on the fly which can be problem as translations become more complex – cmr will look at caching of results
· Questions – [email protected] (put CMR in subject)
· Ted – see content comes in and automatically goes into a database – systems are recognizing the same concept & go into the database. Db is 19115, the concepts are 19115, the complete representation is 19139… the db have to have the concept of 19115
· 19115 was created in 2003 – in the last 11 years have improved our metadata. April1, 2014 – iso 19115-1 was accepted – NASA and NOAA
· How difficult is it to translate these concepts into 19139 or 19115-3 – Thomas has templates – it looks like iso xml – in the values of the field they have tokens that go back to the concepts in the db
· LTER – have exportable tool – not capturing ISO – question: what will do with the iso that is usefull
· Q Ken Casey – why come up with a new metadata model with iso may be it – because global metadata standards change. Why is NASA going down this road when iso is already there?
o All of the newer missions are required to use ISO – current missions are told they have to reconstruct metadata
o GCMD and ECHO were tasked with CMR – maybe the name UMM will change one day – it is where CMR is heading
· Q (Soren) – problem with metadata records in iso – vector data from NASA
o People use iso effectively with vector