HDF: New Representations, Conventions and Features

Abstract/Agenda: 

HDF is the format of choice for scientific data from Earth Sciences and many other disciplines. Traditionally data in HDF have been accessed using various HDF APIs (Fortran, C, Java, ...)  on desktops and mainframes. We are now exploring access to HDF data using web services and several new representations. This work will be demonstrated using the THREDDS Data Server to generate XML and json representations of HDF metadata and data on demand. We will use the HDF4Map XML representation to characterize the metadata that are available from granules in the NASA HDF4 archive. Finally, we will seek advice and ideas for features required to support new HDF use cases for NASA data users and others.

Slides from this session:

Notes: 

·         HDF has had a long history with NASA.  Last year held workshop at ESIP meeting.

·         Earth Science Platform

o   Formats, services, tools, conventions – make interoperable information systems

o   HDF group has tools for writing, reading, and visualizing HDF

o   Today is mostly services and conventions - newer

o   Formats and tools are well known

·         Many groups are using HDF – new project called HDFinside

o   New Book – Python and HDF5 by Andrew Collette

o   Andrew is a scientist in Boulder that works on the dust accelerator

o   NASA facility – can get time on dust accelerator (can bring your own dust)

o   This is a large expensive facility, funded by NASA, that shares data – Andrew uses HDF5 and MySQL and sends people with HDF5

o   Interview on ESIP website - http://esipfed.org/ESIPInnovator-AndrewCollette

·         HDF Inside

o   NetCDF – conserve NetCDF library

o   EOSDIS

o   Bag

o   NeXus – also have xml dialect that they use to create HDF5 files

o   Fastquery and FastBit – in HDF5 – bit map index

o   PyTables- with its own indexing scheme – for trading stocks

o   GlobeClaritas and bioconductor

o   KAIRA – atmospheric radar – shipping large data files daily

o   MatLab, IDL, ….

o   These are communities that have HDF with their own underlying data that allow data sharing

·         If you are using HDF to share data Tweet #HDFInside

·         Q – how is Andy using the metadata – uses it to drive searches

Two talks – Aleksandar Jelenak, and Ted Habermann

 

Opening HDF Archieves With Services – Hyo-Kyung Joe Lee, Ted, Aleksandar

·         In the beginning had EOSDIS archive – desktop users used FTP data in HDF

·         Evolving data access paradigm

o   Accessing from more devices

o   Have THREDDS or HYRAX

o   Can pull data via OPeNDAP – don’t have to be a single file anymore

o   HTTP – can get a file base

o   WMS/WCS – web services

o   Metadata – recently get metadata from there files as well

§  This has changed how people get data

§  Look at metadata in XML or JSON and then get a smaller subset of HDF files

o   Present 2 new web services

·         HDF THREDDS Services (4.4 in beta)

o   Can embed your own/new services

o   THREDDS runs as a java

o   Allows you to create hierarches of your data set that are independent of your file system

o   Built-in ISO Metadata support

o   2 services – HDF Map Writer and HDF5 JSON

o   Catalog with HDF4 and HDF5 files

·         HDF5 Data + Services

o   Dataset page – present services for a dataset (one file or multiple file)

o   DataServices – OPENDAP, HTTPServer, WCS, NetCDFSubset

o   Metadata Services – NCML (netCDF xml representation of data, common data model)

§  UDDC – Nciso – provides rubric to check metadata file against ISO

§  ISO – actually ISO 19115 metadata

§  H5JSON (new) –

§  WMS – web map services – is get capability request

o   Viewers – links to various views for additional displays of the data -

·         H5JSON

o   See when created

o   Uuid

o   Groups ex HDFEOS

o   ID elements

o   File attributes

§  Have path in HDF 5 files

§  Name

§  Shape type values

o   Get all information that would be in the binary in the HDF5 object

·         HDF4 Data + Services

o   Same as HDF5…

o   New service is H4MAP – was a line tool and is now a full service

o   H4MAP gives XML output – driver was data preservation – life of the data when HDF4 isn’t going to be around

o   Gives all file attribute and data in file

§  File information and file contents – describe what is in the file

§  File contents include attribute, groups, Ex. EarthBoudning Cor – data type, and provide numeric Values

·         HDF4 was written for the future scientists and engineers – now being repurposed for current scientists

·         http://www.hdfgroup.org/projects/h4map/h4map_writer.html

·         Q NetCDF is HDF, but HDF is not NetCDF – HDF has more capabilities

·         Q – JSON representation – showed metadata – does the JSON version include the same information, including data

o   Doesn’t actually show values

o   Working on this

o   Gallagher – when looking for a specific group of variable, better to use JSON to ask for a specific variable and then use D3

§  D3 – can do more than just Earth

§  For D3, GeoJSON can be used, but doesn’t use BiSON

o   How can use JSon and metadata space to help people see what is in the file – focus is on metadata

·         All this is two way, from HDF to JSON to XML and back also RDF representation of HDF, graf databases – textural representation… w ant consistent approach

 

The EOSDIS Metadata Archive – Ted Habermann

·         Ted works to evaluate and improve discovery metadata

o   Discovery is only first step, then use and understanding

o   Starting to evaluate use of EOSDIS metadata archive

o   Data archive is not useful without metadata

·         Metadata Types

o   Use

o   Understanding – evaluate why 2 data sets don’t agree

o   Concepts are in HDF-EOS

o   Discovery = Core = ACDD (netCDF)

o   Use/Mashup = Structural

o   Understanding = Archive

·         Data Tools

o   Classified based on function

§  Visualization, geolocation/mapping, subsetting and filtering, data handling, search/order

§  ~30-35 tools in each category

§  Which have structural metadata

·         HDF-EOS Grid Metadata

o   Made – UML models of metadata

o   What is HDF-EOS actually – really an api doc…

·         HDF-EOS Swath Metadata

o   More complicated than Grid

o   UML are on USGS wiki?

·         NASA Metadata Archive

o   Used HDF4 Mapper to get metadata from archive files Blobs of ODL in these files

§  ODL = precursor to XML

§  There aren’t standard tools to deal with ODL

§  Many people have written parser for ODL

§  Need to move away from ODL and to XML (for standard tool)

o   ODL2XML.xsl – then analyzed over 600 files

§  StructMetadata.0 – if it has then HDFEOS

§  Some have “StructMetadata” – won’t work for machines

§  Problem of “this looks like” and “this is this”… but it doesn’t work for machines – user has problems when scientists don’t follow convention

§  Also will do a compliance test

§  About 50% (granule level) are compliant with HDFEOS – 76% of the datasets (products)

o   78 different tools

§  Either custom – only work with specific set of data

§  General tools – deal with EVERYTHING

§  Having unconventional data is EXPENSIVE

·         Entire HDF4 can be read by HDF4Mapper because it is all HDF4

o   Only half the archive have conventional use of metadata

o   Can’t get at data with conventional tools

·         Earth Science Platform

o   Depends on consistent formats

o   Conventions to support efficient tool/service development

o   NASA is working in something that has been obsolete for over a decade – need to get away from HDF4

o   Conventions need work

·         HDF-EOS future working group – WE WANT YOU

o   Taking data sets that are not compliant with HDFEOS – talk to people who designed and wrote data sets to see what the technical problems with HDFEOS that caused the problems in the files – get list of requirements

 

What are features that need to add to HDF library, what like, what don’t you like, how can we improve it?

·         Jeff – what is the goal of updating HDFEOS

o   Build a collection of earth science data sets that can be accessed from tools and shared easily

o   Isolate scientists from underlying computer stuff (original goal)

o   Jeff - This is for new missions and old data

o   There are lots of communities that are using HDF – want to learn the lessons these groups have learned

·         James – one of aspects of taking NetCDF api and rewriting it to have HDF under the hood, using a simplified model, but other communities are using HDF5 – these are general computer science format – are there efforts in those domains to make conventions and enshrine in api? – yes

o   Can learn 2 things from community – technical and how they are building their community – what has worked in terms of adoptions

·         Jo Lee is using THREDDS and connecting to Hyrak to the OpenDAP – need to work on how these have support for THREDDS catalogues

·         NetCDF4classic or 3 and add THREDDS then you have HDF (with metadata in THREDDS catalogue)

·         James – take concept of a catalogue and embed in data and then use data access api to access it is a very powerful tool – lead to aggregation…. Should do this and fix THREDDS catalogue

·         HDF is a directory structure, a “smart container”

·         CF is happy with flat file systems – going to directories is a very difficult thing

·         James – find people who gzip HDF files – the entire file has to be decompressed to access the metadata -?gzip to smart compressed files

·         Ed – HDF4 to HDF5 – is that the scientists or the ground processing that is slowing this down?

o    It is a lot of existing code.  Problem is in software layer above the data

o   Management has become complacent

o   Need to change the way HDF helps groups like NASA

·         Make suggestions – HDF forum, email (Ted, Alek, others at HDF)

Citation:
HDF: New Representations, Conventions and Features; Winter Meeting 2014. ESIP Commons , December 2013