Enabling Technologies for the Ecological Data Life-cycle: Tools from the LTER Network.
Enabling Technologies for the Ecological Data Life-cycle: Tools from the LTER Network.
Introduction and overview – John Chamblee, Coweeta LTER – University of Georgia
· Anthropologist – so covering the history for the tools being presented today
· “Ages” of LTER Information Management
o Late 1980-early 1990 – site data access and data catalogues
o 1994 -2001– use of internet expands
o 2010-2014 – developing web services
· Four components: culture, technology, structure, and contents
· Every LTER site (and network office) deals with IT stack in various ways
· Process – NSF has data availability guidelines
· Culture – LTER isn’t really a network – actually a network of networks – high diversity
o Each manager has to deal with university/institution level (ex. Software licenses)
o Then ecological diversity – represented by biomes (on slide) – representing biomes where they operate
o Then data diversity
· Data sets by disciplines – highest is chemist – there is also hydrology, anthropology…
· There is diversity in ecosystem types – highest is lakes, then streams, but also beaches, dunes…
· Need way to deal with diversity for a network wide data catalogue
· Data catalogue mile stones:
o 1989-1991 – Data Catalog “paper”
o … 2014 – PASTA DataONE
· Have structured data and structured metadata in PASTA – really important is quality manager
· Additional Network Info systems components
o ClimDB, Hydro DB, Veg E, Chem E, Site DB, Personnel DB, LTER Bibliography
· Have a long list of information management tools (small list presented here)
o Ex. GeoNIS, PASTAprog (download data into R & other programs)
· Have parallels institutional and cultural structure & the ESIP Fed – developed internal tools – it is time to talk the tools to the wider world (see if they are useful to others)… start this conversation
· Q can you list examples of other groups (partner organization)
· Forest service (3 sites), US Agric research, University of Georgia Marine Institute, Nature Conservancy, Integrated Ocean Observing System
EML: Not just for ecology Lessons learned using LTER data – Margaret O’Brien (Santa Barbara Coastal LTER, UC-Santa Barbara)
· EML = ecological metadata language
o It can handle lots of different kinds of data
· Goal – integration of heterogeneous data from diver sources (didn’t yet know where)
o Partnered with programmers & experts
· EML – adopted other syntax (ISO 19115, Dublin Core, STMML)
o Made it modular and extensible – users could adapt it & avoid extensive review process
o Some of the specifications were not granular enough
· History – started EML (based on RDF triples), EML2 – hierarchical schema (2.0.1) now(2012) EML 2.1.1 – internationalized
o Not backwards compatible below 2.1
· EML highlights
o Detailed descriptions – sufficient for RDBMS
o Text description for web display
o 4 top level elements – diatset, software, citation, protocol
o <additionalMetadata> for… eml is rich, but you might want more – can take any other xml
· Each data type can describe the best way to use eml – everything teaches something new about eml
o Automated use & human use are expanding
· What does EML enable that we couldn’t do before?
o <dataTable>
o Usually deal with files that are text – models need to be flexible because don’t know what get from scientist
o Needed way to compare metadata and data - create rules – each is a lesson
o Develop technique to read the data and metadata and confirm that the information is there
o Metadata, report, and data are all stored in LTER system
· Additional metadata is modular
o It is a development space … not machine readable … yet
o If pattern became adopted, it could be added to eml module
· Extension – Iter-project
o Eml’s project model was extended to provide structured content – not yet submitted to eml group, but able to test and extend as needed
· Where might we want to go next
o Additional structures added to leaf nodes – incorporating other kinds of mark up in text field
o Better integration between EML & other specs
o … all go into an issue tracking system
o Cross links between EML resources – possibly go back to RDF triples
o Data type for streaming data – especially with so many sensors
· http://Im.iternet.edu http://knb.ecoinfomatics.org/#tools/eml [email protected]
· List of tools that use EML – compatible SQL models, XSL tempaltes for HTML display and conversion (ex, crosswalks to ISO19115), R and Matlab code
· Q Ted – spec for EML … it is a skema for an instance
· Q Ted – additional metadata tag – as a holding place for stuff that doesn’t fit is an interesting model that is extremely danagerous… there are no machines that can read it (thus not validate it)…. OGC did this (called metadata – was an xlink to something), Java had “generic metadata”… these will still validate – had rules = lax … iso 19115 has a model to extend metadata & in NASA there is additional attributes that is an xml structure for adding something that is not in the standard – also record and record type… there are ways to extent things
o Showed worse way
o Use STML – includes a schema to validate – can point to internal references
· Q – what is the control mechanism (in relation to Ted above)
o Internal policing… schema only goes so far
o Quality check is helpful
o None of the tools for creating EML supporting adding additional metadata (except units) – this is a small part of most documents
o Within the closed community do look at each other’s data
· Ted – ECHO at NASA – includes something like 400 fields & additional attributes – looked at 2500 – 3400 additional attributes in addition to 400… this has one of the strict government in Earth Science… I can add my own attributes, why check the standard
o Question of is this how the community operates… or looking at metadata on large scale
o As LTER grows – need to have standardized metadata
Provenance aware synthesis tracking architecture (PASTA) – Mark Sevilla (LTER Network Office– UNM)
· PASTA
o It is the core infrastructure for network
o Permanent archive
o Had input in design by community (scientists and students)
o SOA, RESTful API, EML (2.x), multiple data formats (tabular, raster, spatial, vector), strong versioning
· Gatekeeper reverse proxy – single entry point into services – provides some validation of authentication tokens
· Identify management – currently only LTER member … adding more
· Audit/Logging Management – audit all into a separates service – can query & output
· Data package management – data package specific services
o Metadata/data quality engine – looks at congruence between 2… needs to adhere – creates report – block data that does not meet certain criteria - there are ways to configure each check
o Metadata manager – metacat system
o Data manager – allows distributed storage across file systems
o Query/search – coupled to data manger – looking at different types of solutions
o Provenance metadata – create a fragment of provenance metadata – ask PASTA to provide metadata for derived data
o Event manager – allows users to subscribe to data upload events – ends http post to end service – used with DataONE member node implementation – allow automated workflows to push data back into PASTA
· System monitor – monitor state of health of system (architecture & services)
· LTER Network information system
o Middle wear is between the consumer and the producer – PASTA is like it
o Think of it like a co-op –
§ have a number of producers that input data (these push),
§ pull content (SOA Services) – LTER LDAP,
§ Consumer applications – Data One Member Node, report generators
· Data package event time-line
o Site collect, Upload into pasta – if fail then can factor or reapply test, …
· Website – includes browse catalogue based on keywords
· Statistics – over 4,000 site contributed data package, over 15,000 synthesis data package from ECOTrend, adding Landsat packages
o Most is publically accessible – trying to be more publically accessible
· Q Ted – have mechanism to keep poorly documented data out of the system… called NAZI approach… suggest FDA “friends don’t let friends download bad data” – more like a directed help system…
Putting the Achieves to Work: Workflow and Metadata-driven Analysis in LTER Science – Wade Sheldon (Georgia Coastal Ecosystems LTER, University of Georgia)
· Data-intensive effort so focus on informatics – managing, curating, and archiving
· Trying for EML implementation of integration level
· Putting a lot of emphasis on physical metadata for entities (e.g. tables) and need to understand attributes
· EML Generation
o Have been lots of techniques for developing EML
o Metabase MMS
§ Generalized RDBMS for managing environmental metadata
§ Provides content management for all bits going into metadata
§ Has been effective for 12 years… it getting adopted outside the LTER network
o DEIMS (drupal based system)
§ IMS built on the popular Drupal CMS framework
§ Provides prebuilt graphical forms for people to put metadata into
§ Structured eml doesn’t just happen
· Metadata-driven analysis
o PASTA has simplified using EML-described data for metadata-driven analysis and workflows
o Workflow tools – Kepler, R, SAS, SPSS, MatLab, GCE Data Toolbox
· Kepler
o Supports data downloading via REST URLs,
o ClimDB export
· R, SAS, SPSS, and MatLab
o Eml transformed via XSLT to generate native data acquisition programs for target platform
o Works from the website (restful service)… produces output for each program
· GCE Data toolbox – matlab framework for metadata-based processing quality control and analysis of environmental data
· Just starting to see what EML and PASTA can do
· Q – did you consider Vistrails in addition to Kepler – are aware of them but don’t have any active users – new generation are using these tools more (otherwise using excel)
· Q – using schematron – played with it a bit – been using W33 rules – more traditional xml compliance
· Q – DEIMES is being maintained – Open source with a good community – used on about 9 sites – it is growing
· Q – scientific equivalence of data – when transposing into R code – what is the step in the provenance when you upload the data – there is a step in PASTA (creates the metadata)… then upload relinks – the modified data is no longer the same – has a new identified
· This has been a long evolution – first code generation was in 1992 – provided a way to provide structured metadata… then needed standards – then had content standard, then ecological applications papers then eml