Data Lifecycle Interoperability - Broadening our Perspective on What it Means to be Interoperable

Abstract/Agenda: 

This breakout session is a intended to build on discussions that were begun at the Winter meeting around the ideas of considering the full data lifecycle when considering interoperability models. Seeking to answer the question of how we can leverage interoperability concepts to enable more effective data management through all of the phases of the lifecycle. We hope to bring together a diverse group (State and US Federal Government agencies, Data Curation Organizations, Technology development groups, and research data networks) that can provide a rich collection of perspectives on how to more effectively leverage data and metadata interoperability concepts in moving through the data lifecycle, share information and develop a plan for moving forward with achieving increased efficiencies throughout the lifecycle. 

 

Notes: 

Notes:

This session will be more discussion based than presentations.

Session topic:

What we are doing or thinking about in facilitating movement through the data lifecycle? How can we think about efficiencies as we move the data through the lifecycle.

Brief presentations, then conversation on these topics.

Karl

Points of interaction with research process and data.  Two examples - JISC research presentation and DCC Data Curation Lifecycle.  See slides.

Karl is working in this framework and conceptualize how we can map the researcher process (like JISC) in to the data curators perspective (DCC).  To curate, archive, facilitate reuse of these objects.

Visualization of the researchers lifecycle mapped to the activities of the data curation lifecycle (see slide for graph).

What roles and standards we are using can play a part in this lifecycle and work through it more efficiently and increase discoverability, usability, accessibility to those data.

Showed a diagram from DataONE, which Bill will discuss more later.  Data that may or may not be documented, reusable etc.  And will think about answering basic questions in terms what work are we doing now to enable the progression on a view of a lifecycle.

How do we integrate the work of the various ESIP working groups and clusters to develop organizing principles?

Bill Michener

From DataONE.  In DataONE they have realized that the software, curation, and research lifecycles overlap.  In production for 2 years now, to make it easier for researchers in life on earth and the environment that sustains it.  They have been working on building a community and education, and cyber infrastructure, and finally enabling the use of data with tools that cross points of the lifecycle.

There are three different components are the member notes, the coordinating nodes, and the investigator tool kit.

Data management tool etc.  Long term preservation - many of the researchers do not consider (or perhaps understand) long term usability and access.

Named a few tools they encourage based on the lifecycle -- see slides.

Someone says - where is the data I need? Can go through the R portal to DataONE to access the data, and then upload the process when complete.  Bill gave a few examples on this process.

Library is the front line at universities and colleges and need to be included in this process.

Chris Lenhardt

Middleware and iRods

Data interoperability, data management, data quality, data lifecycle etc. All goes back to the metadata and structured content.  Need the basic structure to do other stuff.

From the GEO data meeting - Agile data management, or deconstructing the data lifecycle.  Ruth had talked about data having a life beyond creation - the data life story of the data product.

Middleware - greatest middleware in the world, but the technology will not help you if you don't have the metadata.  Reviewed diagram of iRODS and then a comparison of the individual scientist’s workflow with iRODS and other technology to help structure this workflow for repeatability etc.  With rules you can implement.

How can middleware help?  iRODS as part of your data management plan as part of your proposal.  Turn it on, and it creates the archive.  Also it will help create the structured content, provenance etc.  All the changes as you go through the data lifecycle.  Also data interoperability, data formats etc.  And finally (with DataNet) helping to implement trusted digital repository moves, which can be turned in to rules.

Need to unpack the lifecycle and think about it from the very beginning.

Sky Bristol

USGS

http://www.usgs.gov/datamanagement

Discussed some of the major themes this site covers as well as an upcoming policy that will outline how to implement these things.  It is designed by a working group, and several different ones on data integration.  It also includes people outside of the organization.

There is a data management dashboard which will include many of the tools mentioned here in a cohesive place.  Instead of having to go back and forth between tools, being able to go directly into a system.

Elsewhere in the community - Parsons and Fox (2013) paper “is publishing the right metaphor”.  Need to put out good metadata and update them as opposed to a point in time publication.

CDI data management working group is a place to make contact for more information.

US group on earth observations, data management working group - has a published document through the white house that covers much of this data lifecycle discussion.  The working group is focusing on the civil earth observation agencies and trying to capture, tools, policies, challenges across the data lifecycle in each of these agencies.  Find commonalities, differences, feel of the state of the union.  And help us understand the lifecycle and manage our assets.

Denise Hills

Geological Survey of Alabama

Thinking about how we can share our data and make it interoperability more from a user side but some management as well.

Sister agency that is the Oil and Gas board is mandated by law to ascension all data related to oil and gas wells.  That means cores, cuttings, logs, permits, and other analysis related to that.

In addition they have all the survey materials - maps etc, and other things from as early as the 1850s.

The more people that can find their data, the more valuable it becomes.  For example, no one 30 years ago thought to look for oil in shales.  And now it is the hot topic.  And they get many word of mouth visitors but they need to get beyond that and make it findable and accessible.

Key staff member just had a stroke, afraid they will lose this knowledge as it was just in his head.  And is trying to learn from the ESIP community how to better management without having additional funding.  And to how it has value to get the support to manage it.  (man hours and money).  All of these issues cross cut many of the ESIP clusters.  THe interoperability and preservation go hand in hand.

Coming at this issue at a different background than computer science or management but does have experience in management.  And now in a position where can implement policies which might become standards for other similar agencies which are not under NSF purview.

Karl

Discussed the overlap of these ideas and how that can help facilitate getting access to the users who might not want to interact with the whole lifecycle, just a portion of it.

The full cycle - we cover an arch by focusing on a data management system (like Gstore).  Which is a joint project with State gov (New Mexico?), NASA and NSF.  Trying to make a unified system on top of data objects.  With management and access levels.  That can be mixed and matched and provide different types of access.  A key principle is that these things are invisible to the users.   They can add data objects and documentation to the system and then manage them through data services.  

His group just joined DataONE as a member node.  When it comes to documentation and discovery, they have an internal model they use to present metadata ISO 19115, dublin core etc. And it can evolve as they move forward.  Working at the University of New Mexico to crosswalk (as this is not a long term storage place) to allow university institutional repositories to be able to pull materials from this source and move it to a more robust preservation system.  

Data services are as simple as possible so that you can use them with many tools like GIS, catalog data.gov to harvest data etc.  To broaden the impact and discoverability and move things into a long term system for reuse.

Comments and insights from the room?

Comment on the data lifecycle itself - describe the function of the data, and there has been some metadata captured already, but the metadata happens at many different points of the lifecycle instead of just one point.

Sky agreed and commented on treating metadata more like a living document then just the report at the end of the project.

Comment - on metadata creation - wiht my own colleagues, most large universities, you have a research administration office which is used for reporting and administration work.  And if at the end I asked them for multiple reports and accounts - they would think it is crazy.  But at the end we often create lots of metadata.  And this is something we are struggling to overcome.

Karl - there are social dimensions to this issue as well.  What do people think is effective to institutionalize this knowledge capture process?

Commenter - my experiences in real time oceanographic data - it is a consideration of not getting to look at everything to put a stamp of approval on it, I have invistions not updating an ISO file, but the linage using the NOAA program to pull a lot of things in a web service which also pulled the process I use and other QC test materials. It should ride along and not replaced.  Data aggravations can use this for QC tests.  We can build upon metadata.  It is intimidating to put up data knowing people might ask about when the equipment was last calibrated.

Bill - DataONE, social implications, lots of good tools but they have never heard of workflows and things.  Best practices - have a PI who needs a quick answer not a 30 page technical document.  They created a database that has one page answers.  And best practices primers which points to other tools.  They also have training at various workshops.  Next phase of data one will have monthly to quarterly webinars on community driven topics related to data management.

Sky - example.  Discussed creating a new research program and sets around the country and needed to put things in place to facilitate this.  Put together a RFP as a start point.  THere were social challenges along the way.  Specifically with integrating the data management plan in to the proposal and seeing them turn up in the results as well.  Science Base was used.  It took several years and lots of carrots and sticks.  There is never going to be one tool to rule them all.  But context really matters in trying to be successful in this.

Question on interoperability - to Karl about APIs and access to the data.  Karl said interoperability in relationship to the data lifecycle, and acknowledging that one tool will not rule them all, and areas where definitions or conventions or common APIs or other cultural or technical processes might be speed humps as they move in to the data management planning phase, translating that plan into a set of processes that can be streamlined in to documentation.  And lowering the barrier for converting ideas and concepts through the lifecycle.  Not everyone will use the same tools and technologies, but providing for the transitions through the process.  

Karl spoke about creating processes that could be implemented in their institutional repository to make it a straightforward process.

Question - with the LP DAAC, we have a different lifecycle.  Part of NASA and USGS so there are differences.  They developed 3 phases.  Active management phase, historic/older data management and long term preservation.  And have to think of both NASA and USGS.

Karl - you might have nesting, and this is not just a loop.  Processes have their own cycles.

Chris - big data systems, you don’t launch a satellite before putting thought into the outputs which is different from some of the long tail data.  There is an assumption that we have to keep all data, for ever. But we know in terms of science, if we think about books or certain treatises become the go to for a specific subject.  And there might be 10 different books but only one is recognized as the true source.  The other 9 will fade away.  There might be something like this with data.  KEvin said at the winter meeting - if no one touches the data for 10 years it disappears.  

Denise said that was her example with Shale - that if they had tossed it all 20 years ago they would not be able to do some of the things they are doing today.

Question - data in a federal system being part of a schedule.  Sky - gave some examples.  There is a lot of data though that falls in the grey area of which record schedule does it fall into and how should it be made available, and for how long?

Sky - we are looking at not being the sole provider for data that customers are demanding.  NOAA has asked their customers how they might put things out more publically accessible. There may be many distribution sites, but records management mandate to keep the archive.

Dynamic of public domain and licensed access.

Comment - Separating sourcing to provisioning.  And authoritative data principles.

Chris - there might be an authoritative copy, go to for scientists, there might be five different version, but there will be one authoritative one as opposed to authoritative copy.

Karl spoke about university issues, looking through the faculty handbook and ownership of databases.  They do not have an extensive policy that is actionable or understandable - both for faculty rights and obligations, and our efforts towards preservation.

Question - Bill was talking earlier about all the avenues this community and others like Earthcube to educate people on what tools are available, what is the responsibility of the university to make their people aware of these tools and tasks?

Bill - ALA and CLR, many libraries have taken on this role of education on data management and data lifecycle. But it has not reached a critical mass yet.  We teach them where the library is etc. but we have not also shown them where the data management side is, specifically with the Federal mandates come out.  In the future ESIP can do a lot about this, Erin is working with RENCI to create tutorials, as is DataONE.  We can work through other societies to get this work out.  One stop shop for training events.

Question - Ideas on research data? There will be research publication, but it does not stop there.

Commenter - everyone is doing this, Harvard etc. But each has their own flavor, so maybe they can be the standard here at ESIP similarly to the way we did with Data Citations.

Karl - that is good, what can we do to help with these challenges, or what are they that we can focus our community on?

Sky - one thing we could do as ESIP would be to eat our own dog food.  In the evolution over the years, it is getting more and more difficult to find the information that ESIP has produced.  and they have thrown technology at it, but it is getting worst.  And not really managing the lifecycle out of the ideas, or curation of them in a digestible form.

Chris - some of this stuff grew up organically, like the wiki.  And the commons is supposed to be more curated.  Maybe we need to do a better process of educating the cluster/committee chairs in what and why they have to do the various processes.

Sky- perhaps a model of the data lifecycle and how ESIP uses it tools could be created and used to help others as an example.

Denise - this is the best description of the difference between the commons and the wiki.  But what I think what is being suggested about better socializing, we need to do a better job at disseminating our training and which needs to be employed.  LIke a one page on the differences between the commons and the wiki. etc.

Sky - many of us who are managing some institutional repository, we don't just need a catalog etc. we need something that brings all that together.  And once you bring it all together you need to apply thinking about how this is managed.

Denise - education on what ESIP is to help other groups like data users as opposed to Data providers engage more with this process.

Comment - ⅓ of the 300 people here, do we have a mechanism to get them to take back some of this knowledge to their parent organization?

Karl - Standard strategy has been ESIP 101 to learn about ESIP, but what are other ways?

Sky - you have to carry the context with you as data moves outside of where it was started.  When people discover outputs and outcomes from this community, it needs to be something that also are viewable outside of context so that you don’t need to know the details of how ESIP works.

Commenters - If librarians are the key point, that might be a good place to start.

Chris has been advocating working with these other organizations and professional societies.

Citation:
Data Lifecycle Interoperability - Broadening our Perspective on What it Means to be Interoperable; Summer Meeting 2014. ESIP Commons , April 2014