Preservation and Stewardship Committee Telecon 2012-12-10
Updates on Winter meeting sessions
- DOI Landing Pages
Curt discussed AGU - there were a number of relevant presentations which aligned with our way of thinking. There was a town hall on data citations. There seems to be a lot of movement around this topic in other organizations.
Curt asked if anyone had any reflections. Denise asked if any of the sessions were recorded - she was unable to attend. Curt thought that there were only a few, but the posters are posted on the eposter site and has a discussion area for each poster as well.
Curt felt that AGU was a good place for sounding board moments that will lead to future collaborations. Even missing AGU, that material will not be lost.
Rama said that a couple of talks that he and Ruth (and others) were running - Margaret Headstum’s talk and the lessons learned talk on archiving long term data. Rama thought that it would be useful to send the data preservation content spec to one of the callers colleagues who spoke on these concepts. (Did not catch name).
Curt talked about equivalence as a topic at AGU and an earlier conference this year. Anne thought it would be worth gathering our scattered thoughts on equivalence and those outcomes from the NISO meeting. Bruce asked about outlines on equivalence - if the presentations focused on how to prove that two different data files are equivalent. Curt talked about different forms and establishing identifiers to the content rather than the form. Bruce asked if we had any text about specific file formats and re formatting data. Curt mentioned that the UNF - universal numerical fingerprint, has done some work with this.
Steve posted the link to the videos from AGU: http://fallmeeting.agu.org/2012/scientific-program/video-on-demand-lectu...
Curt moved on to the ESIP meeting. The first session is a keynote of interest to this group. That first afternoon there is a dual session on the national climate assessment and the climate assessment process. The second session is on global change information system itself. This has a lot to do with assigning identifiers, capturing provenance and other related things that went in to the climate assessment. There are a number of other parallel sessions, but if interested please attend these sessions.
We also have a planning session on day three in the afternoon. At this time we can lay out activities and plans for the future. Data citations, need for some more work on refining those guidelines and the guidelines for editors and reviewers - who is going to work on these and what are the deliverables. The PCCS - turning that into a standard. A good case for that was made at the AGU talks and the need for that type of standard. The elements of provenance and other specific pieces of information. Also talked about a domain expansion on the W3C “prov” - it would be timely of us to take some of the elements in the PCCS and more specific granularity level work flow (by Hook and Helen) and pull these proposals into this and align them so you can summaries and get specific. Also connecting terms across systems and archives. Capturing and archiving provenance while capturing the data. ProvES ISO 1195 (?) taken on by the documentation cluster. This ISO is important in documentation of information and provenance and context as part of stewardship. We really need to work out who will do which activities. ProvES was also proposed for a NASA working group - we need to establish how their needs relates to what we want to do in our committee and ESIP.
Bruce asked about reliability? hard to hear, not sure what he was asking about (referenced readings by Elizabeth Conway)
Rama was not sure if Bruce was talking about the same thing that he has been working on but the content items are the same, it just has more details for specific types of instruments.
Bruce said it was coming out of Oxford - something that came up at the Knoxville meeting. Talked about probability of information loss. And get to an important issue of long term documentation and access. He will dig up some of the references and share them.
Curt also wanted to draw people’s attention to Denise’s presentation on the preservation of physical samples (afternoon of the second day). We need to tie that into the way we describe things. Denise said we are still working out the details but we want to make sure the physical samples and objects are able to be search in a similar way to digital data. We need standards for preservation of physical samples to make it interoperable. Curt asked if we would be addressing provenance. Denise said we need to capture enough information as possible, some of these are so old we do not have that. Same with some digital data - we have lost provenance or access to file formats. If we can’t access it digitally, can’t find it what good is it.
Denise wants to start with at her Survey - chips and core collection from drilling data. There is no typical workflow for that.
Bruce mentioned the issue of how do you audit an inventory of items to check that they are still there. This will be important for data evaluation. He was looking at audit procedures - for all things donated over $5000 you want to check that the physical inventory matches the records that you kept - this will be come a documentation item in provenance while this is custodianship.
Denise is also worried about the idea that physical items take up space. As budget cuts occur - we need to prioritize what you keep and what you don’t. Data evaluation and data provenance is important for this.
Robert mentioned auditing again - another way to look at it is to sample at different time periods of ingest - 5 vs 10 vs 20 years in the collection. Sampling from different years of ingest will give you a different idea.
Bruce said that for accounting purposes we have to think about depreciation. What do we save and what don’t we.
Curt thinks provenance helps with that. Not just the items, but the provenance of the items that cite those items. Making the connections through the scientific record that contribute to the provenance trail. We need the connections between recommendations and policies.
Rama thought that we could recommend suggestions for how to make throwing things away easier. The development of tools to help. But who makes the decision and what is the process is a different aspect of things.
Heather from the NCDC asked if anyone looked up the NARA requirements on how long to keep what? They have someone working there on their materials - digital and physical to come up with NCDC’s interpretation of NARA’s guidelines. On what they are keeping and for how long.
Rama said for scientific data it tends to be - keep them for as long as they are useful. Heather said they have many documents that are not digitized and they have to decide what to do with them on a case by case basis. Denise pointed out cores - even if you digitize them you can't ever chemically sample a photograph. So what do you keep and what you don't - there is a whole other level of this topic.
Denise said it would be great to keep things as long as they are useful - as a state agency they have to keep things for as long as they can. But they are not sure who would make the decision in a governmental as opposed to a research institute when things would be thrown away.
Rama said we can talk about the consequences of throwing away materials. When they relate to each other we can have techniques that explain the consequences.
Bruce said that NARA says there is a value of 75 years which might drive the auditing process. Not sure what the guidance for state agencies or universities would be for retention. Denise said that in Alabama they are required to keep all records for eternity. But there are no stipulations on how they need to keep them. They are scanning logs but they have not digitized them, and they are not sure if they need to keep both the physical and the digital copy.
Bruce asked what happens if you destroy one - which one will keep the information alive the longest.
Curt suggested at the meeting we discuss deliverables that could come out of this group and this is an interesting area to explore.
The data citation guidelines for editors and reviewers - Curt mentioned exploring next steps now that the ESIP standards are one of many. The CoData group has a number of standards as well. We wanted to do outreach to editors and reviewers within our field - ESIP has some standing to push those guidelines and encourage reviewers to look at citations and require that submissions include proper data citations etc. We need to refine this into a specific deliverable that has these guidelines. We can talk about this more in coming meetings and develop a roadmap at the winter meeting. Rama asked if Mark was going to do something with the materials on the wiki but Curt mentioned it has been a busy month. However there was a town hall and a number of discussions at AGU which is a positive. We know the right way and that we ought to do these things, we just need to get it into wider practice. IT has not been happening as much and is still incomplete today according to standards. ANd the need for the archives to provide the right landing pages.
At the townhall - there was a question about assigning DOIs to incomplete data sets. Sarah Chalhan (?) would not assign a DOI until a data set was complete. Brian had talked about work on subsets within a data set at AGU. Curt thought we should write up the ideas we have implemented. He mentioned in the identifiers paper there is a mention of this need to precisely cite a sub set but it is still an unsolved area. Curt was encouraged that DOIs were widely mentioned and the ESIP recommendations as well in particular.
Rama believed they are using DOIs for data sets that are growing. There were suggestions that you not do that. Rama heard that at a talk as well. We need to discuss the ramification of these different processes - and how do you handle precise references to data sets. Growth vs revisions after the fact. Your DOI is not enough. Mark has also mentioned including access time in the past. But Curt believes this is not good enough. Bruce asked if there were any examples of the NCGC data (gave examples of different data types) so that the data file does not remain the same but is added onto as we move downstream. Curt thought these would be very useful as use cases given the disagreements on this topic at AGU and that the plus and minuses of these different types would be useful. Robert pointed out that it would allow people to make informed decisions. Heather said that the data Bruce was talking about is accessible online but they also archive it every time it is updated. For provenance purposes they do archive the materials but only provide the latest and greatest on the website. Curt asked if someone was to cite the data would you be able to tell which version? Heather said they are working on correcting that, but in the past it might not have been as straightforward as they would have liked. Heather and the NOAA perspective might be very useful in a paper we might work on.
Curt mentioned DOI landing pages - we don't have much time, but we have done some work to capture specific information after the creation of a DOI and constructing the pages - what is the information that is available on that first page or from it. ANd talk about machine readable ways of working from a DOI to more specific metadata - either of the data or a subset. and the issues of provenance and context. Can we recommend a common way to represent these without knowing the specific details of an archive? ISO 11905 (?) We have done a lot of work looking at different data centers and how they use the landing page and we can do a number of things to look at standardizing that.
Rama asked what should happen to landing pages for growing sets - should they be static or dynamic?
Curt and Sarah will work on laying out the specific activities for the meeting. To produce specific deliverables and milestones in the future. If there are specific deliverables in the near term Curt would really like to develop a plan for that and these activities. So try and think about this for the upcoming meeting.
Curt also mentioned that we might need to move more into the mailing list given problems with the time schedule and international attendance.
Bruce will send out an email concerning a topic he thinks we should discuss.