Collection Structure Group Break out session
Winter 2014 Proposed Breakout Session for the Collection Structure Group
One of the difficulties the Collection Structure Group has identified in
its discussions is the granularity of the items that should be included
in an archive's inventory of data and documentation. We had moved toward
an identification of a small number of distinct and reasonably stable
discrete object types, which included such items as physical objects
(biological or geological samples, as well as written or printed documents)
and digital objects (primarily digital files or databases).
One objective of the proposed breakout session is to solidify the list of
these basic types, together with the descriptive attributes that can provide
their functional definition.
Beyond that definitional purpose, there remains the difficulty of dealing
with organizing the record keeping for an archive's inventory.
The OAIS RM defines two types of Archive Information Packages: the Archive
Information Collection (AIC) and the Archive Information Unit (AIU).
"An AIU is viewed as having a single Content Information object that is
described by exactly one set of PDI (Preservation Description Information)."
[p. 4-41] "The AIUs can be viewed as the `atoms' of information that the Archive
is tasked to store." [p. 4-42]
From that standpoint, it appears reasonable to regard AIU's as the objects
that should correspond with the items in the archive's inventory control
or accounting system. In other words, an Archive should be able to assign
an AIU to an account in the inventory control system it uses to keep track of
what the Archive contains. To do this sensibly, we'll need to assume that
the objects in the inventory are stable objects. If we make progress on this
issue, we can discuss how to track an inventory of objectst that the Archive
can assemble on the fly. The accounting transactions that track state changes
in the inventory are one way of keeping track of provenance, using standard
bookkeeping conventions and avoiding a deep descent into the syntax of some
of the Web database approaches.
This approach has the advantage of making it relatively easy to identify
distinct accounts and identify which objects belong to them. It may also
clarify how to handle uniqueness and equivalence of objects.
A second objective of the proposed breakout session is to clarify the group's
preferences regarding how to handle the distinctions between similar objects.
The approach for moving our discussion will considering several examples
involving various cases of replicating, copying, and comparing different
instances of objects that an archive can keep in its inventory. The specific
cases will require a careful examination of files with different degrees of
similarity, different charts of accounts, and different ways users might
specify whether or not the data in two (or more) files are equivalent.
The discussion of these cases will require each breakout session participant
to discuss their preferences for particular approaches. If the group emerges
with a clear consensus, then we will move toward writing up that consensus
with a view toward publication in one of the Earth science informatics or
data journals.
Here are some sample issues we might discuss.
1. Consider a digital file originally stored in the archive, but then has
a copy made and stored in an online backup and has a second copy that is transferred
to a remote backup site. The archive uses standard copy software for this
purpose and checks that the copies have the same cryptographic digest. How
should the Archive inventory these three apparently identical copies?
a. Should they go into separate inventory accounts and treat location as
part of the definition of the object (since location is the only thing
that distinguishes the object)? [Acccount names might be "File A in main
storge," "File A in online backup," and "File A in remote backup."]
b. Should they go into one inventory account that just keeps track of how
many individual copies there are? [This is a bit like a grocery store keeping
track of just kg of potatoes.]
Extra credit questions: If there are three indistinguishable files - what is
unique and how would you test for uniqueness? If the files are indistinguishable,
are all three "authentic?" If not, which one is authentic and how could an
independent auditor that the particular copy verify the authenticity claim?
2. Assume that the archive makes an indistinguishable copy and sends it to
another archive - and that either sends an indistinguishable copy to a user.
Does the user now have an authentic copy of the original? If not, can the
user even extract data from the file and use it, assuming that the copy is
"inauthentic?"