The SEAD Prototype: Data Curation and Preservation for Sustainability Science


Sustainability science is a growing area of interdisciplinary research that focuses on the complex interactions between nature and human activities. Sustainability research requires heterogeneous data from across the earth sciences and social sciences disciplines. As various scientific disciplines collect, describe and store their data in different ways, sustainability science poses challenges in data access, use and preservation. We address those challenges in the Sustainable Environments – Actionable Data (SEAD) project, a DataNet project funded by the National Science Foundation. In this poster we describe the goals of the SEAD project and outline the SEAD prototype along with some solutions that went into its development.

The SEAD project addresses the needs of sustainability researchers in terms of discovery, aggregation, and data management along with supporting long-term data reuse scenarios. SEAD aims to support the following use cases:

1.   Ingestion of heterogeneous data types (e.g., images, geo-spatial data, and sensor data) and mapping of semantic relationships among the research data collections as well as semantic annotation and tagging.

2.   Support of data discovery through interoperable standards and algorithms, social networking and data publishing.

3.   Enhancements of existing data through automated scientific metadata extraction and data visualization plugins.

4.   Ingestion of new data sets directly via bench-work tools.

5.   Curation of data via federated deposit into institutional and disciplinary repositories.

We integrate and extend existing tools and services to create a generalizable framework of services that provide a rich curation infrastructure along with a robust archiving infrastructure that can provide long-term access to preserved datasets via a network of institutional and disciplinary repositories. The SEAD strategy addresses the goals of improving quality, relevance, and usefulness of data and reducing cost of data management and preservation in the following ways:

·         Moving data curation tasks earlier in the data life cycle, toward the beginning of research projects.

·         Involving domain scientists in setting priorities for evolution of data and services.

·         Developing mechanisms for facilitating and improving data discoverability, such as automatic metadata capture in diverse forms and community annotation of data.

·         Re-engineering long term curation processes to leverage rich metadata and volunteered efforts in data curation.

We realize our strategy by implementing the three component architecture. The three interacting components of the SEAD architecture are the active curation services in the Active Content Repository (ACR), which provide interfaces to upload, store, organize, and annotate data; community social curation services, SEAD-VIVO, which connect researchers profiles to data and research products with rich network exploration and visualization capabilities; and the SEAD Virtual Archive (VA), which allows researchers to easily store data long-term and search for data across multiple institutional repositories.

The SEAD prototype represents the efforts of a large collective of software developers, data curators, scientists and librarians, who contribute their diverse expertise to define and enable new practices related to active curation, research social networks, and long-term preservation. SEAD leverages existing open source software including the Tupelo and VIVO semantic web technologies, Medici data repository software, MySQL database, Data Conservancy software, and other components.

Currently, SEAD has implemented core functionality for uploading, annotating, and viewing data, linking data to researcher profiles, and mechanisms to package this information and transfer it to institutional repositories or archival cloud storage. The curation pipeline to institutional repositories via the VA supports both long-term preservation and search and discovery workflows. The SEAD prototype is currently being tested by ingesting, annotating, and preserving datasets from the National Center for Earth Surface Dynamics (1.6 terabytes of data containing over 450,000 files) which involves transfer of data and metadata between SEAD ACR, VIVO and VA components.