Data Publication/Ingest Systems - Tools and Best Practices
Data centers struggle with difficulties related to data submission. Data are acquired through many avenues by many people. Many data submission activities involve intensive manual processes and the level of information received from data providers varies.In addition, collecting and tracking the information on pending data set submissions is arduous. For data providers, the submission process can be inconsistent and confusing. Scientists generally provide data from previous projects, and archival can be a low priority. Incomplete or poor documentation accompanies many data sets. However, complicated questionnaires deter busy data providers. The data centers have to work through multiple data sets at a time, tracking, documenting, and curating the data files. The process of ingest can be time consuming. In addition, manuscripts are requiring data to be published before accepting publication of the article.
This session is aimed at discussing approaches taken by various agencies and groups to address data publication.
The goal of the session is to :
- Build an understanding of various approaches taken by agencies and groups to handle data publication
- Demonstrate functionality and tools that aid in the ingestion and publication of data sets
- Share best practices in developing data publication systems
- Share experience to tackle common issues and work as a community towards new/shared solutions
1. The NEON approach to data ingest, curation, and sharing
Christine M. Laney and Mark Brundege
The National Ecological Observatory Network (NEON) is a new continental-scale observation system, currently in construction, that is dedicated to examining ecological change over time. Highly diverse data streams are collected using airborne hyperspectral and LiDAR sensors, tower-based ground, water, and air sensors, NEON field technicians who collect data and samples in the field, and contract laboratories that process specimens and samples. Data are delivered to headquarters via hard drive (for large remote sensing data), network streaming, web UIs, and in the near-term future, hand-held PDAs. At this time, more than 20 terrestrial sites are streaming sensor data, 15 terrestrial sites have collected observational data, and 10 sites have collected airborne (hyperspectral, lidar, and hi-res camera) data. With more than 100 sites planned to be operational by the fall of 2017, the process of managing and publishing data for eventual use by external researchers and the general public is increasing in complexity. Here, we present the system by which NEON data are ingested, curated, processed, and eventually shared via a data portal. System considerations include hardware development, data QA/QC, algorithm development and implementation by combined science and cyberinfrastructure teams, and discoverability and usability by NEON’s user community.
2. Global Hydrology Resource Center (GHRC) data Ingestion Process
Helen Conover, Rahul Ramachandran
3. Long Term Ecological Research Network (LTER) Information Management
Corinna Gries , John Porter
The U.S. Long-Term Ecological Research (LTER) Network incorporates ecological research at sites in many biomes. Hundreds of faculty and student researchers, along with technical staff, collect many different kinds of data (everything from ice thickness, to forest growth, to coral populations). A variety of tools, including ad hoc spreadsheets, structured spreadsheets, databases, GIS, remote sensing software, custom programming, and statistical packages are used to process and ingest diverse data from LTER researchers and instruments. Individual, or small groups of LTER sites, have developed common-use tools such as the GCE Matlab Toolbox (https://gce-lter.marsci.uga.edu/public/im/tools/data_toolbox.htm), which incorporates a large number of functions useful for the management, QA/QC, documentation and ingestions of time-series data. The Drupal Ecological Information Management System (DEIMS), built on the Drupal Content Management system (https://www.drupal.org/project/deims), provides web-form-based interfaces for managing many routine information management tasks supporting the data publishing process. Publication of LTER data is done via site-based data catalogs hosted on individual web sites and by a network-wide data portal. Ecological Metadata Language (EML) serves as the standard for exchanges of metadata and is used for a variety of purposes, including generation of searchable data catalogs, informational web pages and even automated generation of statistical scripts (R, Matlab, SAS and SPSS). The LTER Data Portal (https://portal.lternet.edu) provides additional metadata quality checks, a data versioning system, unique identifiers, and a repository for data from LTER sites and a mechanism for sharing with other archives (e.g., DataONE).
4. Establishing Best Practices in Data Accession at the NASA NSIDC DAAC
Donna J. Scott, Amanda Leon
At the National Snow and Ice Center (NSIDC) Distributed Active Archive Center (DAAC), data accession requires a manual review from scientists and data managers to ensure the scientific integrity of the data as well as its fit within the scope of a cryospheric data center. Over the last several years we have focused on adequately capturing information and decisions surrounding a submission request, as well as improving the timeliness of collecting expert input from multiple groups. The time it takes between accession approval and release remains a challenge. The NSIDC DAAC is starting to explore a more automated approach in capturing the submitted information and data to improve efficiency in the process. An additional challenge is rethinking how we distribute data at the DAAC, and the expectations of support. With some Publishing Groups requiring actively distributed data sets, there is more pressure on scientific researchers to ensure their data is distributed by approved data centers before their literature is published. This is forcing a new conversation at the DAAC about how to meet Publisher’s deadlines and requirements while maintaining our desired Levels of Service for our distributed data and information.
5. Send2NCEI: Improving Producer/Archive Propinquity
John Relph, Kenneth S. Casey
The National Centers for Environmental Information are responsible for preserving, monitoring, assessing, and providing access to the Nation's treasure of environmental data and information. NCEI often receives "one-off" Submission Information Packages which are not well documented. The process of documenting these data packages can represent a burden to NCEI staff, can involve numerous discussions with the Producers, and can take weeks or months to complete. To address these issues, NCEI has developed Send2NCEI, a Producer-friendly web interface for collecting metadata and uploading data files to the Archive as a well-formed SIP, and an internal set of services for inserting the SIP into the existing Archive workflow.
6. Oak Ridge National Laboratory "Semi-"Automated Data ingest process
Suresh Vannan, Tammy Beaty, Bob Cook, Daine Wright, Yaxing Wei, Ranjeet Deverakonda, Harold Shanafield
The ORNL DAAC archives and publishes data and information relevant to biogeochemical, ecological, and environmental processes. The data archived at the ORNL DAAC must be well formatted, self-descriptive, and documented, as well as referenced in a peer-reviewed publication. The ORNL DAAC ingest team curates diverse data sets from multiple data providers simultanesously. To streamline the ingest process, the data set submission process at the ORNL DAAC has been recently updated and semi-automated to provide a consistent data provider experience and to create a uniform data product. The goals of ingest automation are to:
Provide the ability to track a data set from acceptance to publication
Automate steps that can be automated to improve efficiencies and reduce redundancy
Update legacy ingest infrastructure
Provide a centralized system to manage the various aspects of ingest
Data set related communications both internal and external
This talk will cover the workflow and tools developed through this semi-automated system. Presentation in Google Docs.
7. ASDC Ingest Automation Efforts through Collaboratory for quAlity Metadata Preservation (CAMP)
Aubrey Beach, Tiffany Mathews, Emily Northup
ORNL DAAC presentation