Towards a Data Commons for the Geosciences
Modern scientific discovery, particularly in the geosciences, is driven by a model which entails collaboration around data and software by teams of specialized experts. Common to these collaborations is the need to share and control data and descriptions of the data, to share compute resources and tools, to share and develop code, the need to move data between compute resources and team members, and the need to save and publish data and results. Several groups have been developing technological solutions to enable data-centric collaborations based on the concept of a data commons. However, the notion of a data commons and what constitutes a data commons is not well defined. We propose a workshop to discuss what a data commons should provide for the Geoscience community based on some representative science use cases, where we are today, and what needs to be accomplished.
- Presentation: "Towards a Data Commons for the Geosciences" by Chris Lenhardt, Charles Schmitt, Brian Blanton, and Howard Lander of RENCI
- Presentation file is attached.
- The term "Data Commons" is borrowed from other applications, so it is open for discussion.
- The basic idea is that the more data and the more heterogeneous data become, it is also becoming increasingly more vital to facilitate open, collaborative, and interdisciplinary research.
- However, in order to achieve a "Data Commons", several capabilities are needed, such as federated sources, access to computing and networking, and integrated data management.
- During this session, "Data Commons" can be understood as an environment/infrastructure that provides/enables integrated scientific research.
- Research data life cycle is one way to consider the approach for a Data Commons.
- "box" or "Dropbox" has been used as a particular manifestation of Data Commons; however, it is not the best approach for the complete Data Commons.
- EuDAT (https://www.eudat.eu/) has created and provided a suite of services that are corresponding to the different stages of a research data life cycle.
- EuDAT is deployed across Europe.
- Other examples of Data Commons:
- INCF Dataspace (https://www.incf.org/resources/incf-products-and-services/incf-dataspace)
- KnowledgeSpace (http://www.knowledgespace.com.au/)
- NIH Data Commons (https://datascience.nih.gov/commons)
- NCI Genomics Data Commons (https://gdc.nci.nih.gov/)
- Associated APIs are also made available through the Commons
- Additional data portals - cBioPortal for Cancer Genomics (http://www.cbioportal.org/) - is also linked to the Commons.
- The instances of Data Commons are becoming more popular, so interoperability of Data Commons is an emerging issue.
- Other examples:
- Datanet Federation Consortium (DFC) - http://datafed.org
- An infrastructure to bring together the following major areas: services, collaboration, resources, analysis, workflows, and repository.
- iRODS (http://irods.org/) is at the heart of the computational resources.
- How about DataONE (https://www.dataone.org/)? Does it count as a Data Commons?
- From Bill Michener: The current structure of DataONE is more likely to be considered as a federated community. This is mostly due to the fact that the definitions have not yet been distinctively determined for "Data Commons" versus "Data Repositories/Archives". As data repositories/archives and data commons proliferate, perhaps we have need to examine more closely whether there is a need to differentiate between data repositories/archives and data commons.
- Additional comments from audience: It is important to consider how these different infrastructures could be engineered/implemented, so that they help in fulfilling the variety of data and data management/stewardship needs.
- Open Science Framework from Center for Open Science - https://osf.io/
- Meant to be an interdisciplinary commons, and to minimize the movement of data.
- Connections to several existing systems that are being used as "data commons", such as Dropbox, GitHub, box, figshare, Dataverse, Mendeley, Amazon, and Google Drive, have been made to optimize the interoperability.
- Comment: Working across the existing and the evolving community of practices would be challenging but crucial in helping the facilitation and interoperation of Data Commons.
- One way to approach this might be to couple lightly on the upper layers but allowing loose coupling at the lower layers that involve specific operating components.
- Datanet Federation Consortium (DFC) - http://datafed.org
- Questions to consider:
- Does geoscience domain need a Data Commons?
- If yes, what are the assumptions and capabilities that this Data Commons might need?
- How could we fulfill the goals that we would like for a geoscience specific Data Commons?
- Key feedback from the breakout discussions:
- Best practices and interoperability are the two factors that emphasized by the first breakout group.
- In order to answer this question, it might be necessary to define success criteria first.
- Additionally, what are some of the best practices that are already in progress. For example, within ESIP community, different committees, working groups, and clusters might already be tackling certain aspects that could be contributed to the formation/structuring of a Data Commons.
- The concept of loose coupling might help in facilitating the interoperability of Data Commons.
- For the second group, the main focus for the discussion is whether it possible to have a single, universal Data Commons.
- It is potentially OK for each discipline to develop its own Data Commons; however, these Data Commons need to be linked together.
- The development of Data Commons needs to be tightly coupled with needs/problems that the Data Commons is aiming to resolve. This would help guide the capabilities and features that should be designed and implemented in order to meet the needs/resolve the identified problems.
- Funding is also necessary to help in enabling the development of the linkage between Data Commons.
- For the third group, the discussion is mainly focused on the impact of the Data Commons.
- There are many "open science" organizations that already exist. Perhaps it could be possible to review all of the organizations to compile and synthesis the recommendations and lessons learned before starting the next decisions regarding the geoscience data commons.
- Best practices and interoperability are the two factors that emphasized by the first breakout group.
- Next steps?
- ESIP Sustainable Data Management cluster would be one cluster where this investigation/conversation could be continued.
- Important to not duplicate any existing efforts.
- ESIP Sustainable Data Management cluster would be one cluster where this investigation/conversation could be continued.