Building a Google for data: The current state of the art

Abstract: 

Researchers are experiencing increasing pressure from agencies and professional societies to make all outputs of their science open and accessible. But a large percentage of scientists feel the need to maintain control of their data or lack an appropriate repository in which to deposit their data. Even when data are transferred to a repository, discovery of those data can be difficult due to the large number and dynamic state of repositories. Brokering services can provide harmonized discovery and access services on heterogeneous catalogs but populating and maintaining those catalogs has proven to be a challenge and will become more so as more researchers seek to deposit their data. There is a growing need for mechanisms that make it possible to discover and aggregate all data that is in repositories and registries, or is self-published on the web.

Researchers at the National Snow and Ice Data Center and the Ronin Institute for Independent Scholarship have piloted a way for researchers to advertise their data by posting syndication-type metadata alongside their data, and have developed crawlers that find and aggregate these advertisements. We are now working on a crawler that is capable of discovering other types of documents describing data such as OAI-PMH, OpenSearch, OGC W*S, and THREDDS. Owing to the vast number of such documents on the web we are also developing a means to discriminate which of the discovered resources are applicable to specific scientific interests. The locations of these discovered resources could then be incorporated into discovery and access services such as those provided by GI-Cat/GI-Axe.The status of this work will be discussed along with our plans for working with communities such as ESIP, RDA, and possibly the Belmont Forum to develop good practice guidelines for researchers in advertising their data holdings and service providers in describing and validating their web services.

Attachments for download: 
Creative Common License: 
Creative Commons Attribution 3.0 License