Earth Science Data Analytics - What are your analytics requirements?
The Earth Science Data Analytics (ESDA) Cluster has made great strides in understanding the utilization of data analytics in Earth science, an area virtually untouched in the literature. In achieiving its goal to support advancing science research that increasingly includes very large volumes of heterogeneous data, the ESDA Cluster has defined terms, documented use cases, and loosely identified tools and technologies that faciltate a better understanding of the needs of Earth science research.
ESDA Definition: Earth Science Data Analytics is The process of examining large amounts of spatial (3D), temporal, and/or spectral data of a variety of data types to uncover hidden patterns, unknown correlations and other useful information, involving one or more of the following:
- Data Preparation – Preparing heterogeneous data so that they can ‘play’ together
- Data Reduction – Smartly removing data that do not fit research criteria
- Data Analysis – Applying techniques/methods to derive results
This cluster session will discuss and initate the work still to be done, including evaluating use cases, extracting data analytics requirements from use cases (this will be a major part of the discussion), survey exisiting data anlytics tools and techniques, and sharing derived ESDA requirements and found technology gaps with the ESIP group interested in 'Emerging Big Data Technologies for Geoscience'.
Note 1: Thanks to Lindsay for recording a great set of notes
Note 2: Session presentations are attached below
In introducing the session, Steve surveyed the group regarding their interests in the ESDA Session, and found that more than 50% of the participants were present to learn. This group was comprised of about 25% Earth scientists, 25% whose work ecompasses performing data analytics, and the remainder is assumed to be information technologists and managers. (Admittedly, all rather broad categories). The win here is that in a mere 2 years, since our first Cluster Meeting, the number of Earth scientists and, let's say, Data Scientists have quintupled (from 1 each). (we take our wins where we can get them). In any case, the ESDA Cluster was new to most people.
Given the number of new participants, Steve, briefly introduced the purpose of the ESDA Cluster: To specifically address (Big) Data Variety - The environment, tools, techniques that can be provided by information technologists to facilitate the analysis of heterogeneous datasets (different formats, instruments, where they reside, etc.)
Ushered in by the advancement of computing technologies, ESDA is a new field that is maturing, due to the availability of large amounts data, and the recognition of the potential discovery of new Earth science information and knowledge, when analyzing all this data as a unit. Thus, at this time, this cluster has more of an academic nature, to gain a clear understanding of what Earth Science Data Analyics means and what it can do for our Earth science community...and what information technologists can do to help. The ESDA Cluster's ultimate goal: To facilitate the ability to glean knowledge about Earth from all available data and information.
Steve also pointed out that the literature is almost void of discussion about, specifically Earth science data analytics, again, inviting much academic discussion on ESDA scope and needs. (Currently Data Analytics discussion in the literature have been business oriented, solving very different types of problems with different tpes of methodologies)
The ESDA Cluster has acomplished much over the past 2 years (see presentation). Currently, the ESDA Cluster has provided the ESDA definition and goals to tthe Federation Executive Committee for potential endorsement: The ESIP Federation endorsed definition and goals of Earth science data analytics'. In addition, the the cluster is currently analyzing acquired use cases to determine ESDA requirements, and from the other direction, gathering known data analytics tools and techniques for eventual match-up to requirements.
ESDA Definition: The process of examining, preparing, reducing, and analyzing large amounts of spatial (multi-dimensional), temporal, or spectral data using a variety of data types to uncover patterns, correlations and other information, to better understand our Earth.
Steve continued presenting: The Clusters examination of data analytics tools and techniques that can be applied to Earth science analysis; Steve's work at AGU, visiting posters in the atmospheric science and hydrology section, discussing with authrs the analytiocal methodologies used in the research, and; A resource that Ethan McMahon brought to our attention, “The Field Guide to DATA SCIENCE”, Booz/Allen/Hamilton, 2015. (All analysis information results in presentation downloadable below)
The discussion that followed helped clarify participant questions and provided ideas for the way forward in our work:
- Various participants described their 'big data' and data analytics issues. These can make for excellent additional use cases
- Analytics is a process, not the result. This was the result of a helpful exchange for the group to better understand thedata analytics scope and definition.
- Why we look at Use Cases: To acquire a better understanding of the scope/direction of the community to better ensure that we provide tools and support in response to the community needs
- One of the conclusions (confirmations) from the AGU science poster methodologies uses information gathered is performimng science research, scientists know the methodolgies (i.e., data analytics techniques), tools, and models avialable to them... so what can we, information technologists do to help them. Of the three types of Earth science data analytics (Preparation, Reductioon, Analysis), as defined above, this discussion focused on Analysis. It seems that Analysis tools and techniques are either available or specifically developed per specific research. The scientists present did not disagree. It was suggested that a valuable resource for scientists would be providing a framework for where data analytics tools and techniques utilized for analyzing data can be found. (A catalog?) Also, then, should we be focusing information technology expertise/interest on the first 2 types of data analytics: Preparation and Reduction....
- ... Does the ESDA Cluster overlap with the Interoperability Cluster? (preparing and reducing datasets so they can be interoperable?) This led to the following thoughts:
- We will engage interoperability cluster is one.
- We just need to know what is their scope and what are their drivers? We will research all this.
Next Shea Caspersen gave a short presentation, entitled: Modeling Terrestrial Ecology Under Climate Change', dewscribing his work as a Data Scientist. The presemntation, a 'teaser'for a potential longer presentation at next summer's ESIP Meeting, provided very good insights into the work Shea does, and the skills he need to be a Data Scientist.
- Evaluate further if we can have an impact on data aanalytics for analysis. Or aybe be a “Clearinghouse” for different data analytical techniques
- Gather and document Use Cases to be provided by our AGU Session presenters
- Complete the analysis of tools and technioques for Earth science data analytics, to better match what is available to what is needed
- Determine if today's parrticipants have Use Cases to provide.