Analytics and Data Scientists
Analytics and Data Scientists
Large Heterogeneous Datasets (LHD…Allow me to avoid the term: Big Data) has come upon us…but what exactly does it mean. Federation presentations by Emily Law and Karl Benedict have described the scope and directions of LHD attention in regards to hardware, software, definition, framework, etc. In fact, the ‘3 Vs’ (or sometimes 4) (http://www-01.ibm.com/software/data/bigdata/) that formulates LHD is being addressed from various angles: (at a high level: ) Foundation, Infrastructure, Management, Search and Mining, Security & Privacy, Applications. Also, purely speaking, we have Gartner’s definition – “Big data” is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. (source: http://www.gartner.com/it-glossary/big-data/)
LHD is, indeed multi-faceted. For this session, we focus on “cost-effective, innovative forms of information processing” techniques and methodologies, in particular, unique to LHD.
On the continuum of ever evolving data management systems, we need to understand and develop ways that allow for data relationships to be examined, and information to be manipulated, such that knowledge can be enhanced, to facilitate science. In short, we have a lot of data that we really have not provided opportunity for users to holistically ‘mine’.
Enter Analytics:
Analytics is a two-sided coin. On one side, it uses descriptive and predictive models to gain valuable knowledge from data - data analysis. On the other, analytics uses this insight to recommend action or to guide decision making - communication. Thus, analytics is not so much concerned with individual analyses or analysis steps, but with the entire methodology (http://en.wikipedia.org/wiki/Analytics)
And the Data Scientist:
A data scientist possesses a combination of analytic, machine learning, data mining and statistical skills as well as experience with algorithms and statistical skills as well as experience with algorithms and coding. Perhaps the most important skill a data scientist possesses, however, is the ability to explain the significance of data in a way that can be easily understood by others. (http://searchbusinessanalytics.techtarget.com/definition/Data-scientist)
Federation Partners are forward thinking, by nature, and are industry leaders positioned to apply smart innovative ideas to conceptualizing, developing (implementing?) analytic tools and techniques that facilitate the use of large heterogeneous data sets, unique to serving large heterogeneous datasets. (Not necessarily repurposing existing tools and techniques). It truly appears that analytics, and data scientists to usher in analytics, is the logical next step in our quest up the pyramid to 'knowledge'.
This session aims to explore the possibilities, interest, involvement, experience, and understanding that Partners are willing to bring to the table to enhance and foster the next evolutionary phase of enabling science research.
Potential long term goals (time periods are best guess):
<1 year:
- To look forward, brainstorm, and specify/analyze user scenarios that involve the need for analytics (new techniques and methodologies) required to further analyze large heterogeneous Earth science datasets. Dozens of excellent user scenarios have already been collected by the NIST Big Data Public Working Group (http://bigdatawg.nist.gov/workshop.php)
- To define the specific qualification of the Earth science Data Scientist. Work on this is already in progress, but what is our input.
1-2 years:
- To document a wish list of clearly specified analytics that can satisfy user scenarios, lending themselves to gleaning knowledge from information> Have them peer reviewed.
- To peer review the Earth science Data Scientist Qualifications. Collaborate with other groups looking at this.
2-3 years:
- To see that collaborations are formed to implement analytics tools and techniques
- To publish: The Earth science Data Scientist Qualifications. Or work with other groups to formalize qualifications
This sessions goal:
- To determine if there is there enough work here to be a cluster?
This session is an open discussion to address:
- The definition of analytics: What does it imply? What do we want it to imply?
- Same for the Data Scientist
- What is our ultimate vision?
- What resources are available to us? (not money…just information sources, known work in this area, knowledge of college curriculums (for training Data Scientists), etc.
- What activities could we define to move out on this?
- Is there enough inertia to continue?
Summary:
The group was comprised of people well versed in analytics, those formulating what analytics means and how information systems can be enhanced to facilitate the use of Earth science data analytics, and those learning the ‘lingo’. In all, all were present because of their interest in the subject, at some level.
Observation:
Analytics is not only a new paradigm for advancing the holistic usability of heterogeneous datasets, but it also invites potential new paradigms for information management system frameworks that can handle Earth science analytics: Knowledge Management Systems?
Notes:
Slides meant to introduce subject and stimulate discussion (slides attached). This is intended to be an interactive session - with thinking, and things that need to be done related to big data. Want to define and gather information about what can be done by members of the federation.
Session goals (see slides for details): To define terms: analytics and data scientist, and define activities that will advance analytics techniques and methodologies, and the data scientist as a profession. We may not get that far, but can keep it going via telecons.
The ultimate long term goal – Facilitate the advancement of Information Analysis to Knowledge Analysis through analytics (see DIKW slide). We are beginning to talk about knowledge management - where people can glean knowledge from information. Specifically when serving unique large heterogeneous data sets (LHD).
Session Goal one: Define LHD, analytics in context of LHD, and definition of Data Scientists in relation to LHD
Session Goal two: Define activities, list actions, and draft a timeline related to analytics and data scientists.
Let’s see how far we get…
Presentation:
Starting point - defining Large Heterogeneous Datasets (see slides)
Gartner’s definition: “Big data” is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. (source: http://www.gartner.com/it-glossary/big-data/)
Consider the “3 V’s”: Data Management: Controlling Data Volume, Velocity and Variety. Current business conditions and mediums are pushing traditional data management principles to their limits, giving rise to novel and more formalized approaches
Big data spans four dimensions: Volume, Velocity, Variety, Veracity
Volume: Enterprises are awash with ever-growing data of all types
Velocity: Big data must be made available and used in a reasonable time period
Variety: Big data is any type of data - structured and non-structured
Veracity: Trusting the information to make decisions.
(Source: http://www-01.ibm.com/software/data/bigdata/)
Thus, we have big data, but what is it that makes big data different from what we have today, and what do we need to do differently to handle the big data? As a federation, what can we do to facilitate the use of large amounts of data?
Analytics vs Analysis: Analytics is a two-sided coin. On one side, it uses descriptive and predictive models to gain valuable knowledge from data - data analysis. On the other, analytics uses this insight to recommend action or to guide decision making - communication. Thus, analytics is not so much concerned with individual analyses or analysis steps, but with the entire methodology (Source: http://en.wikipedia.org/wiki/Analytics)
The 2011 Mckinsey report provides representative analytics technologies (see slides)
Data Scientists: A data scientist possesses a combination of analytic, machine learning, data mining and statistical skills as well as experience with algorithms and statistical skills as well as experience with algorithms and coding. Perhaps the most important skill a data scientist possesses, however, is the ability to explain the significance of data in a way that can be easily understood by others. (Source: http://searchbusinessanalytics.techtarget.com/definition/Data-scientist)
Rising alongside the relatively new technology of big data is the new job title data scientist. While not tied exclusively to big data projects, the data scientist role does complement them because of the increased breadth and depth of data being examined, as compared to traditional roles. (Source: http://www-01.ibm.com/software/data/infosphere/data-scientist/)
Discussion(, Thoughts and Ideas):
How can we make big data more manageable and properly and efficiently used by the user? Not just the technical description, but how it would be used and managed.
What could we do? What could we try to achieve? Eliza brought up the DC Data Science Meetup group - http://www.meetup.com/Data-Science-DC/ (“Data Science DC is a non-profit professional group that meets monthly to discuss diverse topics in predictive analytics, applied machine learning, statistical modeling, open data, and data visualization”.) Thanks Eliza
It was noted that the use of the words ‘big’ or ‘large’ is that necessary? It is more of the heterogeneity of data, not size. What is large a decade ago is now small. So that will change but the heterogeneous part will not change.
The purpose of the analytics is not to improve on the product but to use it to extract new information from datasets that are not typically analyzed together, holistically. One possibility is to use use cases to flush out possible analytics scenarios
How do we create enough metadata, so that there is one tool that everyone can use? We need better metadata.
Discovery vs. modeling: Discovery issues are being addressed in other venues and that might not be something we need to worry about here. Instead, we ask: How to cut across one or multiple data sets and come up with an answer that transcends these datasets?
Analytics can provide the ability for the data to tell stories that have not been thought about.
Also mentioned: how do we make sure the bad analytics manipulations gets handled as well?
Data scientists with appropriate domain expertise would know techniques that scientifically, not just mathematically, relate heterogeneous datasets.
Provided by Aleksandar: Here's the blog with the "metro map" on what becoming a Data Scientist entails: http://nirvacana.com/thoughts/becoming-a-data-scientist/ Thanks Aleksandar
Data Provider Perspectives:
1. What are the challenges because the data is heterogeneous, in different locations and being managed differently? What can we, as data provider organizations, do to enable these kinds of things. For example, create an environment where smart people can get in the system and determine the process for themselves. Different organizations handle different data. And conceptually given a data flow, everything that is called for is done smoothly and automatically. That will lead to advances to converting information to knowledge.
2. Thus, a generalized framework the ESIP partners can enable this. Defining the relevant questions and techniques to combine data, is going to be very scientific community driven thing, that we cannot solve, but a framework can facilitate. (Comment: Information users know what information they need and how to combine information, but is the framework, as we have thought of it in the past for tools and services, applicable to the uniqueness of analytics?)
Scientist Perspectives:
1. Modeling data has fundamental issue: Data comes in different formats and descriptions and truthfulness - how do we address those problems? We are not always asked to deal with the analysis. The data scientist component is taking the computer skills and applying it to the reference question. How do we get to the real question that these skills are needed for.
2. If data is not in the required structure, attributes, they are thrown out. Rather than manipulating the data sets a lot of good data is thrown out.
Use Cases: NIST had a big data workshop, with dozens of user scenarios, several applicable to Earth Science. RDA has a big data analytics interest group as well, also compiling use cases. In addition, at AGU David Lary had a very good use case, as well. Generally the science community wants capabilities to enable analytics to be done easier. User scenarios or use cases guide us in making that job easier for them. Also, we need to have scenarios that are broader than science community.
In addition, Information Partners need to help information users to understand, not from a mathematical perspective, but from an infrastructure perspective what the possibilities for utilizing analytics methodologies for heterogeneous data are.
Sarah e-mailed me the following very pertinent announcement: Second International Symposium on Big Data and Data Analytics in Collaboration (BDDAC 2014), May 19-23, 2014, http://cts2014.cisedu.info/2-conference/symposia/symp-02-big-data-2014. It’s worth a look. Thanks Sarah
David Batchelor provided the following link to interesting subject articles: DataScienceCtrl bit.ly/1dkwxJP #abdsc. Thanks David
Take away messages:
1. Address the ‘heterogeneous’ components of data/information, not the ‘big’ part.
2. We have all this data, how do we make it more usable?
3. Provide a generalized framework, albeit different from today’s tools and services framework structure, that will lead to advances to converting information to knowledge
4. Examine use cases (see action) to flush out what is needed to architect an analytics ‘framework’.
Actions and Moving Forward:
1. Steve to distribute user scenarios and their references, thus far collected
Response:
NIST: http://bigdatawg.nist.gov/usecases.php lots of good scenarios
RDA (cut and paste this one): https://www.rd-alliance.org/groups/big-data-analytics-ig/wiki/science-use-cases.html
2. Steve to check in with Dave Jones: Dave is acquiring, combining, and sharing data. What are his methods and how does he assure all the providers are trustworthy.
3. Steve to distribute notes, and provide presentation and additional acquired information, including external activities that are consistent with this activity. Also, solicit session participant and Federation wide feedback to acquire, share and promote a better understanding of Earth science data analytics and Data Science, most prominently, to encourage Federation discussion on these subjects
4. Participants to provide their ideas and thoughts on what does Earth science data analytics mean to you and what you think is the best Federation approach to addressing analytics and data science on behalf of the broader community. Please respond to Steve. Once compiled and distributed, we can see decide what our next steps are.
Actions and Moving Forward:
1. Steve to distribute user scenarios and their references, thus far collected
Response:
NIST: http://bigdatawg.nist.gov/usecases.php lots of good scenarios
RDA (cut and paste this one): https://www.rd-alliance.org/groups/big-data-analytics-ig/wiki/science-use-cases.html
2. Steve to check in with Dave Jones: Dave is acquiring, combining, and sharing data. What are his methods and how does he assure all the providers are trustworthy.
3. Steve to distribute notes, and provide presentation and additional acquired information, including external activities that are consistent with this activity. Also, solicit session participant and Federation wide feedback to acquire, share and promote a better understanding of Earth science data analytics and Data Science, most prominently, to encourage Federation discussion on these subjects
4. Participants to provide their ideas and thoughts on what does Earth science data analytics mean to you and what you think is the best Federation approach to addressing analytics and data science on behalf of the broader community. Please respond to Steve. Once compiled and distributed, we can see decide what our next steps are.