Professional Documents
Culture Documents
Introduction
to
Data Science
1
Digital Data
• Digital data is data that represents other forms of data using specific machine
language systems that can be interpreted by various technologies.
• The most fundamental of these systems is a binary system, which simply stores
complex audio, video, image or text information in a series of binary characters,
traditionally 0’s and 1’s.
• Whenever we send an email, read a social media post, or take pictures with your
digital camera, we are working with digital data.
2
Digital Data
• Data growth has seen exponential acceleration since advent of computer and
internet.
• Whenever we send an email, read a social media post, or take pictures with your
digital camera, we are working with digital data.
3
Structured Data
• Structured data refers to any data that resides in a fixed field within a
record or file.
• This includes data contained in relational databases and spreadsheets.
4
5
Unstructured Data
• The phrase unstructured data usually refers to information that doesn't reside in a traditional row-
column database.
– As you might expect, it's the opposite of structured data (i.e. the data stored in fields in a database).
• Unstructured data files often include text and multimedia content.
• Examples include
– e-mail messages,
– word processing documents,
– videos,
– photos,
– audio files,
– presentations,
– webpages and
– many other kinds of business documents.
• Note that while these sorts of files may have an internal structure, they are still considered
"unstructured" because the data they contain doesn't fit neatly in a database.
6
Typical human-generated unstructured data includes:
•Text files: Word processing, spreadsheets, presentations, email, logs.
•Email: Email has some internal structure thanks to its metadata, and we sometimes refer to it as semi-
structured. However, its message field is unstructured and traditional analytics tools cannot parse it.
7
Typical machine-generated unstructured data
includes:
Satellite imagery: Weather data, land forms, military movements etc..
8
Unstructured Data Management
Organizations use of variety of different software tools to organize and manage unstructured
data. These can include the following:
• Big data tools: Software like Hadoop can process stores of both unstructured and
structured data that are extremely large, very complex and changing rapidly.
• Business intelligence software: Also known as BI, business intelligence is a broad
category of analytics, data mining, dashboards and reporting tools that help companies
make sense of their structured and unstructured data for the purpose of making better
business decisions.
• Data integration tools: These tools combine data from disparate sources so that they can be
viewed or analyzed from a single application. They sometimes include the capability to
unify structured and unstructured data.
• Document management systems: Also called enterprise content management systems, a
DMS can track, store and share unstructured data that is saved in the form of document files.
• Information management solutions: This type of software tracks structured and
unstructured enterprise data throughout its lifecycle.
• Search and indexing tools: These tools retrieve information from unstructured data files
such as documents, Web pages and photos.
9
Structured vs. Unstructured Data:
What’s the Difference?
• Besides the obvious difference between storing in a relational database and storing
outside of one, the biggest difference is the ease of analyzing structured data vs.
unstructured data.
• Mature analytics tools exist for structured data, but analytics tools for mining
unstructured data are emerging and developing.
• Users can run simple content searches across textual unstructured data. But its
lack of orderly internal structure defeats the purpose of traditional data mining
tools, and the enterprise gets little value from potentially valuable data sources
like rich media, network or weblogs, customer interactions, and social media data.
10
Structured vs. Unstructured Data:
What’s the Difference?
• Even though unstructured data analytics tools are in the marketplace, no one
vendor or toolset are clear winners.
• And many customers are reluctant to invest in analytics tools with uncertain
development roadmaps.
• On top of this, there is simply much more unstructured data than structured.
• Unstructured data makes up 80% and more of enterprise data, and is growing at
the rate of 55% and 65% per year.
• And without the tools to analyze this massive data, organizations are leaving vast
amounts of valuable data on the business intelligence table.
11
Structured vs. Unstructured Data:
What’s the Difference?
12
Structured vs. Unstructured Data:
Next Gen Tools are Game Changers
14
Unstructured data analytics with
machine learning
• Track high-volume customer conversations in social media.
– Text analytics and sentiment analysis lets analysts review positive and
negative results of marketing campaigns, or even identify online threats.
– This level of analytics is far more sophisticated simple keyword search, which
can only report basics like how often posters mentioned the company name
during a new campaign.
– New analytics also include context:
• was the mention positive or negative?
• Were posters reacting to each other?
• What was the tone of reactions to executive announcements?
– The automotive industry for example is heavily involved in analyzing social
media, since car buyers often turn to other posters to gauge their car buying
experience.
– Analysts use a combination of text mining and sentiment analysis to track
auto-related user posts on Twitter and Facebook.
15
Semi-Structured Data
16
Examples of Semi-structured Data
19
20
What is Machine Learning?
21
What is Machine Learning?
• Initially, the model is fed parameter data for which the answer is known.
• The algorithm is then run, and adjustments are made until the algorithm’s
output (learning) agrees with the known answer.
• At this point, increasing amounts of data are input to help the system learn
and process higher computational decisions.
22
What Machine Learning can do?
24
What Machine Learning can do?
• Algorithms have been invented that are effective for certain types
of learning tasks, and a theoretical understanding of learning is
beginning to emerge.
• Many practical computer programs have been developed to exhibit
useful types of learning, and significant commercial applications
have begun to appear.
25
Achievements of ML
26
A few achievements of ML
Programs have been developed that:
• Successfully learn to recognize spoken words (Waibel 1989; Lee
1989).
• Predict recovery rates of pneumonia patients (Cooper et al.
1997).
• Detect fraudulent use of credit cards, drive autonomous
vehicles on public highways (Pomerleau 1989).
• Play games such as backgammon at levels approaching the
performance of human world champions (Tesauro 1992, 1995).
etc.
WELL-POSED LEARNING PROBLEMS
For example, a computer program that learns to play checkers might improve
its performance as measured by its ability to win at the class of tasks involving
playing checkers games, through experience obtained by playing games
against itself.
In order to define learning problem, we must identity these three features: the
class of tasks, the measure of performance to be improved, and the source
of experience.
Example 1
32
Why Machine Learning?
• Machine learning has several very practical applications that drive the kind of real business results –
such as time and money savings – that have the potential to dramatically impact the future of your
organization.
• At Interactions in particular, we see tremendous impact occurring within the customer care industry,
whereby machine learning is allowing people to get things done more quickly and efficiently.
• Through Virtual Assistant solutions, machine learning automates tasks that would otherwise need to
be performed by a live agent – such as changing a password or checking an account balance. This
frees up valuable agent time that can be used to focus on the kind of customer care that humans
perform best: high touch, complicated decision-making that is not as easily handled by a machine.
• At Interactions, we further improve the process by eliminating the decision of whether a request
should be sent to a human or a machine: unique Adaptive Understanding technology, the machine
learns to be aware of its limitations, and bail out to humans when it has a low confidence in
providing the correct solution.
• Machine learning has made dramatic improvements in the past few years, but we are still very far
from reaching human performance. Many times, the machine needs the assistance of human to
complete its task. At Interactions, we have deployed Virtual Assistant solutions that seamlessly blend
artificial with true human intelligence to deliver the highest level of accuracy and understanding.
33
How does it work?
34
How does it work?
Stages of a
data science project
37
Types of machine learning
38
39
40
41
Supervised Learning
• Supervised learning as the name indicates the presence of a supervisor as a teacher.
• Basically, supervised learning is a learning in which we teach or train the machine
using data which is well labeled that means some data is already tagged with the
correct answer.
• After that, the machine is provided with a new set of examples(data) so that
supervised learning algorithm analyses the training data(set of training examples)
and produces a correct outcome from labeled data.
• Thus the machine learns the things from training data(basket containing fruits) and
then apply the knowledge to test data.
• Supervised learning classified into two categories of algorithms:
• Classification: A classification problem is when the output variable is a category,
such as “Red” or “blue” or “disease” and “no disease”.
• Regression: A regression problem is when the output variable is a real value, such
as “dollars” or “weight”
42
43
44
Unsupervised Learning
46
Semi-Supervised Learning
47
Reinforcement Learning
50
51
Machine Learning Applications
53
Machine Learning Applications
Machine Learning in Digital Marketing
• This is where Machine Learning can help significantly. Machine Learning
allows a more relevant personalization. Thus, companies can interact and
engage with the customer. Sophisticated segmentation focus on the
appropriate customer at the right time. Also, with the right message.
Companies have information which can be leveraged to learn their
behavior.
• Nova uses Machine Learning to write sales emails that are personalized
one. It knows which emails performed better in past and accordingly
suggests changes to the sales emails.
Machine Learning in Healthcare
• This application seems to remain a hot topic for the last three years.
Several promising start-ups of this industry as they are gearing up their
effort with a focus on healthcare. These include Nervanasys (acquired by
Intel), Ayasdi, Sentient, Digital Reasoning System among others.
• Computer vision is the most significant contributors in the field of Machine
Learning. which uses deep learning. It’s an active healthcare application
for ML Microsoft’s InnerEye initiative that started in 2010 and is currently
working on an image diagnostic tool. 54
Assignment 1
55
Thank You