Unit 1: To Data Science

Unit 1
Introduction
to
Data Science
1
Digital Data
• Digital data is data that represents other forms of data using specific machine
language systems that can be interpreted by various technologies.
• Digital data, in information theory and information systems, is the discrete,

discontinuous representation of information or works.
• It is the data which is not physical but stored on digital format.
• The most fundamental of these systems is a binary system, which simply stores
complex audio, video, image or text information in a series of binary characters,
traditionally 0’s and 1’s.
• Whenever we send an email, read a social media post, or take pictures with your
digital camera, we are working with digital data.
2
Digital Data
• Data growth has seen exponential acceleration since advent of computer and
internet.
• Whenever we send an email, read a social media post, or take pictures with your
digital camera, we are working with digital data.
3
Structured Data
• Structured data refers to any data that resides in a fixed field within a
record or file.
• This includes data contained in relational databases and spreadsheets.
Characteristics of Structured Data:

• Structured data first depends on creating a data model – a model of the types
of business data that will be recorded and how they will be stored,
processed and accessed.
– This includes defining:
• what fields of data will be stored and
• how that data will be stored: data type (numeric, currency, alphabetic, name, date,
address) and
• any restrictions on the data input (number of characters; restricted to certain terms
such as Mr., Ms. or Dr.; M or F).
4
5
Unstructured Data
• The phrase unstructured data usually refers to information that doesn't reside in a traditional row-
column database.
– As you might expect, it's the opposite of structured data (i.e. the data stored in fields in a database).
• Unstructured data files often include text and multimedia content.
• Examples include
– e-mail messages,
– word processing documents,
– videos,
– photos,
– audio files,
– presentations,
– webpages and
– many other kinds of business documents.
• Note that while these sorts of files may have an internal structure, they are still considered
"unstructured" because the data they contain doesn't fit neatly in a database.
6
Typical human-generated unstructured data includes:
•Text files: Word processing, spreadsheets, presentations, email, logs.
•Email: Email has some internal structure thanks to its metadata, and we sometimes refer to it as semi-
structured. However, its message field is unstructured and traditional analytics tools cannot parse it.
•Social Media: Data from Facebook, Twitter, LinkedIn.
•Website: YouTube, Instagram, photo sharing sites.
•Mobile data: Text messages, locations.
•Communications: Chat, IM, phone recordings, collaboration software.
•Media: MP3, digital photos, audio and video files.
•Business applications: MS Office documents, productivity applications.
7
Typical machine-generated unstructured data
includes:
Satellite imagery: Weather data, land forms, military movements etc..
Scientific data: Oil and gas exploration, space exploration, seismic imagery,

atmospheric data.
Digital surveillance: Surveillance photos and video.
Sensor data: Traffic, weather, oceanographic sensors.
8
Unstructured Data Management
Organizations use of variety of different software tools to organize and manage unstructured
data. These can include the following:
• Big data tools: Software like Hadoop can process stores of both unstructured and
structured data that are extremely large, very complex and changing rapidly.
• Business intelligence software: Also known as BI, business intelligence is a broad
category of analytics, data mining, dashboards and reporting tools that help companies
make sense of their structured and unstructured data for the purpose of making better
business decisions.
• Data integration tools: These tools combine data from disparate sources so that they can be
viewed or analyzed from a single application. They sometimes include the capability to
unify structured and unstructured data.
• Document management systems: Also called enterprise content management systems, a
DMS can track, store and share unstructured data that is saved in the form of document files.
• Information management solutions: This type of software tracks structured and
unstructured enterprise data throughout its lifecycle.
• Search and indexing tools: These tools retrieve information from unstructured data files
such as documents, Web pages and photos.
9
Structured vs. Unstructured Data:
What’s the Difference?
• Besides the obvious difference between storing in a relational database and storing
outside of one, the biggest difference is the ease of analyzing structured data vs.
unstructured data.
• Mature analytics tools exist for structured data, but analytics tools for mining
unstructured data are emerging and developing.
• Users can run simple content searches across textual unstructured data. But its
lack of orderly internal structure defeats the purpose of traditional data mining
tools, and the enterprise gets little value from potentially valuable data sources
like rich media, network or weblogs, customer interactions, and social media data.
10
• Even though unstructured data analytics tools are in the marketplace, no one
vendor or toolset are clear winners.
• And many customers are reluctant to invest in analytics tools with uncertain
development roadmaps.
• On top of this, there is simply much more unstructured data than structured.
• Unstructured data makes up 80% and more of enterprise data, and is growing at
the rate of 55% and 65% per year.
• And without the tools to analyze this massive data, organizations are leaving vast
amounts of valuable data on the business intelligence table.
11
12
Next Gen Tools are Game Changers
• New tools are available to analyze unstructured data, particularly given

specific use case parameters.
– Most of these tools are based on machine learning.
• Structured data analytics can use machine learning as well, but the massive
volume and many different types of unstructured data requires it.
• A few years ago, analysts using keywords and key phrases could search
unstructured data and get a decent idea of what the data involved.
• Discovery was (and is) a prime example of this approach.
• However, unstructured data has grown so dramatically that users need to
employ analytics that not only work at compute speeds, but also
automatically learn from their activity and user decisions.
• Natural Language Processing (NLP), Pattern sensing and classification,
and text-mining algorithms are all common examples, as are document
relevance analytics, sentiment analysis, and filter-driven Web harvesting.
13
Unstructured data analytics with
machine learning
It allows organizations to:
• Analyze digital communications for compliance.
– Failed compliance can cost companies millions of dollars in fees, litigation, and lost business.
Pattern recognition and email threading analysis software searches massive amounts of email and
chat data for potential noncompliance.
– A recent example includes Volkswagen’s woes, who might have avoided a huge fines and
reputational hits by using analytics to monitor communications for suspicious messages.
• Gain new marketing intelligence.
– Machine-learning analytics tools quickly work on massive amounts of documents to analyze
customer behavior.
– A major magazine publisher applied text mining to hundreds of thousands of articles, analyzing
each separate publication by the popularity of major subtopics.
– Then they extended analytics across all their content properties to see which overall topics got the
most attention by customer demographic.
– The analytics ran across hundreds of thousands of pieces of content across all publications, and
cross-referenced hot topic results by segments.
– The result was a rich education on which topics were most interesting to distinct customers, and
which marketing messages resonated most strongly with them.
14
Unstructured data analytics with
machine learning
• Track high-volume customer conversations in social media.
– Text analytics and sentiment analysis lets analysts review positive and
negative results of marketing campaigns, or even identify online threats.
– This level of analytics is far more sophisticated simple keyword search, which
can only report basics like how often posters mentioned the company name
during a new campaign.
– New analytics also include context:
• was the mention positive or negative?
• Were posters reacting to each other?
• What was the tone of reactions to executive announcements?
– The automotive industry for example is heavily involved in analyzing social
media, since car buyers often turn to other posters to gauge their car buying
experience.
– Analysts use a combination of text mining and sentiment analysis to track
auto-related user posts on Twitter and Facebook.
15
Semi-Structured Data
• Semi-structured data is information that doesn't reside in a relational

database but that does have some organizational properties that make it
easier to analyze.
• Semi-structured data maintains internal tags and markings that identify
separate data elements, which enables information grouping and hierarchies.
• Email is a very common example of a semi-structured data type.
• Examples of semi-structured data might include XML documents and

NoSQL databases.
• Both documents and databases can be semi-structured.
16
Examples of Semi-structured Data
• Markup language XML: This is a semi-structured document language. XML is a set of

document encoding rules that defines a human- and machine-readable format. (Although
saying that XML is human-readable doesn’t pack a big punch: anyone trying to read an
XML document has better things to do with their time.) Its value is that its tag-driven
structure is highly flexible, and coders can adapt it to universalize data structure, storage,
and transport on the Web.
• Open standard JSON (JavaScript Object Notation): JSON is another semi-structured
data interchange format. Java is implicit in the name but other C-like programming
languages recognize it. Its structure consists of name/value pairs (or object, hash table, etc.)
and an ordered value list (or array, sequence, list). Since the structure is interchangeable
among languages, JSON excels at transmitting data between web applications and servers.
• NoSQL Semi-structured data is also an important element of many NoSQL (“not only
SQL”) databases. NoSQL databases differ from relational databases because they do not
separate the organization (schema) from the data. This makes NoSQL a better choice to store
information that does not easily fit into the record and table format, such as text with varying
lengths. It also allows for easier data exchange between databases. Some newer NoSQL
databases like MongoDB and Couchbase also incorporate semi-structured documents by
natively storing them in the JSON format.
17
18
Introduction
to
Machine Learning
19
20
What is Machine Learning?
• Machine learning is an application of artificial intelligence (AI) that provides systems

the ability to automatically learn and improve from experience without being explicitly
programmed.
• Machine learning focuses on the development of computer programs that can access
data and use it to learn for themselves.
• The process of learning begins with observations or data, such as examples, direct
experience, or instruction, in order to look for patterns in data and make better decisions
in the future based on the examples that we provide.
• The primary aim is to allow the computers learn automatically without human
intervention or assistance and adjust actions accordingly.
• Machine learning enables analysis of massive quantities of data.
• While it generally delivers faster, more accurate results in order to identify profitable
opportunities or dangerous risks, it may also require additional time and resources to
train it properly.
• Combining machine learning with AI and cognitive technologies can make it even
more effective in processing large volumes of information.
21
What is Machine Learning?
• Machine learning is made up of three parts:

– Computational algorithms at the core of making determinations.
– Variables and features that make up the decision.
– Base knowledge for which the answer is known that enables (trains) the
system to learn.
• Initially, the model is fed parameter data for which the answer is known.
• The algorithm is then run, and adjustments are made until the algorithm’s
output (learning) agrees with the known answer.
• At this point, increasing amounts of data are input to help the system learn
and process higher computational decisions.
22
What Machine Learning can do?
• Ever since computers were invented, we have wondered whether they

might be made to learn?
• If we could understand how to program them to learn-to improve
automatically with experience
– the impact would be dramatic!!
• Imagine
– computers learning from medical records which treatments are most
effective for new diseases,
– houses learning from experience to optimize energy costs based on the
particular usage patterns of their occupants, or
– personal software assistants learning the evolving interests of their
users in order to highlight especially relevant stories from the online
morning newspaper.
23
• A successful understanding of how to make computers learn would

open up many new uses of computers and new levels of
competence and customization.
• And a detailed understanding of information processing algorithms

for machine learning might lead to a better understanding of
human learning abilities (and disabilities) as well.
We do not yet know how to make computers learn nearly as well

as people learn.
24
• Algorithms have been invented that are effective for certain types
of learning tasks, and a theoretical understanding of learning is
beginning to emerge.
• Many practical computer programs have been developed to exhibit
useful types of learning, and significant commercial applications
have begun to appear.
For problems such as speech recognition, algorithms based on

machine learning outperform all other approaches that have been
attempted to date.
25
Achievements of ML
• In the field known as data mining, machine learning algorithms

are being used routinely to discover valuable knowledge from
large commercial databases containing
– equipment maintenance records,
– loan applications,
– financial transactions,
– medical records and the like.
26
A few achievements of ML
Programs have been developed that:
• Successfully learn to recognize spoken words (Waibel 1989; Lee
1989).
• Predict recovery rates of pneumonia patients (Cooper et al.
1997).
• Detect fraudulent use of credit cards, drive autonomous
vehicles on public highways (Pomerleau 1989).
• Play games such as backgammon at levels approaching the
performance of human world champions (Tesauro 1992, 1995).
etc.
WELL-POSED LEARNING PROBLEMS
Tom Mitchell’s Definition:

“A computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P, if its performance at tasks in T, as
measured by P, improves with experience E.”
For example, a computer program that learns to play checkers might improve
its performance as measured by its ability to win at the class of tasks involving
playing checkers games, through experience obtained by playing games
against itself.
In order to define learning problem, we must identity these three features: the
class of tasks, the measure of performance to be improved, and the source
of experience.
Example 1
In Checkers learning problem: a computer program that

learns to play checkers might improve its performance as measured by
its ability to win at the class of tasks involving playing checkers
games, through experience obtained by playing games against itself.
What is
– Task T:
• playing checkers
– Performance measure P:
• percent of games won against opponents
– Training experience E:
• playing practice games against itself
Example 2
A Handwriting Recognition learning problem:

What is
– Task T:
• recognizing and classifying handwritten words within images
• percent of words correctly classified
• a database of handwritten words with given classifications
Example 3
A Robot Driving learning problem:

What is
– Task T:
• driving on public four-lane highways using vision sensors
• average distance travelled before an error
• a sequence of images and steering commands recorded while
observing a human driver
Question 1
32
Why Machine Learning?
• Machine learning has several very practical applications that drive the kind of real business results –
such as time and money savings – that have the potential to dramatically impact the future of your
organization.
• At Interactions in particular, we see tremendous impact occurring within the customer care industry,
whereby machine learning is allowing people to get things done more quickly and efficiently.
• Through Virtual Assistant solutions, machine learning automates tasks that would otherwise need to
be performed by a live agent – such as changing a password or checking an account balance. This
frees up valuable agent time that can be used to focus on the kind of customer care that humans
perform best: high touch, complicated decision-making that is not as easily handled by a machine.
• At Interactions, we further improve the process by eliminating the decision of whether a request
should be sent to a human or a machine: unique Adaptive Understanding technology, the machine
learns to be aware of its limitations, and bail out to humans when it has a low confidence in
providing the correct solution.
• Machine learning has made dramatic improvements in the past few years, but we are still very far
from reaching human performance. Many times, the machine needs the assistance of human to
complete its task. At Interactions, we have deployed Virtual Assistant solutions that seamlessly blend
artificial with true human intelligence to deliver the highest level of accuracy and understanding.
33
How does it work?
• Machine Learning algorithm is trained using a training data set to

create a model.
• When new input data is introduced to the ML algorithm, it makes
a prediction on the basis of the model.
• The prediction is evaluated for accuracy and if the accuracy is
acceptable, the Machine Learning algorithm is deployed.
• If the accuracy is not acceptable, the Machine Learning algorithm
is trained again and again with an augmented training data set.
34
How does it work?
Stages of a
data science project
37
Types of machine learning
38
39
40
41
Supervised Learning
• Supervised learning as the name indicates the presence of a supervisor as a teacher.
• Basically, supervised learning is a learning in which we teach or train the machine
using data which is well labeled that means some data is already tagged with the
correct answer.
• After that, the machine is provided with a new set of examples(data) so that
supervised learning algorithm analyses the training data(set of training examples)
and produces a correct outcome from labeled data.
• Thus the machine learns the things from training data(basket containing fruits) and
then apply the knowledge to test data.
• Supervised learning classified into two categories of algorithms:
• Classification: A classification problem is when the output variable is a category,
such as “Red” or “blue” or “disease” and “no disease”.
• Regression: A regression problem is when the output variable is a real value, such
as “dollars” or “weight”
42
43
44
Unsupervised Learning
• Unsupervised learning is the training of machine using information that is neither

classified nor labeled and allowing the algorithm to act on that information
without guidance. Here the task of machine is to group unsorted information
according to similarities, patterns and differences without any prior training of
data.
• Unlike supervised learning, no teacher is provided that means no training will be
given to the machine. Therefore machine is restricted to find the hidden structure
in unlabeled data by our-self.
For instance, suppose it is given an image having both dogs and cats which
have not seen ever.
• Thus the machine has no idea about the features of dogs and cat so we can’t
categorize it in dogs and cats. But it can categorize them according to their
similarities, patterns, and differences i.e., we can easily categorize the above
picture into two parts. First first may contain all pics having dogs in it and second
part may contain all pics having cats in it. Here you didn’t learn anything before,
means no training data or examples.
• Unsupervised learning classified into these categories of algorithms:
• Clustering: A clustering problem is where you want to discover the inherent
groupings in the data, such as grouping customers by purchasing behavior.
• Association: An association rule learning problem is where you want to discover
rules that describe large portions of your data, such as people that buy X45also
tend to buy Y.
Semi-Supervised Learning
• The most basic disadvantage of any Supervised Learning algorithm is that the dataset
has to be hand-labeled either by a Machine Learning Engineer or a Data Scientist.
• This is a very costly process, especially when dealing with large volumes of data.
• The most basic disadvantage of any Unsupervised Learning is that it’s application
spectrum is limited.
• To counter these disadvantages, the concept of Semi-Supervised Learning was
introduced.
• In this type of learning, the algorithm is trained upon a combination of labeled and
unlabeled data.
• The acquisition of unlabeled data is relatively cheap while labeling the said data is very
expensive.
• The basic procedure involved is that first, the programmer will cluster similar data using
an unsupervised learning algorithm and then use the existing labeled data to label the rest
of the unlabeled data.
46
Semi-Supervised Learning
A Semi-Supervised algorithm assumes the following about the data –

• Continuity Assumption: The algorithm assumes that the points
which are closer to each other are more likely to have the same
output label.
• Cluster Assumption: The data can be divided into discrete clusters
and points in the same cluster are more likely to share an output
label.
• Manifold Assumption: The data lie approximately on a manifold of
much lower dimension than the input space. This assumption allows
the use of distances and densities which are defined on a manifold.
47
Reinforcement Learning
Process of learning “how to cycle” is not completely supervised or

unsupervised it’s a Reinforcement Learning process.
• Reinforcement learning is an area of Machine Learning.

• It is about taking suitable action to maximize reward in a particular
situation.
• It is employed by various software and machines to find the best
possible behavior or path it should take in a specific situation.
• Reinforcement learning differs from the supervised learning in a way
that in supervised learning the training data has the answer key with it
so the model is trained with the correct answer itself whereas in
reinforcement learning, there is no answer but the reinforcement agent
decides what to do to perform the given task.
• In the absence of training dataset, it is bound to learn from its
experience.
49
• Main points in Reinforcement learning –

– Input: The input should be an initial state from which the model will start
– Output: There are many possible output as there are variety of solution to
a particular problem
– Training: The training is based upon the input, The model will return a
state and the user will decide to reward or punish the model based on its
output.
– The model keeps continues to learn.
– The best solution is decided based on the maximum reward.
• Various Practical applications of Reinforcement Learning –
– RL can be used in robotics for industrial automation.
– RL can be used in machine learning and data processing
– RL can be used to create training systems that provide custom instruction
and materials according to the requirement of students.
50
51
Machine Learning Applications
• Machine Learning in Education

Teachers can use Machine Learning to check how much of
lessons students are able to consume, how they are coping
with the lessons taught and whether they are finding it too
much to consume. Of course, this allows the teachers to help
their students grasp the lessons. Also, prevent the at-risk
students from falling behind or even worst, dropping out.
• Machine Learning in Search Engine
Search engines rely on Machine Learning to improve their
services is no secret today. Implementing these Google has
introduced some amazing services. Such as voice recognition,
image search and many more. How they come up with more
interesting features is what time will tell us.
53
Machine Learning Applications
Machine Learning in Digital Marketing
• This is where Machine Learning can help significantly. Machine Learning
allows a more relevant personalization. Thus, companies can interact and
engage with the customer. Sophisticated segmentation focus on the
appropriate customer at the right time. Also, with the right message.
Companies have information which can be leveraged to learn their
behavior.
• Nova uses Machine Learning to write sales emails that are personalized
one. It knows which emails performed better in past and accordingly
suggests changes to the sales emails.
Machine Learning in Healthcare
• This application seems to remain a hot topic for the last three years.
Several promising start-ups of this industry as they are gearing up their
effort with a focus on healthcare. These include Nervanasys (acquired by
Intel), Ayasdi, Sentient, Digital Reasoning System among others.
• Computer vision is the most significant contributors in the field of Machine
Learning. which uses deep learning. It’s an active healthcare application
for ML Microsoft’s InnerEye initiative that started in 2010 and is currently
working on an image diagnostic tool. 54
Assignment 1
1. Explain why we need Machine Learning?

2. List 10 real-life applications of Regression and Classification.
3. Explain the labeled and unlabeled datasets and its
application in Machine Learning.
4. Explain Reinforcement Learning using a real-life example.
Explain how reinforcement learning is applicable in the
example.
5. Explain how we can use the structured, semi-structured and
unstructured digital data in Machine Learning.
55
Thank You

Unit 1: To Data Science

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1: To Data Science

Uploaded by

Copyright:

Available Formats

Unit 1

• Digital data, in information theory and information systems, is the discrete,

• It is the data which is not physical but stored on digital format.

Characteristics of Structured Data:

•Social Media: Data from Facebook, Twitter, LinkedIn.

•Website: YouTube, Instagram, photo sharing sites.

•Mobile data: Text messages, locations.

•Communications: Chat, IM, phone recordings, collaboration software.

•Media: MP3, digital photos, audio and video files.

•Business applications: MS Office documents, productivity applications.

Scientific data: Oil and gas exploration, space exploration, seismic imagery,

Digital surveillance: Surveillance photos and video.

Sensor data: Traffic, weather, oceanographic sensors.

• New tools are available to analyze unstructured data, particularly given

• Semi-structured data is information that doesn't reside in a relational

• Email is a very common example of a semi-structured data type.

• Examples of semi-structured data might include XML documents and

• Both documents and databases can be semi-structured.

• Markup language XML: This is a semi-structured document language. XML is a set of

• Machine learning is an application of artificial intelligence (AI) that provides systems

• Machine learning is made up of three parts:

• Ever since computers were invented, we have wondered whether they

• A successful understanding of how to make computers learn would

• And a detailed understanding of information processing algorithms

We do not yet know how to make computers learn nearly as well

For problems such as speech recognition, algorithms based on

• In the field known as data mining, machine learning algorithms

Tom Mitchell’s Definition:

In Checkers learning problem: a computer program that

A Handwriting Recognition learning problem:

A Robot Driving learning problem:

• Machine Learning algorithm is trained using a training data set to

• Unsupervised learning is the training of machine using information that is neither

A Semi-Supervised algorithm assumes the following about the data –

Process of learning “how to cycle” is not completely supervised or

• Reinforcement learning is an area of Machine Learning.

• Main points in Reinforcement learning –

• Machine Learning in Education

1. Explain why we need Machine Learning?

You might also like