Introduction To Data Science and Machine Learning

Introduction to
Data Science and Machine Learning

What is Data Science?
1. Data Science is the science of extracting hidden patterns from

large data sets
2. Hidden patterns can appear in form of trends, cycles,

associations, rules, groups etc. in the data
3. Data sets usually refer to large volume of cleansed, structured

data prepared for the analysis
4. Science refers to the statistical tools and techniques employed

to understand the data and reliability of the identified patterns
a. That part of statistics which is used to understand the data is called

descriptive statistics. Descriptive statistics give vital insights into the
data in terms of central values, spread and distribution shape of the data
What is Machine Learning?
1. Machine Learning is an integral and critical part of data science. It

refers to a collection of algorithms which are used to extract the
hidden patterns from the dataset
2. These algorithms use a learning process through which they

identify the patterns in the dataset. The patterns they learn from the
data are called models
3. The models could be expressed in form of mathematical equations,

rules, probability ratios etc.
4. Machine learning algorithms work on the data prepared for

analytics to express the hidden patterns in form of models
5. For machine learning algorithms to successfully identify reliable

When is machine learning useful ?
1. Cannot express our knowledge about patterns as a

program. For e.g. Character recognition or natural
language processing
2. Do not have an algorithm to identify a pattern of

interest. For e.g. In spam mail detection
3. Too complex and dynamic. For e.g. Weather

forecasting
4. No prior experience or knowledge. For e.g. Mars rover

Machine Learning Applications (examples)
1. Fraud detection
2. Sentiment analysis
3. Credit risk management
4. Prediction of equipment failures
5. New pricing models / strategies
6. Network intrusion detection
7. Pattern and image recognition
8. Email spam filtering

Machine Learning Pre-requisites
1. Rich set of data representing the environment where
the model is to be used
2. Knowledge and skills in

a. Mathematics and statistics (graduate level or more)
b. Programming in any language such as Python or R
(considered as the two most popular languages for
data science)
c. Domain knowledge
3. Usually data science is a team effort where the team

Real World as Mathematical Space
Machine learning happens in mathematical space / feature
space:
1. A data set representing the real world, is a collection

attributes that define an entity
2. Each entity is represented as one record / line in the data set

Attributes / Dimensions
space:
1. Each attribute
becomes a
dimension
Sugar
2. Each record
becomes a point in
the space
e
Ag BP level
Heart healthy
Potential heart ailments
space:
1. Position of a
point in space is
defined with
respect to the
origin
Sugar
2. The position is
e
decided by the
values of the Ag
BP level
attributes for a
Heart healthy
point Potential heart ailments
space:
3. A model represents the

real world process that
generated the different
set of data points
4. The model could be a

simple plane, complex
plane, hyper plane
Sugar
5. But multiple planes can
do the job. Each
e
Ag
representing an alternate
hypothesis
BP level
Heart healthy
6. The learning algorithm Erroneous classification
selects that hypothesis
space:
7. In the figure, since

the separator is a
plane, the model
will
ax be the+equation
+ by cz = d
representing the
plane
Sugar
e
8. x , y, z represent the Ag
three dimensions BP level
i.e. BP, Age, Sugar
Heart healthy
while d represents
space:
9. A new data point

enters the system
10.It’s x,y and z

values will be fed
into the model to
Sugar
get value of d
(healthy or ailing)
e
Ag
11.The data point will
ax + byabove
be placed + cz or
= d, BP level
below the plane Heart healthy

based on d
space:
12.Whether the new

data point is
correctly placed
(above or below the
plane) i.e. correctly
classified as ailing
Sugar
or healthy hear will
be known only after
direct observation
e
Ag
ax + by + cz = d, BP level
Heart healthy
space:
13.Only direct test on the

object of interest will
tell whether the
classification is correct
or not
ax + by + cz = d,
Sugar
e
Ag
BP level
15.If majority of new data Heart healthy
points are correctly
classified, the model is
Machine Learning Categories
Machine learning categories:
There are broadly three categories into which the

machine learning algorithms are grouped
1.Supervised Learning
2.Unsupervised Learning
3.Reinforcement Learning
Supervised Machine Learning:
1. Class of algorithms which work in two stages. The first stage is called
training and second one is usually called testing. Sometimes it may
involve validation stage followed by testing
2. At each stage it takes input data prepared for that stage. i.e. for training
data for training stage, test data for test stage, validation data for validation
stage
3. During training, the machine learning algorithm gets the training data
inform of independent and dependent variables
4. In the process of learning, the algorithm learns the relationship between

the dependent and the independent variables
5. This relationship is expressed as a model which can take the form of a

equation, probability ratios, hidden rules etc.
Examples of Supervised Machine Learning:
1.Regression - Predicting mileage of a car given the other features such as

weight, engine capacity, horse power, transmission type, number of cylinders
etc.
a. In this example, mileage is the dependent variable and weight, engine

capacity, horse power, transmission type, number of cylinders are
independent variables
b. Mileage = f ( weight, engine capacity, horsepower, transmission type,

number of cylinders)
2.Classification – Categorizing a mail into spam or ham

a. In this example, the email category (spam or ham) is the target
variable and the occurrences of certain words and their frequency are
independent variables
b. P(ham) = f( words, frequencies) where P stands for probability. 1-

Unsupervised Machine Learning:
1. Class of algorithms which work in a single stage. Unlike supervised
learning algorithms, it does not have a separate training, testing or
validation stage
2. Unsupervised learning algorithms take the data as a whole, not in form of

independent and dependent variables.
3. The algorithms are not used to find any relationship between dependent
and independent variables
4. This class of algorithms usually find patterns in form of clusters and

associations reflecting some kind of commonality, togetherness among the
data points in the given data sets
5. It is the responsibility of the data scientist to analyse the identified

clusters/associations and give meaning to those clusters
6. Clustering and PCA (Principal Component Analysis), a mathematical

Examples of Unsupervised Machine Learning:
1.Clustering - Identifying groups in the given data set where a

group represents some kind of commonality among the data
points. “Birds of same feather, flock together”.
2.Clustering can further be categorized into –
a. Flat clustering, e.g. Kmeans clustering- The clusters

identified are disjoint, non-overlapping. For e.g.
segmenting customers into different groups based on their
purchase amount, frequency of purchase and types of
items purchase
b. Hierarchical clustering, clustering is done at multiple

levels indicating clusters inside clusters indicating some
Reinforced Machine Learning:
1. Reinforcement learning algorithm learns

through trial and error and the feedback it
receives from the environment in which it
learns.
2. During the initial stages of learning, it is likely

to commit many errors in learning the
patterns, however, through a process of reward
and punishment, it learns to identify the
patterns correctly.
Thank You

Introduction To Data Science and Machine Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Data Science and Machine Learning

Uploaded by

Copyright:

Available Formats

Introduction to

Data Science and Machine Learning

1. Data Science is the science of extracting hidden patterns from

2. Hidden patterns can appear in form of trends, cycles,

3. Data sets usually refer to large volume of cleansed, structured

4. Science refers to the statistical tools and techniques employed

a. That part of statistics which is used to understand the data is called

1. Machine Learning is an integral and critical part of data science. It

2. These algorithms use a learning process through which they

3. The models could be expressed in form of mathematical equations,

4. Machine learning algorithms work on the data prepared for

5. For machine learning algorithms to successfully identify reliable

1. Cannot express our knowledge about patterns as a

2. Do not have an algorithm to identify a pattern of

3. Too complex and dynamic. For e.g. Weather

4. No prior experience or knowledge. For e.g. Mars rover

3. Credit risk management

4. Prediction of equipment failures

5. New pricing models / strategies

6. Network intrusion detection

7. Pattern and image recognition

8. Email spam filtering

2. Knowledge and skills in

3. Usually data science is a team effort where the team

1. A data set representing the real world, is a collection

2. Each entity is represented as one record / line in the data set

3. A model represents the

4. The model could be a

7. In the figure, since

9. A new data point

10.It’s x,y and z

below the plane Heart healthy

12.Whether the new

13.Only direct test on the

There are broadly three categories into which the

4. In the process of learning, the algorithm learns the relationship between

5. This relationship is expressed as a model which can take the form of a

1.Regression - Predicting mileage of a car given the other features such as

a. In this example, mileage is the dependent variable and weight, engine

b. Mileage = f ( weight, engine capacity, horsepower, transmission type,

2.Classification – Categorizing a mail into spam or ham

b. P(ham) = f( words, frequencies) where P stands for probability. 1-

2. Unsupervised learning algorithms take the data as a whole, not in form of

4. This class of algorithms usually find patterns in form of clusters and

5. It is the responsibility of the data scientist to analyse the identified

6. Clustering and PCA (Principal Component Analysis), a mathematical

1.Clustering - Identifying groups in the given data set where a

2.Clustering can further be categorized into –

a. Flat clustering, e.g. Kmeans clustering- The clusters

b. Hierarchical clustering, clustering is done at multiple

1. Reinforcement learning algorithm learns

2. During the initial stages of learning, it is likely

You might also like