You are on page 1of 23

Introduction to

Data Science and Machine Learning


What is Data Science?

1. Data Science is the science of extracting hidden patterns from


large data sets

2. Hidden patterns can appear in form of trends, cycles,


associations, rules, groups etc. in the data

3. Data sets usually refer to large volume of cleansed, structured


data prepared for the analysis

4. Science refers to the statistical tools and techniques employed


to understand the data and reliability of the identified patterns

a. That part of statistics which is used to understand the data is called


descriptive statistics. Descriptive statistics give vital insights into the
data in terms of central values, spread and distribution shape of the data
What is Machine Learning?

1. Machine Learning is an integral and critical part of data science. It


refers to a collection of algorithms which are used to extract the
hidden patterns from the dataset

2. These algorithms use a learning process through which they


identify the patterns in the dataset. The patterns they learn from the
data are called models

3. The models could be expressed in form of mathematical equations,


rules, probability ratios etc.

4. Machine learning algorithms work on the data prepared for


analytics to express the hidden patterns in form of models

5. For machine learning algorithms to successfully identify reliable


When is machine learning useful ?

1. Cannot express our knowledge about patterns as a


program. For e.g. Character recognition or natural
language processing

2. Do not have an algorithm to identify a pattern of


interest. For e.g. In spam mail detection

3. Too complex and dynamic. For e.g. Weather


forecasting

4. No prior experience or knowledge. For e.g. Mars rover


Machine Learning Applications (examples)

1. Fraud detection

2. Sentiment analysis

3. Credit risk management

4. Prediction of equipment failures

5. New pricing models / strategies

6. Network intrusion detection

7. Pattern and image recognition

8. Email spam filtering


Machine Learning Pre-requisites
1. Rich set of data representing the environment where
the model is to be used

2. Knowledge and skills in


a. Mathematics and statistics (graduate level or more)
b. Programming in any language such as Python or R
(considered as the two most popular languages for
data science)
c. Domain knowledge

3. Usually data science is a team effort where the team


Real World as Mathematical Space
Machine learning happens in mathematical space / feature
space:

1. A data set representing the real world, is a collection


attributes that define an entity

2. Each entity is represented as one record / line in the data set


Attributes / Dimensions
Machine learning happens in mathematical space / feature
space:

1. Each attribute
becomes a
dimension

Sugar
2. Each record
becomes a point in
the space

e
Ag BP level
Heart healthy
Potential heart ailments
Machine learning happens in mathematical space / feature
space:

1. Position of a
point in space is
defined with
respect to the
origin

Sugar
2. The position is

e
decided by the
values of the Ag
BP level
attributes for a
Heart healthy
point Potential heart ailments
Machine learning happens in mathematical space / feature
space:

3. A model represents the


real world process that
generated the different
set of data points

4. The model could be a


simple plane, complex
plane, hyper plane

Sugar
5. But multiple planes can
do the job. Each

e
Ag
representing an alternate
hypothesis
BP level
Heart healthy
6. The learning algorithm Erroneous classification
Potential heart ailments
selects that hypothesis
Machine learning happens in mathematical space / feature
space:

7. In the figure, since


the separator is a
plane, the model
will
ax be the+equation
+ by cz = d
representing the
plane

Sugar

e
8. x , y, z represent the Ag
three dimensions BP level
i.e. BP, Age, Sugar
Heart healthy
Potential heart ailments
while d represents
Machine learning happens in mathematical space / feature
space:

9. A new data point


enters the system

10.It’s x,y and z


values will be fed
into the model to

Sugar
get value of d
(healthy or ailing)

e
Ag
11.The data point will
ax + byabove
be placed + cz or
= d, BP level

below the plane Heart healthy


Potential heart ailments
based on d
Machine learning happens in mathematical space / feature
space:

12.Whether the new


data point is
correctly placed
(above or below the
plane) i.e. correctly
classified as ailing

Sugar
or healthy hear will
be known only after
direct observation

e
Ag
ax + by + cz = d, BP level

Heart healthy
Potential heart ailments
Machine learning happens in mathematical space / feature
space:

13.Only direct test on the


object of interest will
tell whether the
classification is correct
or not

ax + by + cz = d,

Sugar

e
Ag

BP level
15.If majority of new data Heart healthy
Potential heart ailments
points are correctly
classified, the model is
Machine Learning Categories
Machine learning categories:

There are broadly three categories into which the


machine learning algorithms are grouped

1.Supervised Learning

2.Unsupervised Learning

3.Reinforcement Learning
Supervised Machine Learning:

1. Class of algorithms which work in two stages. The first stage is called
training and second one is usually called testing. Sometimes it may
involve validation stage followed by testing

2. At each stage it takes input data prepared for that stage. i.e. for training
data for training stage, test data for test stage, validation data for validation
stage

3. During training, the machine learning algorithm gets the training data
inform of independent and dependent variables

4. In the process of learning, the algorithm learns the relationship between


the dependent and the independent variables

5. This relationship is expressed as a model which can take the form of a


equation, probability ratios, hidden rules etc.
Examples of Supervised Machine Learning:

1.Regression - Predicting mileage of a car given the other features such as


weight, engine capacity, horse power, transmission type, number of cylinders
etc.

a. In this example, mileage is the dependent variable and weight, engine


capacity, horse power, transmission type, number of cylinders are
independent variables

b. Mileage = f ( weight, engine capacity, horsepower, transmission type,


number of cylinders)

2.Classification – Categorizing a mail into spam or ham


a. In this example, the email category (spam or ham) is the target
variable and the occurrences of certain words and their frequency are
independent variables

b. P(ham) = f( words, frequencies) where P stands for probability. 1-


Unsupervised Machine Learning:
1. Class of algorithms which work in a single stage. Unlike supervised
learning algorithms, it does not have a separate training, testing or
validation stage

2. Unsupervised learning algorithms take the data as a whole, not in form of


independent and dependent variables.

3. The algorithms are not used to find any relationship between dependent
and independent variables

4. This class of algorithms usually find patterns in form of clusters and


associations reflecting some kind of commonality, togetherness among the
data points in the given data sets

5. It is the responsibility of the data scientist to analyse the identified


clusters/associations and give meaning to those clusters

6. Clustering and PCA (Principal Component Analysis), a mathematical


Examples of Unsupervised Machine Learning:

1.Clustering - Identifying groups in the given data set where a


group represents some kind of commonality among the data
points. “Birds of same feather, flock together”.

2.Clustering can further be categorized into –

a. Flat clustering, e.g. Kmeans clustering- The clusters


identified are disjoint, non-overlapping. For e.g.
segmenting customers into different groups based on their
purchase amount, frequency of purchase and types of
items purchase

b. Hierarchical clustering, clustering is done at multiple


levels indicating clusters inside clusters indicating some
Reinforced Machine Learning:

1. Reinforcement learning algorithm learns


through trial and error and the feedback it
receives from the environment in which it
learns.

2. During the initial stages of learning, it is likely


to commit many errors in learning the
patterns, however, through a process of reward
and punishment, it learns to identify the
patterns correctly.
Thank You

You might also like