You are on page 1of 21

UNIT IV

(Data Analytics)
Object Segmentation: Regression Vs Segmentation – Supervised and
Unsupervised Learning, Tree Building – Regression, Classification,
Overfitting, Pruning and Complexity, Multiple Decision Trees etc. Time
Series Methods: Arima, Measures of Forecast Accuracy, STL approach,
Extract features from generated model as Height, Average Energy etc
and Analyze for prediction.
Segmentation:
Segmentation is a methodology that involves dividing broad
market/items/customers into subsets of entities with common
characteristics and homogeneous groups — then designing and
implementing strategies specific to these segments makes easier
decision making.
Segmentation is used in different areas of Risk Management like credit
risk, operational risk, reserving and investment among others.
Segmentation is often used for modeling Credit risk. Applicants are
segmented based on the estimated credit risk and decisions are made
based on the segment in which the applicant falls.
Supervised Machine Learning:
In Supervised learning, you train the machine using data which is well
"labeled." It means some data is already tagged with the correct answer.
It can be compared to learning which takes place in the presence of a
supervisor or a teacher.
A supervised learning algorithm learns from labeled training data,
helps you to predict outcomes for unforeseen data. Successfully
building, scaling, and deploying accurate supervised machine learning
Data science model takes time and technical expertise from a team of
highly skilled data scientists. Moreover, Data scientist must rebuild
models to make sure the insights given remains true until its data
changes.
Why Supervised Learning?
 Supervised learning allows you to collect data or produce a data
output from the previous experience.
 Helps you to optimize performance criteria using experience
 Supervised machine learning helps you to solve various types of real
world computation problems.
Unsupervised Learning:
Unsupervised learning is a machine learning technique, where you do not
need to supervise the model. Instead, you need to allow the model to
work on its own to discover information. It mainly deals with the
unlabelled data.

Unsupervised learning algorithms allow you to perform more complex


processing tasks compared to supervised learning. Although,
unsupervised learning can be more unpredictable compared with other
natural learning deep learning and reinforcement learning methods.
Why Unsupervised Learning?
Here, are prime reasons for using Unsupervised Learning:
 Unsupervised machine learning finds all kind of unknown
patterns in data.
 Unsupervised methods help you to find features which can be
useful for categorization.
 It is taken place in real time, so all the input data to be
analyzed and labeled in the presence of learners.
 It is easier to get unlabeled data from a computer than labeled data,
which needs manual intervention.
Types of Supervised Machine Learning Techniques:

Types of Unsupervised Machine Learning Techniques:


Unsupervised learning problems further grouped into clustering and
association problems.
Clustering:
Clustering is an important concept when it comes to unsupervised
learning. It mainly deals with finding a structure or pattern in a collection
of uncategorized data. Clustering algorithms will process your data and
find natural clusters(groups) if they exist in the data. You can also modify
how many clusters your algorithms should identify. It allows you to
adjust the granularity of these groups.
Decision Trees:
Decision trees are used to solve both classification and regression
problems in the form of trees that can be incrementally updated by
splitting the dataset into smaller datasets (numerical and categorical),
where the results are represented in the leaf nodes.

There are many specific decision-tree algorithms. Notable ones include:


 ID3 (Iterative Dichotomiser 3)
 C4.5 (successor of ID3)
 CART (Classification And Regression Tree)
 CHAID (CHI-squared Automatic Interaction Detector). Performs
multi- level splits when computing classification trees.
 MARS: extends decision trees to handle numerical data better.
 Conditional Inference Trees.
CHAID:
CHAID (Chi-square Automatic Interaction Detector) analysis is an
algorithm used for discovering relationships between a categorical
response variable and other categorical predictor variables. It is
useful when looking for patterns in datasets with lots of categorical
variables and is a convenient way of summarizing the data as the
relationships can be easily visualized.
In practice, CHAID is often used in direct marketing to understand how
different groups of customers might respond to a campaign based on their
characteristics. So suppose, for example, that we run a marketing
campaign and are interested in understanding what customer
characteristics (e.g., gender, socio-economic status, geographic
location, etc.) are associated with the response rate achieved. We build
a CHAID “tree” showing the effects of different customer characteristics
on the likelihood of response.

The algorithm performs stepwise splitting. It begins with a single cluster


of cases and searches a candidate set of predictor variables for a way to
split this cluster into two clusters.
Each predictor is tested for splitting as follows:
 Sort all the n cases on the predictor and examine all n-1 ways to
split the cluster in two.
 For each possible split, compute the within-cluster sum of squares
about the mean of the cluster on the dependent variable.
 Choose the best of the n-1 splits to represent the predictor’s
contribution. Now do this for every other predictor.
 For the actual split, choose the predictor and its cut point which
yields the smallest overall within-cluster sum of squares.
Categorical predictors require a different approach. Since categories
are unordered, all possible splits between categories must be
considered. For deciding on one split of k categories into two
groups, this means that 2k-1 possible splits must be considered.
 Once a split is found, its suitability is measured on the same within-
cluster sum of squares as for a quantitative predictor.
 Morgan and Sonquist called their algorithm AID because it naturally
incorporates interaction among predictors. Interaction is not
correlation. It has to do instead with conditional discrepancies. In
the analysis of variance, interaction means that a trend within one
level of a variable is not parallel to a trend within another level of
the same variable.

CART (Classification and Regression Trees):


A Classification and Regression Tree (CART) is a predictive algorithm
used in machine learning. It explains how a target variable's values
can be predicted based on other values. It is a decision tree where
each fork is a split in a predictor variable and each node at the end
has a prediction for the target variable.
Classification Trees:
A classification tree is an algorithm where the target variable is fixed or
categorical. The algorithm is then used to identify the “class” within which
a target variable would most likely fall.
An example of a classification-type problem would be determining who
will or will not subscribe to a digital platform; or who will or will not
graduate from high school.
These are examples of simple binary classifications where the categorical
dependent variable can assume only one of two, mutually exclusive
values. In other cases, you might have to predict among a number of
different variables.
For instance, you may have to predict which type of smart phone a
consumer may decide to purchase. In such cases, there are multiple
values for the categorical dependent variable.
Here’s what a classic classification tree looks like.

Regression Trees:
A regression tree refers to an algorithm where the target variable is and
the algorithm is used to predict it’s value. As an example of a regression
type problem, you may want to predict the selling prices of a residential
house, which is a continuous dependent variable.
This will depend on both continuous factors like square footage as well as
categorical factors like the style of home, area in which the property is
located and so on.

The regression trees are leveraged in case where the response


variable is either continuous or numeric, but not categorical.
Regression trees can be applied in case of prices, quantities, or data
involving quantities etc.
The main elements of CART (and any decision tree algorithm) are:
 Rules for splitting data at a node based on the value of one
variable;
 Stopping rules for deciding when a branch is terminal and can be
split no more; and
 Finally, a prediction for the target variable in each terminal node.
Advantages of CART:
 Simple to understand, Interpret and Visualize.
 Variable screening and feature Selection.
 It uses both Numerical and Categorical data.
When to use Classification and Regression Trees:
Classification trees are used when the dataset needs to be split into
classes which belong to the response variable. In many cases, the
classes Yes or No. In other words, they are just two and mutually
exclusive. In some cases, there may be more than two classes in which
case a variant of the classification tree algorithm is used.
Regression trees, on the other hand, are used when the response variable
is continuous. For instance, if the response variable is something like
the price of a property or the temperature of the day, a regression tree
is used. In other words, regression trees are used for prediction-type
problems while classification trees are used for classification-type
problems.
How does a tree based algorithms decide where to split?
The decision of making strategic splits heavily affects a tree’s accuracy.
The decision criteria is different for classification and regression trees.
Decision trees use multiple algorithms to decide to split a node in two or
more sub-nodes. The creation of sub-nodes increases the homogeneity of
resultant sub-nodes. In other words, we can say that purity of the node
increases with respect to the target variable. Decision tree splits the
nodes on all available variables and then selects the split which results in
most homogeneous sub nodes.
The algorithm selection is also based on type of target variables. Let’s look
at the four most commonly used algorithms in decision tree:
1. Information gain:
The information gain is the amount of information gained about a
random variable or signal from observing another random variable.
Entropy is the average rate at which information is produced by a
stochastic source of data, Or, it is a measure of the uncertainty
associated with a random variable.
Information theory is a measure to define this degree of disorganization in
a system known as Entropy. If the sample is completely
homogeneous, then the entropy is zero and if the sample is an equally
divided (50% – 50%), it has entropy of one.
Information gain is

Entropy:
2. Reduction in Variance:
Till now, we have discussed the algorithms for categorical target variable.
Reduction in variance is an algorithm used for continuous target
variables (regression problems). This algorithm uses the standard
formula of variance to choose the best split. The split with lower variance
is selected as the criteria to split the population:

Above X-bar is mean of the values, X is actual and n is number of


values.
Steps to calculate Variance:
1. Calculate variance for each node.
2. Calculate variance for each split as weighted average of each
node variance.
3. Gini Index:
Gini says, if we select two items from a population at random then
they must be of same class and probability for this is 1 if population is
pure.
1. It works with categorical target variable “Success” or “Failure”.
2. It performs only Binary splits
3. Higher the value of Gini higher the homogeneity.
4. CART (Classification and Regression Tree) uses Gini method to
create binary splits.
Steps to Calculate Gini for a split
1. Calculate Gini for sub-nodes, using formula sum of square of
probability for success and failure (p^2+q^2).
2. Calculate Gini for split using weighted Gini score of each node
of that split.
You might often come across the term ‘Gini Impurity’ which is
determined by subtracting the gini value from 1. So mathematically we
can say,
Gini Impurity = 1-Gini
To compute Gini impurity for a set of items, suppose i ε {1, 2... m},
and let fi be the fraction of items labeled with value i in the set.

4. Chi-Square:
It is an algorithm to find out the statistical significance between the
differences between sub-nodes and parent node. We measure it by sum of
squares of standardized differences between observed and expected
frequencies of target variable.
1. It works with categorical target variable “Success” or “Failure”.
2. It can perform two or more splits.
3. Higher the value of Chi-Square higher the statistical
significance of differences between sub-node and Parent node.
4. Chi-Square of each node is calculated using formula,
Chi-square = ((Actual – Expected)^2 / Expected)^1/2
5. It generates tree called CHAID (Chi-square Automatic
Interaction Detector)
Steps to Calculate Chi-square for a split:
1. Calculate Chi-square for individual node by calculating the deviation
for Success and Failure both.
2. Calculated Chi-square of Split using Sum of all Chi-square of
success and Failure of each node of the split
Pruning in Decision Tree:
Pruning is a data compression technique in machine learning and search
algorithms that reduces the size of decision trees by removing sections of
the tree that are non-critical and redundant to classify instances.
Pruning reduces the size of decision trees by removing parts of the tree
that do not provide power to classify instances.
Pruning is better But how to implement it in decision tree? The idea is
simple.
1. We first make the decision tree to a large depth.
2. Then we start at the bottom and start removing leaves which are
giving us negative returns when compared from the top.
3. Suppose a split is giving us a gain of say -10 (loss of 10) and
then the next split on that gives us a gain of 20. A simple decision
tree will stop at step 1 but in pruning, we will see that the overall
gain is +10 and keep both leaves.
Under fitting in Decision Tree:
A statistical model or a machine learning algorithm is said to have under
fitting when it cannot capture the underlying trend of the data.
Under fitting destroys the accuracy of our machine learning model. Its
occurrence simply means that our model or the algorithm does not fit the
data well enough. It usually happens when we have less data to build an
accurate model and also when we try to build a linear model with a
non- linear data.
Under fitting – High bias and low variance
Techniques to reduce under fitting:
1. Increase model complexity
2. Increase number of features
3. Remove noise from the data.
Over fitting in Decision Tree:
A statistical model is said to be over fitted, when we train it with a lot
of data. When a model gets trained with so much of data, it starts
learning from the noise and inaccurate data entries in our data set.
Then the model does not categorize the data correctly, because of too
many details and noise.
Over fitting is one of the key challenges faced while using tree based
algorithms. If there is no limit set of a decision tree, it will give you
100% accuracy on training set because in the worse case it will end up
making 1 leaf for each observation. Thus, preventing over fitting is
pivotal while modeling a decision tree.

Time series analysis:


Time series analysis is a statistical technique that deals with time
series data, or trend analysis. Time series data means that data is in a
series of particular time periods or intervals.
Time series analysis can be useful to see how a given asset, security, or
economic variable changes over time. It can also be used to examine
how the changes associated with the chosen data point compare to
shifts in other variables over the same time period.
Such data may be collected at regular time intervals, such as, monthly
(eg. CPI), weekly (eg. Money supply), quarterly (eg. GDP) or annually (eg.
Government Budget).
Time series are used in statistics, econometrics, mathematical finance,
weather forecasting, earthquake prediction and many other
applications.
The various reasons or the forces which affect the values of an
observation in a time series are the components of a time series.
The four categories of the components of time series are

 Trend
 Seasonal Variations
 Cyclic Variations
 Random or Irregular movements
Seasonal and Cyclic Variations are the periodic changes or short-term
fluctuations.

 Long term trend – The smooth long term direction of time series
where the data can increase or decrease in some pattern.
 Seasonal variation – Patterns of change in a time series within a
year which tends to repeat every year.
 Cyclical variation – Its much alike seasonal variation but the
rise and fall of time series over periods are longer than one year.
 Irregular variation – Any variation that is not explainable by
any of the three above mentioned components. They can be
classified into – stationary and non – stationary variation.
The data is considered in three types:
Time series data: A set of observations on the values that a variable
takes at different times.
Cross-sectional data: Data of one or more variables, collected at the
same point in time.
Pooled data: A combination of time series data and cross-sectional data.
Applications of Time Series:
(a) Forecasting inflation rate or unemployment rates of the net inflow
of foreign funds in the near future could be of interest to the
government.
(b) Firms may be interested in demand for their product (e.g. two-
wheelers, soft drinks bottles, or soaps etc.) or the market share of their
product.
(c) Housing finance companies may want to forecast both the mortgage
interest rate and the demand for housing loans.

There are different time series models (methods) those


are
 Auto regression (AR)
 Moving Average (MA)
 Autoregressive Moving Average (ARMA)
 Autoregressive Integrated Moving Average (ARIMA)

Time series models can be simulated, estimated from data, and used to
produce forecasts of future behavior.
White Noise:
A series is called white noise if it is purely random in nature. Let {Et}
denote such a series then it has zero mean [E(ct)=0], has a constant
variance [V(et)= 02 ] and is an uncorrelated [ER0=0] random variable.

The scatter plot of such a series across time will indicate no pattern
and hence forecasting the future values of such a series is not possible.
1. Auto Regression (AR):

2. Moving Average (MA)

3. Autoregressive Moving Average (ARMA)


Stationary of Time Series:

Autocorrelation function plot (ACF):


Autocorrelation refers to how correlated a time series is with its past
values whereas the ACF is the plot used to see the correlation between
the points, up to and including the lag unit. In ACF, the correlation
coefficient is in the x-axis whereas the number of lags is shown in the y-
axis.

4. Autoregressive Integrated Moving Average (ARIMA)


ARIMA stands for Auto Regressive Integrated Moving Average. There
are seasonal and Non-seasonal ARIMA models that can be used for
forecasting. Stationary time series is when the mean and variance are
constant over time. It is easier to predict when the series is
stationary.
Differencing is a method of transforming a non-stationary time series into
a stationary one. This is an important step in preparing data to be used in
an ARIMA model.
ARIMA models are generally denoted ARIMA(p, d, q) where parameters p,
d, and q are non-negative integers, p is the order of the Autoregressive
model, d is the degree of differencing, and q is the order of the Moving-
average model.
Measures of Forecast Accuracy:
These measures of forecast accuracy represent how well
the forecasting method can predict the historical values of the time series.
Forecast Accuracy can be defined as the deviation of Forecast or Prediction
from the actual results.

Error = Actual demand – Forecast


Or ei = At – Ft

We measure Forecast Accuracy by 2 methods:


1. Mean Forecast Error (MFE)
2. Mean Absolute Deviation (MAD)
3. Mean Absolute Percentage Error (MAPE)
1. Mean Forecast Error (MFE)
For n time periods where we have actual demand and forecast values:

Ideal value = 0;
MFE > 0, model tends to under-
forecast MFE < 0, model tends to over-
forecast
While MFE is a measure of forecast model bias, MAD indicates the
absolute size of the errors
Uses of Forecast error:
 Forecast model bias
 Absolute size of the forecast errors
 Compare alternative forecasting models
 Identify forecast models that need adjustment
2. Mean Absolute Deviation (MAD)
It is also called MAD for short, and it is the average of the absol ute
value, or the difference between actual values and their average value,
and is used for the calculation of demand variability. It is expressed by
the following formula.
For n time periods where we have actual demand and forecast values:

While MFE is a measure of forecast model bias, MAD indicates the


absolute size of the errors.
3. Mean Absolute Percentage Error (MAPE):
The mean absolute percentage error (MAPE) is a measure of how
accurate a forecast system is. It measures this accuracy as a percentage,
and can be calculated as the average absolute percent error for each time
period minus actual values divided by actual values.

Where:
 n is the number of fitted points,
 At is the actual value,
 Ft is the forecast value.
 Σ is summation notation (the absolute value is summed for every
forecasted point in time).

The mean absolute percentage error (MAPE) is the most common


measure used to forecast error, and works best if there are no extremes
to the data (and no zeros).
Seasonal And Trend Decomposition Using LOESS (STL):
STL uses the LOESS (LOcal regrESSion) method to model the trend and
seasonal components using polynomial regression. This is a statistical
method of decomposing a Time Series data into 3 components containing
seasonality, trend and residual.
it is a sequence of data points that varies across a continuous time
axis.
 Loess is a regression technique that uses local weighted
regression to fit a smooth curve through points in a sequence,
which in our case is the Time Series data.
 Trend gives you a general direction of the overall data.
 W he re as s e as onality is a re g ular and predic table pattern
that recur at a fixed interval of time.
 R a n d om ne s s or N oise or Re s id ua l is the random
fluctuation or unpredictable change.
 loess is a regression technique that uses local we ig hte d
reg res s ion to fit a smooth curve throug h points in a sequence
Feature Extraction for Prediction:
 In ML, dimensionality refers to the number of features in your
dataset
 The difficult of searching through the space gets a lot harder as
you have more dimensions
 When the number of features is very large relative to the number of
observations in your dataset, certain algorithms struggle to train
effective models.
 This is called as Curse of Dimensionality and its especially relevant
for distance calculations models like knn, clustering etc.

Types of Dimensionality Reduction:


Feature Selection: In this, we try to find a subset of original set of
variables, or features, to get a smaller subset which can be used to
model the problem
Feature Extraction: This reduces the data in high dimensional space
to a lower dimensions space i.e., with lesser no of dimensions
Principal Component Analysis:
 PCA is a method of extracting important features (in the form of
components) from a large set of variables in the dataset
 It extracts low dimensional set of features from high dimensional
dataset with a motive as much information as possible
 With fewer variables, visualization also becomes much more
meaningful.
ETL Approach:
Extract, Transform and Load (ETL) refers to a process in database usage
and especially in data warehousing that:

Extracts data from homogeneous or heterogeneous data sources


Transforms the data for storing it in proper format or structure for
querying and analysis purpose
Loads it into the final target (database, more specifically,
operational data store, data mart, or data warehouse)
Usually all the three phases execute in parallel since the data
extraction takes time, so while the data is being pulled another
transformation process executes, processing the already received data and
prepares the data for loading and as soon as there is some data ready
to be loaded into the target,

the data loading kicks off without waiting for the completion of the
previous phases.
ETL systems commonly integrate data from multiple applications
(systems), typically developed and supported by different vendors or
hosted on separate computer hardware.
The disparate systems containing the original data are frequently
managed and operated by different employees. For example, a cost
accounting system may combine data from payroll, sales, and
purchasing.
Long Answer Questions:

1. Explain about the Advantages of Decision Tree?


2. Discuss about Regression &Classification tree?
3. Explain about Gini index with example?
4. Explain about Supervised and Unsupervised Learning?
5. Explain CHAID algorithm for classification and also discuss its limitations?
6. What are the 4 components in Time series? Explain in detail.
7. What is the difference between ARMA and ARIMA?
8. What are the various steps involved in ETL Process?
9. Discuss about information gain and Chi Square approach?
10. What are the applications of time series? Discuss about various time
series methods?
11. What is STL Approach? Explain in detail?
12. Discuss about Pruning in decision trees?

You might also like