You are on page 1of 34

Unit-IV:

Additional Predictive Methods: Search-based Algorithms - Optimization-based


Algorithm- Decision Tree Induction Algorithms-Decision Trees for Regression-Model
Trees-Multivariate Adaptive Regression Splines-Optimization-based Algorithms:
Artificial Neural Networks-Deep Networks and Deep Learning Algorithms-Support Vector
Machines: SVM for Regression

SEARCHING ALGORITHMS
Searching algorithms are methods or procedures used to find a specific item or element
within a collection of data. These algorithms are widely used in computer science and are
crucial for tasks like searching for a particular record in a database, finding an element in a
sorted list, or locating a file on a computer.
These are some commonly used searching algorithms:
1. Linear Search: In this simple algorithm, each element in the collection is
sequentially checked until the desired item is found, or the entire list is traversed. It is
suitable for small-sized or unsorted lists, but its time complexity is O(n) in the worst
case.
2. Binary Search: This algorithm is applicable only to sorted lists. It repeatedly
compares the middle element of the list with the target element and narrows down
the search range by half based on the comparison result. Binary search has a time
complexity of O(log n), making it highly efficient for large sorted lists.
3. Hashing: Hashing algorithms use a hash function to convert the search key into an
index or address of an array (known as a hash table). This allows for constant-time
retrieval of the desired item if the hash function is well-distributed and collisions are
handled appropriately. Common hashing techniques include direct addressing,
separate chaining, and open addressing.
4. Interpolation Search: Similar to binary search, interpolation search works on sorted
lists. Instead of always dividing the search range in half, interpolation search uses the
value of the target element and the values of the endpoints to estimate its
approximate position within the list. This estimation helps in quickly narrowing
down the search space. The time complexity of interpolation search is typically O(log
log n) on average if the data is uniformly distributed.
5. Tree-based Searching: Various tree data structures, such as binary search trees
(BST), AVL trees, or B-trees, can be used for efficient searching. These structures
impose an ordering on the elements and provide fast search, insertion, and deletion
operations. The time complexity of tree-based searching algorithms depends on the
height of the tree and can range from O(log n) to O(n) in the worst case.
6. Ternary Search: Ternary search is an algorithm that operates on sorted lists and
repeatedly divides the search range into three parts instead of two, based on two
splitting points. It is a divide-and-conquer approach and has a time complexity of
O(log₃ n).
7. Jump Search: Jump search is an algorithm for sorted lists that works by jumping
ahead a fixed number of steps and then performing linear search in the reduced
subarray. It is useful for large sorted arrays and has a time complexity of O(sqrt(n)),
where n is the size of the array.
8. Exponential Search: Exponential search is a technique that combines elements of
binary search and linear search. It begins with a small range and doubles the search
range until the target element is within the range. It then performs a binary search
within that range. Exponential search is advantageous when the target element is
likely to be found near the beginning of the array and has a time complexity of O(log
n).
9. Fibonacci Search: Fibonacci search is a searching algorithm that uses Fibonacci
numbers to divide the search space. It works on sorted arrays and has a similar
approach to binary search, but instead of dividing the array into halves, it divides it
into two parts using Fibonacci numbers as indices. Fibonacci search has a time
complexity of O(log n).
10. Interpolation Search for Trees: This algorithm is an extension of interpolation
search designed for tree structures such as AVL trees or Red-Black trees. It combines
interpolation search principles with tree traversal to efficiently locate elements in the
tree based on their values. The time complexity depends on the tree structure and can
range from O(log n) to O(n) in the worst case.
11. Hash-based Searching (e.g., Bloom Filter): Hash-based searching algorithms
utilize hash functions and data structures like Bloom filters to determine whether an
element is present in a set or not. These algorithms provide probabilistic answers,
meaning they can occasionally have false positives (indicating an element is present
when it is not), but no false negatives (if an element is not present, it will never claim
it is). Bloom filters have a constant-time complexity for search operations.
12. String Searching Algorithms: Searching algorithms specific to string data include
techniques like Knuth-Morris-Pratt (KMP) algorithm, Boyer-Moore algorithm,
Rabin-Karp algorithm, and many others. These algorithms optimize the search for
patterns within text or strings and are widely used in text processing, pattern
matching, and string matching tasks.

OPTIMIZATION FOR DATA SCIENCE


From a mathematical foundation viewpoint, it can be said that the three pillars for data
science that we need to understand quite well are Linear Algebra, Statistics and the third
pillar is Optimization which is used pretty much in all data science algorithms. And to
understand the optimization concepts one needs a good fundamental understanding of linear

algebra. What’s Optimization? Wikipedia


defines optimization as a problem where you maximize or minimize a real function by
systematically choosing input values from an allowed set and computing the value of the
function. That means when we talk about optimization we are always interested in finding
the best solution. So, let say that one has some functional form(e.g in the form of f(x)) that
he is interested in and he is trying to find the best solution for this functional form. Now,
what does best mean? One could either say he is interested in minimizing this functional
form or maximizing this functional form. Why Optimization for Machine Learning?
 Almost all machine learning algorithms can be viewed as solutions to
optimization problems and it is interesting that even in cases, where the original
machine learning technique has a basis derived from other fields for example,
from biology and so on one could still interpret all of these machine learning
algorithms as some solution to an optimization problem.
 A basic understanding of optimization will help in:
 More deeply understand the working of machine learning
algorithms.
 Rationalize the working of the algorithm. That means if
you get a result and you want to interpret it, and if you had
a very deep understanding of optimization you will be able
to see why you got the result.
 And at an even higher level of understanding, you might be
able to develop new algorithms yourselves.
Components of an Optimization Problem Generally, an optimization problem has three
components.
minimize f(x), w.r.t x, subject to a ≤ x ≤ b
1. The objective function(f(x)): The first component is an objective
function f(x) which we are trying to either maximize or minimize. In general,
we talk about minimization problems this is simply because if you have a
maximization problem with f(x) we can convert it to a minimization problem
with -f(x). So, without loss of generality, we can look at minimization
problems.
2. Decision variables(x): The second component is the decision variables which
we can choose to minimize the function. So, we write this as min f(x).
3. Constraints(a ≤ x ≤ b): The third component is the constraint which basically
constrains this x to some set.
So, whenever you look at an optimization problem you should look for these three
components in an optimization problem. Types of Optimization Problems: Depending on
the types of constraints only:
1. Constrained optimization problems: In cases where the constraint is given
there and we have to have the solution satisfy these constraints we call them
constrained optimization problems.
2. Unconstrained optimization problems: In cases where the constraint is
missing we call them unconstrained optimization problems.
Depending on the types of objective functions, decision variables and constraints:
1. If the decision variable(x) is a continuous variable: A variable x is said to be
continuous if it takes an infinite number of values. In this case, x can take an
infinite number of values between -2 to 2.
min f(x), x ∈ (-2, 2)
 Linear programming problem: If the decision variable(x) is a
continuous variable and if the objective function(f) is linear and all
the constraints are also linear then this type of problem known as a
linear programming problem. So, in this case, the decision
variables are continuous, the objective function is linear and the
constraints are also linear.
 Nonlinear programming problem: If the decision variable(x)
remains continuous; however, if either the objective function(f) or
the constraints are non-linear then this type of problem known as a
non-linear programming problem. So, a programming problem
becomes non-linear if either the objective or the constraints
become non-linear.
2. If the decision variable(x) is an integer variable: All numbers whose
fractional part is 0 (zero) like -3, -2, 1, 0, 10, 100 are integers.
min f(x), x ∈ [0, 1, 2, 3]
 Linear integer programming problem: If the decision
variable(x) is an integer variable and if the objective function(f) is
linear and all the constraints are also linear then this type of
problem known as a linear integer programming problem. So, in
this case, the decision variables are integers, the objective function
is linear and the constraints are also linear.
 Nonlinear integer programming problem: If the decision
variable(x) remains integer; however, if either the objective
function(f) or the constraints are non-linear then this type of
problem known as a non-linear integer programming problem. So,
a programming problem becomes non-linear if either the objective
or the constraints become non-linear.
 Binary integer programming problem: If the decision
variable(x) can take only binary values like 0 and 1 only then this
type of problem known as a binary integer programming problem.
min f(x), x ∈ [0, 1]
3. If the decision variable(x) is a mixed variable: If we combine both
continuous variable and integer variable then this decision variable known as a
mixed variable.
min f(x1, x2), x1 ∈ [0, 1, 2, 3] and x2 ∈ (-2, 2)
 Mixed-integer linear programming problem: If the decision
variable(x) is a mixed variable and if the objective function(f) is
linear and all the constraints are also linear then this type of
problem known as a mixed-integer linear programming problem.
So, in this case, the decision variables are mixed, the objective
function is linear and the constraints are also linear.
 Mixed-integer non-linear programming problem: If the
decision variable(x) remains mixed; however, if either the
objective function(f) or the constraints are non-linear then this type
of problem known as a mixed-integer non-linear programming
problem. So, a programming problem becomes non-linear if either
the objective or the constraints become non-linear.

DECISION TREE INDUCTION

A decision tree is a structure that includes a root node, branches, and leaf nodes. Each
internal node denotes a test on an attribute, each branch denotes the outcome of a test, and
each leaf node holds a class label. The topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a
customer at a company is likely to buy a computer or not. Each internal node represents a
test on an attribute. Each leaf node represents a class.
The benefits of having a decision tree are as follows −
 It does not require any domain knowledge.
 It is easy to comprehend.
 The learning and classification steps of a decision tree are simple and fast.
Decision Tree Induction Algorithm
A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm
known as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was the successor of
ID3. ID3 and C4.5 adopt a greedy approach. In this algorithm, there is no backtracking; the
trees are constructed in a top-down recursive divide-and-conquer manner.
Generating a decision tree form training tuples of data partition D
Algorithm : Generate_decision_tree

Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.

Output:
A Decision Tree
Method
create a node N;

if tuples in D are all of the same class, C then


return N as leaf node labeled with class C;

if attribute_list is empty then


return N as leaf node with labeled
with majority class in D;|| majority voting

apply attribute_selection_method(D, attribute_list)


to find the best splitting_criterion;
label node N with splitting_criterion;

if splitting_attribute is discrete-valued and


multiway splits allowed then // no restricted to binary trees

attribute_list = splitting attribute; // remove splitting attribute


for each outcome j of splitting criterion

// partition the tuples and grow subtrees for each partition


let Dj be the set of data tuples in D satisfying outcome j; // a partition

if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due to noise or
outliers. The pruned trees are smaller and less complex.
Tree Pruning Approaches
There are two approaches to prune a tree −
 Pre-pruning − The tree is pruned by halting its construction early.
 Post-pruning - This approach removes a sub-tree from a fully grown tree.
Cost Complexity
The cost complexity is measured by the following two parameters −
 Number of leaves in the tree, and
 Error rate of the tree.

A DECISION TREE
A Decision Tree is an algorithm used for supervised learning problems such as classification
or regression. A decision tree or a classification tree is a tree in which each internal (nonleaf)
node is labeled with an input feature. The arcs coming from a node labeled with a feature are
labeled with each of the possible values of the feature. Each leaf of the tree is labeled with a
class or a probability distribution over the classes.
A tree can be "learned" by splitting the source set into subsets based on an attribute value
test. This process is repeated on each derived subset in a recursive manner called recursive
partitioning. The recursion is completed when the subset at a node has all the same value of
the target variable, or when splitting no longer adds value to the predictions. This process of
top-down induction of decision trees is an example of a greedy algorithm, and it is the most
common strategy for learning decision trees.
Decision trees used in data mining are of two main types −
 Classification tree − when the response is a nominal variable, for example if an
email is spam or not.
 Regression tree − when the predicted outcome can be considered a real number (e.g.
the salary of a worker).
Decision trees are a simple method, and as such has some problems. One of this issues is the
high variance in the resulting models that decision trees produce. In order to alleviate this
problem, ensemble methods of decision trees were developed. There are two groups of
ensemble methods currently used extensively −
 Bagging decision trees − These trees are used to build multiple decision trees by
repeatedly resampling training data with replacement, and voting the trees for a
consensus prediction. This algorithm has been called random forest.
 Boosting decision trees − Gradient boosting combines weak learners; in this case,
decision trees into a single strong learner, in an iterative fashion. It fits a weak tree to
the data and iteratively keeps fitting weak learners in order to correct the error of the
previous model.
# Install the party package
# install.packages('party')
library(party)
library(ggplot2)

head(diamonds)
# We will predict the cut of diamonds using the features available in the
diamonds dataset.
ct = ctree(cut ~ ., data = diamonds)

# plot(ct, main="Conditional Inference Tree")


# Example output
# Response: cut
# Inputs: carat, color, clarity, depth, table, price, x, y, z

# Number of observations: 53940


#
# 1) table <= 57; criterion = 1, statistic = 10131.878
# 2) depth <= 63; criterion = 1, statistic = 8377.279
# 3) table <= 56.4; criterion = 1, statistic = 226.423
# 4) z <= 2.64; criterion = 1, statistic = 70.393
# 5) clarity <= VS1; criterion = 0.989, statistic = 10.48
# 6) color <= E; criterion = 0.997, statistic = 12.829
# 7)* weights = 82
# 6) color > E

#Table of prediction errors


table(predict(ct), diamonds$cut)
# Fair Good Very Good Premium Ideal
# Fair 1388 171 17 0 14
# Good 102 2912 499 26 27
# Very Good 54 998 3334 249 355
# Premium 44 711 5054 11915 1167
# Ideal 22 114 3178 1601 19988
# Estimated class probabilities
probs = predict(ct, newdata = diamonds, type = "prob")
probs = do.call(rbind, probs)
head(probs)

MULTIVARIATE ADAPTIVE REGRESSION SPLINES


Several previous tutorials (i.e. linear regression, logistic regression, regularized regression)
discussed algorithms that are intrinsically linear. Many of these models can be adapted to
nonlinear patterns in the data by manually adding model terms (i.e. squared terms,
interaction effects); however, to do so you must know the specific nature of the nonlinearity
a priori. Alternatively, there are numerous algorithms that are inherently nonlinear. When
using these models, the exact form of the nonlinearity does not need to be known explicitly
or specified prior to model training. Rather, these algorithms will search for, and discover,
nonlinearities in the data that help maximize predictive accuracy.
This tutorial discusses multivariate adaptive regression splines (MARS), an algorithm that
essentially creates a piecewise linear model which provides an intuitive stepping block into
nonlinearity after grasping the concept of linear regression and other intrinsically linear
models.
Prerequisites
For this tutorial we will use the following packages:
library(rsample) # data splitting
library(ggplot2) # plotting
library(earth) # fit MARS models
library(caret) # automating the tuning process
library(vip) # variable importance
library(pdp) # variable relationships
To illustrate various MARS modeling concepts we will use Ames Housing data, which is
available via the AmesHousing package.
# Create training (70%) and test (30%) sets for the AmesHousing::make_ames() data.
# Use set.seed for reproducibility

set.seed(123)
ames_split <- initial_split(AmesHousing::make_ames(), prop = .7, strata = "Sale_Price")
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
The basic idea
Some previous tutorials (i.e. linear regression, logistic regression, regularized regression)
have focused on linear models. In those tutorials we illustrated some of the advantages of
linear models such as their ease and speed of computation and also the intuitive nature of
interpreting their coefficients. However, linear models make a strong assumption about
linearity, and this assumption is often a poor one, which can affect predictive accuracy.
We can extend linear models to capture non-linear relationships. Typically, this is done by
explicitly including polynomial parameters or step functions. Polynomial regression is a
form of regression in which the relationship between the independent variable x and the
dependent variable y is modeled as an nth�ℎ degree polynomial of x. For example, Equation
1 represents a polynomial regression function where y is modeled as a function
of x with d degrees. Generally speaking, it is unusual to use d greater than 3 or 4 as the
larger d becomes, the easier the function fit becomes overly flexible and oddly
shapened…especially near the boundaries of the range of x values.
yi=β0+β1xi+β2x2i+β3x3i⋯+βdxdi+ϵi,(1)(1)��=�0+�1��+�2��2+�3��3⋯+�
����+��,
An alternative to polynomial regression is step function regression. Whereas polynomial
functions impose a global non-linear relationship, step functions break the range of x into
bins, and fit a different constant for each bin. This amounts to converting a continuous
variable into an ordered categorical variable such that our linear regression function is
converted to Equation 2
yi=β0+β1C1(xi)+β2C2(xi)+β3C3(xi)⋯+βdCd(xi)+ϵi,(2)(2)��=�0+�1�1(��)+�2�
2(��)+�3�3(��)⋯+����(��)+��,
where C1(x)�1(�) represents x values ranging
from c1≤x<c2�1≤�<�2, C2(x)�2(�) represents x values ranging
from c2≤x<c3�2≤�<�3, ……, Cd(x)��(�) represents x values ranging
from cd−1≤x<cd��−1≤�<��. Figure 1 illustrate polynomial and step function fits
for Sale_Price as a function of Year_Built in our ames data.

Figure 1: Blue line represents predicted Sale_Price values as a function of Year_Built for
alternative approaches to modeling explicit nonlinear regression patterns. (A) Traditional
nonlinear regression approach does not capture any nonlinearity unless the predictor or
response is transformed (i.e. log transformation). (B) Degree-2 polynomial, (C) Degree-3
polynomial, (D) Step function fitting cutting Year_Built into three categorical levels.
Although useful, the typical implementation of polynomial regression and step functions
require the user to explicitly identify and incorporate which variables should have what
specific degree of interaction or at what points of a variable x should cut points be made for
the step functions. Considering many data sets today can easily contain 50, 100, or more
features, this would require an enormous and unncessary time commitment from an analyst
to determine these explicit non-linear settings.
Multivariate regression splines
Multivariate adaptive regression splines (MARS) provide a convenient approach to capture
the nonlinearity aspect of polynomial regression by assessing cutpoints (knots) similar to
step functions. The procedure assesses each data point for each predictor as a knot and
creates a linear regression model with the candidate feature(s). For example, consider our
simple model of Sale_Price ~ Year_Built. The MARS procedure will first look for the single
point across the range of Year_Built values where two different linear relationships
between Sale_Price and Year_Built achieve the smallest error. What results is known as a
hinge function (h(x−a)ℎ(�−�) where a is the cutpoint value). For a single knot (Figure 2
(A)), our hinge function is h(Year_Built−1968)ℎ(Year_Built−1968) such that our two linear
models for Sale_Price are
Sale_Price={136091.022Year_Built≤1968,136091.022+3094.208(Year_Built−1968)Year_B
uilt>1968Sale_Price={136091.022Year_Built≤1968,136091.022+3094.208(Year_Built−196
8)Year_Built>1968
Once the first knot has been found, the search continues for a second knot which is found at
2006 (Figure 2 (B)). This results in three linear models for Sale_Price:
Sale_Price=⎧⎨⎩136091.022Year_Built≤1968,136091.022+2898.424(Year_Built−1968)196
8<Year_Built≤2006,136091.022+20176.284(Year_Built−2006)Year_Built>2006Sale_Price
={136091.022Year_Built≤1968,136091.022+2898.424(Year_Built−1968)1968<Year_Built
≤2006,136091.022+20176.284(Year_Built−2006)Year_Built>2006
Figure 2: Examples of fitted regression splines of one (A), two (B), three (C), and four (D)
knots.
This procedure can continue until many knots are found, producing a highly non-linear
pattern. Although including many knots may allow us to fit a really good relationship with
our training data, it may not generalize very well to new, unseen data. For example, Figure 3
includes nine knots but this likley will not generalize very well to our test data.
Figure 3: Too many knots may not generalize well to unseen data.
Consequently, once the full set of knots have been created, we can sequentially remove
knots that do not contribute significantly to predictive accuracy. This process is known as
“pruning” and we can use cross-validation, as we have with the previous models, to find the
optimal number of knots.
Fitting a basic MARS model
We can fit a MARS model with the earth package. By default, earth::earth() will assess all
potential knots across all supplied features and then will prune to the optimal number of
knots based on an expected change in R2�2 (for the training data) of less than 0.001. This
calculation is performed by the Generalized cross-validation procedure (GCV statistic),
which is a computational shortcut for linear models that produces an error value
that approximates leave-one-out cross-validation (see here for technical details).
Note: The term “MARS” is trademarked and licensed exclusively to Salford Systems
http://www.salfordsystems.com. We can use MARS as an abbreviation; however, it cannot
be used for competing software solutions. This is why the R package uses the name earth.
The following applies a basic MARS model to our ames data and performs a search for
required knots across all features. The results show us the final models GCV statistic,
generalized R2�2 (GRSq), and more.
# Fit a basic MARS model
mars1 <- earth(
Sale_Price ~ .,
data = ames_train
)

# Print model summary


print(mars1)
## Selected 37 of 45 terms, and 26 of 307 predictors
## Termination condition: RSq changed by less than 0.001 at 45 terms
## Importance: Gr_Liv_Area, Year_Built, Total_Bsmt_SF, ...
## Number of terms at each degree of interaction: 1 36 (additive model)
## GCV 521186626 RSS 995776275391 GRSq 0.9165386 RSq 0.92229
It also shows us that 38 of 41 terms were used from 27 of the 307 original predictors. But
what does this mean? If we were to look at all the coefficients, we would see that there are
38 terms in our model (including the intercept). These terms include hinge functions
produced from the original 307 predictors (307 predictors because the model automatically
dummy encodes our categorical variables). Looking at the first 10 terms in our model, we
see that Gr_Liv_Area is included with a knot at 2945 (the coefficient
for h(2945−Gr_Liv_Area)ℎ(2945−��_���_����) is -49.85), Year_Built is
included with a knot at 2003, etc.
Tip: You can check out all the coefficients with summary(mars1).
summary(mars1) %>% .$coefficients %>% head(10)
## Sale_Price
## (Intercept) 301399.98345
## h(2945-Gr_Liv_Area) -49.84518
## h(Year_Built-2003) 2698.39864
## h(2003-Year_Built) -357.11319
## h(Total_Bsmt_SF-2171) -265.30709
## h(2171-Total_Bsmt_SF) -29.77024
## Overall_QualExcellent 88345.90068
## Overall_QualVery_Excellent 116330.48509
## Overall_QualVery_Good 36646.55568
## h(Bsmt_Unf_SF-278) -21.15661
The plot method for MARS model objects provide convenient performance and residual
plots. Figure 4 illustrates the model selection plot that graphs the GCV R2�2 (left-hand y-
axis and solid black line) based on the number of terms retained in the model (x-axis) which
are constructed from a certain number of original predictors (right-hand y-axis). The vertical
dashed lined at 37 tells us the optimal number of non-intercept terms retained where
marginal increases in GCV R2�2 are less than 0.001.
plot(mars1, which = 1)

Figure 4: Model summary capturing GCV $R^2$ (left-hand y-axis and solid black line)
based on the number of terms retained (x-axis) which is based on the number of predictors
used to make those terms (right-hand side y-axis). For this model, 37 non-intercept terms
were retained which are based on 26 predictors. Any additional terms retained in the model,
over and above these 37, results in less than 0.001 improvement in the GCV $R^2$.
In addition to pruning the number of knots, earth::earth() allows us to also assess potential
interactions between different hinge functions. The following illustrates by including
a degree = 2 argument. You can see that now our model includes interaction terms between
multiple hinge functions (i.e. h(Year_Built-2003)*h(Gr_Liv_Area-2274)) is an interaction
effect for those houses built prior to 2003 and have less than 2,274 square feet of living
space above ground).
# Fit a basic MARS model
mars2 <- earth(
Sale_Price ~ .,
data = ames_train,
degree = 2
)

# check out the first 10 coefficient terms


summary(mars2) %>% .$coefficients %>% head(10)
## Sale_Price
## (Intercept) 242611.63686
## h(Gr_Liv_Area-2945) 144.39175
## h(2945-Gr_Liv_Area) -57.71864
## h(Year_Built-2003) 10909.70322
## h(2003-Year_Built) -780.24246
## h(Year_Built-2003)*h(Gr_Liv_Area-2274) 18.54860
## h(Year_Built-2003)*h(2274-Gr_Liv_Area) -10.30826
## h(Total_Bsmt_SF-1035) 62.11901
## h(1035-Total_Bsmt_SF) -33.03537
## h(Total_Bsmt_SF-1035)*Kitchen_QualTypical -32.75942
Tuning
Since there are two tuning parameters associated with our MARS model: the degree of
interactions and the number of retained terms, we need to perform a grid search to identify
the optimal combination of these hyperparameters that minimize prediction error (the above
pruning process was based only on an approximation of cross-validated performance on the
training data rather than an actual k-fold cross validation process). As in previous tutorials,
we will perform a cross-validated grid search to identify the optimal mix. Here, we set up a
search grid that assesses 30 different combinations of interaction effects (degree) and the
number of terms to retain (nprune).
# create a tuning grid
hyper_grid <- expand.grid(
degree = 1:3,
nprune = seq(2, 100, length.out = 10) %>% floor()
)

head(hyper_grid)
## degree nprune
## 1 1 2
## 2 2 2
## 3 3 2
## 4 1 12
## 5 2 12
## 6 3 12
We can use caret to perform a grid search using 10-fold cross-validation. The model that
provides the optimal combination includes second degree interactions and retains 34 terms.
The cross-validated RMSE for these models are illustrated in Figure 5 and the optimal
model’s cross-validated RMSE is $24,021.68.

ARTIFICIAL NEURAL NETWORKS

Artificial Neural Network Tutorial provides basic and advanced concepts of ANNs. Our
Artificial Neural Network tutorial is developed for beginners as well as professions.
The term "Artificial neural network" refers to a biologically inspired sub-field of artificial
intelligence modeled after the brain. An Artificial neural network is usually a computational
network based on biological neural networks that construct the structure of the human brain.
Similar to a human brain has neurons interconnected to each other, artificial neural networks
also have neurons that are linked to each other in various layers of the networks. These
neurons are known as nodes.
Artificial neural network tutorial covers all the aspects related to the artificial neural
network. In this tutorial, we will discuss ANNs, Adaptive resonance theory, Kohonen self-
organizing map, Building blocks, unsupervised learning, Genetic algorithm, etc.
What is Artificial Neural Network?
The term "Artificial Neural Network" is derived from Biological neural networks that
develop the structure of a human brain. Similar to the human brain that has neurons
interconnected to one another, artificial neural networks also have neurons that are
interconnected to one another in various layers of the networks. These neurons are known as
nodes.
The given figure illustrates the typical diagram of Biological Neural Network.
The typical Artificial Neural Network looks something like the given figure.

Dendrites from Biological Neural Network represent inputs in Artificial Neural Networks,
cell nucleus represents Nodes, synapse represents Weights, and Axon represents Output.
Relationship between Biological neural network and artificial neural network:

Biological Neural Network Artificial Neural Network

Dendrites Inputs

Cell nucleus Nodes

Synapse Weights

Axon Output
An Artificial Neural Network in the field of Artificial intelligence where it attempts to
mimic the network of neurons makes up a human brain so that computers will have an
option to understand things and make decisions in a human-like manner. The artificial neural
network is designed by programming computers to behave simply like interconnected brain
cells.
There are around 1000 billion neurons in the human brain. Each neuron has an association
point somewhere in the range of 1,000 and 100,000. In the human brain, data is stored in
such a manner as to be distributed, and we can extract more than one piece of this data when
necessary from our memory parallelly. We can say that the human brain is made up of
incredibly amazing parallel processors.
ADVERTISEMENT
We can understand the artificial neural network with an example, consider an example of a
digital logic gate that takes an input and gives an output. "OR" gate, which takes two inputs.
If one or both the inputs are "On," then we get "On" in output. If both the inputs are "Off,"
then we get "Off" in output. Here the output depends upon input. Our brain does not perform
the same task. The outputs to inputs relationship keep changing because of the neurons in
our brain, which are "learning."
The architecture of an artificial neural network:
To understand the concept of the architecture of an artificial neural network, we have to
understand what a neural network consists of. In order to define a neural network that
consists of a large number of artificial neurons, which are termed units arranged in a
sequence of layers. Lets us look at various types of layers available in an artificial neural
network.
Artificial Neural Network primarily consists of three layers:

Input Layer:
As the name suggests, it accepts inputs in several different formats provided by the
programmer.
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the calculations
to find hidden features and patterns.
Output Layer:
The input goes through a series of transformations using the hidden layer, which finally
results in output that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the inputs and
includes a bias. This computation is represented in the form of a transfer function.
ADVERTISEMENT
ADVERTISEMENT

It determines weighted total is passed as an input to an activation function to produce the


output. Activation functions choose whether a node should fire or not. Only those who are
fired make it to the output layer. There are distinctive activation functions available that can
be applied upon the sort of task we are performing.

DEEP NETWORKS AND DEEP LEARNING ALGORITHMS

Types of Algorithms used in Deep Learning


Here is the list of top 10 most popular deep learning algorithms:
1. Convolutional Neural Networks (CNNs)
2. Long Short Term Memory Networks (LSTMs)
3. Recurrent Neural Networks (RNNs)
4. Generative Adversarial Networks (GANs)
5. Radial Basis Function Networks (RBFNs)
6. Multilayer Perceptrons (MLPs)
7. Self Organizing Maps (SOMs)
8. Deep Belief Networks (DBNs)
9. Restricted Boltzmann Machines( RBMs)
10. Autoencoders
Deep learning algorithms work with almost any kind of data and require large amounts of
computing power and information to solve complicated issues. Now, let us, deep-dive, into
the top 10 deep learning algorithms.
1. Convolutional Neural Networks (CNNs)
CNN's, also known as ConvNets, consist of multiple layers and are mainly used for image
processing and object detection. Yann LeCun developed the first CNN in 1988 when it was
called LeNet. It was used for recognizing characters like ZIP codes and digits.

CNN's are widely used to identify satellite images, process medical images, forecast time
series, and detect anomalies.
How Do CNNs Work?
CNN's have multiple layers that process and extract features from data:
Convolution Layer
 CNN has a convolution layer that has several filters to perform the convolution
operation.
Rectified Linear Unit (ReLU)
 CNN's have a ReLU layer to perform operations on elements. The output is a rectified
feature map.
Pooling Layer
 The rectified feature map next feeds into a pooling layer. Pooling is a down-sampling
operation that reduces the dimensions of the feature map.
 The pooling layer then converts the resulting two-dimensional arrays from the pooled
feature map into a single, long, continuous, linear vector by flattening it.
Fully Connected Layer
 A fully connected layer forms when the flattened matrix from the pooling layer is fed
as an input, which classifies and identifies the images.
Below is an example of an image processed via CNN.
2. Long Short Term Memory Networks (LSTMs)
LSTMs are a type of Recurrent Neural Network (RNN) that can learn and memorize long-
term dependencies. Recalling past information for long periods is the default behavior.
LSTMs retain information over time. They are useful in time-series prediction because they
remember previous inputs. LSTMs have a chain-like structure where four interacting layers
communicate in a unique way. Besides time-series predictions, LSTMs are typically used for
speech recognition, music composition, and pharmaceutical development.
How Do LSTMs Work?
 First, they forget irrelevant parts of the previous state
 Next, they selectively update the cell-state values
 Finally, the output of certain parts of the cell state
Below is a diagram of how LSTMs operate:

3. Recurrent Neural Networks (RNNs)


RNNs have connections that form directed cycles, which allow the outputs from the LSTM
to be fed as inputs to the current phase.
The output from the LSTM becomes an input to the current phase and can memorize
previous inputs due to its internal memory. RNNs are commonly used for image
captioning, time-series analysis, natural-language processing, handwriting recognition, and
machine translation.
An unfolded RNN looks like this:
How Do RNNs work?
 The output at time t-1 feeds into the input at time t.
 Similarly, the output at time t feeds into the input at time t+1.
 RNNs can process inputs of any length.
 The computation accounts for historical information, and the model size does not
increase with the input size.
Here is an example of how Google’s autocompleting feature works:

4. Generative Adversarial Networks (GANs)


GANs are generative deep learning algorithms that create new data instances that resemble
the training data. GAN has two components: a generator, which learns to generate fake data,
and a discriminator, which learns from that false information.
The usage of GANs has increased over a period of time. They can be used to improve
astronomical images and simulate gravitational lensing for dark-matter research. Video game
developers use GANs to upscale low-resolution, 2D textures in old video games by
recreating them in 4K or higher resolutions via image training.
GANs help generate realistic images and cartoon characters, create photographs of human
faces, and render 3D objects.
How Do GANs work?
 The discriminator learns to distinguish between the generator’s fake data and the real
sample data.
 During the initial training, the generator produces fake data, and the discriminator
quickly learns to tell that it's false.
 The GAN sends the results to the generator and the discriminator to update the model.
Below is a diagram of how GANs operate:

5. Radial Basis Function Networks (RBFNs)


RBFNs are special types of feedforward neural networks that use radial basis functions as
activation functions. They have an input layer, a hidden layer, and an output layer and are
mostly used for classification, regression, and time-series prediction.
How Do RBFNs Work?
 RBFNs perform classification by measuring the input's similarity to examples from the
training set.
 RBFNs have an input vector that feeds to the input layer. They have a layer of RBF
neurons.
 The function finds the weighted sum of the inputs, and the output layer has one node
per category or class of data.
 The neurons in the hidden layer contain the Gaussian transfer functions, which have
outputs that are inversely proportional to the distance from the neuron's center.
 The network's output is a linear combination of the input’s radial-basis functions and
the neuron’s parameters.
See this example of an RBFN:
6. Multilayer Perceptrons (MLPs)
MLPs are an excellent place to start learning about deep learning technology.
MLPs belong to the class of feedforward neural networks with multiple layers of perceptrons
that have activation functions. MLPs consist of an input layer and an output layer that are
fully connected. They have the same number of input and output layers but may have
multiple hidden layers and can be used to build speech-recognition, image-recognition, and
machine-translation software.
How Do MLPs Work?
 MLPs feed the data to the input layer of the network. The layers of neurons connect in a
graph so that the signal passes in one direction.
 MLPs compute the input with the weights that exist between the input layer and the
hidden layers.
 MLPs use activation functions to determine which nodes to fire. Activation functions
include ReLUs, sigmoid functions, and tanh.
 MLPs train the model to understand the correlation and learn the dependencies between
the independent and the target variables from a training data set.
Below is an example of an MLP. The diagram computes weights and bias and applies
suitable activation functions to classify images of cats and dogs.
7. Self Organizing Maps (SOMs)
Professor Teuvo Kohonen invented SOMs, which enable data visualization to reduce the
dimensions of data through self-organizing artificial neural networks.
Data visualization attempts to solve the problem that humans cannot easily visualize high-
dimensional data. SOMs are created to help users understand this high-dimensional
information.
How Do SOMs Work?
 SOMs initialize weights for each node and choose a vector at random from the training
data.
 SOMs examine every node to find which weights are the most likely input vector. The
winning node is called the Best Matching Unit (BMU).
 SOMs discover the BMU’s neighborhood, and the amount of neighbors lessens over
time.
 SOMs award a winning weight to the sample vector. The closer a node is to a BMU,
the more its weight changes..
 The further the neighbor is from the BMU, the less it learns. SOMs repeat step two for
N iterations.
Below, see a diagram of an input vector of different colors. This data feeds to a SOM, which
then converts the data into 2D RGB values. Finally, it separates and categorizes the different
colors.
8. Deep Belief Networks (DBNs)
DBNs are generative models that consist of multiple layers of stochastic, latent variables.
The latent variables have binary values and are often called hidden units.
DBNs are a stack of Boltzmann Machines with connections between the layers, and each
RBM layer communicates with both the previous and subsequent layers. Deep Belief
Networks (DBNs) are used for image-recognition, video-recognition, and motion-capture
data.
How Do DBNs Work?
 Greedy learning algorithms train DBNs. The greedy learning algorithm uses a layer-by-
layer approach for learning the top-down, generative weights.
 DBNs run the steps of Gibbs sampling on the top two hidden layers. This stage draws a
sample from the RBM defined by the top two hidden layers.
 DBNs draw a sample from the visible units using a single pass of ancestral sampling
through the rest of the model.
 DBNs learn that the values of the latent variables in every layer can be inferred by a
single, bottom-up pass.
Below is an example of DBN architecture:
9. Restricted Boltzmann Machines (RBMs)
Developed by Geoffrey Hinton, RBMs are stochastic neural networks that can learn from a
probability distribution over a set of inputs.
This deep learning algorithm is used for dimensionality reduction, classification, regression,
collaborative filtering, feature learning, and topic modeling. RBMs constitute the building
blocks of DBNs.
RBMs consist of two layers:
 Visible units
 Hidden units
Each visible unit is connected to all hidden units. RBMs have a bias unit that is connected to
all the visible units and the hidden units, and they have no output nodes.
How Do RBMs Work?
RBMs have two phases: forward pass and backward pass.
 RBMs accept the inputs and translate them into a set of numbers that encodes the inputs
in the forward pass.
 RBMs combine every input with individual weight and one overall bias. The algorithm
passes the output to the hidden layer.
 In the backward pass, RBMs take that set of numbers and translate them to form the
reconstructed inputs.
 RBMs combine each activation with individual weight and overall bias and pass the
output to the visible layer for reconstruction.
 At the visible layer, the RBM compares the reconstruction with the original input to
analyze the quality of the result.
Below is a diagram of how RBMs function:

10. Autoencoders
Autoencoders are a specific type of feedforward neural network in which the input and
output are identical. Geoffrey Hinton designed autoencoders in the 1980s to solve
unsupervised learning problems. They are trained neural networks that replicate the data
from the input layer to the output layer. Autoencoders are used for purposes such as
pharmaceutical discovery, popularity prediction, and image processing.
How Do Autoencoders Work?
An autoencoder consists of three main components: the encoder, the code, and the decoder.
 Autoencoders are structured to receive an input and transform it into a different
representation. They then attempt to reconstruct the original input as accurately as
possible.
 When an image of a digit is not clearly visible, it feeds to an autoencoder neural
network.
 Autoencoders first encode the image, then reduce the size of the input into a smaller
representation.
 Finally, the autoencoder decodes the image to generate the reconstructed image.
The following image demonstrates how autoencoders operate:
SVM FOR REGRESSION
There are three kernels which SVM uses the most
 Linear kernel: Dot product between two given observations
 Polynomial kernel: This allows curved lines in the input space
 Radial Basis Function (RBF): It creates complex regions within the feature space

In general, regression problems involve the task of deriving a mapping function which
would approximate from input variables to a continuous output variable.
Support Vector Regression uses the same principle of Support Vector Machines. In other
words, the approach of using SVMs to solve regression problems is called Support Vector
Regression or SVR.
Read more on Difference between Data Science, Machine Learning & AI
Now let us look at the classic example of the Boston House Price dataset. It has got 14
columns, where we have one target variable and 13 independent variables with 506 rows/
records. The approach will be to use SVR and use all the three Kernels and then compare the
error metric and choose the model with the least error and then come up with the observation
on the dataset whether its linear or not. The dataset is available at Boston Housing Dataset.
Though the dataset looks simple and we could do a traditional regression we want to see if
the dataset is linear or not and how SVR performs with various kernels, here we will be
focusing on MAE & RMSE only, lower the value better is the model. You can look at the
code which is available in my GitHub repository that is mentioned in the reference section.
Based on the above results we could say that the dataset is non- linear and Support Vector
Regression (SVR)performs better than traditional Regression however there is a caveat, it
will perform well with non-linear kernels in SVR. In our case, SVR performs the least in
Linear kernel, performs well to moderate when Polynomial kernel is used and performs
better when we use the RBF (or Gaussian) Kernel. Maybe I wouldn’t have left with burnt
potato chips had I used SVR while selecting the potatoes as they were non-linear though
they looked very similar. I hope this article was useful in understanding the basics of SVM
and SVR, and when we need to use SVR.

You might also like