ML Module

Introduction to Machine
Learning
Module - 1
Prepared By: Anit James, Asst. Professor, Amal Jyothi College of

Engineering, Kanjirappally.
Module - 1
 Introduction to Machine Learning: - How do machines learn -
Selecting the right features.
 Understanding Data:- numeric variables – mean, median, mode,
Measuring spread.
 Review of Distributions: Uniform and normal. Categorical variables.
 Dimensionality Reduction: Principal Component Analysis
Prepared By: Anit James, Asst. Professor, Amal Jyothi College of Engineering, Kanjirappally.
Sophia (Robot)
• A social humanoid robot
• Developed by Hanson Robotics.
• Activated on February 14, 2016
• First public appearance in Texas, US.
• Able to display more than 50 facial expressions.
• Participated in many high-profile interviews.
Robot – ‘Sophia’ Interview
Artificial Intelligence - Examples.
 Robotics.
 Smartphones
 Smart Cars and Drones.
 Social Media.
 Music and Media Streaming Services(YouTube).
 Video Games
 Online Ads Network.
 Navigation and Travel(Google Maps).
 Banking and Finance.
 Smart Home Devices.
Artificial Intelligence
• Artificial intelligence is the simulation of human intelligence processes by

machines, especially computer systems.
• Specific applications of AI include expert systems, natural language

processing, speech recognition and machine vision.
Machine Learning
 A type of Artificial Intelligence (AI) that provides computers with the ability to
learn without being Explicitly Programmed.
 Machine learning focuses on the development of computer programs that

can change when exposed to new data.
Machine Learning(cntd..)
• A branch of artificial intelligence, concerned with the design and
development of algorithms that allow computers to evolve behaviors based
on data.
• As intelligence requires knowledge, it is necessary for the computers to

acquire knowledge.
• Machine learning is about predicting the future, based on the past.
Machine Learning
• The field of study interested in the development of computer algorithms for transforming
data into intelligent action is known as machine learning.
Machine Learning(cntd..)
• Algorithms or techniques that enable computer (machine) to “learn” from
data.
• Related with many areas such as:
• Data Mining
• Statistics etc…
Machine Learning
A machine learning algorithm takes data and identifies patterns that can be used for action.
Uses of Machine Learning
• Predict the outcomes of elections
• Identify and filter spam messages from e-mail
• Foresee criminal activity
• Automate traffic signals according to road conditions
• Produce financial estimates of storms and natural disasters
• Examine customer churn
• Create auto-piloting planes and auto-driving cars
• Identify individuals with the capacity to donate
• Target advertising to specific types of consumers
How machines learn?
• Human brains are naturally capable of learning from birth, but the conditions
necessary for computers. to learn must be made explicit
How machines learn ?
• Regardless of whether the learner is a human or machine, the basic learning
process is similar.
 It can be divided into four interrelated components:
 Data storage
 Abstraction
 Generalization
 Evaluation
How Machines Learn ?
1. Data storage:
 Utilizes observation, memory, and recall to provide a factual basis for further reasoning.
2. Abstraction:
 Involves the translation of stored data into broader representations and concepts.
3. Generalization:
 Uses abstracted data to create knowledge and inferences that drive action in new
contexts.
4. Evaluation:
 Provides a feedback mechanism to measure the utility of learned knowledge and
inform potential improvements.
How Machines Learn ?
• Steps in the Learning Process:
1. Data Storage
 All learning must begin with data.

 Humans and computers utilize data storage as a foundation for more
advanced reasoning.
 In a human being, this consists of a brain that uses electrochemical signals in
a network of biological cells to store and process observations for short- and
long-term future recall.
 Computers uses hard disk drives, flash memory, and random access memory
(RAM) in combination with a Central Processing Unit (CPU).
Steps in the Learning Process:
2. Abstraction:
 Assigning meaning to stored data.

 Basis of Knowledge Representation.
 The formation of logical structures that assist in turning raw sensory
information into a meaningful insight.
 During this process, the computer summarizes stored raw data using a
model;
 Model - an explicit description of the patterns within the data.
Steps in the Learning Process
 Different Types of models:
• Eg:
– Mathematical Equations.
– Relational Diagrams.( trees, graphs...)
– Logical if/else rules.
– Groupings of data known as clusters.
• Training:
 The process of fitting a model to a dataset is known as training.
 When the model has been trained, the data is transformed into an abstract
form that summarizes the original information.
• 2. Abstraction(cntd)
• Eg: -Discovery of gravity.

• By fitting equations to observational data, Sir Isaac Newton inferred the
concept of gravity. But the force we now know as gravity was always present.
It was not recognized until Newton recognized it as an abstract concept that
relates some data to others.
• – i.e, by becoming the ‘g’ term in a model that explains observations of
falling objects.
3. Generalization
 Process of turning abstracted knowledge into a form that can be utilized for
future action.
 Process is a bit difficult to describe.
 It has been imagined as a search through the entire set of models
(inferences) that could be abstracted during training.
 It involves Reduction of data set into a manageable number
of important findings
 3. Generalization (cntd.)
 In generalization, the learner is tasked with limiting the patterns it

discovers, to only those that will be most relevant to its future tasks.
 Generally, it is not feasible to reduce the number of patterns by examining
them one-by-one and ranking them by future utility.
 So, machine learning algorithms generally employ shortcuts that reduce the
search space more quickly.
 the algorithm will uses Heuristics - technique designed for solving a problem more
quickly.
 3. Generalization (cntd.)
4. Evaluation
 To Evaluate/Measure the learners success

 Use this information to inform additional training if needed.
• The model is evaluated on a new test dataset in order to judge how well its
characterization of the training data generalizes to new, unseen data.
4. Evaluation(cntd)
 Models fail to perfectly generalize due to the problem of noise:

 Noise:
 Unexplainable variations in data.
 Error due to sensors
 Issues with human subjects
 Data quality problems (missing, null, truncated, incorrectly coded or
corrupted values).
4. Evaluation(cntd)
Overfitting:
 Problem which trying to model noise.
 Solution to the problem of overfitting are specific to particular machine learning
approaches.
 A model that seems to perform well during training, but does poorly during
evaluation, is said to be over fitted to the training data set, as it does not generalize
well to the test dataset.
 Overfitting happens due to several reasons, such as:
• The training data size is too small and does not contain enough data samples to
accurately represent all possible input data values.
• The training data contains large amounts of irrelevant information, called noisy
data.
• The model complexity is high, so it learns the noise within the training data.
Machine learning in Practice.
• 5 step process:
1. Data Collection
2. Data Exploration and Preparation
 Checking the Quality of data & Prepare data for learning process.
3. Model Training:
 Selection of an appropriate algorithm, and the algorithm
• will represent the data in the form of a model.
4. Model Evaluation
 Evaluating the accuracy of a model using the test data set.
5. Model Improvement
 If better performance is needed , use advanced strategies to improve the
performance of the model.
 Switch to different model.
Terminologies
• Model: Also known as “hypothesis”, a machine learning model is the mathematical
representation of a real-world process. A machine learning algorithm along with the
training data builds a machine learning model.
• Feature: A feature is a measurable property or parameter of the data-set.
• Feature Vector: It is a set of multiple numeric features. We use it as an input to the
machine learning model for training and prediction purposes.
• Training: An algorithm takes a set of data known as “training data” as input. The
learning algorithm finds patterns in the input data and trains the model for expected
results (target). The output of the training process is the machine learning model.
• Prediction: Once the machine learning model is ready, it can be fed with input data
to provide a predicted output.
• Target (Label): The value that the machine learning model has to predict is called the
target or label.
Types of input data
Unit Of Observation:
 Smallest entity with measured properties of interest for a study.
 Eg: persons, objects or things, transactions, time points, geographic regions,
or measurements.
Unit Of Analysis:
 Smallest Unit from which the inference is made.
 Eg: classes
 Both may not be always same, eg: data observed from people might be used
to analyse trends across different countries
Types of input data
•Datasets :
 Store the units of observation and their properties.
 Collections of data consisting of:
 Examples: Instances of the unit of observation for which
properties have been recorded.
 Features: Recorded properties or attributes of examples that may
be useful for learning.
Types of input data
• Eg: - To build a learning algorithm to identify spam e-mail;
– Unit Of Observation-> e-mail messages
– Examples -> specific messages
– Features-> consist of the words used in the messages.
• Examples and Features:
– Do not have to be collected in any specific form.
– Commonly gathered in Matrix Format, which means that each example has exactly the
same features.
Types of input data
 Each Row -> Example

 Each Column -> Feature
 Eg: Examples of Automobiles
Various form of ‘Features’
 Numeric
 Categorical / Nominal
 Ordinal
Various form of ‘Features’(cntd.).
1. Numeric Feature:
– The feature which represents a characteristic measured in Numbers.
 Number of black pixels
 Noise ratios, length of sounds, relative power
 Frequency of specific terms
 Height, weight, etc..
2. Categorical / Nominal feature :
– A feature which is an attribute that consists of a set of categories.
– Allows you to assign categories
– Eg:
 Gender
 Colour of a ball
3. Ordinal feature:
– A special case of categorical variables is called ordinal variable.
– a nominal variable with categories falling in an Ordered List.
– An ordinal variable is a categorical variable for which the possible
values are ordered.
– Eg: clothing sizes : small, medium, and large;
– Eg:- a measurement of customer satisfaction on a scale from "not at all happy" to "very
happy."
• Eg: ordinal variable
• Educational qualification
• SSLC
• Plus Two
• Degree
• PG
It is important to consider “what the features represent”, as the “Type” and “Number of
features” in your dataset which will assist in determining an appropriate machine
learning algorithm for your task.
Types of machine learning algorithms.
 Machine learning algorithms are divided into categories
according to their purpose.
 Predictive Model
 Descriptive Model
 Learning Algorithms:
 Supervised Learning
 Unsupervised Learning
Predictive Model:
 Used for tasks that involve the prediction of one value using other values in
the dataset.
 The learning algorithm attempts to discover and model the relationship
between the target feature (the feature being predicted) and the other
features.
 Eg:- Model is used to predict whether an email is spam or "ham"
Descriptive Model:
 No single feature is more important than any other.
 No target to learn.
 Process of training a descriptive model is called Unsupervised
Learning.
 Descriptive modeling tasks:
– Pattern Discovery.
– Clustering.
If user buys product A (eg:- bread) then machine should automatically give him a
suggestion to buy B (eg;- Jam).
Descriptive Model:
 Pattern Discovery:
• Used to identify useful associations within data.
• An association rule learning problem is where you want to discover rules that
describe large portions of your data, such as people that buy X also tend to
buy Y.
 Clustering.
• Dividing a dataset into homogeneous groups.
• Eg: grouping customers by purchasing behavior.
Supervised Learning.
• Process of training a predictive model is known as supervised learning.
 Supervised learning algorithm attempts to

optimize a function (model) to find the
combination of feature values that result in the
target output.
Supervised learning is where you have input

variables (x) and an output variable (Y) and you
use an algorithm to learn the mapping function
from the input to the output.
Y = f(X)
• The often used supervised machine learning task of predicting which
category an example belongs to is known as classification.
• Eg:- we could predict whether:
– An e-mail message is spam
– A person has cancer
– A football team will win or lose
– An applicant will default on a loan
Types of supervised Machine learning Algorithms:
1. Regression
• Regression algorithms are used if there is a relationship between the input
variable and the output variable. It is used for the prediction of continuous
variables, such as Weather forecasting, Market Trends, etc.
• Below are some popular Regression algorithms which come under supervised
learning:
• Linear Regression
• Regression Trees
• Non-Linear Regression
• Bayesian Linear Regression
• Polynomial Regression
Types of supervised Machine learning Algorithms:
2. Classification
• Classification algorithms are used when the output variable is categorical,
which means there are two classes such as Yes-No, Male-Female, True-false,
etc.
• Random Forest
• Decision Trees
• Logistic Regression
• Support vector Machines
Unsupervised Learning
 Have input data (X) and no corresponding output variables.
 The main goal of unsupervised learning is to discover hidden and interesting
patterns in unlabeled data
 Unsupervised learning is a type of machine learning in which models are
trained using unlabeled dataset and are allowed to act on that data without
any supervision.
 The unsupervised learning algorithm can be further categorized into two
types of problems:
 Clustering
 Association Rules
 k-means for clustering problems.
 Apriori algorithm for association rule learning problems.
Clustering:
• Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no
similarities with the objects of another group.
Association:
• An association rule is an unsupervised learning method which is used
for finding the relationships between variables in the large database.
• It determines the set of items that occurs together in the dataset.
Association rule makes marketing strategy more effective. Such as
people who buy X item (suppose a bread) are also tend to purchase Y
(Butter/Jam) item.
• A typical example of Association rule is Market Basket Analysis.
• Below is the list of some popular unsupervised learning algorithms:
• K-means clustering
• KNN (k-nearest neighbors)
• Hierarchal clustering
• Anomaly detection
• Neural Networks
• Principle Component Analysis
• Independent Component Analysis
• Apriori algorithm
• Singular value decomposition
General types of Machine Learning Algorithms
R
 A programming language and Environment.
 commonly used in:
 Statistical computing
 Data analytics
 Scientific Research.
Exploring and understanding data.
 After collecting data and loading it into R's data structures, the next step in
the machine learning process involves Examining the Data in detail.
 Ie, explore the data's features and examples, and realize the peculiarities
that make your data unique.
 The better you understand your data, the better you will be able to match a
machine learning model to your learning problem.
Numerical Variables
 Variables whose values are numbers.
 Summary Statistics.
 A common set of measurements to describe values of numeric variables in the data
 summary() function displays common summary statistics.
 provides 6-summary statistics
Numerical Variables
• Year -> year of manufacture of vehicles that were recently listed for sale.
• Summary() function can also be used to obtain summary statistics for
several numeric variables at the same time.
Numerical Variables
Numerical Variables
• summary() function provides 6-summary statistics
• Simple & powerful tools to investigate data.
• summary statistics can be divided into two types:
–Measures of Center
–Measures of Spread
Measuring the Central Tendency : Mean , Median & Mode
 Measures of Central Tendency:

• A class of statistics used to identify a value that falls in the middle of a set
of data.
 Mean (average)
 Median
 Mode
Mean (average)
 Sum of all values divided by the number of values.
 For example, to calculate the mean income in a group of 3 people with
incomes of $36,000, $44,000, and $56,000, use the following command:
 (36000 + 44000 + 56000) / 3
 = 45333.33
Mean (average) (cntd..)
• In R, mean() function:
 Mean:
 Most commonly used statistic to measure the
center of a dataset.
 It is not always the most appropriate one.
Median
 Another commonly used measure of central tendency is
the median.
 The value that occurs halfway through an ordered list of
values.
 “middle” number in the set of numbers.
Median(cntd.)
• median() function is used in R
• Middle value = 44000.

• median income = $44,000
Mode
 The mode is the number in a set that appears most frequently.
 A variable may have more than one mode;
 a variable with a single mode is Unimodal;
 a variable with two modes is Bimodal.
 Data having multiple modes is more generally called Multimodal.
 Eg: 12, 8, 7, 15, 7 ;
 Mode = 7
Measuring Spread:
 To measure the diversity.
 Another type of summary statistics
 concerned with the spread of data.
how tightly or loosely the values are spaced.
knowing about spread provides a sense of data’s highs and lows and
whether most values are like or unlike the mean and median.
Measuring Spread(cntd.)
1. Five-number summary:
 A set of five statistics that roughly depict the spread of a feature's values.
 All five of the statistics are included in the output of the summary()
function.
• they are:
1. Minimum (Min.)
2. First quartile, or Q1 (1st Qu.)
3. Median, or Q2
4. Third quartile, or Q3 (3rd Qu.)
5. Maximum (Max.)
Measuring Spread(cntd.):
Quantiles:
• Numbers that divide data into equally sized quantities.
• commonly used quantiles:-
• Quartiles – Four equal parts divided by three quartiles.
• Tertiles (3- parts)
• Quintiles (5- parts)
• Deciles (10- parts), and
• Percentiles (100 parts).
• Quartiles divide a dataset into Four portions, each with the same number of
values.
Quartiles:
Measuring Spread:
1. Quartiles:
• First Quartile (Q1) middle number between the smallest number(min) and
the median.
• Second Quartile (Q2)  median
• Third Quartile (Q3) middle value between the median and the highest
value(max).
 First and third quartiles—Q1 and Q3
 Refer to the value below or above which one quarter of the values are found.
Measuring spread
2. Variance and Standard Deviation
• Variance, which is defined as the average of the squared differences between

each value and the mean value
Measuring spread
Standard Deviation
• square root of the variance,
• denoted by sigma,
Measuring spread
• var() and sd() functions can be used to obtain the variance and standard
deviation in R.
• While interpreting the variance, larger number indicates that the data are
spread more widely around the mean.
• The SD indicate, how much each value differs from the mean.
Measuring spread
• Standard deviation is the spread of a group of numbers from the mean.
• The variance measures the average degree to which each point differs from
the mean.
• Standard deviation is useful when comparing the spread of two separate data
sets that have approximately the same mean.
• Low standard deviation means data are clustered around the mean(more
reliable), and high standard deviation indicates data are more spread out.
• Low variability is ideal because it means that you can better predict
information about the population based on sample data. High variability
means that the values are less consistent, so it's harder to make predictions.
Visualizing Numeric Variables.
• Helpful in diagnosing data problems
 Boxplots
 Histograms
• Data Exploration Tools.
Boxplot
• A box and whisker plot—also called a box plot—displays the five-number
summary of a set of data.
• The five-number summary is the minimum, first quartile, median, third
quartile, and maximum.
• Also known as a box-and-whiskers plot.
• boxplot() function in R
• Depicts the five-number summary values using the horizontal lines and dots.
• A box plot is often used to compare and contrast two or more groups.
Boxplot
• A boxplot contains a lot of information, therefore the interpretation of the
boxplot can be very versatile.
• The Boxplot consists of three parts, the box, the T-shaped whisker, also called
feeler and two lines.
• the Box
• two strokes
• the T-shaped whisker
Boxplot
• The box itself indicates the range in which the middle 50% of all data is
located.
• The lower end of the box is therefore the 1st quantile and the upper end the
3rd quantile.
• In the box plot, the solid line indicates the median and the dashed line the
mean value.
• The T-shaped whiskers go up to the last point, which is still 1.5 times the
interquartile distance. Points further away are considered outliers.
• If no point is more than 1.5 times the interquartile distance away, the T-
shaped whisker indicates the maximum or minimum value.
Boxplot
Minimum: The minimum value in the given dataset
First Quartile (Q1): The first quartile is the median of the lower half
of the data set.
Median: The median is the middle value of the dataset, which
divides the given dataset into two equal parts. The median is
considered as the second quartile.
Third Quartile (Q3): The third quartile is the median of the upper
half of the data.
Maximum: The maximum value in the given dataset.
Apart from these five terms, the other terms used in the box plot are:
Interquartile Range (IQR): The difference between the third quartile
and first quartile is known as the interquartile range. (i.e.) IQR = Q3-
Q1
Outlier: The data that falls on the far left or right side of the ordered
data is tested to be the outliers. Generally, the outliers fall more than
the specified distance from the first and third quartile.
(i.e.) Outliers are greater than Q3+(1.5 . IQR) or less than Q1-(1.5 .
IQR).
Boxplot
Boxplot
Applications
• It is used to know:
• The outliers and their values
• Symmetry of Data
• Tight grouping of data
• Data skewness – if, in which direction and how
Boxplot
• Example:
• Find the maximum, minimum, median, first quartile, third quartile for the given data set: 23,
42, 12, 10, 15, 14, 9.
• Solution:
• Given: 23, 42, 12, 10, 15, 14, 9.
• Arrange the given dataset in ascending order.
• 9, 10, 12, 14, 15, 23, 42
• Hence,
• Minimum = 9
• Maximum = 42
• Median = 14
• First Quartile = 10 (Middle value of 9, 10, 12 is 10)
• Third Quartile = 23 (Middle value of 15, 23, 42 is 23).
Boxplot
• The box plot distribution will explain how tightly the data is grouped, how the
data is skewed
• Positively Skewed: If the distance from the median to the maximum is greater
than the distance from the median to the minimum, then the box plot is
positively skewed.
• Negatively Skewed: If the distance from the median to minimum is greater
than the distance from the median to the maximum, then the box plot is
negatively skewed.
• Symmetric: The box plot is said to be symmetric if the median is equidistant
from the maximum and minimum values.
Visualizing Numeric Variables: Histograms
• Another way to graphically depict the spread of a numeric variable.

• Similar to a boxplot.
• it divides the variable's values into a predefined number of portions or bins
that act as containers for values.
• hist() function.
Histograms
• Composed of a series of bars with Heights indicating the Count, or
frequency of values falling within each of the equal width bins partitioning
the values.
• The vertical lines that separates the bar, as labelled on the horizontal axis,
indicates the start and end points of the range of values for the bin.
Histograms

Histograms(cntd..)
 Seems that the used car prices tend to be evenly divided on both sides of the
middle, while the car mileages stretch further to the right.
 This characteristic is known as skew:
 specifically right skew;
 because the values on the high end (right side) are far more spread out than the values
on the low end (left side) (right side (or "tail") is longer than its left side.)
Histograms
Skew:
 Ability to quickly diagnose patterns in our data is one of the strengths of

the histogram as a data exploration tool.
 To examine other patterns of spread in numeric data.
Skew:

Data Distributions
 A variable's distribution- > How likely a value is to fall within
various ranges.
 Uniform Distribution
 Normal Distribution
Uniform Distribution
• In statistics, uniform distribution refers to a type of probability distribution in
which all outcomes are equally likely
• A coin also has a uniform distribution because the probability of getting either
heads or tails in a coin toss is the same.
• The uniform distribution can be visualized as a straight horizontal line, so for a
coin flip returning a head or tail, both have a probability p = 0.50 and would be
depicted by a line from the y-axis at 0.50.
Uniform Distribution
Uniform Distribution of
one six-sided die.
The roll of a single dice yields
one of six numbers: 1, 2, 3, 4,
5, or 6. Because there are
only 6 possible outcomes, the
probability of you landing on
any one of them is 16.67%
(1/6).
Normal Distribution
• Normal Distribution is a probability distribution which peaks out in the
middle and gradually decreases towards both ends of axis.
• It is also known as gaussian distribution and bell curve because of its bell like
shape.
• Data tendsto be around a central value with no bias to left or right.
Normal Distribution
 Some values are more likely to occur than
others.
 Eg:- On the price histogram, it seems that
values grow less likely to occur as they are
further away from both sides of the center
bar, resulting in a bell-shaped distribution of
data.
 This characteristic is so common in real-world
data & this type of data distribution is known
as Normal Distribution.
Normal Distribution
• Normal distribution, also known as the Gaussian distribution, is a probability
distribution that is symmetric about the mean, showing that data near the
mean are more frequent in occurrence than data far from the mean.
Normal Distribution
Skewness
• Skewness measures the degree of symmetry of a distribution.
• The normal distribution is symmetric and has a skewness of zero.
• If the distribution of a data set instead has a skewness less than zero, or
negative skewness (left-skewness), then the left tail of the distribution is
longer than the right tail; positive skewness (right-skewness) implies that the
right tail of the distribution is longer than the left.
Uniform vs Normal distribution
• Uniform distributions are probability distributions with equally likely
outcomes.
• In a discrete uniform distribution, outcomes are discrete and have the same
probability.
• In a normal distribution, data around the mean occur more frequently.
• The frequency of occurrence decreases the farther you are from the mean in a
normal distribution.
Categorical / Nominal Variables
 A feature which is an attribute that consists of a set of categories.
 Allows you to assign categories
 The used car dataset had three categorical variables:
 model, color, and transmission.
Categorical /Nominal Variables
 Categorical data is typically examined using tables rather than summary
statistics.
• One-way Table.
 A table that presents a single categorical variable.
 The table() function can be used to generate one-way tables.
Categorical / Nominal Variables.
• table() output -> lists the categories of the nominal variable & a count of the
number of values falling into this category.
Relationships Between Variables.
• Bivariate relationships:
 Relationship between two variables.
• Multivariate relationships:
 Relationships of than two more variables.
Visualizing Relationships: – Scatterplots
Scatterplot:
• Diagram that visualizes a bivariate relationship.
• It is a two-dimensional figure in which dots are drawn on a coordinate plane
using the values of one feature to provide the horizontal x coordinates and
the values of another feature to provide the vertical y coordinates.
• Patterns of dots reveal the associations between the two features.
• Plot() is used.
Scatterplots
• Full Command to create a scatterplot:
 Positive Association:
 Relationships which Forms a pattern of dots
in a line sloping upward.
 Negative Association:
 Relationships which forms a pattern of dots
in a line sloping downward.
 Eg:- The relationship between car prices and
mileage is known as a negative association.
Two way cross tabulation
• To examine a relationship between two nominal variables, a two-way cross-
tabulation is used (also known as a crosstab or a contingency table).
• Two-way tables are used in statistical analysis to summarize the relationship
between two categorical variables.
• Two-way tables are also known as contingency, cross-tabulation, or crosstab
tables.
• The levels of one categorical variable are entered as the rows in the table
and the levels of the other categorical variable are entered as the columns in
the table
• CrossTable() function in R used to display 2 way cross tables.
• CrossTable(mydata$myrowvar, mydata$mycolvar)
Dimensionality Reduction:
Principal Component Analysis (PCA)

Introduction to Dimensionality Reduction
• In machine learning, “dimensionality” simply refers to the number of features
(i.e. input variables) in your dataset.
• In machine learning classification problems, there are often too many factors
on the basis of which the final classification is done.
• These factors are basically variables called features.
• The higher the number of features, the harder it gets to visualize the training
set and then work on it.
• Sometimes, most of these features are correlated, and hence redundant.
• This is where dimensionality reduction algorithms come into play.
• Dimensionality reduction is the process of reducing the number of random
variables under consideration, by obtaining a set of principal variables.
Dimensionality Reduction:
 Process of reducing the number of random variables under consideration, &
obtaining a set of principal variables.
 Process of converting a set of data having vast dimensions into data with
lesser dimensions, ensuring that it conveys similar information concisely.
 Used for solving machine learning problems to obtain better features for a
classification or regression task.
Dimensionality Reduction(cntd):
 2 dimensions x1 and x2, which are
measurements of several object in cm (x1)
and inches (x2).
 Since both these dimensions will convey
similar information &
 Introduce a lot of noise in system.
 Its better to use just one dimension.
 Convert the dimension of data from 2D
(from x1 and x2) to 1D (z1), which has
made the data relatively easier to explain.
Dimensionality Reduction(cntd.):
 “reduce n - dimensions of data set to k dimensions (k < n)”.
 These k dimensions can be directly identified (filtered) or
 It can be:
A combination of dimensions (weighted averages of dimensions) or
New dimension(s) that represent existing multiple dimensions well.
Dimensionality Reduction(cntd.):
 Real-world data, such as:
 Speech signals
 Digital photographs, or
 MRI scans, usually has a high dimensionality.
 To handle real-world data adequately, its dimensionality needs
to be reduced.
Dimensionality Reduction
 Dimensionality reduction is the transformation of high-dimensional data
into a meaningful representation of reduced dimensionality.
 Reduced representation should have a dimensionality that corresponds to
the intrinsic dimensionality of the data.
 The Intrinsic Dimensionality of data -> minimum number of parameters
needed to account for the observed properties of the data.
Benefits of Dimension Reduction?
 It helps in data compressing and reducing the storage space required.
 It fastens the time required for performing same computations.
 Less dimensions leads to less computing, also less dimensions can allow
usage of algorithms unfit for a large number of dimensions.
 Improves the model performance.
 Removes redundant features.
 To plot and visualize data precisely.
 Helpful in noise removal.
What Is Principal Component Analysis?
• Principal component analysis, or PCA, is a dimensionality-reduction method
that is often used to reduce the dimensionality of large data sets, by
transforming a large set of variables into a smaller one that still contains
most of the information in the large set.
• smaller data sets are easier to explore and visualize and make analyzing data
much easier and faster for machine learning algorithms
• PCA— reduce the number of variables of a data set, while preserving as

much information as possible.
Principal Component Analysis (PCA)
 The main linear technique for dimensionality reduction.
 Performs a linear mapping of the data to a lower- dimensional space in
such a way that, the variance of the data in the low-dimensional
representation is maximized.
Principal Component Analysis (PCA):
• Variables are transformed into a new set of variables, which are linear
combination of original variables.
• These new set of variables are known as Principle Components
• They are obtained in such a way that 1st principle component accounts for
most of the possible variation of original data, after which each succeeding
component has the highest possible variance.
 The 2nd principal component must be orthogonal to the 1st principal
component
 In other words, it does its best to capture the variance in the data that is
not captured by the 1st principal component.
Principal Component Analysis(PCA)
 Below is a snapshot of the data and its 1st and 2nd principal
components.
 i.e; second principle component is orthogonal to first
principal component.
Principal Components Analysis(PCA).(cntd..)
• X1, X2: original axes (attributes)
• Y1,Y2: principal components
PRINCIPAL COMPONENT ANALYSIS- STEPS
• Standardize the range of continuous initial variables

• Compute the covariance matrix to identify correlations
• Compute the eigenvectors and eigenvalues of the covariance matrix to
identify the principal components
• Create a feature vector to decide which principal components to keep
• Recast the data along the principal components axes
Step-by-Step Explanation of PCA
STEP 1: STANDARDIZATION
• The aim of this step is to standardize the range of the continuous initial
variables so that each one of them contributes equally to the analysis.
• if there are large differences between the ranges of initial variables, those
variables with larger ranges will dominate over those with small ranges (for
example, a variable that ranges between 0 and 100 will dominate over a
variable that ranges between 0 and 1), which will lead to biased results.
• this can be done by subtracting the mean and dividing by the standard
deviation for each value of each variable.
• Once the standardization is done, all the variables will be transformed to the
same scale.
STEP 2: COVARIANCE MATRIX COMPUTATION
• The aim of this step is to understand how the variables of the input data set
are varying from the mean with respect to each other, or in other words, to
see if there is any relationship between them
• What do the covariances that we have as entries of the matrix tell us about
the correlations between the variables?
• If positive then: the two variables increase or decrease together (correlated)
• If negative then: one increases when the other decreases (Inversely correlated)
• STEP 3: COMPUTE THE EIGENVECTORS AND EIGENVALUES OF THE
COVARIANCE MATRIX TO IDENTIFY THE PRINCIPAL COMPONENTS
• Eigenvectors and eigenvalues computed from the covariance matrix in order to
determine the principal components of the data
• Principal components are new variables that are constructed as linear
combinations or mixtures of the initial variables.
• These combinations are done in such a way that the new variables (i.e.,
principal components) are uncorrelated and most of the information within
the initial variables is squeezed or compressed into the first components. So,
the idea is 10-dimensional data gives you 10 principal components, but PCA
tries to put maximum possible information in the first component, then
maximum remaining information in the second and so on
• The axis that explains the maximum amount of variance in the training set is
called the Principal Components.
• It is very important to choose the right hyperplane so that when the data is
projected onto it, it the maximum amount of information about how the
original data is distributed.
• Organizing information in principal components this way, will allow you to
reduce dimensionality without losing much information, and this by
discarding the components with low information and considering the
remaining components as your new variables.
• Geometrically speaking, principal components represent the directions of the
data that explain a maximal amount of variance, that is to say, the lines that
capture most information of the data.
• The relationship between variance and information here, is that, the larger the
variance carried by a line, the larger the dispersion of the data points along it,
and the larger the dispersion along a line, the more information it has.
• Ie, principal components are the new axes that provide the best angle to see
and evaluate the data, so that the differences between the observations are
better visible.
• Compute eigenvalues and eigenvectors for the covariance matrix
• Sort these pairs based on eigenvalues in descending order
• And eigenvalues are simply the coefficients attached to eigenvectors, which
give the amount of variance carried in each Principal Component.
• In PCA, PC are the eigenvectors
STEP 4: FEATURE VECTOR
• the feature vector is simply a matrix that has as columns the eigenvectors of
the components that we decide to keep.
• This makes it the first step towards dimensionality reduction, because if we
choose to keep only p eigenvectors (components) out of n, the final data set
will have only p dimensions.
STEP 5: RECAST THE DATA ALONG THE PRINCIPAL COMPONENTS AXES
• In this step, which is the last one, the aim is to use the feature vector formed
using the eigenvectors of the covariance matrix, to reorient the data from the
original axes to the ones represented by the principal components (hence the
name Principal Components Analysis).
• This can be done by multiplying the transpose of the original data set by the
transpose of the feature vector.
• Refer: https://www.youtube.com/watch?v=MLaJbA82nzk
?
• Mention the Difference between Categorical variables & nominal variables.
• Mention the Difference between uniform distribution & normal distribution?
• How machine learning can be done in practice? Explain.
• Explain the steps of machines learning in detail with the help of relevant
diagram.
• with examples Explain about various forms of features?
• Explain in detail different measurements of central tendency & measures of
spread?
• Explain PCA and its steps in detail.
• What is the purpose of Ordinary Least Square Estimation?
• Discuss the learning process of machines.

ML Module

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Module

Uploaded by

Copyright:

Available Formats

Introduction to Machine

Prepared By: Anit James, Asst. Professor, Amal Jyothi College of

• Artificial intelligence is the simulation of human intelligence processes by

• Specific applications of AI include expert systems, natural language

 Machine learning focuses on the development of computer programs that

• As intelligence requires knowledge, it is necessary for the computers to

• Machine learning is about predicting the future, based on the past.

 All learning must begin with data.

 Assigning meaning to stored data.

• Eg: -Discovery of gravity.

 In generalization, the learner is tasked with limiting the patterns it

 To Evaluate/Measure the learners success

 Models fail to perfectly generalize due to the problem of noise:

 Each Row -> Example

 Supervised learning algorithm attempts to

Supervised learning is where you have input

 Measures of Central Tendency:

• Middle value = 44000.

• Variance, which is defined as the average of the squared differences between

• Another way to graphically depict the spread of a numeric variable.

Prepared By: Anit James, Asst. Professor, Amal Jyothi College of

 Ability to quickly diagnose patterns in our data is one of the strengths of

Prepared By: Anit James, Asst. Professor, Amal Jyothi College of

Prepared By: Anit James, Asst. Professor, Amal Jyothi College of

• PCA— reduce the number of variables of a data set, while preserving as

• Standardize the range of continuous initial variables

You might also like