You are on page 1of 46

M ACHINE L EARNING

Dr P Bhagath M.Tech (IITG), Ph.D(IITG)

Department of CSE,
LBRCE Mylavaram

August 2, 2023
C ONTENT

1 Introduction to Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3


1.1 Types of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Applications of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3 Issues in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2 Preparing to Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1 Machine Learning Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Basic Types of Data in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1 / 45
Part I

FUNDAMENTALS OF MACHINE LEARNING

2 / 45
A RTIFICIAL INTELLIGENCE , M ACHINE L EARNING ,
AND D EEP L EARNING

3 / 45
A RTIFICIAL I NTELLIGENCE

Definition 1.1 (Artificial Intelligence)


The effort to automate intellectual tasks normally performed by humans. As such, AI is a general field that
encompasses machine learning and deep learning.

4 / 45
M ACHINE L EARNING

Definition 1.2 (Machine Learning)


A machine-learning system is trained rather than explicitly programmed. It’s presented with many examples
relevant to a task, and it finds statistical structure in these examples that eventually allows the system to come
up with rules for automating the task.

Figure. ML Paradigm

5 / 45
M ACHINE L EARNING -2

Machine learning program discovers rules to execute a data-processing task, given examples of
what’s expected. Machine Learning algorithm requires 3 components:

1. Input Data Points: Text, Speech, Image, etc

2. Examples of the expected outputs: Labels, Transcripts, etc.

3. A way to measure whether the algorithm is doing a good job? Difference between the
algorithm’s current output and its expected output. The measurement is used as a feedback
signal to adjust the way the algorithm works. This adjustment step is what we call learning.

Machine-learning models are all about finding appropriate representations for their input
data—transformations of the data that make it more amenable to the task at hand, such as a
classification task.

6 / 45
T YPES OF M ACHINE L EARNING

7 / 45
T YPES OF M ACHINE L EARNING

1. Supervised learning – Also called predictive learning. A machine predicts the class of
unknown objects based on prior class-related information of similar objects.

2. Unsupervised learning – Also called descriptive learning. A machine finds patterns in


unknown objects by grouping similar objects together.

3. Reinforcement learning – A machine learns to act on its own to achieve the given goals.

8 / 45
T ERMINOLOGY

▶ Model: Also known as “hypothesis”, a machine learning model is the mathematical


representation of a real-world process. A machine learning algorithm along with the training
data builds a machine learning model.

▶ Feature: A feature is a measurable property or parameter of the dataset.

▶ Feature Vector: It is a set of multiple numeric features. We use it as an input to the machine
learning model for training and prediction purposes.

▶ Training: An algorithm takes a set of data known as “training data” as input. The learning
algorithm finds patterns in the input data and trains the model for expected results (target).
The output of the training process is the machine learning model.

▶ Prediction: Once the machine learning model is ready, it can be fed with input data to provide
a predicted output.

▶ Target (Label): The value that the machine learning model has to predict is called the target or
label.

9 / 45
S UPERVISED L EARNING

▶ Supervised learning is a class of problems that uses a model to learn the mapping between the
input and target variables. Applications consisting of the training data describing the various
input variables and the target variable are known as supervised learning tasks.

▶ Let the set of input variable be (x) and the target variable be (y). A supervised learning
algorithm tries to learn a hypothetical function which is a mapping given by the expression
y=f(x), which is a function of x.

▶ The learning process here is monitored or supervised. Since we already know the output the
algorithm is corrected each time it makes a prediction, to optimize the results. Models are fit on
training data which consists of both the input and the output variable and then it is used to
make predictions on test data.

▶ There are basically two types of supervised problems:


1. Classification – which involves prediction of a class label
2. Regression – that involves the prediction of a numerical value

10 / 45
C LASSIFICATION

11 / 45
U N -S UPERVISED L EARNING

▶ In an unsupervised learning problem the model tries to learn by itself and recognize patterns
and extract the relationships among the data. As in case of a supervised learning there is no
supervisor or a teacher to drive the model.

▶ Unsupervised learning operates only on the input variables. There are no target variables to
guide the learning process. The goal here is to interpret the underlying patterns in the data in
order to obtain more proficiency over the underlying data.
1. Clustering – where the task is to find out the different groups in the data

2. Association Analysis – Association Analysis is simply a search through the data for
combinations of items whose statistics are interesting. It helps us establish rules dictating
something like "If A occurs then B is likely to occur as well."

12 / 45
U NSUPERVISED L EARNING

13 / 45
R EINFORCEMENT L EARNING

▶ Reinforcement learning is a type of problem where there is an agent and the agent is operating
in an environment based on the feedback or reward given to the agent by the environment in
which it is operating.

▶ The rewards could be either positive or negative. The agent then proceeds in the environment
based on the rewards gained.

▶ The reinforcement agent determines the steps to perform a particular task. There is no fixed
training dataset here and the machine learns on its own.

▶ Playing a game is a classic example of a reinforcement problem, where the agent’s goal is to
acquire a high score. It makes the successive moves in the game based on the feedback given by
the environment which may be in terms of rewards or a penalization. Reinforcement learning
has shown tremendous results in Google’s AplhaGo of Google which defeated the world’s
number one Go player.

14 / 45
R EINFORCEMENT L EARNING

15 / 45
C OMPARISON

Supervised Un-Supervised Reinforcement


Learning
When class labels are known When there is no idea of class When there is no idea about
labels class label
Labelled training data is Any unknown and unla- The model learns and
needed. Model is built based belled data set is given to the updates itself through re-
on training data. model as input and records ward/punishment
are grouped.
Performance is based on the Homogeneity of records Model is evaluated by means
number of predicted and ac- grouped together is the only of the reward function after it
tual values measure had some time to learn

▶ Naive Bayes algorithm ▶ k-means ▶ Q-Learning


▶ k-nearest neighbor ▶ PCA ▶ Sarsa
▶ Decision tree ▶ DBSCAN
▶ Linear Regression ▶ Self-Organizing
▶ Logistic Regression Maps(SOMs)
▶ SVM ▶ Apriori

16 / 45
A PPLICATIONS OF M ACHINE L EARNING

▶ Facial recognition/Image recognition

▶ Automatic Speech Recognition

▶ Financial Services

▶ Marketing and Sales

▶ Healthcare

17 / 45
M ACHINE L EARNING P ROCESS

18 / 45
M ODEL P REPARATION

▶ Understand the type of data in the given input data set.

▶ Explore the data to understand the nature and quality.

▶ Explore the relationships amongst the data elements, e.g. inter-feature relationship.

▶ Find potential issues in data.

▶ Do the necessary remediation, e.g. impute missing data values, etc., if needed.

▶ Apply pre-processing steps, as necessary. Once the data is prepared for modelling, then the
learning tasks start
1. The input data is first divided into parts – the training data and the test data (called
holdout). This step is applicable for supervised learning only.

2. Consider different models or learning algorithms for selection.

3. Train the model based on the training data for supervised learning problem and apply to
unknown data. Directly apply the chosen unsupervised model on the input data for
unsupervised learning problem.

19 / 45
A CTIVITIES IN M ACHINE L EARNING

Step # Step Name Activites Involved


1 Preparing to Model
▶ Understand the type and quality of
data
▶ Understand the relationships
▶ Finding the potential issues
▶ Remediate the data
▶ Apply pre-processing steps

2 Learning
▶ Data partitioning, Model Selection,
Cross-validation

3 Performance Evaluation
▶ Examining the model performance
▶ Visualization

4 Performance Improvement
▶ Tuning, Ensembling

20 / 45
T YPES OF DATA IN M ACHINE L EARNING

▶ A data set is a collection of related information or records

▶ The information may be on some entity or some subject area

▶ For example, we may have a data set on students in which each record consists of information
about a specific students

▶ Each row of a data set is called a record

▶ Each data set also has multiple attributes, each of which gives information on a specific
characteristics

▶ Attributes can also be termed as feature, variable, dimension or field

21 / 45
T YPES OF DATA IN M ACHINE L EARNING

1. Qualitative: Qualitative data provides information about the quality of an object or


information which cannot be measured
1.1 Nominal: is one which has no numeric value, but a named value. It is used for assigning
named values to attributes. Nominal values cannot be quantified.

1.2 Ordinal: Ordinal data also assigns named values to attributes but unlike nominal data,
they can be arranged in a sequence of increasing or decreasing value so that we can say
whether a value is better than or greater than another value.

2. Quantitative: Quantitative data relates to information about the quantity of an object


2.1 Interval: Interval data is numeric data for which not only the order is known, but the exact
difference between values is also known. An ideal example of interval data is Celsius
temperature.

2.2 Ratio: represents numeric data for which exact value can be measured. Absolute zero is
available for ratio data. Also, these variables can be added, subtracted, multiplied, or
divided. The central tendency can be measured by mean, median, or mode and methods of
dispersion such as standard deviation.

22 / 45
E XPLORING STRUCTURE OF DATA

Structure Understanding the structure means understanding the nature

There are different ways to understand the structure:

▶ Graph Plots: Box plot and Histogram

▶ Statistics: (Exploring Numerical Data)


• Central Tendency

• Data Spread

• Data Dispersion

• Measuring the data value position

23 / 45
U NDERSTANDING C ENTRAL T ENDENCY

▶ Measures of central tendency help us understand the central point of a set of data

▶ Mean: is a sum of all data values divided by the count of data elements

▶ Median: the value of the element appearing in the middle of an ordered list of data elements

▶ Example: 21, 89, 34, 67, 96

▶ Mean = 21+89+34+67+96
5 = 61.4

▶ Median = 21, 34, 67, 89, 96

▶ Mean and median are impacted differently by data values appearing at the beginning or at the
end of the range

▶ Mean is sensitive to the outliers. It is shifted even with the presence of small number of
outliers
24 / 45
UNDERSTANDING DATA SPREAD

▶ Dispersion
1. Variance Pn  Pn 2
2
i=1 xi i=1 xi
Variance = − (1)
n n

2. Standard Deviation p
StandardDeviation = Variance(x) (2)

Important Point

Larger value of variance or standard deviation indicates more dispersion in the data and
vice versa.

▶ Data Spread: Measuring Data value position


1. When the data values of an attribute are arranged in an increasing order, the median gives
the central data value which divides the entire data set into two halves

2. Q1 (Quartile 1)

3. Q2 (Median)

4. Q3 (Quartile 3) 25 / 45
E XAMPLE

▶ Attribute 1 Values: 44, 46, 48, 45, and 47

▶ Attribute 2 values: 34, 46, 59, 39, and 52

Example

442 +462 +482 +452 +472 44+46+48+45+47 2



▶ Variance = 5 − 5
1936+2116+2304+2025+2209 230 2
= 10978 − (46)2
 
▶ = 5 − 5 5 = 79.6

26 / 45
BOXPLOTS

▶ A box plot is an extremely effective mechanism to get a one-shot view and understand the
nature of the data.

▶ They enable us to study the distributional characteristics of a group of scores as well as the
level of the scores.

▶ The box plot (also called box and whisker plot) gives a standard visualization of the
five-number summary statistics of a data, namely minimum, first quartile (Q1), median (Q2),
third quartile (Q3), and maximum

27 / 45
BOX PLOT

28 / 45
BOXPLOTS

▶ Median: The median (middle quartile) marks the mid-point of the data and is shown by the
line that divides the box into two parts. Half the scores are greater than or equal to this value
and half are less.

▶ Inter-quartile range (IQR): The middle “box” represents the middle 50% of scores for the
group. The range of scores from lower to upper quartile is referred to as the inter-quartile
range. The middle 50% of scores fall within the inter-quartile range.

▶ Upper quartile:Seventy-five percent of the scores fall below the upper quartile.

▶ Lower quartile: Twenty-five percent of scores fall below the lower quartile.

▶ Whiskers: The upper and lower whiskers represent scores outside the middle 50%. Whiskers
often (but not always) stretch over a wider range of scores than the middle quartile groups.

29 / 45
BOXPLOTS

30 / 45
BOXPLOTS

31 / 45
BOXPLOTS EXAMPLE

32 / 45
E XPLORING RELATIONSHIP BETWEEN VARIABLES

▶ An important aspect of data exploration is to explore relationship between attributes


• Scatter Plot:
• Two-way cross-tabulations:

▶ Scatter Plot:
• A scatter plot helps in visualizing bi-variate relationship, i.e. relationship between two
variables

• It is a two-dimensional plot in which points or dots are drawn on coordinates provided by


values of the attributes

• We want to understand the impact of a variable on another variable

• attr_1 is said to be the independent variable and attr_2 as the dependent variable

• It also helps to identify the outliers

33 / 45
SCATTERPLOT EXAMPLE

34 / 45
TWO-WAY CROSS TABULATIONS

▶ Two-way cross-tabulations (also called cross-tab or contingency table) are used to understand
the relationship of two categorical attributes in a concise way

▶ It has a matrix format that presents a summarized view of the bi-variate frequency distribution

▶ helps to understand how much the data values of one attribute changes with the change in data
values of another attribute

35 / 45
SCATTERPLOT EXAMPLE

36 / 45
DATA QUALITY

▶ Success of machine learning depends largely on the quality of data. A data which has the right
quality helps to achieve better prediction accuracy, in case of supervised learning

▶ Problems:
1. Certain data elements without a value or data with a missing value

2. Data elements having value surprisingly different from the other elements, which we term
as outliers

▶ Factors:
1. Incorrect sample set selection: The data may not reflect normal or regular quality due to
incorrect selection of sample set

2. Errors in data collection: Results in missing values and outliers

37 / 45
DATA REMEDIATION

▶ Handling outliers:
1. Remove outliers: If the number of records which are outliers is not many, a simple
approach may be to remove them

2. Imputation: One other way is to impute the value with mean or median or mode. The
value of the most similar data element may also be used for imputation.

3. Capping: For values that lie outside the 1.5|×| IQR limits, we can cap them by replacing
those observations below the lower limit with the value of 5th percentile and those that lie
above the upper limit, with the value of 95th percentile.

▶ Handling missing values


• Eliminate records having a missing value of data elements

• Imputing missing values

• Estimate missing values

38 / 45
M ISSING VALUES

How to identify the missing value in a dataset?

39 / 45
D ATA I MPUTATION M ECHANISMS

▶ Next or Previous Value

▶ K Nearest Neighbors

▶ Maximum or Minimum Value

▶ Missing Value Prediction

▶ Most Frequent Value

▶ Average or Linear Interpolation

▶ (Rounded) Mean or Moving Average or Median Value

▶ Fixed Value

40 / 45
M ISSING VALUES

How to identify the missing value in a dataset?

41 / 45
D ATA I MPUTATION M ECHANISMS

▶ Next or Previous Value

▶ K Nearest Neighbors

▶ Maximum or Minimum Value

▶ Missing Value Prediction

▶ Most Frequent Value

▶ Average or Linear Interpolation

▶ (Rounded) Mean or Moving Average or Median Value

▶ Fixed Value

42 / 45
DATA PREPROCESSING

▶ Dimensionality Reduction: Dimensionality reduction refers to the


techniques of reducing the dimensionality of a data set by creating new
attributes by combining the original attributes.

▶ Feature Subset selection

43 / 45
D IMENSIONALITY R EDUCTION

Need?
▶ High-dimensional data sets need a high amount of computational
space and time
▶ Irrelevant Features ====> Degrade the performance of ML model

▶ Dimensionality reduction helps in reducing irrelevance and redundancy


in features
▶ It is easier to understand a model if the number of features involved in
the learning activity is less (Traceability Explainability)
▶ Examples: PCA, SVD

44 / 45
F EATURE S UBSET S ELECTION

▶ Find out the optimal subset of the entire feature set which significantly
reduces computational cost without any major impact on the learning
accuracy
▶ Only features which are not relevant or redundant are selected

▶ All irrelevant features are eliminated while selecting the final feature
subset
▶ A feature is potentially redundant when the information contributed by
the feature is more or less same as one or more other features

45 / 45

You might also like