Professional Documents
Culture Documents
Department of CSE,
LBRCE Mylavaram
August 2, 2023
C ONTENT
2 Preparing to Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1 Machine Learning Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Basic Types of Data in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1 / 45
Part I
2 / 45
A RTIFICIAL INTELLIGENCE , M ACHINE L EARNING ,
AND D EEP L EARNING
3 / 45
A RTIFICIAL I NTELLIGENCE
4 / 45
M ACHINE L EARNING
Figure. ML Paradigm
5 / 45
M ACHINE L EARNING -2
Machine learning program discovers rules to execute a data-processing task, given examples of
what’s expected. Machine Learning algorithm requires 3 components:
3. A way to measure whether the algorithm is doing a good job? Difference between the
algorithm’s current output and its expected output. The measurement is used as a feedback
signal to adjust the way the algorithm works. This adjustment step is what we call learning.
Machine-learning models are all about finding appropriate representations for their input
data—transformations of the data that make it more amenable to the task at hand, such as a
classification task.
6 / 45
T YPES OF M ACHINE L EARNING
7 / 45
T YPES OF M ACHINE L EARNING
1. Supervised learning – Also called predictive learning. A machine predicts the class of
unknown objects based on prior class-related information of similar objects.
3. Reinforcement learning – A machine learns to act on its own to achieve the given goals.
8 / 45
T ERMINOLOGY
▶ Feature Vector: It is a set of multiple numeric features. We use it as an input to the machine
learning model for training and prediction purposes.
▶ Training: An algorithm takes a set of data known as “training data” as input. The learning
algorithm finds patterns in the input data and trains the model for expected results (target).
The output of the training process is the machine learning model.
▶ Prediction: Once the machine learning model is ready, it can be fed with input data to provide
a predicted output.
▶ Target (Label): The value that the machine learning model has to predict is called the target or
label.
9 / 45
S UPERVISED L EARNING
▶ Supervised learning is a class of problems that uses a model to learn the mapping between the
input and target variables. Applications consisting of the training data describing the various
input variables and the target variable are known as supervised learning tasks.
▶ Let the set of input variable be (x) and the target variable be (y). A supervised learning
algorithm tries to learn a hypothetical function which is a mapping given by the expression
y=f(x), which is a function of x.
▶ The learning process here is monitored or supervised. Since we already know the output the
algorithm is corrected each time it makes a prediction, to optimize the results. Models are fit on
training data which consists of both the input and the output variable and then it is used to
make predictions on test data.
10 / 45
C LASSIFICATION
11 / 45
U N -S UPERVISED L EARNING
▶ In an unsupervised learning problem the model tries to learn by itself and recognize patterns
and extract the relationships among the data. As in case of a supervised learning there is no
supervisor or a teacher to drive the model.
▶ Unsupervised learning operates only on the input variables. There are no target variables to
guide the learning process. The goal here is to interpret the underlying patterns in the data in
order to obtain more proficiency over the underlying data.
1. Clustering – where the task is to find out the different groups in the data
2. Association Analysis – Association Analysis is simply a search through the data for
combinations of items whose statistics are interesting. It helps us establish rules dictating
something like "If A occurs then B is likely to occur as well."
12 / 45
U NSUPERVISED L EARNING
13 / 45
R EINFORCEMENT L EARNING
▶ Reinforcement learning is a type of problem where there is an agent and the agent is operating
in an environment based on the feedback or reward given to the agent by the environment in
which it is operating.
▶ The rewards could be either positive or negative. The agent then proceeds in the environment
based on the rewards gained.
▶ The reinforcement agent determines the steps to perform a particular task. There is no fixed
training dataset here and the machine learns on its own.
▶ Playing a game is a classic example of a reinforcement problem, where the agent’s goal is to
acquire a high score. It makes the successive moves in the game based on the feedback given by
the environment which may be in terms of rewards or a penalization. Reinforcement learning
has shown tremendous results in Google’s AplhaGo of Google which defeated the world’s
number one Go player.
14 / 45
R EINFORCEMENT L EARNING
15 / 45
C OMPARISON
16 / 45
A PPLICATIONS OF M ACHINE L EARNING
▶ Financial Services
▶ Healthcare
17 / 45
M ACHINE L EARNING P ROCESS
18 / 45
M ODEL P REPARATION
▶ Explore the relationships amongst the data elements, e.g. inter-feature relationship.
▶ Do the necessary remediation, e.g. impute missing data values, etc., if needed.
▶ Apply pre-processing steps, as necessary. Once the data is prepared for modelling, then the
learning tasks start
1. The input data is first divided into parts – the training data and the test data (called
holdout). This step is applicable for supervised learning only.
3. Train the model based on the training data for supervised learning problem and apply to
unknown data. Directly apply the chosen unsupervised model on the input data for
unsupervised learning problem.
19 / 45
A CTIVITIES IN M ACHINE L EARNING
2 Learning
▶ Data partitioning, Model Selection,
Cross-validation
3 Performance Evaluation
▶ Examining the model performance
▶ Visualization
4 Performance Improvement
▶ Tuning, Ensembling
20 / 45
T YPES OF DATA IN M ACHINE L EARNING
▶ For example, we may have a data set on students in which each record consists of information
about a specific students
▶ Each data set also has multiple attributes, each of which gives information on a specific
characteristics
21 / 45
T YPES OF DATA IN M ACHINE L EARNING
1.2 Ordinal: Ordinal data also assigns named values to attributes but unlike nominal data,
they can be arranged in a sequence of increasing or decreasing value so that we can say
whether a value is better than or greater than another value.
2.2 Ratio: represents numeric data for which exact value can be measured. Absolute zero is
available for ratio data. Also, these variables can be added, subtracted, multiplied, or
divided. The central tendency can be measured by mean, median, or mode and methods of
dispersion such as standard deviation.
22 / 45
E XPLORING STRUCTURE OF DATA
• Data Spread
• Data Dispersion
23 / 45
U NDERSTANDING C ENTRAL T ENDENCY
▶ Measures of central tendency help us understand the central point of a set of data
▶ Mean: is a sum of all data values divided by the count of data elements
▶ Median: the value of the element appearing in the middle of an ordered list of data elements
▶ Mean = 21+89+34+67+96
5 = 61.4
▶ Mean and median are impacted differently by data values appearing at the beginning or at the
end of the range
▶ Mean is sensitive to the outliers. It is shifted even with the presence of small number of
outliers
24 / 45
UNDERSTANDING DATA SPREAD
▶ Dispersion
1. Variance Pn Pn 2
2
i=1 xi i=1 xi
Variance = − (1)
n n
2. Standard Deviation p
StandardDeviation = Variance(x) (2)
Important Point
Larger value of variance or standard deviation indicates more dispersion in the data and
vice versa.
2. Q1 (Quartile 1)
3. Q2 (Median)
4. Q3 (Quartile 3) 25 / 45
E XAMPLE
Example
26 / 45
BOXPLOTS
▶ A box plot is an extremely effective mechanism to get a one-shot view and understand the
nature of the data.
▶ They enable us to study the distributional characteristics of a group of scores as well as the
level of the scores.
▶ The box plot (also called box and whisker plot) gives a standard visualization of the
five-number summary statistics of a data, namely minimum, first quartile (Q1), median (Q2),
third quartile (Q3), and maximum
27 / 45
BOX PLOT
28 / 45
BOXPLOTS
▶ Median: The median (middle quartile) marks the mid-point of the data and is shown by the
line that divides the box into two parts. Half the scores are greater than or equal to this value
and half are less.
▶ Inter-quartile range (IQR): The middle “box” represents the middle 50% of scores for the
group. The range of scores from lower to upper quartile is referred to as the inter-quartile
range. The middle 50% of scores fall within the inter-quartile range.
▶ Upper quartile:Seventy-five percent of the scores fall below the upper quartile.
▶ Lower quartile: Twenty-five percent of scores fall below the lower quartile.
▶ Whiskers: The upper and lower whiskers represent scores outside the middle 50%. Whiskers
often (but not always) stretch over a wider range of scores than the middle quartile groups.
29 / 45
BOXPLOTS
30 / 45
BOXPLOTS
31 / 45
BOXPLOTS EXAMPLE
32 / 45
E XPLORING RELATIONSHIP BETWEEN VARIABLES
▶ Scatter Plot:
• A scatter plot helps in visualizing bi-variate relationship, i.e. relationship between two
variables
• attr_1 is said to be the independent variable and attr_2 as the dependent variable
33 / 45
SCATTERPLOT EXAMPLE
34 / 45
TWO-WAY CROSS TABULATIONS
▶ Two-way cross-tabulations (also called cross-tab or contingency table) are used to understand
the relationship of two categorical attributes in a concise way
▶ It has a matrix format that presents a summarized view of the bi-variate frequency distribution
▶ helps to understand how much the data values of one attribute changes with the change in data
values of another attribute
35 / 45
SCATTERPLOT EXAMPLE
36 / 45
DATA QUALITY
▶ Success of machine learning depends largely on the quality of data. A data which has the right
quality helps to achieve better prediction accuracy, in case of supervised learning
▶ Problems:
1. Certain data elements without a value or data with a missing value
2. Data elements having value surprisingly different from the other elements, which we term
as outliers
▶ Factors:
1. Incorrect sample set selection: The data may not reflect normal or regular quality due to
incorrect selection of sample set
37 / 45
DATA REMEDIATION
▶ Handling outliers:
1. Remove outliers: If the number of records which are outliers is not many, a simple
approach may be to remove them
2. Imputation: One other way is to impute the value with mean or median or mode. The
value of the most similar data element may also be used for imputation.
3. Capping: For values that lie outside the 1.5|×| IQR limits, we can cap them by replacing
those observations below the lower limit with the value of 5th percentile and those that lie
above the upper limit, with the value of 95th percentile.
38 / 45
M ISSING VALUES
39 / 45
D ATA I MPUTATION M ECHANISMS
▶ K Nearest Neighbors
▶ Fixed Value
40 / 45
M ISSING VALUES
41 / 45
D ATA I MPUTATION M ECHANISMS
▶ K Nearest Neighbors
▶ Fixed Value
42 / 45
DATA PREPROCESSING
43 / 45
D IMENSIONALITY R EDUCTION
Need?
▶ High-dimensional data sets need a high amount of computational
space and time
▶ Irrelevant Features ====> Degrade the performance of ML model
44 / 45
F EATURE S UBSET S ELECTION
▶ Find out the optimal subset of the entire feature set which significantly
reduces computational cost without any major impact on the learning
accuracy
▶ Only features which are not relevant or redundant are selected
▶ All irrelevant features are eliminated while selecting the final feature
subset
▶ A feature is potentially redundant when the information contributed by
the feature is more or less same as one or more other features
45 / 45