Professional Documents
Culture Documents
Machine Learning(B21EP0502 )
Dr. Vidyasagar K N
COURSE OBJECTIVES:
Through the use of statistical methods, algorithms are trained to make classifications or predictions,
uncovering key insights within data mining projects.
These insights subsequently drive decision making within applications and businesses, ideally impacting key
growth metrics.
As big data continues to expand and grow, the market demand for data scientists will increase, requiring
them to assist in the identification of the most relevant business questions and subsequently the data to
answer them.
• Traditional Programming
Data
Computer Output
Program
• Machine Learning
Data
Computer Program
Output
Machine Learning Real
Examples
If you have used Netflix, then you must know that it
recommends you some movies or shows for
watching based on what you have watched earlier.
Machine Learning is used for this recommendation
and to select the data which matches your choice. It
uses the earlier data.
When you upload a photo on Facebook, it can
recognize a person in that photo and suggest you,
mutual friends. ML is used for these predictions. It
uses data like your friend-list, photos available etc.
and it makes predictions based on that.
Software, which shows how you will look when you
get older. This image processing also uses machine
learning.
Types of Machine
Learning
TYPES OF MACHINE LEARNING
TYPES OF MACHINE LEARNING
SUPERVISED LEARNING
• Supervised learning is when the model is getting trained on a labelled
dataset. A labelled dataset is one that has both input and output
parameters. In this type of learning both training and validation, datasets
are labelled as shown in the figures below.
SUPERVISED LEARNING
• Training the system: While training the model, data is usually split in the
ratio of 80:20 i.e. 80% as training data and the rest as testing data.
• In training data, we feed input as well as output for 80% of data.
• The model learns from training data only. We use different machine
learning algorithms(which we will discuss in detail in the next articles) to
build our model.
• Learning means that the model will build some logic of its own.
Once the model is ready then it is good to be tested.
• At the time of testing, the input is fed from the remaining 20% of data that
the model has never seen before, the model will predict some value and we
will compare it with the actual output and calculate the accuracy.
SUPERVISED LEARNING
CLASSIFICATION
• Linear Regression
• Logistic Regression
• Nearest Neighbor
• Gaussian Naive Bayes
• Decision Trees
• Support Vector Machine (SVM)
• Random Forest
UNSUPERVISED LEARNING
In Driverless Car, the training data is fed to Algorithm like how to Drive Car in
Highway, Busy and Narrow Street with factors like speed limit, parking, stop
at signal etc.
After that, a Logical and Mathematical model is created based on that and
after that, the car will work according to the logical model.
Also, the more data the data is fed the more efficient output is produced.
Steps for Designing
Learning System are:
STEP – 1: CHOOSING THE TRAINING EXPERIENCE:
• The very important and first task is to choose the training data or training
experience which will be fed to the Machine Learning Algorithm.
• It is important to note that the data or experience that we fed to the
algorithm must have a significant impact on the Success or Failure of the
Model.
• So Training data or experience should be chosen wisely.
ATTRIBUTES WHICH WILL IMPACT ON SUCCESS AND
FAILURE OF DATA
• For example: While Playing chess the training data will provide feedback
to itself like instead of this move if this is chosen the chances of success
increases
ATTRIBUTES WHICH WILL IMPACT ON SUCCESS AND
FAILURE OF DATA
• Second important attribute is the degree to which the learner will control
the sequences of training examples.
• For example: when training data is fed to the machine then at that time
accuracy is very less but when it gains experience while playing again and
again with itself or opponent the machine algorithm will get feedback and
control the chess game accordingly.
ATTRIBUTES WHICH WILL IMPACT ON SUCCESS AND
FAILURE OF DATA
• For example : While playing chess with the opponent, when opponent will
play then the machine learning algorithm will decide what be the number
of possible legal moves taken in order to get success
STEP 3- CHOOSING REPRESENTATION FOR TARGET
FUNCTION
• When the machine algorithm will know all the possible legal moves the
next step is to choose the optimized move using any representation i.e.
using linear Equations, Hierarchical Graph Representation, Tabular form
etc.
• The NextMove function will move the Target move like out of these move
which will provide more success rate.
• For Example : while playing chess machine have 4 possible moves, so the
machine will choose that optimized move which will provide success to it.
STEP 4- CHOOSING FUNCTION APPROXIMATION
ALGORITHM
The final design is created at last when system goes from number of
examples , failures and success , correct and incorrect decision and what
will be the next step etc.
Example: DeepBlue is an intelligent computer which is ML-based won chess
game against the chess expert Garry Kasparov, and it became the first
computer which had beaten a human chess expert.
ISSUES IN MACHINE LEARNING
• What algorithms exist for learning general target functions from specific
training examples?
• In what settings will algorithms converge to the desired function, given
sufficient training data?
• Which algorithms perform best for which types of problems and
representations?
ISSUES IN MACHINE LEARNING
• When and how can prior knowledge held by the learner guide the process
of generalizing from examples?
• Can prior knowledge be helpful even when it is only approximately
correct?
ISSUES IN MACHINE LEARNING
• What is the best strategy for choosing a useful next training experience,
and how does the choice of this strategy alter the complexity of the
learning problem?
ISSUES IN MACHINE LEARNING
• What is the best way to reduce the learning task to one or more function
approximation problems?
• Put another way, what specific functions should the system attempt to
learn? Can this process itself be automated?
ISSUES IN MACHINE LEARNING
How can the learner automatically alter its representation to improve its
ability to represent and learn the target function?
Why Data
Preprocessing:
WHY DATA PREPROCESSING?
• Data integration
• Integration of multiple databases, data cubes, files, or notes
• Data transformation
• Normalization (scaling to a specific range)
• Aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or similar analytical
results
• Data discretization: with particular importance, especially for numerical data
• Data aggregation, dimensionality reduction, data compression,generalization
Forms of data preprocessing
DATA CLEANING
DATA CLEANING
• Ignore the tuple: usually done when class label is missing (assuming the task is
classification—not effective in certain cases)
• Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!
• Use the attribute mean for all samples of the same class to fill in the
missing value: smarter
• Use the most probable value to fill in the missing value: inference-based such
as regression, Bayesian formula, decision tree
NOISY DATA
Q: What is noise?
A: Random error in a measured variable.
• Incorrect attribute values may be due to
✓ faulty data collection instruments
✓ data entry problems
✓ data transmission problems
✓ technology limitation
✓ inconsistency in naming convention
• Other data problems which requires data cleaning
✓ duplicate records
✓ incomplete data
✓ inconsistent data
HOW TO HANDLE NOISY DATA?
• Binning method:
✓ first sort data and partition into (equi-depth) bins
✓ then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
✓ used also for discretization
• Clustering
✓ detect and remove outliers
• Semi-automated method: combined computer and human
inspection
✓ detect suspicious values and check manually
• Regression
✓ smooth by fitting the data into regression functions
DATA VISUALIZATION
• Data visualization is an important skill to possess for anyone trying to
extract and communicate insights from data.
• Great business narratives and presentations often stem from brilliant
visualizations that convey the key ideas in a concise and aesthetic manner.
• In the field of machine learning, visualization plays a key role throughout
the entire process of analysis - to obtain relationships, observe trends and
portray the results as well.
NECESSITY OF DATA VISUALIZATION
• It is difficult for the human eye to decipher patterns from raw numbers
only.
• Sometimes, even the statistical information summarized from the data
may mislead you to wrong conclusions.
• Therefore, you should visualize the data often to understand how different
features are behaving.
DATA VISUALIZATION-RETAIL STORE SALES EXAMPLE
DATA VISUALIZATION-RETAIL STORE SALES EXAMPLE
DATA VISUALIZATION-RETAIL STORE SALES EXAMPLE
• Graphics and visuals, when used intelligently and innovatively, can convey
a lot more than what raw data alone can.
• Matplotlib serves the purpose of providing multiple functions to build
graphs from the data stored in your lists, arrays, etc.
variable. Some key industries and services that rely on line graphs include
financial markets and weather forecast
• plt.plot(x_axis, y_axis)
• Box plots are quite effective in summarizing the spread of a large data set
into a visual representation. They use percentiles to divide the data range.
• The percentile value gives the proportion of the data range that falls below
a chosen data point when all the data points are arranged in the
descending order.
• For example, if a data point with a value of 700 has a percentile value of
99% in a data set, then it means that 99% of the values in the data set are
less than 700.
BOX PLOT
• A Box and Whisker Plot (or Box Plot) is a convenient way of visually
displaying the data distribution through their quartiles.
• The lines extending parallel from the boxes are known as the “whiskers”,
which are used to indicate variability outside the upper and lower
quartiles.
• Outliers are sometimes plotted as individual dots that are in-line with
whiskers. Box Plots can be drawn either vertically or horizontally.
• A scatter plot can help correlate two variables, whereas a bubble chart
adds one more dimension, i.e., the size of the bubble (usually indicative of
the frequency of occurrence of that particular data point)
CHOOSING PLOT TYPES- DISTRIBUTION
• A distribution chart tries to answer the question ‘How is the data
distributed?’.
• For example, suppose you asked everyone their age in a survey.
• Using a distribution chart will help you visualize the distribution of ages in
the data set.
• The distribution can be over a variable, or it can also be over a period of
time. Two of the most used charts for visualizing distribution are as
follows:
• Histogram
• Scatter plots
CHOOSING PLOT TYPES- DISTRIBUTION
Face Detection
Language Parsing
Structured Prediction
Supervised Learning Examples
cat = 𝑓( )
= 𝑓( )
= 𝑓( )
Supervised Learning – k-Nearest Neighbors
cat
dog
bear
cat, cat, dog k=3
cat
119
dog
bear
Supervised Learning – k-Nearest Neighbors
cat
dog
bear k=3
cat
bear, dog, dog
cat dog bear
120
dog
bear
Supervised Learning – k-Nearest Neighbors
•How do we choose the right K?
•How do we choose the right features?
•How do we choose the right distance metric?
121
Supervised Learning – k-Nearest Neighbors
•How do we choose the right K?
•How do we choose the right features?
•How do we choose the right distance metric?
123
k=3
1. Initially assign
all images to a
random cluster
Unsupervised Learning – k-means clustering
124
k=3
2. Compute the
mean image (in
feature space) for
each cluster
Unsupervised Learning – k-means clustering
125
k=3
3. Reassign images
to clusters
based on similarity to
cluster means
Unsupervised Learning – k-means clustering
126
k=3
4. Keep repeating
this process
until convergence
Unsupervised Learning – k-Means clustering
•How do we choose the right K?
•How do we choose the right features?
•How do we choose the right distance metric?
•How sensitive is this method with respect to the random
assignment of clusters?
Answer: Just choose the one combination
127 that works best!
BUT not on the test data.
dog bear
Supervised Learning - Classification
Training Data Test Data
cat
dog
cat
. 129 .
. .
. .
bear
Supervised Learning - Classification
Training Data
𝑥1 = [ ] 𝑦1 = [ cat ]
𝑥2 = [ ] 𝑦2 = [dog ]
𝑥3 = [ ] 𝑦3 = [cat ]
. 130
.
.
𝑥𝑛 = [ ] 𝑦𝑛 = [bear ]
Supervised Learning - Classification
Training Data targets /
inputs labels / predictions
We need to find a function that