Professional Documents
Culture Documents
4) Machine learning is much similar to data mining as it also deals with the huge amount of the
data
Supervised learning, you train the machine using data which is well "labelled." It means some data is
already tagged with the correct answer. A supervised learning algorithm learns from labelled training
data, helps you to predict outcomes for unforeseen data.
ü Prediction of decease
the goal of regression model is to build a mathematical equation that defines y as a function of the x
variables. Examples.
• Weather forecast
Unsupervised learning is a machine learning technique, where you do not need to supervise the
model. Instead, you need to allow the model to work on its own to discover information. It mainly
deals with the unlabelled data.
Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and allowed to acts on that data without any supervision”.
• Unsupervised learning is helpful for finding useful insights from the data.
• Unsupervised learning is much similar as a human learns to think by their own experience,
which makes it closer to the real AI.
• In real world, we do not always have input data with the corresponding output so to solve
such cases, we need unsupervised learning.
o K-means clustering
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Apriori algorithm
o Unsupervised learning is used for more complex tasks as compared to supervised learning
because, in unsupervised learning, we don't have labelled input data.
o Unsupervised learning is intrinsically more difficult than supervised learning as it does not
have corresponding output.
o The result of the unsupervised learning algorithm might be less accurate as input data is not
labelled, and algorithms do not know the exact output in advance.
Regression Analysis:
Dependent Variable
Independent Variable
Outliers Multicollinearity
Ø By performing the regression, we can confidently determine the most important factor, the
least important factor, and how each factor is affecting the other factors
Linear Regression:
Ø Salary forecasting
When working with linear regression, our main goal is to find the best fit line that means the error
between predicted values and actual values should be minimized. The best fit line will have the least
error.
Residuals: The distance between the actual value yi and predicted values is called residual.
Gradient Descent: A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
Model Performance: The Goodness of fit determines how the line of regression fits the set
of observations. The process of finding the best model out of various models is called optimization.
It can be achieved by below method:
R-squared method:
o It measures the strength of the relationship between the dependent and independent
variables on a scale of 0-100%.
o The high value of R-square determines the less difference between the predicted values and
actual values and hence represents a good model.
Regression Analysis…:
Simple Linear Regression:
3. Splitting the data set into training set and testing set
o A linear relationship should exist between the Target and predictor variables.
1. To identify the strength of the effect that the independent variables have on a
dependent variable
3. To guess the precise values / trends (price of gold after 6 months from now)
A low P value suggests that your sample provides enough evidence that you can reject
the null hypothesis for the entire population.
Step-1: Firstly, we need to select a significance level to stay in the model. (SL=0.05)
Step-2: Fit the complete model with all possible predictors/independent variables.
Step-3: Choose the predictor which has the highest P-value, such that.
Step-5: Rebuild and fit the model with the remaining variables
2) Heteroskedasticity
4. Splitting the dataset into the Training set and Test set
8. End
What is Model Selection ?
Model selection is the process of selecting one final machine learning model from among a
collection of candidate machine learning models for a training dataset.
• Model selection is a process that can be applied both Across different types of models (e.g.
logistic regression, SVM, KNN, etc.)
7. End
Logistic regression may be used to predict the risk of developing a given disease (e.g. diabetes;
coronary heart disease), based on observed characteristics of the patient (age, sex, body mass
index, results of various blood tests, etc. Another example might be to predict whether an Indian
voter will vote BJP or Trinamool Congress or Congress, based on age, income, sex, state of
residence, votes in previous elections, etc.
The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
Assumptions for Logistic Regression:
o The dependent variable must be categorical in nature.
• Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
3. Splitting the dataset into the Training set and Test set
10. End
Logistic Regression (PLR) with some Mathematical background and case study:
The odds are defined as the probability that the event will occur divided by the probability that the
event will not occur. Unlike probability, the odds are not constrained to lie between 0 and 1, but can
take any value from zero to infinity.
Confusion Matrix:
A confusion matrix is a table that is often used to describe the performance of a classification model
(or "classifier") on a set of test data for which the true values are known.
• There are four possibilities with regards to the cricket match win/loss prediction
1) The model predicted win and the team won- TP-True Positive
2) The model predicted win and the team lost- FP-False Positive
3) The model predicted loss and the team won- FN-False Negative
4) The model predicted lost and the team lost- TN-True Negative
Decision Tree Classification Algorithm:
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node
• The decisions or the test are performed on the basis of features of the given dataset.
• It is a graphical representation
• A decision tree can contain categorical data (YES/NO) as well as numeric data.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step
-3. Continue this process until a stage is reached where you cannot further classify the
nodes and called the final node as a leaf node.
o Information Gain
o Gini Index
1. Information Gain:
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
• It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits
Example:
o K-NN algorithm assumes the similarity between the new case/data and available cases and
put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is used
for the Classification problems.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then
it classifies that data into a category that is much similar to the new data.
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
o It is simple to implement.
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points for
all the training samples.
Real-time Prediction: Being a fast learning algorithm can be used to make predictions in
real-time as well. It can be used in real-time Predictions because Naïve Bayes Classifier is
an eager learner
Multi Class Classification: It can be used for multi-class classification problems also.
• It is used for Credit Scoring.
• It is used in medical data classification.
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using
this dataset we need to decide that whether we should play or not on a particular day according to
the weather conditions. So to solve this problem, we need to follow the below steps:
3. Now, use Bayes theorem to calculate the posterior probability. Problem: If the weather is
sunny, then the Player should play or not? Solution: To solve this, first consider the below
dataset:
Example:
Solution:
Clustering in Machine Learning:
Clustering or cluster analysis is a machine learning technique
The clustering technique can be widely used in various tasks. Some most common uses of this
technique are:
o Market Segmentation
o Image segmentation
4.Hierarchical Clustering
5. Fuzzy Clustering
Clustering Algorithms:
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering algorithms.
It classifies the dataset by dividing the samples into different clusters of equal variances. The
number of clusters must be specified in this algorithm. It is fast with fewer computations
required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth
density of data points. It is an example of a centroid-based model, that works on updating
the candidates for centroid to be the center of the points within a given region.
6. Affinity Propagation: It is different from other clustering algorithms as it does not require to
specify the number of clusters. In this, each data point sends a message between the pair of
data points until convergence. It has O(N2T) time complexity, which is the main drawback of
this algorithm.
Applications of Clustering:
o In Identification of Cancer Cells
o In Search Engines
o Customer Segmentation
o In Biology: It is used in the biology stream to classify different species of plants and
animals using the image recognition technique
o In Land Use: The clustering technique is used in identifying the area of similar lands use
in the GIS database. This can be very useful to find that for what purpose the particular
land should be used, that means for which purpose it is more suitable
o Fraud Detection: Anomaly or fraud detection in the banking sector by identifying the
patterns of loan defaulters.