You are on page 1of 6

Machine Learning:- we need to feed data into machine for the machine to learn.

Machine Learning is the process of helping the machine Learn how to take decision logically

It will use data and algorithm to learn also it is the subset of AI

Ex-netflix and amazon built such machine learning model by using tons of data inorder to identify
profitable opportunity and avoid risk

The term machine learning was first coined by Arthur Samuel in the year 1959.

First formal definition of ML was given by Tom M “ a computer program is set to learn from experience E
with respect to some classes of task T and performance measure P if its performance at task in T as
measured by P improves with Experience E.”

The need of machine learning

By the help of Ml we can solve the complex problems

We can uncover the pattern and trends in data

We can improve decision making skills

We can do predictive modeling

We can solve complex problems with marginal level of accuracy.

Machine Learning Definition

Algorithm:A Machine learning algo is a set of rules and statistical technique used to learn patterns from
data and draw the significant information from it.logic behind ML model.

Model: A model is main component of ML. A model is trained by using machine learning algorithm.

An algorithm maps all the decision that a model is supposed to take on the given input inorder to get
the correct output.

Predictor Variable:- it is the feature of the data that can be used to predict the output.

Response variable: It is the output variable that needs to be predicted by using the predictor variable.

Training data:the ML Model is built using training data and training data helps to identify key trends and
patterns to predict the output.

Testing Data: After the model is trained it must be tested to evaluate how accurately it can predict the
outcome.

Locality Carpet Area No. of rooms House price


Predictor variable Predictor variable Predictor variable Response variable
Basic Pipeline- Data, Train the machine,Build the model,Predicting Outcome.

Or

Data-Define the objective-Preparing the Data-Exploring the data-Building Model-Model evaluation-Best


Model selected-Testdata-Prediction on Test Data.

Machine Learning Types:

Supervised Learning- Teach Machine or train machine using the data which is well labelled.

Unsupervised Learning- We will train the machine using unlabeled data and we are training the machine
without any guidance

Reinforcement Learning- Part of Machine Learning where an agent is put in an environment and
behaves in this environment and he learns to behave in this environment by observing the rewards
which it gets from those actions

Eg.Alexa,Self Driven Cars

Machine Reinforcement Learning


Learning

UnSupervised ML

Supervised machine Learning

1 Numeric-Regression supervised learning

2 catagorical-Classification supervised machine learning

Unsupervised:-Clustering technique

Regression Classification clustering


Supervised Supervised Unsupervised
Numeric format catagorical clusters
Forecast or Predict Compute category Making of similar item clusters
House Price Classification of species of iris All the transaction which are
dataset fraudulent in nature
Linear regression Logistic regression K means
Conversion of raw data into the meaningfull form and this is called data preprocessing.

Convert data into useful format.

To make our data ready for model building.

Steps involved in data preprocessing

1 Handling Missing Values

2 treating Outlier

3 Scaling the dataset

4 Encoding the categorical variable

Handling the missing values- To avoid lost data

To avoid skip of patterns and trends.

Methods of Handling Missing values:

1 Methods- Mean/median/mode imputation mean can be used when the variable is numeric and
having normal distribution.

Median can be used to fill missing valued when variable is numeric and it is skewed.

More can be used when our variable is in categorical format.

2 Random Sample Imputation: In this method random value from existing set of value is taken and
used to fill the missing value. It is easy and variance is same as original dataset.

3 Capturing NAN value with the new feature: This method is used where the data is missing due to
some cause so in that case we can create a new feature in dataframe which will store the null value
for that particular feature. Easy to implement easy to identify where missing value is.

4 End of Distribution: We will the missing value with some extreme value of the feature.

Extreme value=mean + - 3*(std)

5 Arbitrary value imputation:We can fill the missing value with any value in dataset. It is purely
judgement based decision.

6: Frequent category Imputation: When the variable is categorical best way to fill missing value with
the more popular class mode.

7: Using KNN imputation: K nearest neighbor: if some values are missing at 6 th place baased on
neighbours we are filling missing values by using 5th and 7th place value k nearest neighbor.

8: dropping all the NAN Values: If in dataset for a particular 60% or more is missing then it is advised
to drop that particular feature.
Handling the outliers:Differ significantly from the rest of the data called outliers.

Visualisation and mathematical way of dealing with outlier.

Boxplot

Scatterplot

IQR

ZScore

3 feature Scaling:-When we are having different feature vary over different ranges it is difficult to put
them in a model

The process of bringing features of all features together in the same range is called scaling.

Theres a different method of feature scaling

1 Absolute max scaling: According to this method every value is divide by the max value of the feature
convert all the value from -1 to +1 prone to outliers.

Marks Abs max scaling Min max scaling Normalisation


10 10/300 10-10/300-10=0 10-110/300-10
20 20/300 20-10/300- 20-110/300-10
10=10/290=0.034
300 300/300 300-10/300-10=1 300-110/300-10

2 Min Max Scaling-This method follows the formuls

x-xmin/xmax-xmin

this will create the range from 0 to 1.

3 Normalisation= x-xmean/xmax-xmin

4 Standardisation: Converts each value of a feature to a zscore suited when data is normally distributed

Z=x-xmean/std

5 robust scaling: when the dataset is skewed this method will come to the picture

Scaled value=x-xmedian/IQR

Method formula pythoncode


1 Absolute max scaling x/xmax MaxAbsScaler()
2 min max scaler x-xmin/xmax-xmin MinMaxScaler()
3 Normalisation x-xmean/xmax-xmin normalize
4 Standardisation x-xmean/xstd StandardScaler()
Robust Scaling x-xmedian/iqr RobustScaling
Encoding the Dataset

Encoding the dataset:

Encode the categorical column into numerical

Machine learning model may fail to incorporate these variable that’s why it is important to convert them
into simple numeric codes process is called as encoding

Two ways by which we do encoding

1 nominal encoding

2 ordinal encoding

Nominal Encoding:- Present in ordered form increase the features in the dataset.

10 categories

N number of category in a feature then it will creature n additional feature in dataset.

1 hot encoding

color 1 hot encoding red yellow green


red Red,yellow,green 1 0 0
red 1 0 0
yellow 0 1 0
green 0 0 1
yellow 0 1 0
Disadvantage of 1 hot encoding that it will create additional feature in the dataset that’s why it lead to
cause curse of dimensionality problem

It refers to the scenario it will increase the excessive no of feature in dataset leads to decrease model
performance.

It will difficult to design model in higher dimensional space .

In Higher dimensional space it will take more processing time for model to work along with increase in
noise and error and ultimately decrease the accuracy.

Ordinal Encoding:in the ordinal encoding the label encoding is there which will encode each category of
feature in same column.

state stateonly
Maharashtra 3
Delhi 0
karnataka 2
Gujarat 1
TamilNadu 4

You might also like