You are on page 1of 262

Machine Learning

Unit 1: Introduction to ML

Faculty Name : Dr. Gargi Sameer Phadke


Index

Lec.-1: Dissemination of Institute & department vision-mission, PEO, POs, PSO, COs & POs
mapping

Lec.-2: Introduction to Machine Learning, Categories of Algorithms

Lec.-3: Machine learning Tasks , Issues, Application

Lec.-4: Key Technologies, Steps in Developing machine Learning Applications

2
Lecture 1

Dissemination of
Institute & department
vision-mission, PEO, POs,
PSO,
COs & POs mapping
Institute vision

To foster and permeate higher and quality education with value added
engineering, technology programs, providing all facilities in terms of technology
and platforms for all round development with social awareness and nurture the
youth with international competencies and exemplary level of employability
even under highly competitive environment so that they are innovative,
adaptable and capable of handling problems faced by our country and world
at large.

Source: dypatil.edu/engineering/vision-mission-goal.php
4
Institute vision

RAIT’s firm belief in a new form of engineering education that lays equal stress on
academics and leadership building extracurricular skills has been a major
contribution to the success of RAIT as one of the most reputed institutions of higher
learning. The challenges faced by our country and the world in the 21st century needs
a whole new range of thoughts and action leaders, which a conventional educational
system in engineering disciplines are ill equipped to produce. Our reputation in
providing good engineering education with additional life skills ensures that high
grade and highly motivated students join us. Our laboratories and practical
sessions reflect the latest that is being followed in the industry. The project
works and summer internships make our students adept at handling the real-life
problems and be industry ready. Our students are well placed in the industry and
their performance make reputed companies visit us with renewed demands and vigor.

Source: dypatil.edu/engineering/vision-mission-goal.php
5
Institute mission

The Institution is committed to mobilize the resources and equip itself with men
and materials of excellence, there by ensuring that the Institution becomes a
pivotal center of service to Industry, Academy, and society with the latest
technology. RAIT engages different platforms such as technology enhancing Student
Technical Societies, Cultural platforms, Sports excellence centers,
Entrepreneurial Development Centers and a Societal Interaction Cell. To develop
the college to become an autonomous institution & deemed university at the
earliest, we provide facilities for advanced research and development programs on
par with international standards. We also seek to invite international and reputed
national Institutions and Universities to collaborate with our institution on the
issues of common interest of teaching and learning sophistication.

Source: dypatil.edu/engineering/vision-mission-goal.php
6
Institute mission

RAIT’s Mission is to produce engineering and technology professionals who


are innovative and inspiring thought leaders, adept at solving problems
faced by our nation and world by providing quality education.

The Institute is working closely with all stake holders like industry,
Academy to foster knowledge generation, acquisition, dissemination using
the best available resources to address the great challenges being faced by our
country and World. RAIT is fully dedicated to provide its students skills that
make them leaders and solution providers and are industry ready when
they graduate from the Institution.

Source: dypatil.edu/engineering/vision-mission-goal.php
7
Department of Computer Engineering vision

To impart higher and quality education in computer science with value


added engineering and technology programs to prepare technically sound,
ethically strong engineers with social awareness. To extend the facilities, to
meet the fast-changing requirements and nurture the youths with
international competencies and exemplary level of employability and research
under highly competitive environments.

Source: dypatil.edu/engineering/vision-of-the-dept.php
8
Department of Computer Engineering mission

• To mobilize the resources and equip the institution with men and materials
of excellence to provide knowledge and develop technologies in the thrust
areas of computer science and Engineering.

• To provide the diverse platforms of sports, technical, cocurricular and


extracurricular activities for the overall development of student with ethical
attitude.

• To prepare the students to sustain the impact of computer education for


social needs encompassing industry, educational institutions & public service.

• To collaborate with IITs, reputed universities and industries for the technical
and overall upliftment of students for continuing learning and entrepreneurship.

Source: dypatil.edu/engineering/mission-of-the-department.php
9
Program Outcomes (POs)

PO1- Engineering knowledge: Apply the knowledge of mathematics, science,


engineering fundamentals, and an engineering specialization to the solution of
complex engineering problems.
PO2- Problem analysis: Identify, formulate, review research literature, and
analyze complex 30 engineering problems reaching substantiated conclusions
using first principles of mathematics, natural sciences, and engineering sciences.
PO3- Design/development of solutions: Design solutions for complex
engineering problems and design system components or processes that meet
the specified needs with appropriate consideration for the public health and
safety, and the cultural, societal, and environmental considerations.
PO4- Conduct investigations of complex problems: Use research-based
knowledge and research methods including design of experiments, analysis and
interpretation of data, and synthesis of the information to provide valid
conclusions.

Source: dypatil.edu/engineering/pdf/po-co-anyalsis.pdf
10
POs

PO5-Modern tool usage: Create, select, and apply appropriate techniques,


resources, and modern engineering and IT tools including prediction and
modelling to complex engineering activities with an understanding of the
limitations.
PO6- The engineer and society: Apply reasoning informed by the contextual
knowledge to assess societal, health, safety, legal and cultural issues and the
consequent responsibilities relevant to the professional engineering practice.
PO7-Environment and sustainability: Understand the impact of the
professional engineering solutions in societal and environmental contexts, and
demonstrate the knowledge of, and need for sustainable development.
PO8-Ethics: Apply ethical principles and commit to professional ethics and
responsibilities and norms of the engineering practice.

Source: dypatil.edu/engineering/pdf/po-co-anyalsis.pdf
11
POs

PO9-Individual and team work: Function effectively as an individual, and as a


member or leader in diverse teams, and in multidisciplinary settings.
PO10-Communication: Communicate effectively on complex engineering
activities with the engineering community and with society at large, such as,
being able to comprehend and write effective reports and design documentation,
make effective presentations, and give and receive clear instructions.
PO11-Project management and finance: Demonstrate knowledge and
understanding of the engineering and management principles and apply these to
one’s own work, as a member and leader in a team, to manage projects and in
multidisciplinary environments.
PO12-Life-long learning: Recognize the need for, and have the preparation and
ability to engage in independent and life-long learning in the broadest context of
technological change.

Source: dypatil.edu/engineering/pdf/po-co-anyalsis.pdf
12
Course Outcomes (Cos)

After completion of this course learner will be able to,

CO1: Understand the basic concepts of machine learning.


CO2. Extract different feature vectors from the given data.
CO3. Apply different regression techniques on the input
data.
CO4. Apply and analyse the performance of classification
algorithms.
CO5. Form clusters using various similarity measures.
CO6. Understand the working of reinforcement learning.
Source: dypatil.edu/engineering/pdf/po-co-anyalsis.pdf
13
CO-PO mapping

14
Course scheme

Teaching Scheme(Hrs) Credit Assigned


Subject
Subject Name
Code
Theory Pract. Tut. Theory Pract. Total

Machine
CEC601 Learning
03 - - 03 - 03

Examination Scheme
Total
Theory

Sub IA
Subject Name
code (out of20) Exam Duration Pract. and oral
Mid Oral
End sem
Sem
Test1 Test2 Avg.

100
Machine
CEC601 20 20 20 20 60 2 Hr. - -
Learning

15
Course LAB scheme

Teaching Scheme(Hrs) Credit Assigned


Subject
Subject Name
Code
Theory Pract. Tut. Theory Pract. Total

Machine
CEL601 Learning Lab
- 02 - - 01 01

Examination Scheme
Total
Theory

Sub IA
Subject Name
code (out of20)
Mid TW Pract. and oral Oral
End sem
Sem
Test1 Test2 Avg.

Machine
CEL601 - - - - - 25 25 - 50
Learning Lab

16
Syllabus

Detailed Content Hours

1. Introduction to Machine Learning: Introduction, Categories of Learning 4


Algorithms, Machine Learning tasks, Issues, Applications, Key terminologies,
Steps
in developing machine learning applications

2.Data Pre-processing: Need, creating training and test sets, managing 5


categorical
data, Managing missing features, Data scaling and normalization, Feature
selection and Filtering, Dimension Reduction-Principal Component Analysis (PCA)

3. Learning for Regression: Linear models, Linear Regression and higher 8


dimensionality
Logistic Regression, Classification metrics. Decision Tree, Random forest,
Introduction to Neural Networks, NN for Regression, Model selection, evaluation
and validation

17
Syllabus

Detailed Content Hours

4. Supervised Learning: Naïve Bayes Classifiers, Support Vector Machine 10


(SVM)- Linear SVM, Decision Tree, Construction of Decision tree for rule-based
classification, Ensemble Learning- Random Forest, HMM, NN for classification-
feed forward network, Model selection, evaluation and validation

5. Unsupervised learning: Fundamentals, K-means, Hierarchical Clustering, 08


Expectation maximization clustering. NN for clustering- SOM, Model selection,
evaluation and validation

6.Reinforcement Learning: Introduction, Learning Task, Q Learning, Temporal 06


Difference Learning, Generalization Time series forecasting , Model selection,
evaluation and validation

18
Text books

• Tom M Mickell “Machine Learning” McGraw ell Hill Education


• Peter Harrington, “Machine Learning in action”

19
Lecture 2

Introduction to
Machine Learning (ML)
What is Machine Learning?

What is Machine Learning?

21 Lec-3: Introduction to ML
Machine Learning

1959: Arthur Samuel


first coined the term
Machine Learning

22 Lec-3: Introduction to ML
Machine Learning

1989: Tom Mitchell proposed Learning problem

23 Lec-3: Introduction to ML
Machine Learning

24 Lec-3: Introduction to ML
What is Machine Learning

• Machine learning enables a machine to automatically learn


from data, improve performance from experiences, and
predict things without being explicitly programmed

• Machine Learning is said as a subset of artificial intelligence


• that is mainly concerned with the development of algorithms which allow
a computer to learn from the data and past experiences on their own

25
A machine has the ability to learn if it can improve its
performance by gaining more data.

26
How does Machine Learning work

• learns from historical data, builds the prediction models,


• whenever it receives new data, predicts the output for it.
• The accuracy of predicted output depends upon the amount of data,
• As the huge amount of data helps to build a better model which predicts the
output more accurately.

27
Features of Machine Learning:

• Machine learning uses data to detect various patterns in a


given dataset.
• It can learn from past data and improve automatically.
• It is a data-driven technology.
• Machine learning is much similar to data mining as it also
deals with the huge amount of the data.

28
Need for Machine Learning

• The need for machine learning is increasing day by day.


• machine learning is that it is capable of doing tasks that are too complex
for a person to implement directly.
• As a human, we have some limitations as we cannot access the huge
amount of data manually,
• train machine learning algorithms by providing them the huge amount of
data
• Explore the data, construct the models, and predict the required output
automatically.
• The performance of the machine learning algorithm depends on the
amount of data and it can be determined by the cost function. With the
help of machine learning, we can save both time and money.

29
• The importance of machine learning can be easily understood by its uses
cases,
• Currently, machine learning is used in

–Self-driving cars
– Cyber fraud detection
– Face recognition, and
– friend suggestion by Facebook

30
Key points which show the importance of Machine Learning:

• Rapid increment in the production of data


• Solving complex problems, which are difficult for a
human
• Decision making in various sector including finance
• Finding hidden patterns and extracting useful
information from data.

31
Machine Learning Basics

Machine learning basics

32 Lec-3: Introduction to ML
Categories of Algorithms

Machine Learning
Supervised Unsupervised Reinforcement

Dimension
Regression Classification Clustering
reduction

33 Lec-3: Introduction to ML
34
Supervised Learning

• Supervised learning is the types of


machine learning in which machines
are trained using well "labelled"
training data
• machines predict the output.
• The labelled data means some
input data is already tagged with
the correct output.

35
Steps Involved in Supervised Learning:

• First Determine the type of training dataset


• Collect/Gather the labelled training data.
• Split the training dataset into training dataset, test dataset,
and validation dataset.
• Determine the input features of the training dataset, which
should have enough knowledge so that the model can
accurately predict the output.
• Determine the suitable algorithm for the model, such as
support vector machine, decision tree, etc.
• Execute the algorithm on the training dataset. Sometimes we
need validation sets as the control parameters, which are the
subset of training datasets.
• Evaluate the accuracy of the model by providing the test set.
If the model predicts the correct output, which means our
model is accurate.

36
Regression

 Regression algorithms are used if there is a relationship between the input


variable and the output variable. It is used for the prediction of continuous
variables,
 such as Weather forecasting,
 Market Trends
 Regression algorithms which come under supervised learning:
• Linear Regression
• Regression Trees
• Non-Linear Regression
• Bayesian Linear Regression
• Polynomial Regression

37
Classification

 Classification algorithms are used when the output variable is categorical,


which means there are two classes such as
 Yes-No,
 Male-Female,
 True-false, etc.
 Spam Filtering,
• Random Forest
• Decision Trees
• Logistic Regression
• Support vector Machines

38
Advantages and Disadvantages of Supervised learning:

• With the help of supervised learning, the model can predict the
output on the basis of prior experiences.
• In supervised learning, we can have an exact idea about the classes
of objects.
• Supervised learning model helps us to solve various real-world
problems such as fraud detection, spam filtering, etc.
 Disadvantages of supervised learning:
• Supervised learning models are not suitable for handling the
complex tasks.
• Supervised learning cannot predict the correct output if the test data
is different from the training dataset.
• Training required lots of computation times.
• In supervised learning, we need enough knowledge about the classes
of object.

39
Applications of Supervised Learning

• Image segmentation
• Medical Diagnosis
• Fraud Detection
• Spam detection
• Speech Recognition

40
Unsupervised Machine Learning

 Unsupervised learning is a type of machine learning in which models are


trained using unlabeled dataset and are allowed to act on that data without
any supervision.

 Find the underlying structure of dataset, group that data according to


similarities, and represent that dataset in a compressed format.

41
Unsupervised Machine Learning

42
Why use Unsupervised Learning?

• Unsupervised learning is helpful for finding useful insights


from the data.
• Unsupervised learning is much similar as a human learns to
think by their own experiences, which makes it closer to the
real AI.
• Unsupervised learning works on unlabeled and uncategorized
data which make unsupervised learning more important.
• In real-world, we do not always have input data with the
corresponding output so to solve such cases, we need
unsupervised learning.

43
Types of Unsupervised Learning Algorithm:

• Clustering:
• Clustering is a method of grouping
the objects Cluster analysis
• Find the commonalities between the
data
• Association:
– An association rule finding the
relationships between variables in
the large database.
• It determines the set of items that
occurs together in the dataset

44
Advantages and Disadvantage of Unsupervised Learning

Advantages of Unsupervised Learning


•Unsupervised learning is used for more complex tasks as compared to
supervised learning because, in unsupervised learning, we don't have labeled
input data.
•Unsupervised learning is preferable as it is easy to get unlabeled data in
comparison to labeled data.
Disadvantages of Unsupervised Learning
•Unsupervised learning is intrinsically more difficult than supervised learning
as it does not have corresponding output.
•The result of the unsupervised learning algorithm might be less accurate as
input data is not labeled, and algorithms do not know the exact output in
advance.

45
Reinforcement

"Reinforcement learning is a type of machine learning method where an


intelligent agent (computer program) interacts with the environment and
learns to act within that."

46
Supervised vs Unsupervised vs Reinforcement Learning

Supervised vs Unsupervised vs Reinforcement Learning

47 Lec-3: Introduction to ML
Classification vs Regression

Classification and regression

48 Lec-3: Introduction to ML
Classification vs Regression

• Share price prediction • Music genre detection

• Customer behaviour prediction • Employee retention status

• Product categorization • Loan eligibility detection

• Heart disease detection • Ad popularity prediction

• Customer churn prediction • Market forecasting

• Car price prediction • Identity fraud

49 Lec-3: Introduction to ML
Lecture 3

Machine
Learning tasks, Issues,

50
Machine learning Task

1. Regression
2. Classification
3. Clustering
4. Transcription
5. Machine translation
6. Anomaly detection
7. Synthesis & sampling
8. Estimation of probability density and probability mass function
9. Similarity matching
10.Co-occurrence grouping
11.Causal modeling
12. Link profiling
Issues in Machine Learning
1 Inadequate Training Data

1. Major issue that comes while using machine learning


algorithms is the lack of quality as well as quantity of data
• Noisy Data- It is responsible for an inaccurate prediction
that affects the decision as well as accuracy in
classification tasks.
• Incorrect data- It is also responsible for faulty
programming and results obtained in machine learning
models. Hence, incorrect data may affect the accuracy of
the results also.
• Generalizing of output data- Sometimes, it is also found
that generalizing output data becomes complex, which
results in comparatively poor future actions.

52
2. Poor quality of data

• Data plays a significant role in machine learning,


and it must be of good quality as well.
• Noisy data,
• incomplete data,
• inaccurate data, and
• unclean data lead to less accuracy in classification and
low-quality results.

53
Overfitting and Underfitting

• Non-representative training data Overfitting is one of the


most common issues faced by Machine Learning
• Underfitting is just the opposite of overfitting. Whenever a
machine learning model is trained with fewer amounts of
data
• Monitoring and maintenance

54
• Getting bad recommendations
• Lack of skilled resources
• Customer Segmentation
• Process Complexity of Machine Learning
• Data Bias
• Lack of Explainability
• Slow implementations and results
• Irrelevant features

55
Lecture 4

Key Technologies, Steps in Developing


machine Learning Applications

56
Key Technologies

 Gathering data
 Data preparation
 Choosing a model
 Evaluation
 Hyper-parameter tuning
 Prediction
Gathering data

• First step of ML life cycle.

• Identify and obtain all data-related problems.

• Various sources such as files, database, internet, or mobile


devices, etc.

• Quantity and quality of data determines the efficiency of the output.

• More will be the data, the more accurate will be the prediction.

• Get a coherent set of data, also called as a dataset.

58 Lec-3: Introduction to ML
Data preparation

• Put all data together, and then randomize the ordering of data.

• Understand the nature, format, and quality of data.

• Data may have various issues

• Missing Values

• Duplicate data

• Invalid data

• Noise

59 Lec-3: Introduction to ML
Choosing a model

• Selection of analytical techniques

• Building models

• Review the result


• Choose ML model to analyze data using various analytical techniques.
o Classification
o Regression
o Cluster analysis
o Association
o etc.

• Build the model using prepared data,

• Evaluate the model.

60 Lec-3: Introduction to ML
Training

• Train the model using selected machine learning algorithms.

• Training a model is required so that it can understand the various


patterns, rules, and, features.

• Generally data is segmented in training, evaluation, and validation


sets.

61 Lec-3: Introduction to ML
Training

62 Lec-3: Introduction to ML
Evaluation

• Evaluation data is used to calculate the efficiency of the trained


model.
• Different efficiency measures depending on the algorithm class.
• For example:
o Classification – Confusion matrix, accuracy, sensitivity,
specificity, etc.
o Regression – Mean sum squared error, Mean sum absolute
error, etc.

63 Lec-3: Introduction to ML
Hyper-parameter tuning

• Data is used to find the optimal rules and (hyper) parameters of the
trained model.
• Primary focus is to increase the model efficiency.

64 Lec-3: Introduction to ML
Prediction

• Use testing data and trained model to check for the efficiency as per
the requirement of project or problem.

• Deploy the model in the real-world system.

• If model is trained, evaluated, and tuned correctly then it will perform


equally well in real-world data as it performed in training step.

65 Lec-3: Introduction to ML
Thank You
Machine Learning
Unit 2: Data Preprocessing

Faculty Name : Pallavi H. Chitte


Index

Lecture -1: Need of data preprocessing, creating training and test sets

Lecture -2: Managing categorical data, Managing missing features

Lecture -3: Data scaling and normalization

Lecture -4: Feature selection and Filtering

Lecture -5: Dimension Reduction

Lecture 6: Principal Component Analysis (PCA)

2
Lecture 1

Need of data preprocessing


Why Data Preprocessing ?

• Data in the real world is dirty


– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names

• No quality data, no quality results!


– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality data

• A multi-dimensional measure of data quality:


– A well-accepted multi-dimensional view:
• accuracy, completeness, consistency, timeliness, believability, value added,
interpretability, accessibility
– Broad categories:
• intrinsic, contextual, representational, and accessibility.

4
Remember…

• Data preparation plays an important role in


your workflow. You need to transform the
data in a way that a computer would be able
to work with it

5
Data Preprocessing in Machine learning

• Data preprocessing is a process of preparing the raw data and


making it suitable for a machine learning model. It is the first and
crucial step while creating a machine learning model.

• When creating a machine learning project, it is not always a case


that we come across the clean and formatted data. And while
doing any operation with data, it is mandatory to clean it and put
in a formatted way.

• So, for this, we use data preprocessing task.

6
Need of Data Pre-processing

• For achieving better results from the applied model in Machine


Learning projects the format of the data must be in a proper
manner.

• Some specified Machine Learning model needs information in a


specified format. For example, Random Forest algorithm does not
support null values, therefore, to execute random forest algorithm
null values must be managed from the original raw data set.

• Another aspect is that the data set should be formatted in such a


way that more than one Machine Learning and Deep Learning
algorithm are executed in one data set, and best out of them is
chosen.

7
Data pre-processing

• Process of preparing the raw data and making it suitable for a machine
learning model.
• First and crucial step When creating a machine learning project, it is not
always a case that we come across the clean and formatted data. And
while doing any operation with data, it is mandatory to clean it and put
in a formatted way. So for this, we use data pre-processing task.
• A real-world data generally contains noises, missing values, and maybe in
an unusable format which cannot be directly used for machine learning
models.

• Data pre-processing is required tasks for cleaning the data and making it
suitable for a machine learning model which also increases the accuracy
and efficiency of a machine learning model

8
Why Preprocess Data?

• Data in the real world is dirty

• The purpose of preparation is to transform this data so that a


computer would be able to work with it

9
Why Preprocess Data?

• Data need to be formatted and made adequate for a given method

• Data in the real world is dirty

• incomplete: lacking attribute values, lacking certain attributes of


interest, or containing only aggregate data
• e.g., occupation=“”
• noisy: containing errors or outliers
• e.g., Salary=“-10”, Age=“222”
• inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records

10
Data Preprocessing in Machine learning

• Data preprocessing is a process of preparing the raw data and


making it suitable for a machine learning model. It is the first and
crucial step while creating a machine learning model.

• When creating a machine learning project, it is not always a case


that we come across the clean and formatted data. And while
doing any operation with data, it is mandatory to clean it and put
in a formatted way.

• So, for this, we use data preprocessing task.

11
Major Tasks in Data Preparation

• Data discretization
• Part of data reduction but with particular importance, especially for
numerical data
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or
similar analytical results
12
TYPES OF DATA

13
Types of Data
• Types of Data: Broader categories

• Discrete

• Continuous

• Types of Measurements:
• Nominal scale

content
More information
• Categorical scale Qualitative

• Ordinal scale

• Interval scale
Quantitative
• Ratio scale

14
Discrete or Continuous
Types of Measurements: Examples

• Nominal:

• ID numbers, Names of people


• Categorical:

• eye color, zip codes


• Ordinal:

• rankings (e.g., taste of potato chips on a scale from 1-10), grades,


height in {tall, medium, short}
• Interval:

• calendar dates, IQ scores


• Ratio:

• temperature in Kelvin, length, time, counts

15
Data Conversion

• Some models can deal with nominal values but other


need fields to be numeric

• Convert ordinal fields to numeric to be able to use “>”


and “<“ comparisons on such fields.
• A  4.0
• A-  3.7
• B+  3.3
• B  3.0

• Multi-valued, unordered attributes with small no. of


values

• e.g. Color=Red, Orange, Yellow, …, Violet


• for each value v create a binary “flag” variable C_v , which is 1 if
Color=v, 0 otherwise
20
Conversion: Nominal, Many Values

• Examples:
• US State Code (50 values)
• Profession Code (7,000 values, but only few frequent)

• Ignore ID-like fields whose values are unique for each record

• For other fields, group values “naturally”:


• e.g. 50 US States  3 or 5 regions
• Profession – select most frequent ones, group the rest

• Create binary flag-fields for selected values

1
7
OUTLIERS

18
Outliers

• Outliers are values thought to be out of range.


• “An outlier is an observation that deviates so much from other observations
as to produce suspicion that it was generated by a different mechanism”

• Can be detected by standardizing observations and label the standardized


values outside a predetermined bound as outliers
• Outlier detection can be used for fraud detection or data cleaning

• Approaches:
• do nothing/ Treat Seperately
• Imputing/ enforce upper and lower bounds
• Deleting/let binning (discarding) handle the problem

1
9
Outlier detection

• Univariate
• Compute mean and std. deviation.

2
0
MISSING DATA

21
Missing Data

• Data is not always available


• E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data

• Missing data may be due to


• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data

• Missing data may need to be inferred.

2
2
How to Handle Missing Data?

• Ignore records (use only cases with all values)

• Usually done when class label is missing as most prediction methods do not
handle missing data well
• Not effective when the percentage of missing values per attribute varies
considerably as it can lead to insufficient and/or biased sample sizes

• Ignore attributes with missing values

• Use only features (attributes) with all values (may leave out important
features)

• Fill in the missing value manually

• tedious + infeasible?

2
3
How to Handle Missing Data?

• Use a global constant to fill in the missing value

• e.g., “unknown”. (May create a new class!)

• Use the attribute mean to fill in the missing value

• It will do the least harm to the mean of existing data


• If the mean is to be unbiased
• What if the standard deviation is to be unbiased?

• Use the attribute mean for all samples belonging to the same class to
fill in the missing value

2
4
How to Handle Missing Data?

• Use the most probable value to fill in the missing value

• Inference-based such as Bayesian formula or decision tree

• Identify relationships among variables


• Linear regression, Multiple linear regression, Nonlinear regression

• Nearest-Neighbour estimator
• Finding the k neighbours nearest to the point and fill in the most frequent value or
the average value
• Finding neighbours in a large dataset may be slow

2
5
Missing Values Treatment

 It can lead to wrong prediction or classification.

 How to deal:
o Deletion
o Mean/ Mode/ Median Imputation
o Prediction Model

26
Handling Missing data practically with python

• This strategy is useful for the features which have numeric data such as age,
salary, year, etc. Here, we will use this approach.
• Calculating the mean we will calculate the mean of that column or row
which contains any missing value and will put it on the place of missing
value.

• #handling missing data (Replacing missing data with the mean value)
• from sklearn.preprocessing import Imputer
• imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
• #Fitting imputer object to the independent variables x.
• imputerimputer= imputer.fit(x[:, 1:3])
• #Replacing missing data with the calculated mean value
• x[:, 1:3]= imputer.transform(x[:, 1:3])

27
Summary

• Every real world data set needs some kind of data pre-
processing

• Deal with missing values


• Correct erroneous values
• Select relevant attributes
• Adapt data set format to the modelto be used

2
8
Splitting the Dataset into the Training set and Test set

• In data pre-processing, we divide our dataset into a training set and


test set. This is one of the crucial steps of data pre-processing as by
doing this, we can enhance the performance of our machine learning
model.
• Suppose, if we have given training to our machine learning model by a
dataset and we test it by a completely different dataset. Then, it will
create difficulties for our model to understand the correlations between
the models.
• If we train our model very well and its training accuracy is also very high,
but we provide a new dataset to it, then it will decrease the
performance. So we always try to make a machine learning model which
performs well with the training set and also with the test dataset. Here,
we can define these datasets as:

29
Training and Testing Set

• Training Set: A subset of dataset to train the machine learning model,


and we already know the output.
• Test set: A subset of dataset to test the machine learning model, and by
using the test set, model predicts the output.
• For splitting the dataset, we will use the below lines of code:
• # from sklearn.model_selection import train_test_split

• # x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)

30
Links for Useful videos

• https://www.google.com/search?q=data+preprocessing+in+ML+youtube&rlz=1C1C
HZN_enIN1030IN1030&biw=1366&bih=600&tbm=vid&ei=WNPOY-
nQCt66seMPu9S5gAw&ved=0ahUKEwjp3PDCqt78AhVeXWwGHTtqDsAQ4dUDC
A0&uact=5&oq=data+preprocessing+in+ML+youtube&gs_lcp=Cg1nd3Mtd2l6LXZp
ZGVvEAMyBQghEKABOgUIABCiBDoICCEQFhAeEB06BwghEKABEApQtAZYthdg
wBloAHAAeACAAc4BiAHJC5IBBTAuNy4ymAEAoAEBwAEB&sclient=gws-wiz-
video#fpstate=ive&vld=cid:8146a305,vid:4i9aiTjjxHY

• https://www.google.com/search?q=data+preprocessing+in+ML+youtube&rlz=1C1C
HZN_enIN1030IN1030&biw=1366&bih=600&tbm=vid&ei=WNPOY-
nQCt66seMPu9S5gAw&ved=0ahUKEwjp3PDCqt78AhVeXWwGHTtqDsAQ4dUDC
A0&uact=5&oq=data+preprocessing+in+ML+youtube&gs_lcp=Cg1nd3Mtd2l6LXZp
ZGVvEAMyBQghEKABOgUIABCiBDoICCEQFhAeEB06BwghEKABEApQtAZYthdg
wBloAHAAeACAAc4BiAHJC5IBBTAuNy4ymAEAoAEBwAEB&sclient=gws-wiz-
video#fpstate=ive&vld=cid:5f6afa8c,vid:9uvIazKs2uI

31
Lecture 2

Managing categorical data, Managing


missing features

32
How Do You Handle Missing Values

Ways to handle missing values in the dataset:


Deleting Rows
• If columns have quite 70% – 75% of rows as null, then the complete
column is dropped. The rows that are having one or additional columns
values as null also can be drop.
• Dropping of rows or columns is suggested only if there are enough
samples within the Knowledge Set. Ones check that when information is
deleted. The information will cause loss of knowledge which will not
provide the expected results whereas predicting the output.

• Syntax : Data.Drop([‘Cabin’],Axis=1,Inplace=True)

33
How Do You Handle Missing Values

34
Example

• Consider data set where we collected the ages of the people who attend a yoga class. After
asking all 25 of the people in the class, the data obtained about their ages is the following:
21, 16, 34, 33, 57, 18, 44, 41, 63, 72, 54, 44, 39, 30, 45, 45, 61, 18, 29, 27, 55, 48, 59, 66, 70.

• We start by ordering the numbers:


16 18 18 21 27 29 30 33 34 39 41 44 44 45 45 48 54 55 57 59 61 63 66 70
72

• With this data in mind, let us look at the calculation and definition of mean, median, mode
and range:

35
Example

Mean:

• The mean of a data set is the addition of the values divided by the amount of all
the values in the data set. Notice that, given this mean definition, this is the same
as the arithmetic average of a set of numbers; thus the terms mean and average
are usually used as synonyms.

The mean of a data set tries to find the central value of a set by comparing all of
the values in the set and producing the average of them; if all of the values in the
set were to be equal, the mean of this set would be equal to all of them too.

36
How to find the mean:

• We obtain the mean:

• mean is usually represented as an x with a horizontal line on top.

37
Replacing With Mean / Median

• This method can calculate the mean, median, or mode of the feature
and replace it with the missing values. This approach can be applied to a
feature which has numeric data like the age of a person or the ticket
fare, but the loss of the details is negated by this methodology that gives
higher results compared to deleting of rows and columns.

• Syntax For Means : Data[“Age”] = Data[“Age”].Replace(Np.NaN,


Data[“Age”].Mean())
• Syntax For Median : Data[“Age”] = Data[“Age”].Replace(Np.NaN,
Data[“Age”].Median())

38
How Do You Handle Missing Values

39
Replacing Missing Data With The Most Frequent Values

• When Missing Values Is From Categorical Columns Such As String Or


Numerical Then The Missing Values Can Be Replaced With The Most
Frequent Category. If The Number Of Missing Values Is Very Large Then It
Can Be Replaced With A New Category.

• Syntax : Data[‘Cabin’].Fillna(‘Unknown’)[:10]

40
Replacing Missing Data With The Most Frequent Values

• https://www.google.com/search?q=data+preprocessing+in+ML+youtube&rlz=1C1C
HZN_enIN1030IN1030&biw=1366&bih=600&tbm=vid&ei=WNPOY-
nQCt66seMPu9S5gAw&ved=0ahUKEwjp3PDCqt78AhVeXWwGHTtqDsAQ4dUDC
A0&uact=5&oq=data+preprocessing+in+ML+youtube&gs_lcp=Cg1nd3Mtd2l6LXZp
ZGVvEAMyBQghEKABOgUIABCiBDoICCEQFhAeEB06BwghEKABEApQtAZYthdg
wBloAHAAeACAAc4BiAHJC5IBBTAuNy4ymAEAoAEBwAEB&sclient=gws-wiz-
video#fpstate=ive&vld=cid:8146a305,vid:4i9aiTjjxHY

• https://www.google.com/search?q=data+preprocessing+in+ML+youtube&rlz=1C1C
HZN_enIN1030IN1030&biw=1366&bih=600&tbm=vid&ei=WNPOY-
nQCt66seMPu9S5gAw&ved=0ahUKEwjp3PDCqt78AhVeXWwGHTtqDsAQ4dUDC
A0&uact=5&oq=data+preprocessing+in+ML+youtube&gs_lcp=Cg1nd3Mtd2l6LXZp
ZGVvEAMyBQghEKABOgUIABCiBDoICCEQFhAeEB06BwghEKABEApQtAZYthdg
wBloAHAAeACAAc4BiAHJC5IBBTAuNy4ymAEAoAEBwAEB&sclient=gws-wiz-
video#fpstate=ive&vld=cid:5f6afa8c,vid:9uvIazKs2uI

41
Handling Categorical Data In Machine Learning

• Categorical Data Are Usually Grouped Into A Category, It Is Defined As “A


Collection Of Information That Is Divided Into Groups”. i.e , If A School Or
College Is Trying To Get Details Of Its Students, The Resulting Data Is
Referred To As Categorical. Maybe This Data Grouped According To The
Variables Present In The Details Such As branch, Section, Gender, etc;
This Data Is Called Categorical Data.

• There Are Two Subcategories Of Categorical Data, Such As :


Nominal Data:

• Nominal Data Is Used To Name Variables Without Providing Any


Numerical Value. Nominal Data Is Also Called Labeled Or Named Data. It
Helps To Arrive At Better Conclusions. Examples Of Nominal Data Include
Division, Gender, Etc.

42
Handling Categorical Data In Machine Learning

Ordinal Data :

• The Variables Have Natural, Ordered Categories And The Distances


Between The Categories Are Not Known Is Called Ordinal Data. Ordinal
Data Is Set To Order Or Scale To It. Still, This Order Does Not Have A
Standard Scale On Which The Difference In Variables In Each Scale Is
Measured.

• Examples Of Ordinal Data Include; Interval Scale, Likes, Dislike, Customer


Satisfaction Survey Data, Etc. Each Of These Examples May Have
Different Collection And Analysis Techniques, But They Are All Ordinal
Data.

43
Lecture 3

Data scaling and Normalization

44
What is data scaling and normalization in machine learning?

• Machine learning scaling and Normalization are part of data


preparation

• Scaling technique brings data points that are far from each other closer
in order to increase the algorithm effectiveness and speed up the
Machine Learning processing.
• Goal- data enables the model to learn and understand the problem.

• Normalization change the values of numeric columns in the dataset to


use a common scale, without distorting differences in the ranges of
values or losing information.
• Goal- to transform features to be on a similar scale. This improves the
performance and training stability of the model.

45
The difference is that:

• In Scaling, we're changing the range of the distribution of the


data. While in normalizing, we're changing the shape of the distribution
of the data.
• Range is the difference between the smallest and largest element in a
distribution.

46
Feature Scaling

• A method used to normalize the Range Of Independent Variables Or


Features Of Data Is Known As Feature Scaling.

• Normalization Is Performed During Data Preparation In Machine


Learning Because Normalization Is A technique Which Is Used to Change
the Values Of Numeric Columns In The Dataset To Use A Common Scale,
Without Making Differences In the Ranges Of Values Or Losing
Information.

47
Feature Scaling

Shape of the data doesn't change, but that instead of ranging from 0
to 8, it now ranges from 0 to 1.

48
Normalization

• For distance-based methods, normalization helps to


prevent that attributes with large ranges out-weight
attributes with small ranges

• min-max normalization
• z-score normalization
• normalization by decimal scaling

4
9
Normalization

• min-max normalization

v  min v
v' (new _ max v  new_min v)  new_minv
max v  min v

• z-score normalization

v '  v v does not eliminate outliers


v

• normalization by decimal scaling

v Where j is the smallest integer such that Max(| v' |)<1


v' j
10 range: -986 to 917 => j=3 -986 -> -0.986 917 -> 0.917

5
0
Age min‐max (0‐1) z‐score dec. scaling
44 0.421 0.450 0.44
35 0.184 ‐0.450 0.35
34 0.158 ‐0.550 0.34
34 0.158 ‐0.550 0.34
39 0.289 ‐0.050 0.39
41 0.342 0.150 0.41
42 0.368 0.250 0.42
31 0.079 ‐0.849 0.31
28 0.000 ‐1.149 0.28
30 0.053 ‐0.949 0.3
38 0.263 ‐0.150 0.38
36 0.211 ‐0.350 0.36
42 0.368 0.250 0.42
35 0.184 ‐0.450 0.35
33 0.132 ‐0.649 0.33
45 0.447 0.550 0.45
34 0.158 ‐0.550 0.34
65 0.974 2.548 0.65
66 1.000 2.648 0.66
38 0.263 ‐0.150 0.38

28 minimun
66 maximum
39.50 avgerage
53
10.01 standard deviation
Normalization

Scaling just changes the range of your data.

Normalization is a more radical transformation. The point of


normalization is to change your observations so that they can be
described as a normal distribution.

Normal distribution: Also known as the "bell curve"

Specific statistical distribution where a roughly equal observations fall


above and below the mean, the mean and the median are the same,
and there are more observations closer to the mean. The normal
distribution is also known as the Gaussian distribution.

52
Normalization

Shape of our data has changed. Before normalizing it was almost L-


shaped. But after normalizing it looks more like the outline of a bell
(hence "bell curve").

53
Some Common Methods To Perform Feature Scaling..
1. Standardization:

In Standardization, They Replace The Values By Z Scores.

This Redistributes the Features With Their Means Equal To Zero ( = 0)


And Standard Deviation Equal To One ( =1).

54
2. Min-Max Normalization:

Normalization Is Also Known As Min-Max Scaling Or Rescaling,

The Range Of Normalization Distribution Is Between 0 To 1 If There Are Negative Values


Then -1 To 0

The Formula For A Min-Max Normalization Of [0, 1] Is Given As:

The Formula For Rescaling A Range Between An Arbitrary Set Of Values [A, B] Is Given
As :

55
3. Unit Vector:

Scaling Is Done Considering The Whole Feature Venture To Be Of Unit Length.


That Means Dividing Each Component By The Euclidean Length Of The Vector:

56
Normalization Techniques

Four common normalization techniques may be useful:


• scaling to a range
• clipping
• log scaling
• z-score

57
Summary of normalization techniques

58
Scaling to a range

Scaling means converting floating-point feature values from their natural range
(for example, 100 to 900) into a standard range—usually 0 and 1 (or sometimes -
1 to +1)

Formula to scale to a range

Scaling to a range is a good choice when both of the following conditions are
met:
• You know the approximate upper and lower bounds on your data with few or
no outliers.
• Your data is approximately uniformly distributed across that range.
• A good example is age. Most age values falls between 0 and 90, and every
part of the range has a substantial number of people.
• In contrast, you would not use scaling on income, because only a few people
have very high incomes. The upper bound of the linear scale for income
would be very high, and most people would be squeezed into a small part of
59 the scale.
Feature Clipping

If your data set contains extreme outliers, you might try feature clipping,
which caps all feature values above (or below) a certain value to fixed value.

For example, you could clip all temperature values above 40 to be exactly 40.
You may apply feature clipping before or after other normalizations.
• .
Formula: Set min/max values to avoid outliers

60
Comparing a raw distribution and its clipped version.

Another simple clipping strategy is to clip by z-score to +-Nσ (for


example, limit to +-3σ). Note that σ is the standard deviation

61
Log Scaling

• Log scaling computes the log of your values to compress a wide range to a
narrow range.

• Log scaling is helpful when a handful of your values have many points, while
most other values have few points. This data distribution is known as
the power law distribution. Movie ratings are a good example.

• In the chart below, most movies have very few ratings (the data in the tail),
while a few have lots of ratings (the data in the head).

• Log scaling changes the distribution, helping to improve linear model


performance.

62
Comparing a raw distribution to its log..

63
Z-Score

• Z-score is a variation of scaling that represents the number of standard


deviations away from the mean.

• You would use z-score to ensure your feature distributions have mean =
0 and std = 1. It’s useful when there are a few outliers, but not so
extreme that you need clipping.

• The formula for calculating the z-score of a point, x, is as follows:

64
Comparing a raw distribution to its z-score distribution.

z-score squeezes raw values that have a range of ~40000 down into a range from
roughly -1 to +4.
Suppose you're not sure whether the outliers truly are extreme. In this case, start
with z-score unless you have feature values that you don't want the model to learn;
for example, the values are the result of measurement error or a quirk.

65
Summary

66
Lecture 4

Feature selection and Filtering

67
Feature Selection

• Feature - attribute that has an impact on a problem or is useful for the


problem
• Choosing the important features for the model is known as feature
selection.
• Each machine learning process depends on feature engineering, which
mainly contains two processes; which are Feature Selection and Feature
Extraction.

• The main difference between them is that feature selection is about


selecting the subset of the original feature set, whereas feature
extraction creates new features. Feature selection is a way of reducing
the input variable for the model by using only relevant data in order to
reduce overfitting in the model.

68
Feature Selection

It is a process of automatically or manually selecting the subset of most


appropriate and relevant features to be used in model building."

Feature selection is performed by either including the important features or


excluding the irrelevant features in the dataset without changing them.

69
Benefits of Feature Selection

• It helps in avoiding the curse of dimensionality.


• It helps in the simplification of the model so that it can be easily
interpreted by the researchers.
• It reduces the training time.
• It reduces overfitting hence enhance the generalization.

70
Feature Selection Techniques

• There are mainly two types of Feature Selection techniques, which are:

• Supervised Feature Selection technique


Supervised Feature selection techniques consider the target variable and
can be used for the labelled dataset.

• Unsupervised Feature Selection technique


Unsupervised Feature selection techniques ignore the target variable
and can be used for the unlabeled dataset.

71
Feature Selection Techniques

72
Lecture 5

Dimension Reduction

73
Dimensionality reduction

• Dimensionality reduction is the process of reducing the number of


features in a dataset while retaining as much information as possible.

• This can be done to reduce the complexity of a model, improve the


performance of a learning algorithm, or make it easier to visualize the
data.

Techniques for dimensionality reduction include:

• principal component analysis (PCA)


• singular value decomposition (SVD) and
• linear discriminant analysis (LDA).

74
Dimensionality reduction

• Each technique projects the data onto a lower-dimensional space


while preserving important information.

• Dimensionality reduction is performed during pre-processing


stage before building a model to improve the performance

• It is important to note that dimensionality reduction can also


discard useful information, so care must be taken when applying
these techniques.

75
Advantages of Dimensionality Reduction

• It helps in data compression, and hence reduced storage space.


• It reduces computation time.
• It also helps remove redundant features, if any.

76
Disadvantages of Dimensionality Reduction

• It may lead to some amount of data loss.


• PCA tends to find linear correlations between variables, which is
sometimes undesirable.
• PCA fails in cases where mean and covariance are not enough to define
datasets.
• We may not know how many principal components to keep- in practice,
some thumb rules are applied.

77
Lecture 6

Principal Component Analysis (PCA)

78
Principal Components Analysis ( PCA)

• This method was introduced by Karl Pearson.


• It works on a condition that while the data in a higher dimensional
space is mapped to data in a lower dimension space, the variance of the
data in the lower dimensional space should be maximum.

79
It involves the following steps:

• Construct the covariance matrix of the data.


• Compute the eigenvectors of this matrix.
• Eigenvectors corresponding to the largest eigenvalues are
used to reconstruct a large fraction of variance of the
original data.

80
Principal Components Analysis ( PCA)

• An exploratory technique used to reduce the dimensionality of the data set to 2D or


3D
• Can be used to:

– Reduce number of dimensions in data


– Find patterns in high-dimensional data
– Visualize data of high dimensionality
• Example applications:

– Face recognition
– Image compression
– Gene expression analysis

81
Principal Components Analysis Ideas ( PCA)

• Does the data set ‘span’ the whole of d dimensional space?

• For a matrix of m samples x n genes, create a new


covariance matrix of size n x n.

• Transform some large number of variables into a smaller


number of uncorrelated variables called principal
components (PCs).

• developed to capture as much of the variation in data as


possible

82
Principal Components Analysis Ideas

X2
Y1

Y2
x
x
x
Note: Y1 is the x xx
x x
first eigen vector, x
Y2 is the second. x x Key observation:
x
Y2 ignorable. x variance = largest!
x x
x X1
x
x x x
x
x x
x x

83
Principal Component Analysis: one attribute first

Temperature
• Question: how much spread is in 42
the data along the axis? (distance 40
to the mean)
24
• Variance=Standard deviation^2
30
15

n
18

 i
( X  X ) 2 15
30
s 
2 i 1
(n  1) 15
30
35
30
40
84 30
X=Temperature Y=Humidity
Now consider two dimensions
40 90
40 90
• Covariance: measures the 40 90
correlation between X and Y 30 90
• cov(X,Y)=0: independent
15 70
• Cov(X,Y)>0: move same dir
15 70
• Cov(X,Y)<0: move oppo dir
15 70
30 90
n 15 70
(X
i 1
i  X )(Yi  Y ) 30 70
cov( X , Y )  30 70
(n  1)
30 90
40 70
30 90

85
More than two attributes: covariance matrix

• Contains covariance values between all possible


dimensions (=attributes):
C nxn
 (cij | cij  cov(Dimi , Dim j ))
• Example for three attributes (x,y,z):

 cov(x, x) cov(x, y ) cov(x, z ) 


 
C   cov( y, x) cov( y, y ) cov( y, z ) 
 cov(z , x) cov(z , y ) cov(z , z ) 
 

86
Eigenvalues & eigenvectors

• Vectors x having same direction as Ax are called eigenvectors of A (A


is an n by n matrix).

• In the equation Ax=x,  is called an eigenvalue of A.

 2 3   3  12   3
  x      4 x 
 2 1  2  8   2

87
Eigenvalues & Eigenvectors

• Ax = x  (A-I)x = 0

• How to calculate x and :

– Calculate det(A-I), yields a polynomial (degree n)

– Determine roots to det(A-I)=0, roots are eigenvalues


– Solve (A- I) x=0 for each  to obtain eigenvectors x

88
Principal components

• principal component (PC1)


– The eigenvalue with the largest absolute value will
indicate that the data have the largest variance along
its eigenvector, the direction along which there is
greatest variation
• principal component (PC2)
– the direction with maximum variation left in data,
orthogonal to the 1. PC

In general, only few directions manage to capture most


of the variability in the data.

89
Steps of PCA

• Let X be the mean vector • For matrix C, vectors e


(taking the mean of all
(=column vector) having same
rows)
direction as Ce :
– eigenvectors of C is e such
• Adjust the original data by
the mean that Ce=e,
X’ = X –
X –  is called an eigenvalue of
C.
• Compute the covariance • Ce=e  (C-I)e=0
matrix C of adjusted X
– Most data mining packages
• Find the eigenvectors and do this for you.
eigenvalues of C.

90
Eigenvalues

• Calculate eigenvalues  and eigenvectors x for covariance


matrix:

– Eigenvalues j are used for calculation of [% of total


variance] (Vj) for each component j:

j n
V j  100  n  x n
 x
x 1
x 1

91
Principal components - Variance

25

20
Variance (%)

15

10

0
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10

92
Transformed Data

• Eigenvalues j corresponds to variance on each component j


• Thus, sort by j
• Take the first p eigenvectors ei; where p is the number of top
eigenvalues
• These are the directions with the largest variances

 yi1   e1  xi1  x1 
    
 yi 2   e2  xi 2  x2 
 ...    ...  
    ... 
 y   e  x  x 
 ip   p  in n

93
An Example
Mean1=24.1
Mean2=53.8
X1 X2 X1' X2'

19 63 -5.1 9.25
100
90
80
70
60
39 74 14.9 20.25 50 Series1
40
30
20
30 87 5.9 33.25 10
0
0 10 20 30 40 50

30 23 5.9 -30.75
40

30
15 35 -9.1 -18.75 20

10

15 43 -9.1 -10.75 -15 -10 -5


0
0 5 10 15 20
Series1

-10

-20

15 32 -9.1 -21.75 -30

-40

30 73 5.9 19.25
94
Example

95
PCA –> Original Data

• Retrieving old data (e.g., in data compression)

– RetrievedRowData = (RowFeatureVectorT x FinalData) +


OriginalMean
– Yields original data using the chosen components

96
Principal components

• General about principal components


– summary variables
– linear combinations of the original variables
– uncorrelated with each other
– capture as much of the original variance as possible

97
Applications – Gene expression analysis

– Reference: Raychaudhuri et al. (2000)


– Purpose: Determine core set of conditions for useful
– gene comparison
– Dimensions: conditions, observations: genes
– Yeast sporulation dataset (7 conditions, 6118 genes)
– Result: Two components capture most of variability (90%)
– Issues: uneven data intervals, data dependencies
– PCA is common prior to clustering
– Crisp clustering questioned : genes may correlate with multiple
clusters
– Alternative: determination of gene’s closest neighbor's

98
References

• ‘Data preparation for data mining’, Dorian Pyle, 1999

• ‘Data Mining: Concepts and Techniques’, Jiawei Han and Micheline Kamber, 2000

• ‘Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations’, Ian H. Witten
and Eibe Frank, 1999

• ‘Data Mining: Practical Machine Learning Tools and Techniques second edition’, Ian H. Witten and Eibe
Frank, 2005

• DM: Introduction: Machine Learning and Data Mining, Gregory Piatetsky-Shapiro and Gary Parker
• (http://www.kdnuggets.com/data_mining_course/dm1-introduction-ml-data-mining.ppt)

• ESMA 6835 Mineria de Datos (http://math.uprm.edu/~edgar/dm8.ppt)

9
9
Thank You
Subject Name: MACHINE LEARNING
Unit 3: Learning with Regression

Faculty Name : Ms.Rajashree Shedge


Index

Lecture 13 – Linear Regression (Simple and


Multivariable)

Lecture 14 – Logistic regression

Lecture 15 – Evaluation Metrics


Unit No: 3 Unit Name : Learning with Regression

Lecture No: 13
Linear Regression
Supervised
Learning Tasks

Week4:
Week 2 Data Science with Machine Learning
What is Regression?

• Regression analysis is a predictive modelling technique that analyzes the


relation between the target or dependent variable and independent variable
in a dataset.

• Function: a mathematical relationship enabling us to predict what


values of one variable (Y) correspond to given values of another
variable (X).

• Y: is referred to as the dependent variable, the response variable or


the predicted variable.
• X: is referred to as the independent variable, the explanatory variable
or the predictor variable.

• Used Mainly for Prediction & Estimation

5
Regression….

6
Broad categories of Regression
Regression can be broadly classified into two major types.
• Linear Regression.
o The simplest case of linear regression is to find a
relationship using a linear model (i.e. line) between an
input independent variable (input single feature) and an
output dependent variable.
o This is also called Bivariate Linear Regression.

• Multivariate Linear Regression:


o When there is a linear model representing the relationship
between a dependent output and multiple independent
input variables is called Multivariate Linear Regression.

7
Broad categories of Regression….
• Logistic Regression:
o It is used when the output is categorical. It is more like a
classification problem. The output can be Success / Failure,
Yes / No, True/ False or 0/1. There is no need for a linear
relationship between the dependent output variable and
independent input variables.

8
Scatter Plot

• A scatter plot can be a helpful tool in determining the strength of the


relationship between two variables.

• A scatter plot is often employed to identify potential associations


between two variables, where one may be considered to be an
explanatory variable (such as years of education) and another may
be considered a response variable (such as annual income).

9
Scatter Plot : Example

x 1 2 3 4 5 6 7 8 9 10 11 12
y 16 35 45 64 86 96 106 124 134 156 164 182

10
Linear Regression

11
Regression Line

• Linear regression consists of finding the best-fitting straight line


through the points. The best-fitting line is called a regression line.

12
Error in prediction

• The black diagonal line in Figure is the regression line and consists
of the predicted score on Y for each possible value of X. The
vertical lines from the points to the regression line represent the
errors of prediction.

• The error of prediction for a point is the value of the point minus the
predicted value.

13
Error in
prediction
Objective: Minimize the difference between the observation and
its prediction according to the line.
yi   0  1 xi   i
for i  1, 2,..., n

 i  yi  yˆ i
 yi  ( ˆ0  ˆ1 xi )

yˆ i  predicted y value when x  xi

Week 2
Method of Least
Squares
We want the line which is best for all points. This is done by
finding the values of b0 and b1 which minimizes some sum of
errors. There are a number of ways of doing this. Consider
these two n
min
 0 , 1 
i 1
i

 i
 2
min
 0 , 1
i 1

ˆ0 ˆ1 referred to as least squares estimates

The method of least squares produces estimates with


statistical properties (e.g. sampling distributions) which are
easier to determine

Week4:
Week 2 Data Science with Machine Learning
Method of
Least Squares
‘Best Fit’ Means Difference Between Actual Y Values &
Predicted Y Values Are a Minimum. But Positive Differences
Off-Set Negative. So square errors
n n
E (  0 , 1 )    i   ( yi   0  1 xi ) 2
2

i 1 i 1

LS Minimizes the Sum of the Squared Differences (errors) (SSE)

E
0
 0
E
0
1
Week 2 Data Science with Machine Learning
Week4:
Least Square Graphically
n
LS minimizes  i

 2
 
 2
1  
 2
2  
 2
3  
 2
4
i 1

^4
^2
^1 ^3

Week 2 Data Science with Machine Learning


Week4:
Derivation of
Parameters
Least Squares (L-S): Minimize squared error

     yi   0  1 xi 
2 2

0 i

1 1
 2 xi  yi   0  1 xi 
 2 xi  yi  y  1 x  1 xi 

1  xi  xi  x    xi  yi  y 
1   xi  x  xi  x     xi  x  yi  y 

ˆ SS xy
1 
SS xx

Week 2 Data Science with Machine Learning


Week4:
Derivation of
Parameters
n
S xx   i
( x 
i 1
x ) 2

 ( x1  x ) 2  ( x2  x ) 2    ( xn  x ) 2 Sums of squares of x.
2
n
1 n

     i
2
( xi ) x
i 1 n  i 1 
n
Syy   (y
i 1
i  y )2

 ( y 1  y )2  ( y 2  y )2    ( y n  y )2 Sums of squares of y.
2
n
1 n 
  ( y i )    y i 
2

i 1 n  i 1 

Week 2 Data Science with Machine Learning


Week4:
Derivation of
Parameters
n
Sxy  (x
i 1
i  x )( y i  y )
Sums of cross products
 ( x1  x )( y 1  y )    ( x n  x )( y i  y )
of x and y.
n
1  n  n 
  ( x i y i )    x i   y i 
i 1 n  i 1  i 1 

ˆ S XY
1 
S XX
ˆ0  y  ˆ1 x

Week 2 Data Science with Machine Learning


Week4:
Linear Regression : In Simpler Form
• The simple linear model is expressed using the following equation:

yi   0  1 xi   i
for i  1, 2,..., n
Where,
y – variable that is dependent
x – Independent (explanatory) variable
– Intercept
– Slope
ϵ – Residual (error)

• For simplicity in calculations, we assume the error to be 0.

• The simple form of the regression model is a line equation:

21
Linear Regression : In Simpler Form

• Regression model :

• Here & are called the regression coefficients.

• To calculate & , use following formula:

• Where, x̄ is the mean of x and ӯ is the mean of y

22
Linear Regression : Example

1. The following data pertain to number of computer jobs per day and the
central processing unit (CPU) time required.

Number of jobs CPU time


x y
1 2
2 5
3 4
4 9
5 10

23
Linear Regression : Example

y  2x
Week 2 Data Science with Machine Learning
Week4:
Exercise
2. The following table shows the midterm and final exam grades obtained for
students in a database course. Use the method of Least squares using
regression to predict the final exam grade of a student who received 80 on
the midterm exam. Midterm Exam (X) Final Exam (Y)

72 84
50 63
81 77
74 78
94 90
86 75
59 49
83 79
65 77
33 52
88 74
81 90
25
Exercise
3. A clinical trail gave the following data about the BMI and Cholesterol level
of 10 patients. Predict the likely value of Cholesterol level for a patient who
has BMI of 27.
BMI Cholesterol
17 140
21 189
24 210
28 240
14 130
16 100
19 135
22 166
15 130
18 170

26
Exercise

4. Find the regression coefficients for the following data:

27
Let’s revise through a small video

Introduction to Linear Regression:

https://www.youtube.com/watch?v=zPG4NjIkCjc

28
Multivariable/Multivariate Regression

• Multivariate regression is a technique used to measure


the degree to which the various independent variable
and various dependent variables are linearly related
to each other.
• The relation is said to be linear due to the correlation
between the variables.
• E.g.
o An agriculture expert decides to study the
crops that were ruined in a certain region.
He collects the data about recent climatic
changes, water supply, irrigation methods,
pesticide usage, etc. To understand why
29 the crops are turning black, do not yield
Steps for Multivariate Regression

1. Select the features


• Features that are highly responsible for the change in
your dependent variable.
2. Normalize the feature
• Scale them in a certain range (preferably 0-1) so that
analysing them gets a bit easy.
3. Select Loss function and Hypothesis
• A formulated hypothesis is nothing but a predicted
value of the response variable and is denoted by
h(x).
• A loss function is a calculated loss when the
hypothesis predicts a wrong value.

30
Steps for Multivariate Regression

4. Minimize the loss function


• Loss M\minimization algorithms can be run over the
datasets. These algorithms then adjust the
parameters of the hypothesis.
• One of the minimization algorithms that can be used
is the gradient descent algorithm.
5. Test the hypothesis
• The formulated hypothesis is then tested with a test
set to check its accuracy and correctness.

31
Multivariable regression

Solved Problem:
https://www.statology.org/multiple-linear-regression-by-hand/

32
Applications of Regression
• Forecasting continuous outcomes like house prices, stock prices, or
sales.
• Predicting the success of future retail sales or marketing campaigns to
ensure resources are used effectively.
• Predicting customer or user trends, such as on streaming services or
ecommerce websites.
• Analysing datasets to establish the relationships between variables and an
output.
• Predicting interest rates or stock prices from a variety of factors.
• Creating time series visualisations.

33
Real Life Examples on Linear Regression
1. Businesses often use linear regression to understand the relationship
between advertising spending and revenue.
• The regression model would take the following form:
revenue = β0 + β1(ad spending)
• The coefficient β0 would represent total expected revenue when ad
spending is zero.
• The coefficient β1 would represent the average change in total revenue
when ad spending is increased by one unit (e.g. one dollar).
• If β1 is negative, it would mean that more ad spending is associated with
less revenue.
• If β1 is close to zero, it would mean that ad spending has little effect on
revenue.
• And if β1 is positive, it would mean more ad spending is associated with
more revenue.
• Depending on the value of β1, a company may decide to either
decrease or increase their ad spending.

34
Real Life Examples on Linear Regression
2. Medical researchers often use linear regression to understand the
relationship between drug dosage and blood pressure of patients.
• The regression model would take the following form:
blood pressure = β0 + β1(dosage)
• The coefficient β0 would represent the expected blood pressure when
dosage is zero.
• The coefficient β1 would represent the average change in blood
pressure when dosage is increased by one unit.
• If β1 is negative, it would mean that an increase in dosage is associated
with a decrease in blood pressure.
• If β1 is close to zero, it would mean that an increase in dosage is
associated with no change in blood pressure.
• If β1 is positive, it would mean that an increase in dosage is associated
with an increase in blood pressure.
• Depending on the value of β1, researchers may decide to change the
dosage given to a patient.

35
Real Life Examples on Linear Regression
3. Agricultural scientists often use linear regression to measure the effect of
fertilizer and water on crop yields.
• The regression model would take the following form:
crop yield = β0 + β1(amount of fertilizer) + β2(amount of water)
• The coefficient β0 would represent the expected crop yield with no
fertilizer or water.
• The coefficient β1 would represent the average change in crop yield
when fertilizer is increased by one unit, assuming the amount of water
remains unchanged.
• The coefficient β2 would represent the average change in crop yield
when water is increased by one unit, assuming the amount of fertilizer
remains unchanged.
• Depending on the values of β1 and β2, the scientists may change the
amount of fertilizer and water used to maximize the crop yield.

36
Real Life Examples on Linear Regression
4. Data scientists for professional sports teams often use linear regression to
measure the effect that different training regimens have on player
performance.
• For example, data scientists in the NBA might analyze how different
amounts of weekly yoga sessions and weightlifting sessions affect the
number of points a player scores.
• The regression model would take the following form:
points scored = β0 + β1(yoga sessions) + β2(weightlifting
sessions)
• The coefficient β0 would represent the expected points scored for a
player who participates in zero yoga sessions and zero weightlifting
sessions.

37
Unit No: 3 Unit Name :Learning with regression

Lecture No: 14
Logistic Regression
Logistic regression
introduction
• Logistic regression models a relationship between predictor
variables and a categorical response variable.

• Logistic regression helps us estimate a probability of falling into


a certain level of the categorical response given a set of
predictors

• We can choose from three types of logistic regression,


depending on the nature of the categorical response variable:
o Binary logistic regression
o Nominal logistic regression
o Ordinal logistic regression

Week 2
Why not Linear Regression?

Suppose we have data of tumor size vs its malignancy. As it is a


classification problem, if we plot, we can see, all the values will lie
on 0 and 1. And if we fit the best-found regression line, by
assuming the threshold at 0.5, we can do line pretty reasonable
job.

Week 2
Why not Linear
Regression?
1. We cannot use any of the well-established routines for statistical
inference with least squares (e.g., confidence intervals, etc.),
because these are based on a model in which the outcome is
continuously distributed. At an even more basic level, it is hard to
precisely interpret β

2. We cannot use this method when the number of classes exceeds


2. If we were to simply code the response as 1, . . . , K for a
number of classes K > 2, then the ordering here would be
arbitrary, but it actually matters

Week 2
Why not Linear
Regression?

Week 2
Logistic regression
The y is usually a yes/no type of response.

This is usually interpreted as the probability of an event happening (y = 1) or not


happening (y = 0). This can be deconstructed as:

● If y is an event (response, pass/fail, etc.),

● and p is the probability of the event happening (y = 1),

● then (1 - p) is the probability of the event not happening (y = 0),

● and p/(1 - p) are the odds of the event happening

But there is an issue here, the value of (P) will exceed 1


or go below 0 and we know that range of Probability is (0-1).
To overcome this issue we take “odds” of P:

Week 2
Logistic regression
For a more general case, involving multiple independent variables, x, there is:

logit = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 +…+ 𝑏𝑛 𝑥𝑛

The logit is the logarithm of the odds of the response, y, expressed as a function of
independent or predictor variables, x, and a constant term.

Week 2
Logistic regression

The problem here is that the range is restricted and we don’t want a restricted range
because if we do so then our correlation will decrease.

It is difficult to model a variable that has a restricted range. To control this we take
the log of odds which has a range from (-∞,+∞).

Logit function :

This formulation is also useful for interpreting the model, since the logit can be
interpreted as the log odds of a success

Week 2
Logistic regression
we will multiply by exponent on both sides and then solve for P.

Week 2
Logistic regression

Euler's number, e ≈ 2.71828,

Week 2
Linear Vs Logistic regression

Week 2
Logistic regression- Example

Week 2
Logistic regression- Example

Week 2
Logistic regression- Example

Week 2
Logistic regression-
Excercise
The dataset of amount of saving and loan non
defaulter is given in below table. Find the sigmoid
function values for logistic regression
Log odd= - 4.0778+1.5046*amoumt of savings
Calculate the probability of loan non deafaulter for
2.5 X Y
(Amount of savings) (Loan Non Defaulter)
0.5 0
1.0 0
2.0 1
2.5 0
4.0 1

Week 2
Unit No: 3 Unit Name : Learning with Regression

Lecture No: 15
Evaluation Metrics for
Regression
Model Evaluation
• Model evaluation helps you to understand the
performance of your model and makes it easy to
present your model to others.
• There are 3 main metrics for model evaluation in
regression:
1. R Square/Adjusted R Square
2. Mean Square Error(MSE)/Root Mean Square
Error(RMSE)
3. Mean Absolute Error(MAE)

Week 2
R Square/Adjusted R
Square
• R Square measures how much variability in dependent
variable can be explained by the model.
• It is the square of the Correlation Coefficient(R) and that is
why it is called R Square.

• R Square is calculated by the sum of squared of prediction error


divided by the total sum of the square which replaces the
calculated prediction with mean.
• R Square value is between 0 to 1 and a bigger value indicates a
better fit between prediction and actual value.
• R Square is a good measure to determine how well the model
fits the dependent variables. However, it does not take into
consideration of overfitting problem.
• Adjusted
Week 2 R Square is introduced because it will penalize
Mean Square Error(MSE)/Root Mean
Square Error(RMSE)
• While R Square is a relative measure of how well the model fits
dependent variables, Mean Square Error is an absolute measure
of the goodness for the fit.

• MSE is calculated by the sum of square of prediction error


which is real output minus predicted output and then divide by the
number of data points.
• It gives you an absolute number on how much your predicted
results deviate from the actual number.
• Root Mean Square Error(RMSE) is the square root of
MSE. MSE is calculated by the square of error, and thus square
root brings it back to the same level of prediction error and makes
it easier for interpretation.
Week 2
Mean Absolute
Error(MAE)
• Mean Absolute Error(MAE) is similar to Mean Square
Error(MSE). However, instead of the sum of square of error in
MSE, MAE is taking the sum of the absolute value of error.

• Compared to MSE or RMSE, MAE is a more direct


representation of sum of error terms.
• MSE gives larger penalization to big prediction error by square
it while MAE treats all errors the same.

Week 2
Thank You
Subject Name: MACHINE LEARNING
Unit No: 3 Classification

Faculty Name : i
Ms.Rajashree Shedge
Index

Lecture 16 – Introduction to NN

Lecture 17 – McCulloch-Pitt’s Neuron

Lecture 18 – NN case study


Unit No: 3 Learning for Regression

Lecture No: 16
Introduction to NN
Human Brain

 Human brain composed of specific types of cells called, Neurons,which dosen’t


regenerate.
 As they aren’t replaced, they have ability to remember, think & apply previous
experience to everyday actions.

 Human brain:
one hundred billion (100,000,000,000) neurons
each with about 1000 synaptic connections

 Each connected to other neurons

 Together these neurons & their connections form a process.


 How synapses are wired defines our brilliance
 Learning : changing effectiveness of synapses.

4 Week 2
INTERCONNECTIONS IN BRAIN
Biological Neural Network (Visualization)

6
Biological neuron

• collects inputs using dendrites


• sums up all inputs from
dendrites
• if the resulting value is greater
than its firing threshold, the
neuron fires.
• Firing neuron sends an
electrical impulse through the
neuron's axon to its boutons.
• Boutons connect to other
neurons via synapses.
Artificial Neural Network : An Introduction

 Resembles the characteristic of biological neural network.


 Nodes – interconnected processing elements (units or neurons)
 Neuron is connected to other by a connection link.
 Each connection link is associated with weight which has information about the
input signal.
 ANN processing elements are called as neurons or artificial neurons , since
they have the capability to model networks of original neurons as found in
brain.
 Internal state of neuron is called activation or activity level of neuron, which
is the function of the inputs the neurons receives.
 Neuron can send only one signal at a time.

8 Week 2
Artificial Neural Networks

• Hopes to reproduce human brain by artificial means.

• Mimics how our nervous system process information.

• ANN is composed of a large number of highly interconnected


processing elements (neurons) working in unison to solve specific
problems.

• ANNs, like people, learn by example/experience.

• It is configured for special application such as pattern recognition and data


classification through a learning process.

• 85-90% accurate.
Definition of Neural Networks

• According to the DARPA Neural Network Study (1988, AFCEA


International Press, p. 60):

“A neural network is a system composed of many simple processing


elements operating in parallel whose function is determined by network
structure, connection strengths, and the processing performed at computing
elements or nodes.”

• According to Haykin (1994), p. 2:

“A neural network is a massively parallel distributed processor that has a


natural propensity for storing experiential knowledge and making it available
for use.”
It resembles the brain in two respects:
• Knowledge is acquired by the network through a learning process.
• Interneuron connection strengths known as synaptic weights are used to
store the knowledge.
From Human Neurons to Artificial Neurons

Biological Neuron Artificial Neuron


Cell Neuron
Dendrites Weights or interconnections
Soma/cell body Net input
Axon Output
Basic Operation of a Neural Net

 X1 and X2 – input neurons.


 Y- output neuron
w1
 Weighted interconnection links- W1
and W2.
 Net input calculation is :

w2

 Output is :

 In general net input is calculated by:

12 Lecture 5 – Basics of NN
Components of Neural Networks

Basic Models of
ANN

Activation
Interconnections Learning rules
function

13
Basic models of ann

 The arrangement of neurons to form layers and the connection pattern formed
within and between layers is called the network architecture.
 Five types:
o Single layer feed forward network
o Multilayer feed-forward network
o Single node with its own feedback
o Single-layer recurrent network
o Multilayer recurrent network

14 Lecture 7 – Basic Models of NN


Single layer Feed- Forward Network

 Layer is formed by taking


processing elements and combining
it with other processing elements.
 Input and output are linked with
each other
 Inputs are connected to the
processing nodes with various
weights, resulting in series of
outputs one per node.

15 Lecture 7 – Basic Models of NN


Multilayer feed-forward network

 Formed by the interconnection of


several layers.
 Input layer receives input and
buffers input signal.
 Output layer generated output.
 Layer between input and output is
called hidden layer.
 Hidden layer is internal to the
network.
 Zero to several hidden layers in a
network.
 More the hidden layer, more is the
complexity of network, but efficient
output is produced.

16 Lecture 7 – Basic Models of NN


Feed back network

 If no neuron in the output layer is an


input to a node in the same layer /
proceeding layer – feed forward
network.
 If outputs are directed back as input
to the processing elements in the
same layer/proceeding layer –
feedback network.
 If the output are directed back to the
input of the same layer then it is
lateral feedback.
 Recurrent networks are networks
with feedback networks with closed
loop.

17 Lecture 7 – Basic Models of NN


Continued….

• Processing element output can be


directed back to the nodes in the
preceding layer, forming a
multilayer recurrent network.
• Processing element output can be
directed to processing element
itself or to other processing
element in the same layer.

18 Lecture 7 – Basic Models of NN


Learning

 Two broad kinds of learning in ANNs is :


i) parameter learning – updates connecting weights in a neural net.
ii) Structure learning – focus on change in the network.
 Apart from these, learning in ANN is classified into three categories as
i) supervised learning
ii) unsupervised learning

19 Lecture 7 – Basic Models of NN


Supervised learning

 Learning with the help of a teacher.


 Example : learning process of a small
child.
 Child doesn’t know read/write.
 Their each & every action is supervised
by a teacher
 In ANN, each input vector requires a
corresponding target vector, which
represents the desired output.
 The input vector along with target vector
is called training pair.
 The input vector results in output vector.
 The actual output vector is compared with
desired output vector.
 If there is a difference means an error
signal is generated by the network.It is
used for adjustment of weights until
actual output matches desired output.

20 Lecture 7 – Basic Models of NN


Unsupervised learning

 Learning is performed without the


help of a teacher.
 Example: tadpole – learn to swim by
itself.
 In ANN, during training process,
network receives input patterns and
organize it to form clusters.
 From the Fig. it is observed that no
feedback is applied from environment
to inform what output should be or
whether they are correct.
 The network itself discover patterns,
regularities, features/ categories from
the input data and relations for the
input data over the output.
 Exact clusters are formed by
discovering similarities &
dissimilarities so called as self –
organizing.
21 Lecture 7 – Basic Models of NN
Activation functions

 To make work more efficient and for exact output, some force or activation is
given.
 Like that, activation function is applied over the net input to calculate the output
of an ANN.
 Information processing of processing element has two major parts: input and
output.
 An integration function (f) is associated with input of processing element.
 Several activation functions are there.
1. Identity function:
it is a linear function which is defined as
f(x) =x for all x
The output is same as the input.
2. Binary step function
it is defined as

where θ represents thresh hold value.


It is used in single layer nets to convert the net input to an output that is
binary ( 0 or 1)
22 Lecture 7 – Basic Models of NN
Activation functions….

3. Bipolar step function:


It is defined as

where θ represents threshold value.


used in single layer nets to convert the net input to an output that is
bipolar (+1 or -1).
4. Sigmoid function
used in Back propagation nets.
Two types:
a) binary sigmoid function
-logistic sigmoid function or unipolar sigmoid function.
-it is defined as
where λ – steepness parameter.
-The derivative of this function is
f’(x) = λ f(x)[1-f(x)]. The range of sigmoid function is 0 to 1.

23 Lecture 7 – Basic Models of NN


Activation functions….

b) Bipolar sigmoid function

where λ- steepness parameter and the sigmoid range is between -1 and


+1.

5. Ramp function

The graphical representation of all these function is given in the upcoming


Figure

24 Lecture 7 – Basic Models of NN


Activation Functions....

25 Lecture 7 – Basic Models of NN


Unit No: 3 Learning for Regression

Lecture No: 17
McCulloch-Pitt’s Neuron
Mcculloch-Pitts neuron

 Discovered in 1943.
 Usually called as M-P neuron.
 M-P neurons are connected by directed weighted paths.
 Activation of M-P neurons is binary (i.e) at any time step the neuron may fire
or may not fire.
 Weights associated with communication links may be excitatory(weights are
positive)/inhibitory(weights are negative).
 Threshold plays major role here. There is a fixed threshold for each neuron
and if the net input to the neuron is greater than the threshold then the
neuron fires.
 They are widely used in logic functions.

27
Continued...

 A simple M-P neuron is shown in


the figure.
 It is excitatory with weight (w>0) /
inhibitory with weight –p (p<0).
 In the Fig., inputs from x1 to xn
possess excitatory weighted
connection and Xn+1 to xn+m has
inhibitory weighted interconnections.
 Since the firing of neuron is based
on threshold, activation function is
defined as

28
Continued....

 For inhibition to be absolute, the threshold with the activation function should
satisfy the following condition:
θ >nw –p
 Output will fire if it receives “k” or more excitatory inputs but no inhibitory
inputs where
kw≥θ>(k-1) w
 The M-P neuron has no particular training algorithm.
 An analysis is performed to determine the weights and the threshold.
 It is used as a building block where any function or phenomenon is modeled
based on a logic function.

29
Practice Problem

 Implement AND function using McCulloch Pitts neuron,


 Steps:
 Provide training data(truth table of AND operation)
 Assume weights
 Draw NN architecture
 Calculate Net Input

 ChooseThreshold to apply activation function


 Calculate output using activation function.

30
Practice Problem

31
Practice Problem

32
Practice Problem

33
Reference for problem solving

PRINCIPLES OF SOFT COMPUTING- Book by S.N. Deepa & S.N. Sivanandam

https://pg.its.edu.in/sites/default/files/MCAKCA032-
PRINCIPALES%20OF%20SOFT%20COMPUTING-SN%20SIVNANDAM%20AND%20DEEPA%20SN.pdf

• McCulloch-Pitt’s Model (M-P Neuron): Book Pg No. 34


Unit No: 3 Learning for Regression

Lecture No: 18
NN Case Study
NN for Regression

 The purpose of using Artificial Neural Networks for Regression over Linear
Regression is that the linear regression can only learn the linear relationship
between the features and target and therefore cannot learn the complex non-
linear relationship.
 In order to learn the complex non-linear relationship between the features
and target, we are in need of other techniques. One of those techniques is to
use Artificial Neural Networks.
 Artificial Neural Networks have the ability to learn the complex relationship
between the features and target due to the presence of activation function in
each layer.

36
NN Case Study on Regression

Regression-based neural networks: Predicting Average Daily Rates for


Hotels
https://towardsdatascience.com/regression-based-neural-networks-with-
tensorflow-v2-0-predicting-average-daily-rates-e20fffa7ac9a

37
Thank You

You might also like