Major Doc 13 Batch FINAL@@@

FACIAL EXPRESSION RECOGINITION USING MACHINE LEARINING
CHAPTER 1
HITSCOE 2015-19
CHAPTER 1
1.1 INTRODUCTION
Facial expression is the most effective form of non-verbal communication and it provides
intimation about emotional state, mindset and intention. Facial expressions not only can change
the flow of conversation but also provides the listeners a way to communicate a wealth of
information to the speaker without even uttering a single word. When the facial expression does
not match with the spoken words, then the information pass on by the face gets more power in
interpreting the information. Image processing is the field of signal processing where both the
input and output signals are images. One of the most important applications of Image processing
is Facial expression recognition. Our emotion is revealed by the expressions in our face. Facial
Expressions plays an important role in interpersonal communication.
Facial expression is a non-verbal scientific gesture which gets expressed in our face as
per our emotions. Automatic recognition of facial expression plays an important role in artificial
intelligence and robotics and thus it is a need of the generation. Some application related to this
includes Personal identification and Access control, Videophone and Teleconferencing, Forensic
application, Human-Computer Interaction, Automated Surveillance, Cosmetology and so on. The
objective of this project is to develop Automatic Facial Expression Recognition System which
can take human facial images containing some expression as input and recognize and classify it
into seven different expression classes such as:
1. Neutral
2. Angry
3. Disgust
4. Fear
5. Contempt
6. Happy
7. Sadness
8. Surprise
HITSCOE 2015-19
1. Anger: involves three main features- teeth revealing, eyebrows down and inner side
tightening, squinting eyes. The function is clear- preparing for attack. The teeth are ready
to bite and threaten enemies, eyes and eyebrows squinting to protect the eyes, but not
closing entirely in order to see the enemy.
2. Disgust: involves wrinkled nose and mouth. Sometimes even involves tongue coming
out. This expression mimics a person that tasted bad food and wants to spit it out, or
smelling foul smell.
3. Fear: involves widened eyes and sometimes open mouth. The function- opening the eyes
so wide is suppose to help increasing the visual field (though studies show that it doesn't
actually do so) and the fast eye movement, which can assist finding threats. Opening the
mouth enables to breath quietly and by that not being revealed by the enemy.
4. Surprise: very similar to the expression of fear. Maybe because a surprising situation can
frighten us for a brief moment, and then it depends whether the surprise is a good or a
bad one. Therefore the function is similar.
5. Sadness: involves a slight pulling down of lip corners, inner side of eyebrows is rising.
Darwin explained this expression by suppressing the will to cry. The control over the
upper lip is greater than the control over the lower lip, and so the lower lip drops. When a
person screams during a cry, the eyes are closed in order to protect them from blood
pressure that accumulates in the face. So, when we have the urge to cry and we want to
stop it, the eyebrows are rising to prevent the eyes from closing.
6. Contempt: involves lip corner to rise only on one side of the face. Sometimes only one
eyebrow rises. This expression might look like half surprise, half happiness.This can
imply the person who receives this look that we are surprised by what he said or did (not
in a good way) and that we are amused by it. This is obviously an offensive expression
that leaves the impression that a person is superior to another person.
7. Happiness: usually involves a smile- both corner of the mouth rising, the eyes are
squinting and wrinkles appear at eyes corners. The initial functional role of the smile,
HITSCOE 2015-19
which represents happiness, remains a mystery. Some biologists believe that smile was
initially a sign of fear. Monkeys and apes clenched teeth in order to show predators that
they are harmless. A smile encourages the brain to release endorphins that assist
lessening pain and resemble a feeling of well being.
8. Neutral: Those good feeling that one smile can produce can help dealing with the fear. A
smile can also produce positive feelings for someone who is witness to the smile, and
might even get him to smile too. Newborn babies have been observed to smile
involuntarily, or without any external stimuli while they are sleeping. A baby's smile
helps his parents to connect with him and get attached to him. It makes sense that for
evolutionary reasons, an involuntary smile of a baby helps creating positive feelings for
the parents, so they wouldn't abandon their offspring.
Fig.1.1 Different Expressions
HITSCOE 2015-19
CHAPTER 2
HITSCOE 2015-19
CHAPTER 2
2.1 IMPORTANT CONCEPTS
2.1.1 MOTIVATION
Significant debate has risen in past regarding the emotions portrayed in the world-famous
masterpiece of Mona Lisa. British Weekly „New Scientist‟ has stated that she is in fact a blend
of many different emotions, 83%happy, 9% disgusted, 6% fearful, 2% angry.
Fig. 2.1.1 Mona Lisa
We have also been motivated observing the benefits of physically handicapped people
like deaf and dumb. But if any normal human being or an automated system can understand their
needs by observing their facial expression then it becomes a lot easier for them to make the
fellow human or automated system understand their needs.
HITSCOE 2015-19
Fig. 2.1.2 Anonymous picture
2.2 PROBLEM DEFINITION
Human facial expressions can be easily classified into 7 basic emotions: happy, sad,
surprise, fear, anger, disgust, and neutral. Our facial emotions are expressed through activation of
specific sets of facial muscles. These sometimes subtle, yet complex, signals in an expression
often contain an abundant amount of information about our state of mind. Through facial
emotion recognition, we are able to measure the effects that content and services have on the
audience/users through an easy and low-cost procedure. For example, retailers may use these
metrics to evaluate customer interest. Healthcare providers can provide better service by using
additional information about patients' emotional state during treatment. Entertainment producers
can monitor audience engagement in events to consistently create desired content.
Humans are well-trained in reading the emotions of others, in fact, at just 14 months old,
babies can already tell the difference between happy and sad. But can computers do a better job
than us in accessing emotional states? To answer the question, We designed a deep learning
neural network that gives machines the ability to make inferences about our emotional states. In
other words, we give them eyes to see what we can see.
HITSCOE 2015-19
Fig 2.2.1 Algorithm
Problem formulation of our project:
Facial expression recognition is a process performed by humans or computers, which

consists of
1. Locating faces in the scene (e.g., in an image; this step is also referred to as face
detection),
2. Extracting facial features from the detected face region (e.g., detecting the shape of facial
components or describing the texture of the skin in a facial area; this step is referred to as
facial feature extraction),
3. Analyzing the motion of facial features and/or the changes in the appearance of facial
features and classifying this information into some facial-expression- interpretative
categories such as facial muscle activations like smile or frown, emotion
(affect)categories like happiness or anger, attitude categories like (dis)liking or
ambivalence, etc.(this step is also referred to as facial expression interpretation).
HITSCOE 2015-19
Several Projects have already been done in this fields and our goal will not only be to
develop a Automatic Facial Expression Recognition System but also improving the accuracy of
this system compared to the other available systems.
From the perspective of automatic recognition, a facial expression can be considered to

consist of deformations of facial components and their spatial relations, or changes in the
pigmentation of the face. Facial expressions represent the changes of facial appearance in
reaction to a person’s inside emotional states, social communications or intentions. To
communicate the emotions and express the intentions the Facial expression is the most powerful,
natural, non-verbal and instant way for humans. It is faster to communicate the emotions through
facial expressions than through verbalization. The requirement for proficient communication
channels between machines and humans becomes progressively imperative in light of the fact
that machines and individuals start to share a variety of tasks. Systems to form these
communication channels are known as human machine interaction (HMI) systems.
2.3 The Importance of Facial Recognition

Several Projects have already been done in this fields and our goal will not only be to
develop an Automatic Facial Expression Recognition System but also improving the accuracy of
this system compared to the other available systems. When computers look at an image, what
they ‘see’ is simply a matrix of pixel values. In order to classify an image, the computer has to
discover and classify numerical patterns within the image matrix. These patterns can be variable,
and hard to pin down for multiple reasons. Several human emotions can be distinguished only by
subtle differences in facial patterns, with emotions like anger and disgust often expressed in very
similar ways.
Each person’s expressions of emotions can be highly idiosyncratic, with particular quirks
and facial cuts. There can be a wide variety of divergent orientations and positions of people’s
heads in the photographs to be classified. For these types of reasons, FER is more difficult than
most other Image Classification tasks. However, well-designed systems can achieve accurate
results when constraints are taken into account during development. For example, higher
accuracy can be achieved when classifying a smaller subset of highly distinguishable
expressions, such as anger, happiness, and fear. Lower accuracy is achieved when classifying
larger subsets, or small subsets with less distinguishable expressions, such as anger and disgust.
HITSCOE 2015-19
Understanding the human facial expressions and the study of expressions has many
aspects, from computer analysis, emotion recognition, lie detectors, airport security, nonverbal
communication and even the role of expressions in art. Improving the skills of reading
expressions is an important step towards successful relations.
2.3.1 Expressions and Emotions
A facial expression is a gesture executed with the facial muscles, which convey the
emotional state of the subject to observers. An expression sends a message about a person's
internal feeling. In Hebrew, the word for "face" has the same letters as the word represents
"within" or "inside"- That similarity implies about the facial expression most important role-
being a channel of nonverbal communication. Facial expressions are a primary means of
conveying nonverbal information among humans, though many animal species display facial
expressions too. Although human developed a very wide range and powerful of verbal
languages, facial expression role in interactions remains essential, and sometimes even critical.
Expressions and emotions go hand in hand, i.e. special combinations of face muscular actions
reflect a particular emotion. For certain emotions, it is very hard, and maybe even impossible, to
avoid it's fitting facial expression. For example, a person who is trying to ignore his boss's
annoying offensive comment by keeping a neutral expression might nevertheless show a brief
expression of anger. This phenomenon of a brief, involuntary facial expression shown on the
face of humans according to emotions experienced is called 'micro expression'.
Micro expressions express the seven universal emotions: happiness, sadness, anger,
surprise, contempt, fear and disgust. However, Paul Ekman, a Jewish American psychologist
who was a pioneer in the study of emotions and their relation to facial expressions, expanded the
list of classical emotions. Ekman has added to the list of emotions nine more: amusement,
shame, embarrassment, excitement, pride, guilt, relief, satisfaction and pleasure. Micro
expression is lasting only 1/25-1/15 of a second. Nonetheless, capturing it can illuminate one's
real feelings, whether he wants it or not. That is exactly what Paul Ekman did. Back in the 80's,
Ekman was already known as a specialist for study of facial expressions, when approached by a
psychiatrist, asking if Ekman has the ability to detect liars. The psychiatrist wanted to detect if a
patient is lying by threatening to suicide. Ekman watched a tape of a patient over and over again,
looking for a clue until he found a split second of desperation, meaning that the patient's threat
HITSCOE 2015-19
wasn't empty. Since then, Ekman have found those critical split seconds in almost every liar's
documentation. The leading character in the TV series "Lie to me" is based on Paul Ekman
himself, the man who dedicated his life to read people's expressions- the "human polygraph".
The research of facial expressions and emotions began many years before Ekman's work.
Charles Darwin published his book, called "The Expression of the Emotions in Man and
Animals" in 1872. This book was dedicated to nonverbal patterns in humans and animals and to
the source of expressions. Darwin's two former books- "The Descent of Man, and Selection in
Relation to Sex" and "On the Origin of Species" represented the idea that man did not came into
existence in his present condition, but in a gradual process- Evolution. This was, of course, a
revolutionary theory since in the middle of the 19th century no one believed that man and animal
"obey to the same rules of nature”. Darwin’s work attempted to find parallels between behaviors
and expressions in animals and humans. Ekman's work supports Darwin's theory about
universality of facial expressions, even across cultures.
The main idea of "The Expression of the Emotions in Man and Animals" is that the
source of nonverbal expressions of man and animals is functional, and not communicative, as we
may have thought. This means that facial expressions creation was not for communication
purposes, but for something else. An important observation was that individuals who were born
blind had similar facial expressions to individuals who were born with the ability to see.
This observation was intended to contradict Sir Charles Bell's idea (a Scottish surgeon, we
anatomist, neurologist and philosophical theologian, who influenced Darwin's work), who
claimed that human facial muscles e created to provide humans the unique option to express
emotions, meaning, for communicational reasons. According to Darwin, there are three "chief
principles", which are three general principles of expression:
1. The first one is called "principle of serviceable habits". He described it as a habit that was
reinforced at the beginning and then inherited by offspring. For example: he noticed a
serviceable habit of raising the eyebrows in order to increase the vision field. He connected it
to a person who is trying to remember something, while performing those actions, as though
he could "see" what he is trying to remember.
2. The second principle is called "antithesis". Darwin suggested that some actions or habits
might not be serviceable themselves, but carried out only because they are opposite in nature
HITSCOE 2015-19
to a serviceable habit. I have found this principle very interesting, and I will go into more
detail later on.
3. The third principle is called "The principle of actions due to the constitution of the Nervous
System". This principle is independent from will or a certain extent of habit. For example:
Darwin noticed that animals rarely make noises, but in special circumstances, like fear or
pain they response by making involuntary noises.
2.4 Facial Expressions Evolutionary Reasons

A common assumption is that facial expressions initially served a functional role and not a
communicative one. I will try to justify each one of the seven classical expressions with its
functional initially role: Several Projects have already been done in this fields and our goal will
not only be to develop an Automatic Facial Expression Recognition System but also improving
the accuracy of this system compared to the other available systems. When computers look at an
image, what they ‘see’ is simply a matrix of pixel values. These patterns can be variable, and
hard to pin down for multiple reasons. Several human emotions can be distinguished only by
subtle differences in facial patterns, with emotions like anger and disgust often expressed in very
similar ways.
Each person’s expressions of emotions can be highly idiosyncratic, with particular quirks and
facial cues. For these types of reasons, FER is more difficult than most other Image
Classification tasks..For example, higher accuracy can be achieved when classifying a smaller
subset of highly distinguishable expressions, such as anger, happiness, and fear. Lower accuracy
is achieved when classifying larger subsets, or small subsets with less distinguishable
expressions, such as anger and disgust.
2.5 The "Antithesis" Principle

As I mentioned earlier, the antithesis phenomenon refers to the way that some muscle
movements represent an emotion, and the opposite muscle movements represent the opposite
emotion. An impressive explanation for the facial expression represents 'helplessness' can be
done using antithesis: Helplessness body gesture involves hands spreading to the sides, fingers
spreading and shoulders shrugging. It's facial expressions involves pulling down the bottom lip
HITSCOE 2015-19
and raising eyebrows, like you can see in the followed image: Darwin explained the features of
this expression using the antithesis principle. He discovered that all of those movements
opposing to the movements of a man who is ready to face something. The movements of a
person who is preparing himself for something will look like that: closed hands and fingers (as if
he is preparing for a fight, for example), hands close to the body for protection and the neck is
raised and tight. At a helplessness situation the shrugging of the shoulders releases the neck.
As for the face: eyebrows are low (like in a mode of attack or firmness), upper lip might reveal
teeth.
The functional source of the antithesis can be explained with the investigation of muscles,
and to be precise- the antagonist's muscles. Every muscle has an antagonist muscle that performs
the opposite movement. Spreading fingers is a movement done by some muscles, and closing the
fingers is done by the antagonist muscles. For some expressions we can't always tell just by
looking at it, what is the opposite expression, but if we'll look at the muscles involving in the
process then it becomes very clear
. An interesting explanation to the antithesis functional source relies on inhibition. If a
person or an animal is trying to prevent itself doing a particular action, one way is to use the
antagonistic muscles. In fact, when a stimuli signal is send to a muscle, an inhibitory signal is
send automatically to the antagonist muscle. Facial expressions that can be explained with
antithesis usually relates to aggression and avoiding it.
2.6 Approach and Method

2.6.1 Experiment Description
My experiment has been composed from 2 sections: First, a series of 35 full facial
images, each representing one of the seven different facial expressions, is presented to each
participant. Each picture appears for 4 seconds only, followed by a black screen, in order to
prevent the participant adapt to the image. The participant labels each picture as one of the seven
given facial expressions and fills in a given form, made for this purpose, during the "black
screen". The participant also needs to fill in which 1-3 facial features assisted him/her to classify
the image. Second; a series of 21 facial features images is presented to each participant, again for
4 seconds only. The participant classifies each feature to a facial expression and fills in the form,
during the "black screen". The purpose of the second part was to see if a person can identify
HITSCOE 2015-19
expressions better or worse using only some facial parts and also to see if the time for
recognition changes. 20 men and 20 women took part in my experiment.
2.7 My Assumption
It's no secret that women considered to be more intuitive than men. More often, women
considered to be more compassioned and emphatic to their surroundings. Therefore, the gift of
interpreting facial expressions is related usually to women. I believe that my experiment results
will support this assumption.
2.7.1 Common expression analysis components
Like most image classification systems, FER systems typically use image preprocessing
and feature extraction followed by training on selected training architectures. The end result of
training is the generation of a model capable of assigning emotion categories to newly provided
image examples.
Fig 2.7.1 common expression analysis components
2.7.2 Commonly used FER system architectures
The image preprocessing stage can include image transformations such as scaling,
cropping, or filtering images. It is often used to accentuate relevant image information, like
cropping an image to remove a background. It can also be used to augment a dataset, for
example to generate multiple versions from an original image with varying cropping or
HITSCOE 2015-19
transformations applied. The feature extraction stage goes further in finding the more descriptive
parts of an image. Often this means finding information which can be most indicative of a
particular class, such as the edges, textures, or colors. The training stage takes place according to
the defined training architecture, which determines the combinations of layers which feed into
each other in the neural network. Architectures must be designed for training with the
composition of the feature extraction and image preprocessing stages in mind.
This is necessary because some architectural components work better with others when
applied separately or together.For example, certain types of feature extraction are not useful in
conjunction with deep learning algorithms. They both find relevant features in images, such as
edges, and therefore it is redundant to use the two together. Applying feature extraction prior to a
deep learning algorithm is not only unnecessary, but can even negatively impact the performance
of the architecture.
2.7.3 A comparison of training algorithms
Once any feature extraction or image preprocessing stages are complete, the training
algorithm produces a trained prediction model. A number of options exist for training FER
models, each of which has strengths and weaknesses making them more or less suitable for
particular situations. In this article we will compare some of the most common algorithms used
in FER:
 Multiclass Support Vector Machines (SVM)

 Convolutional Neural Networks (CNN)
 Recurrent Neural Networks (RNN)
 Convolutional Long Short-Term Memory (ConvLSTM)
Multiclass Support Vector Machines (SVM) are supervised learning algorithms that analyze
and classify data, and they perform well when classifying human facial expressions. However,
they only do so when the images are created in a controlled lab setting with consistent head
poses and illumination.SVMs perform less well when classifying images captured “in the wild,”
or in spontaneous, uncontrolled settings. Therefore, the latest training architectures being
HITSCOE 2015-19
explored are all deep neural networks which perform better under those
circumstances. Convolutional Neural Networks (CNN) are currently considered the go-to neural
networks for image classification, because they pick up on patterns in small parts of an image,
such as the curve of an eyebrow.CNNs apply kernels, which are matrices smaller than the image,
to chunks of the input image.
By applying kernels to inputs, new activation matrices, sometimes referred to as feature

maps, are generated and passed as inputs to the next layer of the network. In this way, CNNs
process more granular elements within an image, making them better at distinguishing between
two similar emotion classifications. Alternatively, Recurrent Neural Networks
(RNN) use dynamic temporal behavior when classifying an image. This means that when an
RNN processes an input example, it doesn’t just look at the data from that example — it also
looks at the data from previous inputs, which are used to provide further context. In FER, the
context could be previous image frames of a video clip.
The idea of this approach is to capture the transitions between facial patterns over time,
allowing these changes to become additional data points supporting classification. For example,
it is possible to capture the changes in the edges of the lips as an expression goes from neutral to
happy by smiling, rather than just the edges of a smile from an individual image frame.
2.8 Face expression recognition system

The overview of the FER system is illustrated in Fig. 1. The FER system includes the
major stages such as face image preprocessing, feature extraction and classification.
Fig 2.8.1 Architecture of face expression recognition system.
HITSCOE 2015-19
CHAPTER 3
HITSCOE 2015-19
CHAPTER 3
3.1 METHODOLOGY
We use these approaches: Support Vector Machine, Artificial Neural Network and K
Nearest Neighbors (KNN). In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN
for short) is a non-parametric method used for classification and regression. In both cases, the
input consists of the k closest training examples in the feature space. In Preprocessing we take
the image and preprocessed with using these techniques then after this features extraction. In
features extraction technique some predefined positions as facial features. In next step feature
selection and then classified them.
Fig 3.1 Overview of Facial Expression Recognition System
3.1.1. Image Acquisition
Images used for facial expression recognition are static images. To take the images of
expressions of people we use a Panasonic camera (Model DMC- LS5) with focal length of 5mm
is used. The format of images is 24 bit color JPEG with resolution of 4320x 3240 pixels. The
distance between the camera and person was four feet and images of six basic expressions of
each person were taken.
HITSCOE 2015-19
3.1.2 Image Preprocessing
The image preprocessing procedure comes as a very important step in the facial
expression recognition task. The objective of the preprocessing phase is to take images which
have normalized intensity, uniform size and shape, and represent only a face expressing certain
emotion.
The preprocessing procedure should also reduce the effects of illumination and lighting.
Expression representation can be delicate to translation, scaling, and rotation of the head in a
picture. To battle the effect of these pointless changes, the facial image may be geometrically
institutionalized before classification.
.3.1.3 Feature Extraction
In developing accurate facial expression recognition system feature extraction is the most
important stage. Unprocessed facial images hold vast amounts of data and feature extraction is
required to decrease it to smaller sets of data called features. Feature extraction change pixel
information into a more elevated amount representation of color shape, motion, texture, and
spatial configuration of the face or its features. The separated representation is utilized for further
expression categorization. Feature extraction ordinarily decreases the information's
dimensionality space. The reduction procedure ought to keep up essential data having high
segregation force and high security.
3.1.4 Feature Selection
Feature selection is concerned with choosing of a subset of features perfectly necessary to

perform the classification task from a larger set of candidate features. The feature selection step
has an effect on both the computational complexity and the quality of the classification results. It
is essential that the information contained in the selected features is adequate to correctly verify
the input class. Too many features may unnecessarily raise the complexity of the training and
classification tasks, while a poor, inadequate selection of features may have a detrimental effect
on the classification results. The process of selecting a sub set of features improves the efficiency
of classifier and reduces execution time.
HITSCOE 2015-19
3.1.5 Classification
The last step of Facial Expressions Recognition systems is to recognize facial expression
based on the extracted features. Classification refers to an algorithmic approach for recognizing a
given expression as one of a given number of expressions. We use K- Nearest Neighbor
classifier for classification. The K- Nearest Neighbor algorithm is a non-parametric method used
for classification and regression. The input comprises of K closest training examples in the
feature space. The output is class participation. By a majority vote of its neighbors an object is
classified, with the object being allotted to the class most common among its k nearest
neighbors.
HITSCOE 2015-19
CHAPTER 4
HITSCOE 2015-19
4.1 Machine learning
Machine learning (ML) is the scientific study of algorithms and statistical

models that computer systems use to effectively perform a specific task without using explicit
instructions, relying on patterns and inference instead. It is seen as a subset of artificial
intelligence. Machine learning algorithms build a mathematical model of sample data, known as
"training data", in order to make predictions or decisions without being explicitly programmed to
perform the task. Machine learning algorithms are used in a wide variety of applications, such
as email filtering, detection of network intruders, and computer vision, where it is infeasible to
develop an algorithm of specific instructions for performing the task.
Machine learning is closely related to computational statistics, which focuses on making
predictions using computers. The study of mathematical optimization delivers methods, theory
and application domains to the field of machine learning. Data mining is a field of study within
machine learning, and focuses on exploratory data analysis through unsupervised learning. In its
application across business problems, machine learning is also referred to as predictive analytics.
Fig 4.1 Machine learning
HITSCOE 2015-19
Machine learning is an application of artificial intelligence (AI) that provides systems the ability
to automatically learn and improve from experience without being explicitly
programmed. Machine learning focuses on the development of computer programs that can
access data and use it learn for themselves.
4.2 Overview of Machine Learning

The name machine learning was coined in 1959 by Arthur Samuel. Tom M.
Mitchell provided a widely quoted, more formal definition of the algorithms studied in the
machine learning field: "A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P if its performance at tasks in T, as measured
by P, improves with experience E." This definition of the tasks in which machine learning is
concerned offers a fundamentally operational definition rather than defining the field in cognitive
terms.
This follows Alan Turing's proposal in his paper "Computing Machinery and
Intelligence", in which the question "Can machines think?" is replaced with the question "Can
machines do what we (as thinking entities) can do?". In Turing's proposal the various
characteristics that could be possessed by a thinking machine and the various implications in
constructing one are exposed.
4.2.1 Machine learning tasks
A support vector machine is a supervised learning model that divides the data into
regions separated by a linear boundary. Here, the linear boundary divides the black circles from
the white. Machine learning tasks are classified into several broad categories. In supervised
learning, the algorithm builds a mathematical model from a set of data that contains both the
inputs and the desired outputs. For example, if the task were determining whether an image
contained a certain object, the training data for a supervised learning algorithm would include
images with and without that object (the input), and each image would have a label (the output)
designating whether it contained the object. In special cases, the input may be only partially
available, or restricted to special feedback. Semi-supervised learning algorithms develop
mathematical models from incomplete training data, where a portion of the sample input doesn't
have labels.
HITSCOE 2015-19
Fig 4.2.1.1 Machine learning tasks

Classification algorithms and regression algorithms are types of supervised learning.
Classification algorithms are used when the outputs are restricted to a limited set of values. For a
classification algorithm that filters emails, the input would be an incoming email, and the output
would be the name of the folder in which to file the email. For an algorithm that identifies spam
emails, the output would be the prediction of either "spam" or "not spam", represented by
the Boolean values true and false. Regression algorithms are named for their continuous outputs,
meaning they may have any value within a range. Examples of a continuous value are the
temperature, length, or price of an object.
In unsupervised learning, the algorithm builds a mathematical model from a set of data
which contains only inputs and no desired output labels. Unsupervised learning algorithms are
used to find structure in the data, like grouping or clustering of data points. Unsupervised
learning can discover patterns in the data, and can group the inputs into categories, as in feature
learning. Dimensionality reduction is the process of reducing the number of "features", or inputs,
in a set of data.
Active learning algorithms access the desired outputs (training labels) for a limited set of
inputs based on a budget, and optimize the choice of inputs for which it will acquire training
labels. When used interactively, these can be presented to a human user for
labeling. Reinforcement learning algorithms are given feedback in the form of positive or
negative reinforcement in a dynamic environment, and are used in autonomous vehicles or in
HITSCOE 2015-19
learning to play a game against a human opponent. Other specialized algorithms in machine
learning include topic modeling, where the computer program is given a set of natural
language documents and finds other documents that cover similar topics.
Machine learning algorithms can be used to find the unobservable probability density
function in density estimation problems. Meta learning algorithms learn their own inductive
bias based on previous experience. In developmental robotics, robot learning algorithms generate
their own sequences of learning experiences, also known as a curriculum, to cumulatively
acquire new skills through self-guided exploration and social interaction with humans. These
robots use guidance mechanisms such as active learning, maturation, motor synergies, and
imitation.
4.3 History of Machine Learning
Arthur Samuel, an American pioneer in the field of computer gaming and artificial
intelligence, coined the term "Machine Learning" in 1959 while at IBM As a scientific.
Endeavour, machine learning grew out of the quest for artificial intelligence. Already in the early
days of AI as an academic discipline, some researchers were interested in having machines learn
from data. They attempted to approach the problem with various symbolic methods, as well as
what were then termed "neural networks"; these were mostly perceptions and other models that
were later found to be reinventions of the generalized linear models of
statistics.[9] Probabilistic reasoning was also employed, especially in automated medical
diagnosis.
However, an increasing emphasis on the logical, knowledge-based approach caused a rift
between AI and machine learning. Probabilistic systems were plagued by theoretical and
practical problems of data acquisition and representation. By 1980, expert systems had come to
dominate AI, and statistics was out of favor. Work on symbolic/knowledge-based learning did
continue within AI, leading to inductive logic programming, but the more statistical line of
research was now outside the field of AI proper, in recognition and information retrieval. Neural
networks research had been abandoned by AI and computer science around the same time. This
line, too, was continued outside the AI/CS field, as "connectionism", by researchers from other
disciplines including Hopfield, Rumelhart and Hinton. Their main success came in the mid-
1980s with the reinvention of back propagation.
HITSCOE 2015-19
Machine learning, reorganized as a separate field, started to flourish in the 1990s. The
field changed its goal from achieving artificial intelligence to tackling solvable problems of a
practical nature. It shifted focus away from the symbolic approaches it had inherited from AI,
and toward methods and models borrowed from statistics and probability theory. It also benefited
from the increasing availability of digitized information, and the ability to distribute it via
the Internet.
4.3.1 Relation to data mining
Machine learning and data mining often employ the same methods and overlap
significantly, but while machine learning focuses on prediction, based on known properties
learned from the training data, data mining focuses on the discovery of
(previously) unknown properties in the data (this is the analysis step of knowledge discovery in
databases). Data mining uses many machine learning methods, but with different goals; on the
other hand, machine learning also employs data mining methods as "unsupervised learning" or as
a preprocessing step to improve learner accuracy. Much of the confusion between these two
research communities (which do often have separate conferences and separate journals, ECML
PKDD being a major exception) comes from the basic assumptions they work with: in machine
learning, performance is usually evaluated with respect to the ability to reproduce
known knowledge, while in knowledge discovery and data mining (KDD) the key task is the
discovery of previously unknown knowledge. Evaluated with respect to known knowledge, an
uninformed (unsupervised) method will easily be outperformed by other supervised methods,
while in a typical KDD task; supervised methods cannot be used due to the unavailability of
training data.
4.3.2 Relation to optimization
Machine learning also has intimate ties to optimization: many learning problems are
formulated as minimization of some loss function on a training set of examples. Loss functions
express the discrepancy between the predictions of the model being trained and the actual
problem instances (for example, in classification, one wants to assign a label to instances, and
models are trained to correctly predict the pre-assigned labels of a set of examples). The
difference between the two fields arises from the goal of generalization: while optimization
algorithms can minimize the loss on a training set, machine learning is concerned with
minimizing the loss on unseen samples.
HITSCOE 2015-19
4.3.3 Relation to statistics
Machine learning and statistics are closely related fields. According to Michael I. Jordan,
the ideas of machine learning, from methodological principles to theoretical tools, have had a
long pre-history in statistics He also suggested the term data science as a placeholder to call the
overall field. Leo distinguished two statistical modeling paradigms: data model and algorithmic
model. Wherein "algorithmic model" means more or less the machine learning algorithms
like forest. Some statisticians have adopted methods from machine learning, leading to a
combined field that they call statistical learning.
4.4 Theory of Machine Learning
A core objective of a learner is to generalize from its experience. Generalization in this
context is the ability of a learning machine to perform accurately on new, unseen examples/tasks
after having experienced a learning data set. The training examples come from some generally
unknown probability distribution (considered representative of the space of occurrences) and the
learner has to build a general model about this space that enables it to produce sufficiently
accurate predictions in new cases.
The computational analysis of machine learning algorithms and their performance is a
branch of theoretical computer science known as computational learning theory. Because
training sets are finite and the future is uncertain, learning theory usually does not yield
guarantees of the performance of algorithms. Instead, probabilistic bounds on the performance
are quite common. The bias–variance decomposition is one way to quantify generalization error.
For the best performance in the context of generalization, the complexity of the hypothesis
should match the complexity of the function underlying the data. If the hypothesis is less
complex than the function, then the model has under fit the data. If the complexity of the model
is increased in response, then the training error decreases. But if the hypothesis is too complex,
then the model is subject to over fitting and generalization will be poorer. In addition to
performance bounds, learning theorists study the time complexity and feasibility of learning. In
computational learning theory, a computation is considered feasible if it can be done
in polynomial time. There are two kinds of time complexity results. Positive results show that a
certain class of functions can be learned in polynomial time. Negative results show that certain
classes cannot be learned in polynomial time.
HITSCOE 2015-19
4.5 Types of learning algorithms
The types of machine learning algorithms differ in their approach, the type of data they
input and output, and the type of task or problem that they are intended to solve.
Supervised
Semi-supervised learning
Reinforcement learning
4.5.1 Supervised learning
Supervised learning algorithms build a mathematical model of a set of data that contains
both the inputs and the desired outputs. The data is known as training data, and consists of a set
of training examples. Each training example has one or more inputs and a desired output, also
known as a supervisory signal. In the case of semi-supervised learning algorithms, some of the
training examples are missing the desired output. In the mathematical model, each training
example is represented by an array or vector, and the training data by a matrix.
Through iterative optimization of an objective function, supervised learning algorithms learn a
function that can be used to predict the output associated with new inputs. An optimal function
will allow the algorithm to correctly determine the output for inputs that were not a part of the
training data. An algorithm that improves the accuracy of its outputs or predictions over time is
said to have learned to perform that task.
Fig 4.5.1.1 Supervised learning
Supervised learning algorithms include classification and regression. Classification algorithms are used
when the outputs are restricted to a limited set of values, and regression algorithms are used when the
HITSCOE 2015-19
outputs may have any numerical value within a range. Similarity learning is an area of
supervised machine learning closely related to regression and classification, but the goal is to
learn from examples using a similarity function that measures how similar or related two objects
are. It has applications in ranking, recommendation systems, visual identity tracking, face
verification, and speaker verification.
4.5.2 Unsupervised learning
Unsupervised learning algorithms take a set of data that contains only inputs, and find structure
in the data, like grouping or clustering of data points. The algorithms therefore learn from test
data that has not been labeled, classified or categorized. Instead of responding to feedback,
unsupervised learning algorithms identify commonalities in the data and react based on the
presence or absence of such commonalities in each new piece of data. A central application of
unsupervised learning is in the field of density estimation in statistics,[21]though unsupervised
learning encompasses other domains involving summarizing and explaining data features.
Fig 4.5.2.1 Unsupervised learning
Cluster analysis is the assignment of a set of observations into subsets (called clusters) so
that observations within the same cluster are similar according to one or more predestinated
criteria, while observations drawn from different clusters are dissimilar. Different clustering
techniques make different assumptions on the structure of the data, often defined by
some similarity metric and evaluated, for example, by internal compactness, or the similarity
HITSCOE 2015-19
between members of the same cluster, and separation, the difference between clusters. Other
methods are based on estimated density and graph connectivity.
4.5.3 Reinforcement learning
Reinforcement learning is an area of machine learning concerned with how software

agents ought to take actions in an environment so as to maximize some notion of cumulative
reward. Due to its generality, the field is studied in many other disciplines, such as game
theory, control theory, operations research, information theory, simulation-based
optimization, multi-agent systems, swarm intelligence, statistics and genetic algorithms.[22][23] In
machine learning, the environment is typically represented as a Markov Decision
Process (MDP).
Fig 4.5.3.1 Reinforcement learning

Many reinforcement learning algorithms use dynamic programming techniques. Reinforcement
learning algorithms do not assume knowledge of an exact mathematical model of the MDP, and
are used when exact models are infeasible. Reinforcement learning algorithms are used in
autonomous vehicles or in learning to play a game against a human opponent.
4.6 List of Common Machine Learning Algorithms
Here is the list of commonly used machine learning algorithms. These algorithms can be
applied to almost any data problem:
1. Linear Regression
2. Logistic Regression
3. Decision Tree
4. SVM
HITSCOE 2015-19
5. Naive Bayes
6. kNN
7. K-Means
8. Random Forest
9. Dimensionality Reduction Algorithms
10. Gradient Boosting algorithms
1. GBM
2. XGBoost
3. LightGBM
4. Cat Boost
4.6.1. Linear Regression
It is used to estimate real values (cost of houses, number of calls, total sales etc.) based on
continuous variable(s). Here, we establish relationship between independent and dependent
variables by fitting a best line. This best fit line is known as regression line and represented by a
linear equation Y= a *X + b.
The best way to understand linear regression is to relive this experience of childhood. Let us say,
you ask a child in fifth grade to arrange people in his class by increasing order of weight, without
asking them their weights! What do you think the child will do? He / she would likely look
(visually analyze) at the height and build of people and arrange them using a combination of
these visible parameters. This is linear regression in real life! The child has actually figured out
that height and build would be correlated to the weight by a relationship, which looks like the
equation above.
In this equation:
Y – Dependent Variable
a – Slope
X – Independent variable
b – Intercept
These coefficients a and b are derived based on minimizing the sum of squared difference of
distance between data points and regression line. Look at the below example. Here we have
identified the best fit line having linear equation y=0.2811x+13.9. Now using this equation, we
can find the weight, knowing the height of a person.
HITSCOE 2015-19
Fig 4.6.1.1 Linear Regression

Linear Regression is of mainly two types: Simple Linear Regression and Multiple Linear
Regression. Simple Linear Regression is characterized by one independent variable. And,
Multiple Linear Regression(as the name suggests) is characterized by multiple (more than 1)
independent variables. While finding best fit line, you can fit a polynomial or curvilinear
regression. And these are known as polynomial or curvilinear regression.
4.6.2 Logistic Regression
Don’t get confused by its name! It is a classification not a regression algorithm. It is used to
estimate discrete values ( Binary values like 0/1, yes/no, true/false ) based on given set of
independent variable(s). In simple words, it predicts the probability of occurrence of an event by
fitting data to a log it function. Hence, it is also known as logit regression. Since, it predicts the
probability, its output values lies between 0 and 1 (as expected).Again, let us try and understand
this through a simple example. Let’s say your friend gives you a puzzle to solve. There are only
2 outcome scenarios – either you solve it or you don’t. Now imagine that you are being given
wide range of puzzles / quizzes in an attempt to understand which subjects you are good at. The
outcome to this study would be something like this – if you are given a trigonometry based tenth
grade problem, you are 70% likely to solve it. On the other hand, if it is grade fifth history
question, the probability of getting an answer is only 30%. This is what Logistic Regression
provides you.
Coming to the math, the log odds of the outcome is modeled as a linear combination of the
predictor variables. Above, p is the probability of presence of the characteristic of interest. It
chooses parameters that maximize the likelihood of observing the sample values rather than
HITSCOE 2015-19
that minimize the sum of squared errors (like in ordinary regression).Now, you may ask, why
take a log? For the sake of simplicity, let’s just say that this is one of the best mathematical way
to replicate a step function. I can go in more details, but that will beat the purpose of this article.
Fig 4.6.2.1 Logistic Regression

4.6.3 Decision Tree
This is one of my favorite algorithm and I use it quite frequently. It is a type of supervised
learning algorithm that is mostly used for classification problems. Surprisingly, it works for
both categorical and continuous dependent variables. In this algorithm, we split the population
into two or more homogeneous sets. This is done based on most significant attributes/
independent variables to make as distinct groups as possible. For more details, you can
read: Decision Tree Simplified.
HITSCOE 2015-19
Fig 4.6.3.1 Decision Tree
In the image above, you can see that population is classified into four different groups based on
multiple attributes to identify ‘if they will play or not’. To split the population into different
heterogeneous groups, it uses various techniques like Gini, Information Gain, Chi-square,
entropy.The best way to understand how decision tree works, is to play Jezebel – a classic game
from Microsoft (image below). Essentially, you have a room with moving walls and you need to
create walls such that maximum area gets cleared off without the balls.So, every time you split
the room with a wall, you are trying to create 2 different populations with in the same room.
Decision trees work in very similar fashion by dividing a population in as different groups as
possible.
4.6.4 SVM (Support Vector Machine)
It is a classification method. In this algorithm, we plot each data item as a point in n-dimensional
space (where n is number of features you have) with the value of each feature being the value of
a particular coordinate. For example, if we only had two features like Height and Hair length of
an individual, we’d first plot these two variables in two dimensional space where each point has
two co-ordinates (these co-ordinates are known as Support Vectors)ղ
Fig 4.6.4.1 SVM (Support Vector Machine)

Now, we will find some line that splits the data between the two differently classified groups of
data. This will be the line such that the distances from the closest point in each of the two groups
will be farthest away.
HITSCOE 2015-19
Fig 4.6.4.2 SVM (Support Vector Machine

In the example shown above, the line which splits the data into two differently classified groups
is the black line, since the two closest points are the farthest apart from the line. This line is our
classifier. Then, depending on where the testing data lands on either side of the line, that’s what
class we can classify the new data
4.6.5. Naive Bayes
It is a classification technique based on Bayes’ theorem with an assumption of

independence between predictors. In simple terms, a Naive Bayes classifier assumes that the
presence of a particular feature in a class is unrelated to the presence of any other feature. For
example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in
diameter. Even if these features depend on each other or upon the existence of the other features,
a naive Bayes classifier would consider all of these properties to independently contribute to the
probability that this fruit is an apple. Naive Bayesian model is easy to build and particularly
useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even
highly sophisticated classification methods Bayes theorem provides a way of calculating
posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at the equation below:
HITSCOE 2015-19
Here,
 P(c|x) is the posterior probability of class (target) given predictor (attribute).

 P(c) is the prior probability of class.
 P(x|c) is the likelihood which is the probability of predictor given class.
 P(x) is the prior probability of predictor.
Example: Let’s understand it using an example. Below I have a training data set of weather
and corresponding target variable ‘Play’. Now, we need to classify whether players will play or
not based on weather condition. Let’s follow the below steps to perform it.
Step 1: Convert the data set to frequency table
Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and
probability of playing is 0.64.
HITSCOE 2015-19
Fig .4.6.5.1 Naive Bayes
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each
class.The class with the highest posterior probability is the outcome of prediction.
Problem: Players will pay if weather is sunny, is this statement is correct?
We can solve it using above discussed method, so P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P
(Sunny)ղ
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.Naive Bayes uses
a similar method to predict the probability of different class based on various attributes. This
algorithm is mostly used in text classification and with problems having multiple classes
4.6.6. kNN (k- Nearest Neighbors)
It can be used for both classification and regression problems. However, it is more widely used
in classification problems in the industry. K nearest neighbors is a simple algorithm that stores
all available cases and classifies new cases by a majority vote of its k neighbors. The case being
assigned to the class is most common amongst its K nearest neighbors measured by a distance
function.
These distance functions can be Euclidean, Manhattan, Minkowski and Hamming distance. First
three functions are used for continuous function and fourth one (Hamming) for categorical
variables. If K = 1, then the case is simply assigned to the class of its nearest neighbor. At times,
choosing K turns out to be a challenge while performing kNN modeling.
HITSCOE 2015-19
.
Fig 4.6.6.1 kNN (k- Nearest Neighbors)
KNN can easily be mapped to our real lives. If you want to learn about a person, of whom you
have no information, you might like to find out about his close friends and the circles he moves
in and gain access to his/her information!
Things to consider before selecting kNN:
KNN is computationally expensive
Variables should be normalized else higher range variables can bias it
Works on pre-processing stage more before going for kNN like outlier, noise removal
4.6.7 K-Means
It is a type of unsupervised algorithm which solves the clustering problem. Its procedure
follows a simple and easy way to classify a given data set through a certain number of clusters
(assume k clusters). Data points inside a cluster are homogeneous and heterogeneous to
peer groups.
Remember figuring out shapes from ink blots? k means is somewhat similar this activity. You
look at the shape and spread to decipher how many different clusters / population are present!
HITSCOE 2015-19
Fig. 4.6.7.1 K-means forms cluster:
How K-means forms cluster:
1. K-means picks k number of points for each cluster known as cancroids.

2. Each data point forms a cluster with the closest cancroids i.e. k clusters.
3. Finds the centroid of each cluster based on existing cluster members. Here we have new
centroids.
4. As we have new cancroids, repeat step 2 and 3. Find the closest distance for each data point
from new cancroids and get associated with new k-clusters. Repeat this process until
convergence occurs i.e. centroids does not change.
4.6.7.1 How to determine value of K:
In K-means, we have clusters and each cluster has its own centroid. Sum of square of
difference between centroid and the data points within a cluster constitutes within sum of square
value for that cluster. Also, when the sums of square values for all the clusters are added, it
becomes total within sum of square value for the cluster solution.
We know that as the number of cluster increases, this value keeps on decreasing but if
you plot the result you may see that the sum of squared distance decreases sharply up to some
HITSCOE 2015-19
value of k, and then much more slowly after that. Here, we can find the optimum number of
cluster.
Fig. 4.6.7.2 determine value of K:
4.6.8. Random Forest

Random Forest is a trademark term for an ensemble of decision trees. In Random Forest,
we’ve collection of decision trees (so known as “Forest”). To classify a new object based on
attributes, each tree gives a classification and we say the tree “votes” for that class. The forest
chooses the classification having the most votes (over all the trees in the forest).Each tree is
planted & grown as follows:
If the number of cases in the training set is N, then sample of N cases is taken at random but with
replacement. This sample will be the training set for growing the tree.
If there are M input variables, a number m<<M is specified such that at each node, m variables
are selected at random out of the M and the best split on these m is used to split the node. The
value of m is held constant during the forest growing.
Each tree is grown to the largest extent possible. There is no pruning.
4.6.9. Dimensionality Reduction Algorithms
In the last 4-5 years, there has been an exponential increase in data capturing at every
possible stages. Corporate/ Government Agencies/ Research organizations are not only coming
HITSCOE 2015-19
with new sources but also they are capturing data in great detail. For example: E-commerce
companies are capturing more details about customer like their demographics, web crawling
history, what they like or dislike, purchase history, feedback and many others to give them
personalized attention more than your nearest grocery shopkeeper.
As a data scientist, the data we are offered also consist of many features, this sounds good for
building good robust model but there is a challenge. How’d you identify highly significant
variable(s) out 1000 or 2000? In such cases, dimensionality reduction algorithm helps us along
with various other algorithms like Decision Tree, Random Forest, PCA, Factor Analysis, Identify
based on correlation matrix, missing value ratio and others
4.6.10. Gradient Boosting Algorithms
4.6.10.1. GBM
GBM is a boosting algorithm used when we deal with plenty of data to make a
prediction with high prediction power. Boosting is actually an ensemble of learning algorithms
which combines the prediction of several base estimators in order to improve robustness over a
single estimator. It combines multiple weak or average predictors to a build strong predictor.
These boosting algorithms always work well in data science competitions like Kaggle, AV
Hackathon, and CrowdAnalytix.
4.6.10.2. XGBoost
Another classic gradient boosting algorithm that’s known to be the decisive choice
between winning and losing in some Kaggle competitions. The XGBoost has an immensely high
predictive power which makes it the best choice for accuracy in events as it possesses both linear
model and the tree learning algorithm, making the algorithm almost 10x faster than existing
gradient booster techniques. The support includes various objective functions, including
regression, classification and ranking.
One of the most interesting things about the XGBoost is that it is also called a regularized
boosting technique. This helps to reduce over fit modeling and has a massive support for a range
of languages such as Scale, Java, R, Python, Julia and C++.Supports distributed and widespread
training on many machines that encompass GCE, AWS, Azure and Yarn clusters. XGBoost can
also be integrated with Spark, Flink and other cloud dataflow systems with a built in cross
validation at each iteration of the boosting process.
4.6.10.3. LightGBM
HITSCOE 2015-19
LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is
designed to be distributed and efficient with the following advantages:
Faster training speed and higher efficiency
Lower memory usage
Better accuracy
Parallel and GPU learning supported
Capable of handling large-scale data
The framework is a fast and high-performance gradient boosting one based on decision tree
algorithms, used for ranking, classification and many other machine learning tasks. It was
developed under the Distributed Machine Learning Toolkit Project of Microsoft. Since the
LightGBM is based on decision tree algorithms, it splits the tree leaf wise with the best fit
whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise. So
when growing on the same leaf in Light GBM, the leaf-wise algorithm can reduce more loss than
the level-wise algorithm and hence results in much better accuracy which can rarely be achieved
by any of the existing boosting algorithms. Also, it is surprisingly very fast, hence the word
‘Light’.
4.6.10.4. Cat boost
Cat Boost is a recently open-sourced machine learning algorithm from Index. It can
easily integrate with deep learning frameworks like Google’s Tensor Flow and Apple’s Core
ML. The best part about Cat Boost is that it does not require extensive data training like other
ML models, and can work on a variety of data formats; not undermining how robust it can be.
Make sure you handle missing data well before you proceed with the implementation.Catboost
can automatically deal with categorical variables without showing the type conversion error,
which helps you to focus on tuning your model better rather than sorting out trivial errors.
4.7 Processes and techniques

Various processes, techniques and methods can be applied to one or more types of machine
learning algorithms to enhance their performance.
HITSCOE 2015-19
4.8 Feature learning
Several learning algorithms aim at discovering better representations of the inputs
provided during training. Classic examples include principal components analysis and cluster
analysis. Feature learning algorithms, also called representation learning algorithms, often
attempt to preserve the information in their input but also transform it in a way that makes it
useful, often as a pre-processing step before performing classification or predictions. This
technique allows reconstruction of the inputs coming from the unknown data-generating
distribution, while not being necessarily faithful to configurations that are implausible under that
distribution. This replaces manual feature engineering, and allows a machine to both learn the
features and use them to perform a specific task.
Feature learning can be either supervised or unsupervised. In supervised feature learning,
features are learned using labeled input data. Examples include artificial neural
networks, multilayer perceptions, and supervised dictionary learning. In unsupervised feature
learning, features are learned with unlabeled input data. Examples include dictionary
learning, independent component analysis, auto encoders, matrix factorization and various forms
of clustering.
Manifold learning algorithms attempt to do so under the constraint that the learned
representation is low-dimensional. Sparse coding algorithms attempt to do so under the
constraint that the learned representation is sparse, meaning that the mathematical model has
many zeros. Multilinker subspace learning algorithms aim to learn low-dimensional
representations directly from tensor representations for multidimensional data, without reshaping
them into higher-dimensional vectors. Deep learning algorithms discover multiple levels of
representation, or a hierarchy of features, with higher-level, more abstract features defined in
terms of (or generating) lower-level features. It has been argued that an intelligent machine is
one that learns a representation that disentangles the underlying factors of variation that explain
the observed data.
Feature learning is motivated by the fact that machine learning tasks such as
classification often require input that is mathematically and computationally convenient to
process. However, real-world data such as images, video, and sensory data has not yielded to
attempts to algorithmically define specific features. An alternative is to discover such features or
representations through examination, without relying on explicit algorithms.
HITSCOE 2015-19
4.9 Artificial neural networks
Fig. 4.9.1 artificial neural networks

An artificial neural network is an interconnected group of nodes, akin to the vast network
of neurons in a brain. Here, each circular node represents an artificial neuron and an arrow
represents a connection from the output of one artificial neuron to the input of another.Artificial
neural networks (ANNs), or connectionist systems, are computing systems vaguely inspired by
the biological neural networks that constitute animal brains. The neural network itself is not an
algorithm, but rather a framework for many different machine learning algorithms to work
together and process complex data inputs.[48] Such systems "learn" to perform tasks by
considering examples, generally without being programmed with any task-specific rules.
An ANN is a model based on a collection of connected units or nodes called "artificial
neurons", which loosely model the neurons in a biological brain. Each connection, like
the synapses in a biological brain, can transmit information, a "signal", from one artificial neuron
to another. An artificial neuron that receives a signal can process it and then signal additional
artificial neurons connected to it. In common ANN implementations, the signal at a connection
between artificial neurons are a real number, and the output of each artificial neuron is computed
by some non-linear function of the sum of its inputs. The connections between artificial neurons
are called "edges". Artificial neurons and edges typically have a weight that adjusts as learning
proceeds. The weight increases or decreases the strength of the signal at a connection. Artificial
neurons may have a threshold such that the signal is only sent if the aggregate signal crosses that
HITSCOE 2015-19
threshold. Typically, artificial neurons are aggregated into layers. Different layers may perform
different kinds of transformations on their inputs. Signals travel from the first layer (the input
layer), to the last layer (the output layer), possibly after traversing the layers multiple times.
The original goal of the ANN approach was to solve problems in the same way that
a human brain would. However, over time, attention moved to performing specific tasks, leading
to deviations from biology. Artificial neural networks have been used on a variety of tasks,
including computer vision, speech recognition, machine translation, social
network filtering, playing board and video games and diagnosis. Deep consists of multiple
hidden layers in an artificial neural network. This approach tries to model the way the human
brain processes light and sound into vision and hearing. Some successful applications of deep
learning are computer vision and speech recognition.[49]
4.10 Support vector machines
Support vector machines (SVMs), also known as support vector networks, are a set of
related supervised learning methods used for classification and regression. Given a set of training
examples, each marked as belonging to one of two categories, an SVM training algorithm builds
a model that predicts whether a new example falls into one category or the other.
An SVM training algorithm is a non-probabilistic, binary, linear classifier, although methods
such as Platt scaling exist to use SVM in a probabilistic classification setting. In addition to
performing linear classification, SVMs can efficiently perform a non-linear classification using
what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature
spaces.
4.11 Bayesian networks
A simple Bayesian network. Rain influences whether the sprinkler is activated, and both rain
and the sprinkler influence whether the grass is wet.A Bayesian network, belief network or
directed acyclic graphical model is a probabilistic graphical model that represents a set
of random variables and their conditional independence with a directed acyclic graph (DAG).
For example, a Bayesian network could represent the probabilistic relationships between diseases
and symptoms. Given symptoms, the network can be used to compute the probabilities of the
presence of various diseases.
HITSCOE 2015-19
Fig.4.11.1 Bayesian networks

Efficient algorithms exist that perform inference and learning. Bayesian networks that
model sequences of variables, like speech signals or protein sequences, are called dynamic
Bayesian networks. Generalizations of Bayesian networks that can represent and solve decision
problems under uncertainty are called influence diagrams.
4.12 Genetic algorithms
A genetic algorithm (GA) is a search algorithm and heuristic technique that mimics the
process of natural selection, using methods such as mutation and crossover to generate
new genotypes in the hope of finding good solutions to a given problem. In machine learning,
genetic algorithms were used in the 1980s and 1990s.[51][52] Conversely, machine learning
techniques have been used to improve the performance of genetic and evolutionary algorithms.
In 2006, the online movie company Netflix held the first "Netflix Prize" competition to
find a program to better predict user preferences and improve the accuracy on its existing
Cinematic movie recommendation algorithm by at least 10%. A joint team made up of
researchers from AT&T Labs-Research in collaboration with the teams Big Chaos and
Pragmatic.
Theory built an ensemble model to win the Grand Prize in 2009 for $1 million shortly
after the prize was awarded, Netflix realized that viewers' ratings were not the best indicators of
their viewing patterns ("everything is a recommendation") and they changed their
recommendation engine accordingly. In 2010 The Wall Street Journal wrote about the firm
Rebellion Research and their use of machine learning to predict the financial crisis. In 2012, co-
HITSCOE 2015-19
founder of Sun Microsystems, Vend, predicted that 80% of medical doctors jobs would be lost in
the next two decades to automated machine learning medical diagnostic software.
In 2014, it was reported that a machine learning algorithm had been applied in the field of
art history to study fine art paintings, and that it may have revealed previously unrecognized
influences between artists. Although machine learning has been transformative in some fields,
machine-learning programs often fail to deliver expected results. Reasons for this are numerous:
lack of (suitable) data, lack of access to the data, data bias, privacy problems, badly chosen tasks
and algorithms, wrong tools and people, lack of resources, and evaluation problems.
In 2018, a self-driving car from Umber failed to detect a pedestrian, who was killed after
a collision. Attempts to use machine learning in healthcare with the IBM Watson system failed to
deliver even after years of time and billions of investment.
HITSCOE 2015-19
CHAPTER 5
HITSCOE 2015-19
5.1 Experiment Results

I have analyzed each of the seven facial expressions, and compared the success
percentage of men and women in each part of the experiment.As we can see in the graph, women
and men identified happiness in similar percentages.The most interesting part, in my opinion, is
that happiness was identified in almost 100% success.
Fig: 5.1.1 Graph of happy

My assumption is that happiness, being a positive emotion, is a mood that people wants
to be around. Happy people project their feelings to others and help to create good vibes in their
surroundings. It is important to recognize easily this kind of emotion because it is helpful to be
around. Maybe this is the reason for easy identification of happiness.
HITSCOE 2015-19
Fig: 5.1.2 Graph of Sad

The graph shows that though women recognized sadness better than men in full facial
images, men recognized it better in the facial features images.
Fig: 5.1.3 Graph of Contempt

This expression was difficult to recognize for both men and women. In the first part of
the experiment, most of the mistakes were labeling sadness instead of contempt. On the other
hand, in the second part, most of the mistakes were labeling happiness instead of contempt. I
think that the confusion in the second part came from the mouth shape for contempt, which
involves lip corner to rise on only one side of the face. Most of the people indicated that the
eyebrows, along with the lips, helped them to recognize the expression.
HITSCOE 2015-19
Fig: 5.1.4 Graph of Surprise

I believe that the shape of the eyebrows got people to confuse contempt with sadness in
the first part. This expression is a tricky one, and I have expected confusions in its identification.
Surprisingly, this expression was recognized better by men. I have also found out that most of
the times, in the second part of the experiment, the recognition took less than 4 seconds. This
means that there are special features for each emotion, and if we focus on that features alone we
will be able to decode the expression faster.
Those three expressions- Fear, Disgust and Anger, were recognized better by women, and
we can see that most of the times the difference in percentage of success between men and
women are significant. An interesting question that should be asked is- why are those specific
emotions were recognized better by women? From an evolutionary point of view, evolutionary
psychologists have suggested that females, due to their role as primary caretakers, are
"programmed" to accurately decode and detect distress in preverbal infants or threatening signals
from other adults to enhance their chances to survive. Fear, anger and disgust are indeed
situations of distress. The most common two special features that helped the participants to
decode the expression and the emotion behind it were the lips and the eyebrows.
HITSCOE 2015-19
Fig: 5.1.5 Graph of Fear, Disgusted and Angry
HITSCOE 2015-19
Appendix
1.Code for Cropping the Image of Face
## This program first ensures if the face of a person exists in the given image or not then if it
exists, it crops
## the image of the face and saves to the given directory.
## Importing Modules
import cv2
import os
##############################################################################
###
##Make changes to these lines for getting the desired results.
## DIRECTORY of the images

directory = r"C:\Users\Rajashekar Reddy\madhav\images"
## directory where the images to be saved:

f_directory = r"C:\Users\Rajashekar Reddy\madhav\images"
##############################################################################
##
def facecrop(image):
## Crops the face of a person from any image!
## OpenCV XML FILE for Frontal Facial Detection using HAAR CASCADES.
facedata = "haarcascade_frontalface_alt.xml"
cascade = cv2.CascadeClassifier(facedata)
## Reading the given Image with OpenCV

img = cv2.imread(image)
try:
## Some downloaded images are of unsupported type and should be ignored while raising
Exception, so for that
## I'm using the try/except functions.
minisize = (img.shape[1],img.shape[0])
miniframe = cv2.resize(img, minisize)
faces = cascade.detectMultiScale(miniframe)
HITSCOE 2015-19
for f in faces:
x, y, w, h = [ v for v in f ]
cv2.rectangle(img, (x,y), (x+w,y+h), (0,255,0), 2)
sub_face = img[y:y+h, x:x+w]
f_name = image.split('/')
f_name = f_name[-1]
## Change here the Desired directory.

cv2.imwrite(f_directory + f_name, sub_face)
print ("Writing: " + image)
except:
pass
if __name__ == '__main__':
images = os.listdir(directory)
i=0
for img in images:

file = directory + img
print (i)
facecrop(file)
i += 1
2. Code for labeling the Image
from __future__ import absolute_import

from __future__ import division
from __future__ import print_function
import argparse
import sys
import time
import numpy as np
import tensorflow as tf
def load_graph(model_file):
graph = tf.Graph()
graph_def = tf.GraphDef()
with open(model_file, "rb") as f:
HITSCOE 2015-19
graph_def.ParseFromString(f.read())
with graph.as_default():
tf.import_graph_def(graph_def)
return graph
def read_tensor_from_image_file(file_name, input_height=299, input_width=299,

input_mean=0, input_std=255):
input_name = "file_reader"
output_name = "normalized"
file_reader = tf.read_file(file_name, input_name)
if file_name.endswith(".png"):
image_reader = tf.image.decode_png(file_reader, channels = 3,
name='png_reader')
elif file_name.endswith(".gif"):
image_reader = tf.squeeze(tf.image.decode_gif(file_reader,
name='gif_reader'))
elif file_name.endswith(".bmp"):
image_reader = tf.image.decode_bmp(file_reader, name='bmp_reader')
else:
image_reader = tf.image.decode_jpeg(file_reader, channels = 3,
name='jpeg_reader')
float_caster = tf.cast(image_reader, tf.float32)
dims_expander = tf.expand_dims(float_caster, 0);
resized = tf.image.resize_bilinear(dims_expander, [input_height, input_width])
normalized = tf.divide(tf.subtract(resized, [input_mean]), [input_std])
sess = tf.Session()
result = sess.run(normalized)
return result
def load_labels(label_file):
label = []
proto_as_ascii_lines = tf.gfile.GFile(label_file).readlines()
for l in proto_as_ascii_lines:
label.append(l.rstrip())
return label
def main(img):
file_name = img
model_file = "C:/Users/Rajashekar Reddy/madhav/retrained_graph.pb"
label_file = "C:/Users/Rajashekar Reddy/madhav/retrained_labels.txt"
input_height = 224
input_width = 224
input_mean = 128
input_std = 128
HITSCOE 2015-19
input_layer = "input"
output_layer = "final_result"
parser = argparse.ArgumentParser()
parser.add_argument("--image", help="image to be processed")
parser.add_argument("--graph", help="graph/model to be executed")
parser.add_argument("--labels", help="name of file containing labels")
parser.add_argument("--input_height", type=int, help="input height")
parser.add_argument("--input_width", type=int, help="input width")
parser.add_argument("--input_mean", type=int, help="input mean")
parser.add_argument("--input_std", type=int, help="input std")
parser.add_argument("--input_layer", help="name of input layer")
parser.add_argument("--output_layer", help="name of output layer")
args = parser.parse_args()
if args.graph:
model_file = args.graph
if args.image:
file_name = args.image
if args.labels:
label_file = args.labels
if args.input_height:
input_height = args.input_height
if args.input_width:
input_width = args.input_width
if args.input_mean:
input_mean = args.input_mean
if args.input_std:
input_std = args.input_std
if args.input_layer:
input_layer = args.input_layer
if args.output_layer:
output_layer = args.output_layer
graph = load_graph(model_file)
t = read_tensor_from_image_file(file_name,
input_height=input_height,
input_width=input_width,
input_mean=input_mean,
input_std=input_std)
input_name = "import/" + input_layer

output_name = "import/" + output_layer
input_operation = graph.get_operation_by_name(input_name);
output_operation = graph.get_operation_by_name(output_name);
HITSCOE 2015-19
with tf.Session(graph=graph) as sess:
start = time.time()
results = sess.run(output_operation.outputs[0],
{input_operation.outputs[0]: t})
end=time.time()
results = np.squeeze(results)
top_k = results.argsort()[-5:][::-1]
labels = load_labels(label_file)
for i in top_k:
return labels[i]
3.Code for Running the Camera
import cv2
import label_image
size = 4
# We load the xml fileq

classifier = cv2.CascadeClassifier('haarcascade_frontalface_alt.xml')
webcam = cv2.VideoCapture(0) #Using default WebCam connected to the PC.
while True:
(rval, im) = webcam.read()
im=cv2.flip(im,1,0) #Flip to act as a mirror
# Resize the image to speed up detection

mini = cv2.resize(im, (int(im.shape[1]/size), int(im.shape[0]/size)))
# detect MultiScale / faces

faces = classifier.detectMultiScale(mini)
# Draw rectangles around each face

for f in faces:
(x, y, w, h) = [v * size for v in f] #Scale the shapesize backup
cv2.rectangle(im, (x,y), (x+w,y+h), (0,255,0), 4)
#Save just the rectangle faces in SubRecFaces

sub_face = im[y:y+h, x:x+w]
FaceFileName = "test.jpg" #Saving the current image from the webcam for testing.
HITSCOE 2015-19
cv2.imwrite(FaceFileName, sub_face)
text = label_image.main(FaceFileName)# Getting the Result from the label_image file, i.e.,
Classification Result.
text = text.title()# Title Case looks Stunning.
font = cv2.FONT_HERSHEY_TRIPLEX
cv2.putText(im, text,(x+w,y), font, 1, (0,0,255), 2)
# Show the image

cv2.imshow('Capture', im)
key = cv2.waitKey(10)
# if Esc key is press then break out of the loop
if key == 27: #The Esc key
break
HITSCOE 2015-19

Major Doc 13 Batch FINAL@@@

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Major Doc 13 Batch FINAL@@@

Uploaded by

Copyright:

Available Formats

FACIAL EXPRESSION RECOGINITION USING MACHINE LEARINING

Fig.1.1 Different Expressions

2.1 IMPORTANT CONCEPTS

Fig. 2.1.1 Mona Lisa

Fig. 2.1.2 Anonymous picture

2.2 PROBLEM DEFINITION

Fig 2.2.1 Algorithm

Problem formulation of our project:

Facial expression recognition is a process performed by humans or computers, which

From the perspective of automatic recognition, a facial expression can be considered to

2.3 The Importance of Facial Recognition

2.4 Facial Expressions Evolutionary Reasons

2.5 The "Antithesis" Principle

2.6 Approach and Method

Fig 2.7.1 common expression analysis components

2.7.2 Commonly used FER system architectures

2.7.3 A comparison of training algorithms

 Multiclass Support Vector Machines (SVM)

By applying kernels to inputs, new activation matrices, sometimes referred to as feature

2.8 Face expression recognition system

Fig 2.8.1 Architecture of face expression recognition system.

Fig 3.1 Overview of Facial Expression Recognition System

3.1.1. Image Acquisition

.3.1.3 Feature Extraction

3.1.4 Feature Selection

Feature selection is concerned with choosing of a subset of features perfectly necessary to

4.1 Machine learning

Machine learning (ML) is the scientific study of algorithms and statistical

Fig 4.1 Machine learning

4.2 Overview of Machine Learning

Fig 4.2.1.1 Machine learning tasks

Fig 4.5.1.1 Supervised learning

Fig 4.5.2.1 Unsupervised learning

4.5.3 Reinforcement learning

Reinforcement learning is an area of machine learning concerned with how software

Fig 4.5.3.1 Reinforcement learning

4.6.1. Linear Regression

Fig 4.6.1.1 Linear Regression

Fig 4.6.2.1 Logistic Regression

Fig 4.6.4.1 SVM (Support Vector Machine)

Fig 4.6.4.2 SVM (Support Vector Machine

4.6.5. Naive Bayes

It is a classification technique based on Bayes’ theorem with an assumption of

 P(c|x) is the posterior probability of class (target) given predictor (attribute).

Step 1: Convert the data set to frequency table

Fig .4.6.5.1 Naive Bayes

Fig. 4.6.7.1 K-means forms cluster:

How K-means forms cluster:

1. K-means picks k number of points for each cluster known as cancroids.

4.6.7.1 How to determine value of K:

Fig. 4.6.7.2 determine value of K:

4.6.8. Random Forest

4.7 Processes and techniques

Fig. 4.9.1 artificial neural networks

Fig.4.11.1 Bayesian networks

5.1 Experiment Results

Fig: 5.1.1 Graph of happy

Fig: 5.1.2 Graph of Sad

Fig: 5.1.3 Graph of Contempt

Fig: 5.1.4 Graph of Surprise

Fig: 5.1.5 Graph of Fear, Disgusted and Angry

1.Code for Cropping the Image of Face

##Make changes to these lines for getting the desired results.

from future import absolute_import