You are on page 1of 8

MSc Artificial Intelligence

Artificial Intelligence Foundations

Week 10: NLP, part 2


Rameez Kureshi: Hello to all. Let's continue with the NLP. Dr Fahim, in a previous session,
gave us a good context about the NLP. Now we're going to see how the text analysis is used
in NLP based on the data-driven approach. So, Dr Fahim will explain a bit more. Dr Fahim,
over to you. Okay.

Dr Muhammad Fahim: Thank you, Rameez. Okay, thank you Rameez. Let me share the
slides and I can quickly go through. The slides are in the full screen mode. Yes, okay. Thank
you. So today, in this session, I'm going to explain about the text analysis, a brief data driven
approach. Just to, as Dr Rameez said, that we covered in the previous session about the
basic introduction of natural language processing. So, in this section, we are going to have
this text analysis. I bring a very nice examples that basically have more interaction and I’d
very much like to get into the more details. So we will continue our discussion with the text
analysis and our focus will be on data-driven approach.

So, the contents for this session, we are going to have the feature extraction where I'm
going to talk about, let me make this laser pointer. So we're going to have this feature
extraction where I am going to talk about bag-of-words, then I will explain the Naive Bayes
classifier and there is someone in the class who maybe has had some interaction before in
machine learning or AI of course. Then I will move towards the model evaluation. I will
speak about the confusion metrics, how we can calculate the accuracy, and I will summarise
this session.

These are the acknowledgments where I got the material to prepare these slides. And at the
end of this session, students will learn how we can understand or even you are in the
position to have end-to-end text analysis model. And it is one of the good things, like when
you are starting natural language processing. So at least at this stage, students should know
how to deal with each and every step that we have in the pipeline.

At text analysis, sometimes we call it text classification. These two words, analysis and
classification, are interchangeable in terms of natural language processing. We can see the
applications in spam or non-spam detection of emails. We use this text analysis to
understand the audience sentiment from the reviews, or the social media, mostly the
tweets.

And here, another very nice application, auto-tagging of the customer queries. Just imagine
if a company which has a lot of customers and they have to deal, for example, let me quickly
give the example of the airline where they are interacting with the customers and the airline
has thousands of customers. And even in the peak time, the queries increase when many
people are going to have their Christmas breaks, Easter breaks, or even the summer or
winter breaks. A lot of queries can overwhelm the system and there is a need where we can
put this test classification and automatically customers can be directed to the concern
department. And it is one of the very nice applications of the text analysis and natural
language processing to make this automation into the industry. Similarly, assigning the
subject categories and topics are generous. Most of the time, on the internet, when we
have these articles either these are related to sports, politics or any other topic of interest.
So in text analysis we have again the rule-based approaches and data-oriented approaches.

Data-oriented approaches most of the time outperform over this classical one and this is our
today's focus. Here on the screen we can see Avatar, the Way of Water. And maybe in the
class some of the students already watched this film. And of course they have their own
reviews. I just copied these two reviews, the very recent reviews from IMDb. Sometimes,
when you are going to write the movie, you are going to ask, how much ratings do we have
over the IMDb? And how are the reviews? There are two ways to give the reviews, for
example, we can place the number of stars, five stars, four stars and then some of the
viewers can also put some comments. Here I can say like, ‘I agree the movie, while not
fantastic story wise, was very enjoyable and well worth going to see it’. And another
comment, the review that one of the users gave, ‘I like it but it wasn't worth waiting 13
years to see’. So, we have these kind of reviews, and if you just check the IMDb with the
Avatar, very latest one, then you will see almost 25,000 reviews.

How can we, if I am a person who is really concerned about the reviews, like what's going
on, I can't really read all the reviews and can’t say, oh, it's going to be good or bad. But of
course, we need some text analysis tool that is going to have this automation and tells us
like, oh the movie reviews are going positively, or, it has like a negative impression. So, in
data-driven approaches when we have these annotations, some of the reviews that are
already classified as a positive or negative, then we are really concerned about the
annotations. So at some point we have these annotations and I’ll go into a little bit more
detail, this is a supervised machine learning approach.

We have the data and we have the labels. Either these reviews are positive and negative.
Now the question is, if we have the data, how we can represent each of the review? Of
course, as I discussed in our previous session, we really need to present this data in a way
that is in a vector format in the numbers, and these numbers will be understandable to our
model, and then we can just analyse this data using our whole natural language processing
pipeline.

I'll leave this to the students. You can decide after reading the whole paragraph, either this
review is positive or negative.

But our task today is to understand the whole pipeline, and I am going to give you a very
good overview, how we can extract the features and how we can represent this data into

2
the vectors. And in the next session, when we have some practice, then the challenge, then
you will have a very hands-on experience on such kind of reviews.

Okay, so in the natural language processing pipeline, we have a data collection. Data
collection, we assume we have from the IMDb. We are going to have data pre-processing,
text cleansing. That is very much engineering detail, and I will keep those details with our
challenge session. And the two more important points here are feature extraction and the
model, which is important to discuss in this session.

The very first one that we are talking about is the feature extraction, so the very first model
that comes into our mind to extract the features is bag-of-words. And this feature extraction
bag-of-words is sometimes known as word embedding in natural language processing.

So in bag-of-words, the name specifies how we can carry a bag with literally the words. We
can see here three different small sentences. Movie is fantastic, dump movie, or movie is
great, and here is a bag. In the bag, we just put movie, fantastic, dump and great. We just
removed this, and also we have case normalisation. Movie is going to be a lowercase. So all
these engineering details, we are going to discuss them in our challenge, our practical
session. But here our focus is how we can count movie is three, fantastic appear one, dumb
appear one and great appear one. So now we have this bag-of-words and this bag-of-words
will represent each of the reviews. Of course, our reviews are quite long and we are just
taking the simple example to understand how we get the concept of bag-of-words.

Let's move a little bit ahead. Here we can see movie is fantastic. Dumb movie, our movie is
great. We have this feature vector, and I present 1, 1, 0, 0, 1, 0, 1, 0. Maybe some of the
students already get why this is four and why someone is there. If you didn't get it, let me
explain. We have four words, and four words means that we have a vector of the four, and
where we have the movie is fantastic. Movie and fantastic.

So, first is the movie, the second attribute is fantastic, dump and great. There is no dump or
great in the first sentence. We have this feature vector representation for this sentence.
Similarly for dump movie and movie is great. Now these feature vectors didn't say anything
about either is the positive or the negative. So let me introduce here our positive or
negative feature vectors so we can represent each feature vector x1, x2 and x3. Consider
this x1, x2 and x3 either these are positive or negative. And in some simple cases, we can
represent this positive or negative with 1 or 0.

So, these are now the feature vectors, which represent which of the small reviews or the
sentences are positive or negative. And now if we map back this one, so it tells us that we
have some representation into a numerical vector. So the next step when we have this
feature extractor, data processing, as I said, is just like the upper or lowercase letters,
removing some stuff, as in sharing details that we are going to discuss in our challenge or
the practical session. We are at this point.

We need to split our dataset into training and testing. We get this feature vector, as we can

3
see from our previous slide.

These are the feature vectors with positive and negative class labels. We will bring these
feature vectors and the labels to train the model. And later on, in the later stage, when we
will make this inference, we will go through all these steps and we will predict to the unseen
review either this is positive or negative. I hope it is clear like how we are progressing from
feature extraction to the model and of course, we need to look in more detail about the
model training and how we can do the inferencing.

So, model training. The models, we can have a classical machine learning models, Naive
Bayes, Sport Vector Machine, Perceptron, and this list can be very long when we are also
going to include deep learning models, especially recurrent neural network, long short-term
memory cells, GRUs, or even the transformers, everything will be a big list. But today, our
discussion will be a very introductory level where we are going to talk about the Naive
Bayes.

And Naive Bayes has very good reasons to start from natural language processing. The first
reason is that it basically explicitly manipulates the probabilities. And I assume that we have
some basic knowledge of the probabilities, prior and posterior probabilities. And it is one of
the practical approaches and competitive with other models like decision trees. And at
some point, I can say, it is also comparable with neural network, very one less one, not the
deep models.

Second reason, it can give very good understanding like how the machines learn this natural
language vectors and they can produce the outcome after the analysis is the positive or
negative. And the same time, it sets some gold standard. For example, students come up
with a new idea and they have a new model, then you can compare your results as a gold
standard with the Naive Bayes one. Bayes theorem, we are familiar with these probabilities
like we are going to have this hypothesis.

Hypothesis is just like a guess, like either this is correct or not. In a science hypothesis, we
can define, okay, I want to do this. I'm not sure either is correct or not, so we can look into
the data. We do some experiments, and then we can find some probability that this
hypothesis actually exists over the given data. So, H is hypothesis, D is our data, and we can
write this Bayes theorem, probability of the data given the hypothesis, these are the prior
probabilities of the hypothesis, and this is the prior probabilities of the data. This is a very
basic model, understanding how we have this Bayes theorem.

Let's get into a little bit more details. We are going to have our bag of visual words shortly.
We write it as a BOW in this case We are basically considering these vectors from our
previous extraction x1 x2 up to xn and, here, represent like how many feature vectors we
have and each one have the class label, where we represent usually with C or the Y.

And just to give the idea that this is the representation of the class label, positive or
negative, so we can represent it with the C, Y, or even the C map at the moment is like

4
maximum a posterior. But sometimes you can just represent with C hat, that is also possible.
And we are going to, we are interested to find the maximum probability from all the
arguments where we want to find a class label for the given feature vectors.

We can replace this probability with the Naive Bayes theorem, and then we just remove this
probability that is in denominator. The reason is, this is the probability of X1, X2 and Xm that
is independent of our class and here we can see that we are only interested how our class
basically belongs to all of the attributes that we have in our feature vector.

So here a little bit more detail I can say that these probabilities that we are going to
calculate over the given examples and as well as given class labels. So, we can just do this
product this pi sign is the product of the probability of xi given class j. These are the
products and this is a very basic assumptions in the Naive Bayes and we are considering that
each attribute that appears into the vector is independent.

So this independence assumption tells us that we can just replace these probabilities with
the product sign. And of course, when we replace this term, this is from our previous one,
here we are. So we replace this term with the product, and this is just the prior probability
of the classes. And specifically in our case, we have just two classes, positive and negative,
so we will calculate these prior probabilities. And this is xi, all the attributes that belong to
that class are not. So, we just multiply those individual probabilities. Sometimes, if some of
the students who don't have really nice backgrounds, nice means like they don't have very
much interaction with the probabilities. Sometimes it is not really clear how we are going to
handle these situations. But I have a very good slide here, but let me explain the properties.

This is much more related to the probabilities that we are estimating individual attributes
over the given class, and it will reduce the number of parameters. We are also interested to
find these attributes over the given class, what are the prior probabilities of the classes
based on the frequencies in the training data? Everything that we are going to calculate is
available in the data and this data is nothing, just this one. We can calculate everything
using these simple feature vectors. These are the inputs to the model and now we are very
close to making some intuition.

So thank you to Alvin who made this very nice intuition behind this knife face. We can see
here, this is the prior probabilities. Prior probabilities of positive or the negative classes. We
are extracting the training parameters, and then we are calculating each word and each of
the reviews in our case. And similarly, each of the words in the negative review, we
calculate these probabilities during the inference time. We just multiply these probabilities.
Which one is the maximum? We just provide the label.

So, let me quickly make intuition here. Training parameters, but prediction or inference
time, we just do this multiplication. And which probability will be higher, it's just we will say,
this is the positive or the negative review. Now once we have these model parameters, we
learn these probability distributions, now we have to check either our model performed

5
well or not. So we have the two labels, and I hope students remember, zero represents
negative and one represents the positive. And this is the predicted label from our model.
And now we have to calculate like if, please a little bit more concentration. The true label is
zero here and the predicted label is zero.

So, it means that the review is negative and the predicted label from our knife base is
negative. That it means the model is working fine.

But in another case, we have a negative, but the model is saying it's a positive preview. So,
in this case, we have all these predictions. And we can calculate this performance with the
help of confusion matrix. Confusion matrix just represents true positive, false negative, false
positive, and true negative. How many we have the true positive, we can just calculate like
in this case, true positives are 2, false negatives are 2. I will just leave these four true
positives. Get these definitions, map over this true positive or the false negative, one false
positive and five true negatives.

We can calculate the accuracy, just place these numbers, add them together, and then the
accuracy is 0.7. Which tells us the model has accuracy of 70%. And of course, accuracy is not
a very good measure. We can calculate the precision, we can calculate the recall. For that I
will give you a little bit more details during the challenge or the practical session. But for
evaluation, we have to see this confusion matrix.

Confusion matrix basically represents our true positive, false negative, true negative, and
false positive. And then we can calculate the different performance measures based on this
confusion matrix.

We have some limitations in bag-of-words. Today is of, or is today of, it has very different
meanings. But when we represent in the form of feature vector, the locations will represent
the same bag-of-words. Similarly, pay attention to the semantics. Buy used cars or purchase
old automobiles. Two different sentences have the same meaning, but we have different
representations. So these are the limitations of bag-of-words.

But of course, we have other techniques, word to word, term frequency, inverse document
frequency, and many more. But at a basic level, we have a very good start, like how we can
have back and forth to present our feature vector and to go through all these details. We
have some of the limitations in the naive space, where we are considering every attribute is
independent but naturally like when we have the text, the attributes are not independent.
They have some context, we can represent the context with the Bayesian networks. We can
represent this context using LSTM, RNN but these are just high-level representation here are
more deeper details with these techniques, it is not the part of today's discussion, but just
to present like we have more nicer models as well.

So in summary, we go through the feature extraction, we go through the model, we see


how we can apply the vectors, and how we can calculate these probabilities, how we get the
term maximum of the class from posterior probabilities and the probability of the classes.

6
And we also discussed some of the limitations. So limitations, discussion of these limitations
doesn't mean that these methods are not worthwhile to have it. But, in some scenarios we
have much more complex and much more intuitive methods. But this totally depends on
how we have the complex problem in our hand.

Thank you and if there are any questions, I'm happy to answer.

RK: Thank you Dr Fahim. It was really informative. The way you described about the bag-of-
words, naive model, how to design it with the help of factors, and then evaluation of model
with the confusion metrics, it was really helpful. This will help the students, showing a path
for how to design a model and how to evaluate with the help of the pipeline you showed. So
it was amazing. Thank you so much again. And by the way, I have a question about your
slide 8. So, if you go to the slide 8. When you say the bag-of-words here, the only words are
just only what movie is repeating. So how do you treat the vector then? If the word has
appeared twice or thrice then how do you have to treat it?

MF: Okay, so for instance, there are some words which can repeat if the review is too large
in the review section. Then at the moment because the sentence is small, so we have, for
example, movie appeared just one, but there is a possibility that if you have a large review,
the movie can appear twice. Is this the question? Okay. So, in that case, I can say you can
count these back of doesn't mean to be half. A binary representation, one or zero. For
example, if a word that appear more than once, so we can just put the number here. For
example, fantastic appears three times in the review. The feature vector that we will
consider, so it will be three here. So, this is possible to incorporate more than one word into
a feature vector. Is it okay?

RK: Yeah. Okay. That was the question of the moment. The second question is about, the
students want to know, what if the probability is 50-50? How do you treat with the Naive
Bayes model then?

MF: And in that case, I think this is a good point to explain. So here we have this prior
probability. When I say prior, because we can calculate this class probability. And the
question is, if it is, like we have 100 reviews, 50 reviews are positive and 50 reviews are
negative. So in that case, the probability of the class review being positive is 0.5. Probability
of the class review being negative is 0.5. So, it means that we are multiplying these numbers
every time with this product. So, we can even just change this term. We can even remove
this term if this is 50-50 and just calculate this term and it can give you the maximum
likelihood. So, the model will work fine, but if you know like the prior probabilities, in terms
of computation, you can remove this term and it will reduce the number of calculations
when you are doing this multiplication.

RK: Okay, thank you. Okay, so this is the end of the part two and we will continue with Dr
Fahim where he will explain about how the NLP, natural language processing, is used in a
computer vision or for information retrieval. So, thank you. Okay, thank you.

7
MF: Thank you.

[End 00:30:48]

You might also like