You are on page 1of 11

Date :- 25/08/22

N-GRAM IN PYTHON

Submitted By:- Submitted To:-


Name:-Ayush Jain Pro. Tanushree Dholpuria
Enrollment No.:- 0103AL213D01
Branch:- CSE -AIML
TABLE OF CONTENT
 Describe N-gram in python
 Implement of N-gram in python
1. Explore the data
2. Feature extraction
3. Training
4. Basic preprocessing
5. Creating Unigram, Bigram & Trigram
N-GRAM
 N-gram can be defined as the contiguous sequence of n items from a
given sample of text or speech. The items can be letters, words, or
base pairs according to the application. The N-grams typically are
collected from a text or speech corpus (A long text dataset)
 An N-gram model predicts the most probable word that might follow
this sequence. It's a probabilistic model that's trained on a corpus of
text. Such a model is useful in many NLP applications including
speech recognition, machine translation and predictive text input.
 An N-gram model is built by counting how often word sequences
occur in corpus text and then estimating the probabilities. Since a
simple N-gram model has limitations, improvements are often made
via smoothing, interpolation and back off.

N Terms
1 Unigram
2 Bigram
3 Trigram
n N-gram
Implementation of N-gram in
python
To start out detecting the N-grams in Python, you will first have to
install the TexBlob package through this command:

Let’s get started:

We have created a sentence string containing the sentence we want to


analyse. We have then passed that string to the TextBlob constructor,
injecting it into the TextBlob the instance that we will run operations
on:
The ngrams() function returns a list of tuples of n successive words. In
our sentence, a bigram model will give us the following set of strings:.
FEARTURE EXTRACTION
Our objective is to predict the sentiment of a given news headline. The
‘News Headline’ column is our only feature and the ‘Sentiment’ column is
our target variable.
y=df['Sentiment'].values
y.Shape

x=df[‘News Headline’].values
x.shape

Both the outputs return a shape of (4846,) which means 4846 rows and 1
column as we have 4846 rows of data and just 1 feature and a target for x
and y respectively.
Training–Testing
In machine learning, deep learning, or NLP(Natural Language Processing)
task, splitting the data into train and test is indeed a highly crucial step.
The train_test_split() method provided by sklearn is widely used for the
same. So, let’s begin by importing it:

from sklearn.model_selection import train_test_split

We have split the data this way:60% for train and the rest 40% for test. I
had started with 20% for the test and kept on playing with the test_size
parameter only to realize that the 60-40 ratio of split provides more useful
and meaningful insights from the trigrams generated. Don’t worry, we will
be looking at trigrams in just a while
.
(x_train,x_test,y_train,y_test)=train_test_split(x,y,test_size=0.4)
x_train.shape
y_train.shape
x_test.shape
y_test.shape
Basic preprocessing
In order to pre-process our text data, we will remove punctuations in train
and test data for the ‘news’ column using punctuation provided by the
string library.

import string
string.punctuation
if(type(text)==float):
return text
ans=""
for i in text:
if i not in string.punctuation:
ans+=i
return ans
df_train['news']= df_train['news'].apply(lambda x:remove_punctuation(x))
df_test['news']= df_test['news'].apply(lambda x:remove_punctuation(x))
df_train.head()

Compare the above output with the previous output of df_train. You can
observe that punctuations have been successfully removed from feature
column of the training dataset. Similarly, from the above codes,
punctuations will be removed successfully from the news column of the test
data frame as well. You can optionally view df_test.head() as well to note it.
Creating Unigram, Bigram & Trigram
 Creating Unigram :-
To generate 1-grams we pass the value of n=1 in n-grams function of NLTK.
But first, we split the sentence into tokens and then pass these tokens to
n-grams function . As we can see we have got one word in each tuple for the
Unigram model.

from nltk .util import ngrams


n=1
sentence = ‘ You will fave many defeats in life , but never let yourself be
defeated .’
unigrams = ngrams ( sentence.split() , n)
for item in unigrams :
print ( item )

 Creating Bigram :-
For generating 2-grams we pass the value of n=2 in n-grams function of
NLTK. But first, we split the sentence into tokens and then pass these
tokens to n-grams function.
from nltk .util import ngrams
n=1
sentence = ‘ You will fave many defeats in life , but never let yourself be
defeated .’
unigrams = ngrams ( sentence.split() , n)
for item in unigrams :
print ( item )

Creating Trigram :-
In case of 3-grams, we pass the value of n=3 in n-grams function of NLTK.
But first, we split the sentence into tokens and then pass these tokens to
n-grams function.
from nltk .util import ngrams
n=1
sentence = ‘ You will fave many defeats in life , but never let yourself be
defeated .’
unigrams = ngrams ( sentence.split() , n)
for item in unigrams :
print ( item )

You might also like