Professional Documents
Culture Documents
N-GRAM IN PYTHON
N Terms
1 Unigram
2 Bigram
3 Trigram
n N-gram
Implementation of N-gram in
python
To start out detecting the N-grams in Python, you will first have to
install the TexBlob package through this command:
x=df[‘News Headline’].values
x.shape
Both the outputs return a shape of (4846,) which means 4846 rows and 1
column as we have 4846 rows of data and just 1 feature and a target for x
and y respectively.
Training–Testing
In machine learning, deep learning, or NLP(Natural Language Processing)
task, splitting the data into train and test is indeed a highly crucial step.
The train_test_split() method provided by sklearn is widely used for the
same. So, let’s begin by importing it:
We have split the data this way:60% for train and the rest 40% for test. I
had started with 20% for the test and kept on playing with the test_size
parameter only to realize that the 60-40 ratio of split provides more useful
and meaningful insights from the trigrams generated. Don’t worry, we will
be looking at trigrams in just a while
.
(x_train,x_test,y_train,y_test)=train_test_split(x,y,test_size=0.4)
x_train.shape
y_train.shape
x_test.shape
y_test.shape
Basic preprocessing
In order to pre-process our text data, we will remove punctuations in train
and test data for the ‘news’ column using punctuation provided by the
string library.
import string
string.punctuation
if(type(text)==float):
return text
ans=""
for i in text:
if i not in string.punctuation:
ans+=i
return ans
df_train['news']= df_train['news'].apply(lambda x:remove_punctuation(x))
df_test['news']= df_test['news'].apply(lambda x:remove_punctuation(x))
df_train.head()
Compare the above output with the previous output of df_train. You can
observe that punctuations have been successfully removed from feature
column of the training dataset. Similarly, from the above codes,
punctuations will be removed successfully from the news column of the test
data frame as well. You can optionally view df_test.head() as well to note it.
Creating Unigram, Bigram & Trigram
Creating Unigram :-
To generate 1-grams we pass the value of n=1 in n-grams function of NLTK.
But first, we split the sentence into tokens and then pass these tokens to
n-grams function . As we can see we have got one word in each tuple for the
Unigram model.
Creating Bigram :-
For generating 2-grams we pass the value of n=2 in n-grams function of
NLTK. But first, we split the sentence into tokens and then pass these
tokens to n-grams function.
from nltk .util import ngrams
n=1
sentence = ‘ You will fave many defeats in life , but never let yourself be
defeated .’
unigrams = ngrams ( sentence.split() , n)
for item in unigrams :
print ( item )
Creating Trigram :-
In case of 3-grams, we pass the value of n=3 in n-grams function of NLTK.
But first, we split the sentence into tokens and then pass these tokens to
n-grams function.
from nltk .util import ngrams
n=1
sentence = ‘ You will fave many defeats in life , but never let yourself be
defeated .’
unigrams = ngrams ( sentence.split() , n)
for item in unigrams :
print ( item )