You are on page 1of 50

------------------------------------------------------------------------------------

NLP Using Python


sentence = """At eight o'clock on Thursday morning... Arthur didn't feel very good."""

tokens = nltk.word_tokenize(sentence)

print(tokens)
tagged = nltk.pos_tag(tokens)

print(tagged[0:6])

entities = nltk.chunk.ne_chunk(tagged)
print(entities)

from nltk.corpus import treebank


t = treebank.parsed_sents('wsj_0001.mrg')[0]
t.draw()

wordfreq = nltk.FreqDist(words)
wordfreq.most_common(2)
[('programming', 2), ('.', 2)]
word nltk.import nl

nltk.download('book')
from nltk.book import *.

text1.findall("<tri.*r>")
type(text1)

n_unique_words = len(set(text1))

text1_lcw = [ word.lower() for word in set(text1) ]


n_unique_words_lc = len(set(text1_lcw))
word_coverage1 = n_words / n_unique_words
word_coverage2 = n_words / n_unique_words_lc
big_words = [word for word in set(text1) if len(word) > 17 ]
sun_words = [word for word in set(text1) if word.startswith('Sun') ]
text1_freq = nltk.FreqDist(text1)

fdist
top3_text1 = text1_freq.most_common(3)
####TEXT CORPORA
Popular Text Corpora
Genesis: It is a collection of few words across multiple languages.
Brown: It is the first electronic corpus of one million English words.

Other Corpus in nltk


Gutenberg : Collections from Project Gutenberg
Inaugural : Collection of U.S Presidents inaugural speeches

stopwords : Collection of stop words.


reuters : Collection of news articles.
cmudict : Collection of CMU Dictionary words.
movie_reviews : Collection of Movie Reviews.
np_chat : Collection of chat text.
names : Collection of names associated with males and females.
state_union : Collection of state union address.
wordnet : Collection of all lexical entries.
---------------------------------------------------------------------------------------------------

2166
18.55

['noise','surprise','wise','apologise'] = 4

How many times each unique word of text6 collection is repeated on an average?
1.16

Count the number of words in text collection, text6, ending with ship?
4 (x)

How many times does the word 'BROTHER' occur in text collection text6?

What is the frequency of word 'ARTHUR' in text collection text6?


X-0.0101

Which of the following modules is used for performing Natural language processing in python?
nltk

Which of the following expression is used to download all the required corpus and collections , related to NLTK Book ?
nltk.download('book')

What is range of length of words present in text collection text6?


1 to 13
In how many number of categories, are all text collections of brown corpus grouped into?
15
Which of the following method is used to determine the number of characters present in a corpus?
char()

#############
items = ['apple', 'apple', 'kiwi', 'cabbage', 'cabbage', 'potato']
nltk.FreqDist(items)

How many times do the word sugar occur in text collections, grouped into genre 'sugar'? Consider reuters corpus.
X-252

How many times do the word zinc occur in text collections, grouped into genre 'zinc'? Consider reuters corpus
x-50

Which of the following class is used to determine count of all tokens present in a given text ?
FreqDist

What is the number of sentences obtained after breaking 'Python is cool!!!' into sentences using sent_tokenize
4

Which of the following method is used to tokenize a text based on a regular expression?
regexp_tokenize()

Which of the following class is used to convert a list of tokens into NLTK text?
X-nltk.text
nltk.text

Which of the following module can be used to read text data from a pdf document?
pypdf

Which of the following module is used to download text from a HTML file?
urllib

Which of the following is not a collocation, associated with text6?


squeak squeak

What is the frequency of bigram ('King', 'Arthur') in text collection text6?


X32

Which of the following function is used to generate a set of all possible n consecutive words appearing in a text
n-grams()
#########
Lancaster Stemmer returns build
Porter Stemmer returns builder.

################FINAL############################
What is the output of the following expression?
import nltk
lancaster = nltk.LancasterStemmer()
print(lancaster.stem('power'))
pow

What is the total number of unique words present in text collection, text6? Considering characters too as words
2166

How many words are ending with 'ing' in text collection text6?
109

Count the number of words in text collection, text6, which have only digits as characters?
24

Which of the following NLTK corpus represent a collection US presidential inaugural addresses, starting from 1789?
inaugural

Which tag occurs maximum in text collections associated with news genre of brown corpus?
NN

How many number of words are obtained when the sentence Python is cool!!! is tokenized into words, with regular expressio
3

import nltk
lancaster = nltk.LancasterStemmer()
print(lancaster.stem('women'))
wom

Which of the following is a Text corpus structure?


All of those mentioned

Which of the following module is used to download text from a HTML file
urllib
How many times does the word sugar occur in text collections, grouped into genre 'sugar'? Consider reuters corpus.
0

How many times does the words tonnes and year occur in text collections, grouped into genre sugar? Consider reuters corpus
355, 196

How many times does the tag AT is associated with the word The in brown corpus?
7824

How many times does the words lead and smelter occur in text collections, grouped into genre zinc? Consider reuters corpus.
32, 33

###################
import re
text = 'Python is cool!!!'
tokens = re.findall(r'\w+', text)
len(tokens)
3

#get tags from brown


from nltk.corpus import brown
brown_tagged = brown.tagged_words()
1161192

import nltk
text = 'Python is awesome.'
words = nltk.word_tokenize(text)
defined_tags = {'is':'BEZ', 'over':'IN', 'who': 'WPS'}

-------------------------------------------------------------------------------------------------------
LIBRARY MANUAL:
https://www.nltk.org/book/ch02.html
ONLINE CONSOLE PYTHON3:
https://www.katacoda.com/courses/python/playground
pip3 install --user setuptools && pip3 install nltk
python3 -c "import nltk; nltk.download('book')"
--------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------
Which of the following is not a collocation, associated with text6 ?
import nltk
from nltk.book import text6
gen_text = nltk.Text(text6)
print(gen_text.collocations())
Straight Table
--------------------------------------------------------------------------------------------------------
How many times does the tag AT is associated with the word The in brown corpus?
import ntltk
from nltk.corpus import brown
brown_text_tagged = nltk.corpus.brown.tagged_words()
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_text_tagged if tag=='AT' and word =='The')
print(tag_fd)
6725
--------------------------------------------------------------------------------------------------------
Which of the following function is used to tag parts of speech to words appearing in a text?
pos_tag()
--------------------------------------------------------------------------------------------------------
How many words are ending with 'ly' in text collection text6?c
import nltk
from nltk.book import text6
ly_ending_words = [word for word in text6 if word.endswith('ly') ]
print(len(ly_ending_words))
109
Which of the following NLTK corpus represents a collection US presiential inaugural addresses, starting from 1789
inaugural
--------------------------------------------------------------------------------------------------------
Which of the following method can be used to determine the number of text collection files associated with a corpus?
fileids()
--------------------------------------------------------------------------------------------------------
Count the number of words in text collection, text6, which have only digits as characters?
24
--------------------------------------------------------------------------------------------------------
Which of the following method is used to view the tagged words of text corpus
tagged_words()
--------------------------------------------------------------------------------------------------------
What is the output of the following expression?
import nltk
lancaster = nltk.LancasterStemmer()
print(lancaster.stem('lying'))
lying
--------------------------------------------------------------------------------------------------------
What is the frequency of bigram ('HEAD', 'KNIGHT') in text collection text6
import nltk
from nltk.book import text6
bigrams = nltk.bigrams(tokens)
filtered_bigrams = [ (w1, w2) for w1, w2 in bigrams if w1=='HEAD' and w2=='KNIGHT']
print(filtered_bigrams)
29
--------------------------------------------------------------------------------------------------------
What is the output of the following expression ?
import nltk
porter = nltk.PorterStemmer()
print(porter.stem('ceremony'))
ceremoni
--------------------------------------------------------------------------------------------------------
Which of the following method is used to tokenize a text based on a regular expression
regexp_tokenize()
--------------------------------------------------------------------------------------------------------
What is the frequency of word 'ARTHUR' in text collection text6
import nltk
from nltk.book import text6
fdist = nltk.FreqDist(text6)
print(fdist.freq('ARTHUR'))
0.0132
--------------------------------------------------------------------------------------------------------
Which of the following function is used to obtain set of all pair of consecutive words appearing in a text?
bigrams()
--------------------------------------------------------------------------------------------------------
What is the range of length of words present in text collection text6?
X-1 to 10
--------------------------------------------------------------------------------------------------------
What is the output of the following code?
import re
s = 'Python is cool!!!'
print(re.findall(r'\s\w+\b', s))
[' is', ' cool']
--------------------------------------------------------------------------------------------------------
What are the categorioes to which the text collection test/16438, of reuters corpus is tagged to ?
earn, gas

Which of the following method can be used to determine the location of a text collection, associated with a corpus
abspath()

Which of the following class is used to convert your own collections of text into a corpus?
PlaintextCorpusReader
--------------------------------------------------------------------------------------------------------
What is the output of the following expression?
import nltk
wnl = nltk.WordNetLemmatizer()
print(wnl.lemmatize('women'))
woman
--------------------------------------------------------------------------------------------------------
Which of the following NLTK corpus represent a collection of around 10000 news articles?
reuters
--------------------------------------------------------------------------------------------------------
How many times each unique word of text6 collection is repeated on an average?
X-6.5 times
--------------------------------------------------------------------------------------------------------
What is the frequency of bigram ('BLACK', 'KNIGHT') in text collection text6?
import nltk
from nltk.book import text6
bigrams = nltk.bigrams(text6)
filtered_bigrams = [ (w1, w2) for w1, w2 in bigrams if w1=='BLACK' and w2=='KNIGHT']
print(len(filtered_bigrams))
32
--------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------
pip3 install --user setuptools && pip3 install nltk
python3 -c "import nltk; nltk.download('book')"

import nltk
from nltk.book import text6
n = len(text6)
print(n)

u = len(set(text6))
print(u)

wc = n/u
print(wc)

ise_ending_words = [word for word in set(text6) if word.endswith('ise') ]


print(len(ise_ending_words))

contains_z = len([word for word in set(text6) if 'z' in word])


print(contains_z)

contains_pt = len([word for word in set(text6) if 'pt' in word])


print(contains_pt)

import re
title_words = len(re.findall(r'([A-Z][a-z]+)', text6))

title_words = [word for word in set(text6) if re.search(r'([A-Z][a-z]+)', word)]


--------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------
import nltk, re
from nltk.corpus import gutenberg
for fileid in gutenberg.fileids():
n_words = len(gutenberg.words(fileid))
n_unique_words = len(set(gutenberg.word
word_coverage = n_words / n_unique_wor
print(word_coverage, fileid)

aus_words = len(gutenberg.words('austen-sense.txt))
aus_words_apha = len([word for word in gutenberg.words('austen-sense.txt') if word.isalpha()]
aus_words_gt4_z = len([word for word in gutenberg.words('austen-sense.txt') if word.isalpha() and len(word) > 4 and 'z' in wo
print(aus_words_gt4_z)

--------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------
import nltk
from nltk.corpus import brown
brown_cdf = nltk.ConditionalFreqDist([
(genre,word.lower())
for genre in brown.categories()
for word in brown.words(categories=genre) ])

brown_cdf.tabulate(conditions=['news', 'religion','romance'], samples=['can', 'could', 'may', 'might', 'must', 'will'])

from nltk.corpus import inaugural


inaugural_cfd = nltk.ConditionalFreqDist(
(target, fileid)
for fileid in inaugural.fileids()
for w in inaugural.words(fileid)
for target in ['america', 'citizen']
if w.lower().startswith(target))

print(inaugural_cfd.conditions())

--------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------
import nltk
from urllib import request
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
html_content = request.urlopen(url).read()
soup = BeautifulSoup(html_content, 'html.parser')
n_links = len(soup.find_all('a'))
print(n_links)

table = soup.find_all('table', attrs={'class':'wikitable'})


rows = [elm.text for elm in table.find_all(['tr']) ]
print(rows[1:])

--------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------
import nltk
from nltk.corpus import brown
news_words = brown.words(categories='news')
lc_news_words = [w.lower() for w in news_words]
len_news_words = [len(w) for w in lc_news_words]
news_len_bigrams = list(nltk.bigrams(len_news_words))
#Compute the conditional frequency of news_len_bigrams, where condition and event refers to length of a words.
#Store the result in cfd_news
#Determine the frequency of 6-letter words appearing next to a 4-letter word
cfd_news = nltk.ConditionalFreqDist(news_len_bigrams)
cfd_news.tabulate(conditions=[6,4])

#############
lc_news_bigrams =nltk.ConditionalFreqDist(news_len_bigrams)

#
filtered_bigrams = [(w1, w2) for w1, w2 in news_len_bigrams if w1==6 and w2==4]
cfd_news = nltk.FreqDist(filtered_bigrams)
print(cfd_news[6,4])

#
cfd_news = nltk.FreqDist((l1, l2) in news_len_bigrams if l1==6 amd l2==4)
print(cfd_news[6,4])

--------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------
from nltk.corpus import brown
humor_words = brown.words(categories='humor')
lc_humor_words = [word.lower() for word in humor_words]
lc_humor_uniq_words = set(lc_humor_words)
from nltk.corpus import words
wordlist_words = words.words()
wordlist_uniq_words = set(wordlist_words)
print(len(lc_humor_uniq_words))
print(len(wordlist_uniq_words ))

--------------------------------------------------------------------------------------------------------

Import the text corpus brown.


Extract the list of tagged words from the corpus brown.
Store the result in brown_tagged_words
Generate trigrams of brown_tagged_words and store the result in brown_tagged_trigrams.
For every trigram of brown_tagged_trigrams, determine the tags associated with each word.

This results in a list of tuples, where each tuple contain pos tags of 3 consecutive words, occurring in text.
Store the result in brown_trigram_pos_tags.
Determine the frequency distribution of brown_trigram_pos_tags and store the result in brown_trigram_pos_tags_freq.
Print the number of occurrences of trigram ('JJ','NN','IN')

--------------------------------------------------------------------------------------------------------
import nltk
from nltk.corpus import brown
brown_tagged_words = [word for (word, tag) in nltk.corpus.brown.tagged_words()]
brown_tagged_trigrams = list(nltk.trigrams(brown_tagged_words))
brown_trigram_pos_tags = list()
for trigram in brown_tagged_trigrams:
trigram_tagged = nltk.pos_tag(trigram)
tags = [tag for (word, tag) in trigram_tagged]
brown_trigram_pos_tags.append(tags)

brown_trigram_pos_tags_freq = nltk.FreqDist((t1,t2,t3) for (t1,t2,t3) in brown_trigram_pos_tags)


print(brown_trigram_pos_tags_freq['JJ','NN','IN'])

brown_trigram_pos_tags_freq = nltk.FreqDist(t1,t2,t3) for (t1,t2,t3) in brown_trigram_pos_tags if t1=='JJ' and t2=='NN' and t3


--------------------------------------------------------------------------------------------------------
import nltk
from nltk.corpus import brown
brown_tagged_words = [word for (word, tag) in nltk.corpus.brown.tagged_words()]
brown_tagged_trigrams = list(nltk.trigrams(brown_tagged_words))
brown_trigram_pos_tags = [ nltk.pos_tag(t) for t in brown_tagged_trigrams ]
brown_trigram_pos_tags_freq = nltk.FreqDist(t1,t2,t3) for (t1,t2,t3) in brown_trigram_pos_tags if t1=='JJ' and t2=='NN' and t3

#TASK2
import nltk
from nltk.corpus import brown
brown_tagged_words = nltk.corpus.brown.tagged_words()
brown_tagged_trigrams = list(nltk.trigrams(brown_tagged_words))
#[(('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'))]
brown_trigram_pos_tags = list()
for tuple in brown_tagged_trigrams:
tags = [tag for (word, tag) in tuple]
brown_trigram_pos_tags.append(tags)
#[['AT', 'NP-TL', 'NN-TL']]
brown_trigram_pos_tags_freq = nltk.FreqDist((t1,t2,t3) for (t1,t2,t3) in brown_trigram_pos_tags)
print(brown_trigram_pos_tags_freq['JJ','NN','IN'])
#TASK2
import nltk
from nltk.corpus import brown
brown_tagged_sents = nltk.corpus.brown.tagged_sents()
total_size = len(brown_tagged_sents)
train_size = int(total_size * 0.8)
train_sents = brown_tagged_sents[:train_size]
test_sents = brown_tagged_sents[train_size:]
unigram_tagger = nltk.UnigramTagger(train_sents)
tag_performace = unigram_tagger.evaluate(test_sents)
print(tag_performace)

------------------------------------------------------------------------------------
NLP Using Python

Which of the following is not a collocation, associated with text6?


Straight table

BIGRAMS appearing in a text

What is the frequency of bigram ('clop','clop') in text collection text6?


26

How many trigrams are possible from the sentence Python is cool?
4

How many trigrams are possible from the sentence Python is cool!!!?
4

Which of the following word occurs frequently after the word FRENCH in text collection text6?
GUARD

What is the frequency of bigram ('HEAD','KNIGHT') in text collection text6?


29

What is the frequency of bigram ('BLACK','KNIGHT') in text collection text6?


32

What is the frequency of bigram ('King','Arthur') in text collection text6?


16

Which of the following word occurs frequently after the word Holy in text collection text6?
Grail

Which of the following function is used to generate a set of all possible n consecutive words appearing in a text?
ngrams()

Which of the following class is used to convert a list of tokens into NLTK text?
nltk.Text correct

Which of the following function is used to break given text into sentences?
sent_tokenize

sentence = """At eight o'clock on Thursday morning... Arthur didn't feel very good."""

tokens = nltk.word_tokenize(sentence)

print(tokens)
tagged = nltk.pos_tag(tokens)

print(tagged[0:6])

entities = nltk.chunk.ne_chunk(tagged)
print(entities)

from nltk.corpus import treebank


t = treebank.parsed_sents('wsj_0001.mrg')[0]
t.draw()

wordfreq = nltk.FreqDist(words)
wordfreq.most_common(2)
[('programming', 2), ('.', 2)]
word nltk.import nl

nltk.download('book')
from nltk.book import *.

text1.findall("<tri.*r>")
type(text1)

n_unique_words = len(set(text1))

text1_lcw = [ word.lower() for word in set(text1) ]


n_unique_words_lc = len(set(text1_lcw))
word_coverage1 = n_words / n_unique_words
word_coverage2 = n_words / n_unique_words_lc
big_words = [word for word in set(text1) if len(word) > 17 ]
sun_words = [word for word in set(text1) if word.startswith('Sun') ]
text1_freq = nltk.FreqDist(text1)

fdist
top3_text1 = text1_freq.most_common(3)

####TEXT CORPORA
Popular Text Corpora
Genesis: It is a collection of few words across multiple languages.
Brown: It is the first electronic corpus of one million English words.

Other Corpus in nltk


Gutenberg : Collections from Project Gutenberg
Inaugural : Collection of U.S Presidents inaugural speeches

stopwords : Collection of stop words.


reuters : Collection of news articles.
cmudict : Collection of CMU Dictionary words.
movie_reviews : Collection of Movie Reviews.
np_chat : Collection of chat text.
names : Collection of names associated with males and females.
state_union : Collection of state union address.
wordnet : Collection of all lexical entries.
---------------------------------------------------------------------------------------------------
How many times do the word gas occur in text collections, grouped into genre 'gas'? Consider reuters corpus.
10
How many times do the words gasoline and barrels occur in text collections, grouped into genre gas ? Consider reuters corpus
77,64

How many times do the words tonnes and year occur in text collections, grouped into genre sugar ? Consider reuters corpus.
355,196

Which of the following method is used to view the conditions, which are used while computing conditional frequency distribu
conditons()

Which of the following class is used to determine count of all tokens present in a given text ?
FreqDist

lead and smelter


40,33

2166
18.55

['noise','surprise','wise','apologise'] = 4

How many times each unique word of text6 collection is repeated on an average?
7.8 times

Count the number of words in text collection, text6, ending with ship?
1

How many times does the word 'BROTHER' occur in text collection text6?
4

What is the frequency of word 'ARTHUR' in text collection text6?


0.0132

Which of the following modules is used for performing Natural language processing in python?
nltk

Which of the following expression is used to download all the required corpus and collections , related to NLTK Book ?
nltk.download('book')

What is range of length of words present in text collection text6?


1 to 13

What are the categories to which the text collection text/16438, of reuters corpus is tagged to ?
crude, nat-gas
In how many number of categories, are all text collections of brown corpus grouped into?
15

Which of the following method is used to determine the number of characters present in a corpus?
char() wrong

Which of the following expression imports genesis corpus into the working environment?
form ntlk.corpus import genesis

#############
items = ['apple', 'apple', 'kiwi', 'cabbage', 'cabbage', 'potato']
nltk.FreqDist(items)

How many times do the word sugar occur in text collections, grouped into genre 'sugar'? Consider reuters corpus.
521

How many times do the word zinc occur in text collections, grouped into genre 'zinc'? Consider reuters corpus
70

Which of the following class is used to determine count of all tokens present in a given text ?
FreqDist

Which of the following class is used to determine count of all tokens present in text collections, grouped bya specific condition
ConditionalFreqDist

Which of the following method is used, on a conditional frequency distribution, in order to display frequency of few samples d
tabulate()

What is the number of sentences obtained after breaking 'Python is cool!!!' into sentences using sent_tokenize
2

Which of the following method is used to tokenize a text based on a regular expression?
regexp_tokenize()

Which of the following class is used to convert a list of tokens into NLTK text?
nltk.Text correct

Which of the following module can be used to read text data from a pdf document?
pypdf

Which of the following module is used to download text from a HTML file?
urllib
Which of the following is not a collocation, associated with text6?
squeak squeak

What is the frequency of bigram ('King', 'Arthur') in text collection text6?


X32 28

The process of breaking text into words and punctuation marks in known as
Tokenization

Which of the following function is used to generate a set of all possible n consecutive words appearing in a text
grams() X
#########
Lancaster Stemmer returns build
Porter Stemmer returns builder.

################FINAL############################
What is the output of the following expression?
import nltk
lancaster = nltk.LancasterStemmer()
print(lancaster.stem('power'))
pow

What is the total number of unique words present in text collection, text6, while Considering characters too as words
2166

What is the total number of words present in text collection, text6, while Considering characters too as words
16967

How many words are ending with 'ing' in text collection text6?
109

Count the number of words in text collection, text6, which have only digits as characters?
24

Which of the following NLTK corpus represent a collection US presidential inaugural addresses, starting from 1789?
inaugural

Which tag occurs maximum in text collections associated with news genre of brown corpus?
NN
How many number of words are obtained when the sentence Python is cool!!! is tokenized into words, with regular expressio
3

How many number of words are obtained when the sentence Python is cool!!! is tokenized into words
6
import nltk
lancaster = nltk.LancasterStemmer()
print(lancaster.stem('women'))
wom

Which of the following is a Text corpus structure?


All of those mentioned

Which of the following module is used to download text from a HTML file
urllib

How many times does the word sugar occur in text collections, grouped into genre 'sugar'? Consider reuters corpus.
0

How many times does the words tonnes and year occur in text collections, grouped into genre sugar? Consider reuters corpus
355, 196

How many times does the tag AT is associated with the word The in brown corpus?
7824

How many times does the words lead and smelter occur in text collections, grouped into genre zinc? Consider reuters corpus.
40, 33

###################
import re
text = 'Python is cool!!!'
tokens = re.findall(r'\w+', text)
len(tokens)
3

#get tags from brown


from nltk.corpus import brown
brown_tagged = brown.tagged_words()
1161192

import nltk
text = 'Python is awesome.'
words = nltk.word_tokenize(text)
defined_tags = {'is':'BEZ', 'over':'IN', 'who': 'WPS'}
-------------------------------------------------------------------------------------------------------
LIBRARY MANUAL:
https://www.nltk.org/book/ch02.html
ONLINE CONSOLE PYTHON3:
https://www.katacoda.com/courses/python/playground
pip3 install --user setuptools && pip3 install nltk
python3 -c "import nltk; nltk.download('book')"
--------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------
Which of the following is not a collocation, associated with text6 ?
import nltk
from nltk.book import text6
gen_text = nltk.Text(text6)
print(gen_text.collocations())
Straight Table
--------------------------------------------------------------------------------------------------------
How many times does the tag AT is associated with the word The in brown corpus?
import ntltk
from nltk.corpus import brown
brown_text_tagged = nltk.corpus.brown.tagged_words()
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_text_tagged if tag=='AT' and word =='The')
print(tag_fd)
6725
--------------------------------------------------------------------------------------------------------
Which of the following function is used to tag parts of speech to words appearing in a text?
pos_tag()
--------------------------------------------------------------------------------------------------------
How many words are ending with 'ly' in text collection text6?c
import nltk
from nltk.book import text6
ly_ending_words = [word for word in text6 if word.endswith('ly') ]
print(len(ly_ending_words))
109
--------------------------------------------------------------------------------------------------------
Which of the following method can be used to determine the number of text collection files associated with a corpus?
fileids()

Which of the following method can be used to view the conditions, which are used while computing conditional frequency dis
conditions()

Which of the following method can be used to determine the number of text collection, associated with a corpus?
abspath()
--------------------------------------------------------------------------------------------------------
Count the number of words in text collection, text6, which have only digits as characters?
24
--------------------------------------------------------------------------------------------------------
Which of the following method is used to view the tagged words of text corpus
tagged_words()
--------------------------------------------------------------------------------------------------------
What is the output of the following expression?
import nltk
lancaster = nltk.LancasterStemmer()
print(lancaster.stem('lying'))
lying
--------------------------------------------------------------------------------------------------------
What is the frequency of bigram ('HEAD', 'KNIGHT') in text collection text6
import nltk
from nltk.book import text6
bigrams = nltk.bigrams(tokens)
filtered_bigrams = [ (w1, w2) for w1, w2 in bigrams if w1=='HEAD' and w2=='KNIGHT']
print(filtered_bigrams)
29
--------------------------------------------------------------------------------------------------------
What is the output of the following expression ?
import nltk
porter = nltk.PorterStemmer()
print(porter.stem('ceremony'))
ceremoni
--------------------------------------------------------------------------------------------------------
Which of the following method is used to tokenize a text based on a regular expression
regexp_tokenize()
--------------------------------------------------------------------------------------------------------
What is the frequency of word 'ARTHUR' in text collection text6 R: 0.0132
import nltk
from nltk.book import text6
fdist = nltk.FreqDist(text6)
print(fdist.freq('ARTHUR'))
0.0132
--------------------------------------------------------------------------------------------------------
Which of the following function is used to obtain set of all pair of consecutive words appearing in a text?
bigrams()
--------------------------------------------------------------------------------------------------------
What is the range of length of words present in text collection text6?
X-1 to 10
--------------------------------------------------------------------------------------------------------
What is the output of the following code?
import re
s = 'Python is cool!!!'
print(re.findall(r'\s\w+\b', s))
[' is', ' cool']

[' is', ' cool']


--------------------------------------------------------------------------------------------------------
Which of the following class is used to convert your own collections of text into a corpus?
PlaintextCorpusReader
--------------------------------------------------------------------------------------------------------
What is the output of the following expression?
import nltk
wnl = nltk.WordNetLemmatizer()
print(wnl.lemmatize('women'))
woman
--------------------------------------------------------------------------------------------------------
Which of the following NLTK corpus represent a collection of around 10000 news articles?
reuters
--------------------------------------------------------------------------------------------------------
How many times each unique word of text6 collection is repeated on an average?
X-6.5 times
--------------------------------------------------------------------------------------------------------
What is the frequency of bigram ('BLACK', 'KNIGHT') in text collection text6?
import nltk
from nltk.book import text6
bigrams = nltk.bigrams(text6)
filtered_bigrams = [ (w1, w2) for w1, w2 in bigrams if w1=='BLACK' and w2=='KNIGHT']
print(len(filtered_bigrams))
32
--------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------
pip3 install --user setuptools && pip3 install nltk
python3 -c "import nltk; nltk.download('book')"

import nltk
from nltk.book import text6
n = len(text6)
print(n)

u = len(set(text6))
print(u)

wc = n/u
print(wc)
ise_ending_words = [word for word in set(text6) if word.endswith('ise') ]
print(len(ise_ending_words))

contains_z = len([word for word in set(text6) if 'z' in word])


print(contains_z)

contains_pt = len([word for word in set(text6) if 'pt' in word])


print(contains_pt)

import re
title_words = len(re.findall(r'([A-Z][a-z]+)', text6))

title_words = [word for word in set(text6) if re.search(r'([A-Z][a-z]+)', word)]

--------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------
import nltk, re
from nltk.corpus import gutenberg
for fileid in gutenberg.fileids():
n_words = len(gutenberg.words(fileid))
n_unique_words = len(set(gutenberg.word
word_coverage = n_words / n_unique_wor
print(word_coverage, fileid)

aus_words = len(gutenberg.words('austen-sense.txt))
aus_words_apha = len([word for word in gutenberg.words('austen-sense.txt') if word.isalpha()]
aus_words_gt4_z = len([word for word in gutenberg.words('austen-sense.txt') if word.isalpha() and len(word) > 4 and 'z' in wo
print(aus_words_gt4_z)

--------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------
import nltk
from nltk.corpus import brown
brown_cdf = nltk.ConditionalFreqDist([
(genre,word.lower())
for genre in brown.categories()
for word in brown.words(categories=genre) ])

brown_cdf.tabulate(conditions=['news', 'religion','romance'], samples=['can', 'could', 'may', 'might', 'must', 'will'])


from nltk.corpus import inaugural
inaugural_cfd = nltk.ConditionalFreqDist(
(target, fileid)
for fileid in inaugural.fileids()
for w in inaugural.words(fileid)
for target in ['america', 'citizen']
if w.lower().startswith(target))

print(inaugural_cfd.conditions())

--------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------
import nltk
from urllib import request
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
html_content = request.urlopen(url).read()
soup = BeautifulSoup(html_content, 'html.parser')
n_links = len(soup.find_all('a'))
print(n_links)

table = soup.find_all('table', attrs={'class':'wikitable'})


rows = [elm.text for elm in table.find_all(['tr']) ]
print(rows[1:])

--------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------
import nltk
from nltk.corpus import brown
news_words = brown.words(categories='news')
lc_news_words = [w.lower() for w in news_words]
len_news_words = [len(w) for w in lc_news_words]
news_len_bigrams = list(nltk.bigrams(len_news_words))
#Compute the conditional frequency of news_len_bigrams, where condition and event refers to length of a words.
#Store the result in cfd_news
#Determine the frequency of 6-letter words appearing next to a 4-letter word
cfd_news = nltk.ConditionalFreqDist(news_len_bigrams)
cfd_news.tabulate(conditions=[6,4])
#############
lc_news_bigrams =nltk.ConditionalFreqDist(news_len_bigrams)

#
filtered_bigrams = [(w1, w2) for w1, w2 in news_len_bigrams if w1==6 and w2==4]
cfd_news = nltk.FreqDist(filtered_bigrams)
print(cfd_news[6,4])

#
cfd_news = nltk.FreqDist((l1, l2) in news_len_bigrams if l1==6 amd l2==4)
print(cfd_news[6,4])

--------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------
from nltk.corpus import brown
humor_words = brown.words(categories='humor')
lc_humor_words = [word.lower() for word in humor_words]
lc_humor_uniq_words = set(lc_humor_words)
from nltk.corpus import words
wordlist_words = words.words()
wordlist_uniq_words = set(wordlist_words)
print(len(lc_humor_uniq_words))
print(len(wordlist_uniq_words ))

--------------------------------------------------------------------------------------------------------

Import the text corpus brown.


Extract the list of tagged words from the corpus brown.
Store the result in brown_tagged_words
Generate trigrams of brown_tagged_words and store the result in brown_tagged_trigrams.
For every trigram of brown_tagged_trigrams, determine the tags associated with each word.

This results in a list of tuples, where each tuple contain pos tags of 3 consecutive words, occurring in text.
Store the result in brown_trigram_pos_tags.
Determine the frequency distribution of brown_trigram_pos_tags and store the result in brown_trigram_pos_tags_freq.
Print the number of occurrences of trigram ('JJ','NN','IN')

--------------------------------------------------------------------------------------------------------
import nltk
from nltk.corpus import brown
brown_tagged_words = [word for (word, tag) in nltk.corpus.brown.tagged_words()]
brown_tagged_trigrams = list(nltk.trigrams(brown_tagged_words))
brown_trigram_pos_tags = list()
for trigram in brown_tagged_trigrams:
trigram_tagged = nltk.pos_tag(trigram)
tags = [tag for (word, tag) in trigram_tagged]
brown_trigram_pos_tags.append(tags)

brown_trigram_pos_tags_freq = nltk.FreqDist((t1,t2,t3) for (t1,t2,t3) in brown_trigram_pos_tags)


print(brown_trigram_pos_tags_freq['JJ','NN','IN'])

brown_trigram_pos_tags_freq = nltk.FreqDist(t1,t2,t3) for (t1,t2,t3) in brown_trigram_pos_tags if t1=='JJ' and t2=='NN' and t3


--------------------------------------------------------------------------------------------------------
import nltk
from nltk.corpus import brown
brown_tagged_words = [word for (word, tag) in nltk.corpus.brown.tagged_words()]
brown_tagged_trigrams = list(nltk.trigrams(brown_tagged_words))
brown_trigram_pos_tags = [ nltk.pos_tag(t) for t in brown_tagged_trigrams ]
brown_trigram_pos_tags_freq = nltk.FreqDist(t1,t2,t3) for (t1,t2,t3) in brown_trigram_pos_tags if t1=='JJ' and t2=='NN' and t3

#TASK2
import nltk
from nltk.corpus import brown
brown_tagged_words = nltk.corpus.brown.tagged_words()
brown_tagged_trigrams = list(nltk.trigrams(brown_tagged_words))
#[(('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'))]
brown_trigram_pos_tags = list()
for tuple in brown_tagged_trigrams:
tags = [tag for (word, tag) in tuple]
brown_trigram_pos_tags.append(tags)
#[['AT', 'NP-TL', 'NN-TL']]
brown_trigram_pos_tags_freq = nltk.FreqDist((t1,t2,t3) for (t1,t2,t3) in brown_trigram_pos_tags)
print(brown_trigram_pos_tags_freq['JJ','NN','IN'])
#TASK2
import nltk
from nltk.corpus import brown
brown_tagged_sents = nltk.corpus.brown.tagged_sents()
total_size = len(brown_tagged_sents)
train_size = int(total_size * 0.8)
train_sents = brown_tagged_sents[:train_size]
test_sents = brown_tagged_sents[train_size:]
unigram_tagger = nltk.UnigramTagger(train_sents)
tag_performace = unigram_tagger.evaluate(test_sents)
print(tag_performace)
s , related to NLTK Book ?
nsider reuters corpus.

er reuters corpus

sing sent_tokenize

appearing in a text
acters too as words

es, starting from 1789?

nto words, with regular expression r'\w+' ?


onsider reuters corpus.

re sugar? Consider reuters corpus.

nre zinc? Consider reuters corpus.

EXAMEN FINAL
es, starting from 1789

associated with a corpus?


ng in a text?

sociated with a corpus


HANDS ON: 1
HANDS ON: 2

= len(gutenberg.words(fileid))
_words = len(set(gutenberg.words(fileid)))
erage = n_words / n_unique_words
d_coverage, fileid)

a() and len(word) > 4 and 'z' in word])

HANDS ON: 3

might', 'must', 'will'])

HANDS ON: 4
HANDS ON: 5

s to length of a words.
HANDS ON: 6

HANDS ON: 7

urring in text.

wn_trigram_pos_tags_freq.

ags if t1=='JJ' and t2=='NN' and t3=='IN')


ags if t1=='JJ' and t2=='NN' and t3=='IN')
appearing in a text?
er reuters corpus.
nre gas ? Consider reuters corpus.

sugar ? Consider reuters corpus.

ng conditional frequency distributions?

s , related to NLTK Book ?


nsider reuters corpus.

er reuters corpus

ns, grouped bya specific condition?

isplay frequency of few samples derived under few conditions?

sing sent_tokenize
appearing in a text

g characters too as words

ters too as words

es, starting from 1789?


nto words, with regular expression r'\w+' ?

onsider reuters corpus.

re sugar? Consider reuters corpus.

nre zinc? Consider reuters corpus.


EXAMEN FINAL

associated with a corpus?

mputing conditional frequency distributions?

ociated with a corpus?


ng in a text?
HANDS ON: 1
HANDS ON: 2

= len(gutenberg.words(fileid))
_words = len(set(gutenberg.words(fileid)))
erage = n_words / n_unique_words
d_coverage, fileid)

a() and len(word) > 4 and 'z' in word])

HANDS ON: 3

might', 'must', 'will'])


HANDS ON: 4

HANDS ON: 5

s to length of a words.
HANDS ON: 6

HANDS ON: 7

urring in text.

wn_trigram_pos_tags_freq.
ags if t1=='JJ' and t2=='NN' and t3=='IN')

ags if t1=='JJ' and t2=='NN' and t3=='IN')


python nlp using simple operation 1
calculate
WordCou
nts(text):
# Write
your code
here

print(len(t
ext))

print(len(
set([w for
w in
text])))

num_char
s=
len(text)

num_wor
ds =
len(text)

num_sent
s=
len(text)

num_voc
ab =
len(set([w
for w in
text]))

print(int(n
um_word
s/num_vo
cab))

You might also like