You are on page 1of 175

NATURAL LANGUAGE

PROCESSING & APPLICATIONS


21EC4082

STUDENT ID: 2100040341 ACADEMIC YEAR: 2023-24


STUDENT NAME: P. Leena Sri
Table of Contents

1. Session 01: Introductory Session .......................................................................................... NA


2. Session 02: Tokenization_of_text #1 ....................................................................................... #
3. Session 03: Text_2_Sequences #2............................................................................................ #
4. Session 04: One_Hot_Encoding #3 .......................................................................................... #
5. Session 05: Vectorization_of_texts #4..................................................................................... #
6. Session 06: Databases_how_to_Use #5 .................................................................................. #
7. Session 07: Parsing_nltk_toolbox #6 ....................................................................................... #
8. Session 08: TF_Testing_fail #7 ................................................................................................. #
9. Session 09: IDF_Why #8 ........................................................................................................... #
10. Session 10: TFIDF_Vertorization #9 ......................................................................................... #
11. Session 11: TF_IDF_Failure_meaning #10 ............................................................................... #
12. Session 12: Distance_Metrics #11............................................................................................ #
13. Session 13: Word_similarities_nltk #12................................................................................... #
14. Session 14: Document_recognition_tfidf_vectors #13 (Adv/Peer) ........................................ #
15. Session 15: Zipf's_Law_nlp #14 (Adv/Peer)............................................................................. #
16. Session 16: Simple_topic_modelling_ex #15 (Adv/Peer) ....................................................... #
17. Session 17: PCA_From_SCratch #16 (Adv/Peer) ..................................................................... #
18. Session 18: Singular_Value_Decomposition_SVD_Ex #17 (Adv/Peer) ................................... #
19. Session 19: Latent_Semantic_Analysis_SVD #18 (Adv/Peer) ................................................. #
20. Session 20: spam_dect_class #19 (Adv/Peer) ......................................................................... #
21. Session 21: Sentiment_Analysis_RNN #20 (Adv/Peer) ........................................................... #

https://github.com/pvvkishore/NLP-A_LAB_2023 : Code for the entire lab sessions.


A.Y. 2023-24 LAB/SKILL CONTINUOUS EVALUATION

S.No Date Experiment Name Pre- In-Lab (25M) Post- Viva Total Faculty
Lab Program/ Data and Analysis & Lab Voce (50M) Signature
(10M) Procedure Results Inference (10M) (5M)
(5M) (10M) (10M)
1. Introductory Session -NA-
Tokenization_of_text #1
2.
Text_2_Sequences #2
3.
One_Hot_Encoding #3
4.
Vectorization_of_texts #4
5.
Databases_how_to_Use #5
6.
Parsing_nltk_toolbox #6
7.
TF_Testing_fail #7
8.
IDF_Why #8
9.
TFIDF_Vertorization #9
10.
TF_IDF_Failure_meaning #10
11.
Distance_Metrics #11
12

13. Word_similarities_nltk #12


S.No Date Experiment Name Pre- In-Lab (25M) Post- Viva Total Faculty
Lab Program/ Data and Analysis & Lab Voce (50M) Signature
(10M) Procedure Results Inference (10M) (5M)
(5M) (10M) (10M)
Document_recognition_tfidf_vectors
14.
#13 (Adv/Peer)
Zipf's_Law_nlp #14 (Adv/Peer)
15.
Simple_topic_modelling_ex #15
16.
(Adv/Peer)
PCA_From_SCratch #16 (Adv/Peer)
17.
Singular_Value_Decomposition_SVD_Ex
18.
#17 (Adv/Peer)
Latent_Semantic_Analysis_SVD #18
19.
(Adv/Peer)
spam_dect_class #19 (Adv/Peer)
20.
Sentiment_Analysis_RNN #20
21
(Adv/Peer)
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Experiment Title: Tokenization_of_text

Aim/Objective:

The aim is to compare and evaluate different tokenization techniques or libraries, such as NLTK,
SpaCy, and TensorFlow, to determine their effectiveness in handling various types of text data.

Description:

Tokenization is the 1st step in any NLP model. The experiment may aim to explore how tokenization
using NLTK, spaCy, and TensorFlow can be integrated into a broader NLP pipeline or used as a
preprocessing step for tasks such as sentiment analysis, machine translation, named entity
recognition, or text summarization. The focus is on understanding the impact of tokenization choices
on downstream model performance. The experiment may aim to analyze the performance
characteristics of tokenization using NLTK and TensorFlow.

Pre-Requisites:

Install Python 3.6 and above using.

1. https://pip.pypa.io/en/stable/installation/

2. https://packaging.python.org/en/latest/tutorials/installing-packages/

3. https://pypi.org/project/nltk/

4. https://www.tensorflow.org/install/pip

5. https://spacy.io/usage

6. https://pypi.org/project/gensim/

Pre-Lab:

This Section must contain at least 5 Descriptive type questions or Self-Assessment Questions which
help the student to understand the Program/Experiment that must be performed in the Laboratory
Session.

1. What is tokenization in the context of NLP?

2. How can you tokenize a sentence into individual words using NLTK?

3. What is the purpose of tokenizing text in NLP?

4. Name a few tokenization techniques other than word tokenization.

5. How can you tokenize a text document into sentences using NLTK?

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 1 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

In-Lab:

1. Apply tokenization methods in the NLTK library on a 5-line text data available in NLTK.
2. Apply tokenization methods in the TF library on a 5-line text data available in NLTK.
3. Draw comparisons based on text handling capabilities.

• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 2 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages to record the Procedure/Program)


Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 3 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page to record the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 4 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data

(Leave at least 1 Page for each Program)

Sample VIVA-VOCE Questions (In-Lab):

1. What is tokenization?
2. According to your exp which tokenizer API is the best?
3. How NLTK and TensorFlow handle tokenization for different languages.
4. List the Metrics used to Evaluate Tokenization Techniques.
5. Can you tokenize multiple text documents simultaneously using TensorFlow.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 5 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Post-Lab:

1. Try tokenization in the spaCy library and compare with the NLTK and Tensorflow.
2. Try tokenization on big corpus dataset given below.
https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 6 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 7 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages for each Procedure/ Program/ Solution)

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page for recording the results)

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data.

(Leave at least 1 Page for recording the analysis and inferences)

Evaluator Remark (if Any):

Marks Secured: _____out of 50

Signature of the Evaluator with Date

Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 8 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Experiment Title: Text_2_Sequences

Aim/Objective:

The aim is to evaluate different techniques or libraries, such as NLTK, SpaCy, and TensorFlow, to
determine their effectiveness in converting text to a sequence of numbers.

Description:

The objective of converting text to a sequence of numbers is a fundamental step in natural language
processing (NLP) tasks. The primary goal of this conversion is to represent textual data in a numerical
format that machine learning models can process effectively. To convert text to a numerical format
that enables the application of machine learning and NLP techniques.

Pre-Requisites:

Install Python 3.6 and above using.

1. https://pip.pypa.io/en/stable/installation/

2. https://packaging.python.org/en/latest/tutorials/installing-packages/

3. https://pypi.org/project/nltk/

4. https://www.tensorflow.org/install/pip

5. https://spacy.io/usage

6. https://pypi.org/project/gensim/

Pre-Lab:

This Section must contain at least 5 Descriptive type questions or Self-Assessment Questions which
help the student to understand the Program/Experiment that must be performed in the Laboratory
Session.

1. Why convert text to numbers?

2. How effective is the method used by you?

3. Are all sentences in the text considered to have the same length? If No, What did you do.

4. In NLTK, which function is used to assign numeric IDs to tokens?

5. What is the difference between word tokenization and sentence tokenization?

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 9 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

In-Lab:

1. Apply tokenization and convert a sequence of sentences in the NLTK library to a sequence of
numbers.
2. Convert a 10-sentence dataset with multiple-length sentences into a number array of equal
size for ML model training.

• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 10 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages to record the Procedure/Program)


Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 11 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page to record the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 12 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data

(Leave at least 1 Page for each Program)

Sample VIVA-VOCE Questions (In-Lab):

1. What does NLTK's FreqDist class provide?


2. According to your exp which API is the best?
3. Do you think your sequence conversion is suitable for GPT.
4. List the Metrics used to Evaluate sequence conversion Techniques.
5. Can you convert using spaCy.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 13 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Post-Lab:

1. Try normalization of converted numbers from text data.


2. Try text to sequences on big corpus dataset given below.
https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 14 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 15 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages for each Procedure/ Program/ Solution)

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page for recording the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 16 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data.

(Leave at least 1 Page for recording the analysis and inferences)

Evaluator Remark (if Any):

Marks Secured: _____out of 50

Signature of the Evaluator with Date

Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 17 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Experiment Title: One Hot Encoding of Text

Aim/Objective:

The aim is to convert the text into numbers and eventually code those converted numbers into
encodings for downstream NLP tasks using NLTK, SpaCy, and TensorFlow.

Description:

One hot encoding of text data is a process of transforming categorical data, such as words or symbols,
into numerical data that can be used by machine learning models. It involves creating a binary vector
for each categorical value, where only one element is 1 and the rest are 0. The length of the vector is
equal to the number of unique categories in the data. One hot encoding allows the representation of
categorical data as multidimensional binary vectors that can be fed to models that require numerical
input.

Pre-Requisites:

Install Python 3.6 and above using.

1. https://pip.pypa.io/en/stable/installation/

2. https://packaging.python.org/en/latest/tutorials/installing-packages/

3. https://pypi.org/project/nltk/

4. https://www.tensorflow.org/install/pip

5. https://spacy.io/usage

6. https://pypi.org/project/gensim/

Pre-Lab:

This Section must contain at least 5 Descriptive type questions or Self-Assessment Questions which
help the student to understand the Program/Experiment that must be performed in the Laboratory
Session.

1. Why convert text to encoded ones and zeros?

2. How effective is the method used by you?

3. Are all sentences in the text considered to have the same length? If No, what did you do.

4. The function get_dummiesis role in one hot encoding?

5. Which according to you is a good NLP practice: OHE words or sentences?

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 18 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

In-Lab:

1. Apply One Hot Encodings and convert a sequence of sentences in the NLTK library to a
sequence of numbers and then OHE.
2. Convert a 10-sentence dataset with multiple-length sentences into a OHE array of equal size
for ML model training.

• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 19 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages to record the Procedure/Program)


Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 20 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page to record the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 21 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data

(Leave at least 1 Page for each Program)

Sample VIVA-VOCE Questions (In-Lab):

1. What is one hot encoding and why is it used?


2. What are the advantages and disadvantages of one hot encoding?
3. How can you implement one hot encoding in Python using pandas or scikit-learn?
4. What are some alternatives to one hot encoding for categorical data?
5. How does one hot encoding affect the dimensionality and sparsity of the data?

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 22 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Post-Lab:

1. Try using OHE data for training a simple neural network model.
2. Try text to OHE on big corpus dataset given below and train a ANN model.
https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 23 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 24 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages for each Procedure/ Program/ Solution)

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page for recording the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 25 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data.

(Leave at least 1 Page for recording the analysis and inferences)

Evaluator Remark (if Any):

Marks Secured: _____out of 50

Signature of the Evaluator with Date

Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 26 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Experiment Title: Vectorization_of_texts

Aim/Objective:

The aim is to convert text into vectors by computing term frequencies and create a corpus.

Description:

The objective of converting text to a sequence of numbers using TF vectorizer function. The primary
goal of this conversion is to represent textual data in a numerical format that machine learning models
can process effectively. To convert text to a numerical format that enables the application of machine
learning and NLP techniques.

Pre-Requisites:

Install Python 3.6 and above using.

1. https://pip.pypa.io/en/stable/installation/

2. https://packaging.python.org/en/latest/tutorials/installing-packages/

3. https://pypi.org/project/nltk/

4. https://www.tensorflow.org/install/pip

5. https://spacy.io/usage

6. https://pypi.org/project/gensim/

Pre-Lab:

This Section must contain at least 5 Descriptive type questions or Self-Assessment Questions which
help the student to understand the Program/Experiment that must be performed in the Laboratory
Session.

1. Why TF is better than OHE?

2. How effective is the method used by you?

3. What is the mathematical formulation to compute TF.

4. Is TF a good representation for text transformation?

5. What difference did you find between OHE and TF?

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 27 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

In-Lab:

1. Apply tokenization and convert a sequence of sentences in the NLTK library to a sequence of
numbers. Use those sequences and calculate term frequencies for representing text data on
a small corpus.
2. Convert a 10-sentence dataset with multiple-length sentences into TF representations and
compare them with OHE.
• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 28 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages to record the Procedure/Program)


Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 29 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page to record the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 30 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data

(Leave at least 1 Page for each Program)

Sample VIVA-VOCE Questions (In-Lab):

1. What does TF stand for?


2. According to your exp which text encoding is the best?
3. Do you think your sequence conversion is suitable for GPT.
4. List the Metrics used to Evaluate sequence conversion Techniques.
5. Can you convert using spaCy.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 31 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Post-Lab:

1. Try an ANN model on the transformed text using TF.


2. Try TF conversion big corpus dataset given below and apply ANN training,
https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 32 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 33 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages for each Procedure/ Program/ Solution)

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page for recording the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 34 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data.

(Leave at least 1 Page for recording the analysis and inferences)

Evaluator Remark (if Any):

Marks Secured: _____out of 50

Signature of the Evaluator with Date

Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 35 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Experiment Title: Text Datasets_how_to_use

Aim/Objective:

The aim is to use the online resources of text data to test NLP applications.

Description:

A text corpus is a large and structured collection of texts, typically stored in a digital format, that
serves as a linguistic resource for language analysis and research. It consists of a diverse range of
written or spoken texts from various sources and domains, such as books, articles, newspapers,
websites, social media, conversations, and more.

Pre-Requisites:

Install Python 3.6 and above using.

1. https://pip.pypa.io/en/stable/installation/

2. https://packaging.python.org/en/latest/tutorials/installing-packages/

3. https://pypi.org/project/nltk/

4. https://www.tensorflow.org/install/pip

5. https://spacy.io/usage

6. https://pypi.org/project/gensim/

Pre-Lab:

1. How can I create a text corpus from a collection of documents using Python?

2. What Python libraries can I use to tokenize and preprocess text data for corpus creation?

3. How can I handle different file formats (e.g., PDF, Word documents) when building a text
corpus in Python?

4. What are the steps involved in cleaning and preprocessing text data for corpus creation?

5. How can I remove stopwords and punctuation from text documents when creating a corpus
in Python?

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 36 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

In-Lab:

1. From NLTK library, download and apply wordnet package of built-in corpus. Extract the
requirements of a text dataset and tokenize the text.
2. From spaCy, use en_core_web_sm (English Small) corpus and tokenize this text.

• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 37 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages to record the Procedure/Program)


Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 38 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page to record the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 39 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data

(Leave at least 1 Page for each Program)

Sample VIVA-VOCE Questions (In-Lab):

1. What are the different types of text datasets available in NLTK?


2. Can you give an example of a text dataset available in NLTK?
3. How can you access and explore the content of a text dataset in NLTK?
4. Can you explain the concept of text datasets in spaCy?
5. Do you know spaCy can handle multi-language text datasets? If Yes, name two.
Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 40 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Post-Lab:

1. Try to encode the wordnet text into TF vectors and OHE. Measure the corpus size occupied
by them in memory.
2. Try to find some text datasets available online and load into your current program.

• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

(Leave at least 2-3 Pages for each Procedure/ Program/ Solution)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 41 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 42 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page for recording the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 43 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data.

(Leave at least 1 Page for recording the analysis and inferences)

Evaluator Remark (if Any):

Marks Secured: _____out of 50

Signature of the Evaluator with Date

Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 44 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Experiment Title: Parsing_nltk_toolbox

Aim/Objective:

The aim is to analyze the grammatical structure of sentences in natural language text data using
NLTK and spaCy.

Description:

To perform parsing in NLTK, you typically start by defining a grammar using CFG, then apply a parsing
algorithm to parse a sentence and obtain a parse tree or dependency tree representation. NLTK
provides functions and methods to assist in these tasks, such as nltk.CFG for defining CFG,
nltk.ChartParser for chart parsing, and nltk.DependencyParser for dependency parsing.By utilizing
NLTK's parsing capabilities, you can analyze sentence structure, extract syntactic information, and
facilitate further natural language understanding and processing tasks.

Pre-Requisites:

Install Python 3.6 and above using.

1. https://pip.pypa.io/en/stable/installation/

2. https://packaging.python.org/en/latest/tutorials/installing-packages/

3. https://pypi.org/project/nltk/

4. https://www.tensorflow.org/install/pip

5. https://spacy.io/usage

6. https://pypi.org/project/gensim/

Pre-Lab:

1. What is parsing in natural language processing (NLP), and what is its goal?

2. Explain the concept of Context-Free Grammars (CFG) and their role in parsing.

3. What are the two main parsing strategies supported by NLTK?

4. How does recursive descent parsing work in NLTK?

5. How can you define a CFG in NLTK?

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 45 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

In-Lab:

1. Analyze the grammatical structure of sentences and extract syntactic information on small
text corpus to identify the performance of the libraries used in NLTK.
2. Show that the parser used in spaCy and NLTK libraries have the capability to extract
semantic information.

• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 46 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages to record the Procedure/Program)


Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 47 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page to record the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 48 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data

(Leave at least 1 Page for each Program)

Sample VIVA-VOCE Questions (In-Lab):

1. What does NLTK's parser called?


2. According to your exp which parsing model is the best?
3. Do you think parsing is necessary in other languages too.
4. List the Metrics used to Evaluate parsing Techniques.
5. How data is stored after parsing.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 49 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Post-Lab:

1. Try parsing using context free grammar on wordnet text data in NLTK.
2. Try parsing on big corpus dataset given below.
https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 50 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 51 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages for each Procedure/ Program/ Solution)

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page for recording the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 52 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data.

(Leave at least 1 Page for recording the analysis and inferences)

Evaluator Remark (if Any):

Marks Secured: _____out of 50

Signature of the Evaluator with Date

Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 53 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Experiment Title: TF_Testing_fail

Aim/Objective:

The aim is to evaluate term frequency (TF) on large text corpuses note its breaking point.

Description:

The formula for calculating term frequency (TF) is:

TF = (Number of occurrences of the term in the document) / (Total number of terms in the
document)

The TF value reflects the relative importance or prevalence of a term within a specific document. It
helps to identify which terms are more frequently used and potentially carry more significance or
relevance in the context of that document. However, term frequency alone does not consider the
significance of the term in the overall corpus.

Pre-Requisites:

Install Python 3.6 and above using.

1. https://pip.pypa.io/en/stable/installation/

2. https://packaging.python.org/en/latest/tutorials/installing-packages/

3. https://pypi.org/project/nltk/

4. https://www.tensorflow.org/install/pip

5. https://spacy.io/usage

6. https://pypi.org/project/gensim/

Pre-Lab:

1. Why TF when OHE is simple?

2. According to you why TF fails, hypothesize?

3. How can you calculate the term frequency of a specific term in a document using NLTK?

4. How can you count the occurrences of a specific term in a list of tokens using NLTK?

5. How do you normalize the term frequency to account for the document length in NLTK?

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 54 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

In-Lab:

1. Compute TF vectors on large corpus in NLTK library and identify the reason why it cannot
capture the sematic information in the text data.
2. Investigate deeply the above process on a text corpus of your choice to arrive at the solution
in a faster way.

• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 55 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages to record the Procedure/Program)


Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 56 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page to record the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 57 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data

(Leave at least 1 Page for each Program)

Sample VIVA-VOCE Questions (In-Lab):

1. Why TF fails to capture the semantics?


2. According to your exp is there an alternative to the TF?
3. Does TF reduce dimensionality or increase it.
4. List the Metrics used to Evaluate the performance of TF representations.
5. Can you convert the above using spaCy.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 58 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Post-Lab:

1. Try to compare dimensionality of TF and OHE. Which is the best show through program.
2. Try to explain how TF fails on big corpus dataset given below.
https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 59 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 60 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages for each Procedure/ Program/ Solution)

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page for recording the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 61 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data.

(Leave at least 1 Page for recording the analysis and inferences)

Evaluator Remark (if Any):

Marks Secured: _____out of 50

Signature of the Evaluator with Date

Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 62 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Experiment Title: IDF_Why

Aim/Objective:

The aim is to evaluate the importance of IDF as an alternative to TF, which is projected as an
information retrieval to quantify the importance or rarity of a term in a collection of documents.

Description:

The IDF of a term is calculated as the logarithm of the ratio between the total number of documents
in the collection and the number of documents that contain the term. The formula for IDF is as follows:

IDF = log(N / DF), N: Total number of documents in the collection, DF: Number of documents that
contain the term. The IDF value increases as the term becomes less frequent in the document
collection. It helps to identify terms that are relatively rare and potentially carry more important or
distinctive information. Terms with higher IDF scores are considered to have more discriminative
power.

Pre-Requisites:

Install Python 3.6 and above using.

1. https://pip.pypa.io/en/stable/installation/

2. https://packaging.python.org/en/latest/tutorials/installing-packages/

3. https://pypi.org/project/nltk/

4. https://www.tensorflow.org/install/pip

5. https://spacy.io/usage

6. https://pypi.org/project/gensim/

Pre-Lab:

1. Why IDF is more important than TF?

2. How effective is IDF is discriminating text?

3. How can you calculate IDF for a specific term using Python and a given collection of
documents?

4. In NLTK, which function is used to perform IDF calculations?

5. Can you handle the presence of stop words during IDF calculations in NLTK?

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 63 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

In-Lab:

1. Compute IDF on small text data and show how IDF is better than TF in the context of text
discrimination in documents of a corpus.
2. Use wordnet dataset in NLTK and show how IDF beats TF as a text discriminator.

• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 64 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages to record the Procedure/Program)


Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 65 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page to record the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 66 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data

(Leave at least 1 Page for each Program)

Sample VIVA-VOCE Questions (In-Lab):

1. Which is a better TF or IDF?


2. According to your exp what is the intuition behind IDF and its significance in information
retrieval and text mining.?
3. Do you think IDF calculated for a term in a collection of documents is suitable for GPT.
4. List the Metrics used to Evaluate sequence conversion Techniques.
5. Can you handle terms with zero IDF.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 67 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 68 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Post-Lab:

1. Try to compute IDF and TF together using spaCY.


2. Try IDF on big corpus dataset given below.
https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 69 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 70 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages for each Procedure/ Program/ Solution)

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page for recording the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 71 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data.

(Leave at least 1 Page for recording the analysis and inferences)

Evaluator Remark (if Any):

Marks Secured: _____out of 50

Signature of the Evaluator with Date

Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 72 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Experiment Title: TFIDF_Vertorization

Aim/Objective:

The aim is to transform a collection of documents into a numerical representation with TF-IDF
vectors.

Description:

The TF-IDF value is computed by multiplying the term frequency (TF) of a term in a document by the
inverse document frequency (IDF) of the term. Each document is represented as a vector, where each
dimension corresponds to a unique term in the collection. The TF-IDF value for each term in the
document becomes the value of the corresponding dimension in the vector.

Pre-Requisites:

Install Python 3.6 and above using.

1. https://pip.pypa.io/en/stable/installation/

2. https://packaging.python.org/en/latest/tutorials/installing-packages/

3. https://pypi.org/project/nltk/

4. https://www.tensorflow.org/install/pip

5. https://spacy.io/usage

6. https://pypi.org/project/gensim/

Pre-Lab:

This Section must contain at least 5 Descriptive type questions or Self-Assessment Questions which
help the student to understand the Program/Experiment that must be performed in the Laboratory
Session.

1. Why TF-IDF works for text vectorization?

2. How effective is the method used by you?

3. Why log is used IDF calculations.

4. Which function in NLTK is used to extract TF-IDF vectors?

5. Can spaCy extract TF-IDF vectors?

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 73 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

In-Lab:

1. Apply TF-IDF vectorization model in NLTK on a small set of text data and show the
representation is better than TF and IDF.
2. Convert the NLTK wordnet corpus into a TF-IDF data representaiton.

• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 74 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages to record the Procedure/Program)


Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 75 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page to record the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 76 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data

(Leave at least 1 Page for each Program)

Sample VIVA-VOCE Questions (In-Lab):

1. Can you calculate TF-IDF values for terms in a collection of documents?


2. According to you explain the concept of term weighting and its role in TF-IDF calculations.
3. Which Python libraries or modules can be used to perform TF-IDF calculations.
4. List the Metrics used to Evaluate TF-IDF vectors.
5. Can you handle stop words or common words when computing TF-IDF using Python.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 77 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Post-Lab:

1. Try spaCy library to convert text to TF-IDF.


2. Try TF-IDF conversion on big corpus dataset given below.
https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 78 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 79 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages for each Procedure/ Program/ Solution)

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page for recording the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 80 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data.

(Leave at least 1 Page for recording the analysis and inferences)

Evaluator Remark (if Any):

Marks Secured: _____out of 50

Signature of the Evaluator with Date

Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 81 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Experiment Title: TF_IDF_Failure_meaning

Aim/Objective:

The aim is to evaluate the reason for failure of TF-IDF vectors and estimate the meaning of failure
to represent the text data in terms of semantics, context extraction and corpus size.

Description:

TF-IDF does not capture the semantic meaning of words or the context in which they are used. It treats
each term independently, without considering their relationships within the document or across the
collection. This can lead to issues when dealing with tasks that require a deeper understanding of
language, such as sentiment analysis or question-answering. TF-IDF is influenced by document length,
as longer documents generally have higher term frequencies. This bias can result in longer documents
dominating the similarity or importance measures, overshadowing shorter and potentially relevant
documents. TF-IDF treats documents as bags of words, disregarding the order and context in which
the words appear. This can be problematic in tasks like text generation or language translation, where
word order and context play a crucial role.

Pre-Requisites:

Install Python 3.6 and above using.

1. https://pip.pypa.io/en/stable/installation/

2. https://packaging.python.org/en/latest/tutorials/installing-packages/

3. https://pypi.org/project/nltk/

4. https://www.tensorflow.org/install/pip

5. https://spacy.io/usage

6. https://pypi.org/project/gensim/

Pre-Lab:

1. Why TF-IDF fails?

2. How effective is TF-IDF vectorization on long documents?

3. Does they preserve context between the words.

4. In NLTK, which function is used to assign TF-IDF vectors to text?

5. What is the length of TF-IDF vector?

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 82 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

In-Lab:

1. Apply TF-IDF and evaluate the vectors to check their failure related to context and semantic
representation.
2. Show the reason for failure of TF-IDF on large datasets such as wordnet in nltk.

• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 83 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages to record the Procedure/Program)


Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 84 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page to record the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 85 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data

(Leave at least 1 Page for each Program)

Sample VIVA-VOCE Questions (In-Lab):

1. What does NLTK's TF-IDF vectorizer do.


2. According to your exp which API is the best for TF_IDF?
3. Do you think your TF-IDF is suitable for GPT.
4. List the Metrics used to Evaluate TF-IDF conversion Techniques.
5. Can you convert large text to TF-IDF using spaCy.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 86 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Post-Lab:

1. Try normalization of converted TF-IDF vectors from text data and does they still fail.
2. Try TF-IDF on big corpus dataset given below and use ANN for classification.
https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 87 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 88 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages for each Procedure/ Program/ Solution)

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page for recording the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 89 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data.

(Leave at least 1 Page for recording the analysis and inferences)

Evaluator Remark (if Any):

Marks Secured: _____out of 50

Signature of the Evaluator with Date

Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 90 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Experiment Title: Distance Metrics

Aim/Objective:

The aim is to evaluate different distance based techniques and their effectiveness for text encoded
information such as TF-IDF.

Description:

Cosine Distance: It measures the cosine of the angle between two vectors in a high-dimensional space.
In NLP, it is commonly used to compute the similarity between documents represented as TF-IDF
vectors or word embeddings. Euclidean Distance: It measures the straight-line distance between two
points in Euclidean space. It is often employed to quantify the dissimilarity between word embeddings
or document vectors.

Pre-Requisites:

Install Python 3.6 and above using.

1. https://pip.pypa.io/en/stable/installation/

2. https://packaging.python.org/en/latest/tutorials/installing-packages/

3. https://pypi.org/project/nltk/

4. https://www.tensorflow.org/install/pip

5. https://spacy.io/usage

6. https://pypi.org/project/gensim/

Pre-Lab:

1. Why distance metrics are important in NLP?

2. How effective is the method Euclidean over cosine distance?

3. Write the formula for cosine similarity.

4. What is the result of a ‘0’ in cosine distance between two text words?

5. What is Jaccard Distance?

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 91 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

In-Lab:

1. Apply different distance metrics on TF-IDF vectorized text data and verify which captures the
closeness between the words effectively.
2. Use a large corpus and find effectiveness of various distance metrics defined in the
objective.

• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 92 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages to record the Procedure/Program)


Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 93 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page to record the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 94 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data

(Leave at least 1 Page for each Program)

Sample VIVA-VOCE Questions (In-Lab):

1. What does distance tell on text data?


2. According to your exp which distance metric is the best?
3. Do you think which distance metric is suitable for GPT.
4. List the distance Metrics used for text classification.
5. Can you convert them using spaCy.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 95 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Post-Lab:

1. Try to compare Jaccard Distance metric with the other two previously used methods.
2. Try comparing all three distance metrics on big corpus dataset given below.
https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 96 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 97 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages for each Procedure/ Program/ Solution)

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page for recording the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 98 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data.

(Leave at least 1 Page for recording the analysis and inferences)

Evaluator Remark (if Any):

Marks Secured: _____out of 50

Signature of the Evaluator with Date

Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 99 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Experiment Title: Word_similarities_nltk

Aim/Objective:

The aim is to extract a piece of text by calculating similarities between the words using NLTK.

Description:

The Word similarities are measurements that quantify the degree of relatedness or similarity between
words based on their meanings, contexts, or semantic properties. Cosine Similarity measures the
cosine of the angle between two vectors, which indicates the similarity between their directions in a
high-dimensional space. It is commonly used with word embeddings or TF-IDF representations to
compute word similarities.

Pre-Requisites:

Install Python 3.6 and above using.

1. https://pip.pypa.io/en/stable/installation/

2. https://packaging.python.org/en/latest/tutorials/installing-packages/

3. https://pypi.org/project/nltk/

4. https://www.tensorflow.org/install/pip

5. https://spacy.io/usage

6. https://pypi.org/project/gensim/

Pre-Lab:

1. What is word similarity, and why is it important in natural language processing (NLP)?

2. Explain the concept of word embeddings and how they are used to measure word similarity.

3. What is cosine similarity, and how is it applied to compute word similarity using word
embeddings?

4. Describe the distributional hypothesis and how it relates to measuring word similarity.

5. How does WordNet contribute to measuring word similarity, and what are some common
similarity metrics used in WordNet?

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 100 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

In-Lab:

1. Extract a small piece of text from a tiny corpus using the NLTK word similarities model.
2. Use wordnet dataset and apply word similarities to extract words close to atlaset 10 words.

• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 101 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages to record the Procedure/Program)

• Data and Results:


Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 102 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page to record the results)

• Analysis and Inferences:


Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 103 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data

(Leave at least 1 Page for each Program)

Sample VIVA-VOCE Questions (In-Lab):

1. Can you calculate word similarity using NLTK?


2. How wordnet in NLTK is used for measuring word similarity.
3. What are synsets in NLTK's wordnet for computing word similarity?
4. Describe the metrics available in NLTK's wordnet for measuring word similarity.
5. Can you compute the similarity between two words without the corpus?

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 104 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Post-Lab:

1. Try to design an experiment to compare different word similarity metrics provided by NLTK's
WordNet, such as path similarity, Wu-Palmer similarity, or Leacock-Chodorow similarity..
2. Try Investigating the impact of context on word similarity using NLTK on big corpus dataset
given below. https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 105 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 106 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages for each Procedure/ Program/ Solution)

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page for recording the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 107 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data.

(Leave at least 1 Page for recording the analysis and inferences)

Evaluator Remark (if Any):

Marks Secured: _____out of 50

Signature of the Evaluator with Date

Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 108 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Experiment Title: Document_recognition_tfidf_vectors

Aim/Objective:

The aim is to analyse text using TF-IDF vectors and apply for information retrieval.

Description:

Clean and preprocess the text data by removing any unnecessary elements like punctuation, stop
words, and special characters. Apply techniques such as tokenization, stemming, or lemmatization to
reduce words to their base form. Represent each document as a vector using the computed TF-IDF
values. Each dimension of the vector corresponds to a term in the corpus, and the value represents
the TF-IDF weight of the term in the document. This vectorization process creates TF-IDF vectors for
all documents in the corpus. Identify important terms or phrases in a document using the TF-IDF
vectors. High TF-IDF values indicate terms that are specific to a document and may carry important
information. Extracting these terms can help in tasks such as keyword extraction or named entity
recognition.

Pre-Requisites:

Install Python 3.6 and above using.

1. https://pip.pypa.io/en/stable/installation/

2. https://packaging.python.org/en/latest/tutorials/installing-packages/

3. https://pypi.org/project/nltk/

4. https://www.tensorflow.org/install/pip

5. https://spacy.io/usage

6. https://pypi.org/project/gensim/

Pre-Lab:

1. How to apply TF-IDF on text data?

2. How effective is the method used by you?

3. What are the advantages of using TF-IDF.

4. What are the disadvantages of using TF-IDF?

5. Perform Text clustering example using TF-IDF vectors?

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 109 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

In-Lab:

1. Apply TF-IDF vectors on a 5 document text data and analyze the relationships between
similar phrases by applying cosine similarity metrics.
2. Cluster text with similar meaning using TF-IDF vectors developed in the above objective.

• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 110 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages to record the Procedure/Program)


Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 111 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page to record the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 112 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data

(Leave at least 1 Page for each Program)

Sample VIVA-VOCE Questions (In-Lab):

1. What does nltk.feature_extraction.text.TfidfVectorizer() class does to text data?


2. According to your exp how close are words ‘man’ and ‘king’?
3. Do you think the TF-IDF are suitable for developing a chat bot.
4. List the Metrics used in the text clustering application.
5. Can you implement the same using spaCy.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 113 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Post-Lab:

1. Try the same on Wordnet dataset in NLTK and comment on the limitations of TF-IDF vectors.
2. Try to vectorize using TF-IDF on big corpus dataset given below and apply clustering.
https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 114 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 115 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages for each Procedure/ Program/ Solution)

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page for recording the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 116 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data.

(Leave at least 1 Page for recording the analysis and inferences)

Evaluator Remark (if Any):

Marks Secured: _____out of 50

Signature of the Evaluator with Date

Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 117 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Experiment Title: Zipf's_Law_nlp

Aim/Objective:

The aim is to evaluate Zipf’s law on small and large corpus text data and use it as a corpus analysis
tool for building a text dataset specific to a university exam portal chatbot.

Description:

Zipf's law states that the frequency of any word in a text is inversely proportional to its rank in the
frequency table. The most common word will occur about twice as often as the second most common
word, three times as often as the third most common word, and so on. Zipf's law has been observed
in a wide variety of corpora, including natural language texts, computer code, and even the population
of cities. The law has been used in a variety of applications, including information retrieval, natural
language processing, and network analysis.

Pre-Requisites:

Install Python 3.6 and above using.

1. https://pip.pypa.io/en/stable/installation/

2. https://packaging.python.org/en/latest/tutorials/installing-packages/

3. https://pypi.org/project/nltk/

4. https://www.tensorflow.org/install/pip

5. https://spacy.io/usage

6. https://pypi.org/project/gensim/

Pre-Lab:

1. Why does Zipf's law hold for a wide variety of data sets

2. How effective is the law on small datasets?

3. How can Zipf's law be used to improve the performance of machine learning algorithms.

4. In NLTK, which function is used to import brown dataset?

5. Why the frequency of 1st word is approximately 3 times that of second word?

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 118 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

In-Lab:

1. Apply zipfs law on Brown dataset in nltk and prove that it holds.
2. Apply zipfs law on your own dataset of not more than 100 words and check that it holds.

• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 119 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages to record the Procedure/Program)

• Data and Results:


Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 120 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page to record the results)

• Analysis and Inferences:


Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 121 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data

(Leave at least 1 Page for each Program)

Sample VIVA-VOCE Questions (In-Lab):

1. Which nltk dataset was used in your experiment?


2. According to your exp when zipfs law holds?
3. Do you think zipfs law is hypothetical and not a effective tool in developing text corpus.
4. List other laws that are close to zipfs.
5. Can you implement zipfs law using spaCy.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 122 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Post-Lab:

1. Try proving zipfs law on your exam answers written by you during an in-sem exam.
2. Try zips law on big corpus dataset given below.
https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 123 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 124 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages for each Procedure/ Program/ Solution)

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page for recording the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 125 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data.

(Leave at least 1 Page for recording the analysis and inferences)

Evaluator Remark (if Any):

Marks Secured: _____out of 50

Signature of the Evaluator with Date

Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 126 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Experiment Title: Simple_topic_modelling

Aim/Objective:

The aim is to compute topic vectors on a text data.

Description:

The Topic modelling is a type of statistical modelling that is used to discover the abstract "topics" that
occur in a collection of documents. Topic modelling is a frequently used text-mining tool for discovery
of hidden semantic structures in a text body. The "topics" produced by topic modelling techniques are
clusters of similar words. A topic model captures this intuition in a mathematical framework.

Pre-Requisites:

Install Python 3.6 and above using.

1. https://pip.pypa.io/en/stable/installation/

2. https://packaging.python.org/en/latest/tutorials/installing-packages/

3. https://pypi.org/project/nltk/

4. https://www.tensorflow.org/install/pip

5. https://spacy.io/usage

6. https://pypi.org/project/gensim/

Pre-Lab:

1. Why topic modelling is necessary for text classification?

2. How effective is the method in clustering similar words?

3. Is topic modelling is a machine learning framework.

4. Who will decide the weights across the words for topic clustering.

5. What is the role of weight matrix in topic modelling?

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 127 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

In-Lab:

1. Apply statistical model to extract topics from a set of documents in a corpus.


2. Compute the relation between the topics through a weight matrix on the considered text
data.

• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 128 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages to record the Procedure/Program)


Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 129 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page to record the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 130 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data

(Leave at least 1 Page for each Program)

Sample VIVA-VOCE Questions (In-Lab):

1. What does weight matrix in topic modelling do?


2. According to your experiment how relations between topics were formulated?
3. Do you think topic modelling has been automated in word2vect.
4. Can topic modelling find sematic meaning between the words.
5. Can you implement word2vect or glove word embedding in NLTK and spaCy.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 131 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Post-Lab:

1. Try implementing word2vect on a small corpus of text data using spaCy.


2. Try word2vect model on big corpus dataset given below.
https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 132 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 133 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages for each Procedure/ Program/ Solution)

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page for recording the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 134 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data.

(Leave at least 1 Page for recording the analysis and inferences)

Evaluator Remark (if Any):

Marks Secured: _____out of 50

Signature of the Evaluator with Date

Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 135 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Experiment Title: Principle Component Analysis

Aim/Objective:

The aim is to compute PCA from scratch on a corpus of text data transformed using TF-IDF vectors
and reduce the dimensionality. Use dimensionally reduced features to reconstruct the original
information and report error.

Description:

The Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal
transformation to convert a set of observations of possibly correlated variables into a set of values of
linearly uncorrelated variables called principal components. PCA is a widely used technique in data
analysis for dimensionality reduction, feature extraction, and visualization. It can be used to reduce
the number of variables in a dataset while preserving as much of the variation in the data as possible.
This can be useful for making the data easier to visualize and interpret, and for improving the
performance of machine learning algorithms.

Pre-Requisites:

Install Python 3.6 and above using.

1. https://pip.pypa.io/en/stable/installation/

2. https://packaging.python.org/en/latest/tutorials/installing-packages/

3. https://pypi.org/project/nltk/

4. https://www.tensorflow.org/install/pip

5. https://spacy.io/usage

6. https://pypi.org/project/gensim/

Pre-Lab:

1. Why PCA is mostly applied in NLP?

2. How effective is PCA in reducing dimensionality?

3. When to apply PCA in the NLP pipeline.

4. What is the limitation on implementing PCA?

5. Which is the most widely used dimensionality reduction model for text analysis?

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 136 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

In-Lab:

1. Develop a step-by-step code for implementing PCA. Generate a dimensionally reduced


feature and reconstruct the original using the covariance matrix. Compute the error.
2. Convert a 10 document 6 words each corpus using TF-IDF vectorizer and reduce its
dimensionality with 10% reconstruction loss.

• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 137 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages to record the Procedure/Program)


Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 138 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page to record the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 139 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data

(Leave at least 1 Page for each Program)

Sample VIVA-VOCE Questions (In-Lab):

1. What does eigen value mean to you?


2. According to your experiment which eigen value did you select?
3. Do you think reducing dimensionality is a good NLP practice.
4. Have you come across any other dimensionality reduction techniques.
5. Can you functionize PCA in spaCy and NLTK.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 140 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Post-Lab:

1. Try building a small 100-word text data and apply PCA on TF-IDF vectors. Train a ANN model
on the dimensionally reduced and verify the performance of the model.
2. Try the same on big corpus dataset given below.
https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 141 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 142 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages for each Procedure/ Program/ Solution)

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page for recording the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 143 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data.

(Leave at least 1 Page for recording the analysis and inferences)

Evaluator Remark (if Any):

Marks Secured: _____out of 50

Signature of the Evaluator with Date

Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 144 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Experiment Title: Singular Value Decomposition

Aim/Objective:

The aim is to implement SVD from scratch and apply it to reduce dimensionality of text data.

Description:

SVD can be used to reduce the dimensionality of a document-term matrix while preserving as much
of the variation in the data as possible. This can be useful for making the data easier to visualize and
interpret, and for improving the performance of machine learning algorithms. SVD can also be used
to extract features from a document-term matrix. This can be useful for tasks such as topic modelling
and text classification. Latent semantic analysis (LSA) is a technique that uses SVD to identify the latent
semantic structures in a document-term matrix. This can be useful for tasks such as text clustering and
information retrieval.

Pre-Requisites:

Install Python 3.6 and above using.

1. https://pip.pypa.io/en/stable/installation/

2. https://packaging.python.org/en/latest/tutorials/installing-packages/

3. https://pypi.org/project/nltk/

4. https://www.tensorflow.org/install/pip

5. https://spacy.io/usage

6. https://pypi.org/project/gensim/

Pre-Lab:

1. What are the applications of Singular Value Decomposition.

2. How effective is the method on small text data?

3. Why SVD loses information when small quantity of singular values are retained.

4. Can SVD tell similarity between words?

5. What is the difference between SVD and PCA?

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 145 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

In-Lab:

1. Apply SVD from scratch on text corpus and construct analysis framework.
2. Compare SVD and PCA on the text data used earlier and conclude which has performed
better.

• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 146 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages to record the Procedure/Program)


Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 147 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page to record the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 148 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data

(Leave at least 1 Page for each Program)

Sample VIVA-VOCE Questions (In-Lab):

1. What is the sklearn function for accessing SVD?


2. According to your exp what are the drawbacks of SVD?
3. Do you think SVD performs better than PCA.
4. Which has fewer steps SVD or PCA.
5. Can you model topics using SVD in spaCy.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 149 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Post-Lab:

1. Try SVD for topic modelling on a 10 document 6-word text corpus.


2. Try SVD on big corpus dataset given below.
https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 150 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 151 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages for each Procedure/ Program/ Solution)

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page for recording the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 152 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data.

(Leave at least 1 Page for recording the analysis and inferences)

Evaluator Remark (if Any):

Marks Secured: _____out of 50

Signature of the Evaluator with Date

Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 153 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Experiment Title: spam_dect_class

Aim/Objective:

The aim is to develop pseudo-code in python for span SMS classification using semantic
representation (LSA) and Construct SMS spam elimination algorithm with the LSA model.

Description:

SMS spam detection is the process of identifying and filtering out unwanted or malicious SMS
messages. Spam messages can be a nuisance, and they can also be dangerous, as they can contain
malware or phishing links.

The dataset@

https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

Pre-Requisites:

Install Python 3.6 and above using.

1. https://pip.pypa.io/en/stable/installation/

2. https://packaging.python.org/en/latest/tutorials/installing-packages/

3. https://pypi.org/project/nltk/

4. https://www.tensorflow.org/install/pip

5. https://spacy.io/usage

6. https://pypi.org/project/gensim/

Pre-Lab:

1. Judge the effects of homographs and homophones in chatbot voice-based applications.

2. What is the function of t-NSE representation and its role in the analysis of NLP pipelines.

3. Tell how polysemy, homonyms and zeugma impact the outcome of a text-based NLP
application.

4. What are steps in sms span detection model.

5. What is the difference between spam detection and elimination?

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 154 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

In-Lab:

1. Download the data from the link in the description. Built an NLP pipeline for pre-processing
the text in the dataset. Convert the text to vector corpus.
2. Develop a SMS spam filter model with Bayes classifier and test the trained model. Report the
failure rate of the model.

• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 155 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages to record the Procedure/Program)


Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 156 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page to record the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 157 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data

(Leave at least 1 Page for each Program)

Sample VIVA-VOCE Questions (In-Lab):

1. What does bayes classification model doing in spam filtering algorithm.


2. According to your exp the obtained error is good or bad?
3. Do you think there are better models than bayes, give one.
4. What are the difficulties faced during implementation.
5. Can you convert the model with an ANN classifier.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 158 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Post-Lab:

1. Try converting the above model into a text classifier.


2. Try ANN model for text classification on big corpus dataset given below.
https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 159 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 160 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages for each Procedure/ Program/ Solution)

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page for recording the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 161 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data.

(Leave at least 1 Page for recording the analysis and inferences)

Evaluator Remark (if Any):

Marks Secured: _____out of 50

Signature of the Evaluator with Date

Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 162 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Experiment Title: Sentiment_Analysis_RNN

Aim/Objective:

The aim is to develop an end-to-end NLP pipeline for sentiment analysis using a time series model
trained on twitter sentiment data.

Description:

Sentiment analysis using RNNs can be implemented in a variety of ways. One common approach is to
use a Long Short-Term Memory (LSTM) network. LSTMs are a type of RNN that is specifically designed
to address the problem of long-term dependencies. This makes them well-suited for tasks such as
sentiment analysis, where it is important to be able to understand the context of the text.

https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment

Pre-Requisites:

Install Python 3.6 and above using.

1. https://pip.pypa.io/en/stable/installation/

2. https://packaging.python.org/en/latest/tutorials/installing-packages/

3. https://pypi.org/project/nltk/

4. https://www.tensorflow.org/install/pip

5. https://spacy.io/usage

6. https://pypi.org/project/gensim/

Pre-Lab:

1. Judge the effects of recurrence in neural networks over dense layers.

2. What is the function of VADER net in sentiment analysis.

3. Tell how the model learns sentiment from data.

4. What are steps in sentiment analysis model.

5. What is the SentiWordNet?

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 163 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

In-Lab:

1. Download the data from the link in the description. Built an NLP pipeline for pre-processing
the text in the dataset. Convert the text to vector corpus.
2. Develop a sentiment analysis model using VADER and SentiWordNet. Report the results
using data visualization techniques.

• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 164 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages to record the Procedure/Program)


Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24
APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 165 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page to record the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 166 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data

(Leave at least 1 Page for each Program)

Sample VIVA-VOCE Questions (In-Lab):

1. What is VADER model.


2. According to your exp the obtained accuray is good or bad?
3. Do you think there are better models than SentiWordNet, give one.
4. What are the difficulties faced during implementation.
5. Can you convert the model with an LSTM classifier.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 167 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Post-Lab:

1. Try converting the above model into a sentiment analysis using Tensorflow with LSTMs.
2. Try LSTM model for text classification on big corpus dataset given below.
https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter
• Procedure/Program:

This Section is meant for the student to Write the program/Procedure for the Experiment

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 168 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 169 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

(Leave at least 2-3 Pages for each Procedure/ Program/ Solution)

• Data and Results:

This Section is meant for the students to collect, record the results generated during the
Program/Experiment execution. Include instructions on how to present the results, such as creating
tables, graphs, or visualizations.

(Leave at least 1 Page for recording the results)

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 170 of 174
Experiment # <TO BE FILLED BY STUDENT> Student ID 2100040341
Date <TO BE FILLED BY STUDENT> Student Name P. Leena Sri

• Analysis and Inferences:

This Section is meant for the students to analyse their data, perform calculations Include
questions or prompts to encourage critical thinking and interpretation of the data.

(Leave at least 1 Page for recording the analysis and inferences)

Evaluator Remark (if Any):

Marks Secured: _____out of 50

Signature of the Evaluator with Date

Evaluator MUST ask Viva-voce prior to signing and posting marks for each experiment.

Course Title NATURAL LANGUAGE PROCESSING & ACADEMIC YEAR: 2023-24


APPLICATIONS
Course Code(s) 21EC4082, 21EC4082A, 21EC4082P Page 171 of 174

You might also like