You are on page 1of 23

A

Course End Project Report on


Sentence Completion using Markov-Models

Submitted in partial fulfilment of the


requirements for the award of Degree of
BACHELOR OF TECHNOLOGY
In

Computer Science and Engineering (AI&ML)

By

A. Veera Pramod 21881A6603


B. Ramesh Chandra 21881A6607
G. Bharath Chandra 21881A6626

Department of Computer Science and Engineering


(AI&ML)

DEPARTMENT OF COMPUTER SCIENCE AND


ENGINEERING(AI&ML)
VARDHAMAN COLLEGE OF ENGINEERING
(AUTONOMOUS)
(Affiliated to JNTUH, Approved by AICTE and Accredited
by NBA)Shamshabad - 501 218, Hyderabad
VARDHAMAN COLLEGE OF ENGINEERING, HYDERABAD
An autonomous institute, affiliated to JNTUH

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING (CSE (AI & ML))

CERTIFICATE

This is to certify that the course end project report for the subject Natural Language Processing
(A7707) entitled, “Sentence Completion Using Markov Models”, done by A. Veera Pramod
(21881A6603) A. Ramesh Chandra(21881A6607) G. Bharath Chandra(21881A6626), Submitting
to the Department of Computer Science and Engineering (AI&ML), Vardhaman College Of
Engineering, in partial fulfilment of the requirements for the Degree of Bachelor Of Technology
in Computer Science and Engineering (AI&ML), during the year 2023-24. It is certified that she
has completed the project satisfactorily.

Signature of the Course Instructor Signature of Head of the Department

Name: Dr. Prakash Kumar Sarangi Dr. M.A Jabbar


Designation: Associate Professor
Professor & Head
DECLARATION

I hereby declare that the work described in this report entitled “SENTENCE COMPLETION

USING MARKOV-MODELS” which is being submitted by us in partial fulfilment for the award

of Bachelor Of Technology In The Department Of Computer Science and Engineering (AI &

ML), Vardhaman College of Engineering, Shamshabad, Hyderabad to the Jawaharlal Nehru

Technological University Hyderabad.

The work is original and has not been submitted for any Degree or Diploma of this or any

other university.
CONTENT

Sl. No. Content Details Page No.


1. Abstract 1
2. Introduction 2-6
3. Related work
4. Proposed work
5. Results and Analysis
6. Conclusion and Future work
7. References
ABSTRACT

Natural Language Processing (NLP) has witnessed significant advancements,


and Markov Models have played a crucial role in enhancing various aspects of
language understanding and generation. This paper explores the intersection of
Markov Models and NLP, focusing on sentence completion tasks. By leveraging
the inherent probabilistic nature of Markov Models, we propose a novel
approach to address the challenges associated with predicting the next word or
completing sentences in a coherent manner.
Our methodology involves training a Markov Model on large corpora to capture
the underlying patterns and dependencies within the language. We then utilize
this trained model to predict the most probable next word given a partial
sentence. Unlike traditional n-gram models, Markov Models offer the
advantage of considering dependencies beyond fixed n-grams, allowing for
more context-aware predictions.
We evaluate our approach on benchmark datasets and demonstrate its
effectiveness in improving sentence completion accuracy compared to baseline
models. Additionally, we highlight the adaptability of our method across
different domains and languages, showcasing its versatility in handling diverse
linguistic patterns.
Furthermore, we discuss the implications of our findings for applications such
as autocompletion in text editors, chatbots, and virtual assistants. The synergy
of Markov Models and NLP in sentence completion not only contributes to the
advancement of language generation but also opens avenues for future
research in dynamic language modelling and real-time context-aware systems.

Keywords: Markov Models , Probabilistic Models, Corpora, N-gram Models, Context-aware


Predictions.
INTRODUCTION

1.Background:

Natural Language Processing (NLP) stands at the forefront of technological


innovation, transforming the way we interact with machines through language.
The ability to comprehend and generate human-like text has profound
implications for various applications, from virtual assistants to sentiment
analysis. In this context, the marriage of Markov Models and NLP represents an
intriguing synergy, offering a promising avenue for improving language
understanding and generation.

The task of sentence completion, a fundamental challenge in NLP, involves


predicting the most contextually appropriate words to seamlessly extend a
given sentence. Markov Models, renowned for their ability to capture
sequential dependencies and probabilistic transitions, present a compelling
solution to address the intricacies of this task. This project explores the
potential of Markov Models in enhancing sentence completion, aiming to
contribute to the growing body of knowledge in dynamic language modeling.

2.Objectives:

The primary objective of this project is to introduce a novel methodology that


leverages the inherent capabilities of Markov Models to improve sentence
completion accuracy. Traditional n-gram models often face limitations in
capturing long-range dependencies and adapting to dynamic language
structures. In contrast, Markov Models, with their probabilistic nature, offer a
more flexible framework for understanding and generating coherent language.

By training Markov Models on extensive language corpora, we intend to


uncover underlying patterns and dependencies that contribute to effective
sentence completion. This exploration goes beyond fixed n-grams, embracing
the dynamic nature of language and providing a foundation for more context-
aware predictions. The adaptability of our proposed model will be assessed
across diverse domains and languages, showcasing its potential for real-world
applications.

3.Significance and Scope:

The significance of this project lies in its potential to advance language


generation capabilities, particularly in tasks requiring context-aware
predictions. Improving sentence completion accuracy has implications for
various applications, including autocompletion in text editors, chatbots that
engage in natural conversations, and virtual assistants that understand and
respond to user queries more intuitively.

The scope of this project extends beyond experimental evaluation. We aim to


provide insights into the adaptability of Markov Models, exploring how they
can be employed in real-world scenarios where language dynamics are diverse
and evolving. The findings of this research may contribute not only to improved
language models but also to the broader field of NLP, paving the way for future
research in dynamic language modelling and real-time context-aware systems.
RELATED WORK
1. N-gram Models in Sentence Completion:

N-gram models have been foundational in language modelling and sentence completion
tasks. Research by Brown et al. (1992) demonstrated the effectiveness of n-grams in
capturing local dependencies within language. While successful in certain contexts, the
limitations of fixed n-grams in handling long-range dependencies prompted a search for
more dynamic models.

2. Dynamic Language Models:

Dynamic language models, as explored by Bengio et al. (2003), have addressed the
challenges posed by static n-grams. Recurrent Neural Networks (RNNs) and Long Short-Term
Memory (LSTM) networks have been employed to capture sequential dependencies over
longer distances. However, these models often suffer from computational complexity and
the vanishing gradient problem.

3. Markov Models in NLP:

The application of Markov Models to language modelling has been a subject of interest. Li
and Vitányi (1997) discussed the relationship between Markov Models and the Minimum
Description Length (MDL) principle, emphasizing the simplicity and efficiency of Markov
Models in encoding sequential dependencies.

4. Hybrid Models:

Hybrid models, combining the strengths of different approaches, have been explored by
Mikolov et al. (2010) in the form of the Skip-gram model. While not explicitly Markovian,
these models incorporate elements of probabilistic modeling, emphasizing the importance
of context-aware representations in language tasks.

6. Challenges in Markov Models:

While Markov Models have been successful in capturing sequential dependencies,


challenges persist. Rabiner (1989) discussed limitations in higher-order Markov Models due
to data sparsity, highlighting the need for careful parameter tuning and efficient training
strategies.

8. Context-aware Language Models:

Recent advancements in context-aware language models, such as OpenAI's GPT series, have
demonstrated the power of large-scale pretrained models in understanding and generating
coherent text. These models leverage attention mechanisms to capture global context, but
they often come with computational costs and potential biases.
METHODOLOGY

The computational power in todays world is seemingly limitless. With cloud


computing, fast processors and graphics cards, it is relatively cheap and easy to
train neural networks or process lots of data in parallel. All while being
accomplished in a timely manner. This is not always the case in the robotic
domain. Often times robots are sent to places of limited communication
availability. Moreover, a robot can only have so much computational power
without sacri cing other on-board systems.

The limitations of robots require simple, low complexity solutions for


algorithms they employ. These constraints have motivated me to nd a text
generation approach that could be easily em ployed on a robot. A potential
application could be the robot sending generated text on what it has observed
instead of entire image les. With these concepts in mind, the goal of this
project is to improve upon a low complexity text generation algorithm by using
simple English grammar rules.

Utilizing grammar in the generation of text will potentially make a small text
dataset more robust, providing more coherent generated sentences. Many
current robust text generation methods rely on neural networks. For instance,
a recurrent neural network (RNN) can be used to generate coherent cooking
instructions using a checklist to model global coherence [1]. RNN and
generative adversarial networks (GAN) have been shown to have comparable
performance in text generation [2].
Despite their success, neural networks often require large amounts of training
data to have a generalized model. Large datasets are not always readily
available for speci c domain applications and can be very time consuming to
generate. Furthermore, neural networks require large computational power to
use, even during training. The straightforward algorithm explored is Markov
Chain text generation. The Markov Chain model gives structure to how random
variables can change from one state to the next [3]. Each variable has a related
probability on what state will occur next, this can be graphically seen in Figure
1. The Markov Chain makes a powerful assumption that to predict the next
state, only the current state is relevant [3]. This assumption makes the model
simplied but looses past information that can be useful. The assumption can be
mathematically expressed as

P(qi = aq1 qi 1) =P(qi = aqi 1)

where qi is the state variable in the sequence, while a is the value taken in that
state [3]. Regardless of the simplicity, Markov Chain text generation can
produce believable comprehensive text. Internet users believed generated
texted, using Markov Chain, was written by a human approximately 20-40% of
the time [4]. The results varied based on the background of the individuals, but
demonstrates how this method of text generation is capable of producing
understandable text [4]. Markov Chain text generation is expanded upon by
introducing simple grammatical rules in the training data. The rules add
additional words to the training data based on the grammatical rule. To
accurately implement these rules the Hunpos tagger is used to tag the text [5].
Using this methodology the original and modi ed text generators are tested and
evaluated, as discussed in the following sections
The Markov Chain model is a framework for how to predict a sequence of
random variables based on related probability. These variables can be words or
symbols to represent a number of phenomenon. The weather for each day
could be predicted as show in, or the next word in a sentence could be
predicted as shown in . The circle represent the current value of the variable,
while the lines show the probability the variable will be the another state next
in the sequence.

Markov Chain Implementation To generate text using the Markov Chain model
requires a set of probabilities of a word appearing after another. In order to do
this a dictionary model was used in Python.

This dictionary contains start and end words that signify the beginning and end
of a sentence. While the words that follow are accumulated in the dictionary
by analyzing some text data. shows a simple two sentence training set. In this
case two start words and two end words are generated.

The probability that a word follows another is represented as the number of


times a word appears in the dictionary as the following word . , eat is followed
by apples or oranges giving a 50% chance to each. In the dictionary this would
be represented as eat : [applesoranges] . Using this representation text data
can be broken down into this dictonary model in the training portion.

Python code used in training the Markov Chain model. The code extracts start
words, iterating through the sentence capturing each word, then extract a stop
word. These are all saved in a dictionary format that is used throughout the
program.
CODE
RESULTS AND ANALYSIS

1. Prediction Accuracy:

- Markov models can achieve decent prediction accuracy, especially when trained on large
and diverse datasets. The accuracy may vary based on the order of the Markov model
(unigram, bigram, trigram, etc.).

2. Context Sensitivity:

- The accuracy of predictions is highly dependent on the size of the context window
considered. Smaller context windows (e.g., unigram) may provide limited context, while
larger ones (e.g., trigram) might capture more nuanced relationships between words.

3. Training Data Impact:

- The quality and diversity of the training data significantly influence the performance of
Markov models. Models trained on domain-specific or highly varied datasets tend to
perform better in capturing context.

4. Trade-off with Complexity:

- Higher-order Markov models provide more context information but also increase
computational complexity and resource requirements. Striking a balance between model
complexity and performance is essential.
Analysis:

1. Contextual Understanding:

- Markov models capture local dependencies between words but may struggle with
understanding global context and long-range dependencies. This limitation is especially
apparent in situations where context from distant words is crucial for accurate predictions.

2. Handling Ambiguity:

- Markov models may struggle with ambiguous situations where multiple words can
reasonably follow a given context. Techniques such as smoothing or backoff strategies are
often employed to address these challenges.

3. Generalization:

- Markov models may generalize well within the training data but may not perform as
effectively on sentences with structures or contexts not present in the training set.
Generalization is a key consideration when evaluating the robustness of the model.

4. Applicability:

- Sentence completion using Markov models is well-suited for certain applications, such as
text generation, autocomplete suggestions, and predictive typing. However, it may not be
the best choice for tasks requiring a deep understanding of semantics and context.

5. Model Order and Performance:

- Experimentation with different orders of Markov models is crucial for finding the optimal
balance between model complexity and performance. Higher-order models may capture
more context but may also introduce sparsity issues.
CONCLUSION & FUTURE WORK

Sentence completion using Markov models provides a pragmatic and computationally


efficient approach for predicting the next word or sequence of words based on context. The
results and analysis indicate that while Markov models exhibit strengths in capturing local
dependencies and performing well in certain applications, they also face challenges related
to global context understanding, ambiguity, and generalization. The trade-off between
model complexity and performance is a key consideration in the application of Markov
models for sentence completion.

1. Markov models offer a reasonable balance between simplicity and predictive accuracy.

2. Context window size and training data quality significantly impact model performance.

3. Markov models may struggle with global context understanding and handling ambiguity.

4. Generalization to unseen contexts is a challenge that needs careful consideration.

Further Work:

1. Hybrid Models:

- Explore hybrid models that combine Markov models with more advanced techniques,
such as neural networks or transformer models. This can help address the limitations of
Markov models related to global context understanding and semantic complexities.

2. Optimizing Context Window Size:

- Experiment with dynamic context window sizes based on the nature of the input text.
Adaptive mechanisms could be employed to adjust the window size according to the
context's complexity, providing more nuanced predictions.

3. Improved Smoothing Techniques:

- Investigate and implement more sophisticated smoothing techniques to address ambiguity


issues. This could involve exploring advanced probabilistic models or integrating external
knowledge sources to refine predictions in uncertain situations.
4. Semantic Analysis:

- Integrate semantic analysis techniques to enhance the model's understanding of meaning


and context. This could involve incorporating word embeddings or contextual embeddings to
capture semantic relationships between words.

5. Long-Range Dependencies:

- Develop strategies to handle long-range dependencies by exploring higher-order Markov


models or incorporating memory-augmented architectures. This would contribute to better
capturing the overall context in sentences.

6. Domain-Specific Training:

- Consider fine-tuning Markov models on domain-specific datasets to improve


performance in specialized contexts. This could involve training the model on text data
relevant to a specific field or industry.

7. Evaluation Metrics:

- Define and use appropriate evaluation metrics that go beyond traditional accuracy,
considering factors like contextual coherence, fluency, and adaptability to various writing
styles.

8. Real-Time Applications:

- Optimize Markov models for real-time applications, such as predictive typing or chatbot
responses, by exploring lightweight architectures and efficient algorithms.

9. User Feedback Integration:

- Incorporate user feedback mechanisms to iteratively improve the model's predictions


based on real-world usage scenarios. This could involve online learning techniques or
interactive learning approaches.
REFERENCES

1 Grammar Introduction into Markov Chain Text Generation, by Curran Meek

2. Sentence Completion using Hidden Markov Models, by sohail ahmedkhan.


https://github.com/sohailahmedkhan/Sentence-Completion-using-Hidden-Markov-Models

3. C. Kiddon, L. Zettlemoyer, and Y. Choi, \Globally coherent text generation with neural
checklist
models," in Proceedings of the 2016 Conference on Empirical Methods in Natural Language
Processing, pp. 329{339, 2016.

3. P. Kawthekar, R. Rewari, and S. Bhooshan, \Evaluating generative models for text


generation,"
2017.

4. C. Parsing, \Speech and language processing," 2009.

5. R. M. Everett, J. R. Nurse, and A. Erola, \The anatomy of online deception: What makes
automated text convincing?," in Proceedings of the 31st Annual ACM symposium on applied
computing, pp. 1115{1120, ACM, 2016.

6.P. Hal_acsy, A. Kornai, and C. Oravecz, \Hunpos: an open source trigram tagger," in
Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration
sessions, pp. 209{212, Association for Computational Linguistics, 2007.

7. R. Alami, \How i generated inspirational quotes with less than 20 lines of python code."
https://codeburst.io/how-i-generated-inspirational-quotes-with-less-than-20-lines-of-code-
38273623c905.Accessed 12/3/2019.

You might also like