Professional Documents
Culture Documents
Team-03 NLP Course End Project Report
Team-03 NLP Course End Project Report
By
CERTIFICATE
This is to certify that the course end project report for the subject Natural Language Processing
(A7707) entitled, “Sentence Completion Using Markov Models”, done by A. Veera Pramod
(21881A6603) A. Ramesh Chandra(21881A6607) G. Bharath Chandra(21881A6626), Submitting
to the Department of Computer Science and Engineering (AI&ML), Vardhaman College Of
Engineering, in partial fulfilment of the requirements for the Degree of Bachelor Of Technology
in Computer Science and Engineering (AI&ML), during the year 2023-24. It is certified that she
has completed the project satisfactorily.
I hereby declare that the work described in this report entitled “SENTENCE COMPLETION
USING MARKOV-MODELS” which is being submitted by us in partial fulfilment for the award
of Bachelor Of Technology In The Department Of Computer Science and Engineering (AI &
The work is original and has not been submitted for any Degree or Diploma of this or any
other university.
CONTENT
1.Background:
2.Objectives:
N-gram models have been foundational in language modelling and sentence completion
tasks. Research by Brown et al. (1992) demonstrated the effectiveness of n-grams in
capturing local dependencies within language. While successful in certain contexts, the
limitations of fixed n-grams in handling long-range dependencies prompted a search for
more dynamic models.
Dynamic language models, as explored by Bengio et al. (2003), have addressed the
challenges posed by static n-grams. Recurrent Neural Networks (RNNs) and Long Short-Term
Memory (LSTM) networks have been employed to capture sequential dependencies over
longer distances. However, these models often suffer from computational complexity and
the vanishing gradient problem.
The application of Markov Models to language modelling has been a subject of interest. Li
and Vitányi (1997) discussed the relationship between Markov Models and the Minimum
Description Length (MDL) principle, emphasizing the simplicity and efficiency of Markov
Models in encoding sequential dependencies.
4. Hybrid Models:
Hybrid models, combining the strengths of different approaches, have been explored by
Mikolov et al. (2010) in the form of the Skip-gram model. While not explicitly Markovian,
these models incorporate elements of probabilistic modeling, emphasizing the importance
of context-aware representations in language tasks.
Recent advancements in context-aware language models, such as OpenAI's GPT series, have
demonstrated the power of large-scale pretrained models in understanding and generating
coherent text. These models leverage attention mechanisms to capture global context, but
they often come with computational costs and potential biases.
METHODOLOGY
Utilizing grammar in the generation of text will potentially make a small text
dataset more robust, providing more coherent generated sentences. Many
current robust text generation methods rely on neural networks. For instance,
a recurrent neural network (RNN) can be used to generate coherent cooking
instructions using a checklist to model global coherence [1]. RNN and
generative adversarial networks (GAN) have been shown to have comparable
performance in text generation [2].
Despite their success, neural networks often require large amounts of training
data to have a generalized model. Large datasets are not always readily
available for speci c domain applications and can be very time consuming to
generate. Furthermore, neural networks require large computational power to
use, even during training. The straightforward algorithm explored is Markov
Chain text generation. The Markov Chain model gives structure to how random
variables can change from one state to the next [3]. Each variable has a related
probability on what state will occur next, this can be graphically seen in Figure
1. The Markov Chain makes a powerful assumption that to predict the next
state, only the current state is relevant [3]. This assumption makes the model
simplied but looses past information that can be useful. The assumption can be
mathematically expressed as
where qi is the state variable in the sequence, while a is the value taken in that
state [3]. Regardless of the simplicity, Markov Chain text generation can
produce believable comprehensive text. Internet users believed generated
texted, using Markov Chain, was written by a human approximately 20-40% of
the time [4]. The results varied based on the background of the individuals, but
demonstrates how this method of text generation is capable of producing
understandable text [4]. Markov Chain text generation is expanded upon by
introducing simple grammatical rules in the training data. The rules add
additional words to the training data based on the grammatical rule. To
accurately implement these rules the Hunpos tagger is used to tag the text [5].
Using this methodology the original and modi ed text generators are tested and
evaluated, as discussed in the following sections
The Markov Chain model is a framework for how to predict a sequence of
random variables based on related probability. These variables can be words or
symbols to represent a number of phenomenon. The weather for each day
could be predicted as show in, or the next word in a sentence could be
predicted as shown in . The circle represent the current value of the variable,
while the lines show the probability the variable will be the another state next
in the sequence.
Markov Chain Implementation To generate text using the Markov Chain model
requires a set of probabilities of a word appearing after another. In order to do
this a dictionary model was used in Python.
This dictionary contains start and end words that signify the beginning and end
of a sentence. While the words that follow are accumulated in the dictionary
by analyzing some text data. shows a simple two sentence training set. In this
case two start words and two end words are generated.
Python code used in training the Markov Chain model. The code extracts start
words, iterating through the sentence capturing each word, then extract a stop
word. These are all saved in a dictionary format that is used throughout the
program.
CODE
RESULTS AND ANALYSIS
1. Prediction Accuracy:
- Markov models can achieve decent prediction accuracy, especially when trained on large
and diverse datasets. The accuracy may vary based on the order of the Markov model
(unigram, bigram, trigram, etc.).
2. Context Sensitivity:
- The accuracy of predictions is highly dependent on the size of the context window
considered. Smaller context windows (e.g., unigram) may provide limited context, while
larger ones (e.g., trigram) might capture more nuanced relationships between words.
- The quality and diversity of the training data significantly influence the performance of
Markov models. Models trained on domain-specific or highly varied datasets tend to
perform better in capturing context.
- Higher-order Markov models provide more context information but also increase
computational complexity and resource requirements. Striking a balance between model
complexity and performance is essential.
Analysis:
1. Contextual Understanding:
- Markov models capture local dependencies between words but may struggle with
understanding global context and long-range dependencies. This limitation is especially
apparent in situations where context from distant words is crucial for accurate predictions.
2. Handling Ambiguity:
- Markov models may struggle with ambiguous situations where multiple words can
reasonably follow a given context. Techniques such as smoothing or backoff strategies are
often employed to address these challenges.
3. Generalization:
- Markov models may generalize well within the training data but may not perform as
effectively on sentences with structures or contexts not present in the training set.
Generalization is a key consideration when evaluating the robustness of the model.
4. Applicability:
- Sentence completion using Markov models is well-suited for certain applications, such as
text generation, autocomplete suggestions, and predictive typing. However, it may not be
the best choice for tasks requiring a deep understanding of semantics and context.
- Experimentation with different orders of Markov models is crucial for finding the optimal
balance between model complexity and performance. Higher-order models may capture
more context but may also introduce sparsity issues.
CONCLUSION & FUTURE WORK
1. Markov models offer a reasonable balance between simplicity and predictive accuracy.
2. Context window size and training data quality significantly impact model performance.
3. Markov models may struggle with global context understanding and handling ambiguity.
Further Work:
1. Hybrid Models:
- Explore hybrid models that combine Markov models with more advanced techniques,
such as neural networks or transformer models. This can help address the limitations of
Markov models related to global context understanding and semantic complexities.
- Experiment with dynamic context window sizes based on the nature of the input text.
Adaptive mechanisms could be employed to adjust the window size according to the
context's complexity, providing more nuanced predictions.
5. Long-Range Dependencies:
6. Domain-Specific Training:
7. Evaluation Metrics:
- Define and use appropriate evaluation metrics that go beyond traditional accuracy,
considering factors like contextual coherence, fluency, and adaptability to various writing
styles.
8. Real-Time Applications:
- Optimize Markov models for real-time applications, such as predictive typing or chatbot
responses, by exploring lightweight architectures and efficient algorithms.
3. C. Kiddon, L. Zettlemoyer, and Y. Choi, \Globally coherent text generation with neural
checklist
models," in Proceedings of the 2016 Conference on Empirical Methods in Natural Language
Processing, pp. 329{339, 2016.
5. R. M. Everett, J. R. Nurse, and A. Erola, \The anatomy of online deception: What makes
automated text convincing?," in Proceedings of the 31st Annual ACM symposium on applied
computing, pp. 1115{1120, ACM, 2016.
6.P. Hal_acsy, A. Kornai, and C. Oravecz, \Hunpos: an open source trigram tagger," in
Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration
sessions, pp. 209{212, Association for Computational Linguistics, 2007.
7. R. Alami, \How i generated inspirational quotes with less than 20 lines of python code."
https://codeburst.io/how-i-generated-inspirational-quotes-with-less-than-20-lines-of-code-
38273623c905.Accessed 12/3/2019.