Intent Discovery From Conversational Logs To Prepare A Student Admission Chatbot For Tecnol Gico de Monterrey

Instituto Tecnologico y de Estudios Superiores de Monterrey
Monterrey Campus
School of Engineering and Sciences
Intent Discovery from Conversational Logs to Prepare a Student

Admission Chatbot for Tecnológico de Monterrey
A thesis presented by
Rolando Treviño Lozano

Submitted to the
in partial fulfillment of the requirements for the degree of
Master of Science
in
Computer Science
Monterrey, Nuevo León, June, 2021

Instituto Tecnologico y de Estudios Superiores de Monterrey
Campus Monterrey
The committee members, hereby, certify that have read the thesis presented by Rolando
Treviño Lozano and that it is fully adequate in scope and quality as a partial requirement
for the degree of Master of Science in Computer Sciences.
Neil Hernández Gress, Ph.D.

Tecnológico de Monterrey
Principal Advisor
Héctor Gibrán Ceballos Cancino, Ph.D.

Co-Advisor
Joanna Alvarado Uribe, Ph.D.

Committee Member
Noé Alejandro Castro Sánchez, Ph.D.

Centro Nacional de Investigación y Desarrollo Tecnológico (CENIDET)
Committee Member
Rubén Morales Menendez, Ph.D.

Associate Dean of Graduate Studies
i
Declaration of Authorship
I, Rolando Treviño Lozano, declare that this thesis titled, Intent Discovery from Conversa-
tional Logs to Prepare a Student Admission Chatbot for Tecnológico de Monterrey and the
work presented in it are my own. I confirm that:
• This work was done wholly or mainly while in candidature for a research degree at this
University.
• Where any part of this thesis has previously been submitted for a degree or any other
qualification at this University or any other institution, this has been clearly stated.
• Where I have consulted the published work of others, this is always clearly attributed.
• Where I have quoted from the work of others, the source is always given. With the
exception of such quotations, this thesis is entirely my own work.
• I have acknowledged all main sources of help.
• Where the thesis is based on work done by myself jointly with others, I have made clear
exactly what was done by others and what I have contributed myself.

©2021 by Rolando Treviño Lozano

All Rights Reserved
iii
Dedication
To my family, who support me on my decisions, motivate me, and always show me there are
no limits in life but our imagination itself.
To my friends, who make difficult times go by as easy as possible.
To past and future me, for never surrendering when pursuing my dreams.
v
Acknowledgements
This research would not have been able to be done without the guidance and support from
outstanding people at Tecnológico de Monterrey.
Firstly, I want to show my gratitude to Dr. Neil Hernandez for allowing me to experience
this research research and introducing me to the area of natural language processing that I had
not idea about at the beginning of the research.
I wish to express special thanks to Dr. Héctor Ceballos for his outstanding guidance and
deepest support throughout this research. Thanks for your patience and your lessons provided
during this time.
Also, I would not have completed this research without the support from Dr. Joanna
Alvarado. Thank you for demonstrating that all things are possible to be done, for your
unique friendship, and for dedicating your time to assisting me during difficult tasks.
I also wish to express my deepest appreciation to my closest friends who are always
there for guiding and listening to me. Also, my colleagues I met during this master program:
Emmanuel Vázquez, Andree Vela, Raúl Martı́nez and Miguel Lara, thanks for all the laughs,
lessons, and motivation; we were able to go through each of our projects during an unprece-
dented time that represented the quarantine due to Covid-19. I am proud of each one of my
friends.
Special recognition goes to my family who showed me their support, motivation, pa-
tience during this research which not all times were easy to go through, but if it was not for
their love and belief in me, I would not be where I am at each stage of my life. Thanks for
always being there for me.
Lastly, I would like to express deep gratitude to Tecnológico de Monterrey and CONA-
CyT, for their financial aid on tuition and support, allowing me to grow academically and
contribute to the scientific community.
vii
Intent Discovery from Conversational Logs to Prepare a
Student Admission Chatbot for Tecnológico de Monterrey
by
Abstract
Online chat services allow companies to serve and attend to their customers to resolve prob-
lems or doubts about a specific concept. Lately, conversational bots have been adapting to this
domain, allowing a broader attention capacity while easing interactions between users and the
company while also easing work for agents, increasing productivity and service quality. To
design a chatbot is a time-consuming task as the designer has to provide the core key concepts
known as intents that the conversational bot will respond to and provide example sentences
and their respective answers. We propose a framework that receives as input data correspond-
ing to conversational transcripts between prospects and agents and transform them through the
use of regular expressions into a tabular dataset of the conversations in log format easing their
analysis and representation to be converted into a convenient word representation of TF-IDF
which serves as input for applying unsupervised machine learning algorithms as Non-Matrix
Factorization for Topic Modeling and K-Means for utterance clustering to discover possible
intents, which can then be passed on to the design of a knowledge base, which this last step of
intent discovery allows an iterative process to process new conversations and identify changes
in the intents or the addition of new ones. Results demonstrate that it is possible to cluster the
utterances and find clusters that align to a possible intent out of a list of possible intents and
such list is subject to change in time for continuously improving intent discovery. A cosine
similarity threshold was set at 0.47 to differentiate correctly aligned clusters from those not
aligned; 18 intents out of 55 were able to be correctly aligned with an initial intents list, and
a total of 35 different intents were able to be captured by the clustering process. No exact
similar research was found in the literature, as other works on the domain imply an already
curated and labeled dataset to being working on classifying the intents rather than discovering
them during the knowledge base design, also they do not take into account the whole process
of transforming the raw conversations into a tabular and processed dataset.
ix
List of Figures
2.1 General text-based chatbot architecture [46] . . . . . . . . . . . . . . . . . . 14
3.1 General framework for discovering intents . . . . . . . . . . . . . . . . . . . 17

3.2 Research methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Top 25 countries from users . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Top 25 Mexican states from users . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Messages by department . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6 Messages by month . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.7 Messages by day of week . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.8 Messages by hour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.9 Tecbot conversation duration in contrast to all conversations . . . . . . . . . 28
3.10 Average messages sent by an agent and Tecbot per conversation . . . . . . . 28
3.11 Non-Negative Matrix Factorization Visualized . . . . . . . . . . . . . . . . . 35
3.12 K-means algorithm extracted from [48] . . . . . . . . . . . . . . . . . . . . 35
4.1 Message count by month SOAD department only . . . . . . . . . . . . . . . 41

4.2 Message length box plot (raw) . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Message length box plot (cleaned) . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Word clouds before and after text preprocessing . . . . . . . . . . . . . . . . 44
4.5 Word frequency raw text without stopwords filter (Top 25) . . . . . . . . . . 46
4.6 Word frequency lemmatized text with stopwords filter (Top 25) . . . . . . . . 47
4.7 BoW - NMF (Frobenius norm) . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.8 BoW - NMF (Kullback-Leibler divergence) . . . . . . . . . . . . . . . . . . 51
4.9 TF-IDF - NMF (Frobenius norm) . . . . . . . . . . . . . . . . . . . . . . . . 53
4.10 TF-IDF - NMF (Kullback-Leibler divergence) . . . . . . . . . . . . . . . . . 54
4.11 Silhouette Score k=50-1000 (BoW) . . . . . . . . . . . . . . . . . . . . . . 56
4.12 Silhouette Score k=50-1000 (TF-IDF) . . . . . . . . . . . . . . . . . . . . . 57
4.13 Histogram of similarities between utterances and intents . . . . . . . . . . . 57
4.14 Similarity thresholds and their respective percentage of utterances that are
able to be found . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.15 Clustering to intent alignment for 100 clusters with TF-IDF . . . . . . . . . . 59
4.16 Clustering to intent alignment for 300 clusters with TF-IDF . . . . . . . . . . 63
4.17 Histogram for the similarity values of the 300 clusters to the intents . . . . . 63
4.18 Process to reproduce the proposed methodology . . . . . . . . . . . . . . . . 65
xi
List of Tables
2.1 Comparison between text classification models, as mentioned in [34] . . . . . 12
3.1 Features of reports regarding conversations . . . . . . . . . . . . . . . . . . 21

3.2 Features of log format conversations . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Example Bag-of-Words representation . . . . . . . . . . . . . . . . . . . . . 32
3.4 Example TF representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Example IDF representation . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6 Example TF-IDF representation . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1 Example conversation in log-format . . . . . . . . . . . . . . . . . . . . . . 39

4.2 Example of texts and their lemmatized results . . . . . . . . . . . . . . . . . 40
4.3 Statistical information of message length in filtered data (raw) . . . . . . . . 42
4.4 Statistical information of message length in filtered data (cleaned) . . . . . . 43
4.5 Statistical information of message length in filtered data (cleaned messages
with length greater than one) . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.6 Statistical information of BoW NMF topics (Frobenius norm) . . . . . . . . . 50
4.7 Statistical information of BoW NMF topics (Kullback-Leibler divergence) . . 52
4.8 Statistical information of TF-IDF NMF topics (Frobenius norm) . . . . . . . 53
4.9 Statistical information of TF-IDF NMF topics (Kullback-Leibler divergence) 55
4.10 Intents found in correctly aligned clusters (k=100) . . . . . . . . . . . . . . . 60
4.11 Cluster #4 examples for alignment validation . . . . . . . . . . . . . . . . . 60
4.12 Cluster #6 examples alignment validation . . . . . . . . . . . . . . . . . . . 61
4.17 Intents found in correctly aligned clusters (k=300) . . . . . . . . . . . . . . . 64
4.18 Conversational Bots Services Comparison . . . . . . . . . . . . . . . . . . . 66
xiii
Contents
Abstract ix
List of Figures xi
List of Tables xiii
1 Introduction 1
1.1 Problem Statement and Motivation . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Hypothesis and Research Questions . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 State of the Art 7

2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Text Mining and Natural Language Processing . . . . . . . . . . . . . . . . . 9
2.2.1 Text Mining Scopes . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Natural Language Processing Pipeline . . . . . . . . . . . . . . . . . 10
2.2.3 Applications of Text Mining . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Conversational bots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Chatbot Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Intent Discovery from Conversational Logs 17

3.1 Proposal Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Methodology Definition . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.4 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.5 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.6 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Mathematical Theory of Applied Techniques . . . . . . . . . . . . . . . . . 31
xv
4 Results 37
4.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.1 Transformation to Log Format . . . . . . . . . . . . . . . . . . . . . 37
4.1.2 Text Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Data Preparation and Exploration . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.1 Word Representations . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.2 Topic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.3 Text Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.1 Silhouette Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.2 Clustering to Intent Alignment . . . . . . . . . . . . . . . . . . . . . 57
4.5 Reproducibility Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.6 Conversational Bots Comparison Benchmark . . . . . . . . . . . . . . . . . 65
5 Discussion 69
5.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Data Analysis and Data Preparation . . . . . . . . . . . . . . . . . . . . . . 70
5.3 Model and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6 Conclusion 73
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Bibliography 81
xvi
Chapter 1
Introduction
Artificial Intelligence (AI) has been growing at an incredible rate in the past years [16, 54].
Text Mining is the process of discovering information by computer, which was previously
unknown due to the unstructured representation of raw text. [20]. Implementation of text
mining in business has transitioned from novelty to common usage; in fact, the most popular
application in customer service is chatbots [32]. A chatbot system is a software program that
interacts with users via conversations using natural language [49].
The very first chatbot dates to 1966 at the Massachusetts Institute of Technology (MIT)
Artificial Intelligence Laboratory, ELIZA, created by Joseph Weizenbaum. It consisted of
a simulation of a psychotherapist character named “Eliza Doolittle”, it maintained conver-
sations using pattern matching and substitution methodology; that is, it recognized certain
keywords allowing it to answer in the form of question simulating a context. ELIZA gave the
illusion that it was capable of understanding context while having no built-in framework for
contextualizing events [50].
The next major chatbot creation is A.L.I.C.E. (Artificial Linguistic Internet Computer
Entity) which won the Loebner Prize, which is awarded to computer programs considered to
be the most human-like, three times (2000, 2001, and 2004). It was brought to life in 1995.
Even though it was not able to pass the Turing Test, it served as the base to the creation of
other chatbots thanks to its AIML (Artificial Intelligence Markup Language) implementation
[49].
One of the latest innovation in artificial intelligence and conversational systems is IBM
Watson, which is a program that is able to answer questions in a conversational form, that is,
using natural language. It became popular in 2011 after participating in a TV contest show
called: Jeopardy! [30] and winning against the program’s champions. Watson contributed to
healthcare by analyzing critical issues in which an AI like Watson could help physicians and
finding out the best way to achieve a successful interaction between human-machine in order
to provide optimal assistance [3].
Correspondingly, in 2014 bots completed about 15% of all the edits in the Wikipedia
encyclopedia. They completed tasks such as cleaning malicious intended modifications in
the documents, enforcing bans, inter-linking language links, importing content automatically,
identifying copyright violations, among others [52].
As an example on the impact that chatbots may have in a society, Microsoft launched
Tay in 2016. It was implemented on Twitter, intending to simulate a teenage girl. Tay was
1
2 CHAPTER 1. INTRODUCTION
set to interact and learn from interactions with other users in the platform. After a few hours,
Tay began to express derogatory and racist comments [16], obligating Microsoft to turn down
such a bot.
Nowadays, messaging services like Facebook Messenger [16], Telegram, and Whatsapp
hold thousands of chatbots in their platforms, allowing many businesses to increase their sales
and customer services.
To illustrate, customer support is an intimate experience as the customer’s fidelity to the
company depends on how well his/her needs are treated. The customer must always be the
center of any decision done by companies, and every implementation of an AI is to be for
the benefit of the customer first, and then the company, second. All businesses should take
the opportunity of delivering a better experience to their customers. One of the best ways to
achieve such delivery of a caring corporate image is through an automated service of a chatbot
[23]. Thus, online customer support has been a more demanding department for enterprises
as Internet has become more accessible for users [8], and it is more practical for a user to
log into a chat room with the company’s support team rather than assisting personally to the
offices requesting for help.
In fact, human-agent to user interactions means an investment for each of the agents
attending the users’ inquiries. A conversational chatbot service could help the industry to
reduce the number of agents, and it would result in minimizing investment into such a de-
partment. A modern AI-based chatbot’s process consists of receiving data, giving meaning
to it by applying natural language understanding techniques, and then acting according to its
knowledge base [46], which provides a proper response following the rules set by the chatbot
designers. The knowledge base that is a component of a modern AI-based chatbot can be
seen in terms of entities and intents. Entities are the subjects that a user is referring to, and
intents are the actions that the user is seeking to do with such entities [46]. For example, in
the sentence “tell me the weather for Monterrey city”, Monterrey city is the entity, and the
user wants to know the weather for such city, that is: the intent. The whole sentence can be
reduced to weather(Monterrey city). The design of such entities and intents can be of hard ef-
fort. A poorly designed knowledge base will result in the chatbot not being able to understand
user utterances, or even worse, in giving wrong answers to such utterances.
1.1 Problem Statement and Motivation

This research represents an internal use case for the Tecnológico de Monterrey. The goal of
this work is to allow a smooth transition for an online chat service driven by human agents to
a conversational agent set up in a chatbot service incorporating natural language processing
and machine learning.
Tecnológico de Monterrey currently counts with 93,168 students in total, from which
27,402 are in high school, 58,782 undergraduate students, and 6,984 graduate students. This
data means that a huge amount of students can consult and make use of the admissions chat
service [51].
A total of 12 agents working in the chat service can only attend to a limited amount
of users. An automated service of a chatbot will help such a scenario by providing service
twenty-four hours a day, seven days a week to support the users asking for assistance in the
1.2. HYPOTHESIS AND RESEARCH QUESTIONS 3
chat service [32]. At the moment, such admissions department provides a decision-based
chatbot in order to serve as a front line for online support, meaning that the most common
queries can be answered by such bot TecBot and allowing transfers to human agents for a
more concise problematic or question. However, a decision-based bot limits the domain of
questions that can be asked to the conversational system while also providing a computerized
experience [36] that may not comply with user satisfaction evaluations as mentioned in [44,
49], which are solved by implementing an AI-based bot which create a natural communication
flow with the user.
We are ought to analyze, clean, process, and apply text mining methods to conversations
carried out from January 2019 until the end of December 2019. This in order to identify the
set of similar dialogues that when grouped together based on their similarities will help to
identify the set of intents to be added to the knowledge base.
One challenge we face is language. Natural language processing research for Spanish is
less common than it is for English. In order to perform a considerable analysis and processing
of the data, we need to provide further attention to what is going to be done, and also add or
create custom functionalities in order to reach a specific goal.
1.2 Hypothesis and Research Questions

It is possible to group a text corpus of conversations between users and agents of a student
admission online chat service into a set of clusters to discover intents through unsupervised
machine learning algorithms to aid in the design of an AI-based chatbot.
The research questions are:
• What is the process to transform the unstructured data into a representation to be used
by machine learning algorithms?
• How can we identify similar utterances and cluster them?
• Which machine learning models are appropriate to gather the sets of similar utterances?
• How can we evaluate the results and determine if a cluster is associated to an existing
intent? How can we resolve new intents?
• What is the similarity threshold to distinguish clusters aligned to an intent? How can
such value be established?
1.3 Objectives
Conversational agents have become a key focus for customer services in many industries. The
objective of this thesis is to help construct a knowledge base consisting of intents extracted
from log data in an unsupervised manner. Natural language conversations from the student
admission online chat services of Tecnológico de Monterrey served as input for our method-
ology to extract information to setup a conversational chatbot.
To fulfill this thesis’ main objective, several specific objectives are proposed:
4 CHAPTER 1. INTRODUCTION
• Collect data from the student admissions department containing conversations between
users and agents during a determined period.
• Analyze the collected data and create a general report of insights of the conversations.
• Transform the unstructured text data into a convenient representation to be processed

by machine learning algorithms.
• Apply natural language processing techniques to the new representation of the text data
to find clusters of similar utterances.
• From the sets of similar utterances, compare them against existing intents and gather
their similarity measures.
• Evaluate the alignment of the clusters with their intent similarity, determine those cor-
rectly aligned with such intent and those that are not and establish the similarity thresh-
old to differentiate them.
1.4 Main Contributions

The contributions of this research to the scientific community are as follows:
• Our research shows the practical approaches to transform unstructured text data belong-
ing to conversations into a log format that clearly presents the structure of the conversa-
tions, and also provides the necessary steps in order to process text, clean it, standardize
it and be able to apply machine learning algorithms to the resulting texts in a convenient
word representation.
• We apply two different techniques in order to discover insights of the intents found in
the texts via topic modeling and clustering. The former serves as an insight guide to
discover those texts that are not yet identified as related to an intent, and the latter is
able to capture the set of groups of similar questions given a distance metric.
• Our solution aids in the process of manual labeling conversational logs by providing
the groups that explain a possible user intent, which when done manually represent a
time-consuming activity. This contributes in the task of implementing conversational
bots using cloud services such as Microsoft Azure, Google Dialogflow or IBM Watson,
where a set of intents and their respective examples are requested, which such informa-
tion of intents and examples are the results of our framework.
1.5 Summary
For this work, we will develop a framework that will output a well-designed knowledge base
for a conversational chatbot. The input for such a framework will contain a text file: logs from
real conversations, containing specific fields as columns to maintain a standardized format.
Such files will be pre-processed, analyzed, and then will be applied to algorithms belonging to
1.5. SUMMARY 5
text mining, machine learning, and natural language processing in order to output the desired
intents to construct the knowledge base that will make up a chatbot knowledge base.
The scope of this research is as follows: to work with chat logs from an online customer
support system which for the means of this research being from the admissions department of
the Tecnológico de Monterrey. Another aspect of the scope is that the technologies applied to
such input will be of text mining, machine learning, and natural language processing. We will
seek to group similar dialogues in order to construct the knowledge base of intents.
Chapter 2
State of the Art
This chapter provides the concepts that will support the presented work. It is divided into
different sections of theory. First, machine learning general concepts are shown. Next, a more
specific theory is presented about text mining and natural language processing (NLP). Then,
the theory about the processing of text is described, later showing the procedure that is used
for the completion of this research. Finally, concepts regarding chatbots are presented.
2.1 Machine Learning

Auréline Géron [19] states that machine learning is the science of programming computers
to learn from data. In his work Hands-On Machine Learning with Scikit-Learn, Keras &
Tensorflow, he supports the concept of Machine Learning with two more definitions:
• “Machine Learning is the field of study that gives computers the ability to learn without
being explicitly programmed.”, Arthur Samuel, 1959.
• “A computer program is said to learn from experience E concerning some task T and
some performance measure P, if its performance on T, as measured by P, improves with
experience E.”, Tom Mitchell, 1997.
The importance of machine learning is the broad possibility of aiding humans on deci-
sion processes that would normally consume a considerable amount of time if done manually
[35].
Machine Learning systems can be classified according to the following characteristics,
as mentioned in [19]. Such definitions are as follows:
• Depending if the system was trained while maintaining human supervision.

– Supervised learning: refers to when the training set being fed to the algorithm
includes the outcome of each data point, also known as labels [19].
Two main tasks on this supervised approach are:
* Classification: given a training dataset with known classes (labels correspond-
ing to a sort of classification of such data points), a classification model will
learn how to classify new data points and classify them accordingly.
7
8 CHAPTER 2. STATE OF THE ART
* Regression: a task in which the training dataset contains features along with
the desired predictor or target variable (the label). A regression system needs
to be fed many examples of the data in context in order to be able to predict
accordingly.
Some of the most important examples of supervised learning algorithms are:
* k-Nearest Neighbors
* Linear Regression
* Logistic Regression
* Support Vector Machines (SVMs)
* Decision Trees and Random Forests
* Neural Networks (which some architectures can also be used in unsupervised
and semisupervised contexts)
– Unsupervised learning: contrary to supervised learning, the training set being fed
to the algorithm does no include the label. Unsupervised learning algorithms try
to learn without given guidance of the data [19].
The most important unsupervised approaches along with their algorithms are:
* Clustering: Helps detect groups with similar features within the data.
· K-Means
· DBSCAN
· Hierarchical Clustering Analysis (HCA)
* Anomaly detection and novelty detection: Helps on determining data points
that do not follow a pattern present in the rest of the dataset.
· One-class SVM
· Isolation Forest
* Visualization and dimensionality reduction: Helps to reduce the dimensions
of a dataset (commonly to 2D or 3D representations) while maintaining struc-
ture and integrity in order to be able to visualize them in a plot and appreciate
how the data is separated within space.
· Principal Component Analysis (PCA)
· Kernel PCA
· Locally Linear Embedding (LLE)
· t-Distributed Stochastic Neighbor Embedding (t-SNE)
* Association rule learning: Helps on finding specific patterns or associations
within the data, revealing insightful information which may explain the prob-
lem in context.
· Apriori
· Eclat
– Semisupervised learning: located most of the time between supervised and un-
supervised learning, semisupervised learning deals with training datasets that are
2.2. TEXT MINING AND NATURAL LANGUAGE PROCESSING 9
partially labeled. Given a few examples of good labeling of data points, a semisu-
pervised algorithm propagates labels across similar data points, saving the tasks
of manually labeling many examples that are time-consuming [19].
– Reinforcement learning: the context represents an agent in an environment, which
is able to select a choice of actions leading to rewards or penalties accordingly
depending on if such desired action improves or decreases a given performance
measure, which as it performs actions updates its internal strategy of action selec-
tion depending on a context, called policy. The goal is to maximize such a policy
[19].
• Depending if the system can incrementally learn while it’s running in deployment.
– Batch learning: also known as offline learning, the system is unable to incremen-
tally learns from new incoming data. The workflow is: first, the system is trained
on all available data, then it is launched or deployed into a production environment
and keeps running without continuous learning.
– Online learning: the system is able to continue learning as data streams keep in-
coming, allowing the system to learn from new data incrementally.
• If the systems are done by comparing new data to previously known data, or if the task
represents to find patterns in available data and be able to build a predictive model.
– Instance-based learning: the system assigns new data labels to those in the training
data set that are closely following a similarity measure.
– Model-based learning: the system is trained on a data set, and it is able to fit
a specific algorithm’s parameters to fit best the data, then when new data points
arrive, by applying such model following the parameters as stated during training,
the system will predict the result of the labels on the latest data.
2.2 Text Mining and Natural Language Processing

As stated previously, text mining allows users to discover information by transforming un-
structured raw text into a structured representation easing the extraction of valuable informa-
tion that was not possible to be known in the previous model [20]. Text Mining uses Natural
Language Processing (NLP) in order to transform the unstructured data into a respective struc-
tured form. NLP is an area of computer science in charge of dealing with methods to analyze,
model, and understand human language [53].
The main difference between both concepts is that text mining (or text analytics) focuses
on obtaining insights from textual data in order to aid in decision-making process, while
natural language processing looks after the data in order to extract key information or create
a model to comply with a specific task, be it classification, generation or clustering of texts.
2.2.1 Text Mining Scopes

Kowsari et al. [34] mention the following scopes of text mining:
• Document-level: Defines categories present in a full-text document.
• Paragraph-level: Defines categories present per paragraph in a document.
• Sentence-level: Defines categories present per sentence found in a paragraph.
• Sub-sentence-level: Defines categories present per each of the expressions found in a

sentence.
2.2.2 Natural Language Processing Pipeline

Vajjala et al. [53] define a step-by-step processing pipeline of text. Such pipeline is the
following:
• Data acquisition: some strategies for acquiring data are: using public datasets, scraping
data from the web, data augmentation, among others.
• Text cleaning: states the processing of cleaning up the data, like HTML parsing, Uni-
code normalization, spelling, and system-specific error correction.
• Pre-processing: defines the procedure to clean the data at a sentence or word level, such
as sentence and work tokenization, stop word removal, stemming and lemmatization,
removing digits, removing punctuation, lowercasing, among others, such as normaliza-
tion, parsing, Parts-of-Speech (POS) tagging.
• Feature engineering: capture the characteristics of text into a numeric vector represen-
tation that can be understood by ML algorithms. This can be statistic measures of text
in terms of word presence, such as One-Hot Encoding, Bag-of-Words, Bag-of-NGrams,
term frequency-inverse document frequency (TF-IDF), or even distributed representa-
tions such as word embeddings.
• Modeling: refers to the application of ML algorithms in terms of the task that is to be

solved given the context.
• Evaluation: in order to determine the ”goodness” of a model, we can apply metrics such
as accuracy, precision, recall, F1 score, AUC, among others, which are to be chosen
according to an algorithm chosen.
• Deployment: most machine learning deployments are part of a larger system, and serve
as web services, taking the specific input for the task and returning the result to serve as
part of a broader pipeline in a system.
• Monitoring and model updating: it is important to keep monitoring the behavior of the
model given its modality (either offline or online learning), in order to ensure that the
performance of the model on the new data given as input is as desired and update the
model if necessary.
2.2.3 Applications of Text Mining

Text mining is an interdisciplinary field as it covers many areas of computing in order to
process textual data and output expected results. Some of its areas are mentioned in [55]:
Text Classification
This area corresponds to train a model which learns to classify documents according to their
training labels, so new records can be classified as they are given to the trained model. Several
classification algorithms are mentioned in [34]. Some of them are stated in Table 2.1.
Model Advantages Disadvantages

• Easy to implement • It cannot solve non-linear problems
• It does not require too many compu- • Prediction requires that each data
Logistic Re- tational resources point be independent
gression • It does not require input features to • Attempting to predict outcomes
be scaled based on a set of independent
• It does not require any tuning variables
• A strong assumption about the shape

of the data distribution
• Limited by data scarcity for which
• It works very well with text data
any possible value in feature space,
Naı̈ve Bayes • Easy to implement
a likelihood value must be estimated
Classifier • Fast in comparison to other algo-
by a frequentist
rithms
• Attempting to predict outcomes
based on a set of independent
variables
• Computational of this model is very

• Effective for text data sets expensive
• Non-parametric • Difficult to find the optimal value of
K-Nearest • More local characteristics of text or k
Neighbor document are considered • A constraint for large search prob-
• Naturally handles multi-class data lems to find nearest neighbors
sets • Finding a meaningful distance func-
tion is difficult for text data sets
• Lack of transparency in results

• SVM can model non-linear decision
caused by a high number of dimen-
boundaries
sions (especially for text data)
• Performs similarly to logistic regres-
Support Vector • Choosing an efficient kernel func-
sion when linear separation
Machines tion is difficult (susceptible to over-
• Robust against overfitting problems
fitting/training issues depending on
(especially for text data set due to
the kernel)
high-dimensional space)
• Memory complexity
Model Advantages Disadvantages

• Can easily handle qualitative (cate- • Issues with diagonal decision bound-
gorical) features aries
• Works well with decision boundaries • Can be easily overfit
Decision Tree parallel to the feature axis • Extremely sensitive to small pertur-
• A decision tree is a very fast algo- bations in the data
rithm for both learning and predic- • Problems with out-of-sample predic-
tion tion
• Quite slow to create predictions once

• Ensembles of decision trees are very
trained
fast to train in comparison to other
• More trees in forest increases time
techniques
complexity in the prediction step
Random Forest • Reduced variance (relative to regular
• Not as easy to visually interpret
trees)
• Overfitting can easily occur
• It does not require preparation and
• Need to choose the number of trees
pre-processing of the input data
at the forest
• Flexible with features design (re-

duces the need for feature engineer-
• Requires a large amount of data (if
ing, one of the most time-consuming
you only have small sample text data,
parts of the machine learning prac-
deep learning is unlikely to outper-
tice)
form other approaches
• Architecture that can be adapted to
• It is extremely computationally ex-
new problems
pensive to train
• Can deal with complex input-output
Deep Learning • Model interpretability is the most
mappings
important problem of deep learning
• Can easily handle online learning
(deep learning most of the time is a
(It makes it very easy to re-train
black box)
the model when newer data becomes
• Finding an efficient architecture and
available)
structure is still the main challenge of
• Parallel processing capability (It can
this technique
perform more than one job at the
same time)
Table 2.1: Comparison between text classification models, as mentioned in [34]
Text Clustering
Similar to document classification, the task of document clustering is to organize the doc-
uments into groups of similar ones. In contrast to document classification, in this task the
respective document label is not provided.
• K-Means: groups the data into k different groups by iterating multiple times an assig-
nation of different centroids and then calculating which are the closest ones for each
of the data points, then calculating an average centroid for each group and starting the
iteration again until no changes are made to each of the assigned clusters [4].
• Hierarchical: can be agglomerative (AGNES), where every single instance is a cluster

and similarities between them allow for grouping iteratively aggregating them until a
single group is reached, or divisive (DIANA), where the data starts with one single set,
and starts to be divided until finding single groupings of the data points [4].
• Density-Based (DBSCAN): is able to handle noise and find a shape to unstructured

data. It also does not need an indicator equivalent to the k amount of clusters as seen in
the K-Means algorithm [9].
Another approach to organizing documents by their contents is through the technique

known as Topic Modeling. It is an unsupervised learning technique to represent topics present
in a set of documents according to their representative contents [4]. Algorithms for topic
modeling are described as follows:
• Latent Semantic Analysis (LSA) [13]: in contrast to the statistics behind LDA, LSA
returns groups of documents that contain the same words.
• Probabilistic Latent Semantic Analysis (PLSA) [24]: compared with LSA makes use of
mixture decomposition inferred from a latent class model.
• Latent Dirichlet Allocation (LDA) [6]: topics are expressed as probability distributions
in which a given set of terms may occur. It allows for flexibility of topics as non-distinct
words may be encountered in different themes.
• Non-Negative Matrix Factorization (NMF or NNMF) [40, 37]: states that a term frequency-
inverse document frequency matrix can be decomposed into two factors: terms per top-
ics, and documents per topics.
The general recommendation stated in [4] is that LSA works better when dealing with
larger corpora for learning descriptive topics, and LDA, as well as NMF, can be used when
dealing with shorter and more compact textual data. Chen et al. [10] performed a set of
experiments comparing LDA and NMF for short texts such as tweets, news headlines, and
forum questions and concluded that NMF is able to produce higher-quality topics than LDA.
Information Retrieval
Given a set of documents, the task of information retrieval is given a query, search through
the documents and return those matching such query. One basic example of such is document
similarity between the set of documents. Schütze et al. mention in [48] that such a concept is
to find material which nature is unstructured (referring to the raw text form), satisfying a need
for information coming from within a large collection stored in computers.
Information Extraction
Information Extraction refers to extracting specific information and transforming the textual
representation into the respective structured model, such as translating a sales document into
a spreadsheet format with the report of the stated textual information of sales. Jurafsky et
al. [31] define information extraction as turning unstructured data embedded in raw text files
and transform it into structured information, allowing storage, for example, into a relational
database. Common tasks include relation extraction, temporal data extraction, event extrac-
tion, and template filling.
2.3 Conversational bots

A chatbot is a software program that interacts with users via conversations using natural lan-
guage [49]. Users are able to access data and services by exchanging natural language com-
munication with such chatbot [16].
Humans are able to communicate by signs, text, and voice. Human-machine interactions
are currently limited to either text-based conversations or voice-based. The most popular form
in which a human can interact with a machine is through text. Some examples of such channel
are IBM Watson and QA platforms. Voice-based dialogue platforms have been developed in
the previous decade, such as Apple Siri, Amazon Alexa, and Google Assistant [17].
Chatbots have become more popular in sectors such as automotive, customer support,
education, entertainment, finance, healthcare, marketing, manufacturing and retail in systems
that help workers interact with their virtual assistants in order to automate complex workflows
with support 24 hours a day, 7 days-a-week. [17, 32].
2.3.1 Chatbot Architecture

Sánchez-Dı́az et al. [46] show the general architecture of a text-based chatbot as seen in
Figure 2.1. A description of each step of the architecture is as follows:
Figure 2.1: General text-based chatbot architecture [46]

2.4. RELATED WORK 15
• User input: an utterance in the form of text is received by the user stating the desired
message to be sent to the conversational bot. This is done through an interface that
allows the connection between a user and the conversational bot service.
• Raw data: refers to extracting the information from the utterance focusing on its con-
tents and also decomposing features such as context.
• Language understanding: specific to the branch known as natural language under-

standing, refers to be able to extract the information that the utterance is presenting,
such as a possible intent (an action desired to be done), or an entity adding to the con-
text (such as specifying a place or time). Information such as grammar, semantics,
pragmatics and logic present in the text are processed in this step in order to consult the
corresponding information in the knowledge base, mentioned next.
• Knowledge base: The chatbot knowledge base is the core of the whole bot. It is the
database in which all the information that is going to be handled for the responses is
stored. Knowledge base designers are responsible for the correct management and in-
terconnection of information [46] that is going to be used by the bot in order to produce
a set of answers to given user inputs.
Definition of intents and entities, as described in [7]:
– Intents: Represent the actions that the user intends to accomplish with the chatbot.
– Entities: Domain-specific information items found in the utterances associated
with an intent.
• Response: once that the knowledge base has been consulted, scores are assigned to the
most relevant intent found in the utterance, along with its suitable and related answer to
be provided back to the user via the same channel as the input text was received from.
2.4 Related Work

Works related to the scope of this research are described next. Literature analysis cover clus-
tering techniques applied to textual domains as well as a relation with conversational bots
functionality, such as the creation of a knowledge base or natural language understanding
dealing with interpreting conversational data.
Intent classification tasks have been applied on web domain [28, 39, 38, 56], where
there exist the use of keywords on query data as well as analyzing the user’s clicks through
a website in order to make a prediction on the intent of the user. Similar to this task, but
applied to a commercial domain is online commercial intent (OCI) [11, 27, 25, 21], where the
main objective is to identify the commercial intent (find, buy, among others) on a commercial
aspect.
Jizhou Huang [29] proposed an automatic extraction of knowledge from online dis-
cussion forums by extracting <question, answer> pairs by using Support Vector Machines
(SVM) and an evaluation that required human interaction by scoring how related is an an-
swer to a topic before training. Kim et al. [33] explored different classifiers (Support Vector
Machines for Sequence Tagging, Naı̈ve Bayes and Conditional Random Fields), along with
varying representations of words (Bag-of-Words, TF-IDF) and additionally testing extra infor-
mation over one-on-one conversational data. The results showed that using dialogue structure
and inter-utterance dependency provided an increase in performance, and also concluded in
which using lemmas rather than raw words increases accuracy of classifiers. Haponchyk et al.
[22] makes use of labeled datasets to propose a supervised clustering methodology to identify
user intents and improving results by comparing their approach against semantic classifier.
Deepak et al. [12] also work on supervised clustering with a proposal to cluster pair of ques-
tions and their respective answers in order to create and further curate a questions archive.
Zhao et al. [58] as well as Aggarwal et al. [2] analyze clustering algorithms for the
domain of textual documents. Zhao et al. also reach a conclusion over a comparison be-
tween partitional clustering algorithms and agglomerative clustering algorithms and states
that the first kind always leads to better solutions by providing higher-quality results as well
as them not requiring many computational resources as compared against the agglomerative
kind, which makes them become the recommended implementation for large collections of
documents. As for cluster validation, Dudoit et al. [14] make use of the silhouette index in
order to estimate the optimal number of clusters to be found. More validation metrics are
shown in [48].
Additional tools used in the industry exist to identify intents, Microsoft LUIS [57] and
Wit.ai [15] are capable of processing conversations and identify the intents that are present in
a set of utterances and provide to the user a broader insight of how a conversational bot is to
be configured to respond to the desired intents.
In the literature revised on this domain, no clear procedure on how to transform data
and standardize it in order to account for a text preprocessing pipeline is present, also often
conversational tasks imply an existing labeled and curated dataset, which by removing the
target class and applying a machine learning algorithm to generate a prediction that is com-
pared with the actual target to define the performance of the classification or clustering model.
The scope of our research does not account for a label per each of the utterances, instead is
a framework to discover the possible intents that may be contained on the utterances, start-
ing from the transformation of unstructured data into a tabular dataset and finalize with the
application of an unsupervised machine learning technique to establish clusters of similar ut-
terances, aiding on lowering costs of manual labeling of data and providing insights on the
discovery of intents in conversational data on the domain of academic online chat services,
specifically from the admissions department.
Chapter 3
Intent Discovery from Conversational

Logs
This chapter comprises the methodology followed during the development of this research,
which covers its definition, data collection, data preprocessing, data analysis, data preparation,
modeling and evaluation.
3.1 Proposal Solution

The proposed framework will allow the maintenance of the content in the knowledge base
for it to keep updated as new unknown questions are asked in the chatbot. The proposed
framework is shown below in Figure 3.1, representing the general concepts that will cover the
framework for the part of the intents.
Figure 3.1: General framework for discovering intents
1. Data Preprocessing: the conversations dataset goes through a data preprocessing pipeline
which consists of two main steps, which are transforming the data to a log format (which
allows to separate each dialog of the conversations), and then the textual cleaning of the
17
18 CHAPTER 3. INTENT DISCOVERY FROM CONVERSATIONAL LOGS
text in order to prepare it by removing noise and standardizing the format to be able to
convert it to a viable word representation.
2. Word Representation: after the conversations have gone through the data preprocess-
ing pipeline, a viable word representation is then chosen in order to use in machine
learning algorithms.
3. Intent Discovery: the selected word representation is then used for the application of a
machine learning method. In this case we make use of unsupervised techniques such as
clustering.
4. Intent Mapping: Each of the clusters are then mapped against existing intents by using
a similarity measure. For the case of this research, there exist a collection of options
provided for the user to select in the menu-based chatbot implementation, which are
selected to represent possible existing intents in the dialogues previous to such imple-
mentation.
5. Intent Evaluation: in this step, the groups of clusters and their similarities to exist-
ing intents are evaluated. A sample of dialogues from each cluster is extracted and the
intent mapping is validated allowing to state if such mapping is correctly aligned with
the intent or not. For those intents not aligned, this step allows the statement if such
clustering can be categorized as a new intent, if so, it is then added to the intent collec-
tion. It is also possible to define a similarity threshold to automatize the definition of
an aligned intent, by calculating the mean similarity measure of those clusters correctly
aligned with their intents.
Steps 4 and 5 are two-way since updating the structure of the intents modifies the sim-
ilarities between the utterances and the intents themselves. If an intent could be defined in a
more suitable way, it may describe better the utterances that are to be similar to it.
The process is also iterative in the sense that new conversations are to be analyzed further
in time. Intents are prone to be modified as users have new questions about services that
integrate into the organization, and also intents may change in their own structure in order to
specify a variant of an intent that would no longer be of use.
As a final result of each iteration over our proposed framework are intents accompanied
by their respective examples that trigger such intents. These serve to configure and train a
conversational bot which receives a set of intents along with examples that are associated
with a respective intent.
3.2 Methodology
In this section, we will explain the research methodology carried on to develop this research.
A methodology helps us define how to set up a road map in order to fulfill a completion goal.
Several steps on the research methodology are explained in this section.
3.2. METHODOLOGY 19
3.2.1 Methodology Definition

In order to reach the completion of this research, the methodology, which is shown in Fig-
ure 3.2, is composed as follows:
Figure 3.2: Research methodology
• Data Collection: chat conversations are retrieved from the admissions department of
ITESM for them to be treated.
• Data Preprocessing: textual data needs to be parsed to a structural form. We follow

the pre-processing data pipeline stated in Section 2.2.
• Data Analysis: before diving into the modeling of a natural language processing model
for the data, first, we need to understand the behavior of the resulted cleansed dataset
by visualizing descriptive statistics, defining a story explaining what the data represent.
• Data Preparation: after the analysis, we can move further into changing the structured
form of the data into a representation that can serve as input to a natural language
processing model.
• Application of Machine Learning Methods: during this phase, natural language pro-
cessing methods will be applied to the chat logs in their respective representation form
in order to obtain the desired results, that are the intents represented in the texts.
• Evaluation: this phase consists of manually verifying a sample of the results from the
modeling phase. This allows to generalize the performance of the model and escalate it
to the explanation of the model.
3.2.2 Data Collection

The dataset utilized for this research was given by the admissions department of Tecnológico
de Monterrery. The acquiring of this data was split into two batches. At the very beginning
of this research, we had access to the first half, belonging to the dates from January 2019 to
August 2019. As we evaluated the data and the time was advancing, when we reached the year
of 2020, we were able to have access to the other half of the batch of discussions belonging
to the dates from September, 2019 to December 2019, covering finally the whole historical
conversations that took place in the year of 2019. First insights on the data gave us the dis-
covery that from the month of July 2019, the admissions department started a deployment of
a menu-based chatbot in their online chat service, it started giving intermittent service along
with the regular work from the agents, until December 2019 when the presence of attention
providers was only from such bot.
The format of such original data was of monthly reports in a comma-separated value
(CSV) file each. The columns of such files are described in Table 3.1, covering technical
specifications of the user device as well as date and time, along with the transcript of the con-
versation in a single string. Such transcripts format varied with time; thus data preprocessing
represented a challenging task to perform.
Feature Name Data Type Brief Description

Non-unique identification number for specific
Case number Numeric
prospect’s tickets
Unique identification number for the conver-
Chat transcript name Numeric
sation
Unique identification number of a prospect
Chat visitor number Numeric
given the login-to-chat information
Hashed IP address of prospect when con-
Visitor IP address String
nected to chat
Maximum time in seconds which an agent
Maximum response time agent Numeric
took to respond
Maximum time in seconds which a prospect
Maximum response time visitor Numeric
took to respond
Messages total visitor Numeric Number of messages sent by the prospect
Messages total agent Numeric Number of messages sent by the agent
Indicates who ended the conversation (sent
Ended by String
the last message)
Indicated from which website the prospect ac-
Chat button String cessed the chat, representing the respective
department
Location String Prospect’s location when connected to chat
Average response time agent Numeric Average response time by an agent
Site reference String Unused field
Screen resolution of prospect’s device when
Screen resolution String
connected to chat
Name of prospect’s Internet Service Provider
ISP String
(ISP) when connected to chat
The operative system used in the prospect’s
Platform String
device when connected to chat
3.2. METHODOLOGY 21

Web browser used by the prospect when con-
Browser String
nected to chat
Amount in seconds of the conversation total
Chat duration Numeric
duration
Abandoned after Numeric Unused field
Creation date Date Date in which the conversation took place
Time of the day in which the conversation
Init date Time
took place
Date in which if any field in conversation was
Last modification date Date
modified
Body String Whole conversation transcript
Name of agent attending the conversation
Owner Numeric
(states if attended by bot)
Table 3.1: Features of reports regarding conversations
We also had access to the chatbot implemented during mid-year of 2019. The structure
of the menus provided by the bot to the user extends up to 5 levels in depth. Each group leads
to more specific information about a subject.
3.2.3 Data Preprocessing

This step can be broken down into two major phases. First, the transformation from raw texts
into a log format. And the second one deals with cleaning the text itself.
Transformation to Log Format

An anonymized example of the original raw string covering the whole transcript of a conver-
sation:
Chat has begun: Friday, January 1, 2019, 12:12:12 (-0500) Chat origin: SOAD
Agent One ( 0s ) Agent One: Hi, how can I help you? ( 42s ) Prospect Two: Hi,
thanks for your support. ( 1m 7s ) Agent One: Bye.
As it can be seen, a series of regular expressions might aid in identifying the groups in
which a transcript can be broken down into. Once regular expressions are set, we can convert
such transcript into a data frame with the following columns, as shown in Table 3.2:

Non-unique representation for specific
CaseID Numeric
prospect’s tickets
A unique number indicating the transcript
ConversationID Numeric
identifier
A sequence number of the message corre-
Sequence Numeric
sponding to a conversation

Department String Department from admissions agent
DateAdjusted Datetime Date and time in which the message was sent
Timezone String The timezone of the conversation
Represents who sends a specific message.
Emitter String It can be SYS (System), AGENTE (Agent),
PROSPECTO (Prospect), or TECBOT
Body String Message sent by the sender
BeginsTecBotBool Boolean Indicates if the chat was first attended by bot
Indicates if the corresponding message indi-
TransferBool Boolean cates a transfer (either from agent to an agent,
or from bot to an agent)
Indicates false until a conversation with
TecBot requests for a transfer to an agent, all
TransferTecBotBool Boolean
messages after such transfer will be marked
as true
Indicates if such message will be such that
MessageNotUnderstood- TecBot will return an automated message
Boolean
TecBotBool indicating it was not able to understand a
prompt (natural language)
Indicates that TecBot responded with auto-
MessageNotUnderstoodPrompt-
Boolean mated message indicating it was not able to
TecBotBool
understand a prompt (natural language)
Table 3.2: Features of log format conversations
Text Preprocessing
In order to process raw textual data into a machine readable format, the following are the steps
carried out during the preprocessing phase for the presented work:
• Load unstructured data: text data can be found in many formats, such as comma sepa-
rated values (CSVs), JSON, or distributed in folders containing text files. The reading
of such data can be made through direct read functions or by creating custom ones that
follow a specific structure depending on how the documents are stored [47].
• Creating the corpus: a corpus is the collection of related documents containing natural
language [4]. In order to clean and produce a structured dataset from the documents, it
is important to work with a corpus object that contains the whole set of documents to
process.
• Pre-processing Pipeline: in Practical Natural Language Processing [53], Vajjala et al.

denote the following pre-processing steps:
– Sentence segmentation and tokenization: every document is presented as a vector.

Each element represents a word in the sentences found in such a document.
3.2. METHODOLOGY 23
– Text normalization: characters must be standardized into a specific format. In the

case for this work, characters were set to UTF-8, removing accents and also trans-
forming each word to its lower-case setting. More processing and replacements
benefit by reducing many words or characters that can be represented as a sin-
gle idea. For example, different written URLs can be standardize to indicate that
what such word means is a URL. The same applies for phone numbers, identifi-
cation numbers. This action’s benefit can be seen in the creation of a document-
term matrix, where, e.g., multiple phone numbers can be represented by the word
phone number. Regular expressions are an important component as well, as they
can aid on identifying key patterns, such as dates, specific input messages, places,
among other principal aspects of the given corpus.
– Stemming and lemmatization: The former refers to removing suffixes and reduc-
ing a word to a single base from which a single form can represent all variants of
such word. The latter is to replace each of the tokens to their equivalent lexeme
[18]. This removes noise and variety of words that belong to a single root in terms
of what is found in a dictionary.
In order to clean the respective text column (Cuerpo), we followed the next text prepro-
cessing pipeline:
1. Lowercasing: change all textual data to lowercase
2. Word segmentation: also called tokenization, split each of the messages into lists, where
each word is an element of such list
3. Encoding standardizing: remove all characters that are out of range from the alphabet
and transforming them into their equivalent alphabetical representation if possible, e.g.,
turn é into e
4. String normalization: change a representation of a variety of strings that can be gener-
alized by a single form.
• Names of Agents
• Names of Prospects
• Numbers
• Student IDs
• General structure (start, transfer)
• Emails
• URLs
5. Stop word removal: removing words that do not provide any information nor alter the
context if removed
6. Lemmatization: changing the form of a word to its equivalent lexeme
This allowed creating a dataset containing the clean texts ready to be passed later to
more advanced techniques to be used in machine learning algorithms.
3.2.4 Data Analysis

Before continuing with machine learning processes, we first need to understand the data.
So far, we have prepared the dataset to be then used in text mining and natural language
processing modules. We will see an analysis of the conversations in a graphical manner.
The analysis consists on exploring the data found in the conversational logs to discover
the contents of the writing, how long are each of the messages sent by the prospects. Graphical
visualizations are shown among their respective statistics for a deeper understanding.
Understanding the data helps in realizing what tasks and preparation are to be done in
order to move a step further into the modeling section. The conversational nature of the data
collected requires a better understanding and also is to be settled with a tighter scope in order
to align its structure to what is wanted to be achieved as stated on the objectives.
General Analysis
The total number of conversations covered in the year 2019 was 65,200, representing 1,281,222
interactions between users, agents, and the menu-based bot.
The conversation data included features such as location from where the user is con-
nected to the online chat service. Figure 3.3, which makes the exclusion of Mexico (91%)
and the United States (3.13%) due to the high amount of users from those countries. The
graph highlights a predominant presence of Latin American countries, 14 out of the total of
25 countries with the most visits.
Figure 3.3: Top 25 countries from users
As previously stated, Mexico occupies the majority of conversations. Figure 3.4 high-
lights the states from Mexico where most users are contacting the most to the online chat
service of admissions from Tecnológico de Monterrey. Distrito Federal (currently Mexico
3.2. METHODOLOGY 25
City) (22.42%), Nuevo León (17.74%), Jalisco (5.43%), Chihuahua (4.45%) cover half of the
conversations attended by the chat service.
Figure 3.4: Top 25 Mexican states from users
Proceeding by analyzing the conversations, Figure 3.5 shows the difference in the num-
ber of conversations by each department. SOAD (Solicitud de Admisión) covers the major-
ity of attendance of the online chat service with 42,079 conversations, followed by Admi-
sion Profesional (10,553) and Admision Preparatoria (6,334).
Figure 3.5: Messages by department
Figure 3.6 reveals a seasonality. The year begins with January containing 6,341 mes-
sages, and maintains a decreasing trend until the months of June, July, and August, which hold
a similar consistency on their messages, the bottom being August. From there until Novem-
ber, there exists an increasing trend with a top of 8,966 conversations in November and ending
the year with December having 5,840 conversations.
Figure 3.6: Messages by month
The graph shown in Figure 3.7 reveals a decreasing trend from the beginning of the
week until the end of it, with a lower demand for conversations in the service beginning on
Fridays with a further decrease on weekends.
3.2. METHODOLOGY 27
Figure 3.7: Messages by day of week
Analyzing the conversations displayed by the hour of the day as shown in Figure 3.8,
it’s possible to note the work schedule from 8 AM to 9 PM, having the busiest hours at 4 and
5 PM. The conversations then show a major decrease in out of working schedule (10 PM to 7
AM).
Figure 3.8: Messages by hour
Tecbot was first deployed in the month of July. Figure 3.9 helps to visualize the par-
ticipation of Tecbot in terms of average chat duration. The peak in November, as seen in
Figure 3.6, shows Tecbot had a participation of 25% out of the total duration of chats.
Figure 3.9: Tecbot conversation duration in contrast to all conversations
Figure 3.10 reveals the participation of Agent and Tecbot in terms of average message
counts. Tecbot maintains a constant number of messages sent to users of around 10 and 11
per conversation, while agents are experiencing a decreasing trend in the number of messages
that are sent to the users.
Figure 3.10: Average messages sent by an agent and Tecbot per conversation
3.2. METHODOLOGY 29
3.2.5 Data Preparation

The total amount of conversational data is mixed. For the first part of the year, the chats
were between agents and users. At mid-year, the admissions department started with the
deployment of the menu-based bot, but it was not until December when the bot covered all
incoming conversations.
Also, the mix of multiple departments meant a different design for the chatbot and a
slightly different structure that constantly changed how the transcript was represented.
In order to be able to work with the data, it needed to be delimited into a set of conver-
sations that would serve the purpose of this research.
3.2.6 Model
Once the desired collection of messages has been cleaned, it is then possible to move into the
modeling of a machine learning algorithm that can output a result that aligns and fulfills this
research’s objective. In this phase, we make use of the techniques described in 3.3.
For the case of word representations, we implemented two kinds: Bag-of-Words and
TF-IDF with an extra parameter for more specificity:
• Bag-of-Words: minimum word appearance (min df) = 2, maximum word appearance

(max df) = 0.95 (as it is a decimal number between 0 and 1, it refers to a percentage),
and total maximum of features to extract (max features) = 500.
• TF-IDF: minimum word appearance (min df) = 2, maximum word appearance (max df)
= 0.95 (as it is a decimal number between 0 and 1, it refers to a percentage), and total
maximum of features to extract (max features) = 500, plus another parameter indicating
how many n-grams to account for the analysis (ngram range) = (1, 3).
After the implementation of the word representations, as our approaches suffer from
sparsity, we include a dimensional reduction denominated as Single Value Decomposition
(SVD), and by providing a percentage of the variance to establish for the result, we iter-
ate through the features starting from n − 1 features decreasing by one, and obtaining the
explained variance on each iteration until the desired percentage is reached. A percentage of
90% is established during this step. This step was taken into account as the number of features
are related to the complexity of the calculations of a given machine learning algorithm with
such sparse matrix. In order to reduce complexity, as computational resources are limited,
the 90% variance was set in order to maintain the greatest amount of words and information
while significantly looking to reduce the size of the matrix.
For the application of a topic modeling we implemented a Non-Negative Factorization
Matrix, with both of its possible variants that the sci-kit learn library allows to select: Frobe-
nius norm and Kullback-Leibler divergence. We analyzed these variants for each of the two
word representations with a number of topics equal to ten. Such value was arbitrary selected
after experiments on understanding the contents of the topics, as no measurement is avail-
able for NMF to evaluate how good or bad is a selected number of topics, so ten gave an
understandable and broad insight on the contents but nothing beyond the surface.
For the clustering of the conversations, we implement the K-Means algorithm. Such
algorithm requires a specified number of k clusters, we make use of the silhouette coefficient
in order to visualize how many clusters are able to be used for the clustering. In case of an
initial high k, we can determine a convenient number that returns a silhouette coefficient that
drastically changes from two ranges, indicating that it results in an accurate separation, after
selecting a convenient k and finalizing with the evaluation, it is possible to select the high k
with a threshold set that automatizes the clusters that are correctly aligned.
3.2.7 Evaluation
The evaluation procedure is performed in separate steps. First, we analyze the silhouette
score that is found when performing K-means under a changing value of K clusters. Then,
after selecting the value of K and performing its K-means implementation, we establish an
average best value for comparison against a set of intents that are brought from the chatbot
menus. Finally, a manual clustering alignment is performed to verify if a cluster is actually
similar to the best intent obtained from the similarity comparisons.
Cluster to Intent Similarity Matching
Before detailing this section, we gather the options from the menus in which the user interacts
with the bot and interpret them as possible intents, since the admissions department experts
team designed such options.
The options go through the same data cleaning processing as the messages. When fin-
ished the cleaning and filtering process, a total of 55 possibles intents are determined. The
options are used as the base for string similarity matching between cluster items and the op-
tions.
In order to obtain which intent predominates in a cluster, we select the maximum average
intent similarity from each of the data points belonging to the same cluster. An algorithm
detailing this procedure is as follows:
3.3. MATHEMATICAL THEORY OF APPLIED TECHNIQUES 31
Algorithm 1 Cluster to Intent Similarity Matching

Require: User messages word representation vector X, Number of clusters k, Menu options
in same word representation as users’ messages W
Results = {}
for i = 0 to k − 1 do
Let vect be a term-document vectorizer
Perform fit transform on vect over W and assign it to D
Perform transform on vect over elements in Xi and assign it to Q
Apply cosine similarity cos(D, Q) and store it to sims
Gather mean values for every vector in sims that corresponds to each Di and store it in
means
Store arg max(means) in to u
Store max(means) in to v
Append (u, v) to Results
end for
return Results
In general terms, the algorithm stated in the stated algorithm performs the similarity of
all intents to each of the items per cluster and returns the one intent that results in the highest
value, meaning the most similar to each specific cluster. This follows the idea that the menu
options interpreted as possibles intents are to be present in the messages from the prospects.
Cluster Alignments
Once the steps mentioned before are finished, we begin to manually verify the results by
analyzing the messages and the respective most similar intent for that cluster.
Since the universe of messages is enormous, as stated in Table 4.5, we select a represen-
tative sample from each of the clusters with 80% CI +/- 10%; this due to time constraints.
The purpose of this exercise is to determine a possible threshold for the similarity value
(values ranging from 0 to 1) that may indicate that an intent returned as similar with a value
above a certain threshold is indeed correct.
Clusters that are not found to be aligned (below the determined threshold, or that are
above the threshold but show incorrect alignment) are to be inspected and determine its possi-
ble intent and also refer if it could be a variant to the one returned by the similarity calculation.
3.3 Mathematical Theory of Applied Techniques

In this section, we provide a brief explanation of the mathematical theories from the machine
learning techniques implemented for this research project.
Word Representations
Textual data is not possible to be introduced into a model, as has been mentioned previously
in the document in Section 2, first, we need to apply a transformation that can convert the
textual information into a numerical representation that is then passed into a machine learning
model.
Among the possible word numerical representations of words, we analyzed two: a Bag-
of-Words approach and a Term Frequency-Inversed Document Frequency) [4].
• Bag-of-Words (BoW):
A Bag-of-Words is a basic representation that stores the usage of words in two possible
ways: binary or count. The matrix has as the horizontal axis each of the words in the
corpus, and as the vertical axis, each of the documents analyzed when building the BoW
representation.
An example representation is as follows: having three sentences:
1. hola quiero informacion de la beca

2. quiero tener informacion
3. quiero tener la beca
A resulting vocabulary would be composed of: ”hola”, ”quiero”, ”informacion”, ”de”,

”la”, ”beca”, ”tener”. Which when in the form of a vector representation, would be as
shown in Table 3.3.
Document hola quiero informacion de la beca tener

1 1 1 1 1 1 1 0
2 0 1 1 0 0 0 1
3 0 1 0 0 1 1 1
Table 3.3: Example Bag-of-Words representation
• TF-IDF: a BoW approach may not completely tell how important is a word in a sen-
tence, as all counts are weighted equally. That is when TF-IDF comes into the matter.
A TF-IDF is able to represent the words in a set of documents and weights those words
that are mentioned rarely, and penalizes those mentioned the most frequently through-
out the collection of documents.
The calculation formula to create a TF-IDF matrix is composed of two parts:
– TF: states the relation of each of the words in a document against the total amounts
of words that it contains. The formula, as shown in [48], is as follows:
nt,d
tf t,d = (3.1)
nd
Where tf t,d is the term frequency of a term t in a document d, nt,d refers to the
number of times a term t appears in a document d, and finally nd denotes the
number of total terms in document d.
This applied to the sentences given in the BoW explanations, would result in the
results shown in Table 3.4.
Term Document 1 Document 2 Document 3 TF Document 1 TF Document 2 TF Document 3

1
hola 1 0 0 7 0 0
1 1
quiero 1 1 1 7 3 0.25
1 1
informacion 1 1 0 7 3 0
1
de 1 0 0 7 0 0
1
la 1 0 1 7 0 0.25
1
beca 1 0 1 7 0 0.25
1
tener 0 1 1 0 3 0.25
Table 3.4: Example TF representation
– IDF: The following formula, also stated in [48], is able to retrieve with a high
score rare words throughout the documents, whereas those mentioned frequently
will have a lower value.
N
idf t = log × (3.2)
df t
Where idf t represents the idf value for a term t, N states the total number of docu-
ments and df t denotes the document frequency, which is the number of documents
containing a specific term t.
The resulting IDF table with the previous example sentences is shown in Table 3.5
Term Document 1 Document 2 Document 3 IDF

hola 1 0 0 0.48
quiero 1 1 1 0
informacion 1 1 0 0.18
de 1 0 0 0.48
la 1 0 1 0.18
beca 1 0 1 0.18
tener 0 1 1 0.18
Table 3.5: Example IDF representation
Once both formulas have been applied, we can formulate the TF-IDF by the following,
as stated in [48]:
tf -idf t,d = tf t,d × idf t (3.3)
Where tf -idf t,d is the tf-idf weighting of a term t in a document d, tf t,d is the tf value
of a term t in a document d, and finally idf t is the idf value of a term t.
Such a formula applied to the previous example sentences would result in the following
values, shown in Table 3.6.
Term TF D1 TF D2 TF D3 IDFt TF-IDF D1 TF-IDF D2 TF-IDF D3

1
hola 7 0 0 0.48 0.069 0 0
1 1
quiero 7 3 0.25 0 0 0 0
1 1
informacion 7 3 0 0.18 0.026 0.06 0
1
de 7 0 0 0.48 0.069 0 0
1
la 7 0 0.25 0.18 0.026 0 0.045
1
beca 7 0 0.25 0.18 0.026 0 0.045
1
tener 0 3 0.25 0.18 0 0.06 0.045
Table 3.6: Example TF-IDF representation
Single Value Decomposition (SVD)
It is important to note that the result of these matrices is sparse, meaning that having a broader
amount of documents and an increasing corpus, the dimension of the resulting matrix will
be N × T , where N is the number of documents and T the total number of terms in the
vocabulary.
In order to reduce an acceptable amount of the word representation matrix size, we em-
ploy a dimensionality reduction technique known as truncated singular value decomposition
(truncated SVD). When such technique is applied to a term-document matrix, either from
BoW or TF-IDF, the resultant is a Latent Semantic Analysis (LSA). Such technique seeks to
approximate an original matrix with one that contains only a number of factors given by the
user (where the amount of factors given is between 0 and the number of factors in the original
matrix) by applying the following formula as stated in [48, 13]:
X ≈ Xk = UK Σk VkT (3.4)
Where k is the value given of the top-ranked features that the term-document matrix
will be reduced to, X is the N × T term-document matrix, Xk is the approximated reduced
matrix of X of the top k features, Uk are the orthogonal eigenvectors of X × X T in the form
of N × k, Σk are the diagonal singular values, with dimensions k × k, where the non-top-k
ranked features are equal to zero and VkT are the orthogonal eigenvectors of X T × X, with
dimensions k × T .
This allows to reduce sparsity and retain the most variance possible. The function to per-
form such an operation is from Sci-kit learn [42], which also allows to calculate the variance
explained from the original term-document matrix.
Instead of indicating the number k of features to be retained when performing truncated
SVD, given an expected variance stated by a user, it is possible to determine which value to
choose for k. A function was designed to iterate from the highest value of k possible (that
is the number of features minus one), apply the truncated SVD with one less feature on each
iteration, and obtain the variance explained on each precise k.
Topic Modeling
When analyzing a collection of documents that may relate to each other based on their content,
topic modeling is a machine learning technique that seeks to group documents that refer to
a similar context by a set of statistical calculations [5]. The nature of this application is
unsupervised [5] since it is not dependent on a labeled dataset in order to produce results.
Two algorithms that highlight in topic modeling applications are Latent Dirichlet Allo-
cation (LDA) [6] and Non-Negative Matrix Factorization (NMF or NNMF) [40, 37]. Since
our dataset contains a majority of short texts, as stated further in Table 4.3, Table 4.4, and
Table 4.5, we attended Chen et al. [10], where it is found that NMF is able to produce higher-
quality results as compared with LDA in terms of short texts.
A visualization of the NMF calculations is shown in Figure 3.11, where a term-document
matrix V (matrix of dimensions N × T , where N indicates the documents and T the terms)
is demonstrated to be approximated by a product of two matrices W (matrix of dimensions
P × T , where P represents the topics and T the terms of the corpus) and H (matrix of
dimensions N × P , where N indicate the documents and P the topics) [4].
Figure 3.11: Non-Negative Matrix Factorization Visualized
Text Clustering
The objective of this research is to explore by grouping similar questions into clusters and
derive if a user intent can be inferred from each of such groups.
One common technique to group similar data points from a dataset is K-Means. The
K-means algorithm is shown in Figure 3.12.
Figure 3.12: K-means algorithm extracted from [48]
In general terms, first, given a set of data points of size N and a number of clusters
K, we begin by randomly initializing K centroids of the clusters. Then, each of the data
points is assigned to their nearest centroid based on their mean Euclidean distance. If there
were changes in the assignments of the clusters, the centroids are recalculated in order to per-
form the same computations again until there are no further new assignments to the clusters,
meaning that the algorithm has converged.
Silhouette Score as Clustering Evaluation Metric

A clustering evaluation can be done by analyzing the silhouette score coefficient [45]. Let a
be the mean intra-cluster distance and b the mean nearest-cluster distance for each sample.
Then, the Silhouette Coefficient [42, 45] for a sample can be computed with the following
equation:
b−a
s= (3.5)
max (a, b)
Thus the mean Silhouette Coefficient for a set of samples can be computed with:
1 X
S= s (3.6)
N
Where N stands for the total amount of samples and s refers to the previous formula of
Silhouette Coefficient of a sample.
Chapter 4
Results
This chapter comprises results from each of the steps taken in order to accomplish the research
purpose for this document. The following sections cover the data preprocessing, analysis of
the collected data, the preparation of the dataset that is used in the modeling section, and
finally, the evaluation of the model.
4.1 Data Preprocessing

Raw data is needed to be transformed previously to the modeling section. Human conver-
sational data represents a challenge, and added to this, Spanish language models are not as
advanced as English language ones. Still, the majority of the data transformation was depen-
dant on regular expressions in order to convert the transcript string into a tabular data form
that is log format.
4.1.1 Transformation to Log Format

Returning to the previous anonymized example of the raw conversation transcript:
Agent One ( 0s ) Agent One: Hi ¿how can I help you? ( 42s ) Prospect Two: Hi,
The transcript can be visually separated into the following parts:
Agent One ( 0s ) Agent One: Hi ¿how can I help you? ( 42s ) Prospect Two: Hi,
The different parts of this basic example are:
• Conversation initialization: the system makes clear that a conversation has initialized by
adding Ha comenzado el chat. This indicates that a user correctly made the connection
with an agent, and therefore, a conversation is about to be carried out. There exist some
variants for system clarifications, for example, when a chat is transferred from agent to
another agent or when the chat encountered an error.
37
38 CHAPTER 4. RESULTS
• Date of conversation: after stating that the conversation connection has been estab-
lished, the system appends the date at that precise moment in the format: Day of week,
Month, Day number, Year, Hours:Minutes:Seconds (Timezone). This helps on mark-
ing down the beginning of the conversation, and later messages will occur at moments
added after the stated time.
• Department: Origen de chat: indicates that the word appended next to it. In the case
of our anonymized example, SOAD, refers to the department. The conversation service
can be reached through different addresses of the Tecnólogico de Monterrey URLs,
depending on which address the service was reached out to, indicates the department
attended from the online chat service of Admissions.
• Agent attending the conversation: the keyword appended next to the department is
Agente, this can be succeeded by a variable amount of words, commonly four (in the
name format: first name, second name, first last name, and second last name), which
help indicate which agent from the online chat service is attending the conversation.
In order to confirm this, when processing the total amount of transcripts, we make use
of the field Propietario, which indicates the agent that attended a conversation, so the
name given by the transcript must match one of the agent names in the list of existing
agents through all the conversations.
• Interval of response: The sets of time deltas (#h #m #s) indicate how many seconds,
minutes, or hours the responder (agent, prospect, or Tecbot) took after the initial time
(conversation start). This allows building a timeline of responses over each conversa-
tion.
• Prospect name: the prospect’s name is the one mentioned after the first agent or Tecbot
message. It is shown in the same variable-length format, which in its four-length vari-
able stands for first name, second name, first last name, and second last name.
• The message: is composed by the time interval indicator, the name of the sender, and the
message itself. The first two components are already covered. After those two comes
the message, which is the natural text of the intended communication from a sender.
This process is done on each of the conversations with the aid of regular expressions.
By extracting the information through the regular expressions, we can explode the data to
transform the transcript into a dataset with the format specified in Table 3.2.
An example of an anonymized conversation in the log format is as follows in Table 4.1:
4.1. DATA PREPROCESSING 39
ConversacionID Secuencia Departamento FechaAjustada Emisor Texto

02/03/2025
999999 0 SOAD SYS zzzinicio
12:12:12
02/03/2025
999999 1 SOAD SYS zzzorigen
12:12:12
02/03/2025 hola en que puedo
999999 2 SOAD AGENTE
12:12:12 servirle?
hola, me llamaron que
02/03/2025
999999 3 SOAD PROSPECTO tengo que finalizar mi
12:12:51
solicitud
claro, deje le propor-
ciono la liga donde podra
02/03/2025 consultar toda la infor-
12:13:17 macion con respecto a su
solicitud https://liga.com
02/03/2025
999999 5 SOAD PROSPECTO le agradezco mucho
12:13:32
02/03/2025 es un gusto apoyarle,
12:13:56 pase un buen dia
Table 4.1: Example conversation in log-format
4.1.2 Text Preprocessing

The text preprocessing and cleaning of the text was done following the procedure as explained
in Section 3.2.3. It is possible to note that some of the text preprocessing steps are done
while transforming the transcripts to a log format. Lowercasing, text normalization, encoding
standardizing are some steps performed during such phase.
System indications, names of agents and prospects are the first to be replaced by a com-
mon indicator:
• System indications replacements are done over statements made by the system on what
is happening during the conversations: zzzinicio when a conversation just started, zzzori-
gen identifies that system stated the origin or department of the conversation, zzztransfer
refers to the message displayed to the user that the chat was correctly transferred, and
zzztransferfail works as the previous were except is for the failure of transferring sce-
narios.
• Sender name replacement in the case of the names of agents, they are replaced by the
word AGENTE, and for the prospects PROSPECTO. This is to maintain a generalization
of such senders and also to keep confidentiality on further analysis.
After having finished the transformation of the data to a log format, we proceed by
applying the other steps mentioned in the procedure for text preprocessing.
Following with string normalization:
• Numbers are replaced by the word zzzNUMEROzzz

• Student IDs are replaced by the word zzzMATRICULAzzz
• Names and last names mentioned during the messages are matched against a dataset
containing common names in Mexico [1], and those matching are replaced with zzzNOM-
BREzzz for names and zzzAPELLIDOzzz for last names.
• Emails are replaced by the word zzzEMAILzzz
• URLs are replaced by the word zzzURLzzz
After having completed the replacements, we proceed by removing stop words, that is,
to remove words that do not alter the context of a sentence through a dictionary of stop words.
We consult the dictionary from the language model es core news lg vocabulary from Spacy
[26].
First experiments on lemmatization were done via Spacy [26] library, but after manually
evaluating the results, some words lost their original meaning, such as para (for) → parir
(give birth). Another library to perform was explored and tested: Stanza [43], which showed a
better consistency on keeping words original meanings. An additionally library on the domain
of lemmatization exists: Freeling [41], but was not explored as part of this research.
The lemmatization step is done through the Stanza library [43] by Stanford NLP Group.
Every word is looked up for their lemma except for a few excluded ones that do not need to
be altered, such as Tecnológico de Monterrey academics specific tec21, among others. As a
result, we can obtain the sentences as shown in Table 4.2
Text Lemmatized Text

thank you very much for your support and the
thank support information provide
information provided
I would like to receive support with the can-
receive support cancel document register up-
celation of documents registered in order to
load update
upload the updated ones
hi, how are you? I am a student from tec, I am
hi student tec complete scholarship process
completing my scolarship process
can you show me what is the next require-
show next requirement current stuck
ment? I am currently stuck
the instructions say there must be 2 different
instruction zzzNUMBERzzz folio
folios
Table 4.2: Example of texts and their lemmatized results
An important note is that spelling errors were not accounted to be resolved. A process
to resolve spelling errors was tried but the dictionary also affected correctly written words,
thus since modifying words is a sensitive action, this process was discarded. According to an
analysis made by comparing a BoW matrix on the complete corpora and another delimiting a
minimum amount of occurrences per word of 2 (thus, non-unique words are accounted for),
it can be determined that an approximate percentage of words containing spelling errors is
4.2. DATA PREPARATION AND EXPLORATION 41
of 45%. Increasing the delimiter up to 3 appearances increases the percentage up to 55% of

words probably containing spelling errors. Even though we conclude that spelling errors are
present, further into the modeling phase, word representations allow to constrain the words
that are looked for such models.
With this step, the text preprocessing concludes. The result is a cleaned dataset con-
taining only words that are important to the context of the messages. It is possible to note
from Table 4.2 that the lemmatization is able to purge around 50% of the words contained in
the original messages. The high percentage of cutting data is a direct consequence due to the
corpus consisting of short texts.
4.2 Data Preparation and Exploration

This step consists of selecting the data used to apply the exploratory model to find intents
contained in the conversations.
Figure 3.5 visually highlights that SOAD is the most requested one. After that, since the
research objective is to discover intents in conversational data, chats attended by Tecbot are
discarded since they do not provide information vital to accomplishing the goal. Therefore,
Figure 4.1 shows the messages per month by applying such a filter where the department is
SOAD and Tecbot chats are discarded. A final addition to the filter is looking at what the
users are asking in the conversations, so a filter to the EMISOR (Sender) field is applied to
only return those from prospects.
Figure 4.1: Message count by month SOAD department only
Statistical information on the length of raw messages returned by the filtered data as
stated previously can be found in Table 4.3, with support of understanding word length in
figures shown in Figure 4.2, there exist 13 messages with lengths greater than 250 in word
count. Figure 4.2a shows the complete spectrum of data, including those long messages,
while Figure 4.2b excludes those data points, showing a clearer representation of word count.
Table 4.3 shows statistics on conversations and their words:
Statistic Value
Number of Conversations 21,257
Number of Messages 168,462
Mean 8.62
Standard Deviation 11.27
Min 1
Q1 2
Median 5
Q3 11
Upper Fence 24
Max 936
Table 4.3: Statistical information of message length in filtered data (raw)
(a) Message length box plot including max values (raw)
(b) Message length box plot excluding max values (up to 250 words, raw)
Figure 4.2: Message length box plot (raw)
Statistical information on the length of cleaned messages returned by the filtered data as
stated previously can be found in Table 4.4, with support of understanding word length in fig-
ures shown in Figure 4.3. There are 3 messages with lengths greater than 250 in word count,
Figure 4.3a shows the complete spectrum of data, including those long messages, while Fig-
ure 4.3b excludes those data points, showing a clearer representation of word count. Table 4.4
shows statistics on conversations and their words:
Statistic Value
Mean 3.58
Min 1
Q1 1
Median 2
Q3 4
Upper Fence 8
Max 425
Table 4.4: Statistical information of message length in filtered data (cleaned)
(a) Message length box plot including max values (cleaned)
(b) Message length box plot excluding max values (up to 250 words, cleaned)
Figure 4.3: Message length box plot (cleaned)
In order to omit to deal with messages with only one word, filtering was also applied
to only obtain messages with a lemmatization of size two or greater. The statistics can be
visualized in Table 4.5.
A visualization that helps in understanding how words are mentioned in a corpus is a
word cloud. A word cloud arranges a given amount of words ordered from those are are
mentioned the most to the least mentioned and highlights their use by displaying in a greater
size those more mentioned.
To highlight the difference between the raw texts and the preprocessed text throughout
the lemmatization, Figure 4.4 shows two word clouds. Figure 4.4a displays a word cloud
using the raw yet-to-be cleaned texts. Additionally, Figure 4.4b displays a similarly structured
word cloud but the input for such is the texts that have been cleaned, lemmatized, and also
Statistic Value
Mean 5.16
Min 2
Q1 2
Median 4
Q3 6
Upper Fence 12
Max 425
Table 4.5: Statistical information of message length in filtered data (cleaned messages with
length greater than one)
where stop words have been filtered out. On the first figure, not much information is able
to be appreciated, since stop words predominate the distribution (e.g., de, el, que, en, among
others), it is hard to interpret the contents of the messages, only a few key words can be
visualized, like solicitud, admision, educativo. On the other hand, on the latter figure, which is
the preprocessed version of the texts, we can appreciate broader information on the messages’
contents, where first it is noted that the word replacement for names and numbers take the top
positions on mentions (zzzNOMBREzzz and zzzNUMEROzzz) which both help to generalize
their usage, then words such as apoyo, beca, campus, informacion, paa, examen, among
others are easily visible.
(a) Wordcloud raw text without stopwords filter (b) Wordcloud lemmatized text with stopwords
(Top 100) filter (Top 100)
Figure 4.4: Word clouds before and after text preprocessing
In order to support and have an added understanding of word mentions, Figure 4.5 and
Figure 4.6 demonstrate the top 25 words along with how much amount of mentions those
words have along with the messages and also follow the same structure as the word clouds,
where one contains the messages as they were sent, and the next one displays the information
after it has been cleaned. The first highlights the use of stop words along with the messages,
where the only important word that could be extracted is solicitud close to ten thousand men-
tions. On the latter, more keywords can be appreciated, accompanied by their amount of
usage in decreasing order. Additionally to this, the graph also shows the amount of vocab
(unique word counts) and the total number of words in the corpus. Focusing on the raw texts,
the vocabulary size is 26,245 unique words contained in 1,250,990 total words; on the other
hand, the text after it has been cleaned by the preprocessing steps, the vocabulary is reduced
to 16,522 unique words contained in 510,051 total unique messages.
Figure 4.5: Word frequency raw text without stopwords filter (Top 25)
Figure 4.6: Word frequency lemmatized text with stopwords filter (Top 25)
The resulting prepared dataset is saved in a file following the structure mentioned in
Table 3.2 with an additional column named OracionLematizada, which holds the lemmatized
sentence of each of the messages. With such a dataset, the data preparation step is finalized,
and therefore the cleaning information can be used to apply in a machine learning model
which can fulfill the objective of our research.
4.3 Model
This section covers the modeling of a machine learning algorithm with the purpose of finding
the groups of intents as stated in the objectives.
4.3.1 Word Representations

The parameters used when setting the term-document matrices are as follows, note that the
amount of words to extract on these models allow for minimizing the words that contain
spelling errors by looking at the most consistent and used words throughout the corpus:
• min df: if a float between 0 and 1, it indicates the minimum percentage of frequency
that a token must meet to be used in the calculation. If an integer, it refers to the
minimum amount of times a token is to appear in the set of documents to be used in the
term-document matrix.
• max df: if a float between 0 and 1, it indicates the maximum percentage of frequency
that a token must meet to be used in calculation. If an integer, it refers to the maximum
amount of times a token is to appear in the set of documents to be used in the term-
document matrix.
• max features: max number of top-ranked features to be used on the term-document

matrix. This results in a term-document matrix with dimensions N × T , where N refer
to the number of documents and T to the amount of max features.
• ngram range: a tuple indicating the ranges of n-grams to be used when obtaining the
term-document matrix.
The returned dimensions for the word representations of both BoW and TF-IDF are as
follows:
• BoW (min df = 2, max df = 0.95, max features = 500): (104535, 500).
• TF-IDF (min df = 2, max df = 0.95, max features = 500, ngram range = (1,3)): (104535,
500).
As mentioned in Section 3.2.6, we apply a truncated SVD to the term-document ma-

trices. For this, we selected a variance explanation after truncation of 90%. The results in
dimensions are as follows:
• BoW: (104535, 277) explaining 0.8988 of variance.
• TF-IDF: (104535, 319) explaining 0.8986 of variance.

4.3. MODEL 49
4.3.2 Topic Modeling
Topic modeling results are performed over the full dimensions of the term-document matrices,
that is, those with a shape of (104535, 500), and as stated in Section 3.2.6, we perform Non-
Negative Matrix Factorization, which itself can be calculated based on Frobenius norm or
the Kullback-Leibler divergence, which both are parameters of the NMF implementation in
sci-kit learn library.
As there is no metric for determining a specific number of topics for the case of NMF,
after experimentation to gather a generic insight into the contents of the utterances via this
technique, an arbitrary number of 10 topics was chosen.
• BoW:
– Frobenius norm: Figure 4.7 shows the distribution of messages on 10 separate

topics along with their most representative words accompanied by how much im-
portant are each of them to their respective topic.
It is possible to infer that the topics refer to:
* Topic 1: dates given to finish a procedure.
* Topic 2: sort of greeting or introductory message, not clear to interpret.
* Topic 3: request filling information.
* Topic 4: farewells.
* Topic 5: delivery of documents.
* Topic 6: greetings.
* Topic 7: scholarships information.
* Topic 8: admission and exams information.
* Topic 9: aid with payments processes and treasury related processes.
* Topic 10: delivery of documents related to credit score checking.

Figure 4.7: BoW - NMF (Frobenius norm)
Table 4.6 details how many documents are found on each of the 10 topics calcu-
lated:
Topic Documents
Topic 1 9,632
Topic 2 9,013
Topic 3 8,497
Topic 4 14,185
Topic 5 11,248
Topic 6 7,182
Topic 7 6,391
Topic 8 19,180
Topic 9 6,363
Topic 10 12,844
Table 4.6: Statistical information of BoW NMF topics (Frobenius norm)
– Kullback-Leibler divergence: Figure 4.8 shows the distribution of messages on 10

separate topics along with their most representative words, accompanied by how
much important are each of them to their respective topic.
* Topic 1: dates for admission information.
4.3. MODEL 51
* Topic 2: admission information.
* Topic 5: delivery of documents and treasury.
* Topic 7: scholarships information.
* Topic 8: campus related information.
* Topic 9: aid with payments processes and treasury related processes.
* Topic 10: delivery of documents related to credit score checking.
Figure 4.8: BoW - NMF (Kullback-Leibler divergence)
lated:
Topic Documents
Topic 1 8,426
Topic 2 7,015
Topic 3 10,506
Topic 4 17,455
Topic 5 11,682
Topic 6 11,453
Topic 7 7,251
Topic 8 9,897
Topic 9 8,840
Topic 10 12,010
Table 4.7: Statistical information of BoW NMF topics (Kullback-Leibler divergence)
• TF-IDF:
– Frobenius norm: Figure 4.9 shows the distribution of messages on 10 separate

topics along with their most representative words, accompanied by how much
important are each of them to their respective topic.
* Topic 1: texts on agreeing an idea.

* Topic 4: dates to finish a certain process.
* Topic 7: delivery of documents and credit score checking.
* Topic 9: aid with errors in process.
* Topic 10: e-mail delivery information.
4.3. MODEL 53
Figure 4.9: TF-IDF - NMF (Frobenius norm)
lated:
Topic Documents
Topic 1 7,923
Topic 2 8,822
Topic 3 4,528
Topic 4 7,886
Topic 5 9,238
Topic 6 14,314
Topic 7 17,423
Topic 8 19,642
Topic 9 6,505
Topic 10 8,254
Table 4.8: Statistical information of TF-IDF NMF topics (Frobenius norm)
– Kullback-Leibler divergence: Figure 4.10 shows the distribution of messages on

10 separate topics along with their most representative words, accompanied by
how much important are each of them to their respective topic.
* Topic 1: agreements and farewells.
* Topic 2: aid with information for son/daughter requests.
* Topic 3: doubts on diverse processes.
* Topic 4: date to finish a process.
* Topic 5: agreements and farewells.
* Topic 7: delivery of documents and credit score checking.
* Topic 9: aid with errors in process.
* Topic 10: e-mail delivery information.
Figure 4.10: TF-IDF - NMF (Kullback-Leibler divergence)
lated:
4.4. EVALUATION 55
Topic Documents
Topic 1 11,588
Topic 2 7,520
Topic 3 9,768
Topic 4 7,032
Topic 5 8,532
Topic 6 10,998
Topic 7 13,793
Topic 8 12,521
Topic 9 11,207
Topic 10 11,576
Table 4.9: Statistical information of TF-IDF NMF topics (Kullback-Leibler divergence)
Topic modeling provides significant insight into the contents of the messages. However,
a disadvantage of this procedure is that it is not possible to set a precise number of topics to
be extracted through this technique.
4.3.3 Text Clustering

Clustering results vary based on the word representations used and the value of k given. For
this, we decided to perform two exercises of ranges 50 to 1000 with steps of 50, for values
of k, for the two word representations used in previous steps: BoW and TF-IDF. The corre-
sponding evaluation of such applications can be seen in Section 4.4, specifically Figure 4.11
and Figure 4.12.
As it will be explained further, TF-IDF results are more consistent on the fact that a peak
value for the silhouette coefficient can be determined for a specific k, in contrast to the BoW,
where the coefficient maintains an increasing trend, and it is not possible to determine a base
k value for the clustering.
Once TF-IDF is chosen, selected a value for k equal to 100, as the coefficient value
increases meaningfully from 50 to 100, and also it is not an extensive amount of clusters for
the manual evaluation.
4.4 Evaluation
Evaluation for an unsupervised machine learning application requires for manual checking
on the resulting groups of data points. We implement Silhouette Coefficient as a baseline to
establish a base value of k to be manually analyzed.
4.4.1 Silhouette Coefficient
Silhouette Coefficient curves on ranges from 50 to 1000 with steps of 50 can be found in
Figure 4.11 for BoW, and Figure 4.12 for TF-IDF. It can be visually appreciated that for
the case of BoW, as the k value increases, the silhouette coefficient is also increasing and
no sign of a peak value is perceived. On the contrary, on TF-IDF there exists a peak value
when k = 300, while also important is that the increase in the coefficient from 50 to 100 is
considerable, meaning that there is a major impact on the scoring at such value of k = 100.
For the manual validation exercise, we perform the procedure as stated in Section 3.2.7, where
each of the clusters will be assigned an intent that maximized its average similarity against all
the data points of each cluster group.
Figure 4.11: Silhouette Score k=50-1000 (BoW)

4.4. EVALUATION 57
Figure 4.12: Silhouette Score k=50-1000 (TF-IDF)
4.4.2 Clustering to Intent Alignment

The alignment is done through the procedure as shown in Algorithm 1 where each of the
clusters are compared against the list of existing possible intents and the one that maximizes
the average similarity value is assigned to the cluster. To gather a general insight of how all
the utterances are compared against the intents, Figure 4.13 shows the similarity values from
zero to one on the x-axis and the amount of utterances for the y-axis.
Figure 4.13: Histogram of similarities between utterances and intents

Another visualization of such values can be seen in Figure 4.14, where the x-axis repre-
sents a similarity threshold and how it impacts the amount of utterances that are returned that
have an equal or higher value to such threshold value, which is given by the y-axis in terms of
percentage.
Figure 4.14: Similarity thresholds and their respective percentage of utterances that are able
to be found
When performing the overall similarities of the utterances to the intents, 5 out of the
55 total intents gathered from the menu-based options were not found as being similar to any
utterance, thus the total intents are modified to be 50. The unrelated intents were:
• informacion sobre becas y apoyos educativos
• requisitos prepa tec
• beca al talento atletico profesional
• beca al talento artistico profesional
• requisitos para profesional
This phenomenon is possible to have happened that during the design of the menu-based
options, these were added even though not having any record of utterances related to them,
but for completeness of the menu options. It is also probable that not all the intents were taken
into account during such design or that they may have omitted some of them on purpose.
After applying the procedure for assigning an intent to the clusters, in order to manually
verify such results the alignment of clusters is analyzed based on a sample for each of the
clusters (80% CI +/- 10%, as it was explained in Section 3.2.7, such confidence values are
given the amount of data to be verified are constrained by the time it requires to perform such
validation). The resulted amount of data points to verify from each of the 100 clusters ranged
between 39 and 41 data points, where if there exists a majority (more than half) in correct
intent similarity, the cluster is stated as correctly aligned, or incorrectly if otherwise.
The results from the validation of 100 clusters are shown in Figure 4.15, where the x-
axis corresponds to each of the clusters, and the y-axis indicates its mean similarity value
4.4. EVALUATION 59
among all the data points per each of the clusters. Additionally, green bars represent clusters
that are correctly aligned with the resulting similar intent, and blue bars are those that are not
aligned.
Figure 4.15: Clustering to intent alignment for 100 clusters with TF-IDF
The mean from the correctly aligned clusters similarity value is 0.65 with a standard
deviation of 0.18 and for the case of those that are incorrectly aligned a mean value of 0.09
with a standard deviation of 0.13. A threshold can be set at the mean score minus a standard
deviation, leading to a value of 0.47, where the clusters above such value have a high proba-
bility of corresponding to a correct alignment. Such threshold is an approximation, as can be
seen in Figure 4.15, clusters #50 and #87 have a higher similarity value than the threshold but
are not aligned, and also clusters below the threshold such as cluster #9, #32, #73, #78 and
#97 are correctly aligned.
A total of 13 different intents were found in correctly aligned clusters out of the 55 total
intents. The summary of intents found in those clusters correctly aligned can be appreciated
in Table 4.10.
Intent Cluster Utterances Count

acta nacimiento 9 319
apoyo documento 5 1414
apoyo pago 89 1705
beca socioeconomico 6 1800
carta razon apoyo 73 352
duda buro credito 97 1172
enviar solicitud 4 1146
examen admision 38 368
examen admision 31 1592
informacion beca apoyo
32 2014
educativo
80 618
educativo
llenar solicitud 63 833
problema solicitud 1 302
prueba aptitud academico 78 2158
taller preparacion 64 865
Table 4.10: Intents found in correctly aligned clusters (k=100)
An analysis of these clusters’ alignment is shown next.
• Correctly aligned:
– Cluster #4 (Table 4.11), aligned intent: enviar solicitud.
Lemmatized Text
opcion enviar solicitud
llenar enviar solicitud
enviar solicitud carrera correcto
escribir fecha llenar enviar solicitud linea
enviar solicitud admision linea procedimiento inscribir
tryouts
Table 4.11: Cluster #4 examples for alignment validation
– Cluster #6 (Table 4.12), aligned intent: beca socioeconomico.

4.4. EVALUATION 61
Lemmatized Text
hijo admitir prepatec agosto zzzNUMEROzzz inicio
solicitud beca socioeconomico
llenar beca socioeconomico
llenar informacion solicitud beca socioeconomico
hola querer fecha limite campus terminar solicitud beca
socioeconomico
hola querer enviar solicitud beca socioeconomico
Table 4.12: Cluster #6 examples alignment validation
– Cluster #63 (Table 4.13), aligned intent: llenar solicitud.
Lemmatized Text
hola querer llenar solicitud admision
encontrar llenar solicitud sitio web
llenar solicitud admision preparatoria
intentar llenar solicitud admision desplegar programa
academico
noche encontrar seccion consulta solicitud acabar llenar
solicitud ingreso
• Incorrectly aligned:
– Cluster #2 (Table 4.14), aligned intent: apoyo pago. Possible better intent: ac-
knowledgements.
Lemmatized Text
gracias zzzNOMBREzzz
duda gracias zzzNOMBREzzz
completar brevedad documento gracias zzzNOMBREzzz
ok gracias zzzNOMBREzzz
valer gracias zzzNOMBREzzz
– Cluster #50 (Table 4.15), aligned intent: enviar solicitud. Possible better intent:
envio documento.
Lemmatized Text
enviar estafeta costo
enviar copia nomina mes padre correo
seguir paso deshabilitar casilla confirmacion enviar
introducir dato dar confirmacion enviar
gracias poder enviar documento
haber alguno correo enviar informe
– Cluster #87 (Table 4.16), aligned intent: apoyo pago. Possible better intent: con-
sultar informacion.
Lemmatized Text
resultado llamado ver reflejado pagina
hola realizar pago paa ver reflejado portal
comentar nomina ver reflejado deduccion
gracias poder imprimir pase ver examen impreso
estar ver rapidamente costo santa fe mayor edo mex zona
Making use of the established threshold for the mean similarity values which is 0.47, we
can apply and automate the discovery for the case of clustering which returns a higher value
for the silhouette. As it can be seen in Figure 4.12, using TF-IDF and grouping in 300 clusters
return a peak value for the silhouette.
The results for the clustering to intent alignments for 300 clusters is shown in Fig-
ure 4.16. Green bars represent those mean similarity values above or equal to the established
threshold of 0.47.
4.4. EVALUATION 63
Figure 4.16: Clustering to intent alignment for 300 clusters with TF-IDF
The resulting histogram for the similarities of the clusters to the intents is shown in
Figure 4.17.
Figure 4.17: Histogram for the similarity values of the 300 clusters to the intents
In the same format as shown in previous Table 4.10, the intents discovered when using
300 clusters and establishing the threshold of 0.47 are shown in Table 4.17. With this we can
appreciate that the total unique intents found increased from 13 to 18.
Intent Count
acta nacimiento 2
apoyo documento 3
apoyo pago 3
beca socioeconomico 3
beca talento academico 1
carta razon apoyo 1
carta recomendacion 1
consulta solicitud 1
duda buro credito 2
duda ensayo curriculo 1
enviar solicitud 3
examen admision 2
2
educativo
informacion lider manana 1
llenar solicitud 2
problema solicitud 1
prueba aptitud academico 1
resumen requisito 1
Table 4.17: Intents found in correctly aligned clusters (k=300)
4.5 Reproducibility Illustration
In order to obtain the results as found during this research, the following methodology (stated
in Figure 3.1) is provided along with the core functions to be applied per each of the stages.
Such detailed illustration is shown in Figure 4.18:
4.6. CONVERSATIONAL BOTS COMPARISON BENCHMARK 65
Figure 4.18: Process to reproduce the proposed methodology
4.6 Conversational Bots Comparison Benchmark
An exercise of comparison between three major conversational bots online service was done.
The cloud services compared were: Google Dialogflow, Amazon Lex, and IBM Watson. The
results of the features compared are as follows, in Table 4.18:
Feature Dialogflow Lex Watson

Post-conversation
Post-conversation Post-conversation
Performance analysis and middle
analysis analysis
ware
200-500ms + Middle
Response time - - ware time. ≤ 8s
through data base
Free:
• 100k texts per
month
• 5 skills
• Analytics
dashboard, 7 days
storage
Standard:
Standard:
• $0.002 USD per
request • No limit for texts
• 10k texts per month
• 600 requests/minute • $0.002875
during first year
Costs USD/API call
Plus: • $0.00075 USD per
• 20 skills
• $0.004 USD per request
• Analytics
request dashboard, 30 days
• 600 requests/minute storage
Plus:
• No limit for texts
• TBD USD/API call
• 50 skills
• Intent Conflict
Detection
Yes, with templates
Access to knowledge base Yes Yes, with templates plus manual
generation
Unknown utterance handling Yes Yes Yes
2k intents, 250 100 nodes per skill on
200k symbols for
Training size entities, 2k trainings free. Unlimited from
declarations
per intent standard and on
Training for intents/entities Yes Yes Yes
Languages 14, Spanish included English 13, Spanish included
• Remembers context
• Remembers user
• Feedback
• Feedback
• Remembers user • Personalization
• Personalization
• Feedback • Text-to-Speech
• Text-to-Speech
Additional features • Text-to-Speech • Easy
• Easy
• Easy implementation
implementation
implementation • Conversation sent to
• Conversation sent to
user
user
• IBM Cloud ($+)
Positive differential Easily implemented Low cost Robust
Negative differential Analytics Languages High cost
Table 4.18: Conversational Bots Services Comparison
An exercise involving costs was also done. Comparison between Google Dialogflow
and IBM Watson, where an average amount of messages per minute was set to 5, taken from
4.6. CONVERSATIONAL BOTS COMPARISON BENCHMARK 67
the maximum activity from the analysis of chats per hour. Amount of messages:
• Minute: 5
• Hour: 300
• Day: 4,800
• Month: 148,800
Estimated cost from high mean activity approximation:
• Google Dialogflow
– Essentials: $297.60 USD

– Plus: $595.20 USD
• IBM Watson
– Standard: $372 USD

– Plus: TBD
Even though IBM Watson is more expensive, the difference in cost with respect to
Google Dialogflow is not so much, and taking into account the analysis of the features of
both cloud services, IBM Watson was selected for a further deployment of a conversational
bot using our proposed framework.
Chapter 5
Discussion
In this chapter, we discuss the results from the experiments carried on during this research,
starting from the data preprocessing, then about the data analysis, and finally the implementa-
tion of the machine learning model. We highlight the hypothesis and objective to contrast the
results from each procedure taken.
A straightforward comparison with the literature is not possible to be done. As men-
tioned in Section 2.4, related works assume a pre-defined labeled dataset while their contribu-
tions are through developing supervised techniques in order to realize a correct categorization
of the data. Our work is about intent discovery and covers from a beginning phase that is
preparing a transcript represented in a single sentence and converting the whole sets of con-
versations into a log file that allows for specific analysis and also is to be preprocessed before
utilizing it in machine learning algorithms. Once the data is prepared, the process is iterative
as it allows for a user to be establishing intents and observing the corresponding results of
the utterances with those intents, and finally digging into the clusters that are not relating to
the intents in order to determine their possible new intent that is going to be added to the
knowledge base, where this last phase is iterative and allows for better considerations on the
intents.
5.1 Data Preprocessing

Data preprocessing represented a challenging activity to carry on. The structure from the texts
varied with time, with the major impact when the Tecbot was beginning to be deployed, and
how it was only until the month of December that its deployment attended 100% of incoming
chat requests.
It was important to divide this section in two parts: transcripts transformation to log
format and the text preprocessing itself to clean the natural language texts.
It is to be emphasized the use of regular expressions in the first step of transcript trans-
formation to log format. Transcripts follow a specific structure, even with the challenge of
changes in the transcripts’ key words varying with time, it was possible to identify each of
the scenarios and key terms used throughout the year, and finally construct a tabular dataset
representing the logs containing the conversations separating the messages sent by each of the
parties involved.
69
70 CHAPTER 5. DISCUSSION
With this section, we covered part of our first research question about how it would be
possible to construct a representation that is viable to make use in machine learning algo-
rithms, and also specific objectives of data collection and transformation of unstructured data
into a more convenient representation are completed.
5.2 Data Analysis and Data Preparation

In order to fulfill our specific objective about performing data analysis to generate insights
on the contents of the conversations, we performed a set of analytical exercises over the con-
versations on each of their forms: the raw data as the admissions department delivered it, the
transformed conversations in log format and also on the contents of the utterances before and
after being cleaned by the text preprocessing pipeline.
Another interesting insight is the amount of conversation of prospects-to-agents com-
pared with prospects-to-Tecbot, which would mean that Tecbot is able to provide a solution
in fewer interactions that are needed when an agent is attending a conversation. Addition-
ally, Tecbot allowed to create a free-schedule attendance, since agents are only able to attend
requests at labor hours, Tecbot makes it possible to attend requests at any hour of the day
without any delay or impediment.
In the case of the utterance contents analysis, we can determine that there exist a variety
of kinds of utterances, from interactions containing only one word, to the dozens, with a few
exceptional utterances with hundreds of words. Even though we only focused on the predom-
inant department of SOAD, which takes up to one-third of the conversations of the admissions
department, it covered more than one thousand utterances and that only focusing on those that
were carried only by an agents and not attended by Tecbot. Word clouds and word frequency
plots highlight the important usage of words such as solicitud, beca, document, admisison,
campus, apoyo, examen, among others, which are in the line with the expected utterances
that would be received at a department that is responsible for attracting new students to the
university. An additional conclusion from such analysis is that the text preprocessing pipeline
allows to focus on those words that are of importance by generalizing the lemmas of words
and removing stop words, as can be appreciated on the comparisons of the word clouds and
word frequency plots of raw utterances and the cleaned ones.
5.3 Model and Evaluation

We used two possible word representations to be used in machine learning algorithms: Bag-
of-Words and Term Frequency-Inverse Document Frequency. From both term-document ma-
trices, we added a configuration on TF-IDF that would allow for a broader explanation of the
data, that is, by making use of the n-gram range filter set from one-gram to three-grams. Be-
cause of time and space constraints, we also set a maximum amount of features to 500. Since
term-documents matrices suffer from sparsity, increasing such a number means higher use of
memory and processing capacity in order to perform calculations. Not only we fixed the max-
imum amount of features, but also made use of a dimensionality reduction technique known
as truncated SVD, which seeks to keep a top k ranked features that approximate to an original
matrix at an estimated variance explication. The original matrix was used for topic modeling
5.3. MODEL AND EVALUATION 71
and the truncated one was used on the text clustering procedures, with an explainable variance
of approximately 90%, reducing the number of features by around 200. This helps on lower-
ing costs of calculations on large sparse matrices, since computational resources were limited
during this research.
We also performed topic modeling, using Non-Negative Matrix Factorization to decom-
pose a term-document matrix on two matrices that would explain how the terms, documents,
and topics are related. This exercise was applied to both word representations stated in the
previous paragraph (BoW and TF-IDF) and two principal configurations of the NMF imple-
mentation in the machine learning sklearn API: Frobenius norm and Kullback-Leibler diver-
gence. The application was set to determine ten different topics, and those were tried to be
explained by analyzing the words that provide higher weight to a specific topic, from greet-
ings to the request of information for scholarships. This technique allowed us to discover the
messages found in the universe of the corpus on a broad scale. However, it is not possible
to determine a convenient amount of topics. For such reason, we perform a broader analysis
with the text clustering technique of K-means.
The next clustering is an automatic equivalent for the actions done by Haponchyk et al.
[22]. The authors make use of an already existing dataset Quora corpus which is originally
for the task of finding duplicate questions, and they manually label a portion of the questions
with intent-based clusters grouping them into questions that refer to an equivalent intent, they
add the information for the entities of the intent. With our approach we make use of a list of
existing intents found in the chatbot conversations, specifically we gather the options available
for the users to click which represent intents. After gathering the intents we are able to perform
clustering on the utterances and discover if a specific cluster is associated to one of the intents
based on their maximum average similarity among the intents.
Given technical memory constraints we decided to make use of partitional clustering
algorithm K-Means, in contrast to the usage of hierarchical techniques implemented in [22]
for the labeling of the data, which has the advantage of flexibly adjusting the clusters by as-
signing the level at which the links are expanded. In order to compensate for that flexibility,
we make use of the silhouette coefficient which after realizing a range of clustering calcula-
tions, allowed us to plot what should be the optimal k value of clusters and use it as reference.
With this we are also following the results found in [58] which concluded that partitional
clustering implementations lead to better solutions and also do not require high amounts of
computational resources as compared with hierarchical methods.
We calculated the silhouette coefficient on a variable range of k values for clustering,
specifically 50 to 1000 with steps of 50. Analyzing the behavior of such a coefficient as the
value of k increases, we found that for the case of the BoW representation, a convenient k
could not be determined as the coefficient maintains an increasing trend as such value of k
increases. On the contrary, the TF-IDF representation showed a peak value for the coefficient
at a k value of 300. As the clustering application was going to be manually validated, we also
noticed a meaningful increase in the coefficient when the value of k goes from 50 to 100, so
the validation phase was performed on an exercise of 100 clusters.
After having the results from the clustering with k = 100, we performed a similarity
exercise by making use of the menu options and representing them as possible intents, since
for the menu-based conversations, those are options that the prospect uses to communicate
and receive answers by Tecbot. The cluster to intent similarity procedure involved calculating
72 CHAPTER 5. DISCUSSION
the intent with the highest mean average cosine similarity among the utterance per each of the
100 clusters. This provides the opportunity to state two things: verify the alignments of the
clusters indicating if a cluster is indeed associated with the intent returned by the similarity
procedure, and also to estimate a threshold that would separate good alignments from incorrect
ones.
The validation procedure was done on a sample from each of the clusters, with 80% CI
+/- 10% allowing the verification to be done on sizes ranging from 39-41 from each of the
100 clusters, instead of analyzing the universe of utterances which is more than one thousand.
The resulting threshold, which was set to 0.47 was not failure-proof, since there exist a few
utterances above such threshold that are not really aligned with the intent, and also a few ones
below the threshold that are actually aligned. For that reason, we demonstrate a few examples
from those correctly aligned and those that do not align to their intent (which, for this specific
case, we indicate an intent that would be more convenient after analyzing the contents of the
utterances). Results show that taking into account chatbot options as intents allowed us to
automatically discover a set of intents that were able to be found similar to the intents given.
At k = 100, 22 possible intents were associated with each of the clusters, and 13 of them were
correctly aligned and allowed us to establish the similarity threshold of 0.47. At k = 300, 35
unique intents were able to be detected, where 18 of them comply with the similarity threshold
having more than 0.47 as similarity value for the average of their clusters. In general we were
able to discover up to 36% of the 50 (discarding 5 unrelated intents) initial intents, as this is
an iterative process, the intents may be reconsidered and reworked in order to keep aligning
clusters and adding intents to the knowledge base. The clustering when k = 300 is the one
that maximizes the silhouette coefficient based on the silhouette coefficient visualization for
a range of values for k. Given that on such application the total number of intents were 35,
it may be interpreted that the other 15 are to be reworked in order to be matched with the
utterances.
One phase not included is the deployment of this framework for use on the design of a
real-scenario chatbot. The only key aspect of gaining the most insight of use of the proposed
framework is to set a specific time to iterate again through the process of the framework with
new conversations to identify if changes are to be made on the knowledge base in terms of
modification, addition or even deletion since intents change over time.
With this phase, we provide an answer to our remaining research questions about how
utterances can be considered to be similar from one another, and how a similarity threshold
can be established in order to differ correctly aligned clusters from those that are not aligned.
We also provide a K-means model that is able to group utterances into clusters while also
evaluating their relation to a specific intent either from an existing list or by inferring one
by analyzing their contents. The remaining specific objectives are also fulfilled, dealing with
natural language processing techniques that can represent the data into a machine learning
model and provide clusters of similar utterances while also evaluating them and defining their
respective intent either by similarity matching or by analyzing them.
Chapter 6
Conclusion
The hypothesis declared for this research was accomplished, as we were able to demonstrate
that clusters of utterances are able to be established to discover intents which are to aid in the
process of designing an AI-based chatbot.
The objective set for this thesis research was to be able to discover intents from a col-
lection of utterances where the data is natural language conversations between prospects and
agents attending the online chat service. Each of the specific objectives were correctly met,
as well as the research questions being answered throughout the development of this research
work. The following paragraphs cover each of the stages of this research as well as answering
to each of the research questions as established at the beginning of this project.
The pipeline followed in order to fulfill the specific objectives that together describe the
general objective, was conformed of the data collection, data analysis, data preprocessing,
data preparation, modeling, and finally, the evaluation of the machine learning model.
The data collection was possible as the admissions department from Tecnológico de
Monterrey were able to provide us with the transcripts from the conversations that were at-
tended during the year 2019, such delivery was made in two parts where first they delivered
the first half of the year when we first requested the information around September and Octo-
ber 2019, then on January 2020 they were able to provide us with the rest of the conversations
corresponding to the other half of the year 2019. This covered our first specific objective about
data collection from the admissions department while setting the corresponding timeline to the
conversations that were recorded during the year of 2019.
By completing the data analysis phase, we were able to understand the performance of
the admissions department’s online chat service, and also noticed how Tecbot was deployed
slightly until it reached full attendance of conversations by the end of 2019. It was possible
to identify seasonality on the service’s demand, which was at a peak by November 2019. We
were also able to identify the demand of requests by day of the week, having a high demand
on Mondays and decreasing steadily by Fridays. Additionally to this, we noticed the busiest
hours in an ordinary working day are around 4 and 5 PM. Tecbot implementation allowed to
open the chat service at any hour of the day, even late in the night. This resolves our second
specific objective, where we stated to develop data analysis insights of the data regarding to
the conversations received by the admissions department .
The most challenging section was data preprocessing. Unstructured text data requires
for a deeper understanding on the business process, which was only possible by analyzing the
73
74 CHAPTER 6. CONCLUSION
transcripts and understanding their structure. Initial attempts on transforming the transcript
data into log format were required, which allowed to transform basic conversations, but as we
analyzed them further, their structure varied slightly as the year passed, from new agents at-
tending the conversations, new formats on indicating the department, information with respect
to the system handling transfers (either successful of unsuccessful), to special cases where a
user or an agent pasted a past conversation inside the chat. The results of this first process
were satisfactory, having transformed natural conversational data into a tabular and more eas-
ily to understand format. The second step in the data preprocessing stage allowed us to clean
and normalize text. As the utterances correspond to natural language conversations, there is
not control over which texts are able to enter the chat service. There were accents, spelling er-
rors, slang, plurals, and specific terms to the Tecnológico vocabulary (such as Tec21, or paa).
We followed a text preprocessing pipeline consisting of unifying the casing, encoding of texts
(allowing to remove accentuation), normalizing words such as names, e-mails, numbers, and
finally transforming words to their equivalent lemmas, which serve to refer word variations to
a form that can generalize them As part of the data preparation we also established two word
representations in order to implement machine learning algorithms using the information from
the conversations after being preprocessed, such representations were Bag-of-Words and TF-
IDF. With this, we concluded our first research question and third specific objective regarding
the transformation of the data into another that is able to serve as input for a machine learning
algorithm to be applied and gather results corresponding to the conversations.
The modeling and evaluation phase allowed us to try two approaches to discovering
the groups of utterances found in the corpus: topic modeling and text clustering. The first
gave insightful information on what are the contents inside the utterances being sent by the
prospects. We were able to perform NMF to the utterance and were also able to define what
is the topic of different groups determined by the words that are used on each topic, however,
there was no possibility to state an approximate amount of topics to be defined. Text clustering
had the advantage of being able to be evaluated through the silhouette coefficient, which was
able to be set on an iteration between a range of values when defining the amount of clusters,
and provide insight on a good measurement for setting such number of clusters to be defined
by the algorithm. By testing both word representations, we found that TF-IDF was able
to provide a peak value for a silhouette coefficient, and so we performed the evaluation on
such approach. By implementing the cluster to intent similarity algorithm, we were able to
determine which intent is a cluster more similar to, and by performing manual validation on a
sample of each cluster, we could conclude if a cluster was actually aligned with the resulting
similar intent or if another intent could be stated for such group. This allowed us to set a
threshold that could provide information on further clustering applications on this dataset that
may estimate if a similarity score is sufficient enough to consider the associated intent as
correct. During these two last stages of modeling and evaluation, we were able to resolve
the remaining research questions and specific objectives, where we declared to apply machine
learning algorithms to collect sets of similar utterances and be able to calculate their respective
similarity measures against intents found in the deployed menu-based chatbot and that serve
as a base on which kind of intents there are in the conversations. We also proposed how to
evaluate the clusters based on their similarity against the existing intents, allowing to establish
a threshold to differentiate those correctly aligned from those that were not, and demonstrated
how to infer a possible intent from those found not to be related to any existing intent.
6.1. FUTURE WORK 75
Having concluded the methodology proposed, the final consideration would be to be

able to deploy this process into a production environment. Doing so would allow to iterate
through the methodology in time, allowing the discovery of intents as time affects how the
intents are represented and posted by the users. Since a knowledge base is to be kept updated,
this process allows for such updates either by replacing how an intent is represented, or by
adding new intents as newer information is looked after in the conversational service.
This research demonstrated that is possible to discover intents among a set of utter-
ances corresponding to conversations between prospects and agents in the field of academia,
specifically the admissions department. A straightforward comparison with the literature is
not possible to be done since, in the literature, a pre-defined labeled dataset is inferred and
the works treat the problem having a golden label for each of the utterances, which for this
research is not considered as part of the scope. The main target of providing a framework for
the discovery of intents and design of a knowledge base is met which required transforming
the conversations into a log format, cleaning the texts and by finding a suitable word represen-
tation we were able to apply a clustering machine learning algorithm to discover intents and
also provide the opportunity for repeated iterations where new intents are able to be added to
the knowledge base and also the already existing ones are available to be modified in order to
seek an increase on the results for the clusters corresponding to such intent, by applying this
framework it is possible to reduce the time taken to analyze the texts and define the intents,
which as the timeline is increased, the amount of utterances to analyze increases also, thus
this automation provides aid on such intent discovery and design of a knowledge base.
6.1 Future Work

There exist a vast of areas of opportunities to extend this research. We could state this version
as the beginning of a process of understanding the messages sent through the chat service of
admissions department of Tecnológico de Monterrey. A few key points to consider for future
research are:
• The number of conversations that were able to be accessed for this research was of
only the year 2019. For future contributions, this is to be extended and account with
information of the following years. This would allow performing broader and more
significant analysis on the performance of the online chat service, Tecbot, satisfaction
rates, and would provide more information on what are the new kinds of requests that
users are asking about that were not uncovered during this research, for example, the
impact of COVID-19 related guidelines for admissions.
• The data preprocessing can be improved. For example, during this research we did
not account for resolving spelling errors, leading to a larger vocabulary size, and thus
results could be improved if we applied a process for auto-correction. Even though the
modeling phase allowed to focus on those words used the most throughout the corpus,
it would be more efficient and would return better results if all words are corrected and
looked after on the word representations. Another example to ease the data preparation
would be to gather synonyms in order to generalize specific actions and determine more
direct intents.
76 CHAPTER 6. CONCLUSION
• Modeling and evaluation phases could be extended to the agents’ setting, where we
could make a direct relationship between the intents found in the prospects’ messages
and how they are dealt with from the agent’s perspective.This would allow providing a
set of answers to a collection of given intents.
• Another future contribution is to make use of word embeddings, as term-document

matrices suffer from sparsity and also are not able to capture the semantics in a text.
• A different approach to contributing to this research’s future work is to, with the aid of
this framework create a manual dataset that may be used for research on intent classi-
fication tasks, specifically for this use case of an online chat service corresponding to
academia.
• The techniques used in this research do not consider the field of deep learning, which the
interest in such area has increased during the past years, and improvements in natural
language processing have been quite significant; thus, this research can be extended
on applying state-of-the-art deep learning techniques, such as transformer models, that
allow the possibility to carry on classification tasks with broader features and increased
accuracies.
• Another approach to contribute to this research would be throughout a broader com-

putational capacity. The algorithms were implemented in a computer with an i7-7700
processor along with 16 Gb of RAM, thus some calculations were limited such as ac-
counting with sparse matrices that were constrained to a limit of maximum features and
then reduced with a dimensionality reduction technique.
• The scope of this research could not cover the deployment of a resulting chatbot setting
in a real-time environment. Even though an exercise of comparison between chatbot
services was performed, IBM Watson was set to be used for implementation. This
work’s final area of opportunity is to test how a chatbot is configured on a set of intents
defined following the framework presented in this thesis to perform with real prospects
seeking information through the admissions department online chat service.
Bibliography
[1] Muestra de Nombres y Apellidos Comunes en México. http://www.datamx.io/

dataset/muestra-de-nombres-y-apellidos-comunes-en-mexico,
visited August 2020.
[2] AGGARWAL , C. C., AND Z HAI , C. A survey of text clustering algorithms. In Mining
text data. Springer, 2012, pp. 77–128.
[3] AOCNP, D. Watson will see you now: a supercomputer to help clinicians make in-
formed treatment decisions. Clinical journal of oncology nursing 19, 1 (2015), 31.
[4] B ENGFORT, B., B ILBRO , R., AND O JEDA , T. Applied text analysis with python: En-
abling language-aware data products with machine learning. O’Reilly Media, Inc.,
2018.
[5] B LEI , D. M. Probabilistic topic models. Communications of the ACM 55, 4 (2012),
77–84.
[6] B LEI , D. M., N G , A. Y., AND J ORDAN , M. I. Latent dirichlet allocation. the Journal
of machine Learning research 3 (2003), 993–1022.
[7] C APPELLO , P., C OMERIO , M., AND C ELINO , I. Botdcat-ap: An extension of the dcat
application profile for describing datasets for chatbot systems. In PROFILES ISWC
(2017).
[8] C HANG , H. H., AND C HEN , S. W. The impact of customer interface quality, satisfac-
tion and switching costs on e-loyalty: Internet experience as a moderator. Computers in
Human Behavior 24, 6 (2008), 2927–2944.
[9] C HEN , Y., AND T U , L. Density-based clustering for real-time stream data. In Proceed-
ings of the 13th ACM SIGKDD international conference on Knowledge discovery and
data mining (2007), pp. 133–142.
[10] C HEN , Y., Z HANG , H., L IU , R., Y E , Z., AND L IN , J. Experimental explorations on
short text topic mining between lda and nmf based schemes. Knowledge-Based Systems
163 (2019), 1–13.
[11] DAI , H., Z HAO , L., N IE , Z., W EN , J.-R., WANG , L., AND L I , Y. Detecting on-
line commercial intention (oci). In Proceedings of the 15th international conference on
World Wide Web (2006), pp. 829–837.
77
78 BIBLIOGRAPHY
[12] D EEPAK , P. Mixkmeans: Clustering question-answer archives. In Proceedings of

the 2016 Conference on Empirical Methods in Natural Language Processing (2016),
pp. 1576–1585.
[13] D EERWESTER , S., D UMAIS , S. T., F URNAS , G. W., L ANDAUER , T. K., AND H ARSH -
MAN , R. Indexing by latent semantic analysis. Journal of the American society for
information science 41, 6 (1990), 391–407.
[14] D UDOIT, S., AND F RIDLYAND , J. A prediction-based resampling method for estimating
the number of clusters in a dataset. Genome biology 3, 7 (2002), 1–21.
[15] FACEBOOK. Wit.ai. https://github.com/wit-ai/wit, 2019.
[16] F ØLSTAD , A., AND B RANDTZÆG , P. B. Chatbots and the new world of hci. interac-
tions 24, 4 (2017), 38–42.
[17] F ØLSTAD , A., S KJUVE , M., AND B RANDTZAEG , P. B. Different chatbots for different
purposes: towards a typology of chatbots to understand interaction design. In Interna-
tional Conference on Internet Science (2018), Springer, pp. 145–156.
[18] F RIEDMAN , C., AND E LHADAD , N. Natural language processing in health care and
biomedicine. In Biomedical informatics. Springer, 2014, pp. 255–284.
[19] G ÉRON , A. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, tools, and techniques to build intelligent systems. O’Reilly Media, 2019.
[20] G UPTA , V., L EHAL , G. S., ET AL . A survey of text mining techniques and applications.
Journal of emerging technologies in web intelligence 1, 1 (2009), 60–76.
[21] G UPTA , V., VARSHNEY, D., J HAMTANI , H., K EDIA , D., AND K ARWA , S. Identifying
purchase intent from social posts. In Proceedings of the International AAAI Conference
on Web and Social Media (2014), vol. 8.
[22] H APONCHYK , I., U VA , A., Y U , S., U RYUPINA , O., AND M OSCHITTI , A. Supervised
clustering of questions into intents for dialog system applications. In Proceedings of
the 2018 Conference on Empirical Methods in Natural Language Processing (2018),
pp. 2310–2321.
[23] H ERRERA , A., YAGUACHI , L., AND P IEDRA , N. Building conversational interface for
customer support applied to open campus an open online course provider. In 2019 IEEE
19th International Conference on Advanced Learning Technologies (ICALT) (2019),
vol. 2161, IEEE, pp. 11–13.
[24] H OFMANN , T. Probabilistic latent semantic analysis. arXiv preprint arXiv:1301.6705

(2013).
[25] H OLLERIT, B., K R ÖLL , M., AND S TROHMAIER , M. Towards linking buyers and sell-
ers: detecting commercial intent on twitter. In Proceedings of the 22nd international
conference on world wide web (2013), pp. 629–632.
BIBLIOGRAPHY 79
[26] H ONNIBAL , M., M ONTANI , I., VAN L ANDEGHEM , S., AND B OYD , A. spaCy:
Industrial-strength Natural Language Processing in Python, 2020.
[27] H U , D. H., S HEN , D., S UN , J.-T., YANG , Q., AND C HEN , Z. Context-aware on-
line commercial intention detection. In Asian Conference on Machine Learning (2009),
Springer, pp. 135–149.
[28] H U , J., WANG , G., L OCHOVSKY, F., S UN , J.- T., AND C HEN , Z. Understanding user’s
query intent with wikipedia. In Proceedings of the 18th international conference on
World wide web (2009), pp. 471–480.
[29] H UANG , J., Z HOU , M., AND YANG , D. Extracting chatbot knowledge from online
discussion forums. In IJCAI (2007), vol. 7, pp. 423–428.
[30] J OIGNEAU , A. Utterances classifier for chatbots’ intents, 2018.
[31] J URAFSKY, D. Speech & language processing. Pearson Education India, 2000.
[32] K ALYANATHAYA , K. P., A KILA , D., AND R AJESH , P. Advances in natural language
processing–a survey of current research trends, development tools and industry applica-
tions. International Journal of Recent Technology and Engineering (2019).
[33] K IM , S. N., C AVEDON , L., AND BALDWIN , T. Classifying dialogue acts in one-on-
one live chats. In Proceedings of the 2010 Conference on Empirical Methods in Natural
Language Processing (2010), pp. 862–871.
[34] KOWSARI , K., JAFARI M EIMANDI , K., H EIDARYSAFA , M., M ENDU , S., BARNES ,
L., AND B ROWN , D. Text classification algorithms: A survey. Information 10, 4 (2019),
150.
[35] K UBAT, M. An introduction to machine learning. Springer, 2017.
[36] L ATHAM , A., C ROCKETT, K., M C L EAN , D., AND E DMONDS , B. Adaptive tutoring in
an intelligent conversational agent system. In Transactions on computational collective
intelligence VIII. Springer, 2012, pp. 148–167.
[37] L EE , D. D., AND S EUNG , H. S. Learning the parts of objects by non-negative matrix
factorization. Nature 401, 6755 (1999), 788–791.
[38] L I , X. Understanding the semantic structure of noun phrase queries. In Proceedings

of the 48th Annual Meeting of the Association for Computational Linguistics (2010),
pp. 1337–1345.
[39] L I , X., WANG , Y.-Y., AND ACERO , A. Learning query intent from regularized click
graphs. In Proceedings of the 31st annual international ACM SIGIR conference on
Research and development in information retrieval (2008), pp. 339–346.
[40] PAATERO , P., AND TAPPER , U. Positive matrix factorization: A non-negative factor
model with optimal utilization of error estimates of data values. Environmetrics 5, 2
(1994), 111–126.
80 BIBLIOGRAPHY
[41] PADR Ó , L., AND S TANILOVSKY, E. Freeling 3.0: Towards wider multilinguality. In
Proceedings of the Language Resources and Evaluation Conference (LREC 2012) (Is-
tanbul, Turkey, May 2012), ELRA.
[42] P EDREGOSA , F., VAROQUAUX , G., G RAMFORT, A., M ICHEL , V., T HIRION , B.,
G RISEL , O., B LONDEL , M., P RETTENHOFER , P., W EISS , R., D UBOURG , V., VAN -
DERPLAS , J., PASSOS , A., C OURNAPEAU , D., B RUCHER , M., P ERROT, M., AND
D UCHESNAY, E. Scikit-learn: Machine learning in Python. Journal of Machine Learn-
ing Research 12 (2011), 2825–2830.
[43] Q I , P., Z HANG , Y., Z HANG , Y., B OLTON , J., AND M ANNING , C. D. Stanza: A
Python natural language processing toolkit for many human languages. In Proceedings
of the 58th Annual Meeting of the Association for Computational Linguistics: System
Demonstrations (2020).
[44] R ADZIWILL , N. M., AND B ENTON , M. C. Evaluating quality of chatbots and intelli-
gent conversational agents. arXiv preprint arXiv:1704.04579 (2017).
[45] ROUSSEEUW, P. J. Silhouettes: a graphical aid to the interpretation and validation of

cluster analysis. Journal of computational and applied mathematics 20 (1987), 53–65.
[46] S ÁNCHEZ -D ÍAZ , X., AYALA -BASTIDAS , G., F ONSECA -O RTIZ , P., AND G ARRIDO ,
L. A knowledge-based methodology for building a conversational chatbot as an in-
telligent tutor. In Mexican International Conference on Artificial Intelligence (2018),
Springer, pp. 165–175.
[47] S ARACCO , B. H. Data science and predictive analytics: Biomedical and health appli-
cations using r. Journal of the Medical Library Association 108, 2 (2020), 334.
[48] S CH ÜTZE , H., M ANNING , C. D., AND R AGHAVAN , P. Introduction to information

retrieval, vol. 39. Cambridge University Press Cambridge, 2008.
[49] S HAWAR , B. A., AND ATWELL , E. Chatbots: are they really useful? In Ldv forum
(2007), vol. 22, pp. 29–49.
[50] S HAWAR , B. A., AND ATWELL , E. S. Using corpora in machine-learning chatbot

systems. International journal of corpus linguistics 10, 4 (2005), 489–516.
[51] Datos y Cifras. https://tec.mx/es/datos-y-cifras, Tecnológico de Mon-

terrey, visited September 2020.
[52] T SVETKOVA , M., G ARC ÍA -G AVILANES , R., F LORIDI , L., AND YASSERI , T. Even
good bots fight: The case of wikipedia. PloS one 12, 2 (2017), e0171774.
[53] VAJJALA , S., M AJUMDER , B., G UPTA , A., AND S URANA , H. Practical Natural
Language Processing: A Comprehensive Guide to Building Real-World NLP Systems.
O’Reilly Media, 2020.
BIBLIOGRAPHY 81
[54] W EI , C., Y U , Z., AND F ONG , S. How to build a chatbot: Chatbot framework and
its capabilities. In Proceedings of the 2018 10th International Conference on Machine
Learning and Computing (2018), pp. 369–373.
[55] W EISS , S. M., I NDURKHYA , N., Z HANG , T., AND DAMERAU , F. Text mining: pre-
dictive methods for analyzing unstructured information. Springer Science & Business
Media, 2010.
[56] W EN , J.-R., N IE , J.-Y., AND Z HANG , H.-J. Clustering user queries of a search engine.
In Proceedings of the 10th international conference on World Wide Web (2001), pp. 162–
168.
[57] W ILLIAMS , J. D., K AMAL , E., A SHOUR , M., A MR , H., M ILLER , J., AND Z WEIG ,
G. Fast and easy language understanding for dialog systems with microsoft language
understanding intelligent service (luis). In Proceedings of the 16th Annual Meeting of
the Special Interest Group on Discourse and Dialogue (2015), pp. 159–161.
[58] Z HAO , Y., K ARYPIS , G., AND FAYYAD , U. Hierarchical clustering algorithms for
document datasets. Data mining and knowledge discovery 10, 2 (2005), 141–168.
Curriculum Vitae
Rolando Treviño Lozano was born in Monterrey, México, on October 22, 1996. He is stu-
dent of the MSc in Computer Science at Tecnologico de Monterrey, Mexico. He holds a
BSc in Computer Science by Universidad Autonoma de Nuevo Leon, Mexico (2019). His
main research interests are data science, machine learning and artificial intelligence. He has
worked on processing international trade data determining trends and later participating in a
conference in Saudi Arabia. His actual thesis work is on the application of machine learning
techniques for natural language processing (NLP).
This document was typed in using LATEX 2ε 1 by Rolando Treviño Lozano.

1
The template MCCi-DCC-Thesis.cls used to set up this document was prepared by the Research Group
with Strategic Focus in Intelligent Systems of Tecnológico de Monterrey, Monterrey Campus.

Intent Discovery From Conversational Logs To Prepare A Student Admission Chatbot For Tecnol Gico de Monterrey

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intent Discovery From Conversational Logs To Prepare A Student Admission Chatbot For Tecnol Gico de Monterrey

Uploaded by

Copyright:

Available Formats

Instituto Tecnologico y de Estudios Superiores de Monterrey

School of Engineering and Sciences

Intent Discovery from Conversational Logs to Prepare a Student

Rolando Treviño Lozano

Monterrey, Nuevo León, June, 2021

Neil Hernández Gress, Ph.D.

Héctor Gibrán Ceballos Cancino, Ph.D.

Joanna Alvarado Uribe, Ph.D.

Noé Alejandro Castro Sánchez, Ph.D.

Rubén Morales Menendez, Ph.D.

Monterrey, Nuevo León, June, 2021

• I have acknowledged all main sources of help.

Rolando Treviño Lozano

©2021 by Rolando Treviño Lozano

To my friends, who make difficult times go by as easy as possible.

2.1 General text-based chatbot architecture [46] . . . . . . . . . . . . . . . . . . 14

3.1 General framework for discovering intents . . . . . . . . . . . . . . . . . . . 17

4.1 Message count by month SOAD department only . . . . . . . . . . . . . . . 41

2.1 Comparison between text classification models, as mentioned in [34] . . . . . 12

3.1 Features of reports regarding conversations . . . . . . . . . . . . . . . . . . 21

4.1 Example conversation in log-format . . . . . . . . . . . . . . . . . . . . . . 39

List of Tables xiii

2 State of the Art 7

3 Intent Discovery from Conversational Logs 17

1.1 Problem Statement and Motivation

1.2 Hypothesis and Research Questions

• How can we identify similar utterances and cluster them?

• Transform the unstructured text data into a convenient representation to be processed

1.4 Main Contributions

State of the Art

2.1 Machine Learning

• Depending if the system was trained while maintaining human supervision.

2.2 Text Mining and Natural Language Processing

2.2.1 Text Mining Scopes

• Document-level: Defines categories present in a full-text document.

• Paragraph-level: Defines categories present per paragraph in a document.

• Sentence-level: Defines categories present per sentence found in a paragraph.

• Sub-sentence-level: Defines categories present per each of the expressions found in a

2.2.2 Natural Language Processing Pipeline

• Modeling: refers to the application of ML algorithms in terms of the task that is to be

2.2.3 Applications of Text Mining

Model Advantages Disadvantages

• A strong assumption about the shape

• Computational of this model is very

• Lack of transparency in results

Model Advantages Disadvantages

• Quite slow to create predictions once

• Flexible with features design (re-

Table 2.1: Comparison between text classification models, as mentioned in [34]

• Hierarchical: can be agglomerative (AGNES), where every single instance is a cluster

• Density-Based (DBSCAN): is able to handle noise and find a shape to unstructured

Another approach to organizing documents by their contents is through the technique

2.3 Conversational bots

2.3.1 Chatbot Architecture

Figure 2.1: General text-based chatbot architecture [46]

• Language understanding: specific to the branch known as natural language under-

2.4 Related Work

Intent Discovery from Conversational

3.1 Proposal Solution

Figure 3.1: General framework for discovering intents

3.2.1 Methodology Definition

Figure 3.2: Research methodology

• Data Preprocessing: textual data needs to be parsed to a structural form. We follow