Cmu Sphinx Audio To Text

Sentiment Analysis on Dutch Phone Call Conversations
Delano de Ruiter
DelanodeRuiter@hotmail.com
Master information studies: Data Science
Faculty of Science
University of Amsterdam
Internal Supervisor External Supervisor

Title, Name Dr Bert Bredeweg Wout Hoeve
Affiliation UvA, FNWI, IvI Zonneplan
Email B.Bredeweg@uva.nl Wout@zonneplan.nl
ABSTRACT Nowadays, almost every bit of communication with the cus-

In research and business there is a need for capturing subjective tomer is stored. For many companies, communication goes mostly
information in speech. When sentiment analysis is successfully by phone. This research tries to predict customer conversion by sen-
applied on speech this can be done timely and automatically. In timent analysis on phone call transcripts using speech recognition
this study speech recognition is used to transcribe Dutch phone in Dutch. The objective of this study is to determine a difference in
calls. Classification is done on transcripts resulting in a positive word frequency of phone call conversations that lead to a sale or
or negative prediction on sale. This study determines a significant not in the domain of solar panels.
difference in word frequency usage between the ’sale’ and ’no sale’ Sentiment analysis on phone calls helps in tailoring to a cus-
class. This makes it possible to predict the class of a conversation tomers specific needs and wishes. Also, insights can be gathered
correctly. It is expected that this approach can be applied to other on word usage in positive or in negative calls, optimum length of a
domains and languages. conversation, recent shifts in sentiment, product specific sentiment,
and the list goes on [16].
In recent years the field of automatic speech recognition (ASR)
KEYWORDS has made great progress and has been adopted in all kinds of appli-
Automatic Speech Recognition, Sentiment Analysis, Sentiment Clas- cations [14]. Sentiment analysis (SA) also has an extensive body of
sification, Corpus Comparison research and is used a lot in marketing [15]. The combination of
these two fields is however not studied extensively, but has great
potential [3, 19].
1 INTRODUCTION This paper is organized as follows. Section 2 describes related
literature. In section 3 the approach of this study is explained. Sec-
Companies striving for the best customer satisfaction may benefit
tion 4 shows the results. In section 5 these results are discussed.
from customer sentiment analysis. Customer satisfaction results
A conclusion is given in section 6. The last section presents some
in ongoing loyalty, increased word-of-mouth, greater brand value,
possible future work.
and is correlated with higher earnings [16].
Traditionally, reviews and surveys are used to measure satisfac-
tion. These methods can be slow and limited to a small group. Also,
customers might see surveys as a nuisance. The advent of social 2 RELATED LITERATURE
media creates a change in communication between company and This paper combines two fields of research. Namely, automatic
customer. The communication is mostly unstructured but still con- speech recognition (ASR) and sentiment analysis (SA). These two
tains a lot of information about the customers sentiment. Sentiment fields are reviewed. After that research on speech sentiment analysis
classification has been performed successfully on social media [21], is described. The benefits of visualizing sentiment is described.
but this is still only a small part of the total communication between Lastly, suitable statistical tests for textual differences in corpora are
customer and business. given.
2.1 Automatic Speech Recognition – Server solution: This method makes it possible to tune
ASR systems are commonly used in everyday applications [14]. a model and then deploy it on a server. For example, the
These systems allow natural communication between humans and Kaldi toolkit is available only on Linux but can be used in
computers. There are a number of commercial systems such as a container and then run on any server. A Dutch model is
those by Google, Amazon, and Microsoft and two open-source available for Kaldi.
systems: CMU Sphinx and Kaldi [14]. Most systems use a Hidden – Local solution: A local solution does processing on the
Markov Model (HMM). In recent years the results of ASR systems device of recognition. Since speech recognition uses substan-
has increased dramatically with the addition of deep learning [14]. tial processing, it might not be suitable for every scenario.
A HMM describes the probabilistic relationship between different With this method adaptation of models is possible [17]. CMU
states. Speech has a temporal structure and can be encoded as a Sphinx is suitable for local training and is available on Mac
sequence of audio frequencies. Each time step the HMM is able to and Windows [17].
make a transition to another state [5]. On entering a HMM state, Creating or adapting a new model is only possible in open-source
an acoustic feature vector is generated. A word is made up of a systems. Two of those systems are Kaldi and CMU Sphinx [2, 17].
combination of these states. The single unit of sound is called a Creating or adapting a model for a specific domain or acoustic
phoneme. For example, in English the phoneme /th/ is a single environment might provide better performance than an existing
sound and is often followed by /ĕ/ to make the word ’the’. model.
HMM are the basis of most speech systems, however improve- All speech models depend on data. Training data consists of
ments can be made with additions. "Refinements include feature audio speech accompanied by a transcription. This data might be
projection, improved covariance modelling, discriminative parame- hard to come by, especially in a language with fewer speakers and
ter estimation, adaptation and normalisation, noise compensation a specific domain.
and multi-pass system combination." [5] Speaker diarization is the process of partitioning speech by
In addition to the probabilistic HMM. Three models are used to speaker identity. The LIUM tool is able to perform such a task
make the match. Single units of sound are combined to make words. [11]. It must be trained for a specific language of which Dutch is
For any combination of sounds the most likely word is composed. not available. Speakers can be discriminated and parsed as a feature
This word is then seen in context of its predecessors. Some words to the transcript [19]. Diarization can also be solved for phone calls
follow other words more often. The three models used to make the when the incoming and outgoing channels are split.
match are: Validation of ASR is commonly done by word error rate (WER).
– Acoustic model: Finds the most likely phoneme from acous- The equation is as follows:
tic properties. The acoustic properties are taken from a small
piece of audio. I +D +S
W ER = × 100% (1)
– Phonetic dictionary: Contains a mapping from words to N
phonemes. Some words can be pronounced in different ways,
Where I words are inserted. D words are deleted, and S words are
therefore there can be multiple phoneme mappings for one
substituted [14].
word. The words in this dictionary are the words that can
The word error rate gives a score to compare speech recognition
be recognized.
systems to each other. In addition to this the Jaccard index is used
– Language model: restricts the word search by taking previ-
to measure similarity between two sets. Namely the recognized
ous words in consideration. Some words follow other words
transcript and the human validated transcript.
more often.
The Jaccard index measures the overlap of two sets. This mea-
All these components work together to go from audio to text. sure states the percentage of words captured in the transcript in
The acoustic model maps the feature vector to a phoneme. The comparison to the real words used. The equation is:
most likely word for a combination of phonemes is found in the
phonetic dictionary. This word is placed in context of the previous |A ∩ B| |A ∩ B|
found words by the language model [5]. J (A, B) = = (2)
|A ∪ B| |A| + |B| − |A ∩ B|
Since individual speech systems work differently they produce
different results. The amount of processing that needs to be done Speed of speech recognition systems may be different for differ-
also varies for models. The adaptability of models is also different ent models. Speed is commonly measured in a ratio of recognition
among systems [2]. time and duration of speech. When this measure is 1 the recognition
The method of deployment affects the adaptability. Therefore, speed is the same as the duration. On different systems this can be
it should be taken into account. Below three methods of deploy- true or not.
ment for a speech system are given with their advantages and Performance of ASR can be influenced by: adjusting or creating a
disadvantages: model suitable for the application, matching acoustic environment,
– Cloud solution: A cloud solution abstracts the model from bigger but slower models, audio quality, audio level and noise,
the user. Google Speech API is one that has a Dutch model sample rate and bit rate are matching for train data and test data1 .
available [14]. A few parameters can be changed, but options
are limited. This makes it easy to use but more limited in
changes. 1 https://cmusphinx.github.io/wiki/faq/
2
2.2 Sentiment Analysis An information need is not always present during exploration [6].
Sentiment analysis deals with extracting subjective information Therefore, a visual approach is helpful in exploration of sentiment.
from language. There are roughly two approaches [22]. The lin- A visual approach can help guide the user and has a positive effect
guistic approach and the machine learning approach. The linguistic on engagement. The result suggest that users spend more time
based approach uses a dictionary of words with a semantic score. performing tasks when using scatter plots. Words can be found
Every token of a text is matched to the dictionary and a score is more easily that are distinctive for their respective class.
calculated based on the matching of the words. A dictionary has to
be made first for this approach. These can be created manually. 2.5 Corpus Comparison
In the machine learning approach a classification algorithm is Comparing word frequencies in corpora can be done with sta-
presented a series of feature vectors of previous data and in the case tistical significance testing [8]. Used tests are Pearson’s χ 2 test,
of supervised classification labels are attached. Feature extraction log-likelihood ratio test, and Wilcoxon rank-sum test. However,
is a step taken to go from text to a numerical representation. A according to some research the χ 2 test is not suitable for text [12].
machine learning algorithm is trained on data. A test set of data is This is due to common words defeating the null hypothesis easily.
used to measure the performance of a machine learning algorithm. The log-likelihood ratio is also problematic since the test is based
The machine learning approach is usually more adaptable and on the assumption of independent samples whereas words in text
more accurate [22]. For this study textual data is available as tran- are not independent [8].
scripts of speech. Word frequencies in transcripts can be tested for significant
differences between ’sale’ and ’no sale’ classes. The Wilcoxon rank-
sum test can be applied to the total word frequencies in both classes.
2.3 Speech Sentiment Analysis Individual words can also be tested on their frequency in multiple
Sentiment analysis on speech can be done on emotion detection conversations.
in speech by acoustic features [7] and on textual sentiment from
audio [3]. Both approaches can be combined to give a single predic- 3 METHODOLOGY
tion. Textual sentiment has a bigger predictive power than acoustic
features [7, 9]. The approach presented in this paper is using automatic speech
Another approach is doing classification on the transcript of recognition (ASR) on recorded phone calls. This results in a tran-
speech [3]. A number of classification algorithms are used of which script of all spoken words in the conversation. Three speech models
a Support Vector Machine (SVM) performed the best. This work are used to transcribe and performance is compared. The models
states that the transcripts represent just 44% of what is originally are:
said and is still able to cluster successfully on sentiment. A limitation – CMU Sphinx: A model is trained with the use of the open
of this work is the usage of an artificially generated data set. source toolkit CMU Sphinx.
Differentiating speakers and performing sentiment analysis on – Google Speech API: A cloud solution with an existing
each individual speaker is a viable approach [19]. A shortcoming of model for Dutch.
this work is also the artificially generated data set. An alternative – Kaldi-NL: An existing model for Kaldi is used on a server.
argument is keeping the conversational structure and performing Transcriptions are made with the use of speech recognition. Text-
sentiment analysis on the whole conversation [3]. based sentiment analysis is performed on the transcript and results
Besides the multiple options of transcripts, there are also multi- in a ’sale’ or ’no sale’ prediction. Classification of telephone calls
ple approaches to sentiment analysis on transcripts. For text the that convert or not are used to find key differences between both
linguistic based approach or the machine learning approach is used classes.
in research [1]. Both approaches are successful in predicting senti- Corpus differences between the ’sale’ and ’no sale’ classes are
ment. The problem with a linguistic based method is the manual visualized and tested for word frequency differences. The visualiza-
evaluation and judgment of the lexicon. tions and measures are then shown in the ’results’ section.
In short, some research has been done on speech sentiment The first subsection describes the creation of a model, after that
analysis but more should be done using real data and a machine the other two models are described. Then the approach to sentiment
learning classification approach. Difficulties with audio sentiment analysis is reported. After that the approach to visualizations is
analysis are: Speech is often less structured, contains pauses and reported.
breaks, and does not always follow the rules of grammar. Also,
the speech recognizer makes errors which creates noise in the
3.1 Creating a speech recognition model with
transcripts.
CMU Sphinx
Creating or adapting a model for a specific domain or acoustic envi-
2.4 Sentiment Visualization ronment might provide better performance than an existing model.
Sentiment visualization has matured from pie and bar charts to Two of those open-source systems are Kaldi and CMU Sphinx [2, 17].
extensive visualizations and has become a notable topic of research Kaldi is only available for Linux. CMU Sphinx is available for Mac
[13]. Often used techniques in literature are clustering and classi- and Windows. Therefore, CMU Sphinx is chosen for this research.
fication. Comparison and creating an overview is also mentioned A speech recognition model is trained on data. This data consists
often in literature [13]. of speech audio with a transcript. The model of this research is
3
Dutch. For the best performance the acoustic environment should The trained model is able to transcribe phone call conversations
match for training and testing. Sampling rate should also be the it has not been trained on. The other two existing models can also
same2 . be used to transcribe the same phone call conversations. These
models can then be compared on errors and speed. How the other
3.1.1 Data. The dutch audio that is used for training of the model two models are used to transcribe is discussed below.
is the Spoken Wikipedia Corpora3 . Wikipedia contains articles
about individual topics, which makes it a diverse set of words. The
3.2 Speech Recognition on a server with
Dutch Spoken Wikipedia Corpora contains 79 hours of word aligned
spoken language. The time alignment is not done for every word, Kaldi-NL
but every word in the speech audio needs a transcript. Misalignment A model for Dutch is created with the Kaldi speech toolkit at the
is a problem that can occur when the speech contains utterances University of Twente6 Although the Kaldi toolkit is only available
that are not in the transcript. The recognizer then hears audio and on Linux. To make it available on Windows and Mac a container is
uses the wrong word for that utterance. This is solved by taking set up with Docker. Containerization solves the problem of system
the time stamp of a word near the beginning of a sentence and at dependencies by providing the container with all its dependencies
the end of a sentence and taking all words in between. This creates and abstracting it from an operating system. A port is opened on
sentence length audio clips accompanied by the text of the sentence. the Docker container to communicate with the host computer. A
bind mount serves the purpose of accessing files between host and
3.1.2 Preprocessing Audio. In CMU Sphinx audio clips must be container. A decode script is called in the container environment
the length of roughly one sentence. The audio clips are stored per with the audio and output directory as arguments.
article, so these are cut to the appropriate length. The reason for Preparation of transcription involves segmentation. The recog-
roughly one sentence length is that longer audio clips might get nition is done with the use of a neural network. Rescoring is done
out of sync with the text. When the words are no longer aligned as a last step to improve the recognition rate. The transcripts are
with the speech, the wrong words are recognized for that speech. saved as text for every phone call.
This can also be a problem when some words are not in the text,
but are spoken. 3.3 Cloud speech recognition with Google
Audio clips cannot simply be cut by a set interval, because then
Speech API
words are cut in half. The audio clips must be cut on a time stamp of
the words that are spoken in the audio clip and contain the words Google Speech API7 is a cloud solution for speech recognition. A
that are between time stamps. The audio is cut on the time stamp connection is made to the API with a Python script. The audio is
and the words that are spoken in that audio clip are saved. Audio sent to the cloud and a transcription is retrieved. A Dutch model
is cut by using FFMPEG4 . is selected for this. Google Speech API has a synchronous and
Cmu Sphinx uses the audio format ’.WAV’. FFMPEG is used to asynchronous process. The synchronous process uses local files of
convert audio formats. It is also necessary to have uniformity in maximum one minute. The asynchronous process can handle longer
the sample rate. Other sample rates can be used, but the training audio files, but must be stored in the Google cloud. Since phone
data must have the same sample rate as the recognized speech. The calls are often longer than 1 minute the asynchronous process is
audio should also be in mono with a single channel. used to transcribe.
Although a Dutch model is available, there is no phone call
3.1.3 Training the CMU Sphinx speech model. CMU Sphinx has acoustic model for Dutch. The acoustic environment of a phone call
a Dutch dictionary and language model available5 . The acoustic is different than microphone for example. The effect of a mismatch
model is trained on 13 hours of speech, which is usually way too in acoustic environment is worse performance on recognition.
little [20]. However, this model is still useful for adapting with
the data discussed above. Adaptation of a model is suitable for 3.4 Speech Recognition Comparison
increasing the accuracy of the model and adapting to an acoustic The evaluation of speech recognition is done with the use of the
environment. This means no model has to be built from scratch. textdistance8 package in Python. The Jaccard index and WER are
With the language model, dictionary and training data in place the calculated for transcripts of three speech systems in comparison to
model is trained. a manually adjusted transcript. One hour of phone calls is manually
The steps taken in training the model are as follows. An acoustic transcribed for evaluation.
feature file is generated for every individual audio clip. The next
step is collecting statistics from the adaptation data. After that adap-
3.5 Sentiment Analysis
tation to the HMM is done. Two methods that are frequently used
are Maximum A Posteriori (MAP) and Maximum Likelihood Linear The purpose of sentiment analysis on phone calls is creating in-
Regression (MLLR). A combination of the two is most successful sights into positive and negative calls. A positive call is a sale and a
[18]. Adaptation is therefore done using a combination of the two negative call is no sale. This is a binary classification problem. Three
methods. algorithms are used to predict the correct class. The algorithms
are capable of finding distinguishing terms between classes. These
2 https://cmusphinx.github.io/wiki/tutorialadapt/
3 https://nats.gitlab.io/swc/ 6 https://github.com/opensource-spraakherkenning-nl
4 https://ffmpeg.org/ 7 https://cloud.google.com/speech-to-text/docs/
5 https://sourceforge.net/projects/cmusphinx/files/ 8 https://pypi.org/project/textdistance/
4
terms are visualized and a comparison is made for the corpora. The Parameters are tuned for best classification performance. The
data used for classification is discussed in the subsection data. ROC curve gives insight into the true positive rate and the false
positive rate. The true positive rate is the measure of correctly
3.5.1 Data. The data for sentiment analysis are phone call conver- identified as positive. The false positive rate is the measure of
sations. The conversation is between an advisor and a customer. correctly identified negatives. With the use of the ROC curve the
The subject of these phone calls is solar panels. The language spo- best tuned model is selected. The ROC curve for SVM is shown in
ken is Dutch. The conversations are about placement of solar panels figure 1.
on roofs and everything that has to do with it. Conversations do
not follow guidelines and can go in many different directions.
Attributes for the phone calls are: sale or no sale, date, duration,
customer name, advisor name, and direction of call. Classification
is done on the sale or no sale attribute.
Two groups of phone calls are made. For sales intakes, which
is the first call with a customer. And a random sample is taken of
which some calls can be sales intakes or follow up calls about sales.
For both groups a thousand calls are selected.
Two thousand calls are selected and automatically transcribed by
the speech recognition system. Calls are roughly ten minutes long.
This makes the total duration 166 hours for both groups. Table 1
shows the distribution between ’sale’ and ’no sale’ conversations.
Table 1: Phone call data
Call group Sale No Sale Total Duration

First conversation 410 590 1000 166
Random sample 630 370 1000 166 Figure 1: ROC curve of SVM on sales conversations
Accuracy can be a deceiving measure in skewed data. This is

3.6 Classification of phone calls
not the case here since training and test data are balanced for both
Classification is done with the use of machine learning. The ma- groups. Therefore, an increase in precision and recall is usually an
chine learning approach uses a label for each call. This is an indi- increase in accuracy.
cator of customer interest and can be seen as positive or negative.
Making a distinction between these groups is binary classification.
3.7 Sentiment Visualization
Classification is done on words. Different words are used in positive
and negative phone calls. The sale or no sale serves as a label for Classification coefficients indicate words that are more often in one
classification. Therefore it is supervised classification. group than the other. The bigger factor weights contribute more
Suitable supervised classification algorithms for text are: Naive to the classification of positive or negative. These terms have the
Bayes, Support Vector Machine (SVM) and Logistic Regression biggest impact on the classification. Visualizations clarify these
[4, 23]. These algorithms are tested on the same transcriptions of factors with the purpose of finding terms that are distinctive for a
speech. The performance metrics are shown in the results section. group.
Before words can be used by a classifier some pre-processing The scatter plot of Figure 2 shows the word frequency in ’sale’
needs to be done. First special characters are removed and words and ’no sale’ groups. This scatter plot is created with Python Pyplot.
are turned to lowercase. Also the words that are used almost in The scatter plot shows frequency of words but does not show lin-
every document or in almost none are removed, since these are not guistic variation between groups clearly. Since the amount of words
important for classification. These parameters are tuned to the best between corpora is not equal the word occurrence per 25k words
classification performance. is used. The words shown in the scatter plot are chosen because of
A classifier algorithm is not able to use words, so words in a their coefficients.
document are represented by a vector. The words are given a tf-idf Visualizing linguistic variation between groups of text is done
score. Tf stands for term frequency. This is the amount of times the with the Scattertext tool9 . Instead of frequency the Scattertext uses
word is in the document. Idf stands for inverse document frequency a ratio to visualize the relative occurence. The Scattertext is based
which is a score that gets lower when a word is in more documents. on a scatter plot which displays a number of words in the corpus.
The vector can consist of each word as a term or use multiple terms The coordinates of the word indicate the frequency of the word and
called n-grams. When for example the word “good” has “not” in the ratio of the occurence in both categories [10]. The Scattertext
front of it it means the opposite of good. A bi-gram can take these is shown in Figure 3.
negations into account whereas a single term can not. Uni-grams
and bi-grams are tested and performance is shown in section 4.2. 9 https://github.com/JasonKessler/scattertext
5
The odds ratio is a measure of association between two groups. As can be seen in table 2, Kaldi-NL has the lowest error rate and
The log-odds ratio is calculated by dividing the relative word oc- captures most of the original conversation. Besides these metrics
currence in ’sale’ by the relative word occurrence in ’no sale’ and there are other differences between systems. Some differences are:
taking the log. speed of computation, access and ease of use, additions to the model,
With the use of Scattertext terms can also be queried and the and format of output. The models are each discussed in the sections
transcripts that contain that term are found. When for example a below.
word is predominantly positive or negative the phone calls can be
4.1.1 CMU Sphinx. A problem that was encountered during train-
found that contain this word.
ing is the train data mostly consisting of a few speakers. The Spoken
3.8 Corpus Comparison Wikipedia Corpus has 145 speakers, however a few top speakers
contributed to a big part of the corpus. When training for just a few
To determine a difference in word frequency of both conversation people the model fits to those speakers. Many speaker dictation
classes a statistical test is used. According to literature the Wilcoxon requires many people. Therefore the performance of this system is
rank-sum test is suitable for comparing word frequencies between not optimal.
two corpora [8, 12]. There is no segmentation to solve speakers talking at the same
The Wilcoxon rank-sum test serves to test significance of word time. Speed of dictation is also lacking compared to other systems.
frequency in ’sale’ and ’no sale’ conversations. The test is performed For recognition CMU Sphinx uses ’.WAV’ files. These are uncom-
on words with their total frequency in the ’sale’ or ’no sale’ group. pressed and can be multiple times larger than ’.mp3’ files that the
The test is also performed on individual words and their frequency other systems are able to use for recognition. There are other fac-
in a thousand individual conversations. tors that contribute to speed. This is however difficult to compare,
The reason that the test is not done on every word for every since Google Speech API is only available in the cloud and this is
conversation is because of the introduction of the multiple compar- not comparable to local recognition.
isons problem. This problem can be avoided here by not testing all
words individually. 4.1.2 Kaldi-NL. One limitation of Kaldi is the availability on Linux
Classification delivers a list of 1500 words. For SVM the list only. This problem was solved by using a Docker container to run
contains 750 positive and 750 negative coefficients. The positive Kaldi in. A connection can be made to the container on any device
and negative words are both tested. The word occurrence for ’sale’ that has network capabilities.
and ’no sale’ is counted and tested. Positive and negative coefficients One way performance of Kaldi increases is segmentation. Individ-
are split in the test to measure a difference between the two. ual speakers are partitioned into segments. After that recognition
is started. The problem of speakers speaking at the same time is
4 RESULTS solved this way.
The results section follows the structure: speech recognition evalua- Although it is difficult to compare speed of a local system, a
tion, classification performance, sentiment visualization and corpus server, and a cloud because of different hardware. An advantage of
comparison. Measurements are given of speech recognition and using Kaldi in Docker is the scalability. A Docker environment can
classification performance. Visualizations are shown in multiple fig- scale to multiple machines in a cluster. Speed of dictation is easily
ures. Significance of word differences are given in tables. Findings adjusted by the scale of the cluster.
will be stated and problems encountered will be discussed. 4.1.3 Google Speech API. Google Speech API is a cloud solution.
The advantage is ease of implementation. The drawback of an API
4.1 Speech Recognition Evaluation connection is the limitations in adjustments. The Dutch model
The speech recognition models compared in this paper are: Google of Google does not have diarization and there is also no phone
Speech API for Dutch, the open source Kaldi-NL speech model, and call acoustic model available. These features are available for the
a model trained in CMU Sphinx. These models work differently and English model.
have different attributes. First the performance measures are given. The output length of dictation is significantly shorter in length
After that other differences are stated. for Google than for the other speech systems. The amount of words
Different speech recognition models make different errors. The in the evaluation is just 26% of Kaldi-NL. However, the words that
systems are compared on word error rate (WER) and Jaccard Index are in the output capture 22.8% of the spoken text as can be seen
as mentioned in the related literature. Table 2 gives the performance in Table 1. This suggests that Google outputs words only that it is
measures on phone call conversations. confident about above a certain threshold.
Table 2: ASR performance measures on Dutch phone call 4.2 Classification

conversations Classification is done twice: on sales intakes and on a random
sample of sales calls. The three classification algorithms used are:
ASR system WER Jaccard index Naive Bayes, Logistic Regression, and SVM. For each algorithm
uni-grams and bi-grams are calculated. The accuracy, precision,
Kaldi-NL 37.6% 55.4%
and recall are shown on a random sample of sales calls in Table
Google Speech API 74.2% 22.8%
3. On the test data SVM has the highest accuracy of 79%. Using
CMU Sphinx 79.3% 15.9%
bi-grams lowers the classification scores for all algorithms.
6
Table 3: Classification on sales transcripts The scatter plot shows some words that are more often in one
group than in the other. However, the difference is not easily quan-
Algorithm Accuracy Precision Recall tified this way. The Scattertext adds to this.
The Scattertext plots the log-odds ratio and the log frequency
SVM 79% 0.79 0.79
for words. The 1500 words of classification are shown. Words that
Naive Bayes 77% 0.76 0.77
are in less than 5 documents are removed. Also words that are in
Logistic Regression 75% 0.76 0.77
more than 70% of the documents are removed. The top positive and
Naive Bayes bi-gram 75% 0.76 0.75
top negative words are shown, since not all words fit in one figure.
SVM bi-gram 72.5% 0.73 0.72
The Scattertext is shown in figure 3.
Logistic Regression bi-gram 69% 0.73 0.73
4.4 Corpus Comparison
Classification is also done on the first call with a customer. These To test statistical significance in word frequency difference the
classification measures are shown in Table 4. Here the Naive Bayes Wilcoxon rank-sum test is used. The Wilcoxon rank-sum test is
algorithm performs best. Therefore one classification algorithm applied to total word frequency for positive and negative words in
does not outperform another in all cases. the ’sale’ and ’no sale’ class. The test is also done on some individual
words and their word frequency in a thousand transcripts.
Table 4: Classification on sales intakes The result of the Wilcoxon rank-sum test for positive and nega-
tive words is shown in Table 5. The results are highly significant
for both groups according to the p-value.
Algorithm Accuracy Precision Recall
Naive Bayes 67% 0.68 0.67
Table 5: Wilcoxon rank-sum test on total word frequency
Naive Bayes bi-gram 65% 0.67 0.65
SVM 63% 0.62 0.63
Logistic Regression 62% 0.62 0.62 Group frequency ratio p-value
Logistic Regression bi-gram 62% 0.61 0.63 Positive 1,59764 2.05208e-06
SVM bi-gram 55% 0.54 0.55 Negative 0,62592 1.40809e-10
The performance measures indicate that precision and recall are

often similar. This indicates that the classification algorithms do Testing for both positive and negative word frequency results
not favor one over the other. This means that the algorithm does in significant differences. However, on an individual word level
balance retrieving all relevant items and retrieving only these when results may vary. Visualizations resulted in words that are more
positive. frequently in one group. Some of these words are shown in Table 6
with a p-value for the Wilcoxon rank-sum test.
4.3 Sentiment Visualization The words used in Table 6 are all significant. Not all words are
tested for significance, since testing all words individually intro-
The frequency of words in a ’sale’ or ’no sale’ conversation is shown
duces the multiple comparisons problem.
in the scatter plot of figure 2. The words shown are the top ten
As can be seen in Table 6 a higher sale/no-sale ratio is not nec-
words in ’sale’ and the top ten words in ’no sale’.
essarily a lower p-value. This is likely due to the distribution of
words in the corpus, where the appearance of words in a corpus is
in bursts.
Table 6: Wilcoxon rank-sum test on word frequency
Dutch word English sale/no-sale ratio p-value

Akkoord Agree 3.64 1.16603e-45
Prima Fine 1.29 4.80939e-11
Goed Good 1.1 2.50576e-11
zometeen Later 1.32 2.63690e-05
straks Soon 1.06 1.99887e-05
Jij You 1.13 7.77938e-06
Misschien Maybe 0.8 0.00142
Informatie Information 0.6 0.00084
Rendement Profit 0.52 0.00545
Figure 2: Frequency/frequency scatterplot of words Interesse Interest 0.32 3.19611e-11
7
Figure 3: Scattertext of positive and negative words
5 DISCUSSION Insights can be gathered on distinctive words. Some findings are

The research presented in this paper finds significant difference in given below:
word frequency between ’sale’ and ’no sale’ phone call conversa- – Words that indicate confirmation are significantly more often
tions. There is a significant difference on individual words and on in sale conversations.
a group of positive and negative words. – Uncertain words are significantly more often in no sale con-
Classification can successfully be performed because of a differ- versations.
ence in word frequency. The classification accuracy of SVM on a – The words interested, information and options are signifi-
random sample of sale conversations is 79%. Classification accuracy cantly more often in no sale conversations.
for the first conversation is 67% with the Naive Bayes algorithm. – Words that indicate advance are significantly more often in
This indicates that a more diverse conversation content brings sale conversations.
about a higher classification accuracy. Also, there does not seem to – When on a first name basis the conversation is significantly
be a best classification algorithm in every case. more often a sale.
The approach of this paper is generic enough to be applied to It is beyond the scope of this research to conclude why a word
other domains and other languages. When introducing another appears more often in one group than the other. Other limitations
domain the classification algorithm should be trained for that do- of this study are: no speaker diarization and no acoustic features
main, since words might have different meaning in different context. for sentiment prediction. These additions might be able to improve
When using this approach for another language a different speech classification performance.
model is necessary for that language. Also the classification algo-
rithm needs to be trained for that language. 6 CONCLUSION
Transcriptions used in this study are created with the Kaldi-NL
This research demonstrates that significant differences in word
model. These transcriptions capture more than half of the spoken
frequency can be found between ’sale’ and ’no sale’ phone call
words with a word error rate of 37.6%. This is highly more accurate
conversations about solar panels in Dutch. Classification can be
than the models of Google and CMU Sphinx. The speech recognition
successfully performed due to the difference in word frequency. In
systems have distinctive limitations and different powers. It might
general, these results suggest that classification can be performed
be possible for the other systems to perform better when placed in
when a significant difference in word frequency is apparent.
a different context.
The approach of this study is using speech recognition to create a
transcript. Even when this transcript contains errors, classification
8
can still be performed. Classification is done with machine learning. [15] Miikka Kuutila Mika V. Mäntylä, Daniel Graziotin. 2018. The Evolution of
There is no algorithm that performs best in all cases. Sentiment Analysis - A Review of Research Topics, Venues, and Top Cited Papers.
Computer Science Review 27 (Feb. 2018), 16–32. https://doi.org/10.1016/j.cosrev.
Words that appear more often in one class than the other are 2017.10.002
found with visualizations. With the current approach significant [16] Don O’Sullivan and John McCallig. 2012. Customer satisfaction, earnings and
firm value. European Journal of Marketing 46, 6, Article 2 (March 2012), 20 pages.
distinctive words can be found. https://doi.org/10.1108/03090561211214627
[17] Evandro Gouvêa Bhiksha Raj-Rita Singh William Walker Manfred Warmuth
Peter Wolf Paul Lamere, Philip Kwok. 2003. THE CMU SPHINX-4 SPEECH
7 FUTURE WORK RECOGNITION SYSTEM. ICASSP (Jan. 2003).
Transcripts used in this study contain all spoken words in a conver- [18] T Ramya, S Lilly Christina, P Vijayalakshmi, and T Nagarajan. 2014. Analysis
on MAP and MLLR based speaker adaptation techniques in speech recognition.
sation. Who said what is not taken into account. This work proved it IEEE (March 2014). https://doi.org/10.1109/iccpct.2014.7054938
is possible to classify accurately without speaker diarization. How- [19] Maghilnan S and Rajesh Muthu. 2018. Sentiment Analysis on Speaker Specific
Speech Data. I2C2 (Feb. 2018). https://doi.org/10.1109/I2C2.2017.8321795
ever, it might be useful to know who said what in a conversation
[20] Karen Livescu Adam Lopez-Sharon Goldwater Sameer Bansal, Herman Kamper.
for better classification performance. 2018. Low-Resource Speech-to-Text Translation. Interspeech (Sept. 2018). https:
In this system speech recognition is done after the conversa- //doi.org/10.21437/Interspeech.2018-1326
[21] Asma Alshahrani Amany Al-Mubarak Sara Albugami Nada Almutiri Aisha Al-
tion. It is also possible to do online recognition. In that case the bugami Shaha Al-Otaibi, Allulo Alnassar. 2018. Customer Satisfaction Measure-
speech recognition is done during the conversation. Words can be ment using Sentiment Analysis. IJACSA 9, 2, Article 2 (Jan. 2018), 106–117 pages.
highlighted that contain sentiment during the conversation and a https://doi.org/10.14569/IJACSA.2018.090216
[22] Harsh Thakkar and Dhiren Patel. 2015. Approaches for Sentiment Analysis on
classification score can be calculated. Twitter: A State-of-Art study. (Dec. 2015).
Terms that appear more often in one class can be found with [23] Alexandre Trilla and Francesc Alias. 2013. Sentence-Based Sentiment Analysis
for Expressive Text-to-Speech. IEEE 21, 2, Article 2 (Feb. 2013), 223-233 pages.
the use of classification. The effect that these words have on a https://doi.org/10.1109/TASL.2012.2217129
conversation is not known. A next step would be finding out why
these appear more often in either group and what the effect is of
using these words in a conversation.
REFERENCES
[1] Päivi Kristiina Jokinen Birgitta Ojamaa and Kadri Muischnek. 2015. Sentiment
analysis on conversational texts. NODALIDA (2015).
[2] Rico Petrick Patrick Proba Ahmed Malatawy Christian Gaida, Patrick Lange and
David Suendermann-Oeft. 2014. Comparing Open-Source Speech Recognition
Toolkits. DHBW (2014).
[3] Souraya Ezzat, Neamat El Gayar, and Moustafa M. Ghanem. 2010. Investigating
Analysis of Speech Content through Text Classification. International Conference
of Soft Computing and Pattern Recognition (Dec. 2010), 105–110. https://doi.org/
10.1109/SOCPAR.2010.5686000
[4] Apeksha Bhansali Saif Ali Khan Harsha Mahajan G V Garje, Apoorva Inamdar.
2016. SENTIMENT ANALYSIS: CLASSIFICATION AND SEARCHING TECH-
NIQUES. IRJET 3, 4, Article 651 (April 2016), 2796-2798 pages.
[5] Mark Gales and Steve Young. 2007. The Application of Hidden Markov Models
in Speech Recognition. Foundations and Trends in Signal Processing 1, 3, Article 3
(Jan. 2007), 195–304 pages. https://doi.org/10.1561/2000000004
[6] Mounia Baeza-Yates Ricardo Graells-Garrido, Eduardo Lalmas. 2016. Sentiment
Visualisation Widgets for Exploratory Search. Social Personalization Workshop
(Jan. 2016).
[7] David Griol, José Manuel Molina, and Zoraida Callejas. 2019. Combining speech-
based and linguistic classifiers to recognize emotion in user spoken utterances.
NEUROCOM 326-327 (Jan. 2019), 132–140. https://doi.org/10.1016/j.neucom.2017.
01.120
[8] Tanja Säily Panagiotis Papapetrou-Kai Puolamäki Jefrey Lijffijt, Terttu Nevalainen
and Heikki Mannila. 2014. Significance testing of word frequencies in corpora.
Literary and Linguistic Computing 31, Article 2 (Dec. 2014), 374–397 pages. https:
//doi.org/doi.org/10.1093/llc/fqu064
[9] Yonghong Yan Chaomin Wang-Zhijie Ren Pengyu Cong Huixin Wang Jia Sun,
Weiqun Xu and Junlan Feng. 2016. Information Fusion in Automatic User Sat-
isfaction Analysis in Call Center. IHMSC (Aug. 2016). https://doi.org/10.1109/
IHMSC.2016.49
[10] Jason Kessler. 2017. Scattertext: a Browser-Based Tool for Visualizing how
Corpora Differ. ACL System Demonstrations (April 2017).
[11] Eva Kiktova and Jozef Juhar. 2015. Comparison of Diarization Tools for Building
Speaker Database. INFORMATION AND COMMUNICATION TECHNOLOGIES
AND SERVICES 13, 4 (Nov. 2015), 6. https://doi.org/10.15598/aeee.v13i4.1468
[12] Adam Kilgarriff. 2011. Comparing Corpora. International Journal of Corpus
Linguistics 6, Article 1 (Nov. 2011), 97–113 pages. https://doi.org/doi.org/10.1075/
ijcl.6.1.05kil
[13] Carita Paradis Kostiantyn Kucher and Andreas Kerren. 2017. The State of the
Art in Sentiment Visualization. COMPUTER GRAPHICS forum 37, 1, Article 1
(June 2017), 71–96 pages. https://doi.org/10.1111/cgf.13217
[14] Veton Këpuska and Gamal Bohouta. 2017. Comparing Speech Recognition Sys-
tems (Microsoft API, Google API And CMU Sphinx). IJERA 7, 3, Article 2 (March
2017), 5 pages. https://doi.org/10.9790/9622-0703022024
9

Cmu Sphinx Audio To Text

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cmu Sphinx Audio To Text

Uploaded by

Copyright:

Available Formats

Sentiment Analysis on Dutch Phone Call Conversations

Internal Supervisor External Supervisor

ABSTRACT Nowadays, almost every bit of communication with the cus-

Table 1: Phone call data

Call group Sale No Sale Total Duration

Accuracy can be a deceiving measure in skewed data. This is

Table 2: ASR performance measures on Dutch phone call 4.2 Classification

The performance measures indicate that precision and recall are

Table 6: Wilcoxon rank-sum test on word frequency

Dutch word English sale/no-sale ratio p-value

5 DISCUSSION Insights can be gathered on distinctive words. Some findings are

You might also like