You are on page 1of 20

A Call Center Agent Productivity Modeling

Using Discriminative Approaches

Abdelrahman Ahmed, Yasser Hifny, Sergio Toral and Khaled Shaalan

Abstract In this article, we present a novel framework for measuring productivity


of customer service representative (CSR) in real estate call centers. The framework
proposes a binary classification task for measuring CSR productivity. Generative
and discriminative classifiers are compared in this study. The generative classifier is
Naive Bayes (NB) versus the discriminative classifiers which are: logistic regression
(LR) and linear support vector machine (LSVM). To train the classifiers, a speech
corpus (7 h) is collected and annotated from three different call centers located in
Egypt. The accuracy results on this corpus show that LSVM can lead to the best
results (82%) and machine learning methods may successfully replace subjective
evaluation methods commonly used in that domain.

Keywords Productivity measurement ⋅ Naïve Bayes ⋅ Logistic regression


Support vector machine

A. Ahmed (✉) ⋅ S. Toral


Electronic Engineering Department, University of Seville, Seville, Spain
e-mail: abdahm@alum.us.es
S. Toral
e-mail: storal@us.es
Y. Hifny
Department of Information Technology, University of Helwan, Cairo, Egypt
e-mail: yhifny@fci.helwan.edu.eg
K. Shaalan
School of Informatics, Edinburgh, UK
e-mail: Khaled.shaalan@buid.ac.ae
K. Shaalan
The British University in Dubai, Dubai, UAE

© Springer International Publishing AG 2018 501


K. Shaalan et al. (eds.), Intelligent Natural Language Processing:
Trends and Applications, Studies in Computational Intelligence 740,
https://doi.org/10.1007/978-3-319-67056-0_24
502 A. Ahmed et al.

1 Introduction

Call centers are the front door of any organization where crucial interactions with
the customers are handled [Reynolds (2010)]. The effective and efficient operations
are the key ingredient to the overall success of in-source and outsource call center
profitability and reputation. It is very difficult to measure productivity objectively
because the agent output, as a firm worker, is the spoken words delivered to the cus-
tomer over the phone. The evaluation mostly is handled in a subjective way. Subjec-
tive evaluation in call center is the essence of qualitative method which is performed
through the monitoring and evaluating the interactions between the agent and the
customer according to the evaluator perception [Sharp (2003)]. It is performed by
listening to the agent recorded call, tapping a live call or making a test call by one of
the quality team or anonymous caller [Rubingh (2013)]. The quality team listens to
the agents recorded calls and uses predefined evaluation forms (evaluation check list)
[Cleveland (2012)]. The evaluation process has many drawbacks. One of the reasons
is that the quality teams evaluate the agents according to their perception and prece-
dent experiences [Cleveland (2012)]. The subjective evaluation opens the door for
favoritism due to what is called social ties [Breuer et al. (2013)]. A quantitative study
of social ties and subjective performance evaluation [Breuer et al. (2013)] highlights
a closer social attachment between supervisors and subordinates that leads to bet-
ter performance rating when there are no differences in true performance. Another
drawback of subjective evaluation is the resources limitation to evaluate all the agents
consistently per time. For instance, some agents are evaluated in different day shifts
(day shift/night shift) which leads to inconsistent or unfair evaluation from one agent
to another. A typical challenge in performance evaluation studies is that the true per-
formance is not observable to the researcher and hence it is hard to assess the gap and
detect the evaluation distortions [Breuer et al. (2013)]. This means that the subjec-
tive evaluation may underestimate the agent when his performance could be higher.
Conversely, the agent may be overestimated in evaluation because of other factors
that may not be relevant to the true performance or the quality of service.
This paper proposes three classification methods for objectively measuring the
agent’s productivity through machine learning approach. The next section (Sect. 2)
discusses the conceptual framework and gives an overview about the main build-
ing blocks. Section 3 discuss in some details the binary classification methods and
parameters optimization methods. Section 4 explains the experiment carried out and
Sect. 5 discusses the study results. Finally, Sect. 6 concludes the study and recom-
mends research opportunities for future work.

2 Related Work and Proposed Framework

This section gives an overview about productivity measurement and highlights gen-
eral concepts and methods of agent evaluation in call centers environment.
A Call Center Agent Productivity Modeling . . . 503

2.1 Productivity Measurement Definition

A productivity measure is commonly understood as a ratio of outputs produced


divided by resources consumed (Eq. (1)) [Steemann Nielsen (1963)]. The observer
has many different choices with respect to the scope and nature of both the outputs
and resources considered [Card (2006)]. For example, the output can be measured in
terms of delivered product or service, while the resource can be measured in terms
of effort or monetary cost [Card (2006)]. An effective productivity measurement
enables the establishment of a baseline against which performance improvements
can be measured [Thomas and Zavrki (1999)]. This is the crucial part in produc-
tivity measurement because each call center has its own objectives and productivity
criteria, which differs from one domain to another. Therefore, it requires a dynamic
approach for grasping the eminent productivity characteristics for each call cen-
ter, which helps the organizations to make better decisions about investments in
processes, methods, tools, and outsourcing [Card (2006)]. The productivity mea-
surement can be formulated using Eq. (1).

Agent Output
Productivity = (1)
Input Effort

The quality of service measurement in call centers is performed using quantita-


tive and qualitative methods [Rubingh (2013)]. The quantitative method regards the
first call resolution, average handling time of the call, the wrap up time and adher-
ence time [Cleveland (2012)]. Every call center draws its own baseline that measures
the overall performance objectively according to the ultimate call center objectives
and strategies [Abbott (2004)]. The qualitative method is the process of monitoring
and evaluating the call handling of the customer interactions subjectively [Cleve-
land (2012)]. The quantitative and qualitative methods aim at fulfilling the call cen-
ter key performance indicators (KPIs), which are widely applied to measure agent
performance as well as overall call center productivity [Reynolds (2010)]. [Taylor
et al. (2002)] argues that qualitative and quantitative measures are not located sim-
ply in polar opposites but represent a continuum, along which operations can be
located according to combination of the identified quality/quantity dimensions. As
mentioned, the qualitative method in call center is subjected to evaluator perception
and/or favoritism which impacts the agent evaluation. This study tries to replace the
legacy qualitative methods by classifying the calls objectively using machine learn-
ing approaches.
The conceptual framework of the study is described in Fig. 1, where the evaluation
process is automated from the beginning to the end. The block diagram includes a
speaker diarization process, speech recognition, modeling and classification process.
The next section discusses each building block in the framework.
504 A. Ahmed et al.

Fig. 1 The study frame work

2.2 Speech Recognition and Speaker Diarization

Speech recognition systems started in the 80s and achieved a significant improve-
ment by the new era of machine learning using neural networks [Yu and Deng
(2012)]. By transcribing the calls into text, the content analysis has become a power-
ful tool for features prediction and interpretation [Othman et al. (2004)]. The speech
in Arabic language achieved a high accuracy in terms of word error rate (WER)
[Ahmed et al. (2016)]. The word error rate is the main indicator of the speech recog-
nition accuracy and performance [Young et al. (2015)]. The lower word error rate
(WER), the higher performance of speech recognition [Woodland et al. (1994)]. The
inbound or outbound call is divided into agent talk part, customer part, silences,
music on hold and noise. As the agent part is the target of the analysis, it requires
a diarization process. The diarization process is the process of using an acoustic
model for sophisticated signal and speech processing to split the one channel mono
recorded voices into different speakers [Tranter and Reynolds (2006)]. It removes
silences, music as well as giving clear one speaker speech [Tranter and Reynolds
(2006)].
The diarization process was intended to be performed using LIUM diarization
toolkit [Meignier and Merlin (2010)]. It is a java based open source toolkit spe-
cialized in diarization using speech recognition models. It required Gaussian Mix-
ture Model (GMM) for training voice and corresponding labels using two clustering
states or more according to the number of speakers (number of states equal to the
number of speakers). It uses GMM mono-phone to present the local probability for
each speaker [Meignier and Merlin (2010)]. For speech recognition, we uses both of
GMM and Hidden Markov Model (HMM) methods but in different configurations.
In Arabic language, we use 3 HMM states for each phone (Arabic proposed phones
are 40 phones), each state is presented by 16 Gaussian models. The Arabic speech
recognition was presented in paper [Ahmed et al. (2016)]. We leave the diarization
process for future work.
A Call Center Agent Productivity Modeling . . . 505

Table 1 Sample of Arabic letters, corresponding characters transliteration and its English equiv-
alent
Arabic letter
Transliteration ga d sh l k
Equivalent A D SH L K
English

2.3 Converting the Arabic Text into Latin (Transliteration


Process)

This step is essential to convert the Arabic transcription into Latin for machine
processing. The character set are 36 character as shown in Table 1.
The transliteration process maps each letter from Arabic to the corresponding
Latin character. The next example shows a transliteration of an Arabic statement:

l̂ykm aslam w r@mt all* w brkat*


Example 1. Sample of Arabic statement transliteration in Buckwalter.
The transliteration shown above transforms the statement from right to left (Ara-
bic writing direction) to left to right (Latin). Buckwalter1 is a powerful open source
tool to transliterate the Arabic to Latin, and it is used in various Arabic language
processing applications.

2.4 Sentiment Analysis and Binary Classification

After the speech has been transcribed into text, the next step is to process the text
using sentiment analysis. Sentiment analysis refers to the natural language process-
ing by detecting and classifying the sentiment expressed by an opinion holder [Mur-
phy (2012)]. Sentiment analysis, also called opinion mining, is the way to classify
the text based on opinions and emotions regarding a particular topic [Richert et al.
(2013); Chen and Goodman (1996)]. This technique classifies the text in polarity
way (on/off/yes/no/good/bad), and it is used for assessing people opinion in books,
movies etc. It deals with billions of words over the web and classifies the positive
and negative opinions according to the most informative features (word) extracted
from the text.
The sentiment analysis uses different binary classification methods in order to
classify and predict the most informative features. The binary classification means
that the classifier results will be only two results i.e. productive/non-productive

1
http://www.qamus.org/transliteration.htm.
506 A. Ahmed et al.

(in our study). This article has selected three classification methods to compare
the classification performance applied on text (Sect. 3). Taking in consideration that
agent productivity is different than opinion mining (emotions classification) because
productivity is assessment of the output of the agent as mentioned in Eq. (1) regard-
less emotional words. However, this work is performed under assumption that the
semantics of the call contents should be shaped with a positive meaning that tends
to classify the call to a productive call [Ezpeleta et al. (2016)].

2.5 Data Validity

The classification accuracy is one of the most important factors in this study. The
human accuracy shows that the level of agreement regarding sentiment is about
80%.2 However, this percent cannot be considered as a baseline of the study because
of two reasons. The first reason is that the accuracy is dependent on the domain of
the collected text which varies from one domain to another. For example, the produc-
tivity features are perceived in different way than other domains like spam emails or
movies review. The second reason is that the machine learning approach and human
perception are incommensurable. Hence, The study presents unprecedented baseline
of performance in real-estate call centers located in Egypt. For the data validation of
the study, the classifier should be able to classify the test set accurately as intended.
The accuracy calculation is given by Eq. (5).
c
Fcor
Accuracy = , (2)
Ftot
c
where Fcor is the correctly classified features per class and Ftot is the total features
extracted.

3 Binary Classification Methods

This section describes the classification using Naïve Bayes denoted by (NB), logistic
regression denoted by (LR) and linear support vector machine (LSVC).

3.1 Naïve Bayes Classifier (NB)

The Naïve Bayes classifier is built on Bayes theorem by considering that features are
independent from each other. Naïve Bayes satisfies the following equation:

2 http://www.webmetricsguru.com/archives/2010/04/sentiment-analysis-best-done-by-humans.
A Call Center Agent Productivity Modeling . . . 507

p(x|c)p(c)
p(c|x) = , (3)
p(x)

where c is the class type (productive/non-productive), and x1 , x2 , … , xn are the text


features. The p(x|c) is the likelihood of the features given the class in the training
process. We ignore p(x) in the equation denominator as it is a constant value that
never change. Accordingly, we are looking for the maximum value of both of the
likelihood value p(x|c) and prior probability p(c) to predict the class probability given
input features—Eq. (4):

p(c|x) = argmax[p(x|c)p(c)] (4)

We calculate the joint probability by multiplying the probability of words given


class p(x|c) with class probability p(c) to get the highest probability as follows [Mur-
phy (2006)]:

n
p(Ck |x1 , x2 , … , xn ) = p(Ck ) p(xi |Ck ) (5)
i=1

The data set is manually transcribed and classified into productive/non-productive.


The learning process is to find the probability of maximum likelihood of p(x|c) and
p(c). The p(c) is simply calculated as following:

Nc
p(c) = , (6)
Ntot

where Nc is the number of the words (features) annotated per class divided by total
number of features in both classes Ntot . To calculate the maximum likelihood, we
count the frequency of word per class and divide it to overall words counted per the
same class [Jurafsky and Martin (2014)] as following:

count(xi |c)
p(xi |c) = ∑ (7)
count(x|c)

As some words may not exist in one of the classes (productive/non-productive),


the result of count(xi , c) will be zero. The total multiplied probabilities will be zero
as well. A Laplace smoothing [Jurafsky and Martin (2014)] is used to avoid this
problem by adding one:

count(xi |c) + 1
p(xi |c) = ∑ (8)
count(x|c) + 1

To avoid underflow and to increase the speed of the code processing, we use log(p)
in Naïve Bayes calculations [Yu and Deng (2012)].
508 A. Ahmed et al.

3.2 Logistic Regression (LR)

Logistic regression (also called logit regression) is a binary classification method


when the classes are linearly separable classes [Jurafsky and Martin (2014)]. It is
required to give value 1 when the feature is classified as productive and 0 when it is
non-productive. Referring to linear regression and model training, the probability of
the class c given feature x is presented in Eq. (9):


N
p(c|x) = w0 x0 + w1 x1 + ⋯ + wn xn = wi xi = wT x (9)
i=1

where w is the weight of each feature per class.


The model mentioned in Eq. (9) gives values ranges from −∞ to ∞ which does
not present the required probability distribution from 0 to 1. The odd-ratio is a math-
ematical assumption to limit the results from 0 to ∞ very close to linear manner in
Eq. (10) by dividing the probability of success (productive) by probability of failure
(non-productivity).
p(c|x)
Odd − Ratio = (10)
1 − p(c|x)

However, the original Eq. (9) lies from −∞ to ∞, hence, this can be achieved by
taking the natural log:
p(c|x)
log( ) = z = wT x (11)
1 − p(c|x)

By using a mathematical derivation, the logit equation 𝛷(z) will be as following:

1
p(c|x) = 𝛷(z) = (12)
1 + e−z

where z is the net input of the features and their weights:


N
z = w0 x0 + w1 x1 + ⋯ + wn xn = wi xi = wT x
i=1

The logit function 𝛷(z) is sigmoid function that limits the classification output
from 0 to 1 (probability function). If we assume the classification threshold is 0.5 and
logit function output is above the 0.5 (i.e. 𝛷(z) = 0.8), then it means that the prob-
ability of a particular sample features of x and w being productive is 80% as shown
∑N
in Fig. 2. The threshold value is also called hyperplane where z = 0 or i=0 wi xi = 0
[Jurafsky and Martin (2014)].
Equation (12) predicts the class given the features. Then we have to train the
model for best value of w which is required for optimum classification. The following
figure illustrates the network diagram by defining an initial value of the weights w,
A Call Center Agent Productivity Modeling . . . 509

Fig. 2 Logit regression function—Wikipedia

Fig. 3 Logit regression training model

summing the net input, classifying then adjust the weights according to error detected
[Raschka (2015)] (Fig. 3).
In linear regression, the error is determined by sum-squared cost function (SSE).
Equation (13) is the cost function (SSE).

J(w) = (𝛷(z)(i) − y(i) )2 (13)
i

The optimum value of w can be estimated by cost function minimization. The


gradient based methods are used in language processing and machine learning to
predict the optimum value that reduces the cost function to the minimum. In logistic
regression, the cost function minimization is determined by conditional maximum
likelihood estimation [Jurafsky and Martin (2014)]. This means that we choose the
parameters w that make the probability of the observed class y in the training data to
be the highest given feature x.
510 A. Ahmed et al.

Assuming that the features are independent each other so that the maximum joint
probability of the class c given the feature x is the probability product of all the class
observations given the input features. The likelihood definition of logistic function
in Eq. (14).
∏ n
L(w) = p(y(i) |x(i) , w) (14)
i=1

The logit function 𝛷(z) is a probability function that can be governed by proba-
bility rules. This function is similar to Bernoulli function as stated in Eq. (15):


n
y(i) 1−y(i)
L(w) = (𝛷(z(i) )) (1 − 𝛷(z(i) )) (15)
i=1

Taking log helps in calculating the small values by summation rather than multi-
plication. Furthermore, it eliminates the exponents for easier manipulation, the like-
lihood becomes:


n
l(w) = logL(w) = y(i) log(𝛷(z(i) )) + (1 − y(i) )log(1 − 𝛷(z(i) )) (16)
i=1

Therefore, the objective function (cost function) is minimized by taking the neg-
ative log of Eq. (16).
[ ]

n
J(w) = −logL(w) = − y(i) log(𝛷(z(i) )) + (1 − y(i) )log(1 − 𝛷(z(i) )) (17)
i=1

It is required to estimate the w to maximize the likelihood (minimize the cost


function) by differentiating Eq. (16):
n ( )
𝜕l(w) ∑ (i) 1 (i) 1 𝜕𝛷(z(i) )
= y − (1 − y ) (18)
𝜕w i=1
𝛷(z )
(i) 1 − 𝛷(z )
(i) 𝜕w

The partial derivative of the sigmoid function is:

𝜕𝛷(z) ( ) ( )
𝜕 1 e−z 1 1
= = = 1−
𝜕z 𝜕z 1 + e−z (1 + e )
−z 2 1+e −z 1+e −z

= 𝛷(z)(1 − 𝛷(z)) (19)

Now, by substituting Eq. (18) in Eq. (19):


n ( )
𝜕l(w) ∑ (i) 1 (i) 1 𝜕𝛷(z(i) )
= y − (1 − y )
𝜕w i=1
𝛷(z(i) ) 1 − 𝛷(z(i) ) 𝜕w
A Call Center Agent Productivity Modeling . . . 511

n ( )
𝜕l(w) ∑ (i) 1 (i) 1 𝜕
= y − (1 − y ) 𝛷(z(i) )(1 − 𝛷(z(i) )) z(i)
𝜕w i=1
𝛷(z )
(i) 1 − 𝛷(z )
(i) 𝜕w


n
( (i) )
= y (1 − 𝛷(z(i) )) − (1 − y(i) )𝛷(z(i) ) x(i)
i=1

Hence, the predicted weight of the learning process:


n
( (i) )
𝛥w = y − 𝛷(z(i) ) x(i) (20)
i=1

The final value of w is the previous value added to the predicted one in Eq. (19).


n
wnew ∶= wold + 𝛥w = wold + (y(i) − 𝛷(z(i) ))x(i) = wold − 𝜂∇J(w) (21)
i=1

One way to update the parameters is to use gradient descent as shown in Eq. (21)
but usually Newton’s method is used to improve the speed of the training [Fan et al.
(2008)]. 𝜂 is the learning rate (a constant between 0 to 1) that controls the step size
or the amount of the effect of 𝛥w. It is adjusted to decide the speed (number of steps)
for reaching the optimum value or may overfitting the model. Overfitting is one of
the famous problems in machine learning when the model fits the training set but
cannot be generalized well for unseen data [Murphy (2012)]. The overfitted model
has a high variance and it is too complex. Similarly, the model that suffers from
underfitting when it fails to capture the pattern in the training data and suffers from
low performance [Raschka (2015)].
Regularization (called L2 regularization) is the method to control the collinearity
(high correlation among features) and prevent overfitting [Raschka (2015)]. It penal-
izes the extreme parameter weights and the model complexity [Murphy (2012)]. L2
regularization is the summation of weights energy over the parameters described in
Eq. (22).
∑n
L2 = 𝜆 ‖w‖2 = 𝜆 w2i (22)
i=1

The cost function after adding L2 regularization is illustrated in Eq. (23).


[ ]

n
J(w) = − y log(𝛷(z )) + (1 − y )log(1 − 𝛷(z )) + 𝜆 ‖w‖2
(i) (i) (i) (i)
(23)
i=1

where 𝜆 is the regularization parameter.


512 A. Ahmed et al.

3.3 Linear Support Vector Machine (LSVM)

Support Vector Machine (SVM) is the most powerful and widely used in machine
learning and binary classification [Raschka (2015)]. The margin is defined as the
distance between the separating hyperplane (decision boundary) and the training
samples that are closest to this hyperplane [Murphy (2012)]. This method is hypoth-
esized on the maximization of the margin that it gives the best hyperplane posi-
tion and linearly classifies the samples. The support vectors are the nearest samples
(almost touching) to the margin that bounds the hyperplane. Figure 4 illustrates the
hyperplane and the margin for binary classes x1, x2.
The hyperplane equation in terms of weight vector wT apart from b and vector x.

wT x + b = 0 (24)

This means that any vector resides on the hyperplane, the equation results zero. In
Fig. 4, it is assumed that the first support vector point of the positive class resides on
the margin with distance +1 from the hyperplane and the negative point of negative
class on −1. So we have two equations:

wT xpos + b = +1 (25)

wT xneg + b = −1 (26)

Fig. 4 Support vector


machine—margin
maximization—Wikipedia
A Call Center Agent Productivity Modeling . . . 513

When we subtract both equations and normalize them by ‖w‖ where ‖w‖ =
√∑
n 2
i=1 wi [Raschka (2015)]:

wT (xpos − xneg ) 2
= (27)
‖w‖ ‖w‖

The left side of the Eq. (27) is the distance between the positive and negative
boundary (the margin). The objective function of SVM becomes maximum by min-
imizing ‖w‖ (or 12 ‖w‖2 for mathematical convenience) [Murphy (2012)]. This is
under the constraint that the samples are classified according to the following:

wT xpos + b ≥ +1, if y = 1 (productive-class) (28)

wT xneg + b < −1, if y = −1 (non-productive-class) (29)

where y is the raw output data located in the training set.


The general form of the last two equations:

y(wT x + b) ≥ +1 (30)

Now, we have to use Lagrange equation to combine both of w and the constraint
y(wT x + b) ≥ +1 in one equation (a slack variable) [Gunn et al. (1998)]. By adding
Lagrange multiplier 𝛼, the Lagrangian equation will be as following:

1 T ∑N
𝓁(w, b, 𝛼) = w w− 𝛼n (yn (wT xn + b) − 1) (31)
2 n=1

By differentiation and substituting the equations that are equal to zero, the final
equation of 𝛼 minimization:


N
1 ∑∑
N N
∇𝓁(𝛼) = 𝛼n − y y 𝛼 𝛼 xT x (32)
n=1
2 n=1 m=1 n m n m n m

∑N
where 𝛼n ≥ 0 for n = 1, … , N and n=1 𝛼n yn = 0
It is obvious that w has been disappeared from the first derivative of Lagrange
equation, however, it is concluded from the partial derivative with respect to w that
𝛼 is proportionally related to w.


N
w= 𝛼n yn xn (33)
n=1
514 A. Ahmed et al.

Fig. 5 Support vector


machine—non-linearly
separable data—Wikipedia

Equation (31) is the dual form that is used in case of hard margin. Hard margin
is the maximum distance between the hyperplane and the margin when the data is
linearly separable [Friedman et al. (2001)]. There are other cases that the data is
almost linearly separable but there are some data points that makes few mistakes
and violates the margin. That data points reside either within the margin (correctly
classified) or with the margin in the opposite side (misclassified) as they are shown
in Fig. 5.
Referring to Eq. (29), an error 𝜉 should be added as shown in Eqs. (30) and (34).

y(wT x + b) ≥ +1 − 𝜉 (34)

𝜉 is the error distance of the vector data (slack variable) that violates the margin
and takes the value range 𝜉 ≥ 0 [Fan et al. (2008)]. Hence, 𝜉 = 0 means there is no
error and vector data are correctly classified and reside out of the margin (the decision
boundary). When 0 ≤ 𝜉 ≤ 1, it means that the data vector reside in the margin but
still correctly classified (The vector resides between the margin and hyperplane).
When 𝜉 ≥ 1 means that the data vector is misclassified (in the wrong side of the
hyperplane). The soft margin is the case when it is accepted to have some errors
or misclassified data points. The primal form of the loss function in this case is in
Eq. (35).
1 ∑N
min wT w + C 𝜉n (35)
w 2
n=1
A Call Center Agent Productivity Modeling . . . 515

Fig. 6 Hinge loss


function—CVX Research
Inc

Constant C is a penalty parameter (C = 𝜆1 ) which controls the accepted amount of


errors. The errors are added to the constraints of Lagrangian equation (dual form) and
determined by finding the derivative of 𝓁(w, b, 𝛼, 𝜉) with respect to 𝜉. Referring to
the primal form in Eq. (35), 𝜉 = 0 whenever y(wT x + b) > 1. On the other hand, 𝜉 =
1 − y(wT x + b) whenever y(wT x + b) < 1 [Murphy (2012)]. In short, 𝜉 = max(0, 1 −
y(wT x + b) which is called hinge function denoted by H—Eq. (35).
{
0 if ywT x ≥ 1
H= (36)
1 − ywT x if ywT x < 1

The hinge function is plotted in Fig. 6.


Then the primal form after adding hinge function of Eq. (35).

1 ∑N
min wT w + C max(0, 1 − yi wT xi ) (37)
w 2
n=1

The form in Eq. (37) is called L1-SVM. The quadratic smoothed form is L2-
regularization form of SVM (L2-SVM), which is used in our experiment, in Eq. (38)
[Fan et al. (2008); Rennie and Srebro (2005)].

1 ∑N
min wT w + C max(0, 1 − yi wT xi )2 (38)
w 2
n=1

It is worth mentioning that both of dual and primal form in LSVM can be used
independently to get the optimum classification margin [Wang (2005)] which is
applied in python libraries as will be mentioned in next section.
516 A. Ahmed et al.

4 The Experiment

The corpus consists of 7 h real estate call centers hosted in Egypt. A call center qual-
ity team has listened to the calls carefully (30 calls) and categorized subjectively the
data set into productive/non-productive. The criterion used for file annotation is the
ability of the agent or CSR to respond to the customer inquiries with right answers
and fulfill the answer items. The evaluator gives score for each item fulfilled out of
the total items required. For example, the customer asks about an apartment for sale,
assuming the answer of the agent is set of 5 main items (Ask the customer name,
the contact number, the budget available, the city, number of bed rooms). When the
agent misses one of the answer items, the score is deducted by one point. Each call
center draws the baseline of productivity according to the call center ultimate objec-
tives. Referring to previous example, when the agent misses one item out of 5, this
may be considered non-productive in call center (X) and still productive in call center
(Y). Here it comes the power of machine learning for drawing the baseline for every
different call center environment. The 30 calls are split into smaller audio chunks
each audio file duration is around 20–50 s. This process is performed to simulate
the output files that are produced by both of the diarization and speech recognition
decoding process [Meignier and Merlin (2010)]. Therefore, the total files are 500
files divided into training set of 400 files and test set of 10% of the files (100 files).
The quality team subjectively has annotated the files in to productive (400 files) and
non-productive (100 files). This unbalanced annotation biases the results of proba-
bility of each class p(c) in Eq. (3) which deviates from one class to the other. It should
be for example 400 productive versus 400 non-productive. In this case, we may have
to use a scaling factor to adjust the probability per class. The text has been converted
from Arabic into Latin using Buckwalter script for machine processing then con-
verted it back into Arabic. The code is developed in python version 2.7 using natural
language toolkit (NLTK) Naïve classifier, Scikit learning library [Raschka (2015)].
The code uses bag-of-words which is an unordered set of words frequency regardless
to its position in the sentence [Jurafsky and Martin (2014)]. Both Logistic regression
and Linear support vector machine are classified using liblinear/scikit library instead
of libsvm. This is because liblinear library has more flexibility for penalties and loss
function parameters adjustment and better scaling for large numbers of samples.3
The scope of the study is to perform the classification methods: Naïve Bayes (NB),
Logistic regression (LR) and linear support vector machine (LSVM). The compari-
son is to show the best classification performance among the classification methods.

3
http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html.
A Call Center Agent Productivity Modeling . . . 517

5 Results

We generated the data model and applied different binary classification approaches
on the whole data set to compare their performance. The classification results accu-
racy is shown the Table 2.
The data set and the test set has been classified in order to extract the most infor-
mative features (MIF). Most informative features (MIF) is presented by ratio of
occurrence in which the feature is much appearance in one of the classes than other.
The features have been converted from Buckwalter to Arabic, then into English as
following.
The NB classifier provides the ratios for comparing different features with respect
to training set that are more often repeated in one of the classes which is called
likelihood ratios [Murphy (2012)]. As shown in Table 3, NB presented the 10 most

Table 2 Binary classification methods and accuracy


Classification method Accuracy (%)
Naïve Bayes 67.3
Logistic regression 80.769
Linear support vector machine 82.692

Table 3 The most informative features


Word index Feature in Feature in Translation Classification Likelihood
BuckWalter Arabic into English tendency ratios
1 trw@ To go Non Productive 19.3

2 ambar@ Yesterday Non Productive 16.3

3 M∧rf$ I have no idea Non Productive 13.4

4 Fwq Upper Productive 13.4

5 Alsaylz Sales Productive 13.4

6 Nmrty My mobile Non Productive 10.4


number

7 Aktwbr October (City Productive 10.4


in Egypt)
8 m%*r The view Productive 10.4

10 Ktyr Expensive Non Productive 9.8

9 alsT@ The roof Productive 8.3


518 A. Ahmed et al.

informative features out of 100 features extracted and have high tendency to specific
class. We tried subjectivity to explore the meaning behind the classification for better
understanding the definition of productivity in real estate call centers. For feature
number 3 - or in English saying (I have no idea) is non-productive according
to lack of awareness of the product or the service. In feature number 6, the agent
dictates his/her mobile number over the phone, and this is considered non-productive
as it consumes much time. The prices for the feature number 10—(expensive) is
classified non-productive feature because it drags the CSR in useless debate and
consumes much time. Furthermore, it might be an unjustified answer by agent which
may indicate less awareness to the market changes and prices. This feature may be
categorized under the same feature number 3. For productive agents, they mentioned
the key features of the apartments or villas such as (the view), (the roof) and the city
(October) can be considered as product awareness. Nevertheless, the feature in itself
works perfectly in evaluating the agent for mentioning some selling points through
the call. There are other features meaningless, for example, (to go), (yesterday),
(Upper), and (Sales). These features may be related to the error percentage which is
expected from the beginning. The accuracy is expected to be improved by training
larger corpus and getting balanced training set for productive and non-productive.
Referring to Table 2, the experiment results can be summarized in a two main
themes. The first theme is the generative versus discriminative models. In the previ-
ous experiment, the Naïve Bayes gives the lowest classification performance com-
pared to other methods, which is expected, because it is generative method. It is
well-known in machine learning that generative models gives less accuracy in clas-
sification compared to discriminative one [Jurafsky and Martin (2014)]. The second
theme is about the linear support vector classification. It shows that the margin max-
imization approach is the most appropriate method in this experiment.

6 Conclusion

The article proposed the call center agents performance evaluation using different
machine learning approaches. Three methods are developed in this work: Naïve
Bayes classifier, Logistic regression and support vector machine for binary classi-
fication (productive/non-productive). The annotation was performed by a call center
quality team that depends on the scoring the itemized answers. The resulted accu-
racy of classification shows that discriminative models (logistic regression/support
vector machine) could give high accuracy (80, 82%) compared to generative model
(Naïve Bayes accuracy 67%). There are still research gap in productivity measure-
ment for better extracting the productivity features (the most informative features—
MIF). Furthermore, extending the productivity measurement in a range of scales
(rather than binary classification) and considering the conversation context may help
better understanding objectively the evaluation gap.
A Call Center Agent Productivity Modeling . . . 519

References

Abbott, J.C.: The Executive Guide to Call Center Metrics. Robert Houston Smith Publishers (2004)
Ahmed, A., Hifny, Y., Shaalan, K., Toral, S.: Lexicon free Arabic speech recognition recipe.
In: International Conference on Advanced Intelligent Systems and Informatics, pp. 147–159.
Springer (2016)
Breuer, K., Nieken, P., Sliwka, D.: Socialities and subjective performance evaluations: an empirical
investigation. Rev. Manag. Sci. 7(2), 141–157 (2013)
Card, D.N.: The challenge of productivity measurement. In: Proceedings of the Pacific Northwest
Software Quality Conference (2006)
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. In:
Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, Asso-
ciation for Computational Linguistics, pp. 310–318 (1996)
Cleveland, B.: Call Center Management on Fast Forward: Succeeding in the New Era of Customer
Relationships. ICMI Press (2012)
Ezpeleta, E., Zurutuza, U., Hidalgo, J.M.G.: Does sentiment analysis help in Bayesian spam filter-
ing? In: International Conference on Hybrid Artificial Intelligence Systems, pp. 79–90. Springer
(2016)
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear
classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning, vol. 1. Springer Series
in Statistics. Springer, Berlin (2001)
Gunn, S.R., et al.: Support vector machines for classification and regression. ISIS Tech. Rep. 14,
85–86 (1998)
Jurafsky, D., Martin, J.H.: Speech and Language Processing. Pearson (2014)
Meignier, S., Merlin, T.: LIUM SpkDiarization: an open source toolkit for diarization. In: CMU
SPUD Workshop, vol. 2010 (2010)
Murphy, K.P.: Naive Bayes Classifiers. University of British Columbia (2006)
Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press (2012)
Othman, E., Shaalan, K., Rafea, A.: Towards resolving ambiguity in understanding Arabic sentence.
In: International Conference on Arabic Language Resources and Tools, pp. 118–122. NEMLAR,
Citeseer (2004)
Raschka, S.: Python Machine Learning. Packt Publishing Ltd (2015)
Rennie, J.D., Srebro, N.: Loss functions for preference levels: regression with discrete ordered
labels. In: Proceedings of the IJCAI Multidisciplinary Workshop on Advances in Preference
Handling, pp. 180–186. Kluwer, Norwell, MA (2005)
Reynolds, P.: Call center metrics: best practices in performance measurement and management to
maximize quitline efficiency and quality. North American Quitline Consortium (2010)
Richert, W., Chaffer, J., Swedberg, K., Coelho, L.: Building Machine Learning Systems with
Python, vol. 1. Packt Publishing, GB (2013)
Rubingh, R.: Call Center Rocket Science: 110 Tips to Creating a World Class Customer Service
Organization. CreateSpace Independent Publishing Platform. https://books.google.ae/books?
id=IknGmgEACAAJ (2013)
Sharp, D.: Call Center Operation: Design, Operation, and Maintenance. Digital Press (2003)
Steemann Nielsen, E.: Productivity, definition and measurement. The Sea 2, 129–164 (1963)
Taylor, P., Mulvey, G., Hyman, J., Bain, P.: Work organization, control and the experience of work
in call centres. Work Employ. Soc. 16(1), 133–150 (2002)
Thomas, H.R., Zavrki, I.: Construction baseline productivity: theory and practice. J. Construct. Eng.
Manag. 125(5), 295–303 (1999)
Tranter, S.E., Reynolds, D.A.: An overview of automatic speaker diarization systems. IEEE Trans.
Audio Speech Lang. Process. 14(5), 1557–1565 (2006)
Wang, L.: Support Vector Machines: Theory and Applications, vol. 177. Springer Science & Busi-
ness Media (2005)
520 A. Ahmed et al.

Woodland, P.C., Odell, J.J., Valtchev, V., Young, S.J.: Large vocabulary continuous speech recog-
nition using HTK. In: 1994 IEEE International Conference on Acoustics, Speech, and Signal
Processing, 1994. ICASSP-94. IEEE, vol. 2, pp. II/125–II/128 (1994)
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Olla-
son, D., Povey, D.: The HTK Book (for HTK version 3.5). Cambridge University Engineering
Department, Cambridge, UK (2015)
Yu, D., Deng, L.: Automatic Speech Recognition. Springer (2012)

You might also like