Professional Documents
Culture Documents
net/publication/358185386
AIMS TALK: Intelligent Call Center Support in Bangla Language with Speaker
Authentication
CITATION READS
1 184
7 authors, including:
All content following this page was uploaded by Shehan Irteza Pranto on 08 December 2023.
Shehan Irteza Pranto Rahad Arman Nabid Ahnaf Mozib Samin Nabeel Mohammed
AIMS Lab, UIU AIMS Lab, UIU AIMS Lab, UIU Dept. of CSE
Dhaka, Bangladesh Dhaka, Bangladesh Dhaka, Bangladesh North South University
shehanirteza@gmail.com ran.nabid@gmail.com asamin9796@gmail.com Dhaka, Bangladesh
nabeel.mohammed@northsouth.edu
Abstract—Call support centers operate over the telephone, such as email, social media, or chat. Customer satisfaction
connect between customers and receptionists to ensure cus- over telephone calls depends on the behavior of customer
tomer satisfaction by solving their problems. Due to pandemics, service employees. Moreover, the behavior of customer service
customer call support centers have become a popular way of
communication that has been used in different domains such as employees depends on their mood; sometimes, biases and
e-commerce, hospitals, banks, credit card support, government unresponsiveness may occur. In fact, human service members
offices. Moreover, humans’ limitations to serve 24 hours a day cannot support 24 hours customer services. So customers can-
and the fluctuation of waiting time makes it more challenging to not get support from the hot-line after office time. Moreover,
satisfy all the customers over call center. So, customer service due to pandemics, the issues of traditional customer call center
needs to be automated to handle customers by providing a
domain-based response in the native language, especially in a service are becoming apparent as an increasing number of
developing country like Bangladesh, where call support centers people are trying to get service through call centers.
are increasing. Although most people use the Bangla language With the increasing difficulty of serving customers, now
to communicate, little work has been done in customer care artificial intelligence (AI) can be the scalable and most cost-
automation in the native language. Our developed architecture, efficient solution for improving customer service [2]. Accord-
“AIMS TALK” can respond to that customer’s need by rec-
ognizing users’ voices, specifying customers’ problems in the ing to the latest report of smallbiztrends titled, “Local Business
standardized Bangla language, collecting customers’ responses Websites and Google My Business Comparison Report” said,
to the database to give feedback according to the queries. 60% of the customers prefer to call over the phone during
Besides, the system uses MFCC feature extraction for speaker pandemic instead of visiting physical shop [14]. Issues include
recognition with an average accuracy of 94.38% on 42 people limited service available only in the office period, sluggish
in real-time testing, an RNN-based model for Bangla Automatic
Speech Recognition (ASR) with a word error rate (WER) of service that led to the increase of customers’ waiting time,
42.15%, and sentence summarization we used Sentence similarity and service delay during peak period [4]. Studies show that
measurement technique having an average loss of 0.004. Lastly, 66% of the customers prefer to solve their issues within 10
we used gTTS that works as Text to Speech Synthesis for the minutes, or they may readily switch to alternative service [1].
Bangla language in WavNet architecture. AI can automate resolution while cutting costs and enhancing
Index Terms—Text to Speech Synthesis(TTS), Automatic
Speech Recognition, Mel-Frequency Cepstral Coefficients customer satisfaction, allowing human agents to work on
more complex issues [3]. Many E-commerce companies have
I. I NTRODUCTION started to implement various forms of AI to understand their
customers better and provide an improved customer experience
Phone call is the preferred top customer support channel for but working with the Bangla language has limited progress. In
almost all e-commerce companies. Customers try to resolve addition, around 37% of e-commerce customers use automated
their issues and queries via phone than any other medium management services such as chatbots to react quickly in an
emergency situation to support 24/7 customer [1].
This research work is funded by ICT innovation Fund (a2i), ICT Division,
Ministry of Posts, Telecommunications and Information Technology, the Researchers have been working on finding the best possible
People’s Republic of Bangladesh solution integrating AI-assistant with human customercare
Authorized licensed use limited to: UNIV OF ALABAMA-TUSCALOOSA. Downloaded on June 13,2023 at 17:22:35 UTC from IEEE Xplore. Restrictions apply.
Figure 2: Workflow of Speaker Recognition module
Figure 3: Sentence Summarizing Architecture
recognize the voice of the user, the system will ask some extra
security questions (National Identity No, Mobile no.) to check - “Large Bengali-Automatic Speech Recognition Training
into the personal information database (PID) whether the user Data” (LB-ASRTD), we trained our ASR model using an
is previously registered in the system or not. If the system finds RNN-based end-to-end speech recognition architecture called
the user information in the PID, the previous method will be “Deep Speech 2” (DS2) [19], developed by Baidu Research
applicable. In contrast, the system will adapt the voice features Community. HPC techniques are used extensively in DS2,
of the user with a GMM-based algorithm and save it with resulting in 7x faster training than the previous edition, “Deep
their information into the PID. Afterward, similarly, the system Speech.” Batch normalization is combined with RNN in this
will ask the reason for calling, convert it into text form using architecture, and a unique optimization curriculum called
ASR, save it into the PID, apply Sentence Summarization, and SortaGrad [19] is employed.
encode it with sentence transformer. Then after the similarity
checking with the generic information database, the system ∞
will respond with the most appropriate answer to the user from
X
L(x, y; ) = log ΠP ctc(lt|x; θ)
the predefined knowledge-based generic information database. l∈=Align(x,y)
At last, the system will ask the user to rate their satisfaction
level after using the system, and it will refer a human assistant Here, Align(x,y) denotes the set of all possible combinations
if it finds the rating low. of the characters of the transcription y to frames of input x.
Authorized licensed use limited to: UNIV OF ALABAMA-TUSCALOOSA. Downloaded on June 13,2023 at 17:22:35 UTC from IEEE Xplore. Restrictions apply.
into series of tokens, add the full meaning of the contractions
and remove stopwords and punctuation, apply lemmatization
to convert the word into their root form, and finally, break
words into parts of speech.
2) Count Vocabulary: Vocabulary is counted from the own
dataset among 12,676 unique words are found from the
dataset. Among this unique model only 11,712 words are used
by removing English words from the Bangla Article.
3) Model Architecture: For training, the dataset RNN
encoder-decoder and Seq2Seq learning with attention model
mechanism are used to summarise the articles. This frame-
work has three components: an encoder network, an attention
network, and a decoder network. The encoder converts the
sequence into fixed-size context vector, which represents a
semantic summary of the article. This context vector works
as the initial stage of the decoder connected to hidden units,
although the encoder time stamps are not equal in encoder
and decoder. If A is a long sequence of input where a is
the target sequence of sentences b is the source sequence of
Figure 4: WaveNet Architecture
the sentences, the maximum probability of the word vector
sequence will be
arg(maxbp(a|b)) S2ORC pairsdataset. BERT makes it unsuitable for semantic
similarity search as well as for unsupervised tasks like
Sequence to Sequence model with Bahdanau attention model
clustering. In N. Reimers Al. [17] present Sentence-BERT
is used with fixed output that maps the fixed-length output. A
(SBERT) that use siamese and triplet object function to derive
pre-trained Bangla word vector file “bn w2c model” is used for
semantically meaningful sentence embedding from comparing
word embedding that converts all the words into a numerical
using cosine-similarity. This reduces the effort for finding
form. The final output of the word is in vector form used for
the most similar pair from 65 hours with BERT / Roberta to
model training.
about 5 seconds with SBERT while maintaining the accuracy
D. Workflow of Interactive Agent: from BERT. Given an anchor sentence a, a positive sentence
p, and a negative sentence n, triplet loss tunes the network
We encoded all the sentences from our database using such that the distance between a and p is smaller than the
Sentence Transformer and saved the encoded score in an distance between a and n. Mathematically, we minimize the
array as repeatedly encoding cost us time for every user. The following loss function [17]:
sentences go to the pooling layer, followed by the BERT
model to make a vector to predict a score. With the same max(||sa − sp|| − ||sa − sn||
procedures, we use “Paraphrase mpnet base v2” to encode
With sx, the sentence embedding for a/n/p a distance metric
new sentences that come from users simultaneously. Then
and margin.
new sentence and database sentences array are compared
by similarity check using cosine similarity technique. The E. Text to Speech Synthesis
maximum score is likely the expected query the user is Our intended model can assist customers by interacting with
looking for. The system will transfer the answer via the TTS them; it can reply to various speech in audio format. As we
system, and at the same time, it enlists necessary data, for targeted the domain of e-commerce in Bangladesh, most of
example, doctors’ appointments, into the database. To avoid the users use Bangali Language to communicate. We used
irreverent queries, we set up a thresh holding that is 0.96; the python-based “gTTS” library to turn our text into speech.
this gives a relevant result in our case. The library serves to convert the text by calling API, which
Sentence Transformer: Text from the ASR system in serves the best quality of sound among other available TTS. It
Bangla language, the model encodes every word with uses DeepMind’s WaveNet [15] to deliver the highest accuracy
BERT sentence transformers model using a pertained possible. The architecture of the WaveNet model is shown in
weight called “paraphrase mpnet base v2” [17] has Fig-4 [14]. Furthermore, the speech generation speed from text
been used to convert the word into a vector format. response is quite impressive and useful for real-time uses.
Paraphrase npnet base v2 model is based on Microsoft
mpnet base having a dimension of 768 and mean pooling F. Database
structure. The model is trained with AllNLI, sentence The system consists of three types of Databases, 1. Personal
compression, SimpleWiki, quora duplicates, coco captions, Information Database (PID), 2. Generic Information Database
flickr30k captions,yahoo answers title questions, (GID), 3. Credential Information Database (CID). PID mainly
Authorized licensed use limited to: UNIV OF ALABAMA-TUSCALOOSA. Downloaded on June 13,2023 at 17:22:35 UTC from IEEE Xplore. Restrictions apply.
Figure 5: Personal Information Database (PID) Figure 7: Credential Information Database (CID)
Authorized licensed use limited to: UNIV OF ALABAMA-TUSCALOOSA. Downloaded on June 13,2023 at 17:22:35 UTC from IEEE Xplore. Restrictions apply.
authors parameters the average loss is 0.004 which is quite [7] Yan, H., Ang, M. H., & Poo, A. N. (2014). A survey on perception
satisfactory. methods for human–robot interaction in social robots. International
Journal of Social Robotics, 6(1), 85-119.
We used the weight of the BERT sentence [8] Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker
transformer(paraphrase-mpnet-base-v2) trained on STSb, identification using Gaussian mixture speaker models. IEEE transactions
DupQ, TwitterP, SciDocs, Clustering dataset having average on speech and audio processing, 3(1), 72-83.
[9] Sarkar, A. K., Matrouf, D., Bousquet, P. M., & Bonastre, J. F. (2012).
accuracy 76.84% shown in Table:I Study of the effect of i-vector modeling on short and mismatch utterance
duration for speaker verification. In Thirteenth Annual Conference of the
Table I: Accuracy of sentence transformer “paraphrase-mpnet- International Speech Communication Association.
base-v2” for the quality to embedded sentences and to embed- [10] Liu, G., & Hansen, J. H. (2014). An investigation into back-end advance-
ments for speaker recognition in multi-session and noisy enrollment
ded search queries paragraphs scenarios. IEEE/ACM Transactions on Audio, Speech, and Language
Dataset Accuracy(%) Processing, 22(12), 1978-1992.
STSb 86.99 [11] Hasan, T., & Hansen, J. H. (2013). Maximum likelihood acoustic factor
analysis models for robust speaker verification in noise. IEEE/ACM
DupQ 87.80
transactions on audio, speech, and language processing, 22(2), 381-391.
TwitterP 76.05
[12] Zheng, M., & Meng, M. Q. H. (2012, December). Designing gestures
SciDocs 80.57
with semantic meanings for humanoid robot. In 2012 IEEE International
Clustering 52.81 Conference on Robotics and Biomimetics (ROBIO) (pp. 287-292).
IEEE.
[13] Nadarzynski, Tom, et al. “Acceptability of artificial intelligence (AI)-led
IV. C ONCLUSION chatbot services in healthcare: A mixed-methods study.” Digital health
Our developed system can communicate with users in the 5 (2019): 2055207619871808.
[14] https://smallbiztrends.com/2019/05/customer-contact-statistics.html
standardized Bangla language. This whole AI-based archi- [15] Oord, A.V.D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves,
tecture will automate the call center as well as increase the A., Kalchbrenner, N., Senior, A. and Kavukcuoglu, K., 2016. Wavenet:
efficiency and quality of services. Moreover, this model can A generative model for raw audio. arXiv preprint arXiv:1609.03499.
[16] Bhattacharjee P, Mallick A, Islam MS, Jannat M (2020) Bengali ab-
reduce the waiting time in the call center and make a positive stractive news summarization (BANS): a neural attention approach. In:
response to the customers by providing unbiased responses. 2nd international conference on trends in computational and cognitive
But in the real case, our system may not handle converting engineering. arXiv:2012.01747
[17] Reimers, N. and Gurevych, I., 2019. Sentence-bert: Sentence embed-
regional forms of speech into text. As we don’t introduce dings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
any noise cancellation module during the phone call, the [18] Li, Feng-Lin, et al. “Alime assist: An intelligent assistant for creating
accuracy may degrade in a noisy environment. The GMM an innovative e-commerce experience.” Proceedings of the 2017 ACM
on Conference on Information and Knowledge Management. 2017.
based MFCC model is suitable for the small size of the dataset. [19] Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E.,
But increasing numbers of users may cause degrade of the Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G. and Chen, J.,
accuracy. In the future, we have a plan to replace the model 2016, June. Deep speech 2: End-to-end speech recognition in english
and mandarin. In International conference on machine learning (pp. 173-
with sincnet to improve the overall accuracy. However, our 182). PMLR.
system is a milestone for call center support in the Bangla [20] Sun, Linjian, et al. “Design of Integrated Vision and Speech Technology
language, creating state of art for future development to ensure for a Robot Receptionist.” 2018 IEEE International Conference on
Mechatronics and Automation (ICMA). IEEE, 2018.
customer satisfaction. [21] Hardy, Hilda, Tomek Strzalkowski, and Min Wu. Dialogue management
for an automated multilingual call center. STATE UNIV OF NEW
V. ACKNOWLEDGEMENT YORK AT ALBANY INST FOR INFORMATICS LOGICS AND
ICT innovation fund from ICT Division, Ministry of Posts, SECURITY STUDIES, 2003.
[22] Zweig, Geoffrey, et al. “Automated quality monitoring for call cen-
Telecommunications and Information Technology, the People’s ters using speech and NLP technologies.” Proceedings of the Human
Republic of Bangladesh funds this research work and pilot Language Technology Conference of the NAACL, Companion Volume:
project. The total technical support is provided by AIMS Lab, Demonstrations. 2006.
[23] McLean, Graeme, and Kofi Osei-Frimpong. “Examining satisfaction
United International University, Bangladesh. with the experience during a live chat service encounter-implications
for website providers.” Computers in Human Behavior 76 (2017): 494-
R EFERENCES 508.
[1] Conduent. (2018). The State of Consumer Experience Communication [24] Warnapura, A. K., et al. “Automated Customer Care Service System for
Edition 2018 Finance Companies.” Research and Publication of Sri Lanka Institute of
[2] BrandGarage, & Linc. (2018). How AI Technology Will Transform Information Technology (SLIIT)’. NCTM, 2014. 08.
Customer Engagement. [25] Atayero, Aderemi A., et al. “Implementation of ‘ASR4CRM’: An auto-
[3] Microsoft. (2018). State of Global Customer Service Report mated speech-enabled customer care service system.” IEEE EUROCON
[4] Li, F., Qiu, M., Chen, H., Wang, X., Gao, X., Huang, J., . . . Chu, W. 2009. IEEE, 2009.
(2017). AliMe Assist: An Intelligent Assistant for Creating an Innovative [26] Sultana, Mariyam, Partha Chakraborty, and Tanupriya Choudhury. “Ben-
E-commerce Experience. In CIKM’17: Proceedings of the 2017 ACM gali Abstractive News Summarization Using Seq2Seq Learning with
Conference on Information and Knowledge Management (pp. 2–5). Attention.” Cyber Intelligence and Information Retrieval. Springer, Sin-
https://doi.org/10.1145/3132847.3133169 gapore, 2022. 279-289.
[5] Zhao, Q., Tu, D., Xu, S., Shao, H., & Meng, Q. (2014, November).
Natural human-robot interaction for elderly and disabled healthcare
application. In 2014 IEEE International Conference on Bioinformatics
and Biomedicine (BIBM) (pp. 39-44). IEEE.
[6] Prado, J. A., Simplı́cio, C., Lori, N. F., & Dias, J. (2012). Visuo-auditory
multimodal emotional structure to improve human-robot-interaction.
International journal of social robotics, 4(1), 29-51.
Authorized licensed use limited to: UNIV OF ALABAMA-TUSCALOOSA. Downloaded on June 13,2023 at 17:22:35 UTC from IEEE Xplore. Restrictions apply.