Deep Learning Based Multilingual Speech Synthesis Using Multi Feature Fusion Methods

Deep Learning based Multilingual Speech Synthesis using Multi Feature Fusion Methods
Praveena Nuthakki
Department of CSIT, Koneru Lakshmaiah Education Foundation, Vaddeswaram522302, AP, India.
Madhavi Katamaneni
Department of IT, Velagapudi Ramakrishna Siddhartha Engineering College, Vijayawada, India.
Chandra Sekhar J. N.
Department of EEE, Sri Venkateswara University College of Engineering, Sri Venkateswara University 517502, Tirupati, A.P., India.
Kumari Gubbala
Associate Professor, Department of CSE (CS),CMR Engineering College, Hyderabad, Telangana, India. ( ORCID Id: 0000-0002-
63781065 )
Bullarao Domathoti*
Department of CSE, Shree Institute of Technical Education, Jawaharlal Nehru Technological University, Ananthapuram , 517501,
India;
Venkata Rao Maddumala
Department of CSE, Koneru Lakshmaiah Education Foundation, Vaddeswaram522302, AP, India.
Kumar Raja Jetti
Department of CSE, Bapatla Engineering College, Bapatla, Guntur, Andhra Pradesh, India; ( ORCID ID: 0009-0000-
5169-3829 )
*Corrresponding Author:Bullarao Domathoti (Email: bullaraodomathoti@gmail.com)
The poor intelligibility and out-of-the-ordinary nature of the traditional concatenation speech synthesis technologies are two major problems. CNN's
context deep learning approaches aren't robust enough for sensitive speech synthesis. Our suggested approach may satisfy such needs and modify the
complexities of voice synthesis. The suggested model's minimal aperiodic distortion makes it an excellent candidate for a communication recognition
model. Our suggested method is as close to human speech as possible, despite the fact that speech synthesis has a number of audible flaws.
Additionally, there is excellent hard work to be done in incorporating sentiment analysis into text categorization using natural language processing. The
intensity of feeling varies greatly from nation to country. To improve their voice synthesis outputs, models need to include more and more concealed
layers & nodes into the updated mixture density network. For our suggested algorithm to perform at its best, we need a more robust network foundation
and optimization methods. We hope that after reading this article and trying out the example data provided, both experienced researchers and those just
starting out would have a better grasp of the steps involved in creating a deep learning approach. Overcoming fitting issues with less data in training,
the model is making progress. More space is needed to hold the input parameters in the DL-based method.
Keywords: Natural Language Processing, Deep Learning, Machine Learning, Speech to Text.
1 INTRODUCTION
For voice synthesis to sound genuine and understandable, fundamental frequency (F0) modelling is essential. Mandarin speech
synthesis is unaffected by the use of a mixed Tibetan train corpus [1]. When compared to the Tibetan monolingual framework, the
mixed BBERT-based cross-lingual synthesis of speech framework only requires 60% of the learning corpus to synthesis a comparable
voice. Thus, the suggested approach may be used to low-resource languages for voice synthesis by using the corpus of a high-resource
language [2]. The process of synthesizing the sounds of numerous voices from a single model is known as "multi-speaker speech
synthesis." There have been various proposals for employing deep neural networks (DNNs), however DNNs are vulnerable to
overfitting when there is insufficient training data [3]. Utilising deep Gaussian processes (DGPs), which are a complex structure of
Bayesian kernel correlations that are resistant to overfitting, we provide a framework for multi-speaker voice synthesis. Voice
synthesis studies that convey a speaker's personality and feelings is thriving because to recent advancements in the field made possible
by deep learning-based voice synthesis. For speakers with extremely low or very high vocal ranges as well as speakers with dialects,
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components
of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on
servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
2375-4699/2023/1-ART1 $15.00
http://dx.doi.org/10.1145/3618110
ACM Trans. Asian Low-Resour. Lang. Inf. Process.
the present day does not adequately portray a variety of moods and features [5]. Modern neural text-to-speech (TTS) systems are able
to create speech that is indistinguishable from normal speech [6] thanks to the remarkable progress in deep learning. However, rather
than having extensive prosodic variance, the produced utterances often retain the average prosodic character of the database. To
effectively transmit meaning in pitch-stressed languages like English, proper intonation and emphasis are essential [7]. Recognizing
feelings in a speaker's voice automatically is a difficult and time-consuming task. Humans' complex range of emotions makes
categorization challenging. Extracting relevant speech elements is a significant challenge because of the need for human intervention
[8]. Deep learning methods that make use of high-level aspects of voice signals improve the accuracy of emotion identification. One of
the primary challenges of employing target-driven models for articulatory-based speech synthesis is representation learning [9].
Computer-aided diagnostic systems that help in clinical decision-making rely heavily on deep learning methodologies. Because of the
scarcity of annotated medical data, however, the development of such automated systems is difficult [10, 11]. The purpose of the user
interface is to accurately read the user's mood. The most pressing concern in speech emotion identification research is how to
efficiently extract relevant speech characteristics in tandem with an adequate classification engine. [12][13]. Emotion recognition and
analysis from speech signals need well-fined speech databases as well. Numerous real-world applications, such as those dealing with
human-computer interaction, gaming on a computer mobile services, as well as emotion assessment, have made use of data science in
recent years. Speech emotion recognition (SER) is a very new and difficult area of study with a broad variety of potential applications.
When it comes to SER, current research has relied on handmade characteristics that provide the greatest results but fall short when
applied to complicated settings. Later, SER made advantage of automated feature detection from voice data using deep learning
methods. Despite the fact that deep learning-based SER approaches have solved the accuracy problems, they still have a long way to
go before they can be considered fully comprehensive. [14] [15].
The following is a problem statement we developed to address the aforementioned challenges and provide a solution to the issue of
multilingual voice recognition.
 The ability to identify and convert a multilingual sentence into a sentence in a single language.
 Creating a model that can interpret voice navigation questions spoken in many languages and provide a unified response in
one of those languages.
Although there are 22 official languages in India, most are only spoken within a single state. As a result, resources for these
languages are limited, and high-quality voice data is hard to come by. Furthermore, no publicly accessible multilingual dataset
includes these languages and English.
2 RELATED WORK
In this study, we work to refine Isarn's speech-synthesis modelling of F0. For this problem, we recommend an RNN-based model
called F0. Supra-segmental aspects of the F0 contour are represented by using sampled values of F0 at the syllable degree of
continuous Isarn sound in conjunction with their dynamic features. We investigate several deep RNN architectures and feature
combinations to find those that provide the highest performance. We evaluated the suggested technique against different RNN-based
baselines to determine its efficacy. Objective and subjective evaluations show that the proposed model outperforms the baseline RNN
strategy that predicts F0 values at the frame level as well as the baseline RNN model that depicts the F0 contours of words using a
discrete cosine transform [16].
This study introduces a deep Gaussian process (DGP) as well as sequence-to-sequence (Seq2Seq) learning-based approach to voice
synthesis, with the goal of producing high-quality end-to-end speech. Since DGP makes use of Bayesian training and kernel
regression, it is able to create more lifelike artificial words than deep neural networks (DNNs). Such DGP models, however, need
extensive knowledge of text processing since they are pipeline architectures of separate models, including acoustic and duration
models. The suggested model makes use of Sequence-to-Sequence (Seq2Seq) learning, which allows for joint training of spectral and
temporal models. Parameters are learned using a Bayesian model, and Gaussian process regressions (GPRs) are used to represent the
encoder and decoder layers. To adequately simulate character-level inputs in the encoder, we additionally propose a self-attention
method based on Gaussian processes. In a subjective assessment, the suggested Seq2Seq-SA-DGP was shown to produce more
natural-sounding speech synthesis than DNNs with self-awareness & recurrent structures. In addition, Seq2Seq-SA-DGP is useful
when a straightforward input is provided for a complete system, and it helps alleviate the smoothing issues of recurrent structures.
Sequence-to-Sequence Deep Neural Network Model. The field evaluations also demonstrate the efficacy of the DGP framework's self-
attention structure, with Seq2Seq-SA-DGP being able to synthesise more natural discourse than Seq2Seq-SRU-DGP & SRU-DGP
[17].

In this study, we offer a deep learning-based system for cross-lingual speech synthesis that can simultaneously synthesise Mandarin
and Tibetan voices. In order to train the acoustic models, we employ a large size Mandarin multi-speaker corpus as well as a limited
basis Tibetan one-speaker corpus due to the difficulty of recording a Tibetan training corpus. Models for analysing sound are taught
using a combination of deep neural networks (DNN), hybrids long short-term memories (BERT), and hybrid simultaneous recurrent
neural networks (BBERT). To generate context-dependent labels from incoming Chinese or Tibetan phrases, we additionally enhance
our Chinese text analyzer with a Tibetan text analyzer. The Tibetan text analyzer has features including text normalisation, a prosodic
boundary predictor, a grapheme-to-phoneme converter, and a new Tibetan word segmentation which combines a BBERT with a
conditional random field. In order to educate a speaker-independent multilingual average voice model (AVM) using DNN, hybrid
BERT, & hybrid BBERT from a mixed corpus of Mandarin and Tibetan, we choose initials and finals from both languages to use as
speech synthesis units. Next, a small target speech corpus from an AVM is used in conjunction with the speaker adaption to train a
speaker-dependent DNN, hybrids BERT, or hybrids BBERT system for Mandarin or Tibetan. In the end, we use speaker-dependent
models of Mandarin or Tibetan to synthesise their respective languages' spoken sounds. Compared to the other two cross-language
frameworks as well as the Tibetan multilingual framework [18], the hybrid BBERT-based cross-language speech synthesis framework
performs the best in the testing.
This work introduces a technique for improving the effectiveness of meta-learning-based multilingual voice synthesis via the
separation of pronunciation and prosody modelling. To forecast mel-spectrograms across languages, the default meta-learning
synthesis technique uses a single text decoder with a parameter generation conditional on language embedded and a single decoder.
Our suggested approach, on the other hand, models pronunciation and prosody separately using separate encoders and decoders since
these two types of information should be exchanged across languages in distinct ways. Our experimental results show that, compared
to the baseline meta-education synthesis technique [19], our suggested method significantly enhances the intelligibility & authenticity
of multilingual speech synthesis.
In this setup, speaker codes are used to transmit speaker data to duration/acoustic models. To effectively account for speaker
similarities and differences, our method concurrently learns the portrayal of each speaker with other model parameters. To test the
efficacy of the suggested methodologies, we conducted experimental evaluations of two scenarios. One scenario has an equal quantity
of information from all of the speakers (speaker-balanced), whereas the other has less information from some of the speakers (speaker-
imbalanced). Evaluation findings demonstrated that the DGP & DGPLVM synthesise multi-speaker speech better than a DNN in a
speaker-balanced setting, both subjectively and objectively. We also observed that in the case of speaker imbalance, the DGPLVM
performs far better than the DGP. [20].
We present MIST-Tacotron, a Tacotron 2-based speech synthesiser model that incorporates an image style transfer module into the
reference encoder. The suggested approach is a method that uses a deep learning model that has been trained to extract the speaker's
feature using the reference mel-spectrogram in conjunction with picture style transfer. The speaker's style, including pitch, tone, and
length, may be taught using the extracted feature to better convey the speaker's style and emotion. With only the ID value for the
individual who speaks and the ID number for the state of mind as inputs, the speaker's style may be extracted completely
independently of the speaker's timbre and mood. The mel-cepstral distortion (MCD), the band aperiodicity distortion (BAPD), the
voiced/unvoiced error (VUVE), the false positive rate (FPR), and the false negative rate (FNR) are used to assess performance. In
comparison to the current designs, GST (Global Style Token) Tacotron & VAE (Variational Autoencoder) Tacotron, the suggested
model was shown to have better performance and lower error values. The suggested model's audio quality scored best on MOS
measures of both emotional expressiveness and reflection of speaker style. [21].
We provide a Tacotron-based approach for multilingual text-to-speech (TTS) synthesising that can generate high-quality voice in a
number of different languages using a single set of speakers. Further, the model can synthesise natural-sounding Spanish speech using
the voice of an English speaker, without being trained on bilingual or parallel samples. Even across very distantly related languages
like English and Mandarin, such transmission is possible. Use of a phonemic input representation to facilitate cross-language model
capacity sharing and the addition of an adversarial loss term to train the model to separate its representation of speaker identity from
the speech content are two key ingredients in this recipe for success. integrating an autoencoding opinions to help stabilise attention
during training and expanding the model to train on several speakers of each language yields a model that can reliably synthesise
intelligible expression for training individuals in all languages seen throughout training, in either their native language's or a language
for which they were not originally fluent [22].
In this study, we present a deep architectural model for voice synthesis using deep Gaussian processes (DGPs), which are layered
Bayesian kernel regressions. In this approach, we mimic deep neural network (DNN)-based voice synthesis by training a statistical
model of transition from contextual data to speech parameters. Our approach employs the approximation technique doubly stochastic

variational deduction to implement DGPs inside a statistical parametric voice synthesis framework; this method works well with large
datasets. When compared to DNNs, DGPs are more robust to overfitting since their training depends on the marginal likelihood, which
considers not just data fitting but also model complexity. We conducted experiments to compare the effectiveness of the suggested
DGP-based architecture to that of a feedforward DNN-based alternative. Our DGP architecture outperformed the status quo in both
subjective and objective evaluations [23], with a higher mean opinion score and reduced distortions to acoustic features.
In this paper, we propose a neural speech synthesis approach that incorporates ToBI (Tones and Break Indices) encoding for fine-
grained prosody modelling. Both a text interface for ToBI prediction as well as a Tacotron-based TTS component with prosody
modelling make up the proposed solution. The ToBI representation allows us to exert fine-grained control over the system, allowing us
to synthesise speech with natural intonation and syllable-level stress. Experiments demonstrate that, in comparison to the Tacotron and
the unsupervised technique, our model is able to create more realistic speech with more precise prosody and to exert more precise
control over the speech's stress, intonation, as well as pause [24].
In this research, we introduced a system that uses deep learning to identify the emotional state of a speaker based on their speech.
Using a time-distributed flatten (TDF) layer, the system integrates a deep convolutional neural network (DCNN) with a bidirectional
long-short term memory (BBERT) network. The suggested model has been used using the newly created SUBESCO emotional speech
corpus in Bangla, which is audio only. All the models in this study were put through a battery of tests, comparing them in single-
language, cross-language, and multilingual training-testing configurations. Compared to previous state-of-the-art CNN-based SER
systems that are able to operate on both temporal & sequential representations of emotions, the model with a TDF layer obtains greater
performance, as shown by testing data. The Bangla as well as English languages' SUBESCO as well as RAVDESS datasets were used
for cross-corpus instruction, multi-corpus training, as well as transfer learning during the cross-lingual studies. Weighted accuracies
(WAs) of 86.9% and 82.7% were achieved on the SUBESCO & RAVDESS datasets, respectively [25], demonstrating that the
proposed model has achieved state-of-the-art perceptual efficiency.
In this paper, we present a deep learning-based approach for extracting basic characteristics from initial information with good
accuracy across languages and speaker demographics (male/female) in voice corpora. To fine-tune the.wav files for Deep
Convolutional Neural Network (DCNN), we first transform them into the RGB spectrograms (pictures) and normalise their size to
(224x224x3). There are two phases of training for a DCNN model. Stage 2 involves re-training the model using the optimum learning
rate determined in stage 1 through the Learning Rate (LR) ranging test. In order to down-sample the characteristics with a smaller
model size, a unique stride is employed. Positive, negative, neutral, and a few more emotions are taken into account. The proposed
approach is evaluated using the publicly-available German-language EMODB, Italian-language EMOVO, and British-English SAVEE
speech corpora. The stated emotion identification accuracy is higher than that seen in previous research across languages and speakers.
[26].
In this study, we offer a computational technique for learning underlying articulatory goals from a 3D articulatory speech synthesis
model, utilising a simultaneous long short-term memory recurrent neural networks trained with a limited collection of representative
seed samples. A bigger training set was produced from the seed data, allowing the model to be exposed to more diverse contexts. The
process of articulation was reverse-engineered into a model by training a model that uses deep learning for acoustic-to-target mapping.
This technique employs a trained model to map the provided audio data into the articulatory target parameters, allowing for the
identification of distributions according to language situations. The perceived accuracy of speech recreated from the predicted
articulation was used to assess the model's performance, along with its ability to map acoustics to articulation. The findings suggest
that the simulator can successfully mimic speech without a high level of phonemic accuracy [27].
To develop generalizable and domain-invariant representations across a variety of medical imaging applications, including malaria,
retinopathy from diabetes, and TB, we offer a unique & a computationally effective deep learning technique. We call the process by
which we construct our CNNs Incremental Modular Network Synthesis (IMNS), as well as the resultant CNNs themselves IMNets.
Our IMNS method makes use of specialised network components called SubNets, each of which may provide problem-specifically
relevant characteristics. Next, we use these smaller networks, called SubNets, to construct bigger and more robust networks using a
variety of alternative configurations. Each new SubNet module is only updated once each stage. This lessens the burden on the
computer during training and assists in optimising the network. In this research, we evaluate IMNets with established and cutting-edge
deep learning architectures including AlexNet, ResNet-50, Inception v3, DenseNet-201, & NasNet. Classifier accuracies of 97.0%,
97.9%, & 88.6%, on average, are achieved with our suggested IMNS architecture for malaria, diabetic eye disease, and TB,
respectively. In all benchmarked use cases, our modular deep learning architecture outperforms the competition. Unlike most
conventional deep learning architectures, the IMNets built here need less processing power to run. In this experiment, the biggest
IMNet had 0.95 million trainable parameters and 0.08 billion MAdd operations (floating point multiplication and addition). The

simplest IMNets examined [28] outperformed the benchmark techniques in terms of training speed, memory needs, and image
processing speed.
In this study, we built a Korean sentimental speech dataset for emotion analysis in spoken language and presented a feature
combination to boost recurrent neural network-based emotion detection accuracy. We retrieved F0, Mel-frequency cepstrum
parameters, spectral characteristics, harmonic features, among other data to further explore the acoustic features that might indicate
discrete instantaneous shifts in emotional expression. Statistical analysis was used to zero down on the best possible set of speech
acoustic variables for conveying feeling. An emotional classification model based on a network of recurrent neurons was deployed.
According to the findings, the suggested system outperforms its predecessors [29] in terms of accuracy.
To get over such problems, the authors of this research presented a new SER model. Due to a lack of previous study, we zeroed in
on Arabic vocal expressions in particular. The suggested model enhances data before extracting features. Emotion identification was
accomplished by using the 273 extracted characteristics as input into a transformer model. We test this model on four different datasets
(BAVED, EMO-DB, SAVEE, and EMOVO). The experimental results showed that the suggested model outperformed state-of-the-art
methods with ease. The best results were achieved using the BAVED dataset, demonstrating the proposed model's suitability for
identifying emotions in Arabic speech [30].
The Griffin-Lim vocoding method is a technique used for audio signal processing, specifically for the synthesis of audio signals,
including speech and music. It's commonly applied in the field of speech and audio processing, particularly in the context of
spectrogram inversion or phase reconstruction.
The Griffin-Lim algorithm is named after its creators Mike Griffin and John Lim, who introduced it in their 1984 paper titled
"Signal Estimation from Modified Short-Time Fourier Transform". The primary goal of this algorithm is to reconstruct a time-domain
audio signal from its magnitude spectrogram, which contains information about the magnitude of the frequencies present in the signal
3 PROPOSED WORK
3.1 Research Gaps

There is a rich history of research into voice synthesis that led to the proposal of several models. When attempting to forecast
parameters, a great deal of contextual data is considered. The acoustic feature incorporates a hierarchical framework to map context
data to the waveform of human voice. Convolutional Neural Network (CNN) based approaches are using this idea as a foundation for
predicting voice synthesis acoustic feature parameters. In most cases, the linguistic traits will be converted into speech synthesis
probability densities for different decision trees. The model for complete voice synthesis from beginning to finish can be trained rather
quickly using deep learning. Template-based speech synthesis algorithms struggle with high natural voices because of voice glitches.
Features extraction for voice synthesis continues in the presence of obstacles.
In the first stage of VecMap, monolingual word embedded are normalized and orthogonalized. Then, a compact bilingual
dictionary is used to provide an approximation of the linear mapping. This eliminates the need for a massive seed dictionary. After
that, new terms in the original language are transformed using what was learnt, thereby expanding the vocabulary. Until the
convergence condition is achieved, the procedure is iterated [14].
Many studies use the regularisation term Mean Squared Error (MSE) for VecMap:
∥ ∥
The given formula computes the Frobenius norm square of the residual matrix.
In this work, we use VecMap under supervision. A 5,000-word English-Persian lexicon is used for teaching and testing purposes
under close supervision.

Start
English TTS
Japanese TTS Mandarin TTS
Corpuse
Corpuse Corpuse
Data Collection
Testing
Data Preprocessing
Training using CNN

Input Text
Model
Multilingual TTS Multilingual TTS

Model
Multi-Speaker
Terminate
Speech converted
Yes
Terminate
Fig.1. Proposed System Flowchart

Inputs
CNN based feature extraction
...
Global pooling layer
...
Softmax layer
Speech Speech Speech
Fig.2. CNN Model for TTS
3.2 CNN
In the figure 2 shows the Words in the system correspond to a real number vector. The words represent the integers 0 through N-1.
These word indices are represented as vectors of length N that may be plugged right into a model of the corresponding network. The
textual sequences were in time, much as a skip gramme model would predict. The conditional probability of each word is denoted as a
vector. Using the softmax function on the vector, as shown in Equation 1, it identifies the most pivotal target word,
( )
∑
(1)
( )
Where is index set. is shorthand for the sequence's duration. The term in context will be found in the
lexicon (index). During the approximative training phase, the loss function is built using the path from one node to another in a binary
tree structure. Gradient computational, a part of the training process, has complexity that grows as the lexicon does. The serial data is
pre-processed using negative sampling. We've developed a model for word embedded that can compare words like "Hi" and "Hai" to
determine their level of similarity. Instead of using the morphology function, a separate vector is utilised for word to vector
conversion. In other words, the skip gramme model is now live. The minimum and maximum length of retrieved subwords were both
set by us. The size of dictionaries, however, is not a fixed parameter. For this purpose, we have used a byte-pair encoding strategy [1].
The Convolution Neural Network (CNN) may be rapidly trained based on the quality of the input voice. Input vectors that are able to
conform to the probability distributions are welcomed by the input layers. The parameters of the underlying hidden layers will also be
determined. Convergence data is not entered into the hidden layer until it is reached. Based on the length of the recovered subwords,
the results layer will interpolate Acoustic characteristics. These variables are trained inside the same network, thus there will be no

future fragmentation issues. Additionally, the excessively smooth issue is mitigated by correlating the acoustic properties with varying
frame size.
Model average is an ensemble method in which all of the individual models' predictions are weighted equally. Overfitting is a
major problem in ML jobs, and this might cause it. One variant of this strategy, known as the weighted mean ensemble, gives more
credence to members with higher anticipated model performance on an outlier dataset. As a result, models with better performance
may contribute more and those with less performance can contribute less. In this study, the model average ensemble has been
employed, although it may be better if a weighted average ensemble was used instead. When doing a classification job, we may
determine the certainty of each class by using the Softmax activating function, which yields the anticipated probability for each class.
The softmax function's output is the class label with the greatest average trust, which is determined by the model averaging method's
calculation of the average classification likelihood of base pupils for each class.
where Z is the vector of input values and C is the total number of categories in the dataset, and are the common exponential
equations for the vectors of input and output. Equation 4 describes the model confidence-averaged prediction of a class label. The
agrgmax for the anticipated labels in the target class is used to determine this.
∑ ( )
ˆ (3)
where is the base learner, is the number of selected base CNN learners, and in the proposed TTD;
represents the likelihood that the category value i will be present in a given data sample x .
Feature Extraction
Speech
Input Text
Conv MaxPoll FC Softmax
Waveform Parameter
Synthesis Generation
Fig.3. Proposed CNN Architecture for TTS
4 RESULTS & DISCUSSION

We utilised a number of Python packages, including Keras, Pandas, Gensim, and Matplotlib, to put our study into practise. Table 1
provides a brief overview of the model parameters we employed. For BERT and other vector models, we use embeddeds of 300
dimensions.

In this work, we experimented with voice synthesis to confirm a viable strategy for realising a multilingual synthesis systems. We
built each system based on the voice quality of the a Japanese speaker (nicknamed NKY) and tested it using English-only synthesised
speech since judging the tone quality of several languages is challenging. (Our English language natural speaker, or ELN, speaks
English.) These are the experimental parameters:
 A quiet room with headphones was used to listen to all of the speeches.
 There was no order to the speeches that were offered.
 One native English speaker from the United States was responsible for rating the level of naturalness.
Four native Japanese speakers weighed in on the similarity.
As input text, we employed a training corpus and 30 sentences from English news articles. Auditory assessment samples: The
hearing test included both synthetic speech samples from seven different TTS systems (Tables 1, 2) and actual human speech.
Information about each speech sample is provided in Table 1.
Each voice sample's CNN architecture is broken out in Table 1.
Table 1. Parameter settings used in the model.
Model Parameters Values
The specifics of the ASR systems tested here are listed in Table 3.
The matrix of uncertainty [19] uses the following fundamental terminology.
 Observation is positive and consistent with the expected outcome; this is a true positive (TP).
 The opposite of a true positive is a false negative.
 A true negative (TN) is an observation that matches expectations of negativity.
 A false positive (FP) occurs when an unfavourable observation contradicts a favourable prediction.
To analyse test results, we calculated an Accuracy, Extractness, Recollotion, & F1Score. The performance evaluation techniques
are shown by the equations. (4), (5), (6), & (7).
The accuracy of a model is the easiest to understand when compared to other assessment indices. However, a supplemental index is
necessary when the data categories are not evenly distributed. The F1-Score is a useful measure since it takes into account both
accuracy and recollotion.
We use the Common voice Semantics (CSS10) voice corpus, which has a single speaker for each of 10 languages. We choose
German, Dutch, Chinese, and Japanese to build a work sequence. We use the publicised train/validation splits from, which provide
15.1 hours, 11.5 hours, 5.4 hours, and 14.0 hours of training audio, in that order. For each language, we used the latest 20 examples
from the validation set as test samples. We rank the various methods of lifelong learning as follows: German (DE), Dutch (NL), Swiss
German (CHZ), and Japanese (JA). Our buffer size for replay-based algorithms is 300 utterances, or around 0.6h of audio. After
instruction on each language, random samples are added to the buffer. To maintain linguistic parity in the buffer during the course of
the training sequence, old samples are removed at random as new ones are added.

Framework and tuning settings. Each language undergoes 100 iterations of training. We use the Adam optimizer, which has a
learning rate of 0.001 at the outset and decays to 0.049 after 60 iterations. The default batch size is 84. All sequential techniques of
language acquisition begin current language instruction using the best possible model parameters gained from the previously learned
language.
Table 2. When compared to the results of the baseline monolingual models using CNN as the classifier, the suggested cross-lingual sentiment analysis
approach shows significant improvement.
Train data Static/dynamic BERT VectorMap
Precisio Reca F- Precisio Reca F-

n ll measure n ll measure
English
58.3 66.2
Static embedded 60.29 59.31 67.05 66.63
6 1
65.9 72.8
Dynamic embedded 68.28 67.13 73.93 73.37
9 1
Japanese
68.5 71.7
Static embedded 70.32 69.42 72.69 72.18
4 1
78.1 80.0
8 1
Mangaria
n 71.6 78.3
Static embedded 73.29 72.46 80.12 79.22
4 3
76.1 83.2
6 9
Table 3. When compared to the results of the baseline multilingual model using BERT as the classifier, the suggested cross-lingual sentiment
assessment model shows significant improvement.
Train data Static/dynamic BERT VectorMap
Recal Recal
Precision F-measure Precision F-measure
l l
English
Static embedded 63.75 62.13 62.93 69.98 68.51 69.24
Dynamic embedded 70.03 68.72 69.37 76.38 75.18 75.79
Japanese
Static embedded 70.99 69.25 70.11 75.13 73.68 74.38
Mangarian
Static embedded 77.24 75.55 76.39 80.92 79.45 80.18

Table 4. When compared to the results of the baseline multilingual models using CNN-BERT as the classifier, the suggested cross-lingual sentiment
estimation model shows significant improvement.
Train Static/dynamic BERT VectorMap

data
Preci Re F- Preci Re
F-measure
sion call measure sion call
Englis Static 64 70
65.85 64.93 72.31 71.63
h embedded .03 .96
Dynamic 70 76
71.77 71.11 78.09 77.41
embedded .44 .74
Japane Static 68 76
70.72 69.83 77.31 76.84
se embedded .96 .27
Dynamic 76 83
78.28 77.59 84.93 84.18
embedded .91 .48
Manga 75 83
76.83 76.05 84.26 83.76
rian .28 .26
Dynamic 83 88
85.03 84.31 89.69 88.91
embedded .58 .13
CNN BERT CNN-BERT
100
Accuracy (%)
80
60
40
100 200 300
400 500
600 700
800
Sample Size (N)
(a)
CNN BERT CNN-BERT
100
90
80
Accuracy (%)
70
60
50
40
30
2 4 6 8 10 12 14 16 18 20 10
Features Size
(b)
Fig.4.Comparison of Accuracy (a). Number of Samples, and (b). Feature Size

CNN BERT CNN-BERT
100
Precision (%)
80
60
40
100 200 300
400 500
600 700
800
Sample Size (N)
(a)
CNN BERT CNN-BERT

100
90
80
Precision (%)
70
60
50
40
30
2 4 6 8 10 12 14 16 18 20 10
Features Size
(b)
Fig. 5. Comparison of Precision (a). Number of Samples, and (b). Feature Size
CNN BERT CNN-BERT
100
Recall (%)
80
60
40
100 200 300
400 500
600 700
800
Sample Size (N)

(a)
CNN BPNN Proposed
100
90
Recall (%)
80
70
60
50
40
30
2 4 6 8 10 12 14 16 18 20 10
Features Size
(b)
Fig.6.Comparison of Recall (a). Number of Samples, and (b). Feature Size
CNN BPNN Proposed
100
F-measure (%)
80
60
40
100 200 300
400 500
600 700
800
Sample Size (N)
(a)
CNN BPNN Proposed
100
90
80
F-measure (%)
70
60
50
40
30
2 4 6 8 10 12 14 16 18 20 10
Features Size
(b)
Fig.7.Comparison of F-measure (a). Number of Samples, and (b). Feature Size

CNN BERT
100 Proposed 105
90 100
80
Time (S)
95
70
90
60
85
50
40 80
30 75
2 4 6 8 10 12 14 16 18 20 10
Features Size
Fig.8.Comparison of Runtime
Two distinct cross-lingual embeddeds (BERT and Vector Map) and classifiers (CNN and CNN-BERT) are employed in our
research. Furthermore, we conducted our tests in two modes because to the fact that the cross-lingual embeddeds have been trained
with no information from the data used for training of the sentiment evaluation job: There are two types of embedded models: static,
which employs the pre-trained vectors directly, and dynamic, which fine-tunes the embeddings using the training data. All experiments
are provided with regards to their levels of accuracy, extractness, recollection, and F-measure.
The following are noteworthy insights gleaned from a comparison of the outputs of several training models and cross-lingual
embeddeds:
Our suggested model, trained on English data rather than Persian data, consistently outperforms state-of-the-art baseline algorithms
in experimental settings.
Although the suggested cross-lingual sentiment analysis performs much better than monolingual models, we may further enhance
the model by taking use of English as well as Persian information in a Bilingual training phase, as shown by a comparison of the
results from English training as well as Bilingual training. Bilingual training allows for the use of the extensive English data set as well
as the study of the language-dependent characteristics present in the Persian dataset.
VectorMap consistently beats BERT, while the performance gaps are small. Because BilBOWA needs parallel data, the corpora
available for training the algorithm are restricted, while with VectorMap, this is not the case.
We may further enhance the model by collecting semantic properties that are present in the subject matter of training data, as
shown by a comparison of static and dynamic embeddeds.
Figure 8 shows that when all classifiers are compared, hybrid models perform the best. Taking use of the strengths of both models,
BERT-CNN and CNN-BERT architectures are able to collect sequential and atypical information from the text, leading to more
precise predictions. Furthermore, the fact that BERT- CNN outperforms CNN-BERT demonstrates the importance of gathering
sequential data early on in the design. However, with the CNN-BERT model, some sequential information may be lost since a CNN
model is performed first. When combined, BERT-CNN and VectorMap provide an F-measure of 95.04, making them the most
effective methods.
In contrast to the technique of concatenative speech synthesis, the method that was provided is capable of creating synthetic speech
that is not only extremely easily comprehended but also sounds very much like real speech. One of the disadvantages of the existing
TTS-based speech synthesis model is that it makes use of context decision trees to trade speech parameters. This is one of the
limitations. As a direct consequence of this, the artificially produced speech lacks the necessary degree of realism to fulfill the
requirements of expressive speech synthesis. In the suggested method, the clustering phase of the context CNN is replaced by the deep
learning-based speech synthesis models. These models employ entire context information and distributed illustration in lieu of the
clustering process. They map the context characteristics to high-dimensional acoustic data by using several hidden layers, which leads
in the quality of the synthesized speech being better to that of the approaches that were utilized in the past. The enormous
representation possibilities of DL-based models, on the other hand, have led to the appearance of a number of brand new difficulties.
The models need an increase in the number of hidden layers and nodes so that they can provide more accurate predictions of the
world's behavior. This, in turn, will undoubtedly result in a rise in the total number of network parameters, in addition to the amount of

time and space that is necessary for the training of the network. Over-fitting is a common problem that arises in cases when there is
insufficient data for the models to be trained on. As a result, the training of the model necessitates the use of a huge quantity of
computing resources in addition to a large number of corpora.
5 CONCLUSION
The poor intelligibility and out-of-the-ordinary nature of the traditional concatenation speech synthesis technologies are two major
problems. CNN's context deep learning approaches aren't robust enough for sensitive speech synthesis. Our suggested model may
satisfy those voice synthesis needs and alter the associated difficulties. The suggested model's minimal aperiodic distortion makes it an
excellent candidate for a communication recognition model. Our suggested method is as close to human speech as possible, despite the
fact that speech synthesis has a number of audible flaws. Additionally, there is excellent hard work to be done in incorporating
sentiment analysis into text categorization using natural language processing. The intensity of feeling varies greatly from nation to
country. To improve their voice synthesis outputs, models need to include increasingly hidden layers and nodes into the updated
mixture density network. If we want to assess the model in the best way possible, our suggested method will require a more robust
network structure and optimization approaches. We hope that after reading this article and trying out the example data provided, both
experienced researchers and those just starting out would have a better grasp of the steps involved in creating a deep learning
approach. Overcoming fitting issues with less data in training, the model is making progress. More space is needed to hold the input
parameters in the DL-based method.
Authors Contribution: Each author contributed equally in each part.
DATA AVAILABILITY STATEMENT:

The data used to support the findings of this study are available from the corresponding author upon request.
ETHICAL APPROVAL:
This article does not contain any studies with human participant and Animals performed by author.
CONFLICTS OF INTEREST
The authors are declared there are no conflicts of interest regarding the publication of this paper.
FUNDING STATEMENT
Author declared that no funding was received for this Research and Publication.
ACKNOWLEDGMENT
The authors thank for providing characterization supports to complete this research work.
REFERENCES
[1]. Bollepalli, B., Juvela, L., & Alku, P. (2019). Lombard Speech Synthesis Using Transfer Learning in a Tacotron Text-to-Speech System.
Interspeech.
[2]. Mishev, K., Karovska Ristovska, A., Trajanov, D., Eftimov, T., & Simjanoska, M. (2020). MAKEDONKA: Applied Deep Learning Model for
Text-to-Speech Synthesis in Macedonian Language. Applied Sciences.
[3]. Nishimura, Y., Saito, Y., Takamichi, S., Tachibana, K., & Saruwatari, H. (2022). Acoustic Modeling for End-to-End Empathetic Dialogue Speech
Synthesis Using Linguistic and Prosodic Contexts of Dialogue History. Interspeech.
[4]. Barhoush, M., Hallawa, A., Peine, A., Martin, L., & Schmeink, A. (2023). Localization-Driven Speech Enhancement in Noisy Multi-Speaker
Hospital Environments Using Deep Learning and Meta Learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31, 670-
683.
[5]. Ning, Y., He, S., Wu, Z., Xing, C., & Zhang, L. (2019). A Review of Deep Learning Based Speech Synthesis. Applied Sciences.
[6]. Gudmalwar, A.P., Basel, B., Dutta, A., & Rao, C.V. (2022). The Magnitude and Phase based Speech Representation Learning using Autoencoder
for Classifying Speech Emotions using Deep Canonical Correlation Analysis. Interspeech.
[7]. Tu, T., Chen, Y., Liu, A.H., & Lee, H. (2020). Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech
Representation. Interspeech.
[8]. Wu, P., Watanabe, S., Goldstein, L.M., Black, A.W., & Anumanchipalli, G.K. (2022). Deep Speech Synthesis from Articulatory Representations.
Interspeech.
[9]. Lee, M., Lee, J., & Chang, J. (2022). Non-Autoregressive Fully Parallel Deep Convolutional Neural Speech Synthesis. IEEE/ACM Transactions
on Audio, Speech, and Language Processing, 30, 1150-1159.
[10]. Khalil, R.A., Jones, E., Babar, M.I., Jan, T., Zafar, M.H., & Alhussain, T. (2019). Speech Emotion Recognition Using Deep Learning Techniques:
A Review. IEEE Access, 7, 117327-117345.

[11]. Kumar, Y., Koul, A., & Singh, C. (2022). A deep learning approaches in text-to-speech system: a systematic review and recent research
perspective. Multimedia Tools and Applications, 82, 15171 - 15197.
[12]. Ma, Y., & Wang, W. (2022). MSFL: Explainable Multitask-Based Shared Feature Learning for Multilingual Speech Emotion Recognition.
Applied Sciences.
[13]. Kulkarni, A., Colotte, V., & Jouvet, D. (2020). Transfer Learning of the Expressivity Using FLOW Metric Learning in Multispeaker Text-to-
Speech Synthesis. Interspeech.
[14]. Bhattacharya, S., Borah, S., Mishra, B.K., & Mondal, A. (2022). Emotion detection from multilingual audio using deep analysis. Multimedia
Tools and Applications, 81, 41309 - 41338.
[15]. Azizah, K., & Jatmiko, W. (2022). Transfer Learning, Style Control, and Speaker Reconstruction Loss for Zero-Shot Multilingual Multi-Speaker
Text-to-Speech on Low-Resource Languages. IEEE Access, 10, 5895-5911.
[16]. Janyoi, P., & Seresangtakul, P. (2020). Tonal Contour Generation for Isarn Speech Synthesis Using Deep Learning and Sampling-Based F0
Representation. Applied Sciences.
[17]. Nakamura, T., Koriyama, T., & Saruwatari, H. (2021). Sequence-to-Sequence Learning for Deep Gaussian Process Based Speech Synthesis Using
Self-Attention GP Layer. Interspeech.
[18]. Zhang, W., Yang, H., Bu, X., & Wang, L. (2022). Deep Learning for Mandarin-Tibetan Cross-Lingual Speech Synthesis. IEEE Access, 7,
167884-167894.
[19]. Peng, Y., & Ling, Z. (2022). Decoupled Pronunciation and Prosody Modeling in Meta-Learning-based Multilingual Speech Synthesis.
Interspeech.
[20]. Mitsui, K., Koriyama, T., & Saruwatari, H. (2020). Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes. ArXiv,
abs/2008.02950.
[21]. Moon, S., Kim, S., & Choi, Y. (2022). MIST-Tacotron: End-to-End Emotional Speech Synthesis Using Mel-Spectrogram Image Style Transfer.
IEEE Access, PP, 1-1.
[22]. Zhang, Y., Weiss, R.J., Zen, H., Wu, Y., Chen, Z., Skerry-Ryan, R.J., Jia, Y., Rosenberg, A., & Ramabhadran, B. (2019). Learning to Speak
Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning. ArXiv, abs/1907.04448.
[23]. Koriyama, T., & Kobayashi, T. (2019). Statistical Parametric Speech Synthesis Using Deep Gaussian Processes. IEEE/ACM Transactions on
Audio, Speech, and Language Processing, 27, 948-959.
[24]. Zou, Y., Liu, S., Yin, X., Lin, H., Wang, C., Zhang, H., & Ma, Z. (2021). Fine-Grained Prosody Modeling in Neural Speech Synthesis Using
ToBI Representation. Interspeech.
[25]. Sultana, S., Iqbal, M.Z., Selim, M.R., Rashid, M.M., & Rahman, M.S. (2022). Bangla Speech Emotion Recognition and Cross-Lingual Study
Using Deep CNN and BBERT Networks. IEEE Access, 10, 564-578.
[26]. Singh, Y.B., & Goel, S. (2021). An efficient algorithm for recognition of emotions from speaker and language independent speech using deep
learning. Multimedia Tools and Applications, 80, 14001 - 14018.
[27]. Lapthawan, T., Prom-on, S., Birkholz, P., & Xu, Y. (2022). Estimating underlying articulatory targets of Thai vowels by using deep learning
based on generating synthetic samples from a 3D vocal tract model and data augmentation. IEEE Access, PP, 1-1.
[28]. Ali, R.A., Hardie, R.C., Narayanan, B.N., & Kebede, T.M. (2022). IMNets: Deep Learning Using an Incremental Modular Network Synthesis
Approach for Medical Imaging Applications. Applied Sciences.
[29]. Byun, S., & Lee, S. (2021). A Study on a Speech Emotion Recognition System with Effective Acoustic Features Using Deep Learning
Algorithms. Applied Sciences.
[30]. Al-onazi, B.B., Nauman, M.A., Jahangir, R., Malik, M.M., Alkhammash, E.H., & Elshewey, A.M. (2022). Transformer-Based Multilingual
Speech Emotion Recognition Using Data Augmentation and Feature Fusion. Applied Sciences.
[31]. Sumalatha Mahankali, Jagadish Kalava, Yugandhar Garapati, Bullarao Domathoti, Venkata rao Maddumala, Venkatesa Prabhu Sundramurty, "A
Treatment to Cure Diabetes Using Plant-Based Drug Discovery", Evidence-Based Complementary and Alternative Medicine, vol. 2022, Article
ID 8621665, 12 pages, 2022. https://doi.org/10.1155/2022/8621665

Deep Learning Based Multilingual Speech Synthesis Using Multi Feature Fusion Methods

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning Based Multilingual Speech Synthesis Using Multi Feature Fusion Methods

Uploaded by

Copyright:

Available Formats

Deep Learning based Multilingual Speech Synthesis using Multi Feature Fusion Methods

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

3.1 Research Gaps

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Training using CNN

Multilingual TTS Multilingual TTS

Fig.1. Proposed System Flowchart

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

CNN based feature extraction

Speech Speech Speech

Fig.2. CNN Model for TTS

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Conv MaxPoll FC Softmax

Fig.3. Proposed CNN Architecture for TTS

4 RESULTS & DISCUSSION

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Table 1. Parameter settings used in the model.

Model Parameters Values

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Train data Static/dynamic BERT VectorMap

Precisio Reca F- Precisio Reca F-

Train data Static/dynamic BERT VectorMap

Dynamic embedded 70.03 68.72 69.37 76.38 75.18 75.79

Dynamic embedded 78.33 76.23 77.27 82.04 80.82 81.43

Dynamic embedded 80.63 79.28 79.96 87.24 86.06 86.65

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Train Static/dynamic BERT VectorMap

CNN BERT CNN-BERT

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

CNN BERT CNN-BERT

CNN BERT CNN-BERT

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

CNN BPNN Proposed

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

DATA AVAILABILITY STATEMENT:

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

You might also like