Ieee Icaicta Edited

Transfer Learning from News Domain to Lecture
Domain in Automatic Speech Recognition
Iftitakhul Zakiah Dessi Puji Lestari

School of Electrical Engineering and Informatics School of Electrical Engineering and Informatics
Institut Teknologi Bandung (ITB) Institut Teknologi Bandung (ITB)
Indonesia Indonesia
iftitakhulzakiyah@gmail.com dessipuji@informatika.org
Abstract—Nowadays Automatic Speech Recognition (ASR) is lectures. In addition, the news domain is also relatively more
increasingly developed, including in the lecture domain. ASR that common where the vocabulary of the domain can also appear in
build from scratch requires very large data, so it can use another the lecture domain. Besides that, the news domain doesn’t have
approach, that is transfer learning, an approach to build models the special vocabulary that only exists in lectures, so that adding
by utilizing existing models as source models. This experiment the vocabulary from the domain are needed. These differences
begins with data collection on the lecture of the Informatics are the reasons for choosing transfer learning in the language
Undergraduate Program in ITB. We used spontaneous language model. The language model is used as a rule in a language,
models on the news domain as source models. We divided into containing vocabulary to its grammar.
three systems, there are systems that use the news domain, lecture
domain, and both of them. In all three systems, the acoustic model The room that used for the lecture domain is also different
used was triphone GMM-HMM and also MAP which was only on from the news domain. If the news domain used a soundproof
system C. The third language system uses the N-gram and LSTM room, then in the lecture domain used the classrooms. It can
with the projection layer. Transfer learning is implemented both affect ASR performance because each room has its own
in N-gram interpolation and weight initialization LSTM model. characteristics. Therefore transfer learning needs to be done on
The news domain system gives WER score 78.30% (5-fold) and the acoustic model. The acoustic model plays a role in
85.18% (10sp), the lecture domain system is 58.232% (5-fold) and constructing statistical representations of the vector sequences
62.18% (10sp), and the transfer learning system 52.734% (5 -fold) of features obtained from sound signals.
and 67.0 (10sp). The smaller WER score, the better model was
built, so the best ASR for the lecture is the transfer learning II. RELATED WORKS
approach on language models and triphone on the acoustic model.
Various studies on transfer learning in speech processing
Keywords— ASR, transfer learning, LM, AM, triphone, MAP, N- have been carried out. [5] divided the research on transfer
gram, interpolation, LSTM, WER learning in speech processing into three main groups, namely
cross-lingual and multilingual transfer, speaker adaptation, and
I. INTRODUCTION transfer models. In the first group, cross-lingual and
Automatic Speech Recognition (ASR) has been multilingual transfers have a basis that some common patterns
implemented in various fields, such as a personal assistant or can be used across languages.
translate speech for the deaf person. Besides that, ASR also can Transfer learning with the model adaptation had done by
be implemented in the lecture domain, where the lecturers [10] that used Maximum Likelihood Linear Regression
deliver material in the classroom. When the lecturer explains the (MLLR) and Maximum a Posteriori (MAP) in the car
material in the class, the students have notes in their versions so environment. In this study, data with natural noise and data with
the knowledge distributed in each of them. On the other hand, generated noise from the computer are compared. And then
learning activities better with centralized notes because there are MLLR+ MAP produced better performance for data with
a number of things delivered by lecturers but not available in natural noise.
slide or courses module. Some universities allow their students
to make centralized notes through shared applications. However, Another approach is model transfer, where the new model
there are universities that not allowed the students using (child model) learns from the existing model (teacher model).
electronic devices such as laptops and cellphones during The initial idea of the transfer model is that the teacher model
teaching and learning activities. These things are the reasons for has a lot of knowledge from training data and that knowledge
ASR in the lecture domain needed for teaching and learning can be used as a guide to training on a simpler model (child
activities. model) because, without the teacher model's guidance, the child
model cannot learn the details must be known. On [2] using
Transfer learning using prior knowledge to learn in the new existing N-gram models to be interpolated by using new data
task or domain. In this experiment, we used data from previous and gives new models. This experiment was conducted on the
work [7] as the source domain. [7] used spontaneous language lecture domain with data sources from teaching transcripts to
on the news so that the language style is similar to that used in textbooks. [9] tried three types of LSTM approaches which
aimed to optimize the decoding process on the second pass with with the values far before. The experiment done by [9], used
pre-training. The LSTM on [9] used a projection layer as LSTM architecture that proposed by [8] with projection layer
proposed by [8] and adaptation layer proposed by [6]. after the LSTM layer and adaptation layer that proposed by [6].
[9].
III. TRANSFER LEARNING
C. Maximum a Posteriori (MAP)
Machine learning uses a statistical approach to make a model
from labeled or non-labeled training data. If the training data has Maximum a Posteriori (MAP) can be effectively used for
been labeled, it can be called supervised learning. Then learning sparse data problems by making estimates of the prior
that uses unlabeled data is also called unsupervised learning, knowledge. If it’s assumed that HMM is a random vector and
while semi-supervised is used for problems where data labeled previous knowledge about random vector is symbolized by P
is too small to build a good classifier and a lot of unlabeled data (Ф). These parameters are obtained from the experimental
are used. Machine learning that same task, domain, and results, so the estimation of MAP can be formulated as follows
distribution, hereinafter it referred to traditional machine [14].
learning. Unlike traditional machine learning, transfer learning 𝜙̂ = 𝑎𝑟𝑔 𝑚𝑎𝑥𝜙 [𝑝(𝜙|𝑋)] = 𝑎𝑟𝑔 𝑚𝑎𝑥𝜙 [𝑝(𝑋|𝜙) 𝑝(𝜙)] (2)
uses a different domain, task, or distribution in both testing data
and training data [12].
If the value of P (Ф) isn’t obtained from the previous
experiment, then it can be a uniform distribution or in other
words, MAP becomes like normal maximum likelihood.
Limitation of MAP is it needs accurate prior knowledge. 
IV. EXPERIMENTS
We used KALDI [4] and SRILM [1] toolkit for this
experiment. KALDI used to prepare the data until for decoding
the test data and SRILM used to build the N-gram based
language model. But before that, we collected the lecture data
by recording in the classrooms and transcribing it. Apart from
Fig. 1. The differences between traditional machine learning (a) and transfer lecture data, the spontaneous data (news domain) [7] is also used
learning (b) [12]
as source domain data either for transfer learning or for the
For example given different tasks on a DS as source domain baseline system. The datasets, design experiment, and tuning
and TS as source task, a DT target domain and TT target task, parameter for the model are explained in the following sections.
transfer learning aims to help learning from the predictive A. Datasets
function of the target fT (.) in the DT using knowledge from DS
In this experiment, we used two domain datasets, the news
and TS, where DS ≠ DT, or TS ≠ TT [12].
domain, and the lecture domain. The news domain using the
A. N-gram Interpolation dataset from the previous work [7] and the lecture domain
Interpolation is to create new data points from a range of datasets we collected ourselves in the classrooms. We have two
known data points. In the N-gram model, interpolation can be datasets in lecture domain, the lecture_all for training and
done using existing models as known data points, so to create a 10sp for testing.
new N-gram model is done by constructing the data. The first step is collecting data by recording the courses
Interpolation is needed to create new models with relatively from Informatics Undergraduate in ITB. Not only the audio
small data. corpus but also the text corpus collected by transcription or
Linear interpolation uses the value of λ that consider the writing words per word spoken by the speaker manually. This
conditions or context. It can be used if the N-gram value has an text corpus is then used to build the language model.
accurate value, so it can determine the value of λ for N-gram The text data was also collected from material courses at
for the higher N. It makes the higher N-gram values have Informatics Undergraduate ITB. We chose the Indonesian
greater weight. material only from the slide or lecturer module. There are
several courses that not in Indonesian, so it isn’t included as
𝑖−1
𝑃(𝑤𝑖 |𝑤𝑖−2 , 𝑤𝑖−1 ) = 𝜆1 (𝑊𝑖−2 ) 𝑥 𝑃𝑀𝐿 (𝑤𝑖 |𝑤𝑖−2 , 𝑤𝑖−1 ) + data. We divided the courses into six clusters based on the same
𝑖−1
𝜆2 (𝑊𝑖−2 ) 𝑥 𝑃𝑀𝐿 (𝑤𝑖 |𝑤𝑖−1 ) + 5th digit course ID. Cluster 1 represents programming cluster,
𝑖−1 2 for computer engineering, 3 for distributed systems, 4 for
𝜆3 (𝑊𝑖−2 ) 𝑥 𝑃𝑀𝐿 (𝑤𝑖 ) (1)
database, 5 for software engineering, and 6 for artificial
intelligence.
B. LSTM Language Model Then the text can also be used to get a list of words that not
LSTM was introduced by Hochreiter and Schmidhuber in found in the previous experiment [7] that obtained by collecting
1997 which were designed to avoid the long term dependency unique words from the text. The list of words is added to the
problem, where the value estimated at this time still depends on new lexicon by using G2P (graph to phoneme) tools developed
values that far before. This is a problem because it could be that by PT. Prosa Solusi Cerdas.
the value currently estimated doesn’t have a strong connection
The spontaneous text corpus crawled from MBDC, On the other hand, triphone is better used for cases where the
Beritagar in the Bincang section, Spontaneous Freelance, and size of the lexicon is large[11].
Hipwee. Then the audio corpus is collected by reading the text. Training the acoustic model for system A is only done by
Below is a summary of the dataset used in the experiment. using the spontan_all corpus (not used 5-fold cross
validation). In system B, use the lecture_all to build
TABLE I. DATASETS CHARACTERISTICS
triphone with a 5-fold scheme. System C also used a 5-fold
Dataset Text
Total
Audio
scheme by combining two types of acoustic conditions, the
Sentences clear conditions (spontaneous data) and the classroom
spontan
_all Crawling from website 81966 Read conditions.
lecture
_all Transcripted from audio 2999 Spontan In system C, besides using the triphone GMM-HMM
10sp Arranged by author 500 Read
model, it was also adapted using MAP (Maximum a
Posteriori). The MAP acoustic model used the same training
15457,
cluster 12853,
data as the triphone GMM-HMM model. It used to compare
lecture_all + slide + between the adaptation model with triphone and how much
_lectur 4153, -
e lecture module impact for the model trained by extra data from another
7046,
6569, 4845 domain. Below is the design of all experiments that
combined both of language and acoustic model.
B. Design TABLE II. DESIGN EXPERIMENTS
In general, the experiment is divided into three systems,
System Training Testing AM N-gram LM LSTM LM
there are two baseline systems (system A and B) and one system
spontan
that used transfer learning (system C). System A models are Trip
A _all Trigram LSTM
trained with spontaneous data. The system B models are trained informal hone
with lecture data only. The system C used both data to build lecture Trip Unigram
B _all 5-fold LSTM
hone Trigram
acoustic models and language models. For every system, we spontan
10sp
used the 5-fold cross validation scenario because the lecture Trip LSTM
_all + Trigram
C lecture hone (weight
dataset is relatively small so the testing result can represent the MAP
interpolation
initiation)
_all
all data condition. Detailed experiments are written in the cluster
Additio Trigram
following sections. _lectur 10sp MAP -
nal e interpolation
1) Language Model
Language models are built using N-gram or in this
C. Setup Parameter
experiment using unigram and trigram. The number of N is
done on system B because the corpus used in system B We only configure parameter for language model, the
(lecture_all) is similar to this experiment. From that, we acoustic model used the default parameter from tools.
chose the best N value for the other system. Configuring parameters for the N-gram language model is only
for system C when interpolating with the existing spontaneous
On system A and B we used vanilla N-gram and LSTM
language models. The changed parameter is lambda or the
for the language model. But in system C, we interpolated weight used for the lecture model when interpolated with
the N-gram and initialization the weight on the LSTM with spontaneous language models. We tried 0.5, 0.7, 0.75, 0.8, and
the source model (news domain). We also tried how much 0.9 for the lambda and 0.8 give the best perplexity rather than
the impact of OOV on all system (N-gram only) by doing others.
the close and open experiment in the lexicon when
decoding. The LSTM model is only decoded using an OOV On the LSTM model, the configuration used the
(open experiment) lexicon. lecture_all corpus because the corpus represents the target
Different from the systems above (systems A, B, and C), domain. Then 10sp corpus is also used to evaluate the
we also tried experiments by divided the courses to clusters. performance of the model. Following is the parameter
It aims to see how the performance system if the domain is configuration for the LSTM with the best configuration are bold.
reduced. This language model used N-gram interpolation TABLE III. TUNING PARAMETER LSTM
with the same source model with system C. Each cluster
also has a different lexicon, the news domain lexicon added Parameter Option
by each cluster. embedding - tdnn – lstmp – tdnn – lstmp – tdnn
– output
Topology embedding - tdnn – lstmp – tdnn – output
2) Acoustic Model embedding - tdnn – lstmp – tdnn – lstmp – tdnn –
The existence of noise or environmental conditions that lstmp – tdnn - output
can disrupt the speech recognition is one reason for us to use Regularization
tdnn = 0.001 | lstm = 0.001 | output = 0.001
the triphone-based acoustic models rather than word-based. tdnn = 0.001 | lstm = 0.0005 | output = 0.0005
512
Word-based acoustic models are good for ASR that have Embedding 800
small, limited lexicon sizes and many repetitive sentences. 1024
Parameter Option B. System B
128 unit The average perplexity score of unigram at 5-fold has a
Unit
256 unit
better value than the perplexity that evaluates with corpus 10sp
V. RESULT AND ANALYSIS as in TABLE V. It might happen because the language style in
We evaluated the experiments by intrinsic and extrinsic the lecture_all corpus is spontaneous language style while
evaluation. Intrinsic evaluation for language model is used the 10sp corpus is more formal. In addition, with the open
perplexity and for the acoustic model is uses PER (Phoneme experiment on the lexicon, gives better perplexity score than the
Error Rate). The extrinsic evaluation used WER (Word Error close experiment. However, in this case, perplexity is inversely
Rate). Following are the evaluation of the results per systems. proportional with WER scores. It might because the size of the
close experiment lexicon is greater so that it gives a smaller
A. System A probability of words to the corpus. However, the presence and
As we see in TABLE IV. , the average perplexity at 5-fold absence of OOV in the lexicon also doesn’t provide a
has a better score than used 10sp corpus. It can occur because significant WER score (less than 1%). It might happen because
the language style in the lecture_all corpus is spontaneous the text data used to train the language model isn’t close to the
language style while the 10sp corpus is more formal. In experiment, so the probability of occurrence from the learning
addition, there are a number of words that not appear in the process is small.
training data, so they have a small probability and there is an Then if seen from the PER score as in TABLE V. , the
OOV problem where the searched word isn’t in the lexicon differences evaluating score from the lecture_all corpus and
model. 10sp corpus aren’t very significant. It can be caused by similar
If seen from the PER score as in TABLE IV. , there is a environmental conditions in the two corpora. The triphone
significant difference between the evaluation results that used lecture model was built from spontaneous data in lectures, has
lecture_all corpus and the 10sp corpus. It can be caused by a similar characteristics, so the resulting PER score isn’t much
different speaking style to the two corpora. The acoustic model different.
triphone spontaneous which was built from reading The LSTM language model gives perplexity score that
spontaneous data has similar characteristics to the corpus 10sp greater than the N-gram model as in TABLE V. . The WER
so the score of PER is better. scores are also not better than the N-gram model. It might be
Overall, the results of the extrinsic evaluation of the acoustic caused by the LSTM prediction results will be worse if there is
model and system A language model are found in Table V 3. OOV. In addition, the data used to build the model is also a
The average result of the 5-fold cross validation scheme has a small corpus, even though the neural network-based model
value of WER 58.76. As in Figure V 1, the m03 speaker has the produces a better value if using a large enough corpus.
highest WER score because of poor recording data, that is, the
sound volume is less loud because the placement of the clip-on TABLE V. SYSTEM B RESULTS
microphone isn’t suitable. Speaker F06 has the best WER score LM AM Perplexity OOV PER WER
because this speaker recording data read more text than speaks 1689.286** 0 58.82**
spontaneously where the speech style is the same as the Unigram 2068.13* 0 62.98*
spontan_all corpus used to train acoustic models and language lecture 1210.87** 411.8 59.506**
1484.02* 245 63.64*
models. As in the 5-fold test data, the relatively large error value Tri-
989.216** 0
27.574**
56.694**
is also caused by OOV in 10sp test data words especially in phone 33.81*
Trigram 933.613* 0 61.12*
lecture
lecture terms, such as 'wumpus', 'scrum', 'mysql', and others. lecture 865.66** 411.8 58.232**
OOV greatly gives impact on ASR performance with 774.208* 245 62.18*
LSTM 2026.42** 411.8 60.87**
spontaneous training data. When compared with the LSTM lecture 1900.7* 245 66.06*
spontaneous model, the WER score is increasing. The very **
*
used 5-fold cross validation scenario
used all lecture_all as training data
large WER score in LSTM spontaneous occurs because the
lexicon on training model and spontaneous training data have a
different writing style. Whereas in LSTM, the lexicon model C. System C
used for training and decoding must be the same lexicon. If we compare the WER score between trigram interpolation
(TABLE VI. ) and trigram lecture (TABLE V. ) with the same
TABLE IV. SYSTEM A RESULTS
acoustic model, that is triphone, shows that interpolation of
LM AM Perplexity OOV PER WER trigrams from spontaneous languages can improve the
Trigram 702.115** 0 57.76** performance of the model at 5-fold (1.5% increase) but it
Informal 1125.81* 0 70.16* doesn’t happen to corpus 10sp (a decrease of about 9%). It
Tri-
Trigram 6773.48** 6619 92.528* 78.30**
Spontan
phone
5237.92* 3640 * 85.18* might be caused by the different characteristics of the test data.
spontane The test corpus at 5-fold uses spontaneous language style, not
LSTM 43.46*
ous
spontane 19996.88** 1007.6 93.432** dictation speech, so it’s sufficiently covered from the source
ous language models that speak the spontaneous language. Whereas
**
*
used all lecture_all as training data the 10sp test corpus uses a language style that tends to be stiff
because it dictated.
The results of the extrinsic evaluation use MAP as the have a large lexicon model and larger training data with high
acoustic model with the lecture trigram (TABLE VI. ) variance.
increasing about 3.5% in 5-fold when compared with triphone
lectures (TABLE V. ). However, when used the corpus 10sp for TABLE VI. SYSTEM C RESULTS
testing, the WER score is much worse (down about 6.5%). This LM AM Perplexity OOV PER WER
shows that MAP adaptation is well used for data that have 55.99**
Tri-
similar acoustic conditions, in contrast to the corpus 10sp which phone
27.574** 61.90*
has a mismatch condition with training data. lecture 33.81* 56.81**
71.1*
System C which uses the transfer learning approach can Triphon 737.758** 0
54.11**
provide results that are generally better than two previous Trigram e 758.793* 0
27.966** 56.90*
systems. It can happen because in general, the training data used interpolat spontan 925.813** 384.2
eous 30.85* 54.606**
ion 846.058* 220
by system C is more than systems A and B. Unlike the previous lecture 62.24*
two systems where the perplexity score of the open experiment 51.38**
is inversely proportional to its accuracy (WER), system C is 54.64*
52.734**
directly proportional. 67.0*
MAP
TABLE VI. also shows the results of the intrinsic Trigram spontan 865.66** 411.8 26.242** 54.708**
evaluation of the triphone and MAP acoustic model with the lecture eous 774.208* 245 31.82* 66.98*
LSTM lecture
combined data of the two domains. When compared with
spontane 3890.6** 411.8 59.86**
TABLE V. , which only used data from the lecture domain, the ous 2829.9* 245 68.02*
results aren’t too significant. It might because the acoustic lecture
conditions of the news domain data aren’t very similar to the **
*
used all lecture_all as training data
green_character : close experiment on lexicon (OOV = 0)
acoustic conditions of the lecture domain data.

When we compared the extrinsic evaluation of acoustic
models (MAP) and language models that use transfer learning D. Additional Experiment
(TABLE VI. ) with triphone which didn’t do transfer learning Besides the three systems above, we tried additional
in the acoustic model, MAP gave the better results in the 5-fold experiment to see performance model if the scope of the lecture
test. It differs from the results of the corpus 10sp test. It can be is reduced to the cluster courses. This experiment used
caused by differences in the distribution of inter-test corpus and cluster_lecture corpus. Based on the result (see TABLE VII.
its characteristics. From this, it can be indicated that MAP has ), almost all clusters (except cluster 1 and 6), give better
better performance if the acoustic conditions of the test data are performance in WER rather than system C that used the same
similar to the acoustic conditions of the adapted training data. configuration of language and acoustic model. It can be caused
Whereas triphone gives better results if there is an acoustic by increased training data for language models. It can also be
mismatch between training data and test data. caused by narrowed and more specific scope (domain).
Meanwhile, when MAP + N-gram interpolation compared At cluster_1 and kluster_6 even though the training data for
MAP + N-gram lecture which has differences in language the language model was added, but the OOV level of the test
models, when testing with the corpus 10sp, the WER score corpus remains large, so it affects to the WER score. We also
tends to be the same (the delta 0.02%). It shows that in the 10sp see that OOV and WER are correlated as shown in Fig. 2.
corpus, the existence of the source domain news model isn’t
very giving impact because the lecture domain has specific TABLE VII. ADDITIONAL EXPERIMENT RESULTS
words and not use spontaneous language styles. It’s different LM AM Perplexity OOV PER WER
from the 5-fold test results that give better results if using the
Cluster_1 2730.85 2055 88.06
news domain source model because of it from audio transcripts
of the spontaneous lecture. Cluster_2 4023.86 285 61.34
TABLE VI. shows that OOV and perplexity score are Cluster_3
MAP
1573.32 255 56.38
directly proportional. In general, the perplexity score from spontan 26.242**
Cluster_4
eous
2792.66 255 31.82* 59.08
LSTM in systems A, B and C is no better than N-gram (see lecture
TABLE IV. TABLE V. TABLE VI. ). LSTM isn’t too robust Cluster_5 1890.08 275 55.92
for cases that have a lot of OOV because LSTM makes
predictions by considering the context of the word before and Cluster_6 1386.71 1620 85.46
after, so if it’s assumed that one OOV word, then it can affect
the results of previous and afterword predictions. Whereas N-
gram only takes into the context of the previous sequence of
words. If there is one OOV word in the previous sequence, then
the N-gram still has a calculation by looking at the previous of
previous word (for N> 2) so it’s still possible to give the correct
prediction. It shows that LSTM is better if it used for cases that
classroom to do transcription automatically. Then added the
lecture data used both for training and testing the model. And
last, do some exploration about deep transfer learning for
acoustic models.
ACKNOWLEDGMENT
This experiment is part of roadmap research in Artificial
Intelligence and Graphics Laboratory, ITB. Thanks to PT.Prosa
Solusi Cerdas that support the hardware, software, and
environment for the experiment.
REFERENCES
[1] A.Stolcke, “SRILM-An Extensible Language Modeling Toolkit”, ICSLP,

2002.
[2] B.J.Paul, J.Glass, "N-gram Weighting: Reducing Training Data Mismatch
in Cross Domain Language Model Estimation", Empirical Methods in
Natural Language Processing, pages 829-838, 2008.
[3] D.Jurafsky, J.H.Martin, "Natural and Language Processing: An
Introduction to Natural Language Processing, Computational Linguistics,
Fig. 2. The correlation between OOV and WER per courses clusters and Speech Recognition (2 ed.)", Prentice-Hall, 2009.
[4] D.Povey, “The KALDI Speech Recognition Toolkit”, Automatic Speech
Recognition and Understanding (ASRU), 1-4, 2011.
[5] D.Wang, T.F.Zheng, “Transfer Learning for Speech and Language
VI. CONCLUSIONS Processing”, APSIPA ASC, 2015.
[6] E.Arisoy, T.N.Sainath, B.Kingbury, B.Ramabhadran, “Deep Neural
In this paper, we have been tried some experiments and built Network Language Models”, NAACL-HLT Workshop, pages 20-28,
three systems that consist of two baseline system (system A and 2012.
B), one transfer learning system (system C), and the additional [7] F.Y.Putri, D.P.Lestari, D.H.Widyantoro, “Long Short-Term Memory
experiment. Based on the experiments, we conclude that Based Language Model for Indonesian Spontaneous Speech
transfer learning using N-gram interpolation better than LSTM Recognition”, IC3INA, 2018.
weight initialization for a small dataset in the target domain. [8] H.Sak, A.Senior, F.Beaufays, “Long Short-Term Memory Recurrent
Neural Network Architectures”, ISCA-speech., 2014.
The adaptation model (MAP) gives better result if the testing
[9] M.Ma, M.Nirschi, “Approaches for Neural-Network Language Model
and training data have a similar acoustic condition and the Adaptation”, ISCA-speech, 2017.
opposite, triphone gives a better result when there is a mismatch [10] R.Bippus, A.Fischer, V.Stahl, “Domain Adaptation for Robust Automatic
condition between the two datasets. Overall, system C gives Speech Recognition in Car Environments”, ISCA-Speech, 1999.
better than another system, The system A gives WER score [11] R.Thangarajan, A.M.Natarajan, M.Selvam, “Word and Triphone Based
78.30% (5-fold) and 85.18% (10sp), the system B is 58.232% Approaches in Continuous Speech Recognition for Tamil Language”,
WSEAS Transactions on Signal Processing, 2008.
(5-fold) and 62.18% (10sp), and the system C 52.734% (5-fold)
[12] S.J.Pan, Q.Yang, “A Survey on Transfer Learning”, IEEE Transactions
and 67.0 (10sp). The best ASR for lecture domain is the transfer on Knowledge and Data Engineering, Vol. 22 No.10, 1345-1359, 2009.
learning approach on language models and triphone on the [13] S.Karpagavalli, E.Chandra, “A Review on Automatic Speech Recognition
acoustic model. Architecture and Approaches”, International Journal of Signal Processing,
Image Processing and Pattern Recognition Vol. 9 No. 4, 393-404”, 2016.
[14] X.Huang, A.Acero, H.W.Hon, “Spoken Language Processing, A Guide to
FUTURE WORKS Theory, Algorithm, and System Development”, 2001.
For future works, we suggest focussing on online decoding
because the purpose of the lecture ASR is to be used in the

Ieee Icaicta Edited

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ieee Icaicta Edited

Uploaded by

Copyright:

Available Formats

Transfer Learning from News Domain to Lecture

Domain in Automatic Speech Recognition

Iftitakhul Zakiah Dessi Puji Lestari

acoustic conditions of the lecture domain data.

[1] A.Stolcke, “SRILM-An Extensible Language Modeling Toolkit”, ICSLP,

You might also like