You are on page 1of 45

Ministry of Education and Science of the Republic of Kazakhstan

Ye. A. Buketov Karaganda University

IT-TECHNOLOGIES IN
TRANSLATION
specialty:
6B02301 - «Translation studies»
author: Klunnaya V.O., Master of Humanities,
Teacher of Chair of Theory and Practice of Translation

Type of lessons: seminars

Karaganda 2020
8 theme
Speech recognition
systems in translator
work
The idea of using digital computers to translate
natural languages first emerged about 50 years ago.
The communication technology available today -
specifically, anything that enables real-time, online
conversation - is getting a tremendous amount of
attention, mostly due to the Internet’s continuous
expansion.
The rise of social networking has also
contributed to this growing interest as more users
join and speak different languages to communicate
with each other. But in spite of this technology’s
recent progress, we still lack a thorough
understanding of how real-time MT affects
communication.
MT is challenging because translation requires
a huge amount of human knowledge to be
encoded in machine-processable form.
In addition, natural languages are highly
ambiguous: two languages seldom express the
same content in the same way.
Google Translate is an example of a text-based
MT system that applies statistical learning
techniques to build language and translation models
from a large number of texts.
Other tools offer cross language chat services,
such as IBM Lotus Translation Services for
Sametime and VoxOx.
In the 2013 version of the Gartner Hype Cycle for
emerging technologies, we can see an important trend in
speech-to-speech recognition technology (the
translation of spoken words into text and back to speech
in the translated language), which leads us to predict
that in a few years’ time, we might see speech-to-speech
or voice-based MT technology.
Although speech-to-speech translation technology is
considered to be an innovation trigger - there’s a
potential technology breakthrough but no usable
products or proven commercial viability - speech
recognition technology falls in the plateau of
productivity, meaning that mainstream adoption is
starting to take off.
Speech-to-speech translation has three
components:
automatic speech recognition (ASR),
MT,
and voice synthesis (or text to speech; TTS).
The ASR component processes the voice in its
original language, creating a text version of what the
speaker said.
This text in the original language goes through
the MT component, which translates it to the target
language.
Finally, this translated text goes through the TTS
component, which “speaks” the text using a
synthesized voice in the target language.
For each step of this process, many other
technologies could be used in the future to improve
the quality of the overall speech-to-speech
translation.
Microsoft Speech API

Microsoft Speech API (SAPI) allows access to


Windows’ built-in speech recognition and speech
synthesis components.
The API was released as part of the OS from
Windows 98 forward.
The most recent release, Microsoft Speech API 5.4,
supports a small number of languages: American
English, British English, Spanish, French, German,
simplified Chinese, and traditional Chinese.
Because it is a native Windows API, SAPI isn’t
easy to use unless you’re an experienced C++
developer.
The Microsoft.NET framework offers an
alternative way to access Windows’ speech resources
through the System.Speech namespace.
This library gives C+ developers access to the same
SAPI features through much simpler interfaces.
Its speech recognition components allow the
system to interpret strings from single spoken words
up to complete phrases; typically, however, developers
use Microsoft speech technologies to let applications
recognize spoken, predefined commands instead of
complex phrases.
In these cases, the accuracy of the speech
recognition is very high.
Microsoft Server-Related Technologies

The Microsoft Speech Platform provides access to


speech recognition and synthesis components that
encourage the development of complex voice/telephony
server applications.
This technology supports 26 different languages,
although it primarily just recognizes isolated words stored
in a predefined grammar
(http://msdn.microsoft.com/en-us/library/
hh361571(v=office.14).aspx).
Microsoft also provides the Microsoft Unified
Communications API (UCMA 3.0), a target for server
application development that requires integration with
technologies such as voice over IP, instant messages, voice
call, or video call.
The UCMA API allows easy integration with Microsoft
Lync and enables developers to create middle-layer
applications.
Sphinx
Sphinx 4 is a modern open source speech recognition
framework based on hidden Markov models (HMMs) and
developed in the Java programming language.
This free platform also allows the implementation of
continuous-speech, speakerindependent, and large-vocabulary
recognition systems (http://cmusphinx.sourceforge.net/
sphinx4).
The framework is language independent, so developers can
use it to build a system that recognizes any language. However,
Sphinx requires a model for the language it needs to recognize.
The Sphinx group has made available models for English,
Chinese, French, Spanish, German, and Russian languages.
Sphinx 4 is a flexible, modular, pluggable framework that is
fostering a lot of the current research in speech recognition;
many studies use it to create speech recognition systems for
new languages or testing algorithms.
HTK
The HMM Toolkit, also known as HTK, is a
platform for building and manipulating HMMs.
Researchers primarily use HTK for speech
recognition, although it has been used for other
applications such as research about speech synthesis,
character recognition, and DNA sequencing.
HTK can build complete continuous speech,
speaker-independent, and large-vocabulary recognition
systems for any desired language.
It also provides tools to create and train acoustic
models (http://htk.eng.cam.ac.uk/docs/faq.shtml).
Microsoft bought the platform in 1999 and retains
the copyright to existing HTK code.
Developers can still use HTK to train the models
used in their products, but the HDecoder module has a
more restrictive license and can be used only for
research purposes.
(The HDecoder is the module responsible for
analyzing the digital signal and identifying which is
the most likely word or phrase being said according to
the acoustic and language models available.)
Julius
Julius is a high-performance decoder designed to
support a large vocabulary and continuous speech
recognition, which are important features to allow
the implementation of dictating systems.
The decoder is at the heart of a speech
recognition system; its job is to identify the most
likely spoken words for given acoustic evidence.
It can perform near-real-time decoding for
60,000-word dictation tasks on most current PCs.
This decoder implements the most popular
speech recognition algorithms, which makes it very
efficient.
The system also allows the usage of models
created in different tools such as HTK and the
Cambridge Statistical Language Modeling toolkit
(CAM SLM) from Carnegie Mellon University
(CMU; http://julius.sourceforge.jp/en_index.php).
A great advantage of using the Julius decoder is
that it’s freely available.
The VoxForge project is currently developing
acoustic models to be used in the Julius recognition
system.
The main platform for Julius is Linux,
although it works on Windows as well; it also has a
Microsoft SAPI–compatible version.
Java Speech API
The Java Speech API (JSAPI) is a specification for
cross-platform APIs that supports command-and-control
recognizers, dictation systems, and speech synthesizers.
Currently, the Java Speech API includes javadocstyle
API documentation for the approximately 70 classes in and
interfaces to the API.
The specification includes a detailed programmer’s
guide that explains both introductory and advanced
speech application programming with JSAPI, but it
doesn’t yet offer the source code or binary classes required
to compile the applications.
JSAPI is freely available, and its owners welcome
anyone to develop an implementation for it; so far, it has
just a few implementations, such as FreeTTS for voice
synthesis and IBM Speech for Java for speech
recognition (the discontinued IBM ViaVoice).
Google Web Speech API

In early 2013, Google released Chrome version 25, which


included support for speech recognition in several different
languages via the Web Speech API.
This new API is a JavaScript library that lets developers
easily integrate sophisticated continuous speech recognition
feature such as voice dictation in their Web applications.
However, the features built using this technology can only
be used in the Chrome browser; other browsers do not support
the same JavaScript library.
Nuance Dragon SDK

Dragon Naturally Speaking by Nuance Communications is


an application suite for speech recognition, supporting several
languages other than English, including French, German,
Italian, and Dutch.
It’s available as a desktop application for PC and Mac and
as a mobile app for Android and iOS.
Nuance also provides software development kits (SDKs) for
enabling speech recognition in third-party applications.
Developers use the Dragon SDK client to add speech
recognition to existing Windows applications, the SDK as a
back end to support non-Windows clients, and the Mobile SDK
to develop apps for iOS, Android, and the Windows Phone.
MT adoption is starting to take off in industry.
MT technology is currently available in the form of
cross-language Web services that can be embedded into
multiuser and multilingual chats without disrupting
conversation flow, but it’s mostly text-based.
Many factors affect how MT systems are used and
evaluated, including the intended use of the translation,
the nature of the MT software, and the nature of the
translation process.
Voice-based technology is already available, but it isn’t
capable of handling typical human conversations where
people talk over one another, use slang, or chat on noisy
streets.
Moreover, to overcome the language barrier worldwide,
multiple languages must be supported by speech-to-speech
translation technology, requiring speech, bilingual, and text
corpora for each of the several thousands of languages that
exist on our planet today.
There are a lot of methods used for ASR including
HMMs, DTW, neural networks and deep neural
networks.
HMMs are used in speech recognition because a
speech signal can be visualised as a piecewise
stationary signal or a short-time stationary signal. In
a short time-scale (10ms), speech can be approximated
as a stationary process.
Speech can be thought of as a Markov model for
many stochastic purposes.
Another reason why HMMs are popular is because
they can be trained automatically and are simple and
computationally feasible to use.
Dynamic time warping was used before HMMs in
SR but has long been declared less successful and isn't
used.
Neural networks can be used efficiently in SR but
are rarely successful for continuous recognition tasks
because of their lack of ability to model temporal
dependencies.
Deep neural networks have been shown to succeed in
SR but because it is a system of 3 components,
recognition, translation and synthesis, we have used
HMMs with the least time complexity.
SR can be further divided into word-based, phrase-
based and phoneme-based mechanisms.
While there is excellent performance in isolated word
recognition with word-based models, they fail when it
comes to continuous SR because of complexity.
Phoneme-based is the best approach and is
employed in this system because of its ability to
incorporate and work with large corpus and ease of
addition of new words into the vocabulary.
The main component of ASR is Feature Extraction.
It uses maximum relevant information about the speech
signal and helps distinguish between different linguistic
units and removes external noise, disturbances and
emotions.
Commonly used feature extraction techniques are
MFCCs (Mel Frequency Cepstral Coefficients), LPCs
(Linear Prediction Coefficients) and LPCCs (Linear
Prediction Cepstral Coefficients).
MFCCs feature shall be used for this process and the
reason is stated as follows.
MFCCs have been ascertained as the state-of-the-art for
feature extraction especially since it is based on actual
human auditory system and has a perceptual frequency scale
called Mel - frequency scale.
It combines the advantage of the Cepstral analysis with
a perceptual frequency scale based on critical bands making
use of logarithmically spaced filter banks.
Prior to the introduction of MFCCs, LPCs and LPCCs
were used for the feature extraction process.
These have been stated as obsolete since both have a
linear computation nature and lack perceptual
frequency scales even though they are efficient in tuning
out environmental noise and other disturbances from the
sampled speech signal.
Further, several researches based on feature extraction
for various languages like English, Chinese, Japanese and
even Indian languages like Hindi have proved
experimentally that MFCCs produce at least 80 percent
efficiency as opposed to just 60 percent by LPCs and
LPCCs.
There are numerous tools developed for ASR such as
HTK, JULIUS, SPHINX, KALDI, iATROS, AUDACITY,
PRAAT, SCARF, RWTH ASR and CSL.
Out of all the available, the 2 fairly popular and widely
accepted frameworks are HTK and SPHINX.
Both of these are based on Hidden Markov Model
(HMM) and are open source.
Both frameworks can be used to develop, train, and test a
speech model from existing corpus speech utterance data by
using Hidden Markov modeling techniques.
The results that were achieved decoding the test data from
AN4 corpus. These were achieved on a PC running Windows
and Cygwin, with a 2.6GHz Pentium 4 processor with 2 GB
System RAM.
HTK decoder did not make any deletions, which gave it
a slight advantage on the overall word error rate.
Also, while HVite uses less memory during decoding,
the time difference in running the test set is significant at 30
seconds.
In addition to this HTK supports HCopy tools which
provide a wealth of input/output combinations of model data
and front-end features couple of which are not present in
SPHINX but certain compatibility is provided.
However, the efficiency of compatibility is debatable.
Other slight advantages of HTK over SPHINX are that
HTK is supported on multiple OS while SPHINX is
largely supported on LINUX platforms only. Also, HTK
uses C Programming language while SPHINX uses the Java
Programming language.
Automatic Segmentation was one of the first such
methods where meaningful phrases were translated
together after successful segmentation.
The issue of sentence length occurs because the
Neural Network fails to recognize some of the initial
words of the input sentence from the vector that the
sentence is transcribed to.
The input sentence is segmented into cohesive phrases
that can be translated easily by the NN.
After every segment of the source sentence is
translated into the target sentence, the target phrases are
concatenated together to produce output.
While this gets rid of the vanishing gradient problem, it
poses a few new difficulties.
Because the neural network can only work with easy
to translate (cohesive) clauses, there is a clause division
problem because the system cannot decide the best
cohesive phrases to translate into target language easily.
Also, computational complexity increases because of
the parallel processing required in reading input words
and translating previously read phrase at the same time.
Also, this method of concatenating translated phrases
only works with languages that don't have long range
dependencies between the words of a sentence and only
with source target language pairs with semantically similar
grammar structures.
The use of system combination approaches for speech
recognition is now common in state-of-the-art speech
recognition systems.
There are two types of approaches to combining
Speech-to-Text (STT) systems.
The output from one system may be used as the
adaptation supervision for a second system. This is
referred to as cross-adaptation.
The information propagated between the two systems
is the 1-best hypothesis with associated conſdence scores.
The other approach is to take multiple system
hypotheses and combine them together, for example
using ROVER or Confusion Network Combination
(CNC).
Both of these types of system combination have shown
signiſcant gains for reducing word error rates (WERs)
both individually and conjointly.
The gains from these approaches are increased when
the cross-site combination is used, where the systems are
built at different sites.
This tends to result in greater diversity between the
hypotheses than when using systems from a single site.
The impact of the form of cross-site STT combination
scheme on speech translation is examined.
When combining multiple STT systems together for
speech recognition the criterion commonly minimised is
WER, or in the case of Mandarin Character Error Rate
(CER).
This is not necessarily the appropriate criterion to
minimise if the output of the system is to be fed into a
statistical machine translation (SMT) system.
In particular as the majority of translation systems
use phrase-based translation, it is important to
maintain phrases and appropriate word-order.
The choice of STT system combination approach
for speech translation must take this into
consideration.
Hypothesis Combination
The first stage in most hypothesis combination schemes is
to align the hypotheses against some “base” hypothesis.
For ROVER combination this is performed by
minimising the Levenshtein distance, in CNC the expected
Levenshtein distance over the confusion networks is
minimised.
Once the hypotheses have all been aligned they are
converted into a consensus network.
The final output for each set of arcs is then made based on a
combination of voting and confedence scores from the
hypotheses.
There are a number of issues to consider if the STT
output is to be used for translation:
Alignment:
The first stage in combining hypotheses together is to
align the 1-best hypotheses or confusion networks against
one another.
As this alignment process is related to minimising a
WERlike measure it can result in a number of peculiarities
that may disrupt translation.
For example the penalty from incorrectly inserting a
word into a phrase is 1 for the recognition system.
However, since it may break a possible phrase that could
be translated in its entirety, the cost in terms of translation
score may be far higher.
This alignment issue becomes far more important the
greater the difference between the hypotheses being aligned.
Voting/Selection:
The criterion used to select the output from the consensus
networks is tuned to minimise WER/CER.
In contrast for translation Translation Edit Rates (TER) or
BLEU scores are commonly used. As words/characters in the
consensus network are selected independently of each other (given
the aligned hypotheses), there are no phrase constraints imposed.
Though this does not impact WER/CER, it may impact the
translation performance.
For schemes such as ROVER or CNC the selection may also
make use of word-level confidence scores.
In the case of combining two systems, conſdence scores must
be used. As the performance difference between STT systems
increases, it becomes increasingly important to have good
conſdence scores. The use of poor word confidence estimates may
result in inappropriate selections from the consensus network,
again possibly breaking phrases.
Cross Adaptation
The vast majority of STT systems make use of an
adaptation stage, normally based on linear transformations
such as MLLR or CMLLR. To estimate these transforms an
initial hypothesis is required.
Normally this is obtained from an initial unadapted
recognition run.
However in cross-adaptation it is obtained from the
output of another STT system, which may itself have made
use of adaptation.
Provided the adaptation transforms are not overly tuned
to the adaptation supervision, gains are possible provided
the errors in the supervision differ from those of the system
to be adapted.
In terms of STT combination for translation, cross-
adaptation may be viewed as a “safe” option.
The system output does not involve the combination of
two highly diverse systems.
Note that there is still a level of hypothesis combination
internal to each individual STT system described in this
paper.
However, in contrast to the cross-site hypothesis
combination, the within-site hypotheses tend to be more
consistent so shouldn’t be severely impacted by the issues
described before.
.
STT POST PROCESSING
In processing data such as Broadcast News (BN) or
Broadcast Conversations (BCs) for an STT system, the ſrst
stage is to segment the data into homogeneous blocks, or
turns.
Each block should consist of data belonging to an
individual speaker in a particular acoustic environment.
However, SMT systems are trained on text data, where
individual sentences are aligned with each other and the
punctuation is provided.
It is therefore necessary to post-process the STT output
so that “sentence-like” segments with punctuation are
passed to the translation system.
Arabic STT output, whether generated by a stand-
alone system or by combining multiple system
outputs, was first divided into segments based on
BBN’s acoustic segmentation.
This segmentation starts by detecting speaker turns
in the audio, and then divides long speaker turns into
shorter segments by splitting on pauses of signiſcant
duration (0.3 sec min). The resulting segments are
about 10 sec long on average.
The acoustic segmentation is further reſned based on a
Hidden-Event Language Model (HELM), a Kneser-Ney
(KN) smoothed 4-gram trained on 850M words of Arabic
news.
The HELM integrates pause duration as observation
into the HMM search, and makes use of a bias to insert
boundaries at a high rate, then boundaries of low posterior
probability are removed while constraining the maximum
sentence length to 51 words.
For the Mandarin system the post-processing consisted
of a two stage process.
First a simple silent splitting process based on the inter-
word silence gaps was performed. The data was split at
silences of greater than 0.9 seconds.
In addition a maximum segment length of 20
seconds was imposed by splitting longer segments at the
longest inter-word gap.
A HELM was then used to further split the segments.
In addition to adding sentence boundaries and
sentence end punctuation, both STT outputs had a
reverse number mapping applied.
STT systems are built on acoustic data where
numbers are in their spoken form.
However, translation systems are built on text data,
where numbers are normally written in digit form.
Language speciſc number mappings were generated,
and applied prior to translation.
The described information is made with the use of STT
systems built at three different sites, BBN, Cambridge
University (CU) and LIMSI. The final, sentence boundary
marked, STT output was translated using an SMT system
developed at BBN.
The BBN translation engine employs statistical
phrase-based translation models, with a decoding
strategy similar to. Phrase translations are extracted from
word alignments obtained by running GIZA++ on a
bilingual parallel training corpus.
A signiſcant portion of the phrase translations are
generalized through the use of part of speech classes, for
improved performance on unseen data.
Both forward and backward phrase translation
probabilities are estimated and used in decoding along with
a pruned trigram English LM, a penalty for phrase
reordering, a phrase segmentation score, and a word
insertion penalty.
These scores are combined in log-linear fashion, with
weights estimated discriminatively, on held-out BN and BC
data in order to minimize overall TER.
The output of the decoding process is an N-best list of
unique translation hypotheses, which is subsequently rescored
using an unpruned, KN smoothed 5-gram English LM.
This LM is trained on more than 5B words of text,
consisting of the Gigaword corpus, archives downloaded from
news web sites, and other news and conversational web data.
The majority of the data used for training the SMT system
was text, not speech.
The translation systems is thus not ideally matched to
translating speech. However, it should be more closely matched
in style to BN data, rather than the more conversational BC
data.
As the translation systems were tuned for TER, translation
performance is assessed in this work in terms of TER. Note, the
same general trends were observed with BLEU, though the
differences between the systems were reduced.
The first commercial speech recognition devices
appeared in the early 1980s.
They were imperfect, could only recognize commands
and were used to control the computer, its programs and
applications.
The first more or less successful examples of integrated
continuous speech recognition programs appeared only in
the last few years (Dragon Naturally-Speaking, DNS).
This program is considered to be the best. It allows one
to print at speeds up to 160 words per minute.
Recognition accuracy is about 99%.
The program can simultaneously work in dictation and
command mode.
“Scratch that” voice command allows one to correct
errors made during dictation.
Unfortunately, the program cannot work with the
Russian language.
Despite the perspectiveness of such programs, today
they are not so often used in translation.
What are the benefits for a translator using a
speech recognition device?
1. Physical activity on the fingers and wrists is
significantly reduced.
2. The information input speed is more than
doubled compared to the traditional printing method.
3. The program can be taught to recognize not only
common vocabulary, but also special terminology, use
abbreviations to speed up the input of repeated multi-
component phrases.
Ideally, a speech recognition program is very
promising for the written translation from Russian
into English. In practice, it requires regular training,
updating and replenishing the glossary (voice input
recording).
For the effective use of this technology, the
translator should be fluent in the subject of the text
and have a good skill in visual-interpretation.
The translator needs to gradually get used to a
completely new and unusual form of information
input.
It may be advisable to use the combined method in
translation, optimally combining different methods of
information input - through the voice channel, and in
difficult cases with the help of the keyboard.
Tasks:
1. Name the famous programs of recognition of the English
speaking speech that you know.
2. Which of these programs is considered the best?
3. Are there any similar recognition systems for the Russian
language? What are some of these programs?
4. What features does the speech recognition program offer the
translator?
5. Do such programs have any limitations? Name which ones.
6. In what cases is it justified to use recognition systems?
7. What is the maximum typing speed of the English text provides
speech recognition program?
8. What do translators think about using such programs?
References:

Shevchuk V.N. Electronic resources of a translator. – M.: Libright, 2015.

Palazhchenko P.R. My unsystematic dictionary. – M.: R. Valent, 2016.


Marchuk Yu.N. Translation: traditions and modern technology. – M.: VTsP, 2015.
Krupnov V.N. In the creative laboratory of a translator. – M.: International relations, 2015.
Nelyubin L.L. Computer linguistics and machine translation. – M.: VTsP, 2017.
Klimzo B.N. The craft of a technical translator. – M.: R. Valent, 2017.
2. Additional literature
Kakzhanova F. А. Practical course of the English language. – Almaty: Bastau, 2015. – 448
p.
Schweizer A.D. Through the eyes of a translator. – M.: Stella, 2016.

You might also like