Professional Documents
Culture Documents
PROJECT TITLE:
BUILDING AND TRAINING A SYSTEM TO GENERATE SUBTITLE USING
SPEECH RECOGNITION
BY
OWUSU ANSAH ASARE
FAS/3532/09
SUPERVISOR:
MR STEPHEN AKOBRE
2013
PROJECT TITLE:
BUILDING AND TRAINING A SYSTEM TO GENERATE SUBTITLE USING
SPEECH RECOGNITION
BY
OWUSU ANSAH ASARE
FAS/3532/09
SUPERVISOR:
MR STEPHEN AKOBRE
JUNE 2013
ABSTRACT
Video sharpens the focus, creates rallying points, places the message in context, and build a
safe place where viewers can be challenged on a thoughtful and heartfelt level. The demand
for text accompanying videos has grown for the past one and half decade. This is so because
most viewers will like to understand the videos they watch, others are deaf and have gaps in
spoken language. Therefore, it is necessary to nd solutions for the purpose of making these
media artefacts accessible for most people. Several software propose utilities to create subtitles
for videos but all require an extensive participation of the user. Hence, a more automated
concept is envisaged.
This work report indicates a way to build and train a system to generate subtitles following
standards by using speech recognition. Three parts are distinguished. The rst one consists of
separating audio from video and converting the audio into suitable format if necessary. The
second phase proceeds to the recognition of speech contained in the audio. The ultimate stage
generates a subtitle le from the recognition results of the previous step. Directions of
implementation have been proposed for the three distinct modules. The experiment results have
not done enough satisfaction and adjustments have to be realized for further work. Decoding
parallelization, use of well-trained models, and punctuation insertion are some of the
improvements to be done.
i|Page
DECLARATION
I hereby declare that the project entitled,
Building and training a system to generate subtitle using speech recognition
is my own original work carried out as Bachelor of Science students project at the University
for Development Studies under the supervision of Mr Stephen Akobre except to the extent that
assistance from others in the projects design and conception or in style, presentation and
linguistic expression are duly acknowledged.
All sources used for the project have been fully and properly cited. It contains no material
which to a substantial extent has been accepted for the award of any other degree at University
for Development Studies
__________________________________
ASARE OWUSU ANSAH
___________________________
DATE
FAS/3532/09
ii | P a g e
CERTIFICATION
We certify that this work was carried out independently by ASARE OWUSU ANSAH
(FAS/3532/09) in the Department of Computer Science of University for Development Studies
as part of the requirements for the award of a Bachelor of Science degree in Computer Science.
______________________________
Mr Stephen Akobre (Supervisor)
______________________________
Mr David Laar
__________________________
Date
___________________________
Date
(Head of Department)
iii | P a g e
ACKNOWLEDGEMENT
I investigated this topic as part of my Bachelor of Science degree in Computer Science.
Dedication and intense research work have been determining factors in the writing of this
project report.
First of all I give thanks to Almighty God for his guidance, knowledge and abundance favour
in the completion of this report.
I would like to thank Boris Guenebaut for his insight knowledge given to me through his work
(Automatic subtitle generation in videos). I would like to also thank the SourceForge.net
community and javaforums.org for giving insightful tips about technical concerns when
needed. I also would like to express my gratitude towards my supervisor Mr Stephen Akobre
for offering precious advices concerning main issues of the project work.
iv | P a g e
TABLE OF CONTENTS
CHAPTER ONE ........................................................................................................................ 1
INTRODUCTION ..................................................................................................................... 1
1.0 BACKGROUND STUDY ............................................................................................... 1
1.1 SUITABLE PROGRAMMING LANGUAGES DISCUSSION ..................................... 2
1.2 PROBLEM STATEMENT .............................................................................................. 3
1.3 OBJECTIVES .................................................................................................................. 4
1.4 SIGNIFICANT OF STUDY ............................................................................................ 4
1.5 SCOPE OF THE STUDY ................................................................................................ 5
1.6 LIMITATIONS OF THE STUDY ................................................................................... 5
2.0 SUBTITLES ..................................................................................................................... 6
2.1 FUNCTIONS OF SUBTITLES ....................................................................................... 7
2.2 THE POSITIVE EFFECTS OF SUBTITLES ................................................................. 8
2.3 THE NEGATIVE EFFECTS OF SUBTITLES ............................................................. 10
2.4 PREVIOUS STUDIES ON SUBTITLES ...................................................................... 12
3.0 INTRODUCTION .......................................................................................................... 15
3.1 PLANNING ................................................................................................................... 15
3.1.1 DATA COLLECTION ............................................................................................ 15
3.1.2 HARDWARE AND SOFTWARE REQUIREMENTS .......................................... 16
3.2 ANALYSIS .................................................................................................................... 17
3.3 DESIGN ......................................................................................................................... 18
3.3.1 AUDIO EXTRACTION .......................................................................................... 19
v|Page
TABLE OF FIGURES
Figure 3. 1 - Phases of methodology ....................................................................................... 15
Figure 3. 2 - Steps of Project Design ....................................................................................... 18
Figure 4. 1 - Activity Diagram for Audio Extraction .............................................................. 26
Figure 4. 2 - Sphinx-4 Architecture ......................................................................................... 27
Figure 4. 3 - Activity Diagram for Speech Recognition .......................................................... 28
Figure 4. 4 - Activity Diagram for Subtitle Generation ........................................................... 29
Figure 4. 5 - Class Diagram for Subtitle Generation ............................................................... 41
LIST OF TABLES
Table 3. 1-Hardware Requirements ......................................................................................... 16
LIST OF EQUATION
Equation 3. 1-Bayes' Theorem ................................................................................................. 19
vii | P a g e
CHAPTER ONE
INTRODUCTION
1.0 BACKGROUND STUDY
Video is one of most used visual source of entertainment. Videos are also use for keeping
historical records. The demand for visual content has grown over the past few years. Recent
demand for visual text accompanying videos has grown to certain level and apart from that
some people have problem in hearing and some are probably deaf and cannot understand the
meaning of visual content and most subtitles are in foreign language giving viewers difficulty
in understanding them. Therefore, there is a need to find a solution to this kind of problem
faced by the public.
Several software propose utilities to create subtitles for videos but all require an extensive
participation of the user. Most websites provide a database for subtitle in movies but mostly
for popular movies and not for general visual content. Hence, a more automated concept is
envisaged. This project report indicates a way to generate subtitles following standards by
using speech recognition. Three parts are distinguished.
The first one consists in separating audio from video and converting the audio into suitable
format if necessary. The second phase proceeds to the recognition of speech contained in the
audio. The ultimate stage generates a subtitle file from the recognition results of the previous
step.
I.
AUDIO EXTRACTION
The audio extraction routine is expected to return a suitable audio format that can be used by
the speech recognition module as pertinent material. It must handle a defined list of video and
audio formats. It has to verify the file given in the input so that it can evaluate the extraction
feasibility. The audio track has to be returned in the most reliable format.
1|Page
II.
SPEECH RECOGNITION
The speech recognition routine is the key part of the system. Indeed, it affects directly
performance and results evaluation. First, it must get the type (film, music, information, homemade, etc...) of the input file as often as possible. Then, if the type is provided, an appropriate
processing method is chosen. Otherwise, the routine uses a default configuration. It must be
able to recognize silences so that text delimitations can be established.
III.
SUBTITLE GENERATION
The subtitle generation routine aims to create and write in a file in order to add multiple chunks
of text corresponding to utterances limited by silences and their respective start and end times.
Time synchronization considerations are of main importance.
2|Page
API allowing developers to deal with media tracks (audio and video) in an efficient way. The
Java Media Framework API (JMF) enables audio, video and other time-based media to be
added to applications and applets built on Java technology. This optional package, which can
capture, playback, stream, and transcode multiple media formats, extends the Java 2 Platform
Standard Edition (J2SE) for multimedia developers by providing a powerful toolkit to develop
scalable, cross-platform technology. Consequently, Java seemed to be the most suitable
language.
3|Page
using speech recognition have not been developed. Therefore, it seems necessary to start
investigations on this concept.
1.3 OBJECTIVES
I.
The purpose of this study is come out with a method of generating subtitle automatically
for viewers.
II.
III.
IV.
To make a system be able to accept audio files and produce a subtitle file.
V.
Another aim of this project is to generate a subtitle file in order for it to be available for
translation to other languages.
4|Page
Time Constraints
Insufficiency of funds and inflation of goods and services
III.
IV.
5|Page
CHAPTER TWO
LITERATURE REVIEW
2.0 SUBTITLES
Video is an electronic medium for recording, copying and broadcasting of moving visual
images. The demand for visual content has emerged over the last one and half decade. Video
is one of the most used visual source of entertainment. Video sharpens the focus, creates
rallying points, places the message in context, and build a safe place where viewers can be
challenged on a thoughtful and heartfelt level. Videos are also use to keep historical records.
The invention of films started without sound. And not so long after that, many efforts had done
to convey the message of the actors to the audience. Their efforts were what we now call
intertiles which were texts, drawn or printed on paper, filmed and placed between scenes of the
film. The original titles were removed, translated, filmed and re-inserted and even a speaker
was used to give a simultaneous interpretation with another language.
In 1909, a method of projectionist, using a scioption (somehow like a slide projector) that
showed the subtitles on the screen was used. According to Gottlieb (2002), M.N. Topp
registered a patent for this device for the rapid showing of titles for moving pictures. From
1927 onwards, sound films were invented, so the titles inserted between scenes were removed.
The idea to make several languages versions arose but the efforts were highly cost intensive.
So the attempts to make subtitle started with manual projection of slides with printed texts
directly onto screen. A frame containing the title was kept in position while the film negative
and positive print strip were fed forward and exposed. Then, the process was automatically
done by inserting exposed blank frames between the title frames and the title were fed forward
by means of a counter to ensure subtitles were in right length and in right place. But the films
negative were difficult to be obtained plus large quantities of negatives were needed to publish
a large copy of films. Therefore, the titles were photographed onto separate film of same length
6|Page
as original, synchronize with the sound. The negative and the roll of titles were then copied
and displayed simultaneously.
In 1930, a Norwegian inventor, Leif Eriksen registered a patent for method of stamping titles
directly on to the images on film strip, followed by a Hungarian inventor, O.Turchanyi who
used high temperature to melt away the emulsion on film, without the need for softening bath
did by Eriksen, and then patented his method in 1935.
Laser subtitling, according to Ivarrson (2004), developed by Denis Auboyer is the latest
development which involves the use of lasers to burn away or vaporizes the emulsion. This has
been in commercial use since 1988. The titles themselves are computer typeset and can be
showed on video display by means of time coding or frame counting. Laser subtitling is cheaper
but requires costly investment in equipment.
France, and Spain were the first to decide on dubbing, while the other European countries had
their movies subtitled.
Not only have that, subtitling too functioned to reduce the cost of the production. This is
because of the high costs of dubbing. It is true that dubbed films distract viewers attention to
a lesser extent, and dialogues are much more understandable. However, it is less favorable
because many had complained that it is not natural to watch the movie with someone else
talking for another person. To add to that, dubbing costs which are higher compared to
subtitling is because they need to hire few people in order to use their voice in the process of
dubbing. The process of dubbing requires a lot of time too.
The usage of subtitles also is to maintain the originality of the audio and the voices of the
original and professional cast, actors and actresses. The subtitles which appear at the time for
every lines said by the cast does not distract the viewers to listen to the real voice of the cast.
Moreover, in dubbing, the people who are involved in this process sometimes did not prepare
it with the correct intonation and rhythm like the original cast. This is quite distracting for the
audience when they watch the movies because the environment in the movie does not suit with
the incorrect intonation from the voice of those people.
recognition problem could understand what they are watching with the aid of subtitles.
Television teletext subtitles are hidden unless requested by the audience from a menu. This
type of subtitles always comes with additional sound representations for deaf and hard ofhearing viewers. Furthermore, teletext subtitle language follows the original audio, except in
multi-lingual countries where the broadcaster may provide subtitles in additional languages on
other teletext pages.
In some East Asian countries, such as China, Korea and Japan, subtitling is common in some
genres of television. In these languages, written text is less ambiguous than spoken text, so
subtitling may offer a distinct advantage to aid comprehension. For example, although people
in China generally speak Putonghua, the standard spoken language, and different speakers have
different accents due to their native dialects and subtitles bridge this gap as most Chinese
speakers understand the one standard form of written Chinese.
According to Van Der Kamp (2007), subtitled movies have been found to improve reading
skills especially for children. Subtitled movies enable people to automatically and
subconsciously read while watching. In order to comprehend subtitled movies, the audiences
need to read the lines fast and accurately. Indirectly, this could improve the literacy skills.
One of the most important roles brought by subtitles is to help people who wants to learn
language either the mother tongue language or the foreign languages. A research shows that,
young people in Hungary watched subtitled movies to learn language!
People who are learning a foreign language may sometimes use same- language subtitles to
better understand the dialog while not having to refer to a translation. For example, a person
who is learning English language could boost up his/her language skills, by watching English
movies with English subtitles. However, this way of learning only suits with beginners. This is
because; the learners could not develop his or her language skills if they keep looking at the
9|Page
subtitles. Therefore, once they have improved, they are encouraged to watch movies without
the aid of subtitles.
Besides, subtitles help a lot when it comes to the scene where the actor or actress need to speak
with low tone such as the actor need to whisper. Audience usually cannot interpret the word
that had been mentioned especially during the romantic scene of the movie or even worse, the
background sounds are too loud (Anonymous, 2008). Therefore, reading subtitle was the
alternative way for audience to understand the film and most important, to enjoy it.
In conclusion, subtitles bring many advantages to the audience in terms of language learning,
reading and translating.
10 | P a g e
language may contain verbal padding or culturally implied meanings, which are confusing
words and not usually adapted in written subtitles. The inaccurate translation in subtitling could
affect those who aim subtitling as a method of foreign language learning because they are
referring to low quality sentence structures and poor grammatical reference.
It is undeniable that subtitling do improves many language learning skills such as reading and
writing skills, but it is also makes sense that subtitling leads to weakness of listening skills.
When people keep referring to the subtitles, they use less listening skills. In some cases, people
tend to read what is written at the bottom screen rather than listen to what the speaker said.
This is not necessarily happening only to people who watch foreign movies, documentaries or
others. Even if they are watching the mother-tongue- language television program, they prefer
to read the subtitles. To language learners, this is not good and will not help in developing their
language learning too.
Other than that, some people opine that reading and interpreting the stimuli seen at the same
time could be annoying, regardless of the type of program. For example, in movies, subtitling
provided may takes away the movie experience of the audiences because they are reading, not
viewing. Some people prefer good, well-subtitled foreign language movies than equivalently
same-language-subtitled movies. By reading the dialogue instead of hearing it, there is more
room for interpretation thus creating a movie that means more to certain viewers. In this case,
the subtitles should be made with high quality, good grammar and interpretation, as well as
precise.
To conclude, subtitles do not necessarily give harm to the audiences. It is based on individual
perception on how he or she deals with it. An optimistic person surely could see the positive
side of providing subtitles and would use them to make benefit out of it. Therefore, the
11 | P a g e
important role should be played by the media in providing the high quality subtitles in purpose
of assisting language learners and also giving satisfaction to movie lovers.
The first procedure is the participants were given the synopsis of the earlier portion of Rear
Window. They would be watching only the last 40 minutes of the film so they were asked to
read the synopsis given carefully. Participants are then asked to write down any thoughts
coming to mind during the film watching. They were instructed to stop the movie whenever
they have thoughts about what is going to happen or what has happened earlier in the film. This
has been gathered as the inferences. The inferences were categorized as either backward (focus
on previous information), forward (focus on future information), and current (focus on
information in the current scene).
After finishing the film, the participants were asked whether they had ever seen the film before
this occasion. They were then debriefed and dismissed before their data were run individually
whereby each session took less than 60 minutes.
From all the inferences gathered, only 28 data from the subjects were selected because one of
them did not follow the instructions and the remaining five have watched the movie before.
For comparison of inference generation between two groups, three categorizations were
developed depending on the source of the inferences used. When an inference was generated
regarding a future scene, it was categorized as a forward inference. When an inference was
generated by using either earlier information or general knowledge, it was considered a
backward inference. Last, when an inference was generated by using current information
within the scene, it was considered a current inference.
The analysed gathered inferences show that participants generated more current inferences in
the foreign language film with subtitles. In contrast, more backward inferences in the native
language condition were generated by the participants. These results are consistent with the
idea that participants had less comprehension of the foreign language film with subtitles which
proved the assumption made by Mina Lee and David Roskos-Ewoldsen.
13 | P a g e
Another study of subtitle was done by Maria Bernschtz, a PhD holder in marketing from the
Corvinus University of Budapest, Hungary. In her research, she studied the attitude of the
young towards subtitled movies. The participants consisted of young people only because it is
essential for them to be expert in foreign language, especially in English. 413 third-year
students of Corvinus University of Budapest, Hungary participated in this research as
samplings. 63 % were women, 37 % were men. The average age of the group was 21 years.
They were given questionnaires on subtitled movies. The respondents had to answer the
question: for what types of program they would recommend subtitles. The number of yes
answers are represented cumulatively. 86% of the respondents agree that subtitles are
appropriate for movies in cinema while only 65% recommended subtitles for documentaries.
65 % of the sample suggested subtitles for the historical film followed by the comedy, 60 % of
the sample.
From the results, it is found that Hungarian young people prefer subtitled movies rather than
dubbed movies. This is to maintain the originality of the movies. They also watch subtitled
movies mostly for learning foreign language easily.
14 | P a g e
CHAPTER THREE
METHODOLOGY
3.0 INTRODUCTION
This chapter will cover the details explanation of methodology that was used to make this
project complete and working well. The method is use to achieve the objective of the project
that will accomplish a perfect result. In order to evaluate this project, the methodology was
based on System Development Life Cycle (SDLC), generally three major steps were involved,
which
are
planning,
implementing
and
analysis.
3.1 PLANNING
To identify all the information and requirement such as hardware and software, planning must
be done in the proper manner. The planning phase have two main elements namely data
collection and the requirements of hardware and software
3.1.1 DATA COLLECTION
Data collection is an important stage in any area of study. At this stage I planned about the
projects resources and requirements, literature studies and schedule to get more gathered from
libraries and Internet.
15 | P a g e
Within the data collection period I found some study about subtitle generation using speech
recognizer on the Internet and did some research about the project related. Once I got the
project manual, I tried to find out the hardware component, software materials and other
materials that will needed for the project.
While planning, I did some research about the project related, which including with study about
speech recognition such as building of language models from text corpora, creation of acoustic
models and data gathering.
B. SOFTWARE REQUIREMENTS
Several software applications can be used for the implementation of this project work. This
project work can be implemented in both Linux operating systems and Windows operating
systems. The following are the basic software applications used to ensure effective
implementation of subtitle generation using voice recognition (speech recognition).
16 | P a g e
I.
SYSTEM SOFTWARE
Ubuntu 13.04-dvd-64bit (Linux Operating System)
II.
3.2 ANALYSIS
Analysis is a process of collecting factual data, understanding the processes involved,
identifying problems and recommending feasible suggestions for improving the system
functioning. This includes sub-dividing of complex process involving the entire
s y s t e m , i d e n t i f i c a t i o n of data store and manual processes. The major objectives of
17 | P a g e
systems analysis are to find answers for each business process: What is being done,
how is it being done, who is doing it, when is it being done, who is going to use the
system, why is it being done and how can it be improved?
3.3 DESIGN
A media file (either video or directly audio) is given as input. The audio track is extracted and
then read chunk by chunk until the end of track is reached. Within this loop happen successively
three tasks: speech recognition, time synchronization, and subtitle generation. Finally, a
subtitle file is returned as an output.
18 | P a g e
19 | P a g e
Following this theorem, it is observed P (acoustics) is a constant for any given sequence of
acoustic data and thereby, can be ignored. P (word) represents the probability of the word
related to a specific language model.
Even if word recognition has reached a very satisfying rate of 80-90%, grammar still needs
multiple enhancements. Punctuation determination involves to distinguish stressed syllables or
words in an utterance. In a natural conversation, it is straightforward to define if the speaker
affirms (.), exclaims (!) or interrogates (?). Nonetheless, it is much more difficult for
speech recognition systems to make the difference and they usually only recognize which word
was pronounced. Therefore, stresses and intonations are rarely detected and punctuation
remains a manual process.
A. ACOUSTIC MODEL
We now introduce the concept of acoustic model. It is a file containing a statistical
representation of the distinct sounds that make up each word in the language model. We
differentiate two acoustic models:
I. Speaker Dependent Acoustic Model: it is an acoustic model that has been
designed to handle a specific individuals speech. This kind of model is
generally trained using audio from the concerned person.
II. Speaker Independent Acoustic Model: it is supposed to recognize speech from
different people, especially the ones who did not participate to the training of
the acoustic model. It is obvious a speaker independent acoustic model requires
much more speech audio training to provide correct results.
20 | P a g e
C. DECODER
The following definition for Decoder comes from the VoxForge website Software program
that takes the sounds spoken by a user and searches the Acoustic Model for the equivalent
sounds. When a match is made, the Decoder determines the phoneme corresponding to the
sound. It keeps track of the matching phonemes until it reaches a pause in the users speech. It
then searches the Language Model or Grammar file for the equivalent series of phonemes. If a
match is made it returns the text of the corresponding word or phrase to the calling program.
D. SPHINX-4
Sphinx-4 is an open source project led by Carnegie Mellon University, Sun Microsystems Inc.
and Mitsubishi Electric Research Laboratories. It is completely written in Java. It offers a
highly modularized and flexible architecture as well as versatile APIs, supports any acoustic
model structure, handles most types of language models, provides new algorithms in order to
get word level hypotheses, and accepts multimodal inputs. This will be our main speech
recognition tool to implement our project.
3.3.4 CODING
Java is our preferred programming language used to code our project because Java offers an
intuitive syntax, portability on multiple OS and reliable libraries. The absence of pointers, the
use of efficient garbage collector and the run-time compilation play a key role in Java.
Beside sphinx-4 which is our speech recognition tool was written entirely in java.
II.
Coding and programming: Java was preferred programming language used to code
our program. C++ can also be used but java is preferred because it has the ability to
deal directly with media files. Java Media Framework (JMF) API allowing developers
to deal with media tracks (audio and video) in an efficient way.
III.
System Analysis: For a problem to be solved to get maximum output system analysis
is needed. A well-structured technique was used for handling the problem of subtitle
generation which led to an efficient allocation of resources to meet well defined goals
and objectives.
22 | P a g e
IV.
Flow Charts
3.4 IMPLEMENTATION
Implementation is the carrying out or execution of a plan, a method or any design for doing
something. As such, implementation is the action that must follow any preliminary thinking in
order for something to actually happen. In project management terminology, it refers to the
process of carrying out the project plan by performing the activities included therein. The major
steps involved in this phase are:
I. Acquisition and Installation of Hardware and Software
II. Construction of the subtitle generation using speech recognition
III. Documentation
23 | P a g e
3.4.3 DOCUMENTATION
Documentation is converting the project work in the form of documents.
3.5 SUMMARY
In summary, every project will have different methodologies that is being used to make the
project successful and working well. Generally, the methodologies are divided into three parts,
there are planning, implementing and analysis. In planning phase they includes some reading
activity and some job of requirements of hardware and software to be used.
In reading activity I did some research using several sources such as text books, journal, paper
references, the Internet and more sources due to get the information about the project related.
While in the requirements of hardware and software to be use I studied and found out the
functional and operational of the hardware and software related.
Next step is implementing phase where subtitle is generated into .srt file format.
With appropriate steps and methodology, any process of completing the project can be managed
wisely and will be make a good result.
24 | P a g e
CHAPTER FOUR
PROPOSED SYSTEM
4.0 INTRODUCTION
This chapter presents the procedural nature of the project work. We now suggest a profound
analysis in order to clearly measure the real needs. Java was realized as the preferred
programming language to implement our project. Java has components that was realized to be
essential for speech recognition. The subtitle generation will use the different symbols and tags
produced by the speech recognizer in order to generate SRT files.
25 | P a g e
26 | P a g e
Recognition systems where they were integrated in. It is therefore natural we opted for the
Sphinx-4 decoder. Sphinx-4 has been entirely written in Java, making it totally portable, and
provides a modular architecture, allowing to modify configurations with ease. The architecture
overview is shown below:
An audio file and some parameters are given as arguments to the module. First, the audio file
is checked: if its format is valid, the process continues; otherwise, an exception is thrown and
the execution ends. According to the category (potentially amateur, movie, news, series, music)
given as argument, related acoustic and language models are selected. Some alterations are
realized in the Sphinx-4 configuration based on the set parameters. Then, all the components
used in the ASR process are allocated required resources. Finally, the decoding phase takes
place and results are periodically saved to be reused later.
28 | P a g e
In the case, it is a SIL utterance, we check if the current line is empty: if not, the end time of
the current line is set to the SIL speech time. If the line is empty, we ignore the SIL utterance.
Once the list has been traversed, the file is finalized and released to the user.
However, we face some limitations. Indeed, it will not be able to define punctuation in our
system since it involves much more speech analysis and deeper design.
4.2 IMPLEMENTATION
4.2.1. GENERAL SETTINGS OF THE SYSTEM
The implementation of the system was realized on a personal computer with at least the
following specifications or higher:
Table 4. 1-Hardware Requirements
This segment provides guidelines about the use of peripheral software required to make the
subtitle generation using speech recognition coordination run.
Apache Ant is needed in the general setting of the system since it has a relationship between
java. Both java and apache ant are needed to enable sphinx4 speech recognition tool work. First
of all, it is required to download the binary distributions. The next step consists in extracting
the root directory and copying it at the target location. Then, the ANT HOME environment
variable should be set to the directory where Ant is installed; the bin directory must be added
to the users path. The implementation and installation of external software were done solely
in Linux distributions.
30 | P a g e
download
the
binary
distributions
of
apache
ant
from
http://archive.apache.org/dist/ant/
We then uncompressed the downloaded file into the directory /usr/local/
We set environmental variables JAVA_HOME to your Java environment, ANT_HOME
to the directory you uncompressed Ant to, and add ${ANT_HOME}/bin
export ANT_HOME=/usr/local/ant
export JAVA_HOME=/usr/lib/jvm/Oracle-java 8
export PATH=${PATH}:${ANT_HOME}/bin
Finally we checked to see if ant is installed:
31 | P a g e
ant version
C. AUDIO EXTRACTION
This stage is where we figure out how to convert the output audio file into a format recognized
by Sphinx-4. We did not obtain the expected result using the java media framework.
Consequently, we orientated this part of the system on the use of VLC media player. The trick
thus consists in successive straight ward steps:
I.
II.
III.
Select media file - several files of a same directory can be selected at a time.
IV.
V.
In Audio codec check Audio, choose WAV in Codec, set the Bitrate at 256 kb/s
and set channels at 1.
VI.
VII.
D. SPEECH RECOGNITION
I. SPHINX-4
Sphinx-4 is a state-of-the-art speech recognition system written entirely in the Java
programming language. It was created via a joint collaboration between the Sphinx group at
Carnegie Mellon University, Sun Microsystems Laboratories, Mitsubishi Electric Research
Labs (MERL), and Hewlett Packard (HP), with contributions from the University of California
at Santa Cruz (UCSC) and the Massachusetts Institute of Technology (MIT). Sphinx-4 started
out as a port of Sphinx-3 to the Java programming language, but evolved into a recognizer
designed to be much more flexible than Sphinx-3, thus becoming an excellent platform for
32 | P a g e
speech research. CMU Sphinx is one of the most popular speech recognition applications for
Linux and it can correctly capture words. It also gives the developers the ability to build speech
systems, interact with voice and build something unique and useful.
Based on the official manual of Sphinx-4, we explain now the different phases we went through
in order to obtain a fully operational speech recognition system on a Linux Operating System
(OS) - Ubuntu 13.04:
Sphinx-4 has two packages available for download:
Sphinx4-{version}-bin.zip: provides the jar files, documentation, and demos
Sphinx4-{version}-src.zip: provides the sources, documentation, demos, unit tests and
regression tests.
After we downloaded the distribution, we unjar the ZIP files using the jar command which is
in the bin directory of your Java installation into /usr/local/ :
jar xvf sphinx4-{version}-bin.zip
jar xvf sphinx4-{version}-src.zip
Next step was to build sphinx4 installation and before we do that we need to install sharutils
since it is needed for the installation of Java Speech Application Programming Interface:
Sudo apt-get install sharutils
We then installed the Java Speech Application Programming Interface by opening lib directory
in the sphinx4 folder typing the following in the terminal:
sh jsapi.sh
For sphinx4 to work we needed to set required environment variables. JAVA_HOME to the
location of JDK, ANT_HOME to the location of ant and PATH to include both bin subfolder
of JDK and bin subfolder of ant variables:
export ANT_HOME=/usr/local/ant
export JAVA_HOME=/usr/lib/jvm/Oracle-java 8
33 | P a g e
export PATH=${PATH}:${ANT_HOME}/bin
We then type the following:
ant
ant clean
ant javadoc
ant -Daccess=private javadoc
34 | P a g e
utility
to
generate
DMP
format
models:
http://cmusphinx.sourceforge.net/webpage/html/download.php#utilities#
unix commands : gawk uniq mv rmdir rm
All commands should be in your path
bn=weather.txt
We want a closed vocabulary language model so we use extractVocab to extract just the
sentences that entirely match our vocabulary
gawk -f extractVocab.awk weather.vocab weather.txt > $bn.tmp.closed
We generate the 'test' file that can be used by the live decoder as the prompt for the user. We
eliminate adjacent duplicate entries
gawk -f genTranscript.awk < $bn.tmp.closed > weather.transcript
We then generate the word frequencies
text2wfreq < $bn.tmp.closed > $bn.tmp.wfreq
We generate the vocabulary (this should be a subset weather.vocab)
wfreq2vocab < $bn.tmp.wfreq > $bn.tmp.vocab
We generate the idngram
text2idngram -vocab $bn.tmp.vocab < $bn.tmp.closed > $bn.tmp.idngram
We generates the language model
35 | P a g e
36 | P a g e
follow
the
Robust
groups
Open
Source
Tutorial
from
associate this sequence with the corresponding speech signal. The trainer then looks into a
dictionary which maps every word to a sequence of sound units, to derive the sequence of
sound units associated with each signal. Thus, in addition to the speech signals, you will also
be given a set of transcripts for the database (in a single file) and two dictionaries, one in which
legitimate words in the language are mapped sequences of sound units (or sub-word units), and
another in which non-speech sounds are mapped to corresponding non-speech or speech-like
sound units. We will refer to the former as the language dictionary and the latter as the filler
dictionary.
In summary, the components provided to you for training will be:
1. The trainer source code
2. The acoustic signals
3. The corresponding transcript file
4. A language dictionary
5. A filler dictionary
B. COMPONENTS PROVIDED FOR DECODING
The decoder also consists of a set of programs, which have been compiled to give a single
executable that will perform the recognition task, given the right inputs. The inputs that need
to be given are: the trained acoustic models, a model index file, a language model, a language
dictionary, a filler dictionary, and the set of acoustic signals that need to be recognized. The
data to be recognized are commonly referred to as test data.
38 | P a g e
39 | P a g e
I. asg-am.dic: includes eachword and the phonemes that make up each word.
II. asg-am.filler: includes generally filler sounds for <s>, <sil>, and </s>.
III. asg-am.phone: includes the phonemes that are part of the training set (must not have
unused phonemes).
IV. asg-am train.fileids: contains the names of the audio files of the asg-am/wav folder
without extension - one on each line in the same order as the audio files order.
V. asg-am train.transcription: contains the transcription of each audio file - one on each
line in the same order as the audio files order and surrounded by the markers <s>
and </s>.
VI. feat.params: generated by SphinxTrain. and sphinx train.cfg: generated by
SphinxTrain.
After adjusting the configuration in sphinx train.cfg, the model creation can take place by first
generating the feature files from the WAV files and then by running all the perl scripts to
complete the process. Here are the two successive commands used:
.../asg-am>./scripts_pl/make_feats.pl -ctl etc/asg-am_train.fileids
.../asg-am>./scripts_pl/RunAll.pl
40 | P a g e
E. SUBTITLE GENERATION
Based on the analysis and design realized previously, a module generating subtitles in SRT
format has been implemented in Java. To do so, we created three classes TimedToken, Subtitle
and SubtitleGenerator. The Subtitle class permits to encapsulate a subtitle in SRT format. The
SubtitleGenerator class provides static methods to create Subtitle instances using Result objects
from the Sphinx-4 apparatus. Then, the method getTimedBestResult() from Result is used to
retrieve both tokens and times in string form. The String is then parsed and each token and its
times are used as input to instantiate a TimedToken. From this point, a list of TimedToken
objects is avail-able. Finally, the list is traversed and different operations are made according
to the type of token. Each Subtitle text is delimited by two silences.
F. SPHINX-4 TUNING
For the sphinx-4 system to work we needed a sphinx-4 configuration. The configuration of a
particular Sphinx-4 system is determined by a configuration file. This configuration file defines
the following:
I. The names and types of all of the components of the system
II. The connectivity of these components - that is, which components talk to each other
III. The detailed configuration for each of these components.
I. SPHINX-4 XML-CONFIGURATION
42 | P a g e
43 | P a g e
44 | P a g e
45 | P a g e
CHAPTER FIVE
5.0 CONCLUSION AND RECOMMENDATION
5.1 CONCLUSION
We proposed a way to build and train a system to generate subtitles for sound in videos using
speech recognition. A complete system including the three required modules namely audio
extraction, speech recognition and subtitle generation could not be realized since the audio
conversion needed more resources. VLC gave an appropriate solution but a custom component
coded in Java is expected in further work so that portability and installation of the system is
rendered uncomplicated. Nonetheless, the expected output for each phase has been reached.
The audio extraction module provides a suitable audio format to be used by the speech
recognition module. This one generates a list of recognized words and their corresponding time
in the audio although the accuracy is not guaranteed. The former list is used by the subtitle
generation module to create standard subtitle file readable by the most common media players
available.
During the last years, the Internet has known a multiplication of websites based on videos of
which most are from amateurs and of which transcripts are rarely available. This work was
mostly orientated on video media and suggested a way to produce transcript of audio from
video for the ultimate purpose of making content comprehensible by deaf persons. Although
the current system does not present enough stability to be widely used, it proposes one
interesting way that can certainly be improved. The main aim of this system is build and train
a system to be used to generate a subtitle file.
46 | P a g e
5.2 RECOMMENDATION
Getting larger language models and acoustic models was not easy to come by. Larger language
models and acoustic models are only accessible by purchase. This is my recommendation that
websites hosting these language models and acoustic models should make it available for
students to make work easier.
When talking about the subtitle generation module, we emphasized the insertion of punctuation
was a complicated task to be performed by an Automatic Speech Recognition apparatus. It
could be interesting to lead a study towards this subject because the outcome of an Automatic
Speech Recognition system is generally a raw text, in a lower-text format and without any
punctuation mark while the former plays a significant role in the understanding of talk
exchanges. Several methodologies should be deemed such as the use of transducersor language
model enhancements
47 | P a g e
REFERENCES
1. Apache ant 1.7.1 manual, 2013. URL: http://ant.apache.org/manual/index.html.
2. B. Guenebaut (2009): Automatic Subtitle Generation for Sound in Videos
3. Configuration
management
for
sphinx-4
2013.
URL:
http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/util/props/docfiles/ConfigurationManagement.html
4. Engineered
Station.
How
speech
recognition
works.
(2013).
URL:
http://project.uet.itgo.com/speech.htm.
5. Essays on subtitles.(2013) URL: http://www.ukessays.com/essays/english-language/ahistory-and-development-of-subtitles-english-language-essay.php
6. Frederick Jelinek. Statistical Methods for Speech Recognition. MIT Press, (2013).
http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/util/props/docfiles/
7. Gaupol. (2013).URL: http://home.gna.org/gaupol/.
8. Imtoo Dvd Subtitle ripper. (2013). URL: http://www.imtoo.com/dvd-subtitleripper.html.
9. Installing java on Ubuntu.(2013) URL: http://ubuntuguide.net/install-oracle-java-jdk6-7-8-in-ubuntu-13-04
10. J.P. Lewis and Ulrich Neumann. Performance of java versus C++. (2013). URL:
http://idiom.com/zilla/Computer/javaCbenchmark.html.
11. Jubler subtitle editor in java, (2005). URL:http://www.jubler.org/.
12. Language
model
(Weather
language
model).(2013)URL:
http://code.metager.de/source/xref/cmusphinx/sphinx4/models/language/weather/
48 | P a g e
13. Robust groups open source tutorial learning to use the CMU sphinx automatic speech
recognition system. (2013). URL:http://www.speech.cs.cmu.edu/sphinx/tutorial.html.
14. Sphinx
for
the
java
platform
architecture
notes.
(2013)
URL:
www.speech.cs.cmu.edu/sphinx/twiki/pub/Sphinx4/WebHome/Architecture.pdf.
15. Sphinx-4 a speech recognizer written entirely in the java programming language,
(2013).URL: http://cmusphinx.sourceforge.net/sphinx4/
16. Sphinx-4 (2013). URL: http://cmusphinx.sourceforge.net/sphinx4/
17. Statistical
language
modelling
toolkit,
(2013).URL:
http://www.speech.cs.cmu.edu/SLM/toolkit.html.
18. Sphinx-4
Architecture
(2013)
URL:
www.speech.cs.cmu.edu/sphinx/twiki/pub/Sphinx4/WebHome/Architecture.pdf
19. Subtitle editor, (2013). URL:http://home.gna.org/subtitleeditor/.
20. The CMU sphinx group open source speech recognition engines, (2013). URL:
http://cmusphinx.sourceforge.net/.
21. Ubuntu 13.04 (2013) URL: http://www.ubuntu.com/download/desktop
22. Vlc media player. (2013).URL: http://www.videolan.org/vlc/.
23. Voxforge. (2013) URL:http://www.voxforge.org/home
24. Welcome to the ant wiki, (2013). URL http://wiki.apache.org/ant/ FrontPage.
25. Willie Walker, Paul Lamere, Philip Kwok, Bhiksha Raj, Rita Singh, Evandro Gouvea,
Peter Wolf, and Joe Woelfel. (2013): Sphinx-4: A exible open source framework for
speech recognition. In SMLI TR2004-0811 2004 SUN MICROSYSTEMS INC., URL
http://cmusphinx.sourceforge.net/ sphinx4/doc/Sphinx4Whitepaper.pdf.
26. Xilisoft Dvd subtitle ripper. (2013). URL:http://www.xilisoft.com/dvd-subtitleripper.html
49 | P a g e
50 | P a g e