You are on page 1of 59

UNIVERSITY FOR DEVELOPMENT STUDIES

DEPARTMENT OF COMPUTER SCIENCE

PROJECT TITLE:
BUILDING AND TRAINING A SYSTEM TO GENERATE SUBTITLE USING
SPEECH RECOGNITION

BY
OWUSU ANSAH ASARE
FAS/3532/09

SUPERVISOR:
MR STEPHEN AKOBRE

2013

UNIVERSITY FOR DEVELOPMENT STUDIES


DEPARTMENT OF COMPUTER SCIENCE

PROJECT TITLE:
BUILDING AND TRAINING A SYSTEM TO GENERATE SUBTITLE USING
SPEECH RECOGNITION

BY
OWUSU ANSAH ASARE
FAS/3532/09

SUPERVISOR:
MR STEPHEN AKOBRE

A project submitted to the Department of Computer Science,


Faculty of Mathematical Sciences, University for Development
Studies in partial fulfilment of the requirements for the award of
Bachelor of Science degree in Computer Science

JUNE 2013

ABSTRACT
Video sharpens the focus, creates rallying points, places the message in context, and build a
safe place where viewers can be challenged on a thoughtful and heartfelt level. The demand
for text accompanying videos has grown for the past one and half decade. This is so because
most viewers will like to understand the videos they watch, others are deaf and have gaps in
spoken language. Therefore, it is necessary to nd solutions for the purpose of making these
media artefacts accessible for most people. Several software propose utilities to create subtitles
for videos but all require an extensive participation of the user. Hence, a more automated
concept is envisaged.
This work report indicates a way to build and train a system to generate subtitles following
standards by using speech recognition. Three parts are distinguished. The rst one consists of
separating audio from video and converting the audio into suitable format if necessary. The
second phase proceeds to the recognition of speech contained in the audio. The ultimate stage
generates a subtitle le from the recognition results of the previous step. Directions of
implementation have been proposed for the three distinct modules. The experiment results have
not done enough satisfaction and adjustments have to be realized for further work. Decoding
parallelization, use of well-trained models, and punctuation insertion are some of the
improvements to be done.

i|Page

DECLARATION
I hereby declare that the project entitled,
Building and training a system to generate subtitle using speech recognition
is my own original work carried out as Bachelor of Science students project at the University
for Development Studies under the supervision of Mr Stephen Akobre except to the extent that
assistance from others in the projects design and conception or in style, presentation and
linguistic expression are duly acknowledged.
All sources used for the project have been fully and properly cited. It contains no material
which to a substantial extent has been accepted for the award of any other degree at University
for Development Studies

or any other educational institution, except where due

acknowledgment is made in the project.

__________________________________
ASARE OWUSU ANSAH

___________________________
DATE

FAS/3532/09

ii | P a g e

CERTIFICATION
We certify that this work was carried out independently by ASARE OWUSU ANSAH
(FAS/3532/09) in the Department of Computer Science of University for Development Studies
as part of the requirements for the award of a Bachelor of Science degree in Computer Science.

______________________________
Mr Stephen Akobre (Supervisor)

______________________________
Mr David Laar

__________________________
Date

___________________________
Date

(Head of Department)

iii | P a g e

ACKNOWLEDGEMENT
I investigated this topic as part of my Bachelor of Science degree in Computer Science.
Dedication and intense research work have been determining factors in the writing of this
project report.
First of all I give thanks to Almighty God for his guidance, knowledge and abundance favour
in the completion of this report.
I would like to thank Boris Guenebaut for his insight knowledge given to me through his work
(Automatic subtitle generation in videos). I would like to also thank the SourceForge.net
community and javaforums.org for giving insightful tips about technical concerns when
needed. I also would like to express my gratitude towards my supervisor Mr Stephen Akobre
for offering precious advices concerning main issues of the project work.

iv | P a g e

TABLE OF CONTENTS
CHAPTER ONE ........................................................................................................................ 1
INTRODUCTION ..................................................................................................................... 1
1.0 BACKGROUND STUDY ............................................................................................... 1
1.1 SUITABLE PROGRAMMING LANGUAGES DISCUSSION ..................................... 2
1.2 PROBLEM STATEMENT .............................................................................................. 3
1.3 OBJECTIVES .................................................................................................................. 4
1.4 SIGNIFICANT OF STUDY ............................................................................................ 4
1.5 SCOPE OF THE STUDY ................................................................................................ 5
1.6 LIMITATIONS OF THE STUDY ................................................................................... 5
2.0 SUBTITLES ..................................................................................................................... 6
2.1 FUNCTIONS OF SUBTITLES ....................................................................................... 7
2.2 THE POSITIVE EFFECTS OF SUBTITLES ................................................................. 8
2.3 THE NEGATIVE EFFECTS OF SUBTITLES ............................................................. 10
2.4 PREVIOUS STUDIES ON SUBTITLES ...................................................................... 12
3.0 INTRODUCTION .......................................................................................................... 15
3.1 PLANNING ................................................................................................................... 15
3.1.1 DATA COLLECTION ............................................................................................ 15
3.1.2 HARDWARE AND SOFTWARE REQUIREMENTS .......................................... 16
3.2 ANALYSIS .................................................................................................................... 17
3.3 DESIGN ......................................................................................................................... 18
3.3.1 AUDIO EXTRACTION .......................................................................................... 19
v|Page

3.3.2 SPEECH RECOGNITION ...................................................................................... 19


3.3.3 SUBTITLE GENERATION.................................................................................... 21
3.3.4 CODING .................................................................................................................. 22
3.3.5 METHODOLOGIES USED.................................................................................... 22
3.4 IMPLEMENTATION .................................................................................................... 23
3.4.1 ACQUISITION AND INSTALLATION OF HARDWARE AND SOFTWARE . 23
3.4.2 CONSTRUCTION OF THE SUBTITLE GENERATION USING SPEECH
RECOGNITION ............................................................................................................... 23
3.4.3 DOCUMENTATION .............................................................................................. 24
3.5 SUMMARY ................................................................................................................... 24
4.0 INTRODUCTION .......................................................................................................... 25
4.1 SYSTEM OVERVIEW .................................................................................................. 25
4.1.1 AUDIO EXTRACTION .......................................................................................... 25
4.1.2 SPEECH RECOGNITION ...................................................................................... 26
4.1.3 SUBTITLE GENERATION.................................................................................... 29
4.2 IMPLEMENTATION .................................................................................................... 30
4.2.1. GENERAL SETTINGS OF THE SYSTEM .......................................................... 30
CHAPTER FIVE ..................................................................................................................... 46
5.0 CONCLUSION AND RECOMMENDATION ............................................................. 46
5.1 CONCLUSION .............................................................................................................. 46
5.2 RECOMMENDATION ................................................................................................. 47
REFERENCES ........................................................................................................................ 48
vi | P a g e

TABLE OF FIGURES
Figure 3. 1 - Phases of methodology ....................................................................................... 15
Figure 3. 2 - Steps of Project Design ....................................................................................... 18
Figure 4. 1 - Activity Diagram for Audio Extraction .............................................................. 26
Figure 4. 2 - Sphinx-4 Architecture ......................................................................................... 27
Figure 4. 3 - Activity Diagram for Speech Recognition .......................................................... 28
Figure 4. 4 - Activity Diagram for Subtitle Generation ........................................................... 29
Figure 4. 5 - Class Diagram for Subtitle Generation ............................................................... 41

LIST OF TABLES
Table 3. 1-Hardware Requirements ......................................................................................... 16

Table 4. 1-Hardware Requirements ......................................................................................... 30

LIST OF EQUATION
Equation 3. 1-Bayes' Theorem ................................................................................................. 19

vii | P a g e

CHAPTER ONE
INTRODUCTION
1.0 BACKGROUND STUDY
Video is one of most used visual source of entertainment. Videos are also use for keeping
historical records. The demand for visual content has grown over the past few years. Recent
demand for visual text accompanying videos has grown to certain level and apart from that
some people have problem in hearing and some are probably deaf and cannot understand the
meaning of visual content and most subtitles are in foreign language giving viewers difficulty
in understanding them. Therefore, there is a need to find a solution to this kind of problem
faced by the public.
Several software propose utilities to create subtitles for videos but all require an extensive
participation of the user. Most websites provide a database for subtitle in movies but mostly
for popular movies and not for general visual content. Hence, a more automated concept is
envisaged. This project report indicates a way to generate subtitles following standards by
using speech recognition. Three parts are distinguished.
The first one consists in separating audio from video and converting the audio into suitable
format if necessary. The second phase proceeds to the recognition of speech contained in the
audio. The ultimate stage generates a subtitle file from the recognition results of the previous
step.
I.

AUDIO EXTRACTION

The audio extraction routine is expected to return a suitable audio format that can be used by
the speech recognition module as pertinent material. It must handle a defined list of video and
audio formats. It has to verify the file given in the input so that it can evaluate the extraction
feasibility. The audio track has to be returned in the most reliable format.

1|Page

II.

SPEECH RECOGNITION

The speech recognition routine is the key part of the system. Indeed, it affects directly
performance and results evaluation. First, it must get the type (film, music, information, homemade, etc...) of the input file as often as possible. Then, if the type is provided, an appropriate
processing method is chosen. Otherwise, the routine uses a default configuration. It must be
able to recognize silences so that text delimitations can be established.
III.

SUBTITLE GENERATION

The subtitle generation routine aims to create and write in a file in order to add multiple chunks
of text corresponding to utterances limited by silences and their respective start and end times.
Time synchronization considerations are of main importance.

1.1 SUITABLE PROGRAMMING LANGUAGES DISCUSSION


There are several programming languages that can be used in the creation of subtitle using
speech recognition. A quick overview on the Internet indicated to focus on C/C++ and Java.
On one hand, C++ provides remarkable characteristics towards speed, cross systems
capabilities, and well-tested packages. On the second hand, Java offers an intuitive syntax,
portability on multiple OS and reliable libraries. They both are used in the Sphinx speech
recognition engine. Java proposes the Java Media Framework (JMF) API allowing developers
to deal with media tracks (audio and video) in an efficient way.
Java performances are relatively similar to C/C++ ones independently of the hardware.
The absence of pointers, the use of efficient garbage collector and the run-time compilation
play a key role in Java and their enhancement should soon permit to go beyond C++
performances. Java is one of the several programming languages used in the generation of
subtitles. Java offers an intuitive syntax, portability on multiple OS and reliable libraries. Java
is used in sphinx speech recognition engine. Java proposes the Java Media Framework (JMF)

2|Page

API allowing developers to deal with media tracks (audio and video) in an efficient way. The
Java Media Framework API (JMF) enables audio, video and other time-based media to be
added to applications and applets built on Java technology. This optional package, which can
capture, playback, stream, and transcode multiple media formats, extends the Java 2 Platform
Standard Edition (J2SE) for multimedia developers by providing a powerful toolkit to develop
scalable, cross-platform technology. Consequently, Java seemed to be the most suitable
language.

1.2 PROBLEM STATEMENT


Visual content (videos) is one of the popular multimedia content used on the internet, PCs and
our homes. In a majority of cases within a video, the sound holds an important place. From this
statement, it appears essential to make the understanding of a sound video available for people
with auditory problems as well as for people with gaps in the spoken language. Most viewers
will like to have better understanding of spoken words in videos and most viewers especially
those watching videos with foreign language will like to learn the language of the foreign
country. The most natural way lies in the use of subtitles. However, manual subtitle creation is
a long and boring activity and requires the presence of the user.
Nowadays, there exist many software dealing with subtitle creation. Some proceed on
copyright DVDs by extracting the original subtitle track and converting it in a format
recognized by media players, for example IM-TOO DVD Subtitle Ripper, and XILISOFT
DVD Subtitle RIPPER. Others allow the user to watch the video and to insert subtitles using
the timeline of the video, e.g. Subtitle Editor, and Subtitle Workshop. There are also subtitle
editors providing facilities to handle subtitle formats and ease changes, for instance JUBLER,
and GAUPOL. Nonetheless, software generating subtitles without intervention of an individual

3|Page

using speech recognition have not been developed. Therefore, it seems necessary to start
investigations on this concept.

1.3 OBJECTIVES
I.

The purpose of this study is come out with a method of generating subtitle automatically
for viewers.

II.

To build and train a system to be able to convert speech to text.

III.

To help viewers be able to generate subtitle automatically

IV.

To make a system be able to accept audio files and produce a subtitle file.

V.

Another aim of this project is to generate a subtitle file in order for it to be available for
translation to other languages.

1.4 SIGNIFICANT OF STUDY


Recent demand for visual text accompanying videos has grown to certain level and apart from
that some people have problem in hearing and some are probably deaf and cannot understand
the meaning of visual content. Therefore, there is a need to find solution to this kind of problem
faced by the public.
Several software propose utilities to create subtitles for videos but all require an extensive
participation of the user. Most websites provide a database for subtitle in movies but mostly
for popular movies and not for general visual content. Hence, a more automated concept is
envisaged.

4|Page

1.5 SCOPE OF THE STUDY


The project work principally tends to answer our problematic by presenting a potential system.
Three distinct modules have been defined, namely audio extraction, speech recognition, and
subtitle generation (with time synchronization). The system should take a video file as input
and generate a subtitle file (sub/srt/sst/son/txt) as output.
1.6 LIMITATIONS OF THE STUDY
In the course of the software design and development some limitations were encountered.
Noticeable among them includes:
I.
II.

Time Constraints
Insufficiency of funds and inflation of goods and services

III.

Unstable internet connection

IV.

Inaccessible language models for sphinx to use to generate subtitles

5|Page

CHAPTER TWO
LITERATURE REVIEW
2.0 SUBTITLES
Video is an electronic medium for recording, copying and broadcasting of moving visual
images. The demand for visual content has emerged over the last one and half decade. Video
is one of the most used visual source of entertainment. Video sharpens the focus, creates
rallying points, places the message in context, and build a safe place where viewers can be
challenged on a thoughtful and heartfelt level. Videos are also use to keep historical records.
The invention of films started without sound. And not so long after that, many efforts had done
to convey the message of the actors to the audience. Their efforts were what we now call
intertiles which were texts, drawn or printed on paper, filmed and placed between scenes of the
film. The original titles were removed, translated, filmed and re-inserted and even a speaker
was used to give a simultaneous interpretation with another language.
In 1909, a method of projectionist, using a scioption (somehow like a slide projector) that
showed the subtitles on the screen was used. According to Gottlieb (2002), M.N. Topp
registered a patent for this device for the rapid showing of titles for moving pictures. From
1927 onwards, sound films were invented, so the titles inserted between scenes were removed.
The idea to make several languages versions arose but the efforts were highly cost intensive.
So the attempts to make subtitle started with manual projection of slides with printed texts
directly onto screen. A frame containing the title was kept in position while the film negative
and positive print strip were fed forward and exposed. Then, the process was automatically
done by inserting exposed blank frames between the title frames and the title were fed forward
by means of a counter to ensure subtitles were in right length and in right place. But the films
negative were difficult to be obtained plus large quantities of negatives were needed to publish
a large copy of films. Therefore, the titles were photographed onto separate film of same length

6|Page

as original, synchronize with the sound. The negative and the roll of titles were then copied
and displayed simultaneously.
In 1930, a Norwegian inventor, Leif Eriksen registered a patent for method of stamping titles
directly on to the images on film strip, followed by a Hungarian inventor, O.Turchanyi who
used high temperature to melt away the emulsion on film, without the need for softening bath
did by Eriksen, and then patented his method in 1935.
Laser subtitling, according to Ivarrson (2004), developed by Denis Auboyer is the latest
development which involves the use of lasers to burn away or vaporizes the emulsion. This has
been in commercial use since 1988. The titles themselves are computer typeset and can be
showed on video display by means of time coding or frame counting. Laser subtitling is cheaper
but requires costly investment in equipment.

2.1 FUNCTIONS OF SUBTITLES


Subtitles are textual versions of the dialog in films and television programs, usually displayed
at the bottom of the screen. They can either be a form of written translation of a dialog in a
foreign language or a written rendering of the dialog in the same language, with or without
added information. From the beginning of the existence of subtitles, it has been functioning in
many ways.
Translated subtitles have been used in live broadcasted program. In order for viewers to
understand foreign language live television program, translated subtitles are helpful. Also, the
subtitles are important this type of program is just one time showing.
In addition, subtitles also functioned to translate foreign films. In the beginning of the 19th
century, the spreading use of mass media increased the demand for foreign films and news.
The translation fever started in the 1920s and 30s. In the era of talking pictures, Germany, Italy,
7|Page

France, and Spain were the first to decide on dubbing, while the other European countries had
their movies subtitled.
Not only have that, subtitling too functioned to reduce the cost of the production. This is
because of the high costs of dubbing. It is true that dubbed films distract viewers attention to
a lesser extent, and dialogues are much more understandable. However, it is less favorable
because many had complained that it is not natural to watch the movie with someone else
talking for another person. To add to that, dubbing costs which are higher compared to
subtitling is because they need to hire few people in order to use their voice in the process of
dubbing. The process of dubbing requires a lot of time too.
The usage of subtitles also is to maintain the originality of the audio and the voices of the
original and professional cast, actors and actresses. The subtitles which appear at the time for
every lines said by the cast does not distract the viewers to listen to the real voice of the cast.
Moreover, in dubbing, the people who are involved in this process sometimes did not prepare
it with the correct intonation and rhythm like the original cast. This is quite distracting for the
audience when they watch the movies because the environment in the movie does not suit with
the incorrect intonation from the voice of those people.

2.2 THE POSITIVE EFFECTS OF SUBTITLES


Subtitles are mostly provided in various types of media such as television programmers,
movies, and broadcasts. Subtitles help a lot as it brings many advantages to the audience.
Subtitles may be in the same language with the media or may be translated into other languages
which are mainly in specific mother tongue languages.
Subtitles assist the audience who are deaf and those with problem of hearing to follow the
dialog. In addition, people who cannot understand the spoken language or having accent
8|Page

recognition problem could understand what they are watching with the aid of subtitles.
Television teletext subtitles are hidden unless requested by the audience from a menu. This
type of subtitles always comes with additional sound representations for deaf and hard ofhearing viewers. Furthermore, teletext subtitle language follows the original audio, except in
multi-lingual countries where the broadcaster may provide subtitles in additional languages on
other teletext pages.
In some East Asian countries, such as China, Korea and Japan, subtitling is common in some
genres of television. In these languages, written text is less ambiguous than spoken text, so
subtitling may offer a distinct advantage to aid comprehension. For example, although people
in China generally speak Putonghua, the standard spoken language, and different speakers have
different accents due to their native dialects and subtitles bridge this gap as most Chinese
speakers understand the one standard form of written Chinese.
According to Van Der Kamp (2007), subtitled movies have been found to improve reading
skills especially for children. Subtitled movies enable people to automatically and
subconsciously read while watching. In order to comprehend subtitled movies, the audiences
need to read the lines fast and accurately. Indirectly, this could improve the literacy skills.
One of the most important roles brought by subtitles is to help people who wants to learn
language either the mother tongue language or the foreign languages. A research shows that,
young people in Hungary watched subtitled movies to learn language!
People who are learning a foreign language may sometimes use same- language subtitles to
better understand the dialog while not having to refer to a translation. For example, a person
who is learning English language could boost up his/her language skills, by watching English
movies with English subtitles. However, this way of learning only suits with beginners. This is
because; the learners could not develop his or her language skills if they keep looking at the
9|Page

subtitles. Therefore, once they have improved, they are encouraged to watch movies without
the aid of subtitles.
Besides, subtitles help a lot when it comes to the scene where the actor or actress need to speak
with low tone such as the actor need to whisper. Audience usually cannot interpret the word
that had been mentioned especially during the romantic scene of the movie or even worse, the
background sounds are too loud (Anonymous, 2008). Therefore, reading subtitle was the
alternative way for audience to understand the film and most important, to enjoy it.
In conclusion, subtitles bring many advantages to the audience in terms of language learning,
reading and translating.

2.3 THE NEGATIVE EFFECTS OF SUBTITLES


Some viewers said that subtitle give problem for them to understand the message of the film.
In other words, according to Tyler (2010) subtitles may create misunderstanding among
viewers in order to interpret the film. We cannot deny that there are some mistakes for the
translation that had been made by the media corporation. This is because the translation was
not made by those person that are professional in translation of foreign-language and the culture
of the native citizen. If the translations were not made properly and with high quality, it totally
not going to develop any positive effect thus people will keep complaining about it. Sometimes,
translation of subtitling could be very different from the translation of written text. Usually,
before a person makes subtitles, he/she may or may not have the access to the written transcript
of the dialog. Therefore, in some cases, the person who made subtitles often interpret what is
meant, rather than translating how it is said. In some cases, the subtitling is done word by word,
thus confusing the viewer to interpret what is meant by the whole sentences. It could be
frustrating to the audience who know some of the spoken language. This is because, spoken

10 | P a g e

language may contain verbal padding or culturally implied meanings, which are confusing
words and not usually adapted in written subtitles. The inaccurate translation in subtitling could
affect those who aim subtitling as a method of foreign language learning because they are
referring to low quality sentence structures and poor grammatical reference.
It is undeniable that subtitling do improves many language learning skills such as reading and
writing skills, but it is also makes sense that subtitling leads to weakness of listening skills.
When people keep referring to the subtitles, they use less listening skills. In some cases, people
tend to read what is written at the bottom screen rather than listen to what the speaker said.
This is not necessarily happening only to people who watch foreign movies, documentaries or
others. Even if they are watching the mother-tongue- language television program, they prefer
to read the subtitles. To language learners, this is not good and will not help in developing their
language learning too.
Other than that, some people opine that reading and interpreting the stimuli seen at the same
time could be annoying, regardless of the type of program. For example, in movies, subtitling
provided may takes away the movie experience of the audiences because they are reading, not
viewing. Some people prefer good, well-subtitled foreign language movies than equivalently
same-language-subtitled movies. By reading the dialogue instead of hearing it, there is more
room for interpretation thus creating a movie that means more to certain viewers. In this case,
the subtitles should be made with high quality, good grammar and interpretation, as well as
precise.
To conclude, subtitles do not necessarily give harm to the audiences. It is based on individual
perception on how he or she deals with it. An optimistic person surely could see the positive
side of providing subtitles and would use them to make benefit out of it. Therefore, the

11 | P a g e

important role should be played by the media in providing the high quality subtitles in purpose
of assisting language learners and also giving satisfaction to movie lovers.

2.4 PREVIOUS STUDIES ON SUBTITLES


This journal entitled Foreign Subtitles Help but Native-Language Subtitles Harm Foreign
Speech Perception. The research was done by Holger Mitterer, from Max Planck Institute for
Psycholinguistics, Nijmegen, The Netherlands and James M. McQueen from Behavioural
Science Institute and Donders Institute for Brain, Cognition and Behaviour, Centre for
Cognition, Radboud University Nijmegen, Nijmegen, and The Netherlands.
121 participants from the subject pool of the Max Planck Institute for Psycholinguistics
participated in the experiment. The participants were native speakers of Dutch studying at the
Radboud University Nijmegen, with good command of spoken and written English. They had
not been to Scotland or Australia for longer than two weeks, so were unfamiliar with Scottish
and Australian English.
The objective of the research was to investigate whether subtitles, which provide lexical
information, support perceptual learning about foreign speech or not.
Other study by Mina Lee and David Roskos-Ewoldsen (2004), investigated the different pattern
of inferences generated by participants watching a native language film versus a foreign
language film with subtitles. It was assumed that reading subtitles creates a detrimental effect
to inference generation. The method of the research is they used two versions of the film Rear
window (Alfred Hitchcock, 1956) which one of them is in English and the other is French
dubbed and subtitled in English. 34 participants were selected. The participants were recruited
from an introductory mass communication class or a mass communication law and regulation
class.
12 | P a g e

The first procedure is the participants were given the synopsis of the earlier portion of Rear
Window. They would be watching only the last 40 minutes of the film so they were asked to
read the synopsis given carefully. Participants are then asked to write down any thoughts
coming to mind during the film watching. They were instructed to stop the movie whenever
they have thoughts about what is going to happen or what has happened earlier in the film. This
has been gathered as the inferences. The inferences were categorized as either backward (focus
on previous information), forward (focus on future information), and current (focus on
information in the current scene).
After finishing the film, the participants were asked whether they had ever seen the film before
this occasion. They were then debriefed and dismissed before their data were run individually
whereby each session took less than 60 minutes.
From all the inferences gathered, only 28 data from the subjects were selected because one of
them did not follow the instructions and the remaining five have watched the movie before.
For comparison of inference generation between two groups, three categorizations were
developed depending on the source of the inferences used. When an inference was generated
regarding a future scene, it was categorized as a forward inference. When an inference was
generated by using either earlier information or general knowledge, it was considered a
backward inference. Last, when an inference was generated by using current information
within the scene, it was considered a current inference.
The analysed gathered inferences show that participants generated more current inferences in
the foreign language film with subtitles. In contrast, more backward inferences in the native
language condition were generated by the participants. These results are consistent with the
idea that participants had less comprehension of the foreign language film with subtitles which
proved the assumption made by Mina Lee and David Roskos-Ewoldsen.

13 | P a g e

Another study of subtitle was done by Maria Bernschtz, a PhD holder in marketing from the
Corvinus University of Budapest, Hungary. In her research, she studied the attitude of the
young towards subtitled movies. The participants consisted of young people only because it is
essential for them to be expert in foreign language, especially in English. 413 third-year
students of Corvinus University of Budapest, Hungary participated in this research as
samplings. 63 % were women, 37 % were men. The average age of the group was 21 years.
They were given questionnaires on subtitled movies. The respondents had to answer the
question: for what types of program they would recommend subtitles. The number of yes
answers are represented cumulatively. 86% of the respondents agree that subtitles are
appropriate for movies in cinema while only 65% recommended subtitles for documentaries.
65 % of the sample suggested subtitles for the historical film followed by the comedy, 60 % of
the sample.
From the results, it is found that Hungarian young people prefer subtitled movies rather than
dubbed movies. This is to maintain the originality of the movies. They also watch subtitled
movies mostly for learning foreign language easily.

14 | P a g e

CHAPTER THREE
METHODOLOGY
3.0 INTRODUCTION
This chapter will cover the details explanation of methodology that was used to make this
project complete and working well. The method is use to achieve the objective of the project
that will accomplish a perfect result. In order to evaluate this project, the methodology was
based on System Development Life Cycle (SDLC), generally three major steps were involved,
which

are

planning,

implementing

and

analysis.

Figure 3. 1-Phases of methodology

3.1 PLANNING
To identify all the information and requirement such as hardware and software, planning must
be done in the proper manner. The planning phase have two main elements namely data
collection and the requirements of hardware and software
3.1.1 DATA COLLECTION
Data collection is an important stage in any area of study. At this stage I planned about the
projects resources and requirements, literature studies and schedule to get more gathered from
libraries and Internet.

15 | P a g e

Within the data collection period I found some study about subtitle generation using speech
recognizer on the Internet and did some research about the project related. Once I got the
project manual, I tried to find out the hardware component, software materials and other
materials that will needed for the project.
While planning, I did some research about the project related, which including with study about
speech recognition such as building of language models from text corpora, creation of acoustic
models and data gathering.

3.1.2 HARDWARE AND SOFTWARE REQUIREMENTS


A. HARDWARE REQUIREMENTS
Basically the hardware require is a personal computer with at least the following specification
or higher.
Table 3. 1-Hardware Requirements

B. SOFTWARE REQUIREMENTS
Several software applications can be used for the implementation of this project work. This
project work can be implemented in both Linux operating systems and Windows operating
systems. The following are the basic software applications used to ensure effective
implementation of subtitle generation using voice recognition (speech recognition).
16 | P a g e

I.

SYSTEM SOFTWARE
Ubuntu 13.04-dvd-64bit (Linux Operating System)

II.

PERSONAL COMPUTER SOFTWARE


Java Development Kit
Text Corpora
Apache ant
VLC media player 2.0.5 Two flower
Sphinx-4
Sphinx-3
CMU-Cambridge Statistical Language Modelling Toolkit-4
Sphinxbase
SphinxTrain
AN4
RM1
Perl

3.2 ANALYSIS
Analysis is a process of collecting factual data, understanding the processes involved,
identifying problems and recommending feasible suggestions for improving the system
functioning. This includes sub-dividing of complex process involving the entire
s y s t e m , i d e n t i f i c a t i o n of data store and manual processes. The major objectives of
17 | P a g e

systems analysis are to find answers for each business process: What is being done,
how is it being done, who is doing it, when is it being done, who is going to use the
system, why is it being done and how can it be improved?

3.3 DESIGN
A media file (either video or directly audio) is given as input. The audio track is extracted and
then read chunk by chunk until the end of track is reached. Within this loop happen successively
three tasks: speech recognition, time synchronization, and subtitle generation. Finally, a
subtitle file is returned as an output.

Figure 3. 2-Steps of Project Design


For the system to be fully implemented three stages are essentials. These are audio extraction,
speech recognition and subtitle generation.

18 | P a g e

3.3.1 AUDIO EXTRACTION


A media movie is generally composed of a video track and an audio track. During the montage, the tracks
are gathered together and the final artefact is a single file. In this section, we isolate the audio track from
the rest of the file in order to solely process sound and convert it into a suitable format to be used
by the speech recognition tool. The suitable tool to do this is VLC player.

3.3.2 SPEECH RECOGNITION


This technology permits a computer to handle sound input through either a microphone or an
audio file in order to be transcribed or used to interact with the machine. A speech recognition
system can be designed to handle either a unique speaker or an infinite number of speakers.
The first case, called speaker-dependent model, presents an accuracy rate greater than 90% with
peaks reaching 98% under optimal conditions (quiet room, high quality microphone, etc. . .).
Modern speech recognitions engines are generally based on the Hidden Markov Models
(HMMs).
We can also pick up their example to ease the understanding: if a person stated that he wore a
raincoat yesterday, then one would predict that it must have been raining. Using this technique,
a speech recognition system may determine the probability of a sequence of acoustic data given
one word (or word sequence). The most likely word sequence can be elucidated according to
Bayes theorem.

Equation 3. 1-Bayes' Theorem

19 | P a g e

Following this theorem, it is observed P (acoustics) is a constant for any given sequence of
acoustic data and thereby, can be ignored. P (word) represents the probability of the word
related to a specific language model.
Even if word recognition has reached a very satisfying rate of 80-90%, grammar still needs
multiple enhancements. Punctuation determination involves to distinguish stressed syllables or
words in an utterance. In a natural conversation, it is straightforward to define if the speaker
affirms (.), exclaims (!) or interrogates (?). Nonetheless, it is much more difficult for
speech recognition systems to make the difference and they usually only recognize which word
was pronounced. Therefore, stresses and intonations are rarely detected and punctuation
remains a manual process.

A. ACOUSTIC MODEL
We now introduce the concept of acoustic model. It is a file containing a statistical
representation of the distinct sounds that make up each word in the language model. We
differentiate two acoustic models:
I. Speaker Dependent Acoustic Model: it is an acoustic model that has been
designed to handle a specific individuals speech. This kind of model is
generally trained using audio from the concerned person.
II. Speaker Independent Acoustic Model: it is supposed to recognize speech from
different people, especially the ones who did not participate to the training of
the acoustic model. It is obvious a speaker independent acoustic model requires
much more speech audio training to provide correct results.

20 | P a g e

B. LANGUAGE MODEL OR GRAMMAR


A language model groups a very broad list of words and their probability of occurrence in a
given sequence. In a grammar, a list of phonemes is associated to every word. The phonemes
correspond to the distinct sounds forming a word.

C. DECODER
The following definition for Decoder comes from the VoxForge website Software program
that takes the sounds spoken by a user and searches the Acoustic Model for the equivalent
sounds. When a match is made, the Decoder determines the phoneme corresponding to the
sound. It keeps track of the matching phonemes until it reaches a pause in the users speech. It
then searches the Language Model or Grammar file for the equivalent series of phonemes. If a
match is made it returns the text of the corresponding word or phrase to the calling program.

D. SPHINX-4
Sphinx-4 is an open source project led by Carnegie Mellon University, Sun Microsystems Inc.
and Mitsubishi Electric Research Laboratories. It is completely written in Java. It offers a
highly modularized and flexible architecture as well as versatile APIs, supports any acoustic
model structure, handles most types of language models, provides new algorithms in order to
get word level hypotheses, and accepts multimodal inputs. This will be our main speech
recognition tool to implement our project.

3.3.3 SUBTITLE GENERATION


The module is expected to get a list of words and their respective speech time from the speech
recognition module and then to produce a SRT subtitle file. To do so, the module must look at
the list of words and use silence (SIL) utterances as delimitation for two consecutive sentences.
21 | P a g e

3.3.4 CODING
Java is our preferred programming language used to code our project because Java offers an
intuitive syntax, portability on multiple OS and reliable libraries. The absence of pointers, the
use of efficient garbage collector and the run-time compilation play a key role in Java.
Beside sphinx-4 which is our speech recognition tool was written entirely in java.

3.3.5 METHODOLOGIES USED


This is where we demonstrate our projects feasibility by detailing our experiences and
resources that will be drawn upon to carry out the project. The methodology section enable us
to describe the specific activities that will take place to achieve the objectives. The following
methodologies was used to carry out the projects objectives:
I.

Research and consultations: My research methodology requires gathering relevant


data from the specified documents in order to analyse the material and arrive at a more
complete understanding of subtitle generation. Internet also helped in the research
work. Enhanced knowledge was acquired from resource persons. A qualitative
evaluation was utilized for this research project leveraging subjective methods such as
interviews and observations to collect substantive and relevant data

II.

Coding and programming: Java was preferred programming language used to code
our program. C++ can also be used but java is preferred because it has the ability to
deal directly with media files. Java Media Framework (JMF) API allowing developers
to deal with media tracks (audio and video) in an efficient way.

III.

System Analysis: For a problem to be solved to get maximum output system analysis
is needed. A well-structured technique was used for handling the problem of subtitle
generation which led to an efficient allocation of resources to meet well defined goals
and objectives.

22 | P a g e

IV.

Flow Charts

3.4 IMPLEMENTATION
Implementation is the carrying out or execution of a plan, a method or any design for doing
something. As such, implementation is the action that must follow any preliminary thinking in
order for something to actually happen. In project management terminology, it refers to the
process of carrying out the project plan by performing the activities included therein. The major
steps involved in this phase are:
I. Acquisition and Installation of Hardware and Software
II. Construction of the subtitle generation using speech recognition
III. Documentation

3.4.1 ACQUISITION AND INSTALLATION OF HARDWARE AND SOFTWARE


This is the stage where we get the hardware system installed. The hardware system used is
mainly a personal computer (DELL INSPIRON 1750). Our hardware system conforms to the
hardware specifications provided in Table 3.1 above.
This stage is also where we get the various software provided in 3.2.2.2 above installed and
ready for the construction of the subtitle generations.

3.4.2 CONSTRUCTION OF THE SUBTITLE GENERATION USING SPEECH


RECOGNITION
This is where we implement the various stages of the subtitle generation namely audio
extraction, speech recognition and subtitle generation. In the audio extraction we used
VideoLan player to extract the audio files from video files into wav formats which is compatible

23 | P a g e

in sphinx-4. In the speech recognition stage sphinx-4 is used in the implementation of


recognition of speech. The final stage is the generating of a subtitle file c .srt format.

3.4.3 DOCUMENTATION
Documentation is converting the project work in the form of documents.

3.5 SUMMARY
In summary, every project will have different methodologies that is being used to make the
project successful and working well. Generally, the methodologies are divided into three parts,
there are planning, implementing and analysis. In planning phase they includes some reading
activity and some job of requirements of hardware and software to be used.
In reading activity I did some research using several sources such as text books, journal, paper
references, the Internet and more sources due to get the information about the project related.
While in the requirements of hardware and software to be use I studied and found out the
functional and operational of the hardware and software related.
Next step is implementing phase where subtitle is generated into .srt file format.
With appropriate steps and methodology, any process of completing the project can be managed
wisely and will be make a good result.

24 | P a g e

CHAPTER FOUR
PROPOSED SYSTEM
4.0 INTRODUCTION
This chapter presents the procedural nature of the project work. We now suggest a profound
analysis in order to clearly measure the real needs. Java was realized as the preferred
programming language to implement our project. Java has components that was realized to be
essential for speech recognition. The subtitle generation will use the different symbols and tags
produced by the speech recognizer in order to generate SRT files.

4.1 SYSTEM OVERVIEW


4.1.1 AUDIO EXTRACTION
This module aims to output an audio file from a media file. It takes as input a file URL and
optionally the wished audio format of the output. Next, it checks the file content and creates a
list of separated tracks composing the initial media file. An exception is thrown if the file
content is irrelevant. Once the list is done, the processor analyzes each track, selects the first
audio track found and discards the rest (note that an exception is thrown if there is more than
one audio track). Finally, the audio track is written in a file applying the default or wished.

A. AUDIO FORMATS DISCUSSION


We now consider audio formats accepted by Sphinx. At the moment, it principally supports
WAV or RAW files. It is therefore a complicated task to obtain directly the right format. It often
requires various direction and adaptation to reach the same purpose. The modules to enable
this kind of conversion and adaptation possible will be VideoLan media player.

25 | P a g e

Figure 4. 1-Activity Diagram for Audio Extraction

4.1.2 SPEECH RECOGNITION


According to the limited duration of this project work and because of the complexity of the
task, it would not have been feasible to design and realize a new Automatic Speech Recognition
module. There are different Speech Recognition systems. We will limit ourselves to sphinx
speech recognition systems. There are many sphinx speech systems. They have been developed
for nearly two decades and have proved their reliability within several Automatic Speech

26 | P a g e

Recognition systems where they were integrated in. It is therefore natural we opted for the
Sphinx-4 decoder. Sphinx-4 has been entirely written in Java, making it totally portable, and
provides a modular architecture, allowing to modify configurations with ease. The architecture
overview is shown below:

Figure 4. 2-Sphinx-4 Architecture


We aim to develop benefit of Sphinx-4 structures in order to satisfy the needs of the subtitle
generation speech recognition module. The latter is expected to select the most suitable models
with regards to the audio and the parameters (category, length, etc. . .) given in input and
thereby generate an accurate transcript of the audio speech.
27 | P a g e

An audio file and some parameters are given as arguments to the module. First, the audio file
is checked: if its format is valid, the process continues; otherwise, an exception is thrown and
the execution ends. According to the category (potentially amateur, movie, news, series, music)
given as argument, related acoustic and language models are selected. Some alterations are
realized in the Sphinx-4 configuration based on the set parameters. Then, all the components
used in the ASR process are allocated required resources. Finally, the decoding phase takes
place and results are periodically saved to be reused later.

Figure 4. 3-Activity Diagram for Speech Recognition

28 | P a g e

4.1.3 SUBTITLE GENERATION


The module is expected to get a list of words and their respective speech time from the speech
recognition module and then to produce a SRT subtitle file. To do so, the module must look at
the list of words and use silence (SIL) utterances as delimitation for two consecutive sentences.

Figure 4. 4-Activity Diagram for Subtitle Generation


Activity diagram for subtitle generation exhibits the major statements of the subtitle generation
module. First, it receives a list of pairs Utterance-Speech Time. Then, it negotiates the list till
the end. In each iteration, the current utterance is checked. If it is a real utterance, we verify if
the current line is empty. If so, the subtitle number is incremented and the start time of the
current line is set to the utterance speech time. Then, the utterance is added to the current line.
29 | P a g e

In the case, it is a SIL utterance, we check if the current line is empty: if not, the end time of
the current line is set to the SIL speech time. If the line is empty, we ignore the SIL utterance.
Once the list has been traversed, the file is finalized and released to the user.
However, we face some limitations. Indeed, it will not be able to define punctuation in our
system since it involves much more speech analysis and deeper design.

4.2 IMPLEMENTATION
4.2.1. GENERAL SETTINGS OF THE SYSTEM
The implementation of the system was realized on a personal computer with at least the
following specifications or higher:
Table 4. 1-Hardware Requirements

This segment provides guidelines about the use of peripheral software required to make the
subtitle generation using speech recognition coordination run.
Apache Ant is needed in the general setting of the system since it has a relationship between
java. Both java and apache ant are needed to enable sphinx4 speech recognition tool work. First
of all, it is required to download the binary distributions. The next step consists in extracting
the root directory and copying it at the target location. Then, the ANT HOME environment
variable should be set to the directory where Ant is installed; the bin directory must be added
to the users path. The implementation and installation of external software were done solely
in Linux distributions.

30 | P a g e

A. INSTALLATION OF ORACLE JAVA


Since installation is done in Linux we implemented the installation of java in the terminal of
Ubuntu.
Before we install it, we removed Open JDK:
sudo apt-get purge openjdk*
To install Oracle java we add Personal Package Archive (PPA):
sudo add-apt-repository ppa:webupd8team/java
We then updated repository index:
sudo apt-get update
We then install Oracle java:
sudo apt-get install oracle-java8-installer
We then check to see if java is installed:
Java version

B. INSTALLATION OF APACHE ANT


First we make sure you have a Java environment installed
We

download

the

binary

distributions

of

apache

ant

from

http://archive.apache.org/dist/ant/
We then uncompressed the downloaded file into the directory /usr/local/
We set environmental variables JAVA_HOME to your Java environment, ANT_HOME
to the directory you uncompressed Ant to, and add ${ANT_HOME}/bin
export ANT_HOME=/usr/local/ant
export JAVA_HOME=/usr/lib/jvm/Oracle-java 8
export PATH=${PATH}:${ANT_HOME}/bin
Finally we checked to see if ant is installed:

31 | P a g e

ant version

C. AUDIO EXTRACTION
This stage is where we figure out how to convert the output audio file into a format recognized
by Sphinx-4. We did not obtain the expected result using the java media framework.
Consequently, we orientated this part of the system on the use of VLC media player. The trick
thus consists in successive straight ward steps:
I.
II.

Install VLC media player.


Launch VLC and go in Media -> Convert / save...

III.

Select media file - several files of a same directory can be selected at a time.

IV.

In Encapsulation check WAV.

V.

In Audio codec check Audio, choose WAV in Codec, set the Bitrate at 256 kb/s
and set channels at 1.

VI.
VII.

Define output file path and click on save button.


The resulting file is quite heavy but can be perfectly used in Sphinx-4

D. SPEECH RECOGNITION
I. SPHINX-4
Sphinx-4 is a state-of-the-art speech recognition system written entirely in the Java
programming language. It was created via a joint collaboration between the Sphinx group at
Carnegie Mellon University, Sun Microsystems Laboratories, Mitsubishi Electric Research
Labs (MERL), and Hewlett Packard (HP), with contributions from the University of California
at Santa Cruz (UCSC) and the Massachusetts Institute of Technology (MIT). Sphinx-4 started
out as a port of Sphinx-3 to the Java programming language, but evolved into a recognizer
designed to be much more flexible than Sphinx-3, thus becoming an excellent platform for
32 | P a g e

speech research. CMU Sphinx is one of the most popular speech recognition applications for
Linux and it can correctly capture words. It also gives the developers the ability to build speech
systems, interact with voice and build something unique and useful.
Based on the official manual of Sphinx-4, we explain now the different phases we went through
in order to obtain a fully operational speech recognition system on a Linux Operating System
(OS) - Ubuntu 13.04:
Sphinx-4 has two packages available for download:
Sphinx4-{version}-bin.zip: provides the jar files, documentation, and demos
Sphinx4-{version}-src.zip: provides the sources, documentation, demos, unit tests and
regression tests.
After we downloaded the distribution, we unjar the ZIP files using the jar command which is
in the bin directory of your Java installation into /usr/local/ :
jar xvf sphinx4-{version}-bin.zip
jar xvf sphinx4-{version}-src.zip
Next step was to build sphinx4 installation and before we do that we need to install sharutils
since it is needed for the installation of Java Speech Application Programming Interface:
Sudo apt-get install sharutils
We then installed the Java Speech Application Programming Interface by opening lib directory
in the sphinx4 folder typing the following in the terminal:
sh jsapi.sh
For sphinx4 to work we needed to set required environment variables. JAVA_HOME to the
location of JDK, ANT_HOME to the location of ant and PATH to include both bin subfolder
of JDK and bin subfolder of ant variables:
export ANT_HOME=/usr/local/ant
export JAVA_HOME=/usr/lib/jvm/Oracle-java 8

33 | P a g e

export PATH=${PATH}:${ANT_HOME}/bin
We then type the following:
ant
ant clean
ant javadoc
ant -Daccess=private javadoc

II. LANGUAGE MODEL CREATION


This task requires the installation of CMU - Cambridge Statistical Language Modeling toolkit.
As the implementation is done on a Linux machine, we install the software using terminal. We
describe briefly the successive steps:
1. Download the tarball CMU-Cam Toolkit v2.tar.gz.
2. Extract CMU-Cam Toolkit v2.tar.gz into /usr/local/.
3. Open terminal and direct directory to /usr/local/CMU-Cam Toolkit v2/src.
4. Execute the command make install. The executable files are generated in the bin
directory.
5. Copy all the executable files contained in bin to /bin
We now present the phases to generate a language model from a text file containing sentences
given a text file called weather.txt located in the current directory. Before we do this we need
to create a make file for the text corpora to use to create the language model.
First of all we design the make file of the language model. We use the text corpora of a
weather.
The make generates the language model for a 'weather' vocabulary.
The make uses:
weather.txt - transcript of weather forecasts

34 | P a g e

weather.vocab - hand prepared vocabulary list


The make generates:
weather.lm - arpa format of the language model
weather.DMP - CMU binary format of the language model
weather.transcript transcript
The make requires:
CMU language model toolkit: http://www.speech.cs.cmu.edu/SLM_info.html
lm3g2dmp

utility

to

generate

DMP

format

models:

http://cmusphinx.sourceforge.net/webpage/html/download.php#utilities#
unix commands : gawk uniq mv rmdir rm
All commands should be in your path
bn=weather.txt
We want a closed vocabulary language model so we use extractVocab to extract just the
sentences that entirely match our vocabulary
gawk -f extractVocab.awk weather.vocab weather.txt > $bn.tmp.closed
We generate the 'test' file that can be used by the live decoder as the prompt for the user. We
eliminate adjacent duplicate entries
gawk -f genTranscript.awk < $bn.tmp.closed > weather.transcript
We then generate the word frequencies
text2wfreq < $bn.tmp.closed > $bn.tmp.wfreq
We generate the vocabulary (this should be a subset weather.vocab)
wfreq2vocab < $bn.tmp.wfreq > $bn.tmp.vocab
We generate the idngram
text2idngram -vocab $bn.tmp.vocab < $bn.tmp.closed > $bn.tmp.idngram
We generates the language model

35 | P a g e

idngram2lm -vocab_type 0 -idngram $bn.tmp.idngram -vocab $bn.tmp.vocab -arpa


$bn.arpa
We generate the DMP version of the language model
mkdir dmp
lm3g2dmp weather.txt.arpa dmp
mv dmp/weather.txt.arpa.DMP weather.DMP
mv $bn.arpa weather.lm
We finally perform a cleanup
rmdir dmp
rm *.tmp.*
We now present the phases to generate a language model from a text file containing sentences
given a text file called weather.txt:
1. Compute the word unigram counts:
cat weather.txt | text2wfreq > weather.wfreq
2. Convert the word unigram counts into a vocabulary:
cat weather.wfreq | wfreq2vocab > weather.vocab
3. Generate a binary id 3-gram of the training text, based on the above vocabulary:
cat weather.txt | text2idngram vocab weather.vocab > weather.txt.idngram
4. Convert the idngram into a binary format language model:
idngram2lm idngram weather.txt .idngram vocab sentences.vocab -binary
weather.binlm
5. Generate an ARPA format language model from a binary format language model:
binlm2arpa -binary weather.binlm -arpa movies.arpa

36 | P a g e

III. ACOUSTIC MODEL PACKAGE CREATION


We

follow

the

Robust

groups

Open

Source

Tutorial

from

http://www.speech.cs.cmu.edu/sphinx/tutorial.html. In this tutorial, you handled a complete


state-of-the-art HMM-based speech recognition system. The system we used is the SPHINX
system, designed at Carnegie Mellon University. SPHINX is one of the best and most versatile
recognition systems in the world today. An HMM-based system, like all other speech
recognition systems, functions by first learning the characteristics (or parameters) of a set of
sound units, and then using what it has learned about the units to find the most probable
sequence of sound units for a given speech signal. The process of learning about the sound
units is called training. The process of using the knowledge acquired to deduce the most
probable sequence of units in a given signal is called decoding, or simply recognition.
Accordingly, you will need those components of the SPHINX system that you can use for
training and for recognition. In other words, you will need the SPHINX trainer and a SPHINX
decoder.
A. COMPONENTS PROVIDED FOR TRAINING
The SPHINX trainer consists of a set of programs, each responsible for a well-defined task,
and a set of scripts that organizes the order in which the programs are called. We have to
compile the code in our favorite platform. The trainer learns the parameters of the models of
the sound units using a set of sample speech signals. This is called a training database. A choice
of training databases will also be provided to us. The trainer also needs to be told which sound
units we want it to learn the parameters of, and at least the sequence in which they occur in
every speech signal in your training database. This information is provided to the trainer
through a file called the transcript file, in which the sequence of words and non-speech sounds
are written exactly as they occurred in a speech signal, followed by a tag which can be used to
37 | P a g e

associate this sequence with the corresponding speech signal. The trainer then looks into a
dictionary which maps every word to a sequence of sound units, to derive the sequence of
sound units associated with each signal. Thus, in addition to the speech signals, you will also
be given a set of transcripts for the database (in a single file) and two dictionaries, one in which
legitimate words in the language are mapped sequences of sound units (or sub-word units), and
another in which non-speech sounds are mapped to corresponding non-speech or speech-like
sound units. We will refer to the former as the language dictionary and the latter as the filler
dictionary.
In summary, the components provided to you for training will be:
1. The trainer source code
2. The acoustic signals
3. The corresponding transcript file
4. A language dictionary
5. A filler dictionary
B. COMPONENTS PROVIDED FOR DECODING
The decoder also consists of a set of programs, which have been compiled to give a single
executable that will perform the recognition task, given the right inputs. The inputs that need
to be given are: the trained acoustic models, a model index file, a language model, a language
dictionary, a filler dictionary, and the set of acoustic signals that need to be recognized. The
data to be recognized are commonly referred to as test data.

38 | P a g e

In summary, the components provided to you for decoding will be:


1. The decoder source code
2. The language dictionary
3. The filler dictionary
4. The language model
5. The test data
We create a root directory for the task called asg-am in the usr/local/Speech directory. In order
to get relevant results from the decoder, it is imperative to use the SphinxTrain. We download
the file SphinxTrain.nightly.tar.gz containing the source code and extract it in usr/local/
directory. The next phase is the compilation of the trainer: from the terminal command line,
move to the SphinxTrain directory and type the following commands:
.../SphinxTrain>./configure
.../SphinxTrain>make
Once the SphinxTrain has been installed, we need to configure the task. We move to the asgam directory and input the command:
.../asg-am>../SphinxTrain/scripts_pl/setup_SphinxTrain.pl task asg-am
The next step consists in preparing the audio data. We put the gathered audio in asg-am/wav
folder and the necessary property files in the asg-am/etc folder. Those files which have to be
created from the audio material are listed here:

39 | P a g e

I. asg-am.dic: includes eachword and the phonemes that make up each word.
II. asg-am.filler: includes generally filler sounds for <s>, <sil>, and </s>.
III. asg-am.phone: includes the phonemes that are part of the training set (must not have
unused phonemes).
IV. asg-am train.fileids: contains the names of the audio files of the asg-am/wav folder
without extension - one on each line in the same order as the audio files order.
V. asg-am train.transcription: contains the transcription of each audio file - one on each
line in the same order as the audio files order and surrounded by the markers <s>
and </s>.
VI. feat.params: generated by SphinxTrain. and sphinx train.cfg: generated by
SphinxTrain.

After adjusting the configuration in sphinx train.cfg, the model creation can take place by first
generating the feature files from the WAV files and then by running all the perl scripts to
complete the process. Here are the two successive commands used:
.../asg-am>./scripts_pl/make_feats.pl -ctl etc/asg-am_train.fileids
.../asg-am>./scripts_pl/RunAll.pl

40 | P a g e

E. SUBTITLE GENERATION
Based on the analysis and design realized previously, a module generating subtitles in SRT
format has been implemented in Java. To do so, we created three classes TimedToken, Subtitle
and SubtitleGenerator. The Subtitle class permits to encapsulate a subtitle in SRT format. The
SubtitleGenerator class provides static methods to create Subtitle instances using Result objects
from the Sphinx-4 apparatus. Then, the method getTimedBestResult() from Result is used to
retrieve both tokens and times in string form. The String is then parsed and each token and its
times are used as input to instantiate a TimedToken. From this point, a list of TimedToken
objects is avail-able. Finally, the list is traversed and different operations are made according
to the type of token. Each Subtitle text is delimited by two silences.

Figure 4. 5 - Class Diagram for Subtitle Generation


41 | P a g e

F. SPHINX-4 TUNING
For the sphinx-4 system to work we needed a sphinx-4 configuration. The configuration of a
particular Sphinx-4 system is determined by a configuration file. This configuration file defines
the following:
I. The names and types of all of the components of the system
II. The connectivity of these components - that is, which components talk to each other
III. The detailed configuration for each of these components.

I. SPHINX-4 XML-CONFIGURATION

42 | P a g e

43 | P a g e

44 | P a g e

Sample of a subtitle output

45 | P a g e

CHAPTER FIVE
5.0 CONCLUSION AND RECOMMENDATION
5.1 CONCLUSION
We proposed a way to build and train a system to generate subtitles for sound in videos using
speech recognition. A complete system including the three required modules namely audio
extraction, speech recognition and subtitle generation could not be realized since the audio
conversion needed more resources. VLC gave an appropriate solution but a custom component
coded in Java is expected in further work so that portability and installation of the system is
rendered uncomplicated. Nonetheless, the expected output for each phase has been reached.
The audio extraction module provides a suitable audio format to be used by the speech
recognition module. This one generates a list of recognized words and their corresponding time
in the audio although the accuracy is not guaranteed. The former list is used by the subtitle
generation module to create standard subtitle file readable by the most common media players
available.
During the last years, the Internet has known a multiplication of websites based on videos of
which most are from amateurs and of which transcripts are rarely available. This work was
mostly orientated on video media and suggested a way to produce transcript of audio from
video for the ultimate purpose of making content comprehensible by deaf persons. Although
the current system does not present enough stability to be widely used, it proposes one
interesting way that can certainly be improved. The main aim of this system is build and train
a system to be used to generate a subtitle file.

46 | P a g e

5.2 RECOMMENDATION
Getting larger language models and acoustic models was not easy to come by. Larger language
models and acoustic models are only accessible by purchase. This is my recommendation that
websites hosting these language models and acoustic models should make it available for
students to make work easier.
When talking about the subtitle generation module, we emphasized the insertion of punctuation
was a complicated task to be performed by an Automatic Speech Recognition apparatus. It
could be interesting to lead a study towards this subject because the outcome of an Automatic
Speech Recognition system is generally a raw text, in a lower-text format and without any
punctuation mark while the former plays a significant role in the understanding of talk
exchanges. Several methodologies should be deemed such as the use of transducersor language
model enhancements

47 | P a g e

REFERENCES
1. Apache ant 1.7.1 manual, 2013. URL: http://ant.apache.org/manual/index.html.
2. B. Guenebaut (2009): Automatic Subtitle Generation for Sound in Videos
3. Configuration

management

for

sphinx-4

2013.

URL:

http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/util/props/docfiles/ConfigurationManagement.html
4. Engineered

Station.

How

speech

recognition

works.

(2013).

URL:

http://project.uet.itgo.com/speech.htm.
5. Essays on subtitles.(2013) URL: http://www.ukessays.com/essays/english-language/ahistory-and-development-of-subtitles-english-language-essay.php
6. Frederick Jelinek. Statistical Methods for Speech Recognition. MIT Press, (2013).
http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/util/props/docfiles/
7. Gaupol. (2013).URL: http://home.gna.org/gaupol/.
8. Imtoo Dvd Subtitle ripper. (2013). URL: http://www.imtoo.com/dvd-subtitleripper.html.
9. Installing java on Ubuntu.(2013) URL: http://ubuntuguide.net/install-oracle-java-jdk6-7-8-in-ubuntu-13-04
10. J.P. Lewis and Ulrich Neumann. Performance of java versus C++. (2013). URL:
http://idiom.com/zilla/Computer/javaCbenchmark.html.
11. Jubler subtitle editor in java, (2005). URL:http://www.jubler.org/.
12. Language

model

(Weather

language

model).(2013)URL:

http://code.metager.de/source/xref/cmusphinx/sphinx4/models/language/weather/

48 | P a g e

13. Robust groups open source tutorial learning to use the CMU sphinx automatic speech
recognition system. (2013). URL:http://www.speech.cs.cmu.edu/sphinx/tutorial.html.
14. Sphinx

for

the

java

platform

architecture

notes.

(2013)

URL:

www.speech.cs.cmu.edu/sphinx/twiki/pub/Sphinx4/WebHome/Architecture.pdf.
15. Sphinx-4 a speech recognizer written entirely in the java programming language,
(2013).URL: http://cmusphinx.sourceforge.net/sphinx4/
16. Sphinx-4 (2013). URL: http://cmusphinx.sourceforge.net/sphinx4/
17. Statistical

language

modelling

toolkit,

(2013).URL:

http://www.speech.cs.cmu.edu/SLM/toolkit.html.
18. Sphinx-4

Architecture

(2013)

URL:

www.speech.cs.cmu.edu/sphinx/twiki/pub/Sphinx4/WebHome/Architecture.pdf
19. Subtitle editor, (2013). URL:http://home.gna.org/subtitleeditor/.
20. The CMU sphinx group open source speech recognition engines, (2013). URL:
http://cmusphinx.sourceforge.net/.
21. Ubuntu 13.04 (2013) URL: http://www.ubuntu.com/download/desktop
22. Vlc media player. (2013).URL: http://www.videolan.org/vlc/.
23. Voxforge. (2013) URL:http://www.voxforge.org/home
24. Welcome to the ant wiki, (2013). URL http://wiki.apache.org/ant/ FrontPage.
25. Willie Walker, Paul Lamere, Philip Kwok, Bhiksha Raj, Rita Singh, Evandro Gouvea,
Peter Wolf, and Joe Woelfel. (2013): Sphinx-4: A exible open source framework for
speech recognition. In SMLI TR2004-0811 2004 SUN MICROSYSTEMS INC., URL
http://cmusphinx.sourceforge.net/ sphinx4/doc/Sphinx4Whitepaper.pdf.
26. Xilisoft Dvd subtitle ripper. (2013). URL:http://www.xilisoft.com/dvd-subtitleripper.html

49 | P a g e

27. Z. Ghahramani (2001): An introduction to hidden markov models and Bayesian


networks. In World Scientic Publishing Co., Inc. River Edge, NJ, USA

50 | P a g e

You might also like