You are on page 1of 4

Proceedings of the SMART–2019, IEEE Conference ID: 46866

8th International Conference on System Modeling & Advancement in Research Trends, 22nd–23rd November, 2019
College of Computing Sciences & Information Technology, Teerthanker Mahaveer University, Moradabad, India

Text Extraction through Video Lip Reading


Using Deep Learning

S.M. Mazharul Hoque Chowdhury1, Mushfiqur Rahman2, Marzan Tasnim Oyshi3 and Md. Arid Hasan4
Department of Computer Science & Engineering,
1

Daffodil International University, Dhaka, Bangladesh


2
Department of Computer Science & Engineering,
Daffodil International University, Dhaka, Bangladesh
3
Department of Computer Science & Engineering,
Daffodil International University, Dhaka, Bangladesh
4
Department of Computer Science & Engineering,
Daffodil International University, Dhaka, Bangladesh
E-mail: 1mazharul2213@diu.edu.bd, 2mushfiqur.cse@diu.edu.bd, 3marzan@diu.edu.bd,
4
arid15-5332@diu.edu.bd

Abstract— Automated text extraction from video data person is trying to say who could previously talk but lost
through lip reading can overcome the language barrier and his voice due to any accident. For that we can just simply
open the door of opportunities in terms of security, connectivity convert the audio less video data into text or speech. In this
and physical challenges. The conversion is possible by research we are going to discuss about converting this type
analyzing facial expression using deep learning method. But
of video data into text data.
this conversion is a challenging task due to the varieties of
pronunciation and accents of the same word causing different II. Literature Review
countenance. In this research, a method of converting video There is a numerous number of work done for audio
data to text data through lip reading has been proposed. The
to video sync quandary but for audio less data to audio/
proposed method includes test dataset, image frame analysis
and having text output from identified words. In the proposed text (visual verbalization apperception) is still a rear case
technique, the test dataset will be organized by combining all now. These works are fundamentally done predicated on
the possible facial expressions of different words. some concrete methods by individual researchers. Zhou
Keywords: Automated, Audio, Conversion, Frame, et al. [1] surveyed a method in this regards on their research
Identification, Training Data, Video, Sequence, Word with less detailing but efficacious. Many types of research
have been done on this field and everyone followed
I. Introduction
relatively similar methods to extract the information
Technologies are developing frequently all over (features around lips) and then relegate these to a template.
the world. People are getting introduced with incipient Pfister et al. [2] distinguished lip movement by the state
contrivances and technological solutions every single day. of mouth openness by utilizing a single SIFT descriptor of
Cameras are one of the mostly used contrivance by integrating the facial region. Pei et al. [3] described in a research the
incipient features and exhibiting their capabilities in state of art on many databases, extract features and then
different sectors. Now it is possible to detect face, ascertain aligned these for motion patterns. Koller et al. [4] utilized
emotion through countenance, object identification and a deep convolutional neural network to extract dactylology
many more. Even medical imaging made it facile to detect from effervescent mouth shapes. Similarly, Zoric et al. [5]
diseases, encephalon damage, bone fracture, etc. On the encoded the sample image, frame them, train them after that
other hand, keenly intellective computer systems now can relegate, to engender a word level classification. Chen [6]
convert audio data into text data. It can be habituated to represented a verbalization availed frame rate conversation
identify suspicious conversation and avail the physically method for verbalization availed video encoding. Lavagetto
challenged people who are struggling with hearing. Audio [7] developed a multimedia telephone for hard auricularly
less video data is one of the very commonly available data. discerning persons from the conversation by graphic
But retrieving audio data from audio less video data is a animation fit to lip-reading.
very arduous task. But if we can identify and analyze the Word classification with a bag of word /lexicon has
countenances of video data, we will be able to retrieve text not been endeavored in visual verbalization apperception,
data from audio less video data. This will help to reduce however [8] has tackled the same drawback within the
the crime rate or maybe we will be able to identify what a context of text apperceiving. Their work shows that it’s
240 Copyright © IEEE–2019 ISBN: 978-1-7281-3245-7

Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on July 21,2020 at 23:08:01 UTC from IEEE Xplore. Restrictions apply.
Text Extraction through Video Lip ReadingUsing Deep Learning

possible to train a general and scalable word apperception


model for an immensely colossal pre-defined wordbook, as
a multi-class relegation quandary. Work on silent speech
recognition based on lip movement was done by Ke Sun
et al. (2018) and their purpose was to improve mobile
device interaction [9]. Joon Son Chun et al. (2018) worked
on lip reading technique and here only video data was Fig.1: Data Flow Diagram of Proposed System
provided without audio to monitor learning [10]. Researcher
Sumita Nainan et al. (2018) worked on lip tracking using IV. Word Identification
deformable models and geometric approaches [11]. Gelder The identification of word will be predicated on the
et al. [12] discussed about autistic children and compared training data set provided to the system. For each and every
the strength of their brain with normal people. The research word a sequence of image frame will be stored. A database
proves that, the control of memory for lip reading of autistic of word and their co-responding image frame sequence
need to be stored to the system as training data. When the
children are poorer then in general. Summerfield et al. [14]
system will take an input data set it will first separate video
reviewed the physiological understanding of lip reading
sequences predicated on the time we spent to verbalize
and audio visual speech prediction based on speech signal. a word after another word. Then it will split each video
Lee et al. [14] extracted features of lip motion from the sequence into image frames, so that it can be matched with
differences of consecutive frames. They have used Principle the training data set.
Component Analysis (PCA) and Independent Component Training data set will be built predicated on the amassed
Analysis (ICA). Luettin et al. [15] discussed about a data set. Data amassment must be for individual words.
latest approach with a view to identifying the speaker on For each word multiple video data will be accumulated.
lip reading from a small video database. They used Because the precision of the analysis will depend on the
spatio-temporal speaker based on specific features extracted amount of the data utilized for training. Additionally the
from image by identifying the grey level area of mouth and amount of frames for each word will be responsible for
using HMMs along with the mixture of Gaussians. Zhang the precision of the word detection. For data accumulation
some video will be accumulated and from it blank frames
et al. [16] presented a model that integrates the color
that are not responsible for the word spooked by the person
and spatial edge information with a view to addressing
will be abstracted. Because the more blank frames are
the problem of lip feature extraction for the concern of there, the more it will be arduous to detect a word. For this
speechreading. They have used prominent red hue as type of data system will first endeavor to match the blank
indicator with a view to locate the lip position. Depending frames with test data. So it will not be able to find the word
on the identified lip area, they further refined the exterior accurately. Consequently it is consequential to abstract
and interior lip boundary using the information of color those blank frames from the training video data. On the
and spatial edge, where those two were combined within other hand, if any audio data subsist with video data then
Markov random field (MRF) framework. Nguyen et al. [17] those audio additionally must be eliminated from the data.
proposed a latest technique for face detection and lip feature Because in this system there will be no role for the audio
extraction using a real time Field Programmable Gate Array data. A database must be engendered to store the word
(FPGA). So, we are intended to take the same approach. and its respective video data frames. This database will be
utilized in the system for word identification.
III. Proposed Method
In this research we are proposing a method to convert
video data without audio into text data. For that at first a
test data set will be needed to compare or identify the input
data. This system can be divided into four main sections.
They are – test data, data preprocessing, identification of
word and output. In the test data section raw video clips
will be considered as input for the system. This data will be
preprocessed in the data preprocessing section and data will
be split into frames. Those frames are the processed data
and will be matched with training data in the identification
of word section. Determinately the system will provide
result predicated on the analysis in the output section. The
data flow diagram is given below in the fig. 1. Fig. 2: Image Sequence Matching Process

Copyright © IEEE–2019 ISBN: 978-1-7281-3245-7 241

Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on July 21,2020 at 23:08:01 UTC from IEEE Xplore. Restrictions apply.
8th International Conference on System Modeling & Advancement in Research Trends, 22nd–23rd November, 2019
College of Computing Sciences & Information Technology, Teerthanker Mahaveer University, Moradabad, India

This system is entirely dependent on the training data As this system will identify words according to the
set provided to it. The more data is provided to it, the more movement of lips, so it will be possible to detect any kind of
it becomes precise. Predicated on the angle or position word from any language. Even emotions can be predicted
of the speakers face system will provide different output. from the visage with the word. So it will be more facile to
Because predicated on the angle the movement may seem identify what authentically the verbalizer meaning through
different form different side. So for a particular word the sentence. This system has a profoundly and immensely
multiple training data will be required form different angle. colossal use in our authentic life. Moreover it withal have
Moreover verbalizing a word may vary from person to some back draw and complexities. The prodigious amount
person. So it is withal paramount to take the training data of data cannot be stored in our traditional databases and if
from multiple utilizer. As there will be an abundance of not handled conscientiously data will not be able to provide
training data for a single word, so amount of total training good result after analysis. There is no such limit for this
data will be high additionally. But higher amount of data type of training data for better precision.
will give us higher precision. This work can be compared with the work of Ke
After building the training data set with enough data Sun’s teams Lip-Interact technique that allows interaction
items of word and its video frames system will be able to between smart device and human [9]. The main similarity
deal with the test data. To convert the voiceless video data between our researches that it includes language processing
into text data it is paramount to build such a system that can from the lip movement of its user. According to the project
handle video data as well as text data. For that the system Lip-Interact collects silent command from its user where
will require to map the lip of the utilizer and how it changes the model is trained with some fixed inputs as do, undo,
during it is verbalizing about something. The system will screenshot, open camera and so on. The front camera of
be able to break the video into image sequences first. Every the smart device gets the input while user looks at the
person takes a minimum time to verbalize a word after screen and says something. To improve the quality of
another word. So to identify the word it is paramount to the command recognition they used Spatial Transformer
identify the time a person takes. When someone verbalize Networks. Their training model included real time user
they take minuscule break, which is may be micro seconds. inputs as data set instead of random data set and news data.
May be someone takes longer than that, but it will not This can be a good solution for a small amount of words
transmute anything in the system. Because the minimum because complexity will reduce because the possibility
time will not be transmuted. The longer break someone of in accurate data is very low and noisy data is already
takes the more accurately the system will be able to identify preprocessed. Therefore, the number of training data will
a word. be in the good rang that can easily be handled. But for a
After breaking the test video data into image sequences general conversation it will be difficult though our research
like TD-X in fig. 2 the system will split them according is still under process to improve the analysis.
to the time space for verbalizing one word after another. An example of the data training can be explained
During this process the system will additionally eliminate through the sample given below in figure 3-
the frames in this time gape. Because in the training data
all blank or extraneous frames were withal expunged to
eschew any kind of error or mismatch of the word. When
splitting is done as fig. 2 the system is now yare to analyze
the test data. As mentioned afore system is already trained,
ergo now system will match every image sequence from
the test data set with the training database. For each match
system will engender a word and store it to a text file. Then
it will move for the next sequence matching.
In this system grammar is not the main priority, so Fig. 3: Data training for word recognition
ascertaining what someone is verbalizing will be the According to the given sample which was collected
main focus. Because predicated on the rudimentary text from the research regarding Lip-Interact shows that
architecture it is quite possible to predict about what the camera which works as a sensor in this research is
someone is verbalizing about. In the text file all the words working as a lip gesture identifier and taking snap of every
must be placed in an order according to the sequence in the movement from the video and there is a condition of time
video. Because if order is broken it can transmute entire gap about after how much time the image will be taken.
meaning of the sentence. When all the sequence are being Figure 4 is showing the movement of lip movement while
analyzed and their corresponding words are situated to the a user is saying something from the research of Ahmad B.
text file the process will culminate and a text file will be A. Hassanat regarding the research of automated lip reading
engendered as output. [18]. Each word speaking have its own type of movement

242 Copyright © IEEE–2019 ISBN: 978-1-7281-3245-7

Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on July 21,2020 at 23:08:01 UTC from IEEE Xplore. Restrictions apply.
Text Extraction through Video Lip ReadingUsing Deep Learning

and the movement sequence is different also. The change of References


the position of the lip during speaking is expressed by the [1] Z. Zhou, G. Zhao, X. Hong and M. Pietik¨ainen, “A review of recent
figure given below- advances in visual speech decoding,” Image and vision computing,
vol. 32, Issue. 9, pp. 590–605, September 2014.
[2] T. Pfister, J. Charles and A. Zisserman, “Large-scale learning of
sign language by watching TV (using co-occurrences),” Proceedings
of the British machine vision conference, DOI:10.5244/c.27.20,
January 2013.
[3] Y. Pei, T.K. Kim and H. Zha, “Unsupervised random forest manifold
alignment for lipreading,” Proceedings of the IEEE International
Conference on Computer Vision, DOI: 10.1109/ICCV.2013.23, pp.
129–136, December 2013.
[4] O. Koller, H. Ney and R. Bowden, “Deep learning of mouth shapes
for sign language,” Proceedings of the IEEE International Conference
Fig. 4: Lip Movement While Speaking
on Computer Vision Workshops. DOI: 10.1109/ICCVW.2015.69,
This technique can fail due to different reasons. One pp. 85–91, December 2015.
of the most common problem researchers faced is the lip [5] G. Zoric and I. S. Pandzic, “A real-time lip sync system using a
gesture for similar words. Sometimes it is difficult to identify genetic algorithm for automatic neural network configuration,” IEEE
International Conference on Multimedia and Expo. DOI: 10.1109/
what the speaker is trying to express and what actually the ICME.2005.1521684, pp. 1366–1369, July 2005.
machine is understanding. But through enough training and [6] T. Chen, “Audiovisual speech processing, lip reading and lip
sample data it is possible to reduce the error related to this synchronization,” IEEE Signal Processing Mag., vol. 18, pp. 9-21,
problem using machine learning. In that process machine January 2001.
[7] F. Lavagetto, “Converting speech into lip movements: A multimedia
will be able to understand the sentence construction from
telephone for hard of hearing people,” IEEE Trans. on Rehab. Eng.,
its training dataset. This analysis can also be done using vol. 3, March 1995.
convolutional neural network. [8] M. Jaderberg, K. Simonyan, A. Vedaldi and A. Zisserman, “Synthetic
Considering the previous study and several technique as data and artificial neural networks for natural scene text recognition,”
well as research projects the accuracy differs in a structured Workshop on Deep Learning, NIPS, June 2014.
[9] K. Sun, C. Yu, W. Shi, L. Liu and Y. Shi, “Lip-Interact: Improving
way. In general based on the training data and model built Mobile Device Interaction with Silent Speech Commands,”
for analysis accuracy is used to be in between 85% to 95% Proceedings of the 31st Annual ACM Symposium on User Interface
and this was studied from the references provided in this Software and Technology, doi: 10.1145/3242587.3242599, pp: 581-
research. Based on the study, this analysis will have about 593, October 2018.
90% of accuracy on the data set we are working on. [10] J. S. Chung, A. Zisserman, “Learning to lip read words by watching
videos,” Computer Vision and Image Understanding, Vol. 173, pp.
V. Conclusion 76-85, August 2018.
[11] S. Nainan and V. Kulkarni, “Lip Tracking Using Deformable Models
Day by day the amount of video data are incrementing and Geometric Approaches” Satapathy S., Joshi A. (eds) Information
world wise. But the amount of systems that can analyze and Communication Technology for Intelligent Systems. Smart
those data are not enough yet. Those days are not far when Innovation, Systems and Technologies, vol 106, pp. 655-663,
there will not be any language barrier among the people all December 2018.
[12] Gelder, B. de, Vroomen, J., & van der Heide, L. (1991). Face
over the world. They will understand each other in their own
recognition and lip-reading in autism. European Journal of Cognitive
language and will be communicate in their native language. Psychology, 3(1), 69–86. doi:10.1080/09541449108406220
Predicated on the lip reading and voice data analysis and [13] Summerfield, Q. (1992). Lipreading and Audio-Visual Speech
text data analysis it is possible to build such system that will Perception. Philosophical Transactions of the Royal Society B:
be able to impeccably translate one language to another. So Biological Sciences, 335(1273), 71–78. doi:10.1098/rstb.1992.0009
[14] Lee, K. D., Lee, M. J., & Soo-Young Lee. (n.d.). Extraction of
someone from one country will verbalize something in his frame-difference features based on PCA and ICA for lip-reading.
native language, then through the data analysis another one Proceedings. 2005 IEEE International Joint Conference on Neural
from another country will understand in his own language. Networks, 2005. doi:10.1109/ijcnn.2005.1555835
Research on data science making our quotidian life more [15] Luettin, J., Thacker, N. A., & Beet, S. W. (n.d.). Speaker identification
facile every day. Lip reading has its own paramount role by lipreading. Proceeding of Fourth International Conference
on Spoken Language Processing. ICSLP ’96. doi:10.1109/
to play in this development. Gregarious development, icslp.1996.607030
crime reduction and so on are possible through this type [16] Zhang, X., & Mersereau, R. M. (n.d.). Lip feature extraction towards
of analysis. So it is high time to commence working with an automatic speechreading system. Proceedings 2000 International
audio less video data for our own development. Conference on Image Processing (Cat. No.00CH37101). doi:10.1109/
icip.2000.899336
Acknowledgement [17] Nguyen, D., Halupka, D., Aarabi, P., & Sheikholeslami, A. (2006).
This research was conducted with the help of DIL NLP Real-time face detection and lip feature extraction using field-
programmable gate arrays. IEEE Transactions on Systems, Man and
and Machine Learning Lab. Still some great and energetic
Cybernetics, Part B (Cybernetics), 36(4), 902–912. doi:10.1109/
researchers are working in this lab on this topic. We hope tsmcb.2005.862728
that this lab will be able to present good researches and [18] Ahmad B. A. H., “Visual Speech Recognition.” Speech and
products in future. Language Technologies (2011): n. pag. Crossref. Web.
Copyright © IEEE–2019 ISBN: 978-1-7281-3245-7 243

Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on July 21,2020 at 23:08:01 UTC from IEEE Xplore. Restrictions apply.

You might also like