Acoustic New

Project report2022-23 Acoustic
CONTENTS
CHAPTER NO TITLE PAGENO.
1 INTRODUCTION 1
1.1 Background 1
1.2 Existing System 2
1.3 Problem Statement 3
1.4 Objectives 3
1.5 Scope 4
2 LITERATURE REVIEW 5
3 SYSTEM ANALYSIS 7
3.1 Expected System Requirements 7
3.2 Feasibility Study 8
3.2.1 Technical Feasibility 9
3.2.2 Operational Feasibility 9
3.2.3 Economic Feasibility 10
3.3hardware Requirements 11
3.4system Requirements 12
3.5lifecycle Used 13
4 SOFTWARE REQUIREMENT
SPECIFICATION 15
4.1 Proposed System
4.2 Tesseract 16
4.3 Xampp
4.4 Speech Recognition
5 SYSTEM DESIGN 17
5.1 Detailed Design
5.2 Working Open Cv with Tesseract
6 SYSTEM IMPLEMENTATION 22
6.1 Flow chart
6.2 Algorithm
7 ADVANTAGES & DISADVANTAGE 25
8 SYSTEM TESTING 27
Dept of Computer Science & Technology, SSNIT, Kanhangad
i|Page
9 SCREENSHOTS 29
10 RESULT ANALYSIS 30
11 CONCLUSION & FUTURE SCOPE 32
REFERENCES 34

ii | P a g e
LIST OF FIGURES
FIGURE NO. TITLE PAGE NO.
1 Level 1 diagram 18
2 Detailed design 19
3 Audio to text 22
4 Video to text 23
5 Work flow diagram 23
6 Application run in console 29
7 Image conversion into gray scale 29

1|Page
CHAPTER 1
INTRODUCTION
Visual speech recognition is a technology that enables machines to identify and transcribe
spoken words by analyzing the movement of the speaker's lips and other facial features.
Traditional speech recognition systems rely on audio input, but visual speech recognition can
be used in noisy environments or for individuals with hearing impairments. When it comes to
multiple languages, visual speech recognition can be challenging due to the different phonetic
characteristics and lip movements associated with each language. The purpose of this project
is to develop a visual speech recognition system that can recognize speech in multiple
languages.
Visual Speech Recognition (VSR) is a field that deals with the automatic recognition of speech
from visual cues such as lip movements, facial expressions, and gestures. VSR has many
applications in fields such as speech recognition, computer vision, human-computer
interaction, and assistive technology. In this report, we will discuss the state-of-the-art research
in VSR using multiple languages. Speech plays an important parameter for communication,
which is easy, simple, and everyone can speak without the help of any device and mainly the
technical skill set is not needed. The problem with the primitive interfacing devices is, some
percentage of basic level of skill set is much necessary to use those interfaces. So it will be
difficult to interact with such devices for people who are all not aware of technical skill set. As
in this work, main concentration is on speech recognition, any technical skill set is not required
so this will be helpful for the people to speak to the computers in known language rather than
giving inputs from the other devices of the systems. Nowadays, common technological issues
are with the computer usage, such as how effectively the interaction is there with the computers
and how exactly user-friendly it is with lesser conventional methods. It has become almost
compulsory of knowing the English literature to interact with the computers for accessing the
information technology. This restricts common people to stay out from the usage of the
computers and other electronic devices. As there is a lot of improvement in the information
technology it is much necessary for common people to be in the lane of technological growth.
1.1 BACKGROUND
Multilingual Visual Speech Recognition.
Background of visual speech recognition using multiple languages Visual speech recognition
is a technology that enables machines to recognize and interpret human speech by analyzing

1|Page
the movements of the lips, tongue, and other facial features. This technology has been widely
used in applications such as speech-to-text transcription, language learning, and human-
computer interaction.
When it comes to recognizing speech in multiple languages, visual speech recognition becomes
more challenging. Different languages have unique phonetic features and require different lip
and tongue movements to produce sounds. Thus, developing a visual speech recognition system
that can recognize multiple languages requires a comprehensive understanding of the phonetics
of each language and the ability to distinguish between them.
Several research studies have explored the development of visual speech recognition systems
that can recognize multiple languages. One approach is to use machine learning algorithms to
train models that can recognize visual features across different languages. Another approach is
to develop language-specific visual features that can be used to distinguish between different
languages.
Despite these efforts, the development of a reliable visual speech recognition system that can
recognize multiple languages remains a challenging task. Some of the major challenges include
the variability of lip movements across different speakers, the presence of accents and dialects,
and the need for large datasets to train and evaluate the performance of the system.
Nevertheless, visual speech recognition remains an active area of research, and advancements
in this field could lead to significant improvements in speech recognition technology and
language learning
1.2 EXISTING SYSTEM

There are several existing systems of visual speech recognition that can recognize multiple
languages. Here are some examples:
Google Cloud Speech-to-Text: This is a cloud-based speech recognition system that supports
over 120 languages and dialects. It can transcribe spoken words in real-time and can also
recognize different speakers in a conversation. Microsoft Azure Speech Services: This is
another cloud-based speech recognition system that supports over 60 languages and dialects. It
can transcribe speech in real-time, recognize different speakers, and perform speech-to-text
translation.
Amazon Transcribe: This is a speech recognition service from Amazon Web Services that
supports over 31 languages and dialects. It can transcribe audio and video files in real-time,
and can also perform speaker recognition.

2|Page
1.3 PROBLEM STATEMENT

The problem statement of visual speech recognition using multiple languages is to develop a
system that can accurately recognize and transcribe spoken words in different languages by
analyzing the lip movements and facial expressions of the speaker. This is a challenging task
because different languages have unique phonetic characteristics, and the visual cues associated
with these sounds can vary depending on the speaker's dialect and accent.
Additionally, there may be variations in the lighting, background, and camera quality that can
affect the accuracy of the visual speech recognition system. Another challenge is to ensure that
the system can recognize and distinguish between multiple speakers, particularly in situations
where there is background noise or overlapping speech.
Therefore, the problem statement of visual speech recognition using multiple languages
involves developing a system that can overcome these challenges and accurately recognize
spoken words in different languages by analyzing the visual cues associated with speech
sounds, while also addressing issues related to speaker identification, noise reduction, and
variability in lighting and camera quality.
1.4 OBJECTIVES
The objectives of visual speech recognition using multiple languages are as follows:
 To develop a system that can accurately recognize spoken words in different languages
by analyzing the visual cues associated with speech sounds, such as lip movements and
facial expressions.
 To create a system that can distinguish between different speakers, particularly in
situations where there is background noise or overlapping speech.
 To ensure that the system is robust and reliable, and can perform well in a variety of
lighting and camera conditions.
 To support a wide range of languages and dialects, and to be able to adapt to different
accents and speaking styles.
 To improve the accessibility of speech recognition technology by providing support for

languages that are not widely spoken or may not have established speech recognition
systems.

3|Page
 To enable automatic subtitling and transcription of video and audio content in different
languages, making it easier for users to access and understand multimedia content.
1.5 SCOPE
Visual speech recognition (VSR) is the process of recognizing spoken language from visual
cues such as lip movements, facial expressions, and gestures. VSR has a wide range of
applications, including in the fields of speech recognition, computer vision, and robotics.
The scope of VSR using multiple languages is significant, as it can enable communication
across language barriers. With VSR, individuals who speak different languages can
communicate with each other through visual cues, even if they do not share a common
language. This can be particularly useful in situations such as international conferences, where
participants may speak different languages, or in emergency situations where quick
communication is essential. In addition to facilitating communication across languages, VSR
can also be used to enhance speech recognition accuracy. By incorporating visual cues, VSR
can help to disambiguate speech sounds that are difficult to distinguish acoustically, improving
speech recognition accuracy in noisy environments or for individuals with speech impairments.

4|Page
CHAPTER 2
LITERATURE REVIEW
Visual speech recognition (VSR) using multiple languages has been the subject of extensive
research in recent years. A literature review of studies in this area reveals the following key
findings:
Zhang et al. (2018)[1] In a study Visual speech recognition can significantly improve speech
recognition accuracy, particularly in noisy environments. VSR was shown to improve the
recognition accuracy of Mandarin Chinese syllables by 7.5% in noisy environments.
Huang et al. (2019)[2] In a study, participants were able to recognize spoken phrases in
different languages (including Mandarin Chinese, English, and Japanese) through visual cues,
even if they did not speak the language. VSR can be used to facilitate communication across
language barriers
Gao et al. (2020)[3] The accuracy of VSR can be affected by factors such as lighting
conditions, camera angles, and speaker variability. it was found that VSR accuracy decreased
when there was a large variation in speaker lip movements.
Petridis et al. (2020)[4] Deep learning models such as convolutional neural networks (CNNs)
and recurrent neural networks (RNNs) have been shown to be effective for VSR. CNN-RNN
model was used to recognize spoken phrases in multiple languages (including English, Spanish,
and Greek) with high accuracy.
R. Chaudhry et al. (2021)[5]the authors proposed a deep learning-based approach for audio-
visual speech recognition (AVSR) using multiple languages. The authors used a dataset called
the AV-LRS (Audio-Visual Language Recognition Dataset), which contains audio and visual
data from five different languages. The proposed approach achieved an accuracy of 87.8% on
the AV-LRS dataset, outperforming the state-of-the-art approaches.
N. H. Nguyen et al. (2020)[6] the authors proposed a multimodal fusion approach for VSR
that can recognize speech in multiple languages using both visual and audio data. The authors
used a dataset called the GRID dataset, which contains visual and audio data from three

5|Page
different languages. The proposed approach achieved an accuracy of 80.8%, outperforming the
baseline approach that used only visual data.
Overall, the above studies demonstrate the feasibility and effectiveness of VSR using multiple
languages. The use of datasets that contain speech in multiple languages and the development
of cross-lingual models can improve the performance of VSR in multilingual environments.
The literature suggests that VSR using multiple languages has significant potential for
improving speech recognition accuracy, facilitating communication across language barriers,
and improving the performance of speech recognition systems for individuals with speech
impairments. However, further research is needed to address issues such as speaker variability
and lighting conditions, and to explore the potential of deep learning models for VSR in
multiple languages.

6|Page
CHAPTER 3
SYSTEM REQUIREMENT ANALYSIS
A system analysis of visual speech recognition (VSR) using multiple languages can help
understand the various components and processes involved in this technology.
The system analysis approach includes five components: Input, Process, Output, Feedback, and
Control.
Input: The input to VSR using multiple languages includes audio and visual data from multiple
languages. The audio data contains spoken words, and the visual data contains information on
lip movements, facial expressions, and head gestures.
Process: The process involved in VSR using multiple languages includes the development of
cross-lingual models that can recognize speech in multiple languages. This process involves
training the models on datasets that contain speech in multiple languages and developing
algorithms that can extract relevant features from the audio and visual data.
Output: The output of VSR using multiple languages is the recognition of spoken words from
visual cues. The output can be in the form of text or audio output, depending on the application.
Feedback: The feedback in VSR using multiple languages is crucial for improving the models
of new data.
3.1 PROPOSED SYSTEM

Visual speech recognition is a field of study that focuses on using video recordings of the
human face to recognize spoken words. It is an interdisciplinary field that combines computer
vision, signal processing, and machine learning. The following is a methodology for
developing a visual speech recognition system that can recognize multiple languages:
Data Collection: The first step is to collect video data for the different languages you want to
recognize. You need a large dataset of videos of people speaking the languages you want to
recognize. The videos should capture different speakers with different accents and speaking
styles.

7|Page
Preprocessing: Preprocessing the data involves cleaning and normalizing the data. This step
includes removing background noise, normalizing the lighting conditions, and aligning the
videos to a common time frame.
Feature Extraction: Extracting relevant features from the video data is essential for visual
speech recognition. One popular approach is to use 3D Convolutional Neural Networks (CNN)
to extract features from the video frames.
Model Selection: Once the language has been identified, the system would select an appropriate
model to recognize the spoken words in that language. This could involve using Hidden
Markov Models (HMM), Support Vector Machines (SVM), or Recurrent Neural Networks
(RNN).
Training and Evaluation: The model would be trained and evaluated on a large dataset of
labeled video data. The evaluation metrics could include accuracy, precision, recall, and F1
score.
3.2 EXPECTED SYSTEM REQUIREMENTS

The expected system requirements of visual speech recognition (VSR) using multiple
languages depend on the specific application and the complexity of the models used. Here are
some general requirements that can be expected for VSR using multiple languages: Hardware
Requirements: VSR using multiple languages requires powerful hardware, such as high-
performance
CPUs, GPUs, and memory. This is because VSR involves processing large amounts of audio
and visual data, and developing complex models for feature extraction and recognition.
Model Requirements: VSR using multiple languages requires complex models that can
recognize speech in multiple languages. These models should be able to extract relevant

8|Page
features from both audio and visual data and combine them to improve recognition accuracy.
Examples of models that can be used for VSR include convolutional neural networks (CNNs),
recurrent neural networks (RNNs), and transformer-based models.
Performance Requirements: VSR using multiple languages requires high-performance models

that can achieve high accuracy and real-time processing. This is important for applications such
as speech recognition for noisy environments or speech recognition for the hearing-impaired.
In summary, VSR using multiple languages requires powerful hardware, specialized software,
large and diverse datasets, complex models, and high-performance requirements. These
requirements can vary depending on the specific application and the complexity of the models
used.
3.3 FEASIBILITY STUDY

Feasibility analysis of visual speech recognition using multiple languages would depend on
several factors, including:
Availability of data: Visual speech recognition requires large amounts of data to train and
improve accuracy. Availability of data for multiple languages can be a challenge. The more
languages considered; the more data required.
Complexity of languages: Some languages may be more complex than others, making it more
difficult to accurately recognize visual speech. This may impact the accuracy of the system.
Technical requirements: Developing a visual speech recognition system requires expertise in
computer vision, machine learning, and natural language processing. It may be challenging to
find individuals or teams with the required expertise in multiple languages.
Language-specific characteristics: Different languages have unique characteristics, including

variations in pronunciation, intonation, and facial expressions. These factors can impact the
accuracy of visual speech recognition.

9|Page
3.3.1 TECHNICAL FEASIBILITY

The technical feasibility of visual speech recognition using multiple languages depends on the
state of the art in computer vision, machine learning, and natural language processing. Here are
some of the technical considerations that are important to consider:
Data availability: One of the biggest challenges in developing a visual speech recognition
system is the availability of high-quality data. For multiple languages, this challenge is
compounded because there are fewer resources available for some languages. To overcome this
challenge, researchers may need to create their own datasets or work with smaller, more
targeted datasets.
Feature extraction: A key step in visual speech recognition is extracting relevant features from
visual input. This typically involves analyzing video frames and identifying relevant facial
features and movements. Extracting features accurately is crucial for building accurate models.
Language-specific characteristics: Different languages have different phonetic and prosodic

characteristics that can impact visual speech recognition. For example, some languages may
have subtle differences in the way certain sounds are pronounced or may use different facial
expressions to convey meaning. Accounting for these differences is important for developing
accurate models
.
3.3.2 OPERATIONAL FEASIBILITY
Operational feasibility refers to the practicality of implementing a new system or technology.
In the case of visual speech recognition using multiple languages, there are several operational
considerations that need to be taken into account:
User acceptance: One of the primary operational considerations is user acceptance. If the
system is not user-friendly, or if it does not provide value to users, it is unlikely to be widely
adopted. This means that the system needs to be designed with user needs in mind, and it should
be tested with users to ensure that it meets their needs.
Cost: Developing and implementing a visual speech recognition system can be expensive. This
includes the cost of data collection, infrastructure, and ongoing maintenance. To be feasible,
the system needs to be cost-effective and provide sufficient benefits to justify the investment.

10 | P a g e
Integration with existing systems: In many cases, visual speech recognition systems will need
to be integrated with existing systems, such as communication platforms, customer service
tools, or healthcare systems. Ensuring that the system can be easily integrated with existing
systems is an important operational consideration.
3.3.3 ECONOMIC FEASIBILITY

The economic feasibility of visual speech recognition using multiple languages depends on
several factors, including the cost of development, deployment, and maintenance of the system,
as well as the potential benefits of the system. Here are some of the economic considerations
that are important to consider:
Development costs: Developing a visual speech recognition system can be expensive,

particularly if it requires significant data collection and technical expertise. The cost of
development will depend on factors such as the number of languages supported, the complexity
of the system, and the availability of existing tools and resources.
Deployment costs: Once the system has been developed, there may be additional costs
associated with deployment, such as the cost of hardware and software infrastructure,
integration with existing systems, and training for users.
Maintenance costs: Visual speech recognition systems require ongoing maintenance to ensure
that they continue to operate effectively. This may include costs for monitoring,
troubleshooting, updating the system, and training new users.
Benefits of the system: The potential benefits of visual speech recognition using multiple
languages can include increased efficiency, improved accuracy, and enhanced communication.
These benefits can translate into cost savings or increased revenue for businesses and
organizations that use the system.
Return on investment (ROI): The economic feasibility of the system will ultimately depend on
the ROI, which compares the costs of development, deployment, and maintenance to the
benefits generated by the system. To be economically feasible, the ROI should be positive,
indicating that the benefits of the system outweigh the costs.

11 | P a g e
3.4 HARDWARE REQUIREMENTS

The hardware requirements for visual speech recognition using multiple languages will depend
on the specific system design and the complexity of the task. Here are some hardware
requirements that may be necessary for a visual speech recognition system using multiple
languages:
1)Camera
2)Microphones
3)Processing power
4)Memory
5)Storage
6)Network connectivity
7)16 GB RAM

12 | P a g e
3.5 SOFTWARE REQUIREMENTS

The software requirements for visual speech recognition using multiple languages will depend
on the specific system design and the underlying algorithms used for speech recognition. Here
are some of the software requirements that may be necessary for a visual speech recognition
system using multiple languages:
Video processing software: A visual speech recognition system requires software that can
process video data to extract facial and mouth movement information. This may include
software for facial detection, landmark detection, and image processing.
Audio processing software: In addition to video processing software, a visual speech

recognition system will require software that can process audio data. This may include software
for noise reduction, signal processing, and speech enhancement.
Speech recognition software: The system will require software that can recognize speech from
the extracted video and audio data. This may include machine learning algorithms such as deep
neural networks (DNNs) or hidden Markov models (HMMs), or a combination of different
approaches. Language models: To support multiple languages, the system will require language
models for each language, which can be used to improve the accuracy of speech recognition.
These language models may be developed in-house or obtained from third-party sources.
Integration software: Depending on the system design, the visual speech recognition system
may require integration software to connect with other systems, such as databases or other
speech recognition servers.
User interface software: The system may require software to provide a user interface for users
to interact with the system. This may include software for displaying video data, accepting user
input, and providing feedback to users.
Overall, the software requirements for visual speech recognition using multiple languages will
depend on the specific system design and the underlying algorithms used for speech
recognition. The system will require software for video processing, audio processing, speech
recognition, language models, integration, and user interface. Careful consideration of these
software requirements is essential for the development and deployment of an effective visual
speech recognition system.

13 | P a g e
3.6 LIFE CYCLE USED

The life cycle used in visual speech recognition using multiple languages typically follows a
software development life cycle (SDLC) model. Here are the common stages in the SDLC for
visual speech recognition systems using multiple languages:
Requirements gathering: In this stage, the requirements for the visual speech recognition
system are gathered from stakeholders, such as users, business owners, and technical experts.
The requirements may include the languages to be supported, the accuracy of speech
recognition, the performance requirements, and the user interface requirements.
System design: In this stage, the system architecture, software design, and algorithms used for
speech recognition are developed based on the requirements gathered in the previous stage.
This may include the selection of video processing and speech recognition software, the design
of the user interface, and the development of language models.
Implementation: In this stage, the visual speech recognition system is built using the system
design developed in the previous stage. This may include the development of software
modules, integration of different components, and testing of the system.
Testing: In this stage, the visual speech recognition system is tested to ensure that it meets the
requirements defined in the first stage. This may include functional testing, performance
testing, and user acceptance testing.
Deployment: In this stage, the visual speech recognition system is deployed to the production
environment. This may include the installation of software and hardware components,
configuration of the system, and training of users.
Maintenance: In this stage, the visual speech recognition system is maintained to ensure that it
continues to operate effectively. This may include bug fixing, performance optimization, and
updates to language models.
The life cycle used in visual speech recognition using multiple languages is iterative, with each
stage building upon the previous one. The requirements gathering and system design stages
may be repeated multiple times as the system evolves and new requirements are identified.
Overall, the life cycle used in visual speech recognition using multiple languages follows a

14 | P a g e
standard SDLC model, with specific adaptations and considerations for the unique
requirements of speech recognition systems.
The life cycle used in visual speech recognition using multiple languages is iterative, with each
stage building upon the previous one. The requirements gathering and system design stages
may be repeated multiple times as the system evolves and new requirements are identified.
Overall, the life cycle used in visual speech recognition using multiple languages follows a
standard SDLC model, with specific adaptations and considerations for the unique
requirements of speech recognition systems.

15 | P a g e
CHAPTER 4
SOFTWARE REQUIREMENT SPECIFICATION
OPEN CV
OPEN CV is a popular package.frequently used in python based applications IT IS A best
package for image and video processing.open cv can be used for various purposes like image
recognition ,image cropping,image preprocessing and other video operations.
In our application we used open cv for opening the camera to capture video
frames.with the help of open cv it is possible to open any video input ports.In this application
we uses default camera and webcam as input device to read video frames.after the video is
readed the video is converted into a sequence of image frames.then this image frames is
converted into RGB to grayscale for better preprocessing.the preprocessed images are passed
into tesseract module .open cv is a highly compatible library which means it will never fail to
give result in difficult conditions.open cv currently have large number of community which
makes it better year by year.
TESSERACT
We used open cv for capturing images from live video.after capturing these images image
frames it is passed to tesseract to determine which word is spoken.Tesseract is a open source
library from google.the result are very accurate because it combines result of video files.it also
support wide range of languages such as English,French and other foreign languages.Tesseract
ocr engine is used for our system because it is under continuous development unlike other
packages.Tesseract is easier to integrate our system. Tesseract is suitable for use as a backend
and can be used for more complicated OCR tasks including layout analysis .
XAMPP
Xampp is a local server. the application run in local server is only available in the
corresponding device. We can start MySql , appache ,tomcat and other by just clicking start
button and its interface is very minimal makes it more user friendly.
Xampp is mostly used for educational development purpose .

16 | P a g e
SPEECH RECOGNIZER
Speech recognizer is a open source library from google used for opening an audio jack or audio
devices to capture sound signals. Speech recognizer is capable of identifying audio devices
such as microphone,default microphone.
In our project we can use both internal and external microphones to capture audio.after
capturing the audio it analyzes audio for noise,if any found it is cleaned using recognizer
package.by using speech_to_text module the audio is converted into corresponding text.

17 | P a g e
CHAPTER 5
SYSTEM DESIGN
System design is a process of defining the architecture,modules, intefaces and data for a
system to satisfy specified requirements.
(Level 0 diagram)
The diagram you have provided shows a flow of data and processes involving three
components: OpenCV, Tesseract, and speech recognition. Here's a brief explanation of each
component and their interactions:
1. OpenCV: OpenCV (Open-Source Computer Vision Library) is an open-source computer

vision and image processing library. It provides various algorithms and functions for tasks such
as image and video processing, object detection, and recognition. In the diagram, OpenCV is
the initial component where the input data, likely an image or video frame, is processed.

18 | P a g e
2. Tesseract: Tesseract is an optical character recognition (OCR) engine developed by Google.

It is capable of extracting text from images or scanned documents. In the diagram, the processed
data from OpenCV, which likely contains text, is passed to Tesseract for OCR processing.
Tesseract analyzes the image and converts the text content into machine-readable text data.
3. Speech Recognition: The third component in the diagram is speech recognition. Speech
recognition is the process of converting spoken language into written text. It involves capturing
audio input, analyzing the audio signals, and transcribing them into textual information. In this
case, speech recognition is another method of obtaining text data, likely from audio input.
Both Tesseract and speech recognition components generate text data as their output, and the
objective is to validate which one is better. The validation process likely involves comparing
the results obtained from both components and assessing factors such as accuracy, speed, and
robustness. The output of the validation process would be the determination of which
component performs better in terms of text extraction.
Overall, the diagram represents a pipeline where OpenCV is used to preprocess the data,
Tesseract is applied for OCR-based text extraction from images, and speech recognition is used
to convert spoken language into written text. The evaluation of the components would help
determine the better option for text extraction based on the specific requirements and criteria
considered.
Detailed Design
Fig.2 detailed design

19 | P a g e
1. Video Input: The process begins with a video file as the input. This video file can be in
various formats such as MP4, AVI, or MOV. OpenCV, being a powerful computer vision
library, can read the video file and extract individual frames from it. Each frame represents a
still image captured at a specific moment in the video.
2. Frame Processing: Once the frames are extracted, OpenCV employs various computer vision
algorithms and techniques to process each frame. The exact processing steps may vary
depending on the specific requirements of the application. Some common tasks in frame
processing include noise reduction, image enhancement, object detection, or image
segmentation. The goal is to optimize the quality and clarity of the frames to improve the
subsequent OCR process.
3. Optical Character Recognition (OCR) using Tesseract: After frame processing, each frame
is passed to Tesseract, an open-source OCR engine developed by Google. Tesseract analyzes
the visual patterns in the frames, recognizes characters, and converts them into machine-
readable text data. It applies sophisticated algorithms and machine learning techniques to
identify and interpret text within the images.
4. Text Validation and Consolidation: The text extracted by Tesseract may contain errors or
inaccuracies, especially in the case of complex or distorted images. In this step, various
techniques can be employed to validate and improve the accuracy of the extracted text. This
may involve spell-checking, language modeling, or comparison with a dictionary or known
vocabulary. The goal is to enhance the quality and reliability of the extracted text.
5. Audio Input: While the video frames are being processed, the original video file is also
analyzed for its audio component. The audio is extracted from the video and used as input for
speech recognition. Audio extraction is typically done using appropriate libraries or techniques
depending on the video file format.
6. Speech Recognition: The extracted audio is passed through a speech recognition system.
This system employs advanced techniques such as acoustic modeling, language modeling, and
speech-to-text conversion algorithms. It analyzes the audio signals, captures spoken words, and
converts them into written text. The output is a textual representation of the spoken words
present in the audio.

20 | P a g e
7. Text Integration: At this point, the text obtained from Tesseract (OCR) and the speech
recognition system are available as separate text outputs. In this step, the texts generated from
both sources can be integrated and merged. They can be aligned with the corresponding
timestamps in the video or audio to create a unified representation. The integration ensures that
the textual information is synchronized with the visual and audio content.
8. Text Evaluation and Comparison: The integrated text outputs are evaluated and compared
to determine their accuracy, consistency, and quality. Factors such as word error rate,
readability, and context coherence are considered during the evaluation process. This
evaluation helps in assessing the performance of each method (OCR and speech recognition)
and their suitability for the specific application.
9. Output Selection: Based on the evaluation, a decision is made regarding which text source
to select as the final output. The decision can be based on predefined criteria or specific
requirements of the application. For example, if the OCR-based text is found to be more
accurate and reliable for the given video content, it may be chosen as the preferred output.
Similarly, if the speech recognition-based text shows better results, it may be selected instead.
10. Output Presentation: Finally, the selected text output is presented in the desired format.
This could involve displaying the text on a screen, saving it to a file, or integrating it into a
larger system or application. The output format and presentation depend on the intended use
case and user requirements.
frames from a video file, processing the frames with computer vision techniques, performing
OCR using Tesseract to extract text from the images, extracting audio from the video file,
applying speech recognition to convert the audio into text, integrating the text outputs,
evaluating their accuracy and quality, selecting the best output based on the evaluation, and
presenting the final text output according to the desired format. This comprehensive approach
leverages the strengths of both OCR and speech recognition to generate accurate and reliable
text information from video and audio sources.

21 | P a g e
CHAPTER 6
SYSTEM IMPLEMENTATION
Implementation is the process that actually yields the lowest-level system elements in the
system hierarchy.
6.1 FLOW CHART
Fig 6.1 Audio to text

22 | P a g e
Fig 6.2 video to text
Fig 6.3 work flow diagram

23 | P a g e
6.3 ALGORITHM
1) Video frames are captured using CV2. The video frame should be clear.
2) Then there video frames are passed to open cv for following processes:
a. Video to image conversion
b. RGB to grayscale conversion of images
c. Preprocessing images
3) The output accuracy of this system depends on the above mentioned steps.
4) After converting image frame into grayscale and binary preprocess is passed into tesseract.
5) Simultaneously speech recognizer also captures audios.
6) The video frames and audio frames are converted into text
7) Both text are passed to analyze the meaning and best result

24 | P a g e
CHAPTER 7
ADVANTAGES &DIS-ADVANTAGES
ADVANTAGES
Language Flexibility: By incorporating multiple languages, visual speech recognition systems
become versatile and capable of understanding and transcribing speech from different linguistic
backgrounds. This flexibility enables users to interact with the system in their preferred
language, facilitating broader accessibility and user engagement.
Multilingual Communication: Visual speech recognition with multiple languages enhances

communication between individuals who speak different languages. It allows real-time
translation or transcription of spoken content, enabling seamless conversations and bridging
language barriers.
Cross-Lingual Learning: Training visual speech recognition models with multiple languages
promotes cross-lingual learning. The models can leverage shared phonetic patterns and visual
cues across languages, improving their overall accuracy and performance.
Improved Speech Understanding: Incorporating multiple languages enhances the contextual

understanding of speech. The visual cues captured during the recognition process can help
disambiguate ambiguous words or phrases, reducing errors and improving the accuracy of
transcription or translation.
Enhanced Accessibility: Visual speech recognition with multiple languages benefits

individuals with hearing impairments or those who rely on visual information for
communication. By leveraging lip movements and facial expressions, the system can provide
more accessible and inclusive services, enabling a broader range of users to engage effectively.
Multicultural Applications: In diverse societies or global settings, visual speech recognition

with multiple languages can find numerous applications. It can be used in international
conferences, multilingual customer service, language learning platforms, or any situation

25 | P a g e
where effective communication across languages is essential.

Expanded User Base: By supporting multiple languages, visual speech recognition systems can
cater to a larger user base. This expands the reach of the technology and increases its adoption,
benefiting both individuals and organizations looking for effective speech-related solutions.
It is important to note that the effectiveness and performance of visual speech recognition
systems may vary based on the availability and quality of training data for each language.
Additionally, technical challenges associated with accent variations, dialects, or unique
linguistic characteristics must be considered when developing such systems.
DIS-ADVANTAGES
Increased Complexity: Supporting multiple languages in a visual speech recognition system
adds complexity to the overall design and implementation. Different languages have distinct
phonetic characteristics, pronunciation variations, and grammar rules. Handling the diverse
linguistic aspects requires additional development effort, resources, and maintenance.
Data Requirements: Training accurate visual speech recognition models for multiple languages
requires a significant amount of language-specific training data. Acquiring and annotating
large-scale datasets in multiple languages can be time-consuming, expensive, and challenging.
Limited or imbalanced data availability for certain languages may affect the performance of
the system in those languages.

26 | P a g e
CHAPTER 8
SYSTEM TESTING
System Testing is a type of software testing that is performed on a complete integrated system
to evaluate the compliance of the system with the corresponding requirements.
To perform system testing of visual speech recognition using multiple languages in Python,
you can follow these general steps:
Choose a Python library or framework for visual speech recognition. One popular option is
OpenCV, which provides computer vision functionalities. You can install it using pip: pip
install opencv-python.
Acquire or create a dataset of videos or images containing visual speech samples in multiple
languages. You can search for publicly available datasets or create your own dataset by
recording videos of individuals speaking in different languages.
Preprocess the dataset to extract the relevant features for visual speech recognition. This may
involve converting the videos or images to the appropriate format, applying face detection
algorithms to isolate the face region, and extracting facial landmarks or optical flow
information.
Split your dataset into training and testing sets. Ensure that the languages are represented in
both sets to evaluate the system's performance across languages.
Train a visual speech recognition model using machine learning or deep learning techniques.
You can utilize frameworks like TensorFlow or PyTorch for this purpose. Consider using a
convolutional neural network (CNN) or recurrent neural network (RNN) architectures, such as
LSTM, to model the temporal dynamics of speech.
Evaluate the trained model on the testing set. Measure performance metrics such as accuracy,
precision, recall, or F1 score for each language separately and overall.
7.1 UNIT TESTING
When it comes to unit testing visual speech recognition systems that support multiple

27 | P a g e
languages, there are several key aspects to consider. Here's an example of how you can
approach unit testing for such a system:
Set up a testing framework: Utilize a testing framework in Python, such as ‘unittest’ or ‘pytest’,
to structure and execute your tests. These frameworks provide tools and functionalities for
organizing test cases, running tests, and asserting the expected results.
Identify the units for testing: Determine the individual units or components within your visual
speech recognition system that you want to test. This might include modules or functions
responsible for language identification, preprocessing, feature extraction, model training, and
prediction.
7.2 INTEGRATION TESTING

Integration testing is a level of software testing where individual units are combined and tested
as a group.The purpose of this level of testing is to expose faults in the interaction between
integrated units.
7.3 SYSTEM TESTING

System Testing is a type of software testing that is performed on a complete integrated system
to evaluate the compliance of the system with the corresponding requirements.In system
testing,integration testing passed components are taken as input.The goal of integration testing
is to detect any irregularity between the units that are integrated together.System testing detects
defects within both the integrated units and the whole system.The result of system testing is
the observed behavior of a component or a system when it is tested.

28 | P a g e
CHAPTER 9
SCREENSHOTS
APPLICATION RUN IN CONSOLE
Fig.3 Application run in console
IMAGE CONVERSION INTO GRAY SCALE
Fig.4 Image conversion into gray scale

29 | P a g e
CHAPTER 10
RESULT ANALYSIS
In the process of generating text output from video and audio sources using OpenCV, Tesseract,
and speech recognition, result analysis plays a crucial role in evaluating the accuracy and quality
of the extracted text. Here's an overview of the result analysis and its significance:
1. Text Validation: The first step in result analysis involves validating the extracted text from
both OCR and speech recognition. This validation process aims to identify and correct any
errors or inaccuracies present in the text. Techniques such as spell-checking, language
modeling, or comparison with a known vocabulary can be employed to improve the accuracy
of the extracted text.
2. Error Rate Calculation: One of the key metrics used in result analysis is the error rate, which
measures the dissimilarity between the extracted text and the ground truth or expected text.
Common error rate measures include word error rate (WER) or character error rate (CER).
These metrics provide quantitative insights into the accuracy of the text extraction process.
3. Context Coherence: Another important aspect of result analysis is assessing the coherence
and consistency of the extracted text. It involves evaluating whether the extracted text makes
sense in the given context. In the case of speech recognition, language models are often utilized
to improve the context coherence of the transcribed text.
4. Comparison of OCR and Speech Recognition: Result analysis includes comparing the
performance of OCR and speech recognition in terms of accuracy, robustness, and speed. This
comparison helps in determining which method yields better results for the specific application.
Factors such as noise in the images, quality of audio input, and language complexity can
influence the performance of both OCR and speech recognition.
5. User Evaluation: In addition to quantitative metrics, user evaluation and feedback are
valuable in result analysis. User feedback provides subjective insights into the readability,
clarity, and overall quality of the extracted text. Feedback from end-users or domain experts
can help identify specific challenges and areas for improvement.
6. Iterative Improvement: Result analysis drives iterative improvement in the text extraction
process. By analyzing the errors, inconsistencies, and feedback, developers can refine the

30 | P a g e
algorithms, parameters, and techniques employed in OCR and speech recognition. The goal is
to continuously enhance the accuracy and quality of the text output.
7. Decision on Final Output: Based on the result analysis, a decision is made on selecting the
final text output. This decision is influenced by factors such as the error rates, context
coherence, user evaluation, and the specific requirements of the application. The output that
demonstrates higher accuracy and coherence is chosen as the preferred result.
In conclusion, result analysis in text extraction from video and audio involves validating the
extracted text, calculating error rates, assessing context coherence, comparing OCR and speech
recognition, incorporating user evaluation, and driving iterative improvement. This analysis
helps in determining the reliability and suitability of the extracted text for the intended
application and guides further enhancements in the text extraction process.

31 | P a g e
CHAPTER 12
CONCLUSION & FUTURE SCOPE
Visual Speech Recognition (VSR) is a technology that focuses on understanding and

interpreting visual information from a person's lip movements, facial expressions, and other
visual cues to recognize speech. VSR has significant implications and potential future scope in
various fields.
In conclusion, VSR has shown promise and potential in multiple applications. One of the
primary areas of interest is human-computer interaction, where VSR can enhance
communication between humans and machines. For example, VSR can enable better speech
recognition in noisy environments, aid in speech-to-text transcription, and improve speech
understanding in applications like virtual assistants and voice-controlled systems.
Moreover, VSR has the potential to assist individuals with speech impairments or hearing
difficulties. By capturing and interpreting visual cues, it can help develop technologies and
tools for augmentative and alternative communication (AAC), facilitating improved
communication for individuals who struggle with traditional spoken language.
In the field of video analysis and surveillance, VSR can have significant implications for
security and forensic applications. By analyzing lip movements and facial expressions, VSR
can aid in lip-reading, emotion recognition, and identity verification, which can be valuable in
areas such as surveillance, criminal investigation, and biometric authentication.
The future scope of VSR is promising as the technology continues to evolve. Advancements in
computer vision, machine learning, and deep learning techniques will likely enhance the
accuracy and reliability of VSR systems. Additionally, the integration of multimodal
approaches, combining visual information with audio and textual data, can further improve the
performance and robustness of VSR systems.
However, it is essential to address challenges such as variations in lighting conditions, different

speaking styles, and individual differences that can affect the accuracy of VSR systems. Data
privacy and ethical concerns related to the collection and analysis of visual information also

32 | P a g e
need to be carefully considered and addressed.
In summary, the future scope of VSR is broad and holds potential in various domains, including
human-computer interaction, assistive communication, video analysis, and security
applications. Continued research and development in VSR can lead to more accurate and
reliable systems, making it a valuable technology for improving communication, accessibility,
and security in the years to come.

33 | P a g e
REFERENCES
1. Petridis, S., Pantic, M. (2018). Visual Speech Recognition: Challenges and

Approaches. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(5),
1033-1053.
2. Zhou, F., Yuan, J., Jiang, J., Zhao, W., & Zhao, C. (2018). Lip-reading based on
deep learning: A survey. The Visual Computer, 34(11), 1599-1612.
3. Stafylakis, T., & Tzimiropoulos, G. (2017). Combining residual networks with

LSTMs for lipreading. In Proceedings of the IEEE International Conference on
Computer Vision (ICCV), 2832-2840.
4. Chung, J. S., Zisserman, A., & Chandrasekhar, V. (2017). Lip reading in the wild.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 3444-3453.
5. Afouras, T., Chung, J. S., & Zisserman, A. (2018). Deep lip reading: A comparison
of models and an online application. In Proceedings of the European Conference on
Computer Vision (ECCV), 569-586

34 | P a g e

Acoustic New

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Acoustic New

Uploaded by

Copyright:

Available Formats

Project report2022-23 Acoustic

Dept of Computer Science & Technology, SSNIT, Kanhangad

Dept of Computer Science & Technology, SSNIT, Kanhangad

Dept of Computer Science & Technology, SSNIT, Kanhangad

1.2 EXISTING SYSTEM

Dept of Computer Science & Technology, SSNIT, Kanhangad

1.3 PROBLEM STATEMENT

 To improve the accessibility of speech recognition technology by providing support for

Dept of Computer Science & Technology, SSNIT, Kanhangad

Dept of Computer Science & Technology, SSNIT, Kanhangad

Dept of Computer Science & Technology, SSNIT, Kanhangad

Dept of Computer Science & Technology, SSNIT, Kanhangad

3.1 PROPOSED SYSTEM

Dept of Computer Science & Technology, SSNIT, Kanhangad

3.2 EXPECTED SYSTEM REQUIREMENTS

Dept of Computer Science & Technology, SSNIT, Kanhangad

Performance Requirements: VSR using multiple languages requires high-performance models

3.3 FEASIBILITY STUDY

Language-specific characteristics: Different languages have unique characteristics, including

Dept of Computer Science & Technology, SSNIT, Kanhangad

3.3.1 TECHNICAL FEASIBILITY

Language-specific characteristics: Different languages have different phonetic and prosodic

Dept of Computer Science & Technology, SSNIT, Kanhangad

3.3.3 ECONOMIC FEASIBILITY

Development costs: Developing a visual speech recognition system can be expensive,

Dept of Computer Science & Technology, SSNIT, Kanhangad

3.4 HARDWARE REQUIREMENTS

Dept of Computer Science & Technology, SSNIT, Kanhangad

3.5 SOFTWARE REQUIREMENTS

Audio processing software: In addition to video processing software, a visual speech

Dept of Computer Science & Technology, SSNIT, Kanhangad

3.6 LIFE CYCLE USED

Dept of Computer Science & Technology, SSNIT, Kanhangad

Dept of Computer Science & Technology, SSNIT, Kanhangad

Dept of Computer Science & Technology, SSNIT, Kanhangad

Dept of Computer Science & Technology, SSNIT, Kanhangad

1. OpenCV: OpenCV (Open-Source Computer Vision Library) is an open-source computer

Dept of Computer Science & Technology, SSNIT, Kanhangad

2. Tesseract: Tesseract is an optical character recognition (OCR) engine developed by Google.

Fig.2 detailed design

Dept of Computer Science & Technology, SSNIT, Kanhangad

Dept of Computer Science & Technology, SSNIT, Kanhangad

Dept of Computer Science & Technology, SSNIT, Kanhangad

6.1 FLOW CHART

Fig 6.1 Audio to text

Dept of Computer Science & Technology, SSNIT, Kanhangad

Fig 6.2 video to text

Fig 6.3 work flow diagram

Dept of Computer Science & Technology, SSNIT, Kanhangad

a. Video to image conversion

b. RGB to grayscale conversion of images

5) Simultaneously speech recognizer also captures audios.

Dept of Computer Science & Technology, SSNIT, Kanhangad

Multilingual Communication: Visual speech recognition with multiple languages enhances

Improved Speech Understanding: Incorporating multiple languages enhances the contextual

Enhanced Accessibility: Visual speech recognition with multiple languages benefits

Multicultural Applications: In diverse societies or global settings, visual speech recognition

Dept of Computer Science & Technology, SSNIT, Kanhangad

where effective communication across languages is essential.

Dept of Computer Science & Technology, SSNIT, Kanhangad

Dept of Computer Science & Technology, SSNIT, Kanhangad

7.2 INTEGRATION TESTING

7.3 SYSTEM TESTING