Professional Documents
Culture Documents
A PROJECT REPORT
Submitted by
SRIVATHSAN N (913120104097)
SRIRAM S (913120104095)
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING
BONAFIDE CERTIFICATE
SIGNATURE SIGNATURE
II
ABSTRACT
In recent years, advancements in machine learning (ML) techniques have
revolutionized the field of healthcare by offering innovative solutions for
monitoring and improving patient care. This abstract presents a novel ML-based
health monitoring system integrated with hand, eye, and speech recognition
capabilities, implemented using Python. The primary objective of this system is
to provide an assistive communication platform for individuals with speech
impairments or disabilities The system comprises three key components: hand
recognition, eye tracking, and speech recognition modules. Firstly, the hand
recognition module utilizes convolutional neural networks (CNNs) to detect and
recognize hand gestures made by the user. These gestures serve as input signals
for text generation, enabling individuals to convey messages through hand
signs. Secondly, the eye tracking module employs computer vision techniques
to track the movement of the user's eyes, facilitating intuitive interaction with
the system interface. Thirdly, the speech recognition module utilizes deep
learning models to convert spoken words into text, enabling seamless
communication for users with speech impairments. Moreover, the system
incorporates text-to-speech (TTS) and speech-to-text (STT) functionalities to
support bidirectional communication. The TTS module converts textual
information generated from hand signs and eye movements into audible speech,
enabling users to convey messages verbally. Conversely, the STT module
converts spoken words captured by the system's microphone into textual form,
enhancing accessibility and enabling natural interaction
III
TABLE OF CONTENTS
S.NO. CONTENT PAGE NUMBER
1. Introduction 1-2
1.1. Overview 1
1.2. Objective 2
2. Literature Survey 3- 4`
IV
5.2. Software Specification 13
8.2. Algorithm 25
9. Conclusion 31-32
Appendix 1 34
Appendix 2 36
Appendix 3 38
Appendix 4 40
Appendix 5 42
V
LIST OF TABLES
TABLE.NO TABLE PAGE NUMBER
VI
LIST OF FIGURES
TABLE.NO FIGURES PAGE
NUMBER
1. Sample Data 24
4. Activation Maps 29
5. Kernel Filter 30
6. Max pooling 31
VII
LIST OF ABBREVIATIONS
CNN - Convolutional Neural Network
VIII
IX
1. INTRODUCTION
1.1 OVERVIEW
In an era marked by rapid technological advancement, machine learning
(ML) has emerged as a powerful tool with transformative potential, particularly
in the realm of healthcare. Leveraging the capabilities of ML algorithms,
researchers and developers have been exploring innovative solutions to address
various challenges in patient care and assistive technologies. Among the most
pressing needs is the development of communication systems tailored to
individuals with speech impairments or disabilities, enabling them to express
themselves effectively and interact with their environment. This introduction
presents a novel ML-based health monitoring system designed to address the
communication needs of individuals with speech impairments. The system
integrates hand, eye, and speech recognition functionalities, implemented using
Python, to create a comprehensive assistive communication platform. By
combining these diverse modalities, the system aims to offer intuitive and
accessible communication channels, empowering users to convey their thoughts,
needs, and emotions efficiently. The impetus for developing such a system
stems from the recognition of the challenges faced by individuals with speech
impairments in conventional communication settings. While existing assistive
technologies have made significant strides in enhancing accessibility, they often
rely on single modalities such as text-based interfaces or speech-to-text systems,
which may not adequately address the diverse needs and preferences of users.
Moreover, the lack of real-time feedback and adaptability limits the usability
and effectiveness of these systems in dynamic healthcare environments. To
address these limitations, the proposed ML-based health monitoring system
adopts a multifaceted approach, integrating hand, eye, and speech recognition
technologies within a unified framework. This approach capitalizes on the
complementary nature of these modalities, offering users multiple channels
through which they can communicate and interact with the system. By
leveraging ML algorithms, the system can continuously learn and adapt to user
behavior, improving its performance over time and enhancing the user
experience.
Furthermore, the integration of text-to-speech (TTS) and speech-to-text (STT)
functionalities enhances the versatility and inclusivity of the system, enabling
seamless bidirectional communication between users and caregivers or
healthcare professionals. Through these features, the system aims to promote
1
autonomy, independence, and social integration for individuals with speech
impairments, fostering a more inclusive and supportive healthcare environment.
1.2 OBJECTIVE
The primary objective of this project is to achieve :
2
2. LITERATURE SURVEY
Article Title: "Hand Gesture Recognition in Robotics: A Survey of Techniques
and Applications."
Authors: M. Zhao, L. Wang
Summary: This survey paper provides an overview of hand gesture recognition
techniques and applications in robotics, discussing their role in enhancing human-
robot interaction, collaborative tasks, and assistive functions in various domains.
3
Article Title: "Speech Recognition Systems in Clinical Documentation: A
Review of Implementation Strategies and User Perspectives."
Authors: L. Johnson, M. Thompson
Summary: This review evaluates the implementation strategies and user
perspectives of speech recognition systems in clinical documentation, discussing
factors influencing adoption, workflow integration, and user satisfaction among
healthcare professionals.
4
Summary: The article discusses the implications of eye-tracking technology in
automotive design, addressing its role in understanding driver behavior,
attentional patterns, and cognitive workload to inform the development of safer
and more intuitive vehicle interfaces.
3. SYSTEM STUDY
3.1. FEASIBILITY STUDY
The feasibility study for implementing deep learning models in breast
cancer detection involves assessing
Economic Feasibility
Technical Feasibility
Social Feasibility
5
well as computing hardware for processing, is imperative to determine the initial
investment.
Potential Cost Reduction Strategies: Exploring avenues for cost reduction
while maintaining service quality is paramount. The integrated system is designed
with cost considerations in mind, aiming to streamline medical workflows, reduce
manual labor, and ultimately lower overall project costs while improving
diagnostic accuracy and patient outcomes.
TECHNICAL FEASIBILITY
As healthcare systems evolve, there is a growing interest in leveraging
artificial intelligence (AI) and deep learning algorithms for real-time patient
monitoring and diagnosis. Assessing the technical feasibility of implementing
such systems is crucial to ensure their effectiveness and reliability in enhancing
patient care. This section provides an overview of key technical considerations
for evaluating the feasibility of integrating speech recognition, hand gesture
interpretation, and eye-tracking technologies in medical applications.
Data Requirements: What are the challenges associated with collecting and
annotating large volumes of diverse and representative data for training speech
recognition, hand gesture interpretation, and eye-tracking models in medical
contexts? How can data quality and diversity be ensured to improve the robustness
and generalization capabilities of the models, especially considering patient
variability and medical conditions?
6
Integration with Existing Infrastructure: Why is it essential for these
technologies to seamlessly integrate with existing medical infrastructure, such as
electronic health records (EHRs), medical imaging systems, and patient monitoring
devices? What protocols and standards need to be followed to facilitate
interoperability and data exchange between different components of the medical
infrastructure, ensuring seamless integration and workflow optimization?
Scalability: How does the scalability of these technologies impact their feasibility
for deployment across diverse healthcare settings, including hospitals, clinics, and
telemedicine platforms? What strategies can be employed to design scalable
architectures that can accommodate increasing data volumes and computational
demands while maintaining performance and reliability?
SOCIAL FEASIBILITY
Understanding the social feasibility of implementing speech recognition,
hand gesture interpretation, and eye-tracking technologies in the medical field is
crucial to ensure acceptance, adoption, and positive impacts on healthcare
providers, patients, and society. This section provides an overview of key social
considerations for evaluating the feasibility of integrating these technologies into
medical applications.
7
the use of these technologies in medical settings, particularly concerning privacy,
data security, and trust in automated systems?
User Experience and Satisfaction: What are the expectations and preferences of
healthcare providers and patients regarding the usability and user experience of
speech recognition, hand gesture interpretation, and eye-tracking systems?
How can the design and implementation of these technologies be tailored to meet
the diverse needs and preferences of users, ensuring a positive and intuitive
interaction experience?
Ethical and Legal Considerations: What ethical and legal implications arise
from the use of speech recognition, hand gesture interpretation, and eye-tracking
technologies in medical practice, particularly concerning patient confidentiality,
consent, and data protection?
How can healthcare organizations ensure compliance with relevant regulations
and guidelines, such as HIPAA (Health Insurance Portability and Accountability
Act) and GDPR (General Data Protection Regulation), while implementing these
technologies?
8
adoption and implementation, guiding decision-making processes and promoting
successful integration of these technologies into healthcare practice
4.SYSTEM ANALYSIS
a. EXISTING SOLUTIONS
In older healthcare systems, patients with speech and motor disabilities faced
barriers accessing assistive technologies due to high costs and limited availability,
resorting to inadequate traditional communication methods Traditional
communication methods such as pen and paper or basic communication boards
may not have been sufficient for individuals with complex communication needs.
These methods often require fine motor skills and may be challenging to use for
individuals with motor disabilities or cognitive impairments, leading to frustration
and inefficiency in communication. Without advanced technology, healthcare
professionals heavily relied on subjective interpretation of non-verbal cues,
potentially leading to misunderstandings, especially with diverse communication
styles or cultural backgrounds.
9
recognition, and yet another for eye-tracking assessments, resulting in disjointed
interactions and increased cognitive load.
10
c. PROPOSED WORK
The proposed work focuses on integrating speech recognition, hand gesture
interpretation, and eye-tracking technologies into healthcare systems to enhance
patient care, communication, and diagnostic capabilities. By combining these
modalities, the system aims to improve accessibility, efficiency, and accuracy in
medical applications.
11
Validation and Evaluation: Rigorous validation and evaluation processes are
conducted to assess the performance and reliability of the integrated system.
Testing scenarios include simulated patient interactions, real-world clinical use
cases, and usability evaluations conducted with healthcare professionals and
patients.
12
The unified platform enables comprehensive analysis of patient interactions,
combining multiple modalities for more accurate diagnostic evaluations.
By integrating speech recognition, hand gesture interpretation, and eye-
tracking data, healthcare professionals gain valuable insights into patients'
cognitive function, attention span, and visual perception, aiding in diagnostic
assessments and treatment planning.
Streamlined Workflow Efficiency:
The integration of speech recognition, hand gesture interpretation, and eye-
tracking technologies streamlines medical workflows, reducing manual
documentation efforts and improving data capture accuracy.
Healthcare providers can access real-time transcriptions, gesture
interpretations, and eye movement analyses, enhancing decision-making
processes and optimizing patient care delivery.
Enhanced Patient Experience:
The proposed system enhances the overall patient experience by providing
personalized and interactive communication channels.
Patients feel more empowered and engaged in their care through the ability
to express themselves using speech, gestures, and eye movements, leading to
increased satisfaction and adherence to treatment plans.
Accessibility and Inclusivity:
By integrating multiple communication modalities, the system improves
accessibility and inclusivity for individuals with disabilities, language
barriers, or communication impairments.
Patients from diverse backgrounds and with varying communication needs
can effectively interact with healthcare providers, fostering a more inclusive
healthcare environment.
Cost-Effectiveness and Resource Optimization:
The proposed integration offers cost-saving opportunities by consolidating
multiple technologies into a unified platform, reducing the need for separate
hardware and software solutions.
By optimizing resource utilization and streamlining workflows, healthcare
organizations can achieve operational efficiencies and cost savings in the long
term.
Facilitation of Research and Development
The integrated system provides a rich source of data for research and
development purposes, enabling healthcare professionals and researchers to
13
study patient communication patterns, cognitive function, and diagnostic
outcomes.
By facilitating data-driven research initiatives, the system contributes to
advancements in healthcare delivery and the development of innovative
treatment .
5. SYSTEM SPECIFICATION
System design is the process of the visualizing the entire architecture required
for the product the process includes starting from designing the training
datasets to creating convolutions and testing the models and validating the
models it is the full process of the images to predictive values it has everything
that is as part of the process.
This refers to the central processing unit (CPU) of the system, which is a
Pentium IV processor clocked at 2.4 GHz. The Pentium IV is a type of
microprocessor manufactured by Intel and was commonly used in computers
during the early 2000s. The clock speed of 2.4 GHz indicates the frequency at
which the CPU can execute instructions per second.
This specifies the storage capacity of the hard disk drive (HDD) in the
system, which is 200 gigabytes (GB). The hard disk is a nonvolatile storage
14
device used to store data permanently on the computer. The 200 GB capacity
indicates the amount of data that can be stored on the hard disk.
RAM: 4GB
15
Integration Software:
Middleware or integration platforms for combining data streams from
multiple sources.
Development frameworks such as Unity or Unreal Engine for creating
interactive user interfaces.
Programming languages such as Python, C++, or Java for application
development.
Operating System:
Compatibility with the chosen operating system (e.g., Windows, Linux, or
macOS) for running the integrated software components.
Accessibility and Inclusivity Considerations:
Compliance with accessibility standards (e.g., WCAG - Web Content
Accessibility Guidelines) for ensuring usability for individuals with
disabilities.
Language: Python
Development Environment:
16
1. Speech Recognition:
Speech recognition technology enables the conversion of spoken words into text,
allowing individuals to communicate verbally with the system. This technology
utilizes advanced algorithms to analyze audio input, identify speech patterns, and
transcribe spoken language accurately. Key components of speech recognition
technology include:
Pupil Detection: Detection of the pupil center and estimation of gaze direction
using image processing techniques.
17
Calibration: Calibration process to map eye movements to screen coordinates
accurately, accounting for individual differences in eye anatomy and movement
patterns.
Gaze Analysis: Analysis of gaze patterns, fixation durations, and saccadic
movements to infer cognitive states, visual attention, and information processing.
CNN:
CNNs are a class of deep learning models specifically designed for image
and video analysis tasks.
They consist of multiple layers, including convolutional layers, pooling
layers, and fully connected layers.
Convolutional layers apply filters to input frames of video footage,
extracting spatial features such as patterns, textures, and objects.
Pooling layers reduce the spatial dimensions of feature maps, aiding in
feature extraction and computational efficiency.
Fully connected layers perform classification based on the extracted
features, determining whether an accident is present in the video footage.
7. SYSTEM DESIGN
System design is the process of the visualizing the entire architecture required for
the product the process includes starting from designing the training datasets to
creating convolutions and testing the models and validating the models it is the
full process of the images to predictive values it has everything that is as part of
the process.
18
7.1. SYSTEM ARCHITECTURE
System architecture talks about the high level components design that is required
for the analysis we are going to have majorly three components of the system
architecture preprocessing the data and validating the data and deploying the
models for usage purpose.
Video Input Data: Acquiring video footage from cameras installed in medical
facilities, capturing hand gestures and eye movements during patient consultations
and examinations.
Data Processing:
Preprocessing: Enhancing the quality of audio recordings and video footage to
reduce noise and improve clarity. Segmenting video streams into individual
frames for analysis.
19
Feature Extraction: Extracting relevant features from audio and video data, such
as speech patterns, hand gestures, and eye movements, to facilitate analysis.
Format Conversion: Converting audio recordings and video frames into formats
suitable for input into the respective recognition and interpretation models.
Model Engineering:
Execution:
20
Hardware Requirements: Deploying the system on hardware platforms capable
of processing audio and video data in real time, such as dedicated servers or
cloud-based infrastructure.
This is the step-by-step procedure where, the data imported is validated, and the
statistical analysis is made on the data and the date is merged into sets, an error
21
validation is also done and the algorithm is validated and the accuracy is also
tested.
Participants: The diagram involves several participants or modules: User,
HandSignToTextModule, AI_Module, TextToSpeechModule,
SpeechToTextModule, and EyeballTrackingModule. Each participant represents
a component or actor involved in the process.
Hand Sign Detection: The sequence begins with the User performing a hand
sign, which is then detected by the HandSignToTextModule. This module
communicates with the AI_Module to analyze the hand sign and determine the
type of defect.
Deactivation: Modules are deactivated using the deactivate keyword once their
respective tasks are completed.
22
8. SYSTEM IMPLEMENTATION
23
Integration with Hand Sign to Text and Text to Speech: Incorporate hand sign
to text and text to speech functionalities into the data collection pipeline. Collect
data encompassing hand sign gestures and their corresponding textual
representations, ensuring alignment between the two modalities. Likewise,
capture textual inputs and their corresponding spoken outputs to facilitate text to
speech conversions. By integrating these functionalities, the dataset encompasses
a broader spectrum of multimodal interactions, enabling the development of
comprehensive gesture interpretation systems.
Sample data:
24
popular library in computer vision, offers efficient methods to detect such
keypoints, allowing for robust feature extraction.
Edge detection is another crucial aspect of image processing, aiming to
identify sudden changes in pixel intensity, which often correspond to object
boundaries or discontinuities in an image.
Histogram of Oriented Gradients (HOG) is a powerful technique for feature
extraction, particularly suited for detecting shape patterns within images. By
Text to Speech:
Speech to Text:
Eyeball Tracking:
Eye Movement Data Collection: Gather data on eye movements using eye-
tracking devices or algorithms.
Data Preprocessing: Filter out noise and artifacts from the eye movement
data. Normalize eye movement data to a common coordinate system.
Feature Extraction: Extract features such as fixation points, saccades, and
gaze patterns.
25
Data Labeling: Annotate eye movement data with corresponding visual
stimuli or tasks.
Convolution:
In mathematics, the term convolution is refereed as mathematical operations that
applied upon a function with the help of the functions. It is defined as integral
part of the product that is applied and shifted over producing the convolution
function. A CNN is Deep Learning algorithm that uses images as
inputs and assigns weight and bias to different aspect of the images, and is
capable of differentiating between them. The time taken to preprocessing the data
in convolutional neural networks is very less when compared to other machine
algorithms the architectures of a convolutional neural networks is similar to the
brain of the human where multiple neuron are connected over the cortex. It has
the capacity to process the information through series of signals Images are
generally made up of pixels the images size is typically represented in the form of
rows, columns and RGB color formats. In case of a binary image that is a gray
scale image the images are treated as binary values the black is represented as
zero and white color is represented is white. A ConvNet is can successfully
capture the overlay parts of the images and an image will be processed through
the pixel by pixel by taking a 3*3 or 5*5 convolution and the convolutions are
processed through the entire images to create the new features after each
convolution the images are processed through the weights and biases. The below
Figure, is an RGB image divided into three colors plane Red, Green, Blue. There
are different such color space in which image exists like RGB, Grayscale, HSV.
Processing an entire image is very time consuming and take lot of resources for
computation and instead of this we are going to process the image by
5 * 5 or 3 * 3 filters without losing features the image will be reconstructed to the
image and process the image for analysis purpose.
26
Enhanced Model Convergence:
27
leads to generate the multiple images of smaller sizes and the convolutional are
used in creating feature maps the higher feature maps the higher the accuracy of
the models can be possible over a period of time we are going to process the
complete image for creating by applying a better kernel filters through the
images.
The feature maps are the actual features that goes as input to the convolutional
neural networks the images will be processed through a predefined the activation
functions and the activation functions are used to calculate the different weighted
features and the process of the process the image is very fast. Before the fully
connected layers the entire images processing happens through the convolutional
neural networks and they are process with large scale and faster than the expected
method. With different types of kernels applied there is great chance of the
getting bets features in the convolutional neural networks and the output is
predicted the final outcome.
28
Activation Maps:
Creation of activation maps is one of the key feature in the deep learning
algorithms. Below are the steps used in creating the convolutional neural
networks:
1. Decide the size of the convolutional filter that needs to be applied on the
original images.
2. Start the kernel shift from the top and bottom to the entire image which
results in the first activation map.
3. Take another filter and another image and process this through the entire
image that leads to the predicting the final outputs.
4. A series of multiple filters leads to the final output and these are called
as convolutional activation maps.
Kernel Filter:
29
The kernel filters are generally a matrix of operations that applied up on original
matrix.
1. Convolutional 1D: It has row level values the values that are randomly
generated. it is used to multiple with the original kernel of the image.
2. Convolutional 2D: These images are 2 * 2 matrixes. It has rows and columns
as input to the image.
3. Convolutional 3D: These convolutional filters are 3 * 3 and they have been
used to process the RGB images and the process of the images was happened
through this multi-channel the images and the images was sided over the
convolutional process and sided over the image to generate the multiple features
maps this is standard filter we us whenever we want to process the images of the
convolutions and created the features.
Max Pooling:
Convolutional neural networks are images that a filter applied systematically on
the images to process images. The process of converting multiple feature maps
can result in the create great feature maps we will be using a method called down
sapling to process the images and the images processed through down sampling
images are having better features. Convolutional neural networks are proven to be
very effective the when we are stacking multiple layers. A common approach to
this problem is that down sampling can achieve through the process of taking a
maximum of the convolutional neural network convolutions. A better approach is
30
to use the pooling layer. The pooling layers is the method of using the maximum
number of the image. It can also be called as max pooling. The two types of
pooling layers are the listed below:
1. Average pooling: After applying the filter we take the average
number out of all the processing.
2. Maximum pooling:
a. After applying the kernel we will be applying the
maximum number as the process for the analysis.
9. CONCLUSION
Enhanced Communication Accessibility: The seamless integration
of hand gesture recognition, text-to-speech conversion, text recognition,
and text-to-hand sign translation, augmented by eye-tracking technology,
significantly enhances communication accessibility for individuals across
a wide spectrum of abilities. This comprehensive approach ensures that
31
communication barriers are effectively overcome, fostering more
inclusive interactions in both personal and professional settings.
32
10. FUTURE WORK
Improving Gesture Recognition Accuracy: Future research could focus on
enhancing the accuracy and robustness of hand gesture recognition algorithms,
particularly in complex or noisy environments. This could involve exploring
advanced machine learning techniques, incorporating additional sensor
modalities, or developing more sophisticated gesture modeling approaches.
33
processes aimed at refining system interfaces, features, and functionalities based
on user feedback and real-world usage scenarios.
11. APPENDICES
Appendix1:
import cv2
import os
import mediapipe as mp
def main():
# Create a directory to store captured images
output_dir = 'captured_images'
if not os.path.exists(output_dir):
os.makedirs(output_dir)
cap = cv2.VideoCapture(0)
image_count = 0
34
while True:
success, image = cap.read()
if results.multi_hand_landmarks:
for hand_landmarks in results.multi_hand_landmarks:
# Draw landmarks on the hands (optional)
mp.solutions.drawing_utils.draw_landmarks(
image, hand_landmarks,
mp_hands.HAND_CONNECTIONS)
35
hand_gray = cv2.cvtColor(hand_region,
cv2.COLOR_BGR2GRAY)
cap.release()
cv2.destroyAllWindows()
36
return x_min, y_min, x_max, y_max
if __name__ == "__main__":
main()
Appendix2:
import os
import cv2
import numpy as np
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Flatten,
Dense, Dropout
from keras.utils import to_categorical
if not os.path.isdir(class_path):
print(f"Skipping {class_path} as it is not a directory.")
continue
37
try:
image = cv2.imread(image_path)
if image is not None: # Check if image is loaded
successfully
image = cv2.resize(image, (64, 64)) # Resize to
match your image size
images.append(image)
labels.append(class_id)
else:
print(f"Skipping {image_path} as it could not be
loaded.")
except Exception as e:
print(f"Error loading {image_path}: {str(e)}")
images = np.array(images)
labels = to_categorical(labels, num_classes)
38
# Define your CNN model
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu',
input_shape=(64, 64, 3))) # Adjust input shape to match resized
image
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax')) # Use
num_classes
Appendix3:
import cv2
import numpy as np
from keras.models import load_model
39
# Load and preprocess the input image(s)
def preprocess_input_image(image_path):
image = cv2.imread(image_path)
if image is not None:
image = cv2.resize(image, (64, 64))
image = image.astype('float32') / 255
return image
else:
print(f"Failed to load image from {image_path}.")
return None
# Example usage
input_image_path = "captured_images\image_1.jpg"
input_image = preprocess_input_image(input_image_path)
# Make predictions
predictions = model.predict(input_image)
# Interpret predictions
predicted_class_index = np.argmax(predictions)
confidence = predictions[0][predicted_class_index]
40
import matplotlib.pyplot as plt
# Extract MFCCs
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
# Display MFCCs
plt.figure(figsize=(10, 4))
librosa.display.specshow(mfccs, x_axis='time')
plt.colorbar()
plt.title('MFCC')
plt.tight_layout()
plt.show()
Appendix 4:
import tensorflow as tf
from tensorflow.keras import layers, models
import numpy as np
41
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.Flatten(),
layers.Dense(64, activation='relu'),
layers.Dense(num_classes, activation='softmax') #
num_classes is the number of output classes (e.g., number of
unique words)
])
Appendix 5:
import cv2
import dlib
42
eye_detector =
dlib.cnn_face_detection_model_v1('path_to_eye_detector_mode
l.dat')
while True:
ret, frame = cap.read()
if not ret:
break
43
# Display the frame
cv2.imshow('Eye Tracking', frame)
12. REFERENCES:
1. Chicco, D., Jurman, G. (2020). The advantages of the Matthews correlation
coefficient (MCC) over F1 score and accuracy in binary classification evaluation.
BMC Genomics.
2. Cruz, M.C., Ferenchak, N.N. (2020). Emergency Response Times for Fatal
Motor Vehicle Crashes, 1975–2017. Transp Res Rec.
https://doi.org/10.1177/0361198120927698
3. Kumeda, B., Fengli, Z., Oluwasanmi, A., Owusu, F., Assefa, M., Amenu, T.
(2020). Vehicle Accident and Traffic Classification Using Deep Convolutional
Neural Networks. 2019 16th International Computer Conference on Wavelet
Active Media Technology and Information Processing,ICCWAMTIP-2019.
https://doi.org/10.1109/ICCWAMTIP47768.2019.9067530
4. Lu, Z., Zhou, W., Zhang, S., Wang, C. (2020). A New Video-Based Crash
Detection Method: Balancing Speed and Accuracy Using a Feature Fusion Deep
Learning Framework. J Adv Transp. https://doi.org/10.1155/2020/8848874
5. Machaca Arceda, V.E., Laura Riveros, E. (2020). Fast car crash detection in
video. Proceedings - 2018 44th Latin American Computing Conference, CLEI
2018. https://doi.org/10.1109/CLEI.2018.00081
44
6. Robles-Serrano, S., Sanchez-Torres, G., Branch-Bedoya, J. (2020).
Automatic detection of traffic accidents from video using deep learning
techniques. Computers. https://doi.org/10.3390/computers10110148
7. Shahinfar, S., Meek, P., Falzon, G. (2020). “How many images do I need?”
Understanding how sample size per class affects deep learning model
performance metrics for balanced designs in autonomous wildlife monitoring.
Ecol Inform. https://doi.org/10.1016/j.ecoinf.2020.101085
8. Daxin Tian, Chuang Zhang, Xuting Duan, and Xixian Wang “An Automatic
Car Accident Detection Method Based on Cooperative Vehicle Infrastructure
Systems” No.2019.2939532, Vol.7, 2019
9. Kyu Beom Lee, Hyu Soung “An application of a deep learning algorithm for
automatic detection of unexpected accidents under bad CCTV monitoring
conditions in tunnels” Shin International Conference on Deep Learning and
Machine Learning Emerging Applications, 2019
10. Liujuan Cao, Qilin Jiang, Ming Cheng, Cheng Wang “Robust Vehicle
Detection by Combining Deep Features with Exemplar Classification”, PP:
S0925-2312(16)30644-0, 2019
11. Liwei Wang, Yin Li, Svetlana Lazebnik, “Learning Deep Structure-
Preserving Image-Text Embedding”, IEEE Conference on Computer Vision and
Pattern Recognition, PP: 1063-6919/16, 2020
45
14. Sahithi Prasanthi M,Subrahmanya Sarma M,Abhiram M,Dr Y
Srinivas,”Insider Threat Detection With Face Recognition And KNN User
Classification”, proceedings of CCEM IEEE International conference November
2022
16. May 2023.Wang, X., Ahmad, I., Javeed, D., Zaidi, S. A., Alotaibi, F. M.,
Ghoneim, M. E., ... & Eldin, E. T. (2022). Intelligent hybrid deep learning model
for breast cancer detection. Electronics, 11(17), 2767.
17. Wang, Y., Acs, B., Robertson, S., Liu, B., Solorzano, L., Wählby, C., ... &
Rantalainen, M. (2022). Improved breast cancer histological grading using deep
learning. Annals of Oncology, 33(1), 89-98.
46