Report

A
Project Report
on
Virtual Mouse Using Hand Gesture and Voice Assistant
submitted in partial fulfillment for the award of
BACHELOR OF TECHNOLOGY
DEGREE
SESSION 2022-23
in
Department of Computer Science And Technology

By
Hrithik Chandok (1900290110045)
Huzaifa Ansari (1900290110047)
Manas Khare (19002900110056)
Under the supervision of

Prof. Vinay Kumar
KIET Group of Institutions, Ghaziabad

Affiliated to
Dr. A.P.J. Abdul Kalam Technical University, Lucknow
May, 2023
DECLARATION
We hereby declare that this submission is our own work and that, to the best of our knowledge
and belief, it contains no material previously published or written by another person nor material
which to a substantial extent has been accepted for the award of any other degree or diploma of
the university or other institute of higher learning, except where dueacknowledgment has been
made in the text.
Signature: Signature:
Name: Hrithik Chandok Name: Huzaifa Ansari
Roll No.:1900290110045 Roll No.: 1900290110047
Signature:
Name: Manas Khare
Roll No.:1900290110056
iii
CERTIFICATE
Certified that the Project Report entitled “Gesture Control Virtual Mouse” submitted by Hrithik
Chandok (1900290110045), Manas Khare (1900290110056), and Huzaifa Ansari
(1900290110047) is their own work and has been carried out under my supervision. It is
recommended that the candidates may now be evaluated for their project work by the University.
Date: Supervisor
Prof. Vinay Kumar
(Professor)
iii
ACKNOWLEDGEMENT
We wish to express our heartfelt gratitude to all the people who have played a crucial role in
the research for this project, without their active cooperation thepreparation of this project could
not have been completed within the specifiedtime limit.
We are also thankful to our project guide Prof Vinay Kumar sir who supported methroughout
this project with utmost cooperation and patience and for helping me in doing this Project.
Date:
Signature: Signature:
Name: Hrithik Chandok Name: Huzaifa Ansari
Roll No.:1900290110045 Roll No.: 1900290110047
Signature:
Name: Manas Khare
Roll No.:1900290110056
iv
ABSTRACT
Mobile Gesture-controlled laptops and computers have recently gained a lot of traction.
Leap motion is the name for this technique. Simple gestures of our hand in frontof our
computer/laptop allows us to manage its operations. Unfortunately, employing these
techniques is more complicated. In the dark, these devices are difficult to see, and
manipulating them causes the presentation to be disrupted.Hand gestures are the most
natural and effortless manner of communicating. The camera’s output will be displayed
on the monitor.
The user will be able to see their image and gestures in a window for better accuracy.
The concept is to use a simple camera instead of a classic or standardmouse to control
mouse cursor functions. The Virtual Mouse provides an infrastructure between the user
and the system using only a camera.
It allows users to interface with machines without the use of mechanical or physical
devices, and even control mouse functionalities. This study presents a method for
controlling the cursor’s position without the need of any electronic equipment. While
actions such as clicking, dragging etc.will be carried out using various hand gestures. In
addition to that more functionality like volume and brightness control are given to the
user which creates an additional motivation to use this version of mouse. As an input
device, the suggested system will just require a webcam.
The suggested system will require the use of OpenCV and Mediapipe, python
programming environment and several other libraries as well as other tools. The python
dependencies that will be used for implementing this machine are NumPy, math,
pyautogi, pycaw, messagetodict, screen_brightness_control andothers.
In this paper, we present a singular approach for human computerinterplay (hci) in
which cursor motion is controlled using a real-time camera A manner to control the
Position of the cursor with the bare palms without using Any digital tool. While the
operations like clicking and dragging of objectscould be accomplished with special
Hand gestures.
v
The proposed gadget willrequire a Webcam as an input tool. The software’s that will be
Required to put in force the proposed machine are opencv and python. The output of the
camera may be displayed on the system’s display screen so that it could be similarly
calibrated by means of the user. Thecamera’s output will be presented on the system’s
screen so that the user can further calibrate it.
vi
TABLE OF CONTENTS Page
No.
DECLARATION……………………………………………………………………. ii
CERTIFICATE……………………………………………………………………… iii
ACKNOWLEDGEMENTS…………………………………………………………. iv
ABSTRACT………………………………………………….…………………….... v
LIST OF FIGURES………………………………………………………………….. x
LIST OF ABBREVIATIONS……………………………….………………………. xii
CHAPTER 1 (INTRODUCTION)………………………………………………….. 13
1.1. Flaws in the Existing Solution…………………………………………………..... 14

1.1.1 Head Control……………............................................................................... 15
1.1.2 Eye Control…................................................................................................. 15
1.1.3 Touch Control…............................................................................................. 15
1.2. Proposed Solution…………………………………………………………………. 15
1.3. Existing System…………………………………………………………………… 16
1.4. Industrial Benefits…………………………………………………………………. 16
CHAPTER 2 (LITERATURE RIVIEW)………………………………………….… 18
2.1. Literature Survey ………………………................................................................ 18

2.1.1 Cursor Control................................................................................................ 19
2.1.2 Auditory Based Interaction............................................................................. 20
2.1.3 Cursor Control and Text Entry using Speech…............................................. 20
2.1.4 Multimodal System…………………............................................................. 21
2.1.5 Cursor Control and Text Entry....................................................................... 22
2.1.6 Modality Interaction………………............................................................... 23
vii
CHAPTER 3 (HARDWARE AND SOFTWARE REQUIREMENTS) ......................... 25
3.1. Hardware Requirements............................................................................................ 25

3.1.1 Computer Deasktop.......................................................................................... 25
3.1.2 Wbcam…………….......................................................................................... 25
3.2. Proactive Gateway Discovery.................................................................................... 25
3.2.1 Pyhton……………..………………................................................................. 25
3.2.2 OpenCV……………………………................................................................ 27
3.2.3 MediaPipe…………………………................................................................. 28
3.2.4 PyAutoGUI……….……………….................................................................. 29
3.2.5 Math……………………………….................................................................. 29
3.2.6 PyClaw…………..……………….................................................................... 30
3.2.7 ENUM…………….……………….................................................................. 30
3.2.8 Screen Brightness Control…………................................................................. 31
CHAPTER 4 (METHODOLOGY) ………………………………….............................. 32
4.1. Camera Control………............................................................................................... 33

4.2. Video Capturing & Processing……………………………………………………… 33
4.3. Frame Display………………………………………………………………………. 33
4.4. Module Dicvision....................................................................................................... 34
4.4.1 Hand Tracking…………….……...................................................................... 34
4.4.2 Cursor & Cleaning using Hand Gesturee........................................................... 36
4.4.3 Volume Control…..……………….................................................................... 39
4.4.4 Scrolling Control….………………................................................................... 41
4.4.5 Brightness Contrl…………………………….....................................................41
CHAPTER 5 (SOURCE CODE) ..........................................................….……………… 43
CHAPTER 6 (SCREENSHOTS) ……………….............................................................. 48
viii
CHAPTER 7 (RESULTS AND ANALYSIS) …………………….................................. 59
CHAPTER 8 (CONCLUSIONS AND FUTURE SCOPE) ............................................... 60
8.1. Future Scope............................................................................................................... 61

8.2. Applications….............................................................................................................. 61
REFERENCES 62
ix
LIST OF FIGURES
Figure No. Description Page No.
2.1 Python Logo 26
2.2 OpenCV Logo 28
2.3 MediaPipe Logo 28
2.4 Co-ordinates or Landmarks on Hand 29
2.5 PyAutoGui Logo 29
3.1 Flow Chart of Methodology 34
3.2 Neutral Gesture 36
3.3 Mouse Click 37
3.4 Right Click 38
3.5 Increase Volume Gesture 40
3.6 Decrease Volume Gesture 40
3.7 Low Brightness Gesture 41
3.8 Use Case Diagram 42
6.1 Neutral Gesture 48
6.2 Mouse Click 49
6.3 Left Click 50
6.4 Right Click 51
6.5 Double Click 52

x
6.6 Left to Right-Brightness Controls Top to Bottom 53
6.7 Scroll Up and Down 54
6.8 Low Brightness Gesture 55
6.9 High Brightness Gesture 56
6.10 Increase Volume Gesture 57
6.11 Decrease Volume Gesture 58
6.12 Volume and Brightness Neutral Gesture 58
xi
LIST OF ABBREVIATIONS
HCI Human Computer Interaction
OpenCV Open Source Computer Vision Library
CNN Convolutional Neural Network
KNN k-Nearest Neighbor
SVM Support Vector Machine
xii
CHAPTER 1
INTRODUCTION
INTRODUCTION
Gesture Recognition has been very interesting problem in computer vision community
fora long time. Hand gestures are an aspect of body language that can be conveyed
throughthe center of the palm, the finger position and the shape constructed by the hand.
Hand gestures can be classified into static and dynamic. As its name implies, the static
gesture refers to the stable shape of the hand, whereas the dynamic gesture comprises a
series of hand movements such as waving. There are a variety of hand movements within
a gesture; for example, a handshake varies from one person to another and changes
according to time and place. The main difference between posture and gesture is that
posture focuses more on the shape of the hand whereas gesture focuses on the hand
movement.
Computer technology has tremendously grown over the past decade and has become a
necessary part of everyday live. The primary computer accessory for Human Computer
Interaction (HCI) is the mouse. The mouse is not suitable for HCI in some real life
situations, such as with Human Robot Interaction (HRI). There have been many
researches on alternative methods to the computer mouse for HCI. The most natural and
intuitive technique for HCI, that is a viable replacement for the computer mouse is with
the use of hand gesture.
Our vision was to develop a virtual mouse system that uses a Web camera to
communicate with the device in a more user-friendly way, as an alternative to using a
touch screen and physical mouse. In order to harness the full potential of a webcam, it
can be used for visionbased CC, which would effectively track the hand gesture predict
the gesture on basis of label.
The software enables user to control the complete functionality of a physical mouse, just
13
by using easy symbols and gestures. it utilizes a digital camera and computer
imagination and technology to control numerous mouse activities and is able to acting
each assignment that the bodily computer/laptop mouse can.
The major motivation of this project was in amidst of covid 19 spread. We wanted to
build a solution which enables the user to use their device without physically touching it.
It also reduces the e-waste and also helps to reduce the cost of
1.1 FLAWS IN THE EXISTING SOLUTION
The proposed AI virtual mouse system can be used to overcome problems in the real
worldsuch as situations where there is no space to use a physical mouse and also for the
persons who have problems in their hands and are not able to control a physical mouse.
Also, amidst of the COVID-19 situation, it is not safe to use the devices by touching them
because it may result in a possible situation of spread of the virus by touching the devices,
so the proposed AI virtual mouse can be used to overcome these problems since hand
gesture and hand Tip detection is used to control the PC mouse functions by using a
webcam or a built-in camera.
current system is comprised of a generic mouse and trackpad monitor control system, as
well as the absence of a hand gesture control system. The use of a hand gesture to access
the monitor screen from a distance is not possible. Even though it is primarily
attemptingto implement, the scope is simply limited in the virtual mouse field.
The existing virtual mouse control system consists of simple mouse operations using a
hand recognition system, in which we can control the mouse pointer, left click, right
click, and drag, and so on. The use of hand recognition in the future will not be used.
Even though there are a variety of systems for hand recognition, the system they used is
static hand recognition, which is simply a recognition of the shape made by the hand and
the definition of action for each shape made, which is limited to a few defined actions
and causes a lot of confusion. As technology advances, there are more and more
alternatives to using a mouse.
The following are some of the techniques that were employed:
14
1.1.1 Head Control
A special sensor (or built-in webcam) can track head movement to move the mouse
pointeraround on the screen. In the absence of a mouse button, the software's dwell
delay feature is usually used. Clicking can also be accomplished with a well-placed
switch.
1.1.2 Eye Control
The cost of modern eye gaze systems is decreasing. These enable users to move the
pointeron the screen solely by moving their eyes. Instead of mouse buttons, a dwell delay
feature, blinks, or a switch are used. The Tobii PCEye Go is a peripheral eye tracker that
lets you use your eyes to control your computer as if you were using a mouse.
1.1.3 Touch Screens
Touch screens, which were once seen as a niche technology used primarily in special
education schools, have now become mainstream. Following the success of smartphones
and tablets, touch-enabled Windows laptops and all-in-one desktops are becoming more
common. Although this is a welcome new technology, the widespread use of touch
screenshas resulted in a new set of touch accessibility issues.
However, each of the methods below has its own set of disadvantages. The use of the
head or eyes to control the cursor regularly can be hazardous to one's health. This can
lead to a number of problems with health. When using a touch screen, the user must
always maintain their focus on the screen, which can cause drowsiness. By comparing
the following techniques, we hope to create a new project that will not harm the user's
health.
1.2 PROPOSED SOLUTION
This project promotes an approach for the Human Computer Interaction (HCI) where
cursor movement can be controlled using a real-time camera, it is an alternative to the
current methods including manual input of buttons or changing the positions of a physical
computer mouse. Instead, it utilizes a camera and computer vision technology to control
various mouse events and is capable of performing every task that the physical computer
mouse can.
We’ll first use MediaPipe to recognize the hand and the hand key points. MediaPipe
returns a total of 21 key points for each detected hand. Palm Detection Model and Hand
15
Landmark Model are used by mediapipe to detect hand. First Palm is detected as it is an
easy process with respect to hand landmark. Then Hand Landmark Model performs
precisekey point localization of 21 3D hand-knuckle coordinates inside the detected hand
regions via regression, that is direct coordinate prediction.
We are detecting which finger is up using the tip Id of the respective finger that we found
using the MediaPipe and the respective co-ordinates of the fingers that are up, and
according to that, the particular mouse function is performed.
Then we apply formulas like distance calculation to track the gestures that are being done
by the user, for example if the distance between index finger and middle finger becomes
zero the perform single click operation by calling single click function.
Similarly, we have defined and deployed different various gestures like sliding up with
thumb and index finger increases the volume of the system, keeping a constant distance
between index and middle finger keeps the system in stable state, means does a neutral
gesture similarly moving joined fingers up or down will perform scrolling action. The
movement of palm performs drag gesture. The software can also detect multiple hands
but is deployed in such a way that only gestures of one hand are active a time so that
multiple hands don’t cause ambiguity to the system and user.
1.3 EXISTING SYSTEM
Current voting system is based on ballot machine where when we press the button with
the symbol the voting is done. Here there is a security risk, the person who votes may be
fake person voting. The people there might not know that a person is using fake voting
card, this may cause problem. Also, the person who has to vote should travel from
faraway places to his constituency to cast his vote. So, effective method is to use face
detection while voting online and enabling the right person to vote.
1.4 INDUSTRIAL BENEFITS
The proposed project will avoid covid-19 spread by means of putting off the human
intervention and dependency of gadgets to govern the computer. Amidst of the covid-
19
state of affairs, it isn't always safe to use the devices by way of touching them because it
16
is able to result in a possible situation of spread of the virus by using touching the gadgets,
sothe proposed ai digital mouse can be used to conquer these problems seeing that hand
gesture detection is used to manipulate the mouse functions by using the usage of a
webcam or a built-in digicam of the users computing device be it a pc, laptop, workstation
etc.
Hence, in public hotspots like cyber cafe, places of work, academic institutes, etc. An
person can operate operations on a laptop with no physical contact with it. This in turn
willreduce the spread of virus, bacteria’s and communicable diseases.
Home and office automation is a large discipline wherein gesture popularity is being
employed. As an example, clever tv’s can feel finger actions and hand gestures, offer
contact much less manage over lighting and audio systems.
Also, the project aims to reduce e-waste which is a very common concern with physical
hardware devices. The use of this software will eradicate the use of remotes from tv,
buttons from smart appliances, mouses from personal computers, laptops and work-
stations.
The project also reduces users’ cost, be it upfront cost for buying hardware with new
machines, or for old machines which install this software will not need to replace their
un- usable or are not working any more, we can simply use this software to control the
mouse controls of the system.
It's miles fair to say that the digital mouse will soon to be substituting the traditional
bodilymouse inside the close to future, as people are aiming towards the life-style where
that each technological gadgets may be controlled and interacted remotely without the
usage of any peripheral gadgets along with the remote, keyboards, and so on. It doesn't
simply give convenience; however, it is price effective as properly.
The software functionality can also be extended to Augmented reality applications likes
gaming and VR/AR headsets, gestures-based games etc. Apart from these mobile
applications based for android phones and smart tvs can be implemented, so that they can
be operated in wire-less mode.
17
CHAPTER 2
LITERATURE REVIEW
2.1 LITERATURE SURVEY
In the 70s, command-line interfaces (typewriter paradigm) were introduced whereby

interaction was only through text. This interaction had problems of having rigid protocols
which limited the power of computers. In the 80s the graphical users interface (GUI) and
the desktop metaphor in XEROX PARC were introduced. This paradigm is best
described by the acronym WIMP (windows, icons, menus, and a pointing device) [
Despite the desktop paradigm being very useful for providing a direct manipulation style
of interaction, the two forces; the evolving nature of computers and the desire for more
powerful and compelling user experience, have directed change in interface design.
Sherin Mohammed Sali Shajideen, Preetha V H. [7] They use two USB cameras that are
utilised for the side and the top views are set up. Software used for separating the view
is MATLAB software. For distinct two viewpoints, they train the two detectors and select
different picture samples for diverse top and side views.
HCI: methods which use data gloves and those which are vision-based. Methods which
usedata use sensors attached to the glove that transduce finger inflexion to electrical
signalsfor determining hand postures. However this approach forces a user to carry a lot
of cables needed for connection to the computer. Hence it compromises the naturalness
and easiness of using Vision-based interfaces. On the other hand, methods which are
vision based are non-invasive and intuitive in which they are based on the way human
perceive information about their environment.
In [2] in order to collect the results of hand position estimation by applying smart depth
- sensing technology to track hand postures Nathaniel Rossol provides a unique approach.
Their method specifically solves the problem of tracking objects by using posture
estimates with the help of several detectors positioned from different vantage points.
Instead, greater flexibility is gained by carefully examining the independently generated
skeletal posture estimations from each sensor system. In comparison to the single sensor
18
technique, this approach's testing results reveal that they were able to minimize total
estimation error by 30% in a two-detector configuration, although that is insufficient.
In [3] Shining Song, Dongsong Yan, and Yongjun Xie use a simple camera to capture
the input data as images and then the input image is converted into YCBCR space with
the help of Ostu Algorithm. Some irregular little holes or small protrusions inside the
border will remain in the binarized picture because of other objects in the background
interfering with image segmentation, which may cause issues with gesture recognition
processing. As a result, the outcome must be processed using mathematical morphology.
After locating the centroid point of the gesture picture, the motion direction is assessed
during the dynamic gesture recognition phase in accordance with the motion trajectory
of the centroid point. [4] This study proposes the k-cosine curvature technique for
fingertip recognition and provides an enhanced cut-off segmentation technique to address
the issue of range with depth data for hand motion segmentation.
In [5] Liu Qiongli, Xu Dajun, Li Zhiguo, Zhou Peng, Zhou Jingjing and Xu Yongxia
acquired methods for recognising hand posture using KNN classification and distance
learning. The method is split into two phases: hand posture and the Mahalanobis distance
learning matrix. To learn a reasonable matrix from a sample is the primary goal of the
Distance measure matrix learning phase. Following the image's extraction of the hand
form characteristics (Fourier descriptors), the hand posture recognition phase uses a KNN
classifier to identify the outcomes coming from K nearest neighbour, from the training
sample.
2.1.1 Cursor Control and Text Entry using VB
Cursor control has been the easiest mouse function to be achieved through visual
techniques. Using visual techniques, cursor control has been achieved by tracking
different features and maps the changes in location to the cursor’s x, y coordinates on the
screen.
Mokhtar M [1] uses an algorithm, multivariate Gaussian distribution to track hand
movements. In order to increase the likelihood of matching the input tested gesture with
the already trained classes in feature space, they used Gaussian bivariate pdf method for
fitting and capturing the movement of hand. This step helps to reduce the rotation
disturbance that is typically treated by increasing the number of trained gestures in each
gesture class.
19
The virtual mouse approach described by Tran, DS., Ho, NH., Yang, H.J. et al. in 2021
[6] uses RGB-D pictures and fingertip recognition. Using detailed bone-joint of hand
(Palm + Fingers) information. Photos from a Microsoft Kinect Sensor version 2, the
hand’s region of interest and palm's centre both first retrieved, and they are then
translated into a binary image. A border-tracing method is then used to extract and
characterise the hands' outlines. Depending on the coordinates of hand joint points, the
method of K cosine is used to determine the position of the fingertip. Finally, the mouse
cursor is controlled via movement of hand by mapping the joints and fingertip position
to RGB pictures. This research still has several flaws.
In deferential mode, the accumulation of displacement of motion parameters drives the
navigation of the mouse cursor. However, the head tracking lacks precision and therefore
feature tracking can be used to control the cursor.
2.1.2 Auditory Based Interaction

This section discusses audio-based interface systems and consists of three parts. This first
partdiscusses the interaction device. This is followed by methods for interface selection
and lastly cursor control and text entry.
The concept of talking to computers has comfortably attracted funding and interest and
hasbrought naturalness in communication. Rosenfeld et al stated that the advantages of
using speech for interaction, and these are:
1. Speech is an ambient medium rather than an attentional one. Unlike vision-based
interactions which need our focused attention, speech permits us to interact while
using other senses to perform other tasks.
2. Descriptiveness but not referential-Unlike in vision-based interaction where we
point to or grasp objects of interest, in speech-based interaction objects are
described by roles and attributes. Hence, it‟s easy to combine with other modalities.
Different approaches have been employed to enable users to interact via speech. Three
of the commonly used are natural language, dialog trees, and commands.
Dialog tree systems reduce the difficulties in recognition by breaking down activity into
a sequence of choice points at which the user selects. The disadvantage of this approach
is that, the user will be unable to directly access the parts of a domain that are of
20
immediate interest.
Regardless of the difficulty in constructing such systems on the side of the designer, they
simplify the interaction as well as lessen the need for user training.
2.1.3 Cursor Control and Text Entry using Speech.
In different research areas including academic and commercial, voice-based cursor
control has been proposed to enable control of a cursor using speech input.
Lohr and Brugge proposed two approaches for simulation of cursor control using speech
and these are target-based and direction-based navigation. Direction-based mouse
emulation is done by having the cursor moving in the direction uttered by the user. In
their research, Sears et al implemented the movement of the cursor with commands like
“Move down”,”-up”, “-right” or “-left” and stops when “stop” is uttered. However, this
has precision bottleneck when the stopping command “stop” is uttered. In this case the
cursor keeps on moving as the speech recognizer will still be processing the command.
In thestudy by Igarashi and Hughes, mouse control was achieved by uttering the direction
followed by a non-verbal vocalization. The cursor will move as long as the
vocalizationlasts, e.g., “move down”. Using non-verbal vocal has the advantages of high
precision.
Harada et al used a similar approach. In their study they assigned each direction a specific
sound. In their system the user utters vowel sound corresponding to one of the desired
directions (“a” for up, “e” for right, “i” for down, and “o” for left). The speed of cursor
tobegin with, starts out slow and gradually increases with time. The cursor is stopped
byuttering the same vowel again and a click is performed by uttering a two-vowel
command(“a-e”). The advantage of this system is that it offers immediate processing of
vocal input.Target-based mouse emulation involves the definition of specific targets on
the screen andassigning speakable identifiers which will be displayed closely to the
target. Uttering anidentifier will place the cursor within the corresponding target. For
example, uttering thewidget name (widget used as target) causes the mouse cursor to be
placed over the button.However, this method suffers from layout and usability issues if
the number of widgets ishigh.
Vowel sounds (shown using International Phonetic Alphabet symbols) as a function
oftheir dominant articulatory configurations. (b) Mapping of vowel sound to direction in
the 8-way mode Vocal Joystick. In 4-way mode, only the vowels along the horizontal
21
and vertical axes are used. In order to counteract this effect, the number of targets might
be restricted by, for instance, dividing the screen into a course grained grid of named
cells. Nevertheless, a number of commands will be needed per task.
2.1.4 Multimodal Systems

Multimodal Systems Multimodal systems are systems which combine two or more
modalities. These modalities refer to the ways in which the system responds to the inputs
that are communication channels . Due to the problems faced with unimodal systems a
combination of different modalities help solve the problems; that is when one modality
is inaccessible, a task can be completed with the other modalities. Modalities can be
combined as redundant or complementary depending on the number and type of
modalitiesintegrated.
A combination of devices has been used in multimodal system. Interaction is done using
the devices which enable input for the specific modality. For one system which has visual
and speech, a camera and a microphone or a headset with microphone can be used as
Interface selection.
Franges kideset al explicitly provided three methods for interface selection: idle click,
external switch and voice command click. In the idle click they used a threshold value
whereby a mouse click is triggered once the condition will be true. The external switch
method involves the use of sound to invoke mouse click, if the system is in Sound Click
mode. The mouse click is triggered if the sound produced has the higher intensity than
the background noise.
In order to simulate all the mouse events, Lanitis et al categorised voice commands into
five classes: Mouse, Move cursor, Computer and Open. Manipulation of interface
widgets by voice command involved the use of voice commands in the category
“Mouse”- these actions include drag, drop, click among others. The advantage of
operating in sound click mode is that it is speaker-independent hence there is no need for
training as compared to when in voice command mode.
2.1.5 Cursor Control and Text Entry

In the work by Frangeskides and Lanitis the system which uses speech and visual input
to perform HCI tasks was proposed. In their system visual input is used for cursor control
through11 face-tracking. Their approach for face-tracking is achieved by tracking the
22
eyes and nose. The generated regions around the eyes and nose are used to calculate the
horizontal and vertical projections tracking to enable for cursor control. The difference
of the face location from the original position is mapped into cursor movement
that is,towards the direction of the movement.
In their work, cursor control would be enabled if the Voice command mode was
activated. In that case the cursor control will be done by uttering commands from one of
the five groups (“Movecursor”) they designed . Although cursor control using voice
commands canbe used, it lacks precision, hence will need to be integrated with the use
of face tracking in order to precisely reach the widget.
For text entry an “On screen keyboard” (Windows operating system utility) was used. As
soon as the keyboard was activated, keys were entered by using head tracking to move
the cursor on the keys. Speech was then used to invoke mouse click events when the
system was in “Sound click mode” or if it was in “Voice command mode” cursor
movement will be used.
2.1.6 Modality Integration
This section focuses on integrating modalities to provide a best interface to the user. First
two taxonomies of multimodal systems are summarized followed by a discussion of how
the different modalities relate to each other.
Benoit et al stated that two taxonomies of multimodal integration:
1. How different modalities are supported by a particular system?

2. How does information from different modalities relate to each other and how
it iscombined?
The first taxonomy is based on the types and goals of cooperation between modalities-
they stated 6 different types of cooperation:
Equivalence: Whereby chuck of information may be processed as an alternative of either
modality Redundancy: the same piece of information is transmitted using more than one
modality For example if the user types “close” on the keyboard and utters “close”.
Complementary: This is whereby different amounts of information belonging to the same
command are transmitted over one modality.
Specialisation: this is whereby a same chuck of information is always processed by the
same modality.
Concurrency: independent kinds of information are processed using different modalities
23
and overlap in time (e.g. moving cursor whilst editing document). That means parallel
use of different modalities.
Dung-Hua Liou, ChenChiung Hsieh, and David Lee in 2010 [10] proposed a study on
“A Real- Time Hand Gesture Recognition System Using Motion History Image.” The
main limitation of this model is more complicated hand gestures.
In [11] June 2010 Vision based Gesture recognition for human computer interaction was
published Which used Motion detection and recognition algorithm, which was a
solid method for gesture detection. Devanshu singh in International Journal for Research
in Applied Science and Engineering Technology mentioned a novel method of
controlling mouse movement with Real time camera using openCV. Apart from this
official documentation of openCv and Media pipe was referred extensively.
Monika B. Gandhi, Sneha U. Dudhane, and Ashwini M. Patil in 2013 [12] proposed a
study on “Cursor Control System Using Hand Gesture Recognition.” In this work, the
limitation is stored frames are needed to be processed for hand segmentation and skin
pixel detection.
In [8] (2013) a research paper was published named “Vision-based Multi model Human-
Computer Interaction using Hand and Head Gestures” which used hand and head to
control. Any computer vision-based algorithm based applications.it recognized gestures
based on Pattern of hand and motion of head.
In [9] 2015 in a paper “vision-based computer mouse control using hand gesture”
described camera-based technique which used real time video acquisition and
implemented lift and right click. It mainly used binary image generation and filtering.
24
CHAPTER 3
HARDWARE AND SOFTWARE REQUIREMENTS
For the purpose of detection of hand gestures and hand tracking, the MediaPipe
frameworkis used, and OpenCV library is used for computer vision. The algorithm
makes use of the machine learning concepts to track and recognize the hand gestures
and hand tip.
3.1 HARDWARE REQUIREMENTS
The following describes the hardware needed in order to execute and develop the
VirtualMouse application.
3.1.1 Computer Desktop or Laptop
The computer desktop or a laptop will be utilized to run the visual software in order to
display what webcam had captured. A notebook which is a small, lightweight and
inexpensive laptop computer is proposed to increase mobility.
System will be using Processor : Core2Duo Main Memory : 4GB RAM Hard Disk :
320GB Display : 14" Monitor.
3.1.2 Webcam
Webcam is utilized for image processing, the webcam will continuously taking image in
order for the program to process the image and find pixel position.
3.2 SOFTWARE REQUIREMENTS

3.2.1 PYTHON
Python is a multi-paradigm programming language. Object-oriented programming and
structured programming are fully supported, and many of its features support functional
programming and aspect-oriented programming (including by meta programming and
metaobjects [magic methods] Many other paradigms are supported via extensions,
including design by contract and logic programming.
Python uses dynamic typing, and a combination of reference counting and a cycle-
detecting garbage collector for memory management. It uses dynamic name resolution
(late binding), which binds method and variable names during program execution.
Its design offers some support for functional programming in the Lisp tradition. It
25
hasfilter, map and reduce functions; list comprehensions, dictionaries, sets, and generator
expressions. The standard library has two modules (itertools and functools) that
implementfunctional tools borrowed from Haskell and Standard ML.
Its core philosophy is summarized in the document The Zen of Python (PEP 20), which
includes aphorisms such as:
• Beautiful is better than ugly.
• Explicit is better than implicit.
• Simple is better than complex.
• Complex is better than complicated.
• Readability counts.
Rather than building all of its functionality into its core, Python was designed to be highly
extensible via modules. This compact modularity has made it particularly popular as a
means of adding programmable interfaces to existing applications. Van Rossum's vision
of a small core language with a large standard library and easily extensible interpreter
stemmed from his frustrations with ABC, which espoused the opposite approach.
Python strives for a simpler, less-cluttered syntax and grammar while giving developers
a choice in their coding methodology. In contrast to Perl's "there is more than one way
to do it" motto, Python embraces a "there should be one—and preferably only one—
obvious way to do it" philosophy. Alex Martelli, a Fellow at the Python Software
Foundation and Python book author, wrote: "To describe something as 'clever' is not
considered a compliment in the Python culture."
Figure 2.1 – Python Logo
26
3.2.2 OpenCV
OpenCV (Open Source Computer Vision Library) is a huge open-source library for
computer vision, machine learning, and image processing, cross-platform library using
which we can develop real-time computer vision applications. OpenCV supports a wide
variety of programming languages like Python, C++, Java, etc. It mainly focuses on
image processing, video capture and analysis to identify objects, faces, or even the
handwriting ofa human. It can be installed using "pip install opencv-python".
OpenCV was built to provide a common infrastructure for computer vision applications
and to accelerate the use of machine perception in the commercial products. Being a
BSD-licensed product, OpenCV makes it easy for businesses to utilize and modify the
code.
Computer Vision can be defined as a discipline that explains how to reconstruct,
interrupt, and understand a 3D scene from its 2D images, in terms of the properties of the
structure present in the scene. It deals with modeling and replicating human vision using
computer software and hardware.
Computer Vision overlaps significantly with the following
fields −Image Processing − It focuses on image manipulation.
Pattern Recognition − It explains various techniques to classify patterns.
Photogrammetry − It is concerned with obtaining accurate measurements from
images.
Image processing deals with image-to-image transformation. The input and output of
image processing are both images.
Computer vision is the construction of explicit, meaningful descriptions of physical
objects from their image. The output of computer vision is a description or an
interpretation of structures in 3D scene.
Applications of Computer Vision:
Here we have listed down some of major domains where Computer Vision is heavily used.
Robotics Application
1. Localization − Determine robot location automatically
2. Navigation
3. Obstacles avoidance
27
4. Assembly (peg-in-hole, welding, painting)
5. Manipulation (e.g. PUMA robot manipulator)
6. Human Robot Interaction (HRI) − Intelligent robotics to interact with and
servepeople
Figure 2.2 – OpenCV Logo

3.2.3 MediaPipe
MediaPipe is a framework which is used for applying in a machine learning pipeline, and
itis an opensource framework of Google. The MediaPipe framework is based on three
fundamental parts; they are performance evaluation, framework for retrieving sensor
data, and a collection of components which are called calculators ,and they are reusable.
A pipeline is a graph which consists of components called calculators, where each
calculator is connected by streams in which the packets of data flow through. Developers
are able to replace or define custom calculators anywhere in the graph creating their own
application. The calculators and streams combined create a data-flow diagram [5].It can
be installed using “pip install mediapipe”.
Figure 2.3 MediaPipe Logo
Single-shot detector model is used for detecting and recognizing a hand or palm in real
time. The single-shot detector model is used by the MediaPipe[3]. First, in the hand
detection module, it is first trained for a palm detection model because it is easier to train
palms. Furthermore, the non maximum suppression works significantly better on small
objects such as palms or fists. A model of hand landmark consists of locating 21 joint or
knuckle co-ordinates in the hand region.
28
Figure 2.4 – Co-ordinates or Landmarks on Hand
3.2.4 PyAutoGUI
PyAutoGUI is a cross-platform GUI automation Python module for human beings. Used
toprogrammatically control the mouse & keyboard. or we can say that it facilitates us to
automate the movement of the mouse and keyboard to establish the interaction with the
other application using the Python script. It can be installed by pip install pyautogui.
PyAutoGUI has several features:
• Moving the mouse and clicking in the windows of other applications.

• Sending keystrokes to applications (for example, to fill out forms).
• Take screenshots, and given an image (for example, of a button or checkbox),
andfind it on the screen.
• Locate an application’s window, and move, resize, maximize, minimize, or close it.
Figure 2.5 – PyAutoGUI Logo
3.2.5 Math
This module provides access to the mathematical functions defined by the C standard.
These functions cannot be used with complex numbers; use the functions of the same
29
name from the cmath module if you require for complex numbers. The distinction
between functions which support complex numbers and those which don’t is made since
most users do not want to learn quite as much mathematics as required to understand
complex numbers. Receiving an exception instead of a complex result allows earlier
detection of theunexpected complex number used as a parameter, so that the programmer
can determine how and why it was generated in the first place.
The following functions are provided by this module. Except when explicitly noted
otherwise, all return values are floats.
3.2.6 PyClaw
Python CydProtocol buffers (Protobuf) are a language-agnostic data serialization format
developed by Google. Protobuf is great for the following reasons: Low data volume:
Protobuf makes use of a binary format, which is more compact than other formats such
as JSON. Persistence: Protobuf serialization is backward-compatible.ia API Wrapper.
Youcan use PyCaw to retrieve information (JSON) of packages via the Cydia/Sileo API.
3.2.7 ENUM
Enum is a class in python for creating enumerations, which are a set of symbolic names
(members) bound to unique, constant values. The members of an enumeration can be
compared by these symbolic names, and the enumeration itself can be iterated over. An
enum has the following characteristics.
• The enums are evaluable string representation of an object also called repr().
• The name of the enum is displayed using ‘name’ keyword.
Using type() we can check the enum types.

We have used enum.IntEnum class of this Enum
library.class enum.IntEnum
Base class for creating enumerated constants that are also subclasses of int.
A Python tool for controlling the brightness of your monitor. Supports Windows and
mostflavours of Linux.
30
3.2.8 Screen_Brightness_Control
A Python tool for controlling the brightness of your monitor. Supports Windows and
mostflavors of Linux. We can install this library by pip install screen-brightness-
control.
31
CHAPTER 4
METHODOLOGY
We all use new technology development in our day to day life. Including our devices as
well. When we talk about technology the best example is a computer. A computer have
evolved from a very low and advanced significantly over the decades since they
originated.However we also use the same setup, which includes a mouse and keyboard..
Though the technology have made many changes in the development of computers like
laptop where the camera is now an integrated part of the computer. We still have a mouse
which is eitherintegrated or an external device.
This is how we have come across the implementation a new technology for Our mouse
where we can control the computer by finger tips and this system is known as
HandGesture Movement. With the aid of our fingers, we will be able to guide our cursor.
For this project we have used .
Python as the base language as it is an open source and easy to understand and
environment friendly. Ananconda is packaged python IDE that is shipped with tons of
important packages. It is an friendly environment. The packages that are required here is
PyAutoGUI and OpenCV. PyAutoGUI is a Python module for programmatically
controlling the mouse and keyboard. OpenCV through which we can control mouse
events.Red, Yellow, and Blue will be the three colors we use for our finger tips. It is a
program that uses Image Processing to extract required data and then adds it to the
computer'smouse interface according to predefined notions. Python is used to write the
file. It uses of the cross platform image processing module OpenCV and implements the
mouse actions using Python specific library PyAutoGUI.Real time video captured by the
Webcam is processed and only the three colored finger tips are extracted.
Their centers are measured using the system of moments, and the action to be taken is
determined based on their relative positions.
The first goal is to use the function cv2.VideoCapture().This function uses to capture the
live stream video on the camera. OpenCV will create an very easy interface to do this.
To capture a image we need to create an video capture object. We then covert this
captured images into HSV format. The second goal is to use the function
Calibratecolor().Using thisfunction the user will be able to calibrate
32
the color ranges for three fingers individually.
The third goal is to use the function cv2.inRange().In this function depending on
thecallibrations only the three fingers are extracted. We remove the noise from the feed
using the two stem morphism one is Erosion and
second is Dilation.
The next goal is to center the radius of the finger tip. So that we can start moving the
cursor. ChooseAction() is
used in the code to do this. The performAction() method uses the PyAutoGUI library to
perform all of the following actions: free cursor movement, left click, right click,
drag/select, scroll up, scroll down, and so on, depending on its performance.
4.1 CAMERA CONTROL

The runtime operations are managed by the webcam of the connected laptop or
desktop.To capture a video, we need to create a Video Capture object. By using the
Python computer vision library OpenCV, the video capture object is created and the web
camera will start capturing video. Its argument can be either the device index or the name
of a video file. Device index is just the number to specify which camera. Since we only
use a single camera we pass it as ‘0’. We can add additional camera to the system and
pass it as 1,2 and so on. After that, you can capture frame-by-frame. But at the end, don’t
forget to release the capture. We could also apply color detection techniques to any image
by doing simple modifications in the code.
4.2VIDEO CAPTURING & PROCESSING

The AI virtual mouse system uses the webcam where each frame is captured till the
termination of the program. The video frames are processed from BGR to RGB color
spaceto find the hands in the video frame by frame as shown in the following code:
deffindHands(self,img,draw = True):imgRGB = cv2.cvtColor(img,cv2.COLOR_BGR2
RG B) self.results = self.hands.process(imgRGB)
4.3 FRAME DISPLAY

The imShow() is a function of HighGui and it is required to call the waitKey regulerly.
Theprocessing of the event loop of the imshow() function is done by calling waitKey.
The function waitKey() waits for key event for a “delay” (here, 5 milliseconds). Windows
33
events like redraw, resizing, input event etc. are processed by HighGui. So we call the
waitKey function, even with a 1ms delay[4].
Figure 3.1 – Flowchart of Methodology
4.4 MODULE DIVISION
4.4.1 Hand Tracking

Capturing the image using the palm model.
Detection of Unique Points on the palm using hand landmark model. Connecting the
detected unique points using the hand landmark model Implementation of Frame rate
counter to get the hand tracking.
MediaPipe Hands is a high-fidelity hand and finger tracking solution. It employs machine
learning (ML) to infer 21 3D landmarks of a hand from just a single frame. Whereas
current state-of-the-art approaches rely primarily on powerful desktop environments for
inference, our method achieves real-time performance on a mobile phone, and even
scales to multiple hands. We hope that providing this hand perception functionality to the
wider research and development community will result in an emergence of creative use
cases, stimulating new applications and new research avenues.
34
ML Pipeline:
MediaPipe Hands utilizes an ML pipeline consisting of multiple models working
together: A palm detection model that operates on the full image and returns an oriented
hand bounding box. A hand landmark model that operates on the cropped image region
defined by the palm detector and returns high-fidelity 3D hand keypoints. This strategy
is similar to that employed in our MediaPipe Face Mesh solution, which uses a face
detector togetherwith a face landmark model.
Providing the accurately cropped hand image to the hand landmark model drastically
reduces the need for data augmentation (e.g. rotations, translation and scale) and instead
allows the network to dedicate most of its capacity towards coordinate prediction
accuracy. In addition, in our pipeline the crops can also be generated based on the hand
landmarks identified in the previous frame, and only when the landmark model could no
longer identify hand presence is palm detection invoked to relocalize the hand.
The pipeline is implemented as a MediaPipe graph that uses a hand landmark tracking
subgraph from the hand landmark module, and renders using a dedicated hand renderer
subgraph. The hand landmark tracking subgraph internally uses a hand landmark
subgraph from the same module and a palm detection subgraph from the palm detection
module.
Palm Detection Model :
To detect initial hand locations, we designed a single-shot detector model optimized for
mobile real-time uses in a manner similar to the face detection model in MediaPipe Face
Mesh. Detecting hands is a decidedly complex task: our lite model and full model have
to work across a variety of hand sizes with a large scale span (~20x) relative to the image
frame and be able to detect occluded and self-occluded hands
35
Figure 4.2 Neutral Gesture
Whereas faces have high contrast patterns, e.g., in the eye and mouth region, the lack of
such features in hands makes it comparatively difficult to detect them reliably from their
visual features alone. Instead, providing additional context, like arm, body, or person
features, aids accurate hand localization.
Our method addresses the above challenges using different strategies. First, we train a
palm detector instead of a hand detector, since estimating bounding boxes of rigid objects
like palms and fists is significantly simpler than detecting hands with articulated fingers.
Inaddition, as palms are smaller objects, the non-maximum suppression algorithm
workswell even for two-hand self-occlusion cases, like handshakes. Moreover, palms can
be modelled using square bounding boxes (anchors in ML terminology) ignoring other
aspect ratios, and therefore reducing the number of anchors by a factor of 3-5.
Second, an encoder-decoder feature extractor is used for bigger scene context awareness
even for small objects (similar to the RetinaNet approach). Lastly, we minimize the focal
loss during training to support a large amount of anchors resulting from the high scale
36
variance.
4.4.2 Cursor and Cleaning using Hand Gesture

Movement of Mouse Cursor through hand tracking. Implementation of Single
click.Implementation of Neutral Gesture.
Hand Landmark Model
After the palm detection over the whole image our subsequent hand landmark model
performs precise keypoint localization of 21 3D hand-knuckle coordinates inside the
detected hand regions via regression, that is direct coordinate prediction. The model
learns a consistent internal hand pose representation and is robust even to partially visible
hands and self-occlusions.
To obtain ground truth data, we have manually annotated ~30K real-world images with
21 3D coordinates, as shown below (we take Z-value from image depth map, if it exists
per corresponding coordinate). To better cover the possible hand poses and provide
additional supervision on the nature of hand geometry, we also render a high-quality
synthetic hand model over various backgrounds and map it to the corresponding 3D
coordinates.
Figure 3.3 – Mouse Click

If set to false, the solution treats the input images as a video stream. It will try to detect
37
hands in the first input images, and upon a successful detection further localizes the hand
landmarks. In subsequent images, once all max_num_hands hands are detected and the
corresponding hand landmarks are localized, it simply tracks those landmarks without
invoking another detection until it loses track of any of the hands.
This reduces latency and is ideal for processing video frames. If set to true, hand detection
runs on every input image, ideal for processing a batch of static, possibly unrelated,
images. Default to false.
MAX_NUM_HANDS
Maximum number of hands to detect. Default to 2.
MODEL_COMPLEXITY
Complexity of the hand landmark model: 0 or 1. Landmark accuracy as well as inference
latency generally go up with the model complexity. Default to 1.
Figure 3.4 – Right Click
MIN_DETECTION_CONFIDENCE
Minimum confidence value ([0.0, 1.0]) from the hand detection model for the detection
to be considered successful. Default to 0.5.
MIN_TRACKING_CONFIDENCE:
Minimum confidence value ([0.0, 1.0]) from the landmark-tracking model for the hand
landmarks to be considered tracked successfully, or otherwise hand detection will be
38
invoked automatically on the next input image. Setting it to a higher value can increase
robustness of the solution, at the expense of a higher latency. Ignored if
static_image_mode is true, where hand detection simply runs on every image. Default to
0.5.
Output
Naming style may differ slightly across platforms/languages.
MULTI_HAND_LANDMARKS
Collection of detected/tracked hands, where each hand is represented as a list of 21 hand
landmarks and each landmark is composed of x, y and z. x and y are normalized to [0.0,
1.0] by the image width and height respectively. z represents the landmark depth with
the depth at the wrist being the origin, and the smaller the value the closer the landmark
is to the camera. The magnitude of z uses roughly the same scale as x.
MULTI_HAND_WORLD_LANDMARKS
Collection of detected/tracked hands, where each hand is represented as a list of 21 hand
landmarks in world coordinates. Each landmark is composed of x, y and z: real-world 3D
coordinates in meters with the origin at the hand’s approximate geometric center.
MULTI_HANDEDNESS
Collection of handedness of the detected/tracked hands (i.e. is it a left or right hand).
Each hand is composed of label and score. label is a string of value either "Left" or
"Right". score is the estimated probability of the predicted handedness and is always
greater than or equal to 0.5 (and the opposite handedness has an estimated probability of
1 - score).
4.4.3 Volume Controls
Dynamic Gestures for Volume control - The rate of increase/decrease of volume is

proportional to the distance moved by pinch gesture from start point.
Increasing the volume of System using system hand gestures. Decreasing the volume of
system using hand gestures.
39
Figure 3.5 – Increase Volume Gesture
Now we can think on creating a straight line between the landmark 4 and 8, and
computingits length which will be proportional to the volume. We need to be careful with
the following, the length of this line might not be 0 even when we have our fingers
touching each other, because the landmark points are not on the edge and we don’t know
what is the distance in pixels when they are the farthest away from each other, therefore
we will have to print the length and create a UPPER_BOUND and LOWER_BOUND
based on this, second, the volume range that our package will can be 0 to 100, but
it can also besomething else, in any case we need to map our [LOWER_BOUND,
UPPER_BOUND] interval into [MIN_VOL, MAX_VOL].
Finally, just to make it look prettier, we will add draw a circle in the middle point that
will change color when both fingers are super close to each other and a volume bar to the
left.
Figure 3.6 – Decrease Volume Gesture
40
4.4.4 Scrolling Commands
Dynamic Gestures for horizontal and vertical scroll. The speed of scroll is proportional
to the distance moved by pinch gesture from start point. Vertical and Horizontal scrolls
are controlled by vertical and horizontal pinch movements respectively.
Scroll up command is implemented. Scroll down command is implemented.
4.4.5 Brightness Control

Dynamic Gestures for Brightness control - The rate of increase/decrease of brightness is
proportional to the distance moved by pinch gesture from start point.
Increasing the Brightness of system using system hand gestures. Decreasing the
Brightnessof system using hand gestures. Changing the brightness means changing the
value of pixels. It means adding or subtracting value some integer value with the current
value of each pixel. When you add some integer value with every pixel, it means you
are making the image brighter. When you subtract some constant value from all of the
pixels, you are reducing the brightness. First, we will learn how to increase the brightness
and second we will learn how to reduce the brightness.
Increasing the Brightness:
Increasing the Brightness using OpenCV is very easy. To increase the brightness,
addsome additional values with each channel, and the brightness will be increased. For
example, BRG images have three channels blue (B), green (G) and red(R). That means
the current value of a pixel will be (B. G, R). To increase the brightness, we have to add
some scalar number with it such as (B, G, R) + (10, 10, 10) or (B, G, R) + (20, 20, 20) or
whatever number you want.
Figure 4.7 Low Brightness Gesture
41
We have a hand tracking module already done, so let’s say we want to control the volume
of our computer by moving the thumb and index finger closer and further away from
each other. From before we now the thumb is landmark number 4 and the index is
landmark number 8.
Figure 4.8 Use Case diagram
42
CHAPTER 5
SOURCE CODE
43
44
45
46
47
CHAPTER 6
SCREENSHOTS
Figure 6.1 – Neutral Gesture
48
Figure 6.2 – Mouse Click
49
50
51
52
53
Figure 6.7 – Scroll Up and Down
54
Figure 6.8 – Low Brightness Gesture
55
Figure 6.9 – High Brightness Gesture
56
Figure 6.10 – Increase Volume Gesture
57
Figure 6.11 – Decrease Volume Gesture
Figure 6.12 – Volume & Brightness Neutral Gesture
58
CHAPTER 7
EXPERIMENTAL RESULT AND ANALYSIS
In the proposed AI vir tual mouse system, the concept of advancing the human- computer
interaction using computer vision is given.
Cross comparison of the testing of the AI virtual mouse system is difficult because only
limited numbers of datasets are available. The hand gestures and finger tip detection have
been tested in various illumination conditions and also been tested with different
distances from the webcam for tracking of the hand gesture and hand tip detection. An
experimental test has been conducted to summarize the results shown in Table 1. The test
was performed25 times by 4 persons resulting in 600 gestures with manual labelling, and
this test hasbeen made in different light conditions and at different distances from the
screen, and each person tested the AI virtual mouse system 10 times in normal light
conditions, 5 times in faint light conditions, 5 times in close distance from the webcam,
and 5 times in long distance from the webcam, and the experimental results.
The purpose of this project was to make the machine stand out it interacts with and
responds to human behavior. Alone The purpose of this paper was to make technology
accessible and is compatible with any standard operating system.
The proposed system is used to control the pointer for mouse by seeing a human hand
and inserting the cursor in the middle direction in the hand of man respectively. System
controlmouse activity as simple as left click, cursor pull and movement.
The path finds a human skin hand and follows it continuously with the movement of the
cursor at an angle between the fingers of a human hand the process performs the
functionof the left click.
59
CHAPTER 8
CONCLUSION
8.1 FUTURE SCOPE

Our proposed solution is machine learning based with face detection which allows
thevoter to register and he/she can vote from anywhere irrespective of the location. This
system provides security and also avoid casting of the multiple vote by same person. This
system is more reliable in which we can vote from multiple locations. It also minimize
work, human requirements and time resources.
This implementation of virtual control mouse has a few accuracy and precision issues
,like it will have precision gap in volume and brightness control ,likewise it might have
accuracy issues in click function. These issues we will resolve in future model.
Further additional features like voice assistant, virtual keyboard can be implemented .
The Model when added with thenPeople having some Kind of disability or some hand
problems will be able to operate the mouse.this way it can also contribute to medical
industry in some way. With addition of voice assistant and virtual keyboard it can
becomea complete solution for people who cannot see or may be cannot move some parts
of upperbody correctly.
More functions like saving, copy, paste shortcuts, select all direct shortcuts can be added.
These functions are not present in a normal mouse, so adding these kind of features will
increase both, one the overall functionality of the mouse and secondly will also create
additional motivation for the user to use that mouse instead of a physical mouse / trackpad
of the system.
In future this web application can also be used on Android devices or the mobile
applications, where touchscreen concept can be replaced by hand gestures. The
application/software can be made cross-platform so it creates an ecosystem like an
experience to the user and also add additional functionality on both platforms.
The proposed model cannot be effectively used in dark environments, this can be
resolved by automatically increasing brightness of monitor, which will be implemented
in future model. This is an inherent problem even with physical mouse which barely has
a solution even today but we can , overcome this to a certain extent by asking relevant
60
permissions from the system and the increasing the screen brightness by using light
sensor in thelaptop/ mobile phone.
An automatic zoom-in/out functions are required to improve the distance, where it will
automatically adjust focus speed on the distance between the user and the camera. This
canbe done increase the user experience so that the user can get going straight away and
he/she doesn’t have to face any focusing issues while using the webcam which can lead
to wrong/no gesture detection.
8.2 APPLICATIONS
Virtual mouse system is useful for many applications; it can be used to decrease the space
for using the actual mouse, and it can be used in situations where we cannot use the
physical mouse. The system eliminates the usage of devices, and it improves the human-
computer interaction.
Following are the more applications of the proposed system:
• The proposed model has a greater accuracy of which is far greater than the
that of other proposed models for virtual mouse, and it has many
applications.
• IN COVID-19 scenario, it is not safe to use the devices by physically
touching as it will result in a scenario of spread of the virus by touching the
devices, so the proposed virtual mouse can be used to control the PC mouse
functions without using the physical mouse.
• The system can be used to control robots and systems without the usage of devices.
• Can be used to play augmented reality games and use AR applications.
• Persons with some disability will be able to use the mouse.
61
REFERENCES
[1] Mokhtar M., Hasan, and Pramod K. Mishra. "Robust gesture recognition using gaussian
distribution for features fitting." International Journal of Machine Learning and Computing 2,
no. 3 (2012): 266
[2] “A Multi-Sensor Technique for Gesture Recognition through Intelligent Skeletal Pose
Analysis.” Nathaniel Rossol, Student Member, IEEE, Irene Cheng, Senior Member, IEEE, and
Anup Basu, Senior Member, IEEE (2015).
[3] Shining Song, Dongsong Yan, and Yongjun Xie “Design of control system based on hand
gesture recognition.” the Natural Science Foundation of Guangdong Province
(NO˖2017A030310184) ©2018IEEE.
[4] Xuhong Ma and Jinzhu Peng. “Kinect Sensor-Based Long-Distance Hand Gesture Recognition
and Fingertip Detection with Depth Information.” Hindawi Journal of Sensors Volume 2018,
Article ID 5809769, (2018).
[5] Liu Qiongli, Xu Dajun, Li Zhiguo, Zhou Peng, Zhou Jingjing and Xu Yongxia “A New
Distance Metric Learning Algorithm for Hand Posture Recognition.” 3rd International
Conference on Mechatronics and Industrial Informatics (ICMII 2015)
[6] Tran, DS., NH., Ho, Yang, HJ. et al. “Realtime virtual mouse system using RGB-D images and
fingertip detection” Multimed Tools Appl 80, 10473–10490, 2021.
[7] Sherin Mohammed Sali Shajideen, Preetha V H. ``Hand Gestures - Virtual Mouse for Human
Computer Interaction.” International Conference on Smart Systems and Inventive Technology
(ICSSIT 2018) IEEE Xplore Part Number: CFP18P17-ART; ISBN:978-1-5386-5873-4.
[8] “Cursor Control using Hand Gestures” Pooja Kumari, Saurabh Singh, Vinay Kr. Pasi
International Journal of Computer Applications (0975 – 8887) (2013).
[9] Sandeep Thakur, Rajesh Mishra, Buddhi Prakash “Vision based computer mouse control using
hand gestures.” (2015) International Conference on Soft Computing Techniques and
Implementations (ICSCTI).
[10] Chen-Chiung Hsieh, Dung-Hua Liou, David Lee “A real time hand gesture recognition
system using motion history image.” .IEEE Xplore: 23 August (2010).
[11] X. Zabulis†, H. Baltzakis†, A. Argyros "Vision-Based Hand Gesture Recognition for

Human-Computer Interaction." DOI:10.1201/9781420064995-c34 (2010).
62
[12] Bharath Kumar Reddy Sandra, Katakam Harsha Vardhan, Ch. Uday, V Sai Surya, Bala Raju,
Dr. Vipin Kumar "GESTURE-CONTROL-VIRTUAL-MOUSE." International Research Journal of
Modernization in Engineering Technology and Science (2012).
63

Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Report

Uploaded by

Copyright:

Available Formats

A

Department of Computer Science And Technology

Huzaifa Ansari (1900290110047)

Manas Khare (19002900110056)

Under the supervision of

KIET Group of Institutions, Ghaziabad

Name: Hrithik Chandok Name: Huzaifa Ansari

Roll No.:1900290110045 Roll No.: 1900290110047

Name: Manas Khare

Prof. Vinay Kumar

Name: Hrithik Chandok Name: Huzaifa Ansari

Roll No.:1900290110045 Roll No.: 1900290110047

Name: Manas Khare

1.1. Flaws in the Existing Solution…………………………………………………..... 14

CHAPTER 2 (LITERATURE RIVIEW)………………………………………….… 18

2.1. Literature Survey ………………………................................................................ 18

3.1. Hardware Requirements............................................................................................ 25

CHAPTER 4 (METHODOLOGY) ………………………………….............................. 32

4.1. Camera Control………............................................................................................... 33

CHAPTER 5 (SOURCE CODE) ..........................................................….……………… 43

CHAPTER 6 (SCREENSHOTS) ……………….............................................................. 48

CHAPTER 8 (CONCLUSIONS AND FUTURE SCOPE) ............................................... 60

8.1. Future Scope............................................................................................................... 61

Figure No. Description Page No.

2.1 Python Logo 26

2.2 OpenCV Logo 28

2.3 MediaPipe Logo 28

2.4 Co-ordinates or Landmarks on Hand 29

2.5 PyAutoGui Logo 29

3.1 Flow Chart of Methodology 34

3.2 Neutral Gesture 36

3.3 Mouse Click 37

3.4 Right Click 38

3.5 Increase Volume Gesture 40

3.6 Decrease Volume Gesture 40

3.7 Low Brightness Gesture 41

3.8 Use Case Diagram 42

6.1 Neutral Gesture 48

6.2 Mouse Click 49

6.3 Left Click 50

6.4 Right Click 51

6.5 Double Click 52

6.7 Scroll Up and Down 54

6.8 Low Brightness Gesture 55

6.9 High Brightness Gesture 56

6.10 Increase Volume Gesture 57

6.11 Decrease Volume Gesture 58

6.12 Volume and Brightness Neutral Gesture 58

HCI Human Computer Interaction

OpenCV Open Source Computer Vision Library

CNN Convolutional Neural Network

KNN k-Nearest Neighbor

SVM Support Vector Machine

1.1 FLAWS IN THE EXISTING SOLUTION

1.2 PROPOSED SOLUTION

1.3 EXISTING SYSTEM

1.4 INDUSTRIAL BENEFITS

2.1 LITERATURE SURVEY

In the 70s, command-line interfaces (typewriter paradigm) were introduced whereby

2.1.2 Auditory Based Interaction

2.1.4 Multimodal Systems

2.1.5 Cursor Control and Text Entry

1. How different modalities are supported by a particular system?