Professional Documents
Culture Documents
Project Report
on
Virtual Mouse Using Hand Gesture and Voice Assistant
submitted in partial fulfillment for the award of
BACHELOR OF TECHNOLOGY
DEGREE
SESSION 2022-23
in
May, 2023
DECLARATION
We hereby declare that this submission is our own work and that, to the best of our knowledge
and belief, it contains no material previously published or written by another person nor material
which to a substantial extent has been accepted for the award of any other degree or diploma of
the university or other institute of higher learning, except where dueacknowledgment has been
made in the text.
Signature: Signature:
Signature:
Roll No.:1900290110056
iii
CERTIFICATE
Certified that the Project Report entitled “Gesture Control Virtual Mouse” submitted by Hrithik
Chandok (1900290110045), Manas Khare (1900290110056), and Huzaifa Ansari
(1900290110047) is their own work and has been carried out under my supervision. It is
recommended that the candidates may now be evaluated for their project work by the University.
Date: Supervisor
(Assistant Professor)
iii
ACKNOWLEDGEMENT
We wish to express our heartfelt gratitude to all the people who have played a crucial role in
the research for this project, without their active cooperation thepreparation of this project could
not have been completed within the specifiedtime limit.
We are also thankful to our project guide Mr. Vinay Kumar sir who supported methroughout
this project with utmost cooperation and patience and for helping me in doing this Project.
Date:
Signature: Signature:
Signature:
Roll No.:1900290110056
iv
ABSTRACT
Mobile Gesture-controlled laptops and computers have recently gained a lot of traction.
Leap motion is the name for this technique. Simple gestures of our hand in frontof our
computer/laptop allows us to manage its operations. Unfortunately, employing these
techniques is more complicated. In the dark, these devices are difficult to see, and
manipulating them causes the presentation to be disrupted.Hand gestures are the most
natural and effortless manner of communicating. The camera’s output will be displayed
on the monitor.
The user will be able to see their image and gestures in a window for better accuracy.
The concept is to use a simple camera instead of a classic or standardmouse to control
mouse cursor functions. The Virtual Mouse provides an infrastructure between the user
and the system using only a camera.
It allows users to interface with machines without the use of mechanical or physical
devices, and even control mouse functionalities. This study presents a method for
controlling the cursor’s position without the need of any electronic equipment. While
actions such as clicking, dragging etc.will be carried out using various hand gestures. In
addition to that more functionality like volume and brightness control are given to the
user which creates an additional motivation to use this version of mouse. As an input
device, the suggested system will just require a webcam.
The suggested system will require the use of OpenCV and Mediapipe, python
programming environment and several other libraries as well as other tools. The python
dependencies that will be used for implementing this machine are NumPy, math,
pyautogi, pycaw, messagetodict, screen_brightness_control andothers.
v
Position of the cursor with the bare palms without using Any digital tool. While the
operations like clicking and dragging of objectscould be accomplished with special
Hand gestures.
The proposed gadget willrequire a Webcam as an input tool. The software’s that will be
Required to put in force the proposed machine are opencv and python. The output of the
camera may be displayed on the system’s display screen so that it could be similarly
calibrated by means of the user. Thecamera’s output will be presented on the system’s
screen so that the user can further calibrate it.
vi
TABLE OF CONTENTS Page
No.
DECLARATION……………………………………………………………………. ii
CERTIFICATE……………………………………………………………………… iii
ACKNOWLEDGEMENTS…………………………………………………………. iv
ABSTRACT………………………………………………….…………………….... v
LIST OF FIGURES………………………………………………………………….. x
LIST OF ABBREVIATIONS……………………………….………………………. xii
CHAPTER 1 (INTRODUCTION)………………………………………………….. 13
vii
CHAPTER 3 (HARDWARE AND SOFTWARE REQUIREMENTS) ......................... 27
viii
CHAPTER 7 (RESULTS AND ANALYSIS) …………………….................................. 62
REFERENCES 65
ix
LIST OF FIGURES
xi
LIST OF ABBREVIATIONS
xii
CHAPTER 1
INTRODUCTION
INTRODUCTION
Gesture Recognition has been very interesting problem in computer vision community fora
long time. Hand gestures are an aspect of body language that can be conveyed throughthe
center of the palm, the finger position and the shape constructed by the hand. Hand gestures
can be classified into static and dynamic. As its name implies, the static gesture refers to the
stable shape of the hand, whereas the dynamic gesture comprises a series of hand movements
such as waving. There are a variety of hand movements within a gesture; for example, a
handshake varies from one person to another and changes according to time and place. The
main difference between posture and gesture is that posture focuses more on the shape of the
hand whereas gesture focuses on the hand movement.
Computer technology has tremendously grown over the past decade and has become a
necessary part of everyday live. The primary computer accessory for Human Computer
Interaction (HCI) is the mouse. The mouse is not suitable for HCI in some real life situations,
such as with Human Robot Interaction (HRI). There have been many researches on alternative
methods to the computer mouse for HCI. The most natural and intuitive technique for HCI,
that is a viable replacement for the computer mouse is with the use of hand gesture.
Our vision was to develop a virtual mouse system that uses a Web camera to communicate
with the device in a more user-friendly way, as an alternative to using a touch screen and
physical mouse. In order to harness the full potential of a webcam, it can be used for vision
based CC, which would effectively track the hand gesture predict the gesture on basis of label.
The software enables user to control the complete functionality of a physical mouse, just by
using easy symbols and gestures. it utilizes a digital camera and computer imagination and
13
technology to control numerous mouse activities and is able to acting each assignment that the
bodily computer/laptop mouse can.
The major motivation of this project was in amidst of covid 19 spread. We wanted to build a
solution which enables the user to use their device without physically touching it. It also
reduces the e-waste and also helps to reduce the cost of
The proposed AI virtual mouse system can be used to overcome problems in the real world
such as situations where there is no space to use a physical mouse and also for the persons who
have problems in their hands and are not able to control a physical mouse. Also, amidst of the
COVID-19 situation, it is not safe to use the devices by touching them because it may result in
a possible situation of spread of the virus by touching the devices, so the proposed AI virtual
mouse can be used to overcome these problems since hand gesture and hand Tip detection is
used to control the PC mouse functions by using a webcam or a built-in camera.
current system is comprised of a generic mouse and trackpad monitor control system, as well
as the absence of a hand gesture control system. The use of a hand gesture to accessthe monitor
screen from a distance is not possible. Even though it is primarily attemptingto implement,
the scope is simply limited in the virtual mouse field.
The existing virtual mouse control system consists of simple mouse operations using a hand
recognition system, in which we can control the mouse pointer, left click, right click, and drag,
and so on. The use of hand recognition in the future will not be used.
Even though there are a variety of systems for hand recognition, the system they used is static
hand recognition, which is simply a recognition of the shape made by the hand and the
definition of action for each shape made, which is limited to a few defined actions and causes
a lot of confusion. As technology advances, there are more and more alternatives to using a
mouse.
14
The following are some of the techniques that were employed:
A special sensor (or built-in webcam) can track head movement to move the mouse pointer
around on the screen. In the absence of a mouse button, the software's dwell delay featureis
usually used. Clicking can also be accomplished with a well-placed switch.
1.1.2 Eye Control
The cost of modern eye gaze systems is decreasing. These enable users to move the pointer on
the screen solely by moving their eyes. Instead of mouse buttons, a dwell delay feature, blinks,
or a switch are used. The Tobii PCEye Go is a peripheral eye tracker that lets you use your
eyes to control your computer as if you were using a mouse.
Touch screens, which were once seen as a niche technology used primarily in special education
schools, have now become mainstream. Following the success of smartphones and tablets,
touch-enabled Windows laptops and all-in-one desktops are becoming more common.
Although this is a welcome new technology, the widespread use of touch screens has resulted
in a new set of touch accessibility issues.
However, each of the methods below has its own set of disadvantages. The use of the head or
eyes to control the cursor regularly can be hazardous to one's health. This can lead to a number
of problems with health. When using a touch screen, the user must always maintain their focus
on the screen, which can cause drowsiness. By comparing the following techniques, we hope
to create a new project that will not harm the user's health.
This project promotes an approach for the Human Computer Interaction (HCI) where cursor
movement can be controlled using a real-time camera, it is an alternative to the current methods
including manual input of buttons or changing the positions of a physical computer mouse.
Instead, it utilizes a camera and computer vision technology to control various mouse events
15
and is capable of performing every task that the physical computer mouse can.
We’ll first use MediaPipe to recognize the hand and the hand key points. MediaPipereturns a
total of 21 key points for each detected hand. Palm Detection Model and Hand Landmark
Model are used by mediapipe to detect hand. First Palm is detected as it is an easy process with
respect to hand landmark. Then Hand Landmark Model performs precisekey point localization
of 21 3D hand-knuckle coordinates inside the detected hand regions via regression, that is
direct coordinate prediction.
We are detecting which finger is up using the tip Id of the respective finger that we found using
the MediaPipe and the respective co-ordinates of the fingers that are up, and according to that,
the particular mouse function is performed.
Then we apply formulas like distance calculation to track the gestures that are being done by
the user, for example if the distance between index finger and middle finger becomes zero the
perform single click operation by calling single click function.
Similarly, we have defined and deployed different various gestures like sliding up with thumb
and index finger increases the volume of the system, keeping a constant distance between index
and middle finger keeps the system in stable state, means does a neutral gesture similarly
moving joined fingers up or down will perform scrolling action. The movement of palm
performs drag gesture. The software can also detect multiple hands but is deployed in such a
way that only gestures of one hand are active a time so that multiple hands don’t cause
ambiguity to the system and user.
Current voting system is based on ballot machine where when we press the button with the
symbol the voting is done. Here there is a security risk, the person who votes may be fake
person voting. The people there might not know that a person is using fake voting card,this
may cause problem. Also, the person who has to vote should travel from faraway places to his
constituency to cast his vote. So, effective method is to use face detection while voting online
16
and enabling the right person to vote.
The proposed project will avoid covid-19 spread by means of putting off the human
intervention and dependency of gadgets to govern the computer. Amidst of the covid-19
state of affairs, it isn't always safe to use the devices by way of touching them because it is able
to result in a possible situation of spread of the virus by using touching the gadgets, so the
proposed ai digital mouse can be used to conquer these problems seeing that hand gesture
detection is used to manipulate the mouse functions by using the usage of a webcam or a built-
in digicam of the users computing device be it a pc, laptop, workstation etc.
Hence, in public hotspots like cyber cafe, places of work, academic institutes, etc. An person
can operate operations on a laptop with no physical contact with it. This in turn willreduce the
spread of virus, bacteria’s and communicable diseases.
Home and office automation is a large discipline wherein gesture popularity is being employed.
As an example, clever tv’s can feel finger actions and hand gestures, offer contact much less
manage over lighting and audio systems.
Also, the project aims to reduce e-waste which is a very common concern with physical
hardware devices. The use of this software will eradicate the use of remotes from tv, buttons
from smart appliances, mouses from personal computers, laptops and work- stations.
The project also reduces users’ cost, be it upfront cost for buying hardware with new machines,
or for old machines which install this software will not need to replace their un- usable or are
not working any more, we can simply use this software to control the mouse controls of the
system.
It's miles fair to say that the digital mouse will soon to be substituting the traditional bodily
mouse inside the close to future, as people are aiming towards the life-style where that each
technological gadgets may be controlled and interacted remotely without the usage of any
peripheral gadgets along with the remote, keyboards, and so on. It doesn't simply give
17
convenience; however, it is price effective as properly.
The software functionality can also be extended to Augmented reality applications likes
gaming and VR/AR headsets, gestures-based games etc. Apart from these mobile
applications based for android phones and smart tvs can be implemented, so that they can be
operated in wire-less mode.
18
CHAPTER 2
LITERATURE REVIEW
Sherin Mohammed Sali Shajideen, Preetha V H. [7] They use two USB cameras that are
utilised for the side and the top views are set up. Software used for separating the view is
MATLAB software. For distinct two viewpoints, they train the two detectors and select
different picture samples for diverse top and side views.
HCI: methods which use data gloves and those which are vision-based. Methods which use
data use sensors attached to the glove that transduce finger inflexion to electrical signalsfor
determining hand postures. However this approach forces a user to carry a lot of cables needed
for connection to the computer. Hence it compromises the naturalness and easiness of using
Vision-based interfaces. On the other hand, methods which are vision based are non-invasive
and intuitive in which they are based on the way human perceive information about their
environment.
In [2] in order to collect the results of hand position estimation by applying smart depth -
sensing technology to track hand postures Nathaniel Rossol provides a unique approach. Their
method specifically solves the problem of tracking objects by using posture estimates with the
19
help of several detectors positioned from different vantage points. Instead, greater flexibility
is gained by carefully examining the independently generated skeletal posture estimations from
each sensor system. In comparison to the single sensor technique, this approach's testing results
reveal that they were able to minimize total estimation error by 30% in a two-detector
configuration, although that is insufficient.
In [3] Shining Song, Dongsong Yan, and Yongjun Xie use a simple camera to capture the input
data as images and then the input image is converted into YCBCR space with the help of Ostu
Algorithm. Some irregular little holes or small protrusions inside the border will remain in the
binarized picture because of other objects in the background interfering with image
segmentation, which may cause issues with gesture recognition processing. As a result, the
outcome must be processed using mathematical morphology. After locating the centroid point
of the gesture picture, the motion direction is assessed during the dynamic gesture recognition
phase in accordance with the motion trajectory of the centroid point. [4] This study proposes
the k-cosine curvature technique for fingertip recognition and provides an enhanced cut-off
segmentation technique to address the issue of range with depth data for hand motion
segmentation.
In [5] Liu Qiongli, Xu Dajun, Li Zhiguo, Zhou Peng, Zhou Jingjing and Xu Yongxia acquired
methods for recognising hand posture using KNN classification and distance learning. The
method is split into two phases: hand posture and the Mahalanobis distance learning matrix.
To learn a reasonable matrix from a sample is the primary goal of the Distance measure matrix
learning phase. Following the image's extraction of the hand form characteristics (Fourier
descriptors), the hand posture recognition phase uses a KNN classifier to identify the outcomes
coming from K nearest neighbour, from the training sample.
Cursor control has been the easiest mouse function to be achieved through visual techniques.
Using visual techniques, cursor control has been achieved by tracking different features and
maps the changes in location to the cursor’s x, y coordinates on the screen.
20
Mokhtar M [1] uses an algorithm, multivariate Gaussian distribution to track hand movements.
In order to increase the likelihood of matching the input tested gesture with the already trained
classes in feature space, they used Gaussian bivariate pdf method for fitting and capturing the
movement of hand. This step helps to reduce the rotation disturbance that is typically treated
by increasing the number of trained gestures in each gesture class.
The virtual mouse approach described by Tran, DS., Ho, NH., Yang, H.J. et al. in 2021 [6]
uses RGB-D pictures and fingertip recognition. Using detailed bone-joint of hand (Palm +
Fingers) information. Photos from a Microsoft Kinect Sensor version 2, the hand’s region of
interest and palm's centre both first retrieved, and they are then translated into a binary image.
A border-tracing method is then used to extract and characterise the hands' outlines. Depending
on the coordinates of hand joint points, the method of K cosine is used to determine the position
of the fingertip. Finally, the mouse cursor is controlled via movement of hand by mapping the
joints and fingertip position to RGB pictures. This research still has several flaws.
This section discusses audio-based interface systems and consists of three parts. This first part
discusses the interaction device. This is followed by methods for interface selection and lastly
cursor control and text entry.
The concept of talking to computers has comfortably attracted funding and interest and has
brought naturalness in communication. Rosenfeld et al stated that the advantages of using
speech for interaction, and these are:
Different approaches have been employed to enable users to interact via speech. Three of the
commonly used are natural language, dialog trees, and commands.
Dialog tree systems reduce the difficulties in recognition by breaking down activity into a
sequence of choice points at which the user selects. The disadvantage of this approach is that,
the user will be unable to directly access the parts of a domain that are of immediate interest.
Regardless of the difficulty in constructing such systems on the side of the designer, they
simplify the interaction as well as lessen the need for user training.
In different research areas including academic and commercial, voice-based cursor control has
been proposed to enable control of a cursor using speech input.
Lohr and Brugge proposed two approaches for simulation of cursor control using speech and
these are target-based and direction-based navigation. Direction-based mouse emulation is
done by having the cursor moving in the direction uttered by the user. In their research, Sears
et al implemented the movement of the cursor with commands like “Move down”,”-up”, “-
right” or “-left” and stops when “stop” is uttered. However, this has precision bottleneck when
the stopping command “stop” is uttered. In this case the cursor keeps on moving as the
speech recognizer will still be processing the command. In thestudy by Igarashi and Hughes,
mouse control was achieved by uttering the direction followed by a non-verbal vocalization.
The cursor will move as long as the vocalizationlasts, e.g., “move down”. Using non-verbal
vocal has the advantages of highprecision.
Harada et al used a similar approach. In their study they assigned each direction a specific
sound. In their system the user utters vowel sound corresponding to one of the desired
directions (“a” for up, “e” for right, “i” for down, and “o” for left). The speed of cursor to
22
begin with, starts out slow and gradually increases with time. The cursor is stopped by
uttering the same vowel again and a click is performed by uttering a two-vowel command
(“a-e”). The advantage of this system is that it offers immediate processing of vocal input.
Target-based mouse emulation involves the definition of specific targets on the screen and
assigning speakable identifiers which will be displayed closely to the target. Uttering an
identifier will place the cursor within the corresponding target. For example, uttering the
widget name (widget used as target) causes the mouse cursor to be placed over the button.
However, this method suffers from layout and usability issues if the number of widgets is
high.
Multimodal Systems Multimodal systems are systems which combine two or more modalities.
These modalities refer to the ways in which the system responds to the inputs that are
communication channels . Due to the problems faced with unimodal systems a combination of
different modalities help solve the problems; that is when one modality is inaccessible, a task
can be completed with the other modalities. Modalities can be combined as redundant or
complementary depending on the number and type of modalitiesintegrated.
A combination of devices has been used in multimodal system. Interaction is done using the
devices which enable input for the specific modality. For one system which has visual and
speech, a camera and a microphone or a headset with microphone can be used as Interface
selection.
Franges kideset al explicitly provided three methods for interface selection: idle click, external
23
switch and voice command click. In the idle click they used a threshold value whereby a mouse
click is triggered once the condition will be true. The external switch method involves the use
of sound to invoke mouse click, if the system is in Sound Click mode. The mouse click is
triggered if the sound produced has the higher intensity than the background noise.
In order to simulate all the mouse events, Lanitis et al categorised voice commands into five
classes: Mouse, Move cursor, Computer and Open. Manipulation of interface widgets by voice
command involved the use of voice commands in the category “Mouse”- these actions include
drag, drop, click among others. The advantage of operating in sound click mode is that it is
speaker-independent hence there is no need for training as compared to when in voice
command mode.
In the work by Frangeskides and Lanitis the system which uses speech and visual input to
perform HCI tasks was proposed. In their system visual input is used for cursor control
through11 face-tracking. Their approach for face-tracking is achieved by tracking the eyes
and nose. The generated regions around the eyes and nose are used to calculate thehorizontal
and vertical projections tracking to enable for cursor control. The difference of the face
location from the original position is mapped into cursor movement that is,towards the
direction of the movement.
In their work, cursor control would be enabled if the Voice command mode was activated. In
that case the cursor control will be done by uttering commands from one of the five groups
(“Movecursor”) they designed . Although cursor control using voice commands canbe used, it
lacks precision, hence will need to be integrated with the use of face tracking in order to
precisely reach the widget.
For text entry an “On screen keyboard” (Windows operating system utility) was used. As soon
as the keyboard was activated, keys were entered by using head tracking to move the cursor
on the keys. Speech was then used to invoke mouse click events when the system was in
“Sound click mode” or if it was in “Voice command mode” cursor movement will be used.
24
2.1.6 Modality Integration
This section focuses on integrating modalities to provide a best interface to the user. First two
taxonomies of multimodal systems are summarized followed by a discussion of how the
different modalities relate to each other.
The first taxonomy is based on the types and goals of cooperation between modalities-they
stated 6 different types of cooperation:
Specialisation: this is whereby a same chuck of information is always processed by the same
modality.
Concurrency: independent kinds of information are processed using different modalities and
overlap in time (e.g. moving cursor whilst editing document). That means parallel use of
different modalities.
Dung-Hua Liou, ChenChiung Hsieh, and David Lee in 2010 [10] proposed a study on “A Real-
Time Hand Gesture Recognition System Using Motion History Image.” The main limitation
of this model is more complicated hand gestures.
25
In [11] June 2010 Vision based Gesture recognition for human computer interaction was
published Which used Motion detection and recognition algorithm, which was a solid
method for gesture detection. Devanshu singh in International Journal for Research in Applied
Science and Engineering Technology mentioned a novel method of controlling mouse
movement with Real time camera using openCV. Apart from this official documentation of
openCv and Media pipe was referred extensively.
Monika B. Gandhi, Sneha U. Dudhane, and Ashwini M. Patil in 2013 [12] proposed a study
on “Cursor Control System Using Hand Gesture Recognition.” In this work, the limitation is
stored frames are needed to be processed for hand segmentation and skin pixel detection.
In [8] (2013) a research paper was published named “Vision-based Multi model Human-
Computer Interaction using Hand and Head Gestures” which used hand and head tocontrol.
Any computer vision-based algorithm based applications.it recognized gestures based on
Pattern of hand and motion of head.
In [9] 2015 in a paper “vision-based computer mouse control using hand gesture” described
camera-based technique which used real time video acquisition and implemented lift and right
click. It mainly used binary image generation and filtering.
26
CHAPTER 3
For the purpose of detection of hand gestures and hand tracking, the MediaPipe frameworkis
used, and OpenCV library is used for computer vision. The algorithm makes use of the
machine learning concepts to track and recognize the hand gestures and hand tip.
The following describes the hardware needed in order to execute and develop the Virtual
Mouse application.
The computer desktop or a laptop will be utilized to run the visual software in order to display
what webcam had captured. A notebook which is a small, lightweight and inexpensive laptop
computer is proposed to increase mobility.
System will be using Processor : Core2Duo Main Memory : 4GB RAM Hard Disk : 320GB
Display : 14" Monitor.
3.1.2 Webcam
Webcam is utilized for image processing, the webcam will continuously taking image in order
for the program to process the image and find pixel position.
Python uses dynamic typing, and a combination of reference counting and a cycle- detecting
garbage collector for memory management. It uses dynamic name resolution (late binding),
which binds method and variable names during program execution.
Its design offers some support for functional programming in the Lisp tradition. It has
filter, map and reduce functions; list comprehensions, dictionaries, sets, and generator
expressions. The standard library has two modules (itertools and functools) that implement
functional tools borrowed from Haskell and Standard ML.
Its core philosophy is summarized in the document The Zen of Python (PEP 20), which
includes aphorisms such as:
Rather than building all of its functionality into its core, Python was designed to be highly
extensible via modules. This compact modularity has made it particularly popular as a means
of adding programmable interfaces to existing applications. Van Rossum's vision of a small
core language with a large standard library and easily extensible interpreter stemmed from his
frustrations with ABC, which espoused the opposite approach.
Python strives for a simpler, less-cluttered syntax and grammar while giving developers a
choice in their coding methodology. In contrast to Perl's "there is more than one way to do it"
motto, Python embraces a "there should be one—and preferably only one—obviousway to do
it" philosophy. Alex Martelli, a Fellow at the Python Software Foundation and Python book
author, wrote: "To describe something as 'clever' is not considered a compliment in the Python.
28
Figure 3.1 : Python Logo
3.2.2 OpenCV
OpenCV (Open Source Computer Vision Library) is a huge open-source library for computer
vision, machine learning, and image processing, cross-platform library using which we can
develop real-time computer vision applications. OpenCV supports a wide variety of
programming languages like Python, C++, Java, etc. It mainly focuses on image processing,
video capture and analysis to identify objects, faces, or even the handwriting ofa human. It
can be installed using "pip install opencv-python". OpenCV was built to provide a
common infrastructure for computer vision applications and to accelerate the use of machine
perception in the commercial products. Being a BSD-licensed product, OpenCV makes it easy
for businesses to utilize and modify the code.
Computer Vision can be defined as a discipline that explains how to reconstruct, interrupt, and
understand a 3D scene from its 2D images, in terms of the properties of the structure present
in the scene. It deals with modeling and replicating human vision using computer software and
hardware.
Robotics Application
3.2.3 MediaPipe
MediaPipe is a framework which is used for applying in a machine learning pipeline, and it is
an opensource framework of Google. The MediaPipe framework is based on three fundamental
parts; they are performance evaluation, framework for retrieving sensor data, and a collection
of components which are called calculators ,and they are reusable.
A pipeline is a graph which consists of components called calculators, where each calculator
30
is connected by streams in which the packets of data flow through. Developers are able to
replace or define custom calculators anywhere in the graph creating their own application. The
calculators and streams combined create a data-flow diagram [5].It can be installed using “pip
install mediapipe”.
Single-shot detector model is used for detecting and recognizing a hand or palm in real
time. The single-shot detector model is used by the MediaPipe[3]. First, in the hand
detection module, it is first trained for a palm detection model because it is easier to train
palms. Furthermore, the non maximum suppression works significantly better on small
objects such as palms or fists. A model of hand landmark consists of locating 21 joint or
knuckle co-ordinates in the hand region.
3.2.4 PyAutoGUI
PyAutoGUI is a cross-platform GUI automation Python module for human beings. Used to
programmatically control the mouse & keyboard. or we can say that it facilitates us to automate
the movement of the mouse and keyboard to establish the interaction with the other application
31
using the Python script. It can be installed by pip install pyautogui.
PyAutoGUI has several features:
3.2.5 Math
This module provides access to the mathematical functions defined by the C standard. These
functions cannot be used with complex numbers; use the functions of the same name from the
cmath module if you require for complex numbers. The distinction between functions which
support complex numbers and those which don’t is made since most users do not want to learn
quite as much mathematics as required to understand complex numbers. Receiving an
exception instead of a complex result allows earlier detection of the unexpected complex
number used as a parameter, so that the programmer can determine how and why it was
generated in the first place.
The following functions are provided by this module. Except when explicitly noted otherwise,
all return values are floats.
32
3.2.6 PyClaw
3.2.7 ENUM
Enum is a class in python for creating enumerations, which are a set of symbolic names
(members) bound to unique, constant values. The members of an enumeration can be compared
by these symbolic names, and the enumeration itself can be iterated over. An enum has the
following characteristics.
• The enums are evaluable string representation of an object also called repr().
• The name of the enum is displayed using ‘name’ keyword.
3.2.8 Screen_Brightness_Control
A Python tool for controlling the brightness of your monitor. Supports Windows and most
flavors of Linux. We can install this library by pip install screen-brightness-control.
33
CHAPTER 4
METHODOLOGY
We all use new technology development in our day to day life. Including our devices as well.
When we talk about technology the best example is a computer. A computer have evolved
from a very low and advanced significantly over the decades since they originated.However
we also use the same setup, which includes a mouse and keyboard.. Though the technology
have made many changes in the development of computers like laptop where the camera is
now an integrated part of the computer. We still have a mouse which is either integrated or an
external device.
This is how we have come across the implementation a new technology for Our mouse where
we can control the computer by finger tips and this system is known as HandGesture
Movement. With the aid of our fingers, we will be able to guide our cursor. For this project
we have used .
Python as the base language as it is an open source and easy to understand and environment
friendly. Ananconda is packaged python IDE that is shipped with tons of important packages.
It is an friendly environment. The packages that are required here is PyAutoGUI and OpenCV.
PyAutoGUI is a Python module for programmatically controlling the mouse and keyboard.
OpenCV through which we can control mouse events.Red, Yellow, and Blue will be the three
colors we use for our finger tips. It is a program that uses Image Processing to extract
required data and then adds it to the computer'smouse interface according to predefined
notions. Python is used to write the file. It uses of the cross platform image processing module
OpenCV and implements the mouse actions using Python specific library PyAutoGUI.Real
time video captured by the Webcam is processed and only the three colored finger tips are
extracted.
Their centers are measured using the system of moments, and the action to be taken is
34
determined based on their relative positions.
The first goal is to use the function cv2.VideoCapture().This function uses to capture the live
stream video on the camera. OpenCV will create an very easy interface to do this. To capture
a image we need to create an video capture object. We then covert this captured images into
HSV format. The second goal is to use the function Calibratecolor().Using this function the
user will be able to calibrate the color ranges for three fingers individually.
The third goal is to use the function cv2.inRange().In this function depending on
thecallibrations only the three fingers are extracted. We remove the noise from the feed using
the two stem morphism one is Erosion and second is Dilation. The next goal is to center the
radius of the finger tip. So that we can start moving the cursor. ChooseAction() is used in the
code to do this. The performAction() method uses the PyAutoGUI library to perform all of the
following actions: free cursor movement, left click, right click, drag/select, scroll up, scroll
down, and so on, depending on its performance.
The runtime operations are managed by the webcam of the connected laptop or desktop.To
capture a video, we need to create a Video Capture object. By using the Python computer vision
library OpenCV, the video capture object is created and the web camera will start capturing
video. Its argument can be either the device index or the name of a video file. Device index is
just the number to specify which camera. Since we only use a single camera we pass it as ‘0’.
We can add additional camera to the system and pass it as 1,2 and so on. After that, you can
capture frame-by-frame. But at the end, don’t forget to release the capture. We could also apply
color detection techniques to any image by doing simple modifications in the code.
The AI virtual mouse system uses the webcam where each frame is captured till the termination
of the program. The video frames are processed from BGR to RGB color spaceto find the hands
in the video frame by frame as shown in the following code:
35
deffindHands(self,img,draw = True):imgRGB = cv2.cvtColor(img,cv2.COLOR_BGR2RG B)
self.results = self.hands.process(imgRGB)
The imShow() is a function of HighGui and it is required to call the waitKey regulerly. The
processing of the event loop of the imshow() function is done by calling waitKey. The function
waitKey() waits for key event for a “delay” (here, 5 milliseconds). Windows events like
redraw, resizing, input event etc. are processed by HighGui. So we call the waitKey function,
even with a 1ms delay[4].
MediaPipe Hands is a high-fidelity hand and finger tracking solution. It employs machine
learning (ML) to infer 21 3D landmarks of a hand from just a single frame. Whereas
current state-of-the-art approaches rely primarily on powerful desktop environments for
inference, our method achieves real-time performance on a mobile phone, and even scales to
multiple hands. We hope that providing this hand perception functionality to the wider research
and development community will result in an emergence of creative use cases, stimulating new
applications and new research avenues.
ML Pipeline:
MediaPipe Hands utilizes an ML pipeline consisting of multiple models working together: A
palm detection model that operates on the full image and returns an oriented hand bounding
box. A hand landmark model that operates on the cropped image region defined by the palm
detector and returns high-fidelity 3D hand keypoints. This strategy is similarto that employed
in our MediaPipe Face Mesh solution, which uses a face detector togetherwith a face landmark
model.
Providing the accurately cropped hand image to the hand landmark model drastically reduces
the need for data augmentation (e.g. rotations, translation and scale) and instead allows the
network to dedicate most of its capacity towards coordinate prediction accuracy. In addition, in
our pipeline the crops can also be generated based on the hand landmarks identified in the
previous frame, and only when the landmark model could no longer identify hand presence is
palm detection invoked to relocalize the hand.
The pipeline is implemented as a MediaPipe graph that uses a hand landmark tracking subgraph
from the hand landmark module, and renders using a dedicated hand renderer subgraph. The
hand landmark tracking subgraph internally uses a hand landmark subgraph from the same
module and a palm detection subgraph from the palm detection module.
37
real-time uses in a manner similar to the face detection model in MediaPipe Face Mesh.
Detecting hands is a decidedly complex task: our lite model and full model have to work across
a variety of hand sizes with a large scale span (~20x) relative to the image frame and be able
to detect occluded and self-occluded hands
Whereas faces have high contrast patterns, e.g., in the eye and mouth region, the lack of such
features in hands makes it comparatively difficult to detect them reliably from their visual
features alone. Instead, providing additional context, like arm, body, or person features, aids
accurate hand localization.
Our method addresses the above challenges using different strategies. First, we train a palm
detector instead of a hand detector, since estimating bounding boxes of rigid objects like palms
38
and fists is significantly simpler than detecting hands with articulated fingers. In addition, as
palms are smaller objects, the non-maximum suppression algorithm workswell even for
two-hand self-occlusion cases, like handshakes. Moreover, palms can be modelled using
square bounding boxes (anchors in ML terminology) ignoring other aspect ratios, and therefore
reducing the number of anchors by a factor of 3-5.
Second, an encoder-decoder feature extractor is used for bigger scene context awareness even
for small objects (similar to the RetinaNet approach). Lastly, we minimize the focal loss during
training to support a large amount of anchors resulting from the high scale variance.
To obtain ground truth data, we have manually annotated ~30K real-world images with 21 3D
coordinates, as shown below (we take Z-value from image depth map, if it exists per
corresponding coordinate). To better cover the possible hand poses and provide additional
supervision on the nature of hand geometry, we also render a high-quality synthetic hand model
over various backgrounds and map it to the corresponding 3D coordinates.
39
Figure 4.3 : Mouse Click
If set to false, the solution treats the input images as a video stream. It will try to detect hands
in the first input images, and upon a successful detection further localizes the hand landmarks.
In subsequent images, once all max_num_hands hands are detected and the corresponding
hand landmarks are localized, it simply tracks those landmarks without invoking another
detection until it loses track of any of the hands.
This reduces latency and is ideal for processing video frames. If set to true, hand detection runs
on every input image, ideal for processing a batch of static, possibly unrelated,images.
Default to false.
MAX_NUM_HANDS
MODEL_COMPLEXITY
Complexity of the hand landmark model: 0 or 1. Landmark accuracy as well as inference latency
generally go up with the model complexity. Default to 1.
40
Figure 4.4 : Right Click
MIN_DETECTION_CONFIDENCE
Minimum confidence value ([0.0, 1.0]) from the hand detection model for the detection to be
considered successful. Default to 0.5.
MIN_TRACKING_CONFIDENCE:
Minimum confidence value ([0.0, 1.0]) from the landmark-tracking model for the hand
landmarks to be considered tracked successfully, or otherwise hand detection will beinvoked
automatically on the next input image. Setting it to a higher value can increase robustness of
the solution, at the expense of a higher latency. Ignored if static_image_mode is true, where
hand detection simply runs on every image. Default to 0.5.
Output
Naming style may differ slightly across platforms/languages.
MULTI_HAND_LANDMARKS
Collection of detected/tracked hands, where each hand is represented as a list of 21 hand
landmarks and each landmark is composed of x, y and z. x and y are normalized to [0.0, 1.0]
by the image width and height respectively. z represents the landmark depth with the depth at
41
the wrist being the origin, and the smaller the value the closer the landmark is to the camera.
The magnitude of z uses roughly the same scale as x.
MULTI_HAND_WORLD_LANDMARKS
Collection of detected/tracked hands, where each hand is represented as a list of 21 hand
landmarks in world coordinates. Each landmark is composed of x, y and z: real-world 3D
coordinates in meters with the origin at the hand’s approximate geometric center.
MULTI_HANDEDNESS
Collection of handedness of the detected/tracked hands (i.e. is it a left or right hand). Each
hand is composed of label and score. label is a string of value either "Left" or "Right".
score is the estimated probability of the predicted handedness and is always greater than or
equal to 0.5 (and the opposite handedness has an estimated probability of 1 - score).
Dynamic Gestures for Volume control - The rate of increase/decrease of volume is proportional
to the distance moved by pinch gesture from start point.
Increasing the volume of System using system hand gestures. Decreasing the volume of system
using hand gestures.
Finally, just to make it look prettier, we will add draw a circle in the middle point that will
change color when both fingers are super close to each other and a volume bar to the left.
Dynamic Gestures for horizontal and vertical scroll. The speed of scroll is proportional to the
distance moved by pinch gesture from start point. Vertical and Horizontal scrolls are controlled
by vertical and horizontal pinch movements respectively.
43
proportional to the distance moved by pinch gesture from start point.
Increasing the Brightness of system using system hand gestures. Decreasing the Brightness of
system using hand gestures. Changing the brightness means changing the value of pixels. It
means adding or subtracting value some integer value with the current value of each pixel.
When you add some integer value with every pixel, it means you are making the image
brighter. When you subtract some constant value from all of the pixels, you are reducing the
brightness. First, we will learn how to increase the brightness and second we will learn how to
reduce the brightness.
44
We have a hand tracking module already done, so let’s say we want to control the volume
of our computer by moving the thumb and index finger closer and further away from
each other. From before we now the thumb is landmark number 4 and the index is
landmark number 8.
45
CHAPTER 5
SOURCE CODE
46
47
48
49
50
CHAPTER 6
SCREENSHOTS
51
Figure 6.2 : Mouse Click
52
Figure 6.3 : Left Click
53
Figure 6.4 : Right Click
54
Figure 6.5 : Double Click
55
Figure 6.6 : Left to Right-Brightness Controls Top to Bottom
56
Figure 6.7 : Scroll Up and Down
57
Figure 6.8 : Low Brightness Gesture
58
Figure 6.9 : High Brightness Gesture
59
Figure 6.10 : Increase Volume Gesture
60
Figure 6.11 : Decrease Volume Gesture
61
CHAPTER 7
In the proposed AI vir tual mouse system, the concept of advancing the human- computer
interaction using computer vision is given.
Cross comparison of the testing of the AI virtual mouse system is difficult because only limited
numbers of datasets are available. The hand gestures and finger tip detection have been tested
in various illumination conditions and also been tested with different distances from the
webcam for tracking of the hand gesture and hand tip detection. An experimental test has been
conducted to summarize the results shown in Table 1. The test was performed 25 times by 4
persons resulting in 600 gestures with manual labelling, and this test hasbeen made in different
light conditions and at different distances from the screen, and each person tested the AI virtual
mouse system 10 times in normal light conditions, 5 times in faint light conditions, 5 times in
close distance from the webcam, and 5 times in long distance from the webcam, and the
experimental results.
The purpose of this project was to make the machine stand out it interacts with and responds
to human behavior. Alone The purpose of this paper was to make technology accessible and is
compatible with any standard operating system.
The proposed system is used to control the pointer for mouse by seeing a human hand and
inserting the cursor in the middle direction in the hand of man respectively. System control
mouse activity as simple as left click, cursor pull and movement.
The path finds a human skin hand and follows it continuously with the movement of the cursor
at an angle between the fingers of a human hand the process performs the functionof the left
click.
62
CHAPTER 8
CONCLUSION
Our proposed solution is machine learning based with face detection which allows the
voter to register and he/she can vote from anywhere irrespective of the location. This system
provides security and also avoid casting of the multiple vote by same person. This system is
more reliable in which we can vote from multiple locations. It also minimize work, human
requirements and time resources.
This implementation of virtual control mouse has a few accuracy and precision issues ,like it
will have precision gap in volume and brightness control ,likewise it might have accuracy
issues in click function. These issues we will resolve in future model.
Further additional features like voice assistant, virtual keyboard can be implemented . The
Model when added with thenPeople having some Kind of disability or some hand problems
will be able to operate the mouse.this way it can also contribute to medical industry in some
way. With addition of voice assistant and virtual keyboard it can becomea complete solution
for people who cannot see or may be cannot move some parts of upperbody correctly.
More functions like saving, copy, paste shortcuts, select all direct shortcuts can be added. These
functions are not present in a normal mouse, so adding these kind of features will increase both,
one the overall functionality of the mouse and secondly will also create additional motivation
for the user to use that mouse instead of a physical mouse / trackpad of the system.
In future this web application can also be used on Android devices or the mobile applications,
where touchscreen concept can be replaced by hand gestures. The application/software
can be made cross-platform so it creates an ecosystem like an experience to the user and also
63
add additional functionality on both platforms.
The proposed model cannot be effectively used in dark environments, this can be resolved by
automatically increasing brightness of monitor, which will be implemented in future model.
This is an inherent problem even with physical mouse which barely has a solution even today
but we can , overcome this to a certain extent by asking relevant permissions from the system
and the increasing the screen brightness by using light sensor in thelaptop/ mobile phone.
An automatic zoom-in/out functions are required to improve the distance, where it will
automatically adjust focus speed on the distance between the user and the camera. This can be
done increase the user experience so that the user can get going straight away and he/she
doesn’t have to face any focusing issues while using the webcam which can lead to wrong/no
gesture detection.
8.2 APPLICATIONS
Virtual mouse system is useful for many applications; it can be used to decrease the space for
using the actual mouse, and it can be used in situations where we cannot use the physical
mouse. The system eliminates the usage of devices, and it improves the human- computer
interaction.
• The proposed model has a greater accuracy of which is far greater than the that of other
proposed models for virtual mouse, and it has many applications.
• IN COVID-19 scenario, it is not safe to use the devices by physically touching as it will
result in a scenario of spread of the virus by touching the devices, so the proposed
virtual mouse can be used to control the PC mouse functions without using the physical
mouse.
• The system can be used to control robots and systems without the usage of devices.
• Can be used to play augmented reality games and use AR applications.
• Persons with some disability will be able to use the mouse.
64
REFERENCES
[1] Mokhtar M., Hasan, and Pramod K. Mishra. "Robust gesture recognition using gaussian
distribution for features fitting." International Journal of Machine Learning and Computing 2,
no. 3 (2012): 266
[2] “A Multi-Sensor Technique for Gesture Recognition through Intelligent Skeletal Pose
Analysis.” Nathaniel Rossol, Student Member, IEEE, Irene Cheng, Senior Member, IEEE, and
Anup Basu, Senior Member, IEEE (2015).
[3] Shining Song, Dongsong Yan, and Yongjun Xie “Design of control system based on hand
gesture recognition.” the Natural Science Foundation of Guangdong Province
(NO˖2017A030310184) ©2018IEEE.
[4] Xuhong Ma and Jinzhu Peng. “Kinect Sensor-Based Long-Distance Hand Gesture Recognition
and Fingertip Detection with Depth Information.” Hindawi Journal of Sensors Volume 2018,
Article ID 5809769, (2018).
[5] Liu Qiongli, Xu Dajun, Li Zhiguo, Zhou Peng, Zhou Jingjing and Xu Yongxia “A New
Distance Metric Learning Algorithm for Hand Posture Recognition.” 3rd International
Conference on Mechatronics and Industrial Informatics (ICMII 2015)
[6]Tran, DS., NH., Ho, Yang, HJ. et al. “Realtime virtual mouse system using RGB-D images and
fingertip detection” Multimed Tools Appl 80, 10473–10490, 2021.
[7] Sherin Mohammed Sali Shajideen, Preetha V H. ``Hand Gestures - Virtual Mouse for Human
Computer Interaction.” International Conference on Smart Systems and Inventive Technology
(ICSSIT 2018) IEEE Xplore Part Number: CFP18P17-ART; ISBN:978-1-5386-5873-4.
[8] “Cursor Control using Hand Gestures” Pooja Kumari, Saurabh Singh, Vinay Kr. Pasi
International Journal of Computer Applications (0975 – 8887) (2013).
[9] Sandeep Thakur, Rajesh Mishra, Buddhi Prakash “Vision based computer mouse control using
hand gestures.” (2015) International Conference on Soft Computing Techniques and
Implementations (ICSCTI).
[10] Chen-Chiung Hsieh, Dung-Hua Liou, David Lee “A real time hand gesture recognition system
using motion history image.” .IEEE Xplore: 23 August (2010).
[11] X. Zabulis†, H. Baltzakis†, A. Argyros "Vision-Based Hand Gesture Recognition for Human-
Computer Interaction." DOI:10.1201/9781420064995-c34 (2010).
65
[12] Bharath Kumar Reddy Sandra, Katakam Harsha Vardhan, Ch. Uday, V Sai Surya, Bala Raju,
Dr. Vipin Kumar "GESTURE-CONTROL-VIRTUAL-MOUSE." International Research Journal of
Modernization in Engineering Technology and Science (2012).
66