You are on page 1of 7

Virtual Trial Room for Online Shopping

K. Durga Prasad
Department of Computer Applications V Esther Jyothi
Velagapudi Ramakrishna Siddhartha Engineering Asst. Professor, Department of Computer Applications
College, Vijayawada, India Velagapudi Ramakrishna Siddhartha Engineering
durgaprasadkorikana88165@gmail.com College, Vijayawada, India
vejyothi@vrsiddhartha.ac.in

Abstract—Human Beings have the natural ability to receive real-time results, with the selected attire
recognize body and sign language easily. This is possible superimposed on their bodies as they interact with the
because of vision and synaptic interactions formed during application. This real-time feedback is achieved by
brain development. Humans also can pick up contextual capturing and processing every frame of the video input,
information to understand body and sign language. If we seamlessly integrating the attire onto the user's body and
want to replicate this ability in computers, we need to solve returning the modified frame for immediate viewing
several issues—the ability to differentiate between objects Importantly, our implementation is cost-effective, requiring
of interest and the background. Choosing image capture no additional hardware expenses, unlike some existing
technology and image classification algorithms etc. So, solutions. Moreover, it's platform-independent, capable of
Hand gesture recognition is quickly becoming an
running on any operating system and device with a camera,
important and relevant technology. Given the recent
growth of AR and VR technologies, Hand gesture internet access, and a web browser. The primary concerns
recognition has become an exciting new technology field. driving our project are twofold: ensuring the accuracy of
But currently, hand gestures are very specialized and attire superimposition according to the user's body and
haven’t acquired mainstream adoption. We aim to show delivering a realistic viewing experience.
that gesture recognition can enhance many aspects of
computer work and teaching. In our predicted model, the A. Precision in Superimposition:
system works by the user starting the program, which The accuracy of superimposing wearables in a virtual trial
would start providing the live webcam feed. The user then room is paramount as it dictates the system's overall
makes the hand gesture inside the frame of detection precision. The effectiveness relies heavily on whether the
shown on the screen. When the hand gesture is made, the application's algorithm can successfully identify the user
gesture is segmented and isolated..Some possible within video frames. Two primary methods exist: utilizing
applications are to reduce the use of a physical keyboard neural networks to train the algorithm to detect the human
and mouse by using hand gestures recognition to control body or employing an RGB color marker to locate the user
your Computer like Pause/Play, close and open windows, within the frame. The latter method, while less user-
and manipulate media controls. Another application is friendly, is mitigated by OpenCV's pre- trained algorithm
sign language. Hand gesture recognition here can be used designed to recognize body parts such as the face, upper
for communication and teaching purposes. body, and lower body, making it the preferred approach.
Index Terms - Convolution Neural Networks, Pattern B. Enhancing Realism in the Virtual Trial Room:
Recognition, Hand Gesture Recognition, Segmentation of Hand
Region, Data Collection, and Deep Learning. A Virtual Reality setup falls short if it fails to evoke a sense
of reality, even though achieving complete realism akin to
the sensation of wearing different fabric types, like cotton
I. INTRODUCTION versus wool, remains challenging. Nevertheless, we can
Purchasing clothing or accessories online poses inherent enhance the user's experience by simulating a more realistic
risks, mainly because it's difficult to gauge how the items environment, akin to trying on clothes in front of a mirror
will appear on oneself. Similarly, shopping for clothes or within an authentic Tryon room. Augmented Reality blends
jewellery in physical stores demands significant time real-world elements seamlessly with virtual ones, refining
investment, involving the search for suitable shops and the user's perception of their surroundings. By overlaying
trying on multiple items in fitting rooms. Our proposed projected images onto the user's view, augmented reality
solution aims to streamline this process by digitizing the try- merges synthetic and natural light, creating an immersive
on experience, thereby saving users time. We've opted to experience. Augmented Reality devices operate
utilize OpenCV for its efficiency and pre- trained independently, without the need for cables or desktop
capabilities in detecting the user's body, enabling us to computers.
overlay clothing swiftly. This approach not only expedites OpenCV, short for Open Source Computer Vision Library,
the process but also enhances the user experience. Users supports various programming interfaces like Python, C++,
and Java. It prioritizes computational efficiency and real- their wheelchair-adaptive product, aiming to elevate user
time applications, boasting the advantage of multi-core experience and accessibility, notwithstanding technical
processing for C or C++ code. Leveraging augmented intricacies in execution.
reality technology streamlines the virtual try-on process, Pros:Cutting-edge camera technologies facilitating
saving customers time and reducing the confusion personalized digital overlays.
associated with online wearables shopping. Cons: Escalating system complexity.

II. RELATED WORK Reizo NAKAMURA and Masaki IZUTSU [6] delineate a
The current methodology of online searching frequently methodology for estimating body dimensions leveraging
fails to ensure accurate apparel sizing, resulting in a plethora Kinect data. They advocate for the utilization of multiple
of product returns and prolonged replacement processes. Kinect sensors to enhance accuracy, albeit at a premium
This challenge presents a substantial hurdle for the e- cost.
commerce sector. Pros: Augmented size estimation through amalgamated
Numerous strategies have been proposed to tackle this issue: sensor inputs.
Cons: Escalating implementation expenses.
Srinivasan K. and Vivek S. [1] delve into the burgeoning
realm of online shopping, highlighting the imperative need Poonpong Boonbrahma and Charlee Kaewrat [7]
for algorithms capable of digitally fitting clothing onto prognosticate fabric appearances based on physical
individual silhouettes. They underscore the intricacies of parameters, aiding in material differentiation for virtual
achieving precise fitment, especially within static images fitting rooms. However, they stress the necessity for
amid varying backgrounds and noise levels. meticulous experimentation to validate their predictions.
Pros: Enriched online shopping experiences through virtual Pros: Fabric simulation across diverse environments.
try-on functionalities. Cons:Imperative requirement for comprehensive
Cons: Absence of 3D visualization capabilities and experimental validation.
susceptibility to lighting variations.
Umut Gültepe and Uğur Güdükbay [8] introduce a virtual
Ari Kusumaningsih and Eko Mulyanto Yuniarno [2] pioneer fitting room harnessing depth sensor data for lifelike
a bespoke virtual dressing room designed specifically for garment fitting. While promising, there exists a pressing
Madura batik attire. Their objective is to invigorate sales need to enhance measurement accuracy and collision
and perpetuate cultural heritage by delivering a tailor-made detection capabilities.
digital fitting encounter, necessitating streamlined Pros: Authentic fitting experience enabled by depth sensing
computational methodologies for efficacious handling of 3D technology.
models. Cons: Subpar measurement precision and redundant data.
Pros: Accurate spatial perception for seamless object
placement. Ayushi Gahlot and Purvi Agarwal [9] focus on Kinect-
Cons: Impact of lighting conditions on depth map precision. driven action recognition and pose estimation, showcasing
the versatility of this technology. However, they stress the
Ting Liu and Ling Zhi Li [3] harness Kinect technology for importance of leveraging pose estimation for practical
user segmentation and skeletal tracking to align clothing applications beyond mere action prediction.
models with users within a prototype virtual dressing Pros: Precision pose estimation facilitated by Kinect
software. They underscore the criticality of swift and interactions.
authentic clothing modeling techniques. Cons: Limited applicability beyond action.
Pros: Real-time visualization of clothing ensembles.
Cons: Static alignment constraints overlooking dynamic Furthermore, we have extensively explored several
movements. scholarly works delving into the concept of superimposing
attire, predominantly clothing, onto human forms. This
Stephen Karungaru and Kenji Terada [4] advocate for a functionality empowers users to visualize themselves
Kinect-centric approach to effortlessly acquire human body donning various clothing items without the need for
measurements. Despite successful data acquisition, physical try-ons. Initially, the user faces a camera,
challenges persist in enhancing accuracy and user- capturing their image, which is then overlaid with assorted
friendliness. garments for display. This streamlined process aids users
Pros: Emphasis on human-centric data acquisition. in making informed decisions while augmenting their
Cons: Shortfall in interactive features and accuracy satisfaction levels.
calibration. In the research conducted by Shreya Kamani, F.
Isikdogan, and Vipin Paul (cited as [1], [2], and [6]
Dr. Anthony L. Brooks and Dr. Eva Petersson Brooks [5] respectively), the implementation of virtual trial room
collate extensive feedback through open surveys to refine applications utilizing hardware sensors like the Microsoft
Kinect sensor is proposed. These sensors primarily capture devices, which detect users' body coordinates when
the user's skeletal structure, providing crucial data for positioned in front of a scanner. This data serves as the
determining the appropriate size for virtual clothing foundation for generating a comprehensive 3D model of
augmentation. Presently, commercial products leverage the user.
technologies such as Microsoft's Kinect and Asus Xtion
algorithm identifies key points on the shoulders and
abdomen. By measuring the distance between these points
and the user-to-camera distance, we derive the user's size.
Once the image (video frame) is captured, we apply a
Canny edge detection filter to isolate the body's silhouette.
Given that Canny edge detection is sensitive to noise
present in unprocessed data, we utilize a Gaussian filter to
smooth the raw image. Following convolution, four filters
are employed to detect horizontal, vertical, and diagonal
edges within the processed image. Morphological functions
III. METHODOLOGY are then utilized to generate a closed silhouette.
A. Detecting and Sizing the Body: Subsequently, we utilize an 8-point Freeman chain code, as
The initial phase of the proposed Online Virtual Trial Room illustrated in Figure 1, to assign a direction to each pixel.
method involves acquiring the body's shape, including the The choice between using 8 or 4 chain codes depends on
head or neck depending on the attire, to establish reference the scenario, with the following formula applicable:
points. These points are subsequently utilized to determine
the appropriate placement of specific clothing or z = 4*(deltax+ 2)+(deltay+ 2) (1)
accessories. To achieve this, we experimented with various
The formula (1) generates a sequence corresponding to rows
techniques:
1- 8 in the preceding table: z = {11,7,6,5,9,13,14,15}. These
Filtering with thresholding, Canny edge detection, K-means,
values serve as indices into the table, enhancing the speed of
and ii) Motion detection or skeleton detection, where
multiple frames were scrutinized for any movement. computing the chain code. Each difference between
However, the outcomes were inconsistent and insufficient consecutive numbers represents a 45º variation. If the
for obtaining accurate reference points to display the attire. difference in direction between consecutive points exceeds
Consequently, we developed a novel detection approach two (90º), a feature point is identified and marked in the
centered on identifying the user's face, adjusting a reference image. Numerous commercially available technologies exist
point at the neck, and positioning the attire based on that for assessing the precision, accuracy, and completeness of
reference point. Furthermore, an Augmented Reality (AR) hand configuration data. These technologies encompass
marker can serve as an additional reference point. While exoskeletons and instrumented gloves, known as "Data
this method proved adequate for small items like glasses or gloves," which are worn on the hand and body. Data gloves
ornaments, it fell short in accurately mapping clothing onto offer direct measurement of hand and finger characteristics,
the user's body. data provision, high sampling frequency, ease of use, line of
sight, cost-effectiveness, and translation-independent data
characteristics. However, using a data glove presents certain
challenges, including calibration difficulties, reduced comfort
and range of motion, noise in cheaper systems, and the cost
of acquiring an accurate device.
ek =| dj+1 – dj |= 2 (2)
The equation (2) implies that the absolute difference
between two points exceeds 2, as stated in Eq. (2).
Ultimately, the distance between these points is measured
in the image and correlated with the distance from the user
to the camera to determine size.

B. Face Detection:
Fig. 1. – Freeman’s Codification Upon the user's approach to the screen, the discrete
structure targeted for identification is the face. To achieve
To determine the user's size, we employ a similar face detection, Haar feature-based cascade classifiers are
automated body feature extraction technique as employed. Unlike utilizing pixel intensity values, Haar
demonstrated in [7]. The process involves positioning the classifiers leverage contrasts between adjacent groups of
user in front of the camera at a predetermined distance. The pixels, using the variance difference to discern light and
dark areas in the image [8]. This method adopts a machine
learning approach, requiring the cascade function to be F. Proposed Application:
trained with numerous negative and positive images. A Our application is built using the Python Flask Web
plethora of negative images (devoid of faces) and positive Application Interface. It enables users to browse through
images (containing faces) are presented to the classifier for various clothing items and accessories on the website,
training, enabling it to extract features effectively. The giving them the option to either make a purchase or
integration of OpenCV facilitates this process, providing virtually try on the attire. To initiate the virtual try-on
pre-trained classifiers for faces, eyes, smiles, and more. process, users need to click on the 'Quick View' button,
Equipped with both a trainer and a detector, OpenCV triggering the execution of the Tryon script. Using
allows for straightforward training of custom classifiers for OpenCV, the application captures video from the device
object detection. Upon identifying a match, it returns Rect camera and seamlessly superimposes the selected attire
(x, y, w, h) denoting the coordinates for the left, top, onto the user’s body in real-time. Should the user find the
bottom, and right boundaries. attire satisfactory, they can proceed with the purchase or
continue exploring additional wearables available on the
C. Image Masking website, mimicking the experience of shopping in a
Image masking involves setting certain pixel intensity physical store.
values of the masked image to zero. Wherever the pixel
intensity value in the original image is zero, the
corresponding pixel intensity in the resulting masked image
is typically set to the background value, which is commonly
zero. The regions of interest (ROIs) for each slice are
utilized to define the mask. If necessary, masking can be
managed on a slice-by-slice basis within the ROI toolkit.
Importantly, masking operations in the ROI toolkit do not
impact a slice that lacks an ROI.

D .Edge Detection:
Numerous techniques exist for edge detection, among which
the Canny Edge detection technique [9] has been employed
for body detection, as previously discussed. Gaussian filters
are utilized in this edge detection method to eliminate noise
in digital images, thereby preventing false detections by the
processor. These filters smooth the image and diminish the
impact of noise, facilitating the proper functioning of the
processor. Through this process, the intensity gradients of
the image are determined. Since edges in an image can be
oriented horizontally, vertically, or diagonally, this
algorithm employs four filters to detect all types of edges in
the blurred image. Following this stage, non- maximum
suppression is applied to thin the edges, resulting in
accurate edge pixels relative to actual edges. Additionally, Fig. 2. Screenshots of the Web Application
some pixels may be artifacts of noise, for which a double
threshold is applied.

E. Scaling of Attire: C.
Scaling refers to resizing the image based on specific
conditions. When the user approaches the screen, the attire's A. Algorithm
size should adjust accordingly to fit the body. As the user Steps involved in the Algorithm:
moves closer to the screen, the image size should increase • Step-1: Begin, import required libraries like os, numpy,
to accommodate the user's proximity, but the actual cv2.
measurements of the attire should remain constant. For • Step-2: Make directories for training and test data using
instance, if the person is trying clothes with a size os, path. exists().
measurement of S, the size should not change to M or L as • Step-3: Next call cv2.videocapture() for capturing the
the person moves closer to the screen. Instead, only the
gestures and loading the train data.
overall view of the clothing should be enlarged or reduced
• Step-4: Now, map the gestures with desired outputs
as needed. This adjustment is achieved through scaling
using cv2 modules.
methods.
• Step-5: Then import required modules of keras like data.
sequential, convolutional2D, maxpooling2D, Flatten, • Step-7: Now execute the code; the model will start
Dense for training the CNN model. capturing live video sequences and giving gestures.
• Step-6: Later, do the image preprocessing using • Step-8: Check the results predicted by the model and
Keras.preprocessing.image and import train the model with different gestures.
imageDatagenerater and train model with loaded train

for ’5’ by opening the palms of her hand and


outstretched fingers within the window frame and
pressing the key ’5’ on the numeric keypad. This
saves the image in the directory for the ’5’ hand
gesture.
IV. RESULTS AND ANALYSIS
• Training the CNN model: The CNN Model is trained by
Hand gesture recognition and the working of the CNN classifying the data set collection to the appropriate hand
model follow standard steps, making it easier for the project gesture and creating a 2D Convolutional Neural Network.
to split into modules. Keras is used to create a Convolutional model. There are
• Segmentation of hand region: The Hand region of the two ways to use Keras, sequential or functional. Here
webcam feed is to be segmented to save space, remove sequential is used for layer-by-layer model creation.
unwanted parts of the image, and reduce background Functions used:
noise. This is achieved by: Background subtraction, Mo- 1) Convolution2D - a layer that divides a picture into
tion Detection, and Thresholding Contour Extraction. many images.
Opencv is used to achieve this as it is a module that 2) MaxPooling2D - a layer that maximizes the value
deals with image processing and computer vision. It can from a size matrix.
be used to identify objects and faces. Example functions 3) Flatten - a layer that flattens the image’s dimensions
cv2.cvtcolor - used to convert an image from one color after it has been convolved.
space to another cv2. Threshold- used to assign pixel 4) Dense - a layer that turns this into a fully linked
values about the threshold value cv2.imshow-used to model.
display an image in a window. 5) ImageDataGenerator - zooms in on the image, re-
• Dataset collection Data set is obtained by manually sizes it, applies shear in some range, zooms the image
providing the program with images of segmented hand and does horizontal flipping with the image
gestures. Separate folders are made that store the image train_datagen.flow_from_directory - is the function
corresponding to the hand gesture made. The os module that is used to prepare data from the train_dataset
is used within the program to make new directories. Ex: directory Target_size specifies the target size of the
os.makedirs - is used to make directories os.Listers - image.
to list all files in the directory. Opencv is also used here. 6) Test datagen.flow from directory - is employed to
Ex: cv2.imwrite - used to save any image to a storage prepare test data for the model.
location. Steps involved in dataset creation.
1) First, we will make new directories for each hand • Recognition of hand gestures: Get the thresholded image
segment in the ’train’ and the ’test’ modes. from the live feed and use the model to recognize the hand
2) We will then turn on video capture to capture the gesture in the image as shown in Fig. 2. Functions used:
images for the dataset. cv2.cvtcolor - used to change an image’s color space from
3) Next, we will create a small frame with specific one to another. imshow- used to show a picture in a cv2
dimensions within the webcam feed where the hand window. To obtain a video capture object for the camera,
gesture is to be made. use the video capture method.
4) We will also convert the RGB feed to a black and
white feed to eliminate background noise and make
it easier for the webcam to distinguish the hand from
its surroundings easily.
5) We will only capture the image within the
mentioned frame to remove unwanted background
and reduce space.
6) We make the dataset by making the corresponding
hand gesture and then pressing the key to indicate
the hand gesture. For example, the user gestures
pp. 1–4.
[6] V. E. Jyothi, D. L. S. Kumar, B. Thati, Y. Tondepu, V. K.
Pratap, and S. P. Praveen, “Secure data access management for cyber
threats using artificial intelligence,” in 2022 6th International
Conference on Electronics, Communication and Aerospace
Technology. IEEE, 2022, pp. 693–697.
[7] S. B. Wang, A. Quattoni, L.-P. Morency, D. Demirdjian, and T.
Darrell, “Hidden conditional random fields for gesture recognition,”
in 2006 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR’06), vol. 2. IEEE, 2006, pp. 1521–
1527.
[8] D. C. Cires¸an, U. Meier, L. M. Gambardella, and J. Schmidhuber,
“Deep, big, simple neural nets for handwritten digit recognition,”
Neural computation, vol. 22, no. 12, pp. 3207–3220, 2010.
[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
classification with deep convolutional neural networks,”
Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017.
Fig. 2. [10] P. Y. Simard, D. Steinkraus, J. C. Platt et al., “Best practices for
convolutional neural networks applied to visual document analysis.” in
Icdar, vol. 3, no. 2003. Edinburgh, 2003.
[11] Jyothi, V. Esther, B. D. C. N. Prasad, and Ramesh Kumar Mojjada.
"Analysis of Cryptography Encryption for Network Security." IOP
Conference Series: Materials Science and Engineering. Vol. 981. No.
2. IOP Publishing, 2020.
[12] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and
L. Fei-Fei, “Large-scale video classification with convolutional neural
networks,” in Proceedings of the IEEE conference on Computer
SEGMENTED HAND REGION Vision and Pattern Recognition, 2014, pp. 1725–1732.
[13] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” in International
V. CONCLUSION
conference on machine learning. pmlr, 2015, pp. 448–456.
The model can recognize the hand gesture in the webcam
with 80-90% accuracy. The accuracy depends on factors like
room lighting, exposure, distance from the camera, back-
ground noise, etc. Improvement of these conditions can give
a much better hand gesture recognition accuracy.
Discrepancies can be ironed out by providing more data
set images to train the CNN model. The segmented images
need a plain background to pick up the hand gesture easily.
Too much background noise will cause the hand not to be
picked up by the camera. Likewise, overexposure or
underexposure of the camera will also cause issues picking
up hand gestures.

REFERENCES
[1] S. P. Praveen, T. B. Murali Krishna, C. Anuradha, S. R. Mandalapu,
P. Sarala, and S. Sindhura, “A robust framework for handling health
care information based on machine learning and big data engineering
techniques,” International Journal of Healthcare Management, pp. 1–
18, 2022.
[2] S. Ahlawat, V. Batra, S. Banerjee, J. Saha, and A. K. Garg, “Hand
gesture recognition using convolutional neural network,” in
International Conference on Innovative Computing and
Communications: Proceedings of ICICC 2018, Volume 2. Springer,
2019, pp. 179–186.
[3] S. Ahmed, K. D. Kallu, S. Ahmed, and S. H. Cho, “Hand gestures
recognition using radar sensors for human-computer-interaction: A
review,” Remote Sensing, vol. 13, no. 3, p. 527, 2021.
[4] C. J. L. Flores, A. G. Cutipa, and R. L. Enciso, “Application of
convolutional neural networks for static hand gestures recognition
under different invariant features,” in 2017 IEEE XXIV international
confer- ence on electronics, electrical engineering and computing
(INTERCON). IEEE, 2017, pp. 1–4.
[5] S. P. Praveen, S. Sindhura, A. Madhuri, and D. A. Karras, “A novel
effective framework for medical images secure storage using advanced
cipher text algorithm in cloud computing,” in 2021 IEEE International
Conference on Imaging Systems and Techniques (IST). IEEE, 2021,

You might also like