Real-Time Motion Insight Using Mediapipe: A. Lakshmiprabha, Dr. G. Arockia Sahaya Sheela

REAL-TIME MOTION INSIGHT USING MEDIAPIPE
A. Lakshmiprabha 1, Dr. G. Arockia Sahaya Sheela 2,

Student 1, Assistant Professor and Head 2,
PG Department of Data Science, Holy Cross College (Autonomous), Tiruchirappalli.
ABSTRACT
Real-time motion insight using MediaPipe is an approach that leverages advanced computer vision techniqu
es to analyze and interpret human movements in real-time. Motion insight refers to the understanding or kno
wledge gained from observing and analyzing movement or motion. It involves extracting meaningful pattern
s, recognizing specific motions, or gaining insights into human behavior based on observed movements. It in
troduces an AI-driven real-time body language decoder developed using MediaPipe and Python. Utilizing so
phisticated computer vision methodologies, the system interprets and decodes human gestures, facial express
ions, and body movements instantaneously. The implementation utilizes the MediaPipe framework, enabling
seamless integration and efficient processing of visual data. The real-time nature of the decoder allows for in
stantaneous analysis of live interactions. Python serves as the programming language, facilitating a robust an
d accessible solution for developers. The model demonstrates the capability to decode nuanced nonverbal cu
es, enhancing our understanding of human communication. The potential applications span various fields, in
cluding human-computer interaction, behavioral analysis, sign language prediction ,drowsiness detection an
d communication enhancement. The model utilizes pose estimation strategies to evaluate their appropriatene
ss in body language recognition. The proposed system and architecture exhibit real-time inference capabiliti
es as well as offer precise predictions.
Keywords: Motion insight, MediaPipe, Real-time Pose tracking, Holistic framework , Pose Estimation, Pre
diction.
I. INTRODUCTION
In an era of rapid technology breakthroughs, the integration of artificial intelligence and computer vision has
enabled ground-breaking applications in comprehending and deciphering human behavior. The "Real-time
Motion Insight Using MediaPipe" is a new exploration into the practice of dynamic motion analysis that inc
orporates innovative technologies. The intention is to explore the world of real-time motion understanding,
which is enabled by advanced computer vision techniques, in particular the varied capabilities of the
MediaPipe framework. MediaPipe serves as the foundation of our endeavor, providing a cross-platform solu
tion for leveraging the potential of machine learning applications. The core objective is to unravel the deep c
omplexities of human motion gestures, facial expressions, and broader body movements and offer instantane
ous, correct insights through real-time analysis. Leveraging MediaPipe's powerful capabilities, the aim is to
develop a system that not only recognises and archives movements efficiently, but also intelligently evaluate
s them, facilitating a wide range of applications.
Background:
By creating systems that can imitate human cognitive abilities, artificial intelligence (AI) has made strides to
wards success in the past few years. Artificial intelligence is the exploration of developing computer systems
that can do operations that involve human reasoning. Machine Learning is a branch of artificial intelligence t
hat enables computers to learn and improve over time without straightforward coding.The integration of AI
and ML has significantly enhanced the capabilities of systems to analyze and interpret motion in real-time. T
hrough algorithms and models, computers can now identify, track, and understand various movements, open
ing up possibilities for applications ranging from augmented reality to health and fitness monitoring.
MediaPipe is a versatile, open-source framework developed by Google, designed to address real-time percep
tion tasks. The goal is to harness MediaPipe's capabilities to get real-time motion knowledge. It entails analy
sing and decoding motion in a live video feed or camera input by employing pose estimation, hand tracking,
or other pertinent modules.
MediaPipe:
MediaPipe, an extensible freely available framework designed by Google, for managing real-time perception
challenges. For operations like pose prediction, hand tracking, and recognition of faces, it provides pre-built
solutions. It is supported by an active community and abundant documentation, and it is cross-platform, ena
bling implementation on a variety of operating systems. MediaPipe's capabilities in real-time motion analysi
s are facilitated through modules like pose estimation, hand tracking, and face detection. These modules pro
vide the essential building blocks for understanding and interpreting human movement, making it a valuable
tool for developers seeking to integrate motion insight into their applications.
The Body Landmark model from MediaPipe tracks and identifies important spots on the human body in rea
l-time to facilitate pose estimation. Pose estimation functions in real-time, enabling continuous tracking of b
ody movements and providing a dynamic representation of an individual's posture by identifying and localisi
ng the positions of body joints and landmarks. The Body Landmark model from MediaPipe locates and mon
itors important body parts such as the shoulders, elbows, wrists, hips, knees, and ankles. As a result, it is pos
sible to interpret the motion and positioning of the body in great depth.
Fig 1.1 Pose Landmarks
MediaPipe's Holistic Hand Connections is a feature within the MediaPipe framework that focuses on providi
ng comprehensive hand tracking and gesture recognition capabilities. It's designed to recognize and establish
connections between key hand landmarks, enabling the modelling and analysis of hand movements. Dynami
c tracking of hand movements is made possible by Holistic Hand Connections' real-time feature. This is part
icularly beneficial for applications requiring to respond to hand gestures instantaneously.
Fig 1.2 Hand Landmarks
MediaPipe provides a Face Mesh model that is designed for facial landmark detection and tracking. It identif
ies and tracks a set of 468 unique landmarks on the face, including points on the eyebrows, eyes, nose, mout
h, and jawline. While the term "Face Mesh Connections" may not be explicitly used, connections can be infe
rred from the sequential arrangement of facial landmarks. These connections represent the natural structure a
nd geometry of the face. Face Mesh is commonly used in applications such as facial expression analysis, aug
mented reality (AR) effects, avatar creation, and virtual try-on experiences.
Fig 1.3 Face Landmarks
Scope:
Real-Time Motion Insight Using MediaPipe is a broad field that encompasses various aspects of leveraging
the capabilities of the MediaPipe framework for analyzing and understanding human motion in real time. It
can be used to Integrate real-time motion insight into AR applications for creating immersive and interactive
experiences and develop AR effects or overlays based on user movements and gestures. It can be leveraged t
o Elevate human-computer interaction through the integration of natural and instinctive interactions grounde
d in real-time motion, exploring touchless interfaces, gesture controls, and adaptive UI elements. Real-time
motion analysis can be explored in a variety of contexts with this system, ranging from gaming and entertain
ment to healthcare and education. It offers chances for innovation and inventiveness in making use of the pot
ent capabilities of the MediaPipe framework to interpret human motion in real-time.
Performance metrics and continuous improvement initiatives will be integral, fostering a dynamic project th
at adapts to evolving technological landscapes. Future enhancements may involve machine learning integrati
on for personalised insights and expanding the range of supported gestures and movements.
Limitations:
Motion analysis has a great and promising scope nevertheless challenges and limitations also arises. The rea
l-world factors like lighting conditions and occlusions can impact the precision of the MediaPipe framewor
k's models for pose estimation and gesture recognition. The system's vocabulary for recognizing gestures ma
y be limited, potentially hindering its ability to comprehend nuanced or less common movements effectively .
Computational intensity poses a constraint, with the real-time motion analysis demanding significant process
ing power, which may result in performance issues on resource-constrained devices. Environmental factors,
such as dynamic backgrounds, can introduce variability in the accuracy of motion tracking. Learning curves
for users might exist, necessitating user training or clear documentation to ensure effective system interactio
n. Achieving low-latency real-time performance could be challenging, requiring careful trade-offs between a
ccuracy and responsiveness.
Objective:
The primary goal is to revolutionizing human-computer interaction and motion analysis through the sophisti
cated capabilities of the MediaPipe framework
 Pose Estimation Excellence: Utilize MediaPipe's pose estimation models to achieve highly acc
urate tracking of essential body landmarks in real-time.
 Facial Expression Analysis Mastery: Harness face mesh and facial landmark tracking in
MediaPipe to delve into intricate facial expressions and emotions.
 Fitness and Health Monitoring Advancements: Apply real-time motion analysis to monitor and
assess physical activities, providing constructive feedback on exercise routines, posture, and ov
erall fitness.
 Educational Tool Development and Interactive Learning: Develop educational tools that levera
ge real-time motion insights to instruct complex concepts related to anatomy, physics, or body
kinetics.
 Gaming and Entertainment Innovation: Implement real-time motion analysis to enrich gaming
experiences, providing a more immersive and dynamic user interface.
 Holistic Motion Analysis: Combine body pose, hand gestures, and facial expressions to create a
comprehensive understanding of holistic human motion.
 Augmented Reality Integration and Immersive Experiences: Integrate real-time motion insights
seamlessly into augmented reality applications for immersive and interactive experiences.
 Accessibility Solutions for Inclusive Technology: Employ real-time motion analysis for creatin
g adaptive technology solutions, addressing the needs of individuals with disabilities.
The above objectives collectively form a comprehensive roadmap for the project, outlining a vision that co
mbines technological sophistication with a user-centric approach, fostering advancements in motion analysis
and human-computer interaction.
II. LITERATURE REVIEW
Aashish Ananthanarayan et al [1] develops a system that uses YOLO, you only look once network to
detect body languages of human in sub-conscious state actions and detect how they are feeling real-time.
The system uses PyTorch for hand detections and run through an in-lab GPU. For body language detection ,
it leverages YOLO and the system aims at displaying the uses of YOLO in recognizing certain body languag
e features and viewing its potential for speedy detections. The system had some misclassifications and works
on the improvement of error-free and detection of more sophisticated body languages to read human
emotions.
Babita Sonare et al [2] proposes a system that can translate sign language gestures to text and speech in rea
l time using video input. The system uses two deep learning algorithms, convolutional neural network (CN
N) and recurrent neural network (RNN), to recognize the hand movements and convert them to words. The s
ystem also uses an open-source text-to-speech API to generate speech from the text. The paper claims that th
e system can achieve an accuracy of 92.4% on dynamic hand gestures and that it can be useful for deaf and
mute people to communicate with others, especially in educational and business settings. The paper also disc
usses the challenges and future work of the system, such as improving the robustness, scalability, and efficie
ncy of the system.
Calin Alexandru Octavian et al [3] establishes a system that can recognize hand gestures and translate the
m to text and speech using a glove equipped with sensors and a computer running Python. The system uses a
neural network to classify the gestures based on the data from the sensors, and an open-source text-to-speech
API to generate speech from the text. The system has achieved an accuracy of 92.5% on dynamic hand gestu
res and it can be useful for people with disabilities, especially deaf and mute people, to communicate with ot
hers
Bhuvi Sharma et al [4] aims to provide a solution for recognizing hand gestures in real-time using compute
r vision and deep learning techniques. It proposes a method for creating a custom dataset of hand gestures us
ing a webcam, and then training a TensorFlow model using the Single Shot MultiBox Detector (SSD) algorit
hm. The system leverages OpenCV to preprocess the images and an open-source text-to-speech API to gener
ate speech from the recognized gestures. It is well-organized and provides a clear and detailed explanation o
f the proposed method and the experimental results.
Jong-Wook Kim et al [5] objective is to develop a system about human pose estimation using computer vis
ion and optimization techniques. It is a method for estimating the joint angles of a 3D humanoid model from
a video input, using MediaPipe Pose, a 2D pose estimation tool, and a global optimization method. The pape
r claims that the system can achieve high accuracy and real-time performance, and that it can be useful for m
onitoring and assisting seniors who live alone at home. The paper reviews the existing methods and their lim
itations, and enhances the proposed method that combines an off-the-shelf 2D pose estimation tool and a fast
optimization method.
Sherzod Turaev et al [6] uses data analysis approaches to conduct various descriptive and exploratory exa
minations of the findings. These strategies yield more accurate analysis of machine learning methodologies
as well as more authentic domain knowledge of abnormal behaviors and body motions. The discoveries of th
is study are critical to the development of intelligent automated systems that effectively evaluate the physica
l and psychological well-being of patients, recognize disease signs and symptoms on the outside, and approp
riately track the health of patients.
Amritanshu Kumar Singh et al [7] develops real-time human pose detection and recognition using
MediaPipe, an open-source framework for building perception pipelines. The system leverages MediaPipe
Holistic, which provides pose, face, and hand landmark detection models, to parse the frames obtained from
real-time device feed using OpenCV. The project utilizes exported coordinates of 501 landmarks to a CSV
file and trained a custom multi-class classification model to understand the relationship between the class
and coordinates. It has compared four machine learning classification algorithms: random forest, linear
regression, ridge classifier, and gradient boosting classifier.
Swati Raman et al [8] delves into the fascinating field of emotion and gesture detection. Leveraging advanc
ed techniques, they explore how emotions can be inferred from human gestures. Their work significantly co
ntributes to enhancing human-computer interaction, enabling more intuitive and empathetic interfaces. The s
tudy emphasizes the importance of accurate emotion recognition for applications such as virtual reality, robo
tics, and assistive technologies. Overall, Raman and colleagues shed light on the intricate relationship betwe
en gestures and emotions, paving the way for innovative solutions in this domain.
Mohamed S. Abdallah et al [9] examines the success of deep learning approaches in hand gesture and
dynamic sign language recognition. Despite their remarkable achievements, the deployment of sign
language recognition applications on mobile phones, constrained by limited storage and computing
capacities, poses significant challenges. The system proposes the adoption of lightweight deep neural
networks with advanced processing tailored for real-time dynamic sign language recognition (DSLR). Their
contribution includes the development of a DSLR application utilizing two robust deep learning models, the
GRU and the 1D CNN, in conjunction with the MediaPipe framework.
Rawad Abdulghafor et al [10] Examines the landscape of epidemic and pandemic illnesses in recent
decades, with COVID-19 serving as a prominent case study, researchers have extensively documented the
prevalence of such global health challenges. In the realm of medical applications, smart technology emerges
as a necessary player, particularly in the domain of automated symptom detection. Notably, artificial
intelligence techniques have been harnessed by researchers to delve into body language analysis, addressing
diverse tasks ranging from fall detection to the identification of COVID-19 symptoms. This comprehensive
meta-review meticulously evaluates various methodologies proposed in preceding papers, highlighting their
individual significance and presenting the outcomes achieved. As the exploration of gesture recognition
unfolds, this study elucidates the field's dynamic nature and its promising potential to influence the
landscape of illness diagnosis and management, offering valuable insights for future research endeavors.
III. PROBLEM STATEMENT
The current landscape of motion analysis lacks a robust and real-time solution for extracting comprehensive
insights from dynamic visual data. Existing methods often suffer from limitations in accuracy, speed, and ac
cessibility, hindering their applicability in various domains such as sports analytics, health monitoring, and h
uman-computer interaction. To address these challenges, there is a pressing need for an advanced real-time
motion insight system leveraging the capabilities of MediaPipe, a powerful open-source framework for build
ing multi-modal applied machine learning solutions.
The primary goal is to develop a solution that can efficiently and accurately analyze real-time motion data,
extracting meaningful insights from live video feeds. This involves overcoming challenges related to
occlusions, diverse motion patterns, and the need for low-latency processing. The system should be
adaptable to different use cases, providing a versatile tool for applications ranging from gesture recognition
and posture analysis to sports performance tracking.
The key challenges include:
 Instantaneous Computation: Minimising processing delay to guarantee the extraction of motion

insights in real-time, enabling instantaneous evaluation and reaction.
 Precision and Sturdiness: Improving motion analysis's precision and stability to manage
intricate and varied movement patterns in a range of settings while reducing mistakes brought
on by obstacles or inaccurate data.
 Hybrid Integration: Investigating how several modalities, such as hand tracking, facial
recognition, and skeleton tracking, can be integrated to give an in-depth understanding of the
the individual's motion.
 Scalability: Producing a system that can be easily deployed on a wide variety of devices, from
robust servers to limited in resources edge devices, by designing it to scale smoothly across
multiple hardware configurations.
 User-Friendly Interface: Developing an interface that is easy to use, allowing for

customization according to particular use cases and simple integration into a variety of
applications.
 Safety Factors: Taking steps to ensure that sensitive information is managed appropriately in
order to address privacy concerns, particularly in applications that use biometrics or health-
related information about an individual.
The proposed real-time motion insight system using MediaPipe is aimed at overcoming these obstacles in
order to offer a flexible, precise, and affordable solution for a variety of applications, advancing the fields of
healthcare, sports analytics, human-computer interaction, and other areas.
IV. EXISTING SYSTEM
Body language decoding systems have witnessed remarkable evolution, leveraging cutting-edge
technologies such as computer vision and machine learning to interpret non-verbal cues. These systems play
a crucial role in various domains, including human-computer interaction, security, and mental health
assessments.
 Computer Vision-based Approaches: Many systems rely on computer vision algorithms to

analyze video footage and extract relevant pose information. OpenPose, a popular open-source
library, has been extensively used for multi-person pose estimation. These systems excel in
detecting key joints and skeletal structures, forming the foundation for nuanced body language
interpretation.
 Deep Learning Architectures: Deep learning models, particularly convolutional neural

networks (CNNs) and recurrent neural networks (RNNs), have demonstrated remarkable
success in capturing temporal dependencies and spatial features within video sequences. These
models, when trained on large datasets, showcase improved accuracy in recognizing intricate
body language patterns.
 Sensor-based Approaches: Some systems integrate wearable sensors, such as accelerometers

and gyroscopes, to capture subtle body movements. These approaches offer real-time feedback
and are especially valuable in scenarios where camera-based solutions face limitations.
 Fusion of Modalities: To enhance robustness, several systems integrate multiple modalities,
combining video analysis with audio and textual cues. This multi-modal fusion enables a more
holistic understanding of communication dynamics, reducing ambiguity in decoding body
language.
The landscape of body language decoding systems is vibrant, driven by a convergence of computer vision,
deep learning, and sensor technologies. While challenges persist, the continuous innovation and
commitment to ethical considerations position these systems as valuable tools in diverse applications. As we
look towards the future, the refinement of decoding algorithms, increased customization, and a heightened
focus on ethical implementation will shape the next generation of body language analysis systems.
Fig 4.1 Real-Time Sign Language Detection
Existing systems like DeepSign and Microsoft Kinect Sign Language Translator have revolutionized sign
language decoding by employing deep learning and computer vision techniques. These systems, trained on
extensive sign language datasets, utilize convolutional neural networks (CNNs) and recurrent neural
networks (RNNs) to accurately interpret sign language gestures in real-time, offering seamless
communication for the deaf community. Simultaneously, face expression decoding systems such as
Affectiva and FaceReader have paved the way for understanding human emotions through facial micro-
expressions, contributing to applications in market research and mental health. The integration of these
advancements forms the foundation of real-time body language decoding systems. By combining pose
estimation from sign language decoders and emotion recognition from face expression decoders, these
systems can interpret a wide array of non-verbal cues in real-time scenarios. This multi-modal fusion
enhances the system's ability to comprehend nuanced human communication, establishing a robust
framework for inclusive and sophisticated human-computer interaction experiences.
V. PROPOSED SYSTEM
Body language decoding systems have witnessed significant advancements, with various approaches
employing computer vision and deep learning. One prominent system utilizes the MediaPipe library for real-
time pose detection. The integration of MediaPipe's pose model allows for accurate identification of key
body landmarks, enabling nuanced interpretations of human gestures.
The system impressively captures live video feed or webcam input, converting frames to RGB format for
optimal processing. It effectively extracts and analyzes pose landmarks, such as shoulders and wrists, to
interpret gestures. For instance, the system intelligently calculates angles between body parts to discern
raised hands, associating them with positive or neutral gestures. The modular design facilitates the
incorporation of additional pose landmarks for more comprehensive body language analysis. The code's
clarity and simplicity make it accessible for customization and expansion. This system's reliance on
MediaPipe enhances its robustness, ensuring reliable and efficient pose detection in real-world scenarios.
The final section of the model showcases the real-time application of the trained model, making instant
predictions on detected body language and providing immediate feedback. The incorporation of probabilities
enhances the interpretability of the system's predictions, contributing to a more nuanced understanding of
the recognized gestures.
In summary, this comprehensive and modular implementation underscores the technical process of
combining pose estimation, machine learning, and real-time visualization. The system not only captures the
intricate details of human body language but also presents a robust framework for developing sophisticated
and interactive systems that decode and interpret non-verbal communication cues in real-time.
The Real-Time Motion Insight using MediaPipe exemplifies a powerful fusion of computer vision and deep
learning, offering a versatile foundation for understanding and interpreting human gestures in real-time
applications. The system implemented in the provided code offers a multitude of practical applications in
various real-time scenarios. One significant use is in the domain of human-computer interaction, where the
system can enhance user experiences by interpreting and responding to natural gestures. This is particularly
relevant in interactive virtual environments, gaming, and immersive simulations, where users can engage
with the system through intuitive body language.
In the context of video conferencing and online communication platforms, the system can provide valuable
insights by analyzing participants' non-verbal cues, contributing to more nuanced and effective virtual
communication. Furthermore, the system's application extends to security and surveillance, where it can
assist in the real-time identification of suspicious behavior or potential security threats through the analysis
of body language patterns. Overall, the versatility of this real-time body language decoding system makes it
a valuable tool for creating more interactive, engaging, and secure applications across various domains.
Fig 5.1 Flowchart of working module

VI. METHODOLOGY
MediaPipe:
MediaPipe, an advanced open-source framework developed by Google, provides an all-encompassing

solution for building real-time applications centered around perceptual computing tasks. Its primary goal is
to streamline the development of applications that involve the processing and interpretation of multimedia
data, such as image and video processing. With a modular and flexible architecture, MediaPipe includes a
diverse range of pre-built models and components tailored for tasks like face detection, hand tracking, and
pose estimation. Its versatility is highlighted by its adept handling of various input sources, including
cameras and video streams, making it applicable across a wide range of applications, from augmented reality
to gesture recognition.
Designed to cater to the needs of both developers and researchers, the framework boasts user-friendly APIs
and readily available pre-trained models. Supporting cross-platform development ensures that applications
can seamlessly operate on different devices and operating systems. Furthermore, users have the flexibility to
customize models or seamlessly integrate their own, enhancing the framework's adaptability.
In the specific context of the Body Language Decoder, which leverages MediaPipe's capabilities, the
drawing_utils and holistic components play crucial roles. The drawing_utils module aids developers by
facilitating the visualization of identified landmarks and annotations on images or videos, contributing to a
deeper understanding and refinement of their applications. Simultaneously, the holistic module proves
particularly advantageous for comprehensive human body pose estimation, capturing key points such as
facial landmarks, hand gestures, and overall body posture.
By harnessing the capabilities of MediaPipe, the Body Language Decoder can effectively interpret and
analyze real-time body language cues. This functionality is invaluable across various applications, from
human-computer interaction to emotion recognition, offering a nuanced understanding of non-verbal
communication through gestures, poses, and facial expressions. Overall, MediaPipe stands as a flexible tool
for developers looking to implement sophisticated perceptual computing tasks in their applications.
Fig 6.1 MediaPipe Holistic Module
MediaPipe's Face Mesh model:
Face Mesh model is a sophisticated tool designed to accurately track and analyze facial features in
real-time. Developed by Google, it is part of the broader MediaPipe framework and serves various
applications in fields such as augmented reality, facial animation, and emotion analysis. The Face Mesh
model operates by detecting key landmarks on a person's face, allowing for precise tracking and
manipulation of facial expressions.
Fig 6.2 Visual Representation of Face Mesh Model
MediaPipe's Face Mesh model is a versatile tool for accurately tracking and analyzing facial features in real-
time. Its high accuracy, real-time performance, and broad applicability make it a valuable asset for
developers and researchers in various industries. Whether it's for creating immersive AR experiences,
enhancing facial animation, or conducting market research, the Face Mesh model offers a powerful solution
for understanding and interacting with facial expressions.
MediaPipe's Hand Connections model:
MediaPipe's Hand Connections Model is a component within the broader MediaPipe framework, developed
by Google, designed specifically for hand tracking and gesture recognition tasks. This model employs
machine learning techniques to accurately detect and track hand keypoints in real-time video streams or
static images. The primary objective of the Hand Connections Model is to identify and establish connections
between keypoints representing different parts of the hand, enabling robust hand tracking and gesture
analysis.
Fig 6.3 Visual Representation of Hand Connections Model
MediaPipe's Pose Connections model:
MediaPipe's Pose Connections Model is a crucial component within the broader MediaPipe framework
developed by Google. It focuses on human pose estimation, which involves detecting and tracking key body
joints or keypoints to infer the body's pose, such as the position and orientation of various body parts. This
model utilizes machine learning algorithms to accurately predict the spatial arrangement of these keypoints,
facilitating a wide range of applications in fields like augmented reality (AR), gaming, healthcare, and more.
Fig 6.4 Visual Representation of Pose Connections Model
Install and import dependencies:
 MediaPipe: The core library for utilizing the MediaPipe framework, including its modules for
holistic body pose estimation.
 OpenCV(cv2): OpenCV is a widely used computer vision library that provides tools for image
and video processing. It's often used in conjunction with MediaPipe for handling frames and
displaying visualizations.
 Numpy: NumPy is a library for numerical operations, and it can be handy for various
mathematical and array manipulation tasks when working with the output from MediaPipe.
Additional libraries: Depending on your specific requirements, you may need other libraries such as Pandas,
Sci-kit learn, Seaborn, Matplotlib, etc. By incorporating libraries such as MediaPipe in a Body Language
Decoder not only streamlines the development process but also ensures accuracy, efficiency, and community
support, contributing to the creation of robust and effective applications for decoding and interpreting non-
verbal communication cues.
The system leverages both drawing_utils and holistic modules from MediaPipe framework, Where
drawing_utils is utilized for the visualization of results generated by the models, and Holistic module for
Pose Estimation
Make some detections:
The system performs holistic (full-body) pose detection and hand tracking on the video feed from your
webcam using OpenCV. The webcam captures video frames, processes them with the holistic model, and
then visualizes the results by drawing landmarks and connections on the video frames. The Make detections
section performs
Fig 6.5 Visual Representation of Detected landmarks
1.Setting up Video Capture:
Use a library like OpenCV (cv2) to capture video frames from a camera source or a video
file.cv2.VideoCapture() initializes the video capture object.It sets up the connection to your camera or video
file, allowing you to grab individual frames for processing.
2.Initializing Holistic Model:
Import and initialize the MediaPipe library.Create a Holistic model using mp.solutions.holistic.Holistic().
This step sets up the holistic body pose estimation model, which includes detection for various body parts
like face, hands, and posture.
3.Processing Video Frames:
Continuously capture video frames in a loop using cv2.VideoCapture().read().Convert the frames to RGB
format if needed (MediaPipe processes RGB images).Pass the RGB frames to the MediaPipe Holistic model
using holistic.process(image).
4.Recoloring and Making Detections:
Some image processing may be necessary, such as recoloring or resizing, to match the requirements of the
model.The Holistic model processes the frames and detects various landmarks on the face, hands, and body,
providing information about their positions in the frame.Retrieve the results from the Holistic model.
5.Drawing Landmarks and Connections:
Use the results obtained from the Holistic model to draw landmarks and connections on the processed frame.
The detected landmarks represent key points on the body, such as joints or facial features.Draw lines or
connections between these landmarks to visualize the body posture and movements.Displaying the
6.Processed Frame:
Use cv2.imshow() to display the processed frame with landmarks and connections.This allows real-time
visualization of the detected body language cues and gestures.
By following these methods, a pipeline for capturing video frames, processing them through the MediaPipe
Holistic model, and visualizing the results in real-time has been created. The drawn landmarks and
connections on the frame provide insights into the subject's body language, enabling applications like
motion insight, gesture recognition, or emotion analysis.It provides a great starting point for building
applications that involve real-time pose and hand tracking using a webcam.The code provides a visual
representation of detected landmarks and connections on the human body.
Capture landmarks and export to CSV:
Initially we have to create a comma separated variable file and then export of landmark coordinates to a
CSV file. This code captures video frames from the webcam, processes them using the MediaPipe holistic
model, and then exports the pose and face landmark coordinates, along with visibility information, to a CSV
file named 'coords.csv'. The order of landmarks in the CSV file corresponds to the order in which they are
flattened .The major tasks on this section are:
1. CSV file initialization.
The method involves the initialization of a CSV file and the subsequent exportation of landmark
coordinates, playing a crucial role in body language detection. In the initialization step, a CSV file is
created, setting the groundwork for storing and organizing the landmark coordinates efficiently. The file
serves as a structured repository for the pose and facial landmark data extracted during real-time body
language decoding.
2. Exporting landmark co-ordinates to CSV.
Following this initialization, the method proceeds to export the landmark coordinates to the CSV file.
This step involves collecting and arranging the relevant data, such as x, y, z coordinates, and visibility, into
rows. The CSV file then becomes a comprehensive dataset capturing the spatial and visibility aspects of
detected landmarks over time. This data, stored in a structured format, becomes invaluable for subsequent
processes, particularly in training machine learning models. By systematically organizing landmark
information, this method lays the foundation for effective data collection, ensuring that the system is
equipped with the necessary inputs to discern and interpret body language patterns accurately.
Fig 6.6 Exporting Detected landmarks to coords.csv file

The portion is leveraged to create a dataset that can be used to train some of the poses and gestures with the
specific human term which are called class name. The gestures and pose are processed and landmarks are
exported with their corresponding class name to the created ‘coords .csv’ file.
Train Custom model using Sci-kit learn:
Read collected data and process:
Here, The coords is imported as a dataframe using pandas library. The class axis is dropped
to partition dataset into training and testing data.The dataset ‘coords.csv’ is then partitioned into target and
input variables,then the system performs some initial data exploration and splitting into training and testing
sets using scikit-learn's train_test_split.
The random_state parameter ensures reproducibility. The script concludes by printing the target values of
the test set (y_test), suggesting an emphasis on evaluating model performance. The code follows standard
data preprocessing and splitting practices for machine learning tasks. It has a basic familiarity with pandas
and scikit-learn conventions.
Train different ML models:
The System utilizes scikit-learn to define and train machine learning pipelines for different
classification algorithms, including Logistic Regression, Ridge Classifier, Random Forest, and Gradient
Boosting. It employs a systematic approach with pipelines, integrating feature scaling using StandardScaler.
The pipelines are encapsulated in a dictionary, enhancing code modularity and readability. The 'fit_models'
dictionary stores the trained models for each algorithm, allowing for easy retrieval and further analysis. The
loop iterates through the pipelines, fitting models to the training data, and populates the 'fit_models'
dictionary.
Logistic Regression:
Logistic Regression, employed as a statistical method for tasks involving binary classification, constitutes a
fundamental element within the real-time body language decoding pipeline. In the domain of decoding body
language, Logistic Regression emerges as a potent instrument for predicting discrete classes, proving
particularly apt for scenarios where the objective involves categorizing gestures or expressions into
predefined classes.
Integrated into the real-time body language decoding pipeline through the scikit-learn library, Logistic
Regression forms an integral part of a comprehensive machine learning framework that encompasses
preprocessing stages and various classification algorithms. Its contributions to the overall functionality can
be delineated as follows:
Feature Scaling and Standardization:
Logistic Regression's sensitivity to the scale of input features necessitates a preprocessing step within the
pipeline. Typically involving feature scaling or standardization prior to model fitting, this ensures equitable
contributions from all input features, ultimately enhancing the model's performance.
Pipeline Consistency:
The adoption of a pipeline in the real-time body language decoder ensures a uniform and systematic
approach to data processing. Logistic Regression seamlessly integrates into this pipeline alongside crucial
components like data preprocessing and other classification algorithms. This modular and well-organized
structure enhances the readability of the code, facilitates maintenance, and allows for future modifications.
Probabilistic Output:
An inherent strength of Logistic Regression lies in its provision of probability estimates for each class.
This feature is particularly advantageous in real-time applications where discerning the confidence of
predictions holds paramount importance. In the context of a body language decoder, the model's capability
to furnish probabilities allows for nuanced interpretation of detected gestures, offering insights into the
certainty of the classification.
Interpretability
Logistic Regression models are inherently interpretable, shedding light on the influence of each feature on
classification decisions. In the realm of body language analysis, this interpretability proves valuable in
understanding which landmarks or features contribute significantly to the predicted body language,
providing insights into the model's decision-making process.
Real-time Inference:
Logistic Regression's computational efficiency renders it well-suited for real-time applications. The
model's streamlined nature, coupled with its ability to generate prompt predictions, ensures seamless
operation of the body language decoder in real-time, delivering instantaneous feedback on detected gestures.
Logistic Regression assumes a crucial role in the real-time body language decoding pipeline, furnishing a
robust and interpretable classification model. Its adept handling of binary classification tasks, provision of
probabilistic outputs, and maintenance of computational efficiency establish it as a valuable component for
decoding and interpreting human gestures in real-time applications.
Ridge Classifier:
The Ridge Classifier, a regularization-based classification algorithm, has a crucial role within the pipeline
of a real-time body language decoder, contributing to the accurate and efficient interpretation of human
gestures. Tailored for binary and multiclass classification tasks, the Ridge Classifier stands out as a
valuable component in scenarios where the goal is to categorize body language expressions into predefined
classes.
In the specific context of a real-time body language decoder, the seamless integration of the Ridge
Classifier is achieved through the scikit-learn library. This integration is part of a comprehensive machine
learning framework that encompasses preprocessing steps and various classification algorithms.
The Ridge Classifier plays a crucial role in the real-time body language decoding pipeline, providing
stability through regularization, effective handling of multicollinearity, and probabilistic outputs for
nuanced interpretation. Its integration contributes to the accuracy and efficiency of classifying human
gestures in real-time applications.
RandomForest Classifier:
The Random Forest Classifier, an influential ensemble learning algorithm, plays a critical role in the
operational framework of a real-time body language decoder, significantly enhancing the precision and
resilience of gesture classification. Tailored for tasks encompassing both classification and regression, the
Random Forest Classifier emerges as a crucial element in scenarios where the objective is to categorize body
language expressions into predefined groups.
In real-time body language decoder, the incorporation of the Random Forest Classifier seamlessly occurs
through the utilization of the scikit-learn library. This assimilation is an integral part of a holistic machine
learning framework that includes preprocessing steps and various classification algorithms.
The Random Forest Classifier assumes a fundamental role in the real-time body language decoding pipeline,
harnessing ensemble learning for heightened accuracy, offering insights into feature importance, and
demonstrating robustness to noisy and non-linear data. Its integration significantly contributes to the model's
proficiency in accurately classifying diverse human gestures in real-time applications.
Gradient Boosting Classifier:
The Gradient Boosting Classifier, a sophisticated machine learning algorithm, stands as a valuable
component in the pipeline of a real-time body language decoder, contributing to the model's accuracy and
predictive capabilities. Particularly effective for classification tasks, Gradient Boosting Classifier excels in
scenarios where precise classification of body language expressions into predefined categories is paramount.
Within the real-time body language decoder pipeline, the integration of the Gradient Boosting Classifier is
seamlessly achieved using the scikit-learn library. The Gradient Boosting Classifier plays a essential role in
the real-time body language decoding pipeline by offering enhanced accuracy through sequential learning,
robustness to overfitting, insights into feature importance, the ability to handle non-linearity, and real-time
efficiency. Its integration contributes significantly to the model's capability to accurately classify and
interpret diverse human gestures in real-time applications.
The Ridge Classifier ('rc') is specifically used for prediction on the test data, showcasing flexibility in model
selection. The overall structure promotes scalability and experimentation with various algorithms. The
inclusion of StandardScaler reflects a commitment to proper preprocessing, ensuring consistent and
improved model performance. The concise code encourages clarity and ease of maintenance, making it
suitable for iterative model development and testing. Established scikit-learn conventions align with best
practices in machine learning development, enhancing code reliability. The clear separation of algorithmic
components into pipelines fosters code reusability and streamlines the addition of new classifiers. This
Portion of implementation showcases a structured and principled approach to building, training, and
utilizing machine learning models for classification tasks.
Evaluate and serialize the model:
The Section of Model calculates and prints the accuracy scores for each model in the 'fit_models'
dictionary using the scikit-learn accuracy_score metric. The code iterates through each model, predicting on
the test set and evaluating its accuracy. The Random Forest model is then used to make predictions on the
test set, and the true target values are printed. Subsequently, the Random Forest model is serialized using
pickle and saved to a file named 'body_language.pkl'. It demonstrates an evaluation step for multiple models
and the use of accuracy as a performance metric. The selective focus on the Random Forest model suggests
a specific interest in its performance. The accuracy scores provide a quantitative measure of model
effectiveness on the test data. The use of pickle facilitates model persistence, enabling easy reuse or
deployment of the trained Random Forest model.
The code has familiarity with the concept of accuracy in classification tasks and the serialization process
using pickle. The serialization code segment is succinct and efficient for saving the model to a file. The file
name 'body_language.pkl' implies potential relevance to a body language classification task. The absence of
comments may reduce verbosity but could impact code understanding, especially for users less familiar with
the specific context or requirements. Overall, the code serves the purpose of evaluating and persisting a
specific model efficiently, with an emphasis on accuracy as a performance metric.
Make Detections with Model:
The system segment "Make Detections with Model" showcases the operationalization of a real-time body
language decoder, incorporating key technical concepts for effective gesture recognition. The script begins
by loading a pre-trained machine learning model from a saved pickle file, initializing the necessary libraries,
such as MediaPipe (mp_holistic) and OpenCV (cv2), and opening a video capture stream from the webcam
(cv2.VideoCapture(0)).
Within a continuous loop, the script processes each frame retrieved from the webcam feed. It employs
MediaPipe's holistic model to detect facial landmarks, hand movements, and full-body pose simultaneously.
The obtained results are then visualized on the frame using OpenCV, with distinct drawings for face
landmarks, right and left hand landmarks, and overall body pose. A crucial aspect of the code involves the
extraction and concatenation of landmark coordinates from both the pose and face landmarks. These
coordinates are organized into a row and used to make predictions with the loaded machine learning model.
The model's predictions, representing the detected body language class and corresponding probabilities, are
then displayed in real-time on the video feed.
The script further enhances user interaction by overlaying informative boxes on the video feed. These
include a rectangle highlighting the detected body language class near the left ear coordinates, a status box
displaying the class name, and another box indicating the probability of the detected class.
The utilization of coordinates and real-time predictions ensures immediate feedback on the recognized body
language, creating an interactive and informative user experience. In essence, this program seamlessly
integrates machine learning, pose estimation, and real-time visualization to interpret and respond to human
body language, making it a powerful tool for applications such as virtual environments, gaming, and
interactive simulations.
VII. IMPLEMENTATION
MediaPipe: MediaPipe is a library developed by Google that provides pre-trained models for various
computer vision tasks, including hand tracking and pose estimation. In this system, MediaPipe is used for
extracting information about the body language from images or videos.
OpenCV: A well-known computer vision library, OpenCV offers capabilities for processing pictures and
videos.. It's likely used in conjunction with MediaPipe for tasks such as reading and preprocessing images or
videos, as well as other computer vision operations.
Seaborn: Based on Matplotlib, Seaborn is a library for visualisation of information about statistics .. It is
often used to create visually appealing and informative statistical graphics. Seaborn is used here to visualize
the data and analysis results in a more intuitive way.
Pandas: Pandas is an effective library for modifying information and investigation. It is widely used for
handling structured data, such as CSV files or dataframes. Pandas used for organizing and processing the
data.
NumPy: The NumPy package is a crucial tool for calculating values. Large, complex arrays and matrices
can be utilized and algebraic operations can be applied to these components. It could be applied to a variety
of numerical operations and calculations.
Scikit-learn: Scikit-learn is an ML library with user-friendly features for modeling and data evaluations. It
may be applied in the system to train machine learning models that use the characteristics extracted to
decode body language.
Matplotlib: An adjustable Python charting package, Matplotlib enables programmers to produce static,
scalable, and cooperative visuals. Matplotlib could potentially be used in together with Seaborn to
visualize data.
The combination of these libraries allows for a comprehensive approach to analyzing body language using
computer vision, data processing, and machine learning techniques. Each library brings its own set of
functionalities to the table, contributing to a holistic solution for the task at hand. with Python. It builds on
matplotlib and provides high-level interfaces for drawing attractive and informative plots.
The drawing_utils is a module within MediaPipe that facilitates the visualization of results generated by the
models. It helps in drawing the output on the input image or video stream, making it easier for developers to
interpret and understand the detected features or keypoints. For example, if you are using a pose detection
model, drawing_utils can help draw lines connecting keypoints to represent the detected pose on the
image.MediaPipe's holistic approach represents a comprehensive framework crafted by Google, aimed at
facilitating the creation of machine learning-driven solutions tailored for diverse multimedia processing
tasks. These encompass real-time video analysis, image processing, and audio manipulation. The framework
furnishes developers with a suite of pre-configured modules and pipelines, streamlining the development
process of sophisticated multimedia applications without necessitating the creation of algorithms from the
ground up.
Install and import dependencies:
The attempt to install necessary Python packages via the pip package manager and subsequently import the
MediaPipe and OpenCV libraries in the Python script reflects a proactive approach towards leveraging
robust toolsets for multimedia processing tasks. However, a discrepancy surfaces in the import statement
concerning MediaPipe. The corrected version of the script encapsulates the essence of precision and
attention to detail requisite in programming endeavors.
Executing the provided commands initiates the installation of essential libraries, including MediaPipe,
OpenCV, Pandas, and scikit-learn, essential components for facilitating diverse multimedia processing
tasks within the Python environment. The utilization of pip underscores the seamless integration of
external dependencies, thereby fortifying the script's ecosystem with indispensable functionalities.
Importing MediaPipe as 'MediaPipe' and OpenCV as 'cv2' signifies adherence to standardized naming
conventions, promoting code readability and maintainability. Furthermore, the assignment of
'mp.solutions.drawing_utils' to 'mp_drawing' and 'mp.solutions.holistic' to 'mp_holistic' optimally
encapsulates functionality, enhancing the script's modularity and scalability.
The conscientious directive to execute the commands within the designated Python environment or Jupyter
Notebook underscores a prudent approach towards ensuring script execution within a controlled and
conducive computational environment. By adhering to best practices in package management and library
importation, the script cultivates a foundation conducive to seamless development and deployment of
multimedia applications, characterized by efficiency, reliability, and adaptability.
Make Some Detections
In this implementation, a real-time body language decoder is orchestrated using OpenCV and the
MediaPipe library, focusing on holistic pose estimation. The code initiates a video capture stream, utilizing
the cv2.VideoCapture(0) function, and integrates the MediaPipe Holistic model to concurrently detect
facial landmarks, hand movements, and full-body pose. A confidence threshold is applied to ensure robust
detection results. Throughout the execution loop, each frame from the video feed is recolored to RGB
format and processed by the Holistic model. The detected landmarks, including facial, hand, and pose
landmarks, are meticulously visualized on the frame using OpenCV's drawing functions, creating an
overlay that enhances the interpretability of the model's output. Technical terms such as
'min_detection_confidence' and 'min_tracking_confidence' reflect the minimum confidence levels required
for initiating detection and tracking processes. The 'mp_drawing' module from MediaPipe is employed for
drawing landmarks with specified visual specifications, like color, thickness, and circle radius. The
'cv2.imshow' function facilitates the real-time display of the modified frames, enabling continuous
monitoring of the decoded body language.
The code structure encapsulates essential computer vision and machine learning concepts, ensuring
seamless integration of pose estimation and visualization. The closing segments involve releasing the video
capture resources ('cap.release()') and closing display windows ('cv2.destroyAllWindows()'). The reference
to 'results.face_landmarks.landmark[0].visibility' implies the extraction of visibility information for the first
facial landmark, offering insights into the reliability of the detected features. This systematic and
technically grounded approach showcases the utilization of advanced computer vision techniques for real-
time body language decoding.
Capture Landmarks & Export to CSV
The Model exemplifies an advanced body language decoding system, seamlessly integrating OpenCV,
MediaPipe, and CSV data handling to export landmark coordinates. The script commences by determining
the total number of coordinates, incorporating both pose and face landmarks, necessary for capturing a
comprehensive set of information. Subsequently, it initializes a CSV file ('coords.csv') to store these
landmark coordinates, establishing a structured format that includes class labels, x, y, z coordinates, and
visibility attributes.
During the real-time execution loop, the code employs the MediaPipe Holistic model to capture facial,
hand, and full-body pose landmarks. These landmarks are visualized on the video feed using OpenCV,
enhancing interpretability. Notably, the script dynamically extracts and concatenates pose and face
landmarks into a coherent row, encompassing spatial coordinates and visibility information. The export
process involves appending the class label to the row and writing the entire dataset to the CSV file. This
meticulous handling of data ensures a robust foundation for subsequent model training and analysis.
Furthermore, the code exhibits robust exception handling, preventing disruptions in case of unforeseen
errors during landmark extraction. The combination of real-time visualization, dynamic data export, and
comprehensive exception handling underscores the technical sophistication of this body language decoding
system. It serves as a foundational script for capturing, processing, and exporting intricate landmark
information, paving the way for further exploration and analysis in the realm of computer vision and
machine learning.
Train Custom Model Using Sci-kit Learn:
The Section of the system under consideration is a comprehensive exploration of the essential phases
involved in training a machine learning classification model designed for the nuanced task of recognizing
body language through landmark coordinates. The initial stage entails the extraction of pertinent landmark
data from a CSV file. Subsequently, an intricate preparation process is meticulously executed to curate the
dataset, ensuring its readiness for model training. Employing the versatile pandas library, the code
meticulously probes the data's structure, offering insights by revealing the first and last rows, and
strategically filtering instances tied to specific classes, exemplified by the distinctive 'victory' class.
The subsequent segment navigates through a model training loop, systematically iterating over each
algorithm. This iterative process involves fitting each algorithm's pipeline to the training data and adeptly
storing the resultant trained models in a dictionary for future reference. Evaluation metrics, specifically
accuracy scores, are meticulously computed for each model on the test set. This robust evaluation
framework affords a comprehensive assessment of the models' performance. Noteworthy among the
ensemble, the Random Forest Classifier emerges as a prominent selection, showcasing a commendable
accuracy in proficiently predicting various body language classes.
In anticipation of future applications, the code concludes with the crucial step of model serialization.
The Random Forest Classifier, discerned as the most effective model through the evaluation process,
undergoes serialization and is persistently saved as 'body_language.pkl.' This serialized model file stands
as a compact, transportable, and efficient representation of the rigorously trained classifier. It stands ready
for seamless deployment in real-world scenarios necessitating the instantaneous and precise recognition of
body language cues in real-time, marking a significant stride in the realm of human-computer interaction
and behavior analysis.
Make Detections with Model

The segment exemplifies a sophisticated real-time body language decoder leveraging MediaPipe, OpenCV,
and a pre-trained machine learning model stored in 'body_language.pkl'. The implementation seamlessly inte
grates holistic pose estimation and real-time visualization, delivering an intricate analysis of facial, hand, an
d full-body gestures. The holistic model is initialized with minimum detection and tracking confidences, ens
uring robust performance. The loop efficiently processes webcam frames, executing the entire pipeline for p
ose estimation.
The rendered output showcases these landmarks, differentiated by distinct colors, providing a comprehensiv
e view of detected gestures. Notably, facial landmarks, hand movements, and body pose are discerned and vi
sualized in real-time, contributing to a holistic understanding of body language. Furthermore, the implement
ation integrates machine learning to classify body language based on landmark coordinates. The 'body_langu
age.pkl' model is deserialized, and for each frame, relevant landmarks are extracted, forming a feature vector.
This vector is then fed into the model for real-time predictions, demonstrating the system's ability to interpr
et and classify body language on-the-fly. The code incorporates dynamic feedback by overlaying class and p
robability information on the webcam feed, enhancing the interpretability of the system's predictions.
Output:
Fig 7.1 Displaying the class “thumbs-up” and landmarks in coords
Fig 7.2 Displaying the class “welcome” and landmarks in coords

Fig 7.3 Displaying the test results with class
Fig 7.4 Comparing the accuracy of each pipeline

Fig 7.5 Array of each class detected during real-time test
Fig 7.6 Detecting Motion Real-Time-’Welcome’

Fig 7.7 Detecting Motion Real-Time-’Thumbs up’
VIII . CONCLUSION
A real-time motion insight System utilizing MediaPipe has demonstrated a commendable integration of pose
estimation, machine learning, and real-time visualization, contributing to the field of human-computer intera
ction and behavior analysis. The holistic approach employed in this project, encompassing face, hand, and fu
ll-body pose estimation, has exhibited the capability to discern intricate gestures and expressions in real time.
The system leveraged the MediaPipe library and showcased a robust foundation for capturing and interpreti
ng diverse human movements.
The integration of OpenCV has played a necessary role in seamless webcam feed processing, ensuring effici
ent frame handling for instantaneous analysis. The modular structure of the code has enhanced readability an
d maintainability, delineating specific sections for landmark extraction, machine learning model inference, a
nd real-time visualization. This structure not only facilitates code understanding but also opens avenues for f
uture extensions and enhancements.
The real-time motion insight project has laid a strong foundation for the intersection of pose estimation and
machine learning in decoding body language. Future enhancements can build upon this foundation, explorin
g avenues for increased accuracy, broader applicability, and a more user-centric experience. The System is a
testament to the synergy of computer vision and machine learning, opening doors for advancements in huma
n-computer interaction and behavior analysis.
REFERENCE
[1] Ananthanarayan, Aashish. Hand detection and body language recognition using yolo. Diss. 2020.
[2] Sonare, Babita, et al. "Video-based sign language translation system using machine learning." 2021 2n
d International Conference for Emerging Technology (INCET). IEEE, 2021.
[3] Octavian, Calin Alexandru, Hnatiuc Mihaela, and Iov Jan Catalin. "Gesture Recognition using PYTHO
N." 2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD). IEEE,
2021.
[4] Sharma, Bhuvi. "Gesture Recognition Using Tensor flow, Opencv and Python." Amity Journal of
Computational Sciences 7.1 (2023).
[5] Kim, Jong-Wook, et al. "Human pose estimation using MediaPipe pose and optimization method based
on a humanoid model." Applied Sciences 13.4 (2023): 2700.
[6] Turaev, Sherzod, et al. "Review and Analysis of Patients’ Body Language from an Artificial
Intelligence Perspective." IEEE Access (2023).
[7] Singh, Amritanshu Kumar, Vedant Arvind Kumbhare, and K. Arthi. "Real-time human pose detection
and recognition using MediaPipe." International Conference on Soft Computing and Signal Processing.
Singapore: Springer Nature Singapore, 2021.
[8] Raman, Swati, et al. "Emotion and Gesture detection." International Journal for Research in Applied
Science and Engineering Technology 10 (2022): 3731-3734.
[9] Abdallah, Mohamed S., et al. "Light-Weight Deep Learning Techniques with Advanced Processing for
Real-Time Hand Gesture Recognition." Sensors 23.1 (2022): 2.
[10] Abdulghafor, Rawad, et al. "An Analysis of Body Language of Patients Using Artificial Intelligence."
Healthcare. Vol. 10. No. 12. MDPI, 2022.
Fig 1.

Real-Time Motion Insight Using Mediapipe: A. Lakshmiprabha, Dr. G. Arockia Sahaya Sheela

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Real-Time Motion Insight Using Mediapipe: A. Lakshmiprabha, Dr. G. Arockia Sahaya Sheela

Uploaded by

Copyright:

Available Formats

REAL-TIME MOTION INSIGHT USING MEDIAPIPE

A. Lakshmiprabha 1, Dr. G. Arockia Sahaya Sheela 2,

Fig 1.1 Pose Landmarks

Fig 1.3 Face Landmarks

III. PROBLEM STATEMENT

 Instantaneous Computation: Minimising processing delay to guarantee the extraction of motion

 User-Friendly Interface: Developing an interface that is easy to use, allowing for

IV. EXISTING SYSTEM

 Computer Vision-based Approaches: Many systems rely on computer vision algorithms to

 Deep Learning Architectures: Deep learning models, particularly convolutional neural

 Sensor-based Approaches: Some systems integrate wearable sensors, such as accelerometers

Fig 4.1 Real-Time Sign Language Detection

Fig 5.1 Flowchart of working module

MediaPipe, an advanced open-source framework developed by Google, provides an all-encompassing

Fig 6.1 MediaPipe Holistic Module

MediaPipe's Face Mesh model:

MediaPipe's Hand Connections model:

Fig 6.3 Visual Representation of Hand Connections Model

MediaPipe's Pose Connections model:

Fig 6.4 Visual Representation of Pose Connections Model

Install and import dependencies:

Make some detections:

1.Setting up Video Capture:

2.Initializing Holistic Model:

3.Processing Video Frames:

4.Recoloring and Making Detections:

5.Drawing Landmarks and Connections:

Capture landmarks and export to CSV:

1. CSV file initialization.

2. Exporting landmark co-ordinates to CSV.

Fig 6.6 Exporting Detected landmarks to coords.csv file

Train Custom model using Sci-kit learn:

Read collected data and process:

Train different ML models:

Feature Scaling and Standardization:

Gradient Boosting Classifier:

Evaluate and serialize the model:

Install and import dependencies:

Capture Landmarks & Export to CSV

Make Detections with Model

Fig 7.2 Displaying the class “welcome” and landmarks in coords

Fig 7.4 Comparing the accuracy of each pipeline

Fig 7.6 Detecting Motion Real-Time-’Welcome’

You might also like