Exploratory Work Presentation

Unified Model for Multiple Human Centric Vision
Tasks
Vaibhav Sharma
Enrollment No: T23156
Indian Institute of Technology Mandi
March 4, 2024
Guide: Dr. Parimala Kancharla

Introduction
This presentation showcases my study about the
▶ learning and understanding the idea of developing a unified
model for multiple tasks specific to human perception.
▶ Surveying existing papers on unified models for vision tasks.
▶ focusing on the review of relevant papers and understanding
existing architectures.
Before diving into the complexities, let’s explore the foundational
concepts and terms I’ve encountered in my recent two months of
study. We’ll begin by addressing the essential questions
▶ What are these tasks Unveiling the ultimate goals we aim to
achieve
▶ What is a Model Defining the essence of a deep learning
model.
▶ How to develop such a model Investigating the recent
developments that have shaped our current landscape.
Vaibhav Sharma, T23156 Unified Model for Multiple Human Centric Vision Tasks 2
Different vision tasks to perform related to humans in an image
Source: Internet
Neural Network as Function Approximators

Figure: f (X ) = f 3 f 2 f 1 (X )
Figure: Y = f(X;W)
Figure: layer as a set of functions Figure: A single Perceptron

Convolutional Neural Network
Figure: Convolution operation

Figure: Representation of Fully
Connected layer
Figure: Convolution with multiple Figure: Convolution with multiple

cannels filters
Unified model for multiple tasks
▶ Model to work for multiple tasks with less compute and

memory power instead of using several models.
▶ Based on the intuition that different human-related tasks
share common semantic information about the human body.
▶ Uses multitask training in which

the model weights are shared
among multiple tasks
▶ Model trained for multiple tasks
may generalize better and is less
prone to overfitting.
Paper 1
About the paper
2023 paper introduces a new Unified model for these

Human-related tasks and calls it UniHCP: Unified Model for
Human-Centric Perceptions. UniHCP considers five tasks:
1. Pose estimation
2. Semantic part segmentation/ human parsing
3. Pedestrian detection
4. Person identification
5. Person attribute recognition
The proposed model has some key features. It unifies five distinct
human-centric tasks, handling them simultaneously. The model is
built upon a shared encoder-decoder Transformer network, here
the encoder is the plain Vision Transformer.
Paper 2
Transformer
Introduced by Google in 2017, Transformers revolutionized

sequence-to-sequence tasks, surpassing RNNs. The model relies on
attention mechanisms to capture dependencies in input sequences.
▶ Encoder-Decoder Framework: Transformers consist of an
encoder and decoder. The encoder encodes the input
sequence, while the decoder decodes it to produce the output.
▶ Encoder Layer: Comprising six encoder blocks, each block
processes input sequentially. It includes self-attention, a fully
connected layer, and normalization steps.
▶ Decoder Layer: Similar to the encoder, the decoder
comprises blocks with self-attention, fully connected networks,
and cross-attention to capture information from the entire
encoder output. Add & Norm layers follow each sublayer.
The Transformer architecture
Figure: Source: Internet

Paper 3
Vision Transformer
▶ uses the transformer encoder and fed images as patches of 16
X 16 in a sequence order as words in NLP.
▶ work on the concept of pre-training on a large dataset and
then fine-tune for a specific task.
▶ outperforms convolution-based networks like ResNet if
pre-trained on a large dataset
▶ does not have the

inductive bias of locality
and 2D neighbourhood of
pixels in an image as with
CNN
▶ positional encoding no a
have information about Vision Transformer (ViT)
the 2d location of patches architecture as depicted in the
paper.
UniHCP Architecture
UniHCP is composed of the following 4 components:
1. Encoder : uses the plain Vision transformer which is shared
across all the tasks.
2. Decoder: uses the transformer decoder with cross attention
before the self-attention layer in each decoder block.
3. Task-specific Queries : these are the task-specific and
learnable embedding which acts as input for the decoder, note
that the parameters in the rest of the modules are shared
among tasks, making the weight sharing 99.97%.
4. Task guided Interpreter : the shared output head for all the
tasks, outputs 4 things, feature representations, local
probability map(pixel classification), global probability (image
classification) and bounding box coordinates as used in object
detection.
UniHCP Architecture
Figure: Source: UniHCP original paper
Paper 4
Paper 5
Recent Advances in Unified Models
▶ Two notable papers released in 2023 adopt a similar approach
to UniHCP in the field of unified models.
▶ The first paper addresses six tasks and introduces the
Projector Assisted Hierarchical Pretraining Method (PATH),
utilizing a hierarchical architecture in contrast to UniHCP’s
Plain transformer and decoder approach.
▶ The second paper released in December 2023, introduces
HQNet, a flexible model that learns a single shared ”human
query.” It consists of four key components: Backbone,
Transformer encoder, Transformer decoder, and Task-specific
heads.
▶ Additionally, this paper contributes a specific dataset for
Human perception tasks named COCO-UniHuman created by
adding annotations to the COCO dataset.
Conclusion and Future work
The studies I have done not only demonstrate the possibility of a
unified model for multiple related tasks but also showcase its
better performance compared to existing models, both with and
without fine-tuning for specific tasks.
The model learns shared semantic information, reflecting the

fundamental structure of the human body. This shared knowledge
not only optimizes the model but also enhances generalization,
making it effective for tasks with limited training data, requiring
minimal fine-tuning or transfer learning.
Continuing my work on this problem, I am focusing on its

implementation, particularly the UniHCP PyTorch implementation
by the authors of the corresponding paper. This involves
refactoring the code to eliminate distributed training and
extracting valuable inferences from the pre-trained model.
Thank you
t23156@students.iitmandi.ac.in

Exploratory Work Presentation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Exploratory Work Presentation

Uploaded by

Copyright:

Available Formats

Unified Model for Multiple Human Centric Vision

Indian Institute of Technology Mandi

Guide: Dr. Parimala Kancharla

Figure: layer as a set of functions Figure: A single Perceptron

Figure: Convolution operation

Figure: Convolution with multiple Figure: Convolution with multiple

▶ Model to work for multiple tasks with less compute and

▶ Uses multitask training in which

2023 paper introduces a new Unified model for these

Introduced by Google in 2017, Transformers revolutionized

Figure: Source: Internet

▶ does not have the

Figure: Source: UniHCP original paper

The model learns shared semantic information, reflecting the

Continuing my work on this problem, I am focusing on its

You might also like