You are on page 1of 20

Unified Model for Multiple Human Centric Vision

Tasks

Vaibhav Sharma
Enrollment No: T23156

Indian Institute of Technology Mandi

March 4, 2024

Guide: Dr. Parimala Kancharla


Introduction
This presentation showcases my study about the
▶ learning and understanding the idea of developing a unified
model for multiple tasks specific to human perception.
▶ Surveying existing papers on unified models for vision tasks.
▶ focusing on the review of relevant papers and understanding
existing architectures.
Before diving into the complexities, let’s explore the foundational
concepts and terms I’ve encountered in my recent two months of
study. We’ll begin by addressing the essential questions
▶ What are these tasks Unveiling the ultimate goals we aim to
achieve
▶ What is a Model Defining the essence of a deep learning
model.
▶ How to develop such a model Investigating the recent
developments that have shaped our current landscape.
Vaibhav Sharma, T23156 Unified Model for Multiple Human Centric Vision Tasks 2
Different vision tasks to perform related to humans in an image
Source: Internet
Vaibhav Sharma, T23156 Unified Model for Multiple Human Centric Vision Tasks 3
Neural Network as Function Approximators


Figure: f (X ) = f 3 f 2 f 1 (X )

Figure: Y = f(X;W)

Figure: layer as a set of functions Figure: A single Perceptron


Vaibhav Sharma, T23156 Unified Model for Multiple Human Centric Vision Tasks 4
Convolutional Neural Network

Figure: Convolution operation


Figure: Representation of Fully
Connected layer

Figure: Convolution with multiple Figure: Convolution with multiple


cannels filters

Vaibhav Sharma, T23156 Unified Model for Multiple Human Centric Vision Tasks 5
Unified model for multiple tasks

▶ Model to work for multiple tasks with less compute and


memory power instead of using several models.
▶ Based on the intuition that different human-related tasks
share common semantic information about the human body.

▶ Uses multitask training in which


the model weights are shared
among multiple tasks
▶ Model trained for multiple tasks
may generalize better and is less
prone to overfitting.

Vaibhav Sharma, T23156 Unified Model for Multiple Human Centric Vision Tasks 6
Paper 1

Vaibhav Sharma, T23156 Unified Model for Multiple Human Centric Vision Tasks 7
About the paper

2023 paper introduces a new Unified model for these


Human-related tasks and calls it UniHCP: Unified Model for
Human-Centric Perceptions. UniHCP considers five tasks:
1. Pose estimation
2. Semantic part segmentation/ human parsing
3. Pedestrian detection
4. Person identification
5. Person attribute recognition
The proposed model has some key features. It unifies five distinct
human-centric tasks, handling them simultaneously. The model is
built upon a shared encoder-decoder Transformer network, here
the encoder is the plain Vision Transformer.

Vaibhav Sharma, T23156 Unified Model for Multiple Human Centric Vision Tasks 8
Paper 2

Vaibhav Sharma, T23156 Unified Model for Multiple Human Centric Vision Tasks 9
Transformer

Introduced by Google in 2017, Transformers revolutionized


sequence-to-sequence tasks, surpassing RNNs. The model relies on
attention mechanisms to capture dependencies in input sequences.
▶ Encoder-Decoder Framework: Transformers consist of an
encoder and decoder. The encoder encodes the input
sequence, while the decoder decodes it to produce the output.
▶ Encoder Layer: Comprising six encoder blocks, each block
processes input sequentially. It includes self-attention, a fully
connected layer, and normalization steps.
▶ Decoder Layer: Similar to the encoder, the decoder
comprises blocks with self-attention, fully connected networks,
and cross-attention to capture information from the entire
encoder output. Add & Norm layers follow each sublayer.

Vaibhav Sharma, T23156 Unified Model for Multiple Human Centric Vision Tasks 10
The Transformer architecture

Figure: Source: Internet


Vaibhav Sharma, T23156 Unified Model for Multiple Human Centric Vision Tasks 11
Paper 3

Vaibhav Sharma, T23156 Unified Model for Multiple Human Centric Vision Tasks 12
Vision Transformer
▶ uses the transformer encoder and fed images as patches of 16
X 16 in a sequence order as words in NLP.
▶ work on the concept of pre-training on a large dataset and
then fine-tune for a specific task.
▶ outperforms convolution-based networks like ResNet if
pre-trained on a large dataset

▶ does not have the


inductive bias of locality
and 2D neighbourhood of
pixels in an image as with
CNN
▶ positional encoding no a
have information about Vision Transformer (ViT)
the 2d location of patches architecture as depicted in the
paper.
Vaibhav Sharma, T23156 Unified Model for Multiple Human Centric Vision Tasks 13
UniHCP Architecture
UniHCP is composed of the following 4 components:
1. Encoder : uses the plain Vision transformer which is shared
across all the tasks.
2. Decoder: uses the transformer decoder with cross attention
before the self-attention layer in each decoder block.
3. Task-specific Queries : these are the task-specific and
learnable embedding which acts as input for the decoder, note
that the parameters in the rest of the modules are shared
among tasks, making the weight sharing 99.97%.
4. Task guided Interpreter : the shared output head for all the
tasks, outputs 4 things, feature representations, local
probability map(pixel classification), global probability (image
classification) and bounding box coordinates as used in object
detection.

Vaibhav Sharma, T23156 Unified Model for Multiple Human Centric Vision Tasks 14
UniHCP Architecture

Figure: Source: UniHCP original paper

Vaibhav Sharma, T23156 Unified Model for Multiple Human Centric Vision Tasks 15
Paper 4

Vaibhav Sharma, T23156 Unified Model for Multiple Human Centric Vision Tasks 16
Paper 5

Vaibhav Sharma, T23156 Unified Model for Multiple Human Centric Vision Tasks 17
Recent Advances in Unified Models
▶ Two notable papers released in 2023 adopt a similar approach
to UniHCP in the field of unified models.
▶ The first paper addresses six tasks and introduces the
Projector Assisted Hierarchical Pretraining Method (PATH),
utilizing a hierarchical architecture in contrast to UniHCP’s
Plain transformer and decoder approach.
▶ The second paper released in December 2023, introduces
HQNet, a flexible model that learns a single shared ”human
query.” It consists of four key components: Backbone,
Transformer encoder, Transformer decoder, and Task-specific
heads.
▶ Additionally, this paper contributes a specific dataset for
Human perception tasks named COCO-UniHuman created by
adding annotations to the COCO dataset.

Vaibhav Sharma, T23156 Unified Model for Multiple Human Centric Vision Tasks 18
Conclusion and Future work
The studies I have done not only demonstrate the possibility of a
unified model for multiple related tasks but also showcase its
better performance compared to existing models, both with and
without fine-tuning for specific tasks.

The model learns shared semantic information, reflecting the


fundamental structure of the human body. This shared knowledge
not only optimizes the model but also enhances generalization,
making it effective for tasks with limited training data, requiring
minimal fine-tuning or transfer learning.

Continuing my work on this problem, I am focusing on its


implementation, particularly the UniHCP PyTorch implementation
by the authors of the corresponding paper. This involves
refactoring the code to eliminate distributed training and
extracting valuable inferences from the pre-trained model.

Vaibhav Sharma, T23156 Unified Model for Multiple Human Centric Vision Tasks 19
Thank you
t23156@students.iitmandi.ac.in

Vaibhav Sharma, T23156 Unified Model for Multiple Human Centric Vision Tasks 20

You might also like