You are on page 1of 21

TEEP Internship Program

Weekly Report (25/10/2023 – 08/11/2023)


Intern: Thanh-Nguyen Truong
To-do list
• Research “DFRF: Learning Dynamic Facial Radiance Fields for Few-
Shot Talking Head Synthesis [ECCV 2022]”
• Research “ER-NeRF: Efficient Region-Aware Neural Radiance Fields for
High-Fidelity Talking Portrait Synthesis [ICCV 2023]”
• Research “HiDe-NeRF: One-Shot High-Fidelity Talking-Head Synthesis
with Deformable Neural Radiance Field [CVPR 2023]”

Doing this week


To do

2
Overview of 3D Talking Head Synthesis
Scene Representation Network - 3D View Synthesizing

SRN Neural Volumes NeRF


[Sitzmann 2019] [Lombadi 2019] [Mildenhall 2020] DFRF
[Shen 2022]

ER-NeRF
[Li 2023]

Audio Driven Face Generation HiDe-NeRF


[Li 2023]
NerFACE AD-NeRF
GAN models 3DMM models
[Gafni 2020] [Gou 2021]

….
(+ torso)
3
DFRF: Learning Dynamic Facial Radiance
Field
Problem Statement

Previous Methods Proposed: DFRF


• 2D-based : unnatural talking style • Less training data
• Low cost, fast convergence
• 3D-based: Information loss due to • High generalization:
the use of 3DMM intermediate Each new identity requires only
representations a small amount of fine-tuning

• NeRF-based methods: Contributions


• High computational cost 1. Dynamic Facial Radiance Field
• Data burden for fast learning of new identity
• Identity-specific 2. Differentiable Face Warping
for better facial dynamics
modelling
4
DFRF: Learning Dynamic Facial Radiance
Field
Proposed Framework

5
DFRF: Learning Dynamic Facial Radiance
Field
Dynamic Facial Radiance Field
3d query point

NeRF: (but only for static scenes) MLP Network Color and density

2D view direction

For talking head, audio information needs to be provided Formula:


Audio features

Identity learning

6
DFRF: Learning Dynamic Facial Radiance
Field
Differentiable Face Warping
WHY? – Strict NeRF mapping fails to model complex facial movements.

3d query point

Audio features
∆ 𝑜𝑛 + 𝑝
𝑟𝑒𝑓 ′
𝑛

Image features Face warping module


or, deformation field (indifferentiable)

Solve: soft index by bilinear interpolation


DFRF: Learning Dynamic Facial Radiance
Field Face Warping
Differentiable

Regularization term to limit the offset values Volume Rendering


Similar to conventional NeRF

N.o. reference images All the points in the 3D-space Learnable parameters

Low density points are more probably background Loss function: MSE
=> Less offset 2
𝐿=‖𝐶 − 𝐼 ‖ + 𝜆∗ 𝐿′𝑟

: rendered color
: ground truth
DFRF: Learning Dynamic Facial Radiance
Field
Effectiveness of the Face Warping Module

Result to show the contribution of the proposed differentiable face warping module.
DFRF – Reported Results

Results with different numbers of reference images

Method comparisons when using different lengths of training videos


DFRF – Reported Results

DFRF

11
ER-NeRF: Efficient Region-Aware Neural Radiance
Fields for High-Fidelity Talking Portrait Synthesis

Observations Proposal
• Only the head region needs to be focused Efficient Region-aware NeRF (ER-NeRF):
-> Unrelated neurons can be pruned uneven concentration level for different
• Distinct audio-facial manners spatial regions
-> Unique audio-driven local motions
Contribution
• Tri-Plane Hash Representation
• Region Attention Module
Capture the correlation between the
audio condition and spatial regions

12
ER-NeRF
Tri-Plane Hash Representation
Problems
• Hash collision increases linearly with n.o. sampling points
• Every point in the 3D space are sampled equally
-> MLP needs to handle multiple audio
features at the same time
• Naïve methods of sampling reduction lower the quality
-> Avoid hash collisions from high dimensions
(concatenate with other 2 planes’ features)
Method
For each plane:
+

plane-level
Final tri-plane
3d points Project to 2d Hash decoder geometry
geometry features
feature

+
ER-NeRF
Region Attention Module

Purpose: Connect audio features with related spatial features


i.e. attend to certain parts of the head given the audio

Visualization of the Region Attention Module

Query point 3-plane Final tri-plane 2-layer Attention vector


hashing geometry features MLP
Channel-
(same number of channels A) wise
attention
Audio features
ER-NeRF: Reported Results

Results of the head reconstruction setting, Obama dataset.

15
ER-NeRF: Reported Results

Results on lip-synchronization

Key-frame picking 16
ER-NeRF: Different training length results
PSNR ↑
35.000

30.976
30.000 29.097 29.469 29.480
27.993 28.222
26.594 26.678 26.365
25.873
25.000 24.591 24.18 24.378 24.587
23.519
22.991
21.863

20.000 18.647 19.068

15.000

10.000

5.000

0 0 0 0 0 0 0 0 0 0 0 0
0.000
Obama Biden reporter Chinese man Chinese woman Trump reporter French man

17
ER-NeRF: Different training length results
LPIPS↓
0.1800

0.1597
0.1600

0.1400

0.1200

0.1037
0.1000

0.0800 0.0772
0.0698
0.0623
0.0600 0.0558 0.0564 0.0569 0.0556 0.0555
0.0532
0.0487 0.0471
0.0446
0.0400 0.0370
0.0316 0.0314
0.0254
0.0242
0.0200

0.0000
Obama Biden reporter Chinese man Chinese woman Trump reporter French man

18
ER-NeRF: Different training length results
LMD ↓
5.000

4.500 4.375

3.969
4.000
3.697

3.500 3.341 3.407


3.247 3.209
3.135 3.061
3.029
3.000 2.912
2.761 2.765
2.550 2.598 2.646 2.531 2.58
2.500

2.000

1.500

1.000

0.500

0.000
Obama Biden reporter Chinese man Chinese woman Trump reporter French man

19
ER-NeRF: Cross-lingo tests

Chinese on Obama Chinese woman on French man French on American man

20
Future Plan
• Continue to run experiments on customed data with DFRF and ER-NeRF
• Research and experiment with HideNeRF

27

You might also like