TEEP Weekly Report

TEEP Internship Program
Weekly Report (25/10/2023 – 08/11/2023)

Intern: Thanh-Nguyen Truong
To-do list
• Research “DFRF: Learning Dynamic Facial Radiance Fields for Few-
Shot Talking Head Synthesis [ECCV 2022]”
• Research “ER-NeRF: Efficient Region-Aware Neural Radiance Fields for
High-Fidelity Talking Portrait Synthesis [ICCV 2023]”
• Research “HiDe-NeRF: One-Shot High-Fidelity Talking-Head Synthesis
with Deformable Neural Radiance Field [CVPR 2023]”
Doing this week

To do
2
Overview of 3D Talking Head Synthesis
Scene Representation Network - 3D View Synthesizing
SRN Neural Volumes NeRF

[Sitzmann 2019] [Lombadi 2019] [Mildenhall 2020] DFRF
[Shen 2022]
…
ER-NeRF
[Li 2023]
Audio Driven Face Generation HiDe-NeRF

[Li 2023]
NerFACE AD-NeRF
GAN models 3DMM models
[Gafni 2020] [Gou 2021]
…
….
(+ torso)
3
DFRF: Learning Dynamic Facial Radiance
Field
Problem Statement
Previous Methods Proposed: DFRF

• 2D-based : unnatural talking style • Less training data
• Low cost, fast convergence
• 3D-based: Information loss due to • High generalization:
the use of 3DMM intermediate Each new identity requires only
representations a small amount of fine-tuning
• NeRF-based methods: Contributions

• High computational cost 1. Dynamic Facial Radiance Field
• Data burden for fast learning of new identity
• Identity-specific 2. Differentiable Face Warping
for better facial dynamics
modelling
4
Field
Proposed Framework
5
Field
Dynamic Facial Radiance Field
3d query point
NeRF: (but only for static scenes) MLP Network Color and density
2D view direction
For talking head, audio information needs to be provided Formula:

Audio features
Identity learning
6
Field
Differentiable Face Warping
WHY? – Strict NeRF mapping fails to model complex facial movements.
3d query point
Audio features
∆ 𝑜𝑛 + 𝑝
𝑟𝑒𝑓 ′
𝑛
Image features Face warping module

or, deformation field (indifferentiable)
Solve: soft index by bilinear interpolation

Field Face Warping
Differentiable
Regularization term to limit the offset values Volume Rendering

Similar to conventional NeRF
N.o. reference images All the points in the 3D-space Learnable parameters
Low density points are more probably background Loss function: MSE
=> Less offset 2
𝐿=‖𝐶 − 𝐼 ‖ + 𝜆∗ 𝐿′𝑟
: rendered color
: ground truth
Field
Effectiveness of the Face Warping Module
Result to show the contribution of the proposed differentiable face warping module.
DFRF – Reported Results
Results with different numbers of reference images
Method comparisons when using different lengths of training videos

DFRF – Reported Results
DFRF
11
ER-NeRF: Efficient Region-Aware Neural Radiance
Fields for High-Fidelity Talking Portrait Synthesis
Observations Proposal
• Only the head region needs to be focused Efficient Region-aware NeRF (ER-NeRF):
-> Unrelated neurons can be pruned uneven concentration level for different
• Distinct audio-facial manners spatial regions
-> Unique audio-driven local motions
Contribution
• Tri-Plane Hash Representation
• Region Attention Module
Capture the correlation between the
audio condition and spatial regions
12
ER-NeRF
Tri-Plane Hash Representation
Problems
• Hash collision increases linearly with n.o. sampling points
• Every point in the 3D space are sampled equally
-> MLP needs to handle multiple audio
features at the same time
• Naïve methods of sampling reduction lower the quality
-> Avoid hash collisions from high dimensions
(concatenate with other 2 planes’ features)
Method
For each plane:
+
plane-level
Final tri-plane
3d points Project to 2d Hash decoder geometry
geometry features
feature
+
ER-NeRF
Region Attention Module
Purpose: Connect audio features with related spatial features

i.e. attend to certain parts of the head given the audio
Visualization of the Region Attention Module
Query point 3-plane Final tri-plane 2-layer Attention vector

hashing geometry features MLP
Channel-
(same number of channels A) wise
attention
Audio features
ER-NeRF: Reported Results
Results of the head reconstruction setting, Obama dataset.
15
ER-NeRF: Reported Results
Results on lip-synchronization
Key-frame picking 16
ER-NeRF: Different training length results
PSNR ↑
35.000
30.976
30.000 29.097 29.469 29.480
27.993 28.222
26.594 26.678 26.365
25.873
25.000 24.591 24.18 24.378 24.587
23.519
22.991
21.863
20.000 18.647 19.068
15.000
10.000
5.000
0 0 0 0 0 0 0 0 0 0 0 0
0.000
Obama Biden reporter Chinese man Chinese woman Trump reporter French man
17
LPIPS↓
0.1800
0.1597
0.1600
0.1400
0.1200
0.1037
0.1000
0.0800 0.0772
0.0698
0.0623
0.0600 0.0558 0.0564 0.0569 0.0556 0.0555
0.0532
0.0487 0.0471
0.0446
0.0400 0.0370
0.0316 0.0314
0.0254
0.0242
0.0200
0.0000
18
LMD ↓
5.000
4.500 4.375
3.969
4.000
3.697
3.500 3.341 3.407

3.247 3.209
3.135 3.061
3.029
3.000 2.912
2.761 2.765
2.550 2.598 2.646 2.531 2.58
2.500
2.000
1.500
1.000
0.500
0.000
19
ER-NeRF: Cross-lingo tests
Chinese on Obama Chinese woman on French man French on American man
20
Future Plan
• Continue to run experiments on customed data with DFRF and ER-NeRF
• Research and experiment with HideNeRF
27

TEEP Weekly Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TEEP Weekly Report

Uploaded by

Copyright:

Available Formats

TEEP Internship Program

Weekly Report (25/10/2023 – 08/11/2023)

Doing this week

SRN Neural Volumes NeRF

Audio Driven Face Generation HiDe-NeRF

Previous Methods Proposed: DFRF

• NeRF-based methods: Contributions

For talking head, audio information needs to be provided Formula:

Image features Face warping module

Solve: soft index by bilinear interpolation

Regularization term to limit the offset values Volume Rendering

Results with different numbers of reference images

Method comparisons when using different lengths of training videos

Purpose: Connect audio features with related spatial features

Visualization of the Region Attention Module

Query point 3-plane Final tri-plane 2-layer Attention vector

Results of the head reconstruction setting, Obama dataset.

20.000 18.647 19.068

3.500 3.341 3.407

Chinese on Obama Chinese woman on French man French on American man

You might also like