You are on page 1of 29

Reference-based Video Super-Resolution

Using Multi-Camera Video Triplets

Junyong Lee Myeonghee Lee Sunghyun Cho Seungyong Lee


2
Motivation

Multi-Camera Smartphones
The higher the focal length, the better the resolution and detail of the captured subject

Apple iPhone 13 Pro Max Samsung Galaxy S21 Ultra


3
Motivation

Reference-based Super-Resolution (RefSR)


Transferring high-quality details from a reference to LR image

Yang et al. CVPR ‘21 Wang et al. ICCV ’21 (RefSR in dual-camera setting)

Patch-based global matching modules are widely employed for matching LR-Ref pair

Learning texture transformer network for image super-resolution, Yang et al., CVPR 2020
Dual-camera super-resolution with aligned attention modules, Wang et al., ICCV 2021
4
Motivation

Reference-based Super-Resolution (RefSR)


Matching often fails for the region outside the overlapped regions between multi-cameras

LR↑ / Ref↓ Bicubic DCSR (RefSR)1

Inside the overlap

Outside the overlap

1Dual-camera super-resolution with aligned attention modules, Wang et al., ICCV 2021
5
Motivation

Reference-based Video Super-Resolution (RefVSR)


Expanding the RefSR task for a video

LR↑ / Ref↓ Bicubic DCSR (RefSR)1 Ours (RefVSR)

Inside the overlap

Outside the overlap

1Dual-camera super-resolution with aligned attention modules, Wang et al., ICCV 2021
6
Motivation

Reference-based Video Super-Resolution (RefVSR)


High-fidelity textures from temporal reference video frames

LR↑ / Ref↓ Bicubic DCSR (RefSR)1 Ours (RefVSR)

Inside the overlap

Outside the overlap

1Dual-camera super-resolution with aligned attention modules, Wang et al., ICCV 2021
7
Reference-based Video Super-Resolution Using Multi-Camera Video Triplets

Proposed RefVSR Framework


𝑅𝑅𝑅𝑅𝑅𝑅 𝑁𝑁
Input: low-resolution (LR: ultra-wide) and reference (Ref: wide-angle) video frames {𝐼𝐼𝑡𝑡𝐿𝐿𝐿𝐿 , 𝐼𝐼𝑡𝑡 }𝑡𝑡=1
Output: super-resolved (SR) resulting frame sequence {𝐼𝐼𝑡𝑡𝑆𝑆𝑆𝑆 }𝑁𝑁
𝑡𝑡=1

𝐿𝐿𝐿𝐿 𝑅𝑅𝑅𝑅𝑅𝑅 𝑅𝑅𝑅𝑅𝑅𝑅 𝐿𝐿𝐿𝐿 𝑅𝑅𝑅𝑅𝑅𝑅


𝐼𝐼𝑡𝑡−1 𝐼𝐼𝑡𝑡−1 𝐼𝐼𝑡𝑡𝐿𝐿𝐿𝐿 𝐼𝐼𝑡𝑡 𝐼𝐼𝑡𝑡+1 𝐼𝐼𝑡𝑡+1

𝐹𝐹𝑓𝑓 𝐹𝐹𝑓𝑓 𝐹𝐹𝑓𝑓

𝐹𝐹𝑏𝑏 𝐹𝐹𝑏𝑏 𝐹𝐹𝑏𝑏

𝑈𝑈 𝑈𝑈 𝑈𝑈
𝑆𝑆𝑆𝑆
𝐼𝐼𝑡𝑡−1 𝐼𝐼𝑡𝑡𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆
𝐼𝐼𝑡𝑡+1
8
Reference-based Video Super-Resolution Using Multi-Camera Video Triplets

Key Idea: Fuse and Propagate


Global matching is needed only between an LR and reference frames at a recurrent step

𝐿𝐿𝐿𝐿 𝑅𝑅𝑅𝑅𝑅𝑅 𝑅𝑅𝑅𝑅𝑅𝑅 𝐿𝐿𝐿𝐿 𝑅𝑅𝑅𝑅𝑅𝑅


𝐼𝐼𝑡𝑡−1 𝐼𝐼𝑡𝑡−1 𝐼𝐼𝑡𝑡𝐿𝐿𝐿𝐿 𝐼𝐼𝑡𝑡 𝐼𝐼𝑡𝑡+1 𝐼𝐼𝑡𝑡+1

𝐹𝐹𝑓𝑓
propagate 𝐹𝐹𝑓𝑓 𝐹𝐹𝑓𝑓

𝐹𝐹𝑏𝑏 𝐹𝐹𝑏𝑏
propagate 𝐹𝐹𝑏𝑏

𝑈𝑈 𝑈𝑈 𝑈𝑈
𝑆𝑆𝑆𝑆
𝐼𝐼𝑡𝑡−1 𝐼𝐼𝑡𝑡𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆
𝐼𝐼𝑡𝑡+1
9
Reference-based Video Super-Resolution Using Multi-Camera Video Triplets

Recurrent Cell
1. Inter-frame fusion
Propagated features + LR features  temporally aggregated features

2. Reference Alignment and Propagation (RAP)


temporally aggregated features + reference features  propagating features

𝑅𝑅𝑅𝑅𝑅𝑅 𝐿𝐿𝐿𝐿 𝑅𝑅𝑅𝑅𝑅𝑅


𝐼𝐼𝑡𝑡 𝐼𝐼𝑡𝑡+1 𝐼𝐼𝑡𝑡+1

𝐹𝐹𝑓𝑓 Inter-frame fusion


𝐹𝐹𝑏𝑏 𝐹𝐹𝑏𝑏
10
Reference-based Video Super-Resolution Using Multi-Camera Video Triplets

Reference Alignment and Propagation


1. Reference feature alignment1
Aligning reference features to LR features

2. Propagative temporal fusion


Matching confidence-based fusion between
temporally aggregated features and reference features

Propagative Temporal Fusion


temporally
Reference Fusion and Propagation (𝑅𝑅𝐴𝐴𝑃𝑃) aggregated features
{𝑓𝑓,𝑏𝑏}
ℎ� 𝑡𝑡
𝜙𝜙

conv
concat
𝐼𝐼𝑡𝑡𝐿𝐿𝐿𝐿 cosine 𝑐𝑐𝑡𝑡 (matching confidence)
𝜙𝜙 similarity
𝑅𝑅𝑒𝑒𝑒𝑒
𝐼𝐼𝑡𝑡
Reference
matrix
4 … 5 𝑝𝑝
… … … 𝑡𝑡 aligned Ref
𝑅𝑅
𝑅𝑅𝑅𝑅𝑅𝑅
9 … 2 (matching index)
ℎ� 𝑡𝑡 {𝑓𝑓,𝑏𝑏}
feature features ℎ𝑡𝑡
𝑅𝑅𝑅𝑅𝑅𝑅
ℎ� 𝑡𝑡 concat
alignment
Reference Propagative
{𝑓𝑓,𝑏𝑏}
ℎ𝑡𝑡 Guidance for the fusion conv
Reference feature
Alignment1 Temporal matching
aligned to LR 𝑐𝑐𝑡𝑡
{𝑓𝑓,𝑏𝑏} Accumulated Fusion {𝑓𝑓,𝑏𝑏} confidence
𝑐𝑐𝑡𝑡̃ matching confidence 𝑐𝑐𝑡𝑡
accumulated
{𝑓𝑓,𝑏𝑏}
ℎ� 𝑡𝑡 matching confidence
{𝑓𝑓,𝑏𝑏}
𝑐𝑐𝑡𝑡̃ max {𝑓𝑓,𝑏𝑏}
𝑐𝑐𝑡𝑡

1Dual-camera super-resolution with aligned attention modules, Wang et al., ICCV 2021
11
Reference-based Video Super-Resolution Using Multi-Camera Video Triplets

Reference Alignment and Propagation


Matching confidence-based fusion to guide only well-matched reference features to be propagated
Inside the overlap
Outside the overlap

w/o PTF w/ PTF w/o PTF w/ PTF


12
Training

RealMCVSR Dataset

The RealMCVSR dataset provides real-world HD video triplets


concurrently recorded by Apple iPhone 12 Pro Max equipped with triple cameras

(a) Ultra-wide Video (b) Wide-angle Video (c) Telephoto Video


13
Training

2-Stage Training Scheme Fully Utilizing Video Triplets


Pre-training stage (supervised setting)
 Inputs  4x downsampled ultra-wide (LR) and wide (reference) videos
 Supervision  ultra-wide (HR) and wide (reference) videos

Wide-angle videos for reference supervision

𝑆𝑆𝑆𝑆 𝐻𝐻𝐻𝐻 𝑊𝑊𝑖𝑖𝑖𝑖𝑖𝑖


𝑙𝑙𝑝𝑝𝑝𝑝𝑝𝑝 = 𝐼𝐼𝑡𝑡,𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 − 𝐼𝐼𝑡𝑡,𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 + 𝛿𝛿(𝐼𝐼𝑡𝑡𝑆𝑆𝑆𝑆 , 𝐼𝐼𝑡𝑡𝐻𝐻𝐻𝐻 ) + 𝑙𝑙𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 (𝐼𝐼𝑡𝑡𝑆𝑆𝑆𝑆 , 𝐼𝐼𝑡𝑡∈Ω )
Low-frequency term High-frequency term
14
Training

2-Stage Training Scheme Fully Utilizing Video Triplets


Adaptation stage for real-world 4x VSR (self-supervised setting)
 Inputs  real-world ultra-wide (LR) and wide (reference) videos
 Supervision  telephoto (reference) video

Telephoto videos for reference supervision

𝑆𝑆𝑆𝑆 𝑈𝑈𝑈𝑈 𝑇𝑇𝑒𝑒𝑒𝑒𝑒𝑒


𝑙𝑙8𝐾𝐾 = 𝐼𝐼𝑡𝑡↓,𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 − 𝐼𝐼𝑡𝑡,𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 + 𝑙𝑙𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 (𝐼𝐼𝑡𝑡𝑆𝑆𝑆𝑆 , 𝐼𝐼𝑡𝑡∈Ω )
Low-frequency term High-frequency term
• Experiments

Qualitative Results
(8K ultra-wide VSR from real-world HD videos)

15
• Experiments

Video 1
The region inside the overlapped FoV between ultra-wide and wide-angle frames

16
• Experiments

Video 1
The region outside the overlapped FoV between ultra-wide and wide-angle frames

18
• Experiments

Video 2
The region inside the overlapped FoV between ultra-wide and wide-angle frames

20
• Experiments

Video 2
The region outside the overlapped FoV between ultra-wide and wide-angle frames

22
• Experiments

Video 3
The region inside the overlapped FoV between ultra-wide and wide-angle frames

24
• Experiments

Video 3
The region outside the overlapped FoV between ultra-wide and wide-angle frames

26
28
Results

Quantitative Results
State-of-the-art super-resolution performance
Improved fidelity balance > 49%
Thank You!

Analysis on reference video types Analysis on alignment modules

Paper Project

More results
Analysis on propagative temporal fusion

You might also like