Professional Documents
Culture Documents
Jeong Hun Yeo∗ , Seunghee Han∗ , Minsu Kim, Yong Man Ro†
Integrated Vision and Language Lab, KAIST, South Korea
{sedne246,gkstmdgml211,ms.k,ymro}@kaist.ac.kr
as a speech recognition decoder, so that the speech LLM (VSP-LLM). It includes a visual encoder that
sequence features obtained from a conformer en- embeds the input video into the input space of a pre-
coder were designed to be directly mapped into text trained LLM, a visual speech unit based deduplica-
tokens, the domain of LLaMA. Moreover, Wu et al. tion module that discards redundant information in
(Wu et al., 2023a) have tried to address the inherent contiguous frames, and an instruction embedding
problem of mismatched sequence lengths between component that serves as a task specifier. In the
speech signals and text, while taking LLaMA as following, we describe each component in detail.
a speech translation decoder. So, they have com-
pressed the speech sequence feature and matched 3.1 Visual-to-Text Space Mapping
its sequence length with that of the text.
Our primary objective is to employ the rich con-
However, while the existing studies have primar-
text modeling capability of LLM in our visual
ily focused on incorporating LLMs with the audio
speech modeling. To accomplish this, we need
speech modality, the exploration of such integra-
to represent the input video in a manner that aligns
tion for visual speech processing remains unex-
closely with linguistic information, thereby facili-
plored. In this paper, we propose a novel frame-
tating the association between visual inputs and the
work that integrates visual speech processing with
text space of the pre-trained LLM. Motivated by the
LLMs. Specifically, we attempt to mitigate the ho-
recent success of the self-supervised speech mod-
mophenes problem, one of the key challenges in
els (Hsu et al., 2021; Shi et al., 2022) that showed
the field of visual speech processing, by leverag-
the learned representations are highly correlated
ing the rich context modeling capabilities of LLMs.
with phonetic information (e.g., phoneme) (Pasad
Additionally, to address the training load issues
et al., 2023), we employ AV-HuBERT (Shi et al.,
arising from the integration of the visual speech
2022) for our base visual encoder. Then, a learn-
model and the Language Model (LLM), we intro-
able visual-to-text embedding layer is introduced to
duce the concept of a visual speech unit. Through
map the visual representations into the input space
the implementation of visual speech units, we pro-
of LLM. We name this process as visual-to-text
pose a novel visual speech deduplication method
space mapping.
that compresses redundant representations while
To investigate how well the visual representa-
preserving contextual information.
tion aligns with the text embedding space of the
3 Method LLM, we compute the cosine similarity between
the visual speech representation and the token em-
Figure 1 shows the overall framework of the pro- beddings of the LLM, mapping it to the text to-
posed Visual Speech Processing incorporated with ken with the highest similarity. Figure 2a shows
an example of a textualized visual speech repre- GT: race is a social category that has staggering biological consequences
è R R Rran rareaitaitsessits is is is is a a a aS S S S SS so so sosoclosedoccialcialual
sentation. An intriguing observation is that, with is a social
well-structured visual-text space mapping, textu- ALal-Alg CccCAT Manattatter particularacggororryorryy veryringDM를 führteplainbothhalt
Verein that that that tom have have hass S S S S S ST st St KraastastACAGggeryringinging
alized visual speech representations can exhibit (a) that has staggering
pronunciation resembling real words. However, we ら,>'; occas chamМB by by biiarierALLL Laronon COVIDGGICICalALALlessAlgK C C Com
Com ComCONontransSpCicacququQententnesscesssssssissementittel
observe redundant information when mapping en- consequences
tire video frames to text due to the similarity of L R r it it результате is is a expr S S so socialAL C Catacoriendo)'])) entrepregin that have has
is a social that has
adjacent frames. For instance, words like ’is’ and (b)
a SststStoakagggeringing,цію occas fois byionLognGAL CComkonスicqualentceessч
’a’ are repeated multiple times, and the word ’so- staggering consequences
Table 1: The performance comparisons with state-of-the-art supervised and self-supvervised VSR methods. Com-
pared to the self-supervised methods, the proposed unified model, which can perform both VSR and VST, achieves
state-of-the-art recognition performances. Furthermore, our method obtains comparable performances even with
much less amount of training data. Please note that the only method which can perform recogntion and translation
at once is the proposed model.
Method Modality
Labeled
data(hrs)
BLEU ↑
noting that the proposed method achieves a 23.5
En-It En-Fr En-Pt En-Es Avg
you destroy all the bone marrow in the cancer patient with massive but before that cooperation could be realized kennedy was
Ground Truth: Ground Truth:
doses of chemotherapy assassinated and that part of the
you destroy all the bone barrow in the cancer patient with massive but before that cooperation could be realized the energy was
AV-HuBERT : AV-HuBERT :
doces of chemotherapy assassinated and that part of the
you destroy all the bone marrow in the cancer patient with massive but before that cooperation could be realized kennedy was
Proposed Method : Proposed Method :
doses of chemotherapy assassinated and that part of the
so the very first thing we wanted to do in this building was look at what is the exact relationship between levels of greenhouse gases
Ground Truth: Ground Truth:
the air system and planetary
so the very first thing we wanted to do in this building was look at what if the exact relationship between governments greenhouse
AV-HuBERT : AV-HuBERT :
the hair system houses and plant on it
so the very first thing we wanted to do in this building was look at what if the exact relationship between government greenhouse
Proposed Method : Proposed Method :
the air system gases and planetary
you don't have to bring any quarters because the washer and dryer and when i talk to judges around the united states which i do all the
Ground Truth: Ground Truth:
are free time now they all say the same
you don't have to predict any words because the washers trial are and when i talk to just around the united states which i do all the
AV-HuBERT : AV-HuBERT :
free time now they all say the same
you don't have to put in any words because the washers and dryers and when i talk to judges around the united states which i do all the
Proposed Method : Proposed Method :
are free time now they all say the same
i looked at my friend who drilled seven dry wells writing off more and when i look at the criminal justice system in the united states
Ground Truth: Ground Truth:
than a billion dollars for the today i feel the exact same way that i did about the state of
i looked at my friends who trilled at safeest trials running of more and when i look at the criminal justice system in the united states
AV-HuBERT : AV-HuBERT :
than a billion dollars for the today i felt the exact same way that i did about the state of
i looked at my friends who drilled the deepest wells running off and when i look at the criminal justice system in the united states
Proposed Method : Proposed Method :
more than a billion dollars for the today i feel the exact same way that i did about the state of
Figure 3: The qualitative results showing that the contextual modeling ability of LLM, which is adopted in our
method, can improve the homophene problem and other confusing cases. The red and blue words indicate the wrong
predictions from AV-HuBERT. However, as shown in the examples, the proposed method can generate correct
words by considering the entire context (e.g., ‘barrow’ to ‘marrow’).
45 BLEU ↑
41.1 Number of Length of
40 38.0
36.8 Clusters sequence
FLOPs (P)
34.9 En-It En-Fr En-Pt En-Es Avg
35
Word Error Rate (%)
31.6 31.1
29.6
30 28.3 - 18.4 23.7 22.4 23.6 22.0 1.00 27.2
24.5
25 22.5 2000 18.1 22.9 22.3 21.4 21.2 0.70 21.3 (21.7%)
21.2 21.3
19.7
20 18.2 17.1 200 20.3 22.8 23.3 22.0 22.1 0.53 17.9 (34.2%)
15 13.9
50 18.5 22.2 22.9 21.5 21.3 0.45 16.1 (40.8%)
10
5
0 Table 3: Analysis on computational efficiency with vary-
0~2 2~4 4~6 6~ ing number of visual speech unit clusters. When the
Video length of test sample (sec)
deduplication strategy is adopted, the proposed method
Ma et al. Prajwal et al. AV-HuBERT Proposed Method
obtains comparable performances with greatly reduced
Figure 4: VSR performance analysis on LRS3 with vary- sequence length and training FLOPs.
ing video length of test samples. Due to the strength of
contextual understanding ability of LLM, the proposed Additionally, we evaluate the VSR performance
method shows superior performance with longer videos. across various video length segments to explore
the effectiveness of LLM in handling long speech.
the accurate phrase "bone marrow" in the context Figure 4 shows that WER decreases as video length
of describing a cancer treatment process. Simi- increases. Notably, our proposed method exhibits
larly, AV-HuBERT’s transcription of "hair" is con- outstanding recognition performance, with a WER
textually inappropriate for a sentence discussing of 13.9% on videos longer than 6 seconds. Fur-
considerations regarding building systems. Our thermore, our method demonstrates consistent per-
method, however, accurately outputs "air," result- formance improvements as the length of the video
ing in the correct phrase "air system". Also, we can increases, compared to other methods. It indicates
observe similar results in the other cases, not the ho- the effectiveness of LLM’s context modeling in
mophene problem only. For example, the proposed longer video utterances, which demand a more
method can generate the word “planetary” in the comprehensive understanding of context.
context of asking about the relationship between
greenhouse gases and “planetary”. These results 4.3.3 Effectiveness of Deduplication
corroborate that our approach can more effectively We conduct experiments to assess the effectiveness
comprehend contextual clues and generate more of our deduplication strategy. For the deduplication
precise and natural answers, due to the integration process, the number of clusters for visual speech
of LLM. units is required to be determined, and we show
“What do you”
Labeled BLEU ↑
Method WER(%) ↓
(a) Data(hrs)
En-It En-Fr En-Pt En-Es Avg
Figure 5: Visualization results showing how video frame Table 4: Impact of the amount of labeled data. It shows
features are deduplicated and mapped into visual speech that a small amount of labeled data is sufficient to con-
units. By doing so, the redundant frame features can be struct a unified VSR and VST model by leveraging
reduced efficiently. contextual understanding capabilities of LLM.
Pingchuan Ma, Yujiang Wang, Stavros Petridis, Jie Dmitriy Serdyuk, Otavio Braga, and Olivier Siohan.
Shen, and Maja Pantic. 2022b. Training strategies 2022. Transformer-based video front-ends for audio-
for improved lip-reading. In ICASSP 2022-2022 visual speech recognition for single and multi-person
IEEE International Conference on Acoustics, Speech video. arXiv preprint arXiv:2201.10439.
Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, and Ab- 2022a. Opt: Open pre-trained transformer language
delrahman Mohamed. 2022. Learning audio-visual models. arXiv preprint arXiv:2205.01068.
speech representation by masked multimodal cluster
prediction. arXiv preprint arXiv:2201.02184. Ziqiang Zhang, Long Zhou, Junyi Ao, Shujie Liu,
Lirong Dai, Jinyu Li, and Furu Wei. 2022b. Speechut:
Brendan Shillingford, Yannis M. Assael, Matthew W. Bridging speech and text with hidden-unit for
Hoffman, Thomas Paine, Cían Hughes, Utsav Prabhu, encoder-decoder based speech-text pre-training. In
Hank Liao, Hasim Sak, Kanishka Rao, Lorrayne Ben- Proceedings of the 2022 Conference on Empirical
nett, Marie Mulville, Misha Denil, Ben Coppin, Ben Methods in Natural Language Processing, pages
Laurie, Andrew W. Senior, and Nando de Freitas. 1663–1676.
2019. Large-scale visual speech recognition. In In-
terspeech 2019, 20th Annual Conference of the Inter- Ya Zhao, Rui Xu, Xinchao Wang, Peng Hou, Haihong
national Speech Communication Association, Graz, Tang, and Mingli Song. 2020. Hearing lips: Improv-
Austria, 15-19 September 2019, pages 4135–4139. ing lip reading by distilling speech recognizers. In
ISCA. Proceedings of the AAAI Conference on Artificial
Intelligence, volume 34, pages 6917–6924.
Themos Stafylakis and Georgios Tzimiropoulos. 2017.
Combining residual networks with lstms for lipread- Qiushi Zhu, Long Zhou, Ziqiang Zhang, Shujie Liu,
ing. In Proc. Interspeech. Binxing Jiao, Jie Zhang, Lirong Dai, Daxin Jiang,
Jinyu Li, and Furu Wei. 2023. Vatlm: Visual-audio-
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier text pre-training with unified masked prediction for
Martinet, Marie-Anne Lachaux, Timothée Lacroix, speech representation learning. IEEE Transactions
Baptiste Rozière, Naman Goyal, Eric Hambro, on Multimedia.
Faisal Azhar, et al. 2023. Llama: Open and effi-
cient foundation language models. arXiv preprint
arXiv:2302.13971.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. Advances in neural information processing
systems, 30.
BigScience Workshop, Teven Le Scao, Angela Fan,
Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel
Hesslow, Roman Castagné, Alexandra Sasha Luc-
cioni, François Yvon, et al. 2022. Bloom: A 176b-
parameter open-access multilingual language model.
arXiv preprint arXiv:2211.05100.
Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yi-
meng Zhu, Tianrui Wang, Jinyu Li, Shujie Liu,
Bo Ren, Linquan Liu, et al. 2023a. On decoder-only
architecture for speech-to-text and large language
model integration. In 2023 IEEE Automatic Speech
Recognition and Understanding Workshop (ASRU),
pages 1–8. IEEE.
Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and
Tat-Seng Chua. 2023b. Next-gpt: Any-to-any multi-
modal llm. arXiv preprint arXiv:2309.05519.
Jeong Hun Yeo, Minsu Kim, Jeongsoo Choi, Dae Hoe
Kim, and Yong Man Ro. 2023a. Akvsr: Audio knowl-
edge empowered visual speech recognition by com-
pressing audio knowledge of a pretrained model.
Jeong Hun Yeo, Minsu Kim, and Yong Man Ro. 2023b.
Multi-temporal lip-audio memory for visual speech
recognition. In ICASSP 2023-2023 IEEE Interna-
tional Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 1–5. IEEE.
Susan Zhang, Stephen Roller, Naman Goyal, Mikel
Artetxe, Moya Chen, Shuohui Chen, Christopher De-
wan, Mona Diab, Xian Li, Xi Victoria Lin, et al.
2
18
45
128
Ground Truth: it was faulty and most of the time I had to restart it over and over before it worked
English Prediction: it was failing most of the time I had to restart it over and over before it worked
(En) Ground Truth: like we evolved on this planet in the context of all the other animals with which we share
Prediction: can we evolve on this planet in the context of all the other animals with which we share
VST
Ground Truth: tenemos las herramientas pero perdemos la voluntad y el momento colectivo
Spanish Prediction: tenemos las herramientas pero falta la voluntad colectiva y el momento
(En-Es) Ground Truth: hay amor y hay amistad y hay protección
Ground Truth: utilizza esperienza basata su situazioni simili per imparare a gestire
Italian Prediction: utilizza l'esperienza passata basata su situazioni simili per imparare a fare
(En-It) Ground Truth: il testo si è sviluppato da questo slash
Ground Truth: e eu quero fazer o ponto que como membros da sociedade que precisamos
Portuguese Prediction: e eu quero fazer o ponto de que como membros da sociedade podemos fazer
(En-Pt) Ground Truth: mas a magnitude do problema é quando precisamos aceitar
Figure 7: Examples of VSR predictions produced by our proposed model on LRS3 test set. Deletions from the
ground-truth text are highlighted in Red, while substitutions are shown in Blue.