You are on page 1of 29

MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 1

Large Language Models for Document Visual


Question Answering
Anna Oliveras Tous

Abstract

This study introduces innovative methods for Document Visual Question Answering (DocVQA) through the utilization of
Large Language Models (LLMs). Our approach involves fine-tuning the Flan-T5 model on the SP-DocVQA dataset with diverse
context types, revealing the effectiveness of incorporating spatial information. By utilizing both the document’s textual content
and the corresponding bounding box locations of words, we achieve the best performance, reaching an ANLS score of 0.76, using
only the text modality. Furthermore, we attempt to incorporate word recognition in the language model itself. To this purpose, we
present a multimodal DocVQA pipeline and establish a pre-training task aimed at aligning the visual features of cropped word
images with the LLM space. This approach enables the LLM to effectively understand and process visual information. Finally,
we explore two novel methods for performing DocVQA by utilizing the visual embeddings of words. These approaches represent
initial steps toward developing a comprehensive and robust solution for addressing this challenging task in an end-to-end manner.

Index Terms

Document Visual Question Answering, DocVQA, Large Language Models, Sequence to Sequence Transformer, Text to Text
Transformer and Vision-Language Transformer.

I. I NTRODUCTION

Document is a structured representation of information, typically in a physical or digital form, which conveys knowledge,
A data, or a message to its intended audience. The way information is organized within a document plays a crucial role
in conveying the intended message effectively. In this context, we can break down the aspects of a document into three key
components. First, textual content is the written or typed language used to convey information that can include sentences,
paragraphs, headings, and various writing styles. Second, spatial arrangement, contrary to classic Natural Language Processing
(NLP), where text is often treated as a linear sequence of words, documents are two-dimensional entities. Spatial arrangement
includes elements like layout, columns, indentation, margins, and other formatting choices that influence how readers navigate
and interpret the content. Third, visual information such as graphs, images, separators, fonts, and colours provide additional
context and enhance the understanding of the content.

Interpreting information from documents is crucial across various sectors such as finance, insurance, business, public
administration, and personal document management, where dealing with large volumes of documents is essential. However,
extracting meaningful insights and answers from these documents can be challenging, particularly when users have specific
questions that require both textual comprehension and visual understanding. Traditional approaches to question-answering have
primarily focused on textual information, overlooking the valuable insights that visual elements can provide. Nevertheless,
recent advancements in document intelligence have addressed this limitation by combining the fields of Computer Vision
(CV) and Natural Language Processing (NLP). Specifically, Document Visual Question Answering (DocVQA), a particular
area within document intelligence, employs machine learning models to analyze document images and provide responses to
questions about the content. In the realm of DocVQA, responses are extractive, meaning they are always directly drawn from
the document context.

In recent years, NLP has witnessed a surge in popularity, largely driven by the remarkable success of Large Language Models
(LLMs). Notably, the Generative Pre-trained Transformer (GPT) models [1], such as GPT-3 [2], have demonstrated remarkable
Author: Anna Oliveras Tous, contact email
Advisor 1: Dr. Dimosthenis Karatzas, Vision, Language and Reading Group, Computer Vision Center, Universitat Autònoma de Barcelona
Advisor 2: Dr. Mohamed Ali Souibgui, Vision, Language and Reading Group, Computer Vision Center, Universitat Autònoma de Barcelona
Advisor 3: Ruben Perez, Vision, Language and Reading Group, Computer Vision Center, Universitat Autònoma de Barcelona
Thesis dissertation submitted: September 2023
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 2

success in various tasks, including question answering and contextual comprehension. These achievements stem from their
extensive pre-training on massive amounts of text data, enabling them to understand and generate human-like responses. The
more recent emergence of GPT-4 [3], [4] has further expanded the capabilities of LLMs in multimodal scenarios. For example,
GPT-4 has demonstrated its potential in generating accurate descriptions of images and explaining complex visual phenomena.
However, certain models like GPT-4 are not accessible to the general public. Moreover, working with GPT models can be
challenging due to their large size, which necessitates substantial computational resources when fine-tuning them for specific
tasks like DocVQA. Fortunately, there is a recent development in the form of open-source LLMs like T5 [5] and LLAMA [6],
which are now available and can be fine-tuned for particular tasks, offering increased accessibility and flexibility.

Motivated by this, our project aims to harness the power of LLMs in resource-constrained environments for multimodal
DocVQA tasks. We recognize the vast amount of knowledge present in these language models and expect that incorporating
them into the DocVQA pipeline will enhance the overall performance. In our research, we initially focus on the textual content
within documents. We begin by treating text as a one-dimensional string, and then we explore a two-dimensional approach
that considers the spatial arrangement of text within the document. For both approaches, we start by locating and recognizing
text within documents using Optical Character Recognition (OCR) software. Subsequently, we aim to enhance the process by
training a Large Language Model to directly interpret text from word images, creating a multimodal DocVQA pipeline. This
approach offers a more comprehensive and context-aware way of working with textual content in documents. Overall, in this
research, we have four main objectives:

• Establish text-only baselines for DocVQA.


• Demonstrate the utility of spatial information in addressing DocVQA challenges.
• Map the visual features of document word images to the Large Language Model’s space to enable comprehension of
visual information.
• Achieve comparable or improved performance when transitioning from textual to multimodal domain.

II. S TATE OF THE A RT

In this section, we present a comprehensive overview of the cutting-edge advancements in the field of LLMs. Furthermore,
we explore the latest breakthroughs in Multimodal LLMs, which combine text with other modalities like images to enhance its
performance and extend its usability across different tasks. Finally, we delve into various methodologies in Document Question
Answering.

A. Large Language Models

Large language models have revolutionized natural language processing and enabled significant advancements in various
language-related tasks. These models, typically based on transformer architectures, have demonstrated remarkable capabilities
in tasks such as generating language, answering questions, and classifying text. In fact, early models like BERT [7] and GPT-2
[8] have greatly influenced the progress in natural language processing. The first one, which stands for Bidirectional Encoder
Representations from Transformers, brought about advancements through techniques such as masked language modelling and
next-sentence prediction in its pre-training phase. It operates bidirectionally, considering both the preceding and following
context of a word. BERT does not directly generate text as output but it embeds the input text into a fixed-dimensional vector
representation that can be used as features for downstream tasks. The second one, GPT-2 (Generative Pre-trained Transformer)
introduced an autoregressive language modelling approach. It predicts the next word in a sequence based on the context
preceding it, resulting in the generation of coherent and contextually relevant text.

Later, the T5 (Text-To-Text Transfer Transformer) [5] model was proposed as a unified framework that addresses various
natural language processing tasks by treating them as text generation challenges. It achieves impressive performance on different
language benchmarks by training on a diverse range of tasks. The model is based on the original Transformer architecture
with minimal modifications while incorporating pre-training on a substantial volume of data using a novel de-noising task.
Moreover, Flan-T5 [9], built upon the T5 framework, leveraged additional data for fine-tuning, leading to even more impressive
results across various natural language processing tasks.

Better results were obtained by scaling up language models. For example, GPT-3 [2], with its massive 175 billion parameters,
made significant strides across multiple language benchmarks. This model is also autoregressive like its predecessors but
surpasses previous non-sparse models by 10 times in terms of parameter count. The model exhibits strong performance across
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 3

various NLP datasets, including translation and question-answering although it can struggle in few-shot learning on certain
datasets.

To make large language models understand and follow natural language instructions researchers have explored different
approaches to make them perform real-world tasks and avoid certain untruthful or unhelpful outputs. One of the techniques
is known as instruction-tuning and consists of aligning pre-trained models with human intent, instructions, and feedback to
facilitate conversational interactions and complex question-answering. One example is InstructGPT [10] which aligns the GPT-3
model with user intent through fine-tuning with human feedback. The dataset used consists of labelled demonstrations and
rankings of model outputs, which are used for supervised and reinforcement learning-based fine-tuning. Many other models
have emerged as a result of instructing previous models like ChatGPT [11], FLAN-T5 [9], FLAN-PaLM [9] and have proven
to enhance the zero- and few-shot generalization capabilities of large language models.

Recently, Meta released LLAMA [6], a state-of-the-art foundational LLM that is much smaller than the previously mentioned
models. In fact, the model is available in several sizes (7B, 13B, 33B, and 65B parameters) which is crucial for the research
community who do not have access to large infrastructure. Smaller models are trained on more tokens but are easier to retrain
and fine-tune for specific potential product use cases. Further improvements of this model include Alpaca 7B [12], fine-tuned
from the LLaMA 7B model on 52K instruction-following demonstrations. As a result, similar results to ChatGPT [11] are
achieved but with a surprisingly small model. Similarly, Vicuna-13B [13] fine-tunes LLaMA-13B on user-shared conversations
collected from ShareGPT outperforming Stanford Alpaca in more than 90% of cases.

B. Multimodal Large Language Models

In the field of multimodal tasks, certain models such as Flamingo [14] have made significant advancements. These models
achieve this by combining pre-trained vision encoders and language models through the use of gated cross-attention mechanisms.
By incorporating visual inputs into language models, they have showcased remarkable learning capabilities even with limited
examples. To address the expensive nature of vision-and-language pre-training, BLIP-2 [15] proposed a cost-effective pre-
training strategy. This approach involved using a frozen image encoder and a frozen language model, effectively aligning
visual features with a Querying Transformer. Despite its lightweight nature, this model managed to learn meaningful visual
representations and produce outputs that are easily understandable to the language model.

Furthermore, T5 [5] emerged as a prominent basis for cutting-edge techniques in a wide range of natural language processing
(NLP) and multimodal tasks. Unified-IO [16] addressed the challenge of handling various artificial intelligence (AI) tasks,
including pose estimation, object detection, depth estimation, image generation, region captioning, referring expression, question
answering, and paraphrasing. This model tackles the issue of diverse inputs and outputs by standardizing them as a sequence of
discrete vocabulary tokens. Through training a single transformer-based architecture on over 90 different datasets, Unified-IO
achieves a shared representation across tasks.

LaTr [17] introduced an innovative multimodal architecture designed for Scene Text Visual Question Answering (STVQA).
The researchers explore the impact of each modality, emphasizing the significance of the language module enriched with layout
information. They propose a unified pre-training approach that employs text from huge amounts of scanned documents and
spatial cues. Their method enables decoding without relying on a specific vocabulary and demonstrates strong generalization
capabilities beyond the training vocabulary showing good performance on real scenes with text. Additionally, LaTr improves
resilience to OCR errors, a common challenge encountered in STVQA.

Recently, there has been a surge in multimodal instruction-tuned models. LLaVA [18], for instance, generates a high-quality
multimodal language-image instruction-following dataset by representing visual images as captions and bounding boxes. It
utilizes a pre-trained CLIP ViT-L/14 model [19] as a visual encoder and Vicuna as the language model. Another model,
MiniGPT4 [20], connects different modalities using a single projection layer and employs ViT-G/14 [21] and Q-Former for
encoding images. It updates the projection layer using combined datasets and conversational data.

Other works like mPLUG-Owl [22] integrate unimodal and multimodal data during training, fine-tuning the language module
on instruction-following data. It consumes a large amount of data during training, more than similar approaches [18], [20] but
exhibits enhanced visual understanding and reasoning abilities.

Finally, the state-of-the-art GPT-4 [4], OpenAI’s most advanced system, produces safer and more accurate responses than
previous models. It is a large-scale multimodal model pre-trained on aligned image-text data and showcases improved visual
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 4

understanding and reasoning capabilities. However, this model is not publicly available to the research community.

C. Document Question Answering

Since the introduction of the SP-DocVQA dataset [23], numerous techniques have been employed to address this task from
various angles. This benchmark dataset is specifically designed for the task of visual question answering on document images
and comprises real-world documents with diverse layouts and textual and visual elements, providing a valuable resource for
training and evaluating DocVQA models.

In the field of Natural Language Processing (NLP), several frameworks have been introduced to enhance the BERT
architecture [24]. In its original proposal, BERT introduced masked language modelling, a technique that enables the learning of
bidirectional representations. This approach involves predicting the original vocabulary ID of a randomly masked word token
by analyzing its surrounding context. Later works involve adjusting crucial hyperparameters during training and proposing
novel pre-training tasks [25]–[29].

One notable development is LayoutLM [30], which builds upon BERT and focuses specifically on document-related tasks.
It introduces a unique technique of decoupling position embeddings into two dimensions using token bounding box infor-
mation obtained from Optical Character Recognition (OCR). By integrating visual and textual features, LayoutLM enhances
performance in downstream tasks.

Additionally, LayoutLMv2 [31] and TILT [32] incorporate visual information into a multimodal transformer. They explicitly
model relative positions by introducing a learnable bias to self-attention scores. TILT goes a step further by employing a
decoder to dynamically generate answers rather than simply extracting them from the context. LayoutLMv3 [33] extends the
previous version by utilizing visual patch embeddings instead of relying on a convolutional neural network (CNN) backbone.
It employs pre-training with three distinct objectives: aligning text, layout position, and image context. This approach aims to
enhance the alignment between text and visual elements in documents.

In contrast to the aforementioned methods that rely on text recognized with an off-the-shelf OCR, Donut [34] and Dessurt
[35] are end-to-end encoder-decoder methods. These approaches take the document image and the question as input, implicitly
learning to read and understand the semantics and layout of the images. By directly leveraging the image content alongside
textual input, these methods offer a comprehensive approach to document understanding.

Furthermore, LATIN-prompt [36] introduced a layout and task-aware instruction prompt that recovers the layout information
among text segments from OCR tools by appropriate spaces and linebreaks. Moreover, it ensures that the model generates
answers that meet the requirements by providing detailed descriptions of the task. In this way, they align document image
question answering to off-the-shelf instruction-tuning language foundation models to directly utilize their zero-shot learning
capability.

Finally, over the methods proposed in the Document Visual Question Answering (DocVQA) challenge [23] Ernie-Layout
2.0 [37] achieves the best performance, reaching a 0.88 ANLS. Ernie offers an innovative document pre-training method
that enhances layout understanding by merging text, layout, and image features. Initially, sequences are reorganized during
serialization and a reading order prediction task is introduced to learn document structure. To enhance layout awareness,
spatial-aware disentangled attention is added to the multimodal transformer, along with a replaced regions prediction task
during pre-training.

III. DATASET

In this section, we present the dataset used in our study. Moreover, we state the specific data subsets we use in our experiments
due to the Large Language Model’s input sequence length limitation. .

A. SP-DocVQA Dataset

The SP-DocVQA dataset, proposed in [23], contains 50k questions and 12k images collected from the UCSF Industry
Documents Library and the corresponding questions and answers annotations. The majority of the documents were binarized
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 5

Fig. 1. Examples of documents in the DocVQA [23] dataset. For the highlighted document, possible questions and the corresponding answers are shown.

TABLE I
S TATISTICS AND SPLITS DETAILS OF THE SP-D OC VQA DATASET

DocVQA statistics Value


Images Questions
Unique questions 70.72%
Train 10,194 39,463
Unique answers 63.2%
Validation 1,286 5,349
Average question length 8.12 words
Test 1,287 5,188
Average answer length 2.17 words
Total 12,767 50,000
Average number of text tokens of the context 182.75

and they include images, tables, forms and figures apart from text. The documents exhibit a remarkable diversity, stemming
from five distinct industries: Tobacco, Food, Drug, Fossil Fuel, and Chemical. Furthermore, they span an extensive time
frame, ranging from the early 1900s to 2018. It is worth mentioning that there are various types of documents like letters,
forms, reports, tables, contracts, and notes among others. Moreover, the text may appear typewritten, printed, handwritten and
born-digital.

All the DocVQA questions are extractive, so the answer is always somewhere in the textual content. An example of different
types of documents and possible questions and answers is shown in Figure 1. In the SP-DocVQA dataset, the data is split
randomly in 80-10-10 ratio to train, validation and test splits as seen in Table I along with some additional statistics of the
dataset.

B. Input sequence limitation

The main Large Language Model (LLM) that we use is the Flan-T5 model [9], which has a maximum token input size
of 2048. Thus, we had to exclude the questions from the SP-DocVQA dataset [23] whose context exceeds this token limit.
Since various formats of context will be employed and fed into the Large Language Model (LLM), the data type with the
more stringent token constraint becomes the factor that restricts the quantity of data utilized in the experiments, ensuring
comparability across them. Thus, our experimentation will involve 23, 635 questions from the training set and 2, 963 questions
for validation purposes.

While it is not the preferred scenario, we will not be able to furnish results for the test set due to its unavailability to the
public. While it is possible to submit the model’s predictions to the SP-DocVQA dataset’s Challenge website [23], we must
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 6

acknowledge that the LLM’s token limitations might prevent us from providing answers to all questions. Furthermore, the risk
of overfitting to the validation set seems minimal, given the stable trajectory of the validation loss.

Finally, for the reading pre-task, we will use randomly cropped words from the images of the documents. In total, we will
use 80, 000 images of words for training, 10, 000 for validation and 10, 000 for testing.

IV. M ETHOD

In this section, we provide an overview of the key components used in our study. Firstly, we define the Document Visual
Question Answering (DocVQA) problem. Then, we present the architecture of the primary Large Language Model we use,
the Flan-T5 model [9], along with its corresponding Tokenizer. We also explain our methods for conducting Document Visual
Question Answering, covering both text-to-text scenarios and the multimodal domain. Lastly, we discuss the loss function
employed in training our models.

A. DocVQA problem definition

Document Visual Question Answering (DocVQA) is a computer vision and natural language processing (NLP) task that
combines both text and image understanding. In DocVQA, the goal is to answer questions about the content of a document
or image. This task involves processing both textual information and visual information to generate accurate textual answers
to questions posed in natural language.

The input to the model is a document or an image that contains text and possibly other visual elements, such as images,
diagrams, or tables. This document can be a scanned page from a book, a screenshot of a webpage, or any other source of
textual and visual content. Along with the document or image, a natural language question is provided as input. In our case,
for the SP-DocVQA dataset [23], all questions are extractive, so the answer can always be retrieved from the document. The
system needs to perform both text understanding and image understanding and generate a textual answer in natural language,
which has to be accurate and concise.

B. Flan-T5

Flan-T5 [9] stands as an enhanced version of the T5 model [5], which was initially proposed by Google researchers in
2020. The term ”Flan” is derived from ”Fine-tuning language models”. The T5 model presents a unified framework with the
capability to tackle diverse tasks such as translation, linguistic acceptability, sentence similarity and document summarization.
In all these tasks, the model takes text as input and is trained to produce the corresponding target text. Notably, what sets
this approach apart is that the model architecture, hyperparameters, and loss function remain consistent across all tasks. This
unified approach utilizes transfer learning in the realm of natural language processing (NLP), proving to be a potent technique
for achieving cutting-edge performance.

Regarding the architecture of the model, it draws its foundation from the Transformer architecture [38], which has proven
effective in various NLP scenarios. A crucial element within this architecture is self-attention [39], a variant of the attention
mechanism [40], [41], which processes a sequence by replacing each element with a weighted average of the other elements
within the sequence. The architecture of T5 [5] takes the form of an encoder-decoder Transformer implementation as shown in
Figure 2. While the original paper suggests multiple architecture sizes, our research centres around the 3B model. This model
is the second-largest option proposed, chosen due to the constraints of limited memory resources that make the 11B model
unfeasible to utilize on our available infrastructure.

The input sequence of tokens undergoes several steps in a Transformer model. First, the tokens are mapped into embeddings.
This sequence is then processed through the encoder, which comprises stacked blocks, each consisting of a self-attention layer
and a small feed-forward network. In T5, a simplified version of layer normalization is applied to each subcomponent’s input.
This involves rescaling the activation values without adding any bias. A residual skip connection connects the input and output
of each subcomponent, adding the input to the output. Dropout, a regularization technique, is applied at various points. It is
used within the feed-forward network, on the skip connection, on attention weights, and at the input and output of the entire
stack.
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 7

Fig. 2. T5 3B model [5] architecture based on the Transformer model [38]. It is an encoder-decoder structure based on stacked self-attention and point-wise,
fully connected layers. Differently than in the original transformer, the positional encoding is a simplified relative position and the layer normalization does
not have bias.

The Transformer model uses two masking strategies. In the encoder and encoder-decoder attention, fully visible masking
enables self-attention to consider the entire input sequence when generating corresponding outputs, aiding in prediction using
context. On the other hand, in the decoder, causal masking ensures that while generating the output sequence, the model cannot
access future input elements (those beyond the current position). This safeguards the model from using information from the
future. The output of the decoder block is fed into a dense layer with a softmax output.

The Transformer’s attention mechanisms are divided into separate heads. These heads operate independently and their outputs
are concatenated before further processing. A position signal is included to maintain order information in the self-attention
process, which is, by design, order-independent. A simplified form of relative position embeddings [42] is used for this purpose.
These embeddings introduce an offset-dependent learned embedding according to the offset between the “key” and the “query”
being compared in the self-attention mechanism. In this case, a scalar is added to the corresponding logit used for computing
the attention weights. For efficiency, the position embedding parameters are shared across all layers, but each attention head
within a layer employs a distinct learned position embedding. Layer normalization is applied outside the residual path.

Both the encoder and decoder are made up of 24 blocks, each containing self-attention, optional encoder-decoder attention,
and a feed-forward network. The feed-forward networks within each block consist of a dense layer with an output dimension
of df f = 16.384, followed by a ReLU nonlinearity, and another dense layer. The key and value matrices for all attention
mechanisms have an inner dimensionality of dkv = 128, and all attention mechanisms consist of 32 heads. All other sub-layers
and embeddings have a dimensionality of dmodel = 1024, resulting in a total of around 2.8 billion parameters. Dropout with
a probability of 0.1 is employed for regularization

In this unified text-to-text environment training is done using standard maximum likelihood, employing teacher forcing [43]
and cross-entropy loss. The T5 model is pre-trained on the Colossal Clean Crawled Corpus (C4) dataset, introduced in the
same paper. This dataset comprises numerous gigabytes of English text scraped from the internet. Such a large and diverse
dataset is key for the unsupervised training approach. The authors highlight that objectives following a BERT-style [24] with
word corruption with replacing spans leads to better performance. Across various corruption rates, models exhibit relatively
consistent behaviour so, it is advisable to employ objectives leading to concise target sequences, as shorter sequences are more
cost-effective to pre-train. Afterwards, multiple fine-tuning tasks are conducted including text classification with GLUE [44]
and SuperGLUE [45] datasets, abstractive summarization with CNN/Daily Mail [46], SQuAD [47] for question answering and
WMT14 [38] for translation tasks from English to German, French and Romanian.
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 8

Flan-T5 [9] proposes instructional fine-tuning which results in robust zero-shot, few-shot, and chain-of-thought (COT)
capabilities which makes the model reason step by step. One key aspect is the variety and large amount of data used. The
fine-tuning data encompasses 473 datasets and 1,836 tasks and the following 4 mixtures of data:

1) T0-SF: Comprises tasks from T0 [48] that do not overlap with the data used in Muffin (SF stands for ”Sans Flan”). Some
examples of tasks are Commonsense reasoning, question generation, closed-book question answering (QA), adversarial
QA, extractive QA, Title/context generation, Topic classification, Struct-to text...
2) Muffin: Multi-task fine-tuning with instructions comprises 80 tasks like Natural language inference, code instruction
generation program synthesis, Dialog context generation, closed-book QA, conversational QA, code repair...
3) Natural Instructions v2: This is a reduced version of the NIV2 dataset [49] (1554 tasks) since 44 tasks were removed
because they are related to MMLU [50], used for evaluation. NIV2 tasks comprise cause-effect classification, common-
sense reasoning, named entity recognition, toxic language detection, question answering, question generation, program
execution, and text categorization among others.
4) COT (reasoning): Chain-of-thought data with annotations. It is a combination of 9 datasets from prior studies. Tasks
include arithmetic reasoning, commonsense reasoning, multi-hop reasoning and natural language inference.

Moreover, the fine-tuning process includes instances both with and without exemplars, as well as with and without chain-
of-thought annotations. This results in a model with the capability to undertake COT reasoning in a zero-shot scenario. This
joint fine-tuning with both COT and non-COT data is needed to preserve the performance on non-COT tasks while enhancing
COT performance.

C. Tokenizer

In the context of Sequence to Sequence models, the tokenizer plays a pivotal role in bridging the gap between raw textual
input and the machine learning processes that follow. By breaking down the input text into smaller, manageable units known
as tokens, the tokenizer serves as a crucial intermediary that enables the model to effectively process, understand, and generate
human-like text. This transformative step not only enhances the model’s ability to comprehend the intricacies of language but
also facilitates various downstream tasks such as text generation, translation, and summarizing.

In this work, the T5 tokenizer implementation from Huggingface Transformers [51] is used. It is based on the SentencePiece
approach as introduced by Kudo and Richardson [52] which encodes text into units referred to as WordPiece tokens, following
the concepts from Sennrich et al [53] and Kudo [54]. The vocabulary has a fixed size of 32, 000 word pieces.

SentencePiece has four main components. The Normalizer is used to normalize semantically-equivalent Unicode characters
into canonical forms. The encoder internally normalizes the input text using the Normalizer and tokenizes it into a sub-
word sequence with the sub-word model trained by the Trainer. Finally, the Decoder converts the subword sequence into the
normalized text, so it performs the detokenization. One particularity is that SentencePiece uses ID mapping for managing the
vocabulary so it can directly convert text into a sequence of IDs and vice-versa. This can be summarized with the following
equation: Decode(Encode(N ormalized(text))) = N ormalized(text).

Moreover, there are IDs for special meta symbols and unknown symbols (< unk >), BOS (< s >), EOS (< /s >) and
padding (< pad >). Regarding the white spaces, SentencePiece replaces them with ‘ ’ at the beginning of the word following
the white space, and upon decoding, it simply replaces all ‘ ’ with an empty space. This makes it reconstructible although it
loses the value of ‘ ’.

D. Proposed method for Text-to-text DocVQA

The initial idea is centred around using large language models capable of processing text and leveraging their understanding
of NLP tasks to address Document Visual Questions Answering. To accomplish this, textual information from documents is
extracted using Optical Character Recognition (OCR) software, which provides each word as a string along with its normalized
location in the document through bounding box coordinates. Thus, the OCR not only captures individual words but also their
spatial arrangement using bounding boxes, enhancing the model’s understanding of the document’s layout. The question and
OCR data are then combined into a prompt, which is tokenized before inputting it into the LLM. Then the Large Language
Model generates the answer that is later detokenized to obtain the final text answer. The proposed method’s flow is depicted
in Figure 3.
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 9

Fig. 3. Initially proposed architecture for DocVQA. First, the document image is passed to the OCR to obtain the text. Then with the extracted words from
the OCR and the question, a prompt is generated. This prompt is then tokenized and passed through the Large Language Model (LLM) to obtain the answer,
which is finally detokenized.

Different ways of using the OCR information will be explored in the experiment Section V. The Flan-T5 3B model [9]
will serve as the primary LLM in most cases. This model is recognized for its strong performance in various tasks involving
language understanding and generation while not being one of the largest ones. It is worth mentioning that the implementation
needed to consider the constraints posed by limited memory resources. Moreover, the model is publicly available for research
purposes on the FastChat GitHub repository [55].

E. Proposed method for Multimodal DocVQA

The proposed approach builds upon the concept of leveraging multimodal models to enhance the understanding of textual
information by incorporating visual embeddings similar to LLaVA [18] and MiniGPT4 [20]. In this case, the idea is to replace
each word in the OCR text with its corresponding visual embedding obtained from processing images of the words using CLIP
[56] visual encoder, which is based on the visual transformer ViT [57]. Specifically, we use the CLIP implementation from
Huggingface based on the ViT-B/32 model.

Our multimodal pipeline involves cropping each word from the document using the bounding box information, resizing the
images to 224 × 224, processing these images through the CLIP [56] visual encoder, and extracting meaningful visual features
using the CLS token from the CLIP model. To ensure compatibility with downstream tasks, a linear layer is introduced to
map the dimension of the CLS embedding to fit the input requirements of the Flan-T5 model [9]. The prompt consists of two
parts: the first part involves introducing the document to the LLM by stating, “I have this document with the following words:”
while the second part contains the actual question. Both parts are tokenized and passed through the Flan-T5 embedding. These
word embeddings are concatenated with the visual word embeddings as shown in Figure 4. Finally, this is passed to the
Encoder-Decoder of the Flan-T5 and the output is detokenized to obtain the answer.

It’s important to note that while CLIP’s original training might not have been explicitly focused on text-image relationships,
we will fine-tune the CLIP visual encoder on data containing textual images as explained in the Experiments Section V.
This fine-tuning, combined with the addition of the linear layer, is expected to bridge the gap between visual and linguistic
understanding, enabling the Large Language Model to comprehend the visual representations of the document content. However,
a critical consideration is the potential limitation of this approach in terms of direct usability since the proposed method relies
on prior word localization for word cropping. Thus, and also due to the limited time for conducting this research, this method is
proposed as a preliminary step rather than a comprehensive solution. We will explore the synergy between visual and linguistic
understanding but highlight the need for further refinement and exploration to address the aforementioned limitations and yield
more practical outcomes.

F. Loss

As the LLM used is the Flan-T5 [9] and all tasks are formulated as text-to-text, the training is done using the standard
maximum likelihood. Thus, the teacher forcing [43] and Cross-Entropy (CE) loss are used, as commonly done with Language
Models. The cross-entropy loss is calculated between each predicted token and its corresponding target token. In this context,
for each step in the sequence, the model predicts the next token and the cross-entropy loss is used to measure the dissimilarity
between the predicted token’s probability distribution and the actual target token’s probability distribution. This helps the model
learn to generate sequences of tokens that match the desired target sequence.
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 10

Fig. 4. Architecture of the proposed method for DocVQA. For each document image, each of the words is cropped, resized to 224 × 224 and then passed
through the CLIP [56] visual encoder to extract the CLS token. Then a linear layer is used to align the dimension of each embedding from CLIP to the
Flan-T5 [9] embedding dimension. The Visual embeddings are concatenated with the space embedding between words. Moreover, the words from the prompt
are tokenized and passed through the Flan-T5 embedding. The word embeddings and the visual embeddings are then concatenated to pass through the
Encoder-Decoder. Finally, the output from the Flan-T5 is detokenized to obtain the answer. In the figure, n is the number of words from the document, B is
the batch size and p1 and p2 are the number of tokens of each part of the prompt. Note that the concatenation operation is denoted by the plus symbol.

Let’s denote:

1) y to be the one-hot encoded target distribution over the vocabulary for the current token
2) p to be the predicted probability distribution over the vocabulary for the current token

The CE loss for a single token prediction can be expressed as:


X
Cross-Entropy Loss = − yi log(pi ) (1)
i

Where i ranges over all the possible tokens in the vocabulary, yi is the actual probability (1 or 0) of the token i based on
the target and pi is the predicted probability of the token i based on the model’s prediction.

For a sequence of tokens, the individual token losses are summed and then divided by the total number of tokens to compute
the average loss. In neural network frameworks like PyTorch, this loss is implemented as a function that takes the predicted
logits (p) of shape [batch size, seq len, vocab size] and the target ground truth labels (y) and calculates the cross-entropy loss
for all the tokens in one go:

loss = loss_fct(lm_logits.view(-1, lm_logits.size(-1)), labels.view(-1))

V. E XPERIMENTS

In this Section, first, we specify the evaluation metrics. Then, we explain the different experimental scenarios and finally,
we outline the training details for each experiment.
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 11

A. Evaluation metrics

The main evaluation metric for DocVQA is the Average Normalised Levenshtein Similarity (ANLS). This metric is more
suitable than the standard accuracy since the answer to a question is text extracted from the document image. Thus, the metric
needs to respond softly to answer mismatches due to recognition imperfections. ANLS is defined in equation (3), where N is
the total number of questions, M the number of GT answers per question, aij the ground truth answers where i = {0, . . . , N },
and j = {0, . . . , M }, and oqi the returned answer for the ith question qi . N L (aij , oqi ) is the Normalized Levenshtein distance
between the strings aij and oqi . Take note that the ANLS metric is categorized as a “Similarity” rather than a distance, as
indicated by NL(Similarity) = 1 — NL(Distance). The Levenshtein distance between two strings a and b (both with lengths |a|
and |b| respectively) is denoted as lev(a, b) as defined in Equation (2). Here, the term “tail of x” signifies a string comprising
all characters of x except for the first one, and x[n] represents the nth character in the string x, commencing from character
0. It is important to highlight that the first component in the minimum value corresponds to deletion (from string a to b), the
second pertains to insertion, and the third relates to replacement.


|a|


if |b| = 0,
|b| if |a| = 0,





lev(tail(a), tail(b)) if a[0] = b[0]
lev(a, b) =  (2)

  lev(tail(a), b)


1 + min lev(a, tail(b)) otherwise.




 
lev(tail(a), tail(b))
 

N  
1 X
AN LS = max s (aij , oqi ) (3)
N i=0 j

(
1 − N L(aij , oqi ) if N L(aij , oqi ) < τ
s(aij , oqi ) =
0 if N L(aij , oqi ) ≥ τ

It is important to highlight that the threshold τ is configured as τ = 0.5 to screen out NL values surpassing this threshold.
In such cases, a score of 0 is assigned to the NL when it exceeds τ . The rationale underlying this threshold choice is that
when the normalized edit distance between an output and an answer is greater than 0.5, it indicates a likely mismatch caused
by incorrect scene text retrieval.

Additionally, we will include the accuracy metric in our reporting. The accuracy (Acc) will be assigned a value of 1 if the
ANLS equals 1, and it will be set to 0 otherwise. Finally, in some cases, we include also whether the ground truth is in the
predicted answer. We call this the GTIP for “Ground truth in prediction”.

B. Experimental Scenarios

1) Text-to-Text DocVQA

We expose three different experiments to evaluate the performance of our proposed method explained in Section IV. We
aim to prove that spatial information that encodes the text layout on the page helps to solve DocVQA. For this reason, we
fine-tune the Flan-T5 model [9] starting with the weights provided by the FastChat repository [55] with three different types
of context:

• Text: The whole document is inputted to the LLM model as plain text.
• Text and bounding boxes: Each word from the document is inputted to the LLM model followed by the coordinates of
its spatial position on the page like word : (x, y). The coordinates are normalized in the range 0 − 1000.
• Layout: In this case, the text is inputted to the LLM model explicitly including the number of spaces between each word
and the line breaks.
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 12

Fig. 5. Example of the 3 different types of prompts, using different contexts. On the first prompt, only the text from the document is used. On the second,
the coordinates (x,y) of each word are provided. Finally, the third approach explicitly shows the separation between words trying to preserve the layout
information.

An example of the three different types of context given a document image is shown in Figure 5. Note that the third type of
context could have been directly putting the necessary spaces between words as suggested by Wang et al. in their work [36]
instead of x < Spaces > where x is the number of spaces. Although maybe this was more understandable for the LLM, this
occupied more tokens and due to the limitation of 2048 input tokens of the LLM, we opted for the other method.

In the inference we compare the three different types of context and two LLMs: Flan-T5 [9] and Vicuna-7B [13]. The
first model has 3 billion parameters while the second one has 7 billion. We then fine-tune the Flan-T5 model on each of the
different types of context using the SP-DocVQA dataset [23]. The Vicuna model is not fine-tuned since it is too large to fit in
our computational resources.

2) Reading words as a pre-training task for multimodal DocVQA

Before jumping to the multimodal DocVQA, we designed a pre-training task so that the LLM could understand images of
words. To do so, we follow the same pipeline as for Multimodal DocVQA, presented in the Method Section IV and shown in
Figure 4. The aim is to teach the Flan-T5 model [9] to understand the images of the words. After some prompt engineering, we
decided to ask the model to “Type just the following text”. The word “just” is included to mitigate the model from outputting
too many words, while “text” is chosen instead of a word since there are cases where the image may contain a compound word
or a website that is not a single common word. The visual input is always only one token, corresponding to the embedding
of the word image after being passed through the CLIP [56] visual encoder and the linear layer as depicted in Figure 6. The
expected LLM output is the transcription of the visual word, which can be multiple tokens. After detokenization, ideally, only
one word should be obtained since we ask the LLM to read one word at a time. Note that the LLM is frozen, as we are trying
to align the visual features extracted with the CLIP visual encoder to the Flan-T5 embedding space with a linear layer.

We conduct experiments involving the training of the linear layer alone and the joint training of both the linear layer and
CLIP visual encoder. Furthermore, we evaluate two distinct sets of weights for the Flan-T5 model. Firstly, we consider the
weights obtained after fine-tuning Flan-T5 for DocVQA, utilizing text as context, denoted as the PT1 model (Pre-Task 1).
Subsequently, we replicate the process with the Flan-T5 weights sourced from the FastChat repository [55], referred to as the
PT2 model (Pre-Task 2). This is done since we are afraid that our Flan-T5 model fine-tuned for DocVQA may not understand
the pre-training task, since it has never seen such type of question in the fine-tuning.
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 13

Fig. 6. Example of the reading pre-training task for the word ”15.12”. We ask the LLM (Flan-T5 model [9]) to type the word in the image. Note that the
LLM is frozen while the CLIP [56] visual encoder and the linear layer are trained. Concatenation is represented by the plus symbol.

3) Multimodal DocVQA

In this experiment, we use the CLIP [56] visual encoder and linear layer trained with the reading pre-training task and
substitute the words from the documents for their visual embeddings to perform Multimodal DocVQA as explained in the
Method Section IV shown in Figure 4. We start by substituting one random word from the document for its visual embedding,
leaving the other words as text. Then we substitute two random words and at the end, we replace them all.

Note that we have two possibilities to do Multimodal DocVQA:

• Directly use the PT1 model. In theory, this model is able to understand the visual words, since the CLIP [56] visual
encoder and linear layer are taught how to read during the pre-training task and include the Flan-T5 model [9] trained
on DocVQA text data. So, we should be able to directly do DocVQA substituting words for their visual embeddings.
However, we also experiment with the fine-tuning of the linear layer with visual DocVQA data to improve the results
in this DocVQA downstream task with visual words, we call this model PT1F. Importantly, we should highlight that as
we fine-tune the linear layer, when we replace certain words with their corresponding visual embeddings, a deliberate
decision is made. Specifically, we selected 50% of the words necessary for answering the question and 50% that are not
as pertinent. This approach is adopted to prevent any undue biasing of the model towards the visual words.

• Fine-tune the Flan-T5 of the PT2 model for visual DocVQA with all the words substituted by the visual embeddings.
Differently from the first approach, in this case, we first teach the LLM model to understand the visual words fine-tuning
the CLIP visual encoder and training the linear layer in the pre-task. Then we fine-tune the Flan-T5 to perform multimodal
DocVQA. However, with the visual data the whole model does not fit in the available GPUs. Thus, we decide to only
fine-tune the Flan-T5 decoder, leaving the rest of the model frozen. We call this model PT2F.

C. Training Details

In our DocVQA experiments we use the questions from the SP-DocVQA [23] dataset that do not exceed the 2048 tokens
limitation, as explained in Section III-B. For the pre-training task, we use randomly cropped words from the images of the
documents: 80, 000 images of words for training, 10, 000 for validation and 10, 000 for testing.

In all the experiments we use the cosine scheduler, a popular learning rate scheduling technique that gradually reduces the
learning rate in a cosine-shaped manner over a specified number of epochs. This can help models converge to better solutions
during training. Moreover, we apply the Adamw Optimizer, which implements the Adam algorithm with a weight decay fix as
introduced in [58]. We chose really small batch sizes due to memory limitations. For this reason, gradient accumulation is used
to try to simulate a bigger batch size. The key idea behind gradient accumulation is that while you are effectively processing
smaller batches at a time, you are still utilizing the benefits of larger batch sizes in terms of a more accurate estimate of the
gradient direction. This can lead to more stable updates and better convergence properties, compared to using tiny batch sizes
without accumulation.

All the training details for each of the experimental scenarios are shown in Table V-C. Note that for the pre-training task,
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 14

more epochs are needed to reach convergence when fine-tuning the CLIP [56] visual encoder and the linear layer since it
is a different task and CLIP is not used to see images of text. Also, in the Multimodal DocVQA second approach, when
we substitute all words by visual embeddings the batch size and Gradient accumulation needed to be reduced to 1 and 8
respectively to deal with memory issues. Note that when performing DocVQA tasks, we fine-tuned the model for the same
amount of epochs to achieve comparability between the experiments, always ensuring convergence.

Text-to-Text DocVQA Reading pre-training task Multimodal DocVQA 1 Multimodal DocVQA 2


Details Flan-T5 CLIP Visual Encoder + linear layer Linear layer Flan-T5 encoder
GPUs 2 NVIDIA RTX 6000 Ada 1 NVIDIA A40 G 1 NVIDIA A40 G 2 NVIDIA RTX 6000 Ada
Number of epochs 8 35 8 8
Per device batch size 2 32 2 1
Gradient accumulation steps 16 2 16 8
Warmup ratio 0.03 0.03 0.03 0.03
Learning rate 2e-5 2e-5 2e-5 2e-5
Learning rate scheduler Cosine Cosine Cosine Cosine
Optimizer Adamw Adamw Adamw Adamw

VI. R ESULTS

In this section, we analyze the results for each experimental scenario. First, we focus on discussing the outcomes of the
text-to-text modality. After, we present the results of the pre-training task and lastly, we cover the findings in the multimodal
domain.

A. Results by Experimental Scenarios

1) Text-to-text DocVQA

Doing the inference with the three different types of context (Text, Text + Bounding Boxes (BB) and layout) and the two
LLMs (Flan-T5 [9] and Vicuna-7B [13]) we see that, as expected, the best results are obtained with plain text since the model
does not understand the coordinates of the bounding boxes or the layout. Analyzing the predicted results we see that although
the ANLS is quite low, some answers were correct despite having a 0 ANLS since the model generates a wordy response,
explaining more than what is being asked. In fact, Vicuna seems to be more verbose than Flan-T5, as the Ground truth in
prediction (GTIP) metric is much higher than the ANLS. Moreover, we see that with the Flan-T5 (3B parameters) model, we
obtain a higher ANLS and accuracy in all the cases compared to the Vicuna (7B parameters), maybe because of the data they
were originally fine-tuned with.

Regarding the fine-tuning of the Flan-T5 model on DocVQA data with each of the three types of context, we obtain the best
ANLS with the text + Bounding Boxes, achieving a 0.76 ANLS and 0.65 accuracy. Moreover, although the layout does not
work as well as the bounding boxes, it also outperforms the results obtained with just the text. Thus, we can confirm that the
spatial information helps to solve the DocVQA task. Although initially, the model was not able to understand coordinates, it
seems that after fine-tuning, it has been able to learn some spatial information. Regarding the layout, introducing the number
of spaces as x < Spaces > where x is the number of spaces seems to be more difficult to learn. In addition, this method is
less precise and does not take into account vertical spacing.

All the quantitative results are shown in Table II. Note that the second part of the table includes multiple SOTA methods that
use the text modality although direct comparison with our results is not possible since the results are from different subsets
as explained in Section III-B. Nevertheless, we observe the enhanced capabilities of larger language models such as ChatGPT
[11] when coupled with layout information, surpassing even the performance achieved by our fine-tuned Flan-T5 in a zero-shot
setting.

When visually analyzing the cases where the model is not able to answer correctly despite the context type we find some
OCR inconsistencies due to handwritten unclear text, difficult diagrams, or questions relying on non-textual information, that
the context currently does not encode like asking for underlined text. On the contrary, when the question keywords and the
answer are close together, the model tends to respond correctly despite the context type. When dealing with graphs, diagrams,
or other documents where the position of the data is necessary, the text + bounding boxes (BB) outperforms the rest, as we
can visually see in Figure 7. More graphic examples can be found in Figures 8-10 in the Appendix A.
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 15

TABLE II
E XPERIMENT 1: T EXT- TO - TEXT D OC VQA WITH DIFFERENT TYPES OF CONTEXT

ANLS Accuracy GTIP


Model Fine-tuned Model size Text Text+BB Layout Text Text+BB Layout Text Text+BB Layout
Flan-T5 [9] no 3B 0.3617 0.0966 0.2174 0.0180 0.0091 0.0119 0.2516 0.07 0.2602
Flan-T5 [9] yes 3B 0.7346 0.7633 0.7448 0.6193 0.6493 0.6318 0.7098 0.7418 0.7158
Vicuna [13] no 7B 0.1690 0.0710 0.1332 0.0265 0.0358 0.0547 0.2476 0.1002 0.1657
BERT Large [7] yes 340M 0.6768* - - - - - - -
LayoutLM Large [30] yes 343M - - 0.7259* - - - - -
Alpaca [12] no 7B 0.3567* - - - - - - -
Alpaca+LATIN-Prompt [36] no 7B - - 0.4200* - - - - -
ChatGPT+LATIN-Prompt [36] no unknown - - 0.8255* - - - - -
Claude+LATIN-Prompt [36] no unknown - - 0.8336* - - - - -
On the top, results for our experiments. On the bottom, SOTA methods. Note that the * indicates that the results are provided on the complete
SP-DocVQA [23] test set, while our experiments were conducted in a subset of the validation set as explained in Section III-B. We include the SOTA
as a reference, although direct comparison is not possible. Moreover, in this context, no fine-tuning is equivalent to zero-shot for the specific task of
DocVQA. Regarding the Layout context, the implementation may be different than ours but the idea is similar.

Fig. 7. On the left, the document image with the three different types of context below. On the right, two questions that require spatial knowledge are asked
of each of the fine-tuned Flan-T5 models. We see that the text + the bounding box (BB) information seems to be the best type of context for solving such
types of questions.

2) Reading words as a pre-training task for multimodal DocVQA

Our initial idea was to only unfreeze the linear layer and keep the CLIP visual encoder and the LLM model frozen as
suggested by other multimodal works [18], [20] where they ask questions to the LLM regarding the input image. However,
in our case, this was not enough and the model was not learning how to read the cropped words. For this reason, we only
provide the results of the pre-training task when both the CLIP [56] visual encoder and the linear layer unfrozen, which are
shown in Table III. We think that training only the linear layer does not work because our task is much more complicated.
On the one hand, CLIP is trained with general images and not with images of words. On the other hand, our data is tricky,
we have different types of fonts, different font sizes, handwritten and printed words.

In general, we can observe that the PT2 model consistently produces better results. This is expected since its Flan-T5
[9] version utilizes weights from the FastChat [55] repository, rather than the ones obtained from fine-tuning the model on
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 16

DocVQA. During the fine-tuning on DocVQA our model did not see any similar question to the one asked for the pre-task, as
a result, it seems not to understand the prompt in the reading task. However, both models still struggle with the issue of not
knowing when to stop generating text. We notice that the metrics are much lower than the ones only considering the first word
generated by the model. This problem with text generation might arise because the LLM is kept frozen, and despite the prompt
engineering efforts we have undertaken, it’s possible that these measures are not sufficient due to the model’s lack of exposure
to such types of data. We see that the PT1 model suffers even more from this generation problem, since when analyzing only
the first generated word the results are more similar to the PT2, even surpassing it when calculating if the ground truth is in
the predicted solution. Some examples of this generation issue are shown in Table IV. Regarding the test results, we see that
they are similar to the validation split, so we can say that both models generalize well to unseen words.

Analyzing the worst cases, where both the ANLS and the ANLS of the first predicted word is 0 for both models (PT1 and
PT2) we see that in the majority of the cases, it may be due to the difficulty of the images. For example, the text is handwritten
and unclear, the text is printed vertically, long words that when resized to 224 × 224 are more difficult to read, compound or
uncommon words and small crops when resized are more blurry. Some visual examples are shown in Table VI in the Appendix
A.

TABLE III
E XPERIMENT 2: R EADING PRE - TASK

Model Split ANLS Acc ANLS 1st word Acc 1st word GTIP
PT1 Validation 0.2627 0.0458 0.7677 0.6367 0.8077
PT2 Validation 0.4591 0.2953 0.8072 0.6737 0.7775
PT1 Test 0.2640 0.061 0.7761 0.6380 0.8180
PT2 Test 0.4589 0.2910 0.8121 0.6780 0.7711

TABLE IV
G ENERATION STOPPING PROBLEM

GT Predicted by PT1 Predicted by PT2


what what tv what
with with el with
bringing bringing syringe bringing
IN IN gizmo IN.
BROKERS BROKERS CENTRE BROOKERS text
Social Social Informationen Social
Service Service ad Service url
return) return) . /Return/ return(email)\”?¿
M. M. e-mail M. page
Facility Facility, london Facility
d’alene, Daleya, Dianne, e-mail Adele,
lists. lists. org. lists.html
04/04/2000 01/30/00,0508,5/29/00,12/02,04/30/00,07/02,05/30/00,12/02,04/30/00 04/21/2019.html
general general-based-selection general textbook
metabolic metabolic chemistry metabolic

3) Multimodal DocVQA

First, doing the inference with the PT1 model substituting the words by their visual embeddings does not work at all. Even
in the easiest case, where only one word was changed for its corresponding visual embedding, the ANLS and the accuracy
drop to 0. Analyzing the results, we see that introducing the visual word produces a lot of unexpected noise, even when the
word is not part of the answer. In Table VII in Appendix A 30 predicted answers when the ANLS was 0 are shown. Moreover,
even in the cases with the highest ANLS, the model adds special characters or tries to put all the answer words in one word as
shown in Table VIII in Appendix A. This could be related to the pre-training task, where we asked only to output one single
word.

To try to obtain better results, the linear layer (LL) was fine-tuned with the DocVQA data substituting the corresponding
words for their visual embeddings and the results increased significantly. We see how the ANLS when substituting one word
reaches 0.5 for the PT1F model. In this case, contrary to our expectations, slightly better results are obtained when the
substituted word is from the answer. As more words are substituted, all the metrics drop, as we see in Table V. Analyzing the
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 17

predicted answers, we encounter that the model adds less noise to the answers and predicts the correct one, especially when
the answer is only one word. We think that substituting all words may produce the worst results since the LLM may not be
used to see such cases, since in the pre-training task only one token was substituted by the visual embedding.

When conducting inference using the PT2 model directly, the expected outcome is observed: the model struggles to provide
accurate responses to any questions. This outcome was anticipated, as the Flan-T5 model had not undergone fine-tuning on
the SP-DocVQA dataset [23]. Following the fine-tuning of the Decoder (model PT2F), there was a noticeable improvement
in results, albeit still at a relatively modest level. This could potentially be attributed to the limitation of not being able to
perform a comprehensive fine-tuning of the entire LLM model due to memory limitations, which may have contributed to
sub-optimal behaviours. In fact, fine-tuning the decoder on a task similar to the original pre-training tasks of T5 [5] (such as
summarization, translation, question answering, etc.) can help the model adapt more effectively to the characteristics of the
specific task, improve the generation quality and coherence of the model’s outputs. However, as we have changed the input
type, the model seems to not be able to understand correctly the task and the context.

A summary of all the quantitative results is shown in Table V. Note that we also included three SOTA methods although
direct comparison is not possible because the evaluation data is different, as mentioned in Section III-B. Moreover, the two
first SOTA methods don’t use LLM but are specifically designed for DocVQA, achieving higher results. Furthermore, it’s
worth noting that the use of the multimodal GPT-4 model [20], has demonstrated the most impressive outcomes. However,
it is essential to acknowledge that this model’s size may be substantially large, potentially reaching a scale of several trillion
parameters, although official confirmation of its size is currently unavailable.

Finally, to try to understand why we obtain much lower results when transitioning to the multimodal domain we decided to
analyze the cases where the initial Flan-T5 [9] model fine-tuned with just text data from DocVQA got the correct answer but
the multimodal models (PT1F and PT2F) could not. In such cases, the answer tends to be more than one word, so maybe the
pre-training task has something to do with it. Moreover, the PT1F model seems to output more coherent answers, closer to the
GT than the PT2F model, which sometimes generates nonsensical output, maybe because in the fine-tuning the encoder was
frozen. Multiple examples can be seen in Figures 11- 12 in Appendix A. Regarding the cases where PT1F is able to get the
correct answer but PT2F doesn’t, many of the GT answers are just one word. Moreover, the PT2F outputs words that are not
even in the document, seeming that it is not able to understand the task correctly, as shown in Figures 13-14 in Appendix A.

TABLE V
E XPERIMENT 3: M ULTIMODAL D OC VQA

All data From Answer Not from Answer


Model Parameters Visual words ANLS Acc GTIP ANLS Acc GTIP ANLS Acc GTIP
PT1 3B 1 word 0.0918 0 0.2474 0.0668 0 0.1992 0.1168 0 0.2955
PT1F 3B 1 word 0.4954 0.2963 0.2754 0.5134 0.3045 0.2822 0.4774 0.2881 0.2686
PT1 3B 2 words 0.0819 0 0.2308 0.0387 0 0.1681 0.1251 0 0.2935
PT1F 3B 2 words 0.4438 0.2646 0.2592 0.4315 0.2546 0.2505 0.4561 0.2746 0.2679
PT1 3B all words 0.0033 0 0.0098 0.0033 0 0.0098 - - -
PT1F 3B all words 0.3459 0.1404 0.1475 0.3459 0.1404 0.1475 - - -
PT2 3B all words 0 0 0.0121 0 0 0.0121 - - -
PT2F 3B all words 0.1121 0.0472 0.0472 0.1121 0.0472 0.0472 - - -
LayoutLMv2 Large [31] 426M all words 0.8348* - - - - - - - -
ERNIE-Layout Large [37] 507M all words 0.8321* - - - - - - - -
GPT-4 [4] unknown all words 0.8840* - - - - - - - -
On the top, results for our experiments. On the bottom, SOTA methods. Note that the * indicates that the results are provided on the complete
SP-DocVQA [23] test set, while our experiments were conducted in a subset of the validation set as explained in Section III-B. We include the
SOTA results from [36] as a reference, although direct comparison is not possible.

VII. C ONCLUSIONS

In our research, we introduced an innovative approach to enhance Document Visual Question Answering (DocVQA) by
harnessing the capabilities of Large Language Models. Initially, we conducted an evaluation of fine-tuning the Flan-T5 model
[9] on the SP-DocVQA dataset [23], employing three distinct forms of context fine-tuning: textual context, textual context
combined with bounding boxes, and textual context combined with layout information. Interestingly, both the layout and
bounding boxes context types yielded superior outcomes compared to using just plain textual context after the fine-tuning. This
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 18

highlights the significant performance enhancement achievable by incorporating spatial information, particularly when dealing
with documents that contain intricate elements such as tables or diagrams. Overall, the utilization of bounding boxes yielded
the most promising results, achieving an ANLS score of 0.76.

In the context of the text-to-text DocVQA task, it would have been interesting to also fine-tune a larger model. However,
due to resource constraints, we were limited to utilizing the Vicuna model [13] for DocVQA inference, unlike the fine-tuning
applied to the Flan-T5 model. Comparing the Flan-T5 model with 3B parameters to the Vicuna model with 7B parameters
would have shown us a first idea of the dependence on the model size. Furthermore, with additional time at our disposal, we
could have explored alternative training methods that require less memory, such as the Low-Rank adaptation (LoRA) discussed
in [59]. This option remains available for testing to determine its applicability to our specific pipeline, potentially leading to
overall performance enhancements.

In this study, we also introduced a pipeline for multimodal Document Visual Question Answering. Initially, we designed a
pre-training task to enable the Flan-T5 model to comprehend visual information. In our approach, we successfully associated
the visual features of words in the documents with the Large Language Model (LLM) space, allowing the model to recognize
word images. However, we encountered an issue related to the termination of text generation. Since the LLM model remained
frozen during the fine-tuning of the preliminary task, it could not learn when to stop generating text, leading to the generation
of undesired words. Nevertheless, it’s worth noting that in most cases, the initial words were correct, achieving an ANLS score
of 0.81 using just the first predicted word.

In the multimodal domain, we presented two different approaches using different weights of the Flan-T5 model. On the one
hand, we first fine-tuned the Flan-T5 on DocVQA with the text modality, and then we froze the Flan-T5 model while training
the CLIP [56] visual encoder with an additional linear layer to read words as a pre-training task. Finally, we further fine-tuned
the linear layer for DocVQA in the multimodal scenario keeping all the other components frozen. This was done by substituting
the text words tokens of the document with their visual embeddings. On the other hand, we used the Flan-T5 with the weights
of FastChat [55], we froze these weights and trained the CLIP visual encoder with the linear layer for the pre-training task in
the same way as before. After that, we fine-tuned the Decoder of the Flan-T5 for the multimodal DocVQA task. Although the
first approach yielded superior outcomes, both methods introducing the visual embeddings drastically downgraded the results
that we obtained with just text, especially when substituting all words with the visual embeddings.

We believe that both approaches face limitations primarily rooted in the reading pre-training task, which essentially instructed
the model to process one word at a time. To address this constraint, future research endeavours should focus on designing
and evaluating more complex pre-training tasks. Additionally, we anticipate that the second approach has the potential to yield
even superior outcomes compared to the first approach if the entire Flan-T5 model were to undergo fine-tuning with visual
data. It is noteworthy that in this scenario, the reading pre-training task demonstrated enhanced performance, implying that the
model achieved a deeper understanding of visual features. Consequently, allocating more resources to fine-tune the complete
Large Language Model with visual embeddings may result in heightened performance levels for DocVQA.

Finally, it is worth saying that our multimodal approach was conceived as an initial step, not as a fully comprehensive
pipeline. This choice was influenced by the inherent complexity of the task at hand and the constraints of our research timeline.
Looking ahead, we believe that the project should evolve to eliminate its reliance on Optical Character Recognition (OCR) for
visual data extraction. A more robust solution would involve directly integrating visual information without the intermediary
of OCR. Furthermore, there is an intriguing opportunity to enhance our approach by amalgamating spatial information with
the multimodal framework. This integration could potentially result in more accurate and contextually rich interpretations of
documents, thereby advancing the effectiveness of the overall system.

ACKNOWLEDGMENTS

I want to sincerely thank my supervisors, Dr. Dimosthenis Karatzas, Dr. Mohamed Ali Souibgui, and Ruben Perez, for their
invaluable guidance, support and encouragement throughout my thesis. I am also grateful to the professors in my master’s
program for imparting valuable knowledge in the field of Computer Vision and for establishing the groundwork for my
machine-learning projects. To my fellow classmates and teammates during this intense year, thank you for sharing knowledge
and experiences as we navigated our academic journey together. To my CVC colleagues, thank you for creating a welcoming
atmosphere, supporting each other’s projects, and always being ready to lend a hand. Lastly, a special thanks to my family,
especially my parents, for their unwavering presence and the opportunities they’ve provided me with. Your support means the
world to me.
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 19

A PPENDIX A
G RAPHICAL R ESULTS

In this appendix, we offer more visual results from the various experiments we conducted. These extra visuals provide a
clearer picture of our research findings and help to better understand the different outcomes we observed during our experiments.
Thus, we encourage readers to explore this appendix for a deeper understanding of our study.

Fig. 8. Examples where the Flan-T5 [9] model is able to get the correct answer despite the type of context. Note that in all the examples the keywords from
the question and the answer are side by side. In the picture, BB referees to the bounding box.
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 20

Fig. 9. Examples where the Flan-T5 [9] model is not able to get the correct answer despite the type of context. In the first example, the answer should be
“Selection Criteria” . However, “Criteria” is not in the OCR, so this makes it difficult to get the answer. On the second document, any of the context types
take into account underlined text. However, the LLM could have outputted at least the title.
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 21

Fig. 10. Three different documents and the predicted answers from Flan-T5 [9] model trained with each of the context types: text, text + bounding boxes
(BB) or layout. We see how the BB information helps to solve questions that require the positional information.
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 22

TABLE VI
P RE - TASK : FAILED EXAMPLES

Original image Resized image GT PT1 prediction PT2 prediction

JRSJnamely Krankheitserscheinung #Stylized##########################

. THE THIEL HERE ad

1403465516 144105170 144044691

NONESSENTIAL Nonnative English Non-SENSITIVE licensor

Citrus Caritas Crisptsea

Healthletter, Higher, Helena, sand, b., et al., eds., New York, n.d., p. 199
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 23

TABLE VII
MULTIMODAL DOCVQA WITH 1 VISUAL WORD . PT1 MODEL UNEXPECTED BEHAVIOUR IN THE INFERENCE

Visual word From answer Ground truth Prediction


predictor FALSE 0.28 8.28,721(70)(99.4)(75)(99.4)(95)(95)(99)(99)(99)(99)(99)(99)(99)(99)(99)(99)
ocame FALSE paul dr.abc.commons,sacramento,sanandro,mt.Diego,sanantonio,mt.dale,sanantonio
the FALSE dark fantasy cell-renew-fight-skin-damage-cell-level-escape-into-one-bingo!,
the FALSE wills lifestyle flamasssssssssssssssssssssssssssssssssssssssssssssssssssssssss
1993 FALSE 31 31.23.467313.234.9541.367343.9541.31.39.
india FALSE itc limited itc,tcf,tcc,tci,tci,tci,tci,tcm,tcc,tci,tcv,tcv,tcc,tc,tci,tc
lbs. FALSE 22 22-22-004-004-004-004-004-004-004-004-004-004-004-004-004-005-0089.
more FALSE 57 61.57”,59.75”,61.57”,61.57”,61.57”,61.57”,61.57”,61.57”,61.57”,61.57”
72 FALSE the environment theaffectivenesscondominiumtotakeconsumeractionremainststrong.writewrite
yankelovich FALSE 82 80-25%-22
proposed FALSE february 24 .1966 february- 24-30,1960.20.55.20.
modern FALSE 64% 70.46$.
filter FALSE basal diet basal-diet.,-clinically-acclimated-sugar-acid-fed-20-weeks-to-ten-male-rats-
first FALSE general food fund, inc general,st.s.kresge,inc..(
joins FALSE johnplayers jeans distressed.,.
university TRUE university of california ucsf.edu.,write.ucsf.edu.,contains.ucsf.edu.,contains.ucsf.edu.
limited TRUE itc limited itc.,astm,cnxt,citc,india’s highest-performing, fmcg,fmcg,fmcg,fmcg,fmcg, fmc
aashirvaad TRUE aashirvaad flama,asahi,multigrains,simple,simple,simple,simple,simple,simple,simple,s
welch TRUE the robert a. welch foundation the-robert-a.-wheel-foundation.com., all rights reserved..
975.00. TRUE 975.00 $.957.00$.757.00$.757.00$.757.00$.757.00$.757.00$.757.00$.757.00
c. TRUE a. c. monk a.c.monk.23.0.91.59.472(100)twice.
in TRUE in bengaluru in,hyderabad..
22 TRUE 2 2(1).,(2%).,(22)(1).,(4%)(4),(4)(4)(4)(4),(4)(4)(4)(54)(0089)(1).,(111)(111)(111
70 TRUE 70 64.29.
485 TRUE 485 685.50.$.485.d.$.
enhancement TRUE emotional enhancement emotional,mg/pl,mg/pl,mg/pl,mg/pl,mg/pl,mg/mg/pl,mg/mg/pl,mg/pl,mg/m
29 TRUE 29 29.22.9113369882.9882.
meharry TRUE meharry medical college medical,rules.kresge,facs,cmd,cmdh,cmdhc,cmdhc,cmdhc,cmdhc,cmdhc,cmdhc
fresh TRUE scissors menthol fresh royal.4stinson’s.com.au.
vein TRUE caudal vein caudal,middle,middle,middle,middle,middle,middle,middle,middle,middle

TABLE VIII
MULTIMODAL DOCVQA WITH 1 VISUAL WORD . B EST CASES IN THE INFERENCE WITH PT1 MODEL

Visual word From answer Ground truth Prediction


vendor: FALSE $13,519,151 $13,519,151,
nutrition FALSE northeast brazil northeast-brazil-
ml./kg. FALSE (weeks) (weeks)(
bethesda, FALSE nutrition foundation office nutritiona foundation, office..
of FALSE protected document protected-document.-
28.39 FALSE the nutrition foundation, inc. thef nutritionfoundation,inc.,.
order FALSE publication and abstract tracking report publication-and-abstract-tracking-report.20
one FALSE $7.80/hour $7.80/hour.t
p3(1.1 FALSE $10,000-$20,000 $10,000-$20,000.$.
172.8 FALSE 661.1 661.1.
drink TRUE national soft drink association national soft drinks association.,,
carbon TRUE cigarette smoke contains carbon monoxide. cigarette-smoke-contains-carbon-monoxide..
association TRUE ’the indiana dietetic association the-indiana-dietetic-association-
r. TRUE r. j. reynolds tobacco co. R.j.reynolds tobaccocco..
industries’ TRUE the petroleum industries’ air pollution control program the-petroleum-industries’-air-pollution-control-program.22
(n=20) TRUE baseline clinical characteristics of study participants (n=20) baseline-clinical-characteristics-of-study-participants,(1-2)-
classified TRUE classified material receipt classified-material-receipt.7.
6, TRUE washington 6, d. c. washington,c
2004 TRUE april 21-24, 2004 april-21-24,2004..
pentile TRUE highest pentile highest-pentile..
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 24

Fig. 11. Examples where the initial Flan-T5 [9] model with text as context could get the correct answer whereas the multimodal models (PT1F, PT2F) did not
get it correct. We can highlight that in the majority of the cases, the correct answers are more than one word, so our multimodal models could be struggling
to solve correctly the task due to the pre-task, where we asked to read only one word.
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 25

Fig. 12. Examples where the initial Flan-T5 [9] model with text as context could get the correct answer whereas the multimodal models (PT1F, PT2F) did not
get it correct. We can highlight that in the majority of the cases, the correct answers are more than one word, so our multimodal models could be struggling
to solve correctly the task due to the pre-task, where we asked to read only one word.
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 26

Fig. 13. Examples where the initial Flan-T5 [9] model with text as context and the multimodal model PT1F got the correct answer while the PT2F model
did not get it correct. Note that the PT2F model seems to output words that are not even in the document so it seems to not understand the task.
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 27

Fig. 14. Examples where the initial Flan-T5 [9] model with text as context and the multimodal model PT1F got the correct answer while the PT2F model
did not get it correct. Note that the PT2F model seems to output words that are not even in the document so it seems to not understand the task.
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 28

R EFERENCES

[1] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
[2] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,
G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,
C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in Advances in Neural Information
Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 1877–1901.
[Online]. Available: https://proceedings.neurips.cc/paper files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
[3] OpenAI, “GPT-4 Research Page,” 2023. [Online]. Available: https://openai.com/research/gpt-4
[4] OpenAI, “Gpt-4 technical report,” 2023.
[5] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified
text-to-text transformer,” 2020.
[6] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin,
E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” 2023.
[7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” ArXiv, vol.
abs/1810.04805, 2019.
[8] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019.
[9] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun,
X. Chen, A. Chowdhery, A. Castro-Ros, M. Pellat, K. Robinson, D. Valter, S. Narang, G. Mishra, A. Yu, V. Zhao, Y. Huang, A. Dai, H. Yu, S. Petrov,
E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V. Le, and J. Wei, “Scaling instruction-finetuned language models,” 2022.
[10] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton,
L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human
feedback,” 2022.
[11] OpenAI, “Chatgpt,” https://openai.com, 2021.
[12] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,”
https://github.com/tatsu-lab/stanford alpaca, 2023.
[13] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing, “Vicuna: An
open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” March 2023. [Online]. Available: https://lmsys.org/blog/2023-03-30-vicuna/
[14] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han,
Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals,
A. Zisserman, and K. Simonyan, “Flamingo: a visual language model for few-shot learning,” 2022.
[15] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” ArXiv,
vol. abs/2301.12597, 2023.
[16] J. Lu, C. Clark, R. Zellers, R. Mottaghi, and A. Kembhavi, “Unified-io: A unified model for vision, language, and multi-modal tasks,” ArXiv, vol.
abs/2206.08916, 2022.
[17] A. F. Biten, R. Litman, Y. Xie, S. Appalaraju, and R. Manmatha, “Latr: Layout-aware transformer for scene-text vqa,” 2022 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), pp. 16 527–16 537, 2021.
[18] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” 2023.
[19] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning
transferable visual models from natural language supervision,” in International Conference on Machine Learning, 2021.
[20] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” 2023.
[21] Y. Fang, W. Wang, B. Xie, Q. Sun, L. Wu, X. Wang, T. Huang, X. Wang, and Y. Cao, “Eva: Exploring the limits of masked visual representation
learning at scale,” 2022.
[22] Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y. Zhou, J. Wang, A. Hu, P. Shi, Y. Shi, C. Li, Y. Xu, H. Chen, J. Tian, Q. Qi, J. Zhang, and F. Huang, “mplug-owl:
Modularization empowers large language models with multimodality,” 2023.
[23] M. Mathew, D. Karatzas, R. Manmatha, and C. V. Jawahar, “Docvqa: A dataset for vqa on document images,” 2021 IEEE Winter Conference on
Applications of Computer Vision (WACV), pp. 2199–2208, 2020.
[24] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings
of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume
1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186. [Online]. Available:
https://aclanthology.org/N19-1423
[25] L. Garncarek, R. Powalski, T. Stanisławek, B. Topolski, P. Halama, M. Turski, and F. Grali’nski, “Lambert: Layout-aware language modeling for
information extraction,” in IEEE International Conference on Document Analysis and Recognition, 2020.
[26] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “Vl-bert: Pre-training of generic visual-linguistic representations,” 2020.
[27] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,”
2020.
[28] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining
approach,” 2019.
[29] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,” 2020.
[30] Y. Xu, M. Li, L. Cui, S. Huang, F. Wei, and M. Zhou, “LayoutLM: Pre-training of text and layout for document image understanding,” in
Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &amp Data Mining. ACM, aug 2020. [Online]. Available:
https://doi.org/10.1145%2F3394486.3403172
[31] Y. Xu, Y. Xu, T. Lv, L. Cui, F. Wei, G. Wang, Y. Lu, D. Florencio, C. Zhang, W. Che, M. Zhang, and L. Zhou, “Layoutlmv2:
Multi-modal pre-training for visually-rich document understanding,” in ACL-IJCNLP 2021, January 2021. [Online]. Available: https:
//www.microsoft.com/en-us/research/publication/layoutlmv2-multi-modal-pre-training-for-visually-rich-document-understanding/
[32] R. Powalski, Łukasz Borchmann, D. Jurkiewicz, T. Dwojak, M. Pietruszka, and G. Pałka, “Going full-tilt boogie on document understanding with
text-image-layout transformer,” 2021.
[33] Y. Huang, T. Lv, L. Cui, Y. Lu, and F. Wei, “Layoutlmv3: Pre-training for document ai with unified text and image masking,” 2022.
[34] G. Kim, T. Hong, M. Yim, J. Nam, J. Park, J. Yim, W. Hwang, S. Yun, D. Han, and S. Park, “Ocr-free document understanding transformer,” 2022.
[35] B. Davis, B. Morse, B. Price, C. Tensmeyer, C. Wigington, and V. Morariu, “End-to-end document recognition and understanding with dessurt,” 2022.
[36] W. Wang, Y. Li, Y. Ou, and Y. Zhang, “Layout and task aware instruction prompt for zero-shot document image question answering,” 2023.
[37] Q. Peng, Y. Pan, W. Wang, B. Luo, Z. Zhang, Z. Huang, T. Hu, W. Yin, Y. Chen, Y. Zhang, S. Feng, Y. Sun, H. Tian, H. Wu, and H. Wang, “Ernie-layout:
Layout knowledge enhanced pre-training for visually-rich document understanding,” 2022.
[38] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023.
[39] J. Cheng, L. Dong, and M. Lapata, “Long short-term memory-networks for machine reading,” 2016.
[40] A. Graves, “Generating sequences with recurrent neural networks,” 2014.
MASTER THESIS DISSERTATION, MASTER IN COMPUTER VISION, SEPTEMBER 2023 29

[41] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” 2016.
[42] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position representations,” 2018.
[43] R. J. Williams and D. Zipser, “A learning algorithm for continually running fully recurrent neural networks,” Neural Computation, vol. 1, pp. 270–280,
1989. [Online]. Available: https://api.semanticscholar.org/CorpusID:14711886
[44] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Glue: A multi-task benchmark and analysis platform for natural language
understanding,” 2019.
[45] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Superglue: A stickier benchmark for general-purpose
language understanding systems,” 2020.
[46] K. M. Hermann, T. Kočiský, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom, “Teaching machines to read and comprehend,” 2015.
[47] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100,000+ questions for machine comprehension of text,” 2016.
[48] V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. L. Scao, A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker,
S. S. Sharma, E. Szczechla, T. Kim, G. Chhablani, N. Nayak, D. Datta, J. Chang, M. T.-J. Jiang, H. Wang, M. Manica, S. Shen, Z. X. Yong, H. Pandey,
R. Bawden, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. Santilli, T. Fevry, J. A. Fries, R. Teehan, T. Bers, S. Biderman, L. Gao, T. Wolf, and A. M.
Rush, “Multitask prompted training enables zero-shot task generalization,” 2022.
[49] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei, A. Arunkumar, A. Ashok, A. S. Dhanasekaran, A. Naik, D. Stap, E. Pathak,
G. Karamanolakis, H. G. Lai, I. Purohit, I. Mondal, J. Anderson, K. Kuznia, K. Doshi, M. Patel, K. K. Pal, M. Moradshahi, M. Parmar, M. Purohit,
N. Varshney, P. R. Kaza, P. Verma, R. S. Puri, R. Karia, S. K. Sampat, S. Doshi, S. Mishra, S. Reddy, S. Patro, T. Dixit, X. Shen, C. Baral, Y. Choi,
N. A. Smith, H. Hajishirzi, and D. Khashabi, “Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks,” 2022.
[50] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” in
International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=d7KBjmI3GmQ
[51] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen,
C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Transformers: State-of-the-art natural language
processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Online:
Association for Computational Linguistics, Oct. 2020, pp. 38–45. [Online]. Available: https://www.aclweb.org/anthology/2020.emnlp-demos.6
[52] T. Kudo and J. Richardson, “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” 2018.
[53] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” 2016.
[54] T. Kudo, “Subword regularization: Improving neural network translation models with multiple subword candidates,” 2018.
[55] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging
llm-as-a-judge with mt-bench and chatbot arena,” 2023.
[56] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning
transferable visual models from natural language supervision,” 2021.
[57] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and
N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” 2021.
[58] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” 2019.
[59] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” 2021.

You might also like