You are on page 1of 7

To read more such articles, please visit our blog https://socialviews81.blogspot.

com/

DocLLM: JPMorgan’s New AI for Visually Rich Multimodal


Document Intelligence

Introduction
Documents are everywhere in our daily lives, from forms and invoices to
reports and contracts. They often contain rich and complex information
that requires both textual and spatial understanding. However, most of
the existing artificial intelligence (AI) models are not well-equipped to
handle such multimodal documents, as they either ignore the layout
structure or rely on expensive image encoders.

To read more such articles, please visit our blog https://socialviews81.blogspot.com/


To read more such articles, please visit our blog https://socialviews81.blogspot.com/

A new generative language model developed by a team of researchers


at JPMorgan AI Research. JPMorgan Chase is the leading financial
institution in the world .The primary goal behind the development of this
new model was to create a model capable of understanding and
reasoning over visual documents, taking into account both textual
semantics and spatial layout. This new model provides a scalable and
robust solution for document intelligence, which is a key area of interest
for JPMorgan and other businesses that deal with large volumes of
diverse documents. This new model is called 'DocLLM'.

What is DocLLM?

DocLLM is a lightweight extension to traditional large language models


(LLMs) designed for reasoning over visual documents. It stands out by
focusing exclusively on bounding box information to incorporate the
spatial layout structure, avoiding the need for expensive image
encoders.

Key Features of DocLLM

DocLLM has several key features that make it a unique and powerful
model for multimodal document understanding. Some of these features
are:

● Disentangled Spatial Attention Mechanism: One of the standout


features of DocLLM is its disentangled spatial attention
mechanism. This mechanism is a novel approach that
decomposes the attention mechanism found in classical
transformers into a set of disentangled matrices. This
decomposition allows for a more nuanced understanding of the
document.
● Handling of Irregular Layouts: Traditional models often struggle
with irregular layouts found in visual documents. However,
DocLLM’s unique approach allows it to handle these irregular

To read more such articles, please visit our blog https://socialviews81.blogspot.com/


To read more such articles, please visit our blog https://socialviews81.blogspot.com/

layouts effectively. This makes it a versatile tool for document


understanding.
● Dealing with Heterogeneous Content: Visual documents often
contain heterogeneous content, which can be challenging for many
models. DocLLM, with its unique features, is capable of dealing
with such content, making it a robust model for multimodal
document understanding.

These features make DocLLM a powerful tool for understanding and


reasoning over visual documents, taking into account both textual
semantics and spatial layout. Its ability to handle irregular layouts and
heterogeneous content sets it apart from traditional models.

Capabilities/Use Case of DocLLM

● Fine-Tuning Using a Large-Scale Instruction Dataset: DocLLM


is fine-tuned using a large-scale instruction dataset. This dataset
covers four core document intelligence tasks, providing a
comprehensive training ground for the model.
● Superior Performance on Diverse Datasets: DocLLM has
demonstrated its robustness and effectiveness by outperforming
state-of-the-art large language models on 14 out of 16 datasets
across all tasks. This shows the model’s ability to handle a wide
range of document types and layouts.
● Strong Generalization Capabilities: In addition to its impressive
performance on known datasets, DocLLM also generalizes well to
previously unseen datasets. It has shown strong performance on 4
out of 5 such datasets, indicating its potential for real-world
applications.

Working Mechanism and Architecture of DocLLM

DocLLM is a lightweight extension to standard Large Language Models


(LLMs) that excels in visually rich form understanding tasks. It models

To read more such articles, please visit our blog https://socialviews81.blogspot.com/


To read more such articles, please visit our blog https://socialviews81.blogspot.com/

both spatial layouts and text semantics, making it intrinsically


multi-modal. The model incorporates spatial layout information through
bounding box coordinates of text tokens, typically obtained using Optical
Character Recognition (OCR), without the need for any vision encoder
component.

source - https://github.com/dswang2011/DocLLM

The architecture of DocLLM is built upon the foundation of an


auto-regressive transformer language model. It follows a causal decoder
structure and is composed of stacked transformer blocks. Each block
contains a multi-head self-attention layer and a fully connected
feed-forward network. Unlike standard language models that are
typically unimodal and accept only a sequence of text tokens as input,
DocLLM is a multimodal system. It integrates lightweight visual
information by utilizing the spatial positions and dimensions of text
tokens obtained using OCR.

The attention mechanism of LLMs is extended in DocLLM to capture


dependencies between text semantics and spatial layouts. This
extension allows DocLLM to understand both the textual content and the
spatial arrangement of elements in a document, treating the spatial
information as a distinct modality. It computes the inter-dependency
between the text modality and this spatial modality in a disentangled
manner.

To read more such articles, please visit our blog https://socialviews81.blogspot.com/


To read more such articles, please visit our blog https://socialviews81.blogspot.com/

DocLLM uses infilling text blocks as a pre-training objective, allowing the


model to better leverage contextual information and handle visual
documents more effectively. It also fine-tunes the pre-trained knowledge
for several document intelligence tasks on instruction data curated from
several datasets.

In essence, DocLLM proposes modifications to the pre-training objective


to better handle visual documents, making it a powerful tool for
document intelligence tasks. Its unique approach to incorporating spatial
layout information and its impressive performance on various datasets
make it a promising tool for future applications.

Performance Evaluation with Other Models

The performance of DocLLM was evaluated in two experimental


settings:

1. Same Datasets, Different Splits (SDDS): In this setting, DocLLM


was evaluated on the unseen test split of each of the 16 datasets
used for instruction-tuning. This evaluation aimed to assess how
DocLLM performs when tasks and domains remain the same from
training to testing.
2. Same Tasks, Different Datasets (STDD): In this setting, DocLLM
was evaluated on held-out datasets. The model was
instruction-tuned on prompts from 11 of the 16 datasets considered
in SDDS, and then evaluated on the test split of the remaining
three datasets. This evaluation aimed to assess the performance
of DocLLM when tasks remain unchanged but domains and
layouts differ from training to testing.

In both SDDS and STDD settings, DocLLM was benchmarked against


comparably-sized and state-of-the-art large language models (LLMs)
using ZeroShot (ZS) prompts.

To read more such articles, please visit our blog https://socialviews81.blogspot.com/


To read more such articles, please visit our blog https://socialviews81.blogspot.com/

The evaluation metrics used included Average Normalized Levenshtein


Similarity (ANLS) for VQA datasets, CIDEr for VisualMRC, accuracy for
WTQ, CLS, and NLI datasets, and F1 score for KIE datasets.

source - https://arxiv.org/pdf/2401.00908.pdf

In the SDDS setting (as shown in above figure), DocLLM-7B excelled in


12 out of 16 datasets, outperforming equivalent models in 14 out of 16
datasets. It demonstrated superior performance in layout-intensive tasks
such as KIE and CLS. In the STDD setting, DocLLM demonstrated
superior performance compared to Llama2 across four out of five
datasets, and achieved the best score overall for two of them. However,
it’s important to note that classification accuracy was notably lower in
DocLLM, possibly due to the model being trained using only one
classification dataset, limiting its ability to generalize effectively to new
datasets.

How to Access and Use This Model?

DocLLM is an open-source model, and its source code is readily


available for anyone interested in using or studying it. The source code
for DocLLM is hosted on GitHub. This repository contains all the
necessary code files, along with instructions on how to set up and use

To read more such articles, please visit our blog https://socialviews81.blogspot.com/


To read more such articles, please visit our blog https://socialviews81.blogspot.com/

the model. It’s a great resource for developers and researchers who
want to use DocLLM in their projects or study its inner workings.

Remember, while the model is open-source and freely available, it’s


important to use it responsibly and ethically, respecting all relevant
guidelines and regulations. If you are interested in learning more about
this model, all relevant links are provided under the 'source' section at
the end of this article.

Conclusion

DocLLM represents a significant advancement in the field of document


understanding. Its unique approach to incorporating spatial layout
information and its impressive performance on various datasets make it
a promising tool for future applications.

Source
research paper - https://arxiv.org/abs/2401.00908
GitHub repo - https://github.com/dswang2011/DocLLM
Hugging Face Site - https://huggingface.co/papers/2401.00908

To read more such articles, please visit our blog https://socialviews81.blogspot.com/

You might also like