You are on page 1of 39

OCR in the Wild

Content :
1. The Origins and Prevalence of Texture Bias in Convolutional Neural Networks

2. Models trained using Synth text and MJ-Synth


- EAST Model
- Synthetic to Real Unsupervised DA for Scene Text Detection in the Wild
- Masked Vision Language Transformer

3. Segment Anything
- Internal Components
- Working
- Results on Flyers along with OCR

4. Results on Synth Text images using Different model and Natural Augmentations Techniques
- SAM + OCR
- Text Spotter
- OCR
- Proof of Texture Bias over shape bias
- Natural Augmentations (as suggested in 1 to overcome Texture bias problem)
- Cut out
- Gaussian Noise and Blur
- Sobel filtering
- Color Discoloration
The Origins and Prevalence of Texture Bias in Convolutional Neural Networks
● ImageNet-trained CNNs tend to classify images by texture rather than by shape, in contrast to human perception. where models
tend to classify based on superficial textural features rather than shape information.

● Texture bias refers to the ability of an OCR model to better recognize text in new domains where the visual patterns of
the character may differ from those in the training data.
● Shape bias refers to the ability of an OCR model to recognize text in new domains where the shapes of the characters
may differ from those in the training data.
● Texture Bias Limitations:
- Struggles with handwritten text or low-resolution scans that have different visual patterns and textures.
- Performance may degrade when faced with text that deviates significantly from the patterns seen in the training data.
- Less effective in scenarios where text appearance is altered, distorted, or obscured.

● Shape Bias Limitations:


- May face challenges with certain unconventional fonts or non-standard character shapes that are not well-represented
in the training data.
- Less effective in cases where texture and overall appearance provide critical cues for accurate text recognition.
- May require additional adaptations or data augmentation techniques to handle variations in character shapes.
Baseline Model - EAST : An Efficient and Accurate Scene Text Detector

The EAST model is a deep learning-based text detection model that can detect text in natural scene images. The SynthText dataset is a
synthetically generated dataset that contains word instances placed in natural scene images while taking into account the scene layouts.
It is used to train EAST models for text detection.

The pipeline directly predicts words or text lines of arbitrary orientations and quadrilateral shapes in full images, eliminating
unnecessary intermediate steps with a single neural network. The simplicity of their pipeline allows concentrating efforts on designing
loss functions and neural network architecture.

Experiments on standard datasets including ICDAR 2015, COCO-Text and MSRA-TD500 demonstrate that the proposed algorithm
significantly outperforms state-of-the-art methods in terms of both accuracy and efficiency.

On the ICDAR 2015 dataset, the proposed algorithm achieves an F-score of 0.7820 at 13.2fps at 720p resolution.

Github : argman/EAST: A tensorflow implementation of EAST text detector (github.com)


Paper : [1704.03155v2] EAST: An Efficient and Accurate Scene Text Detector (arxiv.org)
Broader View of EAST Model Text Detection on Flyers :
Synthetic to Real Unsupervised DA for Scene Text Detection in the Wild

To address the severe domain distribution mismatch, they propose a synthetic-to-real domain adaptation method for scene text
detection, which transfers knowledge from synthetic data (source domain) to real data (target domain).

In this paper, a text self-training (TST) method and adversarial text instance alignment (ATA) for domain adaptive scene text
detection are introduced. ATA helps the network learn domain-invariant features by training a domain classifier in an adversarial
manner. TST diminishes the adverse effects of false positives (FPs) and false negatives (FNs) from inaccurate pseudo-labels. Two
components have positive effects on improving the performance of scene text detectors when adapting from synthetic-to-real scenes.

They evaluate the proposed method by transferring from SynthText, VISD to ICDAR 2015, ICDAR 2013. The results demonstrate the
effectiveness of the proposed method with up to 10% improvement, which has important exploration significance for domain adaptive
scene text detection.

GitHub : weijiawu/SyntoReal_STD: HHH (github.com)


Paper :Synthetic-to-Real Unsupervised Domain Adaptation for Scene Text Detection in the Wild (thecvf.com)
Architecture :
Paper : Synthetic-to-Real Unsupervised Domain Adaptation for Scene Text Detection in the Wild | Papers With Code
Masked Vision Language Transformer
The paper introduces a novel model called which combines visual and linguistic information. Our encoder is a Vision Transformer,
and our decoder is a multi-modal Transformer.

Recent studies have explored incorporating textual semantics to address these challenges, either through explicit language models or
implicit extraction from visual cues. We propose a method that combines both explicit and implicit textual semantics, leveraging the
strengths of both approaches to enhance STR performance.

The MVLT model was pre-trained in first stage using masked autoencoders and a multi-modal Transformer decoder. The decoder
combined visual cues with language semantics to incorporate linguistic information. The model was fine-tuned in a second stage
using unmasked scene text images and an iterative correction method to optimize the pretrained knowledge of both the encoder and
decoder.

GitHub : onealwj/MVLT: PyTorch implementation of BMVC2022 paper Masked Vision-Language Transformers for Scene Text
Recognition (github.com)

Paper : 2211.04785v1.pdf (arxiv.org)


First Stage : Pre-training stage

In the first stage of the training strategy, the researchers pretrained the MVLT model by adopting the concept of masked autoencoders
(MAE) , MVLT model aims to recognize scene text while simultaneously reconstructing the masked patches. To incorporate
linguistic information, a multi-modal Transformer decoder was introduced, which combines visual cues with language semantics. The
input to the decoder consists of encoded patches, mask tokens for visual information, and character embeddings derived from the
ground-truth text label of the corresponding image.
Second Stage : Fine-tuning stage

In the second stage, the pretrained model from the first stage was further refined through a process called fine-tuning. This stage
aimed to optimize the model's performance specifically for the task of scene text recognition.

During the fine-tuning stage, the unmasked scene text images were provided as input to the encoder, which extracted relevant features
from the images. The decoder, in turn, generated the predicted text based on the extracted features. Unlike previous methods that only
fine-tuned the encoder, in this approach, both the encoder and the decoder were fine-tuned to leverage the pretrained knowledge
effectively. Additionally, the researchers introduced an iterative correction method.

In each iteration, the decoder took the image


features outputted by the encoder and the text
features outputted from the previous iteration as
input.

The predicted text was gradually modified and


refined during these iterations, improving the
accuracy of the recognition results.
Segment Anything

Dive into SAM's Network Architecture and Design

SAM’s design hinges on three main components:

● The promptable segmentation task to enable zero-shot generalization.


● The model architecture.
● The dataset that powers the task and model.
Task
SAM was trained on millions of images and over a billion masks to return a
valid segmentation mask for any prompt. The prompt, in this case, is the
segmentation task and can be foreground/background points, a rough box
or mask, clicks, text, or, in general, any information indicating what to
segment in an image. The task is also used as the pre-training objective for
the model.

Model
SAM’s architecture comprises three components that work together to
return a valid segmentation mask:

● An image encoder to generate one-time image embeddings.


● A prompt encoder that embeds the prompts.
● A lightweights mask decoder that combines the embeddings from the
prompt and image encoders.
Inside Segment Anything Model (SAM)

Image encoder

At the highest level, an image encoder (a masked auto-encoder, MAE, pre-trained Vision Transformer, ViT)
generates one-time image embeddings and can be applied prior to prompting the model.

Prompt encoder

The prompt encoder encodes background points, masks, bounding boxes, or texts into an embedding vector in real
time. The research considers two sets of prompts: sparse (points, boxes, text) and dense (masks).

Points and boxes are represented by positional encodings and added with learned embeddings for each prompt type.
Free-form text prompts are represented with an off-the-shelf text encoder from CLIP. Dense prompts, like masks,
are embedded with convolutions and summed element-wise with the image embedding.
Mask decoder

A lightweight mask decoder predicts the segmentation masks based on the embeddings from both the image and
prompt encoders. It maps the image embedding, prompt embeddings, and an output token to a mask. All of the
embeddings are updated by the decoder block, which uses prompt self-attention and cross-attention in two directions
(from prompt to image embedding and back).

The masks are annotated and used to update the model weights. This layout enhances the dataset and allows the
model to learn and improve over time, making it efficient and flexible.
SAM RESULTS
SAM + EASY OCR RESULTS
Results on Synth Text images using Different model and Natural
Augmentations Techniques
Results on Synth text using SAM + OCR :
Results on Synth text Textspotter:
Results on Synth text using OCR :
Texture Biased over Shape Biased Results :

Color discoloration Gaussian blur

Cut out Gaussian noise


Sobel filtering SAM + OCR

Textspotter
OCR
Data Augmentation Techniques applied using OCR as discussed in ‘The Origins
and Prevalence of Texture Bias in Convolutional Neural Networks’
Results using Color Discoloration on Synth text :
Results using Cut out on Synth text :
Results using Gaussian Blur on Synth text :
Results using Gaussian Noise on Synth text :
Results using Sobel Filtering on Synth text :

You might also like