You are on page 1of 11

2/7/22, 6:30 PM Optical Character Recognition | OCR Text Recognition

Deloitte, EXL, Flipkart, CRED & other Top Companies are HIRING 300+ Data Scientists | 11-13 Feb Register Now ×
Home

Build your own Optical Character Recognition (OCR)



System using Google’s Tesseract and OpenCV
Aniruddha Bhandari — May 16, 2020
Beginner
Computer Vision
Image
Python
Technique

Overview
Optical Character Recognition (OCR) is a widely used system in the computer vision space
Learn how to build your own OCR for a variety of tasks
We will leverage the OpenCV library and Tesseract for building the OCR system
 

Introduction
Do you remember the days when you had to fill in the dots of the right answer during an exam? Or how about the aptitude test
you gave before your first job? I can vividly recall the olympiads and multiple-choice tests where universities and organizations
used an Optical Character Recognition (OCR) system to grade the answer sheets in droves.

Honestly, OCR has applications in a broad range of industries and functions. So, everything from scanning documents – bank
statements, receipts, handwritten documents, coupons, etc.,  to reading street signs in autonomous vehicles – this all falls under
the OCR umbrella.

OCR systems used to be quite expensive and cumbersome to build a couple of decades ago. But advances in the computer
vision and deep learning field mean we can build our own OCR system right now!

But building an OCR system isn’t a straightforward task. For starters, it is filled with problems like different fonts in images,
poor contrast, multiple objects in an image, etc.
We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you

agree to our Privacy Policy and Terms of Use. Accept

https://www.analyticsvidhya.com/blog/2020/05/build-your-own-ocr-google-tesseract-opencv/ 1/11
2/7/22, 6:30 PM Optical Character Recognition | OCR Text Recognition

So, in this article, we will explore some very famous and effective approaches for the OCR task and how you can implement one
yourself.

If you are new to object detection and computer vision, I suggest going through the following resources:

Step-by-Step Introduction to Basic Object Detection Algorithms


Computer Vision Course
 

Table of Contents
1. What is Optical Character Recognition (OCR)?
2. Popular OCR Applications in the Real World
3. Text Recognition with Tesseract OCR
4. The Different Ways for Text Detection

What is Optical Character Recognition (OCR)?


Let’s first understand what OCR is, in case you haven’t come across this concept before.

OCR, or Optical Character Recognition, is a process of recognizing text inside images and converting it into an electronic form.
These images could be of handwritten text, printed text like documents, receipts, name cards, etc., or even a natural scene
photograph.

OCR has two parts to it. The first part is text detection where the textual part within the image is determined. This localization
of text within the image is important for the second part of OCR, text recognition, where the text is extracted from the image.
Using these techniques together is how you can extract text from any image.

But nothing is perfect and OCR is no exception. However, with the advent of deep learning, it has become possible to get better
and more generalized solutions to this problem.

Before we dive into how to build your own OCR, let’s take a look at some of the popular applications of OCR.

Popular OCR Applications in the Real World


OCR has widespread applications across industries (primarily with the aim of reducing manual human effort). It has been
incorporated in our everyday life to an extent that we hardly ever notice it! But they surely strive to bring a better user
experience.

We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you

agree to our Privacy Policy and Terms of Use. Accept

https://www.analyticsvidhya.com/blog/2020/05/build-your-own-ocr-google-tesseract-opencv/ 2/11
2/7/22, 6:30 PM Optical Character Recognition | OCR Text Recognition

OCR is used for handwriting recognition tasks to extract information. A lot of work is going on in this field and we have made
some really significant advancements. Microsoft has come up with an awesome mathematical application that takes as input a
handwritten mathematical equation and generates the solution along with a step-by-step explanation of the working.

OCR is increasingly being used for digitization by various industries to cut down manual workload. This makes it very easy and
efficient to extract and store information from business documents, receipts, invoices, passports, etc. Also, when you upload
your documents for KYC (Know Your Customer), OCR is used to extract information from these documents and store them for
future reference.

OCR is also used for book scanning where it turns raw images into a digital text format. Many large scale projects like the
Gutenberg project, Million Book Project, and Google Books use OCR to scan and digitize books and store the works as an
archive.

The banking industry is also increasingly using OCR to archive client-related paperwork, like onboarding material, to easily
create a client repository. This significantly reduces the onboarding time and thereby improves the user experience. Also, banks
use OCR to extract information like account number, amount, cheque number from cheques for faster processing.

The applications of OCR are incomplete without mentioning their use in self-driving cars. Autonomous cars rely extensively on
OCR
We use to readonsignposts
cookies and traffic
Analytics Vidhya signs.
websites An effective
to deliver understanding
our services, of these
analyze web traffic, signs your
and improve makes autonomous
experience cars
on the site. Bysafe
usingfor pedestrians
Analytics Vidhya, you
and other vehicles that ply on the roads. agree to our Privacy Policy and Terms of Use. Accept

https://www.analyticsvidhya.com/blog/2020/05/build-your-own-ocr-google-tesseract-opencv/ 3/11
2/7/22, 6:30 PM Optical Character Recognition | OCR Text Recognition

There are definitely many more applications of OCR like vehicle number plate recognition, converting scanned documents into
editable word documents, and many more. I would love to hear your experience of using OCR – let me know in the comments
section below.

The digitization using OCR obviously has widespread advantages like easy storage and manipulation of the text, not to mention
the unfathomable amount of analytics that you can apply to this data! OCR is definitely one of the most important fields of
Computer Vision.

Now, let’s look at one of the most famous and widely used text recognition techniques – Tesseract.

Text Recognition with Tesseract OCR


Tesseract is an open-source OCR engine originally developed as proprietary software by HP (Hewlett-Packard) but was later
made open source in 2005. Google has since then adopted the project and sponsored its development.

As of today, Tesseract can detect over 100 languages and can process even right-to-left text such as Arabic or Hebrew! No
wonder it is used by Google for text detection on mobile devices, in videos, and in Gmail’s image spam detection algorithm.

From version 4 onwards, Google has given a significant boost to this OCR engine. Tesseract 4.0 has added a new OCR engine
that uses a neural network system based on LSTM (Long Short-term Memory), one of the most effective solutions for sequence
prediction problems. Although its previous OCR engine using pattern matching is still available as legacy code.

Once you have downloaded Tesseract onto your system, you easily run it from the command line using the following command:

tesseract <test_image> <output_file_name> -l <language(s)> --oem <mode> --psm <mode> 

You can change the Tesseract configuration for results best suited for your image:

1. Langue (-l) – You can detect a single language or multiple languages with Tesseract
2. OCR engine mode (–oem) – As you already know, Tesseract 4 has both LSTM and Legacy OCR engines. However, there are
4 modes of valid operation modes based on their combination

3. Page Segmentation (–psm) – Can be adjusted according to the text in the image for better results

We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you

agree to our Privacy Policy and Terms of Use. Accept

https://www.analyticsvidhya.com/blog/2020/05/build-your-own-ocr-google-tesseract-opencv/ 4/11
2/7/22, 6:30 PM Optical Character Recognition | OCR Text Recognition

Pyteseract
However, instead of the command-line method, you could also use Pytesseract – a Python wrapper for Tesseract. Using this you
can easily implement your own text recognizer using Tesseract OCR by writing a simple Python script.

You can download Pytesseract using the pip install pytesseract command.

The main function in Pytesseract is image_to_text() which takes the image and the command line options as its arguments:

1 # text recognition
2 import cv2
3 import pytesseract
4 # read image
5 im = cv2.imread('./test3.jpg')
6 # configurations
7 config = ('-l eng --oem 1 --psm 3')
8 # pytessercat
9 text = pytesseract.image_to_string(im, config=config)
10 # print text
11 text = text.split('\n')
12 text

Pytesseract.py
hosted with ❤ by GitHub view raw

What are the Challenges with Tesseract?


It’s no secret that Tesseract is not perfect. It performs poorly when the image has a lot of noise or when the font of the language
is one on which Tesseract OCR is not trained. Other conditions like brightness or skewness of text will also affect the
performance of Tesseract. Nevertheless, it is a good starting point for text recognition with low efforts and high outputs.

The Different Ways for Text Detection


We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you

agree to our Privacy Policy and Terms of Use. Accept

https://www.analyticsvidhya.com/blog/2020/05/build-your-own-ocr-google-tesseract-opencv/ 5/11
2/7/22, 6:30 PM Optical Character Recognition | OCR Text Recognition

Tesseract assumes that the input text image is fairly clean. Unfortunately, many input images will contain a plethora of objects
and not just a clean preprocessed text. Therefore, it becomes imperative to have a good text detection system that can detect
text which can then be easily extracted.

There are a fair few ways for text detection:

Traditional way of using OpenCV


Contemporary way of using Deep Learning models, and
Building your very own custom model
 

Text Detection using OpenCV


Text detection using OpenCV is the classic way of doing things. You can apply various manipulations like image resizing,
blurring, thresholding, morphological operations, etc. to clean the image.

1 # preprocessing
2 # gray scale
3 def gray(img):
4
5 img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
6 cv2.imwrite(r"./preprocess/img_gray.png",img)
7 return img
8
9 # blur
10 def blur(img) :
11 img_blur = cv2.GaussianBlur(img,(5,5),0)
12 cv2.imwrite(r"./preprocess/img_blur.png",img)
13 return img_blur
14
15 # threshold
16 def threshold(img):
17 #pixels with value below 100 are turned black (0) and those with higher value are turned white (255)
18 img = cv2.threshold(img, 100, 255, cv2.THRESH_OTSU | cv2.THRESH_BINARY)[1]
19 cv2.imwrite(r"./preprocess/img_threshold.png",img)
20 return img

preprocessing.py
hosted with ❤ by GitHub view raw

Here we have Grayscale, blurred and thresholded images, in that order.

Once you have done that, you can use OpenCV contours detection to detect contours to extract chunks of data:

1 # Finding contours
2 im_gray = gray(im)
3 im_blur = blur(im_gray)
4 im_thresh = threshold(im_blur)
5
6 contours, _ = cv2.findContours(im_thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)

contours.py
hosted with ❤ by GitHub view raw

Finally, you can apply text recognition on the contours that you got to predict the text:

We use
1 cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
# text detection
2 def contours_text(orig, img, contours):
agree to our Privacy Policy and Terms of Use. Accept
3 for cnt in contours:
https://www.analyticsvidhya.com/blog/2020/05/build-your-own-ocr-google-tesseract-opencv/ 6/11
2/7/22, 6:30 PM Optical Character Recognition | OCR Text Recognition

4 x, y, w, h = cv2.boundingRect(cnt)
5
6 # Drawing a rectangle on copied image
7 rect = cv2.rectangle(orig, (x, y), (x + w, y + h), (0, 255, 255), 2)
8
9 cv2.imshow('cnt',rect)
10 cv2.waitKey()
11
12 # Cropping the text block for giving input to OCR
13 cropped = orig[y:y + h, x:x + w]
14
15 # Apply OCR on the cropped image
16 config = ('-l eng --oem 1 --psm 3')
17 text = pytesseract.image_to_string(cropped, config=config)
18
19 print(text)

text detection.py
hosted with ❤ by GitHub view raw

The results in the image above were achieved with minimum preprocessing and contour detection followed by text recognition
using Pytesseract. Obviously, the contours did not detect the text every time.

But, still, doing text detection with OpenCV is a tedious task requiring a lot of playing around with the parameters. Also, it does
not do well in terms of generalization. A better way of doing this is by using the EAST text detection model.

Contemporary Deep Learning Model – EAST


EAST, or Efficient and Accurate Scene Text Detector, is a deep learning model for detecting text from natural scene images. It is
pretty fast and accurate as it is able to detect 720p images at 13.2fps with an F-score of 0.7820.

The model consists of a Fully Convolutional Network and a Non-maximum suppression stage to predict a word or text lines. The
model, however, does not include some intermediary steps like candidate proposal, text region formation, and word partition
that were involved in other previous models, which allows for an optimized model.

You can have a look at the image below provided by the authors in their paper comparing the EAST model with other previous
models:

We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you

agree to our Privacy Policy and Terms of Use. Accept

https://www.analyticsvidhya.com/blog/2020/05/build-your-own-ocr-google-tesseract-opencv/ 7/11
2/7/22, 6:30 PM Optical Character Recognition | OCR Text Recognition

EAST has a U-shape network. The first part of the network consists of convolutional layers trained on the ImageNet dataset.
The next part is the feature merging branch which concatenates the current feature map with the unpooled feature map from
the previous stage.

This is followed by convolutional layers to reduce computation and produce output feature maps. Finally, using a convolutional
layer, the output is a score map showing the presence of text and a geometry map which is either a rotated box or a quadrangle
that covers the text. This can be visually understood from the image of the architecture that was included in the research paper:

I highly suggest you go through the paper yourself to get a good understanding of the EAST model.

OpenCV has included the EAST text detector model in version 3.4 onwards. This makes it super convenient to implement your
own text detector. The resulting localized text boxes can be passed through Tesseract OCR to extract the text and you will have
a complete end-to-end model for OCR.

We
  use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
agree to our Privacy Policy and Terms of Use. Accept

https://www.analyticsvidhya.com/blog/2020/05/build-your-own-ocr-google-tesseract-opencv/ 8/11
2/7/22, 6:30 PM Optical Character Recognition | OCR Text Recognition

Custom Model using TensorFlow Object API for Text Detection


The final method to build your text detector is using a custom-built text detector model using the TensorFlow Object API. It is
an open-source framework used to build deep learning models for object detection tasks. To understand it in detail, I suggest
going through this detailed article first.

To build your custom text detector, you would obviously require a dataset of quite a few images, at least more than 100. Then
you need to annotate these images so that the model can know where the target object is and learn everything about it. Finally,
you can choose from one of the pre-trained models, depending on the trade-off between performance and speed, from
TensorFlow’s detection model zoo. You can refer to this comprehensive blog to build your custom model.

Now. training can require some computation, but if you don’t really have enough of it, don’t worry! You can use Google
Colaboratory for all your requirements! This article will teach you how to use it effectively.

Finally, if you want to go a step ahead and build a YOLO state-of-the-art text detector model, this article will be a stepping stone
to understanding all the nitty-gritty of it and you will be off to a great start!

End Notes
In this article, we covered the problems in OCR and the various approaches that can be used to solve the task. We also
discussed the various shortcomings in the approaches and why OCR is not as easy as it seems!

Have you worked with any OCR application before? What kind of OCR use cases do you plan on building after this? Let me
know your ideas and feedback below.

EAST Object character recognition opencv Tensorflow Object API Tesseract

About the Author


Aniruddha Bhandari

Our Top Authors

view more

Download

Analytics Vidhya App for the Latest blog/Article

We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you
Previous Post Next Post
agree to our Privacy Policy and Terms of Use. Accept
3 Classic Excel Tricks to Become an Efficient Analyst Your Social Distancing Detection Tool: How to Build One
https://www.analyticsvidhya.com/blog/2020/05/build-your-own-ocr-google-tesseract-opencv/ 9/11
2/7/22, 6:30 PM Optical Character Recognition | OCR Text Recognition

using your Deep Learning Skills

4 thoughts on "Build your own Optical Character Recognition (OCR) System using Google’s
Tesseract and OpenCV"
Andrew Cameron Morris says:
May 16, 2020 at 7:29 pm

Back in 2018 I tried to use Tesseract, together with OpenCV, to read text from large tables. However, when the tables were very
dense the table box lines interfered a lot with the recognition, resulting in unacceptably low recognition accuracy. After quite some
effort I was able to detect each table, and then each cell within the table. I could then pass each cell separately to Tesseract for text
recognition. This resulted in much greater recognition accuracy. I was also able to reassemble the recognised text data into a
pandas DataFrame, which is vital if the table data needs to be automatically associated with the table row and/or column headers.
While the python tool I created to do this is proprietary, I may be able to help if anyone is interested in tips on how to go about
doing this.
Reply

san says:
May 19, 2020 at 2:39 pm

Hi Andrew, thanks for sharing your experience. I was using Tesseract to get OCR for Math equations like (+,x,/,power etc) but i am
facing issues with it. Can you suggest how i can approach this? i dont have a training data.
Reply

Ahmed ALLALI says:


June 12, 2020 at 2:34 pm

Hello Andrew, really nice to have your feedback, Thanks ! I'm working on text extraction from scanned administrative forms, it
contains handwritten and printed text and a bit complex tables.
The problem with handwritten text extraction is that it relies
heavily on the form of the text, so it's crucial to estimate accurately the performance of this extraction.
It would be really helpful if I
have a deeper information about your experience. Thank you in advance !
Reply

Shyam says:
June 28, 2020 at 5:52 pm

Hi Andrew, good to note you are keen to help. I need to develop scanned bill document processing module which will be able to
extract free text and table data irrespective of the different format in which the data is presented. My customer don't want to
retrain for every new format or template. Is there a way to achieve this ? please share your view.
Thanks,
Shyam.
Reply

Leave a Reply
Your email address will not be published. Required fields are marked *

Comment

Name* Email*

Website

Notify me of follow-up comments by email.

Notify me of new posts by email.


We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you

agree to our Privacy Policy and Terms of Use. Accept

https://www.analyticsvidhya.com/blog/2020/05/build-your-own-ocr-google-tesseract-opencv/ 10/11
2/7/22, 6:30 PM Optical Character Recognition | OCR Text Recognition

Submit

Top Resources

An Introduction to Synthetic Image Generation from Text Youtube Video Downloader using Python
Data

Suvojit Hore -
JAN 28, 2022 Atulya Khatri -
FEB 01, 2022

Python Tutorial: Working with CSV file for Data Science 3 Interesting Python Projects With Code for Beginners!

Harika Bonthu -
AUG 21, 2021 Gaurav Sharma -
JUL 18, 2021

Analytics Vidhya Data Scientists

About Us Blog

Our Team Hackathon

Download App Careers Discussions

Contact us Apply Jobs

Companies Visit us

Post Jobs    


Trainings

Hiring Hackathons

Advertising

© Copyright 2013-2022 Analytics Vidhya. Privacy Policy


Terms of Use
Refund Policy

We use cookies on Analytics Vidhya websites to deliver our services, analyze web traffic, and improve your experience on the site. By using Analytics Vidhya, you

agree to our Privacy Policy and Terms of Use. Accept

https://www.analyticsvidhya.com/blog/2020/05/build-your-own-ocr-google-tesseract-opencv/ 11/11

You might also like