ArabicOCR - Amazing OCR Library For Arabic PDF Documents - by Shekhar Khandelwal - Medium

9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium
ArabicOCR — amazing OCR library for Arabic

pdf documents
Shekhar Khandelwal · Follow
4 min read · Dec 12, 2021
Listen Share More
There are many OCR libraries out there like tessaract, easy-ocr and keras-ocr, to
name a few. All of them works quite well on English language. But not all works as
accurate & smooth on other languages like Arabic etc.
In my recent work, I came across a problem statement where I need to first identify
whether the pdf data that streams is an editable one or non-editable one. In either
of the cases, we need to extract whole pdf content for further data analytics.
For non-ediatable pdf’s, I needed an OCR library that can extract Arabic content
from the pdf accurately. That’s when I came across this amazing python OCR library
which is specifically built for Arabic language, called ArabicOCR.
Official reporsitory — ArabicOcr · PyPI
Sample tutorial — Google Colab
Now, usually if its a non-ediatble pdf, it usually means that the image has been
converted into a pdf format. And in industrial setup, you will usually get the
document in a pdf format, not a jpg or png format.
Here is a sample pdf document.
https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 1/16
First thing is to convert the document to a png/jpg format.
For this refer this article where I have explained about another amazing python
library which deals with pdf documents, and we will do a lot of amazing things with
this library on pdf data.
PyMuPDF — amazing python library for pdf data — Shekhar Khandelwal — Medium
Official PyMuPDF documentation — PyMuPDF Documentation — PyMuPDF 1.19.2

documentation
First lets import PyMuPDF library, and convert the pdf to an image.
pdf="arabic_image.pdf"
import sys, fitz
doc = fitz.open(pdf) # open document
for page in doc: # iterate through the pages
pix = page.get_pixmap() # render page to an image
pix.save("page-%i.png" % page.number) # store image as a PNG
Now since, for this example, the pdf had only 1 page, hence only 1 image will be
generated. Else, with the above code, as many number of images will be generated
as many numbers of pages in the pdf.
Here is the converted image of the pdf —
Open in app
Now, let’s start with installing the ArabicOCR package -
!pip install ArabicOcr
Import the package in your program
from ArabicOcr import arabicocr
Using the image file, use the below code to extract the arabic textual data.
image_path='page-0.png'
out_image='out.jpg'
results=arabicocr.arabic_ocr(image_path,out_image)
In the console, you can see the output something like this —
Result will be a list of lists which contain both the extracted arabic text as well as
their location.
print(results)
Let’s get the extracted text into a file for further processing of the data.
words=[]
for i in range(len(results)):
word=results[i][1]
words.append(word)
with open ('file.txt','w',encoding='utf-8')as myfile:
myfile.write(str(words))
Similary, we can get the locations of the text from results.
annotations=[]
for i in range(len(results)):
annotation=results[i][0]
annotations.append(annotation)
with open ('annotations.txt','w',encoding='utf-8')as myfile:
myfile.write(str(annotations))
Finally, the code will also produce the resulting image with annotations of every
word in the document.
You can use opencv to read the annotated image.
import cv2
import matplotlib.pyplot as plt
img = cv2.imread('out.jpg', cv2.IMREAD_UNCHANGED)
plt.figure(figsize=(10,10))
plt.imshow(img)
Github link — shekharkhandelwal1983/ArabicOCR (github.com)
Thanks & Happy Learning !
Ocr Arabicnlp NLP Machine Learning Computer Vision
Follow
Written by Shekhar Khandelwal

27 Followers
Data Scientist with a majors in Computer Vision. Love to blog and share the knowledge with the data
community.
More from Shekhar Khandelwal
Shekhar Khandelwal
Counterfactuals for Causal Analysis via pymc : do-operator

Step 1. Build a pymc model skeleton
7 min read · Aug 6
42
Shekhar Khandelwal
Read and write to/from s3 using python — boto3 and pandas (s3fs)!

First, lets create a s3 bucket through Amazon AWS s3 management console. Click on “Create
Bucket”.
4 min read · Jan 16, 2022
Shekhar Khandelwal
Trigger AWS lambda function with s3 update !

AWS Lambda functions can be triggered through s3 updates. For instance, you can trigger the
lambda function with every upload of a file on…
Shekhar Khandelwal
Build your first AWS Lambda function and attach an endpoint using AWS
Api-gateway !
Go to AWS console and launch lambda management console.
See all from Shekhar Khandelwal
Recommended from Medium
Brinnae Bent, PhD in Edge Analytics
From Scribbles to Summaries: Enhancing OCR Models with GPT-Edit for

Handwritten Notes
We use the new beta-release GPT-Edit model from OpenAI to improve grammar and
formatting in OCR models for handwritten text transcription.
6 min read · Mar 8
Drumil Shah in Searce
Exploring Text and Table Extraction Packages in Python

Introduction
9 min read · Jul 7
10
Lists
Natural Language Processing

574 stories · 193 saves
Predictive Modeling w/ Python

Practical Guides to Machine Learning

The New Chatbots: ChatGPT, Bard, and Beyond

Prateek
Key Information Extraction from the documents using PaddleOCR

Key information extraction (KIE)/Intelligent Document extraction (IDE) /Intelligent Document
Processing (IDP) etc. are all just difference…
5 min read · Apr 6
68 2
Arjun Gullbadhar in Level Up Coding
Introduction to EASYOCR: A Simple and Accurate Python Library for

Optical Character Recognition
Want to become an instant expert on this EasyOCR? Here are all the key insights I gained after
10 hours of research, condensed into just 5…
· 5 min read · Feb 6
231
Dr. Joe Logan
Installing WSL2, PyTorch and CUDA on Windows 11

If you have a compatible Nvidia GPU, you can work seamlessly with Ubuntu Linux and CUDA
within your regular Windows 11 (22H2) OS. Really…
3 min read · Mar 19
20 1
Vijai R
BizCardX — Text Extraction Using Easy OCR

Its one of my DS project long back where we extract the text information from a business card
and storing it to DB using Optical Character…
3 min read · Aug 28
See more recommendations

ArabicOCR - Amazing OCR Library For Arabic PDF Documents - by Shekhar Khandelwal - Medium

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ArabicOCR - Amazing OCR Library For Arabic PDF Documents - by Shekhar Khandelwal - Medium

Uploaded by

Copyright:

Available Formats

9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium

ArabicOCR — amazing OCR library for Arabic

Listen Share More

Official reporsitory — ArabicOcr · PyPI

Sample tutorial — Google Colab

Here is a sample pdf document.

First thing is to convert the document to a png/jpg format.

Official PyMuPDF documentation — PyMuPDF Documentation — PyMuPDF 1.19.2

Here is the converted image of the pdf —

!pip install ArabicOcr

Import the package in your program

from ArabicOcr import arabicocr

Similary, we can get the locations of the text from results.

You can use opencv to read the annotated image.

Github link — shekharkhandelwal1983/ArabicOCR (github.com)

Thanks & Happy Learning !

Ocr Arabicnlp NLP Machine Learning Computer Vision

Written by Shekhar Khandelwal

More from Shekhar Khandelwal

Counterfactuals for Causal Analysis via pymc : do-operator

7 min read · Aug 6

Read and write to/from s3 using python — boto3 and pandas (s3fs)!

4 min read · Jan 16, 2022

Trigger AWS lambda function with s3 update !

3 min read · Jan 17, 2022

3 min read · Jan 16, 2022

See all from Shekhar Khandelwal

Recommended from Medium

Brinnae Bent, PhD in Edge Analytics

From Scribbles to Summaries: Enhancing OCR Models with GPT-Edit for

6 min read · Mar 8

Drumil Shah in Searce

Exploring Text and Table Extraction Packages in Python

9 min read · Jul 7

Natural Language Processing

Predictive Modeling w/ Python

Practical Guides to Machine Learning

The New Chatbots: ChatGPT, Bard, and Beyond

Key Information Extraction from the documents using PaddleOCR

5 min read · Apr 6

Arjun Gullbadhar in Level Up Coding

Introduction to EASYOCR: A Simple and Accurate Python Library for

· 5 min read · Feb 6

Dr. Joe Logan

Installing WSL2, PyTorch and CUDA on Windows 11

3 min read · Mar 19

BizCardX — Text Extraction Using Easy OCR

3 min read · Aug 28

See more recommendations

You might also like