You are on page 1of 16

9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium

ArabicOCR — amazing OCR library for Arabic


pdf documents
Shekhar Khandelwal · Follow
4 min read · Dec 12, 2021

Listen Share More

There are many OCR libraries out there like tessaract, easy-ocr and keras-ocr, to
name a few. All of them works quite well on English language. But not all works as
accurate & smooth on other languages like Arabic etc.

In my recent work, I came across a problem statement where I need to first identify
whether the pdf data that streams is an editable one or non-editable one. In either
of the cases, we need to extract whole pdf content for further data analytics.

For non-ediatable pdf’s, I needed an OCR library that can extract Arabic content
from the pdf accurately. That’s when I came across this amazing python OCR library
which is specifically built for Arabic language, called ArabicOCR.

Official reporsitory — ArabicOcr · PyPI

Sample tutorial — Google Colab

Now, usually if its a non-ediatble pdf, it usually means that the image has been
converted into a pdf format. And in industrial setup, you will usually get the
document in a pdf format, not a jpg or png format.

Here is a sample pdf document.

https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 1/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium

First thing is to convert the document to a png/jpg format.

For this refer this article where I have explained about another amazing python
library which deals with pdf documents, and we will do a lot of amazing things with
this library on pdf data.

PyMuPDF — amazing python library for pdf data — Shekhar Khandelwal — Medium

https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 2/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium

Official PyMuPDF documentation — PyMuPDF Documentation — PyMuPDF 1.19.2


documentation

First lets import PyMuPDF library, and convert the pdf to an image.

pdf="arabic_image.pdf"
import sys, fitz
doc = fitz.open(pdf) # open document
for page in doc: # iterate through the pages
pix = page.get_pixmap() # render page to an image
pix.save("page-%i.png" % page.number) # store image as a PNG

Now since, for this example, the pdf had only 1 page, hence only 1 image will be
generated. Else, with the above code, as many number of images will be generated
as many numbers of pages in the pdf.

Here is the converted image of the pdf —

https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 3/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium

Open in app
Now, let’s start with installing the ArabicOCR package -

!pip install ArabicOcr

Import the package in your program

from ArabicOcr import arabicocr

Using the image file, use the below code to extract the arabic textual data.

https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 4/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium

image_path='page-0.png'
out_image='out.jpg'

results=arabicocr.arabic_ocr(image_path,out_image)

In the console, you can see the output something like this —

Result will be a list of lists which contain both the extracted arabic text as well as
their location.

https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 5/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium

print(results)

Let’s get the extracted text into a file for further processing of the data.

words=[]
for i in range(len(results)):
word=results[i][1]
words.append(word)
with open ('file.txt','w',encoding='utf-8')as myfile:
myfile.write(str(words))

https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 6/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium

Similary, we can get the locations of the text from results.

annotations=[]
for i in range(len(results)):
annotation=results[i][0]
annotations.append(annotation)
with open ('annotations.txt','w',encoding='utf-8')as myfile:
myfile.write(str(annotations))

https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 7/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium

Finally, the code will also produce the resulting image with annotations of every
word in the document.

You can use opencv to read the annotated image.

import cv2
import matplotlib.pyplot as plt
img = cv2.imread('out.jpg', cv2.IMREAD_UNCHANGED)
plt.figure(figsize=(10,10))
plt.imshow(img)

https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 8/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium

Github link — shekharkhandelwal1983/ArabicOCR (github.com)

Thanks & Happy Learning !

Ocr Arabicnlp NLP Machine Learning Computer Vision

Follow

Written by Shekhar Khandelwal


27 Followers

Data Scientist with a majors in Computer Vision. Love to blog and share the knowledge with the data
community.

More from Shekhar Khandelwal

Shekhar Khandelwal
https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 9/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium

Counterfactuals for Causal Analysis via pymc : do-operator


Step 1. Build a pymc model skeleton

7 min read · Aug 6

42

Shekhar Khandelwal

Read and write to/from s3 using python — boto3 and pandas (s3fs)!


First, lets create a s3 bucket through Amazon AWS s3 management console. Click on “Create
Bucket”.

4 min read · Jan 16, 2022

https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 10/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium

Shekhar Khandelwal

Trigger AWS lambda function with s3 update !


AWS Lambda functions can be triggered through s3 updates. For instance, you can trigger the
lambda function with every upload of a file on…

3 min read · Jan 17, 2022

Shekhar Khandelwal

https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 11/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium

Build your first AWS Lambda function and attach an endpoint using AWS
Api-gateway !
Go to AWS console and launch lambda management console.

3 min read · Jan 16, 2022

See all from Shekhar Khandelwal

Recommended from Medium

Brinnae Bent, PhD in Edge Analytics

From Scribbles to Summaries: Enhancing OCR Models with GPT-Edit for


Handwritten Notes
We use the new beta-release GPT-Edit model from OpenAI to improve grammar and
formatting in OCR models for handwritten text transcription.

6 min read · Mar 8

https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 12/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium

Drumil Shah in Searce

Exploring Text and Table Extraction Packages in Python


Introduction

9 min read · Jul 7

10

Lists

Natural Language Processing


574 stories · 193 saves

Predictive Modeling w/ Python


20 stories · 348 saves

Practical Guides to Machine Learning


10 stories · 387 saves

The New Chatbots: ChatGPT, Bard, and Beyond


13 stories · 107 saves

https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 13/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium

Prateek

Key Information Extraction from the documents using PaddleOCR


Key information extraction (KIE)/Intelligent Document extraction (IDE) /Intelligent Document
Processing (IDP) etc. are all just difference…

5 min read · Apr 6

68 2

Arjun Gullbadhar in Level Up Coding

https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 14/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium

Introduction to EASYOCR: A Simple and Accurate Python Library for


Optical Character Recognition
Want to become an instant expert on this EasyOCR? Here are all the key insights I gained after
10 hours of research, condensed into just 5…

· 5 min read · Feb 6

231

Dr. Joe Logan

Installing WSL2, PyTorch and CUDA on Windows 11


If you have a compatible Nvidia GPU, you can work seamlessly with Ubuntu Linux and CUDA
within your regular Windows 11 (22H2) OS. Really…

3 min read · Mar 19

20 1

https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 15/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium

Vijai R

BizCardX — Text Extraction Using Easy OCR


Its one of my DS project long back where we extract the text information from a business card
and storing it to DB using Optical Character…

3 min read · Aug 28

See more recommendations

https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 16/16

You might also like