Professional Documents
Culture Documents
ArabicOCR - Amazing OCR Library For Arabic PDF Documents - by Shekhar Khandelwal - Medium
ArabicOCR - Amazing OCR Library For Arabic PDF Documents - by Shekhar Khandelwal - Medium
There are many OCR libraries out there like tessaract, easy-ocr and keras-ocr, to
name a few. All of them works quite well on English language. But not all works as
accurate & smooth on other languages like Arabic etc.
In my recent work, I came across a problem statement where I need to first identify
whether the pdf data that streams is an editable one or non-editable one. In either
of the cases, we need to extract whole pdf content for further data analytics.
For non-ediatable pdf’s, I needed an OCR library that can extract Arabic content
from the pdf accurately. That’s when I came across this amazing python OCR library
which is specifically built for Arabic language, called ArabicOCR.
Now, usually if its a non-ediatble pdf, it usually means that the image has been
converted into a pdf format. And in industrial setup, you will usually get the
document in a pdf format, not a jpg or png format.
https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 1/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium
For this refer this article where I have explained about another amazing python
library which deals with pdf documents, and we will do a lot of amazing things with
this library on pdf data.
PyMuPDF — amazing python library for pdf data — Shekhar Khandelwal — Medium
https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 2/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium
First lets import PyMuPDF library, and convert the pdf to an image.
pdf="arabic_image.pdf"
import sys, fitz
doc = fitz.open(pdf) # open document
for page in doc: # iterate through the pages
pix = page.get_pixmap() # render page to an image
pix.save("page-%i.png" % page.number) # store image as a PNG
Now since, for this example, the pdf had only 1 page, hence only 1 image will be
generated. Else, with the above code, as many number of images will be generated
as many numbers of pages in the pdf.
https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 3/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium
Open in app
Now, let’s start with installing the ArabicOCR package -
Using the image file, use the below code to extract the arabic textual data.
https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 4/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium
image_path='page-0.png'
out_image='out.jpg'
results=arabicocr.arabic_ocr(image_path,out_image)
In the console, you can see the output something like this —
Result will be a list of lists which contain both the extracted arabic text as well as
their location.
https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 5/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium
print(results)
Let’s get the extracted text into a file for further processing of the data.
words=[]
for i in range(len(results)):
word=results[i][1]
words.append(word)
with open ('file.txt','w',encoding='utf-8')as myfile:
myfile.write(str(words))
https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 6/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium
annotations=[]
for i in range(len(results)):
annotation=results[i][0]
annotations.append(annotation)
with open ('annotations.txt','w',encoding='utf-8')as myfile:
myfile.write(str(annotations))
https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 7/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium
Finally, the code will also produce the resulting image with annotations of every
word in the document.
import cv2
import matplotlib.pyplot as plt
img = cv2.imread('out.jpg', cv2.IMREAD_UNCHANGED)
plt.figure(figsize=(10,10))
plt.imshow(img)
https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 8/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium
Follow
Data Scientist with a majors in Computer Vision. Love to blog and share the knowledge with the data
community.
Shekhar Khandelwal
https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 9/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium
42
Shekhar Khandelwal
https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 10/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium
Shekhar Khandelwal
Shekhar Khandelwal
https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 11/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium
Build your first AWS Lambda function and attach an endpoint using AWS
Api-gateway !
Go to AWS console and launch lambda management console.
https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 12/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium
10
Lists
https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 13/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium
Prateek
68 2
https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 14/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium
231
20 1
https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 15/16
9/6/23, 4:53 PM ArabicOCR — amazing OCR library for Arabic pdf documents | by Shekhar Khandelwal | Medium
Vijai R
https://khandelwal-shekhar.medium.com/arabicocr-amazing-ocr-library-for-arabic-pdf-documents-5d736e97904b 16/16