You are on page 1of 5

Department of Computer Science

Hitec University Taxila

PROJECT REPORT

Text Extraction from Images and binary Conversion from images

Group members:

Muhammad khan ahmed (21-cs-023)


Hammad Ali (21-cs-039)
Muhammad Hamza (21-cs-068)

Date of submission: 12/12/2023

Submitted to: Ms. Faiza Jahangir


Contents
1. Abstract:- ............................................................................................................... 3
2. Introduction:- ........................................................................................................ 3
3. background:- ......................................................................................................... 3
4. Past/related work:- ................................................................................................ 4
5. Project management:- .......................................................................................... 4
5.1 Project Scope and Objectives: ............................................................................ 4
5.2 Team Building and Division of Tasks: ................................................................. 4
6. Methodology:- ....................................................................................................... 4
7. Project result and analysis:- ................................................................................ 5
8. Challenge faced:- .................................................................................................. 5
9. Conclusion:- .......................................................................................................... 5
1. Abstract:-

Tesseract, a widely used OCR engine, is employed to recognize and extract text from images, taking into
account the challenges presented by the images' blurriness, low quality, and complexity. Additionally,
PIL, a powerful image processing library, is utilized to process images, adjust their parameters, and
optimize them for OCR operations. By combining these advanced technologies, the project aims to
achieve enhanced text extraction accuracy.

Pil and pytessarct libraries are used to import images and convert it into string . Pytessaract uses the ocr
technique to perform this operation . The extracted string is divided into numeric values . The numeric
values are displayed with their binary using the bin functions.

2. Introduction:-

Importance of the Problem: Extracting text from images is a critical task in fields like digital document
preservation, document indexing, and document search. The extracted text can be further processed for
various applications, such as data analysis, text mining, and sentiment analysis.

Difficulty of Solving the Problem: Solving this problem presents a unique challenge. The process involves
accurately identifying text within an image and accurately transcribing that text into machine-readable
text. This task can be complex due to the variation in handwriting styles, fonts, and backgrounds in real-
world images. Additionally, dealing with noise, blurred text, and skewed images can make the problem
difficult to solve.

The Power of ML: However, by utilizing advanced machine learning algorithms and libraries, this
problem can be solved efficiently and accurately. The use of pre-trained OCR models, such as Tesseract
and Pytesseract, enables the extraction of text from images with a high degree of accuracy.

3. background:-

The project name "Text Extraction from Images and binary Conversion from images" aims to develop an
ML-powered application capable of extracting text from images using Optical Character Recognition
(OCR). Additionally, it focuses on converting the extracted numeric characters into their binary
equivalents.

Here is a brief introduction to the libraries used in the project:

1. Pytesseract: Pytesseract is a Python wrapper for Google's Tesseract OCR engine. It enables users
to perform OCR directly from within Python scripts. By utilizing this library, the project aims to
accurately extract text from images.

2. PIL (Python Imaging Library): PIL is a powerful library that supports opening, manipulating, and
saving many different image file formats. In this project, it is used to load images for further
processing.

3. OCR: Optical Character Recognition is the technology that enables the extraction of text from
images. In this project, it is achieved by using a combination of OCR libraries like Pytesseract,
Tesseract, and OpenCV.

4. Conversion of Numeric to Binary: The project also focuses on converting the extracted numeric
characters into their binary equivalents. This conversion can be useful in various applications, such
as image encryption, where converting data into binary format can provide a higher level of
security.

4. Past/related work:-

• The project initially began by using traditional OCR methods. These methods, such as Hidden
Markov Models (HMMs) and Neural Networks (NNs), have been widely used in the field of
Optical Character Recognition (OCR) for several decades.

• One major limitation of traditional OCR methods is their reliance on manually designed templates.
These templates are used to identify and extract specific pieces of information from the images,
such as names, addresses, and phone numbers. This manual approach can be time-consuming and
prone to errors.

• Tesseract can also benefit from the user's feedback and the incorporation of user-generated training data.
By continuously refining the model with additional training data, Tesseract can further enhance its
recognition accuracy over time.

• Tesseract OCR's recognition accuracy may not be consistently higher than other OCR methods, its adaptive
learning nature allows it to improve over time, potentially surpassing the performance of other methods.
However, for optimal performance, factors such as the quality of training data and the application of
appropriate preprocessing steps should be considered.

5. Project management:-

To effectively manage this project, we have followed a well-defined project management strategy that
involves breaking down the project into smaller tasks and assigning roles to the team members. This
approach ensures that all team members are involved in the project execution and contributes to the overall
success of the project.

Here is a detailed breakdown of our project management strategy:

5.1 Project Scope and Objectives: The first step in our project management strategy was to clearly
define the scope and objectives of the project. This included understanding the problem statement,
determining the project duration.
5.2Team Building and Division of Tasks: To manage the project efficiently, we have divided the
tasks among the team members. This approach ensures that all team members have a stake in the
project and contribute to its success

6. Methodology:-

• we uses Tesseract OCR, OCR, and PIL to extract text from images and convert it into a desired
binary format.
• The scope of the project is the implementation, configuration, and optimization of the Tesseract
OCR engine and other relevant technologies to achieve accurate and efficient text extraction.
• Then we Configure the Tesseract OCR engine by setting the language model, character whitelist,
and other engine parameters.
• After above work we Extract the text from the images using the Tesseract OCR engine.
• And finally Convert the extracted text into the desired binary format using PIL and OCR
techniques.

7. Project result and analysis:-

It work best on the computerized generated documents and After a thorough evaluation, it was
observed that the system does not work accurately on handwritten documents because hand written
documents does not use standard writing. This is primarily due to the fact that Tesseract OCR has
not been trained on such documents and its language model does not recognize the handwriting
style. Overall, we have successfully implemented a system that uses Tesseract OCR, OCR, and
PIL to extract text from images and convert it into a desired binary format, the system's accuracy
on handwritten documents could be further improved by addressing the identified limitations and
making the necessary adjustments to the project's methodology, tools, and datasets.

8. Challenge faced:-

This issue arises because OCR engines are designed to work with clear, well-defined text in images, which
makes it challenging to classify and extract information from complex, blurry, or low-quality images.and
for different types of image there are different types of modes there is no standard mode.

To overcome this challenge, we can employ several techniques to enhance the quality of the images before
feeding them into the OCR engine. Some of these techniques include:

1. Image cleaning and noise reduction: Use image processing techniques like median filtering,
bilateral filtering, or adaptive thresholding to remove noise and unwanted artifacts from the
images.

9. Conclusion:-

Finally, while the computer-generated images were able to achieve the best accuracy for the project, it is
important to remember that this does not mean that they represent the optimal quality for text extraction
in all scenarios. Research should continue to focus on optimizing image quality for OCR operations and
developing advanced techniques for enhancing the readability of handwritten text to further improve text
extraction accuracy and effectiveness.However, it is worth noting that Tesseract is specifically designed
for printed text, and its performance with handwritten text is not as robust.

You might also like