You are on page 1of 2

PDF Text Extraction

We are going to have a look at two Python library PyPDF2 and PDF
miner .These libraries are written specifically to work with pdf
files. We are going to work on one project, which is about splitting a
708-page long pdf file into separate smaller files, extracting the text
information, cleaning it, and then exporting to easily readable text
files. 
PYPDF2-
A Pure-Python library built as a PDF toolkit. It is capable of:

 extracting document information (title, author, …)


 splitting documents page by page
 merging documents page by page
 cropping pages
 merging multiple pages into a single page
 encrypting and decrypting PDF files
 and more!

By being Pure-Python, it should run on any Python platform without


any dependencies on external libraries. It can also work entirely on
StringIO objects rather than file streams, allowing for PDF
manipulation in memory. It is therefore a useful tool for websites that
manage or manipulate PDFs.
PDF-miner-
PDFMiner is a tool for extracting information from PDF documents.
Unlike other PDF-related tools, it focuses entirely on getting and
analyzing text data. PDFMiner allows one to obtain the exact location
of text in a page, as well as other information such as fonts or lines. It
includes a PDF converter that can transform PDF files into other text
formats (such as HTML). It has an extensible PDF parser that can be
used for other purposes than text analysis.

You might also like