You are on page 1of 7

6,306,412 members and growing!

rajesh03 | My Settings | My CodeProject | My Bookmarks | My Articles |


(17,840 online) Sign out

• Home
• Articles
o Search
o Latest
o Most Popular
o Beginner Articles
o Topic List
o Submit an Article
o Submission & Update Guidelines
o Add your Blog to The Code Project
o Blog Articles
o Article Competition
• Message Boards
o ASP.NET
o ATL / WTL / STL
o C / C++ / MFC
o Managed C++/CLI
o C#
o COM
o Hardware & Devices
o LINQ
o .NET Framework
o System Admin
o Silverlight
o General Database
o Sharepoint
o Visual Basic
o Web Development
o WPF / WCF / WF
o XML / XSL
o General IT Issues
o Site Bugs / Suggestions
o The Soapbox 2.0
o All Message Boards...
• Job Board
o Latest
o Search
o Post a Job
o FAQ and Pricing
• Catalog
o Latest
o Search
o Post a Catalog Item
o FAQ and Pricing
• Help!
o What is 'The Code Project'?
o General FAQ
o Post a Question
o Site Directory
o About Us
• Soapbox

General Programming » String handling » Strings C#, Windows, .NET, Visual Studio,
Dev

Converting PDF to Text in C# Posted: 1 Dec 2005


By Dan Letecky Updated: 12 Dec 2005

Parsing PDF files in .NET using PDFBox and IKVM.NET (managed Views: 158,674
code). Bookmarked: 154 times
ANNOUNCEMENTS

How to parse PDF files

While extending the indexing solution for an intranet built using the DotLucene
fulltext search library I decided to add support for PDF files. But DotLucene can only
handle plain text so the PDF files had to be converted.

After hours of Googling I found a reasonable solution that uses "pure" .NET - at least
there are no other dependencies other than a few assemblies of IKVM.NET. Before
we start with the solution let's take a look at the other ways I tried.

Using Adobe PDF IFilter

Using Adobe PDF IFilter requires:

1. Using unreliable COM interop that handles IFilter interface (and the
combination of IFilter COM and Adobe PDF IFilter is especially troublesome)
and
2. A separate installation of Adobe IFilter on the target system. This can be
painful if you need to distribute your indexing solution to someone else.

Read more about using IFilter in Microsoft Office Documents Parsing.

Using iTextSharp

iTextSharp is a .NET port of iText, a PDF manipulation library for Java. It is primarily
focused on creating and not reading PDFs but there are some classes that allow you
to read PDF - especially PdfReader. But extracting the text from the hierarchy of
objects is not an easy task (PDF is not a simple format, the PDF Reference is 7 MB -
compressed - PDF file). I was able to get to PdfArray, PdfBoolean, PdfDictionary and
other objects but after some hours of trying to resolve PdfIndirectReference I gave
up and threw away the iTextSharp based parser.

Finally: PDFBox

PDFBox is another Java PDF library. It is also ready to use with the original Java
Lucene (see LucenePDFDocument).

Fortunately, there is a .NET version of PDFBox that is created using IKVM.NET (just
download the PDFBox package, it's in the bin directory).

Using PDFBox in .NET requires adding references to:

• PDFBox-0.7.2.dll
• IKVM.GNU.Classpath

and copying IKVM.Runtime.dll to the bin directory.

Using the PDFBox to parse PDFs is fairly easy:

Collapse Copy Code


private static string parseUsingPDFBox(string filename)
{
PDDocument doc = PDDocument.load(filename);
PDFTextStripper stripper = new PDFTextStripper();
return stripper.getText(doc);
}

The size of the required assemblies adds up to almost 16 MB:

• IKVM.GNU.Classpath.dll (7 MB)
• IKVM.Runtime.dll (360 kB)
• PDFBox-0.7.2.dll (8 MB)

The speed is not so bad: Parsing the U.S. Copyright Act PDF (1.4 MB) took about 7
seconds.

Related information

• See this article (with future updates) on DotLucene: PDF Documents Parsing.

License

This article has no explicit license attached to it but may contain usage terms in the
article text or the download files themselves. If in doubt please contact the author
via the discussion board below.
A list of licenses authors might use can be found here

About the Author

Dan Letecky My open-source ASP.NET 2.0 controls:

DayPilot - Outlook-like calendar/scheduling control


DayPilot MonthPicker - Light-weight month picker
MenuPilot - Hover context menu
Location: Czech Republic

Member

Other popular String


handling articles:

• The Complete Guide to C++


Strings, Part I - Win32
Character Encodings

A guide to the multitude of string


types used in Windows.

• The Complete Guide to C++


Strings, Part II - String
Wrapper Classes

A guide to the string wrapper classes


provided by Visual C++ and class
libraries

• CString Management

Learn how to effectively use CStrings.

• CString-clone Using Standard


C++

A Drop-In replacement for CString


that builds on the Standard C++
Library's basic_string template

• Wildcard string compare


(globbing)

Matches a string against a wildcard


string such as "*.*" or "bl?h.*" etc.
This is good for file globbing or to
match hostmasks.
Article Rate this article for us! Poor Excellent
Top Your reason for this vote:

FAQ
Noise Tolerance Layout Per page

New Message Msgs 1 to 25 of 95 (Total in Forum: 95) (Refresh) FirstPrevNext


Re: pdf with password Member 3234403 8:11 7 Dec '08
wand = want

Reply·Email·View Thread·PermaLink·Bookmark 1.00/5 (2 votes) Rate this message: 12345

I need help martinbrout 12:47 24 Oct '08


Hi,

trying to run your code i m getting this error at run time :

Could not load file or assembly 'bcprov-jdk14-132, Version=0.0.0.0, Culture=neutral,


PublicKeyToken=null' or one of its dependencies. The system cannot find the file
specified.

on the 1st line of the given code :

PDDocument doc = PDDocument.load(filename);

I ve copied:

bcprov-jdk14-132.dll
FontBox-0.1.0-dev.dll
IKVM.GNU.Classpath.dll
IKVM.Runtime.dll
PDFBox-0.7.3.dll

from the PDFBox-0.7.3 bin directory to my project but the problem pesists

any suggentions???

Reply·Email·View Thread·PermaLink·Bookmark Rate this message: 12345

Pdf to word conversion chint.99 11:44 22 Oct '08


Hi,

Can you tell me how to convert pdf to word.

Reply·Email·View Thread·PermaLink·Bookmark 2.64/5 (9 votes) Rate this message: 12345

Works, but only with 0.7.2 and only for local 12:58 10 Sep '08
Member 3509080
files, not URLs
I had copied the dll files to the bin library and inported the classpath and PDFBox dll file
references, and put in the namespaces

using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
using org.pdfbox.util;
using org.pdfbox.pdmodel;

but it still was not working. It threw a System.IO.File exception on my input file.
Last Visit: 23:35 9 Jul '09 Last Update: 0:11 10 Jul '09 1234 Next »

General News Question Answer Joke Rant Admin

PermaLink | Privacy | Terms of Use Copyright 2005 by Dan Letecky


Last Updated: 12 Dec 2005 Everything else Copyright © CodeProject, 1999-2009
Editor: Rinish Biju Web17 | Advertise on the Code Project

You might also like