Codeproject Pdf2textconvert

6,306,412 members and growing!
rajesh03 | My Settings | My CodeProject | My Bookmarks | My Articles |

(17,840 online) Sign out
• Home
• Articles
o Search
o Latest
o Most Popular
o Beginner Articles
o Topic List
o Submit an Article
o Submission & Update Guidelines
o Add your Blog to The Code Project
o Blog Articles
o Article Competition
• Message Boards
o ASP.NET
o ATL / WTL / STL
o C / C++ / MFC
o Managed C++/CLI
o C#
o COM
o Hardware & Devices
o LINQ
o .NET Framework
o System Admin
o Silverlight
o General Database
o Sharepoint
o Visual Basic
o Web Development
o WPF / WCF / WF
o XML / XSL
o General IT Issues
o Site Bugs / Suggestions
o The Soapbox 2.0
o All Message Boards...
• Job Board
o Latest
o Search
o Post a Job
o FAQ and Pricing
• Catalog
o Latest
o Search
o Post a Catalog Item
o FAQ and Pricing
• Help!
o What is 'The Code Project'?
o General FAQ
o Post a Question
o Site Directory
o About Us
• Soapbox
General Programming » String handling » Strings C#, Windows, .NET, Visual Studio,
Dev
Converting PDF to Text in C# Posted: 1 Dec 2005

By Dan Letecky Updated: 12 Dec 2005
Parsing PDF files in .NET using PDFBox and IKVM.NET (managed Views: 158,674
code). Bookmarked: 154 times
ANNOUNCEMENTS
How to parse PDF files
While extending the indexing solution for an intranet built using the DotLucene
fulltext search library I decided to add support for PDF files. But DotLucene can only
handle plain text so the PDF files had to be converted.
After hours of Googling I found a reasonable solution that uses "pure" .NET - at least
there are no other dependencies other than a few assemblies of IKVM.NET. Before
we start with the solution let's take a look at the other ways I tried.
Using Adobe PDF IFilter
Using Adobe PDF IFilter requires:
1. Using unreliable COM interop that handles IFilter interface (and the
combination of IFilter COM and Adobe PDF IFilter is especially troublesome)
and
2. A separate installation of Adobe IFilter on the target system. This can be
painful if you need to distribute your indexing solution to someone else.
Read more about using IFilter in Microsoft Office Documents Parsing.
Using iTextSharp
iTextSharp is a .NET port of iText, a PDF manipulation library for Java. It is primarily
focused on creating and not reading PDFs but there are some classes that allow you
to read PDF - especially PdfReader. But extracting the text from the hierarchy of
objects is not an easy task (PDF is not a simple format, the PDF Reference is 7 MB -
compressed - PDF file). I was able to get to PdfArray, PdfBoolean, PdfDictionary and
other objects but after some hours of trying to resolve PdfIndirectReference I gave
up and threw away the iTextSharp based parser.
Finally: PDFBox
PDFBox is another Java PDF library. It is also ready to use with the original Java
Lucene (see LucenePDFDocument).
Fortunately, there is a .NET version of PDFBox that is created using IKVM.NET (just
download the PDFBox package, it's in the bin directory).
Using PDFBox in .NET requires adding references to:
• PDFBox-0.7.2.dll
• IKVM.GNU.Classpath
and copying IKVM.Runtime.dll to the bin directory.
Using the PDFBox to parse PDFs is fairly easy:
Collapse Copy Code

private static string parseUsingPDFBox(string filename)
{
PDDocument doc = PDDocument.load(filename);
PDFTextStripper stripper = new PDFTextStripper();
return stripper.getText(doc);
}
The size of the required assemblies adds up to almost 16 MB:
• IKVM.GNU.Classpath.dll (7 MB)
• IKVM.Runtime.dll (360 kB)
• PDFBox-0.7.2.dll (8 MB)
The speed is not so bad: Parsing the U.S. Copyright Act PDF (1.4 MB) took about 7
seconds.
Related information
• See this article (with future updates) on DotLucene: PDF Documents Parsing.
License
This article has no explicit license attached to it but may contain usage terms in the
article text or the download files themselves. If in doubt please contact the author
via the discussion board below.
A list of licenses authors might use can be found here
About the Author
Dan Letecky My open-source ASP.NET 2.0 controls:
DayPilot - Outlook-like calendar/scheduling control

DayPilot MonthPicker - Light-weight month picker
MenuPilot - Hover context menu
Location: Czech Republic
Member
Other popular String

handling articles:
• The Complete Guide to C++

Strings, Part I - Win32
Character Encodings
A guide to the multitude of string

types used in Windows.
• The Complete Guide to C++

Strings, Part II - String
Wrapper Classes
A guide to the string wrapper classes

provided by Visual C++ and class
libraries
• CString Management
Learn how to effectively use CStrings.
• CString-clone Using Standard

C++
A Drop-In replacement for CString

that builds on the Standard C++
Library's basic_string template
• Wildcard string compare

(globbing)
Matches a string against a wildcard

string such as "*.*" or "bl?h.*" etc.
This is good for file globbing or to
match hostmasks.
Article Rate this article for us! Poor Excellent
Top Your reason for this vote:
FAQ
Noise Tolerance Layout Per page
New Message Msgs 1 to 25 of 95 (Total in Forum: 95) (Refresh) FirstPrevNext

Re: pdf with password Member 3234403 8:11 7 Dec '08
wand = want
Reply·Email·View Thread·PermaLink·Bookmark 1.00/5 (2 votes) Rate this message: 12345
I need help martinbrout 12:47 24 Oct '08

Hi,
trying to run your code i m getting this error at run time :
Could not load file or assembly 'bcprov-jdk14-132, Version=0.0.0.0, Culture=neutral,

PublicKeyToken=null' or one of its dependencies. The system cannot find the file
specified.
on the 1st line of the given code :
PDDocument doc = PDDocument.load(filename);
I ve copied:
bcprov-jdk14-132.dll
FontBox-0.1.0-dev.dll
IKVM.GNU.Classpath.dll
IKVM.Runtime.dll
PDFBox-0.7.3.dll
from the PDFBox-0.7.3 bin directory to my project but the problem pesists
any suggentions???
Reply·Email·View Thread·PermaLink·Bookmark Rate this message: 12345
Pdf to word conversion chint.99 11:44 22 Oct '08

Hi,
Can you tell me how to convert pdf to word.
Reply·Email·View Thread·PermaLink·Bookmark 2.64/5 (9 votes) Rate this message: 12345
Works, but only with 0.7.2 and only for local 12:58 10 Sep '08
Member 3509080
files, not URLs
I had copied the dll files to the bin library and inported the classpath and PDFBox dll file
references, and put in the namespaces
using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
using org.pdfbox.util;
using org.pdfbox.pdmodel;
but it still was not working. It threw a System.IO.File exception on my input file.
Last Visit: 23:35 9 Jul '09 Last Update: 0:11 10 Jul '09 1234 Next »
General News Question Answer Joke Rant Admin
PermaLink | Privacy | Terms of Use Copyright 2005 by Dan Letecky

Last Updated: 12 Dec 2005 Everything else Copyright © CodeProject, 1999-2009
Editor: Rinish Biju Web17 | Advertise on the Code Project

Codeproject Pdf2textconvert

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Codeproject Pdf2textconvert

Uploaded by

Copyright:

Available Formats

6,306,412 members and growing!

rajesh03 | My Settings | My CodeProject | My Bookmarks | My Articles |

Converting PDF to Text in C# Posted: 1 Dec 2005

How to parse PDF files

Using Adobe PDF IFilter

Using Adobe PDF IFilter requires:

Read more about using IFilter in Microsoft Office Documents Parsing.

Using PDFBox in .NET requires adding references to:

and copying IKVM.Runtime.dll to the bin directory.

Using the PDFBox to parse PDFs is fairly easy:

Collapse Copy Code

The size of the required assemblies adds up to almost 16 MB:

About the Author

Dan Letecky My open-source ASP.NET 2.0 controls:

DayPilot - Outlook-like calendar/scheduling control

Other popular String

• The Complete Guide to C++

A guide to the multitude of string

• The Complete Guide to C++

A guide to the string wrapper classes

Learn how to effectively use CStrings.

• CString-clone Using Standard

A Drop-In replacement for CString

• Wildcard string compare

Matches a string against a wildcard

New Message Msgs 1 to 25 of 95 (Total in Forum: 95) (Refresh) FirstPrevNext

Reply·Email·View Thread·PermaLink·Bookmark 1.00/5 (2 votes) Rate this message: 12345

I need help martinbrout 12:47 24 Oct '08

trying to run your code i m getting this error at run time :

Could not load file or assembly 'bcprov-jdk14-132, Version=0.0.0.0, Culture=neutral,

on the 1st line of the given code :

PDDocument doc = PDDocument.load(filename);

Reply·Email·View Thread·PermaLink·Bookmark Rate this message: 12345

Pdf to word conversion chint.99 11:44 22 Oct '08

Can you tell me how to convert pdf to word.

Reply·Email·View Thread·PermaLink·Bookmark 2.64/5 (9 votes) Rate this message: 12345

General News Question Answer Joke Rant Admin

PermaLink | Privacy | Terms of Use Copyright 2005 by Dan Letecky

You might also like