Professional Documents
Culture Documents
• Home
• Articles
o Search
o Latest
o Most Popular
o Beginner Articles
o Topic List
o Submit an Article
o Submission & Update Guidelines
o Add your Blog to The Code Project
o Blog Articles
o Article Competition
• Message Boards
o ASP.NET
o ATL / WTL / STL
o C / C++ / MFC
o Managed C++/CLI
o C#
o COM
o Hardware & Devices
o LINQ
o .NET Framework
o System Admin
o Silverlight
o General Database
o Sharepoint
o Visual Basic
o Web Development
o WPF / WCF / WF
o XML / XSL
o General IT Issues
o Site Bugs / Suggestions
o The Soapbox 2.0
o All Message Boards...
• Job Board
o Latest
o Search
o Post a Job
o FAQ and Pricing
• Catalog
o Latest
o Search
o Post a Catalog Item
o FAQ and Pricing
• Help!
o What is 'The Code Project'?
o General FAQ
o Post a Question
o Site Directory
o About Us
• Soapbox
General Programming » String handling » Strings C#, Windows, .NET, Visual Studio,
Dev
Parsing PDF files in .NET using PDFBox and IKVM.NET (managed Views: 158,674
code). Bookmarked: 154 times
ANNOUNCEMENTS
While extending the indexing solution for an intranet built using the DotLucene
fulltext search library I decided to add support for PDF files. But DotLucene can only
handle plain text so the PDF files had to be converted.
After hours of Googling I found a reasonable solution that uses "pure" .NET - at least
there are no other dependencies other than a few assemblies of IKVM.NET. Before
we start with the solution let's take a look at the other ways I tried.
1. Using unreliable COM interop that handles IFilter interface (and the
combination of IFilter COM and Adobe PDF IFilter is especially troublesome)
and
2. A separate installation of Adobe IFilter on the target system. This can be
painful if you need to distribute your indexing solution to someone else.
Using iTextSharp
iTextSharp is a .NET port of iText, a PDF manipulation library for Java. It is primarily
focused on creating and not reading PDFs but there are some classes that allow you
to read PDF - especially PdfReader. But extracting the text from the hierarchy of
objects is not an easy task (PDF is not a simple format, the PDF Reference is 7 MB -
compressed - PDF file). I was able to get to PdfArray, PdfBoolean, PdfDictionary and
other objects but after some hours of trying to resolve PdfIndirectReference I gave
up and threw away the iTextSharp based parser.
Finally: PDFBox
PDFBox is another Java PDF library. It is also ready to use with the original Java
Lucene (see LucenePDFDocument).
Fortunately, there is a .NET version of PDFBox that is created using IKVM.NET (just
download the PDFBox package, it's in the bin directory).
• PDFBox-0.7.2.dll
• IKVM.GNU.Classpath
• IKVM.GNU.Classpath.dll (7 MB)
• IKVM.Runtime.dll (360 kB)
• PDFBox-0.7.2.dll (8 MB)
The speed is not so bad: Parsing the U.S. Copyright Act PDF (1.4 MB) took about 7
seconds.
Related information
• See this article (with future updates) on DotLucene: PDF Documents Parsing.
License
This article has no explicit license attached to it but may contain usage terms in the
article text or the download files themselves. If in doubt please contact the author
via the discussion board below.
A list of licenses authors might use can be found here
Member
• CString Management
FAQ
Noise Tolerance Layout Per page
I ve copied:
bcprov-jdk14-132.dll
FontBox-0.1.0-dev.dll
IKVM.GNU.Classpath.dll
IKVM.Runtime.dll
PDFBox-0.7.3.dll
from the PDFBox-0.7.3 bin directory to my project but the problem pesists
any suggentions???
Works, but only with 0.7.2 and only for local 12:58 10 Sep '08
Member 3509080
files, not URLs
I had copied the dll files to the bin library and inported the classpath and PDFBox dll file
references, and put in the namespaces
using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
using org.pdfbox.util;
using org.pdfbox.pdmodel;
but it still was not working. It threw a System.IO.File exception on my input file.
Last Visit: 23:35 9 Jul '09 Last Update: 0:11 10 Jul '09 1234 Next »