I've been working with the Lucene.NET library for over a year now and am constantly finding new ways to integrate it into websites and applications. Several of my customers use a customized version of the
Desktop Search application. I have also integrated the
Website Spider for Lucene I created into many websites. For some time now, I've been looking for an open source approach to extracting the raw text from PDF files. I searched all over the place and have finally found it!
PDFBox is an open source Java PDF library. The .NET version is available at
IKVM.NET .
Download code (5.8 Mb, including all PDFBox DLLs)
To convert a PDF to text, it's this simple:
1using org.pdfbox.pdmodel;
2using org.pdfbox.util;
...
134 PDDocument objDocument = PDDocument.load(strFileName);
136 PDFTextStripper objTextStripper = new PDFTextStripper();
137 txtText.Text = objTextStripper.getText(objDocument);
...
It seems to run pretty fast, I extracted the text from a 2.3 Mb PDF in 5.2 sec.
A couple weeks ago, a colleague and I were in need of a class that could handle compressing and extracting files. It had to handle "zip" and "jar" files and needed to interface with a .NET application. There were several COM components available, but we wanted a managed code solution. We searched through the web and there were a few example of developers "trying" to get it to work, but each example had a problem or two. Then we came across an article in MSDN Magazine:
Using the Zip Classes in the J# Class Libraries to Compress Files and Data with C# . It was exactly what we were looking for and we implemented it very easily. So I decided to take an hour or so to create a simple application that uses the zip class.
Download applicationZip Class - View the class code
Zip Class Implementation - View the form code-behind