KNOWLEDGE BASE ARTICLE

Searchable PDF's; what are they, how are they different to normal PDF's and how to create them.

Text searchable PDF documents are fast becoming a preferred storage standard within the imaging industry. They certainly do bring with them some nice features but they also bring some potential shortcomings when compared to accurately indexed documents. Let's take a look at both the advantages and disadvantages that this document format offers.

All Hail The Text Searchable Format

For any organization that stores and retrieves large quantities of correspondence or text heavy documents, the ability to search for any word or phrase within a document body is very attractive. Where you may only know the sender's name or perhaps an address of a piece of correspondence, a quick text search among 1000's of documents returns a handful of options that could be further narrowed by document date. You can very quickly arrive at the document you were searching for. In fact, this form of search is attractive even if you don't have text heavy documents!

Another nice feature of the open text format is the availability of automated document indexing options at very low or no cost. Even Microsoft Windows comes with the ability to search within documents to retrieve information and files. This means that all I need for a very basic document management system is my OS and a solution to OCR and convert my documents to text searchable PDF's.

Match that with products like Google Desktop, Windows Live Search or similar and you can step it up a notch or two and add some retrieval speed. Or take it a step further and add an indexing solution such as the Google Mini and you have yourself an inexpensive but relatively impressive, low maintenance document retrieval system. Ok, so it's not everything a document management system can be but it is cheap, easy to implement, very low maintenance and requires very little human intervention (I can hear some of you stayed, wise sages groan at even the thought of a comparison).

I'll Stick With Manual Indexing Thanks

A big consideration to remember when relying on text searchable pdf's is accuracy. Although there are some great OCRing tools out there you will never get 100% accuracy and sometimes that is a requirement.

There is of course the option to manually index all your documents. This is a must if you need to ensure data integrity and search accuracy. You end up with less data to search within but much more accurate results and probably indexing data that would not be contained within a document's body text (such as document category or type).

Of course this brings with it an overhead; someone has to actually enter the data and the more indexing fields you want to work with, the greater the overhead. This will also mean you'll need indexing software and a database/warehouse to search and retrieve from. You end up with a great solution but at an increased cost in money, time and maintenance.

A Third Solution?

There actually is a third solution. What about a semi-automated way of building indexing data using OCR but verify that data and outputing it to a database or to document metadata such as PDF keywords? This doesn't suit every requirement or organization but it can often bring a nice compromise with some great benefits without a huge overhead or cost.

Let's look at a possible scenario...

First, we'll scan our documents using a batch scanning front end. Of course, I am going to suggest i2's scanning front end but there are others available to you. Now, when I set up my Template or Job I identify areas within my document that are likely (or certain) to contain the data I need to use for indexing. Where applicable I set the area as an OCR field so that my scanning software automatically extracts the data and asks me to confirm its correctness. Without too much pain and time I have a fast and semi-automated way of capturing my primary indexing data.

Then we'll output our documents to text searchable PDF format and enable keyword metadata embedding. Keyword metadata embedding means I have the primary indexing data added to my pdf document as part of the document and not just passed off to an indexing database (of course saving the data to a database would be another option).
What I end up with is a very flexible, data rich file that contains both full text searchablility and accurate indexing information that can be used in many different environments and with varying search requirements. Not forgetting that my costs and my time overhead are kept to a minimum.

This solution won't suit every situation (and all the options discussed above have their place) but it is an option that you may wish to consider as you plan your document scanning, indexing, storage and searching requirements.

Link to this article http://umango.com/KB?article=8