Scan to PDF

Scan to PDF
Scan to PDF Software

You can use either the Find toolbar or the Search PDF window to locate a word, series of words, or partial word in the active Adobe PDF document. The Find toolbar provides a basic set of scan options for searching for text in only the current PDF document; the Search PDF window searches more PDF areas than the Find toolbar, provides more advanced options, and lets you search for text in one or more PDF documents, an index of PDF files, or PDF scanning software files on the Internet (see Searching Adobe PDF documents on the Internet).Scanning Searchable PDF By default, both the Find toolbar and the Search PDF window search the text, layers, form fields, and digital signatures in the scan to PDF document; both features also let you include bookmarks and comments in the search. By default, the Search PDF window also searches object data, and image XIF (extended image file format) metadata; it searches document properties and XMP metadata by default but only when searching multiple PDF documents or a PDF index; it searches indexed structure tags but only when Scanning or searching a PDF index. In addition, the Search PDF window lets you include attachments in the search.Note: Adobe PDF documents can have multiple layers. If the search results include an occurrence on a hidden scan  layer, selecting that occurrence displays an alert that asks if you want to make that layer visible.


When you get a document that has been scanned rather than exported from the software that created it, such as MS Word, it's just an image (i.e. a picture). Remember, to a computer, a picture of the letter "A" is not the same as the text character "A," so when you try to text-search an image, you get no hits because there's no text to search. Typical scanned litigation documents are in the TIFF (image) format. There are also many software and hardware packages that scan paper directly into PDF. For now, I'm not going to address using Acrobat or other tools as the scanning software. For our purposes today, let's just say, "you've got those image files that you want to convert into something you can search."

The unique thing about PDF is that you can have an exact image of the document, plus the text, plus all kinds of metadata ALL IN ONE FILE. This is a wonderful thing -- but I will expound on its wonderfulness later... With the "Paper Capture" tools in Acrobat, the software reads the picture, and figures out what the text is. So while you still see the "image," the software can also read the underlying text. OCR is not perfect, and it works best on first generation, laser printed images (just like your eyes do). In the past decade, however, OCR technology has gotten surprisingly accurate.

Scanning and OCR with Acrobat

I see that one area that concerns many people is how to use the OCR (Optical Character Recognition) abilities of Acrobat. Here's an overview, and I'll try to deal with other OCR issues very soon.

Acrobat can import scanned images (best in TIFF) or interface with any TWAIN driver to scanners and digital cameras. This imports only pixel data, so a text recognition step is needed to create searchable text and possibly reduce file size. Acrobat calls this OCR step paper capture.

    * File > Import to bring in image data
    * Tools > Capture to set up your capture preferences. Normal will convert image data to text. Both will keep the image visible and the text data hidden behind it, so find-operations do take you to the right location in the document.
    * Find Next Suspect (Ctrl-H) to review the words Acrobat thinks it may not have recognized correctly. You will see a magnified version of the word/pixels in question. In the actual document, the word Acrobat chose is highlighted. Accept it or type over it.
    * If good looks are really important, you may need to go over most words/lines and edit their font properties with the touchup-text tool, a very lengthy and tedious process. Probably retyping in a word processor would be faster.


Convert scanned pages to searchable Scan Adobe PDF files that anyone with the free Adobe Reader can view, navigate, and print.      
            
    •     Create reusable document-processing workflows tailored to different types of conversion projects.     
            
    •     Accurately perform OCR, font, and page recognition.     
            
    •     Automatically create intra-document links, including tables of contents, cross references, and indexes.     
            
    •     Efficiently correct OCR text suspects with the new QuickFix tool.     
            
    •     Use the new Zone tool to define areas of scanned pages to be treated as images, text, or even keywords.     
            
    •     Decrease processing time with workload balancing and multi-processor support (Cluster Edition only).     
            
    •     Create your own web interface with simple html pages using the Acrobat Capture SDK.

Bring your paper documents to life on the Web
Bridge the gap between your paper and digital workflows. Adobe® Acrobat® Capture® 3.0 is a professional production tool that teams with your scanner to convert volumes of paper documents into searchable Adobe Portable Document Format (PDF) files. Accurate OCR, advanced page and content recognition, and powerful cleanup tools let you turn all your important paper-based information into high-quality electronic documents ready for publication via the Web, intranets, extranets, CD-ROM, and more. Sophisticated productivity features streamline processing from start to finish, so you can get your jobs done more efficiently than ever.

When it's done, don't forget to File > Save the document. And there you have it. (At this point, I always like to do a little test by running a quick search on a word that I see on the first page. It just makes me feel better to know that it worked. I also have a continuing dialogue about what to do with the original TIFF file...)

As I said, if your image file is from a laser printed copy, and it's a decent scan, the OCR accuracy is amazingly good. But it may have garbled some words, so if you want to get really fancy, go back to Document > Scan paper into PDF Capture and select "Find first OCR suspect" or "Find all OCR suspects." This identifies characters that the OCR engine had problems with, and gives you a chance to correct the text. You can fix the spelling if it's important to you -- say for a proper name or term. That way you can be sure that the search software will find it. Otherwise, for a common word, I'd just save time and let it slide.