Scanned PDF documents are great for reading, but fail to deliver anything beyond that. You cannot select text on a scanned page or copy some fragment to the clipboard. You cannot search such a PDF file. This tutorial explains how you can turn the scanned PDF to the searchable document using the PDFium C# library and Tesserat .Net OCR SDK. To learn how to create a PDF from scanned pages, please read this tutorial instead.

How it works

The idea is simple. We take the scanned pages of the original PDF, recognize them using the OCR (optical character recognition) library and add an invisible layer to the PDF file that contains all the recognized text in addition to the main visible layer with scanned pages. This allows a user to view and read the document as before, but also enables them to search the text, select it, copy selection to the clipboard and so on.

How to do that

1. Enable required namespaces

To turn a scanned PDF to the searchable one, we need to use the following namespaces:

These ones are required to work with PDF documents.

These ones provide OCR capabilities.

And we also need some standard ones.

2. Initialize libraries

We need to initialize the PDFium library and the Tesseract OCR library.

This line initializes PDFium. The process has some nuances, because initialization is static. Read more about it here.

Then we need to initialize the OCR library:

To create an instance of the OcrApi class, we call the Create() static method. The OcrApi class implements the IDisposable interface, so we either need to call ocr.Dispose() or simply use the using clause. After we created an instance of the OcrApi class, we need to initialize it. The Init() method does this.

The method looks as follows:

We don’t need all of these parameters to convert a scanned PDF to the searchable one in our example. In fact, you can call Init without any parameters at all (see below). However, other tasks may require them, so we provide a brief description of what can be passed to Init here.

language

This parameter specifies the language or languages for OCR. You can recognize multi-language documents, but the more languages you include, the more memory the app consumes. More languages also mean lower OCR quality. In our case we just stick with English, which is also the default language of the OCR engine.

Note: the tessdata folder should contain data files for all languages you use in the OCR. You can download Tesseract language modules here.

dataPath

This is a path to the parent folder of the tessdata folder – the folder where the language data files are. This is either a full path, or a relative path. The path should end with a trailing backslash. For example, if the path to the tessdata folder is c:\MyApp\tessdata\ the path passed in the dataPath parameter should be c:\MyApp\.

If you don’t specify the parameter, the path defaults to the current folder of the app.

Note: if the current folder changes somehow (for instance, when the user changes the current folder in Open or Save dialogs), the omitted dataPath will point to this new location too! Therefore, the good practice is to explicitly specify the path in this parameter.

oem

This parameter specifies the OcrEngineMode with the following available options: OEM_TESSERACT_ONLY for the fastest OCR, OEM_CUBE_ONLY for slower but accurate recognition, OEM_TESSERACT_CUBE_COMBINED for extreme accuracy and OEM_DEFAULT. The latter determines the OCR mode based on variables in the language-specific config, command-line configs or (if there are no any of them) defaults to OEM_TESSERACT_ONLY.

Note: The tessdata folder should have the corresponding language files in order for the OCR modes to initialize. Language filenames for the OCR modes are:

*.trained – for the OEM_TESSERACT_ONLY mode;

*.cube.* – for the OEM_CUBE_ONLY mode;

*.tesseract_cube.* – for the OEM_TESSERACT_CUBE_COMBINED mode.

If the corresponding file is missing, initialization will fail.

configs

The array of configuration file names. The corresponding configuration files should be located in the configs or tessconfigs subfolder in the tessdata folder.

varsVec

The array of configuration variable names. This is an alternative way to configure Tesseract.

varsValues

The array of configuration variable values. The list of supported variables can be found here

Variables passed this way have higher priority over configuration files. This allows you to overwrite certain settings simply by passing the corresponding variables directly using varsVec and varsValues parameters.

setOnlyNonDebugParams

Disable for debug purposes. Enable for final build.

For now, we only use one parameter and initialize the OCR as follows:

All other parameters are omitted and are set to their corresponding default values as described above.

So, here is the C# code we’ve got so far:

Once we are done with initialization it is time to do some work.

3. OCR pages

To build a PDF from images we need a renderer. In our case we need one specific renderer called OcrPdfRenderer.

Here, we tell the static method Create the filename to save the recognized PDF file as (the first parameter) and where the language data files are (the second parameter). Please note that unlike the initialization procedure above, this method needs the path to the tessdata folder, not the parent folder.

The OcrPdfRenderer class implements IDisposable too, so don’t forget to call Dispose or stick with using like shown below.

Now that we have the renderer created, we pass pages of our scanned PDF file to it. For each page, we render it to a Bitmap, then we recognize it and add the recognized text to the page. Here is the fragment of code that does all of this:

Let’s elaborate this chunk of code a bit.

This line starts a new document:

The next line loads our source PDF document, the one with scanned images. The PdfDocument object requires the final Dispose(), hence the using clause again.

We want to create a bitmap of each page, so we calculate the required width and height of the bitmap in pixels converted from the dimensions of the PDF page in Points. Each Point is 1/72 of an inch, so we basically take the vertical or horizontal DPI of the image (96 in our example), multiply it to corresponding dimension and divide by 72.

Our next line creates a new PdfBitmap using the dimensions we just computed. The last parameter of the constructor tells to use the true color mode.

Then we fill the entire bitmap with white and render the page to it:

Finally, the line that does all the job:

The method takes four parameters: the image to recognize, the debug configuration file we don’t currently need, the maximum timeout (zero means no timeout), and the renderer.

OcrPix takes a bitmap in the .Net format as the parameter, so we simply pass one using the bitmap.Image property.

That was quite a long step, but the actual code we received is pretty short. Here is the entire program:

Final notes

The call to the EndDocument() method is required to finalize the PDF document. Also note that Tesseract OCR cannot reliably recognize symbols smaller than 20 pixels, so make sure the DPI of the scanned pages is enough to provide at least that line height.