TLDR: Extracting images and adding OCR layer: PDF-Xchange Viewer

Splitting scanned books: ScanTailor / ScanTailor Advanced

/ Merging the outputs: i2pdf

Creating hierarchical bookmarks/table of content: JpdfBookmark All programs are free. The whole process takes around 1 hour running, with occasional checks.

Splitting scanned books

There are two problem with automating splitting scanned books in a single pass:

Automation is not always accuracy

Making a scanned book comfortably read is more than just splitting pages

For everything related with scanned books, I strongly recommend using ScanTailor (or its fork ScanTailor Advanced). It has features such as:

Turn skewed pages vertically,

Select content to reduce the page size,

Increase/decrease margin (for taking notes, say),

Whiten the result for better reading experience.

You must export the PDF into images to use this, and recombine the output images back. The processed images may be very small in file size (up to only 6% of the origin), but excellent in quality.

To complete the task satisfactorily, I recommend using PDF-Xchange Viewer for extracting images and adding OCR layer, and i2pdf for merging the outputs. In my experience, you can set the JPG quality to the lowest and it doesn’t seem much different, but there is a trade-off between the final output’s size and image quality.

Creating hierarchical bookmarks/table of content

Use Jpdfbookmark.

Step 1: Prepare the table of content

Save the TOC in a .txt file in this format:

Chapter 1. The Beginning/23 Para 1.1 Child of The Beginning/25,FitWidth,96 Para 1.1.1 Child of Child of The Beginning/26,FitHeight,43 Chapter 2. The Continue/30,TopLeft,120,42 Para 2.1 Child of The Beginning/32,FitPage

You can ORC the TOC and use regex to fix it.

Step 2: Load that TOC

Step 3: Prepare for step 4

This sounds dumb, but if you miss it you will be frustrated and have to do it again. Expand all bookmarks (Ctrl + E), select all of them, then go to Tools → Apply Page Offset

Step 4: Apply page offset

This step should be self-explained. Don’t forget to save.

That’s it. You are done. For more information, you can read its manual. The program has command line mode and can work on Linux, Mac.

If there are non-Roman characters, be sure to use the same encoding when dumping and applying bookmarks.

See also: How to OCR tables of contents?

Other resources

Book scanning – Wikipedia

How to Scan a Book (with Pictures) – wikiHow

Willus.com’s PDF Conversion Tips for e-readers

DIY Book Scanner

Tips for Scanning · scantailor/scantailor Wiki