From MobileRead

A quick summary of how to clean up the files from TXT or other electronic sources. See Digitizing Paper Books to Ebooks to get a paper book in electronic form in the first place.

edit] Overview

Making an eBook for portable devices is not necessarily as easy as it first might sound. There is more to it than just running an automatic converter to the format you need. The main effort is to first get the electronic document into a reasonable eBook form. This article describes a method used to create the eBooks that are posted here at MobileRead. If you wanted to write your own eBook then check Authoring.

The main task is to get the source into reasonable shape to begin with. Usually this mean making a conversion to an intermediate format that has reasonable support for features that are needed. Often this intermediate format is MS Word or some other word processor since they have support for fonts, type styles, structure, spelling checking, and other features that can aid in the preparation. Some tools use HTML formats as this intermediate format which works as well since it too has structure and other similar capabilities. As a matter of fact the fix-up phase may require or benefit from multiple tools and format.

The definition of reasonable shape depends somewhat on the desired outcome and the capabilities of the final conversion tool to get to that format. This article describes what some might describe as a worst case scenario but then you might be surprised as to the state of some of the source electronic texts that are out there.

edit] Steps

The steps below assume that you are starting with basic text format. Adjust as necessary if your starting point is different. Starting with OCR'd text is generally the worst case.

Go to the source file, say the Internet Archive. Download the text file and a PDF. Or perhaps you are starting with Digitizing Paper Books to Ebooks Try saving the PDF as a text file and see whether it is better than the downloaded text file. Usually there's not much difference. Paste the text file into a MS Word doc. (or your favorite editor) Remove all headers and footers. Run a heap of 'Find and Replace' commands to overcome common mistakes: Run Stingo's macro - See hard returns for more info. Check every instance of 'space"space' and correct them. Check every instance of 'space'space' and correct them. If you like curly quotes then do a find and replace of " with " and ' with '. Then check each instance where they occur after a dash of any sort. Also check each instance of space' (because of contractions like 'em, 'tis, 'twas, etc). See smart quotes for a fast way to do this with MS word. Now open both the PDF and the Doc. Adjust the page sizes so that each takes up half the screen. Read them side by side. You will have to add dashes (OCR often misses them out) and italics. Focus mostly on these, and on obvious spelling errors. Also add any missing accents. Now run a spell-check on the doc. Note all dubious cases and check them. If the source was very poor then repeat step 10. I often check every instance of ' and ", because these often get missed out. Get the Chapter headings centered and in Bold. Ditto the Author and Title. Generate a TOC unless your conversion software can do it for you. Insert any pictures. Move any footnotes to the end of sections, the end of the book or wherever you want them.

Now the text should be ready for either Book Designer or Calibre, or your favorite conversion software. Note that some conversion software can accept the TOC you can create in MS word while some, like Book Designer, can generate it for you later.

The good news is that a conversion in Book Designer can now be done in less than 5 minutes. The bad news is that you will have spent many hours tidying up the original text.

Obviously, you can do this in stages, a few minutes at a time. Take notes, so that you know what you have already done.

You may want to provide specialized pages at the front such as a Cover Page and a Title page

edit] Common mistakes

These errors are most often caused by OCR errors.

find ' ?' replace with '?'.

find ' !' replace with '!'

repeat for every punctuation mark--many of these have an unnecessary space in front of them.

find 'hyphen space paragraph mark' replace with 'paragraph mark' to remove unnecessary hyphenation at the hard line breaks. It's best do check each one individually because some hyphens are meant to be there.

find 'paragraph mark " space' replace with 'paragraph mark" ' --this is a common OCR error.

check for common OCR errors 'lie' for 'he' etc.

run a search for all the numerals in turn. Zero and O are often confused. And 1, I and l are often all over the place. " can appear as 4 or 66.

look for emdashes and deal with them as needed. These longer dashes can sometimes be automatically handled but often they need manual fixup. ASCII text documents often use a -- to indicate where an emdash belongs. See OCR villains for more common errors.

edit] Hard returns

Most pure text files contain a return sequence (CR/LF) at the end of each line. Typical software programs expect to wrap the lines on their own to the width of the screen and use the return sequence to mark the paragraph boundaries. If the screen width is larger than a typical line you may not even notice the difference but on mobile devices the line is usually longer than the screen width and the line is wrapped properly but then it ends before reaching the screen edge a second time. This results in a really poor reading experience with long lines and short lines due to the mix of hard returns and wrapped lines. The solution is to remove the hard returns from the text file source and to recognize a blank line (or possibly an indented line) as marking the end of one paragraph and the beginning of the next.

There are a number of programs that can aid in making this conversion. It may be that one of them can be used as the first step to convert the file before it is read into the editing program. Some editing programs can import these TEXT files and automatically convert them when they are read in.

edit] Smart quotes

MS Word has a smart quotes feature that can be turned on as a option. It can help to generate and fix curly quotes. To use it to fix quotes to curly quotes do a search and replace. Your search is " and you replace is also " but because you have smart quotes turned on it will actually pair them up for you. You will still need to check the file since it is possible to fool the program, particularly when quotes cross paragraph boundaries.

edit] Converting the final Document

MS Word is not particularly easy to convert directly to eBook Formats, except for LIT which is an available add-on. Often people resort to converting DOC files into HTML and then using the resultant HTML file to convert to their final eBook format. Unfortunately Word does not produce a very clean HTML document. If you have to use Word be sure and use the filter option. Other choices include reading your DOC file into another editor either directly or after saving it as RTF and then using the new editors features to save into HTML. The old Word 97 version creates a better HTML than the later versions.

AbiWord, a free download, can generally read DOC files and RTF files. It will generate a better HTML than word does.

Atlantis Word Processor, not free, can read DOC files and can generate ePUB as a save option.

Scrivener is another commercial product that can generate ePub and MOBI files directly.

Jutoh can create ePub and MOBI as well.

LibreOffice and OpenOffice.org can also create filtered HTML from .doc, .docx, .rtf, and a variety of other formats. LO and OOo also have extensions that allow direct exporting to ebook formats, such as Writer2epub and eLAIX.

See Kindle HowTo: File Conversion to convert files to MOBI and other Kindle formats.

See ePub and ePub 3 for tools specific to that format.

See PDF for tools specific to that format.

There is a comprehensive list of eBook conversion tools available but there are also some tools that are especially handy for fixing problems in the source file. This section lists some of these tools.

Tidy is one of several programs with tidy in their names. This one is often called HTML Tidy. it is a program to clean up HTML files.

E-Book Tidy can convert files to (and from) PalmDOC, TXT and HTML while fixing problems like hard returns. It is particularly useful in working with Gutenberg downloads.

ClipboardFusion - supercharge your clipboard. Allows editing and modifying the clipboard data before pasting it. Allows capturing data from web sites and other sources.