You are on page 1of 4

notes on book scanning.

These notes explain my experiences of trying to convert an out-of-print book from


a paper edition to a digital file.

I used a normal flat bed scanner, an old version of Photoshop, and Google's free
character recognition software called Tesseract. I also used an application
called TXTcollector to merge the many text files that Tesseract creates, Microsoft
Word for the spell checking and editing, and Microsoft Excel to help me make some
batch files.

Having run a few tests, the experience below shows the best way I could find to
get accurate character recognition from Tesseract. The software is not perfect -
it seemed to have real problems with the letter '"F", but it certainly works well
enough when combined with a spell-check. If you are planning to scan a large
book, then run some tests of your own first to make sure your process is going to
work. The most vital thing to ensure is that your scans are of good quality - if
you can see dots, speckles, and shadows on your scanned pages when you look at
them in Photoshop, Tesseract with interpret these as incorrect characters, and you
will get garbage out.

The most time consuming part of the whole process is the scanning itself. High
volume book scanning is usually done using a high resolution digital camera
mounted above a board where the book rests. But I was using a traditional flatbed
scanner.

Scanning a small hardback book of about 230 pages took the best part of a day.

I chose to scan pairs of facing pages at a time, and used Photoshop to import the
scans, usually doing about 10 pairs of pages at a time, then saving the scans as
uncompressed greyscale TIF files. The scanner software had an option for OCR
(optical character recognition), so I went with these settings, which were set to
grey scale 300dpi.

I set the cropped area up in the scanner software to try to crop everything except
the white part of the page - I kept a large white margin around the edges of the
text area, to make things easier later.

The software also had something called "gutter shadow reduction", which lessens
the dark line between the pairs of pages due to the shape of the book's spine. I
switched this on, and it seemed to reduce the problem. I did no image correction
at this time, no rotation, nothing, just saved the files one by one.

I gave each TIFF file page number names, like "004_5.tif" , "006_7.tif"
"008_9.tif" etc. The leading zeros at the front of the filename makes it easier
to sort the files.

Once I had my rough pairs of pages scanned, I wanted to split them into separate
pages. It took me a while to figure out a good way to do this - in the end I made
a Photoshop action which opened one of the TIFF files, changed the Canvas Size to
50%, and chose to keep the bottom half of the canvas. Then rotated the image 90
degrees clockwise, and saved the file with a new name.

This process will only crop the left-hand pages from each of the page-pair files,
but we will do the right-hand pages in a minute.

To run the automation, I went to File --> Automate, and chose all of my page pairs
(which were in a separate folder), and chose a new folder as the destination for
all of my split pages. Made sure that all file names would be replaced, and
chose the new names to be in the form "004_5_left.tif". This is easier than
renumbering - usually you won't be scanning from page 1 of a book, and it keeps
everything consistent - which is vital later when it comes to spell checking.

The automated action took about 5 minutes to split all the files. I now had a
load of left-hand pages cropped from my scans. So now I made a similar Photoshop
Action to the one described above, but took the upper 50% of the image, and
rotated 90 degrees clockwise, then saved. I then automated this action, but
called the files in the form "004_5_right.tif", and had them save into the same
folder as the left-side files.

Now the really tedious part. I backed up all my single-sheets into another
folder, then went back through every single TIFF file, and recropped to remove as
much junk as possible. This is absolutely essential, and really improves the
spell checking later on. Any really wonky pages were rotated to be straight, any
big blotches were cleaned up with the clone-stamper tool, and any left over gutter
shadows (of which there were plenty) were removed using the Image --> Adjust -->
Curves tool. By adding a point in the middle of the curve, and dragging it down,
you can force any light grey shades to become white, without harming the black of
the text. as I had to do these processes more than 200 times, I made a few
Actions to speed the process up - the decision-making still had to be done
manually, but the Curves settings, and the cropping and saving over could be
automated. It is also important to check over the scans with your eyes to make
sure that none of the text has been accidentally chopped off, in case you have
mis-scanned a page. I also cropped off the page numbers, any illustrations, and
anything else that would have cause confusion to the OCR software. The page
numbers aren't necessary, as your filenames reflect the page numbers.

Once I had a set of pristine, white scans with black text, numbered and saved as
TIFF files, it was time to look as Tesseract.

Tesseract has to be run from the command line, there isn't a graphical
interface... But if you've used things like unzipping tools or ftp from the
command line, it isn't any more confusing.

I was on a Windows PC, so used the .EXE files that have been precompiled. I tried
the Mac version, but couldn't get it to build correctly (I think I may be missing
some X11 installs or something?) anyway the windows version works fine. One part
of the install that requires delving into the documentation is that in order for
Tesseract to work correctly, you must download the Windows Executable:

tesseract-2.01.exe.tar.gz Windows executables (vc++6) for Teseract 2.01

unzip it using 7 zip, or some other unzipper, then you must also download:

tesseract-2.00.eng.tar.gz English language data for Tesseract (2.00 and up)

Unzip this, and you will find a folder called tessdata with a load of files in it.
These files are the training info that tells Tesseract how to read English
(assuming the book you are scanning is in English). There are several other
languages available.

The tessdata folder MUST be placed in the same folder as your tesseract.exe file,
otherwise you will get weird scary errors when you try to run the software.

To convert a TIFF file into text, the TIFF file must be uncompressed. Stick the
TIFF file in the same folder as tesseract.exe, open the command line (click on the
start menu, click Run, then type "cmd"), and go to the folder where you saved
tesseract, eg:

cd \tesseract

If your TIFF file is called myfile.tif then you could type:

tesseract myfile.tif mytextfile

After a few seconds of what appears to be nothing happening, you should find a
text file called mytextfile.txt in your tesseract folder. If it hasn't worked,
there will be an error in a text file called tesseract.log

So now you need to do this 200 or so times, for each page that you have scanned.
I automated it by making a batch file in the same folder - ie a text file (called
bat.bat) along the lines of:

tesseract c:\tesseract\1pages_clean\004_5_left.tif 004_5_left


tesseract c:\tesseract\1pages_clean\004_5_right.tif 004_5_right
tesseract c:\tesseract\1pages_clean\006_7_left.tif 006_7_left
tesseract c:\tesseract\1pages_clean\006_7_right.tif 006_7_right
tesseract c:\tesseract\1pages_clean\008_9_left.tif 008_9_left
tesseract c:\tesseract\1pages_clean\008_9_right.tif 008_9_right

And so forth..

This creates one text file per page of your book. To merge all the pages into
one, I used TXTconvert - which is available from here:
http://bluefive.pair.com/txtcollector.htm

I'm sure I should have used something clever in Python, but this app was simple
small and did the job. Make sure to tick the boxes "No Separator" and "No
filename", that prevent the file names from being added to your combined text
file.

Now I opened the combined text file in Word. Word complained that the file was in
a funny format, but I chose Unicode 8, and it all seemed to be ok. I kept in this
format throughout, but I'm sure a .doc file would have been fine too.

You should look through your pages by eye, and try and spot obvious things that
can be fixed using "Find and Replace". I found that Tesseract missread a fullstop
followed by a closing quotation mark

."

and would instead write a closing curly bracket }

So these are easy to find and replace.

It had problems with the letter F, frequently confusing it with H. Unfortunately


the spellings generated by Tesseract are quite "inhuman", so Word's spellchecker
needs more help than usual, but once you start using "Correct All", 200 pages
probably took an hour or two at most.

Capital I and lowercase L were often confused with number 1, and it was often
necessary for me to go back to the original book, to verify what the correct
letters should be, particularly if the vocabulary in the book is unfamiliar. I
used Window's Search tool to search within the individual text file pages that
Tesseract had created, to tell me which page to look at in the original book, to
verify what some letters should be. Your results I'm sure will very, as each
different printing method will yield particular eccentricities in Tesseract's
conversion abilities.

Then I uploaded the resulting document onto scribd.com

There are loads of page breaks where there shouldn't be for onscreen reading, but
I think the overall result is quite readable.

You might also like