How To Scan Books To Text

notes on book scanning.
These notes explain my experiences of trying to convert an out-of-print book from

a paper edition to a digital file.
I used a normal flat bed scanner, an old version of Photoshop, and Google's free
character recognition softare called Tesseract. I also used an application
called T!Tcollector to merge the many text files that Tesseract creates, "icrosoft
#ord for the spell checking and editing, and "icrosoft $xcel to help me make some
batch files.
%aving run a fe tests, the experience belo shos the best ay I could find to
get accurate character recognition from Tesseract. The softare is not perfect -
it seemed to have real problems ith the letter '&'&, but it certainly orks ell
enough hen combined ith a spell-check. If you are planning to scan a large
book, then run some tests of your on first to make sure your process is going to
ork. The most vital thing to ensure is that your scans are of good (uality - if
you can see dots, speckles, and shados on your scanned pages hen you look at
them in Photoshop, Tesseract ith interpret these as incorrect characters, and you
ill get garbage out.
The most time consuming part of the hole process is the scanning itself. %igh
volume book scanning is usually done using a high resolution digital camera
mounted above a board here the book rests. )ut I as using a traditional flatbed
scanner.
*canning a small hardback book of about +,- pages took the best part of a day.
I chose to scan pairs of facing pages at a time, and used Photoshop to import the
scans, usually doing about .- pairs of pages at a time, then saving the scans as
uncompressed greyscale TI' files. The scanner softare had an option for /01
2optical character recognition3, so I ent ith these settings, hich ere set to
grey scale ,--dpi.
I set the cropped area up in the scanner softare to try to crop everything except
the hite part of the page - I kept a large hite margin around the edges of the
text area, to make things easier later.
The softare also had something called &gutter shado reduction&, hich lessens
the dark line beteen the pairs of pages due to the shape of the book's spine. I
sitched this on, and it seemed to reduce the problem. I did no image correction
at this time, no rotation, nothing, 4ust saved the files one by one.
I gave each TI'' file page number names, like &--567.tif& , &--869.tif&
&--:6;.tif& etc. The leading <eros at the front of the filename makes it easier
to sort the files.
/nce I had my rough pairs of pages scanned, I anted to split them into separate
pages. It took me a hile to figure out a good ay to do this - in the end I made
a Photoshop action hich opened one of the TI'' files, changed the 0anvas *i<e to
7-=, and chose to keep the bottom half of the canvas. Then rotated the image ;-
degrees clockise, and saved the file ith a ne name.
This process ill only crop the left-hand pages from each of the page-pair files,
but e ill do the right-hand pages in a minute.
To run the automation, I ent to 'ile --> ?utomate, and chose all of my page pairs
2hich ere in a separate folder3, and chose a ne folder as the destination for
all of my split pages. "ade sure that all file names ould be replaced, and
chose the ne names to be in the form &--5676left.tif&. This is easier than
renumbering - usually you on't be scanning from page . of a book, and it keeps
everything consistent - hich is vital later hen it comes to spell checking.
The automated action took about 7 minutes to split all the files. I no had a
load of left-hand pages cropped from my scans. *o no I made a similar Photoshop
?ction to the one described above, but took the upper 7-= of the image, and
rotated ;- degrees clockise, then saved. I then automated this action, but
called the files in the form &--5676right.tif&, and had them save into the same
folder as the left-side files.
@o the really tedious part. I backed up all my single-sheets into another
folder, then ent back through every single TI'' file, and recropped to remove as
much 4unk as possible. This is absolutely essential, and really improves the
spell checking later on. ?ny really onky pages ere rotated to be straight, any
big blotches ere cleaned up ith the clone-stamper tool, and any left over gutter
shados 2of hich there ere plenty3 ere removed using the Image --> ?d4ust -->
0urves tool. )y adding a point in the middle of the curve, and dragging it don,
you can force any light grey shades to become hite, ithout harming the black of
the text. as I had to do these processes more than +-- times, I made a fe
?ctions to speed the process up - the decision-making still had to be done
manually, but the 0urves settings, and the cropping and saving over could be
automated. It is also important to check over the scans ith your eyes to make
sure that none of the text has been accidentally chopped off, in case you have
mis-scanned a page. I also cropped off the page numbers, any illustrations, and
anything else that ould have cause confusion to the /01 softare. The page
numbers aren't necessary, as your filenames reflect the page numbers.
/nce I had a set of pristine, hite scans ith black text, numbered and saved as
TI'' files, it as time to look as Tesseract.
Tesseract has to be run from the command line, there isn't a graphical
interface... )ut if you've used things like un<ipping tools or ftp from the
command line, it isn't any more confusing.
I as on a #indos P0, so used the .$!$ files that have been precompiled. I tried
the "ac version, but couldn't get it to build correctly 2I think I may be missing
some !.. installs or somethingA3 anyay the indos version orks fine. /ne part
of the install that re(uires delving into the documentation is that in order for
Tesseract to ork correctly, you must donload the #indos $xecutableB
tesseract-+.-..exe.tar.g< #indos executables 2vcCC83 for Teseract +.-.
un<ip it using 9 <ip, or some other un<ipper, then you must also donloadB
tesseract-+.--.eng.tar.g< $nglish language data for Tesseract 2+.-- and up3
Dn<ip this, and you ill find a folder called tessdata ith a load of files in it.
These files are the training info that tells Tesseract ho to read $nglish
2assuming the book you are scanning is in $nglish3. There are several other
languages available.
The tessdata folder "D*T be placed in the same folder as your tesseract.exe file,
otherise you ill get eird scary errors hen you try to run the softare.
To convert a TI'' file into text, the TI'' file must be uncompressed. *tick the
TI'' file in the same folder as tesseract.exe, open the command line 2click on the
start menu, click 1un, then type &cmd&3, and go to the folder here you saved
tesseract, egB
cd Etesseract
If your TI'' file is called myfile.tif then you could typeB
tesseract myfile.tif mytextfile
?fter a fe seconds of hat appears to be nothing happening, you should find a
text file called mytextfile.txt in your tesseract folder. If it hasn't orked,
there ill be an error in a text file called tesseract.log
*o no you need to do this +-- or so times, for each page that you have scanned.
I automated it by making a batch file in the same folder - ie a text file 2called
bat.bat3 along the lines ofB
tesseract cBEtesseractE.pages6cleanE--5676left.tif --5676left
tesseract cBEtesseractE.pages6cleanE--5676right.tif --5676right
tesseract cBEtesseractE.pages6cleanE--8696left.tif --8696left
tesseract cBEtesseractE.pages6cleanE--8696right.tif --8696right
tesseract cBEtesseractE.pages6cleanE--:6;6left.tif --:6;6left
tesseract cBEtesseractE.pages6cleanE--:6;6right.tif --:6;6right
?nd so forth..
This creates one text file per page of your book. To merge all the pages into
one, I used T!Tconvert - hich is available from hereB
httpBFFbluefive.pair.comFtxtcollector.htm
I'm sure I should have used something clever in Python, but this app as simple
small and did the 4ob. "ake sure to tick the boxes &@o *eparator& and &@o
filename&, that prevent the file names from being added to your combined text
file.
@o I opened the combined text file in #ord. #ord complained that the file as in
a funny format, but I chose Dnicode :, and it all seemed to be ok. I kept in this
format throughout, but I'm sure a .doc file ould have been fine too.
Gou should look through your pages by eye, and try and spot obvious things that
can be fixed using &'ind and 1eplace&. I found that Tesseract missread a fullstop
folloed by a closing (uotation mark
.&
and ould instead rite a closing curly bracket H
*o these are easy to find and replace.
It had problems ith the letter ', fre(uently confusing it ith %. Dnfortunately
the spellings generated by Tesseract are (uite &inhuman&, so #ord's spellchecker
needs more help than usual, but once you start using &0orrect ?ll&, +-- pages
probably took an hour or to at most.
0apital I and loercase I ere often confused ith number ., and it as often
necessary for me to go back to the original book, to verify hat the correct
letters should be, particularly if the vocabulary in the book is unfamiliar. I
used #indo's *earch tool to search ithin the individual text file pages that
Tesseract had created, to tell me hich page to look at in the original book, to
verify hat some letters should be. Gour results I'm sure ill very, as each
different printing method ill yield particular eccentricities in Tesseract's
conversion abilities.
Then I uploaded the resulting document onto scribd.com
There are loads of page breaks here there shouldn't be for onscreen reading, but
I think the overall result is (uite readable.

How To Scan Books To Text

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

How To Scan Books To Text

Uploaded by

Copyright:

Available Formats

notes on book scanning.

These notes explain my experiences of trying to convert an out-of-print book from

You might also like