/  4
 
notes on book scanning.These notes explain my experiences of trying to convert an out-of-print book froma paper edition to a digital file.I used a normal flat bed scanner, an old version of Photoshop, and Google's freecharacter recognition software called Tesseract. I also used an applicationcalled TXTcollector to merge the many text files that Tesseract creates, MicrosoftWord for the spell checking and editing, and Microsoft Excel to help me make somebatch files.Having run a few tests, the experience below shows the best way I could find toget accurate character recognition from Tesseract. The software is not perfect -it seemed to have real problems with the letter '"F", but it certainly works wellenough when combined with a spell-check. If you are planning to scan a largebook, then run some tests of your own first to make sure your process is going towork. The most vital thing to ensure is that your scans are of good quality - ifyou can see dots, speckles, and shadows on your scanned pages when you look atthem in Photoshop, Tesseract with interpret these as incorrect characters, and youwill get garbage out.The most time consuming part of the whole process is the scanning itself. Highvolume book scanning is usually done using a high resolution digital cameramounted above a board where the book rests. But I was using a traditional flatbedscanner.Scanning a small hardback book of about 230 pages took the best part of a day.I chose to scan pairs of facing pages at a time, and used Photoshop to import thescans, usually doing about 10 pairs of pages at a time, then saving the scans asuncompressed greyscale TIF files. The scanner software had an option for OCR(optical character recognition), so I went with these settings, which were set togrey scale 300dpi.I set the cropped area up in the scanner software to try to crop everything exceptthe white part of the page - I kept a large white margin around the edges of thetext area, to make things easier later.The software also had something called "gutter shadow reduction", which lessensthe dark line between the pairs of pages due to the shape of the book's spine. Iswitched this on, and it seemed to reduce the problem. I did no image correctionat this time, no rotation, nothing, just saved the files one by one.I gave each TIFF file page number names, like "004_5.tif" , "006_7.tif""008_9.tif" etc. The leading zeros at the front of the filename makes it easierto sort the files.Once I had my rough pairs of pages scanned, I wanted to split them into separatepages. It took me a while to figure out a good way to do this - in the end I madea Photoshop action which opened one of the TIFF files, changed the Canvas Size to50%, and chose to keep the bottom half of the canvas. Then rotated the image 90degrees clockwise, and saved the file with a new name.This process will only crop the left-hand pages from each of the page-pair files,but we will do the right-hand pages in a minute.To run the automation, I went to File --> Automate, and chose all of my page pairs(which were in a separate folder), and chose a new folder as the destination for
 
all of my split pages. Made sure that all file names would be replaced, andchose the new names to be in the form "004_5_left.tif". This is easier thanrenumbering - usually you won't be scanning from page 1 of a book, and it keepseverything consistent - which is vital later when it comes to spell checking.The automated action took about 5 minutes to split all the files. I now had aload of left-hand pages cropped from my scans. So now I made a similar PhotoshopAction to the one described above, but took the upper 50% of the image, androtated 90 degrees clockwise, then saved. I then automated this action, butcalled the files in the form "004_5_right.tif", and had them save into the samefolder as the left-side files.Now the really tedious part. I backed up all my single-sheets into anotherfolder, then went back through every single TIFF file, and recropped to remove asmuch junk as possible. This is absolutely essential, and really improves thespell checking later on. Any really wonky pages were rotated to be straight, anybig blotches were cleaned up with the clone-stamper tool, and any left over guttershadows (of which there were plenty) were removed using the Image --> Adjust -->Curves tool. By adding a point in the middle of the curve, and dragging it down,you can force any light grey shades to become white, without harming the black ofthe text. as I had to do these processes more than 200 times, I made a fewActions to speed the process up - the decision-making still had to be donemanually, but the Curves settings, and the cropping and saving over could beautomated. It is also important to check over the scans with your eyes to makesure that none of the text has been accidentally chopped off, in case you havemis-scanned a page. I also cropped off the page numbers, any illustrations, andanything else that would have cause confusion to the OCR software. The pagenumbers aren't necessary, as your filenames reflect the page numbers.Once I had a set of pristine, white scans with black text, numbered and saved asTIFF files, it was time to look as Tesseract.Tesseract has to be run from the command line, there isn't a graphicalinterface... But if you've used things like unzipping tools or ftp from thecommand line, it isn't any more confusing.I was on a Windows PC, so used the .EXE files that have been precompiled. I triedthe Mac version, but couldn't get it to build correctly (I think I may be missingsome X11 installs or something?) anyway the windows version works fine. One partof the install that requires delving into the documentation is that in order forTesseract to work correctly, you must download the Windows Executable:tesseract-2.01.exe.tar.gz Windows executables (vc++6) for Teseract 2.01unzip it using 7 zip, or some other unzipper, then you must also download:tesseract-2.00.eng.tar.gz English language data for Tesseract (2.00 and up)Unzip this, and you will find a folder called tessdata with a load of files in it.These files are the training info that tells Tesseract how to read English(assuming the book you are scanning is in English). There are several otherlanguages available.The tessdata folder MUST be placed in the same folder as your tesseract.exe file,otherwise you will get weird scary errors when you try to run the software.To convert a TIFF file into text, the TIFF file must be uncompressed. Stick theTIFF file in the same folder as tesseract.exe, open the command line (click on the
 
start menu, click Run, then type "cmd"), and go to the folder where you savedtesseract, eg:cd \tesseractIf your TIFF file is called myfile.tif then you could type:tesseract myfile.tif mytextfileAfter a few seconds of what appears to be nothing happening, you should find atext file called mytextfile.txt in your tesseract folder. If it hasn't worked,there will be an error in a text file called tesseract.logSo now you need to do this 200 or so times, for each page that you have scanned.I automated it by making a batch file in the same folder - ie a text file (calledbat.bat) along the lines of:tesseract c:\tesseract\1pages_clean\004_5_left.tif 004_5_lefttesseract c:\tesseract\1pages_clean\004_5_right.tif 004_5_righttesseract c:\tesseract\1pages_clean\006_7_left.tif 006_7_lefttesseract c:\tesseract\1pages_clean\006_7_right.tif 006_7_righttesseract c:\tesseract\1pages_clean\008_9_left.tif 008_9_lefttesseract c:\tesseract\1pages_clean\008_9_right.tif 008_9_rightAnd so forth..This creates one text file per page of your book. To merge all the pages intoone, I used TXTconvert - which is available from here:http://bluefive.pair.com/txtcollector.htmI'm sure I should have used something clever in Python, but this app was simplesmall and did the job. Make sure to tick the boxes "No Separator" and "Nofilename", that prevent the file names from being added to your combined textfile.Now I opened the combined text file in Word. Word complained that the file was ina funny format, but I chose Unicode 8, and it all seemed to be ok. I kept in thisformat throughout, but I'm sure a .doc file would have been fine too.You should look through your pages by eye, and try and spot obvious things thatcan be fixed using "Find and Replace". I found that Tesseract missread a fullstopfollowed by a closing quotation mark."and would instead write a closing curly bracket }So these are easy to find and replace.It had problems with the letter F, frequently confusing it with H. Unfortunatelythe spellings generated by Tesseract are quite "inhuman", so Word's spellcheckerneeds more help than usual, but once you start using "Correct All", 200 pagesprobably took an hour or two at most.Capital I and lowercase L were often confused with number 1, and it was oftennecessary for me to go back to the original book, to verify what the correctletters should be, particularly if the vocabulary in the book is unfamiliar. Iused Window's Search tool to search within the individual text file pages that

Share & Embed

More from this user

Add a Comment

Characters: ...

vperetokin5761left a comment

Just a note, Tesseract isn't Google's, and Google has no involvement in it besides it being hosted on Google Code. See http://code.google.com/p/tesseract-oc... for the project's history.