These notes explain my experiences of trying to convert an out-of-print book from
a paper edition to a digital file. I used a normal flat bed scanner, an old version of Photoshop, and Google's free character recognition softare called Tesseract. I also used an application called T!Tcollector to merge the many text files that Tesseract creates, "icrosoft #ord for the spell checking and editing, and "icrosoft $xcel to help me make some batch files. %aving run a fe tests, the experience belo shos the best ay I could find to get accurate character recognition from Tesseract. The softare is not perfect - it seemed to have real problems ith the letter '&'&, but it certainly orks ell enough hen combined ith a spell-check. If you are planning to scan a large book, then run some tests of your on first to make sure your process is going to ork. The most vital thing to ensure is that your scans are of good (uality - if you can see dots, speckles, and shados on your scanned pages hen you look at them in Photoshop, Tesseract ith interpret these as incorrect characters, and you ill get garbage out. The most time consuming part of the hole process is the scanning itself. %igh volume book scanning is usually done using a high resolution digital camera mounted above a board here the book rests. )ut I as using a traditional flatbed scanner. *canning a small hardback book of about +,- pages took the best part of a day. I chose to scan pairs of facing pages at a time, and used Photoshop to import the scans, usually doing about .- pairs of pages at a time, then saving the scans as uncompressed greyscale TI' files. The scanner softare had an option for /01 2optical character recognition3, so I ent ith these settings, hich ere set to grey scale ,--dpi. I set the cropped area up in the scanner softare to try to crop everything except the hite part of the page - I kept a large hite margin around the edges of the text area, to make things easier later. The softare also had something called &gutter shado reduction&, hich lessens the dark line beteen the pairs of pages due to the shape of the book's spine. I sitched this on, and it seemed to reduce the problem. I did no image correction at this time, no rotation, nothing, 4ust saved the files one by one. I gave each TI'' file page number names, like &--567.tif& , &--869.tif& &--:6;.tif& etc. The leading <eros at the front of the filename makes it easier to sort the files. /nce I had my rough pairs of pages scanned, I anted to split them into separate pages. It took me a hile to figure out a good ay to do this - in the end I made a Photoshop action hich opened one of the TI'' files, changed the 0anvas *i<e to 7-=, and chose to keep the bottom half of the canvas. Then rotated the image ;- degrees clockise, and saved the file ith a ne name. This process ill only crop the left-hand pages from each of the page-pair files, but e ill do the right-hand pages in a minute. To run the automation, I ent to 'ile --> ?utomate, and chose all of my page pairs 2hich ere in a separate folder3, and chose a ne folder as the destination for all of my split pages. "ade sure that all file names ould be replaced, and chose the ne names to be in the form &--5676left.tif&. This is easier than renumbering - usually you on't be scanning from page . of a book, and it keeps everything consistent - hich is vital later hen it comes to spell checking. The automated action took about 7 minutes to split all the files. I no had a load of left-hand pages cropped from my scans. *o no I made a similar Photoshop ?ction to the one described above, but took the upper 7-= of the image, and rotated ;- degrees clockise, then saved. I then automated this action, but called the files in the form &--5676right.tif&, and had them save into the same folder as the left-side files. @o the really tedious part. I backed up all my single-sheets into another folder, then ent back through every single TI'' file, and recropped to remove as much 4unk as possible. This is absolutely essential, and really improves the spell checking later on. ?ny really onky pages ere rotated to be straight, any big blotches ere cleaned up ith the clone-stamper tool, and any left over gutter shados 2of hich there ere plenty3 ere removed using the Image --> ?d4ust --> 0urves tool. )y adding a point in the middle of the curve, and dragging it don, you can force any light grey shades to become hite, ithout harming the black of the text. as I had to do these processes more than +-- times, I made a fe ?ctions to speed the process up - the decision-making still had to be done manually, but the 0urves settings, and the cropping and saving over could be automated. It is also important to check over the scans ith your eyes to make sure that none of the text has been accidentally chopped off, in case you have mis-scanned a page. I also cropped off the page numbers, any illustrations, and anything else that ould have cause confusion to the /01 softare. The page numbers aren't necessary, as your filenames reflect the page numbers. /nce I had a set of pristine, hite scans ith black text, numbered and saved as TI'' files, it as time to look as Tesseract. Tesseract has to be run from the command line, there isn't a graphical interface... )ut if you've used things like un<ipping tools or ftp from the command line, it isn't any more confusing. I as on a #indos P0, so used the .$!$ files that have been precompiled. I tried the "ac version, but couldn't get it to build correctly 2I think I may be missing some !.. installs or somethingA3 anyay the indos version orks fine. /ne part of the install that re(uires delving into the documentation is that in order for Tesseract to ork correctly, you must donload the #indos $xecutableB tesseract-+.-..exe.tar.g< #indos executables 2vcCC83 for Teseract +.-. un<ip it using 9 <ip, or some other un<ipper, then you must also donloadB tesseract-+.--.eng.tar.g< $nglish language data for Tesseract 2+.-- and up3 Dn<ip this, and you ill find a folder called tessdata ith a load of files in it. These files are the training info that tells Tesseract ho to read $nglish 2assuming the book you are scanning is in $nglish3. There are several other languages available. The tessdata folder "D*T be placed in the same folder as your tesseract.exe file, otherise you ill get eird scary errors hen you try to run the softare. To convert a TI'' file into text, the TI'' file must be uncompressed. *tick the TI'' file in the same folder as tesseract.exe, open the command line 2click on the start menu, click 1un, then type &cmd&3, and go to the folder here you saved tesseract, egB cd Etesseract If your TI'' file is called myfile.tif then you could typeB tesseract myfile.tif mytextfile ?fter a fe seconds of hat appears to be nothing happening, you should find a text file called mytextfile.txt in your tesseract folder. If it hasn't orked, there ill be an error in a text file called tesseract.log *o no you need to do this +-- or so times, for each page that you have scanned. I automated it by making a batch file in the same folder - ie a text file 2called bat.bat3 along the lines ofB tesseract cBEtesseractE.pages6cleanE--5676left.tif --5676left tesseract cBEtesseractE.pages6cleanE--5676right.tif --5676right tesseract cBEtesseractE.pages6cleanE--8696left.tif --8696left tesseract cBEtesseractE.pages6cleanE--8696right.tif --8696right tesseract cBEtesseractE.pages6cleanE--:6;6left.tif --:6;6left tesseract cBEtesseractE.pages6cleanE--:6;6right.tif --:6;6right ?nd so forth.. This creates one text file per page of your book. To merge all the pages into one, I used T!Tconvert - hich is available from hereB httpBFFbluefive.pair.comFtxtcollector.htm I'm sure I should have used something clever in Python, but this app as simple small and did the 4ob. "ake sure to tick the boxes &@o *eparator& and &@o filename&, that prevent the file names from being added to your combined text file. @o I opened the combined text file in #ord. #ord complained that the file as in a funny format, but I chose Dnicode :, and it all seemed to be ok. I kept in this format throughout, but I'm sure a .doc file ould have been fine too. Gou should look through your pages by eye, and try and spot obvious things that can be fixed using &'ind and 1eplace&. I found that Tesseract missread a fullstop folloed by a closing (uotation mark .& and ould instead rite a closing curly bracket H *o these are easy to find and replace. It had problems ith the letter ', fre(uently confusing it ith %. Dnfortunately the spellings generated by Tesseract are (uite &inhuman&, so #ord's spellchecker needs more help than usual, but once you start using &0orrect ?ll&, +-- pages probably took an hour or to at most. 0apital I and loercase I ere often confused ith number ., and it as often necessary for me to go back to the original book, to verify hat the correct letters should be, particularly if the vocabulary in the book is unfamiliar. I used #indo's *earch tool to search ithin the individual text file pages that Tesseract had created, to tell me hich page to look at in the original book, to verify hat some letters should be. Gour results I'm sure ill very, as each different printing method ill yield particular eccentricities in Tesseract's conversion abilities. Then I uploaded the resulting document onto scribd.com There are loads of page breaks here there shouldn't be for onscreen reading, but I think the overall result is (uite readable.
From Word to eBook Made Easy: A Guide To Prepare Your Word Document For eBook Upload, From Formatting Paragraph Style Settings To Creating a Linkable TOC