all of my split pages. Made sure that all file names would be replaced, andchose the new names to be in the form "004_5_left.tif". This is easier thanrenumbering - usually you won't be scanning from page 1 of a book, and it keepseverything consistent - which is vital later when it comes to spell checking.The automated action took about 5 minutes to split all the files. I now had aload of left-hand pages cropped from my scans. So now I made a similar PhotoshopAction to the one described above, but took the upper 50% of the image, androtated 90 degrees clockwise, then saved. I then automated this action, butcalled the files in the form "004_5_right.tif", and had them save into the samefolder as the left-side files.Now the really tedious part. I backed up all my single-sheets into anotherfolder, then went back through every single TIFF file, and recropped to remove asmuch junk as possible. This is absolutely essential, and really improves thespell checking later on. Any really wonky pages were rotated to be straight, anybig blotches were cleaned up with the clone-stamper tool, and any left over guttershadows (of which there were plenty) were removed using the Image --> Adjust -->Curves tool. By adding a point in the middle of the curve, and dragging it down,you can force any light grey shades to become white, without harming the black ofthe text. as I had to do these processes more than 200 times, I made a fewActions to speed the process up - the decision-making still had to be donemanually, but the Curves settings, and the cropping and saving over could beautomated. It is also important to check over the scans with your eyes to makesure that none of the text has been accidentally chopped off, in case you havemis-scanned a page. I also cropped off the page numbers, any illustrations, andanything else that would have cause confusion to the OCR software. The pagenumbers aren't necessary, as your filenames reflect the page numbers.Once I had a set of pristine, white scans with black text, numbered and saved asTIFF files, it was time to look as Tesseract.Tesseract has to be run from the command line, there isn't a graphicalinterface... But if you've used things like unzipping tools or ftp from thecommand line, it isn't any more confusing.I was on a Windows PC, so used the .EXE files that have been precompiled. I triedthe Mac version, but couldn't get it to build correctly (I think I may be missingsome X11 installs or something?) anyway the windows version works fine. One partof the install that requires delving into the documentation is that in order forTesseract to work correctly, you must download the Windows Executable:tesseract-2.01.exe.tar.gz Windows executables (vc++6) for Teseract 2.01unzip it using 7 zip, or some other unzipper, then you must also download:tesseract-2.00.eng.tar.gz English language data for Tesseract (2.00 and up)Unzip this, and you will find a folder called tessdata with a load of files in it.These files are the training info that tells Tesseract how to read English(assuming the book you are scanning is in English). There are several otherlanguages available.The tessdata folder MUST be placed in the same folder as your tesseract.exe file,otherwise you will get weird scary errors when you try to run the software.To convert a TIFF file into text, the TIFF file must be uncompressed. Stick theTIFF file in the same folder as tesseract.exe, open the command line (click on the
Add a Comment
vperetokin5761left a comment