The Code4Lib Journal – Using ImageMagick to ...

http://journal.code4lib.org/articles/5385

Issue 14, 2011-07-25

ISSN 1940-5758

Using ImageMagick to Automatically Increase Legibility of Scanned Text Documents
The Law Library Digitization Project of the Rutgers University School of Law in Camden, New Jersey, developed a Perl script to use the open-source module PerlMagick to automatically adjust the brightness levels of digitized images from scanned microfiche. This script can be adapted by novice Perl programmers to manipulate large numbers of text and image files using commands available in PerlMagick and ImageMagick.

By Doreva Belfiore

Project Background
Since 1997, the Law Library of the Rutgers University Camden Campus has engaged in a project to provide open access to digital legal materials from the State of New Jersey, other U.S. states, and the United States federal government. Some of the documents are harvested from publically-accessible open data websites, and some of the documents come from the digitization of paper and microfiche volumes.

Problem
For the digitization of state session laws and state constitutional documents, the library uses a Mekel M565-200 microfiche scanner to scan individual images from microfiche frames and generate JPEG image files of each document page. The Mekel filmSCAN scanning software allows for the manipulation of image density and brightness on the fiche level, but not on the frame level for individual page images. This creates a problem of potentially wide variations in brightness among images scanned from the same fiche. The library holds a collection of state constitutional law documents on microfiche which were produced by the Greenwood Press. Greenwood Press selectively inserted watermarks on random images, presumably by super-imposing a plastic or glass film sheet over the printed page prior to filming. These large watermarks (see Figure 1) often render the resulting microfiche image dark and, frequently, illegible. In digitizing the microfiche images, there is no way to control for the brightness of the darkened watermarked files while maintaining the brightness and legibility of the non-watermarked ones. Due to the scale of the State Constitutions Digitization Project (estimated 350,000 image files), there would be no cost-effective way to individually brighten each watermarked file by hand using photo manipulation software such as Photoshop or GIMP. Instead, we needed to find a way to automatically brighten and correct image density for large numbers of scanned text images.

1 of 8

Wednesday 07 September 2011 04:23 PM

After the microfiche are scanned. This split scale of brightness can be adjusted for the needs of the specific files to be corrected. $clean = new Image::Magick(). $rg=~ /(\d+)/. In addition to testing the center portion of each image. It can also be scripted to perform a wide variety of image manipulations automatically. These features make it an excellent choice for high volume image editing. # print the error number print LOGFILE $1.The Code4Lib Journal – Using ImageMagick to . the script issues a MODULATE command of 250.jpg” reference file. the subsequent MODULATE command is issued for 750 or approximately 75% brightening. we found that the ImageMagick SHAVE command could crop a specific number of pixels from all edges of the image (500 pixels from each side formed an ideal ‘slice’).jpg'). a precision level of 2 decimal points was sufficient to demonstrate the brightening effects that we sought. The script can be adjusted to use a different reference file for each fiche set. The STATISTICS command then reads the snapshot and calculates its mean color levels. For any difference greater than 91 points. For the purposes of our project. The SHAVE command crops out a “snapshot” of each image. leaving the original file unchanged. we gather and count the number of files in a directory specified by the user. the script compares the mean value of the image file against the mean value of the “clean. # print the error number print LOGFILE $1. $fixer = new Image::Magick(). $rg=~ /(\d+)/. $rg = $clean->Read('clean... this copy set is used for any manual or automatic image correction procedures. Testing ImageMagick alone on the command line.. we created a script that would work efficiently on a large number of files at the same time. leaving us with a rectangular block in the center from which to evaluate each file.e. we also set the image to monochrome in order to increase text contrast and limit the mathematical comparisons to a smaller number of values (0 to 256). die "$ff" if $ff. although the specific dimensions may vary among fiche sets. If the difference between the reference file and the clean file is 75 to 90 points. If the current file is darker than the reference file by a difference of 25 to 49 points.code4lib. The “slice” or “snapshot” is loaded temporarily into memory for evaluation. We then open the images to be processed using PerlMagick and read each into memory. histograms and more.. die "$rg" if $rg. $rmean = $rstats[3]. IM_autocompare. $ff = $fixer->Read("$input/$document"). die "$rg" if $rg. the maximum MODULATE level of 999 is issued.. can generate a large amount of image metadata. UNIX) and includes a Perl module. Using Perl and PerlMagick. In order to prevent the black edges from skewing the brightness values of our reference image or the other images. we started with a clean reference file that we chose on the basis of legibility. but might not be appropriate for photographic or other image files. or approximately 25% brightening (see Modulate command below). As the camera scans each frame of the fiche sequentially. Finally.. If the file is darker by a difference of 50 to 74 points. we decided to select a sample block for testing from the center of the image.. the script reports that the file is acceptable and takes no action.. print LOGFILE "Mean of clean file is : $rmean \n". embed descriptive Dublin Core and EXIF metadata into each image. Once we determined that ImageMagick could manipulate individual files successfully to increase legibility.org/articles/5385 Figure 1 – Example of a Greenwood Press watermark. Before pagination and metadata assignment. from EXIF values to size. http://journal. for use in scripting. PerlMagick. Using a graduated percentage scale. we found that the MODULATE command could increase the brightness of files. which are stored in a variable. Solution ImageMagick is a free image manipulation program that can be run on various platforms (Windows. filetype. ImageMagick.pl script – SHAVE function #Instantiates an image object in ImageMagick. Through trial and error. color values run from 0 to 256. and used the brightness level of that file as the example from which to evaluate other files in the directory. print $1. the script issues a stronger MODULATE command of 500.\n". Macintosh. occasionally black edges occur around the document pages when the frames shift during the scan process.jpg and take the value of the mean of image levels (see SHAVE script below).. a difference in means of less than 25). Each original image (frame) on the microfiche is scanned at 400 DPI and is approximately 3804 pixels wide by 3193 pixels long. #This command shaves 500 pixels off the top and sides so that it evaluates the middle #of the image and not the edges to get an accurate brightness level. #Instantiates new ImageMagick object. $rg = $clean->Shave(geometry=>'500x500'). Because Greenwood Press microfiche do not have a standardized image grid. or approximately 50% brightening. to STDIN. IM_autocompare. from the command line. print LOGFILE "Adjusting brightness by 25% of file $document now . This was a useful choice for our state constitution text files. { if (($diff > 25) && ($diff < 49)) print "Adjusting brightness by 25% of file $document now .999 (highest). the resulting JPG files are stored in one master folder and retained in their original format until all processing and quality control is complete and the images are uploaded to the Rutgers Camden Law Library website for public access.\n". # log the error number #Gathers statistics about the clean file @rstats = $clean->Statistics(). For this image brightening project. In all cases the script reports to the user what action is being taken and this report can be logged for troubleshooting and documentation. Copies of each fiche set are made specifically for manipulation via Perl scripts that assign pagination. die "$rg" if $rg. $rg = $clean->Set(page=>'0x0+0+0'). print "The mean of the good file is: $rmean \n". and take preservation checksums of each image file. 2 of 8 Wednesday 07 September 2011 04:23 PM . allowing the user to customize the baseline standard to the needs of each particular set of images. if the current file is brighter than the reference file or within 25 points of the reference file value (i.pl script – Modulate command #5 Adjust files for brightness if needed # # # # Performing level adjustments based on scale of difference between the mean level of the current file and the mean level of the difference file. As this file is monochrome. Modulate command adjusts brightness from 1 . # log the error number #Changes grayscale to monochrome to limit to shades of black $rg = $clean->Set(Monochrome=>'True'). brightness levels. print $1. the person scanning the microfiche has to use his or her best judgment in centering the fiche for the camera. We then open the selected reference file clean.

the brightening process is efficient and takes approximately 1-2 seconds per brightened file.The Code4Lib Journal – Using ImageMagick to . and can be changed by adjusting the PerlMagick Write command to write to a temporary file that can be reviewed. print. This overwrite step is completely optional. Below are before and after examples to demonstrate the results of this script: Figure 2 – “Clean” reference file The “clean” reference value or mean brightness of the center of this “clean. $ff = $fixer->Modulate(brightness=>'250'). Example #1: Figure 3a – Original example file The mean of the brightness value of the center of this image = 147.. } As all manipulations take place on the fly in system memory. In addition. $ff = $fixer->Write("$input/$document"). http://journal. At the end of the brightening step. We maintain our original scans on a separate server as a backup in case of problems or errors. edited or replaced at a later time. print LOGFILE $1. $ff=~ /(\d+)/.jpg” is 222..43. the original scanned file is overwritten with the corrected image. and log any error messages generated by PerlMagick.017 (Difference = 75.org/articles/5385 # print the error number # log the error number #Adjusts brightness by 25% and writes over the file. the program is set to quit. print $1. for error handling. as in: $x = $y->Write(“$tempfile”).code4lib.413) 3 of 8 Wednesday 07 September 2011 04:23 PM .

org/articles/5385 Figure 3b – Example file brightened by 75 percent Example #2: Figure 4a – Original example file The mean of the brightness value of the center of this image = 170.code4lib.shtml Known issues and limitations At the present time. the script will be able to detect an empty page.566 (Difference = 51..edu/stateconst/neconst/index. printing results and errors to a logfile and evaluating the output file for legibility after each pass. we run the script sequentially. which shows as a pure black frame. we plan to edit the main script to recognize a highly skewed value and fire off a subroutine that will test the image a second time. As a future enhancement to this script. http://journal. For this reason. Instead.The Code4Lib Journal – Using ImageMagick to . we did not choose to loop this script to automatically repeat the correction process. Taking a center “snapshot” by using the SHAVE command will necessarily include a partially black side and will skew the mean value of the image. a flag can be 4 of 8 Wednesday 07 September 2011 04:23 PM . one on each side of the horizontal page. this script has problems evaluating an image that is composed of only one side of a double-page spread.. Once detected. By taking two vertical slices as “snapshots”.864) Figure 4b – Example file brightened by 50 percent Examples of documents that have been improved by this script can be seen in the Rutgers Law Library’s Nebraska Constitutional Documents Online collection: http://lawlibrary.rutgers. Looping would brighten the printed page side to the point of washed-out illegibility.

/reference").code4lib. $rg=~ /(\d+)/. and run multiple passes and checks on the images as they reach a desired brightness level. This use case is an example of how free and openly available tools can be used to enable the creation of large scale digital library collections at an affordable cost per image that is accessible to libraries and archives with very modest digitization budgets. we are testing the combination of Perl scripts running PerlMagick with CGI scripts presenting images to a non-expert user for evaluation. ImageMagick: convert. http://journal. [updated 2011]. 50%.php. chdir ("$gdir"). ">>IM_autocompare. the percentage scale of image brightening using our method was found to be sufficient to produce legible text from scanned Greenwood Press microfiche when viewed via a standard web browser. In a related project using ImageMagick. A. Appendix Script: IM_autocompare.com) received her Masters of Library and Information Science in 2011 from the Drexel University iSchool in Philadelphia. our overriding interest is the provision of public access. Acknowledgements The author wishes to thank John Joergensen of the Rutgers University Camden School of Law for his mentoring and guidance. [Internet].log").org/script /index. ImageMagick Studio LLC. The script could be improved to utilize better mathematical algorithms to measure the image brightness more tightly. die "$rg" if $rg. [Internet].pl # # # Comparing brightness of a clean reference file # to then adjust levels of all files in a given directory # using PerlMagick. my($rf. #3 Load the information about what a "clean".. $gdir = (". which also makes it an accessible tool for institutions that do not have dedicated programming staff. Please wait. while a positive user response proceeds to the next image. In the future. #Instantiates an image object in ImageMagick.pl #!/usr/bin/perl # # # IMautocompare. #2 Set variables here my($document. print $1. $rg = $clean->Read('clean. $ff). Available from: http://www. "good" or at least #"repaired" file looks like. At the same time. PA.The Code4Lib Journal – Using ImageMagick to . [cited 2011 April 20]. [Internet].org/articles/5385 Another limitation of the script as currently written is that the gradients for image brightening are deliberately broad.org/script/Perl-magick. [updated 2011 March 15]. edit and compose images. NJ # Law Library #1 Use ImageMagick perl module use Image::Magick. Joergensen.php. She works as a Digital Library and Circulation intern at the Law Library of the Rutgers University School of Law in Camden. and 99%.imagemagick. combining these types of testing methods with an enhanced version of this image brightening script from PerlMagick will allow for faster turnaround time in quality control for the State Constitutions Digitization Project. This script can be enhanced to increase brightness. print "Loading reference information now. 75%. Law Library Journal 94(4):673-689. or other image levels on a more fine-grained scale.Camden. New Jersey. A negative user response to a prompted question sends the presented image file into a subroutine for further testing and checking. my($rg. [updated 2011].P.jpg'). and look forward to improving it over time and applying it to more digitization projects. open (LOGFILE. We sought to simplify the brightening process into 4 “strengths” at 25%. Available from: http://www. an automated processing method was absolutely necessary.. a very clean document file as a reference image. $average). PerlMagick API. Future projects and other ideas As the primary goal of the Rutgers Camden Law Library Digitization Project is to digitize and make available as many public domain legal documents as possible. $fixer. # # Doreva Belfiore # Rutgers University School of Law . We have found it to be highly useful. depending upon the needs and goals of the user and programmer.. J. References ImageMagick Studio LLC. In order to achieve the scale of production needed to meet our goals. set for the file to be marked as a single page and exempt it from further brightness evaluation. $clean = new Image::Magick(). sharpness. The New Jersey courts publishing project of the Rutgers–Camden Law Library.imagemagick.jpg . # print the error number 5 of 8 Wednesday 07 September 2011 04:23 PM .. Available from: http://www. die "$rg" if $rg. 2002.imagemagick. About the Author Doreva Belfiore (dorevabelfiore@gmail. the code involved in this script can be understood and maintained by someone with a fairly modest level of technical programming knowledge. Examples of ImageMagick usage. [cited 2011 April 20]. This script could also be extended to take advantage of other file manipulation commands offered by PerlMagick. contrast. respectively. For our purposes. Thyssen. $clean).org/Usage/. [cited 2011 April 20]. #Here we are using clean. \n".

{ if (($diff > 25) && ($diff < 49)) print "Adjusting brightness by 25% of file $document now . #Gathers statistics about the current file @avstats = $average->Statistics(). # log the error number #Changes greyscale to monochrome to limit to shades of black $rf = $average->Set(Monochrome=>'True'). print LOGFILE "Adjusting brightness by 25% of file $document now . die "$rg" if $rg. $rg=~ /(\d+)/. @files = grep /\. $number = @files. $rdir = (".. closedir DIR. # log the error number #Gathers statistics about the clean file @rstats = $clean->Statistics(). $rf = $average->Set(page=>'0x0+0+0').org/articles/5385 #Changes grayscale to monochrome to limit to shades of black $rg = $clean->Set(Monochrome=>'True'). $ff = $fixer->Read("$input/$document"). # print the error number print LOGFILE $1.$amean). $rf=~ /(\d+)/. print $1. # print the error number print LOGFILE $1. die "$rf" if $rf. # print the error number print LOGFILE $1.. print "The difference between $document and the reference file is $diff \n". print $1.\n"..jpg/i.999 (highest). print $1...\n". # print the error number print LOGFILE $1. $average = new Image::Magick(). @files = sort @files. print LOGFILE $1. print LOGFILE "The difference between $document and the reference file is $diff \n". die "$rf" if $rf. #This command shaves 500 pixels off the top and sides so that it evaluates the #middle of the image and not the edges to get an accurate brightness level. readdir DIR.. #5 Adjust files for brightness if needed # # # # Performing level adjustments based on scale of difference between the mean level of the current file and the mean level of the difference file. $rg = $clean->Shave(geometry=>'500x500'). die "$ff" if $ff. # log the error number http://journal. print "Processing folder $rdir \n". #$rdir = (".code4lib. # log the error number print "Transforming image. #Checks files found in the directory print "Found $number files in the directory $rdir \n"../scratch/alconst"). $rf=~ /(\d+)/. #Instantiates new ImageMagick object./test"). chdir ("$rdir"). $rf = $average->Read("$input/$document"). $ff=~ /(\d+)/. $rf = $average->Shave(geometry=>'500x500')..The Code4Lib Journal – Using ImageMagick to . Modulate command adjusts brightness from 1 .. print $1. #4 Open the user-specified folder and get information about the files $input = $ARGV[0].. opendir(DIR. $diff = ($rmean . print "The mean of the good file is: $rmean \n".. $rg = $clean->Set(page=>'0x0+0+0').\n". "$input"). $fixer = new Image::Magick(). $rmean = $rstats[3].. $amean = $avstats[3].. chomp ($input). print "Reading $document. print LOGFILE "Mean of clean file is : $rmean \n".\n". # log the error number 6 of 8 Wednesday 07 September 2011 04:23 PM . foreach $document (@files) { #Instantiates an image object with ImageMagick.. #This command shaves 500 pixels off the top and sides so that it evaluates #the middle of the image and not the edges to get an accurate brightness level.

print LOGFILE "Adjusting brightness by 99% of file $document now . print $1. # print the error number print LOGFILE $1.code4lib. } die "$ff" if $ff.... print $1.The Code4Lib Journal – Using ImageMagick to .. undef $fixer. $ff = $fixer->Read("$input/$document")... # log the error number elsif (($diff <= 74 ) && ($diff >= 49)) { print "Adjusting brightness by 50% of file $document now . # log the error number #Adjusts brightness by 50% and writes over the file. } #6 Undefine variables undef $average. $ff = $fixer->Write("$input/$document"). #logs the error number else { print "Image fine. # log the error number #Adjusts brightness by 99% and writes over the file. \n". } #end foreach loop close LOGFILE. } $ff=~ /(\d+)/. } $ff=~ /(\d+)/.org/articles/5385 #Adjusts brightness by 25% and writes over the file. $ff=~ /(\d+)/. undef $clean. $fixer = new Image::Magick(). Skipping to next file.. # log the error number #Adjusts brightness by 75% and writes over the file. $ff = $fixer->Modulate(brightness=>'750'). # print the error number print LOGFILE $1. $ff=~ /(\d+)/. $ff = $fixer->Modulate(brightness=>'250'). print $1. # log the error number elsif (($diff <= 90) && ($diff >= 74)) { print "Adjusting brightness by 75% of file $document now . $ff=~ /(\d+)/.. undef $ff. $ff = $fixer->Write("$input/$document").. # print the error number print LOGFILE $1. print LOGFILE "Image fine. print $1.. $fixer = new Image::Magick(). print LOGFILE "Adjusting brightness by 50% of file $document now .. Subscribe to comments: For this article | For all articles 7 of 8 Wednesday 07 September 2011 04:23 PM . undef $rg. die "$ff" if $ff... die "$ff" if $ff. $ff = $fixer->Modulate(brightness=>'500'). undef $rf... $ff = $fixer->Modulate(brightness=>'999')... print LOGFILE "Adjusting brightness by 75% of file $document now . \n". print $1. # print the error number print LOGFILE $1. print $1.\n". } $ff=~ /(\d+)/. die "$ff" if $ff. # log the error number elsif ($diff > 91) { print "Adjusting brightness by 99% of file $document now .\n". # print the error number print LOGFILE $1.. $ff = $fixer->Write("$input/$document"). $ff = $fixer->Write("$input/$document"). \n". # print the error number print LOGFILE $1. print $1. # print the error number print LOGFILE $1... \n". http://journal.. undef $document. $ff=~ /(\d+)/... \n". Skipping to next file. $fixer = new Image::Magick().\n". $ff = $fixer->Read("$input/$document").. $ff = $fixer->Read("$input/$document").

http://journal.The Code4Lib Journal – Using ImageMagick to .0 United States License. 8 of 8 Wednesday 07 September 2011 04:23 PM .code4lib..org/articles/5385 This work is licensed under a Creative Commons Attribution 3..