You are on page 1of 11

Progress in Camera-Based Document Image Analysis

David Doermann, Jian Liang, and Huiping Li Language and Media Processing Laboratory, Institute for Advanced Computer Studies University of Maryland College Park {doermann, lj, huiping}
Document image processing and understanding has been extensively studied over the past forty years. Work in the field covers many different areas including preprocessing, physical and logical layout analysis, optical and intelligent character recognition (OCR/ICR), graphics analysis, form processing, signature verification and writer identification, and has been applied in numerous domains, including office automation, forensics, and digital libraries. Some surveys include [6], [34], [37] and [44]. Traditionally, document images are scanned from pseudo binary hardcopy paper manuscripts with a flatbed, sheet-fed or mounted imaging device. Recently, however, we have seen an increased interest in adapting digital cameras to tasks related to document image analysis. Digital camcorders, digital cameras, PC-cams, and even cell phone cameras are becoming increasingly popular and they have shown their potential as an alternative imaging device. Although they can not replace scanners, they are small, light, easily integrated with various networks, and are more suitable for many document capturing tasks in less constrained environments. These advantages are leading to a natural extension of the document processing community where cameras are used to image hardcopy documents, or natural scenes containing textual content. The industry has sensed this direction, and is shifting some of the scanner-based OCR applications onto new platforms. For example, XEROX has a Desktop PC-cam OCR suite [35], based on their CamWorks project, aimed at replacing scanners with PC-cams in light workload environments. The DigitalDesk project turns the desktop into a digital working area through the use of cameras and projectors [46]. Other applications are being enabled as well, such as intelligent digital cameras to recognize and translate signs written in foreign languages ([1], [50] and [51]). Currently most research is focused on processing single images of text, but as technology advances text from video will be an obtainable goal. In this paper we provide a brief survey for recent work on camera-based document processing and analysis, including text detection, extraction, enhancement, recognition, and applications. Although the general trend is toward the ability to image hardcopy documents with a camera, we include literature and applications on scene

Abstract The increasing availability of high performance, low priced, portable digital imaging devices has created a tremendous opportunity for supplementing traditional scanning for document image acquisition. Digital cameras attached to cellular phones, PDAs, or as standalone still or video devices are highly mobile and easy to use; they can capture images of any kind of document including very thick books, historical pages too fragile to touch, and text in scenes; and they are much more versatile than desktop scanners. Should robust solutions to the analysis of documents captured with such devices become available, there is clearly a demand from many domains. Traditional scanner-based document analysis techniques provide us with a good reference and starting point, but they cannot be used directly on camera-captured images. Camera captured images can suffer from low resolution, blur, and perspective distortion, as well as complex layout and interaction of the content and background. In this paper we present a survey of application domains, technical challenges and solutions for recognizing documents captured by digital cameras. We begin by describing typical imaging devices and the imaging process. We discuss document analysis from a single camera-captured image as well as multiple frames and highlight some sample applications under development and feasible ideas for future development.


Traditional document image analysis has carved a nitch out of the more general problem of computer vision because of its pseudo binary nature and the regularity of the patterns used in this visual language. In the early 60s, optical character recognition was taken as one of the first clear applications of pattern recognition, and today, for some simple tasks with clean and well formed data, document analysis is viewed as a solved problem. Unfortunately, these simple tasks do not represent the most common needs of the users of document image analysis. The challenges of complex content and layout, noisy data and variations in font and style presentation keep the field active.

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 2003 IEEE

and graphic text analysis in images and video, to illustrate the diversity of challenges.



2.1 Text and Documents

Work on the general problem of camera based text and document analysis can be categorized in a number of ways, by the type of device used, by the application, or simply by the type of text processed. A majority of the work on camera captured data has been done in the area of processing image and video text, from broadcast video or from still images, rather than on processing images of structured documents. Each problem has its unique challenges but all are directed toward the ultimate goal of providing cameras with text reading capabilities. The processing of image and video text is a specific application which seeks to recognize text appearing as part of, or embedded in, visual content. Text detection in video key frames is a topic that has received a great deal of attention as it provides a supplemental way of indexing the video. The community typically distinguishes between graphic text, which is superimposed on the image (such as subtitles, sports scores or movie credits) and scene text (such as on signs, buildings, vehicles, name tags, or even tee-shirts). Clearly the general goal is to provide the capability to capture information intended for visual human communication and use it for various navigation and indexing tasks. Reading image text has also been widely addressed in the WWW community, because of the desire to index text which appears in graphical images of many web pages [57]. The problem presents many of the same challenges as reading text in general scene images, but is primarily constrained to text which has been graphically overlaid on the image. The problem of imaging and processing structured documents, on the other hand, aims to replace the scanner with a more flexible camera device in appropriate situations. This problem no been less widely addressed.

Current consumer grade digital cameras are pushing the envelope to 6 mega-pixels, and resolutions of up to 3000x2000. Under ideal imaging conditions, this should be sufficient for capturing standard documents. Although most of the devices are still in the 2-3 mega-pixel range, it is predictable that in the next several years, devices will be affordable. For some applications, mass capture via video is appropriate. Current digital video cameras typically have much lower resolution (680x420) because they are designed primarily for low bandwidth presentation and are often highly compressed. The fact that they are not designed specifically for document image capture presents many interesting challenges. Ultimately, we hope to be able to perform various document analysis tasks directly on the device. Recently we have seen consumer and business applications which are intended to eventually run on PDAs with cameras or even on cellular phones. Many companies currently market compact-flash cameras that can be attached to pocket or tablet PCs for document capture. Nokia and other telecom companies have recently released camera phones that capture up to 640x480, and although not yet sufficient for capturing full documents, the resolution has been shown to be sufficient for scene text. 2.2.2 Advantages Document analysis using cameras has a number of advantages over scanner-based input. Cameras are small, easy to carry and easy to use. They can be used in any environment, and can be used to image documents which are difficult to scan such as newspapers and books, or text which does not typically originate as hardcopy, such as text on buildings, vehicles or other objects moving in a scene. In general camera based systems are more flexible. A user study conducted by Newman [35] shows that desktop OCR using PC-cams is more productive than a scanner-based OCR for extracting text paragraphs from newspapers. Fisher [8] investigated the possibility of replacing sheet-fed scanners used by soldiers in the battlefield, with digital cameras. They found that sheetfed scanners could not be used to capture thick books, for example, and they were bulky and difficult to maintain. After experimenting, they came to the conclusion that digital cameras were capable of capturing a whole A4 size document page at an equivalent 200 dpi resolution needed by OCR. 2.2.3 Challenges State-of-the-art document analysis software can produce very good results from clean documents. However, the techniques assume high resolution, high quality document images with fairly simple structure (black text on a white background). Unfortunately, these



Cameras are not new to the world of document imaging. High resolution cameras are often used to image documents of bound volumes or to image fragile documents that can not or should not be handled. Typically, these cameras are high end devices and are set up in ideal circumstances with ideal lighting, and the document lying flat (possibly under glass) and orthogonal to the optical axis. The flexibility that is now becoming available with mobile imaging devices, introduces a new level of processing that must be addressed. High quality devices are being replaced by devices with focal lengths and physical designs meant to image scenes.

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 2003 IEEE

assumptions are not typically valid for camera based systems. The major challenges include: Low resolution - Images taken with cameras usually have a low resolution. While most OCR engines are tuned resolutions between 150 and 400dpi, the same text in a video frame may be at or below 50 dpi making even simple tasks such as segmentation difficult. Uneven lighting - A camera has far less control of lighting conditions on the object than scanners do. Uneven lighting is common, due to both the physical environment (shadows, reflection, fluorescents) as well as uneven response from the devices. Further complications occur when trying to use artificial lights or to image reflective content. As Fisher [8] found, if on-camera flash is used, the center of the view is the brightest, and then lighting decays outward. Perspective distortion - Perspective distortion can occur when the text plane is not parallel to the imaging plane. The effect is that characters farther away look smaller, are distorted, and parallel line assumptions no longer hold in the image. A view angle larger than 40 degrees can produce unacceptable distortion for character segmentation and recognition. For flatbed scanners, documents are perfectly aligned with the scanning surface, so translation and rotation are the primary challenges. However, this is rarely the case with nonfixed cameras. Wide-angle lens distortion As an imaged object gets closer to the image plane, lighting, focus and layout distortions often occur on the periphery. Since many focus-free and digital cameras come with wide angle lens, distortion can be a problem for close-in text, although it can be modeled as a polynomial radial distortion function. Complex backgrounds Often more of the scene is imaged than simply the intended text or document. The lack of a uniform background (even as simple as the background on a sign) can make segmentation especially difficult. Zooming and focusing Since many digital devices are designed to operate over a variety of distances, focus becomes a significant factor. Sharp edge response is required for the best character segmentation and recognition. At short distances and large apertures, even slight perspective changes can cause uneven focus. Moving objects The nature of mobile devices suggests that either the device, or the target may be moving. The amount of light the CCD can accept is fixed and at higher resolutions the amount each pixel gets is smaller, so it is harder to maintain an optimal shutter speed resulting in motion blur. In general, motion blur could be modeled by a Point Spread Function (PSF) that is tuned to the moving direction in the simplest case. A more complex model is needed when multiple directions are involved, or when objects are different distances from the camera.

Intensity and color quantization - Ideally, each pixel in a CCD array should output the three color components, R, G, and B. In practice, however, different hardware can introduce visual distortions or quantize colors that are optimized (or at least designed) for capturing scene images, rather than documents. Sensor Noise - Dark noise and read-out noise are the two major sources of noise at the CCD stage in digital cameras. Additional noise can be generated in amplifiers. The higher the shutter speed, the smaller the aperture, the darker the scene; and the higher the temperature, the greater the noise. Compression Most images captured by digital cameras are compressed, either in hardware or software. It is possible to obtain uncompressed images at the cost of five to ten fold storage space. Compression artifacts are well documented and remain a challenge for document images for which it is necessary to preserve sharpness. Lightweight algorithms The ultimate goal will be to embed document analysis processing directly into the devices. In such cases, the system must provide computationally efficient algorithms which can operate with limited memory, processor and storage resources.

Camera Based Acquisition

One advantage of using a camera instead of a scanner is that it is possible to acquire images at some distance from the target. By zooming to the area of interest, we introduce challenges for auto-focus and zoom. Furthermore, due to the low resolution, it is often not possible to capture all the text in one frame while keeping a reasonable font size. Thus an image mosaic technique is needed to put pieces of text images together as a large high-resolution image. In [31], Mirmehdi et al proposed a simple approach to auto-zooming for general recognition problems. If the background around an object has low variance compared to the object, then the variance in the observation window could be used as an indicator of best zoom. In [52], Zandifar discussed auto-focusing problems in designing a text reading system for the visually impaired. They assumed that the best focus is achieved when edges are strongest in the image. The sum of the differences between neighboring pixels, the sum of gradient magnitude, or the sum of a Laplacian filters output could be used as the overall edge strength measure. Focus is adjusted until the measurement is optimized. Mirmehdi [32] described a system that can automatically locate text documents in a zoom-out view, and control the camera to pan, tilt, and zoom-in to get a closer look at the document. Assuming the document is directly facing the camera so that there is no perspective distortion, the system segments text lines and estimates the average text line height to determine the zoom for optimal OCR. The entire document is divided into several

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 2003 IEEE

pieces accordingly and the camera captures each piece after panning, tilting, and zooming. The small pieces are put together by mosaicing to obtain a complete document image which is sent to a commercial OCR package. Image registration and mosaicing are also used by Jung in [14] to put together long text strings that appear in multiple video frames into a panorama image. In the CamWorks project [32], mosaicing is used to put together the images of the upper and lower part of a document page. In [53], a desktop OCR system using the PC-cam is described where the camera is placed on the top of a desk pointing downwards but the camera captures only a small part of an A4 document. The user moves the document while monitoring the computer screen, until every part of the page appears in the sequence. During the capturing, frames are selected such that they are substantially different and yet successive ones overlap. This reduces the number of frames used in image registration, and reduces blur which can result from the combination of too many images. Based on the observation that words are abundant in text documents, they adopted a feature-based image registration method, where feature points are the lower right vertices of word bounding boxes. The overlapping parts of two registered images are blended to avoid any abrupt seam. Ultimately document mosaicing from handhelds is a desired capability.

captured content is a document page image, localization can be performed with global or adaptive thresholding. In some cases, the page edge is also used to help identify relevant content.



Processing Captured Images

When considering how to process captured images, we must once again consider the differences between processing images of text, and processing full documents imaged with cameras. Typically the tasks involved in processing single frames include detection and extraction, normalization, enhancement and recognition.


Text detection, extraction



Processing captured images requires some basic processing that may or may not be required for traditional document image analysis. The first is detection and localization of the text regions. Although segmentation can be viewed as a challenge for degraded documents, text or documents captured with a camera may be embedded in the scene so text detection has become a problem widely addressed. For video, Zhong [56] noticed that DCT coefficients in compressed MPEG-1 videos contain certain texture information. Most other techniques applied to images or video keyframes can broadly be classified as edge ([11], [12], [19], [20], [33]), color ([16], [30], [36]), or texture based ([9], [21], [25], [28], [49], [55]) In some cases assumptions are made about the sign or document boundaries ([2], [5]). Text detection has been the primary focus of literature on the processing of text in WWW and video frame images so we will not further survey this topic. When it is assumed that the

As previously suggested, documents that are not frontal-parallel to the cameras image plane will undergo a perspective distortion. In general, suppose the document itself is on a plane, then the projective transformation from the document plane to the image plane can be modeled by a 3-by-3 matrix in which eight coefficients are unknown and one is a normalization factor. The removal of perspective can be accomplished once the eight unknowns are found. Four pairs of corresponding points are enough to recover the eight degrees of freedom. However, for the purpose of OCR, the requirements can be relaxed. As Myers et al [33] pointed out, OCR engines are capable of handling different x-to-y scales, and are not affected by xor y- translations. Suppose OCR engines are also able to handle different font sizes, then the z depth is irrelevant and four unknowns can be removed from the problem. Furthermore, the skew (or rotation) can be estimated by traditional page analysis engine. This leaves only three critical parameters: two perspective foreshortening along with two axes and a shearing. As described in [38], in a man-made environment where many 3-D orthogonal lines exist, the estimation of vanishing points provides a way to recover the perspective. In text documents, parallel text lines, column edge lines, and page boundary lines provide such orthogonal lines. Therefore, the estimation of two vanishing points, horizontal and vertical, respectively, is enough. In their study of removing perspective distortion [33], Myers et al. assume that cameras are placed such that vertical edges in scenes are still vertical and parallel in images. The vertical vanishing point where vertical edges intersect is therefore at the infinity of the image plane, while the horizontal vanishing point is in the image plane. They proceed by rotating each text line and observing the horizontal projection profile to find the top and base line, and observing the vertical projection profile to find the dominant vertical edge direction. From the three lines the foreshortening along horizontal axis and shearing along vertical axis are determined so that the original text line image is restored. In their work, text lines are restored independently. Clark and his colleagues studied more general cases. In [3], following the detection of rectangle frames that appear as quadrilaterals in an image, the image in each quadrilateral is mapped to a rectangle area while preserving the approximate width-to-height ratio. It is not

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 2003 IEEE

a completely correct restoration but an effective approximation for OCR purposes. In [1], [4] and [5] both horizontal and vertical vanishing points are estimated for a full removal of perspective. 4.2.2 Warping In some cases, text will appear on curved surfaces. Although the general solution to the problem of recovering surface structure is very challenging, the typical layout of documents often allows some assumptions to be made. For example, warped pages are found with imaging books, but the fact that the page often bends on a cylinder parallel to the page allows for a cylindrical model to be used. Similarly, if we make the assumption that text is laid out horizontally as straight parallel lines on the page, we can use text line features to recover subtle changes in the page structure and warp the page back to a plane.



The text extracted from scenes or documents may require enhancement in a number of ways if standard or commercial OCR is used. In particular, text should be mapped to binary (black on white), size should be projected to be equivalent to about 12pt 300dpi text, edges should be sharpened, and characters should be deblurred when possible. Unlike scanner based input where the quality of the image is primarily a function of document quality, the quality of camera text suffers from other external factors (described above) that need to be rectified. In scanner-based OCR systems, image resolution is high, so interpolation is usually unnecessary. A few touching or breaking strokes will not significantly affect OCR performance. In low-resolution images, however, character strokes may be only one pixel thick, and blended with surrounding backgrounds. Without enhancement, a simple binarization will completely remove many strokes. The tasks of enhancement are typically to increase spatial resolution and to increase the difference between text and background. Traditionally, brightness and contrast enhancement are two preliminary tools in image enhancement. Kuo et al [17] introduced a method for enhancing color and grayscale images and NTSC video frames by contrast stretching. The text in processed images has better visual quality. The problem of deblurring is basically concerned with deconvolution which is ill-posed in its basic form because zeros in the blurring PSF will magnify any noise in input to infinity. Many have worked on solutions to overcome this problem. In both [42] and [35] Tikhonov-Miller regularization is used so that the solution is regularized by a smoothness constraint. Ideally a smoothness

requirement should be weak near character stroke edges and strong in background areas. Instead of adaptively changing the smoothness parameter, they achieved the similar effect by testing the local variance and replacing the pixel with the local average if the local variance is low (i.e., in background). With this method character edges are preserved, and the noise in the background is suppressed where it is most noticeable. Interpolation is widely used before binarization. Under the assumption that the Nyquist criterion is met when a continuous object image is sampled by a CCD array, it is theoretically possible to reconstruct the original light field by sinc function interpolation [24]. A perfect high-resolution image can be obtained by resampling at the needed resolution. In practice, where noise is present perfect reconstruction is meaningless. Bilinear interpolation has been found effective in many instances (i.e. [15], [24], [35] and [42]) . Adaptive thresholding is also a major topic in text image binarization. It has been found that global thresholding is not ideal for camera-captured images due to lighting variation. Kamada [15] and Li [24] both used locally adaptive thresholding to extract text pixels from video frames. Jiang [13] proposed to interpolate the original text image by n times, then for each nxn block select the threshold that would result in a black-to-white ratio proportional to the original pixels graylevel. Others, including [7] have also addressed the problem. In a well-known survey, Trier [43] compared 11 locally adaptive thresholding techniques and concluded that Niblacks method is the most effective. In both [42] and [48], Niblacks method is found to be the most effective to extract text. In Taylor [42], their comparison showed that simple Niblacks locally adaptive thresholding with k=0 gave the best result, which is equivalent to thresholding at a local average. Wolf reported their effort in suppressing the noisy output in pure background areas by a modified version of Sauvola's post-processing [48]. Later Taylor et al. enhance text image by high frequency boosting followed by bilinear interpolation [42]. An alternate approach to compute locally adaptive thresholds is proposed in [40]. It is also possible to do binarization without thresholding. Wolf [47] obtained the prior distribution of 4x4 binary cliques in text images from training samples, and used a MAP estimator to binarize any 4x4 cliques in input graylevel images. To overcome the difficulty of a large discrete search space (of size 65536 =2^16), Wolf used a simulated annealing technique.



Character recognition has long been the fundamental problem in document analysis. Most systems optimized for scanned documents can not be applied directly to text

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 2003 IEEE

acquired with cameras, although the prevailing approaches try to normalize and enhance text extracted from camera images so that it can be passed directly to COTS software. Kurakake [18] presented an approach to couple character segmentation and OCR in order to improve both performances. After text lines are segmented, one character is segmented at a time, from left to right, then verified by recognition. If the recognition score is high, the segmentation is accepted. Otherwise another segmentation position is tried. Similarly, Sato [39] coupled character segmentation and recognition. In their approach, all possible cutting positions are found first. The goodness of a combination of some cuttings is measured by the average character recognition score. A dynamic programming method is used to efficiently find the best segmentation. More recently, specialized classifiers have been trained to deal with scene text in WWW and video keyframes.

detectors. However, in practice, in order to deal with fade in/out and not to miss any small text, a text region detector should still be used at a certain interval. In [54], Zhangs goal was to detect score information appearing in sports video, while ignoring other text such as text in commercials. Zhang et al. found that team names, scores, and other related words usually appear in their fixed positions on the screen. So they adopted a model based method. Initially a short sequence of video containing score information is used to train a model about the position and appearance of text.


Text Tracking

Multiframe Processing

Many of the same types of problems exist in video, but some additional processing is possible when multiple frames are available.


Frame Selection

As the first step of utilizing multiple images in document analysis, images that do not contain text should be eliminated. This step not only saves computation cost in downstream stages, but also reduces the number of false alarms in the text localization step: usually text localizers are designed towards the goal of not missing any text. In some sense, the appearance and disappearance of text in video is a special case of a scene change. In some applications text frame detection is therefore based on shot detection. This method has problems when caption text fades in or out since those situations are easily missed with most shot detectors. Another simple solution is to treat every frame as a potential text frame and apply text region detection to them ([19], [27], [28]). It has the shortcoming of wasting computing power and increasing false alarms. As a simple modification, Kim [16] selected frames at fixed intervals under the assumption that text must stop for a certain amount of time to be visible to the human eye. Gargi [9] proposed a text frame detection scheme based on the increase in the number of intra-coded blocks in MPEG P- and B- frames. Similarly, Kurakake [18] proposed their detection method based on the difference of intensity histograms of successive frames (maybe several frames ahead or behind). The position where histogram changes abruptly is assumed to be a candidate text frame. These methods do not rely on text region

In order to make full use of the temporal redundancy in video sequences in document analysis, an important idea is to improve the OCR performance by using all instances of the same text in different frames. This requires the tracking of the text over time. Li [24] studied the tracking of moving text in videos. Once one text object is found in a given frame (called the reference object), a SSD-based (Sum of Square Difference) matching is used to find the best matching block in the next frame. The trace of the text objects is used to rule out false matches that result in random movement. Text boxes are enlarged by the factor of 2 using bilinear interpolation, and then matched again based on SSD to get subpixel matching precision. Finally the matched images are averaged and binarized to extract text. In order to avoid losing the tracking, a recalibration is performed at a fixed interval. Edges in text objects are grouped to get a tighter bounding box of the text. The new text object is used as a new reference in the tracking process. Lienhart [28] used a different method to track text objects. A projection profile based signature is defined for each text box. In the vicinity of an original text box in a new frame, a best matching box is found based on signature distance. Similar to Li [25] a text detector is invoked every 5 frames to calibrate the tracking. Tracking is continued in the case of a few frame dropouts. Before enhancement, all text boxes are rescaled to a fixed height of 100 pixels. For low resolution video where text is small, text boxes are interpolated; for HDTV frames, text is downsampled. The rescaled text boxes are aligned at subpixel precision by SSD-based matching of pixels that have colors close to dominant text color. In [48], Wolf performs text tracking by simply checking the overlap of detected text blocks between consecutive frames.


Multiframe Enhancement

Another advantage of processing multiframe sequences is the ability to integrate over time to improve recognition.

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 2003 IEEE

When multiple frames are available, the temporal redundancy is often explored to enhance the text quality through a so-called super resolution method. The basic idea of super-resolution algorithms is to construct a clean, high resolution image from the multiple observation instances, under the fundamental assumption that the constructed super-resolution image should be able to generate the original low resolution images when appropriately smoothed, warped and down-sampled. The success of the method is determined by how accurately the above image formation process can be modeled. Nearly all super-resolution based methods require the registration of images to sub-pixel accuracy, which might be a very challenging task in practice. In some applications of video text enhancement more constraints can be added so the registration is manageable. For example, Sato [39] assumes that graphic text in the TV news program is static, so no spatial transformation (warping, or translation) is necessary. After locating all text blocks in successive frames, they interpolate the text blocks 4 times and then integrate them by a temporal min/max operation. The min/max operation is chosen since they think the brightest (for normal text) or darkest (for reverse text) background pixels along time axis will be picked by the max (for normal text) or min (for reverse text) operation, while text pixels do not change. Wolf [48] presents a similar scheme by assuming the text is static. The fusion scheme he chooses is the average of multiple frames. Li [24] also presents a scheme enhancing the moving text. After identifying the reference frame, he uses an image matching technique to track the corresponding text blocks in several consecutive frames. The tracked text blocks are registered to subpixel level accuracy to improve both registration accuracy and text resolution, then averaged to achieve a text block with clean background and higher resolution. Lienhart [26] describes their scheme to handle the moving text. But their enhancement scheme is more recognition focused: they combine the recognition results of multiple instances of the same character throughout subsequent frames to enhance recognition results and to compute the final output. All the algorithms addressed above can only enhance the graphic text by the assumption that the graphic text is either static, or has a pure translational motion, such as text scrolling up the screen. The purpose of the enhancement is focused on denoising. The enhancement of scene text, however, is a much more complex problem and sparsely addressed due to the often unconstrained motion. The purpose of the enhancement includes both denoising and deblurring, caused by focus or motion. Li [24] proposes a projection onto convex sets (POCS) based method to deblur scene text to improve readability. The method of POCS requires the definition of closed

convex constraint sets. An estimate of the high resolution image is defined as a point in the intersection of these constraint sets and determined by successfully projecting an initial guess onto the constraint sets. POCS based method can offer the flexibility of space-varying processing and simultaneously account for blurring due to motion and sensor noise. However, in the paper only linear-space-invariant (LSI) blurring is considered. The extension to more complex cases is not addressed.

Camera Based Applications

Over the past 20 years, there have been numerous applications on camera based text recognition, such as reading license places, book sorting, visual classification of magazines and books, reading freight train IDs, road sign recognition, detecting danger labels and reading signs in warehouses. In addition to these type of signs, the ability to process signs using mobile, low cost hardware enables numerous other applications. Sign detection and translation The ability to detect and recognize text using PDAs or cellular phones has promise for both commercial and military applications. Watanabe [1] described a simple Japanese to English sign translation system. More sophisticated work presented by Yang et al ([50], [51]) consists of three parts: sign detection, character recognition, and sign translation. The difficulty in sign detection roots in the concise nature of signs: a sign is often comprised of only a few words/characters. Techniques developed for the detection of large text segments in documents do not fit sign processing very well. With limited context, it is more difficult to distinguish different meanings of the same word. Traditional knowledge-based machine translation works well with grammatical sentences, but it has difficulty with ungrammatical text in signs. Instead, Yang et al. used statistical and example-based machine translation for sign translation. Mobile text recognizer and speech generator for the visually impaired Zandifar et al. [52] proposed to use camera-based OCR techniques in a head-mounted smart video camera system to help the visually impaired. Their goal is to detect and recognize text in the environment, and then convert text to speech. The problems they confronted include the detection of text and the adjustment of cameras (such as zooming) so clear focus can be achieved. Other challenging problems include multiple text orientations, and text on curved surfaces, like cans. Cargo container and warehouse merchandise code reader Lee et al. [20] presented a system used in ports to automatically read cargo container codes. A single graylevel image captured by a camera is provided for this task. Besides container codes, there may be other text on containers. The uneven surface may make text look warped. Their text detection is based on vertical edges

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 2003 IEEE

found in the image. At the verification stage, they used domain knowledge that container codes are standardized in a fixed format: 4 letters followed by 7 digits in one or two lines. It is not hard to imagine similar applications. For example, while barcodes are widely used, they have the disadvantage of not being readable humans. The ability to capture and recognize text on packages is a useful tool for warehouse management. Similarly, in the package delivery industry, it would be helpful to recognize addresses and automatically route them to an appropriate destination. Document archiving - Due to their flexibility and independence of bulky computers, it will not be surprising to find digital cameras and camcorders being used as document digitizing and archiving devices in the future. A user can carry such a device conveniently anywhere, and record interesting document pages instantly. All the processing can be deferred to when a computer is accessible.

applications will drive the demand to, for example, read text and follow-up with an online search or database query. This type of real-time processing will require onboard OCR. New Applications Perhaps the most interesting thing we can do is to look forward to the new types of applications that will be enabled when we eventually do realize these goals. We will be able to capture, process and send information from restaurant menus, or bus schedules through our cellular phones. We will be able to go into a library and copy, enhance and ultimately read articles that we need without going to a photocopy machine. We will be able to retrieve and reformat imaged documents automatically to adapt to any device. Currently our view of document analysis on mobile devices is one of replacing a scanner. Ultimately PDA, cellular phones and digital cameras will allow us to work in a way where hardcopy documents are seamlessly integrated into our environment.

Grand Challenges

Clearly, the ability to capture and process documents with the same ease as we feed a document into a scanner will revolutionize the way we capture and manage hardcopy documents. We will be able to effortlessly obtain and manage content anytime and anywhere, providing for a pervasive environment for hardcopy content. There are a number of key areas that need to be fully addressed, however, before this becomes a reality. Image Quality One of the fundamental problems that will need to be addressed, primarily at the device level, is the quality of the images. After years of working with 300dpi imagery, where lighting has been optimized and device noise is reduced, we are being faced with the challenge of dealing with low resolution, and in some cases corrupted data. There is no doubt that cameras could be designed to reduce some of the problem, but in general, document analysis is not a driving force for these devices. Until document analysis is feasible with standard off the shelf devices that can be purchased at a reasonable cost, the demand will not be sufficient to significantly change the current operation. Basic Algorithms As previously stated, camera based document analysis introduces many new requirements that are not common with scanner acquired images, including dealing with perspective, motion blur, focus and uneven lighting. A great deal of progress is being made on these problems, but we need to continue to work on them. Device Processing Ultimately, our desire for mobility and on demand processing will require that our document analysis algorithms be ported directly to the imaging devices. Although document capture and offline processing may be the current mode of operation, new

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 2003 IEEE

Summary of text detection literature showing task type (C for caption text, S for scene text), test data, and results
Hua 2001 Kim 1996 Kurakake 1997 Kuwano 2000 Lee 1995 Li 1999


Test data
90 clips of CNN news, 5 consecutive frames each. 50 true color 384x288 frames 100 video clips, 640x480, including running text. 10 hours of 8 news programs, 640x480. 191 grayscale images, 512x512, of cargo containers. 45 text blocks from video frames (320x240)

244 text boxes: 229 (94%) detected, 18 (7.3%) false alarms. 124 text lines: 107 (86%) detected. All text frames detected. 4.9% false alarms. 92% of characters extracted. 82% recognition Of 1,383 caption appearances: 1,314 (95%) were detected with 111 (7.8%) false alarms. 96% of characters extracted. 76% recognition. 2,096 characters: 1,915 (91%) segmented, 217 (10%) false alarms. 13 (29%) of original blocks, 36 (80%) of blocks enlarged by duplication, and 45 (100%) blocks enlarged by interpolation have OCR output. 1,452 characters: 13%, 34% and 67% were recognized, respectively. 500 keyframes and 151 text frames: 133 (88%) detected, 81 (38%) false alarms. 153 text blocks in 75 TV frames: 142 (93%) detected, 14 (9.0%) false alarms. 86% of characters are contained in candidate text regions in the case of static text on static background. Segmentation rate (OCR rate): 96% (76%) on credit sequences, 66%(65%) on commercials, and 99%(41%) on news. Text box detection rate improved from 69.5% to 94.7% from tracking. 91% Text line recall, 54% precision. 91% segmented, 81% recognized. 83-90% recognized (words) in one case. 6.5% CamWorks OCR error rate at 200dpi. (0.6% Flatbed while flatbed scanner OCR error rate at 300dpi.) 25 (100%) text images extracted. Of 21,820 characters (4,406 words): 95%(93%) extracted, 91%(86%) cleaned-up. Of extracted: 84% (77%) recognized. All text bounding areas were correctly learned from initial samples sequences. Of 1,134 caption keyframes, 1,130 (99.6%) detected, 22 (1.9%) false alarms. 92% OCR rate. Of 3,206,936 8x8 compressed blocks: 141,680 labeled as text, 140,983 (99.5%) detected, 50,759 (26.5%) false alarms.


Topic frames found without OCR. Video structure is extracted based on caption appearances, without OCR. Applied to cargo container code reading in ports. To show effect of interpolation on OCR.

Li 2000


500 keyframes (320x240) from 22 MPEG video clips, and 75 frames from TV. 8 video clips, 384x288, each frame is JPEG. 22 min videos, in JPEG format, 384x288. 23 video clips, total 10 min, 352x240 or 1920x1280, and 7 web pages stored as images. 100 grayscale images of book covers normalized to 512x512. 34 MPEG1 (352x288) video clips of news. 2 sample images, 1440x960 after mosaicing. 6 users working on capturing and OCRing small text segments. 30 CD and book cover color images. 48 images from internet, library and scanner, including video frames, photographs, newspapers, ads, and checks. 3 baseball and 1 NBA video clips, all interlaced with commercials.

Lienhart 1996 Lienhart 2000 Lienhart 2002 Messalodi 1999 Miene 2001 Mirmehdi 2001 Newman 1999 Ortacdag 1998 Wu 1997


In other cases, where text or background is non-stationary results are in the high 90s.

To show the effect of text tracking.

Characters are completely separated from background. Images were mosaiced from small pieces captured by a 640x480 video camera. To show the feasibility of replacing scanners with cameras.

Zhang 2002

Target is score board text. Texts in commercials are false alarms. Dictionary has 187 words.

Zhong 2000

2,360 I-frames from 8 MPEG-1 video clips.

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 2003 IEEE



P. Clark, M. Mirmehdi, Location and Recovery of Text on Oriented Surfaces, SPIE CDRR VII, pp. 267-277, 2000. [2] P. Clark, M. Mirmehdi, Finding Text Regions Using Localised Measures, Proc. of the 11th BMVC., pp. 675-684, 2000. [3] P. Clark, M. Mirmehdi, On the Recovery of Oriented Documents from Single Images, Technical Report CSTR-01004, Dept. of CS, Univ. of Bristol, Nov. 2001. [4] P. Clark, M. Mirmehdi, Estimating the Orientation and Recovery of Text Planes in a Single Image, Proc. of the 12th BMVC, pp. 421-430, Sept. 2001. [5] P. Clark, M. Mirmehdi, Recognizing Text in Real Scenes, IJDAR, Vol. 4, No. 4, pp. 243-257, 2002.

Video Databases V, February 8-14, 1997, San Jose, CA, USA. SPIE Proc. Vol. 3022, pp. 368-379, 1997. [19] H. Kuwano, Y. Taniguchi, H. Arai, M. Mori, S. Kurakake, H. Kojima, Telop-on-demand: Video Structuring and Retrieval Based on Text Recognition, IEEE ICME, pp. 759-762, New York, NY, July 2000. [20] C. -M. Lee and A. Kankanhalli, Automatic Extraction of Characters in Complex Scene Images, IJPRAI, Vol. 9, No. 1, pp. 67-82, 1995. [21] J. Li, R.M. Gray, Text and Picture Segmentation by the Distribution Analysis of Wavelet Coefficients, IEEE ICIP, pp. 790-794, 1998. [22] H. Li, D. Doermann, A Video Text Detection

[6] D. Doermann. The Indexing and Retrieval of Document Images: A Survey. CVIU, Vol. 70 No. 3, pp. 287-298, 1998 . [7] E.Y. Du, C-I Chang, P.D. Thouin, Thresholding
Video Images for Text Detection, Proc. of the 16th ICPR, Vol. 3, pp. 919-922, 2002. [8] F. Fisher, Digital Camera for Document Acquisition, SDIUT, pp. 75-83, 2001 [9] U. Gargi, S. Antani, R.R Kasturi, Indexing Text Events in Digital Video Databases, 14th ICPR, pp. 916-918, 1998. [10] R. M. Haralik, Document Image Understanding: Geometric and Logical Layout, pp. 385-390, CVPR 1994. [11] Y. M. Y. Hassan, L. J. Karam, Morphological Text Extraction from Images, IEEE Trans. on Image Processing, Vol. 9, No. 11, pp. 1978-1983, Nov. 2000. [12] A. K. Jain, B. Yu, Automatic Text Location in Images and Video Frames, PR, Vol. 31, No. 12, pp. 20552076, 1998. [13] W.W.C. Jiang, Thresholding and Enhancement of Text Images for Character Recognition, ICASSP, Vol. 4, pp. 2395-2398, 1995. [14] K. Jung, K. I. Kim, T. Kurata, M. Kourogi, J.-H. Han, Text Scanner with Text Detection Technology on Image Sequences, Proc. of the 16th ICPR, Vol. 3, pp. 473-476, 2002. [15] H. Kamada, K. Fujimoto, High-speed, High-accuracy Binarization Method for Recognizing Text in Images of Low Spatial Resolutions, Proc. of the Fifth ICDAR, pp. 139-142, 1999. [16] H. -K. Kim, Efficient Automatic Text Location Method and Content-Based Indexing and Structuring of Video Database, Journal of Visual Communication and Image Representation, Vol. 7, No. 4 pp. 336-344, Dec. 1996. [17] S. -S, Kuo, M.V. Ranganath, Real Time Image Enhancement for Both Text and Color Photo Images, Proc. of the ICIP, Vol. 1, pp. 159-162, 1995. [18] S. Kurakake, H. Kuwano, K. Odaka, Recognition and Visual Feature Matching of Text Region in Video for Conceptual Indexing, Storage and Retrieval for Image and

Digital Video, ACM Proc. of 8th CIKM, pp. 122-130, 1999. [24] H. Li, D. Doermann, Text Enhancement in Digital Video Using Multiple Frame Integration, ACM Multimedia, Vol. 1, pp. 19-22, 1999. [25] H. Li, D. Doermann, O. Kia, Automatic Text Detection and Tracking in Digital Video, IEEE TIP, Vol. 9, No. 1, pp. 147-167, Jan. 2000. [26] R. Lienhart, F. Stuber, Automatic Text Recognition in Digital Videos, SPIE Proc. of 6TH IVP, 1996, Technical Report TR-95-036. [27] R. Lienhart, W. Effelsberg, Automatic Text Segmentation and Text Recognition for Video Indexing, ACM/Springer Multimedia Systems, Vol. 8, pp. 69-81, Jan. 2000, Technical Report TR-98-009, Praktische Informatik IV, University of Mannheim, May 1998. [28] R. Lienhart, A. Wernicle, Localizing and Segmenting Text in Images and Videos, IEEE TCSVT, Vol. 12, No. 4, pp. 256-268, 2002. [29] S. Messalodi, C. M. Modena, Automatic Identification and Skew Estimation of Text Lines in Real Scene Images, PR, Vol. 32, pp. 791-810, 1999. [30] A. Miene, Th. Hermes and G. Ioannidis, Extracting Textual Inserts from Digital Videos, ICDAR, pp. 1079-1083, Sept. 2001. [31] M. Mirmehdi, P.L. Palmer, J. Kittler, Towards Optimal Zoom for Automatic Target Recognition, Proc. of 10th SCIA, Vol. I, pp. 447-453, 1997. [32] M. Mirmehdi, P. Clark, J. Lam, Extracting Low Resolution Text with an Active Camera for OCR, Proc. of the IX Spanish SPRIP, pp. 43-48, May 2001. [33] G.K. Myers, R.C. Bolles, Q. -T. Luong, and J.A. Herson, Recognition of Text in 3-D Scenes, SDIUT 01, pp. 85-100, 2001. [34] G. Nagy, Twenty Years of Document Image Analysis Research in PAMI, IEEE TPAMI, Vol. 22, No. 1, January 2000, pp 63-84. [35] W. Newman, C. Dance, A. Taylor, S. Taylor, M. Taylor, T. Aldhous, CamWorks: A Video-based Tool for Efficient Capture from Paper Source Documents, Proc. in the ICMCS, pp. 647-653, Jun. 1999.

System Based on Automated Training. ICPR, pp. 223226, 2000 [23] H. Li, O. Kia, D. Doermann, Text Enhancement in

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 2003 IEEE


E. Ortacdag, B. Sankur, K. Sayood, A New Algorithm in Locating Text in Complex Color Images, IJDAR, 1998. [37] R. Plamondon and S. Srihari, On-line and off-line Handwriting Recognition: A comprehensive Survey, IEEE TPAMI, Vol 22, No 1, January 2000, pp 63-84. [38] C. Rother, A new Approach for Vanishing Point Detection in Architectural Environments, Proc. 11th BMVC, pp. 382-391, 2000. [39] T. Sato, T. Kanade, E.K. Hughes, M.A. Smith, Video OCR for Figital News Archive, IEEE Workshop on ContentBased Access of Image and Video Database, pp. 52-60, 1998. [40] M. Seeger, C. Dance, Binarising Camera Images for OCR, Proc. of Sixth ICDAR, p. 0054-0059, Sept. 2001. [41] J. -C. Shim, C. Dorai, R. Bolle, Automatic Text Extraction From Video for Content-Based Annotation and Retrieval, Proc. of ICPR, pp. 618-620, 1998. [42] M. J. Taylor and C. R. Dance, Enhancement of Document Images from Cameras, Proc. of IS&T/SPIE EIDR V, pp. 230-241, 1998. [43] O.D. Trier, T. Taxt, Evaluation of Binarization Methods for Document Images, PAMI, Vol. 17, No. 3, pp. 312315, 1995. [44] A. Vinciarelli, A Survey on Off-Line Word Recogntion, Pattern Recognition, Vol. 35, 2002, pp. 14331446. [45] Y. Watanabe, Y. Okada, Y. -B. Kim, T. Takeda, Translation Camera, 14th ICPR, pp. 613-617, 1998. [46] P. Wellner, Interacting with Paper on the DigitalDesk, Comm. ACM, Vol. 36, No. 7, pp. 86-96, 1993 [47] C. Wolf, D. Doermann, Binarization of Low Quality Text Using a Markov Random Field Model, ICPR, pp. 160163, 2002.


C. Wolf, J.-M. Jolion, F. Chassaing, Text Localization, Enhancement and Binarization in Multimedia Documents, Proc. ICPR, Vol. 4, pp. 1037-1040, IEEE Computer Society. Aug. 2002. [49] V. Wu, R. Manmatha, E.M. Riseman, Finding Text In Images, Proc. of the 2nd ACM ICDL, R. B. Allen and E. Rasmussen, eds., pp.3-12, Jul. 1997. [50] J. Yang, J. Gao, Y. Zhang, A. Waibel, Towards Automatic Sign Translation, Proc. of HLT, 2001. [51] J. Yang, J. Gao, Y. Zhang, X. Chen, A. Waibel, An Automatic Sign Recognition and Translation System, Proc. of the Workshop on Perceptive User Interfaces (PU), 2001. [52] A. Zandifar, R. Duraiswami, A. Chahine, L. Davis, A Video Based Interface to Textual Information for the Visually Impaired, IEEE 4th ICMI, pp. 325-330, 2002. [53] A. Zappala, A. Gee, M. Taylor, Document Mosaicing, Image and Vision Computing, Vol. 17, No. 8, pp. 585-595, 1999. [54] D. Zhang, R.K. Rajendran, S. -F. Chang, General and Domain-Specific Techniques for Detecting and Recognizing Superimposed Text in Video, ICIP, pp. 593-596, 2002. [55] Y. Zhong, K. Karu, A.K. Jain, Locating Text in Complex Color Images, ICDAR, pp. 146-149, 1995. [56] Y. Zhong, H. Zhang, A.K. Jain, Automatic Caption Localization in Compressed Video, IEEE Trans. on PAMI, Vol. 22, No. 4, pp. 385-392, Apr. 2000. [57] J. Zhou and D. Lopresti. Extracting Text from WWW Images, ICDAR, pp. 248-252, 1997.

Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 2003 IEEE