You are on page 1of 5

Implementation of a Video Text Detection System

Jinsik Kim, Taehun Kim, Jiexi Lin CS570 Artificial Intelligence Team Foxtrot KAIST, South Korea 305-701
jskim@ai.kaist.ac.kr, thkim@vivaldi.kaist.ac.kr, jesse@islab.kaist.ac.kr

Abstract
This is a term project report of Artificial Intelligence course (CS570) in the Division of Computer Science at KAIST. In this report, we will describe the implementation of a Video Text Detection System. We focus our scope to artifact texts in video frames of the news broadcast which has less noise and distinct text regions and is appropriate for detection purpose. We propose several heuristic approaches such as long line removal, horizontal stroke detection and vertical stroke detection, text area detection, bounding box detection and text validation for text detection based on the existing edge detecting techniques. Keywords: OCR, Edge Detection, Text localization

2. Overview
In this section, we are going to take a look at the existing text detection techniques. Text detection requires us to detect the region where texts locate. After the detection, we could use machine learning to enhance the accuracy and then use the detected region for recognition. In this report, we only focus on the text detection. Our feature work may relate to the text verification and recognition.

2.1 Basic ideas


There are four basic ideas of text detection method, (1) Edge detection, (2) Area-based detection, (3) Texture-base detection, and (4) Continuous frames detection. Edge detection method believes there is difference in brightness or color between texts and background. Thus according to the differences, we can detect the edge of the texts. Areabased detection treats the color of the texts in a certain region the same. Texture-based detection believes that text may be treated as a distinctive texture and thus be segmented out using texture segmentation techniques. Continuous frames detection makes comparison between continuous frames to detect the appearance and disappearance of the texts. The first method, edge detection, is the most widely used method and appropriate for video text detection. Therefore we choose the existing edge detection method for the first step.

1. Introduction
For efficient indexing and retrieval of content-based multimedia database, we need automatic extraction of descriptive features that are relevant to the subject materials such as image, video and etc. Researches have been done on low-level features such as color, texture or shape. To offering a precise idea of the image content, we need some high-level features such as text and human face, etc [2]. Artificial text appearing in video of news broadcast is usually closely related to the visual content and is a strong candidate for high-level semantic indexing for retrieval. It also has the characteristics of having less noise and distinct text regions thus making it especially appropriate for text detection research. Our approach for detecting the text region is based on the existing edge detection techniques together with several heuristic methods such as removal of the long line, stroke region detection, bounding box detection and validation of the text. The remainder of this report is organized as follows. Section 2 gives an overview of existing techniques about text detection. Section 3 describes the implementation of the system. Section 4 presents the experimental results. A conclusion is given at last section.

2.2 Edge detection


There are many edge detection methods available such as sobel edge detection and canny edge detection. We use canny edge detection method with which we have obtained better results. Soble edge detection uses the sobel masks to calculate the gradient magnitude. The sobel masks are shown as Figure1. With soble edge, we could easily get the distinct edges of an image in both horizontal and vertical directions. For fast computation, we use the approximated gradient magnitude calculation G| = |Gx| + |Gy|.

approaches such as long line removal, horizontal stroke detection and vertical stroke detection, text area detection, bounding box detection and text validation.

3.1 Long Line Removal


Figure1. Sobel masks Canny edge detection has the following features: Filter our noises using Gaussian filter. Find the gradient magnitude using sobel masks. Find edge direction. Thinning along the edge direction. Use 2 thresholds (high_threshold, low_threshold) for eliminating streaking. In our system, we use canny edge detection for the reason that it offers better results than sobel edge detection. After the processing the image with canny edge detection, we get the new image with the edge denoted in white as shown in Figure3. As we can see, there are some long lines in the image which is obvious not text. Thus we remove the long lines such that we could get a simpler and cleaner image for further processing. We remove the long lines based on counting the times of direction change of that line. If the line does not change it direction much, say, straight and long, we treat the edge as a long line and remove it. The image after long line removal is shown as Figure4.

3. Implementation
In this section we will describe the implementation of our system. The flow chat of the system is shown as Figure2.
Video Frame C anny Edge Det. Long Line Rem.

Figure3. After canny edge detection


Vertic al Stroke Det.

Horiz ontal Stroke Det.

Text area Det. Bounding Box Det. Validation Detec ted Text Region

Figure4. After long line removal

Figure2. System flow chat From the input video frame, we first tried to use the continuous frames detection method to get the complex background filtered. Due to some problems occurred during the conversion from bit 24 true color image to 256 color BMP image, we skipped this step and use canny edge detection directly on the input video frame. As we can see from the system flow chart, we proposed several heuristic

3.2 Stroke Detection


Before getting the exact text region, we need to detect the approximate location of the text. The heuristic method we propose is that each stroke is treated to have its thickness. By detecting such thickness in two directions, horizontal and vertical, we can tell approximately which edges are texts and which edges are

not. Figure5 shows the results of the horizontal stroke detection and the vertical stroke detection.

(Horizontal) Figure7. Process of getting the bounding box (Vertical) Figure5. Stroke detection

3.5 Text Validation


Even after we get the bounding boxes of the texts, there are some extra boxes we get due to the noise of the image. We get rid of the non-text boxes by deleting small boxes and retaining long boxes heuristics. We believe that characters appear in sentences so that the small bounding boxes containing only a single character will have no meaning and must be the noise. We delete such small boxes. Also, if the box is too short compared with the other boxes, we also believe it is the noise and we delete the short boxes and retain the long ones.

3.3 Text Area Detection


After approximately locating the texts with the results of stroke detection, we need to denote the locations of the texts. We denote the locations by points so that one text may have several corresponding points showing its location in general. To depict each point, we select a detecting area of fixed size. One detecting area corresponds to one point. If a detecting area is detected by both horizontal stroke detection and vertical stroke detection then we draw a point to denote the detecting area. The points then show the areas where we believe the text locates. Figure6 shows the text area we detect with the same text detected by the stroke detection described in Section3.2.

4. Experimental Results
We evaluate our system with two standards. The first is accuracy, which means the fraction of the detected text regions that are correct. The second is completeness, which means the fraction of all correct results actually captured in the detection. We tested our system at 30 different images which is not a large enough sample set. With these 30 images we get so far, we get perfect detection with more than 20 images which means the accuracy and the completeness are both 100%. The average accuracy of 30 images is about 86% and the average completeness is about 98%. To obtain more evaluation results, we need more sample sets and more tests on them. The interface of our system is shown Figure8. Figure9 ~ Figure15 show each step of getting the final detected text

Figure6. Text area detection

3.4 Bounding Box Detection


Based on the result of text area detection, we could dray the approximate bounding box of the texts region based on the points denoting the text area. We scan horizontally and vertically with a detection line. If no text area (point) is detected by the detection line, which means texts are all disappeared suddenly, then we get the bounding box. Figure7 shows the process of getting the bounding box with the image after canny edge detection, the results after text area detection and the result of the bounding box.

Figure8. System interface

Figure11. After horizontal stroke detection

Figure9. After canny edge detection

Figure12. After vertical stroke detection

Figure10. After long line removal region. Figure 15 shows the final detected text region.

Figure13. After text area detection

References
[1] Video OCR: A Survey and Practitioners Guide, Rainer Lienhart [2] Text detection and recognition in images and video frames, Datong chen, Jean-Marc Odobez, Herve Bourlard [3] Automatic Text Detection In Video Frames Based on Bootstrap Artificial Neural Network And CED, Yan Hao Zhang Yi Hou Zeng-guang Tan Min Figure14. After bounding box detection [4] Automatic text segmentation and text recognition for video indexing, Rainer Lienhart and Wolfgang Effelsberg2, Multimedia Systems 8 (2000) P69-P81 [5] Automatic Detection And Extraction of Artifilal Text in Video, Jovanka Malobabi, Noel O'Connol, Noel Murphy and Sean [6] Text detection and recognition in images and video frames, Datong Chen, Jean-Marc Odobez, Herv/e Bourlard, Pattern Recognition 37 (2004) P595 P608 [7] A new robust algorithm for video text extraction, Edward K. Wong, Minya Chen, Pattern Recognition 36 (2003) P1397 P1406 Figure15. After text validation (Final) [8] A localization/verification scheme for finding text in images and video frames based on contrast independent features and machine learning methods, Datong Chena, Jean-Marc Odobeza, Jean-Philippe Thiran, Signal Processing: Image Communication 19 (2004) 205217 [9] Text Area Detection from Video Frames, Xiangrong Chen, Hongjiang Zhang [10] A Video Text Detection and Recognition System, Jie Xi, Xian-Sheng Hua, Xiang-Rong Chen, Liu Wenyin, Hong-Jiang Zhang [11] Automatic Text Extraction and REcognition forr Video Indexing and Retrieval, Laiyan Qing, Weiqiang Wang, Wen Gao [12] Canny Edge Detectoin Tutorial, Bill Green, http://www.pages.drexel.edu/~weg22/can_tut.html [13] Sobel Edge Detectoin Tutorial, Bill http://www.pages.drexel.edu/~weg22/edge.html Green,

5. Conclusion
In this report, we showed the implementation of a Video Text Detection System. We detected the text region by canny edge algorithm and several heuristic approaches proposed by us, such as long line removal, horizontal stroke detection and vertical stroke detection, text area detection, bounding box detection and text validation. We tested our system with 30 sample video frames and got an average detecting accuracy of 86% and an average completeness of 98%.

6. Acknowledgements
Special thanks should be given to our instructor, Prof. Jin Hyung Kim, who showed great passion and expert knowledge in teaching the course. By attending the course, we have caught the basic ideas and specific techniques in the AI, which we believe will greatly benefit our future research.

You might also like