You are on page 1of 10

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & 6367(Print), ISSN

0976 6375(Online) Volume 4, Issue 4, July-August (2013), IAEME


ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume 4, Issue 4, July-August (2013), pp. 556-565 IAEME: Journal Impact Factor (2013): 6.1302 (Calculated by GISI)



Vilas Naik1, Sagar Savalagi2
1 2

Department of CSE, Basaveshwar Engineering College, Bagalkot, India Department of CSE, Basaveshwar Engineering College, Bagalkot, India

ABSTRACT With growing popularity of sites like YouTube, video sharing and recording has obtained popularity in last several years. Unlike text documents, these multimedia contents are difficult to searched and index. Hence content based video retrieval systems are need of the hour. Content-Based Video Retrieval (CBVR) is an active research discipline focused on computational strategies to search for relevant videos based on multimodal content analysis in video such as visual, audio, text to represent and index video. In recent research on Content Based Video Retrieval has presented many such solutions based on these features. The textual content in the video in the form of embedded and scene text. They are quite helpful for indexing the videos. Proposed work is a content based video retrieval system based on textual ques. Text based video retrieval is an approach that enables search based on the textual information present in the video. Regions of textual information are identified within the frames of the video. Video is then annotated with the textual content present in the images. Then traditionally, OCRs are used to extract the text within the video. It also enables applications such as keyword based search in multimedia databases. With help of this video indexing and retrieval is done. A result shows that the system is quite efficient with an accuracy of around 90%. A textual query returns higher accuracy than visual queries which proves the concept. 1. INTRODUCTION With the development of various multimedia compression standards and significant increases in desktop computer performance and storage, the widespread exchange of multimedia information is becoming a reality. Video is arguably the most popular means of communication and entertainment. With this popularity comes an increase in the volume of video and an increase need for the ability to automatically sift through the search for relevant material stored in large video databases. Even with increase in hardware capabilities, which make video distribution possible, factors such as algorithms and speed and storage costs are concerns that must still be addressed. Considering this, a first step should be therefore an attempt to increase speed when using existing compression stan556

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 4, July-August (2013), IAEME dards. Performing analysis in the compressed domain reduces the amount of efforts involved in decompression and providing a means of abstracting the data keeps the storage costs of the resulting feature set low. Both of these problems are active areas of research. The aim of this proposed work is to develop a new detection algorithm which has the ability of boosting the speed of search and in due reduces the cost of the storage. Every day, both military and civilian equipment generates giga-bytes of images. A huge amount of information is out there. However, it is impossible access or makes use of the information unless it is organized so as to allow efficient browsing, searching, and retrieval. Image retrieval has been a very active research area since the 1970s, with the thrust from two major research communities, database management and computer vision. These two research communities study image retrieval from different angles, one being text-based and the other visual-based. Many advances, such as data modelling, multidimensional indexing, and query evaluation, have been made along this research direction. There exist two major difficulties, especially when the size of image collection is large (tens or hundreds of thousands) and vast amount of labour requirement in manual image annotation. Other difficulty, which is more essential, results from the rich content in the images and the subjectivity of human perception. That is, for the same image content different people may perceive it differently. The perception subjectivity and annotation impreciseness may cause unrecoverable mismatches in later retrieval processes. The proposed mechanism is unique scheme in the direction of alleviating these hurdles with a new detection algorithm with boosting that offer a retrieving system which is based on text. The work is folded in following steps: Initially frames are collected from video clip. From these frames text part is segmented. Further, character segmentation identifies the characters. These characters are recognized by the character recognition process carried by Optical Character Recognition (OCR). In order to increase the accuracy of identification Color features are additionally extracted from video clip. These color features are combined with text features and are stored in the database. When user feeds text query it will be matched against stored characters and displays matching videos. 2. RELATED WORK The video retrieval is important in multimedia search engine related applications. Recognizing the text is a crucial task in such applications. In last decades most of the researchers proposed different methods for video retrieval some of the related work are summarized in the following. An approach that enables search based on the textual information present in the video is introduced in [1]. In this method a Regions of textual information are identified within the frames of the video. Video is then annotated with the textual content present in the images. An approach that enables matching at the image-level and thereby avoiding an OCR is also addressed. Videos containing the query string are retrieved from a video database and sorted based on the relevance. Results are shown from video collections in English, Hindi and Telugu. In [2] a method to automatically localize captions in JPEG compressed images and the I-frames of MPEG compressed videos is proposed. In this method a Caption text regions are segmented from background images using their distinguishing texture characteristics. Unlike previously published methods which fully decompress the video sequence before extracting the text regions, this method locates candidate caption text regions directly in the DCT compressed domain using the intensity variation information encoded in the DCT domain. Therefore, only a very small amount of decoding is required. A method in [3] is a news video retrieval solution that target specific news videos based on their contents described by overlay text is addressed. This approach is based on use of overlay text that conveys direct meaning of video as a source of complementary information. The whole process is divided in to two steps. Firstly, they build the metadata labels by detecting and extracting the overlay text. Secondly, these labels are then used to index the news videos. The experiments are carried on the news videos from NDTV News and large data set of video images containing artificial text developed at Image

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 4, July-August (2013), IAEME Processing Centre (IPC) a research facility at National University of Sciences and Technology (NUST), Pakistan. FFMPEG Library is used to extract the frames form news videos. Overlay scene is also inserted on the video scene like the overlay text is, the transition region is also observed at. In [4] the authors proposed three main factors, 1. The integration of the image and audio analysis results in identifying news segments. 2. The video OCR technology to detect text from frames, which provides a good source of textual information for story classification when transcripts and close captions are not available. 3. Natural language processing (NLP) technologies which are used to perform automated categorization of news stories based on the texts obtained from close caption or video OCR process. Based on these video structure and content analysis technologies, two advanced video browsers are developed for home users: intelligent highlight player and HTML-based video browser. Author has proposed a annotation-based indexing method which allows user to retrieve video using textual annotations in [5]. This takes a text based query and compares it with tags used for the indexing the event based video is retrieved from cricket video database. Experiment shows that annotation based event retrieval based methods can potentially improve retrieval accuracy using different searching techniques like binary search or indexing when database is very large and hereby the video retrieval can be efficiently carried out with this type of retrieval system. A technique has been proposed to address problems regarding extracting text from a video and to design algorithms for each phase of extracting text from a video using java libraries and classes. In this first the input video is framed into stream of images using the Java Media Framework (JMF) with the input being a real time or a video from the database. Then pre processing algorithms are applied to convert the image to gray scale and remove the disturbances like superimposed lines over the text, discontinuity removal, and dot removal then continue with the algorithms for localization, segmentation and recognition for which uses the neural network pattern matching technique. The performance of an approach is demonstrated by presenting experimental results for a set of static images. Improving Multimedia Retrieval with a Video OCR a set of experiments with a video OCR system (VOCR) tailored for video information retrieval and establishes its importance in multimedia search in general and for some specific queries in particular. By the method in [7] analysis of video frames producing candidate text regions is detailed. The text regions are then binaries and sent to a commercial OCR resulting in ASCII text that is finally used to create search indexes. The system is evaluated using the TRECVID data. The effectiveness of various textual sources is evaluated on multimedia retrieval by combining the VOCR outputs with automatic speech recognition (ASR) transcripts. For general search queries, the VOCR system coupled with ASR sources outperforms the other system by a very large extent. For search queries that involve named entities, especially people names, the VOCR system even outperforms speech transcripts, demonstrating that source selection for particular query types is extremely essential. Another important consideration is the quality and complexity of pictures containing text for evaluation. Some methods consider large fonts in images, advertisements and video clips . The methods also have some limitations as method in [8] does not detect low contrast text and small fonts. The techniques in [9] use text with deferent complex motions. The method in [10] as well as in [11] detect only caption text in news video clips. The work proposed extracts text from video frames by separating text region from background and employs conventional OCR for text recognition. 3. PROPOSED ALGORITHM FOR VIDEO RETRIEVAL In this section, overview and detail description of all the blocks of the proposed system is given.


International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 4, July-August July August (2013), IAEME 3.1 Overview of the Approach The proposed mechanism is unique scheme that offers a video retrieval system which is based on embedded bedded text the method uses the information conveyed to embedded text to recognize the video to be retrieved from collection based on text query .the mechanism matches que query the text presented in video frame based on feature explained . First extract frames from video. Text part is segmented. Character segmentation extracts the characters. Character recognition recognizes the characters. Color features from video scene are extracted. Color features combined with text features are stored in the database. User can input either text query. If query is in text form, then that is matched against stored characters and displays matched videos. The over all flow is as in the Figure 1.

Fig. 1 Proposed algorithm for Video retrieval by aText Query

3.2 The Text Query Based Video Retrieval Algorithm. This proposed algorithm rithm is summarized into following steps. Step 1. Input a video and Convert it in to frames. Step 2.Apply Median Filter to each frame and perform sobel Edge Detection for detecting an text region edge from the frame then Calculate Sumgraph. i.e. Adding rows and column of binay image.

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 4, July-August (2013), IAEME Step 3.Text region segmentation is performed by applying Threshold as Threshold = (sum(sum(B'))/prod(size(sum(B')))*50 + max(max(sum(B')))*30)/100 Where B`= input image. Step 4. Apply OCR to recognize the text characters from frames and color feature are stored in database as text features. Normalize characters to size 32x32. Step 5. Given a text query, extract characters. Match with character set associated with videos in one direction. Calculate total character match with respect to each video. Step 6. Retrieve the videos with highest matches. 3.3 Text region localization As a first step, extract frames from that are taken from video collection on individual bases. Convert an video frame into image because an video frame will be compressed format so when it processes the frame it will be an image, then convert it into greyscale image as show. Now apply an Median filter to an image the output of median filter is shown in fig 4.2. The median filter considers each pixel in the image in turn and looks at its nearby neighbours to decide whether or not it is representative of its surroundings. Instead of simply replacing the pixel value with the mean of neighbouring pixel values, it replaces it with the median of those values. The median is calculated by first sorting all the pixel values from the surrounding neighborhood into numerical order and then replacing the pixel being considered with the middle pixel value. Now an sobel operator is used, Its an edge detection algorithm technique which is applied to an greyscale image that detects an text region edge from an greyscale image. 3.3 Text detection and Segmentation After the text region is localized. Text area is to be segmented for further reorganization the output of this step is a binary image where black text characters appear on a white background. This stage included extraction of actual text regions as follows. Here again a median filter to an edge detected image that will give us a smooth image now take the vertical and horizontal histogram. The horizontal and vertical histogram, this represents the column-wise and row-wise histogram respectively. These histograms represent the sum of differences of gray values between neighbouring pixels of an image, column-wise and row-wise. In the above step, first the horizontal correction is calculated. To find a horizontal correction, the algorithm traverses through each column of an image. In each column, the algorithm starts with the second pixel from the top. The difference between second and first pixel is calculated. If the difference exceeds certain threshold, it is added to total sum of differences. Then, algorithm will move downwards to calculate the difference between the third and second pixels. So on, it moves until the end of a column and calculate the total sum of differences between neighboring pixels. At the end, an array containing the column-wise sum is created. The same process is carried out to find the vertical correction. In this case, rows are processed instead of columns .Then calculate an threshold value with normalize sum as shown below. Threshold= (sum(sum(B'))/prod(size(sum(B')))*50+max(max(sum(B')))*30)/100; Where B`= input image. The rows and column which satisfies the threshold value then those column are considered. And this will gives us the rows and column where an text is appeared, then extraction of an text block as shown in figure.2 (d) and storing that image into an result folder. Extract all regions separately. Perform Sum graph. Extract Maxima to extract the characters and Normalize characters to size 32x32.


International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 4, July-August July August (2013), IAEME





Fig. 2 Overview of text detection and segmentation (a) original frame. (b) gray scale image with noise reduction tion and edge detection.(c ) feature vector graph when text detected in frame (d) detected text 3.4 Text Reorganization with Optical character reorganization (OCR) This stage includes actual recognition of extracted characters by combining various fe features extracted in previous stages to give ive actual text. The output of the segmentation stage is considered co and given as a input to this stage. Here an Optical Character recognition (OCR) is used takes an i input image and recognizes characters. An When a text image is given input to OCR then a i image undergoes above 4 stage processing they are Pre-processing, processing, Feature Extraction, Classification, Post Postprocessing. . In above four stages an important stage is an feature extraction, On basis of feature e extraction an OCR ia possible to recognize. We have used an template matching ing feature extraction, this is one of the simplest approaches proaches to patter recognition. Template matching: This process involves the use of a database of characters or templates. There exists ists a template for all possible input characters. For recognition to occur, the current i input character is compared to each template to find either an exact match, or the template with the closest r representation of the input character. If I(x, y) is the input character, Tn(x, y) is the templ template n, then the matching function tion s(I, Tn) will return a value indicating how well template n matches the input character. The generated ated outputs from the OCR are ASCII characters, which are used as keywords for future indexing and retrieval. In Figure. Fig 3 (a) shows an identied ed as a text block. This it is sep separated out from the rest of the image and binarized. When this detected block is given as input to the OCR, the corresponding ASCII SCII output is shown in Figure. Fig 3.(c). (c). It is observed that while the text extraction part system detects the text blocks accurately even in a complex background, the OCR also recognize 90% text t correctly. As seen in Figure. Fig 3 (d), the some word was miss recognized due to the presence of noise. Extract mean, standard deviation of R,G,B components of frames, color feature extracted is also store in with text database as text feature.


International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 4, July-August July August (2013), IAEME





Fig 3 (a) Frame contaning text. (b)Original frame (c) Text extraction by done using OCR (d) text recognization by OCR OCR. 3.4 Text querying A text query which is entered by an user is processed as shown in figure 4. . in which an query text is extracted and recognized and sent to an matching process which is next stage as shown in fig 3.3. In that database an individual video has its own character set which is recognized reco nized by an OC OCR. In the matching process which has an direct access to database as shown in fig 3.3. The video character set associated with a videos which are stored in database with an color feature extracted with std mean deviation, iation, at first level while frame extraction. extraction. The process will start matching an query ch character with an of character set that takes place in one direction. The matching process will match an character C followed lowed by R, like this it matches character form query to character from video text dataset. Then Calculate late total character matches with respect to each video and Display the videos names with highest matches result as shown in figure fig 5.

Database Query text reorganization

Matching process

Recognized text from video

Videos names

Fig 4. Block diagram di of query processing


International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 4, July-August (2013), IAEME

Fig 5. Result of query

4. EXPERIMENTAL RESULTS AND DISCUSSIONS In this section, it presents quantitative results on the performance of the text extraction system. The performance can be measured in terms of true positives (TP) - text regions identied correctly as text regions, false positives (FP) non-text regions identied as text regions and false negatives (FN) - text regions missed by the system. Using these basic denitions, recall and precision of retrieval can be dened as follows: Recall = TP/(TP+FN) and Precision = TP/(TP+FP) While the above denitions are generic, different researchers use different units of text for calculating recall and precision. Wong and Chen consider the number of characters while some of the other authors count the number of text boxes or text regions. Jain and Yu calculate recall and precision by considering either characters or blocks depending on the type of image. It has adopted the second definition in which it consider the text regions as units for counting. The ground-truth is obtained by manually marking the correct text regions. Having calculated recall and precision on a large number of text-rich images. For video processing, testing the system on dierent types of mpeg videos such as news clips, sports clips and commercials. The videos contain both caption texts as well as scene texts of dierent font, color and intensity. Table 1 shows the performance of our proposed method on four types of video. It is seen that our method has an overall average recall of 82% and precision of 87%. The method is able to detect text under a large number of dierent conditions like text with small fonts, low intensity, deferent color and cluttered background, text from noisy video, News caption with horizontal scrolling and both caption text and scene. Table 1 Recall and precision of text block extraction No. of text TP FP FN Recall % Precession% blocks SPORTS VIDEO 780 624 60 24 80% 92%

Where TP= True positive, FP= False positive, FN= False negative


International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 4, July-August (2013), IAEME Table 2 Execution time for retrieval Text OCR Retrieval extraction 57 sec for 100 frames 23.78 sec for 60 frames 20 sec 10 sec 1.55 sec 1.20 sec

Videos with different background Complex Plain

Total Time in sec 1:08:55 sec 00:34:98 sec

The primary advantage of the proposed method is that it is very fast since most of the computationally intensive algorithms are applied only on the regions of interests. Table 2 shows processing time for different types of video clips using a 1.83 GHZ Intels core 2 duo machine. As show comparative time required by the algorithms including retrieval is 1:08:55 sec for complex background and for simple it is nearly half a sec. An average is taken over a number of different image sizes.. Since by process every frame which occurs at the rate of about 5.6 per second, and OCR takes about 20 sec for complex background and 10 sec for simples per retrieval concern it is with an 1:55 sec. So it is seen that algorithm requires the least time for processing each frame and Retrieval. 5. CONCLUSION The proposed work uses a textual contents to present a comprehensive video i.e used as content for retrieval system that is based on extracting text from video, recognition of text from image and then matching text from database with query text. Beside this matching, system performs a matching based on color features, such that irrelevant videos are not extracted. The proposed work uses Median filter and soble operator for text region localization, an histogram for text segmentation and on OCR is used for recognition embedded text from sports video. Result shows significant efficiency in detection with a 80 % recall and 92% precession for an text region. Time taken for a retrieval for complex background will be 1.55 sec and for simple background will be an 1.20 sec System can be further improved by implementing better OCR technique for 100% accuracy in text recognition from videos. That will significantly improve the quality of the process. REFERENCES [1]. C. V. Jawahar, Balakrishna Chennupati, Balamanohar Paluri, Nataraj Jammalamadaka,2006 Video Retrieval Based on Textual Queries [2]. Yu Zhong, Hongjiang Zhang, and Anil K. Jain, April 2000. Automatic Caption Localization in Compressed Video IEEE transactions on pattern analysis and machine intelligence [3]. Nilesh Bhojne, Pravinkumar Kamde and Dr. S. P. Algur , 2012 News Video Indexing and Retrieval using Overlay Text. [4]. Wei Qi, Lie Gu, Hao Jiang, Xiang-Rong Chen and Hong-Jiang Zhang, 1998 Integrating Visual, Audio and Text analysis for news video. [5]. Shi-Yong Neo, Jin Zhao, Min-Yen Kan, and Tat-Seng Chua, 1998 Video Retrieval using High Level Features: Exploiting Query Matching and Confidence-based Weighting. [6]. Pranali Kosamkar, Vikram Wathodkar,Rajendra Shinde , April 2012 Annotation Based Event Retrieval in Cricket Video, International Journal of Advances in Computing and Information Researches [7]. Jayshree Ghorpade, Raviraj Palvankar, Ajinkya Patankar and Snehal Rathi, June 2011 Extracting Text From Video Signal & Image Processing An International Journal (SIPIJ).

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 4, July-August (2013), IAEME [8] D. Xu and Shih-Fu Chang, 2007 Visual Event Recognition in News Video using Kernel Methods with Multi-Level Temporal Alignment, IEEE Conference. on Computer Vision and Pattern Recognition. H-K. Kim, , Dec 1996 Ecient Automatic Text Location Method and Content-Based Indexing and Structuring of Video Database. Journal of Visual Communication and Image Representation, H. Li, D. Doerman and O. Kia, Jan. 2000 Automatic Text Detection and Tracking in Digital Video IEEE Transactions on Image Processing. T. Sato, T. Kanade, E. Hughes and M. Smith, 1999 Video OCR Indexing Digital News Libraries by Recognition of Superimposed Captions. Multimedia Systems, Vol. 7,pp. 385-394. Vilas Naik, Prasanna Patil and Vishwanath Chikaraddi, Action Event Retrieval from Cricket Video using Audio Energy Feature for Event Summarization, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 4, 2013, pp. 267 - 274, ISSN Print: 0976 6367, ISSN Online: 0976 6375. Vilas Naik, Vishwanath Chikaraddi and Prasanna Patil, Query Clip Genre Recognition using Tree Pruning Technique for Video Retrieval, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 4, 2013, pp. 257 - 266, ISSN Print: 0976 6367, ISSN Online: 0976 6375. Vilas Naik and Raghavendra Havin, Entropy Features Trained Support Vector Machine Based Logo Detection Method for Replay Detection and Extraction from Sports Videos, International Journal of Graphics and Multimedia (IJGM), Volume 4, Issue 1, 2013, pp. 20 - 30, ISSN Print: 0976 6448, ISSN Online: 0976 6456.


[10] [11] [12]