You are on page 1of 7

World Applied Programming, Vol (2), Issue (5), May 2012.

308-314
Special section for proceeding of International E-Conference on Information Technology and Applications (IECITA) 2012

ISSN: 2222-2510
2011 WAP journal. www.waprogramming.com

Content Based Document Image Retrieval


with Support Vectors Clustering
Maryam Habibi *

Reza Azmi

Kermanshah Branch,
Islamic Azad University,
Kermanshah, Iran
m.habibi60@gmail.com

Department of Computer Engineering,


Alzahra University,
Tehran, Iran.
azmi@alzahra.ac.ir

Abstract: The goal of this paper is representing a suitable approach to content based document image
retrieval. in proposed algorithm a feature vector is extracted with wavelet transform for sub-words. then
based on this features, sub-words are clustered with support vector clustering (SVC) algorithm, then this
approach is used for searching based on keyword in content based document retrieval problem. The retrieval
algorithm achieved a precision of 81.6% at a recall rate of 74.2% on a database of 20 pages.
Key word: Document image retrieval, Document feature vector, Keywords spotting, wavelet transform,
support vector clustering
INTRODUCTION
Modern Technology has made it possible to produce, process, transmit and store digital images efficiently.
Consequently, the amount of visual information is increasing at an accelerating rate in many diverse application areas.
The large amount of these image Data are related to text. For example to achieve the paperless office goal, many
hardcopies convert to digital versions and store in document management system. Document image retrieval systems
can be utilized in many organizations which are using document image databases extensively .To fully exploit this
new content-based image retrieval techniques are required.

Previous Related Work


In recent years, there has been much interest in the research area of Document Image Retrieval (DIR) [1], [2].for
example, Chen and Bloomberg [3] described a method for automatically selecting sentences and key phrases to create
a summary from an imaged document without any need for recognition of the characters in each word. They built
word equivalence classes by using a rank blur hit-miss transform to compare word images and using a statistical
classifier to determine the likelihood of each sentence being a summary sentence.[3,4,5,6,7] retrieve document image
with the ability of matching partial word images. [8,9,10,11,12,13,14,15,16] retrieve document image with character
shape codes encoding. Shijian Lu, Chew Lim Tan [17] represent document retrieval technique that retrieves machineprinted Latin-based document images through word shape coding. Adopting the idea of image annotation, a word
shape coding scheme is proposed, which converts each word image into a word shape code by using a few shape
features. [18,19] report a languages identification technique by using a word shape coding scheme, which converts
each document image into an electronic document vector that captures the contents of image documents efficiently.
[20,21] propose a fast and segmentation-free keyword spotting technique. This algorithm is promising to facilitate
better document image retrieval with more sophisticated IR models. Spitz and Maghbouleh take a character shape
coding approach for document image categorization [22]. Kise, Wuotang and Matsumoto have presented a new
method of document image retrieval based on two dimensional density distributions of terms. The proposed method
employs pseudo relevance feedback to expand an original query so as to improve the performance of retrieval[23].
The remainder of this paper is organized as follows: Section 2, presents proposed system briefly.section3 describes
feature extraction with wavelet. In section 4, the proposed algorithm is discussed in two section of clustering and
retrieval. section 5 evaluate and compare the proposed algorithm and finally section 6 concludes the paper.

308

Maryam Habibi and Reza Azmi, World Applied Programming, Vol (2), No (5), May 2012.

Approach and Methods


The overall system structure is illustrated in Fig1. When a document image is presented to the system, it goes through
preprocessing, as in
user

Images of
documents

Key-word

Preprocessing

Extraction of
Connected components
Feature Extraction

Browsing feature vector

Convert text to image

Feature Extraction

Show result

Figure1:The block diagram of browsing


key-word system in image document
many document image processing systems. It is presumed that processes such as skew estimation and correction [29],
if applicable, and other image-quality related processing, are performed in the first module of the system. Then, all of
the connected components in the image are calculated using connected component analysis algorithm. The
representation of each connected component includes the coordinates and dimensions of the bounding box and so on.
Because our proposed system uses only sub-words s body, it is needed recognize and delete sub-words dots. In
general, the height of a punctuation mark is less than that of a character. Furthermore, dots and straight lines often
have higher pixel densities than character objects. So, when connected components have been retrieved, we can apply
filtering to remove objects with smaller width or height or higher pixel density than pre-determined thresholds. Most
of the punctuation marks and noise will be removed in this manner .so each image of sub-words are taken as a data for
the next stage. In proposed algorithm document image retrieval approach has 3 stages:
1-feature extraction
2-document clustering
3-key-word retrieval in document
At the first stage we extract feature of the sub-words with discrete wavelet transform(DWT) and save them. In the next
stage, using extracted features, the sub-words are clustered with SVC, SVCNN and SVISV algorithm. In SVC
algorithm, the parameter of Gaussian kernel(q) controls how spread out the data points feature space images and
therefore, determines the size of the minimal sphere. Finding a small set of q values for a given dataset is an important
part of SVC algorithm. we calculate Gaussian kernel width, angle decrement based method in [24].with this angle
decrement based method, SVC can produce good clustering result with better performance.
In Retrieval stage, after a query is presented to the system by user, the query is segmented to its sub-words. Then the
images of sub-words are obtained and their features are extracted with wavelet, then the cluster of each image of subwords are recognized with SVCNN algorithm. Then the documents containing the keyword are retrieved for the user.
I.

FEATURE EXTRACTING

Discrete wavelet transform can be applied on 2D signal like images. When digital images are to be viewed or
processed at multiple resolutions, the discrete wavelet transform (DWT) is the mathematical tool of choice. In addition
to being an efficient, highly intuitive framework for the representation and storage of multi resolution images, the

309

Maryam Habibi and Reza Azmi, World Applied Programming, Vol (2), No (5), May 2012.

A=Approximation coefficient

H=Horizontal detail coefficient


V=Vertical detail coefficient
D=Diagonal detail coefficient

DWT
provides
powerful
insight into an images spatial and frequency characteristics. The Fourier transform, on the other hand, reveals only an
images frequency attributes.
Figure 2 shows decomposition of the image into four lower resolution(or lower scale) components with discrete
wavelet transform. The result of wavelet :approximation coefficient matrix and two Detail coefficient matrix that one
is in the first level and the other is in the second level. This feature matrix of sub-words image has large scale and
this large scale complicates the computation. For decreasing features dimension we use Energy, Standard deviation
and Entropy of feature matrix. Therefore each of our data has a feature vector with 8 dimension that the eighth
figure2: Two basic decomposition step for image

dimension is number of document that related to this sub-word.


x =[EA, EDetails, StA, StDetail, EntA,
[document number]]

(1)
(2)
(3)
(4)

II.

PROPOSED ALGORITHM

We used support vector clustering to cluster documents. Support vector clustering (svc) is a non-parametric clustering
algorithm that has been proposed in 2001[25],[26]. The training method of SVC is typically in a batch learning mode.
Unlike parametric and hierarchical clustering algorithm, SVC uses the Lagrange multipliers to obtain the support
vectors (SVs) that can be employed to describe the contour of each cluster. However, when the training data become
large, finding the solution of Lagrange multipliers will be more and more difficult. To surmount this, Ban and Abe
presented a Spatially Chunking Support Vector Clustering Algorithm to speed up the learning process[27].
They segmented the original data set into several sub ones. Each of the sub data sets will be trained by SVC and then
combined with the other individual results to acquire the global clustering one.

II.1. using support vector identifying support vector for clustering


in this paper, to training large Data , we propose an incremental learning mode algorithm called
SVISV[28]. Figure 3 depicts the SVISV algorithm in pseudo codes.

310

Maryam Habibi and Reza Azmi, World Applied Programming, Vol (2), No (5), May 2012.

Figure 3: The SVISV algorithm

III.

KEYWORD RETRIEVAL IN DOCUMENT

In retrieval stage, we use SVCNN. Figure 4 shows the architecture of SVCNN. This network is made up of four layers:
The input layer, SVC layer, calculation layer, and output layer. The numbers of the input weightings of the output
layer depends on the number of clusters after the training data have been conducted by the SVISV algorithm. In the
input layer, the neurons labeled with SVj mean the support vectors of each cluster through the execution of the
SVISV algorithm. The corresponding weighting labeled with j obtained from training is a Lagrange multiplier for
each SV. And the neurons labeled with data mean the same new data attending the SVC calculation for every cluster
in the network. The second layer, namely the SVC layer, performs the SVC calculation with regard to the new data
and the original members of a cluster. Such clustering will result in a new radius of the sphere, which is recorded in
the output weighting labeled with Ri Another output weighting of this layer is labeled with Ri representing the
radius derived from the original members of each cluster after training. The third layer meaning the computation layer
calculates the absolute difference value between Ri and Ri which come from the SVC layer. The fourth layer is the
output layer which sends out a
message that whether the new data belongs to one of clusters built in the network. The decision criterion is the value of
| Ri- Ri | compared to a given threshold. If this value is greater than or equal to the threshold, it indicates that the sphere
in the feature space has been extended. According to the mechanism SVISV, it implies the new data does not belong to
the cluster. on the other hand, we can conclude that the new data belongs to one of existing clusters if the value of | RiRi | is smaller than the threshold T.
The following is the training methodology of the SVCNN. The SVISV algorithm has grouped the original data set
into n clusters, each of which only consist of SVs. We can regard each cluster as a new independent data set, and then
adapt the SVC to re-cluster it. After clustering, every cluster will acquire its own parameters, including the Lagrange
multiplier j corresponding to each SV and the radius Ri .
In the retrieval stage, according to the SVCNN mechanism, after the cluster of new data is distinguished, we find the
documents of data are in that cluster. If the new data doesnt belong to any of clusters, we can conclude that new data
doesnt exist in current documents.

311

Maryam Habibi and Reza Azmi, World Applied Programming, Vol (2), No (5), May 2012.

figure4: the architecture of SVCNN

IV.

RESULTS

One of the common measures of performance of retrieval algorithms is the precision vs. recall. Precision refers to the
percentage (or proportion) of correct retrievals(r) out of the total number of
words retrieved(n). Recall denotes the fraction of the number of correct retrievals(r) to the total number of instances of
the keyword in the database(m).
Precision=r/n

(5)

Recall =r/m

(6)

We test our algorithm on 20 pages from different papers in Persian language with similar font. Table 1 and Table 2
show precision and recall of ten different words that sequentially have 1 and 2 sub-words conducted with this
algorithm. Table 3 shows precision and recall of different algorithm.

Table 1- Precision and Recall of 10 Persian sub-words with 1length

Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Average

Precision%
100
98.88
81.76
68.60
40.00
100
45.32
100
100
90.52
82.50

312

Recall%
96.78
99.66
60.39
53.93
61.62
86.80
58.66
86.78
79.66
73.46
75.77

Maryam Habibi and Reza Azmi, World Applied Programming, Vol (2), No (5), May 2012.

Table 2- Precision and Recall of 10 Persian sub-words with 2 length

Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Average

Precision%
87.62
82.00
100
38.38
92.40
100
68.59
62.92
85.51
90.00
80.74

Recall%
97.33
100
66.00
39.67
90.69
43.00
75.33
70.67
100
43.00
72.56

Table 3- Precision and Recall of some Algorithm


Average (P-R)
Algorithm

Precision%

Recall
%

90.33

65.13

87.16

60.63

74.91

94.36

76.92

87.74

32.38

58.12

48.52

93.75

N_gram(VTD,HTD)
English
N_gram(VTD,HTD)
chines
Primitive string and
partial word image
Matching
Weighted Hausdorf
distance
Pseudo elevance
feedback
Word shape

V.

DISCUSSION AND CONCLUSION

We have presented a new method of Content based Document Image Retrieval with Support vectors clustering. In the
feature extraction stage, discrete wavelet transform (DWT) is a good feature of images spatial and frequency
characteristics. The reason of using clustering algorithm, decrease the number of considered data to the number of
clusters. Document clustering is a fundamental operation used in unsupervised document organization, automatic topic
extraction, and information retrieval. it provides a structure for organizing document for efficient browsing, searching
and retrieval information. SVC and SVISV algorithm dont need the number of cluster to cluster data sets. generally,
we need a feature for any of clusters to allocate the new data to that cluster, so we used radius of sphere to allocate
new data to the cluster. Another advantage of this algorithm is able to finding several consecutive sub-word such as
 that has 4 sub-words , ,  and . Also this Algorithm is language independent.
Because of solving of Lagrange equation, the defects of SVC algorithm are doing slowly, so the run time is long. This
algorithm cant cluster the large dataset, therefore we used SVISV to remedy this problem. also the proposed
algorithm cant recognize different states of word.(for example if searchable word is  ,this algorithm cant find
 . This algorithm only have been tested on documents with similar font.It is proposed that this algorithm is tested
for different font. Also it is proposed to use statistical models such as Markov model to find sequential sub-words.

313

Maryam Habibi and Reza Azmi, World Applied Programming, Vol (2), No (5), May 2012.

REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]

Doermann, D. 1998. The indexing and retrieval of document images: A survey. Computer Vision and Image Understanding, 70(3):287298.
Mitra, M. and Chaudhuri,.B. 2000 Information retrieval from documents: A survey. Information Retrieval, 2(2/3): 141163, August
Chen, F. R. and Bloomberg, D. S. 1998. Summarization of imaged documents without OCR. Computer Vision and Image Understanding,
pages 307319.
Bloomberg, D. S. and Chen, F. R. 1996 Extraction of text-related features for condensing image documents. The International Society for
Photo-Optical Instrumentation Engineering Document Recognition III, pages 7288.
Chen, F. Wilcox, RL., and Bloomberg, D. 1993. Detecting and locating partially specified keywords in scanned images using Hidden Markof
Models. International Conference on Document Analysis and Recognition, pages 133 138.
Hull, l. and Cullen, J. 1997.Document image similarity and equivalence detection. 4th International Conference on Document Analysis and
Recognition, pages 308312.
Hull, J. 1994.Document image matching and retrieval with multiple distortion invariant descriptors. IAPR Workshop on Document Analysis
Systems, pages 383399.
Lu, Y. and Tan, C. L. 2004.Information retrieval in document image databases. 16(11):1398410.
Smeaton, F. and Spitz, A.L 1997.Using Character Shape Coding for Information Retrieval. Proc. Fourth Intl Conf. Document Analysis and
Recognition, pp 974-97.
Nakayama, T. 1994. Modeling content identification from document images. Fourth conference on Applied natural language processing, pages
2227.
Nakayama ,T. 1996. Content-oriented categorization of document images. International Conference On Computational Linguistics, pages 818
823.
Smeaton , A. F. and Spitz, A. L.1997. Using character shape coding for information Retrieval. 4th International Conference on Document
Analysis and Recognition, pages 974978.
Spitz, A. L. and Maghbouleh, A. 1999.Text categorization using character shape Codes. SPIE Symposium on Electronic Imaging Science and
Technology, pages 174181.
Spitz, A. L. 1995.Using character shape codes for word spotting in document images. Shape, Structure and Pattern Recognition, pages 382
389.
Tan, C.L. Huang, W. Yu, Z. and Xu, Y. 2002. Image document text retrieval without OCR. IEEE Transaction on Pattern Analysis and
Machine Intelligence. 24(6):838844.
Tan, C.L. Huang, W. Yu, Z. and Xu, Y. 2003. Text retrieval from document images based on word shape analysis. Applied Intelligence,
18(3):257270.
Lu, S. Tan, C.L. 2007. Retrieval of machine-printed Latin documents through word shape coding. The journal of the pattern recognition
society(Elsevier)
Lu, S. Tan, C.L. 2006. Script and language identification in degraded and distorted document images. In 21st National Conference on
Artificial Intelligence, pages 769774.
Lu, S. Tan, C.L. Script and language identification in noisy and degraded document images. IEEE Transaction on Pattern Analysis and
Machine Intelligence.
Li, L. Lu, S. and Tan, C.L. 2004. A Fast keyword-Spotting Technique. Department of Computer Science, National University of Singapore
Kent Ridge, Singapore 117543
Zhang, L. Tan, C.L. 2004. A Word Image Coding Technique and its Applications in Information Retrieval from Imaged Documents. School of
Computing, National University of Singapore 3 Science Drive 2, Singapore 117543.
Spitz, A.L. and Maghbouleh, A. 1995. Text Categorization using Character Shape Codes Document Recognition Technologies, Inc. 616
Ramona Street, Suite 20 Palo Alto, California 94301.
Kise, K.; Wuotang, Y.; and Matsumoto, K. 2003. Document Image Retrieval Based on 2D Density Distributions of Terms with Pseudo
Relevance Feedback. Proceedings of the Seventh International Conference on Document Analysis and Recognition IEEE
Widyanto, M.R. and Hartono, H. 2008. Angel Decrement Based Gaussian kernel with generator for support vector clustering. Asian Journal of
information Technology 7(8):388-393.
Ben-Hur, A.; Horn, D.; Siegelmann, H.T. and Vapnik, V.2000. A support vector clustering method. in Preceedings of the 15th International
Conference on Pattern Recognition, vol. 2,pp. 724-727.
Ben-Hur, A.; Horn, D.; Siegelmann, H.T. and Vapnik, V.2001. Support vector clustering. Journal of Machine Learning Research, vol. 2, pp.
125-137.
Ben, T. and Abe, S.2004.Spatially chunking support vector clustering algorithm. in Proceedings of the IEEE International Joint conference on
Neural Networks, vol. 1, pp. 25-29.
Fahn, C.S. and Kuo, M.J. 2004.Support Vector Clustering Neural Network Used for Pattern Classification. Taipei106,Taiwan, Republic of
China.
Azmi, R. and Kabir, E. 2001 A New Segmentation Tecchnique for Omnifont Farsi Text, Pattern Recognition Letters, vol 22,pp. 97-104 .
Akbari, M. and Azmi, R.2005.classification and retrieval document based on physical layout. In 11th
conference of Iran computer
Association
Akbari, M. and Azmi, R.2005.Indexing and retrieval information of imaged document database. In 14th conference of electronic engineering
of Iran. Amir-Kabir univ.
Lu,Sh; Li, L and Tan, Ch. 2008 Document Image Retrieval through Word Shape Coding. The Journal of Pattern Analysis and Machine
Intelligence, IEEE Transactions Volume: 30, Issue: 11 On Page(s): 1913 1918.
Zhu, G.2009 Logo Matching for Document Image Retrieval. In 10th International Conference: Document Analysis and Recognition, 2009.
ICDAR '09.

314

You might also like