Professional Documents
Culture Documents
308-314
Special section for proceeding of International E-Conference on Information Technology and Applications (IECITA) 2012
ISSN: 2222-2510
2011 WAP journal. www.waprogramming.com
Reza Azmi
Kermanshah Branch,
Islamic Azad University,
Kermanshah, Iran
m.habibi60@gmail.com
Abstract: The goal of this paper is representing a suitable approach to content based document image
retrieval. in proposed algorithm a feature vector is extracted with wavelet transform for sub-words. then
based on this features, sub-words are clustered with support vector clustering (SVC) algorithm, then this
approach is used for searching based on keyword in content based document retrieval problem. The retrieval
algorithm achieved a precision of 81.6% at a recall rate of 74.2% on a database of 20 pages.
Key word: Document image retrieval, Document feature vector, Keywords spotting, wavelet transform,
support vector clustering
INTRODUCTION
Modern Technology has made it possible to produce, process, transmit and store digital images efficiently.
Consequently, the amount of visual information is increasing at an accelerating rate in many diverse application areas.
The large amount of these image Data are related to text. For example to achieve the paperless office goal, many
hardcopies convert to digital versions and store in document management system. Document image retrieval systems
can be utilized in many organizations which are using document image databases extensively .To fully exploit this
new content-based image retrieval techniques are required.
308
Maryam Habibi and Reza Azmi, World Applied Programming, Vol (2), No (5), May 2012.
Images of
documents
Key-word
Preprocessing
Extraction of
Connected components
Feature Extraction
Feature Extraction
Show result
FEATURE EXTRACTING
Discrete wavelet transform can be applied on 2D signal like images. When digital images are to be viewed or
processed at multiple resolutions, the discrete wavelet transform (DWT) is the mathematical tool of choice. In addition
to being an efficient, highly intuitive framework for the representation and storage of multi resolution images, the
309
Maryam Habibi and Reza Azmi, World Applied Programming, Vol (2), No (5), May 2012.
A=Approximation coefficient
DWT
provides
powerful
insight into an images spatial and frequency characteristics. The Fourier transform, on the other hand, reveals only an
images frequency attributes.
Figure 2 shows decomposition of the image into four lower resolution(or lower scale) components with discrete
wavelet transform. The result of wavelet :approximation coefficient matrix and two Detail coefficient matrix that one
is in the first level and the other is in the second level. This feature matrix of sub-words image has large scale and
this large scale complicates the computation. For decreasing features dimension we use Energy, Standard deviation
and Entropy of feature matrix. Therefore each of our data has a feature vector with 8 dimension that the eighth
figure2: Two basic decomposition step for image
(1)
(2)
(3)
(4)
II.
PROPOSED ALGORITHM
We used support vector clustering to cluster documents. Support vector clustering (svc) is a non-parametric clustering
algorithm that has been proposed in 2001[25],[26]. The training method of SVC is typically in a batch learning mode.
Unlike parametric and hierarchical clustering algorithm, SVC uses the Lagrange multipliers to obtain the support
vectors (SVs) that can be employed to describe the contour of each cluster. However, when the training data become
large, finding the solution of Lagrange multipliers will be more and more difficult. To surmount this, Ban and Abe
presented a Spatially Chunking Support Vector Clustering Algorithm to speed up the learning process[27].
They segmented the original data set into several sub ones. Each of the sub data sets will be trained by SVC and then
combined with the other individual results to acquire the global clustering one.
310
Maryam Habibi and Reza Azmi, World Applied Programming, Vol (2), No (5), May 2012.
III.
In retrieval stage, we use SVCNN. Figure 4 shows the architecture of SVCNN. This network is made up of four layers:
The input layer, SVC layer, calculation layer, and output layer. The numbers of the input weightings of the output
layer depends on the number of clusters after the training data have been conducted by the SVISV algorithm. In the
input layer, the neurons labeled with SVj mean the support vectors of each cluster through the execution of the
SVISV algorithm. The corresponding weighting labeled with j obtained from training is a Lagrange multiplier for
each SV. And the neurons labeled with data mean the same new data attending the SVC calculation for every cluster
in the network. The second layer, namely the SVC layer, performs the SVC calculation with regard to the new data
and the original members of a cluster. Such clustering will result in a new radius of the sphere, which is recorded in
the output weighting labeled with Ri Another output weighting of this layer is labeled with Ri representing the
radius derived from the original members of each cluster after training. The third layer meaning the computation layer
calculates the absolute difference value between Ri and Ri which come from the SVC layer. The fourth layer is the
output layer which sends out a
message that whether the new data belongs to one of clusters built in the network. The decision criterion is the value of
| Ri- Ri | compared to a given threshold. If this value is greater than or equal to the threshold, it indicates that the sphere
in the feature space has been extended. According to the mechanism SVISV, it implies the new data does not belong to
the cluster. on the other hand, we can conclude that the new data belongs to one of existing clusters if the value of | RiRi | is smaller than the threshold T.
The following is the training methodology of the SVCNN. The SVISV algorithm has grouped the original data set
into n clusters, each of which only consist of SVs. We can regard each cluster as a new independent data set, and then
adapt the SVC to re-cluster it. After clustering, every cluster will acquire its own parameters, including the Lagrange
multiplier j corresponding to each SV and the radius Ri .
In the retrieval stage, according to the SVCNN mechanism, after the cluster of new data is distinguished, we find the
documents of data are in that cluster. If the new data doesnt belong to any of clusters, we can conclude that new data
doesnt exist in current documents.
311
Maryam Habibi and Reza Azmi, World Applied Programming, Vol (2), No (5), May 2012.
IV.
RESULTS
One of the common measures of performance of retrieval algorithms is the precision vs. recall. Precision refers to the
percentage (or proportion) of correct retrievals(r) out of the total number of
words retrieved(n). Recall denotes the fraction of the number of correct retrievals(r) to the total number of instances of
the keyword in the database(m).
Precision=r/n
(5)
Recall =r/m
(6)
We test our algorithm on 20 pages from different papers in Persian language with similar font. Table 1 and Table 2
show precision and recall of ten different words that sequentially have 1 and 2 sub-words conducted with this
algorithm. Table 3 shows precision and recall of different algorithm.
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Average
Precision%
100
98.88
81.76
68.60
40.00
100
45.32
100
100
90.52
82.50
312
Recall%
96.78
99.66
60.39
53.93
61.62
86.80
58.66
86.78
79.66
73.46
75.77
Maryam Habibi and Reza Azmi, World Applied Programming, Vol (2), No (5), May 2012.
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Average
Precision%
87.62
82.00
100
38.38
92.40
100
68.59
62.92
85.51
90.00
80.74
Recall%
97.33
100
66.00
39.67
90.69
43.00
75.33
70.67
100
43.00
72.56
Precision%
Recall
%
90.33
65.13
87.16
60.63
74.91
94.36
76.92
87.74
32.38
58.12
48.52
93.75
N_gram(VTD,HTD)
English
N_gram(VTD,HTD)
chines
Primitive string and
partial word image
Matching
Weighted Hausdorf
distance
Pseudo elevance
feedback
Word shape
V.
We have presented a new method of Content based Document Image Retrieval with Support vectors clustering. In the
feature extraction stage, discrete wavelet transform (DWT) is a good feature of images spatial and frequency
characteristics. The reason of using clustering algorithm, decrease the number of considered data to the number of
clusters. Document clustering is a fundamental operation used in unsupervised document organization, automatic topic
extraction, and information retrieval. it provides a structure for organizing document for efficient browsing, searching
and retrieval information. SVC and SVISV algorithm dont need the number of cluster to cluster data sets. generally,
we need a feature for any of clusters to allocate the new data to that cluster, so we used radius of sphere to allocate
new data to the cluster. Another advantage of this algorithm is able to finding several consecutive sub-word such as
that has 4 sub-words , , and . Also this Algorithm is language independent.
Because of solving of Lagrange equation, the defects of SVC algorithm are doing slowly, so the run time is long. This
algorithm cant cluster the large dataset, therefore we used SVISV to remedy this problem. also the proposed
algorithm cant recognize different states of word.(for example if searchable word is
,this algorithm cant find
. This algorithm only have been tested on documents with similar font.It is proposed that this algorithm is tested
for different font. Also it is proposed to use statistical models such as Markov model to find sequential sub-words.
313
Maryam Habibi and Reza Azmi, World Applied Programming, Vol (2), No (5), May 2012.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
Doermann, D. 1998. The indexing and retrieval of document images: A survey. Computer Vision and Image Understanding, 70(3):287298.
Mitra, M. and Chaudhuri,.B. 2000 Information retrieval from documents: A survey. Information Retrieval, 2(2/3): 141163, August
Chen, F. R. and Bloomberg, D. S. 1998. Summarization of imaged documents without OCR. Computer Vision and Image Understanding,
pages 307319.
Bloomberg, D. S. and Chen, F. R. 1996 Extraction of text-related features for condensing image documents. The International Society for
Photo-Optical Instrumentation Engineering Document Recognition III, pages 7288.
Chen, F. Wilcox, RL., and Bloomberg, D. 1993. Detecting and locating partially specified keywords in scanned images using Hidden Markof
Models. International Conference on Document Analysis and Recognition, pages 133 138.
Hull, l. and Cullen, J. 1997.Document image similarity and equivalence detection. 4th International Conference on Document Analysis and
Recognition, pages 308312.
Hull, J. 1994.Document image matching and retrieval with multiple distortion invariant descriptors. IAPR Workshop on Document Analysis
Systems, pages 383399.
Lu, Y. and Tan, C. L. 2004.Information retrieval in document image databases. 16(11):1398410.
Smeaton, F. and Spitz, A.L 1997.Using Character Shape Coding for Information Retrieval. Proc. Fourth Intl Conf. Document Analysis and
Recognition, pp 974-97.
Nakayama, T. 1994. Modeling content identification from document images. Fourth conference on Applied natural language processing, pages
2227.
Nakayama ,T. 1996. Content-oriented categorization of document images. International Conference On Computational Linguistics, pages 818
823.
Smeaton , A. F. and Spitz, A. L.1997. Using character shape coding for information Retrieval. 4th International Conference on Document
Analysis and Recognition, pages 974978.
Spitz, A. L. and Maghbouleh, A. 1999.Text categorization using character shape Codes. SPIE Symposium on Electronic Imaging Science and
Technology, pages 174181.
Spitz, A. L. 1995.Using character shape codes for word spotting in document images. Shape, Structure and Pattern Recognition, pages 382
389.
Tan, C.L. Huang, W. Yu, Z. and Xu, Y. 2002. Image document text retrieval without OCR. IEEE Transaction on Pattern Analysis and
Machine Intelligence. 24(6):838844.
Tan, C.L. Huang, W. Yu, Z. and Xu, Y. 2003. Text retrieval from document images based on word shape analysis. Applied Intelligence,
18(3):257270.
Lu, S. Tan, C.L. 2007. Retrieval of machine-printed Latin documents through word shape coding. The journal of the pattern recognition
society(Elsevier)
Lu, S. Tan, C.L. 2006. Script and language identification in degraded and distorted document images. In 21st National Conference on
Artificial Intelligence, pages 769774.
Lu, S. Tan, C.L. Script and language identification in noisy and degraded document images. IEEE Transaction on Pattern Analysis and
Machine Intelligence.
Li, L. Lu, S. and Tan, C.L. 2004. A Fast keyword-Spotting Technique. Department of Computer Science, National University of Singapore
Kent Ridge, Singapore 117543
Zhang, L. Tan, C.L. 2004. A Word Image Coding Technique and its Applications in Information Retrieval from Imaged Documents. School of
Computing, National University of Singapore 3 Science Drive 2, Singapore 117543.
Spitz, A.L. and Maghbouleh, A. 1995. Text Categorization using Character Shape Codes Document Recognition Technologies, Inc. 616
Ramona Street, Suite 20 Palo Alto, California 94301.
Kise, K.; Wuotang, Y.; and Matsumoto, K. 2003. Document Image Retrieval Based on 2D Density Distributions of Terms with Pseudo
Relevance Feedback. Proceedings of the Seventh International Conference on Document Analysis and Recognition IEEE
Widyanto, M.R. and Hartono, H. 2008. Angel Decrement Based Gaussian kernel with generator for support vector clustering. Asian Journal of
information Technology 7(8):388-393.
Ben-Hur, A.; Horn, D.; Siegelmann, H.T. and Vapnik, V.2000. A support vector clustering method. in Preceedings of the 15th International
Conference on Pattern Recognition, vol. 2,pp. 724-727.
Ben-Hur, A.; Horn, D.; Siegelmann, H.T. and Vapnik, V.2001. Support vector clustering. Journal of Machine Learning Research, vol. 2, pp.
125-137.
Ben, T. and Abe, S.2004.Spatially chunking support vector clustering algorithm. in Proceedings of the IEEE International Joint conference on
Neural Networks, vol. 1, pp. 25-29.
Fahn, C.S. and Kuo, M.J. 2004.Support Vector Clustering Neural Network Used for Pattern Classification. Taipei106,Taiwan, Republic of
China.
Azmi, R. and Kabir, E. 2001 A New Segmentation Tecchnique for Omnifont Farsi Text, Pattern Recognition Letters, vol 22,pp. 97-104 .
Akbari, M. and Azmi, R.2005.classification and retrieval document based on physical layout. In 11th
conference of Iran computer
Association
Akbari, M. and Azmi, R.2005.Indexing and retrieval information of imaged document database. In 14th conference of electronic engineering
of Iran. Amir-Kabir univ.
Lu,Sh; Li, L and Tan, Ch. 2008 Document Image Retrieval through Word Shape Coding. The Journal of Pattern Analysis and Machine
Intelligence, IEEE Transactions Volume: 30, Issue: 11 On Page(s): 1913 1918.
Zhu, G.2009 Logo Matching for Document Image Retrieval. In 10th International Conference: Document Analysis and Recognition, 2009.
ICDAR '09.
314