Professional Documents
Culture Documents
In order to speed up the analysis, we have ignored large blank spaces between text lines as copied.
image areas with no content and specified only a certain region However, the presented results cannot be generalized to all
of interest to be processed. Depending on the font size used, possible experimental settings, and therefore should be taken with
the regions have the dimensions of 1 900600, 1 9001 100 or some caution. The final decision of whether a pixel is either au-
1 9001 600 pixels. thentic or forged is influenced by a combination of several pa-
rameters. In order to gain a complete picture of the detection
Experimental Results performance of CMFD methods in the case of scanned document
Our experiments have produced detection results that vary images, it is necessary to carry out a large number of experimen-
over a wide range (44 % to 87 %, see Table 1) under varying test tal runs that cover all combinations of parameter values. At its
settings. The calculated metrics indicate that the selected block- core, this is a multidimensional optimization problem. The ap-
based CMFD methods have been able to detect traces of CM ma- plied decision logic is however hidden in the adopted software
nipulations in scanned text images (to a certain extent and sub- framework, making the construction of a complete ROC-curve
ject to specific threshold and parameter values). Figure 4 presents impractical in our setup.
an example detection map that was produced from analyzing the Nevertheless, our initial experiments support our conjecture
document forgery of 12 pt font size with a block size of 1515 that a block-based CMFD approach, in its current design and in
pixels and Z ERNIKE moments as a feature representation (shown the considered application scenario, is prone to many false posi-
in Figure 3). tive matches due to inter- and intra-glyph similarity and the pres-
ence of smooth background areas. To obtain reliable results, one
should first inspect document content and carefully choose thresh-
old parameters. In order to alleviate this need for human inter-
action, in the next section we propose some modifications of a
prototypical CMFD approach based on OCR technology.
Figure 5. Analysis framework for copymove forgery detection in scanned text images
To build a fine-tuned approach for CMFD in digital doc- version of the document and to segment it into regions free of text
ument images, some combination of the mentioned techniques (i. e., background) and connected components (e. g., at the level
must be integrated together into a coherent processing chain. Fig- of text lines, words, or individual glyphs). As we have seen in our
ure 5 illustrates our analysis framework for CMFD in scanned initial experiments the existing block-based methods, which were
text images. It is presented in a pipeline manner, meaning that originally proposed to deal with other kinds of images (and there-
each stage passes its results on to the next one and depends on the fore do not differentiate between text and background areas), have
success of the preceding steps. The proposed framework com- incorrectly identified blank space between text lines as forged in
bines the leading streams of research in digital image forensics certain instances. We believe that these two kinds of image areas
and pattern recognition: image source identification, forgery de- should be considered in the detection analysis separately, in or-
tection, and OCR technology. OCR serves as a preceding step to der to decrease both false detections and computational overload.
the common workflow of CMFD, as character recognition results Secondly, CMFD methods may significantly benefit from char-
are employed as input parameters in the typical processing steps acter recognition techniques in a way that font sizes, in which
of original CMFD methods. This is intended to solve the problem symbols are typeset in the text image, can be estimated and used
of intra-glyph similarity as well as to speed up the detection pro- for the determination of optimal block sizes. Lastly, OCR is cer-
cess by constraining the search space. We now discuss particular tainly one of the most promising ways to handle the critical issues
steps of the framework in greater detail. of intra-glyph similarity and false positives, as it recognizes spe-
cific text characters and permits adjustment in the feature match-
Preprocessing. A scanned text image has to first be prepro- ing process of CMFD methods, such that feature vectors extracted
cessed and transformed into a form that can be more effectively only from similar characters are matched against each other.
analyzed by the corresponding OCR and CMFD methods. In gen- With preprocessing as the first step, OCR proceeds with doc-
eral, OCR schemes apply a broad set of different preprocessing ument structure segmentation which is intended first to delineate
techniques (e. g., skew detection/correction, image enhancement text lines and background areas in the preprocessed document im-
techniques, noise removal, edge detection, binarization, morpho- age and then to isolate glyphs from each other [25]. Thus, sizes
logical operations etc.) to improve the image quality of a digitally and positions of glyphs, as well as of identified text-free areas,
converted document and thus to increase OCR recognition perfor- are obtained at this stage and passed to the next processing steps.
mance [1]. Some of these operations may cause a loss or a change Under the generic term character recognition, we have encom-
of image pixel information. Due to other required preprocessing passed several prototypical subprocesses of core OCR algorithms:
operations, a CMFD method needs to operate on a separate image feature extraction, matching, classification, and error reduction
that is pre-processed in line with a selected feature representation [14, 25]. As follows from its name, this stage is primarily respon-
(e. g., converted to a grayscale image or represented as the in- sible for the conversion of scanned document image into encoded
tegral image). This must still preserve the original image data in text. In this regard, research and practice in the field of OCR
order to guarantee the reliable detection of copied image snippets. offers a range of established approaches and advanced methods;
The fact that OCR schemes usually run on binarized versions of the detailed discussion of which is not a concern here. Character
images further justifies the separation of the preprocessing steps recognition is followed by a glyph clustering routine that discrimi-
between OCR and CMFD. nates various font sizes used in the scanned document and assigns
recognized characters to symbolic classes of same font size and
Optical Character Recognition. This overarching process is case. To accomplish this task in document analysis, OCR algo-
composed of a number of distinct steps, which together resem- rithms usually extract features from global properties of the text
ble a typical workflow of the majority of known OCR systems. image (e. g., character size, text density etc.) [27]. Following this
The main rationale for the incorporation of OCR techniques in the sequence of steps, the OCR method thus delivers clusters of rec-
analysis framework is threefold. First, OCR schemes are purpose- ognized characters of the same case and font size, along with their
fully designed to automatically analyze the structure of a digital locations (i. e., pixel coordinates) in the image and dimensions.
Copy-Move Forgery Detection. In essence, the CMFD work- tional CMFD approaches. As evident from the empirical find-
flow presented in our analysis framework is similar to those orig- ings, feature sets are generally sensitive to white background ar-
inally proposed for the forensic inspection of digital images of eas; thereby making CMFD techniques unsuitable for the applica-
natural scenes. However, we have reconsidered certain aspects in tion in this context. Recent research in scanner and paper foren-
the original CMFD pipeline in order to tailor it for the purpose sics, on the other hand, offers many potential ways to analyze the
of forensic document analysis. In the context of our framework, background of printed and scanned documents for the presence
block-based CMFD methods rely on output results of the preced- of inconsistencies (e. g., in pattern noise, color degradation, etc).
ing OCR process. They operate on the input image of scanned Additionally, if a questionable image has been saved in a com-
text, preprocessed in line with a chosen feature representation, pressed format JPEG compatibility analysis [8, 12] may reveal
and continue with image localization (i. e., block tiling) and fea- the fact that the background area was copied and used as an over-
ture extraction. Document font sizes, identified in the OCR stage, lay to conceal some undesirable image contents.
play a leading role in specifying respective block sizes. However,
the best range of block sizes for a specified font size remains an Decision Fusion. Both the CM forgery detection process and
open question. We assume that this critical decision is subject the background analysis produce a binary detection map as an
to a trade-off between multiple factors, and therefore encourage output, depicting identified image duplicates. As the framework
further research in this direction. Thus, the framework analysis provides for the separate analysis of content and text-free image
envisions the determination of the best block sizes for image frag- areas, these binary masks have to be combined in the decision fu-
mentation and feature extraction, which are obtained with respect sion step in order to construct a complete detection map. This map
to base font sizes used for typesetting document content. is employed in the following comparison against a corresponding
After the bounding boxes around recognized glyphs are frag- ground truth map, and the ultimate evaluation of detection accu-
mented into blocks of pixels, and feature statistics are computed racy. However, it should be noted that the generation of detection
for each of these blocks, matching of feature vectors takes place. maps and their fusion are associated with partial information loss
From a pragmatic perspective, this process usually requires a sub- in the sense that detection maps do not allow us to control for the
stantial amount of time and computational resources, depending correct relation between a copied glyph and its original counter-
on the image size, number of features, and the chosen matching part. Detection results may be misleading when one glyph has
method. Motivated by the need to speed up matching and resolve been correctly identified as cloned, while its counterpart, being
the issue of false positives caused by intra-glyph similarity, we marked in the detection map as forged, is, in fact, not the one
propose to compare and find similar feature vectors within classes originally used by the forger but just a similar glyph.
defined by same character (of same font size and case). Due to vi- To sum up, the presented analysis framework for CMFD in
sual distortions, applying scaling operations to copied glyphs is text images can serve as a design starting point for tackling the
rather unlikely in real life, justifying the suggested matching rou- problem of detecting CM counterfeiting operations in digitalized
tine. One can argue that matching feature descriptors across the documents. In addition to integrating OCR techniques into the
same recognized characters may lead to missed forgery detections common CMFD pipeline, it promotes several other ideas with re-
because of potential errors in OCR. If either a genuine or a du- spect to different processing steps in order to optimize the detec-
plicated character is wrongly recognized due to its visual resem- tion results. The framework envisions a separate examination of
blance to other characters and is consequently clustered into an- content and background image areas, block size identification in
other class than its counterpart, the proposed matching procedure relation to font sizes and feature matching within classes of same
will fail to detect these forged glyphs. An alternative method is recognized characters, font size and case. It should be mentioned
to match extracted feature statistics across all glyphs of same font that the analysis framework presented here should not be consid-
and size (i. e., without paying attention to syntactical content), and ered comprehensive, as there are other relevant methods available
check in the next post-processing stage whether matched feature that may enhance the detection performance in this application
vectors correspond to similar-looking characters. Even though scenario.
we do not consider this approach as completely inappropriate, we
adhere to the less time- and resource-consuming one, as modern Results of a First Tailored Approach
OCR systems have advanced to a level of high recognition accu- To get an initial idea about potential advantages of applying
racy. OCR principles to the problem of CMFD in scanned text images,
The key goal of post-processing step is to handle outliers and we have conducted a series of preliminary tests with a document
prune false positive matches. In this respect, the translation vector forgery of 12 pt font size. Instead of starting by partitioning the
analysis traditionally used in many of the existing CMFD meth- input image into overlapping blocks of pixels, we have first per-
ods can be applied to pairs of matched feature vectors, achieving formed the connected component analysis in order to identify po-
greater detection accuracy. In the case of document images being sitions and dimensions of the rectangular boxes that bound glyphs
analysed, post-processing can also be augmented with syntactical or ligatures. Next, we have calculated logarithms of H U moments
models and statistical techniques from natural language process- of the pixels inside each bounding box as a glyphs feature rep-
ing, allowing the examination of syntactic and semantic grammars resentation. In spite of the fact that the recognition of individual
and linguistic structures. characters has not been implemented in our practical setup, we
have obtained decent detection results in the basic CM and com-
Background Analysis. With regard to the forgery detection in pression scenarios. As a threshold for determining whether an
text-free image areas, the analysis framework assumes the use of individual glyph has been duplicated or not, we have measured
scanner and paper forensics techniques instead of the conven- the Euclidean distance between the computed feature vectors of
each identified glyph and of its nearest neighbor. In the simplest text on a white background. Our results have demonstrated that
case of a plain CM forgery, the Euclidean distance between fea- the existing CMFD methods, originally designed for the foren-
ture vectors of two copied glyphs was always zero. This resulted sic analysis of natural images, have significant shortcomings in
in a correct detection of all forged glyphs without any false posi- this case and yield many false positive matches. Third, we have
tive matches. adapted the common processing pipeline for CMFD and elabo-
The second batch of experiments was aimed at testing a rated an integrated analysis framework for detecting CM tamper-
more realistic CM document forgery. For that, we have applied ing in text images. This theoretical framework offers a systematic
JPEG compression with varying quality factors to the whole fal- way to address the critical issue of intra-glyph similarity, intrin-
sified document. This operation corresponds to the last (post- sic to any digital document image. In essence, it integrates the
compression) step of the image processing chain presented in Fig- two leading streams of research in digital image forensics image
ure 2.2 The quality factors of the post-compression varied be- source identification and forgery detection with optical charac-
tween 100 and 80 in steps of 5. As JPEG compression introduces ter recognition techniques, making use of all available informa-
a common global disturbance [7], we have adjusted the thresh- tion about the document structure, positions and sizes of individ-
old for the Euclidean distance to 0.1. This value was determined ual characters. Forth, we have presented first promising evidence
through a set of trials conducted with varying thresholds, in the for the effectiveness of CM forgery detection in text images using
search of a good trade-off between the number of true and false a preliminary instantiation of the proposed analysis framework.
positives. Computing feature vectors per bounding box has en- More generally, we consider this study as the first step to-
abled us to analyze the detection performance at the level of indi- ward addressing the problem of copymove tampering in digital-
vidual glyphs instead of the original pixel-based approach. Table ized documents. The presented analysis framework is intended to
2 presents the detection results for JPEG post-compression with provide a starting point and stimulate further thoughts regarding
different quality factors. We have obtained the highest detection the design of a tailored approach for CM forgery detection in text
accuracy of 42 glyphs out of 50 originally copied (i. e., 84 %) at images. The next crucial task is to implement a more compre-
the quality factor of 100. As expected, the detection accuracy de- hensive CMFD-T (T for text) toolbox and evaluate its detection
creases with a lower quality factor, whereas the false positive rate performance in practical settings, with different feature represen-
remains relatively stable. tations and post-processing operations on the text image. This
will serve to further validate the overall feasibility of the proposed
Performance results at the glyph level (in %)
framework. In addition, we see a need for future research exam-
JPEG quality factor T PR FPR Acc ining the relation between font and block sizes as factors jointly
influencing detection performance. We also encourage further re-
100 84.0 4.7 89.7 search on the development of new (or adjustment of the existing)
95 76.0 5.3 85.4 scanner and paper forensics techniques in order to enable a reli-
90 54.0 5.1 74.5 able detection of copied background areas.
85 42.0 4.7 68.7
80 36.0 5.1 65.5
Acknowledgments
No compression 100.0 0.0 100.0 Part of this research and the associated conference presentation
(for comparison) have been supported by Archimedes Privatstiftung, Innsbruck.
References
These initial results have demonstrated the general feasibil- [1] Yasser Alginahi. Preprocessing Techniques in Character Recogni-
ity of enhancing the performance of existing CMFD approaches to tion. Character Recognition. Ed. by Minoru Mori. InTech, 2010,
document forensic analysis by incorporating techniques and prin- pp. 120.
ciples of OCR. Clearly, there is much room for improvement and [2] Irene Amerini, Lamberto Ballan, Roberto Caldelli, Alberto Del
further research that addresses all aspects of the conceptual anal- Bimbo, and Giuseppe Serra, A SIFT-Based Forensic Method for
ysis framework. CopyMove Attack Detection and Transformation Recovery, IEEE
Transactions on Information Forensics and Security, 6(3), pp. 1099
Concluding Remarks 1110. (2011).
The main contribution of this work is fourfold. First, we [3] Irene Amerini, Roberto Caldelli, Alberto Del Bimbo, Andrea Di
have raised and sought to draw academic attention to the practical Fuccia, Luigi Saravo, and Anna Paola Rizzo, Copymove forgery
problem of copymove forgery detection in the so-far unexplored detection from printed images, Proc. SPIE 9028, Media Watermark-
domain of scanned document images. To date, most CMFD meth- ing, Security, and Forensics, pp. 110. (2014).
ods proposed in the literature have been validated only on digital [4] M.K. Bashar, K. Noda, N. Ohnishi, and K. Mori, Exploring Du-
or scanned images of natural scenes, whereas digitalized versions plicated Regions in Natural Images, IEEE Transactions on Image
of documents have been left out of researchers consideration. Processing, accepted for publication. (2010).
Secondly, we have empirically studied the detection performance [5] Joost van Beusekom, Faisal Shafait, and Thomas M. Breuel, Docu-
of block-based methods when applied to scanned images of black ment Inspection Using Text-Line Alignment, Proceedings of the 9th
2 The optional pre-compression at step 2 has been omitted in our ex- IAPR International Workshop on Document Analysis System, pp.
periments. Scanned images of the genuine documents had been initially 263270. (2010).
saved in lossless TIFF format. [6] Richard G. Casey and Eric Lecolinet, A Survey of Methods and
Strategies in Character Segmentation, IEEE Transactions on Pattern vances in Signal Processing, 2014(1), pp. 113. (2014).
Analysis and Machine Intelligence, 18(7), pp. 690706. (1996). [24] Ewerton Silva, Tiago Carvalho, Anselmo Ferreira, and Anderson
[7] Vincent Christlein, Christian Riess, Johannes Jordan, Corinna Riess, Rocha, Going deeper into copymove forgery detection: Exploring
and Elli Angelopoulou, An Evaluation of Popular Copymove image telltales via multi-scale analysis and voting processes, J. Vis.
Forgery Detection Approaches, IEEE Transactions on Information Commun. Image R., 29, pp. 1632. (2015).
Forensics and Security, 7(6), pp. 18411854. (2012). [25] ivind Due Trier, Anil K. Jain, and Torfinn Taxt, Feature extraction
[8] Jessica Fridrich, Miroslav Goljan, and Rui Du, Steganalysis based methods for character recognition A survey, Pattern Recognition,
on JPEG compatibility, Proc. SPIE 4518, Multimedia Systems and 29(4), pp. 641662. (1996).
Applications IV, pp. 275280. (2001). [26] Junwen Wang, Guangjie Liu, Zhan Zhang, Yuewei Dai, and Zhiquan
[9] Jessica Fridrich, David Soukal, and Jan Lukas, Detection of Copy Wang, Fast and Robust Forensics for Image RegionDuplication
Move Forgery in Digital Images, Proceedings of Digital Forensic Forgery, Acta Automatica Sinica, 35(12), pp. 14881495. (2009).
Research Workshop. (2003). [27] Abdelwahab Zramdini and Rolf Ingold, Optical Font Recognition
[10] Thomas Gloe, Matthias Kirchner, Antje Winkler, and Rainer Using Typographical Features, IEEE Transactions on Pattern Anal-
Bohme, Can we trust digital image forensics?, Proceedings of the ysis and Machine Intelligence, 20(8), pp. 877882. (1998).
15th ACM International Conference on Multimedia, pp. 7886.
(2007). Author Biography
[11] Andrew D. Gross, Daniel G. Neely, and Juergen Sidgman, When Svetlana Abramova is a PhD student in the Security and Privacy
Paper Meets the Paperless World, The CPA Journal, 85. (2015). Lab at the Institute of Computer Science, Universitat Innsbruck, Austria.
[12] Nicholas Zhong-Yang Ho and Ee-Chien Chang, Residual Informa- She received her BSc (10) and MSc (12) degrees in Information Systems
tion of Redacted Images Hidden in the Compression Artifacts. In- from the National Research University Higher School of Economics,
formation Hiding (10th International Workshop). Ed. by Kaushal Russia, and the University of Munster, Germany. Her research interests
Solanki, Kenneth Sullivan, and Upamanyu Madhow, 5284, Lecture focus on multimedia security and digital image forensics in particular as
Notes in Computer Science (LNCS), Berlin Heidelberg: Springer- well as on virtual currencies.
Verlag, pp. 87101. (2008). Rainer Bohme is Professor of Security and Privacy at the Institute
[13] Hailing Huang, Weiqiang Guo, and Yu Zhang, Detection of Copy- of Computer Science, Universitat Innsbruck, Austria. A common thread
Move Forgery in Digital Images Using SIFT Algorithm, Proceed- in his scientific work is the interdisciplinary approach to solving exigent
ings of the Pacific-Asia Workshop on Computational Intelligence problems in information security and privacy, specifically concerning cy-
and Industrial Application, 2, pp. 272276. (2008). ber risk, digital forensics, cyber crime, and crypto finance. Prior af-
[14] S. Impedovo, L.Ottaviano, and S.Occhinegro, Optical Character filiations in his academic career include TU Dresden and Westfalische
Recognition A Survey, Int. J. Pattern Recog. Artif. Intell., 5(1-2), Wilhelms-Universitat Munster (both in Germany) as well as the Interna-
pp. 853862. (1991). tional Computer Science Institute in Berkeley, California.
[15] Eric Kee and Hany Farid, Printer Profiling for Forensics and Bal-
listics, Proceedings of the 10th ACM Workshop on Multimedia and
Security, pp. 310. (2008).
[16] Nitin Khanna, George T. C. Chiu, Jan P. Allebach, and Edward J.
Delp, Scanner Identification with Extension to Forgery Detection,
Proc. SPIE Electronic Imaging, 6819. (2008).
[17] Nitin Khanna and Edward J. Delp, Intrinsic Signatures for Scanned
Documents Forensics: Effect of Font Shape and Size, IEEE Interna-
tional Symposium on Circuits and Systems (ISCA), pp. 30603063.
(2010).
[18] Xunyu Pan and Siwei Lyu, Region Duplication Detection Using Im-
age Feature Matching, IEEE Transactions on Information Forensics
and Security, 5(4), pp. 857867. (2010).
[19] Alin C. Popescu and Hany Farid, Exposing Digital Forgeries by De-
tecting Duplicated Image Regions, Department of Computer Sci-
ence, Dartmouth College, Tech. Rep. TR2004515. (2004).
[20] Seung-Jin Ryu, Matthias Kirchner, Min-Jeong Lee, and Heung-
Kyu Lee, Rotation Invariant Localization of Duplicated Image Re-
gions Based on Zernike Moments, IEEE Transactions on Informa-
tion Forensics and Security, 8(8), pp. 13551370. (2013).
[21] Seung-Jin Ryu, Min-Jeong Lee, and Heung-Kyu Lee, Detection of
Copy-Rotate-Move Forgery using Zernike Moments, Proceedings of
Information Hiding Conference, pp. 5165. (2010).
[22] Husrev Taha Sencar and Nasir Memon, Overview of State-of-the-
Art in Digital Image Forensics, Algorithms, Architectures and In-
formation Systems Security, 3, pp. 325348. (2008).
[23] Shize Shang, Nasir Memon, and Xiangwei Kong, Detecting docu-
ments forged by printing and copying, EURASIP Journal on Ad-