You are on page 1of 6

2017 3rd International Conference on Science in Information Technology (ICSITech)

Physical Document Validation With Perceptual Hash

Prasetyo Adi Wibowo Putro


Sekolah Tinggi Sandi Negara
National Crypto Institute
Jakarta, Indonesia
prasetyo.adi@stsn-nci.ac.id

Abstract—validation requirements documents electronically is document cannot be validated electronically because electronic
not only needed for electronic documents. For the specific needs validation mechanisms need hash functions for digitized
of the physical document validation also needs to be done physical documents. Because physically digitized documents
electronically. existing problems, physical documents will always will generate different hash values, it is not possible to validate
have a different hash values each time digitized. Through this physical documents using hash values.
research is reviewed whether perceptual hash can be used for
electronic validation of the physical document. The resulting This study attempts to prove whether perceptual hash can
conclusion of this study, perceptual can hash to use and can be used in electronic validation for physical document. The
detect all modifications that occur in the main information research will be done by trying to validate ten documents with
document. six type of modification. As a tool we develop java application
using one of published perceptual hash algorithm [1].
Keywords—electronic validation; perceptual hash; physical
document II. PERCEPTUAL HASH
I. INTRODUCTION Perceptual hashing algorithms is a fingerprint of a
multimedia file derived from various features from its content.
Validation of data authenticity is a process that certainly we Perceptual hash define two fingerprint similar if both files have
do on every information transaction. According to the oxford similar features. This method different with cryptographic hash
dictionary, validation can be interpreted as The action of function which rely on the avalanche effect of small changes in
checking or proving the validity or accuracy of something. input leading to drastic changes in the output. As shown in Fig.
Document validation used to be done by seeing two documents 1, a perceptual image hashing system consists of four stream
and compare them visually. With this method the accuracy of stages: the Transformation stage, the Feature extraction stage,
validation depends on the ability of the perpetrator visual the Quantization stage and the Compression and Encryption
validation. Another thing that affects the accuracy of the results stage [2].
of validation is objectivity of evidence and quality of data
validated. This is why process of data validation on physical
paper documents often lasts longer because the quality of the
documents that are not properly maintained.
Developments in information technology offers electronic
document solutions. Validation of electronic documents faster
than physical documents and have small possibility of losing
evidence for document validations. The accuracy of the
validation of electronic documents is also very high because it
involves cryptographic hash function algorithms and digital
signatures. With hash function, slightest difference in the
document will be detected. Fig. 1. Perceptual Hash Stages

Hash functions are functions that produce a fingerprint of Step 1 Transformation Stage. The transformation stage
the data inputted. This function will produce a unique value for performs a spatial transformation of the inputted image file
each inputed strings and the it will change if the inputted data involving the Discrete Cosine Transform (DCT) or Discrete
changes. The hash function will produce digital evidence from Wavelet Transform (DWT). Some of the spatial
a digital data, therefore its function will go well if the transformations such as color transformation, smoothing, affine
document to be validated already digitized. transformations, or frequency transformations. The principal
Although electronic business rapidly increase, until now, aim of these transformations is to make all extracted features
not all the administration process runs fully electronic. there depend upon the image pixel values or the their frequency
are still some activities that involve physical document with a coefficients in the frequency space. Conducting DWT in
visual validation. Until now, some institutions provide both perceptual hashing schemes will take just the LL (low low)
physical and electronic documents validation. the physical

978-1-5090-5864-8/17/$31.00 ©2017 IEEE 582


2017 3rd International Conference on Science in Information Technology (ICSITech)

subband into process because it is a coarse version of the structural changes become the main reason using perceptual
original image and contains all of the perceptually information. hash for validation of physical document.
Step 2 Feature Extraction Stage. In this stage, the Perceptual hash functions can be categorized into two
perceptual hashing algorithm extracts the image features from categories, that is unkeyed perceptual hash functions and keyed
the transformed image to generate the feature vector of L perceptual hash functions. An unkeyed perceptual hash
features, where L << M x N. At this stage we will get L x p function generates a hash value from an arbitrary string input.
floats because there is L features which each feature can A keyed perceptual hash function generates a hash value h
contain p elements of type float. However, there is still an open from an arbitrary string input and a secret key [2].
question about mappings from DCT or DWT coefficients keep
the essential information about an image for hashing and/or III. DESCRIPTION
mark embedding purpose. In some research they add another In recent years, there has been a growing body of research
features selections at this stage to select then only the most on perceptual image hashing. Study about perceptual hash
pertinent features [1]. The selected features can be presented as increasingly received attention in the literature. Most of the
an intermediate hash vector of K x p floats, where K < L. existing research studies focused on the stage of feature
These addition are statistically make the algorithm more extraction because they believe that extracting a set of robust
resistant against a specific allowed manipulation like the features that refused, and remained relatively constant, the
addition of noise, JPEG compression and filtering. manipulation of the content-preserving and at the same time
Modification of Feature Extraction Stage felt necessary need to detect manipulation of the content change is the most
because the visual features are usually publicly known and can important goal in the system image hashing perceptual [3].
therefore be modified. This might threaten security but in this Different research by comparing Zauner persepectual method
paper we need this so the hash value could be adjusted of verifying the hash [1].
maliciously to match that of another image,including image
with minor modification. Zauner compare the results of 4 content identification
function by using single hash creation function. This study uses
Step 3 Quantization Stage. In the quantization stage, we get a java class created by Zauner and implement it in simple
a quantized intermediate perceptual hash vector which contains applications as a tool for research. The application output hash
K x p components of byte. Uniform quantization applied to values in binary and hexadecimal form. We use hexadecimal
quantize each component of the continuous perceptual hash value to compare original value with the modification one.
vector. There are difference of Uniform quantization from While binary value we use to measure how far the differences.
Adaptive quantization. Uniform quantization is based on the Simple application of Zauner java class is shown in Fig. 2.
interval length of the hash values, while the adaptive
quantization partitioning based on the probability density
function (pdf) of the hash values.
Step 4 Compression and Encryption Stage. The
compression and encryption stage is the final step of a
perceptual hashing system. This stage guarantees both the
system security and the fixed length of the final perceptual
hash. The binary intermediate perceptual hash vector from
previous stage is compressed and encrypted into a short
perceptual hash of fixed size of l bytes, where l << K x p. This
process presents group of bytes that allows image verification
and authentication with perceptual hash . The compression and
encryption stage can be ensured by cryptographic hash
functions, i.e. SHA series that generate the final hash with a
fixed size (hash of 160 bits in case of SHA-1) average. After
that we compute the average value of the AC coefficients in
our 1/16th image. We can do this simply by summing all
values of our image, except the first, then dividing the result by
the size of the image. We ignore the first (or DC) coefficient Fig. 2. Simple Application of Zauner java class
during this calculation because it would most likely distort our
average. To prove that pespectual hash can be used for document
validation, validation process must be performed for original
Perceptual hashes provide capability to comparing the documents and modified documents. The expected result of
similarity of two images quickly. Although This hash this process is the original document will have the same
technique will not detect similarilty at large structural changes, perpectual hash value while the modified document will result
but it does prove useful for certain applications for reverse in a different perceptual hash value. To be processed with a
image searches and other approximate comparisons where we hash perceptual algorithm, the physical document must be
can compute the hashes of images in a database as an offline scanned into an image.
process [3]. The capability of not detect similarilty at large

583
2017 3rd International Conference on Science in Information Technology (ICSITech)

The study was conducted two phases. The first phase by TABLE I. LIST OF IMAGES
comparing the hash result with the original image modification. No Sample Name Information Type
This phase will examine 10 images with 6 type of 1 Image 1 Picture
modification. The modification used is : 2 Image 2 Picture
3 Image 3 Text
1. Content wiping as much as 5%, 4 Image 4 Text
5 Image 5 Text with charts
2. Content wiping as much as 25%, 6 Image 6 Text with charts
7 Image 7 Text with picture
3. Cropping, 8 Image 8 Text with picture
9 Image 9 Certificate
4. Adding content of text as much as 5%, 10 Image 10 Certificate
5. Adding content of text as much as 25%, and
6. Combination of content Adding and content
Wiping.
As the object of research, we use jpeg images that content 5
kind of information : picture, text document, text document
with chart, text document with picture, and certificate. This
kind of information selected because of the purpose of this
study for validating physical documents while in daily practice
we can still face this five kind of documents physicaly uses.
For each image studied, stored with the type of jpg and we
did six kinds of modifications. to modify the applications use
Adobe Photoshop CC 2017 using existing simple feature such
marquee tool, eraser tool, paint bucket tool, horizontal type Fig. 4. Image1.jpg
tool, clear menu navigation and crop. The results are stored in
the form of a jpeg modification also
The second phase of the study focused on the types of
modifications that still produces the same hash value. The
modification is repeated at several different locations in the
image such as empty space, over text data and above the
graphical data.

Existing Java Class Fig. 5. Image2.jpg

Develop Java Application

Phase 1 Phase 2
Testing Testing 1 modification
6 modification on 8 different place

Fig. 3. Research Framework

The first stage is done to review whether there are types of


modifications that still provide the same hash value to the
original image. While the second phase of research will
determine whether the similarity value is influenced by the Fig. 6. Image3.jpg
location of the modification. Research framework is shown in
Fig. 3.
IV. RESULT
The first phase of the research conducted by selecting 10
picture has five types of information that is graphics, text
documents, text with charts, text with pictures and certificates.
Description of the ten images to be studied can be seen on
Table 1, while the image it self show on Fig. 4 until Fig. 13.

584
2017 3rd International Conference on Science in Information Technology (ICSITech)

Fig. 7. Image4.jpg Fig. 11. Image8.jpg

Fig. 12. Image9.jpg

Fig. 8. Image5.jpg

Fig. 13. Image10.jpg

By calculating the hash value, we found seven of ten


surveyed images shows success. On that seven images, all
Fig. 9. Image6.jpg modification has a different hash value to the original image.
Three images having same hash value with one of the modified
version. All three images above have the same hash value for
modification of Adding content of text as much as 5%. Those
Three images is image 1 that contain graphic information,
image 2 containing graphic information and image 3 which
contain text information. Hash values for the ten images and its
modification can be seen in Table 3.
By observing content of Table 1, we find that image 1 and
image 2 has the same type of information that is graphical.
Although image 3 has different type of information, for the
next stage of the research we select image 1, as shown in Fig.
4, to be modified text additions in several different places.

Fig. 10. Image7.jpg

585
2017 3rd International Conference on Science in Information Technology (ICSITech)

TABLE II. HASH VALUE TEXT ADDING MODIFICATION


No Modification Hash Value
1 Original 97EA50549ABA
2 Modification from Phase One 97EA50549ABA
4 Text above graphical data 97EA F5F69B94
5 Text over text data 97EA F5F69B94
6 Text adding at empty space in the middle 97EA F5F69BB5
7 Text adding at empty space footer 97EA F5F69B94
8 Text adding at empty space header 97EA F5F69B94
9 Text adding at all empty space 97EA F5F69BB5

Phase 2 results showed dissimilarity value for the entire


modification addition of text as much as 5%. Image 1 has
Fig. 14. Image1.jpg with adding content of text as much as 5% experienced seven modification witch is one modification in
the phase one and six modification in phase two. Observing
From observing the contents of Image 1, as seen on Figure Table 2, only modification 2 showing hash value similarities
4, there are content such as graphical text, images, and some with the original image. Modification number 2 actual text
empty space. Therefore, the text addition applied above text, additions of 5% on the middle left of the image. When six
the above picture, the empty space at the top, the middle of the similar modifications done at the other part of the image, the
image and the empty space empty space at the bottom. Fig. 14 resulting hash value is different.
show text adding in the middle of the image 1.

TABLE III. HASH VALUE OF IMAGE AND ITS MODIFICATIONS


Crop
Original 5% wipe 25% wipe 5% Text Adding 25% Text Adding Adding and Wiping
ping
97EA 97C4 197EA 1B6C 97EA 97C4 97C4
Image 1 5054 5054 4044 85AD 5054 5054 5054
9ABA 9A9A 9A9A 49B92 9ABA 9A9A 9A9A
77A6 3B26 372E DBE7 77A6 7636 3736
Image 2 9C6E 9C7E D47F BD5E 9C6E 9C6E 9C7E
1AF9 1AF9 38FD 5B69 1AF9 1AF9 1AFB
1A54 1A54 1B76 F7C2 1A74 1A74 1E34
Image 3 2C8F 2CAF 2C97 7231 2C8F 0C8F 0C97
18B92 18913 18992 8842 18903 18B03 18983
1870 1A70 A7ED A7ED 18700 18700 18700
Image 4 03065 03065 2625 2625 3065 3065 34A5
AB1C AB16 A916 A916 AB1C AA1C A21C
1A37 1A37 1A77 1AB5 1A37 1A26 1A37
Image 5 D9A4 D9AC D9A4 2A2A 59AC 59AC D9A4
50087 50087 58146 54A85 50007 50087 50102
1E3C 1E3C 1E7C0 F1F4 1A36 1A36 1E30
Image 6 0B26 0302 38B1 3850 0306 0306 03025
58286 58286 A244 2840 48386 58386 02C4
F320 F320 F360 B5B8 F330 F330 D330
Image 7 6251 62D1 62D1 4B14 22D1 22D1 40B1
E953 AB52 EB52 2051 C953 C953 E2D9
F0A3 F5A1 F1A1 F44A F8A3 F9E1 F1A3
Image 8 F5F6 F5F6 F5B6 E5D3 F5F6 F7B2 F5F6
9B94 9BB4 9FB4 12B9 9B94 9F94 9BB4
5169 5148 5378 50C1 5149 5549 55C9
Image 9 D7FB F77B F67B E75A D7FB D7FB D7B9
C7F5 C7F5 C7E5 868D C7F5 C7F5 87F5
1D7A 1D640 1F700 A2E2 1D72 FF24 1F7A
Image 10 0B16 23E4 33E6 6246 0A16 0164 03166
49A07 BE07 3E03 DB8A 41A03 1A03 1A03

V. CONCLUSIONS on certain areas that do have information. thus, inevitably


changes that occur in other areas that do not have a headline.
Electronic validation over physical document can be done
using a perceptual hash algorithm. Based on the study that has Further application development can determine areas of key
been done, perceptual hash can detect changes in the main information for each document and provide tolerance changes
information area of physical documents. if needed. Tolerance changes can be calculated from the binary
hash value.
Recommended validation mechanism as was done in this
study is digitized image and hash value calculation is focused Additionally, in the future we have a plan to compare the
results in this research with the following studies: [6, 7, 8].

586
2017 3rd International Conference on Science in Information Technology (ICSITech)

REFERENCES [6] V. Monga and B.L. Evans, Perceptual image hashing via feature points:
performance evaluation and tradeoffs. IEEE Transactions on Image
[1] C. Zauner, "Implementation and benchmarking of perceptual image hash Processing, 15(11), 3452-3465. Chicago, 2006
functions," University of Applied Sciences Hagenberg, Hagenberg.
[7] S.S. Kozat, R. Venkatesan, and M.K. Mihçak, Robust perceptual image
[2] Hadmi, A. Puech, W.A.E. Said, B.A. Ouahman and Abdellah, "A robust hashing via matrix invariants. In Image Processing, 2004. ICIP'04. 2004
and secure perceptual hashing system based on a quantization step International Conference on (Vol. 5, pp. 3443-3446). IEEE. 2004.
analysis," Signal Processing: Image Communication, vol. 28, no. 8, pp. [8] M.K. Mıhçak and R. Venkatesan, New iterative geometric methods for
929-948, 2013. robust perceptual image hashing. In ACM Workshop on Digital Rights
Management(pp. 13-21). Springer, Berlin, Heidelberg. 2001
[3] Jie and Zeng, "A Novel Block-DCT and PCA Based Image Perceptual
Hashing Algorithm," Watermarking, vol. 10, no. 1, pp. 399-403, 2013.
[4] Hadmi, A. Puech, W. Said, B.A.E. Ouahman and A. Ait, "Perceptual
image hashing," Watermarking, vol. 2, pp. 17-42, 2012.
[5] V. Monga, "Perceptually Based Methods for Robust Image Hashing,"
University of Texas, Austin, 2005.

587

You might also like