Professional Documents
Culture Documents
Francisco Vega∗ , José Medina† , Daniel Mendoza∗ , Vı́ctor Saquicela∗ , and Mauricio Espinoza∗
∗ ComputerScience Department, University of Cuenca, Ecuador
{francisco.vegaz, daniel.mendoza, victor.saquicela, mauricio.espinoza}@ucuenca.edu.ec
† Department of Electrical, Electronic Engineering and Telecommunications, University of Cuenca, Ecuador
jose.medina@ucuenca.edu.ec
Abstract—This paper proposes a general framework that al- the videos can suffer the adhesion of banners or logos within
lows to identify a video in real time using perceptual image their original content.
hashing algorithms. In order to evaluate the versatility and In order to deal with these problems, many studies
performance of the framework, it was coupled for a use case have focused their attention on the development and ap-
about ads tv monitoring. Four Perceptual Image Hashing (PIH) plication of Perceptual Hashing techniques [1]–[4], which
algorithms were subject to a benchmarking process to identify allow the identification of multimedia contents for obtaining
the best one for the use case. This process was focused on short binary strings of robust characteristics present in the
analyze differences in terms of discriminability (D), robustness multimedia data [1]. These techniques are also known as
(R), time processing (Tp) and efficiency (E). A truth table fingerprinting or content-based media identification. The
was used to obtain information about discriminability and application of these techniques is much faster and more effi-
robustness, while processing time was directly measured. An cient than performing a direct comparison of the multimedia
efficiency metric based on time processing and identification content, especially in the case of videos, since this type of
capacity was proposed. In general terms, DHASH and PHASH multimedia objects is composed by a sequence of images
algorithms have higher identification capacities than AHASH (frames) and generally an audio track.
and WHASH in order to identify a video using only one Most authors agree that Perceptual Hashing based al-
frame. Moreover, a progressive decrease in robustness with the
gorithms must ensure the following properties: robustness
increment of the Hamming distance is observed in all cases.
to support distortions in the content, fast extraction to
generate a hash representation from the multimedia con-
However, for a particular case of tv monitoring where speed is
tent, discrimination to avoid collisions between hash values,
critical, the processing time becomes the most discriminatory
fast searching to retrieve an element from a database, and
parameter for the selection of the algorithm. So, for this case,
efficiency to identify the required item. Considering these
a particular type of PIH (Average Hash) is highlighted as
criteria, several algorithms have been developed that using
the most efficient one among other techniques, reaching an
a measure of similarity allow to compare the hash represen-
accuracy of 100% and frame rates on processing average of
tation of a multimedia object with the hash representation
108 fps with a Hamming Distance of 1. At the end, the proposed of multimedia objects previously extracted and stored in a
framework has remarkable identification skills, and presents database. The aim of these algorithms is to establish the
an efficient search. Furthermore, presents the steps to select the “equality” of the compared objects.
best algorithm and its more adecuate paremeters, according In a general way, the identification of videos based on
to the requirements of each particular case. Perceptual Hashing techniques can be classified into two
main groups:
1. Introduction • Based on audio fingerprinting [5]–[7]. This type
of algorithm allows identifying a song or audio in
During the last years, the proliferation of multimedia
general from a video, by extracting particular char-
content has grown enormously, mainly due to costs savings
acteristics (fingerprints) of a fragment of the audio
and ease of use of new tools for generating, producing,
track, whether or not to a noisy environment.
storing, distributing, and delivering digital content. However,
• Based on visual characteristics. These algorithms
this has complicated the management of multimedia files,
are based on the extraction of particularities of the
especially in tasks such as: control of copyright infringe-
multimedia object from its visual characteristics.
ments, search for content, or custom filtering. From all
Several algorithms of this type have been proposed
types of multimedia objects, videos and particularly the
in the literature, using techniques of diverse nature to
process used for their identification, is still one of the major
obtain the required representation. According to [8]
research challenges. A reason is the amount of alterations
these algorithms can be classified into three groups:
and variations that can suffer these multimedia objects in
their properties, such as resolution, format or codecs. Also, 1) Frame-by-frame video hashing [1], [3], [9].
This group applies in an individual way algo- Image Hashing algorithms is presented. Section 4 intro-
rithms based on Perceptual Image Hashing duces possible applications of the proposed framework. The
to every frame of the video, finally to com- benchmarking process used to compare different Perceptual
bine them in an alone representation (hash) Image Hashing algorithms is described in Section 2. The re-
of every video. sults of benchmarking process are presented and interpreted
2) Key frame based video hashing [10], [11]. in Section 2. The description of use case implemented to
These algorithms identify key frames along evaluate the proposed framework is presented and discussed
the video, then to obtain from them the same in the Section 2. The paper ends in Section 8 with the
particular hash as described in 1). conclusions and the suggestions for future works.
3) Spatio-temporal video hashing [12], [13].
This technique uses a normalized vector rep- 2. Perceptual Image Hashing Algorithms
resentation of the video that includes tem-
poral samplings and spatial resizing of the An image comparison algorithm of this type is based on
content of the whole video. the extraction of characteristics derived from the content of
an image to form a representation in short binary strings,
The main problems found in these approaches are: i) the sufficiently flexible and robust enough to effectively dis-
audio lack in some videos, which prevents from executing criminate similarities and differences between two images
techniques based on audio fingerprint, ii) the degradation subject to alterations of diverse nature. This representation
that suffer the algorithms based on visual characteristics is obtained through hash functions, which are mathematical
(like 1 and 2) in case of frame dropping either intentional tools that allow the coupling of data of different sizes into
or by codec conversion, iii) the time of response both in the a single fixed-sized representation. The required flexibility
process of feature extraction and in the matching process, is obtained by the use of a mathematical function known as
and iv) the excessive use of computational resources and the Hamming Distance (Hd), which returns a result that denotes
reduction of response speed as new videos are included to how many elements are different between two binary strings.
the database. This means that while closer to 0 is this value the greater
In our opinion, these problems can be mitigated, ana- the similarity between them. It should be considered that
lyzing in depth the relation between the used techniques the hash representations obtained with these algorithms are
for storing the fingerprints and the used perceptual hash not unique, although they are different.
algorithms. This type of analysis has relevancy great due The algorithms used in this study, were selected due to
that the success of these algorithms is directly linked to their ease of implementation and can be obtained free from
the time of fingerprint extraction, the time of the search the Python library called “Image Hash”1 . All these methods
process, and to the efficiency of the used algorithm. All obtain a fingerprint of 64-bit as a result. Generally, these
these characteristics have direct relation with the storage values are presented and stored as hexadecimal values. The
method used to enable the identification of videos. following is a brief description of the algorithms selected in
Considering the above mentioned and emphasizing the this work, describing the steps used to execute them:
challenge that supposes at present the identification of a
video, the following question arises: is it possible to de- 2.1. Average Hashing (AHASH).
termine an approach that allows identifying a video in real
time and that in addition consumes few resources?. This As described in [14]the main idea of AHASH algorithms
work helps to solve this question, by means of the definition is to obtain the mean value of all the low frequencies of an
of a framework that enable the identification of videos, image and then use it in the construction of the hash value.
using open source perceptual image hashing algorithms, The steps to follow are: (1) compression of the image to a
emphasizing in the reduction of the processing time and the size of 8x8 pixels; (2) grayscale conversion; (3) obtaining
consumption of computational resources, without neglecting the mean value of the image intensities; (4) comparison the
the efficiency. 64 intensities of the image one by one with the mean value,
The main contributions of this paper can be summarized one value of 1 is applied if the intensity is greater and 0
as follows: i) to propose a framework that allows the robust otherwise, and (5) building a 64 bits hash representation
identification of videos in real time using techniques of with the bits obtained.
Perceptual Image Hashing, ii) to evaluate the importance
of the interaction between the process of extraction and 2.2. Difference Hashing (DHASH).
search of fingerprints, and iii) to evaluate the behavior
of the proposed framework in a case study for television According to [15], this technique is defined for obtaining
monitoring of advertising. a fingerprint (hash) from the number of different pixels and
The rest of this paper is structured as follows. The its gradient. The steps for this case are: (1) conversion of
conceptual basis of Perceptual Image Hashing techniques the image to a grayscale, (2) reduction of the image to a
and the definition of the algorithms used in this work are common size (here, 9x8 pixels), (3) comparison of each
shown in Section 2. In Section 3, the proposed framework
to support video identification in real time using Perceptual 1. https://pypi.python.org/pypi/ImageHash
intensity with the adjacent values for each row –if the value
is greater, a 1 is applied, otherwise 0–, and (4) obtain the
representation with binary values.
Indexes Sublayer It is a data dictionary that allows having more than one
column associated with an index. Each column presents
In order to optimize the search process, given the large a hexadecimal value of 64 bits. Where, the first 32 bits
number of fingerprint elements that can be stored in the represent the video Id and the remaining bits represent the
database, it is necessary to manage indexes and a look up frame number of the video associated with the fingerprint.
table directly on RAM, as shown in Figure 2. However, for For specific scenarios, it is possible to use only the video
large data sets it is recommend to use NoSql indexing tools. Id, since, in the case of large data sets could be used few
Indexes are no more than the same fingerprints obtained fingerprints of a shot to populate the database. A shot is
in the data population process. A Bk-Tree allows to keep the name that receives a fragment of a video scene, where
in RAM the data in any moment. This approach can be several contiguous frames maintain a similar sequence.
very useful for small or medium scale databases. A Bk-Tree An ambiguity can occur in two cases: (1) when a frame
is a structure based on discrete metrics proposed by [18]. is very similar to another contiguous frame within the same
Originally it emerged as a data structure that allows to video and, (2) when the same image appears in more than
identify the number of changes necessary to convert one one video. For example, in commercials about car models
word to another, this with the purpose of applying it in an belonging to a same brand are usual to place at the end of
orthographic corrector. ad, some frames with the same logo. Ambiguities can be
In the proposed framework, the distance chosen between handled differently according to the use case. Thus, if the
the hexadecimal hash values is the Hamming distance. intention is to use the framework to identify a commercial
from a screen capture, in the first case, the only commercial recognition algorithms to identify objects and people
that is repeated can be returned as response. While, in the within the video and semantic-based techniques to
second case a brand can be returned as response and all formally annotate objects.
commercials corresponding to that brand can be listed, or • Parental Control: The authors [30], [31], highlight
a new screen capture can be requested to the device until the importance of this set of applications to filter
there is no ambiguity automatically videos not suitable for minors in real
time, according to the characteristics and the content
Database Sublayer of each video.
• Live Channel Detection: It identifies a television
Depending on the amount of data managed, a relational channel that is being watched by a user in real time.
or NoSQL database can be used. The main goal of this It can be used by broadcasters to organize contests,
component is to store more detailed information about the surveys, etc. In [32], [33], are enhaced some steps
video, without consuming the RAM of the equipment. The to deal with the job of video identification.
database structure clearly depends on the intended use, for • Advertising Avoidance: These systems allow identi-
example, in the case of television monitoring it is necessary fying a commercial and blocking their presentation
to add one or several tables that allow registering informa- to users who do not wish to observe them. In ad-
tion such as channel, date, or detection time. dition, it could be used to customize the commer-
Possible applications that can take advantage of the cials sent to each user during the presentation of
existence of a video identification algorithm that works in the advertising strip. In [34], [35], the advertising
real time is described below. avoidance is treated from the point of view economic
and statistical.
• Audience Measurement: The paper [36], presents
4. Possible Applications
the challenges of audience measurement and how the
applications can help to identifying automatically the
The need for a video identification algorithm that re-
videos observed by a user and from there generating
sponds in real time is mainly due to the great demand
statistics such as the observation time before the user
for novel and various applications that are currently being
decides to change channels.
developed, for example:
• Personalized TV Recommender: These applications
• Second Screen Synchronizing: There consists of use techniques of extraction and identification of
identifying a particular video reproduced by a main videos to generate personalized recommendations of
device and offering a set of exclusive services programming according to the likes of each viewer
through the use of a companion device, known as and recommendatios of tv viewers [37], [38].
second screen device [20]. Its applicability ranges • Content Based Video Retrieval: This type of applica-
as presented in [21], [22], from social messaging tions allows you to obtain videos similar to an image
and offering audio for films in different languages, or sequence of images within a database. The works
providing narratives for blind people; to present done by [39]–[41], deal with video indexing and
additional cameras at different angles of a sporting retrieval.
event. • Video Database Organization: The applications in
• TV Monitoring: Continuing the analysis of [23]– this group allow you to organize and categorize
[25], is denoted that the interest to develop and videos according to their content. The great amount
improved different techniques that allows identifying of the media produced currently make this job very
any TV content transmitted by a Broadcaster at any complicated and tedious to be done manually. In
time and in real time, is currently latent. Once the [42], are showed a solution to deal with this in an
content is identified, this type of techniques allows automatically manner.
the display of detailed information about the televi-
sion event, such as: the precise time and date of 5. Benchmarking Process
detection, source or source of the video, type of
video, among others. Its benefits are so important This section describes the benchmarking process that
that at present this activity demands intense research was carried out with the objective of comparing the four
in order to meet the broad demand for needs such different Perceptual Image Hashing algorithms described in
as: publicity monitoring or political advertising. Section 2. Two factors were considered in the evaluation:
• Extended Information for Ads: The authors [24], processing time (extraction and search) and identification
[26], [27], proposes different systems that uses a capacity. The process was divided into three steps: (1) to
mobile device to obtain more information about a determine the relative number of successes and failures
brand or product presented on television, in adver- in the identification of videos under different alterations
tising banners, videos on the web, etc. (considering variations in quality and source of the video),
• Video Semantic Enrichment: The studies done by (2) to evaluate the performance of the PIH algorithm (iden-
[26], [28], [29], present solutions to combine pattern tified in section two) (3) to evaluate the performance of
the framework considering computational conditions of low TABLE 2: Identification Processing Time
performance (1 processor core). Method Hamming Distance
0 1 2 3 4 5 6
The machine used to run the evaluation is a PC with a AHASH
1.09b ,5.57b 21.25b 67.59b 163.65b 329.81b 635.42b
2.7Ghz i7 processor, 1600Mhz 8GB RAM, 500GB SATA (0.0091)a
hard drive and a 64bit Ubuntu 14.04LTS operating system. DHASH
1.01b 7.91b 43.01b 162.52b 450.96b 971.01b 1786.99b
(0.0099)a
The dataset used to populate the test database consists of PHASH
1.24b 1.23b 68.33b 68.97b 795.62b 806.65b 3104.11b
165 automobile commercial videos obtained from YouTube (0.0808)a
WHASH
–a total of 246145 frames are used–. The videos are in MP4 1.16b 1.94b 23.76b 26.91b 154.65b 166.56b 530.81b
(0.0737)a
format and encoded with H.264. The characteristics of the a
Fingerprinting Extraction Time (Te): average in seconds per frame.
videos are variable, with resolutions ranging from 640x480 b
Searching Process Time (Ts): duration in seconds for the total number of
to 1280x1080 pixels; on the other hand, the frame rate goes samples.