You are on page 1of 13

OPENING

The urgency of the thesis

Nowadays, the strong development of digital image capture devices has led to the explosion of

multimedia data. Taking advantage of available multimedia data with attractive content attracts

a large number of viewers such as movies, TV videos, sports, ... . Advertising objects are

integrated and directly inserted into the content of these videos. This is the fastest and the most

common method used to advertise for products, trademark, brand name, ... (Ad object, for

short) to numerous customers and consumers.

The process of inserting a new ad object or replacing an existing ad object in a video with a

different ad object in order to take advantage of existing videos mostly done manually.

However, with the explosion in the number of videos available both online and offline, it is

very expensive and not feasible to analyze and process all video content manually. This has

driven the thesis to do the research on techniques to support video post-processing

automatically applied in the commercial field such as analyzing advertising content in video or

integrating, replacing ad objects for videos with available content.

The objective of the thesis

The thesis aims to focus on researching and improving techniques for handling important issues

in the post-processing problem, understanding the application video for the problem of

detecting and replacing the advertising object in the video to achieve high performance in terms

of speed and accuracy. The handling issues include: object detection in video, including

detecting and recognizing the shape of the object in the video; replace object in video including

partition, extract select part of object display; complete video after replacing found object in

video with selected object.

The object and scope of study

The object of study is the object detection models in the video. Huge vector set indexing

techniques with a large number of dimensions, closest approximation adjacent search


techniques (ANN) applied in object shape recognition. Models of object partitioning, finishing,

and reconstructing videos after object removal or replacement.

The scope of the study focuses on sports videos, advertisements and movies which have been

divided into a series of consecutive frames. Ad objects are static, two-dimensional image

objects. Object instances are not too small in size, limit between 20px and 400px each

dimension

The method and content of study

The methodology in the thesis research is a combination of theoretical and experimental

research, including: analysis, comparison, synthesis and evaluation of results based on

experiment.

The research content focuses on improving object detection model in video based on deep

convolutional network (DCNN). Improving indexing techniques based on product quantization

(PQ), search technique, sample matching applied to the shape recognition problem of the

object. Improving the object entity partition technique, completing the void / destroyed area in

the video based on DCNN.

The contribution of the thesis

(i) Improve practical performance of object detection model in video with DCNN-based

approach.

(ii) Improving the quantization feature vector indexing techniques of vector clusters (PSVQ),

improving the encoding quality of vector data sets with large numbers.

(iii) Improving the RBPconv image inpainting model applied to the problem of finishing and

reproducing blank areas generated in the video after replacing the object.
CHAPTER 1. THE OVERVIEW OF OBJECT REPLACEMENT IN VIDEOS

1.1 Detect object

With traditional methods, the object detection is divided into two independent phases:

extracting, describing raw features and representing these features. In which, extraction of raw

features is the process of detecting the high invariants with some geometric transformations.

Due to focusing only on the morphology, the limitation of this model is that the features are

subjectively determined for all object classes, so there is no high universality besides

complexity and very high computational costs

With the object detection method based on DCNN. All previously discrete steps are moved

through layers of a single neural network. Based on DCNN, the object detection model is

classified into two types: two-state object detection model and one state object. In the two-state

model, it is first proposed that the object area be defined. Next, the DCNN network is used to

extract features from the region proposals, and finally, the classification / regression is

performed to determine the class and the object closure. The advantage of this method is the

relatively high accuracy, but the processing speed is relatively slow for even one image.

Overcoming the disadvantages of the two-state model, the one-state model does not use the

object area proposal but only relies on a single DCNN that maps pixels directly to the bounding

zone coordinates and differential probability. class is studied. Typical models in this group are

YOLO, SSD, which have achieved success in execution time in object detection, but the

disadvantage is only based on high level characteristics, so the accuracy is not high.

1.2. Indentify the shape of object

To identify the shape of the object based on the feature vector selected during object detection,

many techniques of indexing the data set with large number of dimensions have been studied.

Effective techniques include: hash-based, clustering, spatial partitioning, and product

quantization. Among these techniques, the search method based on the quantization of

problems on large digital vector data sets gives the most optimal results. Therefore, this
technique is studied in detail, improved and developed by the thesis for the matching problem

to find out the object shape from a given set of shapes based on feature vectors.

As the size of the data set can reach in the millions of records, and each vector has a large

number of dimensions (thousands of dimensions), the match time is an important issue to

address for video processing applications that need in real time. To optimize sample matching

time, feature vector sets are indexed and encoded using PQ technique to reduce the size of the

storage space. Then, methods of searching fast ANN on the coding space find the

approximately closest object.

1.3. Replace and complete Videos

After the object in the video is detected, the display area of the object need to be extracted and

removed from the video. Similarly, the replaced object area is also extracted from the target

image to be inserted into the source area that has been removed in the video. This video editing

will present areas damaged by incomplete overlapping and need to be completed during post-

processing. Video inpainting is a suitable technique to reproduce and complete these areas. To

achieve the goal of restoring the destroyed image to approximately the original image, there

have been many studies of video inpainting, but mainly developed in two approaches:

sampling-based, or CNN-based.

With sampling-based approaches, the lost part of the image is restored in such a way by

increasing the finished part from the outer edge into the center of the region by searching for

suitable patterns and stitching them together. The biggest drawback is that they cannot solve the

case where lost parts cannot be found from the data.

Studies using CNNs to complete the blank area often use a basic architecture that is an encoder-

decoder network that can learn the contextual features of an image from which to perfect the

image. The resulting image is often more realistic than the sampling approach.

Conclusion chapter 1:
In this chapter, object detection models, video object shaping techniques based on feature

vector data sets, and demolition area completion model in video are presented in overview.

Through evaluating the advantages and disadvantages of previous studies, the thesis has

identified a suitable research direction for the object detection and replacement problem in the

video.

CHAPTER 2. OBJECT DETECTION AND IDENTIFICATION IN VIDEO

This chapter focuses on introducing real-time object detection model improved with high

accuracy. Detection speed (> 30 frames per second) is the key factor in choosing the model.

Therefore, the thesis focuses on improving the YOLO model to suit the advertisers in order to

increase accuracy while maintaining real-time speed. Then, the PSVQ technique is combined

with hierarchical cluster tree to find out objects with the same shape from available feature set

based on the selected feature vector set.

2.1 Object detection in video

2.1.1 Some improvements in YOLO-Adv model

2.1.1.1 Improvement of loss Function

In order to reduce the size effect and inclination of the object on the model, the method of

calculating the function of information loss along the width and length of the bounding box has

been improved. Relative dimensional measures were used instead of the absolute measure in

this loss function of the original YOLO network (element marked *). The improved loss

function is defined by the following formula:

2
S B
Loss new =λ coord ∑ ∑ 1obj xi )2 + ( y i −^y i )2 ]
ij [ ( x i − ^
i=0 j=0

2
S A
wi 2 hi−h^ i 2
2
wi −^ S B
+ λ coord ∑ ∑ 1
i=0 j=0
obj
ij
[ (
w^i
) +(
^
hi ] i=0 j=0
^ i )2
) + ∑ ∑ 1ijobj ( C i−C
2
S B
2
+ λ noobj ∑ ∑ 1noobj
ij ( Ci −C^ i )
i=0 j=0
2
S
+ ∑ 1obj
i ∑ ( pi (c )− ^pi (c ))2
i=0 c∈classes

wi −^ wi hi−h^ i
Particularly, in new loss function,
w^i
, ^
hi
used to replace √ w i−√ w
^i , √ hi− √h^ i

The effect of this improvement has made accuracy increase in detecting bounding boxes, reduce

the impact on the size and inclination of objects in frames when they are changed also

increases the speed of convergence during the training phase at the same time.

2.1.1.2 Improvement in network architecture

With the DCNN, local characteristics appear in the lower layers. To use these local features, a

multi-tier feature consolidation strategy is used within the Darknet-53 network architecture. By

this strategy, the result of the image feature map through the Residual 8x256 block continues to

multiply with the masks 3x3x256 and 1x1x64, then the ReShape / 2 operator is used to

reconstruct the feature map for the purpose of making make the feature map of this story similar

to the feature map of the story layers behind. Finally, features at different levels are

consolidated with the aim of enriching local features.

2.1.2 Estimation and assessment of the object detection model

2.1.2.1 Test data and installation environment

For trainning and testing the YOLO-Adv model, flickrlogos-47 data set is used.

The process of object detection in video is conducted on GPU server. The GPU used is Nvidia

Tesla K80, 24GB video memory, and Ubuntu 14 operating with 64GB of internal memory
Pic 2.1 The trainning average loss funcion value

2.1.2.2 Testing result

Traning phase estimation

Picture 2.1 graph shows the graph of average loss function value of 3 original YOLO-Adv,

YOLO-Loss and YOLOv3 models. In which YOLO-Loss is a model that only improves the

loss function, YOLO-Adv is a model that improves simultaneously with loss function and

network architecture. All three models are trained on flickrlogos-47 data set. The results show

that the average loss function value of all 3 models tends to decrease rapidly in the first 5000

cycles, finally towards stability at very small value after about 15000 iterations. However,

YOLO-Adv model has the fastest decrease in average error at the beginning and reaches the

first smallest value of the three models followed by two models YOLO-Loss and YOLOv3.

This proves that the loss function and the improved network architecture make the model highly

stable, less affected by ad object sizes and inclination, suitable for selected training dataset.

(a) YOLOv3 (b) YOLO-Loss (c) YOLO-Adv

Pic 2.2. Graph tranning IoU value


The comparison result IoU shows the accuracy of the bounding box positioning is illustrated in

Pic 2.2. The average IoU coefficient of all three models tended to increase steadily and

remained stable in the range [0.7 - 1.0]. This proves that all three models give high accuracy

when locating the bounding box. However, the IoU value of the YOLO-Adv model tended to

increase the fastest, ie the fastest training speed of the three models above. In addition, the IoU

value of YOLO-Adv also keeps the value stable at the highest level, which means that the

object detection accuracy is the highest.

Testing phase estimation

The classification accuracy of the next YOLO-Adv model is compared with the YOLOv3 and

YOLO-Loss models on the Flickrlogos-47 training data set with the threshold  = 0.5, using the

mAP estimation. .

Comparing the object detection results on the mAP measurement shows that YOLO-Adv gives

the highest detection accuracy, with mAP reaching 80.2 (Table 2.1) compared to YOLO-Loss

models, YOLOv3, respectively 77.4 and 74.0. In addition, with an average processing speed of

0.028s per frame, the YOLO-Logo model can achieve real-time processing speed with about 35

frames processed in 1 second.

Pic 2.1. Efficency on dataset Flickrlogos-47


Model mAP s/Img
YOLOv3 74.0 0.038
YOLO-Loss 77.4 0.032
YOLO-Adv 80.2 0.028

2.2 Recognise the shape of object

The main task of the object shape recognition phase is to determine the exact shape of the

object found in the previous phase. To perform this task, the thesis used a very large set of

vector data about the shape of the object extracted by the YOLO-Adv model. This data set is
indexed, encoded, and the identification is performed by matching the specific vector of the

query object with the vectors in the data set.

2.2.1 Model PSVQ

X is the selected set of feature vectors. The symbol x∈X is a vector or data point in the X data

set. The symbol x ^ ((j)) ∈R ^ ((D / m)) is the jth sub vector of x with j = 1,2 ,… M.

The original X space is first divided into m distinct subspaces, each of which has a dimension

of D / m. To solve the problem of data correlation between subspaces which is not considered,

leading to a redundancy of codewords, PSVQ is developed based on the idea of pooling h

adjacent spaces in m spaces. this. Then apply the vector quantization to these pooled spaces.

Specifically, combining h (1≤h≤m) adjacent spaces creates _m = m / h subspace and performs

separate quantization on this newly formed subset _m with low quantum _m. Thus, each

subspace now has _K = h × K cluster center. Therefore, there will be several subspaces that

share the same quantum set. Thus produces smoother decomposition on the original data while

not increasing the number of codewords (all _m × _K = m × K codeword).

Thus, for a feature vector data set X consisting of n points in space R (d), by applying the above

quantization to all data points in X based on codebooks {C_i ^ *} obtained during the training

process, we obtain a quantum code set Q consisting of n quanta for each element in X. Each

element in Q is a vector of size m and has a value in the range [0, _K-1. ]. The set Q is now n x

m atomic elements, so the more memory space is reduced by many times compared to the real

data set X.

2.2.2 Searching for ANN based on a hierarchical cluster tree

The process of searching ANN is done based on a complete hierarchical clustering tree

combined on a data set encoded by the PSVQ quantum method including 2 phases: offline -

phase of data preparation, creation of a search tree and search online search - tree browsing

phase. In the offline phase, set L (size m x _K) is codebook set of X built according to PSVQ

method. Each quantum code in L is formed from the quantization of one x∈X data point based
on the cookbook {C_h ^ *}. The codebook L set is stored for the lookup to calculate the

distance between the query vector r and the data points in X. For q_i ^ ((*)) (x) = Q_ (ID of x).

Together with the L lookup data set formation, the original feature vector data set was also used

to create a complete hierarchical clustering tree to represent all data points in the original space.

The tree generation process begins to create the root node corresponding to the entire data set.

Next, the clustering algorithm (Kmeans) is applied to cluster the data on this node into K sub-

clusters, each cluster is called a node in representation and represented by the cluster center

(codeword). This process is repeated recursively until the force at the sub-cluster is small

enough, considered the leaf node.

The online phase is the process of finding an approximate vector in the feature vector data set

versus the query vector r in space R (D). The essence of this search is a pre-created hierarchical

clustering tree browsing process. Starting from the root node, the child node whose distance r to

the smallest is selected to browse next. The tree process is repeated recursively until the most

appropriate leaf node is found.

2.2.3 Estimation and assessment

2.2.3.1 Estimation and asessment of PSVQ technique

Data set and hardware configuration

Data used in the experimental process includes data sets with large numbers of dimensions,

namely ANN_GIST1M, VGG. The algorithm is installed on C / C ++ environment and the

experiment is conducted on a computer with standard hardware configuration including: 16GB

RAM, Intel Core (Dual-Core) i7 2.1 GHz, installed Windows 7 operating system.
(a) 1M 960D ANN_GIST (b) 500K 4096D VGG

Pic 2.3 Encoding quality PSVQ

Estimation, assessment of encoding quality

The encoding quality of PSVQ with parameters h = 2,4,8 is compared with the methods based

on the PQ approach using the quantum sets optimization mechanism PQ and ck-means. The

illustrated comparison results in Figure 2.3 show that the proposed model with h = 8 performs

better than the other methods on both GIST and VGG test datasets. This result demonstrates the

correlation of the maximum considered data across all subspaces.

2.2.3.2 Estimation and assessment of ANN search speed

To demonstrate the effectiveness of the algorithm in the search phase of the object's shape, the

thesis conducted a number of experiments on different standard data sets to find the optimal

parameter to use to build the quantity. death. The obtained result is d = 48, k = 256 on

ANN_GIST set with 960-dimensional vectors and d = 64, k = 128 on VGG set of 4096

dimensional vectors.

(a) 1M 960D ANN_GIST (b) 500K 4096D VGG

Pic 2.4 ANN search speed on featured set


The search performance of the proposed method is compared with many different methods in

the ANN search problem class, including: Randomized KD-trees, Randomized K-medoids, K-

means tree, POC-trees and EPQ. When considering the search speed with the accuracy of over

80% on the GIST data set (Figure 2.4.a), the results displayed on the graph show that the search

speed of the proposed method is superior to the methods. rest. On average, the proposed method

is about 2 times faster than the 2nd fastest and EPQ method and about 7 times faster than the

FLANN library search method (FLANN-RC-8trees). In particular, with the search accuracy of

over 90%, the proposed method still gives a superior search speed compared to other methods.

Similar results when experimenting on data set with extremely large number of data

dimensions VGG including feature vectors of 4096 dimensions (Figure 2.4.b), the search speed

of the proposed method gives superior results, faster. about 1.3 to 2.0 times the EPQ technique

which is the best of the rest and many times more than the technique in the FLANN library like

flann-kmeans-1tree

Conclusion chapter 2

In this chapter, the problem of object detection in video is solved by improved

YOLO-Adv model and improved PSVQ technique.

The advantage of the YOLO-Adv model is that the real-time response calculation

speed is maintained and the accuracy is also increased, consistent with the ad audience data set.

Improved PSVQ technique for indexing feature data sets. The query object shape recognition

process is performed based on the hierarchical cluster tree on the indexed data set, encoded

using PSVQ technique. The experimental results show the superiority of the implementation

performance of the proposed model compared to other models in the field of ANN search.

You might also like