Opening The Urgency of The Thesis

OPENING
The urgency of the thesis
Nowadays, the strong development of digital image capture devices has led to the explosion of
multimedia data. Taking advantage of available multimedia data with attractive content attracts
a large number of viewers such as movies, TV videos, sports, ... . Advertising objects are
integrated and directly inserted into the content of these videos. This is the fastest and the most
common method used to advertise for products, trademark, brand name, ... (Ad object, for
short) to numerous customers and consumers.
The process of inserting a new ad object or replacing an existing ad object in a video with a
different ad object in order to take advantage of existing videos mostly done manually.
However, with the explosion in the number of videos available both online and offline, it is
very expensive and not feasible to analyze and process all video content manually. This has
driven the thesis to do the research on techniques to support video post-processing
automatically applied in the commercial field such as analyzing advertising content in video or
integrating, replacing ad objects for videos with available content.
The objective of the thesis
The thesis aims to focus on researching and improving techniques for handling important issues
in the post-processing problem, understanding the application video for the problem of
detecting and replacing the advertising object in the video to achieve high performance in terms
of speed and accuracy. The handling issues include: object detection in video, including
detecting and recognizing the shape of the object in the video; replace object in video including
partition, extract select part of object display; complete video after replacing found object in
video with selected object.
The object and scope of study
The object of study is the object detection models in the video. Huge vector set indexing
techniques with a large number of dimensions, closest approximation adjacent search

techniques (ANN) applied in object shape recognition. Models of object partitioning, finishing,
and reconstructing videos after object removal or replacement.
The scope of the study focuses on sports videos, advertisements and movies which have been
divided into a series of consecutive frames. Ad objects are static, two-dimensional image
objects. Object instances are not too small in size, limit between 20px and 400px each
dimension
The method and content of study
The methodology in the thesis research is a combination of theoretical and experimental
research, including: analysis, comparison, synthesis and evaluation of results based on
experiment.
The research content focuses on improving object detection model in video based on deep
convolutional network (DCNN). Improving indexing techniques based on product quantization
(PQ), search technique, sample matching applied to the shape recognition problem of the
object. Improving the object entity partition technique, completing the void / destroyed area in
the video based on DCNN.
The contribution of the thesis
(i) Improve practical performance of object detection model in video with DCNN-based
approach.
(ii) Improving the quantization feature vector indexing techniques of vector clusters (PSVQ),
improving the encoding quality of vector data sets with large numbers.
(iii) Improving the RBPconv image inpainting model applied to the problem of finishing and
reproducing blank areas generated in the video after replacing the object.
CHAPTER 1. THE OVERVIEW OF OBJECT REPLACEMENT IN VIDEOS
1.1 Detect object
With traditional methods, the object detection is divided into two independent phases:
extracting, describing raw features and representing these features. In which, extraction of raw
features is the process of detecting the high invariants with some geometric transformations.
Due to focusing only on the morphology, the limitation of this model is that the features are
subjectively determined for all object classes, so there is no high universality besides
complexity and very high computational costs
With the object detection method based on DCNN. All previously discrete steps are moved
through layers of a single neural network. Based on DCNN, the object detection model is
classified into two types: two-state object detection model and one state object. In the two-state
model, it is first proposed that the object area be defined. Next, the DCNN network is used to
extract features from the region proposals, and finally, the classification / regression is
performed to determine the class and the object closure. The advantage of this method is the
relatively high accuracy, but the processing speed is relatively slow for even one image.
Overcoming the disadvantages of the two-state model, the one-state model does not use the
object area proposal but only relies on a single DCNN that maps pixels directly to the bounding
zone coordinates and differential probability. class is studied. Typical models in this group are
YOLO, SSD, which have achieved success in execution time in object detection, but the
disadvantage is only based on high level characteristics, so the accuracy is not high.
1.2. Indentify the shape of object
To identify the shape of the object based on the feature vector selected during object detection,
many techniques of indexing the data set with large number of dimensions have been studied.
Effective techniques include: hash-based, clustering, spatial partitioning, and product
quantization. Among these techniques, the search method based on the quantization of
problems on large digital vector data sets gives the most optimal results. Therefore, this
technique is studied in detail, improved and developed by the thesis for the matching problem
to find out the object shape from a given set of shapes based on feature vectors.
As the size of the data set can reach in the millions of records, and each vector has a large
number of dimensions (thousands of dimensions), the match time is an important issue to
address for video processing applications that need in real time. To optimize sample matching
time, feature vector sets are indexed and encoded using PQ technique to reduce the size of the
storage space. Then, methods of searching fast ANN on the coding space find the
approximately closest object.
1.3. Replace and complete Videos
After the object in the video is detected, the display area of the object need to be extracted and
removed from the video. Similarly, the replaced object area is also extracted from the target
image to be inserted into the source area that has been removed in the video. This video editing
will present areas damaged by incomplete overlapping and need to be completed during post-
processing. Video inpainting is a suitable technique to reproduce and complete these areas. To
achieve the goal of restoring the destroyed image to approximately the original image, there
have been many studies of video inpainting, but mainly developed in two approaches:
sampling-based, or CNN-based.
With sampling-based approaches, the lost part of the image is restored in such a way by
increasing the finished part from the outer edge into the center of the region by searching for
suitable patterns and stitching them together. The biggest drawback is that they cannot solve the
case where lost parts cannot be found from the data.
Studies using CNNs to complete the blank area often use a basic architecture that is an encoder-
decoder network that can learn the contextual features of an image from which to perfect the
image. The resulting image is often more realistic than the sampling approach.
Conclusion chapter 1:
In this chapter, object detection models, video object shaping techniques based on feature
vector data sets, and demolition area completion model in video are presented in overview.
Through evaluating the advantages and disadvantages of previous studies, the thesis has
identified a suitable research direction for the object detection and replacement problem in the
video.
CHAPTER 2. OBJECT DETECTION AND IDENTIFICATION IN VIDEO
This chapter focuses on introducing real-time object detection model improved with high
accuracy. Detection speed (> 30 frames per second) is the key factor in choosing the model.
Therefore, the thesis focuses on improving the YOLO model to suit the advertisers in order to
increase accuracy while maintaining real-time speed. Then, the PSVQ technique is combined
with hierarchical cluster tree to find out objects with the same shape from available feature set
based on the selected feature vector set.
2.1 Object detection in video
2.1.1 Some improvements in YOLO-Adv model
2.1.1.1 Improvement of loss Function
In order to reduce the size effect and inclination of the object on the model, the method of
calculating the function of information loss along the width and length of the bounding box has
been improved. Relative dimensional measures were used instead of the absolute measure in
this loss function of the original YOLO network (element marked *). The improved loss
function is defined by the following formula:
2
S B
Loss new =λ coord ∑ ∑ 1obj xi )2 + ( y i −^y i )2 ]
ij [ ( x i − ^
i=0 j=0
2
S A
wi 2 hi−h^ i 2
2
wi −^ S B
+ λ coord ∑ ∑ 1
i=0 j=0
obj
ij
[ (
wî
) +(
^
hi ] i=0 j=0
^ i )2
) + ∑ ∑ 1ijobj ( C i−C
2
S B
2
+ λ noobj ∑ ∑ 1noobj
ij ( Ci −C^ i )
i=0 j=0
2
S
+ ∑ 1obj
i ∑ ( pi (c )− ^pi (c ))2
i=0 c∈classes
wi −^ wi hi−h^ i
Particularly, in new loss function,
wî
, ^
hi
used to replace √ w i−√ w
î , √ hi− √h^ i
The effect of this improvement has made accuracy increase in detecting bounding boxes, reduce
the impact on the size and inclination of objects in frames when they are changed also
increases the speed of convergence during the training phase at the same time.
2.1.1.2 Improvement in network architecture
With the DCNN, local characteristics appear in the lower layers. To use these local features, a
multi-tier feature consolidation strategy is used within the Darknet-53 network architecture. By
this strategy, the result of the image feature map through the Residual 8x256 block continues to
multiply with the masks 3x3x256 and 1x1x64, then the ReShape / 2 operator is used to
reconstruct the feature map for the purpose of making make the feature map of this story similar
to the feature map of the story layers behind. Finally, features at different levels are
consolidated with the aim of enriching local features.
2.1.2 Estimation and assessment of the object detection model
2.1.2.1 Test data and installation environment
For trainning and testing the YOLO-Adv model, flickrlogos-47 data set is used.
The process of object detection in video is conducted on GPU server. The GPU used is Nvidia
Tesla K80, 24GB video memory, and Ubuntu 14 operating with 64GB of internal memory
Pic 2.1 The trainning average loss funcion value
2.1.2.2 Testing result
Traning phase estimation
Picture 2.1 graph shows the graph of average loss function value of 3 original YOLO-Adv,
YOLO-Loss and YOLOv3 models. In which YOLO-Loss is a model that only improves the
loss function, YOLO-Adv is a model that improves simultaneously with loss function and
network architecture. All three models are trained on flickrlogos-47 data set. The results show
that the average loss function value of all 3 models tends to decrease rapidly in the first 5000
cycles, finally towards stability at very small value after about 15000 iterations. However,
YOLO-Adv model has the fastest decrease in average error at the beginning and reaches the
first smallest value of the three models followed by two models YOLO-Loss and YOLOv3.
This proves that the loss function and the improved network architecture make the model highly
stable, less affected by ad object sizes and inclination, suitable for selected training dataset.
(a) YOLOv3 (b) YOLO-Loss (c) YOLO-Adv
Pic 2.2. Graph tranning IoU value

The comparison result IoU shows the accuracy of the bounding box positioning is illustrated in
Pic 2.2. The average IoU coefficient of all three models tended to increase steadily and
remained stable in the range [0.7 - 1.0]. This proves that all three models give high accuracy
when locating the bounding box. However, the IoU value of the YOLO-Adv model tended to
increase the fastest, ie the fastest training speed of the three models above. In addition, the IoU
value of YOLO-Adv also keeps the value stable at the highest level, which means that the
object detection accuracy is the highest.
Testing phase estimation
The classification accuracy of the next YOLO-Adv model is compared with the YOLOv3 and
YOLO-Loss models on the Flickrlogos-47 training data set with the threshold  = 0.5, using the
mAP estimation. .
Comparing the object detection results on the mAP measurement shows that YOLO-Adv gives
the highest detection accuracy, with mAP reaching 80.2 (Table 2.1) compared to YOLO-Loss
models, YOLOv3, respectively 77.4 and 74.0. In addition, with an average processing speed of
0.028s per frame, the YOLO-Logo model can achieve real-time processing speed with about 35
frames processed in 1 second.
Pic 2.1. Efficency on dataset Flickrlogos-47

Model mAP s/Img
YOLOv3 74.0 0.038
YOLO-Loss 77.4 0.032
YOLO-Adv 80.2 0.028
2.2 Recognise the shape of object
The main task of the object shape recognition phase is to determine the exact shape of the
object found in the previous phase. To perform this task, the thesis used a very large set of
vector data about the shape of the object extracted by the YOLO-Adv model. This data set is
indexed, encoded, and the identification is performed by matching the specific vector of the
query object with the vectors in the data set.
2.2.1 Model PSVQ
X is the selected set of feature vectors. The symbol x∈X is a vector or data point in the X data
set. The symbol x ^ ((j)) ∈R ^ ((D / m)) is the jth sub vector of x with j = 1,2 ,… M.
The original X space is first divided into m distinct subspaces, each of which has a dimension
of D / m. To solve the problem of data correlation between subspaces which is not considered,
leading to a redundancy of codewords, PSVQ is developed based on the idea of pooling h
adjacent spaces in m spaces. this. Then apply the vector quantization to these pooled spaces.
Specifically, combining h (1≤h≤m) adjacent spaces creates _m = m / h subspace and performs
separate quantization on this newly formed subset _m with low quantum _m. Thus, each
subspace now has _K = h × K cluster center. Therefore, there will be several subspaces that
share the same quantum set. Thus produces smoother decomposition on the original data while
not increasing the number of codewords (all _m × _K = m × K codeword).
Thus, for a feature vector data set X consisting of n points in space R (d), by applying the above
quantization to all data points in X based on codebooks {C_i ^ *} obtained during the training
process, we obtain a quantum code set Q consisting of n quanta for each element in X. Each
element in Q is a vector of size m and has a value in the range [0, _K-1. ]. The set Q is now n x
m atomic elements, so the more memory space is reduced by many times compared to the real
data set X.
2.2.2 Searching for ANN based on a hierarchical cluster tree
The process of searching ANN is done based on a complete hierarchical clustering tree
combined on a data set encoded by the PSVQ quantum method including 2 phases: offline -
phase of data preparation, creation of a search tree and search online search - tree browsing
phase. In the offline phase, set L (size m x _K) is codebook set of X built according to PSVQ
method. Each quantum code in L is formed from the quantization of one x∈X data point based
on the cookbook {C_h ^ *}. The codebook L set is stored for the lookup to calculate the
distance between the query vector r and the data points in X. For q_i ^ ((*)) (x) = Q_ (ID of x).
Together with the L lookup data set formation, the original feature vector data set was also used
to create a complete hierarchical clustering tree to represent all data points in the original space.
The tree generation process begins to create the root node corresponding to the entire data set.
Next, the clustering algorithm (Kmeans) is applied to cluster the data on this node into K sub-
clusters, each cluster is called a node in representation and represented by the cluster center
(codeword). This process is repeated recursively until the force at the sub-cluster is small
enough, considered the leaf node.
The online phase is the process of finding an approximate vector in the feature vector data set
versus the query vector r in space R (D). The essence of this search is a pre-created hierarchical
clustering tree browsing process. Starting from the root node, the child node whose distance r to
the smallest is selected to browse next. The tree process is repeated recursively until the most
appropriate leaf node is found.
2.2.3 Estimation and assessment
2.2.3.1 Estimation and asessment of PSVQ technique
Data set and hardware configuration
Data used in the experimental process includes data sets with large numbers of dimensions,
namely ANN_GIST1M, VGG. The algorithm is installed on C / C ++ environment and the
experiment is conducted on a computer with standard hardware configuration including: 16GB
RAM, Intel Core (Dual-Core) i7 2.1 GHz, installed Windows 7 operating system.
(a) 1M 960D ANN_GIST (b) 500K 4096D VGG
Pic 2.3 Encoding quality PSVQ
Estimation, assessment of encoding quality
The encoding quality of PSVQ with parameters h = 2,4,8 is compared with the methods based
on the PQ approach using the quantum sets optimization mechanism PQ and ck-means. The
illustrated comparison results in Figure 2.3 show that the proposed model with h = 8 performs
better than the other methods on both GIST and VGG test datasets. This result demonstrates the
correlation of the maximum considered data across all subspaces.
2.2.3.2 Estimation and assessment of ANN search speed
To demonstrate the effectiveness of the algorithm in the search phase of the object's shape, the
thesis conducted a number of experiments on different standard data sets to find the optimal
parameter to use to build the quantity. death. The obtained result is d = 48, k = 256 on
ANN_GIST set with 960-dimensional vectors and d = 64, k = 128 on VGG set of 4096
dimensional vectors.
(a) 1M 960D ANN_GIST (b) 500K 4096D VGG
Pic 2.4 ANN search speed on featured set

The search performance of the proposed method is compared with many different methods in
the ANN search problem class, including: Randomized KD-trees, Randomized K-medoids, K-
means tree, POC-trees and EPQ. When considering the search speed with the accuracy of over
80% on the GIST data set (Figure 2.4.a), the results displayed on the graph show that the search
speed of the proposed method is superior to the methods. rest. On average, the proposed method
is about 2 times faster than the 2nd fastest and EPQ method and about 7 times faster than the
FLANN library search method (FLANN-RC-8trees). In particular, with the search accuracy of
over 90%, the proposed method still gives a superior search speed compared to other methods.
Similar results when experimenting on data set with extremely large number of data
dimensions VGG including feature vectors of 4096 dimensions (Figure 2.4.b), the search speed
of the proposed method gives superior results, faster. about 1.3 to 2.0 times the EPQ technique
which is the best of the rest and many times more than the technique in the FLANN library like
flann-kmeans-1tree
Conclusion chapter 2
In this chapter, the problem of object detection in video is solved by improved
YOLO-Adv model and improved PSVQ technique.
The advantage of the YOLO-Adv model is that the real-time response calculation
speed is maintained and the accuracy is also increased, consistent with the ad audience data set.
Improved PSVQ technique for indexing feature data sets. The query object shape recognition
process is performed based on the hierarchical cluster tree on the indexed data set, encoded
using PSVQ technique. The experimental results show the superiority of the implementation
performance of the proposed model compared to other models in the field of ANN search.

Opening The Urgency of The Thesis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Opening The Urgency of The Thesis

Uploaded by

Copyright:

Available Formats

OPENING

The urgency of the thesis

short) to numerous customers and consumers.

driven the thesis to do the research on techniques to support video post-processing

integrating, replacing ad objects for videos with available content.

The objective of the thesis

video with selected object.

The object and scope of study

techniques with a large number of dimensions, closest approximation adjacent search

and reconstructing videos after object removal or replacement.

The method and content of study

The methodology in the thesis research is a combination of theoretical and experimental

research, including: analysis, comparison, synthesis and evaluation of results based on

convolutional network (DCNN). Improving indexing techniques based on product quantization

the video based on DCNN.

The contribution of the thesis

1.1 Detect object

complexity and very high computational costs

1.2. Indentify the shape of object

Effective techniques include: hash-based, clustering, spatial partitioning, and product

number of dimensions (thousands of dimensions), the match time is an important issue to

approximately closest object.

1.3. Replace and complete Videos

case where lost parts cannot be found from the data.

CHAPTER 2. OBJECT DETECTION AND IDENTIFICATION IN VIDEO

based on the selected feature vector set.

2.1 Object detection in video

2.1.1 Some improvements in YOLO-Adv model

2.1.1.1 Improvement of loss Function

function is defined by the following formula:

2.1.1.2 Improvement in network architecture

consolidated with the aim of enriching local features.

2.1.2 Estimation and assessment of the object detection model

2.1.2.1 Test data and installation environment

2.1.2.2 Testing result

Traning phase estimation

(a) YOLOv3 (b) YOLO-Loss (c) YOLO-Adv

Pic 2.2. Graph tranning IoU value

object detection accuracy is the highest.

Testing phase estimation

frames processed in 1 second.

Pic 2.1. Efficency on dataset Flickrlogos-47

2.2 Recognise the shape of object

query object with the vectors in the data set.

2.2.1 Model PSVQ

leading to a redundancy of codewords, PSVQ is developed based on the idea of pooling h

Specifically, combining h (1≤h≤m) adjacent spaces creates _m = m / h subspace and performs

not increasing the number of codewords (all _m × _K = m × K codeword).

2.2.2 Searching for ANN based on a hierarchical cluster tree

enough, considered the leaf node.

appropriate leaf node is found.

2.2.3 Estimation and assessment

2.2.3.1 Estimation and asessment of PSVQ technique

Data set and hardware configuration

namely ANN_GIST1M, VGG. The algorithm is installed on C / C ++ environment and the

experiment is conducted on a computer with standard hardware configuration including: 16GB

Pic 2.3 Encoding quality PSVQ

Estimation, assessment of encoding quality

correlation of the maximum considered data across all subspaces.

2.2.3.2 Estimation and assessment of ANN search speed

(a) 1M 960D ANN_GIST (b) 500K 4096D VGG

Pic 2.4 ANN search speed on featured set

In this chapter, the problem of object detection in video is solved by improved