Scene Change Detection Framework

Author’s Accepted Manuscript
A Scene Change Detection Framework for Multi-

Temporal Very High Resolution Remote Sensing
Images
Chen Wu, Lefei Zhang, Liangpei Zhang
www.elsevier.com/locate/sigpro
PII: S0165-1684(15)00322-9
DOI: http://dx.doi.org/10.1016/j.sigpro.2015.09.020
Reference: SIGPRO5920
To appear in: Signal Processing
Received date: 23 June 2015
Revised date: 24 August 2015
Accepted date: 15 September 2015
Cite this article as: Chen Wu, Lefei Zhang and Liangpei Zhang, A Scene Change
Detection Framework for Multi-Temporal Very High Resolution Remote
Sensing Images, Signal Processing,
http://dx.doi.org/10.1016/j.sigpro.2015.09.020
This is a PDF file of an unedited manuscript that has been accepted for
publication. As a service to our customers we are providing this early version of
the manuscript. The manuscript will undergo copyediting, typesetting, and
review of the resulting galley proof before it is published in its final citable form.
Please note that during the production process errors may be discovered which
could affect the content, and all legal disclaimers that apply to the journal pertain.
A Scene Change Detection Framework for
Multi-Temporal Very High Resolution Remote
Sensing Images
Chen Wu 1, Lefei Zhang 2*, and Liangpei Zhang 3
1. International School of Software, Wuhan University, Wuhan, P.R. China
2. School of Computer, Wuhan University, Wuhan, P.R. China
3. State Key Laboratory of Information Engineering in Surveying, Mapping and
Remote Sensing, Wuhan University, Wuhan, P.R. China
E-mail: zhanglefei@whu.edu.cn
Abstract
The technology of computer vision and image processing is attracting more and
more attentions in recent years, and has been applied in many research areas like
remote sensing image analysis. Change detection with multi-temporal remote sensing
images is very important for the dynamic analysis of landscape variations. The
abundant spatial information offered by very high resolution (VHR) images makes it
possible to identify the semantic classes of image scenes. Compared with the
traditional approaches, scene change detection can provide a new point of view for the
semantic interpretation of land-use transitions. In this paper, for the first time, we
explore a scene change detection framework for VHR images, with a
bag-of-visual-words (BOVW) model and classification-based methods. Image scenes
are represented by a word frequency with three kinds of multi-temporal learned

1
dictionary, i.e., the separate dictionary, the stacked dictionary, and the union dictionary.
Three features (multispectral raw pixel; mean and standard deviation; and SIFT) and
their combinations were tested in scene change detection. Post-classification and
compound classification were evaluated for their performances in the “from-to”
change results. Two multi-temporal scene datasets were used to quantitatively
evaluate the proposed scene change detection approach. The results indicate that the
proposed scene change detection framework can obtain a satisfactory accuracy and
can effectively analyze land-use changes, from a semantic point of view.
Keywords: Scene change detection, VHR image, remote sensing, BOVW,
post-classification, compound classification.
1. Introduction
The technology of computer vision and image processing is attracting more and
more attentions in recent years [1-7], and has been applied in many research areas like
remote sensing image analysis [8], [9]. Dynamic analysis of landscape variations is
very important for understanding the interactions between human development and
natural phenomena [10]. Since the advent of remote sensing technology, it has been a
major source of information for change detection studies to detect, identify, and
analyze the changes in landscape conditions, because of its broad view, high temporal
frequency, and consistent observations [11-R3]. Change detection with multi-temporal
remote sensing images has proven effective in many applications [15], [16].
2
In recent years, numerous studies have addressed the automatic and accurate
detection of changes from multi-temporal images [17]. The main methods can be
categorized into the following groups: 1) image difference [18]; 2) image
transformation [19]; and 3) classification-based methods [20]. Since
classification-based change detection methods can indicate the detailed “from-to”
change types for further analysis, they are very convenient for practical applications
[21]. Nowadays, the new remote sensing technology with a higher resolution has
brought about a revolution in the analysis of multi-temporal datasets [22]. In addition
to the detailed characterization and monitoring of landscape changes with
object-oriented methods [23], very high resolution (VHR) imagery also provides
unique opportunities for the semantic interpretation of land-use transitions.
Meanwhile, classifying image scenes into semantic classes has attracted increasing
attention in the pattern recognition field [24-26]. Due to the abundant and detailed
spatial information provided by VHR imagery, the remote sensing scenes can be
characterized as semantic land-use classes based on the spatial and structural patterns
of the landscapes encoded in the imagery [27]. A great deal of research has focused on
remote sensing scene classification [28]. The bag-of-visual-words (BOVW) model
has been explored as an effective and robust feature encoding approach to obtain the
statistical characteristics of thematic objects for remote sensing scenes [29]. It can
provide a mid-level representation to bridge the huge semantic gap between the
extracted low-level features in the image and the high-level concepts of human beings
[30]. The combination of multiple features, such as texture features and SIFT feature
3
descriptors, can improve the performance of scene classification [31], [32]. Topic
models derived from text categorization, such as probabilistic latent semantic analysis
(pLSA) and latent Dirichlet allocation (LDA), have been proposed to reduce the
dimension and increase the accuracy [33], [34].
With the increasing availability of high-resolution images coving the Earth’s
surface, it is now available to develop change detection methods for a scene scale. It
can provide a new point of view to analyze and monitor urban development, such as
the expansion of industrial and residential area. The problem is that the changes of
thematic landscapes inside a scene may not lead to the variation of the scene classes.
The traditional change detection approaches, such as the pixel-wise and object-wise
methods, can only identify the changes of the thematic objects, such as the growth of
vegetation and the appearance of new buildings. However, their variations inside the
scene will not modify the land-use category from a residential area to an industrial
area. This illustrates the semantic gap between landscape changes and scene changes.
It is therefore necessary and significant to explore scene change detection, which
takes into consideration the semantic comprehension of the land-use variations.
However, to the best of our knowledge, no previous studies have focused on this
topic. Therefore, in this paper, we explore a scene change detection framework for
multi-temporal VHR remote sensing imagery, and evaluate its performance in the
application of urban development management. The scene classification method is
based on a BOVW model with three different features, i.e., the multispectral raw pixel;
the mean and standard deviation; and the SIFT feature descriptor. Three kinds of
4
dictionary, i.e., the separate dictionary, the stacked dictionary, and the union dictionary,
are proposed to make use of the multi-temporal information to improve the
performance. Post-classification and compound classification approaches for the
scenes are addressed to extract the semantic land-use change information. Finally, the
urban development is mapped and analyzed with the land-use transition interpretation.
The remainder of this paper is organized as follows. Section 2 introduces scene
classification with the BOVW approach. Section 3 explores the change detection
process for multi-temporal scene data. Experiments with two multi-temporal VHR
imagery datasets and a discussion are presented in Section 4. Finally, the conclusion
of this paper is drawn in Section 5.
2. Scene Classification with BOVW
In the BOVW approach, a scene can be identified by the thematic objects it
contains. The idea of BOVW derives from text analysis, where the documents can be
classified according to the frequencies of the words contained therein, regardless of
the word order [35]. For scene classification, BOVW aims at representing the image
scene with the statistical frequency of the visual words as its characteristic feature for
the following classification process, as shown in Fig. 1. The procedure of BOVW
scene classification can be described as follows: 1) randomly select image patches
from the scene images in the scene dataset; 2) learn a dictionary from the feature sets
of the selected patches; 3) for each scene, extract the features of the patches in the
image as visual words; 4) encode the features of the visual words with the learned
5
dictionary; 5) pool the feature codes to obtain the word frequency representation as
the characteristic for the image scene; 6) repeat steps 3–5 to represent all the scenes
with the word frequency; and 7) classify the image scenes into semantic scene classes.
Next, we introduce the strategies of the framework used in this paper in detail.
Scene dataset Dictionary

Scene classes
Dictionary learning
Residential
Feature encoding
Feature extraction
Scene Word frequency
Visual words +1
Industrial
Classification
Parking
+1
+1
Feature pooling
Fig. 1. Flowchart of scene classification with BOVW.
2.1 Feature Extraction
The features of the patch should be representative and distinguishable to describe
the different thematic objects for the visual words. In this paper, we evaluate the
performance of scene change detection with three descriptive features, which are the
raw pixel vector (RAW), the mean and standard deviation (MSTD), and the dense
SIFT feature (SIFT).
The raw pixel vector is widely used as a low-level descriptor for image patches [24].
For a multi-band image patch with the size of m×n and b bands, it is represented as a
column vector x  L
, where L=m  n  b . The image patch can be reshaped
row-wise or column-wise, and filled into the raw pixel vector band by band.
The MSTD is a simple descriptor for an image patch to represent its characteristics.
The mean value and standard deviation of each band are calculated and listed in a
column vector x  L
, where L=2  b .
6
The final descriptor we use is the dense SIFT descriptor. The SIFT descriptor is
invariant to translation, rotation, and scaling in the image domain and is robust with
regard to illumination variation [36]. Previous work has proved that in the application
of scene classification, the dense SIFT descriptor is more effective than sparse interest
points [24], [27]. In this paper, we use the first principal component of PCA as the
input of the dense SIFT descriptor, and obtain the descriptor vector x  L
, where
L=128 . We use the dense SIFT implementation provided by Lazebnik et al. [37] to
extract the features.
Different features can also be combined for a better classification performance. In
this paper, we use a simple approach to combine multiple features by stacking the
word frequencies of the different features into a higher-dimensional representation
vector. It is worth noting that we stack the word frequencies instead of the original
features directly, since the word frequencies of the different features are at the same
scale, and we do not have the problem of feature normalization.
2.2 Dictionary Learning
The dictionary must be comprehensive and representative of the visual words. The
learned dictionary is then used to quantize the extracted features by assigning the
labels of the visual words.
One way to prepare the source data for dictionary learning is to use all the features
extracted from all the scenes in the learning process. However, when the volume of
the input scene data is large, the dictionary learning will be time-consuming.
Therefore, in this paper, we randomly select abundant sample patches from the scene
7
images in the dataset to generate the input data matrix X=  x1 , x2 , , xM  . The
number of samples is usually very large, such as M=100000.
In order to reduce the redundancy of the input data, zero component analysis (ZCA)
whitening is applied to remove the correlations between the different dimensions [49].
K-means clustering is an effective feature representation learning method and can
be quickly implemented at a large scale [38]. This approach has also been applied in
many studies of scene classification [33]. The whitened input dataset is quantized into
K clusters by the K-means algorithm, and their centers are the elements of the learned
dictionary D= d1 , d 2 , , d K  .
2.3 Feature Encoding and Pooling
After obtaining the dictionary, the training and testing scenes can be represented
with the encoded features as Ptrain =  p1train , ptrain

2
, , pTtrain  and
Ptest =  p1test , ptest

2
, , pNtest  , where T and N are the numbers of training and testing
scenes. Dividing the image patches with a non-overlapping grid is a traditional
approach. However, Bosch et al. [24] proved that overlapping patches can lead to a
better performance. Therefore, in this paper, the image patches used for the feature
extraction are squared windows with a size of m×n, and they cover the whole image
scene with a step of s moving in the horizontal and vertical directions. The ZCA
whitening matrix obtained in the dictionary learning should be used before the
encoding to preprocess the features. The whitened feature of each patch is then
quantized to the visual word that has the greatest similarity in the dictionary.
After feature encoding, the pooling process should be applied to represent the scene
8
with the encoded words. In this paper, we compute the histogram of the labeled words
in the scene, and normalizing it to be the word frequency, since it is very robust in
practical applications [35].
2.4 Classification
The dictionary representation is the input data for the scene classification. A
dictionary will usually consist of hundreds of words, and thus the representation for
the scene is very high dimensional data. Since the number of training samples is
usually much smaller than the data dimension, we choose support vector machine
(SVM) as the classifier in this paper [39]. LIBSVM was applied with a linear kernel
to classify the scenes in our experiments [40].
3. Scene Change Detection
Since the scene change detection aims to analyze the land-use transition from a
semantic point of view, supervised information provided by interpreters is essential
for the semantic definition of the different scene classes. Inspired by traditional
change detection, we use a classification-based method to determine the scene
changes. In this paper, post-classification (comparing the class maps after separate
classification) and compound classification (classifying the stacked multi-temporal
scene) are proposed and evaluated for the scene change detection. In addition, as the
temporal information is important in multi-temporal image analysis, we also attempt
to add the temporal information into the learned dictionary.
9
3.1 Post-Classification and Compound Classification
Post-classification and compound classification are the two most commonly used
classification-based change detection methods in land-cover/land-use change mapping
and ecosystem monitoring [21], [41]. They can provide the “from-to” change type
information for the following detailed analysis. For the proposed scene change
detection framework, we also employ these two approaches to map the changed
scenes and to identify the change types, as shown in Fig. 2.
Representation Classification Comparison

Scene
Word frequency Scene class
Idle
From-to change information

Scene
Word frequency Scene class
Idle → Residential
Residential
(a)
Representation Stacking Classification
Scene
Word frequency
From-to change information

Stacked word frequency
Scene
Word frequency
Idle → Residential
(b)
Fig. 2. Scene change detection of (a) post-classification, and (b) compound classification.
Fig. 2 (a) shows that, in post-classification, the multi-temporal scenes are classified
separately as Lt1 = l1t1 , l 2t1 , , lNt1  and Lt 2 = l1t 2 , l2t 2 , , lNt 2  . The scene labels after
scene classification are compared to identify the change conditions of the

10
corresponding scenes at the same location. The compound classification classifies the
multi-temporal scene pair to be Lfromto = l1fromto , l 2fromto , , lNfromto  at one time, with
the stacked word frequency, as shown in Fig. 2 (b). Theoretically, the accuracy of
post-classification is limited since the misclassified scenes at each time point will lead
to errors in the change detection result, no matter whether or not the corresponding
scenes at another time are correctly classified. Although the compound classification
approach can solve this problem with one-pass classification, the direct selection of
training samples for all the “from-to” scene classes is too complex. Therefore, we
generate the simulated training samples for the compound classification with separate
training samples in the multi-temporal dataset.
3.2 Multi-Temporal Dictionary Learning
A multi-temporal scene dataset covers the same study site at different times.
Therefore, the multi-temporal information may be useful to improve the performance
of the temporal analysis, rather than a totally separate process. The dictionary is the
reference for the scene classification, and thus it may be a good idea to add the
temporal information into the dictionary learning process. In order to explore the
different possibilities, we propose three approaches to build the dictionary, as shown
in Fig. 3.
11
Word frequency
Scene dataset (Time 1)
Encoding
Word frequency
Encoding
(a)
Word frequency
Scene dataset (Time 1) Encoding
Word frequency
Encoding
(b)
Scene dataset (Time 1) Word frequency

Encoding

Word frequency
Encoding
(c)
Fig. 3. Multi-temporal dictionary learning approaches of (a) separate dictionary, (b) stacked dictionary,
and (c) union dictionary.
1) Separate dictionary (SP): The multi-temporal scene datasets are processed
separately to build two dictionaries. The image scenes are then encoded and
represented by their own dictionaries. This is the basic approach for the
multi-temporal BOVW method, as shown in Fig. 3 (a).
12
2) Stacked dictionary (ST): Two sub-dictionaries are learned by their own image
scene dataset and stacked into one dictionary. The stacked dictionary is then used to
represent the image scene in different scene datasets, as shown in Fig. 3 (b).
3) Union dictionary (UN): As Fig. 3 (c) shows, sample patches from the
multi-temporal image scene datasets are randomly selected and gathered into one
training sample set. The dictionary is then learned from the set with multi-temporal
patches and is used to encode the image scenes acquired at different times. The
precondition for both the stacked dictionary and union dictionary is that the training
samples must be transformed into one scale; otherwise, the dictionary will be
non-uniform.
3.3 Procedure
The procedure of scene change detection can be summarized as follows: 1) A large
amount of patches are randomly selected from the multi-temporal datasets of scene
images as training samples; 2) The dictionary is learned according to the three
multi-temporal dictionary learning approaches; 3) The training scenes and test scenes
are encoded with the learned dictionary and represented as word frequencies with
feature pooling; 4) Post-classification and compound classification are employed to
obtain the change information.
4. Experiments and Discussion
4.1 Hanyang Dataset
In order to evaluate the performance of the scene change detection, we used two
13
VHR images covering the area of Hanyang in the city of Wuhan, China (Fig. 4). The
multi-temporal images were acquired by the IKONOS sensor on February 11th, 2002,
and June 24th, 2009, respectively. The spatial resolution was 1 m after the fusion of
the pan and multispectral images by the GS algorithm in ENVI. The image size was
7200×6000 with four bands, which were the red, green, blue, and near-infrared (NIR)
bands. Georeference and accurate co-registration were performed to guarantee that the
corresponding scenes cover the same area.
(a) (b)
Fig. 4. Color images of the Hanyang area in the city of Wuhan, acquired on (a) February 11th, 2002,
and (b) June 24th, 2009.
With this data, we obtained the scenes from the large image by a non-overlapping
grid with a cell size of 150×150. In this way, we generated 1920 test scenes (48×40)
for each time point. Eight classes were selected from the study scene by visual
interpretation, as shown in Fig. 5. In the study area, all the small villages represented
as sparse houses had been changed into other classes over the interval, and so the third
scene class did not exist in 2009. It was therefore filled by the blank image in Fig. 5
(c).
14
(a) (b) (c) (d)
(e) (f) (g) (h)

Fig. 5. Examples of training scenes acquired at different times belonging to: (a) parking, (b) water, (c)
sparse houses, (d) dense houses, (e) residential area, (f) idle area, (g) vegetation area, and (h) industrial
area.
The reference maps of the whole images are shown in Fig. 6. A total of 55.52% and
74.01% of the scenes for the images in 2002 and 2009, respectively, are labeled. The
undefined scenes are mostly a mixture of several classes of scene, and none of them
can be defined as the main land-use class. In order to choose representative and
typical scenes for training, we extracted the training scene samples separately from
the whole study scene, instead of the subset of test samples, which may be easier to
implement in practical applications. The numbers of training and test samples, and the
reference percentages are shown in Table I. In total, we selected 1066 and 1421
scenes in 2002 and 2009, respectively, for testing, and 963 common scenes for the
multi-temporal change detection.
15
0-undefined
1-parking
2-water
3-sparse house
4-dense house
5-residential
6-idle
7-farmland
8-industrial
(a) (b)
Fig. 6. Reference map for the multi-temporal images acquired in (a) 2002 and (b) 2009 with eight
classes.
In this paper, in order to obtain robust results with satisfactory accuracies, we chose
the number of training patches for the dictionary learning as 100000, the patch size as
10×10, the step for the patches as 5, and the number of the dictionary as 1000. Since
the dictionary is learned by randomly selected patches, we ran each case 10 times and
computed the statistical values. Different parameters of the algorithm will lead to
different accuracies and runtimes. However, the objective of this paper is the
exploration of a scene change detection framework for multi-temporal VHR images.
Therefore, these parameters were determined for a balance between accuracy,
complexity, and speed.
16
Table I
(a) Number of training and test samples, and the reference percentage for each scene class in 2002.
Code 0 1 2 3 4 5 6 7 8 Total
Training 9 20 24 19 16 24 21 24 157
Test 854 11 252 139 69 41 206 226 122 1920
Per. % 44.48 0.57 13.13 7.24 3.59 2.14 10.73 11.77 6.35 100.00
(b) Number of training and test samples, and the reference percentage for each scene class in 2009.
Code 0 1 2 3 4 5 6 7 8 Total
Training 11 30 0 29 34 20 15 43 182
Test 499 20 275 0 77 258 126 228 437 1920
Per. % 25.99 1.04 14.32 0.00 4.01 13.44 6.56 11.88 22.76 100.00
Firstly, we evaluated the performance of the classical K-means approach and the
simplified sparse approach (Sparse) proposed by Cheriyadat [27]. The quantitative
assessment was made by overall accuracy (OA) with the reference test samples. The
performance of the change detection result was evaluated according to the OA of the
“from-to” change map. The statistical plots for K-means and Sparse with the three
dictionary learning approaches and the post-classification change detection method
are shown in Fig. 7. The feature in this experiment is the RAW feature. For the
statistical box, the whiskers illustrate the maximum and minimum OAs, the white
square shows the mean value, and the horizontal lines of the box indicate the range of
OA with 25%, 50%, and 75%. Fig. 7 illustrates that the K-means approach
outperforms the Sparse approach in both one-time scene classification and the
“from-to” change detection. In the comparison of the three dictionary learning
approaches, they all obtained similar results, and the stacked and union dictionaries
obtained a slightly better performance than the separate dictionary in the “from-to”
result.
17
0.81 0.83 0.68
0.82 0.67
0.80
0.81
0.66
0.79 0.80
0.65
0.78 0.79
0.64
0.78
Overall Accuracy
Overall Accuracy
Overall Accuracy
0.77 0.63
0.77
0.76 0.76 0.62
0.75 0.61
0.75
0.74
0.60
0.74 0.73
0.59
0.73 0.72
0.58
0.71
0.72 0.57
0.70
0.71 0.69 0.56
K-means/SP K-means/ST K-means/UN Sparse/SP Sparse/ST Sparse/UN K-means/SP K-means/ST K-means/UN Sparse/SP Sparse/ST Sparse/UN K-means/SP K-means/ST K-means/UN Sparse/SP Sparse/ST Sparse/UN
2002 2009 "From-to" change detection
(a) (b) (c)

Fig. 7. OA for the results with K-means and Sparse in the dataset of (a) 2002, (b) 2009, and (c) the
“from-to” change detection result. For the dictionary learning approaches, the separate approach is red,
the stacked approach is green, and the union approach is blue.
The performances of different features and their combinations were evaluated with
the K-means approach and post-classification. The quantitative assessment is shown
in Fig. 8, where “R” means RAW, “M” means MSTD, “S” means SIFT, and “+”
means combination. This shows that in the 2002 dataset, the combined features
obtained better performances, except for the combination of MSTD and SIFT. In the
2009 dataset, the combination of RAW and SIFT got the highest OA, and the RAW
feature also obtained good results. In the final change detection result, the
combination of RAW and SIFT got the best results, and only the MSTD and SIFT
combination did not get a better accuracy than the single features. The reason for this
may be that the MSTD feature did not provide enough information and also obtained
the lowest accuracy. However, its combination outperforms the single feature of
MSTD. The evaluation illustrates that the combination of different features results in
better performances, since the different features carry complementary information.
However, the inclusion of MSTD in the combination of RAW and SIFT reduces the
accuracy, which is because the redundancy of information also negatively affects the
classification performance in certain cases. Comparatively, the union dictionary has a
18
slightly higher mean OA than the others, which means that the temporal information
can enrich the diversity of words in the dictionary.

0.86 0.76
0.86
SP 0.84 SP 0.74
ST SP
0.82 ST 0.72 ST
0.84 UN UN
0.80 UN
0.70
Overall Accuracy
Overall Accuracy
Overall Accuracy
0.82 0.78
0.68
0.76
0.66
0.80
0.74
0.64
0.72
0.78 0.62
0.70
0.68
0.60
0.76
0.66 0.58
0.74 0.64 0.56

R M S R+S M+S R+M R+M+S R M S R+S M+S R+M R+M+S R M S R+S M+S R+M R+M+S
(a) (b) (c)

Fig. 8. Mean and standard deviation of OA with different features and their combinations in the
datasets of (a) 2002, (b) 2009, and (c) the “from-to” change detection result.
0.86
0.76
0.86
0.84
0.74
0.84 0.82
0.72
0.80 0.70
Overall Accuracy
Overall Accuracy
Overall Accuracy
0.82
0.78 0.68
0.76 0.66
0.80
0.74 0.64
0.78 0.62
0.72
0.70 0.60
0.76
0.68 0.58
0.74 0.66 0.56

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
R M S R+S M+S R+M R+M+S
(a) (b) (c)

Fig. 9. OA of each classification with different features and their combinations.
Since the combination of different features is stacking the word frequencies, Fig. 9
shows the OA of the different features and their combinations in 10 classification runs
with the union dictionary. It can be observed that the combination of different features
clearly improves the performance in each classification, compared with the single
features.
19
0.77
0.76
0.75
Overall Accuracy
0.74
0.73
Post/SP
0.72 Post/ST
Post/UN
0.71 Com/SP
Com/ST
Com/UN
0.70
1 2 3 4 5 6 7 8 9 10
RAW+SIFT
Fig. 10. OA of post-classification and compound classification for the “from-to” change detection
result.
The performances of post-classification and compound classification based on the
same representation of RAW+SIFT are shown in Fig. 10. The training samples for
compound classification are generated by the traversal of all possible combinations of
separate training samples for the multi-temporal datasets. The dashed line with
triangular labels indicates the OA of compound classification, and the solid line with
square labels indicates that of post-classification. Fig. 10 illustrates that, in most cases,
compound classification slightly outperforms post-classification. However, under the
same conditions, for the training and classification process after representation,
compound classification costs 1885.7 s, while post-classification only costs 5.2 s,
since the large amount of classes and samples makes it more difficult to find the best
classification model. Therefore, we recommend post-classification in practical
applications, because of its efficiency and effectiveness.
Since the traversal of training samples is too time-consuming, we wanted to
evaluate the performance of a random combination of training samples. We therefore
randomly combined the training samples from the numbers of 50 to 400 with the
representation of RAW+OA and the union dictionary, which can obtain the highest
20
OA for the “from-to” change detection result. We then repeated each case 10 times.
The statistical plot is shown in Fig. 11, where the dashed line is the OA obtained with
the same representation and sample traversal. It can be found that by increasing the
number of random training samples, the mean OA approaches that of sample traversal,
with a smaller variance. This illustrates that increasing the number of random training
samples improves the performance, but this does not obtain better results than sample
traversal. Above all, post-classification is a more useful approach than compound
classification with generated training samples.
0.78
0.77
Sample traversal
0.76
Overall Accuracy
0.75
0.74
0.73
0.72
50 100 150 200 250 300 350 400

Random Training Samples
Fig. 11. The OA with different numbers of random training samples.
Fig. 12 shows the scene maps in 2002 and 2009 with a high “from-to” change
detection accuracy, obtained by the RAW+SIFT features, the K-means approach, the
union dictionary, and post-classification. The OA of each map is 85.65% and 84.03%,
and the “from-to” change detection accuracy is 76.43%. It can be observed that the
small villages represented as sparse houses disappeared in 2009. Instead of sparse
houses, industrial areas and residential areas expanded very quickly. Most idle areas
in 2002 were transformed into other land-use classes. It is worth noting that the scene
change detection can reveal the detailed land-use transitions. In that after seven years,
21
the west side of Hanyang has developed to be a residential area, and the east side has
become an industrial area, while the traditional approaches just regard both of them as
impervious surfaces.
1-parking
2-water
3-sparse house
4-dense house
5-residential
6-idle
7-farmland
8-industrial
(a) (b)
Fig. 12. Scene maps for multi-temporal images acquired in (a) 2002 and (b) 2009 with eight classes.
The OAs are 85.65% and 84.03%, and 76.43% for the “from-to” change detection.
Table II
Statistics of the scene classes in 2002 and 2009
Class 2002 2009 +/−
1. Parking 9 13 +4
2. Water 292 312 +20
3. Sparse house 447 0 −447
4. Dense house 55 132 +77
5. Residential 201 507 +306
6. Idle 354 178 −176
7. Vegetation 249 267 +18
8. Industrial 313 511 +198
The statistics of the scenes in 2002 and 2009 are shown in Table II. Here, it can be
clearly found that, in 2002, there were many land-use areas with the type of sparse
houses and idle areas, since the urbanization of the Hanyang area had just started at
this time. By 2009, many of these land-use areas had been changed into residential
areas and industrial areas, because of the city expansion, which is shown by the
22
increasing number of these two scene classes.
(a) (b) (c)

Fig. 13. “From-to” change map: (a) change from sparse houses, (b) change to residential area, and (c)
change to industrial area. The scenes include parking (1-white), water (2-blue), sparse houses (3-cyan),
dense houses (4-purple), residential area (5-brown), idle area (6-yellow), farmland (7-green), and
industrial area (8-red).
The “from-to” change maps for the transition of the three important scene classes
are shown in Fig. 13. The color in the change map indicates the different scene classes.
Fig. 13 (a) shows that sparse houses changed into other land-use classes. This
illustrates that most sparse house areas changed to residential areas and vegetation. It
can also be found that residential areas appeared in the left-top part of the study area
in Fig. 13 (b). Comparatively, industrial areas expanded to the right-bottom part and
covered many idle areas and vegetation areas, shown in Fig. 13 (c). Fig. 13 shows the
“from-to” land-use transition map of the three most important scene classes in the
urban expansion. It can be seen that the scene change detection is able to provide
semantic interpretation of the land-use transition for urban management, which cannot
be achieved by the traditional approaches.
4.2 Hankou Dataset
We also chose another multi-temporal dataset from the Hankou area of Wuhan to
23
further evaluate the proposed approach. The source images of the multi-temporal
dataset were obtained on August 11th, 2002, and May 11th, 2008. They were also
acquired by IKONOS and pansharpened to be VHR with a 1-m resolution and four
bands. Georeference and accurate co-registration were performed as pre-processing.
Since the source images covering Hankou are too complex to label all the scenes, we
extracted multi-temporal scene pairs from the VHR images for the quantitative
assessment. The size of scene was 150×150. Seven classes of land-use scene were
selected from the source images, which are shown in Fig. 14. As discussed in the
Hanyang experiment, the training samples were selected separately from each source
image. A total of 411 scene pairs were extracted for the test. A total of 104 and 92
training samples were selected in 2002 and 2008 for classification. The numbers of
training and test samples of the different classes are shown in Table III.
(a) (b) (c) (d)
(e) (f) (g)

Fig. 14. Examples of multi-temporal scene pairs acquired in 2002 and 2008 belonging to: (a) water (1),
(b) dense houses (2), (c) vegetation area (3), (d) agricultural area (4), (e) residential area (5), (f) idle
area (6), and (g) industrial area (7).
The parameters, such as patch size, dictionary number, etc., were set the same as
those in the Hanyang experiment. The training patches for the dictionary learning
were randomly selected from the multi-temporal dataset of the scene pairs, instead of
the source images.
24
Table III
(a) Number of training and test samples, and percentage of test for each scene class in 2002.
Code 1 2 3 4 5 6 7 Total
Training 16 17 14 15 15 10 17 104
Test 125 61 66 34 60 25 40 411
Per. % 30.42 14.84 16.06 8.27 14.60 6.08 9.73 100.00
(b) Number of training and test samples, and percentage of test for each scene class in 2008.
Code 1 2 3 4 5 6 7 Total
Training 13 14 11 9 20 11 14 92
Test 87 68 30 16 119 33 58 411
Per. % 21.17 16.55 7.30 3.89 28.95 8.03 14.11 100.00
The mean and standard deviation of the OA with the different features and their
combinations with the three dictionaries is shown in Fig. 15. It can be observed that in
the 2002 dataset, RAW+SIFT obtained the best results, and the single RAW feature
also performed well. In the 2009 dataset, the combination of features outperformed
the single features in all cases. Therefore, in the post-classification results, they are
affected by both classification results, and RAW+SIFT obtained higher accuracies
than the other features. Fig. 15 indicates that RAW+SIFT is a very effective
combination of features for scene change detection.

0.88 0.86 0.76
SP SP SP
0.86 ST 0.74
ST 0.85 ST
0.84 UN UN UN
0.72
0.82 0.84
0.70
Overall Accuracy
Overall Accuracy
Overall Accuracy
0.80
0.83 0.68
0.78
0.76 0.66
0.82
0.74
0.64
0.72 0.81
0.62
0.70
0.80
0.68 0.60
0.66 0.79 0.58

R M S R+S M+S R+M R+M+S R M S R+S M+S R+M R+M+S R M S R+S M+S R+M R+M+S
(a) (b) (c)

Fig. 15. Mean and standard deviation of OA with different features and their combinations in the
dataset of (a) 2002, (b) 2008, and (c) the “from-to” change detection result.
Fig. 16 shows the OA of different single features and their combination in the 2002
and 2008 datasets, and the post-classification result. The dictionary type was the
union dictionary. The combination of RAW and SIFT leads to an obvious
improvement over the separate features in every case. The other combinations result
25
in only a limited increase in accuracy. This is because the combination of features can
be both complementary and redundant. In summary, RAW+SIFT was found to be the
best combination in our experiments.

0.88 0.86
0.74
0.86
0.85 0.72
0.84
0.82 0.84 0.70
Overall Accuracy
Overall Accuracy
Overall Accuracy
0.80
0.68
0.83
0.78
0.66
0.76
0.82
0.74 0.64
0.72 0.81
0.62
0.70
0.80 0.60
0.68
0.66 0.79 0.58

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
R M S R+S M+S R+M R+M+S
(a) (b) (c)

Fig. 16. OA of each classification with different features and their combinations.
0.76
0.75
Overall Accuracy
0.74
0.73
Post/SP
Post/ST
0.72 Post/UN
Com/SP
Com/ST
Com/UN
0.71
1 2 3 4 5 6 7 8 9 10
RAW+SIFT
Fig. 17. OA of post-classification and compound classification for the post-classification result.
We then evaluated the performances with the RAW and SIFT combination for
post-classification and compound classification, as shown in Fig. 17. The training
samples were generated by traversal for the compound classification. It can be seen
that, in most cases, the compound classification got a lower accuracy than the
post-classification. Considering the complexity of the compound classification,
post-classification is more effective in scene change detection.
Fig. 18 shows the OA statistic of the “from-to” change detection results with
26
different numbers of random training samples. This illustrates that increasing the
random training samples leads to higher accuracies, approaching that of the sample
traversal. The generation of training samples for compound classification is not a
good approach for change detection. However, the direct selection of the “from-to”
samples from the dataset is also very difficult to achieve, because of the large amount
of possible types and the rareness of some samples. Thus, compound classification
may not be so practical in real applications, and needs further research.
0.76
0.75
Sample traversal
Overall Accuracy
0.74
0.73
0.72
50 100 150 200 250 300 350 400
Random Training Samples
Fig. 18. The OA with different numbers of random training samples.
5. Conclusion
In this paper, we have explored a scene change detection framework to monitor
land-use transitions from a semantic point of view. Firstly, the features of random
patches are extracted to learn a dictionary. Three dictionary learning approaches based
on multi-temporal patches are addressed: the separate dictionary, the stacked
dictionary, and the union dictionary. The image patches are then represented by word
frequency with the learned dictionary, according to the BOVW model. Finally,
post-classification and compound classification are applied to detect the “from-to”
changes of the land-use classes.
27
The experiments show that the proposed framework is effective in analyzing the
semantic change of remote sensing image scene, and shows its practical significance
in the study of city development. A combination of features can clearly improve the
performance, and RAW+SIFT is the best combination for scene identification in these
cases. K-means is more effective than the simple Sparse approach in dictionary
learning, and dictionary learning with multi-temporal information can improve the
performance, since it can enrich the diversity of the dictionary. The experimental
results also illustrate that, although a compound classification may slightly increase
the accuracy over that of post-classification, it is too time-consuming, and thus
post-classification is recommended in practical applications. In summary, the
proposed scene change detection framework has great potential in land-use
monitoring and city management, since the traditional change detection approaches
cannot discriminate the semantic land-use variations. The improvement of scene
change detection framework is the focus of our future work.
Acknowledgements
This work was supported in part by the National Basic Research Program of China
(973 Program) under Grant 2012CB719905, the National Natural Science Foundation
of China under Grants 61471274 and 41431175.
28
References
[1] C. Xu, D. Tao, and C. Xu. (2013, April 1, 2013). A Survey on Multi-view
Learning. ArXiv e-prints 1304, 5634. Available:
http://adsabs.harvard.edu/abs/2013arXiv1304.5634X
[2] W. F. Liu, H. L. Liu, D. P. Tao, Y. J. Wang, and K. Lu, "Multiview Hessian
regularized logistic regression for action recognition," Signal Processing, vol. 110, pp.
101-107, May 2015.
[3] D. C. Tao, X. L. Li, X. D. Wu, and S. J. Maybank, "Geometric Mean for
Subspace Selection," IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, pp. 260-274,
Feb 2009.
[4] D. C. Tao, X. Tang, X. L. Li, and X. D. Wu, "Asymmetric bagging and random
subspace for support vector machines-based relevance feedback in image retrieval,"
IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, pp. 1088-1099, Jul 2006.
[5] D. C. Tao, X. L. Li, X. D. Wu, and S. J. Maybank, "General tensor discriminant
analysis and Gabor features for gait recognition," IEEE Trans. Pattern Anal. Mach.
Intell., vol. 29, pp. 1700-1715, Oct 2007.
[6] L. Weifeng and T. Dacheng, "Multiview Hessian Regularization for Image
Annotation," Image Processing, IEEE Transactions on, vol. 22, pp. 2676-2687, 2013.
[7] W. Liu, D. Tao, J. Cheng, and Y. Tang, "Multiview Hessian discriminative sparse
coding for image annotation," Computer Vision and Image Understanding, vol. 118,
pp. 50-60, 1// 2014.
[8] D. Bo and Z. Liangpei, "A Discriminative Metric Learning Based Anomaly
Detection Method," IEEE Trans. Geosci. Remote Sens., vol. 52, pp. 6844-6857, 2014.
[9] B. Du, L. Zhang, D. Tao, and D. Zhang, "Unsupervised transfer learning for
target detection from hyperspectral images," Neurocomputing, vol. 120, pp. 72-82,
11/23/ 2013.
[10]D. Lu, P. Mausel, E. Brondizio, and E. Moran, "Change detection techniques," Int.
J. Remote Sens., vol. 25, pp. 2365-2401, Jun 2004.
[11] A. Singh, "Review Article Digital change detection techniques using
remotely-sensed data," Int. J. Remote Sens., vol. 10, pp. 989-1003, Jul 1989.
[12] B. Du, Y. Zhang, L. Zhang, and L. Zhang, "A hypothesis independent subpixel
target detector for hyperspectral Images," Signal Processing, vol. 110, pp. 244-249, 5//
2015.
[13] B. Du and L. Zhang, "Target detection based on a dynamic subspace," Pattern
Recognit., vol. 47, pp. 344-358, 1// 2014.
[14] B. Du and L. Zhang, "Random-Selection-Based Anomaly Detector for
Hyperspectral Imagery," IEEE Trans. Geosci. Remote Sens., vol. 49, pp. 1578-1589,
May 2011.
[15] R. E. Kennedy, P. A. Townsend, J. E. Gross, W. B. Cohen, P. Bolstad, Y. Q. Wang,
and P. Adams, "Remote sensing change detection tools for natural resource managers:
Understanding concepts and tradeoffs in the design of landscape monitoring projects,"
29
Remote Sens. Environ., vol. 113, pp. 1382-1396, Jul 2009.
[16] M. J. Carlotto, "Detection and analysis of change in remotely sensed imagery
with application to wide area surveillance," IEEE Trans. Image Process., vol. 6, pp.
189-202, 1997.
[17] R. J. Radke, S. Andra, O. Al-Kofahi, and B. Roysam, "Image change detection
algorithms: a systematic survey," IEEE Trans. Image Process., vol. 14, pp. 294-307,
Mar 2005.
[18] F. Bovolo and L. Bruzzone, "A Theoretical Framework for Unsupervised Change
Detection Based on Change Vector Analysis in the Polar Domain," IEEE Trans.
Geosci. Remote Sens., vol. 45, pp. 218-236, Jan 2007.
[19] C. Wu, B. Du, and L. Zhang, "Slow Feature Analysis for Change Detection in
Multispectral Imagery," IEEE Trans. Geosci. Remote Sens., vol. 52, pp. 2858-2874,
May. 2014.
[20] B. Demir, F. Bovolo, and L. Bruzzone, "Detection of Land-Cover Transitions in
Multitemporal Remote Sensing Images With Active-Learning-Based Compound
Classification," IEEE Trans. Geosci. Remote Sens., vol. 50, pp. 1930-1941, 2012.
[21] G. Xian, C. Homer, and J. Fry, "Updating the 2001 National Land Cover Database
land cover classification to 2006 by using Landsat imagery change detection
methods," Remote Sens. Environ., vol. 113, pp. 1133-1147, Jun 2009.
[22] M. Hussain, D. Chen, A. Cheng, H. Wei, and D. Stanley, "Change detection from
remotely sensed images: From pixel-based to object-based approaches," ISPRS J.
Photogramm., vol. 80, pp. 91-106, Jun. 2013.
[23] G. Chen, G. J. Hay, L. M. T. Carvalho, and M. A. Wulder, "Object-based change
detection," Int. J. Remote Sens., vol. 33, pp. 4434-4457, Jul. 2012.
[24] A. Bosch, A. Zisserman, and X. Muñoz, "Scene Classification Via pLSA," in
Computer Vision – ECCV 2006. vol. 3954, A. Leonardis, et al., Eds., ed: Springer
Berlin Heidelberg, 2006, pp. 517-530.
[25] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, "Learning multi-label scene
classification," Pattern Recognit., vol. 37, pp. 1757-1771, 9// 2004.
[26] A. Bosch, A. Zisserman, and X. Muoz, "Scene Classification Using a Hybrid
Generative/Discriminative Approach," IEEE Trans. Pattern Anal. Mach. Intell., vol.
30, pp. 712-727, 2008.
[27] A. M. Cheriyadat, "Unsupervised Feature Learning for Aerial Scene
Classification," IEEE Trans. Geosci. Remote Sens., vol. 52, pp. 439-451, 2014.
[28] X. Zheng, X. Sun, K. Fu, and H. Wang, "Automatic Annotation of Satellite
Images via Multifeature Joint Sparse Coding With Spatial Relation Constraint," IEEE
Geosci. Remote Sens. Lett., vol. 10, pp. 652-656, 2013.
[29] W. Shao, W. Yang, G.-S. Xia, and G. Liu, "A Hierarchical Scheme of Multiple
Feature Fusion for High-Resolution Satellite Scene Categorization," in Computer
Vision Systems. vol. 7963, M. Chen, et al., Eds., ed: Springer Berlin Heidelberg, 2013,
pp. 324-333.
[30] X. Sheng, F. Tao, L. Deren, and W. Shiwei, "Object Classification of Aerial
Images With Bag-of-Visual Words," IEEE Geosci. Remote Sens. Lett., vol. 7, pp.
366-370, 2010.
30
[31] G. Sheng, W. Yang, T. Xu, and H. Sun, "High-resolution satellite scene
classification using a sparse coding based multiple feature combination," Int. J.
Remote Sens., vol. 33, pp. 2395-2412, 2012/04/20 2011.
[32] L. Wang, L. Hongliang, and L. Guanghui, "Automatic Annotation of
Multispectral Satellite Images Using Author-Topic Model," IEEE Geosci. Remote
Sens. Lett., vol. 9, pp. 634-638, 2012.
[33] B. Zhao, Y. Zhong, and L. Zhang, "Scene classification via latent Dirichlet
allocation using a hybrid generative/discriminative strategy for high spatial resolution
remote sensing imagery," Remote Sensing Letters, vol. 4, pp. 1204-1213, 2013/12/01
2013.
[34] M. Lienou, H. Maitre, and M. Datcu, "Semantic Annotation of Satellite Images
Using Latent Dirichlet Allocation," IEEE Geosci. Remote Sens. Lett., vol. 7, pp. 28-32,
2010.
[35] Y. Yang and S. Newsam, "Bag-of-visual-words and spatial extensions for
land-use classification," presented at the Proceedings of the 18th SIGSPATIAL
International Conference on Advances in Geographic Information Systems, San Jose,
California, 2010.
[36]T. Lindeberg, "Scale Invariant Feature Transform," Scholarpedia, vol. 7, p. 10491,
2012.
[37] S. Lazebnik, C. Schmid, and J. Ponce, "Beyond Bags of Features: Spatial
Pyramid Matching for Recognizing Natural Scene Categories," in 2006 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, 2006, pp.
2169-2178.
[38] A. Coates and A. Y. Ng, "Learning feature representations with k-means," in
Neural Networks: Tricks of the Trade, ed: Springer, 2012, pp. 561-580.
[39] L. Zhang, L. Zhang, D. Tao, and X. Huang, "On Combining Multiple Features for
Hyperspectral Remote Sensing Image Classification," IEEE Trans. Geosci. Remote
Sens., vol. 50, pp. 879-893, Mar 2012.
[40] C.-C. Chang and C.-J. Lin, "LIBSVM: A library for support vector machines,"
ACM Trans. Intell. Syst. Technol., vol. 2, pp. 1-27, 2011.
[41] X. Chen, J. Chen, Y. Shi, and Y. Yamaguchi, "An automated approach for
updating land cover maps based on integrated change detection and classification
methods," ISPRS J. Photogramm., vol. 71, pp. 86-95, 7// 2012.
31

Scene Change Detection Framework

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Scene Change Detection Framework

Uploaded by

Copyright:

Available Formats

Author’s Accepted Manuscript

A Scene Change Detection Framework for Multi-

Chen Wu, Lefei Zhang, Liangpei Zhang

1. International School of Software, Wuhan University, Wuhan, P.R. China

2. School of Computer, Wuhan University, Wuhan, P.R. China

3. State Key Laboratory of Information Engineering in Surveying, Mapping and

Remote Sensing, Wuhan University, Wuhan, P.R. China

explore a scene change detection framework for VHR images, with a

bag-of-visual-words (BOVW) model and classification-based methods. Image scenes

are represented by a word frequency with three kinds of multi-temporal learned

their combinations were tested in scene change detection. Post-classification and

compound classification were evaluated for their performances in the “from-to”

change results. Two multi-temporal scene datasets were used to quantitatively

can effectively analyze land-use changes, from a semantic point of view.

Keywords: Scene change detection, VHR image, remote sensing, BOVW,

post-classification, compound classification.

frequency, and consistent observations [11-R3]. Change detection with multi-temporal

categorized into the following groups: 1) image difference [18]; 2) image

transformation [19]; and 3) classification-based methods [20]. Since

classification-based change detection methods can indicate the detailed “from-to”

brought about a revolution in the analysis of multi-temporal datasets [22]. In addition

to the detailed characterization and monitoring of landscape changes with

unique opportunities for the semantic interpretation of land-use transitions.

remote sensing scene classification [28]. The bag-of-visual-words (BOVW) model

dimension and increase the accuracy [33], [34].

With the increasing availability of high-resolution images coving the Earth’s

It is therefore necessary and significant to explore scene change detection, which

takes into consideration the semantic comprehension of the land-use variations.

application of urban development management. The scene classification method is

are proposed to make use of the multi-temporal information to improve the

performance. Post-classification and compound classification approaches for the

The remainder of this paper is organized as follows. Section 2 introduces scene

of this paper is drawn in Section 5.

2. Scene Classification with BOVW

In the BOVW approach, a scene can be identified by the thematic objects it

classified according to the frequencies of the words contained therein, regardless of

the following classification process, as shown in Fig. 1. The procedure of BOVW

scene classification can be described as follows: 1) randomly select image patches

Scene dataset Dictionary

Fig. 1. Flowchart of scene classification with BOVW.

2.1 Feature Extraction

The features of the patch should be representative and distinguishable to describe

SIFT feature (SIFT).

extract the features.

Different features can also be combined for a better classification performance. In

word frequencies of the different features into a higher-dimensional representation

scale, and we do not have the problem of feature normalization.

2.2 Dictionary Learning

labels of the visual words.

number of samples is usually very large, such as M=100000.

K-means clustering is an effective feature representation learning method and can

2.3 Feature Encoding and Pooling

with the encoded features as Ptrain =  p1train , ptrain

Ptest =  p1test , ptest

scenes. Dividing the image patches with a non-overlapping grid is a traditional

practical applications [35].

to classify the scenes in our experiments [40].

3. Scene Change Detection

semantic point of view, supervised information provided by interpreters is essential

change detection, we use a classification-based method to determine the scene

classification) and compound classification (classifying the stacked multi-temporal

temporal information is important in multi-temporal image analysis, we also attempt