You are on page 1of 32

Author’s Accepted Manuscript

A Scene Change Detection Framework for Multi-


Temporal Very High Resolution Remote Sensing
Images

Chen Wu, Lefei Zhang, Liangpei Zhang

www.elsevier.com/locate/sigpro

PII: S0165-1684(15)00322-9
DOI: http://dx.doi.org/10.1016/j.sigpro.2015.09.020
Reference: SIGPRO5920
To appear in: Signal Processing
Received date: 23 June 2015
Revised date: 24 August 2015
Accepted date: 15 September 2015
Cite this article as: Chen Wu, Lefei Zhang and Liangpei Zhang, A Scene Change
Detection Framework for Multi-Temporal Very High Resolution Remote
Sensing Images, Signal Processing,
http://dx.doi.org/10.1016/j.sigpro.2015.09.020
This is a PDF file of an unedited manuscript that has been accepted for
publication. As a service to our customers we are providing this early version of
the manuscript. The manuscript will undergo copyediting, typesetting, and
review of the resulting galley proof before it is published in its final citable form.
Please note that during the production process errors may be discovered which
could affect the content, and all legal disclaimers that apply to the journal pertain.
A Scene Change Detection Framework for
Multi-Temporal Very High Resolution Remote
Sensing Images
Chen Wu 1, Lefei Zhang 2*, and Liangpei Zhang 3

1. International School of Software, Wuhan University, Wuhan, P.R. China

2. School of Computer, Wuhan University, Wuhan, P.R. China

3. State Key Laboratory of Information Engineering in Surveying, Mapping and

Remote Sensing, Wuhan University, Wuhan, P.R. China

E-mail: zhanglefei@whu.edu.cn

Abstract

The technology of computer vision and image processing is attracting more and

more attentions in recent years, and has been applied in many research areas like

remote sensing image analysis. Change detection with multi-temporal remote sensing

images is very important for the dynamic analysis of landscape variations. The

abundant spatial information offered by very high resolution (VHR) images makes it

possible to identify the semantic classes of image scenes. Compared with the

traditional approaches, scene change detection can provide a new point of view for the

semantic interpretation of land-use transitions. In this paper, for the first time, we

explore a scene change detection framework for VHR images, with a

bag-of-visual-words (BOVW) model and classification-based methods. Image scenes

are represented by a word frequency with three kinds of multi-temporal learned


1
dictionary, i.e., the separate dictionary, the stacked dictionary, and the union dictionary.

Three features (multispectral raw pixel; mean and standard deviation; and SIFT) and

their combinations were tested in scene change detection. Post-classification and

compound classification were evaluated for their performances in the “from-to”

change results. Two multi-temporal scene datasets were used to quantitatively

evaluate the proposed scene change detection approach. The results indicate that the

proposed scene change detection framework can obtain a satisfactory accuracy and

can effectively analyze land-use changes, from a semantic point of view.

Keywords: Scene change detection, VHR image, remote sensing, BOVW,

post-classification, compound classification.

1. Introduction

The technology of computer vision and image processing is attracting more and

more attentions in recent years [1-7], and has been applied in many research areas like

remote sensing image analysis [8], [9]. Dynamic analysis of landscape variations is

very important for understanding the interactions between human development and

natural phenomena [10]. Since the advent of remote sensing technology, it has been a

major source of information for change detection studies to detect, identify, and

analyze the changes in landscape conditions, because of its broad view, high temporal

frequency, and consistent observations [11-R3]. Change detection with multi-temporal

remote sensing images has proven effective in many applications [15], [16].

2
In recent years, numerous studies have addressed the automatic and accurate

detection of changes from multi-temporal images [17]. The main methods can be

categorized into the following groups: 1) image difference [18]; 2) image

transformation [19]; and 3) classification-based methods [20]. Since

classification-based change detection methods can indicate the detailed “from-to”

change types for further analysis, they are very convenient for practical applications

[21]. Nowadays, the new remote sensing technology with a higher resolution has

brought about a revolution in the analysis of multi-temporal datasets [22]. In addition

to the detailed characterization and monitoring of landscape changes with

object-oriented methods [23], very high resolution (VHR) imagery also provides

unique opportunities for the semantic interpretation of land-use transitions.

Meanwhile, classifying image scenes into semantic classes has attracted increasing

attention in the pattern recognition field [24-26]. Due to the abundant and detailed

spatial information provided by VHR imagery, the remote sensing scenes can be

characterized as semantic land-use classes based on the spatial and structural patterns

of the landscapes encoded in the imagery [27]. A great deal of research has focused on

remote sensing scene classification [28]. The bag-of-visual-words (BOVW) model

has been explored as an effective and robust feature encoding approach to obtain the

statistical characteristics of thematic objects for remote sensing scenes [29]. It can

provide a mid-level representation to bridge the huge semantic gap between the

extracted low-level features in the image and the high-level concepts of human beings

[30]. The combination of multiple features, such as texture features and SIFT feature

3
descriptors, can improve the performance of scene classification [31], [32]. Topic

models derived from text categorization, such as probabilistic latent semantic analysis

(pLSA) and latent Dirichlet allocation (LDA), have been proposed to reduce the

dimension and increase the accuracy [33], [34].

With the increasing availability of high-resolution images coving the Earth’s

surface, it is now available to develop change detection methods for a scene scale. It

can provide a new point of view to analyze and monitor urban development, such as

the expansion of industrial and residential area. The problem is that the changes of

thematic landscapes inside a scene may not lead to the variation of the scene classes.

The traditional change detection approaches, such as the pixel-wise and object-wise

methods, can only identify the changes of the thematic objects, such as the growth of

vegetation and the appearance of new buildings. However, their variations inside the

scene will not modify the land-use category from a residential area to an industrial

area. This illustrates the semantic gap between landscape changes and scene changes.

It is therefore necessary and significant to explore scene change detection, which

takes into consideration the semantic comprehension of the land-use variations.

However, to the best of our knowledge, no previous studies have focused on this

topic. Therefore, in this paper, we explore a scene change detection framework for

multi-temporal VHR remote sensing imagery, and evaluate its performance in the

application of urban development management. The scene classification method is

based on a BOVW model with three different features, i.e., the multispectral raw pixel;

the mean and standard deviation; and the SIFT feature descriptor. Three kinds of

4
dictionary, i.e., the separate dictionary, the stacked dictionary, and the union dictionary,

are proposed to make use of the multi-temporal information to improve the

performance. Post-classification and compound classification approaches for the

scenes are addressed to extract the semantic land-use change information. Finally, the

urban development is mapped and analyzed with the land-use transition interpretation.

The remainder of this paper is organized as follows. Section 2 introduces scene

classification with the BOVW approach. Section 3 explores the change detection

process for multi-temporal scene data. Experiments with two multi-temporal VHR

imagery datasets and a discussion are presented in Section 4. Finally, the conclusion

of this paper is drawn in Section 5.

2. Scene Classification with BOVW

In the BOVW approach, a scene can be identified by the thematic objects it

contains. The idea of BOVW derives from text analysis, where the documents can be

classified according to the frequencies of the words contained therein, regardless of

the word order [35]. For scene classification, BOVW aims at representing the image

scene with the statistical frequency of the visual words as its characteristic feature for

the following classification process, as shown in Fig. 1. The procedure of BOVW

scene classification can be described as follows: 1) randomly select image patches

from the scene images in the scene dataset; 2) learn a dictionary from the feature sets

of the selected patches; 3) for each scene, extract the features of the patches in the

image as visual words; 4) encode the features of the visual words with the learned

5
dictionary; 5) pool the feature codes to obtain the word frequency representation as

the characteristic for the image scene; 6) repeat steps 3–5 to represent all the scenes

with the word frequency; and 7) classify the image scenes into semantic scene classes.

Next, we introduce the strategies of the framework used in this paper in detail.

Scene dataset Dictionary


Scene classes

Dictionary learning

Residential
Feature encoding
Feature extraction
Scene Word frequency
Visual words +1
Industrial
Classification

Parking
+1
+1
Feature pooling

Fig. 1. Flowchart of scene classification with BOVW.

2.1 Feature Extraction

The features of the patch should be representative and distinguishable to describe

the different thematic objects for the visual words. In this paper, we evaluate the

performance of scene change detection with three descriptive features, which are the

raw pixel vector (RAW), the mean and standard deviation (MSTD), and the dense

SIFT feature (SIFT).

The raw pixel vector is widely used as a low-level descriptor for image patches [24].

For a multi-band image patch with the size of m×n and b bands, it is represented as a

column vector x  L
, where L=m  n  b . The image patch can be reshaped

row-wise or column-wise, and filled into the raw pixel vector band by band.

The MSTD is a simple descriptor for an image patch to represent its characteristics.

The mean value and standard deviation of each band are calculated and listed in a

column vector x  L
, where L=2  b .
6
The final descriptor we use is the dense SIFT descriptor. The SIFT descriptor is

invariant to translation, rotation, and scaling in the image domain and is robust with

regard to illumination variation [36]. Previous work has proved that in the application

of scene classification, the dense SIFT descriptor is more effective than sparse interest

points [24], [27]. In this paper, we use the first principal component of PCA as the

input of the dense SIFT descriptor, and obtain the descriptor vector x  L
, where

L=128 . We use the dense SIFT implementation provided by Lazebnik et al. [37] to

extract the features.

Different features can also be combined for a better classification performance. In

this paper, we use a simple approach to combine multiple features by stacking the

word frequencies of the different features into a higher-dimensional representation

vector. It is worth noting that we stack the word frequencies instead of the original

features directly, since the word frequencies of the different features are at the same

scale, and we do not have the problem of feature normalization.

2.2 Dictionary Learning

The dictionary must be comprehensive and representative of the visual words. The

learned dictionary is then used to quantize the extracted features by assigning the

labels of the visual words.

One way to prepare the source data for dictionary learning is to use all the features

extracted from all the scenes in the learning process. However, when the volume of

the input scene data is large, the dictionary learning will be time-consuming.

Therefore, in this paper, we randomly select abundant sample patches from the scene

7
images in the dataset to generate the input data matrix X=  x1 , x2 , , xM  . The

number of samples is usually very large, such as M=100000.

In order to reduce the redundancy of the input data, zero component analysis (ZCA)

whitening is applied to remove the correlations between the different dimensions [49].

K-means clustering is an effective feature representation learning method and can

be quickly implemented at a large scale [38]. This approach has also been applied in

many studies of scene classification [33]. The whitened input dataset is quantized into

K clusters by the K-means algorithm, and their centers are the elements of the learned

dictionary D= d1 , d 2 , , d K  .

2.3 Feature Encoding and Pooling

After obtaining the dictionary, the training and testing scenes can be represented

with the encoded features as Ptrain =  p1train , ptrain


2
, , pTtrain  and

Ptest =  p1test , ptest


2
, , pNtest  , where T and N are the numbers of training and testing

scenes. Dividing the image patches with a non-overlapping grid is a traditional

approach. However, Bosch et al. [24] proved that overlapping patches can lead to a

better performance. Therefore, in this paper, the image patches used for the feature

extraction are squared windows with a size of m×n, and they cover the whole image

scene with a step of s moving in the horizontal and vertical directions. The ZCA

whitening matrix obtained in the dictionary learning should be used before the

encoding to preprocess the features. The whitened feature of each patch is then

quantized to the visual word that has the greatest similarity in the dictionary.

After feature encoding, the pooling process should be applied to represent the scene

8
with the encoded words. In this paper, we compute the histogram of the labeled words

in the scene, and normalizing it to be the word frequency, since it is very robust in

practical applications [35].

2.4 Classification

The dictionary representation is the input data for the scene classification. A

dictionary will usually consist of hundreds of words, and thus the representation for

the scene is very high dimensional data. Since the number of training samples is

usually much smaller than the data dimension, we choose support vector machine

(SVM) as the classifier in this paper [39]. LIBSVM was applied with a linear kernel

to classify the scenes in our experiments [40].

3. Scene Change Detection

Since the scene change detection aims to analyze the land-use transition from a

semantic point of view, supervised information provided by interpreters is essential

for the semantic definition of the different scene classes. Inspired by traditional

change detection, we use a classification-based method to determine the scene

changes. In this paper, post-classification (comparing the class maps after separate

classification) and compound classification (classifying the stacked multi-temporal

scene) are proposed and evaluated for the scene change detection. In addition, as the

temporal information is important in multi-temporal image analysis, we also attempt

to add the temporal information into the learned dictionary.

9
3.1 Post-Classification and Compound Classification

Post-classification and compound classification are the two most commonly used

classification-based change detection methods in land-cover/land-use change mapping

and ecosystem monitoring [21], [41]. They can provide the “from-to” change type

information for the following detailed analysis. For the proposed scene change

detection framework, we also employ these two approaches to map the changed

scenes and to identify the change types, as shown in Fig. 2.

Representation Classification Comparison


Scene
Word frequency Scene class

Idle

From-to change information


Scene
Word frequency Scene class

Idle → Residential

Residential

(a)
Representation Stacking Classification
Scene
Word frequency

From-to change information


Stacked word frequency
Scene
Word frequency
Idle → Residential

(b)
Fig. 2. Scene change detection of (a) post-classification, and (b) compound classification.

Fig. 2 (a) shows that, in post-classification, the multi-temporal scenes are classified

separately as Lt1 = l1t1 , l 2t1 , , lNt1  and Lt 2 = l1t 2 , l2t 2 , , lNt 2  . The scene labels after

scene classification are compared to identify the change conditions of the


10
corresponding scenes at the same location. The compound classification classifies the

multi-temporal scene pair to be Lfromto = l1fromto , l 2fromto , , lNfromto  at one time, with

the stacked word frequency, as shown in Fig. 2 (b). Theoretically, the accuracy of

post-classification is limited since the misclassified scenes at each time point will lead

to errors in the change detection result, no matter whether or not the corresponding

scenes at another time are correctly classified. Although the compound classification

approach can solve this problem with one-pass classification, the direct selection of

training samples for all the “from-to” scene classes is too complex. Therefore, we

generate the simulated training samples for the compound classification with separate

training samples in the multi-temporal dataset.

3.2 Multi-Temporal Dictionary Learning

A multi-temporal scene dataset covers the same study site at different times.

Therefore, the multi-temporal information may be useful to improve the performance

of the temporal analysis, rather than a totally separate process. The dictionary is the

reference for the scene classification, and thus it may be a good idea to add the

temporal information into the dictionary learning process. In order to explore the

different possibilities, we propose three approaches to build the dictionary, as shown

in Fig. 3.

11
Word frequency
Scene dataset (Time 1)
Encoding

Word frequency
Scene dataset (Time 2)
Encoding

(a)
Word frequency
Scene dataset (Time 1) Encoding

Scene dataset (Time 2)

Word frequency

Encoding

(b)

Scene dataset (Time 1) Word frequency


Encoding

Scene dataset (Time 2)


Word frequency

Encoding

(c)
Fig. 3. Multi-temporal dictionary learning approaches of (a) separate dictionary, (b) stacked dictionary,
and (c) union dictionary.

1) Separate dictionary (SP): The multi-temporal scene datasets are processed

separately to build two dictionaries. The image scenes are then encoded and

represented by their own dictionaries. This is the basic approach for the

multi-temporal BOVW method, as shown in Fig. 3 (a).

12
2) Stacked dictionary (ST): Two sub-dictionaries are learned by their own image

scene dataset and stacked into one dictionary. The stacked dictionary is then used to

represent the image scene in different scene datasets, as shown in Fig. 3 (b).

3) Union dictionary (UN): As Fig. 3 (c) shows, sample patches from the

multi-temporal image scene datasets are randomly selected and gathered into one

training sample set. The dictionary is then learned from the set with multi-temporal

patches and is used to encode the image scenes acquired at different times. The

precondition for both the stacked dictionary and union dictionary is that the training

samples must be transformed into one scale; otherwise, the dictionary will be

non-uniform.

3.3 Procedure

The procedure of scene change detection can be summarized as follows: 1) A large

amount of patches are randomly selected from the multi-temporal datasets of scene

images as training samples; 2) The dictionary is learned according to the three

multi-temporal dictionary learning approaches; 3) The training scenes and test scenes

are encoded with the learned dictionary and represented as word frequencies with

feature pooling; 4) Post-classification and compound classification are employed to

obtain the change information.

4. Experiments and Discussion

4.1 Hanyang Dataset

In order to evaluate the performance of the scene change detection, we used two

13
VHR images covering the area of Hanyang in the city of Wuhan, China (Fig. 4). The

multi-temporal images were acquired by the IKONOS sensor on February 11th, 2002,

and June 24th, 2009, respectively. The spatial resolution was 1 m after the fusion of

the pan and multispectral images by the GS algorithm in ENVI. The image size was

7200×6000 with four bands, which were the red, green, blue, and near-infrared (NIR)

bands. Georeference and accurate co-registration were performed to guarantee that the

corresponding scenes cover the same area.

(a) (b)
Fig. 4. Color images of the Hanyang area in the city of Wuhan, acquired on (a) February 11th, 2002,
and (b) June 24th, 2009.

With this data, we obtained the scenes from the large image by a non-overlapping

grid with a cell size of 150×150. In this way, we generated 1920 test scenes (48×40)

for each time point. Eight classes were selected from the study scene by visual

interpretation, as shown in Fig. 5. In the study area, all the small villages represented

as sparse houses had been changed into other classes over the interval, and so the third

scene class did not exist in 2009. It was therefore filled by the blank image in Fig. 5

(c).
14
(a) (b) (c) (d)

(e) (f) (g) (h)


Fig. 5. Examples of training scenes acquired at different times belonging to: (a) parking, (b) water, (c)
sparse houses, (d) dense houses, (e) residential area, (f) idle area, (g) vegetation area, and (h) industrial
area.

The reference maps of the whole images are shown in Fig. 6. A total of 55.52% and

74.01% of the scenes for the images in 2002 and 2009, respectively, are labeled. The

undefined scenes are mostly a mixture of several classes of scene, and none of them

can be defined as the main land-use class. In order to choose representative and

typical scenes for training, we extracted the training scene samples separately from

the whole study scene, instead of the subset of test samples, which may be easier to

implement in practical applications. The numbers of training and test samples, and the

reference percentages are shown in Table I. In total, we selected 1066 and 1421

scenes in 2002 and 2009, respectively, for testing, and 963 common scenes for the

multi-temporal change detection.

15
0-undefined

1-parking

2-water

3-sparse house

4-dense house

5-residential

6-idle

7-farmland

8-industrial

(a) (b)
Fig. 6. Reference map for the multi-temporal images acquired in (a) 2002 and (b) 2009 with eight
classes.

In this paper, in order to obtain robust results with satisfactory accuracies, we chose

the number of training patches for the dictionary learning as 100000, the patch size as

10×10, the step for the patches as 5, and the number of the dictionary as 1000. Since

the dictionary is learned by randomly selected patches, we ran each case 10 times and

computed the statistical values. Different parameters of the algorithm will lead to

different accuracies and runtimes. However, the objective of this paper is the

exploration of a scene change detection framework for multi-temporal VHR images.

Therefore, these parameters were determined for a balance between accuracy,

complexity, and speed.

16
Table I
(a) Number of training and test samples, and the reference percentage for each scene class in 2002.
Code 0 1 2 3 4 5 6 7 8 Total
Training 9 20 24 19 16 24 21 24 157
Test 854 11 252 139 69 41 206 226 122 1920
Per. % 44.48 0.57 13.13 7.24 3.59 2.14 10.73 11.77 6.35 100.00
(b) Number of training and test samples, and the reference percentage for each scene class in 2009.
Code 0 1 2 3 4 5 6 7 8 Total
Training 11 30 0 29 34 20 15 43 182
Test 499 20 275 0 77 258 126 228 437 1920
Per. % 25.99 1.04 14.32 0.00 4.01 13.44 6.56 11.88 22.76 100.00

Firstly, we evaluated the performance of the classical K-means approach and the

simplified sparse approach (Sparse) proposed by Cheriyadat [27]. The quantitative

assessment was made by overall accuracy (OA) with the reference test samples. The

performance of the change detection result was evaluated according to the OA of the

“from-to” change map. The statistical plots for K-means and Sparse with the three

dictionary learning approaches and the post-classification change detection method

are shown in Fig. 7. The feature in this experiment is the RAW feature. For the

statistical box, the whiskers illustrate the maximum and minimum OAs, the white

square shows the mean value, and the horizontal lines of the box indicate the range of

OA with 25%, 50%, and 75%. Fig. 7 illustrates that the K-means approach

outperforms the Sparse approach in both one-time scene classification and the

“from-to” change detection. In the comparison of the three dictionary learning

approaches, they all obtained similar results, and the stacked and union dictionaries

obtained a slightly better performance than the separate dictionary in the “from-to”

result.

17
0.81 0.83 0.68
0.82 0.67
0.80
0.81
0.66
0.79 0.80
0.65
0.78 0.79
0.64
0.78
Overall Accuracy

Overall Accuracy

Overall Accuracy
0.77 0.63
0.77
0.76 0.76 0.62
0.75 0.61
0.75
0.74
0.60
0.74 0.73
0.59
0.73 0.72
0.58
0.71
0.72 0.57
0.70
0.71 0.69 0.56
K-means/SP K-means/ST K-means/UN Sparse/SP Sparse/ST Sparse/UN K-means/SP K-means/ST K-means/UN Sparse/SP Sparse/ST Sparse/UN K-means/SP K-means/ST K-means/UN Sparse/SP Sparse/ST Sparse/UN
2002 2009 "From-to" change detection

(a) (b) (c)


Fig. 7. OA for the results with K-means and Sparse in the dataset of (a) 2002, (b) 2009, and (c) the
“from-to” change detection result. For the dictionary learning approaches, the separate approach is red,
the stacked approach is green, and the union approach is blue.

The performances of different features and their combinations were evaluated with

the K-means approach and post-classification. The quantitative assessment is shown

in Fig. 8, where “R” means RAW, “M” means MSTD, “S” means SIFT, and “+”

means combination. This shows that in the 2002 dataset, the combined features

obtained better performances, except for the combination of MSTD and SIFT. In the

2009 dataset, the combination of RAW and SIFT got the highest OA, and the RAW

feature also obtained good results. In the final change detection result, the

combination of RAW and SIFT got the best results, and only the MSTD and SIFT

combination did not get a better accuracy than the single features. The reason for this

may be that the MSTD feature did not provide enough information and also obtained

the lowest accuracy. However, its combination outperforms the single feature of

MSTD. The evaluation illustrates that the combination of different features results in

better performances, since the different features carry complementary information.

However, the inclusion of MSTD in the combination of RAW and SIFT reduces the

accuracy, which is because the redundancy of information also negatively affects the

classification performance in certain cases. Comparatively, the union dictionary has a

18
slightly higher mean OA than the others, which means that the temporal information

can enrich the diversity of words in the dictionary.


0.86 0.76
0.86
SP 0.84 SP 0.74
ST SP
0.82 ST 0.72 ST
0.84 UN UN
0.80 UN
0.70
Overall Accuracy

Overall Accuracy
Overall Accuracy
0.82 0.78
0.68
0.76
0.66
0.80
0.74
0.64
0.72
0.78 0.62
0.70

0.68
0.60
0.76
0.66 0.58

0.74 0.64 0.56


R M S R+S M+S R+M R+M+S R M S R+S M+S R+M R+M+S R M S R+S M+S R+M R+M+S
2002 2009 "From-to" change detection

(a) (b) (c)


Fig. 8. Mean and standard deviation of OA with different features and their combinations in the
datasets of (a) 2002, (b) 2009, and (c) the “from-to” change detection result.

0.86
0.76
0.86
0.84
0.74

0.84 0.82
0.72
0.80 0.70
Overall Accuracy

Overall Accuracy
Overall Accuracy

0.82
0.78 0.68

0.76 0.66
0.80
0.74 0.64
0.78 0.62
0.72

0.70 0.60
0.76
0.68 0.58

0.74 0.66 0.56


1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
2002 2009 "From-to" change detection

R M S R+S M+S R+M R+M+S

(a) (b) (c)


Fig. 9. OA of each classification with different features and their combinations.

Since the combination of different features is stacking the word frequencies, Fig. 9

shows the OA of the different features and their combinations in 10 classification runs

with the union dictionary. It can be observed that the combination of different features

clearly improves the performance in each classification, compared with the single

features.

19
0.77

0.76

0.75

Overall Accuracy
0.74

0.73
Post/SP
0.72 Post/ST
Post/UN
0.71 Com/SP
Com/ST
Com/UN
0.70
1 2 3 4 5 6 7 8 9 10
RAW+SIFT

Fig. 10. OA of post-classification and compound classification for the “from-to” change detection
result.

The performances of post-classification and compound classification based on the

same representation of RAW+SIFT are shown in Fig. 10. The training samples for

compound classification are generated by the traversal of all possible combinations of

separate training samples for the multi-temporal datasets. The dashed line with

triangular labels indicates the OA of compound classification, and the solid line with

square labels indicates that of post-classification. Fig. 10 illustrates that, in most cases,

compound classification slightly outperforms post-classification. However, under the

same conditions, for the training and classification process after representation,

compound classification costs 1885.7 s, while post-classification only costs 5.2 s,

since the large amount of classes and samples makes it more difficult to find the best

classification model. Therefore, we recommend post-classification in practical

applications, because of its efficiency and effectiveness.

Since the traversal of training samples is too time-consuming, we wanted to

evaluate the performance of a random combination of training samples. We therefore

randomly combined the training samples from the numbers of 50 to 400 with the

representation of RAW+OA and the union dictionary, which can obtain the highest
20
OA for the “from-to” change detection result. We then repeated each case 10 times.

The statistical plot is shown in Fig. 11, where the dashed line is the OA obtained with

the same representation and sample traversal. It can be found that by increasing the

number of random training samples, the mean OA approaches that of sample traversal,

with a smaller variance. This illustrates that increasing the number of random training

samples improves the performance, but this does not obtain better results than sample

traversal. Above all, post-classification is a more useful approach than compound

classification with generated training samples.

0.78

0.77
Sample traversal
0.76
Overall Accuracy

0.75

0.74

0.73

0.72

50 100 150 200 250 300 350 400


Random Training Samples

Fig. 11. The OA with different numbers of random training samples.

Fig. 12 shows the scene maps in 2002 and 2009 with a high “from-to” change

detection accuracy, obtained by the RAW+SIFT features, the K-means approach, the

union dictionary, and post-classification. The OA of each map is 85.65% and 84.03%,

and the “from-to” change detection accuracy is 76.43%. It can be observed that the

small villages represented as sparse houses disappeared in 2009. Instead of sparse

houses, industrial areas and residential areas expanded very quickly. Most idle areas

in 2002 were transformed into other land-use classes. It is worth noting that the scene

change detection can reveal the detailed land-use transitions. In that after seven years,
21
the west side of Hanyang has developed to be a residential area, and the east side has

become an industrial area, while the traditional approaches just regard both of them as

impervious surfaces.

1-parking

2-water

3-sparse house

4-dense house

5-residential

6-idle

7-farmland

8-industrial

(a) (b)
Fig. 12. Scene maps for multi-temporal images acquired in (a) 2002 and (b) 2009 with eight classes.
The OAs are 85.65% and 84.03%, and 76.43% for the “from-to” change detection.

Table II
Statistics of the scene classes in 2002 and 2009
Class 2002 2009 +/−
1. Parking 9 13 +4
2. Water 292 312 +20
3. Sparse house 447 0 −447
4. Dense house 55 132 +77
5. Residential 201 507 +306
6. Idle 354 178 −176
7. Vegetation 249 267 +18
8. Industrial 313 511 +198

The statistics of the scenes in 2002 and 2009 are shown in Table II. Here, it can be

clearly found that, in 2002, there were many land-use areas with the type of sparse

houses and idle areas, since the urbanization of the Hanyang area had just started at

this time. By 2009, many of these land-use areas had been changed into residential

areas and industrial areas, because of the city expansion, which is shown by the
22
increasing number of these two scene classes.

(a) (b) (c)


Fig. 13. “From-to” change map: (a) change from sparse houses, (b) change to residential area, and (c)
change to industrial area. The scenes include parking (1-white), water (2-blue), sparse houses (3-cyan),
dense houses (4-purple), residential area (5-brown), idle area (6-yellow), farmland (7-green), and
industrial area (8-red).

The “from-to” change maps for the transition of the three important scene classes

are shown in Fig. 13. The color in the change map indicates the different scene classes.

Fig. 13 (a) shows that sparse houses changed into other land-use classes. This

illustrates that most sparse house areas changed to residential areas and vegetation. It

can also be found that residential areas appeared in the left-top part of the study area

in Fig. 13 (b). Comparatively, industrial areas expanded to the right-bottom part and

covered many idle areas and vegetation areas, shown in Fig. 13 (c). Fig. 13 shows the

“from-to” land-use transition map of the three most important scene classes in the

urban expansion. It can be seen that the scene change detection is able to provide

semantic interpretation of the land-use transition for urban management, which cannot

be achieved by the traditional approaches.

4.2 Hankou Dataset

We also chose another multi-temporal dataset from the Hankou area of Wuhan to
23
further evaluate the proposed approach. The source images of the multi-temporal

dataset were obtained on August 11th, 2002, and May 11th, 2008. They were also

acquired by IKONOS and pansharpened to be VHR with a 1-m resolution and four

bands. Georeference and accurate co-registration were performed as pre-processing.

Since the source images covering Hankou are too complex to label all the scenes, we

extracted multi-temporal scene pairs from the VHR images for the quantitative

assessment. The size of scene was 150×150. Seven classes of land-use scene were

selected from the source images, which are shown in Fig. 14. As discussed in the

Hanyang experiment, the training samples were selected separately from each source

image. A total of 411 scene pairs were extracted for the test. A total of 104 and 92

training samples were selected in 2002 and 2008 for classification. The numbers of

training and test samples of the different classes are shown in Table III.

(a) (b) (c) (d)

(e) (f) (g)


Fig. 14. Examples of multi-temporal scene pairs acquired in 2002 and 2008 belonging to: (a) water (1),
(b) dense houses (2), (c) vegetation area (3), (d) agricultural area (4), (e) residential area (5), (f) idle
area (6), and (g) industrial area (7).

The parameters, such as patch size, dictionary number, etc., were set the same as

those in the Hanyang experiment. The training patches for the dictionary learning

were randomly selected from the multi-temporal dataset of the scene pairs, instead of

the source images.

24
Table III
(a) Number of training and test samples, and percentage of test for each scene class in 2002.
Code 1 2 3 4 5 6 7 Total
Training 16 17 14 15 15 10 17 104
Test 125 61 66 34 60 25 40 411
Per. % 30.42 14.84 16.06 8.27 14.60 6.08 9.73 100.00
(b) Number of training and test samples, and percentage of test for each scene class in 2008.
Code 1 2 3 4 5 6 7 Total
Training 13 14 11 9 20 11 14 92
Test 87 68 30 16 119 33 58 411
Per. % 21.17 16.55 7.30 3.89 28.95 8.03 14.11 100.00

The mean and standard deviation of the OA with the different features and their

combinations with the three dictionaries is shown in Fig. 15. It can be observed that in

the 2002 dataset, RAW+SIFT obtained the best results, and the single RAW feature

also performed well. In the 2009 dataset, the combination of features outperformed

the single features in all cases. Therefore, in the post-classification results, they are

affected by both classification results, and RAW+SIFT obtained higher accuracies

than the other features. Fig. 15 indicates that RAW+SIFT is a very effective

combination of features for scene change detection.


0.88 0.86 0.76
SP SP SP
0.86 ST 0.74
ST 0.85 ST
0.84 UN UN UN
0.72
0.82 0.84
0.70
Overall Accuracy

Overall Accuracy
Overall Accuracy

0.80
0.83 0.68
0.78

0.76 0.66
0.82
0.74
0.64
0.72 0.81
0.62
0.70
0.80
0.68 0.60

0.66 0.79 0.58


R M S R+S M+S R+M R+M+S R M S R+S M+S R+M R+M+S R M S R+S M+S R+M R+M+S
2002 2008 "From-to" change detection

(a) (b) (c)


Fig. 15. Mean and standard deviation of OA with different features and their combinations in the
dataset of (a) 2002, (b) 2008, and (c) the “from-to” change detection result.

Fig. 16 shows the OA of different single features and their combination in the 2002

and 2008 datasets, and the post-classification result. The dictionary type was the

union dictionary. The combination of RAW and SIFT leads to an obvious

improvement over the separate features in every case. The other combinations result
25
in only a limited increase in accuracy. This is because the combination of features can

be both complementary and redundant. In summary, RAW+SIFT was found to be the

best combination in our experiments.


0.88 0.86
0.74
0.86
0.85 0.72
0.84

0.82 0.84 0.70

Overall Accuracy
Overall Accuracy

Overall Accuracy
0.80
0.68
0.83
0.78
0.66
0.76
0.82
0.74 0.64

0.72 0.81
0.62
0.70
0.80 0.60
0.68

0.66 0.79 0.58


1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

2002 2008 "From-to" change detection

R M S R+S M+S R+M R+M+S

(a) (b) (c)


Fig. 16. OA of each classification with different features and their combinations.

0.76

0.75
Overall Accuracy

0.74

0.73
Post/SP
Post/ST
0.72 Post/UN
Com/SP
Com/ST
Com/UN
0.71
1 2 3 4 5 6 7 8 9 10
RAW+SIFT

Fig. 17. OA of post-classification and compound classification for the post-classification result.

We then evaluated the performances with the RAW and SIFT combination for

post-classification and compound classification, as shown in Fig. 17. The training

samples were generated by traversal for the compound classification. It can be seen

that, in most cases, the compound classification got a lower accuracy than the

post-classification. Considering the complexity of the compound classification,

post-classification is more effective in scene change detection.

Fig. 18 shows the OA statistic of the “from-to” change detection results with

26
different numbers of random training samples. This illustrates that increasing the

random training samples leads to higher accuracies, approaching that of the sample

traversal. The generation of training samples for compound classification is not a

good approach for change detection. However, the direct selection of the “from-to”

samples from the dataset is also very difficult to achieve, because of the large amount

of possible types and the rareness of some samples. Thus, compound classification

may not be so practical in real applications, and needs further research.

0.76

0.75
Sample traversal
Overall Accuracy

0.74

0.73

0.72
50 100 150 200 250 300 350 400
Random Training Samples

Fig. 18. The OA with different numbers of random training samples.

5. Conclusion

In this paper, we have explored a scene change detection framework to monitor

land-use transitions from a semantic point of view. Firstly, the features of random

patches are extracted to learn a dictionary. Three dictionary learning approaches based

on multi-temporal patches are addressed: the separate dictionary, the stacked

dictionary, and the union dictionary. The image patches are then represented by word

frequency with the learned dictionary, according to the BOVW model. Finally,

post-classification and compound classification are applied to detect the “from-to”

changes of the land-use classes.

27
The experiments show that the proposed framework is effective in analyzing the

semantic change of remote sensing image scene, and shows its practical significance

in the study of city development. A combination of features can clearly improve the

performance, and RAW+SIFT is the best combination for scene identification in these

cases. K-means is more effective than the simple Sparse approach in dictionary

learning, and dictionary learning with multi-temporal information can improve the

performance, since it can enrich the diversity of the dictionary. The experimental

results also illustrate that, although a compound classification may slightly increase

the accuracy over that of post-classification, it is too time-consuming, and thus

post-classification is recommended in practical applications. In summary, the

proposed scene change detection framework has great potential in land-use

monitoring and city management, since the traditional change detection approaches

cannot discriminate the semantic land-use variations. The improvement of scene

change detection framework is the focus of our future work.

Acknowledgements

This work was supported in part by the National Basic Research Program of China

(973 Program) under Grant 2012CB719905, the National Natural Science Foundation

of China under Grants 61471274 and 41431175.

28
References
[1] C. Xu, D. Tao, and C. Xu. (2013, April 1, 2013). A Survey on Multi-view
Learning. ArXiv e-prints 1304, 5634. Available:
http://adsabs.harvard.edu/abs/2013arXiv1304.5634X
[2] W. F. Liu, H. L. Liu, D. P. Tao, Y. J. Wang, and K. Lu, "Multiview Hessian
regularized logistic regression for action recognition," Signal Processing, vol. 110, pp.
101-107, May 2015.
[3] D. C. Tao, X. L. Li, X. D. Wu, and S. J. Maybank, "Geometric Mean for
Subspace Selection," IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, pp. 260-274,
Feb 2009.
[4] D. C. Tao, X. Tang, X. L. Li, and X. D. Wu, "Asymmetric bagging and random
subspace for support vector machines-based relevance feedback in image retrieval,"
IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, pp. 1088-1099, Jul 2006.
[5] D. C. Tao, X. L. Li, X. D. Wu, and S. J. Maybank, "General tensor discriminant
analysis and Gabor features for gait recognition," IEEE Trans. Pattern Anal. Mach.
Intell., vol. 29, pp. 1700-1715, Oct 2007.
[6] L. Weifeng and T. Dacheng, "Multiview Hessian Regularization for Image
Annotation," Image Processing, IEEE Transactions on, vol. 22, pp. 2676-2687, 2013.
[7] W. Liu, D. Tao, J. Cheng, and Y. Tang, "Multiview Hessian discriminative sparse
coding for image annotation," Computer Vision and Image Understanding, vol. 118,
pp. 50-60, 1// 2014.
[8] D. Bo and Z. Liangpei, "A Discriminative Metric Learning Based Anomaly
Detection Method," IEEE Trans. Geosci. Remote Sens., vol. 52, pp. 6844-6857, 2014.
[9] B. Du, L. Zhang, D. Tao, and D. Zhang, "Unsupervised transfer learning for
target detection from hyperspectral images," Neurocomputing, vol. 120, pp. 72-82,
11/23/ 2013.
[10]D. Lu, P. Mausel, E. Brondizio, and E. Moran, "Change detection techniques," Int.
J. Remote Sens., vol. 25, pp. 2365-2401, Jun 2004.
[11] A. Singh, "Review Article Digital change detection techniques using
remotely-sensed data," Int. J. Remote Sens., vol. 10, pp. 989-1003, Jul 1989.
[12] B. Du, Y. Zhang, L. Zhang, and L. Zhang, "A hypothesis independent subpixel
target detector for hyperspectral Images," Signal Processing, vol. 110, pp. 244-249, 5//
2015.
[13] B. Du and L. Zhang, "Target detection based on a dynamic subspace," Pattern
Recognit., vol. 47, pp. 344-358, 1// 2014.
[14] B. Du and L. Zhang, "Random-Selection-Based Anomaly Detector for
Hyperspectral Imagery," IEEE Trans. Geosci. Remote Sens., vol. 49, pp. 1578-1589,
May 2011.
[15] R. E. Kennedy, P. A. Townsend, J. E. Gross, W. B. Cohen, P. Bolstad, Y. Q. Wang,
and P. Adams, "Remote sensing change detection tools for natural resource managers:
Understanding concepts and tradeoffs in the design of landscape monitoring projects,"

29
Remote Sens. Environ., vol. 113, pp. 1382-1396, Jul 2009.
[16] M. J. Carlotto, "Detection and analysis of change in remotely sensed imagery
with application to wide area surveillance," IEEE Trans. Image Process., vol. 6, pp.
189-202, 1997.
[17] R. J. Radke, S. Andra, O. Al-Kofahi, and B. Roysam, "Image change detection
algorithms: a systematic survey," IEEE Trans. Image Process., vol. 14, pp. 294-307,
Mar 2005.
[18] F. Bovolo and L. Bruzzone, "A Theoretical Framework for Unsupervised Change
Detection Based on Change Vector Analysis in the Polar Domain," IEEE Trans.
Geosci. Remote Sens., vol. 45, pp. 218-236, Jan 2007.
[19] C. Wu, B. Du, and L. Zhang, "Slow Feature Analysis for Change Detection in
Multispectral Imagery," IEEE Trans. Geosci. Remote Sens., vol. 52, pp. 2858-2874,
May. 2014.
[20] B. Demir, F. Bovolo, and L. Bruzzone, "Detection of Land-Cover Transitions in
Multitemporal Remote Sensing Images With Active-Learning-Based Compound
Classification," IEEE Trans. Geosci. Remote Sens., vol. 50, pp. 1930-1941, 2012.
[21] G. Xian, C. Homer, and J. Fry, "Updating the 2001 National Land Cover Database
land cover classification to 2006 by using Landsat imagery change detection
methods," Remote Sens. Environ., vol. 113, pp. 1133-1147, Jun 2009.
[22] M. Hussain, D. Chen, A. Cheng, H. Wei, and D. Stanley, "Change detection from
remotely sensed images: From pixel-based to object-based approaches," ISPRS J.
Photogramm., vol. 80, pp. 91-106, Jun. 2013.
[23] G. Chen, G. J. Hay, L. M. T. Carvalho, and M. A. Wulder, "Object-based change
detection," Int. J. Remote Sens., vol. 33, pp. 4434-4457, Jul. 2012.
[24] A. Bosch, A. Zisserman, and X. Muñoz, "Scene Classification Via pLSA," in
Computer Vision – ECCV 2006. vol. 3954, A. Leonardis, et al., Eds., ed: Springer
Berlin Heidelberg, 2006, pp. 517-530.
[25] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, "Learning multi-label scene
classification," Pattern Recognit., vol. 37, pp. 1757-1771, 9// 2004.
[26] A. Bosch, A. Zisserman, and X. Muoz, "Scene Classification Using a Hybrid
Generative/Discriminative Approach," IEEE Trans. Pattern Anal. Mach. Intell., vol.
30, pp. 712-727, 2008.
[27] A. M. Cheriyadat, "Unsupervised Feature Learning for Aerial Scene
Classification," IEEE Trans. Geosci. Remote Sens., vol. 52, pp. 439-451, 2014.
[28] X. Zheng, X. Sun, K. Fu, and H. Wang, "Automatic Annotation of Satellite
Images via Multifeature Joint Sparse Coding With Spatial Relation Constraint," IEEE
Geosci. Remote Sens. Lett., vol. 10, pp. 652-656, 2013.
[29] W. Shao, W. Yang, G.-S. Xia, and G. Liu, "A Hierarchical Scheme of Multiple
Feature Fusion for High-Resolution Satellite Scene Categorization," in Computer
Vision Systems. vol. 7963, M. Chen, et al., Eds., ed: Springer Berlin Heidelberg, 2013,
pp. 324-333.
[30] X. Sheng, F. Tao, L. Deren, and W. Shiwei, "Object Classification of Aerial
Images With Bag-of-Visual Words," IEEE Geosci. Remote Sens. Lett., vol. 7, pp.
366-370, 2010.

30
[31] G. Sheng, W. Yang, T. Xu, and H. Sun, "High-resolution satellite scene
classification using a sparse coding based multiple feature combination," Int. J.
Remote Sens., vol. 33, pp. 2395-2412, 2012/04/20 2011.
[32] L. Wang, L. Hongliang, and L. Guanghui, "Automatic Annotation of
Multispectral Satellite Images Using Author-Topic Model," IEEE Geosci. Remote
Sens. Lett., vol. 9, pp. 634-638, 2012.
[33] B. Zhao, Y. Zhong, and L. Zhang, "Scene classification via latent Dirichlet
allocation using a hybrid generative/discriminative strategy for high spatial resolution
remote sensing imagery," Remote Sensing Letters, vol. 4, pp. 1204-1213, 2013/12/01
2013.
[34] M. Lienou, H. Maitre, and M. Datcu, "Semantic Annotation of Satellite Images
Using Latent Dirichlet Allocation," IEEE Geosci. Remote Sens. Lett., vol. 7, pp. 28-32,
2010.
[35] Y. Yang and S. Newsam, "Bag-of-visual-words and spatial extensions for
land-use classification," presented at the Proceedings of the 18th SIGSPATIAL
International Conference on Advances in Geographic Information Systems, San Jose,
California, 2010.
[36]T. Lindeberg, "Scale Invariant Feature Transform," Scholarpedia, vol. 7, p. 10491,
2012.
[37] S. Lazebnik, C. Schmid, and J. Ponce, "Beyond Bags of Features: Spatial
Pyramid Matching for Recognizing Natural Scene Categories," in 2006 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, 2006, pp.
2169-2178.
[38] A. Coates and A. Y. Ng, "Learning feature representations with k-means," in
Neural Networks: Tricks of the Trade, ed: Springer, 2012, pp. 561-580.
[39] L. Zhang, L. Zhang, D. Tao, and X. Huang, "On Combining Multiple Features for
Hyperspectral Remote Sensing Image Classification," IEEE Trans. Geosci. Remote
Sens., vol. 50, pp. 879-893, Mar 2012.
[40] C.-C. Chang and C.-J. Lin, "LIBSVM: A library for support vector machines,"
ACM Trans. Intell. Syst. Technol., vol. 2, pp. 1-27, 2011.
[41] X. Chen, J. Chen, Y. Shi, and Y. Yamaguchi, "An automated approach for
updating land cover maps based on integrated change detection and classification
methods," ISPRS J. Photogramm., vol. 71, pp. 86-95, 7// 2012.

31

You might also like