Ijrs2020 2

International Journal of Remote Sensing
ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/tres20
Feature extraction for hyperspectral image

classification: a review
Brajesh Kumar , Onkar Dikshit , Ashwani Gupta & Manoj Kumar Singh
To cite this article: Brajesh Kumar , Onkar Dikshit , Ashwani Gupta & Manoj Kumar Singh (2020)
Feature extraction for hyperspectral image classification: a review, International Journal of Remote
Sensing, 41:16, 6248-6287, DOI: 10.1080/01431161.2020.1736732
To link to this article: https://doi.org/10.1080/01431161.2020.1736732
Published online: 07 Jun 2020.
Submit your article to this journal
View related articles
View Crossmark data
Full Terms & Conditions of access and use can be found at

https://www.tandfonline.com/action/journalInformation?journalCode=tres20
INTERNATIONAL JOURNAL OF REMOTE SENSING
2020, VOL. 41, NO. 16, 6248–6287
https://doi.org/10.1080/01431161.2020.1736732
Feature extraction for hyperspectral image classification: a

review
a b a a
Brajesh Kumar , Onkar Dikshit , Ashwani Gupta and Manoj Kumar Singh
a
Department of Computer Science & Information Technology, MJP Rohilkhand University, Bareilly, India;
b
Department of Civil Engineering, Indian Institute of Technology Kanpur, Kanpur, India
ABSTRACT ARTICLE HISTORY

Hyperspectral image sensors capture surface reflectance over Received 10 November 2019
a range of wavelengths. The fine spectral information is recorded Accepted 20 February 2020
in terms of hundreds of bands. Hyperspectral image classification
has observed a great interest among researchers in remote sensing
community. High dimensionality provides rich spectral information
for the classification process. But due to dense sampling, some of
the bands may contain redundant information. Sometimes, spectral
information alone may not be sufficient to obtain desired accuracy
of results. Therefore, often spatial and spectral information is inte-
grated for better accuracy. However, unlike spectral information,
the spatial information is not directly available with the image.
Additional efforts are needed to extract spatial information.
Feature extraction is an important step in a classification frame-
work. It has following major objectives: redundancy reduction,
dimensionality reduction (usually but not always), enhancing dis-
criminative information, and modelling of spatial features. The
spectral feature extraction process transforms the original data to
a new space of a different dimension, enhancing the class separ-
ability without significant loss of information. Various mathematical
techniques are applied for modelling spatial features based on pixel
spatial neighbourhood relations. In this paper, a review of the major
feature extraction techniques is presented. Experimental results are
presented for two benchmark hyperspectral images to evaluate
different feature extraction techniques for various parameters.
1. Introduction
Hyperspectral images are captured in hundreds of fine narrow contiguous spectral bands
(Goetz 2009) in visible to infrared regions of electromagnetic spectrum. Each pixel in
hyperspectral image is represented by a vector whose size is equal to the number of
spectral bands. The pixels have very detailed spectral signatures as each component of
the vector is a measurement corresponding to a specific wavelength. The contiguous
acquisition makes it possible to derive radiance spectrum for each pixel in the image. The
rich spectral information helps to better discriminate surface features and objects than
traditional imaging systems (Li et al. 2011a). However, due to fine spectral distance, these
CONTACT Brajesh Kumar bkumar@mjpru.ac.in; sainibrajesh@gmail.com Department of Computer Science &

Information Technology, MJP Rohilkhand University, Bareilly, 243006, India
© 2020 Informa UK Limited, trading as Taylor & Francis Group
INTERNATIONAL JOURNAL OF REMOTE SENSING 6249
bands are highly correlated and provide redundant information (Jia and Richards 1994)
for a particular application. Since hyperspectral sensors are not designed for specific
applications, some bands that are useful in a given application may not reveal important
information for some other applications (Yin, Wang, and Hu 2012). Therefore, extracting
application specific information is crucial for the effective use of hyperspectral images.
Hyperspectral image classification is an important tool for the quantitative analysis of
image data having application in a wide range of areas including environmental studies,
agricultural monitoring, defence, urban planning, weather forecasting, etc. High dimen-
sionality poses some challenges including curse of dimensionality (Hughes 1968) and
increased computational cost that make supervised classification a challenging task.
The redundant and noisy bands not only put unnecessary computational load but also
affect the classification accuracy. Increased number of features provides more information
into the classifier but number of training samples required for reasonable estimation of
statistical behaviour of the data increases exponentially as dimensionality becomes
higher (Landgrebe 2003). Maintaining a reasonably good ratio of available training
samples to the number of features is important for the good performance of any classifier.
However, usually only limited number of training samples is available (Huang and Kuo
2010) as the sampling and collection of reliable training samples is itself a complex and
expensive task. In the absence of sufficient number of training pixels, the classification
accuracies tend to be poor. A widely accepted approach to mitigate dimensionality
related issues is to represent data with the reduced number of features. Feature reduction
is a crucial preprocessing step for the effective hyperspectral image classification.
The most dimensionality reduction techniques can be divided into two major cate-
gories: feature selection and spectral feature extraction. Feature selection algorithms
mainly aim to reduce spectral redundancy and optimally select a subset of original
spectral bands discarding the others (Serpico and Moser 2007). Various measures are
used as selection criteria to distinguish the features including spectral distance (Backer
et al. 2005; Ifarraguerri and Prairie 2004), variance (Chang et al. 1999), mutual information
(Martinez-Uso et al. 2007; Estevez et al. 2009), spectral angle mapper (Keshava 2004),
spectral divergence (Chang and Wang 2016), etc. But separability-based feature selection
methods are computationally very expensive (Landgrebe 2003) and it is difficult to
determine optimal number of features to select. The second approach feature extraction
either enhances relevant bands by arithmetic operations or projects the data onto a new
feature space preserving the discriminative information (Yin, Wang, and Hu 2012). The
new feature space is usually a lower dimensional space but sometimes it may have same
or higher dimension than the original one. Although, original form of the data is lost and
its physical interpretation becomes difficult, but spectral feature extraction is more
effective approach between the two paradigms (Serpico and Moser 2007). With recent
developments, feature extraction not necessarily generates lower number of features
always. Some modern techniques such as Tensor Principle Component Analysis and some
deep learning based methods generate higher than the original number of features. With
the advent of such techniques and availability of high end computing resources, high
dimensionality is no longer a curse always, although, availability of a good number of
training samples is still a point of concern. Modern sensors provide fine spatial resolution
enabling the availability of information on smaller spatial structures, such as edges,
texture, shape, size, etc. The additional information on spatial structures helps to better
6250 B. KUMAR ET AL.
discriminate the objects and land cover features. Extraction of information on spatial
structures from the image data is crucial for classification accuracy.
Feature extraction/selection techniques are reviewed by researchers time to time. Jia,
Kuo, and Crawford (2013) provided a comprehensive review of feature extraction as well
as feature selection methods with a focus on dimensionality reduction and feature
mining. The work is concerned with the extraction of spectral information only. Li et al.
(2018) reviewed dimensionality reduction techniques in a specific domain that is based on
discriminant analysis. The authors discussed and analyzed mainly the linear discriminant
analysis (LDA), sparse graph-based discriminant analysis (SGDA), and their various exten-
sions. The article is concerned with spectral features and dimensionality reduction aspect
only. The modern deep learning based methods are not covered and there is no discus-
sion on the spatial or spectral–spatial feature extraction in these articles. A good review of
feature selection techniques is presented by Sun and Du (2019). It exclusively covers band
selection methods only. Both conventional and modern approaches are nicely presented
in the article.
In this work, major feature extraction techniques are reviewed and analyzed with their
strengths and limitations. The focus of the work is on the extraction of both spectral and
spatial information. It is a comprehensive review that includes most of the major spectral,
spatial, and spectral–spatial features extraction techniques. It attempts to cover a range of
methods from conventional to advanced ones including deep learning techniques.
2. Spectral feature extraction

Feature extraction produces new features by combining original bands. The new features
preserve the most of the important information. The feature extraction methods can be
divided into four categories: knowledge-based, statistical, wavelet-based, and deep learn-
ing based. Knowledge-based methods enhance specific characteristics of the relevant
bands (Benediktsson and Ghamisi 2016) to distinguish the objects or surface features of
interest. The statistical feature extraction transforms the high dimensional data into some
other lower dimensional feature space reducing redundancy and enhancing class separ-
ability. The success of this type of methods depends on their ability to feature transforma-
tion without significant loss of information. The wavelet-based feature extraction relies on
the fact that wavelet transformation decomposes the signal into constituent wavelets of
different scales and positions.
2.1. Knowledge-based feature extraction

Knowledge-based feature extraction methods enhance specific characteristics of spectral
bands by performing arithmetic operations on the relevant original bands. These techni-
ques are straightforward and determine features using spectral knowledge of the classes.
The spectral knowledge can be represented in terms of some physical indicators such as
normalized difference vegetation index (NDVI), normalized difference water index (NDWI),
soil adjusted vegetation index (SAVI), etc. Haboudane et al. (2004) compared some
vegetation indices including NDVI and SAVI using hyperspectral data and modified
some of them to predict green leaf area index. Onoyama et al. (2014) used equation
‘greenNDVI-NDVI’ to estimate nitrogen contents of rice plants from hyperspectral images.
Dutta et al. (2015) quantified soil constituents from airborne hyperspectral data using
NDVI and lasso regression algorithm. Garcia-Salgado and Ponomaryov (2016) converted
the hyperspectral image to NDVI representation and computed texture features for
classification. Although knowledge-based features have direct relation to physical para-
meters but application specific expertise is required to derive such features. The purpose
of this category of feature extraction is not dimensionality reduction at all. Rather, these
techniques are used to obtain additional and different kind of information from spectral
knowledge to discriminate the land cover types/objects.
2.2. Statistical feature extraction

The statistical feature extraction is the process of transforming the high dimensional data to
a lower dimensional space enhancing the class separability without significant loss of infor-
mation. A new set of features is generated by the re-distribution of underlying information.
New features are linear/non-linear combinations of the original bands (Backer et al. 2005). This
type of methods can be supervised or unsupervised. Supervised methods require a prior
knowledge in the form of labelled samples to determine some metrics that separate data
points in different classes (Zhao and Du 2016). The unsupervised techniques are used when
training data are not available. These methods are usually faster than their supervised
counterparts (Benediktsson and Ghamisi 2016) and better suit to time constraint applications.
Let us consider a B-dimensional hyperspectral image X 2 RMNB having P number of pixels,
where P ¼ M N.
2.2.1. Unsupervised feature extraction

Unsupervised feature extraction methods do not require a prior information or training
data. Projection pursuit (PP) is an unsupervised mechanism that transforms high dimen-
sional data into lower dimensional space retaining most of the information into a new set
of orthogonal variables. PP was first introduced by Friedman and Tukey (Friedman and
Tukey 1974) for multivariate data analysis. To transform hyperspectral image I, first I needs
to be rearranged into a 2-D matrix X ¼ ðx 1 ; x 2 ; . . . ; x P Þ 2 RBP , where each pixel x i is
a vector with B elements. The general equation of transformation is
Y ¼ ΩT X (1)
where Y 2 RDP is the transformed data and D B is the reduced number of bands. The
transformation matrix Ω can be computed by optimizing a function called projection
index that is a real valued function of Y
f ðYÞ ¼ f ðΩT XÞ (2)
where f ðÞ depends on the application. Ifarraguerri and Chang (2000) used projection
index based on information divergence for hyperspectral image classification. Chiang,
Chang, and Ginsberg (2001) proposed projection indices based on moments for unsu-
pervised target detection. Principal component analysis (PCA) (Joliffe 2002) is a widely
used unsupervised technique that can be considered as a type of PP that uses variance as
a projection index. It generates features called principal components minimizing the
correlations among them. Mathematically, PCA is performed by eigenvalue decomposi-

tion of covariance matrix of X. The B B covariance matrix ΣX of X is given as
1X P
1X P
ΣX ¼ ðx i xÞðx i xÞT ; x ¼ xi (3)
P i¼1 P i¼1
where x is the sample mean. The B D transformation matrix Ω is obtained from eigen-
vectors of ΣX corresponding to top D non-zero eigenvalues by solving eigenvalue equation
2 3
λ1 0
6 λ2 7
6 7
ΣX V ¼ ΛV; Λ ¼ 6 .. 7 (4)
4 . 5
0 λB
where, Λ 0 and V are matrices with eigenvalues and corresponding eigenvectors,

respectively. The work by Rellier et al. (2004), Neher and Srivastava (2005), Pu et al.
(2014), and Xia et al. (2015) are some of the many examples that used PCA for feature
extraction from hyperspectral images. Fauvel et al. (2008), Velasco-Forero and Angulo
(2013), Lv et al. (2014), and Tan et al. (2014) used the first few principal components to
perform morphological operations.
PCA makes assumption on large variance between informative variables and determines
data second-order statistics. Higher principal components are supposed to steadily poses
lower signal-to-noise ratio. However, sometimes image quality does not steadily decrease
with decreasing component number (Green et al. 1988). The principal components based
on variance sometimes do not adequately represent the image quality (Chang and Du
1999). To overcome the limitations of PCA, a technique named as maximum noise fragment
(MNF) was developed by Green et al. (1988) that maximizes signal-to-noise ratio instead of
variance. MNF generated principal components are arranged according to decreasing
image quality instead of variance. The uncorrelated image X and noise content W ¼
ðw 1 ; w 2 ; . . . ; w P Þ 2 R can be written as follows assuming an additive model
Z¼XþW (5)
so that
ΣZ ¼ ΣX þ ΣW (6)
The transformation matrix can be obtained by solving following eigenvalue equation
ΣN Σ1
Z V ¼ ΛV (7)
Huang and Zhang (2010), Rasti, Ulfarsson, and Sveinsson (2010), and Dopido et al. (2012)
explored the capabilities of MNF as a feature extraction technique for hyperspectral
images and established its superiority over PCA.
Both PCA and MNF being based on second-order statistics can not characterize subtle
material substances as for such substances sufficient samples are not available to con-
stitute the reliable statistics (Wang and Chang 2016). Accurate determination of the
covariance matrix also depends on the availability of adequate samples. Under such
circumstances independent component analysis (ICA) (Hyverinen, Karhunen, and Oja
2001) works better. ICA is related to PCA, which intends to find a linear decomposition
of original data into statistically independent latent variables known as independent

components (Mura et al. 2011). ICA assumes a linear and independent mixture of data
variables and attempts to separate original data into mutually independent non-Gaussian
components with the help of an unmixing matrix. If pixels x 1 ; x 2 ; . . . ; x P are the mixture of
P random variables y 1 ; y2 ; . . . ; yP , the mixing model can be written as
X ¼ AY (8)
where Y ¼ ½y1 ; y2 ; . . . ; yP T is the unknown source matrix and A 2 R is the unknown

mixing matrix. The independent components can be obtained by using an unmixing
matrix W as transformation matrix. The unmixing matrix is the inverse of A. ICA requires
some preprocessing before unmixing such as centring of data points and whitening. The
whitening is done to restrict the search of unmixing matrix in orthogonal space only. Mura
et al. (2011) used independent components to create extended attribute profiles for
hyperspectral images. Falco, Benediktsson, and Bruzzone (2014) investigated different
ICA algorithms to extract class discriminant information. Xia et al. (2016) used ICA and
edge preserving filtering for hyperspectral image classification.
A major limitation of these methods is that they are not efficient for modelling the nonlinear
relationships in complex data. A kernel trick is used sometimes to adapt linear methods for
nonlinear feature extraction. The kernel-based variants of PCA, ICA, and MNF are developed
known as KPCA (Scholkopf, Smola, and Mller 1998), KICA (Bach and Jordan 2003), and KMNF
(Nielsen 2011), respectively. The success of kernel-based methods depends on the choice of
kernel function and efficient estimation of the parameters. Manifold learning techniques better
suit to complex data with nonlinear structure. Bachmann, Ainsworth, and Fusina (2005, 2006)
developed dimensionality reduction methods based on isometric mapping (ISOMAP).
However, ISOMAP is computationally complex and some solutions are required to alleviate
memory and CPU time requirements for these methods. Ma and Crawford (2010) used local
manifold learning for dimensionality reduction of hyperspectral images.
2.2.2. Supervised feature extraction

The supervised feature extraction relies on the prior knowledge provided by the labelled
samples. These methods can further be divided into parametric and non-parametric
categories. The parametric methods rely on the estimation of fixed set of class level
parameters and often make strong assumptions about the data distribution. Non-
parametric methods make no such assumptions. Local Fisher’s discriminant analysis
(LFDA) (Duda, Hart, and Stock 2001) is a classical parametric feature extraction technique.
It is based on the optimization criteria defined by
J ¼ trðS1
w Sb Þ (9)
where Sw is within-class scatter matrix, Sb is between-class scatter matrix, and tr(.) is the
trace of the matrix. The matrices Sw and Sb (Lee and Landgrebe 1993) are defined using
training data for K classes as follows
X
K
Sw ¼ pi Σi (10)
i¼1
X
K X
K
Sb ¼ pi ðmi m0 Þðmi m0 ÞT ; m0 ¼ pi mi (11)
i¼1 i¼1
where mi , pi , and Σi are the mean vector, prior probability, and covariance matrix of class i,
respectively. The transformation matrix Ω is obtained by solving following eigenvalue
equation
ðS1
w Sb ΛIÞΩ ¼ 0 (12)
LFDA has shown good performance for different types of images but it has limited success
with high dimensional images (Huang and Kuo 2010) due to its inherent limitations. LFDA
makes assumption on normal-like distribution of classes that may not be the case with real
images. For K-class data, it produces at most K 1 features as rank of the between-class
scatter matrix is K. Such a low number of features are not optimal always (Kuo and Landgrebe
2004). Moreover, the within-class scatter matrix is often singular for the high dimensional
images (Yang, Yu, and Kuo 2010). Non-parametric discriminant analysis (NDA) was introduced
by Fukunaga and Mantock (1983) by defining a new non-parametric between-class scatter
matrix to overcome some limitations of LFDA. But NDA has same singularity problem. NDA
was improved by non-parametric weighted feature extraction (NWFE) (Kuo and Landgrebe
2004) method. NWFE defines new within-class and between-class scatter matrices and
computes weighted means. NWFE is a more successful method for hyperspectral imagery;
however, it faces issues related to computation time. Yang, Yu, and Kuo (2010) developed
a method known as cosine-based nonparametric feature extraction (CNFE), which uses
a cosine distance based weight function for scatter matrices. CNFE employs
a regularization technique to handle singularity problem. A method known as decision
boundary feature extraction (DBFE) (Lee and Landgrebe 1993) was developed specifically
for hyperspectral images. It uses decision boundaries instead of class mean and covariance
matrices to derive feature vectors. The DBFE is computationally intensive and requires good
number of quality training samples for determining efficient decision boundaries. It is an
efficient feature extraction technique but with limited training samples its performance may
degrade.
The kernel trick can also be used to extend linear supervised methods to nonlinear
feature extraction. LFDA is extended by Baudat and Anouar (2000) as generalized dis-
criminant analysis (GDA) using kernel function. GDA performs non-linear analysis but like
LFDA it produces only K 1 features. Kuo, Li, and Yang (2009) proposed kernel NWFE
(KNWFE) method and shown that NWFE is a special case of KNWFE with linear kernel.
KNWFE works efficiently if good number of training samples is available. Li et al. (2011b)
effectively used another kernel version of LFDA named as kernel local Fisher discriminant
analysis (KLFDA) (Sugiyama 2007) for hyperspectral feature extraction. KLFDA is
a combination of locality preserving projection and kernel discriminant analysis.
2.3. Wavelet-based feature extraction

Wavelet transforms are widely used in signal analysis. Wavelets have the ability to
separate fine-scale and large-scale details of the signal, preserving the energy and spatio-
geometrical information at different scales (Shankar, Meher, and Ghosh 2011). Several
methods based on wavelet transforms have been developed for feature extraction from
hyperspectral data. The fundamental operator is mother wavelet, which is a function that
satisfy following requirement (Bruce, Koger, and Li 2002)
ð þ1
jFðψðtÞÞj2
dω < 1 (13)
1 jωj
Most of these methods use discrete wavelet transform (DWT). Usually for spectral feature
extraction 1-D DWT is used (Kumar and Dikshit 2015a). The 1-D DWT can be applied on
a hyperspectral signal f ðÞ of length B as follows
X
B1
Wψ ði; jÞ ¼ f ðtÞψ i;j ðtÞ (14)
t¼0
where ψi;j ¼ 2j=2 ψð2i t jÞ, 2i is scale parameter, and j is translation parameter. It
decomposes the signal into approximation (L) and details (H) coefficients. Bruce, Koger,
and Li (2002) used 1-D DWT iteratively and shown that wavelet decomposition can be
performed upto log2 ðBÞ levels without losing significant information. The wavelet coeffi-
cients are used as spectral features. The authors investigated a number of mother
wavelets and found that lower order wavelets are better for hyperspectral image feature
extraction. They also established that larger scale wavelet coefficients are more useful.
Kaewpijit, Moigne, and El-Ghazawi (2003) decomposed hyperspectral signal with 1-D DWT
and shown that shape of the spectral signature is still recognizable with reduced dimen-
sionality. Li (2004) tested various wavelet types to extract features for linear unmixing of
hyperspectral signals. The results have shown that Haar wavelet performs better for such
applications.
2.4. Deep feature extraction

Some of the conventional feature extraction techniques have performed well in spectral
domain. However, their ability to deal with the nonlinear nature of hyperspectral data is
limited. In recent times, deep learning architectures observed remarkable success in dealing
with the nonlinear input data. Deep learning techniques employ hierarchical learning
framework to extract high level features with very deep neural networks. Typically neural
networks deeper than three layers are considered as deep networks. Deep models progres-
sively learn abstract and complex features from lower ones at higher layers, which are
typically invariant to local changes of input data. Autoencoder (Chen et al. 2014) is one of
the basic approaches for learning deep features in hierarchical way. As shown in Figure 1,
autoencoder consists of an input layer, one hidden layer, and one output layer also known
as reconstruction layer. Both input and reconstruction layers consist of same number of
units N. The hidden layer is smaller than other two layers consisting of h units
ðh < N often h
NÞ. Between each pair of layers, an activation function is employed,
which is conventionally a non-linear function. During the operation, contents of input
layer x 2 RN are mapped to a latent representation v 2 Rh at hidden layer,
v ¼ gðWx þ βÞ (15)
Figure 1. Autoencoder.
where W is input-to-hidden weight matrix, β is the bias vector of hidden layer, and gðÞ is
the activation function. The latent representation v is then used to reconstruct an
approximation x^ 2 RN by reverse mapping,
x^ ¼ gðΘ v þ γÞ (16)
where Θ is hidden-to-output weight matrix and γ is the bias vector of output layer. The
purpose of training is to minimize the reconstruction error Jðx; ^xÞ between x and x^. Jðx; x^Þ is
usually square error cost but can also be computed in many other different ways. The latent
representation is the compressed form of original input as h
N. If reconstruction error is
within a threshold, the latent representation can be used as a reduced set of features. To
minimize the error, a number of autoencoders are stacked as shown in Figure 2. The hidden
layer in one autoencoder is input to the next one. This arrangement is known as stacked
autoenocoder (SAE) that can progressively generate deep features. The multilayer network so
formed determines parameters by layerwise greedy learning (Lv, Han, and Qiu 2017). The
parameters are fine tuned with back propagation. If label information is available at topmost
layer, it is a supervised learning otherwise it becomes an unsupervised learning process. For
pixelwise spectral features, the pixel vector is fed to SAE as shown in Figure 3. Chen et al.
(2014) used SAE having five layers to generate deep features for hyperspectral images and
classified the images with logistic regression. Sun et al. (2017) obtained discriminative deep
features using SAE and designed a semi-supervised approach for training the encoders. The
authors also suggested a mean pooling scheme for fusing spectral and spatial information.
A different approach is proposed by Zhou et al. (2019) with some optimization criteria to learn
discriminative features using SAE. The authors applied a local Fisher discriminant regulariza-
tion on hidden layers of SAE. It helps to improve within-class and between-class diversity.
Autoencoders or SAEs rely on smaller hidden layers. However, larger hidden layers may
also provide useful features. A different approach of encoders known as sparse autoencoder
(Tao et al. 2015) allows to have larger hidden layers, i.e. h B. These encoders impose
a condition of sparsity on the hidden units intending to keep most of the neurons inactive.
Encoder1 Decoder1
x
Encoder2 Decoder2
Encoder3 Decoder3
Bottleneck
Hidden layer
Input Layer Output Layer
Figure 2. Stacked autoencoder.
Feature
set
Hyperspectral
image Pixel vector
Layers of SAE
Figure 3. Spectral feature extraction by SAE.
Given a training set T ¼ fx 1 ; x 2 ; . . . ; x Q g, the training of sparse encoder aims to find optimal
parameters by minimizing the cost function given by
1X Q Xsl
Jsparse ¼ Jðx i ; xî Þ þ β ^j Þ
KLðρ k ρ (17)
Q i¼1 j¼1
where Q is the number of training pixels, sl is number of units in lth hidden layer, and
^Þ is the Kullback-Leibler (KL) divergence, defined as follows:
KLðρ k ρ
ρ ð1 ρÞ
^j Þ ¼ ρ log
KLðρ k ρ þ ð1 ρÞlog (18)
^j
ρ ^j Þ
ð1 ρ
where ρ is the sparsity parameter close to 0 and ρ ^j is the average activation of jth hidden
unit. Multiple sparse autoencoders are stacked to form stacked sparse autoencoder
(SSAE). Kang et al. (2018) fused spectral and Gabor features and used SSAE for deep
feature learning. Denoised autoencoder (DAE) (Xing, Ma, and Yang 2016) is another type
of encoder developed from autoencoder. DAE first modifies the original data x to x0 by
setting up some elements of x to zero or adding some Gaussian noise to x. The modified
data x 0 is used as input. DAE aims to reconstruct x from the output. Similar to SAE and
SSAE, stacked DAE can be framed that performs better for noisy data. Hao et al. (2018)
employed stacked DAE to encode pixelwise spectral values for hyperspectral images. The
deep features so obtained were fused with other features to get good classification
accuracy. Lan et al. (2019) applied k-sparse method for sparsity and introduced a k-
sparse denoising autoencoder. In addition, the k-sparsity based method also uses
a dropout function at hidden layer to prevent overfitting.
Restricted Boltzman Machine (RBM) (Chen, Zhao, and Jia 2015) is a generative stochas-
tic model of neural networks that consists of two fully connected layers, a visible (input)
layer and a hidden layer. Contrary to conventional Boltzmann Machine, there are no
connections among the units of the same layer in RBM. The data is fed through visible
units and a latent representation is produced at hidden layer. The representation is fed
back and input is reconstructed at visible layer. Each unit has an activation probability and
a state. RBMs are trained in unsupervised mode. The training of RBM involves adjusting of
weights and biases until the reconstruction error is acceptably small. The network
provides a probability to an input with the help of an energy function EðÞ. Both activation
probability and state are used during the training. The joint probability distribution of
units can be expressed as
1
pðv; h; θÞ ¼ expðEðvr ; hr ; θÞÞ (19)
ZðθÞ
XX
ZðθÞ ¼ Eðvr ; hr ; θÞ (20)
vr hr
where v r is a vector of visible units, hr is a vector of hidden units, Zðθ Þ is a normalizing

constant, θ ¼ fW; β; γg, W is a weight matrix between visible and hidden layers, β is
a vector of biases for visible units, and γ is a vector of biases for hidden units. The energy
function is given by
Eðvr ; hr ; θÞ ¼ ðβT vr Þ ðγp T hr Þ ðvTr Whr Þ (21)
If reconstruction error is within the threshold, the contents of hidden units represent the
desired features. A network with multiple RBM layers can be designed that is trained layer-
wise. Output of one trained layer is input to the next layer in the network. This kind of
learning system is known as deep belief network (DBN). The DBN learns deep features
hierarchically reducing the error.
Convolution neural network (CNN) is a special deep learning model inspired by human
vision system. It can be trained using supervised or unsupervised learning approach. The
unsupervised training can be performed with the help of greedy layerwise pertaining
(Romero, Gatta, and Camps-Valls 2016). The supervised training process involves back-
propagation. CNN model typically consists of a group of convolutional layers, pooling
layers, and fully connected layers as shown in Figure 4. Local connections and shared
weights are two important aspects of CNNs that provide better generalization. The
convolutional layer performs convolution of input with kernels or filters. The convolution
in CNN can be defined as
vl ¼ gðKl ? v l1 þ bl Þ (22)
where v l is the output feature map, Kl is the filter, and bl is a bias parameter at lth layer.
The output feature map Fl1 of layer ðl 1Þ becomes input to layer l, ‘ ? ’ is a convolution
operator, and gðÞ is a non-linearity function also known as activation function. ReLU
(Krizhevsky, Sutskever, and Hinton 2012) is one of the most popular activation functions
due to its fast convergence. The output of a neuron at ith position in lth layer and mth
feature map is
!
l;m
XX Sl 1
vi ¼ g b þl;m s iþs
kl;m;p v ðl1Þ;p (23)
p s¼0
where p is the index of feature map in ðl 1Þth layer, bl;m is bias of mth feature map in lth
layer, Sl 1 is the kernel size in lth layer, kl;m;p
s
is the kernel value at position s in pth feature
map. Usually feature maps contain redundant information, therefore convolution is often
followed by pooling operation. Pooling operation reduces the resolution of feature maps
in turn reducing parameters and computation time. Pooling is performed over non-
overlapping subregions of the feature map. The pooling operation can be defined as
v l ¼ f ðv l1 þ bl Þ (24)
where f ðÞ is a function that performs subsampling operation. The output of pooling layer
is smaller than its input. The purpose of pooling is to make features robust and abstract.
Feature Extraction
Input Image Convolution Pool Convolution Pool FC Output
FC: Fully connected
Figure 4. Architecture of CNN.

Several convolutional and pooling layers can be stacked together to make it a deep CNN.
After several layers of convolution and pooling in one direction (1-D convolution), a set of
spectral features is obtained. The last convolutional/pooling layer in hierarchy is followed
by fully connected layers. There could be one or more fully connected layers. The purpose
of fully connected layer is to produce one feature vector. It generates more deep and
abstract features. The operation of a fully connected layer is given as
v lþ1 ¼ hðWl v l þ bl Þ (25)
where Wl is weight matrix and hðÞ is the activation function of the fully connected layer.
To obtain spectral features, the original pixel vector is fed to the first convolution layer
and features are obtained through the last layer in the architecture. As shown in Figure 5,
the pixel vector is used as input to CNN for spectral feature extraction. Chen et al. (2016)
demonstrated the utility of CNN for extracting spectral features.
Various classification frameworks are developed in recent years based on different
deep learning approaches exhibiting impressive performance. However, finding appro-
priate size and number of hidden units for specific problems is a major issue with deep
learning approaches. The major spectral feature extraction techniques are summarized in
Table 1.
3. Spatial feature extraction

Spectral information alone is not good enough to get the desired accuracy of classification.
Therefore, often spatial information is integrated with spectral information for better
results. To extract pixel-wise spatial features, other pixels within the neighbourhood of
a pixel are also considered. The pixel neighbourhood can be defined with the help of a pixel
Stacking
Feature set
Hyperspectral Pooling layer

image Pixel vector
Convolutional layer
Layers of CNN
Figure 5. Spectral feature extraction by CNN.

Table 1. Summary of the spectral feature extraction techniques.
Technique Category Characteristics Strengths Limitations
Projection Unsupervised Optimizes the projection index. Preserves maximum information in a smaller feature set. Vulnerable to find local maximum and may
pursuit not fully exploit high dimensional data.
Choice of projection index is crucial.
PCA Unsupervised Minimizes the correlation. Preserves the most of the information in the first 2–3 Makes assumption on large variance.
principal components.
MNF Unsupervised Maximizes the signal-to-noise ratio. Better represents the image quality. Not efficient for subtle material substances.
ICA Unsupervised Maximizes the contrast information. Good to extract features for subtle objects also. Some preprocessing is required before
application of actual algorithm.
DWT Unsupervised Separates fine-scale and large-scale details. Preserves the shape of spectral signature. Higher computation time.
LFDA Supervised, Optimization criteria based on between-class Provides good class separability. Makes assumptions on normal like
parametric and within-class scatter matrices. distribution. Not good for classes with
similar means. At most K 1 features for
K-class data. Singularity problem.
NDA Supervised, non- Defines a new non-parametric between-class Non-parametric forms of scatter matrices. Despite non-parametric scatter matrices, still
parametric scatter matrix. some parameters are involved that are
empirically decided. Singularity problem.
NWFE Supervised, non- Defines new between-class and within-class Performs well for data with unequal covariance. No High computation times. Singularity
parametric scatter matrices with weighted means. significant issues related to parameter estimation. problem.
CNFE Supervised, non- Uses cosine distance based weighted scatter Better handles singularity problem. High computation times.
parametric matrices.
DBFE Supervised, non- Uses decision boundaries to derive feature Efficiently treats the outliers. Directly focuses on Requires good number of quality training
parametric vectors. classification accuracy. samples. High computation times.
Deep features Supervised/ Inspired by biological systems Good for non-linear data. No defined criteria for choosing appropriate
unsupervised number and size of hidden layers. Needs
high computing resources.
INTERNATIONAL JOURNAL OF REMOTE SENSING
6261
centric window or kernel. The kernels may have different shapes such as square, disk, etc. of
different sizes. Various techniques have been developed over the years for spatial features.
3.1. Morphological operators

Mathematical morphology is one of the most successful techniques widely used for
extracting shape based spatial features. The image is processed with the help of
a structuring element. The size and shape of structuring element are carefully selected
(Kumar and Dikshit 2017). In mathematical morphology, two fundamental operators
are defined: erosion and dilation (Fauvel et al. 2008). Erosion operator finds where the
structuring element fits in the image. On the other hand, dilation identifies the
objects that structuring element hits in the image. For a given pixel xk of image I,
the erosion is performed with structuring element S centred at given pixel as follows
εðx k Þ ¼ minðIði; jÞ 2 SÞ (26)

ði;jÞ
The dual operator dilation is defined as
δðx k Þ ¼ maxðIði; jÞ 2 SÞ (27)

ði;jÞ
Other morphological operators such as opening and closing are combination of these two
operators. The morphological opening of a pixel x k is carried out by erosion followed by
dilation as follows
ϕðx k Þ ¼ δðεðx k ÞÞ (28)
For closing operation, dilation is followed by erosion as given here
ψðx k Þ ¼ εðδðx k ÞÞ (29)
With the application of opening and closing, the elements smaller than structuring element
are deleted while retaining bigger ones. This approach helps to determine shape and size
of the objects in the image. However, sometimes these operators may introduce false
objects in the image. This problem can be avoided with the help of reconstruction.
Opening and closing by reconstruction ensure that an object smaller than structuring
element is completely removed, whereas bigger objects remain completely intact. In order
to identify different types of objects, a series of structuring elements of different sizes is
used that leads to the concept of morphological profile. The opening profile Φn ðxk Þ at pixel
xk using n structuring elements is an n-dimensional vector defined as
Φn ðx k Þ ¼ fϕðnÞ ðx k Þ; ϕðn1Þ ðx k Þ; :::; ϕð1Þ ðx k Þg (30)
where ϕðjÞ ðÞ is opening by reconstruction with jth structuring element and 1 j n.
Similarly, closing profile Ψn ðx k Þ at pixel x k is also an n-dimensional vector given by
Ψn ðx k Þ ¼ fψð1Þ ðx k Þ; ψð2Þ ðx k Þ; :::; ψðnÞ ðx k Þg (31)
where ψj ðÞ is closing by reconstruction with structuring element of jth size. The morpho-
logical profile MPðx k Þ at pixel x k is defined by collating Φn ðx k Þ and Ψn ðx k Þ as follows
MPðx k Þ ¼ fΦn ðx k Þ; Ψn ðx k Þg
(32)
¼ fϕðnÞ ðx k Þ; :::; ϕð2Þ ðxk Þ; x k ; ψð2Þ ðx k Þ; :::; ψðnÞ ðx k Þg
MPðÞ is a ð2n 1Þ-dimensional vector as ϕð1Þ ðx k Þ ¼ ψð1Þ ðx k Þ ¼ x k . The morphological

profile given by above equation is defined for one band image. For multi-band images,
the morphological profiles are created for each band separately and concatenated to form
extended morphological profile. The extended morphological profile EMPðx k Þ for a pixel
x k in m-band image is thus an mð2n 1Þ-dimensional vector formed as follows
EMPðx k Þ ¼ fMP1 ðx k Þ; MP2 ðx k Þ; :::; MPm ðx k Þg (33)
where MPj ðÞ is morphological profile built on jth band image and 1 j m.
Morphological profiles are good for representing multi-scale variability but not sufficient
for modelling other geometrical properties of the structures (Mura et al. 2011). To over-
come these limitations, attribute filters are used instead of reconstruction based morpho-
logical operators. Attribute filters operate on spatially connected pixels using some
criteria based on different attributes such as area, standard deviation, and volume, etc.
Analogous to extended morphological profiles, extended attribute profiles can be cre-
ated, which model geometrical properties of the objects.
3.2. Texture features

Texture is a fundamental property that provides important spatial information for image
processing. Some well known texture modelling techniques have been successfully
extended to hyperspectral image domain.
3.2.1. Gray-level co-occurrence matrix

Gray-level co-occurrence matrix (GLCM) (Haralick, Shanmugan, and Dinstein 1973) is one
of the most widely used techniques for computing second order texture measures. The
co-occurrence matrix provides information about the relative position of a pixel and its
neighbouring pixels in the image. The elements Gi;j of co-occurrence matrix provide the
occurrence of gray levels ði; jÞ within a moving window. For an N M image H, Gi;j is
determined as
8
X N X M < if Hðx; yÞ ¼ i _
1;
Gi;j ¼ H ðx þ Δx ; y þ Δy Þ ¼ j (34)
:
x¼1 y¼1 0; otherwise
where offset ðΔx ; Δy Þ is the distance between pixel ðx; yÞ and its neighbour. The co-
occurrence matrix G is sensitive to the rotation. For a rotated image, a different G may
be obtained depending on the angle of rotation. In order to alleviate the variance to
rotation, G is calculated for different offsets at different angles and averaged. A number of
statistical features can be extracted using the GLCM. Among these, major texture features
are: contrast, correlation, variance, entropy, angular second moment, inverse difference
moment, sum average, sum variance, sum entropy, difference variance, difference
entropy, information measures of correlation, and maximal correlation coefficient.
3.2.2. Gabor filters

Gabor filters (Shi and Healey 2003) are well known technique for texture analysis in
different applications. A filter bank is used to extract texture information at different
scale and orientation. Filters are applied both on individual bands and combinations of
bands. In spatial domain a Gabor filter can be defined as
2
1 x þ y2
gðx; yÞ ¼ exp
2πσ2 2σ2 (35)
cosð2πðx F cosθ þ yFsinθÞÞ
where σ is normalization scale, ðx; yÞ are spatial variables, θ is the orientation, and F is the
central frequency of scale. By using different scales and orientations a set of filters or filter
bank is prepared. Gabor filters provide two types of features namely: unichrome and oppo-
nent. Unichrome features are computed from a single band, while opponent Gabor features
combine spatial information across multiple bands. Shi and Healey (2003) combined both
unichrome and opponent Gabor features to extract texture information from hyperspectral
images. The combined set contains a large number of features. Therefore, authors first used
spectral binning and PCA for dimensionality reduction before extracting Gabor features. The
unichrome features were computed band by band. A subset of features was optimally
selected for the purpose of classification. A simplified process was proposed by Rajadell,
Garca-Sevilla, and Pla (2013) to reduce the computational efforts for feature extraction.
However, they also used a reduced number of bands and applied the filter bank over the
whole image.
3.2.3. Two-dimensional DWT

As discussed in previous section, 1-D DWT is used for spectral feature extraction. The
1-D DWT can be extended to 2-D DWT by applying it along the spatial directions. The
2-D DWT can extract texture information in terms of scale and wavelet coefficients, if it
is applied along the spatial directions. First, 1-D DWT is applied along the rows of the
image generating two subband images L and H. Then it is applied along the columns
of L and H that generates four subband images LL, LH, HL, and HH. This process can
be executed iteratively upto some maximum scale band by band on the hyperspectral
image. Gormus, Canagarajah, and Achim (2012) empirically decomposed the hyper-
spectral signal with 2-D DWT to get spatial information and fused it with spectral
information to form a feature set. The empirical decomposition was performed for
each band. The joint feature set was fed into SVM to classify the image. In another
study (Quesada-Barriuso, Arguello, and Heras 2014), 2-D DWT was used along spatial
dimensions on each spectral band of the image. The wavelet features were stacked
together with other features and classified with SVM. Kumar and Dikshit (2015a) also
used 2-D DWT to capture texture features from hyperspectral image and stacked them
with spectral features for classification.
3.2.4. Local binary pattern

Local binary pattern (LBP) (Ojala, Pietikäinen, and Mäenpää 2002) is a gray-scale invariant
texture descriptor that measures binary pattern at circular neighbourhood. LBP code of
a pixel is determined by thresholding its neighbours as follows
X
n1
LBPn;r ¼ sðgi gc Þ2i ;
i¼0
(36)
1; x0
sðxÞ ¼
0; x<0
where gc and gi are gray level intensity values of the centre and ith neighbour, and n is the
number of neighbours represented on a circle of radius r. The LBPn;r operator is gray-scale
invariant but rotation variant. The 3 3 patterns (with n ¼ 8, r ¼ 1) are present in the
most of the observed textures. These patterns are called as uniform patterns. A uniformity
measure U was introduced by Ojala, Pietikäinen, and Mäenpää (2002) to formally define
uniform patterns. The value of U indicates the number of spatial transitions in the pattern.
Only those patterns are considered as uniform, which have U value of at most 2. An
operator LBPriu2
n;r for gray-scale and rotation invariant uniform pattern is described as
Pn1
LBPriu2 ¼ i¼0 sðgi gc Þ; if UðLBPn;r Þ 2
(37)
n;r
n þ 1; otherwise
where
UðLBPn;r Þ ¼ jsðg gc Þ sðg0 gc Þj þ
Pn1 n1 (38)
i¼1 j sðgi g c Þ sðgi1 g c Þ j
The superscript ‘riu2’ indicates rotation invariant uniform pattern with UðLBPn;r Þ 2. For
hyperspectral images, the texture patterns can be computed for each spectral band.
3.2.5. Mathematical moments

Mathematical moments are widely used as image descriptors in many image proces-
sing tasks such as shape analysis, texture recognition, image segmentation, scene
matching, and computer vision, etc. Many kinds of moments and their functions are
used in many applications. Geometric, Zernike, Legendre, Gaussian-Hermite, etc. are
well known moments. Hu (1962) first introduced geometric moment and its scale,
rotation, and translation invariants. Geometric moments are simple and they have
explicit geometric information. Zernike and Legendre are orthogonal moments that
are able to represent an image with minimal information redundancy. Geometric
moment and its invariants are widely used for image analysis as they are easy to
implement and consist of geometric information. The ðp þ qÞth order geometric
moments for a function f ðx; yÞ are defined as
ð þ1 ð þ1
mp;q ¼ f ðx; yÞxp yq dxdy (39)
1 1
where p; q ¼ 0; 1; . . . 1. For a 2-D image H of size M N, its discrete version is used as

given below
X
M1 X
N1
mp;q ¼ ip jq Hði; jÞ (40)
i¼0 j¼0
Kumar and Dikshit (2015a) implemented geometric moments for hyperspectral imagery.
The authors applied PCA to the input hyperspectral image and computed moments from
the first principal component. Mirzapour and Ghassemian (2016) compared geometric,
Zernike, and Legendre moments for hyperspectral images. They used several principal
components for moment computation. The authors shown that geometric moments are
better for images of agriculture areas, while Zernike and Legendre moments are better for
urban datasets.
3.3. Deep learning for spatial features

Deep learning techniques that are used for spectral feature extraction can also be
used for extracting spatial information with a different arrangement/approach. Like
other techniques discussed in this section, these methods do not represent a pixel by
a spectral vector. Instead, an image patch is used to represent a pixel with the help
of a window or kernel. Figure 6 shows the strategy for extracting spatial features
with the help of autoencoders. Usually image is reduced in dimensionality before
processing. As shown in the figure, an image patch representing pixel neighbour-
hood is fed to the autoencoder/SAE that provides the feature for a particular pixel.
The process is repeated for all the pixels in the image. Chen et al. (2014) reduced the
image bands with the help of PCA. Later, image patches were taken from the
reduced dataset to provide as input to SAE to get spatial features. A similar approach
was followed by Lan et al. (2019), where image patches were formed after reducing
spectral dimension of the original image to use as input to SAE for generating spatial
features. Tao et al. (2015) used multiple kernels of different size to extract spatial
information using SSAE from hyperspectral images. Features obtained from different
kernels were stacked to form a feature vector.
As discussed in Section 2.4, 1-D CNN deals with spectral information only. But it can be
extended to 2-D CNN, which is able to extract spatial features in a hierarchical manner by
using 2-D kernels. The Equation (23) can be extended (Chen et al. 2016) for the value of
a neuron to
Pixel
neighboring
region
Feature
set
Reduced
Hyperspectral image Flattening
image
Layers of SAE
Figure 6. Spatial dominated feature extraction by SAE.

!
XX
Sl 1 X
Tl 1
ðiþsÞ;ðjþtÞ
v l;m
i;j ¼g b l;m
þ w s;t
l;m;p v ðl1Þ;p (41)
p s¼0 t¼0
s;t
where Sl Tl is the size of kernel at lth layer, and wl;m;p is the weight value at position ðs; tÞ
in pth feature map. The 2-D CNN is usually not used with the raw hyperspectral image.
Instead, the original image is reduced as shown in Figure 7. This is the major drawback of
2-D CNN based schemes as original spatial correlation is lost in the process. Chen et al.
(2016) applied PCA on the original image and took only the first principal component. The
first principal component was used to extract layerwise deep spatial feature using CNN
with the help of 4 4 and 5 5 kernels at different layers. Yang, Zhao, and Chan (2017)
averaged all the spectral bands and generated a single band image. The spatial features
were produced from single band by using CNN with the help of a comparatively large
kernel of the size of 21 21. Another PCA based approach was proposed by Song et al.
(2018), where a few most informative principal components were taken as input to CNN.
The deep spatial were learned with the help of 23 23, 25 25, and 27 27 kernels for
different images. The major spatial feature techniques are summarized in Table 2.
4. Spectral–spatial features
When separate steps are used for extraction of spectral and spatial features, the joint
spectral–spatial correlation is ignored. Spectral–spatial feature extraction techniques
simultaneously extract spectral and spatial features. These techniques work on all the
three dimensions of the raw hyperspectral image cube without changing the original
shape. Thus, original spectral–spatial correlation is retained. Mostly, these techniques are
the extension of conventional 2-D techniques into 3-D space.
Hyperspectral image
Stacking
Pixel
neighboring
region
Feature set
Pooling layer
Convolutional layer
Layers of CNN
Image reduced to single band
Figure 7. Spatial feature extraction by CNN.

Table 2. Summary of the spatial feature extraction techniques.

Technique Category Characteristics Strengths Limitations
Morphological Unsupervised Depend on scale/shape Ability to exploit Makes assumptions about
profiles based information. geometrical properties the shape and size of
of the objects. the structuring element.
GLCM Unsupervised Utilizes relative Effective determination of Non-adaptive pixel
positions of the spatial variability. neighbourhood and
neighbouring pixels. needs modelling of
a large number of
correlations.
Gabor filters Unsupervised Based on human visual Ability to exploit scale and A large number of features
system. orientation of physical are generated with
structures in image. possible redundancy.
2-D DWT Unsupervised Decomposes signal at Ability to capture High Computation Time.
different scales, statistical and
frequencies, and geometrical structures.
orientations.
LBP Unsupervised Based on gray-level Computational simplicity Non-adaptive pixel
difference between and robustness to scale neighbourhood.
a pixel and its and rotation variance.
neighbours.
Moments Unsupervised Based on theory of Ability to represent global Non-adaptive pixel
algebraic invariants. features and to neighbourhood.
generate smaller
feature set. Invariance
to scale, rotation, and
translation.
Deep learning Unsupervised/ Inspired by biological Good for non-linear data. No defined criteria for
Supervised systems choosing appropriate
number and size of
hidden layers. Needs
high end computing
resources.
4.1. Tensor-based approaches

In recent years, the tensor-based methods have been successfully used in may applica-
tions including image processing, remote sensing, target detection, and video processing,
etc. The tensor representation of a hyperspectral image preserve the spatial dimensions as
well as band continuity. Therefore, tensor representation can be used for modelling
spectral–spatial features jointly. The most of the tensor representation based techniques
are the extension of conventional vector based approaches.
PCA is one of the most widely used techniques that needs data in vector form.
Therefore, the image is vectorized in order to apply PCA on hyperspectral imagery
ignoring the spatial correlation of the pixels. A variant of PCA-based tensor decomposition
known as Tensor PCA (Velasco-Forero and Angulo 2013; Ren et al. 2017) has been
developed to overcome its limitations. The conventional PCA can be considered
a particular case of Tensor PCA. It considers the hyperspectral image as a third order
tensor. A typical hyperspectral image is converted to tensor representation before apply-
ing Tensor PCA. Each pixel i in the image is represented as a tensor vector x itv . As given in
Equation (3), the tensor covariance matrix can be obtained as follows.
1X n
Σtv ¼ ðx i xtv Þðxtv
i
xtv ÞT (42)
n i¼1 tv
P
n
where xtv ¼ n1 i
xtv and n is the number of image elements. In an analogy to PCA, the
i¼1
transformation tensor matrix can be obtained by solving following equation
Σtv Vtv ¼ Λtv Vtv (43)
where Vtv is an orthonormal tensor matrix and Λtv is a diagonal tensor matrix. A fast
version of Tensor PCA can be implemented in Fourier domain.
Zhang et al. (2013) developed a tensor representation based supervised feature
extraction technique called tensor discriminative locality alignment (TDLA). The tensor
representation preserves original spectral and spatial constraints of the pixels and their
neighbours. TLDA finds a multilinear transformation to reduce the original high order
feature space to a lower order feature space. The output feature space preserves the
discriminability of the classes. It employs an optimization algorithm that reduces the
distance between same class pixels and on the other hand it increases the distance
between pixels of different classes.
Another tensor-based feature extraction method called local tensor discriminant analysis
(LTDA) was proposed by Nie et al. (2009) to overcome the limitations of conventional LDA,
which makes assumption on Gaussian like class distribution. LTDA was used by Zhong et al.
(2015) for reducing the redundancy from the spectral–spatial feature set of the hyperspec-
tral images. The authors demonstrated that features are better represented in tensor format.
4.2. Three dimensional Gabor filters

Gabor filters are well known technique for texture analysis in different applications. The
3-D Gabor filters (Bau, Sarkar, and Healey 2010) can capture specific scale, orientation, and
wavelength dependent properties. A 3-D Gabor filter in frequency domain with centre
frequency ðFx ; Fy ; Fλ Þ is defined as
" 0 2 0 2 #!
x0 2 y λ
gðx; y; λÞ ¼ S exp þ þ
σx σy σλ (44)

exp j2πðxFx þ yFy þ λFλ Þ
where S is normalization scale, ðx; yÞ and λ are spatial and wavelength variables respectively,
ðσx ; σy ; σλ Þ defines the width of Gaussian envelope at three axes, ½x0 ; y0 ; λ0 T ¼ R ½x; y; λT ,
and R is a rotational matrix for transformation. The amplitude of centre frequency is given by
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
F ¼ ðFx2 þ Fy2 þ Fλ2 Þ (45)
The components of centre frequency can be represented as

Fx ¼ FðsinαÞðcosθÞ; Fy ¼ FðsinαÞðsinθÞ; Fλ ¼ Fcosα (46)
where 0 α π and 0 θ π represent orientations of wave vector. Different filters

can be constructed by varying F, α, and θ. Gabor responses at different frequency and
orientation contain information about signal variances in joint spectral–spatial domain.
For hyperspectral images, Gabor filters generate a large number of features. The features
involve some redundancy and therefore, usually it is followed by feature selection to
identify the best subset. Bau, Sarkar, and Healey (2010) used a set of 3-D Gabor filters for
modelling spectral–spatial information. They developed a selection procedure to select
the most significant features from the generated set. Shen and Jia (2011) designed a set of
complex Gabor filters based on different orientations and frequencies. A selection and
fusion method was developed to reduce the redundant contents and generating an
optimized feature set. Considering the huge features generated by Gabor filters, Shen
et al. (2013) proposed a symmetrical uncertainty and the approximate Markov blanket
based supervised selection method for Gabor features.
4.3. Three dimensional DWT

The 2-D DWT can be extended to 3-D DWT by applying the wavelet decomposition
along the spatial as well as wavelength dimensions of the image resulting in eight
subband images LLL, LLH, LHL, LHH, HLL, HLH, HHL, and HHH. It exploits the
correlation along spatial and wavelength axes. The 3-D DWT produces features that
incorporate both spectral and spatial properties. The wavelet based approach may be
time consuming and sometimes produces a large number of features. To overcome
these limitations, dimensionality reduction can be performed before extracting tex-
ture information. Qian, Ye, and Zhou (2012) decomposed hyperspectral image at
different scale, orientation, and frequencies with the help of 3-D wavelets. They also
developed a sparse logistic regression model for simultaneously selecting the most
discriminative features and classification. Ye et al. (2014) extracted 3-D DWT coeffi-
cients for acquiring spectral–spatial information for the purpose of classification. The
wavelet-coefficient correlation matrix was used to divide high dimensional space into
subspaces. The statistical dependence between subspaces is reduced and class
separation is increased. The concept of subspaces helps to reduce the required
number of training samples. It performed better for the images containing noise.
4.4. Three dimensional GLCM

Tsai and Lai (2013) extended GLCM and some statistical measures to 3-D tensor field for
hyperspectral images. The 3-D GLCM is given as
8
if Hðx; y; zÞ ¼ i _
XN X M X B ><
1; H ðx þ Δx1 ; y þ Δy1 ; z þ Δz1 Þ ¼ j _
Gði; j; kÞ ¼ (47)
>
x¼1 y¼1 z¼1 :
Hðx þ Δx2 ; y þ Δy2 ; z þ Δz2 Þ ¼ k
0; otherwise
where ðΔx1 ; Δy1 ; Δz1 Þ and ðΔx2 ; Δy2 ; Δz2 Þ are offsets. The elements of GLCM are converted to
probability form by dividing each element by the sum of G to produce a normalized co-
occurrence matrix Q
Gði; j; kÞ
Qði; j; kÞ ¼ P P P (48)
i j k Gði; j; kÞ
The statistical features can be determined form Q. Some of the most useful features such
as contrast (C), angular second moment (A), and entropy (E) are computed as follows
XXX
C¼ Qði; j; kÞ½ði jÞ2 þ ðj kÞ2 þ ði kÞ2 (49)
i j k
XXX
A¼ Q2 ði; j; kÞ (50)
i j k
XXX
E¼ Q2 ði; j; kÞ½ln Qði; j; kÞ (51)
i j k
The appropriate kernel size is important for texture analysis. A semi-variance and spectral
separability based method was developed by Tsai and Lai (2013) for choosing appropriate
kernel size for volumetric images. All such features are kept collectively to form a texture
feature set.
4.5. Three dimensional morphological profile

An extension of the conventional 2-D morphological processing was proposed by Hou,
Huang, and Jiao (2015) to build 3-D morphological profiles for hyperspectral image analysis.
The 3-D morphological processing uses a set of 3-D structuring elements. In an analogy to
2-D morphology, the basic 3-D morphological operations erosion ðε3D Þ and dilation ðδ3D Þ are
performed with image sub-cubes. For an image xk , the 3-D erosion is performed as follows
ε3D ðxk Þ ¼ minðXði; j; kÞ 2 QÞ (52)
ði;j;kÞ
where Q is a 3-D structuring element centred at ði; j; kÞ. Similarly, 3-D dilation is performed
as follows
δ3D ðxk Þ ¼ maxðXði; j; kÞ 2 QÞ (53)

ði;j;kÞ
The 3-D opening, 3-D closing, and 3-D profiles are also derived from the general
morphological operations as defined in Section 3.1. The 3-D operators better suit to
the cubical nature of the hyperspectral data as cubical nature of the data is preserved.
4.6. Three dimensional LBP

Although 2-D LBP model can be used to extract local features for hyperspectral imagery,
but it can not fully exploit spectral–spatial characteristics. However, 2-D LBP model can
easily be extended to 3-D LBP model by taking voxels in place of pixels as proposed by Jia
et al. (2017). As discussed in Section 3.2.4, 2-D LBP uses a circular neighbourhood. In case
of 3-D version of LBP, the neighbourhood is represented by a sphere. All the calculations
for 3-D LBP are similar to LBP. The local texture of cubical data is provided by a number
3DLBPn;r determined as follows
X
n1
3DLBPn;r ¼ sðgi gc Þ2i (54)
i¼0
where gc is the gray level value of centre voxel and gi corresponds to ith voxel gray level, r
is the radius of sphere, and n is the number of voxels in neighbourhood. The local
continuity of the surface can be described by another number 3DLBP2n;r representing the
number neighbouring voxels having gray level value larger than that of central voxel
determined as follows
Pn1
3DLBP2n;r ¼ i¼0 sðgi gc Þ; if Vð3DLBP2n;r Þ n
(55)
n þ 1; otherwise
where

Vð3DLBP2n;r Þ ¼ sðgj gc Þ sðgk gc Þ (56)
where gj and gk are adjacent voxels. The 3-D LBP has exhibited better performance than
2-D LBP for hyperspectral images.
4.7. Deep spectral–spatial features

In recent years, different deep learning techniques from autoencoders to CNNs have
successfully been used for joint spectral–spatial feature extraction. Figure 8 shows the
approach for spectral–spatial feature extraction using SAEs. Spectral information both
from pixel itself and its neighbouring region is provided as collective input to SAE to
obtain joint features. Chen et al. (2014) used SAE to generate spectral–spatial features. The
authors concatenated original spectral values from the pixel and its neighbouring region
to form a hybrid input vector for deep network as shown in Figure 8. Chen, Zhao, and Jia
(2015) developed a hybrid feature extraction framework using DBN and PCA. The original
dimensionality was reduced with PCA and then a 3-D kernel was used to extract spatial
features. Both spectral and spatial features were concatenated to generate deep spectral–
Hyperspectral image
Feature
set
Pixel vector Layers of SAE
Neighboring region
Reduced image
Flattening
Figure 8. Spectral–spatial dominated feature extraction by SAE.

spatial features from DBN. A similar approach was used by Lan et al. (2019) with DAE. The
spatial component was prepared from a PCA reduced image with the help of a cubical
kernel. Similarly, an unsupervised feature learning approach was developed by Tao et al.
(2015) with SSAE who proposed to use multiple kernels in multiscale manner for spatial
component. They used reduced spectral dimension for computational efficiency. Kang
et al. (2018) used SSAE to generate deep spectral–spatial features with the help of Gabor
filters. The Gabor features were fused with spectral values to provide as input to sparse
stacked autoencoder. After training the deep network, the deep features were captured.
The spectral–spatial feature extraction frameworks based on different types of
autoencoders can produce joint deep features. However, these methods still use
vectorization and thus do not preserve the original correlation of the data.
3-D CNN can better deal with joint spectral–spatial information. It uses
a 3-D kernel for convolution. The Equation (42) is further extended to get the
value at a neuron at position ði; j; kÞ in a 3-D CNN
!
XX
Sl 1 X X
Tl 1 Ul 1
ðiþsÞ;ðjþtÞ;ðkþuÞ
vi;j;k
l;m ¼ g βl;m þ s;t
wl;m;p v ðl1Þ;p (57)
p s¼0 t¼0 u¼0
where Sl Tl Ul is the kernel size at lth layer. The 3-D CNN can be used with raw image
cube without any dimensionality reduction to extract joint spectral–spatial features for
the classification. For this purpose, image is divided into blocks/patches equal to the
number of pixels in the image. Each patch corresponds to one particular pixel at the
centre of that patch. The image patches are used as input to CNN for extracting pixelwise
features as shown in Figure 9. Chen et al. (2016) demonstrated the use of 3-D CNN for
spectral–spatial feature computation with the help of 27 27 B patches. A classification
framework based on 3-D CNN was developed by Chen et al. (2018), where spectral and
spatial features jointly exploited. The authors tested different block sizes and found that
larger blocks provide better results.
Hyperspectral image
Stacking
Pixel Feature set

Neighboring
Region
Pooling layer
Convolutional layer
Layers of CNN
Figure 9. Spectral–spatial feature extraction by CNN.

5. Experiments
Experiments are carried out for some well known techniques presented in Section 2.
Although experimental results for individual techniques are reported in various publica-
tions, here a comparative study is performed.
5.1. Hyperspectral datasets

Two hyperspectral images representing different kind of scenes captured from different
sensors are chosen for the experiments. A brief description of the datasets is given as
follows.
5.1.1. Pavia university

Pavia University image was captured by ROSIS-03 sensor during the flight over the
campus of School of Engineering, University of Pavia in Italy. It is a 610 340 pixels
image with pixel size of 1.3 m. It represents a semi-urban scene with nine information
classes namely: Asphalt, Meadows, Gravel, Tree, Metal sheets, Soil, Bitumen, Bricks, and
Shadow. The image has 103 spectral bands in the wavelength range 0.43–0.86 μm after
removing the noisy bands. The false colour composite (FCC) and ground reference are
shown in Figure 10.
5.1.2. Salinas
Salinas is a 512 217 pixels image captured by AVIRIS sensor over Salinas valley, USA in
1998. The pixel size is 3.7 m. It represents an agriculture land mainly consisting of soil,
vegetation, and vineyard fields. Originally, there were 224 spectral bands out of which 20
bands are removed. There are 16 information classes: Broccoli green weeds 1, Broccoli
green weeds 2, Fallow, Fallow rough plough, Fallow smooth, Stubble, Celery, Grapes
untrained, Soil vineyard develop, Corn senesced green weeds, Lettuce romaine 4 wk,
Lettuce romaine 5 wk, Lettuce romaine 6 wk, Lettuce romaine 7 wk, Vineyard untrained,
and Vineyard vertical trellis. Figure 11 shows the FCC and ground reference for Salinas.
5.2. Experimental results

Experiments are carried out to evaluate the impact of feature extraction on the classifica-
tion accuracy. The random forest (RF), which is one of the most successful supervised
classifier, is used in the most of the cases except for CNN-based method that uses Softmax
at classification layer. The details of the test and training samples are provided in Table 3.
The training pixels are randomly chosen with the help of ground reference map. The same
training set is used in all the experiments. The remaining ground reference pixels form the
test set. In case of conventional techniques, the feature set is formed by the top few
features that correspond to about 99% variance. The number of hidden units is kept as 60
in case of both SAE and SSAE and 30 for DBN. The number of epochs is set to 10,000. Other
parameters are taken as given in references (Chen et al. 2014; Tao et al. 2015; Chen, Zhao,
and Jia 2015). There are three convolutional and three pooling layers in CNN architecture.
The size of feature set in case of CNN is 9 and 16 for Pavia University and Salinas images,
respectively. The learning parameter is set to 0.01 and number of epochs for CNN is 300.
Asphalt
Meadows
Gravel
Tree
Metal sheets
Soil
Bitumen
Bricks
Shadow
Undefined
100 m
(a) (b)
Figure 10. Pavia University: (a) FCC (b) Ground reference.
The classification accuracy is measured in terms of overall accuracy (OA) and kappa (κ)
coefficient. Both global and classwise accuracies are reported. All the results are the
average of 10 trials.
5.2.1. Accuracy analysis on spectral feature extraction techniques

The classification results obtained using conventional spectral feature extraction techni-
ques for Pavia University are reported in Table 4. The number of features are given within
the brackets in the table. It is observed that feature extraction techniques generate
a smaller feature set as compared to the raw image. Some feature extraction techniques
such as PCA, MNF, DWT, and ICA generate a very small feature set. Despite smaller feature
sets, the most of the spectral feature techniques give comparable or better classification
accuracies than the original raw image. DBFE has produced the best results among all
conventional techniques used in experiments. The accuracy of the most of the classes is
improved. The Asphalt, Gravel, and Meadows classes observed the improvement of more
than 5% in κ-coefficient. Table 5 provides the results for some major deep learning
techniques. It can be observed from the results that deep learning methods provide
better accuracy.
Tables 6 and 7 provide classification results for Salinas. The classification accuracy
obtained using raw image and extracted features is comparable. Like Pavia University,
DBFE and CNN yield the best results in their respective categories for Salinas also. Overall,
it can be concluded that discriminating capabilities of the data can be improved with
feature extraction techniques.
1 Broccoli green weeds 1
2 Broccoli green weeds 2
3 Fallow
4 Fallow rough plough
5 Fallow smooth
6 Stubble
7 Celery
8 Grapes untrained
9 Soil vineyard develop
10 Corn green weeds
11 Lettuce romaine 4wk
13. Lettuce romaine 6wk
15 Vineyard untrained
16 Vineyard vertical trellis
100 m Undefined
(a) (b)
Figure 11. Salinas: (a) FCC (b) Ground reference.
Table 3. The information classes and corresponding number of training and test pixels for hyper-
spectral datasets.
Pavia University Salinas
Class Class name Train Test Class Class name Train Test
Class 1 Asphalt 548 6304 Class 1 Broccoli green weeds 1 200 1809
Class 2 Meadows 540 18,146 Class 2 Broccoli green weeds 2 200 3526
Class 3 Gravel 392 1815 Class 3 Fallow 200 1776
Class 4 Tree 524 2912 Class 4 Fallow rough plough 200 1194
Class 5 Metal sheets 292 1113 Class 5 Fallow smooth 200 2478
Class 6 Soil 532 4572 Class 6 Stubble 200 3759
Class 7 Bitumen 375 981 Class 7 Celery 200 3559
Class 8 Bricks 514 3364 Class 8 Grapes untrained 200 11,071
Class 9 Shadow 231 795 Class 9 Soil vineyard develop 200 6003
Class 10 Corn senesced green weeds 200 3078
Class 11 Lettuce romaine 4 wk 200 868
Class 15 Vineyard untrained 200 7068
Class 16 Vineyard vertical trellis 200 1607
Table 4. Classwise and global classification accuracies obtained using major conventional spectral
feature extraction techniques for Pavia University. The number of features is given within brackets.
Class Raw PCA (6) MNF (5) DWT (7) ICA (15) NWFE (40) CNFE (11) DBFE (27)
Class 1 0.8050 0.8394 0.8255 0.8134 0.8357 0.7593 0.7579 0.9155
Class 2 0.7185 0.6692 0.6673 0.6619 0.6525 0.5606 0.7691 0.8683
Class 3 0.7612 0.7254 0.7226 0.7374 0.6762 0.6986 0.7774 0.8144
Class 4 0.9627 0.9429 0.9287 0.9469 0.9504 0.8301 0.8619 0.9461
Class 5 0.9861 0.9776 0.9981 0.9907 0.9981 0.9870 0.9946 1.0000
Class 6 0.8263 0.8235 0.7887 0.7555 0.7927 0.5093 0.6721 0.9232
Class 7 0.8849 0.8751 0.8663 0.8784 0.8470 0.8394 0.9046 0.8950
Class 8 0.8428 0.8176 0.7746 0.8069 0.8156 0.7433 0.8937 0.8710
Class 9 1.0000 0.9917 0.9986 1.0000 0.9945 0.9807 0.9798 1.0000
Global κ 0.8041 0.7838 0.7717 0.7691 0.7702 0.6710 0.7269 0.8968
OA (%) 85.16 83.49 82.65 82.46 82.81 74.92 79.21 92.34
Table 5. Classwise and global classification accuracies obtained using major deep learning spectral
feature extraction techniques for Pavia University. The number of features is given within brackets.
Class SAE (60) SSAE (60) DBN (30) CNN (9)
Class 1 0.9146 0.9188 0.9276 0.9346
Class 2 0.9524 0.9465 0.9432 0.9518
Class 3 0.8812 0.8937 0.9128 0.9124
Class 4 0.9627 0.9598 0.9584 0.9457
Class 5 0.9894 0.9914 0.9926 0.9842
Class 6 0.9063 0.9126 0.9077 0.9014
Class 7 0.9149 0.9189 0.9209 0.9165
Class 8 0.8928 0.8876 0.9143 0.9248
Class 9 0.9924 0.9925 0.9915 0.9965
Global κ 0.9236 0.9284 0.9316 0.9368
OA (%) 93.16 93.48 94.31 94.62
Table 6. Classwise and global classification accuracies obtained using major conventional spectral
feature extraction techniques for Salinas. The number of features is given within brackets.
Class Raw (204) PCA (5) MNF (15) DWT (9) ICA (20) NWFE (70) CNFE (30) DBFE (143)
Class 1 0.9943 0.9800 0.9875 0.9869 0.9830 0.9920 0.9807 0.9943
Class 2 0.9924 0.9957 0.9960 0.9942 0.9903 0.9979 0.9961 0.9915
Class 3 0.9890 0.9715 0.9953 0.9942 0.9740 0.9581 0.9751 0.9785
Class 4 0.9941 0.9966 0.9992 0.9983 0.9949 0.9975 0.9975 0.9864
Class 5 0.9819 0.9852 0.9704 0.9763 0.9949 0.9573 0.9819 0.9683
Class 6 0.9971 0.9991 0.9983 0.9960 0.9994 0.9977 0.9980 0.9977
Class 7 0.9909 0.9978 0.9956 0.9949 0.9956 0.9896 0.9946 0.9946
Class 8 0.6342 0.6588 0.6545 0.6708 0.6302 0.6563 0.6088 0.6592
Class 9 0.9900 0.9943 0.9917 0.9793 0.9964 0.9386 0.9876 0.9592
Class 10 0.9408 0.9146 0.8811 0.9016 0.9156 0.9072 0.8860 0.9467
Class 11 0.9851 0.9907 0.9839 0.9770 0.9747 0.9595 0.9841 0.9689
Class 12 0.9959 0.9964 0.9934 0.9991 0.9992 0.9881 0.9987 0.9994
Class 13 0.9795 0.9931 0.9973 0.9931 0.9973 0.9835 0.9946 0.9778
Class 14 0.9783 0.9473 0.9540 0.9526 0.9424 0.9621 0.9465 0.9817
Class 15 0.6559 0.6718 0.6796 0.6495 0.6469 0.6525 0.6783 0.7103
Class 16 0.9873 0.9885 0.9898 0.9955 0.9962 0.9968 0.9943 0.9878
Global κ 0.8741 0.8793 0.8777 0.8767 0.8710 0.8672 0.8676 0.8815
OA (%) 88.72 89.18 89.04 88.96 88.45 88.11 88.13 90.16
5.2.2. Analysis on the size of feature set

Experiments are performed to analyze the impact of the size of feature set on the
classification accuracy. Figure 12 shows the plots of κ-coefficient against the number of
features. It is observed from the figure that initially classification accuracy increases
Table 7. Classwise and global classification accuracies obtained using major deep learning spectral
feature extraction techniques for Salinas. The number of features is given within brackets.
Class SAE (60) SSAE (60) DBN (30) CNN (16)
Class 1 0.9825 0.9786 0.9924 0.9942
Class 2 0.9846 0.9924 0.9944 0.9931
Class 3 0.9632 0.9648 0.9665 0.9736
Class 4 0.9749 0.9752 0.9732 0.9847
Class 5 0.9858 0.9844 0.9789 0.9729
Class 6 0.9847 0.9769 0.9816 0.9834
Class 7 0.9904 0.9852 0.9811 0.9863
Class 8 0.9066 0.9264 0.9168 0.9166
Class 9 0.9842 0.9676 0.9748 0.9684
Class 10 0.9272 0.9128 0.9224 0.9365
Class 11 0.9314 0.9345 0.9126 0.9325
Class 12 0.9863 0.9856 0.9764 0.9844
Class 13 0.9844 0.9768 0.9602 0.9808
Class 14 0.9285 0.9225 0.9384 0.9215
Class 15 0.8224 0.8428 0.8627 0.8522
Class 16 0.9076 0.9062 0.9118 0.9108
Global κ 0.9218 0.9284 0.9329 0.9414
OA (%) 0.9294 0.9312 93.84 94.85
sharply with the increase in the number of features. Beyond a certain point, the accuracy
stabilizes with no significant improvement when the number of the features increases. In
some cases accuracy drops when larger feature set is used. It can be established that
feature extraction techniques keep the most of the important information in a few top
features having higher eigenvalues. This characteristics helps to reduce the dimension-
ality while retaining the important information.
5.2.3. Accuracy analysis of spectral–spatial feature extraction techniques

Usually spatial features are not used without spectral features. Spatial features are either
concatenated with spectral features or extracted jointly. In this section, performance of
some major spectral–spatial feature extraction techniques is discussed. In recent times,
emphasis is given on the techniques that do not modify the cubical form of the data.
Therefore, mostly 3-D techniques are chosen here for performance evaluation. For SAE,
the approach presented by Chen et al. (2014) is followed to generate the features and
classification is performed using RF. The architecture of CNN consists of alternative
convolution and pooling layers three each. The last pooling layers is followed by flatten-
ing and fully connected layers. Finally classification is performed by Softmax. The results
are reported in Table 8 for both images. As 3-D techniques consider both spatial and
wavelength dimensions of the image, the feature set generated by these techniques
contain joint spectral and spatial information. It is observed from the table that better
results can be obtained by integrating spectral and spatial information. Figures 13 and 14
show classification maps for Pavia University and Salinas, respectively. The maps are
generated using raw image and some major spectral and spectral–spatial feature extrac-
tion techniques. It can be observed that better classification maps are generated by
spectral–spatial feature extraction techniques.
1.0
0.9
0.8
0.7
0.6 CNFE
CNN
κ
0.5 DBFE
DWT
0.4 ICA
MNF
0.3 NWFE
PCA
0.2 SAE
0.1
2 10 20 30 40 50 60 70 80 90 100
Number of features
(a)
1.0
0.9
0.8
0.7
κ
CNFE
0.6 CNN
DBFE
0.5 DWT
ICA
MNF
0.4 NWFE
PCA
SAE
0.3
2 10 20 30 40 50 60 70 80 90 100
Number of features
(b)
Figure 12. Global k versus the number of features: (a) Pavia University (b) Salinas.
Table 8. Classification accuracies for major spectral–spatial feature extraction techniques.

Image 3-D GLCM 3-D DWT 3-D Gabor 3-D EMP SAE 3-D CNN 3-D LBP
Pavia κ 0.9376 0.9235 0.9682 0.9724 0.9798 0.9685 0.9677
University OA (%) 94.94 94.75 97.28 98.47 98.42 97.64 97.48
Salinas κ 0.9456 0.9576 0.9635 0.9782 0.9725 0.9624 0.9623
OA (%) 95.62 97.16 97.81 98.44 97.84 97.36 97.14
(a) (b) (c) (d) (e) (f)
Figure 13. Classification maps for Pavia University: (a) Raw, (b) DBFE, (c) SAE, (d) CNN, (e) 3-D EMP, (f)
3-D CNN.
(a) (b) (c) (d) (e) (f)
Figure 14. Classification maps for Salinas: (a) Raw, (b) DBFE, (c) SAE, (d) CNN, (e) 3-D EMP, (f) 3-D CNN.
6. Conclusion
In this paper, a review of the feature extraction techniques used in the classification of
hyperspectral images is presented. Different approaches for spectral, spatial, and
spectral–spatial feature extraction techniques are reviewed and their strengths and
weaknesses are discussed. The experiments are carried out on two different types of
hyperspectral images. The results show that dimensionality related issues can be
managed well using feature extraction without significantly compromising classifica-
tion accuracy. Supervised techniques provide better accuracy than their unsupervised
counterparts. In the absence of training data, the unsupervised feature extraction can
provide acceptable solutions. It is also observed from the results that spatial features
provide complementary information that can help to improve classification accuracy.

Recently emerged deep learning techniques have shown promising performance.
Deep learning methods learn features in a hierarchical manner with the help of
a complex layered architecture. However, they need a good number of training pixels.
The 3-D techniques are a better choice that suit to cubical form of the data.
Acknowledgements
The authors would like to thank Prof. Paolo Gamba of University of Pavia, Italy for providing ROSIS
dataset. This work is supported by TEQIP-III project funded by World Bank, NPIU, and MHRD, Govt. of
India under grant number TEQIP3/MRPSG/01.
Disclosure statement
No potential conflict of interest was reported by the authors.
Funding
This work was supported by the TEQIP III [TEQIP3/MRPSG/01].
ORCID
Brajesh Kumar http://orcid.org/0000-0001-8100-7287
Onkar Dikshit http://orcid.org/0000-0003-3213-8218
Ashwani Gupta http://orcid.org/0000-0002-2199-8346
Manoj Kumar Singh http://orcid.org/0000-0003-3119-1244
References
Bach, F. R., and M. I. Jordan. 2003. “Kernel Independent Component Analysis.” Journal of Machine
Learning Research 3 (1): 1–48.
Bachmann, C. M., T. L. Ainsworth, and R. A. Fusina. 2005. “Exploiting Manifold Geometry in
Hyperspectral Imagery.” IEEE Transactions on Geoscience and Remote Sensing 43 (3): 441–454.
doi:10.1109/TGRS.2004.842292.
Bachmann, C. M., T. L. Ainsworth, and R. A. Fusina. 2006. “Improved Manifold Coordinate
Representations of Large-Scale Hyperspectral Scenes.” IEEE Transactions on Geoscience and
Remote Sensing 44 (10): 2786–2803. doi:10.1109/TGRS.2006.881801.
Backer, S. D., P. Kempeneers, W. Debruyn, and W. Scheunders. 2005. “A Band Selection Technique for
Spectral Classification.” IEEE Geoscience and Remote Sensing Letters 2 (3): 319–323. doi:10.1109/
LGRS.2005.848511.
Bau, T. C., S. Sarkar, and G. Healey. 2010. “Hyperspectral Region Classification Using a
Three-Dimensional Gabor Filterbank.” IEEE Transactions on Geoscience and Remote Sensing 48
(9): 3457–3464. doi:10.1109/TGRS.2010.2046494.
Baudat, G., and F. Anouar. 2000. “Generalized Discriminant Analysis Using a Kernel Approach.”
Neural Computing 12 (1): 2385–2404. doi:10.1162/089976600300014980.
Benediktsson, J. A., and P. Ghamisi. 2016. Spectral-Spatial Classification of Hyperspectral Remote
Sensing Images. London: Artech House.
Bruce, L. M., C. H. Koger, and J. Li. 2002. “Dimensionality Reduction of Hyperspectral Data Using
Discrete Wavelet Transform Feature Extraction.” IEEE Transactions on Geoscience and Remote
Sensing 40 (10): 2331–2338. doi:10.1109/TGRS.2002.804721.
Chang, C.-I., and Q. Du. 1999. “Interference and Noise-adjusted Principal Components Analysis.” IEEE
Transactions on Geoscience and Remote Sensing 37 (5): 2387–2396. doi:10.1109/36.789637.
Chang, C.-I., Q. Du, T.-L. Sun, and M. Althouse. 1999. “A Joint Band Prioritization and
Band-decorrelation Approach to Band Selection for Hyperspectral Image Classification.” IEEE
Chang, C.-I., and S. Wang. 2006. “Constrained Band Selection for Hyperspectral Imagery.” IEEE
Transactions on Geoscience and Remote Sensing 44 (6): 1575–1585. doi:10.1109/
TGRS.2006.864389.
Chen, C., F. Jiang, C. Yang, S. Rho, W. Shen, S. Liu, and Z. Liu. 2018. “Hyperspectral Classification Based
on Spectralspatial Convolutional Neural Networks.” Engineering Applications of Artificial
Intelligence 68 (1): 165–171. doi:10.1016/j.engappai.2017.10.015.
Chen, Y., H. Jiang, C. Li, X. Jia, and P. Ghamisi. 2016. “Deep Feature Extraction and Classification of
Hyperspectral Images Based on Convolutional Neural Networks.” IEEE Transactions on Geoscience
and Remote Sensing 54 (10): 6232–16241. doi:10.1109/TGRS.2016.2584107.
Chen, Y., Z. Lin, X. Zhao, G. Wang, and Y. Gu. 2014. “Deep Learning-based Classification of
Hyperspectral Data.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote
Sensing 7 (6): 2094–2107. doi:10.1109/JSTARS.2014.2329330.
Chen, Y., X. Zhao, and X. Jia. 2015. “Spectral-Spatial Classification of Hyperspectral Data Based on
Deep Belief Network.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote
Chiang, -S.-S., C.-I. Chang, and I. W. Ginsberg. 2001. “Unsupervised Target Detection in Hyperspectral
Images Using Projection Pursuit.” IEEE Transactions on Geoscience and Remote Sensing 39 (7):
1380–1391. doi:10.1109/36.934071.
Dopido, I., A. Villa, A. Plaza, and P. Gamba. 2012. “A Quantitative and Comparative Assessment of
Unmixing-Based Feature Extraction Techniques for Hyperspectral Image Classification.” IEEE
Journal of Selected Topics in Applied Earth Observations and Remote Sensing 5 (2): 421–435.
doi:10.1109/JSTARS.2011.2176721.
Duda, R. O., P. E. Hart, and D. G. Stock. 2001. Pattern Classification. 2nd ed. New York: Wiley.
Dutta, D., A. E. Goodwell, P. Kumar, J. E. Garvey, R. G. Darmody, D. P. Berretta, and J. A. Greenberg.
2015. “On the Feasibility of Characterizing Soil Properties from AVIRIS Data.” IEEE Transactions on
Geoscience and Remote Sensing 53 (9): 5133–5147. doi:10.1109/TGRS.36.
Estevez, P. A., T. Michel, C. A. Perez, and J. M. Zurada. 2009. “Normalized Mutual Information
Feature Selection.” IEEE Transactions on Neural Networks 20 (2): 189–201. doi:10.1109/
TNN.2008.2005601.
Falco, N., J. A. Benediktsson, and L. Bruzzone. 2014. “A Study on the Effectiveness of Different
Independent Component Analysis Algorithms for Hyperspectral Image Classification.” IEEE
Journal of Selected Topics in Applied Earth Observations and Remote Sensing 7 (6): 2183–2199.
doi:10.1109/JSTARS.2014.2329792.
Fauvel, M., J. A. Benediktsson, J. Chanussot, and J. R. Sveinsson. 2008. “Spectral and Spatial
Classification of Hyperspectral Data Using SVMs and Morphological Profiles.” IEEE Transactions
on Geoscience and Remote Sensing 46 (11): 3804–3814. doi:10.1109/TGRS.2008.922034.
Friedman, J. H., and J. W. Tukey. 1974. “A Projection Pursuit Algorithm for Exploratory Data Analysis.”
IEEE Transactions on Computers C-23: 881–889. doi:10.1109/T-C.1974.224051.
Fukunaga, K., and M. Mantock. 1983. “Nonparametric Discriminant Analysis.” IEEE Transactions on
Pattern Analysis and Machine Intelligence 5 (6): 671–678. doi:10.1109/TPAMI.1983.4767461.
Garcia-Salgado, B. P., and V. Ponomaryov. 2016. “Feature Extraction Scheme for a Textural
Hyperspectral Image Classification Using Gray-scaled HSV and NDVI Image Features Vectors
Fusion.” Proceedings of International Conference on Electronics, Communications and Computers
(CONIELECOMP), 186–191. Cholula: Mexico.
Goetz, A. F. H. 2009. “Three Decades of Hyperspectral Remote Sensing of the Earth: A Personal View.”
Remote Sensing of Environment 113: S5–S16. doi:10.1016/j.rse.2007.12.014.
Gormus, E. T., N. Canagarajah, and A. Achim. 2012. “Dimensionality Reduction of Hyperspectral

Images Using Empirical Mode Decompositions and Wavelets.” IEEE Journal of Selected Topics in
Applied Earth Observations and Remote Sensing 5 (6): 1821–1830. doi:10.1109/
JSTARS.2012.2203587.
Green, A. A., N. Berman, P. Switzer, and M. D. Craig. 1988. “A Transformation for Ordering
Multispectral Data in Terms of Image Quality with Implications for Noise Removal.” IEEE
Haboudane, D., J. R. Miller, E. Pattey, P. J. Zarco-Tejada, and I. B. Strachan. 2004. “Hyperspectral
Vegetation Indices and Novel Algorithms for Predicting Green LAI of Crop Canopies: Modeling
and Validation in the Context of Precision Agriculture.” Remote Sensing of Environment 90 (1):
337–352. doi:10.1016/j.rse.2003.12.013.
Hao, S., W. Wang, Y. Ye, T. Nie, and L. Bruzzone. 2018. “Two-Stream Deep Architecture for
Hyperspectral Image Classification.” IEEE Transactions on Geoscience and Remote Sensing 56 (4):
2349–2361. doi:10.1109/TGRS.2017.2778343.
Haralick, R. M., K. Shanmugan, and I. Dinstein. 1973. “Textural Features for Image Classification.”
IEEE Transactions on Systems Man and Cybernetics 3 (6): 610–621. doi:10.1109/
TSMC.1973.4309314.
Hou, B., T. Huang, and L. Jiao. 2015. “Spectral-Spatial Classification of Hyperspectral Data Using
3-D Morphological Profile.” IEEE Geoscience and Remote Sensing Letters 12 (11): 2364–2368.
doi:10.1109/LGRS.2015.2476498.
Hu, M. K. 1962. “Visual Pattern Recognition by Moment Invariants.” IRE Transanctions on Information
Theory 8 (1): 179–187. doi:10.1109/TIT.1962.1057692.
Huang, H., and B. Kuo. 2010. “Double Nearest Proportion Feature Extraction for Hyperspectral-image
Classification.” IEEE Geoscience and Remote Sensing Letters 48 (11): 4034–4038.
Huang, X., and L. Zhang. 2010. “Object-oriented Subspace Analysis for Airborne Hyperspectral
Remote Sensing Imagery.” Neurocomputing 73 (4–6): 927–936. doi:10.1016/j.
neucom.2009.09.011.
Hughes, G. F. 1968. “On the Mean Accuracy of Statistical Pattern Recognizers.” IEEE Transaction on
Information Theory 14 (1): 55–63. doi:10.1109/TIT.1968.1054102.
Hyverinen, A., J. Karhunen, and E. Oja. 2001. Independent Component Analysis. New York: Wiley.
Ifarraguerri, A., and C. I. Chang. 2000. “Unsupervised Hyperspectral Image Analysis with Projection
Pursuit.” IEEE Transactions on Geoscience and Remote Sensing 38 (6): 2529–2538. doi:10.1109/
36.885200.
Ifarraguerri, A., and M. W. Prairie. 2004. “Visual Method for Spectral Band Selection.” IEEE Geoscience
and Remote Sensing Letters 1 (2): 101–106. doi:10.1109/LGRS.2003.822879.
Jia, S., J. Hu, J. Zhu, X. Jia, and Q. Li. 2017. “Three-Dimensional Local Binary Patterns for Hyperspectral
Imagery Classification.” IEEE Transactions on Geoscience and Remote Sensing 55 (4): 2399–2413.
doi:10.1109/TGRS.2016.2642951.
Jia, X., B.-C. Kuo, and M. M. Crawford. 2013. “Feature Mining for Hyperspectral Image
Classification: A General Overview and Analysis of Feature Reduction Methods for
Classification of Hyperspectral Images Is Provided. Experimental Results Give the
Performance of Selected Feature Selection and Feature Extraction Approaches.” Proceedings
of the IEEE 101 (3): 676–697.
Jia, X., and J. A. Richards. 1994. “Efficient Maximum Likelihood Classification for Imaging
Spectrometer Data Sets.” IEEE Transactions on Geoscience and Remote Sensing 32 (2): 274–281.
doi:10.1109/36.295042.
Joliffe, I. 2002. Principal Component Analysis. New York: Springer-Verlag.
Kaewpijit, S., J. L. Moigne, and T. El-Ghazawi. 2003. “Automatic Reduction of Hyperspectral Imagery
Using Wavelet Spectral Analysis.” IEEE Transactions on Geoscience and Remote Sensing 41 (4):
863–871. doi:10.1109/TGRS.2003.810712.
Kang, X., C. Li, X. Li, and H. Lin. 2018. “Classification of Hyperspectral Images by Gabor Filtering Based
Deep Network.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
11 (4): 1166–1178. doi:10.1109/JSTARS.2017.2767185.
Keshava, N. 2004. “Distance Metrics and Band Selection in Hyperspectral Processing with
Applications to Material Identification and Spectral Libraries.” IEEE Transactions on Geoscience
Krizhevsky, A., I. Sutskever, and G. E. Hinton. 2012. “ImageNet Classification with Deep Convolutional
Neural Networks.” Proceedings of Advanced Neural Information Processing Systems (NIPS),
1097–1105, Lake Tahoe, Nevada, USA.
Kumar, B., and O. Dikshit. 2015a. “Integrating Spectral and Textural Features for Urban Land Cover
Classification with Hyperspectral Data.” Proceedings of Joint Urban Remote Sensing Event (JURSE),
1–4. Lausanne, Switzerland.
Kumar, B., and O. Dikshit. 2017. “Hyperspectral Image Classification Based on Morphological Profiles
and Decision Fusion.” International Journal of Remote Sensing 38 (20): 5830–5854. doi:10.1080/
01431161.2017.1348636.
Kuo, B.-C., and D. A. Landgrebe. 2004. “Nonparametric Weighted Feature Extraction for
Classification.” IEEE Transactions on Geoscience and Remote Sensing 42 (5): 1096–1105.
doi:10.1109/TGRS.2004.825578.
Kuo, B.-C., C.-H. Li, and J.-M. Yang. 2009. “Kernel Nonparametric Weighted Feature Extraction for
Hyperspectral Image Classification.” IEEE Transactions on Geoscience and Remote Sensing 47 (4):
1139–1155. doi:10.1109/TGRS.2008.2008308.
Lan, R., Z. Li, Z. Liu, T. Gu, and X. Luo. 2019. “Hyperspectral Image Classification Using K-sparse
Denoising Autoencoder and Spectral-restricted Spatial Characteristics.” Applied Soft Computing
74: 693–708. doi:10.1016/j.asoc.2018.08.049.
Landgrebe, D. A. 2003. Signal Theory Methods in Multispectral Remote Sensing. New York: Wiley.
Lee, C., and D. A. Landgrebe. 1993. “Decision Boundary Feature Extraction for Nonparametric
Classification.” IEEE Transactions on Systems, Man, and Cybernetics 23 (2): 433–444. doi:10.1109/
21.229456.
Li, J. 2004. “Wavelet-Based Feature Extraction for Improved Endmember Abundance Estimation in
Linear Unmixing of Hyperspectral Signals.” IEEE Transactions on Geoscience and Remote Sensing 42
(3): 644–649. doi:10.1109/TGRS.2003.822750.
Li, S., H. Wu, D. Wan, and J. Zhu. 2011a. “An Effective Feature Selection Method for Hyperspectral
Image Classification Based on Genetic Algorithm and Support Vector Machine.” Knowledge-Based
Systems 24 (1): 40–48. doi:10.1016/j.knosys.2010.07.003.
Li, W., S. Prasad, J. E. Fowler, and L. M. Bruce. 2011b. “Locality-preserving Discriminant Analysis in
Kernel-induced Feature Spaces for Hyperspectral Image Classification.” IEEE Geoscience and
Remote Sensing Letters 8 (5): 895–898. doi:10.1109/LGRS.2011.2128854.
Li, W., F. Feng, H. Li, and Q. Du. 2018. “Discriminant Analysis-Based Dimension Reduction for
Hyperspectral Image Classification: A Survey of the Most Recent Advances and an Experimental
Comparison of Different Techniques.” IEEE Geoscience and Remote Sensing Magazine 15–34.
doi:10.1109/MGRS.2018.2793873.
Lv, F., M. Han, and T. Qiu. 2017. “Remote Sensing Image Classification Based on Ensemble Extreme
Learning Machine with Stacked Autoencoder.” IEEE Access 5: 9021–9031. doi:10.1109/
ACCESS.2017.2706363.
Lv, Z. Y., P. Zhang, J. A. Benediktsson, and W. Z. Shi. 2014. “Morphological Profiles Based on
Differently Shaped Structuring Elements for Classification of Images with Very High Spatial
Resolution.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 7
(12): 4644–4652. doi:10.1109/JSTARS.2014.2328618.
Ma, L., and M. M. Crawford. 2010. “Local Manifold Learning Based K-Nearest Neighbor for
Hyperspectral Image Classification.” IEEE Transactions on Geoscience and Remote Sensing 48
(11): 4099–4109.
Martinez-Uso, A., F. Pla, J. M. Sotoca, and P. Garcia-Sevilla. 2007. “Clustering-based Hyperspectral
Band Selection Using Information Measures.” IEEE Transactions on Geoscience and Remote Sensing
45 (12): 4158–4171. doi:10.1109/TGRS.2007.904951.
Mirzapour, F., and H. Ghassemian. 2016. “Moment-based Feature Extraction from High Spatial
Resolution Hyperspectral Images.” International Journal of Remote Sensing 37 (6): 1349–1361.
doi:10.1080/2150704X.2016.1151568.
Mura, M. D., Villa, A., Benediktsson, J. A., Chanussot, J., and Bruzzone, L. 2011. “Classification of
Hyperspectral Images by Using Extended Morphological Attribute Profiles and Independent
Component Analysis.” IEEE Geoscience and Remote Sensing Letters 8 (3): 542–546. doi:10.1109/
LGRS.2010.2091253.
Neher, R., and A. Srivastava. 2005. “A Bayesian MRF Framework for Labeling Terrain Using
Hyperspectral Imaging.” IEEE Transactions on Geoscience and Remote Sensing 43 (6): 1363–1374.
doi:10.1109/TGRS.2005.846865.
Nie, F., S. Xiang, Y. Song, and C. Zhang. 2009. “Extracting the Optimal Dimensionality for Local
Tensor Discriminant Analysis.” Pattern Recognition 42 (1): 105–114. doi:10.1016/j.
patcog.2008.03.012.
Nielsen, A. A. 2011. “Kernel Maximum Autocorrelation Factor and Minimum Noise Fraction
Transformations.” IEEE Transactions on Image Processing 20 (3): 612–624. doi:10.1109/
TIP.2010.2076296.
Ojala, T., M. Pietikäinen, and T. Mäenpää. 2002. “Multiresolution Gray-scale and Rotation Invariant
Texture Classification with Local Binary Patterns.” IEEE Transactions on Pattern Analysis and
Machine Intelligence 24 (7): 971–987. doi:10.1109/TPAMI.2002.1017623.
Onoyama, H., C. Ryu, M. Suguri, and M. Iida. 2014. “Integrate Growing Temperature to Estimate the
Nitrogen Content of Rice Plants at the Heading Stage Using Hyperspectral Imagery.” IEEE Journal
of Selected Topics in Applied Earth Observations and Remote Sensing 7 (4): 2506–2515. doi:10.1109/
JSTARS.2014.2329474.
Pu, H., Z. Chen, B. Wang, and G.-M. Jiang. 2014. “A Novel Spatial-Spectral Similarity Measure for
Dimensionality Reduction and Classification of Hyperspectral Imagery.” IEEE Transactions on
Geoscience and Remote Sensing 52 (11): 7008–7022. doi:10.1109/TGRS.2014.2306687.
Qian, Y., M. Ye, and J. Zhou. 2012. “Hyperspectral Image Classification Based on Structured Sparse
Logistic Regression and Three-Dimensional Wavelet Texture Features.” IEEE Transactions on
Geoscience and Remote Sensing 51 (4): 2276–2291. doi:10.1109/TGRS.2012.2209657.
Quesada-Barriuso, P., F. Arguello, and D. B. Heras. 2014. “Spectral-Spatial Classification of
Hyperspectral Images Using Wavelets and Extended Morphological Profiles.” IEEE Journal of
Selected Topics in Applied Earth Observations and Remote Sensing 7 (4): 1177–1185. doi:10.1109/
JSTARS.4609443.
Rajadell, O., P. Garca-Sevilla, and F. Pla. 2013. “Spectral-Spatial Pixel Characterization Using Gabor
Filters for Hyperspectral Image Classification.” IEEE Geoscience and Remote Sensing Letters 10 (4):
860–864. doi:10.1109/LGRS.2012.2226426.
Rasti, B., M. O. Ulfarsson, and J. R. Sveinsson. 2010. “Hyperspectral Feature Extraction Using Total
Variation Component Analysis.” IEEE Transactions on Geoscience and Remote Sensing 54 (12):
6976–6985. doi:10.1109/TGRS.2016.2593463.
Rellier, G., X. Descombes, F. Falzon, and J. Zerubi. 2004. “Texture Feature Analysis Using a
Gauss-Markov Model in Hyperspectral Image Classification.” IEEE Transactions on Geoscience
Ren, Y., L. Liao, S. J. Maybank, Y. Zhang, and X. Liu. 2017. “Hyperspectral Image Spectral-Spatial
Feature Extraction via Tensor Principal Component Analysis.” IEEE Geoscience and Remote Sensing
Letters 14 (19): 1431–1435. doi:10.1109/LGRS.2017.2686878.
Romero, A., C. Gatta, and G. Camps-Valls. 2016. “Unsupervised Deep Feature Extraction for Remote
Sensing Image Classification.” IEEE Transactions on Geoscience and Remote Sensing 54 (3):
1349–1362. doi:10.1109/TGRS.2015.2478379.
Scholkopf, B., A. Smola, and K.-R. Mller. 1998. “Nonlinear Component Analysis as a Kernel Eigenvalue
Problem.” Neural Computing 10 (1): 1299–1319. doi:10.1162/089976698300017467.
Serpico, S. B., and G. Moser. 2007. “Extraction of Spectral Channels from Hyperspectral Images for
Classification Purposes.” IEEE Transactions on Geoscience and Remote Sensing 45 (2): 484–495.
doi:10.1109/TGRS.2006.886177.
Shankar, B. U., S. K. Meher, and A. Ghosh. 2011. “Wavelet-fuzzy Hybridization: Feature-extraction and
Land-cover Classification of Remote Sensing Images.” Applied Soft Computing 11: 2999–3011.
doi:10.1016/j.asoc.2010.11.024.
Shen, L., and S. Jia. 2011. “Three-Dimensional Gabor Wavelets for Pixel-Based Hyperspectral Imagery
Classification.” IEEE Transactions on Geoscience and Remote Sensing 49 (12): 5039–5046.
doi:10.1109/TGRS.2011.2157166.
Shen, L., Z. Zhu, S. Jia, J. Zhu, and Y. Sun. 2013. “Discriminative Gabor Feature Selection for
Hyperspectral Image Classification.” IEEE Geoscience and Remote Sensing Letters 10 (1): 29–33.
doi:10.1109/LGRS.2012.2191761.
Shi, M., and G. Healey. 2003. “Hyperspectral Texture Recognition Using a Multiscale Opponent
Representation.” IEEE Transactions on Geoscience and Remote Sensing 41 (5): 1090–1095.
doi:10.1109/TGRS.2003.811076.
Song, W., S. Li, L. Fang, and T. Lu. 2018. “Hyperspectral Image Classification With Deep Feature
Fusion Network.” IEEE Transactions on Geoscience and Remote Sensing 56 (6): 3173–3184.
doi:10.1109/TGRS.2018.2794326.
Sugiyama, M. 2007. “Dimensionality Reduction of Multimodal Labeled Data by Local Fisher
Discriminant Analysis.” Journal of Machine Learning and Research 8: 1027–1061.
Sun, W., and Q. Du. 2019. “Hyperspectral Band Selection: A Review.” IEEE Geoscience and Remote
Sensing Magazine 118–139. doi:10.1109/MGRS.2019.2911100.
Sun, X., F. Zhou, J. Dong, F. Gao, Q. Mu, and X. Wang. 2017. “Encoding Spectral and Spatial Context
Information for Hyperspectral Image Classification.” IEEE Geoscience and Remote Sensing Letters 14
(12): 2250–2254. doi:10.1109/LGRS.2017.2759168.
Tan, K., E. Li, Q. Du, and P. Du. 2014. “Hyperspectral Image Classification Using Band Selection and
Morphological Profiles.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote
Tao, C., H. Pan, Y. Li, and Z. Zou. 2015. “Unsupervised Spectral-Spatial Feature Learning with Stacked
Sparse Autoencoder for Hyperspectral Imagery Classification.” IEEE Geoscience and Remote
Sensing Letters 12 (12): 2438–2442. doi:10.1109/LGRS.2015.2482520.
Tsai, F., and J.-S. Lai. 2013. “Feature Extraction of Hyperspectral Image Cubes Using
Three-Dimensional Gray-Level Cooccurrence.” IEEE Transactions on Geoscience and Remote
Sensing 51 (6): 3504–3513. doi:10.1109/TGRS.2012.2223704.
Velasco-Forero, S., and J. Angulo. 2013. “Classification of Hyperspectral Images by Tensor Modeling
and Additive Morphological Decomposition.” Pattern Recognition 46 (1): 566–577. doi:10.1016/j.
patcog.2012.08.011.
Wang, J., and C.-I. Chang. 2016. “Independent Component Analysis-based Dimensionality Reduction
with Applications in Hyperspectral Image Analysis.” IEEE Transactions on Geoscience and Remote
Sensing 44 (6): 1586–1600. doi:10.1109/TGRS.2005.863297.
Xia, J., L. Bombrun, T. Adal, Y. Berthoumieu, and C. Germain. 2016. “Spectral-Spatial Classification of
Hyperspectral Images Using ICA and Edge-Preserving Filter via an Ensemble Strategy.” IEEE
TGRS.2016.2553842.
Xia, J., J. Chanussot, P. Du, and X. He. 2015. “Spectral-Spatial Classification for Hyperspectral Data
Using Rotation Forests with Local Feature Extraction and Markov Random Fields.” IEEE
TGRS.2014.2361618.
Xing, C., L. Ma, and X. Yang. 2016. “Stacked Denoise Autoencoder Based Feature Extraction and
Classification for Hyperspectral Images.” Journal of Sensors 2016: 1–10.
Yang, J., Y.-Q. Zhao, and C.-W. J. Chan. 2017. “Learning and Transferring Deep Joint Spectral-Spatial
Features for Hyperspectral Classification.” IEEE Geoscience and Remote Sensing Letters 55 (8):
4729–4742. doi:10.1109/TGRS.2017.2698503.
Yang, J.-M., P.-T. Yu, and B.-C. Kuo. 2010. “A Nonparametric Feature Extraction and Its Application to
Nearest Neighbor Classification for Hyperspectral Image Data.” IEEE Transactions on Geoscience
Ye, Z., S. Prasad, W. Li, J. E. Fowler, and M. He. 2014. “Classification Based on 3-D DWT and Decision
Fusion for Hyperspectral Image Analysis.” IEEE Geoscience and Remote Sensing Letters 11 (1):
173–177. doi:10.1109/LGRS.2013.2251316.
Yin, J., Y. Wang, and J. Hu. 2012. “A New Dimensionality Reduction Algorithm for Hyperspectral
Image Using Evolutionary Strategy.” IEEE Transactions on Industrial Informatics 8 (4): 935–943.
doi:10.1109/TII.2012.2205397.
Zhang, L., L. Zhang, D. Tao, and X. Huang. 2013. “Tensor Discriminative Locality Alignment for
Hyperspectral Image SpectralSpatial Feature Extraction.” IEEE Transactions on Geoscience and
Zhao, W., and S. Du. 2016. “Spectral-spatial Feature Extraction for Hyperspectral Image Classification:
A Dimension Reduction and Deep Learning Approach.” IEEE Transactions on Geoscience and
Zhong, Z., B. Fan, J. Duan, L. Wang, K. Ding, S. Xiang, and C. Pan. 2015. “Discriminant Tensor
Spectral-Spatial Feature Extraction for Hyperspectral Image Classification.” IEEE Geoscience and
Remote Sensing Letters 12 (5): 1028–1032. doi:10.1109/LGRS.2014.2375188.
Zhou, P., J. Han, G. Cheng, and B. Zhang. 2019. “Learning Compact and Discriminative Stacked
Autoencoder for Hyperspectral Image Classification.” IEEE Transactions on Geoscience and Remote
Sensing 57 (7): 4823–4833. doi:10.1109/TGRS.36.

Ijrs2020 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ijrs2020 2

Uploaded by

Copyright:

Available Formats

International Journal of Remote Sensing

ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/tres20

Feature extraction for hyperspectral image

To link to this article: https://doi.org/10.1080/01431161.2020.1736732

Published online: 07 Jun 2020.

Submit your article to this journal

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at

Feature extraction for hyperspectral image classiﬁcation: a

ABSTRACT ARTICLE HISTORY

CONTACT Brajesh Kumar bkumar@mjpru.ac.in; sainibrajesh@gmail.com Department of Computer Science &

2. Spectral feature extraction

2.1. Knowledge-based feature extraction

2.2. Statistical feature extraction

2.2.1. Unsupervised feature extraction

f ðYÞ ¼ f ðΩT XÞ (2)

correlations among them. Mathematically, PCA is performed by eigenvalue decomposi-

where, Λ  0 and V are matrices with eigenvalues and corresponding eigenvectors,

The transformation matrix can be obtained by solving following eigenvalue equation

of original data into statistically independent latent variables known as independent

where Y ¼ ½y1 ; y2 ; . . . ; yP T is the unknown source matrix and A 2 R is the unknown

2.2.2. Supervised feature extraction

2.3. Wavelet-based feature extraction

2.4. Deep feature extraction

Input Layer Output Layer

Figure 2. Stacked autoencoder.

Figure 3. Spectral feature extraction by SAE.

where v r is a vector of visible units, hr is a vector of hidden units, Zðθ Þ is a normalizing

Eðvr ; hr ; θÞ ¼ ðβT vr Þ  ðγp T hr Þ  ðvTr Whr Þ (21)

vl ¼ gðKl ? v l1 þ bl Þ (22)

Input Image Convolution Pool Convolution Pool FC Output

FC: Fully connected

Figure 4. Architecture of CNN.

v lþ1 ¼ hðWl v l þ bl Þ (25)

3. Spatial feature extraction

Hyperspectral Pooling layer

Figure 5. Spectral feature extraction by CNN.

3.1. Morphological operators

εðx k Þ ¼ minðIði; jÞ 2 SÞ (26)

The dual operator dilation is deﬁned as

δðx k Þ ¼ maxðIði; jÞ 2 SÞ (27)

ϕðx k Þ ¼ δðεðx k ÞÞ (28)

For closing operation, dilation is followed by erosion as given here

ψðx k Þ ¼ εðδðx k ÞÞ (29)

Φn ðx k Þ ¼ fϕðnÞ ðx k Þ; ϕðn1Þ ðx k Þ; :::; ϕð1Þ ðx k Þg (30)

Ψn ðx k Þ ¼ fψð1Þ ðx k Þ; ψð2Þ ðx k Þ; :::; ψðnÞ ðx k Þg (31)

MPðÞ is a ð2n  1Þ-dimensional vector as ϕð1Þ ðx k Þ ¼ ψð1Þ ðx k Þ ¼ x k . The morphological

EMPðx k Þ ¼ fMP1 ðx k Þ; MP2 ðx k Þ; :::; MPm ðx k Þg (33)

3.2. Texture features

3.2.1. Gray-level co-occurrence matrix

3.2.2. Gabor ﬁlters

3.2.3. Two-dimensional DWT

3.2.4. Local binary pattern

3.2.5. Mathematical moments

where p; q ¼ 0; 1; . . . 1. For a 2-D image H of size M N, its discrete version is used as

3.3. Deep learning for spatial features

Figure 6. Spatial dominated feature extraction by SAE.

Image reduced to single band

Figure 7. Spatial feature extraction by CNN.

Table 2. Summary of the spatial feature extraction techniques.

4.1. Tensor-based approaches

4.2. Three dimensional Gabor ﬁlters

where, Λ 0 and V are matrices with eigenvalues and corresponding eigenvectors,

where Y ¼ ½y1 ; y2 ; . . . ; yP T is the unknown source matrix and A 2 R is the unknown

Eðvr ; hr ; θÞ ¼ ðβT vr Þ ðγp T hr Þ ðvTr Whr Þ (21)

vl ¼ gðKl ? v l1 þ bl Þ (22)

Φn ðx k Þ ¼ fϕðnÞ ðx k Þ; ϕðn1Þ ðx k Þ; :::; ϕð1Þ ðx k Þg (30)

MPðÞ is a ð2n 1Þ-dimensional vector as ϕð1Þ ðx k Þ ¼ ψð1Þ ðx k Þ ¼ x k . The morphological