You are on page 1of 15

5324 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 31, NO.

12, DECEMBER 2020

Hybrid Deep Learning-Gaussian Process


Network for Pedestrian Lane Detection
in Unstructured Scenes
Thi Nhat Anh Nguyen , Son Lam Phung , Senior Member, IEEE,
and Abdesselam Bouzerdoum , Senior Member, IEEE

Abstract— Pedestrian lane detection is an important task in a smart wheelchair, with little guidance from its disabled user,
many assistive and autonomous navigation systems. This arti- to traverse a pedestrian lane [2]. Pedestrian lane detection is
cle presents a new approach for pedestrian lane detection in useful also for autonomous vehicles to avoid pedestrians or
unstructured environments, where the pedestrian lanes can have off-limit regions in a scene [3]. In addition, it complements
arbitrary surfaces with no painted markers. In this approach, other features of electronic navigation devices such as obstacle
a hybrid deep learning-Gaussian process (DL-GP) network is pro-
posed to segment a scene image into lane and background regions.
detection [4], [5] and GPS-based guidance [6].
The network combines a compact convolutional encoder–decoder The existing methods proposed for pedestrian lane detection
net and a powerful nonparametric hierarchical GP classifier. The are mostly designed for detecting pedestrian lanes painted with
resulting network with a smaller number of trainable parameters white markers [7]–[10]. This article addresses this gap by
helps mitigate the overfitting problem while maintaining the focusing on the camera-based detection of pedestrian lanes
modeling power. In addition to the segmentation output for each in unstructured environments, where the pedestrian lanes can
test image, the network also generates a map of uncertainty—a have arbitrary surfaces with no painted markers. The scenes
measure that is negatively correlated with the confidence level containing the pedestrian lane are under varying lighting
with which we can trust the segmentation. This measure is conditions and could be indoor or outdoor.
important for pedestrian lane-detection applications, since its Existing algorithms for unmarked lane detection mostly rely
prediction affects the safety of its users. We also introduce a new
data set of 5000 images for training and evaluating the pedestrian
on hand-engineered features. These methods either use the
lane-detection algorithms. This data set is expected to facilitate color- and texture-based features of the lane surfaces to differ-
research in pedestrian lane detection, especially the application entiate the lane pixels from the background [11]–[13], or use
of DL in this area. Evaluated on this data set, the proposed the edge features to locate the lane boundaries [14]–[16].
network shows significant performance improvements compared In general, these methods are sensitive to scene variations,
with several existing methods. which cannot be easily captured by such model-based systems.
Index Terms— Assistive and autonomous navigation, bench-
Recently, lane-detection methods that use deep neural
mark data set, deep learning (DL), Gaussian process (GP) networks (DNNs) for automatic feature-learning have been
classifier, pedestrian lane detection. proposed [17]–[19]. These methods yielded promising perfor-
mances in terms of accuracy and processing time. However,
I. I NTRODUCTION they were mostly designed for vehicle road lane detection.
To the best of our knowledge, there are few publicly available
A UTOMATIC detection of the pedestrian lane in a
scene is an important component in many assistive and
autonomous navigation systems. It assists vision-impaired peo-
methods based on the DNNs for unmarked pedestrian lane
detection, which is generally a more challenging problem than
road lane detection, because the appearances, surfaces, and
ple in finding the walkable path and maintaining their balance
shapes of the pedestrian lanes often vary more significantly
while walking—a difficult task that is currently performed
than the vehicle lanes.
mostly using a white cane or a guided dog [1]. It also allows
There are two major challenges in adopting the DNN
Manuscript received April 7, 2019; revised September 26, 2019 and Decem- methods for pedestrian lane detection. First, training a typical
ber 21, 2019; accepted January 1, 2020. Date of publication February 13, DNN often requires a large volume of data, especially for
2020; date of current version December 1, 2020. This work was supported by complex problems. However, the data sets of labeled images
a grant from the Australian Research Council (ARC). (Corresponding author: for training a pedestrian lane-detection system are generally
Thi Nhat Anh Nguyen.)
Thi Nhat Anh Nguyen and Son Lam Phung are with the School of Electri- small compared with the training sets in other computer vision
cal, Computer and Telecommunications Engineering, University of Wollon- tasks. The largest publicly available data set for pedestrian lane
gong, Wollongong, NSW 2522, Australia (e-mail: ngt.nhatanh@gmail.com; detection was introduced in [16] with only 2000 images. Since
phung@uow.edu.au). typical DNNs usually have a huge amount of parameters to
Abdesselam Bouzerdoum is with the School of Electrical, Computer model complex problems, they are highly prone to overfitting
and Telecommunications Engineering, University of Wollongong, Wollon-
gong, NSW 2522, Australia, and also with the College of Science and if the amount of available data is small. Second, for the
Engineering, Hamad Bin Khalifa University, Doha 7675, Qatar (e-mail: safety of the users (vision-impaired persons, for example),
a.bouzerdoum@uow.edu.au). the pedestrian lane-detection system must generate not only
This article has supplementary downloadable material available at an accurate segmentation of the lane but also a confidence
https://ieeexplore.ieee.org, provided by the authors.
Color versions of one or more of the figures in this article are available measure with which we can trust its predictive output. Ideally,
online at https://ieeexplore.ieee.org. the system should give a full-resolution confidence map,
Digital Object Identifier 10.1109/TNNLS.2020.2966246 so that the user can decide which parts of the scene to avoid
2162-237X © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: University of Hull. Downloaded on August 12,2023 at 17:52:35 UTC from IEEE Xplore. Restrictions apply.
NGUYEN et al.: HYBRID DL-GP NETWORK FOR PEDESTRIAN LANE DETECTION IN UNSTRUCTURED SCENES 5325

Fig. 1. Conceptual comparison of a softmax layer of a deep network versus a GP classifier, using an example in binary classification. A softmax layer gives
only a point prediction for each test input. In comparison, a GP classifier gives a probabilistic prediction with a predictive mean and an uncertainty level
for each test input: the test inputs far from the training data have higher uncertainty. In Fig. 1(c), the uncertainty interval (the gray area) is plotted with the
predictive 95% confidence bounds. (a) Input to a softmax layer as a function of the network input x. (b) Output of the softmax layer as a function of the
network input x. (c) Predictive output of GP classifier as a function of its input x.

(the parts with low confidence levels). A typical DNN does shown to improve the predictive performance over the
not naturally produce such a confidence measurement for its methods that only use either global or local information.
prediction. However, the above article mainly focuses on the multi
To address the above two challenges, this article presents class classification cases. A model specifically designed
a new approach for pedestrian lane detection using a hybrid for binary classification will lead to a further reduction
deep learning-Gaussian process (DL-GP) architecture. In this in the computational cost. This article will discuss the
approach, we cast the pedestrian lane detection in unstructured HGP model for binary classification in detail. We will
environments as a segmentation problem where a scene image then formulate a single loss function and present an
is segmented into pedestrian lane and background regions. The algorithm for end-to-end training of the hybrid DL-GP
contributions of the article can be highlighted as follows. architecture for pedestrian lane detection.
1) We propose a hybrid architecture for pedestrian 3) We create a new data set with manually annotated
lane detection that combines a compact convolutional ground truth for the objective evaluation of algorithms
encoder–decoder (E-D) network and a hierarchical for pedestrian lane detection. This data set consists
GP (HGP) classifier. Unlike the existing lane-detection of 5000 images collected from the realistic indoor
approaches that use very deep convolutional networks, and outdoor scenes, with various shapes, textures, and
the proposed architecture combines a compact network surface colors. This data set is the largest data set
having a smaller number of parameters with a powerful for pedestrian lane detection in the literature; it is
nonparametric GP classifier. This strategy helps mitigate extended from the data set that has been previously
the overfitting problem while maintaining the modeling introduced in [16]. It is expected to facilitate research
power. The proposed architecture can be trained in an in pedestrian lane detection, especially the application
end-to-end manner. An additional benefit of using a GP of DL in this area. The data set is available at
classifier in our architecture is that besides the seg- slp-lab.com/phung/plvp2.html.
mentation output for each test image, the classifier also The rest of the article is organized as follows. Section II
generates a map of well-calibrated uncertainty—a para- reviews the related work. Section III presents the proposed
meter that is negatively correlated with the confidence hybrid DL-GP architecture for pedestrian lane detection.
with which we can trust the segmentation. In a typical Section IV presents the experiments and analysis. Finally,
DNN classifier, predictive probabilities obtained by a Section V concludes the article.
softmax layer are often erroneously interpreted as model
confidence. In fact, since softmax uses point estimation II. R ELATED W ORK
without considering model uncertainty, a model can be
uncertain in its predictions even with a high softmax In this section, we review the related work for unmarked lane
output. As illustrated in Fig. 1, passing a point estimate detection including the traditional and DL-based methods.
of a function [see Fig. 1(a)] through a softmax layer
results in extrapolations [see Fig. 1(b)] with unjustified A. Traditional Methods for Unmarked Lane Detection
high confidence for points far from the training data. Representative traditional methods based on hand-
For example, an input x = 6 is classified as class 1 engineered features for detecting pedestrian lanes in
with a probability of 1. In contrast, a GP classifier [see unstructured scenes are listed in Table I. They can be divided
Fig. 1(c)] gives a full probabilistic prediction with a pre- into two categories: 1) lane segmentation and 2) lane-
dictive mean and an uncertainty level for each test input. border detection. In the lane-segmentation approach, color
2) The main limitation of the GP is the high computational models, which are built through offline training, are used to
cost. Existing methods to reduce its cost can differentiate the lane pixels from the background [11], [12],
be categorized into global or local approximation [21], [22]. These methods use different color spaces and
approaches. Global approximations use only sparse classifiers. Crisman and Thorpe [11] represent the on-road
information from the training data and normally cannot and off-road classes with Gaussian color models in the
account for nonstationarity and locality in complex data red–green–blue (RGB) color space. Tan et al. [12] uses color
sets. Local approximations fit a separate GP for each histograms in the RGB space. Ramstrom and Christensen [22]
subregion of the input space and are prone to overfitting. construct Gaussian mixture models from the UV, normalized
In [20], we proposed a GP approximation method that red and green, and luminance components. Sotelo et al. [21]
uses a hierarchical structure to combine global and local employ the hue–saturation–intensity (HSI) color space, and
information from the data set. This method has been classify pixels by thresholding their chromatic distance to the
Authorized licensed use limited to: University of Hull. Downloaded on August 12,2023 at 17:52:35 UTC from IEEE Xplore. Restrictions apply.
5326 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 31, NO. 12, DECEMBER 2020

TABLE I
R EPRESENTATIVE T RADITIONAL M ETHODS FOR U NMARKED L ANE D ETECTION . T HE M ARK “-” M EANS T HAT THE T ECHNIQUE I S N OT U SED

color models. In general, the above methods using offline feature-learning have been proposed recently [18], [19],
trained models do not cope well with the variations in lane [30]. Mendes et al. [19] design and train a CNN for image
appearance, lane surfaces, and illumination conditions. patch classification and then convert it into a network for
To address this problem, several methods choose to build road lane segmentation. During training, each image is
the lane model directly from the sample regions in the input divided into 4 × 4 regions, and a patch centered at each
image [23]–[26]. There have been different ways to obtain region is extracted. The CNN is trained to classify the
sample lane regions. In [26] and [27], small random areas are patches into road or nonroad classes that are attributed to
selected at the bottom and in the middle of the input image. the corresponding 4 × 4 regions. For inference, the CNN is
In [25], the sample lane region is initialized as a trapezoid at converted into a fully convolutional network by turning the
the bottom and center of the image, and then refined using fully connected (FC) layers into convolutional layers. In this
the vanishing point. In [23], the sample lane region is formed way, the network can receive an entire image (instead of a
from the candidate lane boundaries that are detected using patch) as input, and output a class for every image region.
the vanishing point and the assumption about the lane width. However, the classification net uses subsampling to reduce
The performance of the above methods highly depends on the the number of features and the computational complexity.
quality of the sample lane regions, which in turn relies on This will also coarsen the segmentation output in the final
prior knowledge about the lane. network and reduce its size compared with the input image.
In the lane-border-detection approach, the lane borders are To address this problem, newer DL methods for semantic
detected using the vanishing point [14], [15] or the templates segmentation learn to decode low-resolution image represen-
of lane boundaries [28]. In [14], the two lane borders are found tations to pixelwise predictions [30]–[32]. These methods
among the edges pointing to the vanishing point using an typically employ the E-D architecture that comprises two
objective function that measures the color and texture differ- parts: encoder and decoder networks. The encoder network
ences between the lane and background regions. This method acts as a feature extractor that transforms an input image into
requires that the color and texture of the lane region are homo- a low-resolution feature representation. The encoder network
geneous and differ significantly from those of the background typically resembles the VGG16 classification network [33],
regions. In [15], the lane borders are also found from the which has 13 convolutional layers and three FC layers. The
edges directed toward the vanishing point; the edges are ranked decoder network is responsible for decoding or mapping the
using texture orientation and color features. In another method, low-resolution feature representation to a probability map of
the lane boundaries are found from the edges of the homoge- the same size as the input image.
neous color regions by matching with the lane templates [28]. A decoder network is typically comprised of several
The above methods for lane-border detection are sensitive to decoders, each of which increases the size of its input feature
background edges. To overcome this problem, Chang et al. map by a factor of 2. In the fully convolutional network [31],
[29] propose combining lane-border detection and lane seg- each decoder learns to upsample its input feature map through
mentation. In this method, lane borders are detected using deconvolution with a 2 × 2 stride. The upsampled feature map
the vanishing point, and the lane region is segmented using is combined with the corresponding encoder feature map to
the color model learned from a homogeneous region at the produce the input to the next decoder. In the deconvolutional
middle bottom of the input image. In [16], the sample lane network [31], each decoder performs two main operations:
region is first identified using the vanishing point method. The unpooling and deconvolution. Unpooling performs nonlinear
lane region is then segmented using both the matching scores upsampling—a reverse operation of pooling in the encoder
between the edges of the homogeneous color regions and lane network. It uses the locations of the maximum activations
templates, and between the colors of these regions and the selected during pooling to place each activation back to its
color model learned from the sample lane region. original locations. Deconvolution with a 1 × 1 stride is then
used to densify the sparse feature map obtained by unpooling.
In SegNet [30], a convolution layer is used in the place of
B. DL-Based Methods for Unmarked Lane Detection deconvolution in a similar decoder network. In Bayesian Seg-
Convolutional neural networks (CNNs) are originally Net, Kendall et al. [34] extend SegNet to a Bayesian network,
proposed for entire-image classification, but they are making which can produce a probabilistic segmentation output. This is
progress in structured prediction problems such as object done by adding dropout layers to the network at both training
detection and semantic segmentation. Inspired by such a and testing phases. Dropout is used at test time to sample the
progress in the CNNs in semantic segmentation, a few network with randomly dropped out units and, thereby, obtain
road lane-detection methods using CNNs for automatic the samples of the posterior distribution of softmax class
Authorized licensed use limited to: University of Hull. Downloaded on August 12,2023 at 17:52:35 UTC from IEEE Xplore. Restrictions apply.
NGUYEN et al.: HYBRID DL-GP NETWORK FOR PEDESTRIAN LANE DETECTION IN UNSTRUCTURED SCENES 5327

TABLE II
R EPRESENTATIVE DL-BASED M ETHODS FOR ITS

probabilities for each test image. The mean of these samples is


used for segmentation prediction, and the variance is used as
model uncertainty. This method requires that each test image
is fed to the network many times, which can be very slow.
There have been attempts to apply the above deep
E-D architectures to road lane segmentation and a related
problem—road scene segmentation. For example, Nugroho
and Riasetiawan [18] adapt the deconvolutional network for
road lane segmentation without using the FC layers. In [30]
and [34], SegNet and Bayesian SegNet are used for road scene Fig. 2. Structure of the hybrid DL-GP architecture.
segmentation on the CamVID road scenes data set [35]. These
methods yield promising results for the road lane-segmentation ELMs themselves are not strong classifiers. In contrast, since
and road scene-segmentation problems; however, we are not our proposed hybrid network employs a one-stage end-to-end
aware of any publicly available methods based on DNNs for training process and since the GP classifier is a powerful
unmarked pedestrian lane detection. classifier, it leads to improved predictive performances over
the original CNN.
C. DL for Intelligent Transportation Systems (ITS) Zadeh et al. [39] propose a warning system for vulnerable
DL has been widely used to solve many ITS problems road users, which classifies the collision risk into low, medium,
in the recent years. Table II gives a summary on the or high. It uses a fuzzy neural network for risk estimation
representative DL-based methods for ITS. While some of based on various types of data from their smartphone sen-
the works purely use DNNs as an instrument for solving sors such as vehicle acceleration, GPS location, and weather
the ITS problems, many of them propose the state of the art condition. The problem addressed in [39] is related and com-
in DL before applying the approaches to the ITS problems. plementary to the pedestrian lane-detection problem addressed
While developing a CNN for traffic sign recognition, Jin by our proposed method. The fuzzy inference method used in
et al. [36] also propose a hinge loss stochastic gradient [39] is a promising direction for computing the risk for blind
descent method, whose cost function is similar to that of people or robots on pedestrian lanes, which could be used in
the support vector machine (SVM), to train the CNNs conjunction with the uncertainty produced by our network to
more quickly. Bejani and Ghatee [37] propose a system for provide extra safety assurance for the users.
classifying driving styles using the CNN on acceleration data
collected by smartphone sensors. To tackle overfitting, they
III. P ROPOSED H YBRID DL-GP A RCHITECTURE
propose two adaptive regularization schemes called adaptive
dropout and adaptive weight decay. In our proposed method, We propose a hybrid DL-GP architecture that consists of a
we take another approach to avoid overfitting by using a convolutional E-D network for generating multi dimensional
compact DNN combined with a nonparametric GP classifier. features for each pixel in the input image and an HGP module
However, as shown in our experiments in Appendix B for pixelwise classification. The block diagram of the proposed
(Supplementary Material), it is also beneficial to apply the architecture is shown in Fig. 2.
adaptive weight decay regularization scheme proposed in [37]
to our network. A. Convolutional E-D Network
Pashaei et al. [38] propose a system for the classification of The convolutional E-D network comprises two parts:
accident images, which uses a CNN for feature extraction and encoder and decoder networks. The encoder network trans-
a mixture of extreme learning machines (ELM) for classifica- forms an input image into a low-resolution feature represen-
tion. This system and our proposed architecture are similar in tation. The decoder network maps the low-resolution feature
that they both try to combine a CNN with a new classifier. representation to a stack of feature maps, each of which has
However, unlike our method that uses a single-stage end-to- the same resolution to the input image.
end training process, the method in [38] requires two separate 1) Encoder Architecture: The encoder network extracts 2-D
training stages. In the first training stage, the traditional CNN image features at increasing dyadic scales by alternating con-
using a multilayer perceptron (MLP) classifier in the last layer volutional layers and max-pooling layers. Here, we organize
is trained. In the second stage, the features extracted from these layers into a series of encoder units. Each encoder unit
the last pooling layer of the CNN are used as the input to comprises of two convolutional layers, followed by a max-
train the mixture of ELM classifiers. The main purpose of pooling layer. The structure of an example encoder network
replacing MLP in the original CNN with the mixture of ELM with four encoder units is shown in the top part of Fig. 3.
classifier is for better prediction speed; however, the predictive The two convolutional layers within an encoder unit have
performance of the new system is inferior to that of the the same number of output channels, which is 64 channels
original CNN, as reported in [38]. This is because the mixture for the first encoder unit. The spatial size of the feature maps
of ELM classifier is trained separately from the CNN, and is halved after each encoder unit. The number of channels is
Authorized licensed use limited to: University of Hull. Downloaded on August 12,2023 at 17:52:35 UTC from IEEE Xplore. Restrictions apply.
5328 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 31, NO. 12, DECEMBER 2020

Fig. 3. Example E-D network with four encoder/decoder units. The convolutional layers are denoted as “Conv kernel size-number of channels.” Max-
pooling layers are denoted as “Pool kernel size.” Upsampling layers are denoted as “Up kernel size.” Each Conv layer is immediately followed by a batch
normalization layer and an ReLU layer.

doubled at the second and third encoder units compared with


their immediately previous unit. However, we experimentally
found that increasing the number of channels beyond 256 does
not improve the performance for the pedestrian lane-detection
problem. Therefore, for the later encoder units, the number
of channels remains unchanged, i.e., the number of channels
at each layer stops increasing after it reaches 256. This Fig. 4. Illustration of the operation of an upsampling layer.
strategy also aligns with our original plan of keeping the
network compact to avoid overfitting. Each convolutional layer to higher resolution feature maps at increasing dyadic scales
is immediately followed by a batch normalization layer and until the output feature maps have the same spatial size as
a nonlinear activation layer (ReLU layer); these layers have the input image. This is done by using the upsampling layers
been omitted in Fig. 3 for conciseness. and the convolutional layers. An example decoder network is
A max-pooling layer applies a 2 × 2 filter to each of its shown in the bottom part of Fig. 3. We organize the layers
input feature maps using the stride of 2 pixels, and outputs of the decoder network into a series of decoder units, each
the maximum value in every subregion that the filter captures. of which comprises an upsampling layer followed by two
A pooling layer effectively reduces the size of its input feature convolutional layers. The structure of the decoder network is
maps by half while still retaining the important information. a reflection of that of the encoder network. The number of
This layer has two main purposes. First, it increases invariance decoder units is equal to the number of encoder units in the
against small spatial shifts. Second, it reduces the number network. The first decoder unit corresponds to the last encoder
of parameters and computation in the network and, thereby, unit, and so on.
controls overfitting. However, spatial information within a The number and the spatial size of the output feature maps
pooling window is lost during pooling. This is not beneficial of each decoder unit are equal to those of the input feature
to semantic segmentation where precise boundary delineation maps of the corresponding encoder unit. An exception is for
is required. Therefore, it is necessary to store boundary infor- the last decoder unit, in which the number of its output feature
mation from the feature maps before subsampling. Storing all maps is not equal to the number of channels of the input image
the encoder feature maps is a possible solution, but it requires (which is 3) but is kept unchanged from that of its input feature
a huge amount of memory. Here, employing the solution maps (which is 64). This is because the output of the final
proposed in [30], we store only the max-pooling indices, decoder unit will be fed to the GP classifier for segmentation,
i.e., the locations of the maximum value in each pooling and a large number of features at each spatial location are
window. These stored indices will be used by the decoder required for good classification performance.
network to upsample the feature maps. a) Upsampling layer: The upsampling layer performs
Note that, in the proposed network, each encoder unit
the reverse operation of pooling: upsampling its input feature
consists of two convolutional layers before a pooling layer.
maps and, thereby, doubling their spatial size. During pooling,
Our design choice is explained as follows. Since pooling the max-pooling indices, i.e., the locations of the maximum
leads to information loss, we could only have a few pooling
value in each pooling window, are stored. An upsampling layer
layers before losing too much information. For example, for
uses the memorized max-pooling indices from the correspond-
a 256 × 256 input image, typically, the maximum number of ing max-pooling layer to place each element of the input
pooling layers (and hence the maximum number of encoder
feature maps back to its original location in the upsampled
units) is 5. As a result, having multiple convolutional layers
output feature maps. The upsampling technique is illustrated
before a pooling layer allows us to build up better representa-
in Fig. 4. Note that the number of feature maps does not
tions of the data without quickly losing spatial information.
change after each upsampling layer.
However, since our goal is to build a compact network,
b) Convolutional layer: An upsampling layer produces
the number of convolutional layers should be kept small. From
enlarged but sparse feature maps. These maps will be fed to
experimentation, we find that using two convolutional layers
the two subsequent convolutional layers to get densified.
per encoder unit gives comparable performance with other
designs that use more convolutional layers per encoder unit, B. HGP Classifier
while it keeps the number of trained parameters small.
We develop an HGP classifier for binary classification that
2) Decoder Architecture: The decoder network maps the can be placed on the top of the E-D network to classify each
low-resolution feature maps generated by the encoder network pixel in the image into lane or background. For each input
Authorized licensed use limited to: University of Hull. Downloaded on August 12,2023 at 17:52:35 UTC from IEEE Xplore. Restrictions apply.
NGUYEN et al.: HYBRID DL-GP NETWORK FOR PEDESTRIAN LANE DETECTION IN UNSTRUCTURED SCENES 5329

image, the E-D network extracts 64 full-resolution feature An HGP classifier is proposed to combine the advantages
maps, which are then rearranged into an array of feature of sparse approximation and MoGPE by exploiting both the
vectors; each vector consists of 64 features corresponding to global information and local information from the training data
a pixel. HGP takes a feature vector as input and classifies the through a two-layer hierarchical structure. In the upper layer,
corresponding pixel into lane or background. a sparse GP, hereafter called the global GP, is used to coarsely
Let x denote an extracted feature vector (row vector of model the entire data set. In the lower layer, a gating network
D features; D = 64 in this application), and y denote the divides the input space into T regions; and within each region,
corresponding class label: y = 1 for lane pixel and y = 0 a specific local GP expert is used for finer modeling. All the
for background. Let N denote the total number of pixels from local GP have a common mean function m(x), which encodes
all the training images, and X = [(x1 )T , . . . , (x N )T ]T and y = information from the global GP. In this way, information is
[y1 , . . . , y N ]T denote the collection of all the feature vectors shared between the two layers as well as among the local
and the corresponding labels of all the N training pixels. X experts to avoid overfitting.
is a matrix with the dimension of N × D and y is a column Herein, we use subscript 0 for the global unit and subscript
vector of size N × 1. We adopt the convention that lowercase k for the kth local expert (k = 1, . . . , T ).
italic letters denote scalar variables or functions, uppercase 3) Details of the GPs: The global GP is associated with a
italic letters denote scalar constants, lowercase bold letters latent function f 0 (x), a zero mean function, and a covariance
denote vectors, and uppercase bold letters denote matrices. function κ0 (x, x ): f 0 (x) ∼ GP(0, κ0 (x, x )). The kth local
The operators row(A) and col(A) return the number of rows GP is associated with a latent function f k (x), the mean
and columns of a matrix A, respectively. function m(x), and a covariance function κk (x, x ): f k (x) ∼
1) Background on GP Classifiers: GP models are powerful GP(m(x), κk (x, x )), for k = 1, . . . , T . Each GP in the model
tools for Bayesian classification [42]–[46]. They have two is a sparse GP in which the training latent variables (i.e.,
most desirable properties. First, since a GP is a nonparametric the latent variables placed at the training inputs) are summa-
model, it has only a few hyperparameters that need to be rized by a set of M inducing points. Each inducing point is
learned, and thus, overfitting can be avoided. Second, GPs give comprised of an inducing input, which is a point from the
probabilistic predictions with well-calibrated uncertainties. input space X , and its corresponding latent variable.
Let X denote the feature space. In GP classification, We introduce the following notations for the kth GP
we assume that there is an underlying latent function f (x): (k = 0, . . . , T ): f kn denotes f k (xn ), fk is the vector
X −→ R that is distributed according to a GP: f (x) ∼ (size N × 1) of all training latent variables, gk is the
GP(m(x), κ(x, x )). Here, m(x) and κ(x, x ) denote the mean vector (size M × 1) of the inducing variables, Uk is the
function and the covariance function that characterize the GP. matrix (size M × D) formed by the M inducing inputs, θk is
For each input x, the value of f (x) is a latent variable. the set of hyperparameters of the covariance function κk (x, x ),
(k)
For any given set of input points, the GP places a prior and KAB is the covariance matrix [size row(A) × row(B)]
multivariate normal distribution on the corresponding latent formed by evaluating κk (x, x ) at all pairs of points (x, x ),
variables. The observed output y, given the latent variable where x is in A and x is in B.
f (x), is then distributed according to a non-Gaussian like- a) Global GP prior: The global GP places a joint prior
lihood: p(y| f (x)) = h( f (x)). The objects of interest are the distribution on its latent variables g0 and f0 , which results in
posterior of the latent variables placed at the training inputs the following distributions:
and the predictive distribution p(y ∗ |y) for the class label y ∗  
at a test point x∗ ; they can be estimated using approximate p(g0 ) = N 0, K(0)U0 U0 (1)
inference methods such as the variational inference.  (0)  (0) −1 (0)
p(f0 |g0 ) = N KXU0 KU0 U0 g0 , KXX
2) Rationale Behind HGP: A limitation of GP is its high  (0) −1 (0) 
computational cost, mainly due to the inversion of the kernel − K(0)
XU0 KU0 U0 KU0 X. . (2)
matrix, which is O(N 3 ) in the training time, where N is
the number of training samples. Many approximation methods b) Local GP prior: To encode information from the
for GP have been proposed to overcome this limitation. They global unit, we set the prior mean of the local GPs to be
(0) (0)
can be classified into global and local approaches. Global GP the mean of p( f 0 (x)|g0 ), which is m(x) = KxU0 [KU0 U0 ]−1 g0 ,
approximation methods (also known as sparse GP methods) according to (2). This imposes a complex dependence between
try to summarize all the training data using a set of M N gk and g0 , making it difficult for model inference. To remove
inducing points to reduce the computational cost [47]–[50]. this dependence, we introduce new latent variables hk =
The sparse GPs work well for simple data sets, but they gk −m(Uk ). This results in the following distributions:
cannot deal with locality and nonstationarity in complex  (k) 
data sets. p(hk ) = N 0, KUk Uk (3)
Local approaches attempt to overcome the limitation of  (k)  (k) −1 (k)
p(fk |hk , g0 ) = N KXUk KUk Uk hk + KXU0 K−1
U0 U0 g0 KXX
global methods by employing a mixture of GP experts   
(k) (k) −1 (k)
(MoGPE). In MoGPE, a gating network divides the input space − KXUk KUk Uk KUk X . (4)
into regions within which a specific GP expert is responsible
for making predictions [51]–[54]. In this way, the locality and 4) Likelihood Function: In the upper layer, the observed
nonstationarity in the data are addressed. The computational pixel labels are related to latent variables according to the
cost is also reduced as the inversion of a large kernel matrix following likelihood:
is replaced by that of several smaller matrices. However, since p(yn | f 0n ) = B(yn |φ( f 0n )) = φ( f0n ) yn (1 − φ( f0n ))1−yn .
each GP expert is trained independently using only the local
data assigned to it, without considering the global information Here, φ(z) is the probit function that maps the latent variables
(the correlations between the clusters), the experts are likely into the unit interval [0, 1], and B denotes a Bernoulli distrib-
to overfit the local training data. ution. The above equation implies that the class membership
Authorized licensed use limited to: University of Hull. Downloaded on August 12,2023 at 17:52:35 UTC from IEEE Xplore. Restrictions apply.
5330 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 31, NO. 12, DECEMBER 2020

(7) can be considered as containing two separate components.


First, the generative loss component involving the first two
terms measures how accurately the model reconstructed the
observed labels. Second, the latent loss component involving
the last three KL-divergences measures how closely the pos-
teriors of the latent variables match their priors.
Learning is performed by maximizing the bound (7) with
respect to q(z), mk , Sk θk , and Uk . Let γ be the vector formed
Fig. 5. Probabilistic graphical model for HGP. by all the parameters to be optimized, except those of q(z).
Due to the complex dependence between z and Uk , we are
probability p(yn = 1| f 0n ) and p(yn = 0| f 0n ) can be
not able to optimize z and Uk . Instead, we employ a varia-
calculated as φ( f 0n ) and 1 − φ( f0n ), respectively.
tional expectation-maximization (EM) algorithm to optimize
In the lower layer, for each observation (xn , yn ), a latent
the bound (7) by alternating between the following two steps.
variable z n indicates the expert to which the observation
belongs. The likelihood in the lower layer (i.e., the conditional 1) E-Step: Fix γ , and maximize the bound with respect to
distribution of the label given all the corresponding latent q(z).
variables from all the local experts) is 2) M-Step: Fix q(z), and maximize the bound with respect
to γ using gradient-based optimization.
p(yn | f 1(xn ), . . . , f T(xn )) = p(yn | f zn (xn )) = B(yn |φ( f zn (xn ))). The first step estimates the output of the gating network q(z).
5) Gating Network Prior: Expert indicators z n are specified As shown in [20], it can be approximated as follows:
by a gating network based on the inputs. In our model,
1, if argmaxi N (xn |mi , V) = k
the prior over z n is defined as q(z n = k) = (10)
0, otherwise.
N(xn |mk , V)
p(z n = k) = T (5) a) Convergence: It has been shown in [20] that both the
j =1 N(xn |m j , V) E and M steps of the above EM algorithm only increase or at
M (k) least maintain the value of the objective function (7), which
where mk = 1/M m=1 u and V = diag(v 1 , . . . , v D ) with
T m M (k) (k) is upper-bounded by the maximum value of the log marginal
v d = 1/(T (M − 1)) k=1 m=1 (u md − m kd )2 ; um denotes likelihood ln p(y), and the algorithm will eventually converge
an inducing inputs in the kth expert. Here, mk is a vector of to a (local or global) maximum.
size 1 × D and V is a diagonal matrix of size D × D. The
b) Complexity: Suppose that all the GPs have the same
rationale behind this prior is that when xn is closer to mk (the
number of inducing points M. Most of the computational
centroid of Uk ), its output can be better predicted by expert
cost for the above EM algorithm arises from computing the
k, and thus, it is more probable to be assigned to that expert.
expected likelihood terms Eq( fkn ) [ln p(yn | f kn )] in (7), which
The probabilistic graphical model that summarizes the
in turn requires the computation of q( f kn ) for k = 1, . . . , T
dependences among the variables for HGP is given in Fig. 5
and n = 1, . . . , N. Computing each q( f kn ) according to [55,
6) Learning for HGP: HGP is learned using a variational
Eqs. (38) and (39)] has the complexity of O(M 2 ). Hence,
inference algorithm, in which the variational distributions of
the overall time complexity of the algorithm appears to be
the inducing latent variables are explicitly represented as
O(N M 2 T ). However, with the variational posterior of z n
q(g0 )  N (m0 , S0 ) and q(hk )  N (mk , Sk ) (6) given by (10), we can see that, for each given point xn , there is
only one expert k such that q(z n = k) is nonzero. Therefore,
where mk is a column vector of size M and Sk is a square for each n ∈ {1, . . . , N}, only one term q( f kn ) is needed for
matrix of size M × M. Following the similar steps in [20], the computation of the objective function (7). For this, the time
the following lower bound of the marginal likelihood is complexity of the algorithm is actually reduced to O(N M 2 ).
derived: 7) Stochastic Optimization: The computational complexity

N 
T of O(N M 2 ) in training time is still prohibitive for this
ln p(y) ≥ q(z n = k)Eq( f kn ) [ln p(yn | f kn )] application, considering that N is the total number of image
n=1 k=1 pixels in the training set. The beauty of this method is that due
to the special form of the bound derived in (7), we can further
 N
+ Eq( f0n ) [ln p(yn | f 0n )] − KL(q(h)|| p(h)) reduce the training cost using mini-batch training, i.e., using
n=1
only a mini-batch of data in each iteration. In particular, since
− KL(q(g0 )|| p(g0 )) − KL(q(z)|| p(z)) (7) the bound (7) includes the sum over data points, at each
training iteration t, it can be approximated with
where KL denotes the Kullback Leibler divergence, z denotes
N 
the set of all z n ; q( f kn ) is defined as L(S (t ) ) = λi − KL(q(h)|| p(h))
 B
x i ∈S (t)
q( f 0n )  p( f 0n |g0 )q(g0 )dg0 and (8) − KL(q(g0 )|| p(g0 )) − KL(q(z)|| p(z)) (11)

q( f kn )  p( f kn |hk , g0 )q(hk )q(g0 )dhk g0 (9) where S (t ) denotes the set of feature vectors corresponding to
the mini-batch of B pixels used in iteration t, and
for k = 1, . . . , T . Using (2) and (4), q( f kn ) can be easily T
computed in terms of m0 , S0 , mk , and Sk (see [55, Eqs. (38) λn = q(z n = k)Eq( f kn ) [ln p(yn | f kn )]
and (39)]). The expectation terms Eq( fkn ) [ln p(yn | f kn )] in (7) k=1
can then be evaluated using the Gauss–Hermite quadrature + Eq( f 0n ) [ln yn | f 0n ].
method [56]. The remaining KL-divergence terms in (7) are
analytically tractable
Authorized (see
licensed use[55, Appendix
limited to: UniversityB]). The
of Hull. bound on
Downloaded This12,2023
in August effectively reduces
at 17:52:35 theIEEE
UTC from training
Xplore. time to O(B
Restrictions M 2 ).
apply.
NGUYEN et al.: HYBRID DL-GP NETWORK FOR PEDESTRIAN LANE DETECTION IN UNSTRUCTURED SCENES 5331

Fig. 6. HGP classification model for pixelwise lane segmentation. Arrows indicate the dataflow at test time.

We note that the computational complexity is independent Algorithm 1 End-to-End Training Algorithm for the Hybrid
of the number of experts T . Therefore, we can freely select the DL-HGP Network
value of T to fit the data. For that, an autoselection mechanism 1: Initialize all the parameters (denoted as δ) of the E-D
for T is implemented as follows. We first initialize T = T0 , network.
where T0 is larger than the expected number of experts and 2: Initialize all the parameters of HGP, i.e., γ and q(z).
is typically smaller than 10. After each optimization iteration,
3: repeat
if an expert k does not have any training pixel assigned to it
(q(z n = k) = 1), this expert will be removed. 4: Sample a mini-batch of B I images randomly from the
8) Predicting With HGP: Consider a test pixel with the training set.
feature vector x∗ . Let f k∗ denote fk (x∗ ). The predictive distri- 5: Pass the mini-batch through the E-D network to
bution for fk∗ can be approximated as generate a set of 64 full-resolution feature maps for
  each image in the mini-batch.
p( f k∗ |y) ≈ p f k∗ |hk , g0 q(hk )q(g0 )dhk dg0 = N (μ∗k , σk∗2 ) 6: Rearrange the feature maps to form a set of feature
where vectors denoted as S (t ). Each vector in S (t ) has 64
 T  T features corresponding to a pixel in the mini-batch.
μ∗k = a∗k mk + a∗0 m0 7: Update q(z n ) according to Eq. (10) for each feature
 T   ∗  ∗ T
σk∗2 = κk (xn , xn )+ a∗k Sk −K(k) ∗ vector xn in S (t ) .
Uk Uk ak + a0 S0 a0 (12)
8: Calculate the partial derivatives of L(S (t ) ) (Eq. (11))
(k) (k)
with a∗k = [KUk Uk ]−1 KUk x∗ . It can be observed that the w.r.t. all the GP variables in γ and all the parameters
last terms of μ∗k and σk∗2 in (12) are actually the posterior δ of the E-D network using back- propagation.
mean and variance of m(x∗ ), which represent the information 9: Update the current estimate of γ and δ using the
passed
 from the upperlayer T to the ∗lower
T layer:
 q(m(x∗ )) = calculated partial derivatives.
∗ ∗ ∗
N m̄(x ), v̄(x ) = N a0 m0 , a0 S0 a0 . ∗ 10: until convergence.
The prediction at x∗ by expert k is then given by

    GP, the local GP experts, and the gating network for a single
p(y ∗ |x∗ , y, z ∗ = k) = p y ∗ | f k∗ p f k∗ |y d f k∗ (13)
test image.
which can be computed by the Gauss–Hermite quadrature
method giving two outputs: the predictive mean (or the lane C. Training and Testing for the Hybrid DL-GP Architecture
probability) and the predictive variance (or the uncertainty). When HGP is used as the classifier in the proposed hybrid
The final prediction by HGP at x∗ can be computed as the DL-GP architecture, we referred to the resulting network as
weighted average of the predictions from T GP experts the DL-HGP network.

T 1) Training: The end-to-end training procedure for the
p(y ∗ |x∗ , y) = p(z ∗ = k|x∗ , y) p(y ∗ |x∗ , yz ∗ = k) DL-HGP network is presented in Algorithm 1.
k=1 As discussed in Section III-B6, the HGP classifier alone
will converge to a local or global maximum. However, this

T
≈ q(z ∗ = k) p(y ∗ |x∗ , y, z ∗ = k) (14) does not guarantee the convergence of the training algorithm
for the hybrid DL-HGP network. In fact, it is also well known
k=1
that even the traditional neural networks are not guaranteed
where q(z ∗ = k) is the output of the gating network, which to converge to a global optimum. Nevertheless, throughout
is calculated according to (10). our experiments, the training algorithm for DL-HGP always
The computational cost of the predicting label for a single converges to a meaningful state, in which the network gives
test pixel by HGP is dominated by the computation of σk∗2 in good predictions on a validation set. This convergence state
(12), which has the time complexity of O(M 2 ). can be detected when both the network parameters and the
Fig. 6 illustrates the input and output of the HGP, and the objective function do not change significantly anymore (i.e.,
dataflow among its main components including the global their changes are below a certain threshold).

Authorized licensed use limited to: University of Hull. Downloaded on August 12,2023 at 17:52:35 UTC from IEEE Xplore. Restrictions apply.
5332 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 31, NO. 12, DECEMBER 2020

Algorithm 2 Lane-Segmentation Algorithm for a Test Image I TABLE III


S TATISTICS OF THE PLVP2 D ATA S ET
1: Pass the image through the E-D network to generate
64 full-resolution feature maps.
2: Rearrange the feature maps to form a set of feature vectors
S ∗ . Each vector in S ∗ consists of 64 features corresponding
to a pixel in the image.
3: for each x∗ ∈ S ∗ do
4: Find the GP expert k that is responsible for the
prediction at x∗ : k = argmaxi N (x∗ |mi , V).
5: Compute μ∗k and σk∗2 using the formulas in (12).
6: Estimate the prediction at x∗ by the expert k results of different methods for pedestrian lane detection (see
according to (13) using the Gauss-Hermite quadrature, Section IV-C).
which gives the lane probability and the classification
uncertainty for the corresponding pixel. A. Image Data Set and Experimental Methods
7: The lane probability p(y ∗ = 1|x∗ , y) produced in the
1) Image Data Set: We created an image data set for pedes-
previous step is compared to 0.5 to give the pixel label:
trian lane detection. This data set, named PLVP2, consists
1, iff p(y ∗ = 1|x∗ , y) > 0.5 of 5000 images with their corresponding ground-truth lane-
label = segmentation masks. Each ground-truth mask is a manually
0, otherwise.
8: end for
segmented binary image, where each pixel is labeled as lane
or background. Annotated vanishing points are also included
9: The set of labels for all the pixels in the test image is
for each image; this information can be used to evaluate the
rearranged into a 2-D segmentation mask, which has the lane-detection methods based on vanishing point detection.
same size as input image, as shown in Fig. 6. Similarly, the The images were taken from the realistic indoor and outdoor
set of classification uncertainties for all the image pixels is scenes, at different times of a day, and in different weather
rearranged into a 2-D uncertainty map. conditions. The images contain unmarked pedestrian lanes
with various shapes, colors, textures, and surface structures
The time complexity for training the DL-HGP network (pavement, brick, concrete, or soil). In many images, lane
is the sum of the time complexity for learning the HGP regions are exposed to extreme lighting conditions (e.g., very
and for learning the parameters of the E-D network. As low or high illumination, or strong shadow). Fig. 7 shows some
previously mentioned, the training time complexity of the example images and the corresponding ground truth from the
HGP alone is O(B M 2 ), where B is the number of pixels data set. Statistics regarding the lane surfaces and the lighting
in each mini-batch. The time complexity of the E-D network conditions are given in Table III.
is dominated
 L by that of all the convolutional layers, which is 2) Performance Measures: To evaluate pedestrian lane
O B I ( l=1 nl−1 sl2 nl m l2 ) , for each mini-batch of B I images. detection for each test image, the predicted pedestrian lane-
Here, l is the index of a convolutional layer and L is the segmentation mask and the ground-truth mask are compared
number of convolutional layers. nl is the number of output pixelwise to compute four evaluation measures: accuracy,
channels (number of filters) of the lth layer and nl−1 is the recall, precision, and F-measure. Each of these four measures
number of input channels of the lth layer. sl is the spatial size for individual images is then averaged over the whole test set
of the filters and m l is the spatial size of the output feature to get the overall evaluation measures. Accuracy is the per-
maps. The overall complexity of the training L algorithm for  centage of the image pixels that are correctly classified. Recall,
DL-HGP is O(B I W H M 2 ) + O B I ( l=1 nl−1 sl2 nl m l2 ) , Precision, and F-measure were proposed in [57] for evaluating
where W × H is the spatial size of the training images. the lane-detection algorithms, and they have been widely used
2) Testing: The steps for producing the lane-segmentation since then. Recall is the percentage of the ground-truth lane
result for a new test image using the hybrid DL-HGP network pixels that are detected correctly. Precision is the percentage
are summarized in Algorithm 2. of the machine-detected lane pixels that are considered to
We note that the predictive lane probability p(y ∗ = 1|x∗ , y) be correct. F-measure is the harmonic mean of precision
and the background probability p(y ∗ = 0|x∗ , y) are summed and recall: F-measure = 2 × Recall × Precision/(Recall +
to 1. Hence, it is reasonable to assign the pixel label to 1 (i.e., Precision).
lane) if the probability p(y ∗ = 1|x∗ , y) is larger than 0.5, 3) Experimental Setup: To compare the performance
as shown in Step 7 of Algorithm 2. measures among different lane-detection methods on the
Since the testing time complexity of the HGP is PLVP2 data set, we use the fivefold cross-validation. In this
O(M 2 ) for a single pixel, the testing time complex- data set, images collected from multiple sources are given
ity for the DL-HGP
 network for a single test image is randomized image numbers. The data set is then divided into
O(W H M 2 ) + O L five partitions of equal sizes. For each fold, four partitions are
l=1 n l−1 sl n l m l .
2 2
used for training and validation (called the training–validation
IV. E XPERIMENTS AND A NALYSIS set), and the remaining partition is used for testing. Different
folds use different test partitions: the first fold uses the first
In this section, we first describe the image data set, perfor- 1000 images, the second fold uses the next 1000 images, and
mance measures, and experimental setup (see Section IV-A). so on. For the experiments comparing the performances of
We then present the configuration search for the proposed different lane-detection methods, the performance measures
network (see Section IV-B) and the comparative experimental evaluated on the test sets are averaged over the fivefolds.

Authorized licensed use limited to: University of Hull. Downloaded on August 12,2023 at 17:52:35 UTC from IEEE Xplore. Restrictions apply.
NGUYEN et al.: HYBRID DL-GP NETWORK FOR PEDESTRIAN LANE DETECTION IN UNSTRUCTURED SCENES 5333

Fig. 7. Examples from the PLVP2 data set. First rows: pedestrian lane images. Second rows: ground-truth-segmentation masks.

TABLE IV
P ERFORMANCE OF D IFFERENT E-D N ETWORK C ONFIGURATIONS FOR L ANE D ETECTION ON THE F OLD -1 PLVP2 VALIDATION S ET. T HE L ISTED
N UMBER OF PARAMETERS C OUNTS A LL THE PARAMETERS OF THE E-D N ETWORK AND THE GP C LASSIFIER

For the hyperparameter tuning (model selection) in each TABLE V


individual method, a validation set separated from the test P ERFORMANCE OF THE DL-HGP N ETWORK U SING D IFFERENT VALUES
set must be used to avoid the biases. Therefore, we further FOR T0 ON THE F OLD -1 PLVP2 VALIDATION S ET

randomly divide each training-validation set into a training


set and a validation set using the ratio of 4 : 1.
Images in the data set are captured in both land-
scape (2113 images) and portrait orientations (the remaining
images). The image width and height range from 1224 to
1632 pixels. To reduce the computational cost, the images are
resized to 306 × 306 pixels. We experimentally found that
image data augmentation does not improve the performance of about overfitting. However, since the training time for the GP
the tested methods in this pedestrian lane-detection problem. classifier is directly proportional to the square of M, we prefer
Therefore, it is not used in all the experiments subsequently to use a small M. In all our experiments, M is fixed to 50.
reported in this section.
For training of the tested methods, the Adam optimizer is B. Configuration Search
used with the learning rate of 0.001, and the exponential decay In this section, we evaluate different variations of the
rates β1 and β2 of 0.9 and 0.999. Mini-batch training is used proposed network to find the best configuration in terms
with a batch size of B I = 3 images. The maximum number of the number of encoder/decoder units, the filter size, and
of training iterations is 30 000. the number of GP experts. These are collectively called the
Each experiment is performed on a system that has Intel hyperparameter set.
Core i7-7700 CPU at 3.60 GHz, 64 GB of RAM, and a For each of the fivefolds described in Section IV-A3,
GTX1080TI GPU with 11-GB memory. we train different network variations using the training set,
4) System Implementation: The proposed hybrid network analyze their performance on the validation set, and select the
is implemented entirely in Python using Tensorflow and the network variation with the best performance on the validation
GPFlow library for the HGP modules. The GPFlow library set. We emphasize that this entire model selection process must
[58] is able to back-propagate through all the GP components, be repeated for each of the fivefolds. However, since we found
allowing us to train the network in an end-to-end manner. the results of the model selection for the different folds are
We note that our HGP classifier treats the inducing inputs very similar, in this section, we will only present the model
as variational parameters, which are tuned to minimize the selection result of the first fold.
divergence between the variational distribution and the exact 1) E-D Network Configuration Search: We first fix the
GP posterior distribution of the latent variables. The more number of GP experts to 9 and search for the number of
inducing points are used by the model, and the better it is at encoder/decoder units and the filter size. The number of
approximating the exact nonparametric GP model, as pointed encoder/decoder units is selected from 2 to 6, and the filter size
out in [59]. Therefore, we can safely increase the number varies among 3 × 3, 5 × 5, and 7 × 7. The performance of the
of inducing inputs M and optimize them without concerning different E-D network configurations on the validation set is

Authorized licensed use limited to: University of Hull. Downloaded on August 12,2023 at 17:52:35 UTC from IEEE Xplore. Restrictions apply.
5334 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 31, NO. 12, DECEMBER 2020

TABLE VI
P ERFORMANCE OF D IFFERENT L ANE -D ETECTION M ETHODS ON THE PLVP2 D ATA S ET U SING F IVEFOLD C ROSS -VALIDATION ; THE B EST
P ERFORMANCE M EASURES A RE G IVEN IN B OLD

listed in Table IV. We observe that the network achieves higher 1) DeepLabv3 + [60]: This method also uses an E-D
accuracy and F-measure by using more encoder/decoder units, structure. The encoder network can extract features at
up to four to five units. The performance drops when the sixth an arbitrary resolution by applying atrous convolution.
encoder/decoder unit is added. This dropping in performance It also uses atrous Spatial Pyramid Pooling with multiple
can be explained by the fact that too much spatial information atrous rates to capture multi scale context. Normally,
has been lost after six pooling layers of the encoder network the encoder will generate feature maps with an output
that it cannot be effectively recovered by the decoder network. stride of 8 or 16 (i.e., the size of the feature maps is
The configurations C4K5 and C5K5, which have the filter 1/8 or 1/16 of that of the input image). A simple decoder
size of 5 × 5 and use four and five encoder/decoder units, module recovers the object boundaries. We find that the
respectively, achieve the best performances in terms of both encoder output stride of 8 performs the best for this data
accuracy and F-measure. The configuration C4K5 is chosen set. Therefore, the result of DeepLabv3+ with an output
for our network, because it has fewer trainable parameters stride of 8 is reported.
(and, hence, it is less prone to overfitting). 2) Fully Convolutional DenseNets [61]: This method
2) Initial Number of Expert Search: Next, we fix the config- extends DenseNets [62] to deal with the problem
uration C4K5 for the E-D network and vary the initial number of semantic segmentation. It uses a downsampling–
of GP experts T0 among 3, 6, 9, and 12. The performance of upsampling style E-D network. Each stage (between
the network using different values for T0 on the PLVP2 vali- the pooling layers) uses a dense block, in which each
dation set is presented in Table V. The network achieves the layer takes a concatenation of feature maps from all
best performance at T0 = 9, where the number of experts T at the preceding layers as input. It also concatenates skip
convergence is 9. When T0 increases to 12 or 15, the network connections from the encoder to the decoder. We report
also converges at T = 9 and achieves the performance that is the results for the Fully Convolutional DenseNets
very close to when T0 = 9. This result shows that the proposed with 56 layers (FC-DenseNet56) and with 103 layers
autoselection mechanism for T is effective as long as the initial (FC-DenseNet103).
number of experts T0 is larger than or equal to the expected 3) SegNet [30]: An E-D network is followed by a linear
T . For the remaining experiments, we fix T0 = 9. classifier (an FC layer and a Softmax layer), which
generates a class label for each image pixel. The encoder
C. Comparison With Other Lane-Detection Methods network of SegNet resembles the first part of the
VGG16 classification network [33] with 13 convolu-
1) Quantitative Comparison: In this experiment, we use tional layers.
fivefold cross-validation to compare the performances of dif- 4) Bayesian SegNet [34]: It has dropout layers added to the
ferent lane-detection methods on the PLVP2 data set. The SegNet architecture at both training and testing phases
proposed DL-HGP network is compared with the traditional to produce a probabilistic segmentation output.
methods that use hand-engineered features and the DL-based
methods for unmarked lane detection. Two representative and We also experiment with the variants of SegNet, Bayesian
relevant traditional methods are included in this experiment. SegNet, and the proposed network (C4K5 + HGP).
1) Edge-Based Method [15]: This method detects the lane 1) SegNet-Basic [30]: This is a smaller version of SegNet,
boundaries from the edges pointing to the vanishing where each of the encoder and decoder networks has
points. We use the MATLAB code provided by Kong four convolutional layers with filter size 7 × 7. Each
et al. [15]. convolutional layer is followed by a pooling or upsam-
2) Border-Detection + Segmentation [16]: This method pling layer.
combines lane-border detection and lane segmen- 2) Bayesian SegNet-Basic [34]: Dropout layers are added
tation. We use the MATLAB code provided by to SegNet-Basic to produce probabilistic segmentation.
Phung et al. [16]. 3) SegNet + HGP, Bayesian SegNet + HGP, SegNet-
Four state-of-the-art DL-based methods for road scene seg- Basic + HGP, and Bayesian SegNet-Basic + HGP:
mentation are evaluated in this experiment. These are, respectively, the variants of SegNet, Bayesian

Authorized licensed use limited to: University of Hull. Downloaded on August 12,2023 at 17:52:35 UTC from IEEE Xplore. Restrictions apply.
NGUYEN et al.: HYBRID DL-GP NETWORK FOR PEDESTRIAN LANE DETECTION IN UNSTRUCTURED SCENES 5335

Fig. 8. Pedestrian lane segmentation using different algorithms. Column 1: input images. Column 2: output of the Border-detection + segmentation method
[16]. Column 3: output of SegNet [30]. Column 4: output of Bayesian SegNet [34]. Column 5: output of the proposed DL-HGP network. See the electronic
color image.

SegNet, SegNet-Basic, and Bayesian SegNet-Basic, with the higher accuracy between the two evaluated
where the linear classifier is replaced by the HGP traditional methods, by 2.12% in accuracy and 1.67% in
classifier. F-measure, while being approximately 173 times faster
4) C4K5 + Linear-Classifier: This is a variant of the than Border-detection + segmentation for inference.
proposed network, where HGP is replaced by a linear 2) For each of the DL networks in the third group, replacing
classifier. its linear classifier by the HGP classifier in the fourth
All the DL-based methods in our experiments are imple- group improves the performance. Especially, we find
mented in Python using Tensorflow. For Bayesian SegNet and that combining the HGP classifier with a compact E-D
its variants, each test image is passed through the network network like SegNet-Basic and C4K5 provides more
30 times to obtain 30 samples of the posterior distribution of significant increases in accuracy (2.29% and 1.21%
the segmentation output. improvement) and F-measure (1.50% and 1.67%
The performance of different methods on the PLVP2 data set improvement). This result shows that using a powerful
using fivefold cross-validation is presented in Table VI. Here, GP classifier can help mitigate the need for a complex
the methods are arranged into four groups. The first group con- E-D network in pedestrian lane detection.
sists of traditional methods that use hand-engineered features. 3) The compact E-D network C4K5 has fewer than half
The second group includes the variants of DeepLabv3+ and the number of trained parameters of the E-D network
Fully Convolutional DenseNets. The third group comprises of SegNet: 13 017 472 versus 29 480 064 parameters.
the variants of SegNet, Bayesian SegNet, and the proposed However, the proposed network combining C4K5 with
network that use the linear classifier. The last group involves HGP is more accurate than SegNet + HGP. It also gives
the variants of SegNet, Bayesian SegNet, and the proposed the best performance among the tested methods in terms
network that use the HGP classifier. The following can be of accuracy, F-measure, and recall. It outperforms the
observed from the results in Table VI. state-of-the-art methods SegNet and Bayesian SegNet by
1) The DL methods outperform the traditional methods 1.21% and 0.82% in terms of accuracy, and 1.56% and
that use the hand-engineered features for unmarked 1.19% in terms of F-measure. This is because pairing
lane detection, and in general, they take a much shorter the compact E-D network C4K5 with the powerful HGP
time for inference. SegNet-Basic, which has the second classifier gives a similar modeling power to the larger
lowest accuracy among the DL methods, outperforms networks while mitigating the overfitting problem,
Border-detection + segmentation, which is the one especially for small data sets. We can also expect that

Authorized licensed use limited to: University of Hull. Downloaded on August 12,2023 at 17:52:35 UTC from IEEE Xplore. Restrictions apply.
5336 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 31, NO. 12, DECEMBER 2020

Fig. 9. Pedestrian lane-detection results of Bayesian SegNet and DL-HGP. Column 1: input images. Column 2 and 3: detected lanes and the uncertainty
maps by Bayesian SegNet. Column 4 and 5: detected lanes and the uncertainty maps by DL-HGP. A brighter intensity in the uncertainty maps presents a
higher uncertain level. See the electronic color image.

the proposed network can perform better than the above (as seen in images 1, 4, 5, and 7). As a result, its performance
larger networks for new test samples that are quite is poor for indoor scenes in which there exist many strong
different from those seen in the PLVP2 data set. structured edges (image 7). It also does not perform well in a
4) In Bayesian SegNet-Basic and Bayesian SegNet, adding scene with extreme lighting conditions such as strong shadow
the dropout layers for probabilistic segmentation has a (image 2) or when the colors of the lane and background
positive side effect: It results in a slightly improved region are similar (images 3 and 6). The three remaining
accuracy of these two networks over SegNet-Basic and (DL-based) methods are less susceptible to the above prob-
SegNet. This improvement comes at the cost of much lems. Especially, they give much better results for indoor
slower inference: Inference in Bayesian SegNet-Basic scenes (image 7). Among these methods, DL-HGP seems to
and Bayesian SegNet requires 0.484 and 0.995 s/test be the one that is most robust to the background edges (as seen
image, respectively. On the other hand, combining the in images 1 and 5), and to the extreme lighting conditions (as
dropout layers into the networks that use the HGP clas- seen in images 2, 3, and 5). There are three possible reasons
sifier does not seem to improve the performance of these for DL-HGP to be able to address these conditions better than
networks: Bayesian SegNet-Basic + HGP classifier ver- the two other DL methods.
sus SegNet-Basic + HGP classifier, and Bayesian Seg-
Net + HGP classifier versus SegNet + HGP classifier. 1) While the linear classifier at the end of a traditional
5) The Fully Convolutional DenseNets (FC-DenseNet103) DNN requires the network to extract linearly separable
gives a good performance, which is close to features, the HGP classifier does not. The HGP can
the performance of the proposed network. Hence, classify well using nonlinearly separable features.
we believe that it would be beneficial to combine an 2) Since HGP does not require linearly separable features,
E-D architecture based on dense blocks with an HGP we can afford to use fewer network layers in DL-HGP
classifier for the pedestrian lane-detectionproblem. (16 convolutional and four pooling layers) than that in
We leave it for future exploration. SegNet and Bayesian SegNet (26 convolutional and five
2) Visual Comparison on Lane-Detection Results: Fig. 8 pooling layers). Using fewer pooling layers means that
shows the visual comparative results of different methods for the resolution of the final encoder feature maps is not
pedestrian lane detection. The compared methods include the reduced as much. Therefore, more edge and boundary
Border-detection + segmentation method [16], SegNet [30], information still remains on the final encoder feature
Bayesian SegNet [34], and DL-HGP. The Border-detection + maps. This helps DL-HGP to detect the lane boundary
segmentation method [16] is susceptible to background edges better.

Authorized licensed use limited to: University of Hull. Downloaded on August 12,2023 at 17:52:35 UTC from IEEE Xplore. Restrictions apply.
NGUYEN et al.: HYBRID DL-GP NETWORK FOR PEDESTRIAN LANE DETECTION IN UNSTRUCTURED SCENES 5337

3) HGP is a generative model, which is well known for [12] C. Tan, H. Tsai, T. Chang, and M. Shneier, “Color model-based real-
generalizing better and for being more robust to noise time learning for road following,” in Proc. IEEE Conf. Intell. Transp.
than a discriminative model (e.g., the linear classifier Syst., Sep. 2006, pp. 939–944.
[13] J. M. Alvarez, T. Gevers, and A. M. Lopez, “Vision-based road detection
and the SVM), especially for small data sets [63]. using road models,” in Proc. ICIP, 2009, pp. 2073–2076.
The compact E-D network with a smaller number of [14] C. Rasmussen, “Texture-based vanishing point voting for road shape
parameters in the proposed DL-HGP architecture is also estimation,” in Proc. BMVC, 2004, pp. 470–477.
less likely to overfit training data and is more robust to [15] H. Kong, J.-Y. Audibert, and J. Ponce, “General road detection
noise. from a single image,” IEEE Trans. Image Process., vol. 19, no. 8,
pp. 2211–2220, Aug. 2010.
3) Visual Comparison on Uncertainty Maps: Fig. 9 shows [16] S. L. Phung, M. C. Le, and A. Bouzerdoum, “Pedestrian lane detection
the visual comparative results for the uncertainty maps gen- in unstructured scenes for assistive navigation,” Comput. Vis. Image
erated by Bayesian SegNet and by the proposed DL-HGP. It Understand., vol. 149, pp. 186–196, Aug. 2016.
can be seen that DL-HGP is generally more certain about its [17] J. Kim and M. Lee, “Robust lane detection based on convolutional
neural network and random sample consensus,” in Proc. ICONIP, 2014,
prediction than Bayesian SegNet. The areas of high uncer- pp. 454–461.
tainty generated by DL-HGP are mostly located near the lane [18] D. P. A. Nugroho and M. Riasetiawan, “Road lane segmentation using
boundaries. On the other hand, the areas of high uncertainty deconvolutional neural network,” in Proc. Int. Conf. Soft Comput. Data
produced by Bayesian SegNet appear more at random loca- Sci., 2017, pp. 13–22.
tions, which can be seen clearly in images 1, 2, 3, 4, and 7. [19] C. C. T. Mendes, V. Fremont, and D. F. Wolf, “Exploiting fully
convolutional neural networks for fast road detection,” in Proc. IEEE
The uncertainty maps by DL-HGP can be more useful for the Int. Conf. Robot. Autom. (ICRA), May 2016, pp. 3174–3179.
pedestrian lane-detection application, since they can be used [20] T. N. A. Nguyen, A. Bouzerdoum, and S. L. Phung, “A scalable
to locate the lane boundaries and to warn the blind users about hierarchical Gaussian process classifier,” IEEE Trans. Signal Process.,
the areas of high uncertainty near the lane boundary to keep vol. 67, no. 11, pp. 3042–3057, Jun. 2019.
them within the safe pedestrian lane areas. [21] M. A. Sotelo, F. J. Rodriguez, L. Magdalena, L. M. Bergasa, and
L. Boquete, “A color vision-based lane tracking system for autonomous
V. C ONCLUSION driving on unmarked roads,” Auto. Robots, vol. 16, no. 1, pp. 95–116,
Jan. 2004.
This article presents a method for pedestrian lane detection [22] O. Ramstrom and H. Christensen, “A method for following unmarked
in unstructured environments using a hybrid DL-GP architec- roads,” in Proc. IEEE Intell. Vehicles Symp., 2005., 2005, pp. 650–655.
ture to segment scene images into the pedestrian lane and [23] Y. He, H. Wang, and B. Zhang, “Color-based road detection in
background regions. The proposed hybrid network, which can urban traffic scenes,” IEEE Trans. Intell. Transp. Syst., vol. 5, no. 4,
pp. 309–318, Dec. 2004.
be trained in an end-to-end manner, combines a compact con-
[24] C. Oh, J. Son, and K. Sohn, “Illumination robust road detection using
volutional E-D network with a powerful nonparametric HGP geometric information,” in Proc. 15th Int. IEEE Conf. Intell. Transp.
classifier to mitigate the overfitting problem while maintaining Syst., Sep. 2012, pp. 1566–1571.
its modeling power. In addition to the segmentation output [25] O. Miksik, P. Petyovsky, L. Zalud, and P. Jura, “Robust detection of
for each test image, the network also generates a map of shady and highlighted roads for monocular camera based navigation of
well-calibrated uncertainty. Last but not least, a new data set UGV,” in Proc. ICRA, 2011, pp. 64–71.
[26] J. M. Á. Alvarez and A. M. Lopez, “Road detection based on illu-
of 5000 images for training and evaluating pedestrian lane- minant invariance,” IEEE Trans. Intell. Transp. Syst., vol. 12, no. 1,
detection algorithms is introduced. It is expected to facilitate pp. 184–193, Mar. 2011.
research in pedestrian lane detection, especially the application [27] J. M. Alvarez, T. Gevers, Y. LeCun, and A. M. Lopez, “Road scene
of DL in this area. segmentation from a single image,” in Proc. ECCV, 2012, pp. 376–389.
[28] J. D. Crisman and C. E. Thorpe, “UNSCARF—A color vision sys-
R EFERENCES tem for the detection of unstructured roads,” in Proc. ICRA, 1991,
pp. 2496–2501.
[1] A. J. Jackson and J. S. Wolffsohn, Low Vision Manual. Amsterdam, [29] C.-K. Chang, C. Siagian, and L. Itti, “Mobile robot monocular vision
The Netherlands: Elsevier, 2007. navigation based on road region and boundary estimation,” in Proc.
[2] R. C. Simpson, “How many people would benefit from a smart wheel- IEEE/RSJ Int. Conf. Intell. Robots Syst., Oct. 2012, pp. 1043–1050.
chair?” J. Rehabil. Res. Develop., vol. 45, no. 1, pp. 53–72, Dec. 2008. [30] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep
[3] J. Kim and H. Shin, Algorithm and SoC Design for Automotive Vision convolutional encoder-decoder architecture for image segmentation,”
Systems. Amsterdam, The Netherlands: Springer, 2014. IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495,
[4] I. Ulrich and J. Borenstein, “The GuideCane-applying mobile robot Dec. 2017.
technologies to assist the visually impaired,” IEEE Trans. Syst., Man, [31] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
Cybern. A, Syst. Humans, vol. 31, no. 2, pp. 131–136, Mar. 2001. for semantic segmentation,” in Proc. CVPR, Jun. 2015, pp. 3431–3440.
[5] S. Shoval, J. Borenstein, and Y. Koren, “Auditory guidance with the
[32] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for
Navbelt—A computerized travel aid for the blind,” IEEE Trans. Syst.,
semantic segmentation,” in Proc. ICCV, Dec. 2015, pp. 1520–1528.
Man, Cybern. C, Appl. Rev., vol. 28, no. 3, pp. 459–467, Mar. 1998.
[6] HumanWare. (2015). BrailleNote GPS. [Online]. Available: [33] K. Simonyan and A. Zisserman, “Very deep convolutional networks
https://store.humanware.com/hus/braillenote-gps-software-and-receiver- for large-scale image recognition,” 2014, arXiv:1409.1556. [Online].
package.html Available: https://arxiv.org/abs/1409.1556
[7] S. Se and M. Brady, “Road feature detection and estimation,” Mach. [34] A. Kendall, V. Badrinarayanan, and R. Cipolla, “Bayesian SegNet:
Vis. Appl., vol. 14, no. 3, pp. 157–165, Jul. 2003. Model uncertainty in deep convolutional encoder-decoder architectures
[8] M. S. Uddin and T. Shioyama, “Bipolarity and projective invariant-based for scene understanding,” in Proc. BMVC, 2017.
zebra-crossing detection for the visually impaired,” in Proc. CVPRW, [35] G. J. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object classes in
2005, pp. 22–30. video: A high-definition ground truth database,” Pattern Recognit. Lett.,
[9] V. Ivanchenko, J. Coughlan, and S. Huiying, “Detecting and locating vol. 30, no. 2, pp. 88–97, Jan. 2009.
crosswalks using a camera phone,” in Proc. CVPRW, 2008, pp. 1–8. [36] J. Jin, K. Fu, and C. Zhang, “Traffic sign recognition with hinge loss
[10] M. C. Le, S. L. Phung, and A. Bouzerdoum, “Pedestrian lane detec- trained convolutional neural networks,” IEEE Trans. Intell. Transp. Syst.,
tion for assistive navigation of blind people,” in Proc. ICPR, 2012, vol. 15, no. 5, pp. 1991–2000, Oct. 2014.
pp. 2594–2597. [37] M. M. Bejani and M. Ghatee, “Convolutional neural network
[11] J. Crisman and C. Thorpe, “SCARF: A color vision system that tracks with adaptive regularization to classify driving styles on smart-
roads and intersections,” IEEE Trans. Robot. Automat., vol. 9, no. 1, phones,” IEEE Trans. Intell. Transp. Syst., early access, doi:
pp. 49–58, Feb. 1993. 10.1109/TITS.2019.2896672.

Authorized licensed use limited to: University of Hull. Downloaded on August 12,2023 at 17:52:35 UTC from IEEE Xplore. Restrictions apply.
5338 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 31, NO. 12, DECEMBER 2020

[38] A. Pashaei, M. Ghatee, and H. Sajedi, “Convolution neural network joint Thi Nhat Anh Nguyen received the B.Eng. degree
with mixture of extreme learning machines for feature extraction and (Hons.) and the M.Eng. degree from Nanyang Tech-
classification of accident images,” J. Real-Time Image Process., vol. 16, nological University, Singapore, in 2007 and 2012,
pp. 1–16, Feb. 2019. respectively, and the Ph.D. degree from the Univer-
[39] R. Bastani Zadeh, M. Ghatee, and H. R. Eftekhari, “Three-phases sity of Wollongong, Wollongong, NSW, Australia,
smartphone-based warning system to protect vulnerable road users under in 2019, all in computer engineering.
fuzzy conditions,” IEEE Trans. Intell. Transp. Syst., vol. 19, no. 7, Her research interests include machine learning,
pp. 2086–2098, Jul. 2018. pattern recognition, computer vision, and image
[40] Y. Lv, Y. Duan, W. Kang, Z. Li, and F. Wang, “Traffic flow prediction processing.
with big data: A deep learning approach,” IEEE Trans. Intell. Transp.
Syst., vol. 16, no. 2, pp. 865–873, Apr. 2014.
[41] L. Zhang, F. Yang, Y. Daniel Zhang, and Y. J. Zhu, “Road crack detection
using deep convolutional neural network,” in Proc. IEEE Int. Conf.
Image Process. (ICIP), Sep. 2016, pp. 3708–3712.
[42] M. Kuss and C. E. Rasmussen, “Assessing approximate inference for
binary Gaussian process classification,” J. Mach. Learn. Res., vol. 6,
pp. 1679–1704, Oct. 2005.
[43] Y. Altun, T. Hofmann, and A. J. Smola, “Gaussian process classifica-
tion for segmenting and annotating sequences,” in Proc. ICML, 2004,
pp. 4–11. Son Lam Phung (Senior Member, IEEE) received
[44] A. Kapoor, K. Grauman, R. Urtasun, and T. Darrell, “Active learning the B.Eng. degree (Hons.) and the Ph.D. degree
with Gaussian processes for object categorization,” in Proc. ICCV, 2007, from Edith Cowan University, Perth, WA, Australia,
pp. 1–8. in 1999 and 2003, respectively, all in computer
[45] H. Nickisch and C. E. Rasmussen, “Approximations for binary engineering.
Gaussian process classification,” J. Mach. Learn. Res., vol. 9, no. 10, He is currently an Associate Professor with the
pp. 2035–2078, 2008. School of Electrical, Computer and Telecommu-
[46] T. P. N. A. Centeno and N. D. Lawrence, “Optimising kernel parameters nications Engineering, University of Wollongong,
and regularisation coefficients for non-linear discriminant analysis,” Wollongong, NSW, Australia. His general research
J. Mach. Learn. Res., vol. 7, pp. 455–491, Feb. 2006. interests are in the areas of image and signal
[47] N. Lawrence, M. Seeger, and R. Herbrich, “Fast sparse Gaussian processing, neural networks, pattern recognition, and
process methods: The informative vector machine,” in Proc. NIPS, 2003, machine learning.
pp. 625–632. Dr. Phung received the University and Faculty Medals in 2000.
[48] A. Naish-Guzman and S. Holden, “The generalized FITC approxima-
tion,” in Proc. NIPS, 2007, pp. 1057–1064.
[49] J. Hensman, A. Matthews, and Z. Ghahramani, “Scalable variational
Gaussian process classification,” in Proc. AISTATS, 2015, pp. 351–360.
[50] J. Hensman, A. G. Matthews, M. Filippone, and Z. Ghahramani,
“MCMC for variationally sparse Gaussian processes,” in Proc. NIPS,
2015, pp. 1648–1656.
[51] C. E. Rasmussen and Z. Ghahramani, “Infinite mixtures of Gaussian
process experts,” in Proc. NIPS, 2002, pp. 881–888.
[52] V. Tresp, “Mixtures of Gaussian processes,” in Proc. NIPS, 2000, Abdesselam Bouzerdoum (Senior Member, IEEE)
pp. 654–660. received the M.S.E.E. and Ph.D. degrees in electri-
[53] T. N. A. Nguyen, A. Bouzerdoum, and S. L. Phung, “Variational cal engineering from the University of Washington,
inference for infinite mixtures of sparse Gaussian processes through KL- Seattle, WA, USA, in 1986 and 1991, respectively.
correction,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. In 1991, he joined The University of Adelaide,
(ICASSP), Mar. 2016, pp. 2579–2583. Adelaide, SA, Australia, where he was a Research
[54] J. Shi, R. Murray-Smith, and D. Titterington, “Bayesian regression Associate and then a Senior Assistant Professor.
and classification using mixtures of Gaussian processes,” Int. J. Adapt. In 1998, he joined Edith Cowan University, Perth,
Control Signal Process., vol. 17, no. 2, pp. 149–161, Mar. 2003. WA, Australia, as an Associate Professor. In 2004,
[55] T. N. A. Nguyen, A. Bouzerdoum, and S. L. Phung, “Stochastic varia- he was appointed as a Professor of computer engi-
tional hierarchical mixture of sparse Gaussian processes for regression,” neering and the Head of the School of Electri-
Mach. Learn., vol. 107, no. 12, pp. 1947–1986, Dec. 2018. cal, Computer and Telecommunications Engineering, University of Wollon-
[56] M. Girolami and S. Rogers, “Variational Bayesian multinomial probit gong, Wollongong, NSW, Australia, where he served as an Associate Dean
regression with Gaussian process priors,” Neural Comput., vol. 18, no. 8, Researcher for the Faculty of Informatics from 2007 to 2013. He is currently
pp. 1790–1817, Aug. 2006. serving as the Head of the Information and Computing Technology Division,
[57] J. Fritsch, T. Kuhnl, and A. Geiger, “A new performance measure and College of Science and Engineering, Hamad Bin Khalifa University, Doha,
evaluation benchmark for road detection algorithms,” in Proc. 16th Int. Qatar, and a Senior Professor of computer engineering with the University of
IEEE Conf. Intell. Transp. Syst. (ITSC), Oct. 2013, pp. 1693–1700. Wollongong. He held several visiting professor appointments at the Institut
[58] A. G. D. G. Matthews et al., “GPflow: A Gaussian process library using Galilée, Université Paris-13, Villetaneuse, France; LAAS/CNRS, Toulouse,
TensorFlow,” J. Mach. Learn. Res., vol. 18, no. 40, pp. 1–6, 2017. France; Institut Femto-st, Besancon, France; Villanova University, Villanova,
[59] M. K. Titsias, “Variational learning of inducing variables in sparse PA, USA; and The Hong Kong University of Science and Technology,
Gaussian processes,” in Proc. AISTATS, 2009, pp. 567–574. Hong Kong. He has published over 360 technical articles and graduated
[60] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder- 50 Ph.D. and research master’s students, and supervised over 60 honors theses.
decoder with atrous separable convolution for semantic image segmen- His research interests include radar imaging and signal processing, image
tation,” in Proc. ECCV, 2018, pp. 801–818. processing, vision, machine learning, and pattern recognition.
[61] S. Jegou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio, “The one Dr. Bouzerdoum was a member of the Australian Research Council (ARC)
hundred layers Tiramisu: Fully convolutional DenseNets for semantic College of Experts from 2009 to 2011. He received the Eureka Prize for
segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Outstanding Science in Support of Defense or National Security in 2011,
Workshops (CVPRW), Jul. 2017, pp. 11–19. the Chester Sall Award of the IEEE T RANSACTIONS ON C ONSUMER
[62] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely E LECTRONICS in 2005, and the Distinguished Researcher Award, Chercheur
connected convolutional networks,” in Proc. CVPR, 2017, vol. 1, no. 2, de Haut Niveau, from the French Ministry in 2001. He is the Deputy
p. 3. Chair of the Engineering, Mathematics and Informatics Panel from 2010 to
[63] A. Y. Ng and M. I. Jordan, “On discriminative vs. generative classifiers: 2011. He has served as an Associate Editor for five international journals,
A comparison of logistic regression and naive Bayes,” in Proc. NIPS, including the IEEE T RANSACTIONS ON I MAGE P ROCESSING and the IEEE
2002, pp. 841–848. T RANSACTIONS S YSTEMS , M AN , AND C YBERNETICS from 1999 to 2006.

Authorized licensed use limited to: University of Hull. Downloaded on August 12,2023 at 17:52:35 UTC from IEEE Xplore. Restrictions apply.

You might also like