Professional Documents
Culture Documents
∗
HRL Laboratories, LLC, Malibu, CA, 90265.
† Imaging and Data Science Laboratory, University of Virginia, Charlottesville, VA 22908.
! = #/% ! = #/&
( = cos , , sin , 0
! = '#/% !=#
(a) (b)
Fig. 1. A visualization of the Radon transform and distribution slices. Panel (a) shows the distribution I, θ as a red arrow, the integration hyperplanes H(t, θ)
(shown as orange lines for d = 2), and the slices/projections RI(·, θ) for four different θs. Panel (b) shows the full sinogram RI (i.e., Radon transform),
where the dotted blue lines indicate the slices shown in Panel (a).
−1
PN
N k=1 yk · log(fθ (xk )) can be viewed as an estimate of where η(.) is a one-dimensional high-pass filter with cor-
Ex∼pX (h(x) · log(fθ (x))), which is equivalent to minimizing responding Fourier transform Fη(ω) ≈ c|ω|d−1 (it appears
the KL-divergence between pY and pfθ . due to the Fourier slice theorem, see the supplementary
Next, we consider the formulations for the standard Radon material) and ‘∗’ denotes the convolution operation. The above
transform, and its generalized version and demonstrate a con- definition of the inverse Radon transform is also known as the
nection between this transformation and the statistical learning filtered back-projection method, which is extensively used in
concepts reviewed above. image reconstruction in the biomedical imaging community.
Intuitively each one-dimensional projection/slice, RpX (·, θ),
A. Radon transform is first filtered via a high-pass filter and then smeared back into
X along H(·, θ) to approximate pX . The filtered summation
The standard Radon transform, R, maps distribution pX to
of all smeared approximations then reconstructs pX . Note
the infinite set of its integrals over the hyperplanes of Rd and
that in practice acquiring infinite number of projections is
is defined as,
Z not feasible therefore the integration in the filtered back-
RpX (t, θ) := pX (x)δ(t − x · θ)dx, (5) projection formulation is replaced with a finite summation over
X projections.
where δ is the one-dimensional Dirac delta function. For ∀θ ∈ 1) Radon transform of empirical PDFs: In most machine
Sd−1 where Sd−1 is the unit sphere in Rd , and ∀t ∈ R. Each learning applications one does not have direct access to the
hyperplane can be written as: actual distribution of the data but to its samples, xn ∼ pX . In
such scenarios the empirical distribution of the data is used as
H(t, θ) = {x ∈ Rd |x · θ = t} (6) an approximation for pX :
which alternatively could be thought as the level set of the 1 X
N
function g(x, θ) = x · θ = t. For a fixed θ, the integrals over pX (x) ≈ p̂X (x) = δ(x − xn ) (8)
N n=1
all hyperplanes orthogonal to θ define a continuous function,
RpX (·, θ) : R → R, that is a projection/slice of pX . We note where δ is the Dirac delta function in Rd . Then, it is straight-
that the Radon transform is more broadly defined as a linear forward to show that the Radon transform of p̂X is:
operator R : RL1 (Rd ) → L1 (R × Sd−1 ), where L1 (X) := N
{I : X → R| X |I(x)|dx ≤ ∞}. Figure 1 provides a visual 1 X
Rp̂X (t, θ) = δ(t − xn · θ) (9)
representation of the Radon transform, the integration hyper- N n=1
planes H(t, θ) (i.e., lines for d = 2), and the slices RpX (·, θ). See supplementary material for detailed derivations of Equa-
The Radon transform is an invertible linear transformation tions (9). Given the high-dimensional nature of estimating
(i.e. linear bijection). The inverse of the Radon transform density pX in Rd one requires large number of samples.
denoted by R−1 is defined as: The projections/slices of pX , RpX (·, θ), however, are one
pX (x) = R−1 (RpX (t, θ)) dimensional and therefore it may not be critical to have
Z large number of samples to estimate these one-dimensional
= (RpX (·, θ) ∗ η(·)) ◦ (x · θ)dθ (7) densities.
Sd−1
3
Fig. 3. Curve integrals for the half-moon dataset for a random linear projection, which is equivalent to a slice of linear Radon transform (a), for one layer
perceptron with random initialization, which is isomorphic to the linear projection (b), for a two-layer perceptron with random initialization (c), and for a
trained multi-layer perceptron (d). The weights of the all perceptrons are forced to be normalized so that kθij k = 1.
the final model. Finally, note that both ReLU and the leaky-
ReLU activation functions contain non-differentiable points,
which are also imparted on the surface function (hence the
sharp ‘kinks’ that appear over iso-surfaces lines).
LeakyReLU
Pooling
Pooling (e.g. average or maximum sampling) operations
Fig. 4. Level curves of nodes introduced by different activation functions. are typically used in large neural networks, especially the
Parameters θ1 ∈ R2 , Θ1 ∈ R50×2 , θ12 ∈ R50 are randomly initialized (with CNN kind. The reasons are often practical, as subsampling
the same seed) for the first three column demonstrations. Parameters for the
last column Θ1 ∈ R50×2 , Θ2 ∈ R100×50 , θ13 ∈ R100 are optimized by
can be used as a way to control and reduce the dimension
minimizing a misclassification loss. (number of parameters) of the model. Another often stated
reason is that pooling can also add a certain amount of
invariance (e.g. translation) to the NN model. In the case of
Figure 3 columns (c) and (d) demonstrate the level sets and average pooling, it is clear that the operation can also be
the line integrals of pX (x) using a multi-layer perceptron as written as a linear operator Θk in equation (12) where the
gθ . Column (c) is initialized randomly and column (d) shows pooling operation can be performed by replacing a particular
gθ after the network parameters are trained in a supervised row of Θk by the desired linear combination between two
classification setting to discriminate the modes of the half- rows of Θk , for example. ‘Max’-pooling on the other hand
moon distribution. It can be seen that after training, the selects the maximum surface value (perceptron response), over
level sets, H(t, θ), only traverse a single mode of the half- all surfaces (each generated by different perceptrons) in a
moon distribution, which indicates that the samples from given layer. Figure 5 shows a graphical description of the
different modes are not projected onto the same point (i.e. concept, though it should be noted that as defined above, the
the distribution is not integrated across different modes). It integration lines are not being added, rather the underlying
also readily becomes apparent the facility with which neural ‘level’ surfaces.
networks have to generate highly nonlinear functions, even
with relatively few parameters (below we compare these to
Adversarial examples
other polynomials). We note that with just one layer, NN’s
can form nonlinear decision boundaries, as the superposition of It has often been noted that highly flexible nonlinear learn-
surfaces formed by σ(θ11 ·x)+σ(θ21 ·x)+· · · can add curvature ing systems such as CNN’s can be ‘brittle’ in the sense
to the resulting surface. Note also that generally speaking, the that a seemingly small perturbation of the input data can
integration streamlines (hypersurfaces for higher dimensional have cause the learning system to produce confident, but
data) have the ability to become more curved (nonlinear) as erroneous, outputs. Such perturbed examples are often termed
the number of layers increases. as adversarial examples. Figure 7 utilizes the integral geomet-
ric perspective described above to provide a visualization of
how neural networks (as well as other classification systems)
Activation functions can be fooled by small perturbations. To find the minimum
It has been noted recently that NN’s (e.g. convolutional displacement that could cause misclassificaiton, using the blue
neural networks) can at times work better when σ is chosen point as the starting point x0 , we perform gradient ascent
to be the rectified linear unit (ReLU) function, as opposed xn+1 = xn + γ∇g(xn , θ), until we reach the other side
the sigmoid option [10], [11], [12]. The experience has en- of the decision boundary (which is indicated by the orange
couraged others to try different activation functions such as point). We limit the magnitude of the displacement small
the ‘leaky’-ReLU [13]. While theory describing which type enough so that the two points belong to the same distribution.
of activation function should be used with which type of However, once integrated along the isosurfaces corresponding
learning problem is yet to emerge, the interpretation of NN’s to the NN drawn in the figure, due to the uneven curvature of
as nonlinear projections can help highlight the differences the corresponding surface, the two points are projected onto
6
Fig. 5. Demonstration of max pooling operation. The level surfaces corresponding to perceptron outputs for a given input sample x are selected for maximum
response (see text for more details).
Fig. 6. The level curves (i.e. hyperplanes and hypersurfaces), H(·, θ), with associated with a given learning problem. More specifically, it
optimally discriminant θ, for different defining functions, namely linear (i.e., can be shown that the probability density function associated
the standard Radon transform), circular, homogeneous polynomial of degree with any output node of a neural network (or a single percep-
5, and a multi-layer perceptron with leaky-ReLU activations.
tron within a hierarchical NN) can be interpreted as a particular
opposite ends of the output node, thus fooling the classifier to hyperplane integration operation over the distribution pX . This
make a confident, but erroneous, prediction. interpretation has natural connections with the N -dimensional
generalized Radon transforms, which similarly proposes that a
S UMMARY AND FUTURE DIRECTIONS high-dimensional PDF can be represented by integrals defined
In this note we explored links between Radon transforms over linear or nonlinear hyperplanes in X. The projections
and Artificial Neural Networks to study the properties of the can be linear (in the case of simple linear logistic regression)
later as an operator on the probability density function pX or nonlinear in which case the projection are computed over
7
R EFERENCES
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521,
no. 7553, p. 436, 2015.
[2] P. Baldi and K. Hornik, “Neural networks and principal component anal-
ysis: Learning from examples without local minima,” Neural networks,
vol. 2, no. 1, pp. 53–58, 1989.
[3] G. Cybenko, “Approximation by superpositions of a sigmoidal function,”
Mathematics of control, signals and systems, vol. 2, no. 4, pp. 303–314,
1989.
[4] F. Cucker and S. Smale, “On the mathematical foundations of learning,”
Bulletin of the American mathematical society, vol. 39, no. 1, pp. 1–49,
2002.
[5] D. T. Gillespie, “A theorem for physicists in the theory of random
variables,” American Journal of Physics, vol. 51, no. 6, pp. 520–533,
1983.
[6] J. Radon, “Uber die bestimmug von funktionen durch ihre integralwerte
laengs geweisser mannigfaltigkeiten,” Berichte Saechsishe Acad. Wis-
senschaft. Math. Phys., Klass, vol. 69, p. 262, 1917.
[7] L. Ehrenpreis, The universality of the Radon transform. Oxford
University Press on Demand, 2003.
[8] P. Kuchment, “Generalized transforms of radon type and their appli-
cations,” in Proceedings of Symposia in Applied Mathematics, vol. 63,
2006, p. 67.
[9] G. Uhlmann, Inside out: inverse problems and applications. Cambridge
University Press, 2003, vol. 47.
[10] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltz-
mann machines,” in Proceedings of the 27th international conference on
machine learning (ICML-10), 2010, pp. 807–814.
[11] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural
networks,” in Proceedings of the fourteenth international conference on
artificial intelligence and statistics, 2011, pp. 315–323.
[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural infor-
mation processing systems, 2012, pp. 1097–1105.
[13] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities
improve neural network acoustic models,” in Proc. icml, vol. 30, no. 1,
2013, p. 3.
8