Professional Documents
Culture Documents
MachineLearningSlides PartOne
MachineLearningSlides PartOne
Learning for
Physicists
University of Erlangen-Nuremberg
& Max Planck Institute for the
Science of Light
Florian Marquardt
Florian.Marquardt@fau.de
http://machine-learning-for-physicists.org
INPUT
(drawing by
Ramon y Cajal,
~1900)
INPUT
Artificial
Neural Network
INPUT
OUTPUT
output layer
input layer
INPUT
“light bulb”
output layer
input layer
output layer
input layer
(training images)
Recognize images
Describe images in sentences
Colorize images
Translate languages (French to Spanish, etc.)
Answer questions about a brief text
Play video games & board games at superhuman level
(in physics:)
predict properties of materials
classify phases of matter
represent quantum wave functions
Playing Atari video games
(DeepMind team, 2013)
neural network observes screen and
figures out a strategy to win, on its own
e.g. “Breakout”
n i n
• Exploiting translational invariance in image processing
a r
(convolutional networks)
e
• Unsupervised
L(autoencoders)
learning of essential features
Keras
package for
• Learning temporal data, e.g. sentences (recurrent
Python
networks)
• Learning a probability distribution (Boltzmann machine)
• Learning from rewards (reinforcement learning)
• Further tricks and concepts
• Modern applications to physics and science
http://machine-learning-for-physicists.org
Very brief history of artificial neural networks
“Recurrent networks”
Deep nets for
“Convolutional networks” image recognition
“Perceptrons” 80s/90s beat the competition
50s/60s 2012
recommend:
online book by Nielsen ("Neural Networks and Deep
Learning") at https://neuralnetworksanddeeplearning.com
Software –
here: python & keras / tensorflow [see later]
A neural network
= a nonlinear function (of many variables) that
depends on many parameters
A neural network
output layer
hidden
layers
input layer
output of a neuron =
nonlinear function of
weighted sum of inputs
weighted sum
X
z= w j yj + b output value
j f (z)
(offset, “bias”)
weights w1 w2 w3 ...
input values y1 y2 y3 ...
output of a neuron =
nonlinear function of
weighted sum of inputs
f (z)
sigmoid
z
weighted sum
X
z= w j yj + b output value reLU
j f (z)
(offset, “bias”) (rectified linear unit)
weights w1 w2 w3 ...
input values y1 y2 y3 ...
A neural network
input layer
The values of input layer neurons are fed
into the network from the outside
0 0.1 0.4 0.2 1.3 -1 0 0
?
0.5 0.1
0.7
output
input
X
j=output neuron zj = in
wjk yk
+ bj
k=input neuron k
in matrix/vector notation:
in
z = wy + b
elementwise nonlinear function:
out
yj = f (zj )
for j in range(N):
do_step(j)
u=dot(M,v)
def f(x):
return(sin(x)/x)
x=linspace(-5,5,N)
Python
for j in range(N):
do_step(j)
u=dot(M,v)
def f(x):
return(sin(x)/x)
x=linspace(-5,5,N)
Python
A few lines of “python” !
in matrix/vector notation:
in
z = wy + b
A few lines of “python” !
output N1=2
input N0=3
Random weights and biases
Input values
Apply network!
f (z)
A basic network (without hidden layer)
z = w 1 y1 + w 2 y2 + b
y1 y2
y2
y1
>0
<0
f (z)
A basic network (without hidden layer)
f (z)
y1 y2
y2
y1
Processing batches: Many samples in parallel
y1 y2 y3
sample 1
sample 3
...
Processing batches: Many samples in parallel
one sample:
vector (Nin) y
many samples:
matrix (Nsamples x Nin) y
Apply matrix/vector operations to operate on all
samples simultaneously! Avoid loops! (slow)
vector (N2)
Note: Python interprets M=A+b
vector (Nout)
matrix (Nin x Nout) becomes
Nsamples x Nout
We can create complicated functions...
...but can we create arbitrary functions?
y2
yout (y1 , y2 )
y1
Approximating an arbitrary nonlinear function
F (y)
y
Approximating an arbitrary nonlinear function
yout = F1 f (w · (y Y1 )) + F2 f (w · (y Y2 ))
(f = sigmoid = smooth step)
F2
F1
Y1 Y2 y
y1 y2
yout
yout = F1 f (w · (y Y1 )) + F2 f (w · (y Y2 ))
F1 F2 (f = sigmoid = smooth step)
use biases: b1 = wY1
y1 y2 b2 = wY2
w w
y
F2
F1
Y1 Y2 y
F1 F2 F3 F4
y1 y2 y3 y4
y
Approximating an arbitrary 2D nonlin. function
y2
y1
Approximating an arbitrary 2D nonlin. function
First step: create quarter-
space “step function”
y2
1
y1
Trick: “AND” operation in a neural network
for large w :
y1 y2 y2
y1 0 1
0 yj 1
0 0 0
1 0 1
Trick: “AND” operation in a neural network
yout
AND
for large w :
y1 y2 y2
y1 0 1
0 yj 1
0 0 0
1 0 1
Homework
Figure out how to implement the following operations using a
neural network:
OR
XOR (gives 1 only if inputs are different, i.e. for 10 and 01)
Approximating an arbitrary 2D nonlin. function
yout
AND
step in y1 step in y2
f (w · (y1 ȳ1 )) f (w · (y2 ȳ2 ))
y2
0 1
ȳ2
y1 y2
0 0
ȳ1 y1
y2
y1
Approximating an arbitrary 2D nonlin. function
F1
F2 F3
y2
y1 y2
y1
Approximating an arbitrary 2D nonlin. function
F1
F2 F3
y2
y1 y2
y1
Approximating an arbitrary 2D nonlin. function
F1
F2 F3
y2
y1 y2
y1
Approximating an arbitrary 2D nonlin. function
F1
F2 F3
y2
y1 y2
y1
Approximating an arbitrary 2D nonlin. function
F1
F2 F3
y2
y1 y2
y1
Approximating an arbitrary 2D nonlin. function
F1
F2 F3
y2
y1 y2
y1
Universality of neural networks
y1
y1
Implement them on the computer and play around...
Homework
Extra * bonus version:
We have indicated how to approximate arbitrary
functions in 2D using 2 hidden layers (with our AND
construction, and summing up in the end)
output layer
Complicated nonlinear
function that depends on
= all the weights and biases
out in
y = Fw (y )
input layer
INPUT INPUT
Note: When we write “w” as subscript of F, we mean all the weights and also biases
Note: When we write “yout”, we mean the whole vector of output values
How to choose the weights (and biases) ?
training examples
= known data points
in
y
Challenge:
Curve fitting with a million parameters!
Stochastic
Gradient
Descent
Neural Net
Structure
A neural network
OUTPUT OUTPUT
output layer
Complicated nonlinear
function that depends on
= all the weights and biases
out in
y = Fw (y )
input layer
INPUT INPUT
Note: When we write “w” as subscript of F, we mean all the weights and also biases
Note: When we write “yout”, we mean the whole vector of output values
We have: y out in
= Fw (y )
neural network
(w here also stands for the biases)
out in
We would like: y ⇡ F (y )
desired “target” function
Cost function measures deviation:
1 in in 2
C(w) = hk Fw (y ) F (y ) k i
2
vector norm average over
all samples
Approximate version, for N samples:
XN
1 1 (s) (s) 2
C(w) ⇡ k Fw (y ) F (y )k
2 N s=1
s=index of sample
out
y
out in
y = Fw (y )
training examples
= known data points
in
y
Minimizing C for this case: “least-squares fitting”!
Method: “Sliding down the hill”
(“gradient descent”)
ẇ ⇠ rw C(w)
C(w)
True gradient of C
For sufficiently small steps: sum over
many steps approximates true gradient
(because it is an additional average)
@C(w)
=?
@w⇤
some weight (or bias),
somewhere in the net
w⇤
@C(w)
=?
@w⇤
some weight (or bias),
somewhere in the net
w⇤
(image: Wikimedia)
Small network: Calculate derivative of
cost function “by hand”
OUTPUT f (z)
INPUT y1 y2
Small network: Calculate derivative of
cost function “by hand”
OUTPUT f (z)
z = w 1 y1 + w 2 y2 + b
1⌦ 2
↵
C(w) = (f (z) F (y1 , y2 ))
cost 2 network desired output
INPUT y1 y2 output
Small network: Calculate derivative of
cost function “by hand”
OUTPUT f (z)
z = w 1 y1 + w 2 y2 + b
1⌦ 2
↵
C(w) = (f (z) F (y1 , y2 ))
cost 2 network desired output
INPUT y1 y2 output
⌧
@C 0 @z
= (f (z) F )f (z)
@w1 @w1
@z
= y1
@w1
Backpropagation
(n)
yj Value of neuron j in layer n
(n)
zj Input value for “y=f(z)”
n,n 1
wjk Weight (neuron k in layer
n-1 feeding into neuron j in
layer n)
Backpropagation
⌦ in
↵
We have: C(w) = C(w, y )
cost value for one particular input
We get:
(n)
@C(w, y ) X (n)
in
in
@yj
= (yj Fj (y ))
@w⇤ j
@w⇤
(n)
X (n) (n) @zj
in 0
= (yj Fj (y ))f (zj )
j
@w⇤
some weight (or bias),
somewhere in the net (we used:) =?
(n) (n)
yj = f (zj )
Backpropagation
Apply chain rule repeatedly
n We want: Change of neuron j in layer n
due to change of some arbitrary
n-1 weight w⇤ :
(n) (n) (n 1)
@zj X @zj @yk
= (n 1)
@w⇤ @y @w⇤ (n 1)
X
k k
n,n 1 0 (n 1) @zk
= wjk f (zk )
@w⇤
k
(n,n 1) (n,n 1) 0 (n 1)
Mjk = wjk f (zk )
Backpropagation
delta=(y_layer[-1]-y_target)*df_layer[-1]
dw_layer[-1]=dot(transpose(y_layer[-2]),delta)/batchsize
db_layer[-1]=delta.sum(0)/batchsize
for j in range(NumLayers-1):
delta=backward_step(delta,Weights[-1-j], ...
...
df_layer[-2-j])
dw_layer[-2-j]=dot(transpose(y_layer[-3-j]),delta)...
...
/batchsize
db_layer[-2-j]=delta.sum(0)/batchsize
Homework
For the example case (learning
a 2D function; see code on
Try out the effects of: the website)
- Value of the stepsize eta
- Layout of the network (number of neurons and
number of layers)
- Initialization of the weights
How do these things affect the speed of learning
and the final quality (final value of the cost
function)?
Try them out also for other test functions (other
than in the example)
Homework
Stochastic
Gradient
Descent
Neural Net
Structure
Backpropagation: the principle
@C C
=?
@w⇤
w⇤
Backpropagation: the principle
@C C
y out in
F (y )
=?
@w⇤
w⇤
(omitting indices, should
be clear from figure)
Backpropagation: the principle
@C C
y out in
F (y )
=?
@w⇤ 0
f (z)
w⇤
(omitting indices, should
be clear from figure)
Backpropagation: the principle
@C C
y out in
F (y )
=?
@w⇤ 0
f (z)
w
0
f (z)
w⇤
(omitting indices, should
be clear from figure)
Backpropagation: the principle
@C C
y out in
F (y )
=?
@w⇤ 0
f (z)
w
0
f (z)
w
0
f (z)
w⇤
(omitting indices, should
be clear from figure)
Backpropagation: the principle
@C C
y out in
F (y )
=?
@w⇤ 0
f (z)
w
0
f (z)
w
0
f (z)
y
w⇤
(omitting indices, should
be clear from figure)
Backpropagation: the principle
@C C
y out in
F (y )
=?
@w⇤ 0
f (z) and now:
w sum over ALL
possible paths!
0
f (z)
w
0
f (z)
efficient implementation:
y repeated matrix/vector
w⇤ multiplication
(omitting indices, should
be clear from figure)
Backpropagation Summary
layer
time
x
...or matrix product:
(t) = Û (t) (0) = Û1 Û2 Û3 . . . (0)
Backpropagation: the code
def net_f_df(z): # calculate f(z) and f'(z)
val=1/(1+exp(-z))
return(val,exp(-z)*(val**2)) # return both f and f'
def forward_step(y,w,b): # calculate values in next layer
z=dot(y,w)+b # w=weights, b=bias vector for next layer
return(net_f_df(z)) # apply nonlinearity
only 30 lines
y=y_in # start with input values
y_layer[0]=y
for j in range(NumLayers): # loop through all layers
# j=0 corresponds to the first layer above input
y,df=forward_step(y,Weights[j],Biases[j])
of code!
df_layer[j]=df # store f'(z)
y_layer[j+1]=y # store f(z) def backward_step(delta,w,df):
return(y) # delta at layer N, of batchsize x layersize(N))
# w [layersize(N-1) x layersize(N) matrix]
# df = df/dz at layer N-1, of batchsize x layersize(N-1)
return( dot(delta,transpose(w))*df )
def backprop(y_target): # one backward pass
global y_layer, df_layer, Weights, Biases, NumLayers
global dw_layer, db_layer # dCost/dw and dCost/db
#(w,b=weights,biases)
global batchsize
delta=(y_layer[-1]-y_target)*df_layer[-1]
dw_layer[-1]=dot(transpose(y_layer[-2]),delta)/batchsize
db_layer[-1]=delta.sum(0)/batchsize
for j in range(NumLayers-1):
delta=backward_step(delta,Weights[-1-j],df_layer[-2-j])
dw_layer[-2-j]=dot(transpose(y_layer[-3-j]),delta)/batchsize
db_layer[-2-j]=delta.sum(0)/batchsize
Neural networks: the ingredients
General purpose algorithm: feedforward & backpropagation
(implement once, use often)
Problem-specific:
Choose network layout (number of layers, number of neurons
in each layer, type of nonlinear functions, maybe specialized
structures of the weights) “Hyperparameters”
Generate training (& validation & test) samples: load from big
databases (that have to be compiled from the internet or by
hand!) or produce by software
y0
y0 y1 Evaluate at sample points
Example: Learning a 2D function
see notebook (on website): MultiLayerBackProp
inputs=random.uniform(low=-0.5,high=+0.5,size=[batchsize,2])
targets=zeros([batchsize,1]) # must have right dimensions
targets[:,0]=myFunc(inputs[:,0],inputs[:,1])
return(inputs,targets)
eta=.1
batchsize=1000
batches=2000
costs=zeros(batches)
for k in range(batches):
y_in,y_target=make_batch()
costs[k]=train_net(y_in,y_target,eta)
Example: Learning a 2D image
see notebook (on website): MultiLayer_ImageCompression
y1
y0
Network layers: 2,150,150,100,1 neurons
(after about 2min of training, ~4 Mio. samples)
Reminder: ReLU (rectified linear unit)
z for z > 0
f (z) = 0 for z 0
z = wy + b
y1
w
= 0
z
y0
à la Franz Marc?
Try to understand how the network operates!
Switching on only a single neuron of the last hidden layer
Image shows
results of
switching on
individually
each of 100
neurons
output
0 1 2 3 ...
0 1 2 3 4 ...
Weights from last hidden layer to output
deleted first 50 weights deleted last 50 weights kept only 10 out of 100
batch
Influence of learning rate (stepsize)
batch
Influence of learning rate (stepsize)
batch
Influence of learning rate (stepsize)
batch
Randomness (initial weights, learning samples)
Learning is a
stochastic,
nonlinear process!
batch
Influence of batch size / learning rate
batch
Influence of batch size / learning rate
always >0
C(w ⌘rw C) ⇡ C(w) ⌘(rw C)(rw C) + . . .
new weights decrease in C! higher
order in
⌘
(C large)
Influence of batch size / learning rate
always >0
C(w ⌘rw C) ⇡ C(w) ⌘(rw C)(rw C) + . . .
new weights decrease in C! higher
Potential problems: order in
- step too large: need higher-order terms
⌘
[will not be a problem near minimum of C]
- approx. of C bad [small batch size: approx.(CClarge)
fluctuates]
Sufficiently small learning
rate: multiple training
steps (batches) add up,
and their average is like a
larger batch
Programming a general
multilayer neural network &
backpropagation was not so hard
(once you know it!)
Could now go on to image
recognition etc. with the same
program!
Programming a general
multilayer neural network &
backpropagation was not so hard
(once you know it!)
Could now go on to image
recognition etc. with the same
program!
But: want more flexibility and added
features!
For example:
• Arbitrary nonlinear functions for each layer
• Adaptive learning rate
• More advanced layer structures (such as
convolutional networks)
• etc.
Keras
• Convenient neural network package for python
• Set up and training of a network in a few lines
• Based on underlying neural network / symbolic
differentiation package [which also provides run-
time compilation to CPU and GPU]: either ‘theano’
or ‘tensorflow’ [User does not care]
Keras
From the website keras.io
“Keras is a high-level neural networks API, written in Python
and capable of running on top of either TensorFlow or Theano. It
was developed with a focus on enabling fast experimentation.
Being able to go from idea to result with the least possible delay
is key to doing good research.”
from keras import *
from keras.models import Sequential
from keras.layers import Dense
Defining a network
layers with 2,150,150,100,1 neurons
net=Sequential()
net.add(Dense(150, input_shape=(2,), activation='relu'))
net.add(Dense(150, activation='relu'))
net.add(Dense(100, activation='relu'))
net.add(Dense(1, activation='relu'))
Defining a network
layers the
“Sequential”: withusual
2,150,150,100,1 neurons
neural network, with several layers
net=Sequential()
net.add(Dense(150, input_shape=(2,), activation='relu'))
“Dense”: “fully connected layer”
net.add(Dense(150, activation='relu'))
net.add(Dense(100, activation='relu'))
(all weights there)
net.add(Dense(1, activation='relu'))
input_shape: number of input neurons
‘relu’: rectified linear unit
‘Compiling’ the network
SGD=stoch. gradient descent ‘loss’=cost
net.compile(loss='mean_squared_error',
optimizer=optimizers.SGD(lr=0.1),
lr=learning rate=stepsize
metrics=['accuracy'])
Training the network
batchsize=20
batches=200
costs=zeros(batches)
for k in range(batches):
y_in,y_target=make_batch()
costs[k]=net.train_on_batch(y_in,y_target)[0]
y_out=net.predict_on_batch(y_in)
Will learn:
- distinguish categories
- “softmax” nonlinearity for probability distributions
- “categorical cross-entropy” cost function
- training/validation/test data
- “overfitting” and some solutions
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 1 0 0 0
output: category
classification
“one-hot encoding”
x,y val1,val2,val3,val4,...
(coordinates) (all pixels)
output: probabilities
(select largest)
j .1 0 0 0 .1 .1 .7 0 0 0
zj
e
fj (z1 , z2 , . . .) = PN
z
ek
k=1
k
(multi-variable generalization of sigmoid)
“Softmax” activation function
zj
e
fj (z1 , z2 , . . .) = PN
e zk
k=1
in keras:
net.add(Dense(10,activation='softmax'))
Entropy
@ X target 2
(fj (z) yj ) =
@w j
X target @fj (z)
=2 (fj (z) yj )
j
@w
in keras:
history=net.fit(training_inputs,
training_results,batch_size=100,epochs=30)
Validation set
(never used for training, but
used during training for
assessing accuracy!)
10000 images
Test set
(never used during
training, only later to test
Training set fully trained net)
(used for training)
Error rate
2 layer (300 hidden) 4.7%
2 layer (800 hidden) 1.6%
2 layer (300 hidden), with image 1.6%
preprocessing (deskewing)
2 layer (800 hidden), distorted 0.7%
images
6 layers, distorted images 0.35%
784/2500/2000/1500/1000/500/10
conv. net “LeNet-1” 1.1%
committee of 35 conv. nets, with 0.23%
distorted images
Homework
Explore how well the network can do if you add noise
to the images (or you occlude parts of them!)
Note: Either use the existing net, or train it explicitly on such noisy/occluded images!
“circle” “square”
Convolutional Networks
Exploit translational invariance!
State of the art for image recognition
K K
smoothing
(approx.) derivative
Image filtering: how to blur...
original pixel
resulting pattern
Image filtering: how to obtain contours...
+
+ -
-
will get zero
in a homogeneous
area!
gray=0
Alternative view:
Scan kernel over original (source) image
(& calculate linear, weighted
superposition of original pixel
values)
“Fully connected (dense) layer”
“Convolutional layer”
w1 w3
w2
(simplified picture)
... ...
several ‘channels’
in keras:
2D convolutional layer
net.add(AveragePooling2D(pool_size=8))
Enlarging the image size (again)
in keras:
net.add(UpSampling2D(size=8))
subsampling
dense
conv dense
dense
output
input
subsampling
conv
“Channels”
MxM image
MxM image
K
K
conv
7 x (7x7)
input
output
subsampling /4
28x28 dense
7 x (28x28) (softmax)
(7 channels)
# initialize the convolutional network
def init_net_conv_simple():
global net, M
net = Sequential()
net.add(Conv2D(input_shape=(M,M,1), filters=7,
kernel_size=[5,5],activation='relu',padding='same'))
net.add(AveragePooling2D(pool_size=4))
net.add(Flatten()) needed for transition to dense layer!
net.add(Dense(10, activation='softmax'))
net.compile(loss='categorical_crossentropy',
optimizer=optimizers.SGD(lr=1.0),
metrics=['categorical_accuracy'])
epoch
The convolutional filters
believed to be good
approximation to first
stage of image processing
in visual cortex
Handwritten digits recognition with a convolutional net
subsampling /2
dense
conv (softmax)
8x(14x14)
8x(7x7)
output
input
subsampling /2
conv
8x(14x14)
28x28 8x(28x28)
Does not learn at all! Gets 90% wrong!
accuracy on
validation data
epoch
same net, with adaptive learning rate
(see later; here: ‘adam’ method)
accuracy on training data
accuracy on
validation data
epoch
Homework
encoder decoder
- Goal: reproduce the input (image) at the output
- An example of unsupervised learning (no need for
‘correct results’ / labeling of data!)
- Challenge: feed information through some small
intermediate layer (‘bottleneck’)
- This can only work well if the network learns to
extract the crucial features of the class of input
images
- a form of data compression (adapted to the
typical inputs)
Still: need a lot of training examples
Here: generate those examples algorithmically
conv /4 x4 conv
conv /4 x4 conv
conv
32x32 32x32
(20 channels in all intermediate steps)
cost function for a single test image
output
“denoising autoencoder”
Stacking autoencoders
fixed
train
train (and so on, for more
and more layers)
train train
fixed
input input
training the autoencoder = “pretraining”
output=input
Sparse autoencoder:
force most neurons in the inner layer to
be zero (or close to some average
value) most of the time, by adding a
modification to the cost function
linear
(dense)
no f(z)!
(dense) linear
Claim:
Diagonalize this (hermitean) matrix, and keep the
M eigenvectors with the largest eigenvalues. These
form the desired set of “v”!
An example in a 2D Hilbert space:
the two eigenvectors of ⇢ˆ
vals,vecs=linalg.eig(rho)
get eigenvalues- and vectors (already sorted, largest first)
plt.imshow(reshape(-vecs[:,0],[28,28]),
origin='lower',cmap='binary',interpolation='nearest')
display the 28x28 image belonging to the largest eigenvector
The first 6 PCA components (eigenvectors)
The first 100 sum up to more than 90% of the total sum
Visualizing high-dimensional data
Visualizing high-dimensional data
Neuron values in some intermediate layer represent
input data in some interesting way, but they are hard
to visualize! [there are more than 2 neurons in such
a layer, typically]
Some trends
visible, but not
well separated!
component 2
component 1
MNIST sample images, reduced to 2D, using “t-SNE”
[using python program by Laurens van der Maaten]
Different colors = differently labeled images (diff. digits)
Well-defined
clusters!
(starts from 50 PCA components for each image; t-SNE takes about 10min)
Basic idea of dimensionality reduction: reproduce
distances in higher-dimensional space inside the lower-
dimensional “map”, as closely as possible
close y2
close
x2 x3
x1 y1 y3
n-dim. 2-dim.
Usually not perfectly possible: Remember the
map-maker’s dilemma!
@C
ẏj =
@yj
2-dim.
“Stochastic neighbor embedding” (SNE): Define
“probability distributions” that depend not only on
the distance but also include some normalization
pij Probability to pick a pair of
points (i,j). Defined to be larger
if they are close neighbors [in
the high-dim. space]
qij similar for low-dim. space
X X
qij = 1 pij = 1
i6=j i6=j
ies p j|i and q j|i , it is also possible to minimize a single Kullback-L
probability distribution, P, in the high-dimensional
Want q-distribution to be a close approximation of space and
in thethe
low-dimensional
p-distribution:space:
pi j
C = KL(P||Q) = ∑ ∑ pi j log .
i j qi j
e set “Kullback-Leibler
pii and qii to zero. divergence”,
We refer to this a way
typeofofcomparing
SNE as symmetr
that probability distributions
pi j = p ji and qi j = q ji for ∀i, j. In symmetric SNE, the pairw
t map after several runs as a visualization of the data is not nearly as problemati
on a test set during supervised learning. In visualization, the aim is to see the s
eralize to held out test data.
2583
probabilities.
dimensional
most infinitesimal This(forallows
datapoint x i isa an
reasonable moderate
outlierofdistance
values (i.e., allinpairwise
the variance theof high-dimen
dis-
the Gaussian
onditional
eled
suchby Choices
an much made
a probability
outlier, larger
thefor t-SNE
p j|idistance
is givenof
values inbythe
pi j map and, as a result,
are extremely smallitfor elimin
[for heuristics behind this see Hinton & v.d.Maaten 2008]
es
onal between
map point map points
yi has that veryrepresent
little effect moderately
!on the cost dissimilar "
function.
2 /2σ2
datapo
exp −∥x − x ∥
n high-dim.
t-SNE,
nt is not we space:
employ
well determined a Studentby
p j|ithet-distribution
= positions
i with j one idegree of
! of the other 2map
2 /2σ
",
Cauchy (Gaussians
distribution) dist.) as the ∑
heavy-tailed k̸=i exp −∥x −
distribution
i x ∥
k in the i low-d
defining the joint probabilities p i j in the high-dimensional
ibution, the joint probabilities q are +pi| j as
pdefined
here σi is the variance
probabilities, that is, of we set p i j =that 2n
the Gaussian i j j|i is centered
. Thison datapoint
ensures thatxi . T
esult
value of σ is presented later
of which each datapoint xi makes
i in this !
section. Because
a ∥y
significantwe "are
−1 only
2 contri- intere
milarities, we set the value of pi|i to zero.
1 +
For the
− y j ∥
i low-dimensional co
low-dim.
imensional space:
space, symmetric iSNE q j = simply uses Equation .
−1 3.
l (1 + ∥ykto−compute yl ∥ ) a simil
gh-dimensional datapoints xi and x j , it∑is 2
k̸=possible
ersion (Cauchy
of SNE dist.=
is the simpler 2 form of
hich we denote by q j|i . We set the variance of the Gaussian that is em
its gradient, which is
use
metric a “Student-t
Student
SNE is dist.”)
t-distribution
fairly similar with
to thata 1 single
of degree
asymmetric of freedom,
SNE, and be
the conditional! probabilities q"j|i−1to √2 . Hence, we model the similari
q
property
oint yi by is comparatively
that 1 + ∥y i − y larger
j ∥2 at long
approaches distances:
an allows
inverse square law
points in low-dim.
− y j ∥ in the low-dimensional map. This space to spread !out for 2 "
makes the map’s represe
= 4 ∑intermediate
(pi j − qi j )(yi distances
− y j ). (they do exp not −∥yhave −
i asj y ∥
ost) jinvariant to changes in the q scale
j|i = of the map for map .
2 )points tha
much space as high-dim. points!k̸=need ∑ i exp (−∥y
to give
i − y k ∥
arge them
clusters more room!)
of points that are far apart interact in just the same wa
gain, since we
d that symmetric are only interested in modeling pairwise similarities, we
mization operates inSNE the same seemsway to produce
at all but maps the finestthatscales.
are justA th
If the map points yi and y j correctly model the similarity between t
faster to evaluate the density of a point under a Student t-distri
se it does not involve an exponential, even though the Studen
nfinite mixture of Gaussians with different variances.
The t-SNE “force”:
of the Kullback-Leibler divergence between P and the Student-t
n Q (computed using Equation 4) is derived in Appendix A, and
δC ! "
2 −1
= 4 ∑(pi j − qi j )(yi − y j ) 1 + ∥yi − y j ∥ .
δyi j
spring-like "gravity"-like at
sign depends on larger distances
1(c), we show the gradients between two low-dimensional datap
match between
r pairwise Euclidean
low- anddistances
high-dim.in the high-dimensional and the
nction of ∥xi − xdistributions
j ∥ and ∥yi − y j ∥) for the symmetric versions of
figures, positive values of the gradient represent an attraction
points yi and y j , whereas negative values represent a repulsion
m the figures, we observe two main advantages of the t-SNE
and UNI-SNE.
An example application from biophysics
“T-SNE visualization of large-scale neural recordings”
George Dimitriadis, Joana Neto, Adam Kampff
(example by
Andrej
Karpathy)
[t-SNE applied
to a 4096-dim.
representation]
http://cs.stanford.edu/people/karpathy/cnnembed/
take neurons
out of a multi-
layer
convolutional
network that
classifies images,
and represent
using t-SNE
(example by
Andrej
Karpathy)
[t-SNE applied
to a 4096-dim.
representation]
http://cs.stanford.edu/people/karpathy/cnnembed/
take neurons
out of a multi-
layer
convolutional
network that
classifies images,
and represent
using t-SNE
(example by
Andrej
Karpathy)
[t-SNE applied
to a 4096-dim.
representation]
http://cs.stanford.edu/people/karpathy/cnnembed/
take neurons
out of a multi-
layer
convolutional
network that
classifies images,
and represent
using t-SNE
(example by
Andrej
Karpathy)
[t-SNE applied
to a 4096-dim.
representation]
http://cs.stanford.edu/people/karpathy/cnnembed/
l e s
t i c
a r
i o n
i l l
.3 m
1
paperscape.org
The whole
arXiv preprint
server,
represented as
a 2D map
Paperscape uses a simple physical model (similar to
t-SNE, but more physical).
Between each two papers there are two forces:
- repulsion (anti-gravity inverse-distance force)
- attraction of any paper to all its references by a
linear spring
- also avoid overlap (circle sizes represent number
of citations to that paper)
Every morning, after new papers are announced,
the map of all 1.3 million papers on the arXiv is
re-calculated (takes 3-4 hours)
The quantum “continent”
[colors represent arXiv
categories]
Artistic representation by Roberto Salazar and
Sebastian Pizarro: “Quantum Earth”
Optimized gradient descent algorithms