As suggested by reviewer 3, we hand labelled corresponding points, where the videos, frames, and the source

point are selected at random and we simply clicked on the correspondence. In this way we labelled 114 cross-video
correspondences from our set of 58 training videos and 50 cross-video correspondences from a held-out set of 4 testing
videos (although, remembering that there were no cross-video labels, these are, in a sense, all test cases). We compared
our learned feature to two baselines: SIFT and AlexNet. For the SIFT baseline, we extracted a SIFT descriptor at a
dense grid of pixel locations. For AlexNet, we used the weights trained on ImageNet as included in caffe, and chopped
off the final classification layer, using the second to last, 4096-dimensional representation known as fc7. Note that
AlexNet provides state-of-the-art results for place recognition http://www.roboticsproceedings.org/
rss11/p22.html.
To score the various representations, we extracted the feature at the source point in the source frame, and then
computed a dense representation of the target frame. We then computed the number of pixels in the target frame that
were closer in the descriptor space to the source descriptor than the manually-labelled corresponding point.

Figure 1: Example correspondences from the four held-out test videos, showing the source on the left and manuallylabelled target on the right.

1

Figure 2: Quantitative results on manually labelled cross-video correspondences. We compute a feature representation
of the source point (from left images of figure 1) and a feature representation on a dense grid in the target image (left
images of figure 1). Then we sort the pixels by distance from the source and compute what percentage of pixels are
closer to the source than to the manually-labelled ‘ground truth’ correspondence. These graphs show the percentage of
frames for which the ground truth label is with the top X% of pixels, where X is the x-axis value. Note that our feature
representation never leads to more than 22% of pixels being mapped closer to the source than the labelled target, even
on held-out test videos. We used the caffe implementation of AlexNet and the OpenCV implementation of SIFT. We
used a cosine distance with the AlexNet descriptor and an L2 distance with the SIFT descriptor as they performed best
on our metric. We also evaluated SIFT at a number of different scales and present above the best result.

(a) Our feature

(b) SIFT

(c) AlexNet

Figure 3: These images visualize our metric for the first image in figure 1. The ‘ground truth’ correspondence is
denoted by a green pixel, red pixels have a feature representation that is closer to the source representation than the
labelled correspondence, black pixels have a distance that is farther, and gray pixels are not considered as they do not
provide enough context for all methods. Thus the x-axis in figure 2 corresponds to the percentage of red pixels in a
given frame. Note that our metric not only has few pixels with nearer to than the labelled point, but also that the closer
points are all close in 2D. Also note the different failure modes of SIFT and AlexNet; SIFT identifies many spurious
correspondences scattered in many small blobs across the image, whereas AlexNet identifies spurious correspondences
in large blobs. This is likely because AlexNet was trained for translation invariance with coarse input granularity, in
contrast to our method which is much more spatially discriminative.

2

Figure 4: Application of our learned feature representation to a new person, who was not present in the training data.

3