You are on page 1of 48

The magazine of the algorithm community

A publication by

July 2019
Women in Computer Vision: Autonomous Driving Applications:
Jelena Frtunikj - BMW • MSC Software
• AutoX

Research: Deep Image Reconstruction


from Human Brain Activity

Upcoming Events
Project Management:
Time Estimation for Your Projects
Exclusive Interview:
Andrew Fitzgibbon - Microsoft
Computer Vision Project:
Nuclei Detection and Segmentation
Computer Vision Tools:
Object Tracking in Python Using OpenCV - with codes!
2 Contents Computer Vision News

04 14 32

Women in Computer Vision


Jelena Frtunikj - BMW
Project by RSIP Vision
Nuclei Detection and Segm. BEST OF CVPR 2019
38
06 20

Tool:
Object Tracking with OpenCV
Project Management Tip
Projects Time Estimation Guest
Andrew Fitzgibbon 42
08 24

Research Paper Review Autonomous Driving Autonomous Driving


Deep Image Reconstruction MSC Software AutoX

03 Editorial by Ralph Anzarouth 14 Best of CVPR 2019


Our articles from Long Beach, CA
04 Nuclei Detection and Segmentation 38 Computer Vision Tools
Project by RSIP Vision
Object Tracking in Python Using OpenCV
by Dorin Yael by Amnon Geifman
06 Project Management Tip
Time Estimation for Your Projects 42 Autonomous Driving App
AutoX with Jianxiong Xiao
08 Research Paper Review
Deep Image Reconstruction from…
47 Computer Vision Events
Upcoming Events July-Sept 2019
by Amnon Geifman
Computer Vision News
Welcome 3

Dear reader,
CVPR 2019 is just behind us, and we can
almost still hear the voices from CVPR that we
heard in Long Beach. It was a huge conference
with excellent scientific content, as confirmed
by those who were there and who read about
it in our CVPR Daily. For everyone, this
magazine includes a 24-page BEST OF CVPR
section, starting on page 14: be sure not miss
it!
This issue of Computer Vision News includes
Did you subscribe to several articles dedicated to the autonomous
Computer Vision News? driving area, one of our most popular subjects.
It’s free, click here! You'll find some in the CVPR section and one
more on page 42, where AutoX CEO and
founder Jianxiong Xiao (a.k.a. Professor X)
Computer Vision News
kindly shares with us how computer vision
Editor: helps them build a full AI platform for
Ralph Anzarouth applications such as logistics delivery services,
Engineering Editor: airport transportation and city driving in
Amnon Geifman smart, 5G-connected cities. You will find more
Publisher: articles related to autonomous driving on
RSIP Vision pages 24 and 32.
Contact us Finally, here you'll find the full recording (with
all the slides) of the webinar hosted just a few
Give us feedback
days ago by Moshe Safran and Miki Haimovich:
Free subscription “Project Management for AI Implementation
Read previous magazines in Medical Devices”. Stay tuned and don’t miss
Copyright: RSIP Vision our upcoming webinars on medical imaging
All rights reserved and more.
Unauthorized reproduction Enjoy reading!
is strictly forbidden.

Follow us: Ralph Anzarouth


Editor, Computer Vision News
Marketing Manager, RSIP Vision
4 Nuclei Detection and Segmentation
Computer Vision News

by Dorin Yael

“… a smart solution was needed!”


Every month, Computer Vision News reviews a successful
project. Our main purpose is to show how diverse image
processing techniques contribute to solving technical challenges
and real world constraints. This month we review challenges
and solutions in a Medical Imaging project by RSIP Vision:
Challenges in Nuclei Detection and Segmentation.
Many diagnostic and research annotation for this project: this was a
processes require an accurate difficult task due to the large amounts
Project

evaluation of the types, shapes and and high density of nuclei in every
numbers of cells in a specific tissue single slice. Observations are very
area. Thus, our goal in this project was subjective as well: specifically in
to detect and segment single cell clusters of cells, inter- and intra-
nuclei in slices of tissue. To achieve this observer variability is particularly high.
goal, we had to overcome multiple Of course, the accuracy of the model is
challenges: the very high variability dependent on the accuracy, number
between the properties of nuclei and richness of the annotations. For
belonging to different cell types this project we annotated a few
including shapes, sizes, color intensities thousands of nuclei using different
and textures which vary within and approaches.
between tissues. In addition,
sometimes cells are clustered and need The first step of processing in our
an expert eye to separate them into solution was cell detection. Once the
single cells with definite borders. nuclei are detected, the cell
Another difficulty is due to occlusion, segmentation task started, with the
when one cell is partially hidden by goal of segmenting each cell separately
another cell. (instance segmentation). Once this is
done automatically, accurately and fast,
These make it very challenging to build
a single process which detects them all the output of this task can be used
both for diagnostic and for research
in every single slide. There is no single set purposes. The standardization that we
of parameters which covers them all.
were able to apply with our solution
Therefore, a smart solution was solves the problem of variability in the
needed. We separated the project into observations. The fact that this is done
two steps: the first step focused on automatically reduces the human time
detecting the cell, while the second needed and enables to answer more
step focused on the segmentation of questions. If the physician needs to
cells detected during the first step. To make a diagnostic based on cell
do this, we used deep learning detection, classification or count, they
techniques which specifically adapted will be able to do it faster and better,
U-Net with different ground truths. since it will be independent from
RSIP Vision also performed the human individual observations.
A project by RSIP Vision 5
Computer Vision News

Another complexity of this task use of non- classic computer vision


emerges from the multitude of algorithms and techniques, in
techniques for tissue staining. One of particular the use of deep learning. In
the most advanced techniques used in contrast to classic algorithms, which
the study of cancer microenvironment require a detailed definition of
is multiplex, which allows dying the conditions and heuristics which are
same tissue with different specific specific to tissue, staining technique
fluorescents. The stained slice is and even the microscope resolution,
photographed multiple times using deep learning is independent with all
different wavelengths and the different of these and can fit different data
images are overlaid to produce a types. Thanks to deep learning, when
merged picture. This allows a you train your model on many
classification of different cell types examples it learns by itself to identify
within the same slide, while conserving and classify the different cells

Project
its spatial structure. Using automatic emerging from different tissues and
analysis of these images, opens new dyed by different staining techniques.
opportunities to shed more light on Read about more projects in medical
multiple open research questions. segmentation
The main aspect of our solution was
the decision to separate the project in Take us along for your next
two parts, enabling us to be specifically
precise in each one of them. The other
Deep Learning project!
valuable choice that we made was the Request a call here
6 Project Management Tip Computer Vision News

Time Estimation for Your Projects


RSIP Vision’s CEO Ron Soferman has launched a
series of lectures to provide a robust yet simple
overview of how to ensure that computer vision
projects respect goals, budget and deadlines.
This month he tells us about another aspect of
Project Management : Time Estimation for Your
Projects. Read more tips by Ron Soferman
about Project Management in Computer Vision.
Management

“…you will certainly obtain a reliable


time estimation, which you can present
to the other stakeholders in the organization!”
As a project manager, you frequently of entering a new market, train
have to estimate the development the sales team on a new feature
time needed to develop a specific and preparing the human
solution, whether you initiate the idea resources aspect of the project.
or you answer a specific need of the The first step will be to consult
organization to solve a problem. with your team and get their input
Time estimation is a very tricky about the length of time that will
task, since the research part is be needed to solve the specific
always a big unknown: you don’t problems. There might be several
know the solution yet and as a unknown as to the success of
result, you can’t know for sure specific approach – so the answer
how long everything will take. might be a range rather than one
Still the organization expects from definitive time length.
you, the project manager, to However, consulting with the
provide a reliable estimation of research team has a drawback:
the investment that is needed to often, their scale of time is biased
bring about the specific solution by their belief in their own ability
that is requested. to solve the task. As soon as they
A reliable time estimate is an find the solution, their
essential information for the assessment of the time will be
organization in order to decide shorter. You will be well advised
about a specific project. It starts to look back at previous projects
from your management, that performed by this team: browsing
needs to decide whether to into the past logs of the
launch this project. Other development team you will get a
stakeholders in the organization broader insight into the different
might need to know and stages that you anticipate from a
understand the timetable in view complete solutions.
Computer Vision News
Project Management Tip 7

The next step will be to inspect the productization, which includes all the
different threats that might appear testing and the fulfillment of
during the project and try to mitigate regulatory compliance requirements
the risks with specific solutions. (like FDA and others, when applicable).
After this first estimation, you must Taking all those aspects into
review and analyze this work plan in consideration, my professional
terms of the project and product life experience says that you will certainly
cycle for your company. It may start obtain a reliable time estimation,
with a first prototype, but then you which you can present to the other

Management
have to deliver a full working software, stakeholders in the organization.
including a GUI that will be checked
and approved by the marketing team
and also to take into account the More articles on Project Management

“You will be well advised to look back


at previous projects performed by this team…
8 Research Computer Vision News

by Amnon Geifman
Every month, Computer Vision News reviews a
research paper from our field. This month we
have chosen Deep Image Reconstruction from
Human Brain Activity. The authors (Guohua
Shen, Tomoyasu Horikawa, Kei Majima and
Yukiyasu Kamitani) published their article here.

…when will it be possible to read minds?


One of the most interesting questions today is: when it will be possible to read
minds? The recent success of deep neural nets enables to perform new and
fascinating tasks. One example for such task is the reconstruction of images from
brain signals. Today, many vision groups and brain science groups are devoting
their efforts to develop DNN based methods to interpret fMRI scans and
generate images using brain activity measurements.
The impact of succeeding in such task is huge, as it might help to understand the
brain structure, assist the blind, and more. In this article, we will describe one of
the best methods to perform this a task, review the method and show some of
its results.
Research

The frame-work of the task is simple: subjects are inserted into an fMRI
machine, then, a sequence of images is shown to them and the fMRI signal is
recorded. The images are divided into three classes: natural images, artificial
shapes, and alphabetical letters.
The fMRI data is represented in a 4-dimensional structure: one dimension is for
the time and the other three dimensions are a structural representation of the
activity of the brain, represented in voxels. Such dataset contains a set of
training images and a set of test images. While this paper performs
reconstruction of the images from the brain, also classification of the images
into classes is possible.
The authors use DNN visual features of a given image based on the VGG-19
network. These features are meant to approximate the hierarchical neural
representation of the human visual system. Then, the signal from the fMRI is
translated to fit into these visual features. At test time, given the translated
signal, an optimization is performed to predict the image which its DNN features
are best correlated with the fMRI signal. In this way, a reconstruction of the
image is generated. The figure in the next page demonstrates the main idea of
the method; we now dive into the details of the paper.
Deep Image Reconstruction from … 9
Computer Vision News Computer Vision News

Method

DNN architecture for visual features representation

Research
In order to represent an image in an informative way, the authors used a pre-
trained classification model of the VGG-19 architecture. This architecture
was trained to classify ImageNet dataset with over 1000 classes of objects
category. A trained classification model (such as VGG-19) gives a
representation of an image in an hierarchical manner, which is invariant to
shifting, rotations and translation of the objects. In turn, this will produce an
invariance of the model to such transformations at inference time.
The VGG-19 model consists of 16 building blocks of convolution and 3 fully
connected layers. The outputs from each layer were taken immediately after
the weight multiplication (without rectification) and concatenated into a
vector. This vector is named the visual feature vector, that later will be used
in the training and reconstruction process.

fMRI decoding
The authors construct a decoding model that predicts the visual feature
vectors from brain signal using sparse linear regression (SLR). The logic is
that the SLR can automatically select the most important features for
prediction. In this way, they also cope with the high dimensionality of the
fMRI signal.
10 Research Computer Vision News

The fMRI input is a sample 𝑋 = 𝑥1 , 𝑥2 , . . , 𝑥𝑑 where xi is a scalar value


specifying the signal amplitude of the i'th voxel in the brain. Using the
measurements of d voxels, the regression function is represented as:
𝑦 𝑥 = Σ𝑖 𝑤𝑖 𝑥𝑖 + 𝑤0
where wi are learnable weights and w0 is the bias term.
Given an input image, a forward pass to the DNN decoder (described in the
previous section) produces the visual feature vector with entries t1,.. tL where tl
is the l'th entry of this vector. Then, each tl is fitted using the regression function
above. The objective of the regression is to maximize the likelihood of tl given
the voxel amplitudes X, the weights W, and a fixed parameter 𝛽. This likelihood
function has the form of:
0.5
𝛽 1
𝑃 𝑡𝑙 𝑋, 𝑊, 𝛽 = ෑ ex p{ − 𝛽 𝑡𝑙 𝑛 − 𝑦𝑛 𝑥 ൡ
𝑛=1,..,𝑁 2𝜋 2

where N is the number of samples, and 𝑡𝑙 𝑛 is the n sample of the tl entry. In


order to get sparse estimator, the authors used Bayesian parameter estimation
and adopted the automatic relevance determination prior.

Image reconstruction from DNN layer


Research

Remember that, during test time, we don't have the input image nor the visual
feature vector. What we do have is the feature vector generated from the brain
signal, which is supposed to be a good approximation of the visual feature
vector (that was generated from the image). To reconstruct the image, the
authors use optimization in the image space that aims at finding the image, the
DNN features of which are the closest to the fMRI signal feature vector. In order
to do so, the authors formulated the optimization problem of the form:
1 2
𝑣 ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛𝑣 { ෍ ቀΦ𝑙 𝑣 − 𝑦 𝑙 ൯ }
2 𝑖
where v is a vector in which elements are pixel values of an image (of size
224x224x3), v* is the reconstructed image, Φ𝑙 is the l'th visual feature vector
entry, and yl is the corresponding translated signal generated from the fMRI
signal. The goal of this optimization is to find the best image to fit the brain
signal.

DGN constraint
In order to improve the 'naturalness' of the reconstructed images, the authors
use a generative adversarial network (GAN). Using a pre-trained deep generator
network (DGN), denoted by G(z), they modified the optimization objective to be
of the form:
Deep Image Reconstruction from … 11
Computer Vision News

1 2
𝑧 ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛𝑧 { ෍ ቀΦ𝑙 𝐺(𝑧ሻ − 𝑦 𝑙 ൯ }
2 𝑖
When achieving minimum of the objective z*, the final image is then taken to be
v*=G(z*). Since GAN is trained to produce natural looking images, the result of
these optimization will generate more plausible results and improve the
appearance of the previous optimization.

Results
In the field of deep image reconstruction, the results are the most important
aspect of the model. While there are some quantitative measures for the
results, we will let you examine the visual results produced from the method
described above. We show here results from the paper on three different tasks:
Reconstruction of shapes and letters, reconstruction of natural images and
reconstruction of images from imagination. We start from the more
straightforward task, reconstruction of shape and letters. You can see the results
in the figure below:

Research
… these reconstructions
are relatively clear!

As you can see, these reconstructions are relatively clear: you can see the edges
of the shapes and read the word NEURON from the reconstructed images. These
results are the first of their kind which enable to understand the content of the
images. Note that the images above are coming from a relatively simple
distribution, hence nice results are expected.
The next set of results are the natural images reconstruction. Note that this is
much harder task since the images have much more complex distribution. It is
analogous to a classification network that achieves great results on Mnist and
12 Research Computer Vision News

less accurate results on ImageNet. However, these results are also interpretable
and sometimes enable to identify the objects in the image. You can appreciate the
results in the following figure:

In the figure above, the number of iteration axes refer to the number of iterations
in the optimization scheme. Although the reconstructions are not entirely
identical to the source images, it can be seen that they capture the main object,
some details of the object and this gives some abstract representation of the image.
Research

The last set of results is the reconstruction from imagination. In this task, the
subjects were instructed to imagine an image that they saw in the experiment.
The reconstruction, in this case, has been done based on the imagined image. This
of course is much more challenging task as the signal to noise ratio is dramatically
increased. In the following figure you can see the results of this experiment:

Conclusion
Image reconstruction from brain signal is a fascinating task that several teams
around the globe are working hard to solve. The high signal to noise ratio, as well
as the high dimensionality of both the fMRI signal and the image, are the greatest
challenges in this field. The results presented above can be considered very good
results in this field. As you can see, there is a lot of research to do on this task,
many methods to try, which will hopefully bring us to better solutions for the
problem. There is no doubt that the ability to reconstruct images from the brain
will have a tremendous influence on humanity and on science in general.
BEST OF CVPR 2019 13
Computer Vision News

BEST OF
CVPR 2019
24 PAGES !!!
CVPR
14 Presentation TuesdayDAILY
AutoAugment: Learning Augmentation Strategies from Data

Dogus finds that data augmentation is


an underutilized tool in deep learning.
Although there are many papers that
come up with new data augmentation
operations – like mixup, Cutout,
geometric operations – it’s not clear
how you would combine them and get
the optimum result. It’s interesting to
ask how far you can push the impact of
data augmentation on these models. In
this paper, they tried to combine the
already existing operations and get as
good a result as possible on the test
set.
“A lot of the work is
focused on architectural
Ekin Dogus Cubuk is a research
scientist at Google Brain. Before this, improvements, rather than
he got his PhD in physics, but used to processing data!”
apply machine learning to physical
systems. He did the residency at During the process, they actually found
Google Brain which is for people who out some more fundamental and
have an interest in deep learning but important things that should be done
are not experts. At Google Brain, he for data augmentation. One of them is
has been working on AutoML variability. Diversity in your data
applications, mainly for data augmentation policy, for example.
augmentation. He is a first-timer at Most of the time when people apply
CVPR and speaks to us ahead of his operations, they either apply it to
first oral presentation. every mini-batch and every image, or
they don’t apply it. Cutout is very
helpful on some data sets, but how
Even as he was doing his PhD, Dogus you use it is you apply it to every single
says it was clear that machine learning image and every single mini-batch.
had many interesting directions. He They found that instead of having one
says if you look at the research strategy, having hundreds of strategies
divisions now, all the physics PhDs are and choosing one of them randomly
using their physics skills in machine for each image in each mini-batch
learning. They are also interested in actually gives you a huge
using machine learning to study improvement. If you were to just do
physics, and there’s still an active that and not use any of AutoAugment’s
research area at Google Brain and search capability, you already get a big
Google Accelerated Science. improvement.
CVPR 15
DAILYTuesday AutoAugment
He explains: “Accuracy with deep In terms of next steps, Dogus says that
learning models is important, and data they want to apply this idea to other
augmentation is an easy but impactful domains. Not just image classification,
way of improving the accuracy. A lot of but video and object detection, for
the work is focused on architectural example. They recently had a paper,
improvements, rather than processing SpecAugment, about applying
data. Since experts already have an idea AutoAugment to speech to benefit
of what symmetries are, and it’s clear speech recognition, but they realised
how they can utilise those symmetries that before they do that, there are
to augment the data, it would be good other improvements to be had just by
to increase the accuracy. Something adding some new operations. In
else we have been seeing recently that general, they are trying to make the
we didn’t necessarily think about when method faster so that other researchers
we were writing the paper is that it can apply it quickly to other domains.
seems that increased data Since their work, several other groups
augmentation helps a lot with have used their search space and
robustness. Recently there was an ICML augmentation strategy, but with a
workshop here on robustness for different optimizer to reach as good
machine learning models to common results, which is encouraging. They had
corruptions and noise. What several used a reinforcement learning approach
people found is that AutoAugment to optimize their augmentation policy,
leads to the best robustness. Although which can sometimes be hard to get to
it was actually only trained for work. Since their work, other people
increased validation accuracy.” have reproduced it with Bayesian
optimization or with population-based
training or a few other methods, and
Dogus is encouraged by how easy it is
to get the same results with different
optimizers.
“Accuracy with
deep learning
models is
important, and
data augmentation
is an easy but
impactful way of
improving the
accuracy!”
CVPR
16 Presentation Wednesday DAILY
AdaGraph: Unifying Predictive and Continuous
Domain Adaptation Through Graphs
task because if you don’t see anything
you cannot forecast what night will
look like. This work tries to predict
what the night, or a situation, will look
like. If you are given a lot of domains –
one is labelled, a lot are not labelled –
for each of them, you are given an
attribute. For example, you see a front
view of a car, and then you see from
the side, and then someone asks you to
produce a model that will recognise
cars if you see them from the back.
Massimiliano Mancini is a PhD student Since you are seeing a lot of different
at Sapienza University of Rome, in domains, either labelled or unlabelled,
collaboration with Fondazione Bruno you may try to figure out what the
Kessler (FBK) and Istituto Italiano di model for the rear of the car will look
Tecnologia (IIT). His advisors are like. This is something that we do even
Barbara Caputo from IIT and as humans. If someone tells you that it
Politecnico of Torino, and Elisa Ricci will rain tonight, you might go out with
from the University of Trento and FBK. an umbrella because you adapt
Samuel Rota Bulò from Mapillary is a yourself to the weather conditions. If
third unofficial advisor. Massimiliano they don’t tell you it’s going to rain, you
spoke to us ahead of his oral and might go out with nothing. As humans,
poster. we adapt to what we know, and if we
This work tackles predictive domain have understood how the weather
adaptation. In standard domain changes with respect to some
adaptation, what you have is a source attributes, we can try to adapt to that.
domain where you have a lot of labels We hope that our algorithms are able
for the task that you want to address, to do that too.
then another domain, which is Massimiliano explains: “To solve this
unlabelled, which is actually the task you must relate what the
domain which you want to apply your parameters of a domain are with
model. For instance, you have a lot of respect to the attribute, or the
images collected in daylight, so they metadata, in our case. We have a first
are clean and labelled, and then you phase where we train parameters
want to apply your model at night, but which are attribute or domain specific,
you have only unlabelled images for so they are specific for a certain
the night situation. You must figure out condition. While doing that, we
how to go from daylight to night initialise what we call a graph. We have
without any label on the night. a node for each of the domains and we
You need the target data for doing this connect the nodes of the graph with
CVPR
DAILY Wednesday Massimiliano Mancini 17

the edges, where the weight of the you that it’s going to rain, you go out
edge basically says how much two with an umbrella, but if they don’t tell
domains are related. If I tell you I see a
you that and you go out and it starts
car from a side view and from a rear- raining, then you must figure out how
side view, a rear side and side are to react. Since obviously there will be
closer than a rear and front, so we nothing supervised, and so the
relate in this way. When we get the prediction of the target model can be
target attribute, at that point we wrong of course – and also, some
initialize a virtual node – a fake node –metadata may be received which are
because we didn’t have any data for not representative of the domain – this
that. No parameters are there, but if method unifies this kind of prediction
you assume that similar metadata, so with continuous domain adaptation. As
similar domains in our graph – similar the target data is received, the model
nodes – require similar parameters, we is continuously updated. This is
can just propagate the parameters of possible because the model is based
nearby nodes and obtain our model for on batch normalization. The different
the target domain which we never domains have different batch
see.” normalization statistics and scale and
bias parameters. For each domain,
"This is possible because there are different statistics, and so for
the target you basically predict just the
the model is based on statistics of the domains and the scale
batch normalization" and the bias. Once you have those, at
test time when you receive the target
He adds that there is another problem data, you use it to update the statistics
with that. Like before, if someone tells because it can be done easily. Then
CVPR
18 Presentation Wednesday DAILY

you devise an unsupervised loss which Massimiliano says they assume the
is just an entropy loss to update the metadata is representative of the
scale and the bias for the domain- domain shift, which cannot be true, so
specific parameters. they must weight them. He says a next
Massimiliano tells us that they were step for this work would be to
restricted in terms of the data they had understand which metadata are
to test on. They tested on the data set important for the domain shift and
used in the literature for this problem, which are not. He also thinks they
CompCars, with different cars, with a could go from the metadata to a more
different viewpoint of cars year of abstract representation, because
production. Also, another data set metadata is good if you have it and if
which depicted different portraits you can quantify the shift, but if
collected over 100 years in different someone tells you it’s dark or it’s
regions of America. This is not a real darker, you can understand what the
application of this problem. He says environment will look like, so a
they would like to have a model, for description rather than metadata
example, in autonomous driving. He would be helpful.
explains: “I tell you it’s evening, it Finally, Massimiliano is excited to tell
starts getting darker, so it adapts your us that he is very proud of his heritage:
model for this kind of light. Obviously, “I come from a very small village in the
one can say that if I have a huge middle of Italy in Umbria. I think we
amount of data which are balanced are 30 people in the centre of the
among all the possible conditions. Our village. It was a long journey to get
algorithm does not do it. Nowadays, here and I’m very happy that at the
the data sets are unbalanced, and we end, even starting from there, I can
must specialize our systems.” represent my small village here.”

"Our graph is able to estimate target model


parameters with pretty good results in our
experiments. Receiving the target data and
using them to refine our model, allows to fill the
remaining gap with the upper bound, a
standard domain adaptation algorithm with
target data available beforehand."
CVPR 19
DAILYWednesday Orals

Top: Jiwoon Ahn presenting Weakly Supervised Learning of Instance


Segmentation With Inter-Pixel Relations
Below: Tatiana Tommasi presenting Domain Generalization by Solving
Jigsaw Puzzles. Tatiana explained how jigsaw puzzles help not only kids
but also machines to generalize better!
CVPR
20 Guest Wednesday DAILY
Andrew Fitzgibbon
in your descriptions, very, very quickly.
Were you like this when you were in
their place?
Good question.
That’s what I’m here for!
I notice when something is a bit wrong,
and in fact, it’s almost a way I generate
research ideas. I find things that are
weird or annoying.
Is there anything that your generation
should learn from the current
generation of students?
Yes. Oh, that’s a great question.
Thank you, that’s the second time that
you’ve told me that. I will start to
believe it!
I like that they’re self-confident.
Well, they were trained to be more
self-confident than we were. We were
trained to accept discipline and
authority, while they’re actually used
Andrew Fitzgibbon leads the All Data to challenge it.
AI theme at Microsoft Research in
Cambridge and has worked for Yes. Which is obviously very useful. One
thing that’s good about being old,
Microsoft for the past 15 years. He also because I’m just over 50, is I’ve just got
teaches at Cambridge University and
has a PhD student there. better at predicting what an algorithm
will do without having to run it.
Andrew, do you feel more of a teacher, So, it’s better intuition?
or a business technologist? Is it intuition? I don’t know. It’s
I really enjoy trying to be clear about definitely to do with experience. You’ve
just seen enough things tried in
things, so teaching is a chance to try to different ways. I’ve definitely learned to
very clearly explain something that’s
happening. I like teaching Fourier try to prove myself wrong.
transforms to undergraduates because I think it was Oscar Wilde who said:
it’s fun to keep one’s hand on the experience is the name that people
mathematics. The fantastic thing about give to their failures. Maybe the fact
students, especially undergraduates, is that we did something many times
they have no idea what’s difficult and leads us to know where it will lead to.
what’s easy. They just try to understand That’s right, and easier to predict when
and if they don’t understand, they tell things will fail, or to predict how to
you. They spot bugs in your reasoning, break things, which is just as important.
CVPR 21
DAILY Wednesday Andrew Fitzgibbon
Can you tell me something about In head tracking on HoloLens, we all
Microsoft that we, the community, knew how to do SLAM, we all knew
don’t know? how to do real-time SLAM, and we all
Something that I’m really proud of at knew how to do structure from motion,
Microsoft is the way in which we bring but no one knew how to do it for hours
cutting-edge computer vision to real- and hours and hours. All the demos that
world products. We did that with we would do as academics were on
Kinect. No one knew how to do body reasonably short sequences, and no
tracking before Kinect came out and it one knew how to do it in incredibly
just got done. With HoloLens, I’m super messy environments. The very messy
proud of something that I didn’t work environments that we deal with most
on. The head tracking in HoloLens v1 often are offices.
done by my colleagues is just an Why do you think it’s Microsoft that
amazing piece of engineering. It’s kind came out with this and not another
of like you had a moonshot and then company? What is special about
you ended up on Mars. Microsoft that made it the company
What is particularly revolutionary who will bring it out?
about it? That is a good question.
If you ask George Klein, who was Thank you. That is the third time. If I
responsible for large proportions of the get to 10, I’ll win a prize.
middle of it, he’ll say nothing. Just good There are a few things. When we were
engineering. Just we put it together. working on Kinect, we have this
Sometimes in the research world, and I visionary guy, Alex Kipman, and he
think young people are maybe better at phoned up the lab in Cambridge and
this now, people say “It’s just said, “I’ve got this brilliant idea. We’re
engineering”, as if just engineering is going to do a machine learning system
not something to be proud of. I think which does body tracking and we’re
the distinction between research and going to stand in front of the computer
engineering is much closer than we and we’re going to play.” I said, “It’s
think. If we think about how deep lucky you came to me because I am a
networks suddenly became popular, to world expert in body tracking, and I can
a large extent, it’s about the tell you that what you’re saying is
engineering that Krizhevsky did in order impossible. It’s not going to happen.
to make the networks run on the GPU. You need to do it in 10 per cent of an
In some sense, an engineering advance Xbox, which is a 2003 era piece of
has given us this amazing new tool and hardware, and you need to do it real-
capability. I have a title of researcher, time, and it needs to run for hours, and
but I really think of myself as an it’s absolutely impossible.” Kipman said,
engineer. My group at Microsoft, and “Oh, it’s funny you say that, because
across Microsoft, more and more we’re look, I’ve done it.” He produced us a
trying to remove the boundaries demo which was doing a fantastically
between roles that are called good job of body tracking if you started
researcher and roles that are called with your arms out in a T-shape. If you
engineer. started in the right pose, then it would
What is the most dramatic boundary do a fantastic job for about 20 seconds
that you’ve been able to break? or about a minute.
CVPR
22 Guest Wednesday DAILY
takes to compute an answer to some
question. It’s happening. One of my
colleagues, Manik Varma in India, has a
VGG net kind of architecture which is
one kilobyte in size, whereas you think
about neural networks as millions of
parameters.
In the medical world, there are
diseases which are so rare that you
never have enough samples in order to
put them into a large state-of-the-art
neural network.
Absolutely, and that’s one of the
Which was enough to impress you? focuses of my research group, All Data
AI. What that means is big data and
Which was enough to impress us, but
also, we knew that could never small data. You might have your
example of a medical scenario where
succeed. There’s a compounding you have two training examples of this
probability of failure because the
system did the thing that all computer important disease, and maybe you have
millions of examples of other situations.
vision engineers love which is, if I knew How do you train using tiny amounts of
what I was doing 30 milliseconds ago,
then surely I can use that information to data? We know the answer to that is
essentially you use Bayesian methods.
learn what I’m doing now. What we did Bayesian methods are great when you
was we didn’t use that and that was the
crazy thing. lack training data. Sometimes you see a
world where there’s data everywhere.
I now have a question for the engineer It’s like with the phrase “Water
that you feel you are. What is the next everywhere and not a drop to drink”.
boundary that you would love to break? You have data everywhere, but only
I would love us to be able to train much two training examples. There’s a whole
smaller neural networks because I feel question there about what we can do
very uncomfortable whenever I see two with semi-supervised learning, with
floating-point numbers getting unsupervised learning, and I think our
multiplied together. If you make them strength in variational inference almost
16-bit floats, it doesn’t make me much looks like it’s being overshadowed by
more comfortable. this deep learning revolution, but it’s
Well, that enlargement is today crucially important for real-world
considered acceptable. problems.
Yes, but someday we’re going to have Now a question for the human behind
to do it with much less computer. the engineer. What boundary would
Obviously, today the HoloLens is a you like to break in this community in
head-mounted device. We want to order to work better together?
think about the number of milliwatts it Someday we’re going to have to figure
CVPR 23
DAILY Wednesday Andrew Fitzgibbon
out how to do these conferences don’t know where the good enough
without everybody travelling to the boundary is, but once we go over it and
same place. once it’s good enough, I think it will be
Do you mean that human science will amazing. We will spend more time
break time and space? together because it will be fun to phone
Well, the question is whether we can you up and spend an hour with you
some evening.
live in a VR world or augmented-reality
world, where the experience of Okay, and you think that with this
travelling there is as good as the solution you will be able to convince me?
experience of encountering it at home. I don’t know what we’re going to do
Here’s my vision for how it works with about the shaking hands. I wonder if we
the time zones because we’re not going will generate some alternative. One of
to change the time zones. The idea is the things that I was surprised by in the
we might run local models of the HoloLens was when we put hand
conference. Maybe in England we would tracking in the device, I thought it
have a hotel somewhere and 200 people would be really important to have
would go to that place. Then they would haptic feedback when you touch things,
all switch to California time. Breakfast but it turns out that just watching the
would be served at 5pm. People would virtual finger touch the virtual blob, and
attend talks until midnight. the virtual slider move left and right,
Let’s see it from another point of view: maybe with a little bit of audio,
a significant share of the people that I suddenly feels like haptic. It feels like
love the most in this world are here at I’m touching it.
this conference and I meet them once It’s almost the real thing.
a year. Now you are telling me, as an It’s almost the real thing. Clearly
engineer, you would like to find a shaking hands, we’re not going to get to
solution so that I will never meet them the real thing.
again! What do you think I will tell
about it? But we’ll get to a proxy which is almost
as good as it.
I tell you this, nothing is stopping you
meeting them. You should meet them We might invent a proxy which is we
put our hands in the virtual air and we
more! You can’t shake their hand. We kind of like ding them, and if they make
don’t have a good solution for that. I
definitely want to be able to sit here a little noise when the fingers touch,
then maybe that’ll feel fine. I don’t
with a glass of beer and you sit there know. We’d have to try it. I would not
with a glass of beer, but you’re 2,000
miles away. I was at a workshop rule it out immediately. I would say that
might feel cool. Ding!
yesterday, Computer Vision for VR and
AR, and a few people – Facebook,
Christian Theobalt, Yaser Sheikh – were
showing us a vision of the future where
the process of me talking to you will
visually be exactly the same as being
together. I’ll have a full 3D view. Perfect
lighting. I think that is a boundary. I
CVPR
24 Expo: MSC Software Wednesday DAILY

miles between fatal crashes. That


means having to test for billions of
miles, and you cannot do that without
simulation. This product allows you to
mimic real life to make it as realistic as
possible, including simulating traffic.
There are sensor models and driver
models. It can also make a distribution
of scenarios so that rare events
happen more frequently.
Edward says his research interest is to
do with quantifying probabilistic
models for the density of different
crashes and adverse events. He’s
specifically interested in people who
want to look at the validation of the
car more so than the algorithms for
driving the car, because he would like a
safe vehicle quantitatively analysing
Edward Schwalb is a Lead Scientist at the safety of the vehicle on the road.
MSC Software, who are part of the He would like to use a simulation to
CVPR 2019 Expo. extend the physical testing to achieve
that statistical simulating of billions of
miles.
MSC Software has a product called a
Virtual Test Drive which is a simulation
product that provides simulation of
virtual worlds that match real worlds.
In order to test the software of
advanced driving assistance and self-
driving cars, they need a lot of virtual
miles driven because it is not practical
to test these vehicles in the physical
world.
Edwards tells us that they provide
hardware-in-the-loop, software-in-the-
loop, model-in-the-loop, vehicle-in-
the-loop, and driver-in-the-loop
simulation with different levels of
realism. They can record a realistic
scenario and generate variations of
that scenario.
One of the challenges in the industry is
that people drive about 100 million
CVPR 25
DAILY Wednesday MSC Software
accurate – which is very generous –
and you have to drive a billion miles
and every mile has multiple frames.
For example, 10 frames per second,
100 seconds per mile, that’s 1012
frames on a single camera. With
multiple cameras, you can easily have
1014 or more frames and you want to
have one fatal crash for every 1012
frames. How do you use an algorithm
that has an error rate of 10-2, and
build a system that is reliable to 10-14?
How do you build a system which is 10
billion times safer than the
components?
Edward tells us that MSC Software do
this better than anybody else. They can
generate scenarios and quantify the
probability density that certain adverse
events happen such as crashes or
going over a cliff. He concludes by
saying that it is very important to be
He explains how MSC Software able to go from the physical road, to
addresses these challenges: “We need the digital twin of the physical road, to
to create a population of scenarios millions of simulations with a
that represents the operational design population of scenarios, and to take
domain of the vehicles. There is safety the result back to the physical world
standard ISO 26262 and SOTIF that and make it credible.
specifically have requirements. SOTIF,
for example, would say that it “How do you build
categorises all the scenarios between a system which is
known safe, known unsafe, unknown
safe, and unknown unsafe. What we 10 billion times safer
want to do is increase the number of than the components?”
known safe scenarios at the expense of
the unknown unsafe scenarios. We
want to do probabilistic interpretation
based on scenario population, based
on the behaviour of the car. As a
scientist, I would like to build AI that
does this.”
He says another challenge is if you
have an AI sensor that is 99 per cent
CVPR
26 Presentation ThursdayDAILY
Data Augmentation Using Learned Transformations
for One-Shot Medical Image Segmentation
the anatomical variations in your
population. They’ll also have intensity
variations because they were taken
with different scanners and they’re of
different patients. This method can
learn to mimic all of these variations
and synthesise examples that look like
these unlabelled scans. These
examples can then be used to train a
segmentation CNN to leverage the
power of deep learning.
Amy tells us that there’s a more
common way of doing what they’re
trying to do: “Data augmentation.
Most people who are familiar with
computer vision know how to use it to
Amy Zhao is a graduate student at some extent. For medical imaging, it’s
MIT in the final year of her PhD. She common to do data augmentation
spoke to us ahead of her oral and using random smooth flow fields.
poster: Data Augmentation Using These are fairly easy to code, but they
Learned Transformations for One-Shot don’t always produce the best results
Medical Image Segmentation, which and you need to hand tune the
explores how to do smarter and parameters. The challenge here is that
better data augmentation to help medical imaging is a pretty complex
improve medical image segmentation space. These brain MRIs have very
when you don’t have a lot of labelled complex variations. It’s difficult for
data. people to write functions that mimic
Amy says that deep learning has been these variations. We are trying to learn
demonstrated to be so powerful for so to do this instead of trying to do it by
many image tasks, but it’s very hard to hand.”
do it on medical data. This work
explores segmentation when you only “These brain MRIs have
have one labelled example. She says very complex variations.
this is a pretty common scenario. It’s
easy to get hundreds of unlabelled It’s difficult for people to
examples if you need to – MRI scans write functions that
from patients, for example – but how
do we leverage the information in mimic these variations.
these unlabelled scans to help us with We are trying to learn to
segmenting any scan?
These unlabelled scans have a lot of
do this instead of trying
information in them. They show you to do it by hand.”
CVPR
DAILYThursday Amy Zhao 27

To solve this, they are leveraging some Co-author Guha Balakrishnan adds
recent work in medical imaging: Adrian that “The benefit of the method is that
V. Dalca and Guha Balakrishnan’s it is interpretable and a simple idea
VoxelMorph. The idea is that if you that can extend to various other data,
have a method that can compute the both in and out of medical imaging."
transformation from your labelled scan Thinking about how to develop this
to an unlabelled scan, that allows you work, Amy points out that they’ve
to compute a set of transformations. demonstrated it on a fairly limited
Once you have the set of range of scans so far. They just look at
transformations, you apply them back T1 MRIs and show what happens if you
to the labelled scan and use that to assume that you have one labelled
synthesise more labelled examples. scan and a hundred unlabelled scans.
Co-author Adrian Dalca comments that What happens if you have multiple
"This work shows how leveraging labelled scans available, or if you apply
unlabeled data and machine learning this approach to different kinds of
can lead to tangible and practical MRIs or CTs? They think it would work,
improvements. This is important, since but don’t know how well, and are
the next step is to use this algorithm in really interested to find out.
practical clinical analyses."

What happens if you have multiple labelled


scans available, or if you apply this approach to
different kinds of MRIs or CTs?
CVPR
28 Presentation Thursday DAILY
Online High Rank Matrix Completion
don’t live in a two-dimensional
subspace, they live in a low-
dimensional manifold?
She asks us to imagine a one-
dimensional manifold – just one curve
– and imagine it in a 3D space. If the
curve does anything remotely
interesting, then the span of all the
vectors on the curve is the complete 3D
space. It’s not low rank; it’s full
dimensional. You cannot use normal
low-rank techniques to fill in the
missing data, but there’s a clear
structure. You’re on a one-dimensional
manifold. It’s like a line. It’s very easy to
see this curve. The question is, how can
we see curves like this in higher-
Madeleine Udell dimensional space?
Madeleine explains: “The way that we
Madeleine Udell is Assistant Professor solve it is we take these vectors and we
of Operations Research and blow them up. We map them through a
Information Engineering at Cornell polynomial feature map. We take each
University. She spoke to us ahead of of the points and we compute many,
her oral and poster, which she many polynomials in the coordinates
presented alongside Postdoctoral that we have observed. Now our data
Associate Jicong Fan. lives in a much higher-dimensional
space, and in that much higher-
Madeleine tells us that this work starts dimensional space, it is low
out with the problem of finding missing dimensional. So, it actually lives on a
entries in data vectors. A classic well- low-dimensional subspace in this
understood approach to filling in higher-dimensional space. The
missing entries in data sets is low-rank dimension of that subspace is probably
matrix completion. That assumes that larger than the ambient dimension of
all of the vectors lie in a low- the original space, but it’s still low-
dimensional subspace. If, for example, dimensional in the ambient dimension
the vectors lie in a two-dimensional of this blown-up space. The most
subspace, then knowing any two of important trick is the fact that after
their coordinates allows you to figure blowing up using this feature map, the
out all of the rest once you’ve learned resulting set of points is a low-
the subspace. This work looks at the dimensional subspace so that we can
question, what happens if you want to use ideas from low-rank matrix
fill in missing entries, but your vectors completion.”
CVPR Online High Rank Matrix Completion 29
DAILY Thursday
She points out that there was some someone run, or they’re running and
really nice work done by Laura they transition to walking and then
Balzano, Rebecca Willett, Rob Nowak they transition to jumping jacks. Their
and Greg Ongie. They essentially had motion is quite complex. You might
the same idea, but it’s too slow, and imagine that all of human motion lives
you can’t scale it to the kinds of sizes in some low-dimensional manifold, in
of data sets that are interesting in the space of all possible coordinates of
computer vision. To improve on that, every point in the body, but it’s a much
this method adds a couple of critical lower-dimensional manifold if you just
components. One component that’s look at jumping jacks or if you just look
really important is that they don’t at running. By learning online, they’re
represent these vectors in this high- able to learn this transition and figure
dimensional space explicitly. Instead out, for example, if they started
they use a kernel map, because writing jumping jacks, they learn the jumping
them down in this high-dimensional jacks manifold, or if they’ve started
space would be too large. Then once running, they learn the running
they’ve kernelized, they’re able to manifold. Because of this they can
rewrite the objective and rewrite all learn much lower-dimensional
the penalties in a way so that it manifolds and learn it much faster and
decomposes over the data points, and get better accuracy than offline
it allows them to train it online. methods. Normally online methods
don’t perform as well as offline
“… this method adds a methods because they’re using the
information less efficiently, but for
couple of critical them, the online method performs
components" better because they can adapt to quick
changes in the kind of motion.
One of the most interesting things How can this be applied practically in
about the online training is it means the real world? Madeleine explains:
that they can adapt to a changing “Some of the things that we’ve
manifold. Imagine you’re watching considered are using it to adapt to

"What this method gives you is a completion of


the points in the original space, but it can’t tell
you, for example, how close together two points
are in the natural coordinates of the manifold."
CVPR
30 Presentation Thursday DAILY
missing sensor measurements. Like in Also, when thinking about a low-
this case of motion detection. I actually dimensional manifold, a natural thing
think it might be useful even for to think about is the internal
motion tracking – I’m wearing a Fitbit coordinates within that manifold. This
right now. We’ve also thought about method doesn’t give you those. What
cases where sensor measurements are this method gives you is a completion
corrupted, possibly in chemical system of the points in the original space, but
monitoring. It’s not clear, but I think it can’t tell you, for example, how close
that maybe the broader picture is that together two points are in the natural
any sufficiently complex phenomenon coordinates of the manifold. For
is probably not exactly a low- Madeleine, that would be extremely
dimensional subspace. That depends valuable to know, so it would be her
on the coordinates in which you’ve next target for this work.
measured it matching something
important about the structure of the
phenomenon. It’s not clear why that
should be true for sufficiently complex
phenomena, especially as you collect
more and more and more data, you
should be able to see that it’s not
actually a low-dimensional subspace.
It’s not flat. Our method allows you to
adapt. As you get more and more data,
then you’re allowed to start looking for
this non-linear structure.”
Thinking about next steps for this
work, Madeleine says she is interested
in the deep structure. They know that
there are ways that manifolds can
wrap back on each other that would
confuse this method in high
dimensions, but one thing they don’t
yet understand very well is when this
mapping succeeds and when it fails. Jicong Fan
CVPR
DAILYThursday Oral Presentation 31

Top: Aliaksandr Siarohin presenting his paper Animating Arbitrary


Objects via Deep Motion Transfer
Below: Supplementary video for this paper. Click to watch!
CVPR
32 Women in Comp. Vision Tuesday DAILY
where we combine machine learning
and safety, which means I work on an
algorithm which does perception. I
bring my safety knowledge because
that’s the background that I have and
that I got during my PhD, in order to
understand how to make deep neural
networks safe.
It means that if autonomous cars do
not crash into another in the future,
we will owe it to you.
Exactly…[laughs] actually that’s my
goal. I wanted to stronger tackle the
topic of safety. It’s not that we are just
developing a prototype. We are
developing something that will go to
the market, and we have to make sure
Jelena Frtunikj is an expert in that it’s safe. Since I have a background
automated driving. Currently, she also in computer science, it was not
works at BMW in Munich, Germany difficult for me to do. I have learned
as a researcher and engineer about neural networks and so on
developing safe (ISO 26262 & SOTIF) before.
End2End features of highly and fully
automated vehicles by using sensor So you were the right person, at the
right moment, at the right place. How
fusion and machine learning long have you been with BMW?
algorithms. At CVPR 2019 she has
organized the workshop Dependable On the first of July, it will be exactly
Deep Detectors: Verification and three years.
Certification of Neural Networks. Congratulations!
Thank you.
Read more interviews with
Tell us something about BMW that we
women scientists don’t know.
Many people think that BMW is a very
Thanks a lot for the invitation. I was traditional company. Many people
very surprised! think that only Germans work there,
Why were you surprised? and it’s not so international. Actually,
I don’t know how someone could that’s completely not the case for the
discover me for CVPR Daily. Maybe you autonomous driving department. We
can tell me how! [laughs]. are so international. In my team, we
are ten people. I’m actually not
We want to learn about you. So tell us German, I am from Macedonia. We
about your work. have an Iranian. We have a guy from
I work at BMW working on topics of India and another one from China. We
autonomous driving. I work on a topic have a Spanish guy. Of course, we have
CVPR
DAILY Tuesday Jelena Frtunikj 33

some Germans. A lot of people think that.


there are many older people because And you do not regret it?
it’s very traditional. Especially in
autonomous driving, people who have No, not at all.
the knowledge are quite young. We Do you have a future in Macedonia?
have a very young population. From my perspective, no, unfortunately.
The last thing that we heard about What would Macedonia need in order
BMW was the ad along the retirement to be more attractive for young
of the Mercedes boss. Did you see it? scientists?
When he took the i8? Yeah, of course I The problem is that we don’t belong to
saw it. the EU. So we don’t have a lot of
What did you think about it? funding projects. Also, we basically
I liked it a lot. As far as I know, to show have zero industry. We have a lot of
mutual respect between both software companies, but those
companies, they do these kinds of companies are mainly outsourcing from
advertisements often. When BMW was German or Swedish companies. It’s not
also celebrating 100 years a few years like you can perform any kind of
back, they also did some kind of research there. That’s what is missing.
advertisement. Macedonia gave me the basis also from
university because I finished it there. It
So there is some kind of dynamic has a really good education system. . I
discussion going on. always think about how I can give back.
Exactly, it’s a public discussion. Until now, I haven’t found the time. It
I’m very proud of interviewing has been a wish that I hope in the next
someone from Macedonia for the first few years I can accomplish. Maybe I
time. Tell us something about can start a project with some of the
Macedonia that we don’t know. university professors. Last year, I also
went to visit them. They also invited
Macedonia is a very small country. me for a talk, but a talk is one thing.
Many people tell me that they often You just give it and go out. Then I go
confuse it with Barcelona or Estonia. I
don’t know why. Maybe it’s how I
pronounce it. Yes, there is a country
called Macedonia. It’s a very, very small
country north of Greece and south of
Serbia. East is Bulgaria, and west is
Albania… less than 2 million people. It’s
a beautiful country nature-wise.
Are you from Skopje?
Yes!
You are a city person.
Yeah, exactly. I moved to Germany for
my studies and chose Munich to do
CVPR
34 Women in Comp. Vision Tuesday DAILY
back to Germany. Doing something Honestly, I don’t know. It was so clear. I
more is, for me, something that I have was really good at math and physics. I
to do. I will hopefully do it at some was bad at languages basically. It was
point. very obvious. My sister wanted to
I’m sure that you will. When did you study medicine, but then there were
discover that you had a passion for discussions in the family about what is
science? the benefit of it or not. Then she
decided also to do electrical
It was quite easy. My father is an engineering.
electrical engineer. He repairs, well
repaired, actually he is going to retire So if we need a medical doctor, we
this year. Sometimes he would bring need to go to another family, not to
the medical devices home and yours.
continue working. At home, we are Definitely… It’s funny. In Germany, my
three women: my sister, me, and my passport says doctor. On an airplane,
mother. He sometimes needed one of they say, “Ah, a doctor.” I’m not a
us to help measure. Sometimes two doctor. I cannot help you if something
hands are not enough. He often asked goes wrong on an airplane. This
me and my sister as well. I was the one actually happened once. On the
who wanted to do it. airplane, someone didn’t feel well. It
He always explained to me how the was a long flight overseas. Then they
whole machine works, which are the said, “Can all the doctors please come
sensors, how do they measure the to the stewardess?” It says so officially,
sugar in the blood. I knew since I was but I am not! [both laugh]
in fourth or fifth grade that I would Regarding the move to BMW, did it
study electrical engineering and then come by chance?
computer science. It came by chance. I was doing my PhD
So you have one sister and your mom. in the area of safety platforms for
Who is the most scientific of the future autonomous systems. I knew I
three? wanted to bring what I had done in my
Me. PhD to a real product. In a PhD, you do
a prototype. It works in some cases. In
What would you have done if you a conference, people present what
were not a scientist? works or doesn’t work. I wanted to do
CVPR
DAILY Tuesday Jelena Frtunikj 35

I had to get good grades. I knew I had


to get good internships, some
extracurricular stuff. It’s not that easy!
I know it sounds really obvious and
very easy, but it wasn’t. I really work
hard at whatever I want to achieve.
There is a plan. Even now, if I think
about what I want to do, I know what I
have to learn as a basis, and what I
have to achieve before I can get there.
What is the most precious lesson that
you received from your teachers?
it for a real product. At that point in [laughs] I don’t know if it sounds very
time, there was a colleague at the cliché, but from my math teachers:
same university who moved to BMW. discipline and hard work. That’s also
He asked me, “Jelena, we are looking what my father told me every day. He
for people. You are perfect. You have also taught me that there are no easy
safety. You have computer science. Do ways in life. I can tell a story that
you want to apply?” I sent him my CV. basically no one knows except in my
It was really very spontaneous. I didn’t family and my boyfriend.
look for anything else. It looked quite
challenging because BMW wanted to So let’s reveal it!
implement autonomous driving, but I I finished elementary school, which is
can help!
From what you say, it seems like you
followed your career because it was
obvious. I’m sure this is not the
reality. I’m sure there were times
when it was not so obvious. What
were those times?
I understand your point, but it’s
actually my character. In whatever I do
in my life, I have some kind of vision of
what I want to do. For example, I knew
from the beginning of my studies that I
wanted to go and study abroad. In
Macedonia, the salary is very low. They
cannot sponsor, let’s say, going
somewhere. I knew I had some time. I
searched the internet on how I could
get scholarships and how I can get
some experience in between. I had
everything planned. I didn’t know if
everything would work out, but I knew
CVPR
36 Women in Comp. Vision TuesdayDAILY

eight years in Macedonia. don’t go to school anymore. That’s my


Then I went to high school. It was not a decision. I won’t help you.” That’s what
problem. I passed all the exams. I was he told me when I was 14.
in a class which, on the first day, didn’t And you didn’t take advantage of this
look like nice people in the class. I to quit school?
didn’t know anyone. I knew some [laughs] No, that was not an option. I
people in the other classes. It looked always wanted to improve myself and
like all nerds. I came back from school. learn new stuff.
I went to my dad and said, “Please can
you find any solution to move me to a So one more time, you tell me that
different class?” He asked, “Why?” I you took the obvious choice.
said, “You know, all of them are nerds. Again, yeah! [both laugh]
It doesn’t look nice. I don’t want to be
there.” He said, “You know what, you
can’t get everything easily in life. Read more interviews with
Either, you stay in that class or you women scientists
CVPR 37
DAILY Posters - Workshops

Top: Cecilia Xuaner Zhang presenting her poster Zoom to Learn, Learn to
Zoom. Hers was one of the most crowded posters in a very crowded
Tuesday afternoon poster session!
Below: runner-up best paper award for “SizeNet: Weakly Supervised
Learning of Visual Size and Fit in Fashion Images” in Understanding
Subjective Attributes of Data Focus on Fashion and Subjective Search
FFSS-USAD workshop. Author Nour Karessli (left, Zalando SE) is greeted
by Diane Larlus (Naver Labs Europe) and Nicu Sebe (University of Trento).
38 Tool
Computer Vision News

Object Tracking in Python Using OpenCV


by Amnon Geifman
“…tracking and detection are not
the same tasks!”
In this article we will explain and implement the object tracking algorithms in
OpenCV library. For many computer vision applications, we are given a video that
contains an object, and we want to track the object's movement across different
frames. The problem is usually solved using a bounding box that contains the
object. The bounding box will be represented by four numbers: two for the
location of the upper left angle of the rectangle and two for the lower right.
Some of you may ask why we don't use YOLO or mask R-CNN on each frame and
get the object's location? The answer is that tracking and detection are not the
same tasks. Detection is a much harder task, which requires many more
resources. Hence tracking is much faster than detection. Moreover, tracking
usually preserves the location of the object; on the other hand, applying
detection on each frame separately might not consider the relationship between
the different frames.
When choosing an object tracker, we need to consider its three major
Focus on

properties:
Speed: in some application, we want our tracker to be able to process hundreds
of frames per seconds (FPS). In other applications, the tracking is done offline so
the speed is not very important.
Accuracy: of course, we want our tracker to be accurate. However, in general,
there is a trade-off between the accuracy and running time. A fast tracker might
be inaccurate and vice versa.
Occlusion robustness: when tracking a moving object, occlusion by other objects
might be a common phenomenon. In some application, we want our tracker to
cope with such occlusions and without moving to track the occluding object.
Luckily enough, OpenCV library contains a tracking API with 8 tracking
algorithms; Boosting, MIL tracker, kernelized correlation filters (KCF),
discriminative correlation filter, median flow tracker, TLD tracker, MOSSE tracker
and GOTRUN tracker. Each of those trackers has different ratio of speed to
accuracy: in this article we will use the KCF tracker since it is fast, relatively
accurate and good at failure detections.

Tracking with Kernelized Correlation Filters (KCF)


In tracking tasks, our goal is to find the object in the current frame, given its
location, shape and structure in the previous frames. The main idea of the KCF
Object Tracking in Python Using OpenCV 39
Computer Vision News

algorithm is to build a correlation filter (i.e. kernel) such that a convolution with
the input image gives the desired response. This desired response usually has a
Gaussian shape centered around the object and decreasing with the distance. To
calculate the optimal filter, the algorithm uses translated instances of the object
from the previous frames. The maximal filter response is taken to be the object
location.
In order to explain how the kernel trick works (like in SVM), we first start with
linear correlation filter. The optimal linear filter w is found by solving the
following least squares problem of the form:
2
min{ 𝑋𝑤 − 𝑦 + 𝜆| 𝑤 |ൠ
𝑤
Here, X is a circulant matrix containing all the possible cyclic image shifts, λ is a
regularization coefficient term and y is the response that we expect to receive.
The advantage of this formulation is that, given such circulant matrix X, we can
find the optimal weights in the Fourier domain w* using a closed form solution.
Like in SVM, the kernel trick allows us to perform a non-linear regression by
mapping the input using a non-linear mapping. In this way, the weights have the
form of 𝑤 = ∑𝛼𝑖 ϕ(xi ሻ and the minimization problem is of the form:
2

Focus on
min 𝐾𝛼 − 𝑦 + 𝜆𝛼 𝑇 𝐾𝛼
α

Here the matrix K is the kernel matrix with entries 𝑘𝑖𝑗 = 𝜙 𝑥𝑖 𝜙(𝑥𝑗 ൯ , thus
we can solve in a closed form in the Fourier domain as well. In this method,
usually an RBF Gaussian kernel is used.

Implementation
Before we begin to dive into the code, note that some of the tracker described
above are not available in old versions of OpenCV so make sure that you pip-
installed the latest version of OpenCV. Moreover, make sure you have opencv-
contib-python on your environment, we need it to create the tracking object.
We begin with importing the cv and sys libraries:

1 import cv2
2 import sys

Next, we define the tracker object. As mention above, we have eight possible
trackers to use and we used the KCF. For other trackers, you can take a look on
the OpenCV documentation. Defining the KCF tracker is simple as:
40 Tool
Computer Vision News

1 tracker = cv2.TrackerKCF_create()

Now we are ready to preprocess the video. Essentially, we open the video and
read the first frame in the video; in this example, I'll also show you how to rotate
each frame in case your video is not aligned. This can all be done in the following
code:

1 video = cv2.VideoCapture("IMG_1003.MOV")
2 flag, frame = video.read()
3
4 (h, w) = frame.shape[:2]
5 center = (w /2 , h / 2)
6 M = cv2.getRotationMatrix2D(center, 270, 1)
7 frame = cv2.warpAffine(frame, M, (h, w))

In the above code, the first line loads the video from a file in my computer. The
second line is a standard extraction of the frame from a video. The flag
argument specifies if the frame extracted successfully. The last 4 rows just rotate
the given frame by 270 degrees.
Focus on

At this stage we are ready to begin. In this code, the region of interest, i.e. the
initial bounding box, will be defined manually. It is not very hard to use some
pre-trained network to perform the initial detection. For example, YOLO
network can extract a good bounding box for the object; however, we keep it for
future articles in order to explain detection methods in details. The ROI selector
of cv2 opens the frame in a dialog box and allows us to mark a rectangle over
the object. The output will be a tuple with four numbers, two for the upper left
side and two for the lower right. We use the ROI selector as:

1 box = cv2.selectROI(frame, False)

The last thing that we still need to do is to use the box that we defined. Then, we
will iterate through the frames and for each frame we update the tracker. If the
tracking succeeds, we draw it on the frame and show it. Otherwise we just show
the frame. At the end we make an option to break the while loop in the case
where esc is pressed or no frames are available. This code will look like this:
Object Tracking in Python Using OpenCV 41
Computer Vision News

2 flag = tracker.init(frame, box)


3 while True:
4 flag, frame = video.read()
5
6 flag, box = tracker.update(frame)
7
8 if flag:
9 p1 = (int(box[0]), int(box[1]))
10 p2 = (int(box[0] + box[2]), int(box[1] + box[3]))
11 cv2.rectangle(frame, p1, p2, (0,0,255), 2, 1)
12
13 cv2.imshow("Tracking", frame)
14
15 k = cv2.waitKey(1) & 0xff
16 if k == 27 : break
17

This is actually it. In my code, I added a few additional lines of code to measure
the number of frames per second and to make the visualization nicer, but this
code is enough to track any object! Let's see some examples:

Results

Focus on
We now demonstrate the performance of the tracker. To this end we track the
phrase FOCUS ON ME on the following video. You can see the results below.

Conclusion
In this article we have seen how to use tracking algorithms in OpenCV.
Specifically, we explained and implemented the KCF tracker which gave us the
desired speed and accuracy we wanted. As mentioned above, there are seven
more trackers implemented on OpenCV. Each of them has different properties
and different robustness. You can use our code as is and just replace the line
that define the tracker. This would allow you to check other trackers depending
on your application. Enjoy!
42 Autonomous Driving: AutoX
Computer Vision News

With Jianxiong Xiao


AutoX is a self-driving car company, focused on developing self-driving car
technology. Their core product is an AI platform that can be installed on
different kinds of vehicles for applications such as logistics delivery services,
airport transportation, and city driving in smart, 5G-connected cities. We talked
about it with Jianxiong Xiao, CEO and founder of AutoX.

Xiao tells us that there are many tracking, as well as object prediction to
automakers around the world making predict what an object is going to do
cars, and there are many sensor next.
companies making Lidar and radar Then there is a decision-making
cameras, but what is missing is a good component called behaviour planning.
AI platform. Given the road condition and the
The self-driving AI platform has many traffic situation, the car has to make a
components. The first is high-definition smart decision about whether it’s
3D mapping and HD mapping. Next is going to stop, go, turn left, or turn
localisation. Localising where the right. Next comes motion and speed
vehicle is in relation to the HD map. planning. After it has made that
Then it’s perception, including object decision, it has to plan the trajectory
detection, object recognition, object and exact speed that it is going to drive.

“…they are the only company in the world with AI


sophisticated and smart enough to deal with
the very dense and challenging traffic there”
AutoX 43
Computer Vision News

The next step is about controlling the cent accuracy. Not a single
vehicle to drive autonomously. The compromise is allowed. He is very
control panel has two parts – the proud that they have put together
algorithm side and the hardware side. such a robust and reliable system.
The vehicle control unit has to send an The solution covers many aspects of
electrical signal to the vehicle in order computer vision. Jianxiong explains:
to control the steering and the
throttle. “For example, for perception, obviously
we need object detection. We need
This is a full-stack solution from the semantic segmentation. Not only do
very beginning to the end for self- we detect the object, but we’re
driving cars. It is backed up by a data segmenting out at instance-level. We
infrastructure built by gathering a huge need to do this at a very fast speed.
amount of data from global testing Unlike other applications where you
fleets. Before AutoX cars hit the road, can wait one second to get a picture;
they are tested heavily through digital for us, we need real-time performance.
simulation. When the image comes in, we need to
Jianxiong emphasises how important it finish all the computation immediately.
is that they make no mistakes. Being a We use a convolutional neural network
self-driving car, they require 100 per to do that.”
44 Autonomous Driving
Computer Vision News

To support object tracking, they have a testing fleet of self-driving cars in the
3D convolutional neural network US and China. Jianxiong claims that
running in the physical space using a they are the only company in the
LiDAR signal. They combine the LiDAR world with AI sophisticated and smart
and camera together in another neural enough to deal with the very dense
network. For object prediction, and challenging traffic there. They are
because it’s a time sequence signal, doing autonomous testing downtown
they use a recurrent neural network. in Shenzhen, the city with the highest
In three years, AutoX has grown to the population density in China, where
point that they currently have a large they are piloting a fleet of RoboTaxis.
AutoX 45
Computer Vision News

In terms of next steps, Jianxiong tells We’re also working with Shanghai
us: Motor. They are the largest Chinese car
“The next step is to further improve our manufacturer. They are both working
technology. To polish it. By the end of closely together for us to provide a self-
the year, we will have more than 100 driving fleet to really push the
self-driving cars testing. We’re also technology to commercialisation and
working with a lot of car production.”
manufacturers. For example, we’re AutoX are currently hiring for various
working with Dongfeng Motor, the positions. See their website for more
second-largest Chinese car manufacturer. details.

“By the end of the year,


we will have more than 100 self-driving cars testing!”
Computer
ComputerVision
VisionNews
News
Upcoming Events 47
Website and Registration

MIDL - Intern. Conf. on Medical Imaging with Deep Learning


London, UK Jul 8-10 Website and Registration

IEEE ICCI*CC Int. Conf. on Cognitive Informatics & Computing


Milano, Italy Jul 23-25 Website and Registration

MIUA - Medical Image Understanding and Analysis


Liverpool, UK Jul 24-26 Website and Registration

Medical Augmented Reality Summer School


Zurich, Switzerland Aug 5-16
FREE Website and Registration

CMBBE Computer Methods in Biomechanics and Biomedical Eng.


SUBSCRIPTION New York City, NY Aug 14-16 Website and Registration

Federated Conference on Computer Science and Info Systems


Dear reader, Leipzig, Germany Sept. 1-4 Website and Registration
Do you enjoy reading
Computer Vision News? Int. Conference on Computer Analysis of Images and Patterns
Salerno, Italy Sept 2-5 Website and Registration
Would you like to receive it
for free in your mailbox? BMVC - British Machine Vision Conference
Cardiff, UK Sept 9-12 Website and Registration
Subscription Form
(click here, it’s free) Int. Conf. on Machine Learning, Optimization, and Data Science
Siena, Italy Sept 10-13 Website and Registration
You will fill the Subscription
Form in less than 1 minute. Intelligent Health 2019
Basel, Switzerland Sept 11-12 Website and Registration
Join thousands of AI
professionals and receive all AI for Business Summit 2019
issues of Computer Vision Sydney, Australia Sept 17-19 Website and Registration
News as soon as we publish
them. You can also read RE•WORK - Deep Learning Summit
London, UK Sept 19-20 Website and Registration
Computer Vision News in
PDF format (though this The AI Summit 2019
online view is way better) S.Francisco, CA Sept 25-26 Website and Registration
and visit our archive to find
new and old issues as well. Did we forget an important event?
Tell us: editor@ComputerVision.News

FEEDBACK
Dear reader,
If you like Computer Vision News (and also if you
don’t like it) we would love to hear from you:
We hate SPAM and Give us feedback, please (click here)
promise to keep your email It will take you only 2 minutes. Please tell us and
address safe, always. we will do our best to improve. Thank you!
IMPROVE YOUR VISION WITH

The only magazine covering all the fields of


the computer vision and image processing industry

SUBSCRIBE
CLICK HERE, IT’S FREE
Gauss Surgical

A PUBLICATION BY

You might also like