Professional Documents
Culture Documents
Computer Vision News PDF
Computer Vision News PDF
A publication by
July 2019
Women in Computer Vision: Autonomous Driving Applications:
Jelena Frtunikj - BMW • MSC Software
• AutoX
Upcoming Events
Project Management:
Time Estimation for Your Projects
Exclusive Interview:
Andrew Fitzgibbon - Microsoft
Computer Vision Project:
Nuclei Detection and Segmentation
Computer Vision Tools:
Object Tracking in Python Using OpenCV - with codes!
2 Contents Computer Vision News
04 14 32
Tool:
Object Tracking with OpenCV
Project Management Tip
Projects Time Estimation Guest
Andrew Fitzgibbon 42
08 24
Dear reader,
CVPR 2019 is just behind us, and we can
almost still hear the voices from CVPR that we
heard in Long Beach. It was a huge conference
with excellent scientific content, as confirmed
by those who were there and who read about
it in our CVPR Daily. For everyone, this
magazine includes a 24-page BEST OF CVPR
section, starting on page 14: be sure not miss
it!
This issue of Computer Vision News includes
Did you subscribe to several articles dedicated to the autonomous
Computer Vision News? driving area, one of our most popular subjects.
It’s free, click here! You'll find some in the CVPR section and one
more on page 42, where AutoX CEO and
founder Jianxiong Xiao (a.k.a. Professor X)
Computer Vision News
kindly shares with us how computer vision
Editor: helps them build a full AI platform for
Ralph Anzarouth applications such as logistics delivery services,
Engineering Editor: airport transportation and city driving in
Amnon Geifman smart, 5G-connected cities. You will find more
Publisher: articles related to autonomous driving on
RSIP Vision pages 24 and 32.
Contact us Finally, here you'll find the full recording (with
all the slides) of the webinar hosted just a few
Give us feedback
days ago by Moshe Safran and Miki Haimovich:
Free subscription “Project Management for AI Implementation
Read previous magazines in Medical Devices”. Stay tuned and don’t miss
Copyright: RSIP Vision our upcoming webinars on medical imaging
All rights reserved and more.
Unauthorized reproduction Enjoy reading!
is strictly forbidden.
by Dorin Yael
evaluation of the types, shapes and and high density of nuclei in every
numbers of cells in a specific tissue single slice. Observations are very
area. Thus, our goal in this project was subjective as well: specifically in
to detect and segment single cell clusters of cells, inter- and intra-
nuclei in slices of tissue. To achieve this observer variability is particularly high.
goal, we had to overcome multiple Of course, the accuracy of the model is
challenges: the very high variability dependent on the accuracy, number
between the properties of nuclei and richness of the annotations. For
belonging to different cell types this project we annotated a few
including shapes, sizes, color intensities thousands of nuclei using different
and textures which vary within and approaches.
between tissues. In addition,
sometimes cells are clustered and need The first step of processing in our
an expert eye to separate them into solution was cell detection. Once the
single cells with definite borders. nuclei are detected, the cell
Another difficulty is due to occlusion, segmentation task started, with the
when one cell is partially hidden by goal of segmenting each cell separately
another cell. (instance segmentation). Once this is
done automatically, accurately and fast,
These make it very challenging to build
a single process which detects them all the output of this task can be used
both for diagnostic and for research
in every single slide. There is no single set purposes. The standardization that we
of parameters which covers them all.
were able to apply with our solution
Therefore, a smart solution was solves the problem of variability in the
needed. We separated the project into observations. The fact that this is done
two steps: the first step focused on automatically reduces the human time
detecting the cell, while the second needed and enables to answer more
step focused on the segmentation of questions. If the physician needs to
cells detected during the first step. To make a diagnostic based on cell
do this, we used deep learning detection, classification or count, they
techniques which specifically adapted will be able to do it faster and better,
U-Net with different ground truths. since it will be independent from
RSIP Vision also performed the human individual observations.
A project by RSIP Vision 5
Computer Vision News
Project
its spatial structure. Using automatic emerging from different tissues and
analysis of these images, opens new dyed by different staining techniques.
opportunities to shed more light on Read about more projects in medical
multiple open research questions. segmentation
The main aspect of our solution was
the decision to separate the project in Take us along for your next
two parts, enabling us to be specifically
precise in each one of them. The other
Deep Learning project!
valuable choice that we made was the Request a call here
6 Project Management Tip Computer Vision News
The next step will be to inspect the productization, which includes all the
different threats that might appear testing and the fulfillment of
during the project and try to mitigate regulatory compliance requirements
the risks with specific solutions. (like FDA and others, when applicable).
After this first estimation, you must Taking all those aspects into
review and analyze this work plan in consideration, my professional
terms of the project and product life experience says that you will certainly
cycle for your company. It may start obtain a reliable time estimation,
with a first prototype, but then you which you can present to the other
Management
have to deliver a full working software, stakeholders in the organization.
including a GUI that will be checked
and approved by the marketing team
and also to take into account the More articles on Project Management
by Amnon Geifman
Every month, Computer Vision News reviews a
research paper from our field. This month we
have chosen Deep Image Reconstruction from
Human Brain Activity. The authors (Guohua
Shen, Tomoyasu Horikawa, Kei Majima and
Yukiyasu Kamitani) published their article here.
The frame-work of the task is simple: subjects are inserted into an fMRI
machine, then, a sequence of images is shown to them and the fMRI signal is
recorded. The images are divided into three classes: natural images, artificial
shapes, and alphabetical letters.
The fMRI data is represented in a 4-dimensional structure: one dimension is for
the time and the other three dimensions are a structural representation of the
activity of the brain, represented in voxels. Such dataset contains a set of
training images and a set of test images. While this paper performs
reconstruction of the images from the brain, also classification of the images
into classes is possible.
The authors use DNN visual features of a given image based on the VGG-19
network. These features are meant to approximate the hierarchical neural
representation of the human visual system. Then, the signal from the fMRI is
translated to fit into these visual features. At test time, given the translated
signal, an optimization is performed to predict the image which its DNN features
are best correlated with the fMRI signal. In this way, a reconstruction of the
image is generated. The figure in the next page demonstrates the main idea of
the method; we now dive into the details of the paper.
Deep Image Reconstruction from … 9
Computer Vision News Computer Vision News
Method
Research
In order to represent an image in an informative way, the authors used a pre-
trained classification model of the VGG-19 architecture. This architecture
was trained to classify ImageNet dataset with over 1000 classes of objects
category. A trained classification model (such as VGG-19) gives a
representation of an image in an hierarchical manner, which is invariant to
shifting, rotations and translation of the objects. In turn, this will produce an
invariance of the model to such transformations at inference time.
The VGG-19 model consists of 16 building blocks of convolution and 3 fully
connected layers. The outputs from each layer were taken immediately after
the weight multiplication (without rectification) and concatenated into a
vector. This vector is named the visual feature vector, that later will be used
in the training and reconstruction process.
fMRI decoding
The authors construct a decoding model that predicts the visual feature
vectors from brain signal using sparse linear regression (SLR). The logic is
that the SLR can automatically select the most important features for
prediction. In this way, they also cope with the high dimensionality of the
fMRI signal.
10 Research Computer Vision News
Remember that, during test time, we don't have the input image nor the visual
feature vector. What we do have is the feature vector generated from the brain
signal, which is supposed to be a good approximation of the visual feature
vector (that was generated from the image). To reconstruct the image, the
authors use optimization in the image space that aims at finding the image, the
DNN features of which are the closest to the fMRI signal feature vector. In order
to do so, the authors formulated the optimization problem of the form:
1 2
𝑣 ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛𝑣 { ቀΦ𝑙 𝑣 − 𝑦 𝑙 ൯ }
2 𝑖
where v is a vector in which elements are pixel values of an image (of size
224x224x3), v* is the reconstructed image, Φ𝑙 is the l'th visual feature vector
entry, and yl is the corresponding translated signal generated from the fMRI
signal. The goal of this optimization is to find the best image to fit the brain
signal.
DGN constraint
In order to improve the 'naturalness' of the reconstructed images, the authors
use a generative adversarial network (GAN). Using a pre-trained deep generator
network (DGN), denoted by G(z), they modified the optimization objective to be
of the form:
Deep Image Reconstruction from … 11
Computer Vision News
1 2
𝑧 ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛𝑧 { ቀΦ𝑙 𝐺(𝑧ሻ − 𝑦 𝑙 ൯ }
2 𝑖
When achieving minimum of the objective z*, the final image is then taken to be
v*=G(z*). Since GAN is trained to produce natural looking images, the result of
these optimization will generate more plausible results and improve the
appearance of the previous optimization.
Results
In the field of deep image reconstruction, the results are the most important
aspect of the model. While there are some quantitative measures for the
results, we will let you examine the visual results produced from the method
described above. We show here results from the paper on three different tasks:
Reconstruction of shapes and letters, reconstruction of natural images and
reconstruction of images from imagination. We start from the more
straightforward task, reconstruction of shape and letters. You can see the results
in the figure below:
Research
… these reconstructions
are relatively clear!
As you can see, these reconstructions are relatively clear: you can see the edges
of the shapes and read the word NEURON from the reconstructed images. These
results are the first of their kind which enable to understand the content of the
images. Note that the images above are coming from a relatively simple
distribution, hence nice results are expected.
The next set of results are the natural images reconstruction. Note that this is
much harder task since the images have much more complex distribution. It is
analogous to a classification network that achieves great results on Mnist and
12 Research Computer Vision News
less accurate results on ImageNet. However, these results are also interpretable
and sometimes enable to identify the objects in the image. You can appreciate the
results in the following figure:
In the figure above, the number of iteration axes refer to the number of iterations
in the optimization scheme. Although the reconstructions are not entirely
identical to the source images, it can be seen that they capture the main object,
some details of the object and this gives some abstract representation of the image.
Research
The last set of results is the reconstruction from imagination. In this task, the
subjects were instructed to imagine an image that they saw in the experiment.
The reconstruction, in this case, has been done based on the imagined image. This
of course is much more challenging task as the signal to noise ratio is dramatically
increased. In the following figure you can see the results of this experiment:
Conclusion
Image reconstruction from brain signal is a fascinating task that several teams
around the globe are working hard to solve. The high signal to noise ratio, as well
as the high dimensionality of both the fMRI signal and the image, are the greatest
challenges in this field. The results presented above can be considered very good
results in this field. As you can see, there is a lot of research to do on this task,
many methods to try, which will hopefully bring us to better solutions for the
problem. There is no doubt that the ability to reconstruct images from the brain
will have a tremendous influence on humanity and on science in general.
BEST OF CVPR 2019 13
Computer Vision News
BEST OF
CVPR 2019
24 PAGES !!!
CVPR
14 Presentation TuesdayDAILY
AutoAugment: Learning Augmentation Strategies from Data
the edges, where the weight of the you that it’s going to rain, you go out
edge basically says how much two with an umbrella, but if they don’t tell
domains are related. If I tell you I see a
you that and you go out and it starts
car from a side view and from a rear- raining, then you must figure out how
side view, a rear side and side are to react. Since obviously there will be
closer than a rear and front, so we nothing supervised, and so the
relate in this way. When we get the prediction of the target model can be
target attribute, at that point we wrong of course – and also, some
initialize a virtual node – a fake node –metadata may be received which are
because we didn’t have any data for not representative of the domain – this
that. No parameters are there, but if method unifies this kind of prediction
you assume that similar metadata, so with continuous domain adaptation. As
similar domains in our graph – similar the target data is received, the model
nodes – require similar parameters, we is continuously updated. This is
can just propagate the parameters of possible because the model is based
nearby nodes and obtain our model for on batch normalization. The different
the target domain which we never domains have different batch
see.” normalization statistics and scale and
bias parameters. For each domain,
"This is possible because there are different statistics, and so for
the target you basically predict just the
the model is based on statistics of the domains and the scale
batch normalization" and the bias. Once you have those, at
test time when you receive the target
He adds that there is another problem data, you use it to update the statistics
with that. Like before, if someone tells because it can be done easily. Then
CVPR
18 Presentation Wednesday DAILY
you devise an unsupervised loss which Massimiliano says they assume the
is just an entropy loss to update the metadata is representative of the
scale and the bias for the domain- domain shift, which cannot be true, so
specific parameters. they must weight them. He says a next
Massimiliano tells us that they were step for this work would be to
restricted in terms of the data they had understand which metadata are
to test on. They tested on the data set important for the domain shift and
used in the literature for this problem, which are not. He also thinks they
CompCars, with different cars, with a could go from the metadata to a more
different viewpoint of cars year of abstract representation, because
production. Also, another data set metadata is good if you have it and if
which depicted different portraits you can quantify the shift, but if
collected over 100 years in different someone tells you it’s dark or it’s
regions of America. This is not a real darker, you can understand what the
application of this problem. He says environment will look like, so a
they would like to have a model, for description rather than metadata
example, in autonomous driving. He would be helpful.
explains: “I tell you it’s evening, it Finally, Massimiliano is excited to tell
starts getting darker, so it adapts your us that he is very proud of his heritage:
model for this kind of light. Obviously, “I come from a very small village in the
one can say that if I have a huge middle of Italy in Umbria. I think we
amount of data which are balanced are 30 people in the centre of the
among all the possible conditions. Our village. It was a long journey to get
algorithm does not do it. Nowadays, here and I’m very happy that at the
the data sets are unbalanced, and we end, even starting from there, I can
must specialize our systems.” represent my small village here.”
To solve this, they are leveraging some Co-author Guha Balakrishnan adds
recent work in medical imaging: Adrian that “The benefit of the method is that
V. Dalca and Guha Balakrishnan’s it is interpretable and a simple idea
VoxelMorph. The idea is that if you that can extend to various other data,
have a method that can compute the both in and out of medical imaging."
transformation from your labelled scan Thinking about how to develop this
to an unlabelled scan, that allows you work, Amy points out that they’ve
to compute a set of transformations. demonstrated it on a fairly limited
Once you have the set of range of scans so far. They just look at
transformations, you apply them back T1 MRIs and show what happens if you
to the labelled scan and use that to assume that you have one labelled
synthesise more labelled examples. scan and a hundred unlabelled scans.
Co-author Adrian Dalca comments that What happens if you have multiple
"This work shows how leveraging labelled scans available, or if you apply
unlabeled data and machine learning this approach to different kinds of
can lead to tangible and practical MRIs or CTs? They think it would work,
improvements. This is important, since but don’t know how well, and are
the next step is to use this algorithm in really interested to find out.
practical clinical analyses."
Top: Cecilia Xuaner Zhang presenting her poster Zoom to Learn, Learn to
Zoom. Hers was one of the most crowded posters in a very crowded
Tuesday afternoon poster session!
Below: runner-up best paper award for “SizeNet: Weakly Supervised
Learning of Visual Size and Fit in Fashion Images” in Understanding
Subjective Attributes of Data Focus on Fashion and Subjective Search
FFSS-USAD workshop. Author Nour Karessli (left, Zalando SE) is greeted
by Diane Larlus (Naver Labs Europe) and Nicu Sebe (University of Trento).
38 Tool
Computer Vision News
properties:
Speed: in some application, we want our tracker to be able to process hundreds
of frames per seconds (FPS). In other applications, the tracking is done offline so
the speed is not very important.
Accuracy: of course, we want our tracker to be accurate. However, in general,
there is a trade-off between the accuracy and running time. A fast tracker might
be inaccurate and vice versa.
Occlusion robustness: when tracking a moving object, occlusion by other objects
might be a common phenomenon. In some application, we want our tracker to
cope with such occlusions and without moving to track the occluding object.
Luckily enough, OpenCV library contains a tracking API with 8 tracking
algorithms; Boosting, MIL tracker, kernelized correlation filters (KCF),
discriminative correlation filter, median flow tracker, TLD tracker, MOSSE tracker
and GOTRUN tracker. Each of those trackers has different ratio of speed to
accuracy: in this article we will use the KCF tracker since it is fast, relatively
accurate and good at failure detections.
algorithm is to build a correlation filter (i.e. kernel) such that a convolution with
the input image gives the desired response. This desired response usually has a
Gaussian shape centered around the object and decreasing with the distance. To
calculate the optimal filter, the algorithm uses translated instances of the object
from the previous frames. The maximal filter response is taken to be the object
location.
In order to explain how the kernel trick works (like in SVM), we first start with
linear correlation filter. The optimal linear filter w is found by solving the
following least squares problem of the form:
2
min{ 𝑋𝑤 − 𝑦 + 𝜆| 𝑤 |ൠ
𝑤
Here, X is a circulant matrix containing all the possible cyclic image shifts, λ is a
regularization coefficient term and y is the response that we expect to receive.
The advantage of this formulation is that, given such circulant matrix X, we can
find the optimal weights in the Fourier domain w* using a closed form solution.
Like in SVM, the kernel trick allows us to perform a non-linear regression by
mapping the input using a non-linear mapping. In this way, the weights have the
form of 𝑤 = ∑𝛼𝑖 ϕ(xi ሻ and the minimization problem is of the form:
2
Focus on
min 𝐾𝛼 − 𝑦 + 𝜆𝛼 𝑇 𝐾𝛼
α
Here the matrix K is the kernel matrix with entries 𝑘𝑖𝑗 = 𝜙 𝑥𝑖 𝜙(𝑥𝑗 ൯ , thus
we can solve in a closed form in the Fourier domain as well. In this method,
usually an RBF Gaussian kernel is used.
Implementation
Before we begin to dive into the code, note that some of the tracker described
above are not available in old versions of OpenCV so make sure that you pip-
installed the latest version of OpenCV. Moreover, make sure you have opencv-
contib-python on your environment, we need it to create the tracking object.
We begin with importing the cv and sys libraries:
1 import cv2
2 import sys
Next, we define the tracker object. As mention above, we have eight possible
trackers to use and we used the KCF. For other trackers, you can take a look on
the OpenCV documentation. Defining the KCF tracker is simple as:
40 Tool
Computer Vision News
1 tracker = cv2.TrackerKCF_create()
Now we are ready to preprocess the video. Essentially, we open the video and
read the first frame in the video; in this example, I'll also show you how to rotate
each frame in case your video is not aligned. This can all be done in the following
code:
1 video = cv2.VideoCapture("IMG_1003.MOV")
2 flag, frame = video.read()
3
4 (h, w) = frame.shape[:2]
5 center = (w /2 , h / 2)
6 M = cv2.getRotationMatrix2D(center, 270, 1)
7 frame = cv2.warpAffine(frame, M, (h, w))
In the above code, the first line loads the video from a file in my computer. The
second line is a standard extraction of the frame from a video. The flag
argument specifies if the frame extracted successfully. The last 4 rows just rotate
the given frame by 270 degrees.
Focus on
At this stage we are ready to begin. In this code, the region of interest, i.e. the
initial bounding box, will be defined manually. It is not very hard to use some
pre-trained network to perform the initial detection. For example, YOLO
network can extract a good bounding box for the object; however, we keep it for
future articles in order to explain detection methods in details. The ROI selector
of cv2 opens the frame in a dialog box and allows us to mark a rectangle over
the object. The output will be a tuple with four numbers, two for the upper left
side and two for the lower right. We use the ROI selector as:
The last thing that we still need to do is to use the box that we defined. Then, we
will iterate through the frames and for each frame we update the tracker. If the
tracking succeeds, we draw it on the frame and show it. Otherwise we just show
the frame. At the end we make an option to break the while loop in the case
where esc is pressed or no frames are available. This code will look like this:
Object Tracking in Python Using OpenCV 41
Computer Vision News
This is actually it. In my code, I added a few additional lines of code to measure
the number of frames per second and to make the visualization nicer, but this
code is enough to track any object! Let's see some examples:
Results
Focus on
We now demonstrate the performance of the tracker. To this end we track the
phrase FOCUS ON ME on the following video. You can see the results below.
Conclusion
In this article we have seen how to use tracking algorithms in OpenCV.
Specifically, we explained and implemented the KCF tracker which gave us the
desired speed and accuracy we wanted. As mentioned above, there are seven
more trackers implemented on OpenCV. Each of them has different properties
and different robustness. You can use our code as is and just replace the line
that define the tracker. This would allow you to check other trackers depending
on your application. Enjoy!
42 Autonomous Driving: AutoX
Computer Vision News
Xiao tells us that there are many tracking, as well as object prediction to
automakers around the world making predict what an object is going to do
cars, and there are many sensor next.
companies making Lidar and radar Then there is a decision-making
cameras, but what is missing is a good component called behaviour planning.
AI platform. Given the road condition and the
The self-driving AI platform has many traffic situation, the car has to make a
components. The first is high-definition smart decision about whether it’s
3D mapping and HD mapping. Next is going to stop, go, turn left, or turn
localisation. Localising where the right. Next comes motion and speed
vehicle is in relation to the HD map. planning. After it has made that
Then it’s perception, including object decision, it has to plan the trajectory
detection, object recognition, object and exact speed that it is going to drive.
The next step is about controlling the cent accuracy. Not a single
vehicle to drive autonomously. The compromise is allowed. He is very
control panel has two parts – the proud that they have put together
algorithm side and the hardware side. such a robust and reliable system.
The vehicle control unit has to send an The solution covers many aspects of
electrical signal to the vehicle in order computer vision. Jianxiong explains:
to control the steering and the
throttle. “For example, for perception, obviously
we need object detection. We need
This is a full-stack solution from the semantic segmentation. Not only do
very beginning to the end for self- we detect the object, but we’re
driving cars. It is backed up by a data segmenting out at instance-level. We
infrastructure built by gathering a huge need to do this at a very fast speed.
amount of data from global testing Unlike other applications where you
fleets. Before AutoX cars hit the road, can wait one second to get a picture;
they are tested heavily through digital for us, we need real-time performance.
simulation. When the image comes in, we need to
Jianxiong emphasises how important it finish all the computation immediately.
is that they make no mistakes. Being a We use a convolutional neural network
self-driving car, they require 100 per to do that.”
44 Autonomous Driving
Computer Vision News
To support object tracking, they have a testing fleet of self-driving cars in the
3D convolutional neural network US and China. Jianxiong claims that
running in the physical space using a they are the only company in the
LiDAR signal. They combine the LiDAR world with AI sophisticated and smart
and camera together in another neural enough to deal with the very dense
network. For object prediction, and challenging traffic there. They are
because it’s a time sequence signal, doing autonomous testing downtown
they use a recurrent neural network. in Shenzhen, the city with the highest
In three years, AutoX has grown to the population density in China, where
point that they currently have a large they are piloting a fleet of RoboTaxis.
AutoX 45
Computer Vision News
In terms of next steps, Jianxiong tells We’re also working with Shanghai
us: Motor. They are the largest Chinese car
“The next step is to further improve our manufacturer. They are both working
technology. To polish it. By the end of closely together for us to provide a self-
the year, we will have more than 100 driving fleet to really push the
self-driving cars testing. We’re also technology to commercialisation and
working with a lot of car production.”
manufacturers. For example, we’re AutoX are currently hiring for various
working with Dongfeng Motor, the positions. See their website for more
second-largest Chinese car manufacturer. details.
FEEDBACK
Dear reader,
If you like Computer Vision News (and also if you
don’t like it) we would love to hear from you:
We hate SPAM and Give us feedback, please (click here)
promise to keep your email It will take you only 2 minutes. Please tell us and
address safe, always. we will do our best to improve. Thank you!
IMPROVE YOUR VISION WITH
SUBSCRIBE
CLICK HERE, IT’S FREE
Gauss Surgical
A PUBLICATION BY