You are on page 1of 30

WEEK 11 - 12: LEARNING COMPUTER VISION

Lesson Xl-Xll: Technologies :

A. Face detection : Haar, HOG, MTCNN, Mobilenet

B. Face recognition : CNN, Facenet

C. Object recognition : alexnet, inceptionnet, resnet

D. Transfer learning : re-training big neural network with little


resources on a new topic

E. Image segmentation : rcnn

F. GAN

G. Hardware for computer vision : what to choose, GPU is important

H. UI apps integrating vision : ownphotos


Image segmentation for autonomous driving

Computer vision has advanced a lot in recent years. Those are

the topics I will mention here :

Technologies :

● Face detection : Haar, HOG, MTCNN, Mobilenet


● Face recognition : CNN, Facenet

● Object recognition : alexnet, inceptionnet, resnet

● Transfer learning : re-training big neural network with little

resources on a new topic

● Image segmentation : rcnn

● GAN

● Hardware for computer vision : what to choose, GPU is

important

● UI apps integrating vision : ownphotos

Applications :

● personal photos organization

● autonomous cars

● autonomous drones
● solving captcha / OCR

● filtering pictures for a picture based website/app

● automatically tagging pictures for an app

● extraction information from videos (tv show, movies)

● visual question answering

● art

People to follow :

● important deep learning founders : andrew ng, yann

lecun, bengio yoshua, hinton joffrey

● adam geitgey https://medium.com/@ageitgey has a lot of

interesting articles on vision such as

https://medium.com/@ageitgey/machine-learning-is-fun-

part-4-modern-face-recognition-with-deep-learning-
c3cffc121d78 with a full face

detection/alignment/recognition pipeline

Courses :

● deep learning @ coursera

● machine learning @ coursera

Related fields :

● deep reinforcement learning : see ppo and dqn with a cnn

as input layer

● interaction with nlp : lstm 2 cnn

Face detection
Face detection is about placing boxes around faces

Face detection is the task of detecting faces. There are several

algorithms to do that.
https://github.com/nodefluxio/face-detector-benchmark provide a

benchmark on the speed of these method, with easy to reuse

implementation code.

Haar classifiers

haar features
They are the old computer vision method present in opencv since

2000. It was introduced in this paper

http://wearables.cc.gatech.edu/paper_of_week/viola01rapid.pdf.

It is a machine learning model with features chosen specifically

for object detection. Haar classifiers are fast but have a low

accuracy.

See a longer explanation and an example on how to use it in

https://docs.opencv.org/3.4.3/d7/d8b/tutorial_py_face_detection.ht

ml

HOG : Histogram of Oriented Gradients


Histogram of oriented gradients

HOG is a newer method to generate feature for object detection: it

has started being used since 2005. It is based on computing

gradients on the pixel of your images. These features are then fed

to a machine learning algorithm, for example SVM. It has a better

precision than haar classifiers.


An implementation of that is in dlib. Which is in the

face_recognition (https://github.com/ageitgey/face_recognition)

lib.

MTCNN

A new method using a variation on CNNs to detect images. Better

precision but a bit slower. See

https://kpzhang93.github.io/MTCNN_face_detection_alignment/in

dex.html

MobileNet

The best and fastest method these days for face detection. Based

on the general mobile net architecture. See

https://arxiv.org/abs/1704.04861

Object detection
Object detection on many kind of objects

Object detection can be achieved using similar methods than face

detection.

Here are 2 articles presenting recent methods to achieve it. These

methods sometimes even provide the class of objects too

(achieving object recognition) :


● https://towardsdatascience.com/review-r-fcn-positive-

sensitive-score-maps-object-detection-91cd2389345c r-

fcn

● https://towardsdatascience.com/r-cnn-fast-r-cnn-faster-r-

cnn-yolo-object-detection-algorithms-36d53571365e a

comparison of r-cnn, fast r-cnn, faster r-cnn and yolo

Convolutional neural networks


Recent progress in deep learning has seen new architectures

achieving a lot of success.

Neural networks using many convolution layers are one of them.

A convolution layer takes advantage of the 2D structure of an

image to generate useful information in the next layer of the

neural network. See https://towardsdatascience.com/intuitively-


understanding-convolutions-for-deep-learning-1f6f42faee1 for a

detailed explanation of what is a convolution.

A convolution layer

Object recognition
Object recognition is the general problem of classifying object into

categories (such as cat, dog, …)


Deep neural network based on convolution have been used to

achieve great results on this task.

The ILSVR conference has been hosting competition on the

ImageNet (http://www.image-net.org/ a database of many images

with in objects tags such as cat, dog,..)

The more successful neural networks have been using more and

more layer.
The ResNet architecture is the best to classify object to date.
Resnet architecture

To train it properly, it is needed to use millions of images, and it

takes a lot of time even with tens of expensive GPUs.

That’s the reason why methods that don’t require retraining every

time on such big datasets are very useful. Transfer learning and

embeddings are such methods.

Pretrained models for resnet are available in

https://github.com/tensorflow/tensor2tensor#image-classification

Face recognition
Face recognition is about figuring out who is a face.

Historic methods
The historic way to solve that task has been to apply either

feature engineering with standard machine learning (for example

svm) or to apply deep learning methods for object recognition.

The problem with these approaches is they require a lot of data

for each person. In practice that data is not always available.

Facenet

Facenet has been introduced by google researchers in 2015

https://arxiv.org/abs/1503.03832. It proposes a method to

recognize faces without having a lot of faces sample for each

person.

The way it works is by taking a dataset of pictures (such as

http://vis-www.cs.umass.edu/lfw/) of a large number of faces.


Then taking an existing computer vision architecture such as

inception (or resnet) then replacing the last layer of an object

recognition NN with a layer that computes a face embedding.

For each person in the dataset, (negative sample, positive

sample, second positive sample) triple of faces are selected

(using heuristics) and fed to the neural network. That produces 3

embeddings. On these 3 embeddings the triplet loss is computed,

which minimizes the distance between the positive sample and

any other positive sample, and maximizes the distance between

the position sample and any other negative sample.

Triplet loss
The end result is each face (even faces not present in the original

training set) can now be represented as an embedding (a vector

of 128 number) that has a big distance from embeddings of faces

of other people.
These embeddings can then be used with any machine learning

model (even simple ones such as knn) to recognize people.

The thing that is very interesting about facenet and face

embeddings is that using it you can recognize people with only a

few pictures of them or even a single one.

See that lib implementing it :

https://github.com/ageitgey/face_recognition

That’s a tensorflow implementation of it :

https://github.com/davidsandberg/facenet

This is a cool application of the ideas behind this face recognition

pipeline to instead recognize bears faces :

https://hypraptive.github.io/2017/01/21/facenet-for-bears.html
Transfer learning

Retrain quickly an accurate neural network on a custom dataset

Training very deep neural network such as resnet is very resource

intensive and requires a lot of data.

Computer vision is highly computation intensive (several weeks of

trainings on multiple gpu) and requires a lot of data. To remedy to

that we already talked about computing generic embeddings for


faces. Another way to do it is to take an existing network and

retraining only a few of its it layers on another dataset.

Here is a tutorial for it : codelab tutorial . It proposes to you to

retrain an inception model to train unknown to it classes of

flowers.

https://medium.com/@14prakash/transfer-learning-using-keras-

d804b2e04ef8 presents good guidelines on which layer to retrain

when doing transfer learning.

Image segmentation
Image segmentation for autonomous driving

Image segmentation is an impressive new task that has become

possible in recent years. It consists in identifying every pixel of an

image.

This task is related with object detection. One algorithm to

achieve it is mask r-cnn, see this article for more details


https://medium.com/@jonathan_hui/image-segmentation-with-

mask-r-cnn-ebe6d793272

GAN

Large scale GAN

Generative Adversial Networks, introduced by ian goodfellow, is a

neural network architecture in 2 parts : a discriminator and a

generator.
● The discriminator detects whether a picture is a class, it

has usually been pretrained on a object classification

dataset.

● The generator produces an image for a given class

The weight of the generator are adapted during learning in order

to produces images the discriminator cannot distinguish from real

images of that class.

Here is an example of images produced by the largest GAN yet

https://arxiv.org/abs/1809.11096

See an implementation of GAN in keras at

https://github.com/eriklindernoren/Keras-GAN

Hardware for computer vision


To train big models, a lot of resources is required. There are two

way to achieve that. The first is to use cloud services, such as

google cloud or aws. The second way is to build a computer with

GPU yourself.
With as little as 1000$ it’s possible to build a decent machine to

train deep learning models.

Read this more in detail in

https://hypraptive.github.io/2017/02/13/dl-computer-build.html

Vision in UI
Face dashboard of ownphotos
Ownphotos is an amazing UI allowing you to import your photos

and automatically computing face embeddings, doing object

recognition and recognizing faces.

It uses :

● Face recognition: face_recognition

● Object detection: densecap, places365

You might also like