You are on page 1of 38

Machine learning Introduction:

The objective of this briefing is to present an overview of the machine learning techniques
currently in use or in consideration at statistical agencies worldwide. Section I, outlines the main
reason why statistical agencies should start exploring the use of machine learning techniques.
Section II outlines what machine learning is, by comparing a well-known statistical technique
(logistic regression) with a (non-statistical) machine learning counterpart (support vector
machines). Sections III, IV, and V discuss current research or applications of machine learning
techniques within the field of official statistics in the areas of automatic coding, editing and
imputation, and record linkage, respectively. The material presented in this paper is the result of
a literature review, of direct contacts with authors during conferences, and more importantly of
an international call for input that was distributed on July 18, 2014 to participants from the 2014
MSIS Meeting, participants from the 2014 Work Session on Statistical Data Editing, and
members of the Modernization Committee on Production and Methods. Section VI contains a list
of machine learning applications in official statistics outside of the three areas mentioned above

Machine Learning means In the statistical context, Machine Learning is defined as an application
of artificial intelligence where available information is used through algorithms to process or
assist the processing of statistical data. While Machine Learning involves concepts of
automation, it requires human guidance. Machine Learning involves a high level of
generalization in order to get a system that performs well on yet unseen data instances.

Why should statistical agencies consider machine learning?

Machine learning is a relatively new discipline within Computer Science that provides a
collection of data analysis techniques. Some of these techniques are based on well established
statistical methods (e.g. logistic regression and principal component analysis) while many others
are not. Most statistical techniques follow the paradigm of determining a particular probabilistic
model that best describes observed data among a class of related models. Similarly, most
machine learning techniques are designed to find models that best fit data (i.e. they solve certain
optimization problems), except that these machine learning models are no longer restricted to
probabilistic ones. Therefore, an advantage of machine learning techniques over statistical ones
is that the latter require underlying probabilistic models while the former do not. Even though
some machine learning techniques use probabilistic models, the classical statistical techniques
are most often too stringent for the oncoming Big Data era, because data sources are increasingly
complex and multi-faceted. Prescribing probabilistic models relating variables from disparate
data sources that are plausible and amenable to statistical analysis might be extremely difficult if
not impossible. Machine learning might be able to provide a broader class of more flexible
alternative analysis methods better suited to modern sources of data. It is imperative for
statistical agencies to explore the possible use of machine learning techniques to determine
whether their future needs might be better met with such techniques than with traditional ones.

Classes of Machine Learning:

There are two main classes of machine learning techniques: supervised machine learning and
unsupervised machine learning.

A. Logistic regression (statistics) vs Support vector machines (machine learning)

Logistic regression, when used for prediction purposes, is an example of supervised machine
learning. In logistic regression, the values of a binary response variable (with values 0 or 1, say)
as well as a number of predictor variables (covariates) are observed for a number of observation
units. These are called training data in machine learning terminology. The main hypotheses are
that the response variable follows a Bernoulli distribution (a class of probabilistic models), and
the link between the response and predictor variables is the relation that the logarithm of the
posterior odds of the response is a linear function of the predictors. The response variables of the
units are assumed to be independent of each other, and the method of maximum likelihood is
applied to their joint probability distribution to find the optimal values for the coefficients (these
parameterise the aforementioned joint distribution) in this linear function. The particular model
with these optimal coefficient values is called the “fitted model,” and can be used to “predict”
the value of the response variable for a new unit (or, “classify” the new unit as 0 or 1) for which
only the predictor values are known. Support Vector Machines (SVM) are an example of a
non-statistical supervised machine learning technique; it has the same goal as the logistic
regression classifier just described: Given training data, find the best-fitting SVM model, and
then use the fitted SVM model to classify new units. The difference is that the underlying models
for SVM are the collection of hyper planes in the space of the predictor variables. The
optimization problem that needs to be solved is finding the hyper plane that best separates, in the
predictor space, the units with response value 0 from those with response value 1. The logistic
regression optimization problem comes from probability theory whereas that of SVM comes
from geometry.

Other supervised machine learning techniques mentioned later in this briefing include decision
trees, neural networks, and Bayesian networks.

B. Principal component analysis (statistics) v s Cluster analysis (machine learning)

The main example of an unsupervised machine learning technique that comes from classical
statistics is principal component analysis, which seeks to “summarize” a set of data points in
high-dimensional space by finding orthogonal one-dimensional subspaces along which most of
the variation in the data points is captured. The term “unsupervised” simply refers to the fact that
there is no longer a response variable in the current setting.
Cluster analysis and association analysis are examples of non-statistical unsupervised machine
learning techniques. The former seeks to determine inherent grouping structure in given data,
whereas the latter seeks to determine co-occurrence patterns of items.

Automatic Coding:

1. Automatic coding via Bayesian classifier (Germany)

In a poster session at the Statistics Canada’s 2014 International Methodology Symposium,


Bethmann et al. (of Intitut für Arbeitsmarkt-und Berufsforschung) have reported on research on
applying two types of probabilistic supervised machine learning algorithms -- Naïve Bayes (NB)
and conjugate Bayesian analysis based on multinomial distributions (BMN) -- for automatic
occupation coding for German panel surveys. The authors used a large volume (approximately
300,000) of manually coded occupation text strings from recent surveys as training data. The rate
of agreement between automatic coding and manual coding was used as a metric to evaluate the
algorithms. Although both methods exhibited good agreement rates by common machine
learning standards, the authors cautioned that they might not be sufficiently satisfactory given the
considerably higher accuracy requirements of occupation coding in production settings (the
authors suggested a minimum agreement rate of 95%). On the other hand, the authors pointed
out that when the target variable was changed to “social-economic status” or “occupational
prestige” (more precisely, ISEI-08 and SIOPS-08 scores, both derived from occupation codes),
both methods yielded dramatically improved results. The authors concluded that the current
versions of their methods may be sufficient for production of socio-economic status or
occupational prestige predictions, but further improvements are required for production of
reliable occupation coding. Possibilities for improvement include the addition of a preprocessing
step (to “clean up” input text strings, thereby reduce noise in training data), incorporation of a
certain distance measure in the existing models, as well as different machine learning methods
altogether (such as random forests or support vector machines). The authors project that their
methods will be ready for release as an open-source R package in several years.

2. Automatic occupation coding via CASCOT (United Kingdom).


Computed-assisted Structured Coding Tool is an automatic occupation coding software tool
developed by the Institute for Employment Research at the University of Warwick, a partner in
the EurOccupations project. The objective of the project is to construct a publicly available
database of the most frequent occupations to facilitate multi-country data collection. Since 2009,
CASCOT has been able to perform automated coding into the ISCO’08 classification of
occupational texts in any of the seven languages of the eight EurOccupations partner countries.
CASCOT is available for online use for free and a desktop version is available for purchase
should high-volume processing be required. However, CASCOT’s underlying methodology has
not been published.

3. Automatic coding via open-source indexing utility (Ireland)

The Central Statistics Office of Ireland has reported they are developing an automatic coding
system for Classification of Individual Consumption by Purpose (COICOP) assignment for their
Household Budget Survey, using previously coded records as training data. Their method is
based on the open-source indexing and searching tool Apache Lucene (http://lucene.apache.org).
4. Automatic coding of census variables via Support Vector Machines (New Zealand)

Statistics New Zealand investigated the potential of using Support Vector Machines (SVM) to
improve coding of item responses in their Census. They applied SVM to code the variables
Occupation and Post-school Qualification, using two disjoint sets of observations, each of size
10,000, from Census 2013 data for training and testing. They reported 50% correctness rate on
testing data for both variables, and concluded that further investigations would be necessary to
further evaluate SVM as an automatic coding methodology.

About python:

The Python language has a substantial body of documentation, much of it contributed by various
authors. The markup used for the Python documentation is ​restructured text​, developed by
the ​docutils​ project, amended by custom directives and using a toolset named ​sphinx​ to
post-process the HTML output.
This document describes the style guide for our documentation as well as the custom restructured
text markup introduced by Sphinx to support Python documentation and how it should be used.

The documentation in HTML, PDF or EPUB format is generated from text files written using
the ​restructured text format​ and contained in the ​CPython Git repository​.

Introduction

Python’s documentation has long been considered to be good for a free programming language.
There are a number of reasons for this, the most important being the early commitment of
Python’s creator, Guido van Rossum, to providing documentation on the language and its
libraries, and the continuing involvement of the user community in providing assistance for
creating and maintaining documentation.

The involvement of the community takes many forms, from authoring to bug reports to just plain
complaining when the documentation could be more complete or easier to use.

This document is aimed at authors and potential authors of documentation for Python. More
specifically, it is for people contributing to the standard documentation and developing
additional documents using the same tools as the standard documents. This guide will be less
useful for authors using the Python documentation tools for topics other than Python, and less
useful still for authors not using the tools at all.

If your interest is in contributing to the Python documentation, but you don’t have the time or
inclination to learn restructured Text and the markup structures documented here, there’s a
welcoming place for you among the Python contributors as well. Any time you feel that you can
clarify existing documentation or provide documentation that’s missing, the existing
documentation team will gladly work with you to integrate your text, dealing with the markup
for you. Please don’t let the material in this document stand between the documentation and your
desire to help out!

Style guide:
Use of whitespace

All reST files use an indentation of 3 spaces; no tabs are allowed. The maximum line length is 80
characters for normal text, but tables, deeply indented code samples and long links may extend
beyond that. Code example bodies should use normal Python 4-space indentation.

Make generous use of blank lines where applicable; they help group things together.

A sentence-ending period may be followed by one or two spaces; while reST ignores the second
space, it is customarily put in by some users, for example to aid Emacs’ auto-fill mode.

Footnotes

Footnotes are generally discouraged, though they may be used when they are the best way to
present specific information. When a footnote reference is added at the end of the sentence, it
should follow the sentence-ending punctuation. The reST markup should appear something like
this:

This sentence has a footnote reference. [#]_ This is the next sentence.

Footnotes should be gathered at the end of a file, or if the file is very long, at the end of a section.
The docutils will automatically create backlinks to the footnote reference.

Footnotes may appear in the middle of sentences where appropriate.

Capitalization

Sentence case

Sentence case is a set of capitalization rules used in English sentences: the first word is always
capitalized and other words are only capitalized if there is a specific rule requiring it.
In the Python documentation, the use of sentence case in section titles is preferable, but
consistency within a unit is more important than following this rule. If you add a section to a
chapter where most sections are in title case, you can either convert all titles to sentence case or
use the dominant style in the new section title.

Sentences that start with a word for which specific rules require starting it with a lower case
letter should be avoided.

Many special names are used in the Python documentation, including the names of operating
systems, programming languages, standards bodies, and the like. Most of these entities are not
assigned any special markup, but the preferred spellings are given here to aid authors in
maintaining the consistency of presentation in the Python documentation.

Other terms and words deserve special mention as well; these conventions should be used to
ensure consistency throughout the documentation:

CPU

For “central processing unit.” Many style guides say this should be spelled out on the first use
(and if you must use it, do so!). For the Python documentation, this abbreviation should be
avoided since there’s no reasonable way to predict which occurrence will be the first seen by the
reader. It is better to use the word “processor” instead.

POSIX
The name assigned to a particular group of standards. This is always uppercase.

Python

The name of our favorite programming language is always capitalized.

reST
For “restructured Text,” an easy to read, plaintext markup syntax used to produce Python
documentation. When spelled out, it is always one word and both forms start with a lower case
‘r’.

Unicode
The name of a character coding system. This is always written capitalized.

Unix
The name of the operating system developed at AT&T Bell Labs in the early 1970s.

Affirmative Tone

The documentation focuses on affirmatively stating what the language does and how to use it
effectively.

Except for certain security or segfault risks, the docs should avoid wording along the lines of
“feature x is dangerous” or “experts only”. These kinds of value judgments belong in external
blogs and wikis, not in the core documentation.

Bad example (creating worry in the mind of a reader):

Warning: failing to explicitly close a file could result in lost data or excessive resource
consumption. Never rely on reference counting to automatically close a file.

Good example (establishing confident knowledge in the effective use of the language):

A best practice for using files is use a try/finally pair to explicitly close a file after it is used.
Alternatively, using a with-statement can achieve the same effect. This assures that files are
flushed and file descriptor resources are released in a timely manner.

Economy of Expression

More documentation is not necessarily better documentation. Err on the side of being succinct.It
is an unfortunate fact that making documentation longer can be an impediment to understanding
and can result in even more ways to misread or misinterpret the text. Long descriptions full of
corner cases and caveats can create the impression that a function is more complex or harder to
use than it actually is.

Security Considerations (and Other Concerns)

Some modules provided with Python are inherently exposed to security issues (e.g. shell
injection vulnerabilities) due to the purpose of the module (e.g. ​ssl​). Littering the documentation
of these modules with red warning boxes for problems that are due to the task at hand, rather
than specifically to Python’s support for that task, doesn’t make for a good reading experience.

Instead, these security concerns should be gathered into a dedicated “Security Considerations”
section within the module’s documentation, and cross-referenced from the documentation of
affected interfaces with a note similar
to ​"Please​ ​refer​ ​to​ ​the​ ​:ref:`security-considerations`​ ​section​ ​for​ ​important​ ​information​ ​on​ ​ho
wto​ ​avoid​ ​common​ ​mistakes."​.

Similarly, if there is a common error that affects many interfaces in a module (e.g. OS level pipe
buffers filling up and stalling child processes), these can be documented in a “Common Errors”
section and cross-referenced rather than repeated for every affected interface.

Code Examples

Short code examples can be a useful adjunct to understanding. Readers can often grasp a simple
example more quickly than they can digest a formal description in prose.

People learn faster with concrete, motivating examples that match the context of a typical use
case. For instance, the ​str.rpartition()​ method is better demonstrated with an example splitting the
domain from a URL than it would be with an example of removing the last word from a line of
Monty Python dialog.
The ellipsis for the ​sys.ps2​ secondary interpreter prompt should only be used sparingly, where it
is necessary to clearly differentiate between input lines and output lines. Besides contributing
visual clutter, it makes it difficult for readers to cut-and-paste examples so they can experiment
with variations.

Code Equivalents

Giving pure Python code equivalents (or approximate equivalents) can be a useful adjunct to a
prose description. A documenter should carefully weigh whether the code equivalent adds value.

A good example is the code equivalent for ​all()​. The short 4-line code equivalent is easily
digested; it re-emphasizes the early-out behavior; and it clarifies the handling of the corner-case
where the iterable is empty. In addition, it serves as a model for people wanting to implement a
commonly requested alternative where ​all() would return the specific object evaluating to False
whenever the function terminates early.

A more questionable example is the code for ​itertools.groupby()​. Its code equivalent borders on
being too complex to be a quick aid to understanding. Despite its complexity, the code
equivalent was kept because it serves as a model to alternative implementations and because the
operation of the “grouper” is more easily shown in code than in English prose.

An example of when not to use a code equivalent is for the ​oct()​ function. The exact steps in
converting a number to octal doesn’t add value for a user trying to learn what the function does.

Audience

The tone of the tutorial (and all the docs) needs to be respectful of the reader’s intelligence.
Don’t presume that the readers are stupid. Lay out the relevant information, show motivating use
cases, provide glossary links, and do your best to connect-the-dots, but don’t talk down to them
or waste their time.
The tutorial is meant for newcomers, many of whom will be using the tutorial to evaluate the
language as a whole. The experience needs to be positive and not leave the reader with worries
that something bad will happen if they make a misstep. The tutorial serves as guide for
intelligent and curious readers, saving details for the how-to guides and other sources.

Be careful accepting requests for documentation changes from the rare but vocal category of
reader who is looking for vindication for one of their programming errors (“I made a mistake,
therefore the docs must be wrong …”). Typically, the documentation wasn’t consulted until after
the error was made. It is unfortunate, but typically no documentation edit would have saved the
user from making false assumptions about the language (“I was surprised by …”).

Open cv introduction:

OpenCV was started at Intel in 1999 by Gary Bradsky and the first release came out in 2000.
Vadim Pisarevsky joined Gary Bradsky to manage Intel’s Russian software OpenCV team. In
2005, OpenCV was used on Stanley, the vehicle who won 2005 DARPA Grand Challenge. Later
its active development continued under the support of Willow Garage, with Gary Bradsky and
Vadim Pisarevsky leading the project. Right now, OpenCV supports a lot of algorithms related to
Computer Vision and Machine Learning and it is expanding day-by-day. Currently OpenCV
supports a wide variety of programming languages like C++, Python, Java etc and is available on
different platforms including Windows, Linux, OS X, Android, iOS etc. Also, interfaces based
on CUDA and OpenCL are also under active development for high-speed GPU operations.
OpenCV-Python is the Python API of OpenCV. It combines the best qualities of OpenCV C++
API and Python language. OpenCV-Python Python is a general purpose programming language
started by Guido van Rossum, which became very popular in short time mainly because of its
simplicity and code readability. It enables the programmer to express his ideas in fewer lines of
code without reducing any readability. Compared to other languages like C/C++, Python is
slower. But another important feature of Python is that it can be easily extended with C/C++.
This feature helps us to write computationally intensive codes in C/C++ and create a Python
wrapper for it so that we can use these wrappers as Python modules. This gives us two
advantages: first, our code is as fast as original C/C++ code (since it is the actual C++ code
working in background) and second, it is very easy to code in Python. This is how
OpenCV-Python works, it is a Python wrapper around original C++ implementation. And the
support of Numpy makes the task more easier. Numpy is a highly optimized library for
numerical operations. It gives a MATLAB-style syntax. All the OpenCV array structures are
converted to-and-from Numpy arrays. So whatever operations you can do in Numpy, you can
combine it with OpenCV, which increases number of weapons in your arsenal. Besides that,
several other libraries like SciPy, Matplotlib which supports Numpy can be used with this. So
OpenCV-Python is an appropriate tool for fast prototyping of computer vision problems.

Since OpenCV is an open source initiative, all are welcome to make contributions to this library.
And it is same for this tutorial also. So, if you find any mistake in this tutorial (whether it be a
small spelling mistake or a big error in code or concepts, whatever), feel free to correct it. 1.1.
Introduction to OpenCV 7 OpenCV-Python Tutorials Documentation, Release 1 And that will be
a good task for freshers who begin to contribute to open source projects. Just fork the OpenCV in
github, make necessary corrections and send a pull request to OpenCV. OpenCV developers will
check your pull request, give you important feedback and once it passes the approval of the
reviewer, it will be merged to OpenCV. Then you become a open source contributor. Similar is
the case with other tutorials, documentation etc. As new modules are added to OpenCV-Python,
this tutorial will have to be expanded. So those who knows about particular algorithm can write
up a tutorial which includes a basic theory of the algorithm and a code showing basic usage of
the algorithm and submit it to OpenCV. Remember, we together can make this project a great
success !!! Contributors Below is the list of contributors who submitted tutorials to
OpenCV-Python.

1. Alexander Mordvintsev (GSoC-2013 mentor)

2. Abid Rahman K. (GSoC-2013 intern)

Additional Resources
1. A Quick guide to Python - A Byte of Python

2. Basic Numpy Tutorials

3. Numpy Examples List

4. OpenCV Documentation

5. OpenCV Forum

Install OpenCV-Python in Windows

Goals In this tutorial

We will learn to setup OpenCV-Python in your Windows system. Below steps are tested in a
Windows 7-64 bit machine with Visual Studio 2010 and Visual Studio 2012. The screenshots
shows VS2012.

Installing Open CV from prebuilt binaries

1. Below Python packages are to be downloaded and installed to their default locations.

1.1. Python-2.7.x.

1.2. Numpy.

1.3. Matplotlib (Matplotlib is optional, but recommended since we use it a lot in our tutorials).

2. Install all packages into their default locations. Python will be installed to C:/Python27/.

3. After installation, open Python IDLE. Enter import numpy and make sure Numpy is working
fine.

4. Download latest OpenCV release from source forge site and double-click to extract it.
5. Goto opencv/build/python/2.7 folder.

6. Copy cv2.pyd to C:/Python27/lib/site-packeges.

7. Open Python IDLE and type following codes in Python terminal.

>>> import cv2

>>> print cv2.__version__

If the results are printed out without any errors, congratulations !!! You have installed
OpenCV-Python successful

Download and install necessary Python packages to their default locations

1. Python 3.6.8.x

2. Numpy

3. Matplotlib (Matplotlib is optional, but recommended since we use it a lot in our tutorials.)

Make sure Python and Numpy are working fine.

4. Download OpenCV source. It can be from Sourceforge (for official release version) or from
Github (for latest source).

5. Extract it to a folder, opencv and create a new folder build in it.

6. Open CMake-gui (Start > All Programs > CMake-gui)

7. Fill the fields as follows (see the image below):

7.1. Click on Browse Source... and locate the opencv folder.

7.2. Click on Browse Build... and locate the build folder we created.
7.3. Click on Configure.

7.4. It will open a new window to select the compiler. Choose appropriate compiler
(here, Visual Studio 11) and click Finish.

7.5. Wait until analysis is finished.

8. You will see all the fields are marked in red. Click on the WITH field to expand it. It decides
what extra features you need. So mark appropriate fields. See the below image:

9. Now click on BUILD field to expand it. First few fields configure the build method. See the
below image:

10. Remaining fields specify what modules are to be built. Since GPU modules are not yet
supported by Open CV Python, you can completely avoid it to save time (But if you work with
them, keep it there). See the image below:

11. Now click on ENABLE field to expand it. Make sure ENABLE_SOLUTION_FOLDERS is
unchecked (Solution folders are not supported by Visual Studio Express edition). See the image
below:

12. Also make sure that in the PYTHON field, everything is filled. (Ignore
PYTHON_DEBUG_LIBRARY). See image below:

13. Finally click the Generate button.

14. Now go to our opencv/build folder. There you will find OpenCV.sln file. Open it with Visual
Studio.

15. Check build mode as Release instead of Debug.

16. In the solution explorer, right-click on the Solution (or ALL_BUILD) and build it. It will
take some time to finish.
17. Again, right-click on INSTALL and build it. Now OpenCV-Python will be installed.

18. Open Python IDLE and enter import cv2. If no error, it is installed correctly

Using OpenCV Read an image

Use the function cv2.imread() to read an image. The image should be in the working directory or
a full path of image should be given. Second argument is a flag which specifies the way image
should be read.

cv2.IMREAD_COLOR : Loads a color image. Any transparency of image will be neglected. It is


the default flag.

cv2.IMREAD_GRAYSCALE : Loads image in grayscale mode

cv2.IMREAD_UNCHANGED : Loads image as such including alpha channel

See the code below:

import numpy as np
import cv2
# Load an color image in grayscale
img = cv2.imread('messi5.jpg',0)

Warning: Even if the image path is wrong, it won’t throw any error, but print img will give you
None

Display an image Use the function cv2.imshow() to display an image in a window. The window
automatically fits to the image size. First argument is a window name which is a string. second
argument is our image. You can create as many windows as you wish, but with different window
names.

cv2.imshow('image',img)
cv2.waitKey(0)

cv2.waitKey() is a keyboard binding function. Its argument is the time in milliseconds.

cv2.destroyAllWindows()

cv2.destroyAllWindows() simply destroys all the windows we created

Write an image

Use the function cv2.imwrite() to save an image. First argument is the file name, second
argument is the image you want to save.

cv2.imwrite('messigray.png',img)

This will save the image in PNG format in the working directory

Sum it up

Below program loads an image in grayscale, displays it, save the image if you press ‘s’ and exit,
or simply exit without saving if you press ESC key.

import numpy as np
import cv2
img = cv2.imread('messi5.jpg',0)
cv2.imshow('image',img)
k = cv2.waitKey(0)
if k == 27: # wait for ESC key to exit
cv2.destroyAllWindows()
elif k == ord('s'): # wait for 's' key to save and exit
cv2.imwrite('messigray.png',img)
cv2.destroyAllWindows()

Using Matplotlib

Matplotlib is a plotting library for Python which gives you wide variety of plotting methods. You
will see them in coming articles. Here, you will learn how to display image with Matplotlib. You
can zoom images, save it etc using Matplotlib.

import numpy as np
import cv2 from matplotlib
import pyplot as plt
img = cv2.imread('messi5.jpg',0)
plt.imshow(img, cmap = 'gray', interpolation = 'bicubic')
plt.xticks([]), plt.yticks([]) # to hide tick values on X and Y axis
plt.show()

Drawing Rectangle

To draw a rectangle, you need top-left corner and bottom-right corner of rectangle. This time we
will draw a green rectangle at the top-right corner of image.

img = cv2.rectangle(img,(384,0),(510,128),(0,255,0),3)

Adding Text to Images:

To put texts in images, you need specify following things.

Text data that you want to write

Position coordinates of where you want put it (i.e. bottom-left corner where data starts).
Font type (Check cv2.putText() docs for supported fonts)

Font Scale (specifies the size of font)

regular things like color, thickness, lineType etc. For better look, lineType = cv2.LINE_AA is
recommended.

We will write OpenCV on our image in white color.

font = cv2.FONT_HERSHEY_SIMPLEX
cv2.putText(img,'OpenCV',(10,500), font, 4,(255,255,255),2,cv2.LINE_AA)

Result So it is time to see the final result of our drawing. As you studied in previous articles,
display the image to see it.

Object Tracking

Now we know how to convert BGR image to HSV, we can use this to extract a colored object. In HSV, it
is more easier to represent a color than RGB color-space. In our application, we will try to extract a blue
colored object. So here is the method:

Take each frame of the video


Convert from BGR to HSV color-space
We threshold the HSV image for a range of blue color
Now extract the blue object alone, we can do whatever on that image we want.

Below is the code which are commented in detail :

import cv2
import numpy as np
cap = cv2.VideoCapture(0)
while(1): # Take each frame
_, frame = cap.read() # Convert BGR to ​HSV
hsv = cv2.cvtColor(frame, cv2.COLOR_BGR2HSV) # define range of blue color in HSV
lower_blue = np.array([110,50,50])
upper_blue = np.array([130,255,255]) # Threshold the HSV image to get only blue colors
mask = cv2.inRange(hsv, lower_blue, upper_blue) # Bitwise-AND mask and original
image
res = cv2.bitwise_and(frame,frame, mask= mask)
cv2.imshow('frame',frame)
cv2.imshow('mask',mask)
cv2.imshow('res',res)
k = cv2.waitKey(5) & 0xFF
if k == 27: break
cv2.destroyAllWindows()

Numpy:

NumPy, which stands for Numerical Python, is a library consisting of multidimensional array
objects and a collection of routines for processing those arrays. Using NumPy, mathematical and
logical operations on arrays can be performed. This tutorial explains the basics of NumPy such
as its architecture and environment. It also discusses the various array functions, types of
indexing, etc. An introduction to Matplotlib is also provided. All this is explained with the help
of examples for better understanding.

NumPy is a Python package. It stands for 'Numerical Python'. It is a library consisting of


multidimensional array objects and a collection of routines for processing of array.
Numeric​, the ancestor of NumPy, was developed by Jim Hugunin. Another package Numarray
was also developed, having some additional functionalities. In 2005, Travis Oliphant created
NumPy package by incorporating the features of Numarray into Numeric package. There are
many contributors to this open source project.
Operations using NumPy
Using NumPy, a developer can perform the following operations −
● Mathematical and logical operations on arrays.
● Fourier transforms and routines for shape manipulation.
● Operations related to linear algebra. NumPy has in-built functions for linear algebra and
random number generation.
NumPy – A Replacement for MatLab
NumPy is often used along with packages like SciPy (Scientific Python)
and Mat−plotlib (plotting library). This combination is widely used as a replacement for
MatLab, a popular platform for technical computing. However, Python alternative to MatLab is
now seen as a more modern and complete programming language.
It is open source, which is an added advantage of NumPy.
Standard Python distribution doesn't come bundled with NumPy module. A lightweight
alternative is to install NumPy using popular Python package installer, pip.
pip install numpy

The best way to enable NumPy is to use an installable binary package specific to your operating
system. These binaries contain full SciPy stack (inclusive of NumPy, SciPy, matplotlib, IPython,
SymPy and nose packages along with core Python).
Building from Source
Core Python (2.6.x, 2.7.x and 3.2.x onwards) must be installed with distutils and zlib module
should be enabled.
GNU gcc (4.2 and above) C compiler must be available.
To install NumPy, run the following command.

Python setup.py install


To test whether NumPy module is properly installed, try to import it from Python prompt.

import numpy
If it is not installed, the following error message will be displayed.
Traceback (most recent call last):

File "<pyshell#0>", line 1, in <module>


import numpy
ImportError: No module named 'numpy'
Alternatively, NumPy package is imported using the following syntax −

import numpy as np

Pandas:
Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use
data structures and data analysis tools for the Python programming language. Python with
Pandas is used in a wide range of fields including academic and commercial domains including
finance, economics, Statistics, analytics, etc. In this tutorial, we will learn the various features of
Python Pandas and how to use them in practice.
Pandas is an open-source Python Library providing high-performance data manipulation and
analysis tool using its powerful data structures. The name Pandas is derived from the word Panel
Data – an Econometrics from Multidimensional data.
In 2008, developer Wes McKinney started developing pandas when in need of high
performance, flexible tool for analysis of data.
Prior to Pandas, Python was majorly used for data munging and preparation. It had very little
contribution towards data analysis. Pandas solved this problem. Using Pandas, we can
accomplish five typical steps in the processing and analysis of data, regardless of the origin of
data — load, prepare, manipulate, model, and analyze.
Python with Pandas is used in a wide range of fields including academic and commercial
domains including finance, economics, Statistics, analytics, etc.
Key Features of Pandas

● Fast and efficient DataFrame object with default and customized indexing.
● Tools for loading data into in-memory data objects from different file formats.
● Data alignment and integrated handling of missing data.
● Reshaping and pivoting of date sets.
● Label-based slicing, indexing and subsetting of large data sets.
● Columns from a data structure can be deleted or inserted.
● Group by data for aggregation and transformations.
● High performance merging and joining of data.
● Time Series functionality.
Standard Python distribution doesn't come bundled with Pandas module. A
lightweight alternative is to install NumPy using popular Python package
installer, ​pip.

pip install pandas


If you install Anaconda Python package, Pandas will be installed by default
with the following −
Windows
Anaconda​ (from ​https://www.continuum.io​) is a free Python distribution for SciPy stack. It is
also available for Linux and Mac.
Canopy​ (​https://www.enthought.com/products/canopy/​) is available as free as well as
commercial distribution with full SciPy stack for Windows, Linux and Mac.
Python​ (x,y) is a free Python distribution with SciPy stack and Spyder IDE for Windows OS.
(Downloadable from ​http://python-xy.github.io/​)
By now, we learnt about the three Pandas DataStructures and how to create them.
We will majorly focus on the DataFrame objects because of its importance in the
real time data processing and also discuss a few other DataStructures.
Series Basic Functionality
S.No​. Attribute or Method & Description

1 Axes ​Returns a list of the row axis labels

2 Dtype ​Returns the dtype of the object.

3 Empty ​Returns True if series is empty.

4 Ndim ​Returns the number of dimensions of the underlying data, by definition 1.

5 Size ​Returns the number of elements in the underlying data.

6 Values ​Returns the Series as ndarray.

7 head() ​Returns the first n rows.

8 tail() ​Returns the last n rows.

Let us now create a Series and see all the above tabulated attributes operation.
Example

import pandas as pd
import numpy as np

#Create a series with 100 random numbers

s = pd.Series(np.random.randn(4))

print s

Its ​output​ is as follows −

0 0.967853
1 -0.148368
2 -1.395906
3 -1.758394
dtype: float64
axes
Returns the list of the labels of the series.

import pandas as pd

import numpy as np

#Create a series with 100 random numbers

s = pd.Series(np.random.randn(4))

print ("The axes are:")

print s.axes

Its ​output​ is as follows −

The axes are:


[RangeIndex(start=0, stop=4, step=1)]
The above result is a compact format of a list of values from 0 to 5, i.e., [0,1,2,3,4].
empty
Returns the Boolean value saying whether the Object is empty or not. True indicates that the
object is empty.
import pandas as pd

import numpy as np

#Create a series with 100 random numbers

s = pd.Series(np.random.randn(4))

print ("Is the Object empty?")

print s.empty

Its ​output​ is as follows −

Is the Object empty?


False
ndim
Returns the number of dimensions of the object. By definition, a Series is a 1D data structure, so
it returns

import pandas as pd

import numpy as np

#Create a series with 4 random numbers

s = pd.Series(np.random.randn(4))

print s

print ("The dimensions of the object:")

print s.ndim

Its ​output​ is as follows −

0 0.175898
1 0.166197
2 -0.609712
3 -1.377000
dtype: float64

The dimensions of the object:


1
size
Returns the size(length) of the series.

import pandas as pd

import numpy as np

#Create a series with 4 random numbers

s = pd.Series(np.random.randn(2))

print s

print ("The size of the object:")

print s.size

Its ​output​ is as follows −

0 3.078058
1 -1.207803
dtype: float64

The size of the object:


2
values
Returns the actual data in the series as an array.

import pandas as pd

import numpy as np

#Create a series with 4 random numbers

s = pd.Series(np.random.randn(4))
print s

print ("The actual data series is:")

print s.values

Its ​output​ is as follows −

0 1.787373
1 -0.605159
2 0.180477
3 -0.140922
dtype: float64

The actual data series is:


[ 1.78737302 -0.60515881 0.18047664 -0.1409218 ]
Head & Tail
To view a small sample of a Series or the DataFrame object, use the head() and the tail()
methods.
head()​ returns the first ​n​ rows(observe the index values). The default number of elements to
display is five, but you may pass a custom number.

import pandas as pd

import numpy as np

#Create a series with 4 random numbers

s = pd.Series(np.random.randn(4))

print ("The original series is:")

print s

print ("The first two rows of the data series:")

print s.head(2)
Its ​output​ is as follows −

The original series is:


0 0.720876
1 -0.765898
2 0.479221
3 -0.139547
dtype: float64

The first two rows of the data series:


0 0.720876
1 -0.765898
dtype: float64
tail()​ returns the last ​n​ rows(observe the index values). The default number of elements to
display is five, but you may pass a custom number.

import pandas as pd

import numpy as np

#Create a series with 4 random numbers

s = pd.Series(np.random.randn(4))

print ("The original series is:")

print s

print ("The last two rows of the data series:")

print s.tail(2)

Its ​output​ is as follows −

The original series is:


0 -0.655091
1 -0.881407
2 -0.608592
3 -2.341413
dtype: float64

The last two rows of the data series:


2 -0.608592
3 -2.341413
dtype: float64
DataFrame Basic Functionality
Let us now understand what DataFrame Basic Functionality is. The following tables lists down
the important attributes or methods that help in DataFrame Basic Functionality.
Tensor flow:

Deep learning is a subfield of machine learning that is a set of algorithms that is inspired by the
structure and function of the brain.

TensorFlow is the second machine learning framework that Google created and used to design,
build, and train deep learning models. You can use the TensorFlow library do to numerical
computations, which in itself doesn’t seem all too special, but these computations are done with
data flow graphs. In these graphs, nodes represent mathematical operations, while the edges
represent the data, which usually are multidimensional data arrays or tensors, that are
communicated between these edges.

You see? The name “TensorFlow” is derived from the operations which neural networks perform
on multidimensional data arrays or tensors! It’s literally a flow of tensors. For now, this is all you
need to know about tensors, but you’ll go deeper into this in the next sections!

Today’s TensorFlow tutorial for beginners will introduce you to performing deep learning in an
interactive way:

● You’ll first learn more about ​tensors​;

● Then, the tutorial you’ll briefly go over some of the ways that you can ​install
TensorFlow​ on your system so that you’re able to get started and load data in your
workspace;

● After this, you’ll go over some of the ​TensorFlow basics​: you’ll see how you can easily
get started with simple computations.

● After this, you get started on the real work: you’ll load in data on Belgian traffic signs
and ​exploring​ it with simple statistics and plotting.

● In your exploration, you’ll see that there is a need to ​manipulate your data​ in such a way
that you can feed it to your model. That’s why you’ll take the time to rescale your images
and convert them to grayscale.

● Next, you can finally get started on ​your neural network model​! You’ll build up your
model layer per layer;
● Once the architecture is set up, you can use it to ​train your model interactively​ and to
eventually also ​evaluate​ it by feeding some test data to it.

● Lastly, you’ll get some pointers for ​further​ improvements that you can do to the model
you just constructed and how you can continue your learning with TensorFlow.

Download the notebook of this tutorial ​here​.

Also, you could be interested in a course on ​Deep Learning in Python​, DataCamp's ​Keras
tutorial​ or the ​keras with R tutorial​.

Introducing Tensors

To understand tensors well, it’s good to have some working knowledge of linear algebra and
vector calculus. You already read in the introduction that tensors are implemented in TensorFlow
as multidimensional data arrays, but some more introduction is maybe needed in order to
completely grasp tensors and their use in machine learning.

Plane Vectors

Before you go into plane vectors, it’s a good idea to shortly revise the concept of “vectors”;
Vectors are special types of matrices, which are rectangular arrays of numbers. Because vectors
are ordered collections of numbers, they are often seen as column matrices: they have just one
column and a certain number of rows. In other terms, you could also consider vectors as scalar
magnitudes that have been given a direction.

Remember​: an example of a scalar is “5 meters” or “60 m/sec”, while a vector is, for example,
“5 meters north” or “60 m/sec East”. The difference between these two is obviously that the
vector has a direction. Nevertheless, these examples that you have seen up until now might seem
far off from the vectors that you might encounter when you’re working with machine learning
problems. This is normal; The length of a mathematical vector is a pure number: it is absolute.
The direction, on the other hand, is relative: it is measured relative to some reference direction
and has units of radians or degrees. You usually assume that the direction is positive and in
counterclockwise rotation from the reference direction.
Visually, of course, you represent vectors as arrows, as you can see in the picture above. This
means that you can consider vectors also as arrows that have direction and length. The direction
is indicated by the arrow’s head, while the length is indicated by the length of the arrow.

So what about plane vectors then?

Plane vectors are the most straightforward setup of tensors. They are much like regular vectors as
you have seen above, with the sole difference that they find themselves in a vector space. To
understand this better, let’s start with an example: you have a vector that is ​2 X 1.​ This means
that the vector belongs to the set of real numbers that come paired two at a time. Or, stated
differently, they are part of two-space. In such cases, you can represent vectors on the
coordinate ​(x,y)​ plane with arrows or rays.

Working from this coordinate plane in a standard position where vectors have their endpoint at
the origin ​(0,0),​ you can derive the ​x​ coordinate by looking at the first row of the vector, while
you’ll find the ​y​ coordinate in the second row. Of course, this standard position doesn’t always
need to be maintained: vectors can move parallel to themselves in the plane without experiencing
changes.

Note​ that similarly, for vectors that are of size ​3 X 1,​ you talk about the three-space. You can
represent the vector as a three-dimensional figure with arrows pointing to positions in the vectors
pace: they are drawn on the standard ​x,​ ​y​ and ​z​ axes.

It’s nice to have these vectors and to represent them on the coordinate plane, but in essence, you
have these vectors so that you can perform operations on them and one thing that can help you in
doing this is by expressing your vectors as bases or unit vectors.
Unit vectors are vectors with a magnitude of one. You’ll often recognize the unit vector by a
lowercase letter with a circumflex, or “hat”. Unit vectors will come in convenient if you want to
express a 2-D or 3-D vector as a sum of two or three orthogonal components, such as the x− and
y−axes, or the z−axis.

And when you are talking about expressing one vector, for example, as sums of components,
you’ll see that you’re talking about component vectors, which are two or more vectors whose
sum is that given vector.

Tip​: watch ​this video​, which explains what tensors are with the help of simple household
objects!

Tensors

Next to plane vectors, also covectors and linear operators are two other cases that all three
together have one thing in common: they are specific cases of tensors. You still remember how a
vector was characterized in the previous section as scalar magnitudes that have been given a
direction. A tensor, then, is the mathematical representation of a physical entity that may be
characterized by magnitude and ​multiple​ directions.

And, just like you represent a scalar with a single number and a vector with a sequence of three
numbers in a 3-dimensional space, for example, a tensor can be represented by an array of 3R
numbers in a 3-dimensional space.

The “R” in this notation represents the rank of the tensor: this means that in a 3-dimensional
space, a second-rank tensor can be represented by 3 to the power of 2 or 9 numbers. In an
N-dimensional space, scalars will still require only one number, while vectors will require N
numbers, and tensors will require N^R numbers. This explains why you often hear that scalars
are tensors of rank 0: since they have no direction, you can represent them with one number.

With this in mind, it’s relatively easy to recognize scalars, vectors, and tensors and to set them
apart: scalars can be represented by a single number, vectors by an ordered set of numbers, and
tensors by an array of numbers.

What makes tensors so unique is the combination of components and basis vectors: basis vectors
transform one way between reference frames and the components transform in just such a way as
to keep the combination between components and basis vectors the same.

Installing TensorFlow

Now that you know more about TensorFlow, it’s time to get started and install the library. Here,
it’s good to know that TensorFlow provides APIs for Python, C++, Haskell, Java, Go, Rust, and
there’s also a third-party package for R called ​tensorflow​.
Tip​: if you want to know more about deep learning packages in R, consider checking out
DataCamp’s ​keras: Deep Learning in R Tutorial​.

In this tutorial, you will download a version of TensorFlow that will enable you to write the code
for your deep learning project in Python. On the ​TensorFlow installation webpage​, you’ll see
some of the most common ways and latest instructions to install TensorFlow
using v​ irtualenv​, p​ ip​, Docker and lastly, there are also some of the other ways of installing
TensorFlow on your personal computer.

Note​ You can also install TensorFlow with Conda if you’re working on Windows. However,
since the installation of TensorFlow is community supported, it’s best to check the ​official
installation instructions​.

Now that you have gone through the installation process, it’s time to double check that you have
installed TensorFlow correctly by importing it into your workspace under the alias ​tf​:

import tensorflow as tf

Note​ that the alias that you used in the line of code above is sort of a convention - It’s used to
ensure that you remain consistent with other developers that are using TensorFlow in data
science projects on the one hand, and with open-source TensorFlow projects on the other hand.

Keras:
Two of the top numerical platforms in Python that provide the basis for Deep Learning research
and development are Theano and TensorFlow.

Both are very powerful libraries, but both can be difficult to use directly for creating deep
learning models.

In this post, you will discover the Keras Python library that provides a clean and convenient way
to create a range of deep learning models on top of Theano or TensorFlow.

Keras is a minimalist Python library for deep learning that can run on top of Theano or
TensorFlow.

It was developed to make implementing deep learning models as fast and easy as possible for
research and development.

It runs on Python 2.7 or 3.5 and can seamlessly execute on GPUs and CPUs given the underlying
frameworks. It is released under the permissive MIT license.
Keras was developed and maintained by ​François Chollet​, a Google engineer using four guiding
principles:

Modularity​: A model can be understood as a sequence or a graph alone. All the concerns of a
deep learning model are discrete components that can be combined in arbitrary ways.
Minimalism​: The library provides just enough to achieve an outcome, no frills and maximizing
readability.
Extensibility​: New components are intentionally easy to add and use within the framework,
intended for researchers to trial and explore new ideas.
Python​: No separate model files with custom file formats. Everything is native Python
Keras is relatively straightforward to install if you already have a working Python and SciPy
environment.

You must also have an installation of Theano or TensorFlow on your system already.

You can see installation instructions for both platforms here:

Installation instructions for Theano


Installation instructions for TensorFlow
Keras can be installed easily using ​PyPI​, as follows:

1 sudo pip install keras


At the time of writing, the most recent version of Keras is version 1.1.0. You can check your
version of Keras on the command line using the following snippet:

You can check your version of Keras on the command line using the following snippet:

1 python -c "import keras; print keras.__version__"


Running the above script you will see:

1 1.1.0
You can upgrade your installation of Keras using the same method:

1 sudo pip install --upgrade keras

Sklearn:

In general, a learning problem considers a set of n ​samples​ of data and then tries to predict
properties of unknown data. If each sample is more than a single number and, for instance, a
multi-dimensional entry (aka ​multivariate​ data), it is said to have several attributes or ​features​.

Learning problems fall into a few categories:

supervised learning​, in which the data comes with additional attributes that we want to predict
(​Click here​ to go to the scikit-learn supervised learning page).This problem can be either:

classification​: samples belong to two or more classes and we want to learn from already labeled
data how to predict the class of unlabeled data. An example of a classification problem would be
handwritten digit recognition, in which the aim is to assign each input vector to one of a finite
number of discrete categories. Another way to think of classification is as a discrete (as opposed
to continuous) form of supervised learning where one has a limited number of categories and for
each of the n samples provided, one is to try to label them with the correct category or class.

regression​: if the desired output consists of one or more continuous variables, then the task is
called ​regression​. An example of a regression problem would be the prediction of the length of a
salmon as a function of its age and weight.

unsupervised learning​, in which the training data consists of a set of input vectors x without any
corresponding target values. The goal in such problems may be to discover groups of similar
examples within the data, where it is called ​clustering​, or to determine the distribution of data
within the input space, known as ​density estimation​, or to project the data from a
high-dimensional space down to two or three dimensions for the purpose of ​visualization​ (​Click
here​ to go to the Scikit-Learn unsupervised learning page).

scikit-learn comes with a few standard datasets, for instance the ​iris​ and ​digits​ datasets for
classification and the ​boston house prices dataset​ for regression.

In the following, we start a Python interpreter from our shell and then load
the iris and digits datasets. Our notational convention is that $ denotes the shell prompt
while >>> denotes the Python interpreter prompt:
$ python
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> digits = datasets.load_digits()

A dataset is a dictionary-like object that holds all the data and some metadata about the data.
This data is stored in the .data member, which is a n_samples, n_features array. In the case of
supervised problem, one or more response variables are stored in the .target member. More
details on the different datasets can be found in the ​dedicated section​.

For instance, in the case of the digits dataset, digits.data gives access to the features that can be
used to classify the digits samples:
>>>
>>> ​print(digits.data)
[[ 0. 0. 5. ... 0. 0. 0.]
[ 0. 0. 0. ... 10. 0. 0.]
[ 0. 0. 0. ... 16. 9. 0.]
...
[ 0. 0. 1. ... 6. 0. 0.]
[ 0. 0. 2. ... 12. 0. 0.]
[ 0. 0. 10. ... 12. 1. 0.]]
and digits.target gives the ground truth for the digit dataset, that is the number corresponding to
each digit image that we are trying to learn:
>>>
>>> ​digits.target
array([0, 1, 2, ..., 8, 9, 8])

You might also like