You are on page 1of 9

Machine Learning Python Packages : Core Data Handling

Libraries:
1. Numpy

Python has a strong set of data types and data structures. Yet it wasn’t designed for Machine
Learning per say. Enter numpy (pronounced as num-pee). Numpy is a data handling library,
particularly one which allows us to handle large multi-dimensional arrays along with a huge
collection of mathematical operations. The following is a quick snippet of numpy in action.

Numpy isn’t just a data handling library known for its capability to handle multidimensional
data. It is also known for its speed of execution and vectorization capabilities. It provides
MATLAB style functionality and hence requires some learning before you can get
comfortable. It is also a core dependency for other majorly used libraries like pandas,
matplotlib and so on.

Advantages

Numpy isn’t just a library, it is “the library” when it comes to handling multi-dimensional
data. The following are some of the goto features that make it special:

 Matrix (and multi-dimensional array) manipulation capabilities like transpose,


reshape,etc.
 Highly efficient data-structures which boost performance and handle garbage
collection with a breeze.
 Capability to vectorize operation, again improves performance and parallelization
capabilities.

Downsides

The major downsides of numpy are:

 Dependency of non-pythonic environmental entities, i.e. due to its dependency upon


Cython and other C/C++ libraries setting up numpy can be a pain
 Its high performance comes at a cost. The data types are native to hardware and not
python, thus incurring an overhead when numpy objects have to be transformed back
to python equivalent ones and vice-versa.

2. Pandas

Think of relational data, think pandas. Yes, pandas is a python library that provides flexible
and expressive data structures (like dataframes and series) for data manipulation. Built on top
of numpy, pandas is as fast and yet easier to use.

Pandas provides capabilities to read and write data from different sources like CSVs, Excel,
SQL Databases, HDFS and many more. It provides functionality to add, update and delete
columns, combine or split dataframes/series, handle datetime objects, impute null/missing
values, handle time series data, conversion to and from numpy objects and so on. If you are
working on a real-world Machine Learning use case, chances are, you would need pandas
sooner than later. Similar to numpy, pandas is also an important component of the SciPy or
Scientific Python Stack.

Advantages
 Extremely easy to use and with a small learning curve to handle tabular data.
 Amazing set of utilities to load, transform and write data to multiple formats.
 Compatible with underlying numpy objects and go to choice for most Machine
Learning libraries like scikit-learn, etc.
 Capability to prepare plots/visualizations out of the box (utilizes matplotlib to prepare
different visualization under the hood).

Downsides

 The ease of use comes at the cost of higher memory utilization. Pandas creates far too
many additional objects to provide quick access and ease of manipulation.
 Inability to utilize distributed infrastructure. Though pandas can work with formats
like HDFS files, it cannot utilize distributed system architecture to improve
performance.

3. Scipy

Pronounced as Sigh-Pie, this is one of the most important python libraries of all time. Scipy is
a scientific computing library for python. It is also built on top of numpy and is a part of the
Scipy Stack.

This is yet another behind the scenes library which does a whole lot of heavy lifting. It
provides modules/algorithms for linear algebra, integration, image processing, optimizations,
clustering, sparse matrix manipulation and many more.
4. Matplotlib

Another component of the SciPy stack, matplotlib is essentially a visualization library. It


works seamlessly with numpy objects (and its high-level derivatives like pandas). Matplotlib
provides a MATLAB like plotting environment to prepare high-quality figures/charts for
publications, notebooks, web applications and so on.

Matplolib is a high customizable low-level library that provides a whole lot of controls and
knobs to prepare any type of visualization/figure. Given its low-level nature, it requires a bit
of getting used to along with plenty of code to get stuff done. Its well documented and
extensible design has allowed a whole list of high-level visualization libraries to be built on
top. Some of which, we will discuss in the coming sections.

Advantages

 Extremely expressive and precise syntax to generate highly customizable plots


 Can be easily used inline with Jupyter notebooks

Downsides

 Heavy reliance on numpy and other Scipy stack libraries


 Huge learning curve, it requires quite a bit of understanding and practice to use
matplotlib.
Machine Learning Core Libraries:
5. Scikit-Learn

Designed as an extension to the SciPy library, scikit-learn has become the de-facto standard
for many of the machine learning tasks. Developed as part of Google Summer of Code
project, it has now become a widely contributed open source project with over 1000
contributors.
Scikit-learn provides a simple yet powerful fit-transform and predict paradigm to learn from
data, transform the data and finally predict. Using this interface, it provides capabilities to
prepare classification, regression, clustering and ensemble models. It also provides a
multitude of utilities for preprocessing, metrics, model evaluation techniques, etc.

Advantages

 The go-to package that has it all for classical Machine Learning algorithms
 Consistent and easy to understand interface of fit and transform
 Capability to prepare pipelines not only helps with rapid prototyping but also quick
and reliable deployments

Downsides

 Inability to utilize categorical data for algorithms out of the box that support such data
types (packages in R have such capabilities)
 Heavy reliance on the Scipy stack

6.  Statsmodels

As the name suggests, this library adds statistical tools/algorithms in the form of classes and
functions to the python world. Built on top of numpy and scipy, Statsmodels provides an
extensive list of capabilities in the form of regression models, time series analysis,
autoregression and so on.

Statsmodels also provides a detailed list of result statistics (even beyond what scikit-learn
provides). It integrates nicely with pandas and matplotlib and thus is an important part of any
Data Scientist’s toolbox. For people who are familiar and comfortable with R style of
programming, Statsmodels also provides R-like formula interface using patsy.

Advantages

 Plugs in the gap for regression and time-series algorithms for the python ecosystem
 Analogous to certain R-packages, hence smaller learning curve
 Huge list of algorithms and utilities to handle regression and time series use-cases

Downsides

 Not as well documented with examples as sklearn


 Certain algorithms are buggy with little to no explanation of parameters
Deep Learning Libraries : 
13.  Tensorflow

Probably one of the most popular GitHub repositories and one of the most widely used
libraries for both research and production environments. Tensorflow is a symbolic math
library which allows differentiable programming, a core concept for many Machine Learning
tasks.

Tensors are the core concept of this library which are generic mathematical objects to
represent vectors, scalers, multi-dimensional arrays, etc.

It supports a range of ML tasks but it is primarily utilized for developing deep neural
networks. It is utilized by Google (also developed by them) and a number of technology
giants for developing and productionalizing neural networks. Tensorflow has capabilities to
not just utilize multi-GPU stacks but also work with specialized TPUs or Tensor Processing
Units. It has now evolved into this complete environment of its own with modules to handle
core functionality, debugging, visualization, serving, etc.

Advantages

 Industry grade package which has a huge community support with frequent bug fixes
and improvements at regular intervals
 Capability to work with a diverse set of hardware like mobile platforms, web, CPUs
and GPUs
 Scalability to handle huge workloads and works out of the box
 Well documented features with tons of tutorials and examples

Downsides

 Low-level interface makes it difficult to get started, huge learning curve


 Computation graphs are not easy to get used to (though this has been largely
addressed with eager execution in version 2.0)

14.  Theano
Let’s just start by saying that Theano is to deep learning what numpy is to machine learning.
Theano (now a deprecated project) was one of the first libraries to provide capabilities to
manipulate multi-dimensional arrays. It predates Tensorflow and hence isn’t as performant or
expressive. Theano has capabilities to utilize GPUs transparently. It is tightly integrated with
numpy, provides symbolic differentiation syntax along with various optimization to handle
small and large numbers. Before the advent of newer libraries, Theano was the defacto
building block for working with neural networks. Theano was developed and maintained
actively by of Montreal Institute for Learning Algorithms (MILA), University of Montreal
until 2017.

Advantages

 Ease of understanding due to its tight coupling with numpy


 Capability to utilize GPUs transparently
 Being one of the first deep learning libraries, it has a huge community to help and
support issues

Downsides

 Once the workhorse for deep learning use-cases, is now a deprecated project which
will not be further developed
 Its low-level APIs often presented a steep learning curve

15.  PyTorch

PyTorch is a result of research and development at Facebook’s artificial intelligence group.


The current day PyTorch is a merged project between pytorch and caffe2. PyTorch is a
python first deep learning framework unlike some of the other well-known ones which are
written in C/C++ and have bindings/wrappers for python. This python first strategy allows
PyTorch to have numpy like syntax and capability to work seamlessly with similar libraries
and their data structures.

It supports dynamic graphs and eager execution (it was the only one until Tensorflow 2.0).
Similar to other frameworks in this space, PyTorch can also leverage GPUs and acceleration
libraries like  Intel-MKL. It also claims to have minimal overhead and hence is supposedly
faster than the rest,

Advantages

 One of the fastest deep learning frameworks.


 Capability to handle dynamic graphs as opposed to static ones used by most
counterparts
 Pythonic implementation helps in seamless integration with python objects and
numpy like syntax

Downsides

 Still gaining ground and support, thus lags in terms of material(tutorials, examples,
etc.) to learn from.
 Limited capabilities like visualizations and debugging as compared to a complete
suite in the form of tensorboard for tensorflow.

You might also like