You are on page 1of 41

Essential

Python
Libraries
CS158-1
NumPy
NumPy, short for Numerical Python, has
long been a cornerstone of numerical
computing in Python. It provides the data
structures, algorithms, and library glue
needed for most scientific applications
involving numerical data in Python.
NumPy contains among other things
• A fast and efficient multidimensional array object
ndarray
• Functions for performing element-wise
computations with arrays or mathematical
operations between arrays
• Tools for reading and writing array-based datasets
to disk
• Linear algebra operations, Fourier transform, and
random number generation
• A mature C API to enable Python extensions and
native C or C++ code to access NumPy’s data
structures and computational facilities
NumPy Advantages
• NumPy arrays are more efficient for storing
and manipulating data than the other built-in
Python data structures.
• Also, libraries written in a lower-level
language, such as C or Fortran, can operate
on the data stored in a NumPy array without
copying data into some other memory
representation.
• many numerical computing tools for Python
either assume NumPy arrays as a primary
data structure or else target seamless
interoperability with NumPy.
Pandas
• Pandas provides high-level data structures and
functions designed to make working with
structured or tabular data fast, easy, and
expressive.
• Pandas blends the high-performance, array-
computing ideas of NumPy with the flexible data
manipulation capabilities of spreadsheets and
relational databases (such as SQL).
• It provides sophisticated indexing functionality to
make it easy to reshape, slice and dice, perform
aggregations, and select subsets of data.
Matplotlib

matplotlib is the most


popular Python library for
producing plots and other
two-dimensional data
visualizations.
SciPy
§ SciPy is a collection of packages
addressing a number of
different standard problem
domains in scientific
computing.
§ Together NumPy and SciPy form
a reasonably complete and
mature computational
foundation for many traditional
scientific computing
applications.
scikit-learn has become the
scikit- premier general-purpose machine
learning toolkit for Python

learn
programmers. In just seven years,
it has had over 1,500 contributors
from around the world.
Classification: SVM, nearest neighbors, random forest,
logistic regression, etc.

• Regression: Lasso, ridge regression, etc.

scikit- • Clustering: k-means, spectral clustering, etc.

learn • Dimensionality reduction: PCA, feature selection,


matrix factorization, etc.

• Model selection: Grid search, cross-validation, metrics

• Preprocessing: Feature extraction, normalization


statsmodels is a statistical analysis
package that was seeded by work
from Stanford University statistics
professor Jonathan Taylor, who
statsmodels implemented a number of
regression analysis models popular
in the R programming language.
statsmodels
Compared with scikit-learn, statsmodels contains algorithms for
classical (primarily frequentist) statistics and econometrics. This
includes such submodules as:
• Regression models: Linear regression, generalized linear models,
robust linear models, linear mixed effects models, etc.
• Analysis of variance (ANOVA)
• Time series analysis: AR, ARMA, ARIMA, VAR, and other models
• Nonparametric methods: Kernel density estimation, kernel
regression
• Visualization of statistical model results
IPython
IPython is designed from the ground up to maximize
your productivity in both interactive computing and
software development. It encourages an execute-
explore workflow instead of the typical edit-compile
run workflow of many other programming languages.
It also provides easy access to your operating system’s
shell and filesystem.
NumPy Basics
One of the reasons NumPy is so important for numerical
computations in Python is because it is designed for
efficiency on large arrays of data.
• NumPy internally stores data in a contiguous block of
memory, independent of other built-in Python
objects. NumPy’s library of algorithms written in the C
language can operate on this memory without any
type checking or other overhead. NumPy arrays also
use much less memory than built-in Python
sequences.
• NumPy operations perform complex computations on
entire arrays without the need for Python for loops.
The NumPy ndarray: A Multidimensional Array Object

One of the key features of NumPy is its N-


dimensional array object, or ndarray,
which is a fast, flexible container for large
datasets in Python. Arrays enable you to
perform mathematical operations on
whole blocks of data using similar syntax
to the equivalent operations between
scalar elements.
Creating ndarrays
The easiest way to create an array is to use the array function. This
accepts any sequence-like object (including other arrays) and produces
a new NumPy array containing the passed data. For example, a list is a
good candidate for conversion:
Array creation functions
Data Types for ndarrays
The data type or dtype is a special object containing the information (or
metadata, data about data) the ndarray needs to interpret a chunk of
memory as a particular type of data:

In [33]: arr1 = np.array([1, 2, 3], dtype=np.float64)


In [34]: arr2 = np.array([1, 2, 3], dtype=np.int32)
In [35]: arr1.dtype
Out[35]: dtype('float64')
0-D Array - 0-D arrays, or Scalars, are the elements in an
array. Each value in an array is a 0-D array.

An array that has 0-D arrays as its elements is called uni


dimensional or 1-D array.

Dimensions An array that has 1-D arrays as its elements is called a 2-


D array
in Arrays
An array that has 2-D arrays (matrices) as its elements is
called 3-D array. These are often used to represent a 3rd
order tensor.

NumPy Arrays provides the ndim attribute that returns


an integer that tells us how many dimensions the array
have.
Array indexing is the same as accessing
an array element.

NumPy Array You can access an array element by


referring to its index number.
Indexing
The indexes in NumPy arrays start with
0, meaning that the first element has
index 0, and the second has index 1 etc.
Slicing in python means taking elements from
one given index to another given index.

We pass slice instead of index like this: [1:3].

We can also define the step, like


NumPy Array this: [start:end:step].

Slicing If we don't pass start its considered 0

If we don't pass end its considered length of


array in that dimension

If we don't pass step its considered 1


NumPy Array Shape
The shape of an array is the number of elements in each dimension.
Reshaping means changing the
shape of an array.

NumPy Array The shape of an array is the number


of elements in each dimension.
Reshaping
By reshaping we can add or remove
dimensions or change number of
elements in each dimension.
Iterating means going through
elements one by one.

As we deal with multi-dimensional


NumPy Array arrays in numpy, we can do this
Iterating using basic for loop of python.

If we iterate on a 1-D array it will


go through each element one by
one.
Joining means putting contents of two or
more arrays in a single array.

NumPy Joining In SQL we join tables based on a key,


whereas in NumPy we join arrays by axes.
Array
We pass a sequence of arrays that we
want to join to the concatenate() function,
along with the axis. If axis is not explicitly
passed, it is taken as 0.
Splitting is reverse operation of
Joining.

NumPy Splitting
Joining merges multiple arrays into
Array one and Splitting breaks one array
into multiple.

We use array_split() for splitting


arrays, we pass it the array we
want to split and the number of
splits.
You can search an array for a
certain value, and return the
NumPy Searching indexes that get a match.
Arrays

To search an array, use


the where() method.
Sorting means putting elements in
an ordered sequence.

NumPy Sorting Ordered sequence is any sequence that has


Arrays an order corresponding to elements, like
numeric or alphabetical, ascending or
descending.

The NumPy ndarray object has a function


called sort(), that will sort a specified array.
Getting some elements out of an
existing array and creating a new
array out of them is
called filtering.
NumPy Filter
Array
In NumPy, you filter an array
using a boolean index list.
Pandas is a Python library used for working
with data sets.
Introduction
to pandas It has functions for analyzing, cleaning,
exploring, and manipulating data.
Data
Structures The name "Pandas" has a reference to both
"Panel Data", and "Python Data Analysis"
and was created by Wes McKinney in 2008.
Pandas allows us to analyze big
data and make conclusions based
on statistical theories.

Pandas can clean messy data sets,


Why Use and make them readable and
Pandas? relevant.

Relevant data is very important in


data science.
Series
A Series is a one-dimensional array-like object containing a sequence of values (of
similar types to NumPy types) and an associated array of data labels, called its
index.
DataFrame
A DataFrame represents a rectangular table of data and contains an ordered
collection of columns, each of which can be a different value type (numeric, string,
boolean, etc.).
A simple way to store big data
sets is to use CSV files (comma
separated files).
Pandas Read
CSV CSV files contains plain text and
is a well know format that can be
read by everyone including
Pandas.
Pandas Read JSON

• Big data sets are often stored or extracted as JSON.


• JSON is plain text, but has the format of an object, and is well known
in the world of programming, including Pandas.
Pandas - Analyzing DataFrames

• One of the most used method for getting a quick overview of the
DataFrame, is the head() method.
• The head() method returns the headers and a specified number of
rows, starting from the top.

You might also like