Professional Documents
Culture Documents
Module Handbook
Data Science with Python - 1
Objectives
At the end of this module, you should be able to understand:
▪ Mathematical Computation with numpy
▪ Getting started with Pandas
▪ Data Import/export with pandas
▪ Working with Data Frames – selection, Filtering, manipulation and statistical analysis
▪ Data Aggregation with Pandas
▪ Data Cleaning with Pandas
www.bootup.ai
Numpy
NumPy is the fundamental package for scientific computing with Python. It contains among other
things:
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container
of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily
integrate with a wide variety of databases.
Broadcasting
Broadcasting is a powerful mechanism that allows numpy to work with arrays of different shapes when
performing arithmetic operations. Frequently we have a smaller array and a larger array, and we want
to use the smaller array multiple times to perform some operation on the larger array.
• If the arrays do not have the same rank, prepend the shape of the lower rank array with 1s until both
shapes have the same length.
• The two arrays are said to be compatible in a dimension if they have the same size in the dimension,
or if one of the arrays has size 1 in that dimension.
• The arrays can be broadcast together if they are compatible in all dimensions.
• After broadcasting, each array behaves as if it had shape equal to the elementwise maximum of
shapes of the two input arrays.
• In any dimension where one array had size 1 and the other array had size greater than 1, the first
array behaves as if it were copied along that dimension
www.bootup.ai
Numpy
In general, any numerical data that is stored in an array-like container can be converted to an ndarray
through use of the array() function. The most obvious examples are sequence types like lists and tuples.
# Import modules
import numpy as np
# Create an array
civilian_birth = np.array([4352, 233, 3245, 256, 2394])
civilian_birth
#output
array([4352, 233, 3245, 256, 2394])
www.bootup.ai
Numpy
create special arrays
np.zeros(10)
np.zeros((3, 6))
np.ones(10)
Reshaping arrays
print(arr.shape)
print(arr.reshape(5, 2))
Stack arrays
a = np.array([0, 1])
b = np.array([2, 3])
ab = np.stack((a, b)).T
print(ab)
# or
www.bootup.ai
Numpy
import numpy as np
np.random.normal()
np.random.normal(size=4)
np.random.uniform(size=4)
www.bootup.ai
Linear Algebra with Numpy
import numpy as np
# Create vector
# Create matrix
[4, 5, 6],
[7, 8, 9]])
Flatten A Matrix
import numpy as np
# Create matrix
[4, 5, 6],
[7, 8, 9]])
# Flatten matrix
matrix.flatten()
www.bootup.ai
Numpy
a=np.array([[1,2],[3,4]])
b=np.array([[11,12],[13,14]])
np.dot(a,b)
#output
array([[37, 40],
[85, 92]])
np.inner(a,b)
#otuput
array([[35, 41],
[81, 95]])
www.bootup.ai
Linear Algebra with Numpy
Linear Algebra with matrices
Flatten A Matrix
A = np.array([[1,2],[3,4]])
numpy.linalg.eig(A)
#output
www.bootup.ai
Numpy
import numpy as np
A = np.array([[4,5],[6,-3]])
b = np.array([23, 3])
x = solve(A,b)
#output – value of x
array([ 2, 3])
10
www.bootup.ai
Pandas
pandas is a Python package providing fast, flexible, and expressive data structures designed to make
working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental
high-level building block for doing practical, real world data analysis in Python.
Additionally, it has the broader goal of becoming the most powerful and flexible open source data
analysis / manipulation tool available in any language. It is already well on its way toward this goal.
The two primary data structures of pandas, Series (1-dimensional) and DataFrame
(2-dimensional), handle the vast majority of typical use cases in finance, statistics,
social science, and many areas of engineering. For R users, DataFrame provides
everything that R’s data.frame provides and much more. pandas is built on top of
NumPy and is intended to integrate well within a scientific computing environment
with many other 3rd party libraries.
11
www.bootup.ai
Pandas
Here are just a few of the things that pandas does well:
• Easy handling of missing data (represented as NaN) in floating point as well as non-floating point
data
• Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional
objects
• Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user
can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in
computations
• Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for
both aggregating and transforming data
• Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures
into DataFrame objects
• Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
• Intuitive merging and joining data sets
• Flexible reshaping and pivoting of data sets
• Hierarchical labeling of axes (possible to have multiple labels per tick)
• Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving /
loading data from the ultrafast HDF5 format
• Time series-specific functionality: date range generation and frequency conversion, moving window
statistics, moving window linear regressions, date shifting and lagging, etc.
12
www.bootup.ai
Pandas
Many of these principles are here to address the shortcomings frequently experienced using other
languages / scientific research environments. For data scientists, working with data is typically divided
into multiple stages: munging and cleaning data, analyzing / modeling it, then organizing the results of
the analysis into a form suitable for plotting or tabular display. pandas is the ideal tool for all of these
tasks.
13
www.bootup.ai
Pandas
Pandas Series
Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python
objects, etc.).
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print(s)
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print(s['a’])
14
www.bootup.ai
Pandas
Pandas DataFrame
A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and
columns.
# Create an dataframe
raw_data ={'first_name':['Jason','Molly','Tina','Jake','Amy'],
'last_name':['Miller','Jacobson',".",'Milner','Cooze'],
import pandas as pd
import numpy as np
15
www.bootup.ai
Data Import and Export using Pandas
Loading A CSV Into pandas
# Load a csv
df =
pd.read_csv('https://raw.githubusercontent.com/anshupandey/datasets/
master/datawh.csv’)
# Load a csv with no headers
df =
pd.read_csv(https://raw.githubusercontent.com/anshupandey/datasets/m
aster/datanh.csv', header=None)
Export DataFrame
df.to_csv('pandas_dataframe_importing_csv/example.csv’)
import pandas as pd
url =
‘https://raw.githubusercontent.com/anshupandey/datasets/master/serve
r-metrics.json’
# Load the first sheet of the JSON file into a data frame
df = pd.read_json(url, orient='columns')
16
www.bootup.ai
Pandas
Descriptive Statistics For pandas Dataframe
import pandas as pd
'postTestScore'])
df['preTestScore'].describe()
df['age'].sum()
# Mean preTestScore
df['preTestScore'].mean()
df['preTestScore'].count()
17
www.bootup.ai
Pandas
Descriptive Statistics For pandas Dataframe
df['preTestScore'].median()
df['preTestScore'].var()
df['preTestScore'].std()
df['preTestScore'].skew()
df['preTestScore'].kurt()
df.corr()
18
www.bootup.ai
Data Manipulation with Pandas
Dropping Rows & Columns in Dataframe
Import pandas as pd
'Maricopa', 'Yuma'])
df.drop(['Cochice', 'Pima’])
df.drop('reports', axis=1)
df[df.name != 'Tina’]
df.drop(df.index[2])
df.drop(df.index[[2,3]])
19
www.bootup.ai
Data Manipulation with Pandas
Join And Merge Pandas Dataframe
import pandas as pd
#first Dataframe
'Atiches']}
'last_name’])
'Btisan']}
'last_name'])
20
www.bootup.ai
Data Manipulation with Pandas
Join And Merge Pandas Dataframe
raw_data = {
'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10',
'11'],
'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}
21
www.bootup.ai
Data Manipulation with Pandas
Find Unique Values In Pandas Dataframes
‘aircrafts'])
list(set(df.trucks))
list(df['trucks'].unique())
22
www.bootup.ai
Data Aggregation with Pandas
Groupby operation
'name': ['Miller','Jacobson','Ali','Milner','Cooze’,'Jacon’,
'Ryaner’,'Sone','Sloan','Piger’,'Riani', 'Ali'],
'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70,
62, 70]}
Import pandas as pd
# Create dataframe
df = pd.DataFrame(raw_data, columns = ['regiment', 'company',
'name', 'preTestScore', 'postTestScore’])
23
www.bootup.ai
Data Aggregation with Pandas
Groupby operation
list(df['preTestScore'].groupby(df['regiment’]))
df['preTestScore'].groupby(df['regiment']).describe()
df['preTestScore'].groupby([df['regiment'], df['company']]).mean()
# heirarchical indexing
df['preTestScore'].groupby([df['regiment'],
df['company']]).mean().unstack()
df.groupby(['regiment', 'company']).mean()
df.groupby(['regiment', 'company']).size()
24
www.bootup.ai
Data Cleaning with Pandas
import pandas as pd
'Amy'],
'Cooze'],
df.duplicated()
df.drop_duplicates()
# Drop duplicates in the name column, but take the last obs in the
duplicated set
df.drop_duplicates(['first_name'], keep='last')
25
www.bootup.ai
Data Cleaning with Pandas
df_no_missing = df.dropna()
df['location'] = np.nan
df.dropna(axis=1, how='all’)
df.dropna(thresh=5)
df.fillna(0)
# inplace=True means that the changes are saved to the df right away
df["preTestScore"].fillna(df["preTestScore"].mean(), inplace=True)
26
www.bootup.ai
Summary
This Module covered following topics
▪ Mathematical Computation with numpy
▪ Getting started with Pandas
▪ Data Import/export with pandas
▪ Working with Data Frames – selection, Filtering, manipulation and statistical analysis
▪ Data Aggregation with Pandas
▪ Data Cleaning with Pandas
27
www.bootup.ai
© Copyright BootUP 2019
BootUP
Noble House, 30th Floor
Jl. Dr.Ide Anak Agung Gde Agung
Mega Kuningan, Jakarta 12950
Jakarta, Indonesia
2019
www.bootup.ai
INDONESIA UNITED STATES
Noble House, 30th Floor 68 Willow Rd
Jl. Dr.Ide Anak Agung Gde Agung Menlo Park, CA 94025
Mega Kuningan, Jakarta 12950 Silicon Valley, USA
+62 811 992 1500 +1 800 493 1945
info@bootup.ai info@bootupventures.com
www.bootup.ai