You are on page 1of 12

Data science with python

Skiped: Apply function

Numpy module

random.randint numpy.random.randint(1,1000,100) It will create a list of 100 number


print(pandas.Series(numpy.random.randint(1,1000,100 between 1&1000
))) It will create a series
random.randint pd.DataFrame(np.random.randint(10, 100, (4,2)), It will create a dataframe cointaining
index=d1, columns=list('ab')) numbers between 10 & 100 of size
(row,column)
np.sum() numpy.sum(s) It will give the sum of the
series/DataFrame and we can set
axis
np.min()/np.max() df[‘Mugdho’] = by default it will find the min/max in
np.min(df)/np.max(df)/np.mean(df)/np.count(df) a column and we can change the axis
np.array([],dtype) np.array([-3.0, -1.0, 1.0, 3.0, 5.0, 7.0], dtype=float) array([-3., -1., 1., 3., 5., 7.])

np.set_printoptions( np.set_printoptions(linewidth=200)
)
np.where() df['Positively Rated'] = np.where(df['Rating'] > 3, 1, 0)
np.array() we can not apply astype on list so we use np.array() np.array(training_sentences).astype(st
r)

Learing output
In series data has label and it can be an be retrieved using the .loc[] for row and [] attribute for columns .
Series can be created from a single list with default index by using pandas.Series()
Series can be created from a single list and a index list pandas.Series
series can be created from a dictionary and it can be modified by using pandas.Series
We can convert (1,n) shape dataframe to series by df.iloc[0]
We can create a DataFrame and series using numpy.random.randint(low,high,size)
Find the index of the series by using .index
Finding something/adding somethig from the series.by using .iloc[]
Summing the series/DataFrame. by using np.sum() and set
Adding value to the seriesby using +=/-=
Adding two series.by using .append()
We can create dataframe by combing multiple series or directly create dataframe by using
pandas.DataFrame(,index=[ ],columns = [])
We can create dataframe from dictionary by using pd.DataFrame.from_dict(train_data)

We can rename the column/row lable by using .rename() *** We must have to set inplace = True or it will return a
copy.
We can change the value in a column using .replace()
We can iter over the rows of data frame using for index,row in df.iterrows():
We can create new column appling conditon using df['Positively Rated'] = np.where(df['Rating'] > 3, 1, 0)
We can append two series but we have to marge two dataframe.Add new row in series by using .loc[] but can not add
new column as it will turn into a DataFrame then. We can add new column to DataFrame by using []
We can add row in dataframe by marging datasets and add column in dataframe by using [] and passing a
pandas.Series{ } to it.It is important because in the added column there might not be a value for all the Index .It will by
default add NaN value for the missing value
We can work with a portion of dataframe using df.sample(frac=0.1, random_state=10)
We can use .loc() in both row and column together in Dataframe.We can give multiple input for both row and column
like df.loc[:,[‘Mugdho’,’Snigdho’]].We can find all the row/columns for a input row/column. .loc[] can only be used
when we have row in [] other wise we may use ony []
We can interchange the row and column.By using .T
We can droping a row from the data frame using .drop().by setting axis =1 we can drop from columns
We can drop a row which contains a single NaN value by using .dropna().by setting axis =1 we can drop from columns
We can set a copy of a data frame into another variable by using .copy()
We can get the number of rows and columns by using .shape
We can get the rank by using .rank
We can get the list of columns & index by using .columns/.index and get a perticular value by using [] operator
We can set the index to default by using.reset_index()
We can set the index by using set_index() and set multi level index by using .set_index([‘ ‘,’ ‘])
We can short the index by using .sort_index()
We can sort by the values in a particular column by using df.sort_values(by=[‘ ‘], inplace = True)
We can keep the unique value in a column by using .unique()
We can keep our desired no of columns using slicing operation by .loc[] operator or some perticular columns using[[]]
operator. energy[['Unnamed: 2','Petajoules','Gigajoules','%']]
We can use .fillna(method = ‘ffill’/’bfill’) to automatically fill the missing value
We can marge two dataframes by using pd.marge() using inner/outer/left/right.based on Index by setting
left_index/right_index = True or by setting left_on/right_on =’ ‘ We can marge by more than one value.For example
As sometime first name matchs but the last name don’t so we have to send a list to the left_on /right_on
We can read csv file using pd.read_csv(‘ ‘). After that we can set index column and may skip row/column
We can read excel file using pd.read_excel(‘ ‘). After that we can set index column and may skip row/column
We can do boolean operation by using == / <> /&/| .df[‘SUMVIL’] == 50
We can filter list by using boolian operators. likke df[df[‘SUMVIL’] == 50] and find the length using df df[df[‘SUMVIL’]
== 50]
np.min()/np.max() by default it will find the min/max in a column and we can change the axis
We can summarize the whole dataframe by using .groupby(‘ ‘).count()/sum()/mean()/min()/max()
We can create a iteration over the groupby by group ,frame method. In the system group is the name of the group
and frame is the associated dataframe
We can combine aggregate() operation with groupby to summarize the seclected portion of DataFrame by
(df.groupby(' ')[' '].aggregate([np.sum,np.average]))

WE can find how many rows in a single group by using .groupby(‘ ‘).count()
We can set categorical variable by using pd.CategoricalDtype()
We can short by using . sort_values()
We can set bins in a column by using pd.cut( )
We can substract between the rows and columns of a dataframe by using .diff()

Pandas module
creating series
Creating pandas.Series([‘Tiger’, ‘Bear’, ‘Moose’], index=[‘India’, pandas.Series({‘Archery’: ‘Bhutan’,
series ‘America’, ‘Canada’]) ‘Golf’: ‘Scotland’,
‘Sumo’: ‘Japan’,
‘Taekwondo’: ‘South Korea’})
pandas.Series([‘Tiger’, ‘Bear’, ‘Moose’])

Modifing a sports = {‘Archery’: ‘Bhutan’, Golf Scotland


exsisting ‘Golf’: ‘Scotland’, Footbal NaN
series with ‘Sumo’: ‘Japan’, Archery Bhutan
new index ‘Taekwondo’: ‘South Korea’} dtype: object
pandas.Series(sports,index=[‘Golf’,’Footbal’,’Archery’])

Look we have added a new index previously not in the series. So a NaN is created
Converting df.iloc[0]
(1,n)
shape
dataframe
to series
Adding a sports[‘cricket’] = ‘Bangladesh’
row on a
exsisting
series
Appending original_sports = pandas.Series({‘Archery’: ‘Bhutan’, Archery Bhutan
two series. ‘Golf’: ‘Scotland’, Golf Scotland
.append() ‘Sumo’: ‘Japan’, Sumo Japan
‘Taekwondo’: ‘South Korea’}) Taekwondo South Korea
cricket_loving_countries = pandas.Series([‘Australia’, Cricket Australia
‘Barbados’, Cricket Barbados
‘Pakistan’, Cricket Pakistan
‘England’], Cricket England
index=[‘Cricket’, dtype: object
‘Cricket’,
‘Cricket’,
‘Cricket’])

all_countries =
original_sports.append(cricket_loving_countries)
all_countries
.index s.index it will print the index
.loc[] s.loc[‘Cricket’] It will find the Cricket.It will return
single value or multiple value.
.loc[] s.loc[‘Animal’] = ‘Bears’ If the value is missin g it will add it to
series
np.sum() np.sum(s) It will sum the serise
***Find if its works on dataframe
len() len() It will find the length of the series
Use .shape
+= // -= s+= 2 it will add/remove the value

Pandas Module
DataFrame
Creating ***This method of setting column leble is time
DataFrame consuming we better to use the second method

First we import pandas


have to purchase_1 = pandas.Series({‘Name’: ‘Chris’,
create ‘Item Purchased’: ‘Dog Food’,
multiple ‘Cost’: 22.50})
series then purchase_2 = pandas.Series({‘Name’: ‘Kevyn’,
combinning ‘Item Purchased’: ‘Kitty Litter’,
the series ‘Cost’: 2.50})
we will get purchase_3 = pandas.Series({‘Name’: ‘Vinod’,
a ‘Item Purchased’: ‘Bird Seed’,
Dataframe. ‘Cost’: 5.00})
Dataframe df = pandas.DataFrame([purchase_1, purchase_2,
has column purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])
lebel.By df
default it ********* Like this********
will have import pandas as pd
lable 0.We purchase_1 = pd.Series([ 'Chris', 'Dog Food',22.50])
can change purchase_2 = pd.Series(['Kevyn','Kitty Litter',2.50])
the lable by purchase_3 = pd.Series(['Vinod','Bird Seed',5.00])
using df = pd.DataFrame([purchase_1, purchase_2,
.rename() purchase_3], index=['Store 1', 'Store 1', 'Store
or we can 2'],columns=['Name','Iteams','Cost'])
add column df
name df = pd.DataFrame([‘A+’, ‘A’, ‘A-‘, ‘B+’, ‘B’, ‘B-‘, ‘C+’,
similar to ‘C’, ‘C-‘, ‘D+’, ‘D’],
index name index=[‘excellent’, ‘excellent’, ‘excellent’,
‘good’, ‘good’, ‘good’, ‘ok’, ‘ok’, ‘ok’, ‘poor’, ‘poor’])
df.rename(columns={0: ‘Grades’}, inplace=True)
df
We can set columns name similar to index name

df = pd.DataFrame(['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+',


'C', 'C-', 'D+', 'D'],
index=['excellent', 'excellent', 'excellent',
'good', 'good', 'good', 'ok', 'ok', 'ok', 'poor',
'poor'],columns = ['Grades'])
df
changing for col in df.columns: here defore : indicates which has to be changed
the name of if col[:2]==’01’: and after : indicates with which.
the df.rename(columns={col:’Gold’ + col[4:]},
column inplace=True)
.rename if col[:2]==’02’:
df.rename(columns={col:’Silver’ + col[4:]},
inplace=True)
if col[:2]==’03’:
df.rename(columns={col:’Bronze’ + col[4:]},
inplace=True)
if col[:1]==’№’:
df.rename(columns={col:’#’ + col[1:]},
inplace=True)
df.head()
Changing for col in df.index:
the index if col[0]==’S’:
name by df.rename(index={col:’Gold’ + col[4:]},
using inplace=True)
.rename() if col[0]==’S’:
df.rename(index={col:’Silver’ + col[4:]},
inplace=True)
df
replacing df.replace(to_replace=['ham','spam'], 
data from value=['1','0'],  inplace=True)
column in
dataframe

We can iter for index,row in df.iterrows():


over the
rows of
data frame
***Here index is the index of dataframe
row in the values of dataframe.We can access
to the perticular value using row[column_name]
Changing energy['Country'] = ***easy way
the energy['Country'].replace({'China, Hong Kong Special df.replace(to_replace=['ham','spam'], value=['1','0'
columns Administrative Region':'Hong Kong', ], inplace=True)
value 'United Kingdom of df
using Great Britain and Northern Ireland':'United In this way we can replace multiple values easily
replace Kingdom',
'Republic of **It can show error like it can not compare then
Korea':'South Korea','United States of just reload the dataframe
America':'United States',
'Iran (Islamic Republic
of)':'Iran'})
Scales

Temperature and compas axis is a good example of Interval scale .Because there is never a absance of
value. In ordinal scale 4% for A+ and 3% for A so the ordinal scale is never evenly spaced
Setting t = pd.CategoricalDtype(categories = ['D', 'D+', 'C-',
catagorical 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+'], It will set the grades in catagorical variable
variable ordered = True) Here we must set the order .which increases
the priority from left to right
df = pd.DataFrame(['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+',
'C', 'C-', 'D+', 'D'],
index=['excellent', 'excellent', 'excellent',
'good', 'good', 'good', 'ok', 'ok', 'ok', 'poor',
'poor'],columns = ['Grades'],dtype = t)
df
Sorting by df.sort_values(by=['Grades'],inplace=True)
values df It always puts in assending order means low to
high

Insest df[‘Location’] = None


additional df['country'] = df.index
column
df[‘Date’] = [‘December 1’, ‘January 1’, ‘mid-May’]
df
df[‘Delivered’] = True
df
df[‘Date’] = pd.Series({‘Store 1’: ‘December 1’, ‘Store ***This method adds NaN value if value is not
2’: ‘mid-May’}) provied
df[‘Delivered’] = pd.Series({‘Store 1’: ‘True’, ‘Store
2’: ‘False’})
df
Use Marging DataFrame
Insest
additional
row in
DataFrame
staff_df = pd.DataFrame([{‘Name’: ‘Kelly’, ‘Role’: Role
Marging ‘Director of HR’}, Name Kelly Director of HR
datasets {‘Name’: ‘Sally’, ‘Role’: ‘Course Sally Course liasion
liasion’}, James Grader
{‘Name’: ‘James’, ‘Role’: ‘Grader’}])
staff_df = staff_df.set_index(‘Name’) School
student_df = pd.DataFrame([{‘Name’: ‘James’, Name
‘School’: ‘Business’}, James Business
{‘Name’: ‘Mike’, ‘School’: ‘Law’}, Mike Law
{‘Name’: ‘Sally’, ‘School’: Sally Engineering
‘Engineering’}])
student_df = student_df.set_index(‘Name’)
print(staff_df.head())
print()
print(student_df.head())
pd.merge(staff_df, student_df, how=’outer’, left_index=True, right_index=True)
pd.merge(staff_df, student_df, how=’inner’, left_index=True, right_index=True)
pd.merge(staff_df, student_df, how=’left’, left_index=True, right_index=True)
pd.merge(staff_df, student_df, how=’right’, left_index=True, right_index=True)

Marging staff_df = pd.DataFrame([{‘Name’: ‘Kelly’, ‘Role’: ‘Director of HR’, ‘Location’: ‘State Street’},
issus: {‘Name’: ‘Sally’, ‘Role’: ‘Course liasion’, ‘Location’: ‘Washington Avenue’},
If same {‘Name’: ‘James’, ‘Role’: ‘Grader’, ‘Location’: ‘Washington Avenue’}])
variable student_df = pd.DataFrame([{‘Name’: ‘James’, ‘School’: ‘Business’, ‘Location’: ‘1024 Billiard Avenue’},
have two {‘Name’: ‘Mike’, ‘School’: ‘Law’, ‘Location’: ‘Fraternity House #22’},
different {‘Name’: ‘Sally’, ‘School’: ‘Engineering’, ‘Location’: ‘512 Wilson Crescent’}])
value in pd.merge(staff_df, student_df, how=’outer’, left_on=’Name’, right_on=’Name’)
two
different
parameters
Then

staff_df = pd.DataFrame([{‘Name’: ‘Kelly’, ‘Role’: ‘Director of HR’, ‘Location’: ‘State Street’},


{‘Name’: ‘Sally’, ‘Role’: ‘Course liasion’, ‘Location’: ‘Washington Avenue’},
{‘Name’: ‘James’, ‘Role’: ‘Grader’, ‘Location’: ‘Washington Avenue’}])
student_df = pd.DataFrame([{‘Name’: ‘James’, ‘School’: ‘Business’, ‘Location’: ‘1024 Billiard Avenue’},
{‘Name’: ‘Mike’, ‘School’: ‘Law’, ‘Location’: ‘Fraternity House #22’},
{‘Name’: ‘Sally’, ‘School’: ‘Engineering’, ‘Location’: ‘512 Wilson Crescent’}])
pd.merge(staff_df, student_df, how=’outer’, left_on=’Location’, right_on=’Location’)
staff_df = pd.DataFrame([{‘First Name’: ‘Kelly’, ‘Last Name’: ‘Desjardins’, ‘Role’: ‘Director of HR’},
{‘First Name’: ‘Sally’, ‘Last Name’: ‘Brooks’, ‘Role’: ‘Course liasion’},
{‘First Name’: ‘James’, ‘Last Name’: ‘Wilde’, ‘Role’: ‘Grader’}])
student_df = pd.DataFrame([{‘First Name’: ‘James’, ‘Last Name’: ‘Hammond’, ‘School’: ‘Business’},
{‘First Name’: ‘Mike’, ‘Last Name’: ‘Smith’, ‘School’: ‘Law’},
{‘First Name’: ‘Sally’, ‘Last Name’: ‘Brooks’, ‘School’: ‘Engineering’}])
staff_df
We can student_df
marge pd.merge(staff_df, student_df, how=’inner’, left_on=[‘First Name’,’Last Name’], right_on=[‘First
based on Name’,’Last Name’])
two
parameter

. df.index
column/.ind df.column
ex
.loc() df.loc[‘Store 2’] Name Vinod
Item Purchased Bird Seed
. Cost 5
loc[row,clo Name: Store 2, dtype: object
umn]
.loc can df.loc[ [(‘Michigan’, ‘Washtenaw County’),
only be (‘Michigan’, ‘Wayne County’)] ]
used when ****as Washtenaw County in Michigan so we
we have always do in this format other wise it will show error
row in [] df.loc[:,[‘Name’, ‘Cost’]] Name Cost
other wise As the first part indicate row so all rows will be Store 1 Chris 22.5
we may use printed Store 1 Kevyn 2.5
ony [] similarly vise versa Store 2 Vinod 5.0

df.loc[‘Store 1’, ‘Cost’] Store 1 22.5


Store 1 2.5
Name: Cost, dtype: float64
[] df[‘Item Purchased’] It will print a single column from the data fram
printing a
single
column
Updating costs = df[‘Cost’] Store 1 24.5
columns costs+=2 Store 1 4.5
using += costs Store 2 7.0
Name: Cost, dtype: float64

***But it
will update
the main
series.simil
ar to other
cases so
better to
make a
copy
.T df.T It transposes the dataframe
.drop() df.drop(‘Store 1’)
If we set inplace = True it will permanent delet the Name Item Purchased Cost
item
**We can set axis = 0 / axis = 1
Store
Vinod Bird Seed 5.0
2

.dropna() only_gold = only_gold.dropna() It will drop all the rows which does not contain
If we set inplace = True it will permanent delet the any value
item if we set axis = 1 we can drop from column
**We can set axis = 0 / axis = 1

.copy() copy_df = df.copy() It will create a copy.It is a good practice to creat


copy_df = copy_df.drop('Store 1') a coppy ot otherwise it may affect the main
copy_df data
.shape df.shape It will print the number of rows and columns of
a table
.rank df.rank It will print the rank
. df = df.reset_index() It will reset the index and set the original data
reset_index index[0,1……….}
()
set_index() df = df.set_index('Gold') It will set a new index
df = df.set_index(['STNAME', 'CTYNAME']) Muptiple layers of indexs can be set
df.head()
sort_index df = df.sort_index() It will short the index
.unique() df['REGION'].unique() It will find the unique values in the region
column
.diff() df.diff(periods=1, axis=0)
we can set periods and axis

We can df = df.loc[:,['STNAME', Just using the slicing method


keep our 'CTYNAME',
desired no 'BIRTHS2010',
of columns 'BIRTHS2011',
'BIRTHS2012',
'BIRTHS2013',
'BIRTHS2014',
'BIRTHS2015',
'POPESTIMATE2010',
'POPESTIMATE2011',
'POPESTIMATE2012',
'POPESTIMATE2013',
'POPESTIMATE2014',
'POPESTIMATE2015']]
df
. df = df.fillna(method='ffill') It will fill the missing value
fillna(meth df
od =
‘ffill’/’bfill’)
Groupby x = df.groupby('STNAME').count()
x = df.groupby('STNAME').sum()
x = df.groupby('STNAME').max()
x = df.groupby('STNAME').min()
x = df.groupby('STNAME').mean()
x.head()

Advance (df.groupby('STNAME')
groupby ['POPESTIMATE2010','POPESTIMATE2011'].aggregate([np.sum,np.max,np.min,np.average,np.size,np.st
using d],axis=0)).head()
.aggregate()
option

(df.groupby('STNAME')['POPESTIMATE2010'].aggregate([np.sum,np.average])).head()
If we want for group, frame in df.groupby('STNAME'):
to iterate in avg = np.average(frame['CENSUS2010POP'])
dataframe print('Counties in state ' + group + ' have an average population of ' + str(avg))
created by
groupby def fun(item):
if item[0]<'M':
return 0
if item[0]<'Q':
return 1
return 2

for group, frame in df.groupby(fun):


print('There are ' + str(len(frame)) + ' records in group ' + str(group) + ' for processing.')
Setting bins df = pd.read_csv('census.csv')
df = df[df['SUMLEV']==50]
df = df.set_index('STNAME',inplace = True)
df.groupby(‘STNAME’)['CENSUS2010POP'].agg([np.average])
pd.cut(df['avg'],10) [Here 10 indicates the number of bins.]

Look Alabama and alaska are in the same bin.


Finding df = df.set_index('STNAME')
how many
are there in def fun(item):
a group if item[0]<'M':
return 0
if item[0]<'Q':
return 1
return 2

for group, frame in df.groupby(fun):


print('There are ' + str(len(frame)) + ' records in group ' + str(group) + ' for processing.')
Finding df = df.loc[:,
max/min/m ['POPESTIMATE2010','POPESTIMATE2011','POPESTIMATE2012','POPESTIMATE2013','POPESTIMATE201
ean/ 4','POPESTIMATE2015']]
sum/count df
from x =pd.DataFrame(np.min(df,axis = 1)) #may be:: df[‘min’] = np.min(df,axis = 1)
number of x.rename(columns = {0:'Min'},inplace = True)
row #x.set_index(['STNAME','CITYNAME'],inplace=True)# already index is set
y =pd.DataFrame(np.max(df,axis = 1)) #may be:: df[‘max’] = np.max(df,axis = 1)
y.rename(columns = {0:'Max'},inplace = True )
#y.set_index(['STNAME','CITYNAME'],inplace=True)# already index is set
df_with_min = pd.merge(df,x,left_index = True,right_index = True)
pd.merge(df_with_min,y,left_index = True,right_index = True)
Creating df.pivot_table(values='(kW)', index='YEAR', columns='Make', aggfunc=np.mean)
Pivot table

Dealing CSV files & excel file

.read_csv df = pd.read_csv('olympics.csv') It will print the csv file


df.head()
.read_excel df = pd.read_excel('olympics.csv')
df.head()
modified df = pd.read_csv('olympics.csv', index_col = 0, It will set first column as index.And will not print
(.read_csv) skiprows=1,names=['Status','Message']) first row.
df.head() And set the 2nd row as the column header
If the csv file does not have any header name = [
] will create column header
Setting different parameters to query in data set
Boolean operators < > / & | ==
df['Gold'] > 0 It will print in True /False
using df[(df['Gold.1'] > 0) & (df['Gold'] == 0)]
&/|/== /!= df[(df['Gold'] > 0) | (df['Gold.1'] > 0)]
option df[df['Gold'] > 0]
df = df[df['Gold'] == 50]
Idiomatic (df.where(df['SUMLEV']==50)
statement .dropna()
***It helps .set_index(['STNAME','CTYNAME'])
to add .rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'}))
multiple
statement
in asingle df = df[df['SUMLEV']==50]
line df.set_index(['STNAME','CTYNAME'])
df.rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'})

You might also like