5th LESSON (ANKUR - PROSCHOOL) - PANDAS

PANDAS LIBRARY
To store data from an external source like an excel workbook or database, we need a data
structure that can hold different data types. It is also desirable to be able to refer to rows and
columns in the data by custom labels rather than numbered indexes.
The pandas library offers data structures designed with this in mind: the series and the
DataFrame. Series are 1-dimensional labeled arrays similar to numpy's ndarrays, while
DataFrames are labeled 2-dimensional structures, that essentially function as spreadsheet
tables.
The name Pandas is derived from the word “Panel Data” – an Econometrics from
Multidimensional data. Using Pandas, we can accomplish five typical steps in the processing and
analysis of data, regardless of the origin of data —
load,
prepare,
manipulate,
model, and
analyze.
DATASTRUCTURES IN PANDAS
Series -------- 1 DIMENSION -------- 1D labeled homogeneous array, size immutable.
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
DataFrame ----- 2 DIMENSION -------- General 2D labeled, size-mutable tabular structure with
potentially heterogeneously typed columns
Panel --------- 3 DIMENSION -------- General 3D labeled, size-mutable array
All Pandas data structures are value mutable (can be changed) and except Series all are size
mutable. Series is size immutable.
Pandas Series
Series are very similar to ndarrays: the main difference between them is that with series, you can
provide custom index labels and then operations you perform on series automatically align the
data based on the labels.
To create a new series, first load the numpy and pandas libraries (pandas is preinstalled with the
Anaconda Python distribution.)
In [2]: import numpy as np

import pandas as pd
*Note: It is common practice to import pandas with the shorthand "pd".
In [3]: s = pd.Series()
print (s)
Series([], dtype: float64)
In [4]: data = np.array(['a','b','c','d'])

s = pd.Series(data)
print (s)
0 a
1 b
2 c
3 d
dtype: object
In [5]: data = {'a' : 0., 'b' : 1., 'c' : 2.}

s = pd.Series(data)
print (s)
a 0.0
b 1.0
c 2.0
dtype: float64
In [6]: s = pd.Series(5, index=[0, 1, 2, 3])

print (s)
0 5
1 5
2 5
3 5
dtype: int64
Define a new series by passing a collection of homogeneous data like ndarray or list, along with
a list of associated indexes to pd.Series():
In [7]: my_series = pd.Series( data = [2,3,5,4], # Data

index= ['a', 'b', 'c', 'd']) # Indexes
my_series
Out[7]: a 2
b 3
c 5
d 4
dtype: int64
You can also create a series from a dictionary, in which case the dictionary keys act as the labels
and the values act as the data:
In [8]: my_dict = {"x": 2, "a": 5, "b": 4, "c": 8}
my_series2 = pd.Series(my_dict)
my_series2
Out[8]: a 5
b 4
c 8
x 2
dtype: int64
Similar to a dictionary, you can ACCESS ITEMS in a series by the labels:
In [9]: my_series = pd.Series( data = [2,3,5,4], # Data

index= ['a', 'b', 'c', 'd']) # Indexes
my_series
my_series["a"]
Out[9]: 2
Numeric indexing also works:
In [10]: my_series[0]
Out[10]: 2
If you take a slice of a series, you get both the values and the labels contained in the slice:
In [11]: my_series[1:3]
Out[11]: b 3
c 5
dtype: int64
OPERATIONS performed on two series align by label:
In [12]: my_series + my_series
Out[12]: a 4
b 6
c 10
d 8
dtype: int64
If you perform an operation with two series that have different labels, the unmatched labels will
return a value of NaN (not a number.).
In [13]: my_series + my_series2 #Missing values converts int datatype to float
Out[13]: a 7.0
b 7.0
c 13.0
d NaN
x NaN
dtype: float64
In [14]: my_series + my_series
Out[14]: a 4
b 6
c 10
d 8
dtype: int64
DataFrame Creation and Indexing

A DataFrame is a 2D table with labeled columns that can each hold different types of data.
DataFrames are essentially a Python implementation of the types of tables you'd see in an Excel
workbook or SQL database. DataFrames are the defacto standard data structure for working with
tabular data in Python
You can create a DataFrame out a variety of data sources like dictionaries, 2D numpy arrays and
series using the pd.DataFrame() function. Dictionaries provide an intuitive way to create
DataFrames: when passed to pd.DataFrame() a dictionary's keys become column labels and the
values become the columns themselves:
In [15]: df = pd.DataFrame()
print (df)
Empty DataFrame
Columns: []
Index: []
In [16]: data =[['Alex',10],['Bob',12],['Clarke',13]]
In [17]: df=pd.DataFrame(data,columns=['Name','Age'])
print (df)
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
In [18]: d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),

'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print (df)
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
Adding a Column to a DataFrame

In [19]: df['three']=pd.Series([10,20,30],index=['a','b','c'])
print (df )
one two three
a 1.0 1 10.0
b 2.0 2 20.0
c 3.0 3 30.0
d NaN 4 NaN
Deleting a Column from a dataframe

In [20]: del df['one']
print (df )
two three
a 1 10.0
b 2 20.0
c 3 30.0
d 4 NaN
In [21]: # Create a dictionary with some different data types as values

my_dict = {"name" : ["Joe","Bob","Frans"],
"age" : np.array([10,15,20]),
"weight" : (75,123,239),
"height" : pd.Series([4.5, 5, 6.1],
index=["Joe","Bob","Frans"]),
"siblings" : 1,
"gender" : "M"}
df = pd.DataFrame(my_dict) # Convert the dict to DataFrame
df # Show the DataFrame
Out[21]:
age gender height name siblings weight
Joe 10 M 4.5 Joe 1 75
Bob 15 M 5.0 Bob 1 123
Frans 20 M 6.1 Frans 1 239
Notice that values in the dictionary you use to make a DataFrame can be a variety of sequence
objects, including lists, ndarrays, tuples and series. If you pass in singular values like a single
number or string, that value is duplicated for every row in the DataFrame (in this case gender is
set to "M" for all records and siblings is set to 1.).
Also note that in the DataFrame above, the rows were automatically given indexes that align with
the indexes of the series we passed in for the "height" column. If we did not use a series with
index labels to create our DataFrame, it would be given numeric row index labels by default:
In [22]: my_dict2 = {"name" : ["Joe","Bob","Frans"],

"age" : np.array([10,15,20]),
"weight" : (75,123,239),
"height" :[4.5, 5, 6.1],
"siblings" : 1,
"gender" : "M"}
df2 = pd.DataFrame(my_dict2) # Convert the dict to DataFrame
df2 # Show the DataFrame
Out[22]:
0 10 M 4.5 Joe 1 75
1 15 M 5.0 Bob 1 123
2 20 M 6.1 Frans 1 239
You can provide custom row labels when creating a DataFrame by adding the index argument:
In [23]: df2 = pd.DataFrame(my_dict2,

index = my_dict["name"] )
df2
Out[23]:
Joe 10 M 4.5 Joe 1 75
Bob 15 M 5.0 Bob 1 123
Frans 20 M 6.1 Frans 1 239
A DataFrame behaves like a dictionary of Series objects that each have the same length and
indexes. This means we can get, add and delete columns in a DataFrame the same way we
would when dealing with a dictionary:
In [24]: # Get a column by name

df2["weight"]
Out[24]: Joe 75
Bob 123
Frans 239
Name: weight, dtype: int64
Alternatively, you can get a column by label using "dot" notation:
In [25]: df2.weight
Out[25]: Joe 75
Bob 123
Frans 239
Name: weight, dtype: int64
In [26]: # Delete a column

del df2['name']
In [27]: # Add a new column

df2["IQ"] = [130, 105, 115]
df2
Out[27]:
age gender height siblings weight IQ
age gender height siblings weight IQ
Joe 10 M 4.5 1 75 130
Bob 15 M 5.0 1 123 105
Frans 20 M 6.1 1 239 115
Inserting a single value into a DataFrame causes it to be all the rows?
In [28]: df2["Married"] = False

df2
Out[28]:
age gender height siblings weight IQ Married
Joe 10 M 4.5 1 75 130 False
Bob 15 M 5.0 1 123 105 False
Frans 20 M 6.1 1 239 115 False
When inserting a Series into a DataFrame, rows are matched by index. Unmatched rows will be
filled with NaN:
In [29]: df2["College"] = pd.Series(["Harvard"], index=["Frans"])

df2
Out[29]:
age gender height siblings weight IQ Married College
Joe 10 M 4.5 1 75 130 False NaN
Bob 15 M 5.0 1 123 105 False NaN
Frans 20 M 6.1 1 239 115 False Harvard
You can select both rows or columns by label with df.loc[row, column]:
In [30]: df2.loc["Joe"] # Select row "Joe"
Out[30]: age 10
gender M
height 4.5
siblings 1
weight 75
IQ 130
Married False
College NaN
Name: Joe, dtype: object
In [31]: df2.loc["Joe","IQ"] # Select row "Joe" and column "IQ"
Out[31]: 130
In [32]: df2.loc["Joe":"Bob" , "IQ":"College"] # Slice by label
Out[32]:
IQ Married College
Joe 130 False NaN
Bob 105 False NaN
Select rows or columns by numeric index with df.iloc[row, column]:
In [33]: df2.iloc[0] # Get row 0
Out[33]: age 10
gender M
height 4.5
siblings 1
weight 75
IQ 130
Married False
College NaN
Name: Joe, dtype: object
In [34]: df2.iloc[0, 5] # Get row 0, column 5
Out[34]: 130
In [35]: df2.iloc[0:2, 5:8] # Slice by numeric row and column index
Out[35]:
IQ Married College
Joe 130 False NaN
Bob 105 False NaN
In [36]: df=pd.read_excel("SAMPLEPANDAS.xlsx")
In [39]: df
Out[39]:
Month Values
0 2018-01-01 201
1 2018-02-01 107
2 2018-03-01 483
3 2018-04-01 240
4 2018-05-01 356
5 2018-06-01 369
6 2018-07-01 266
7 2018-08-01 308
8 2018-09-01 453
Month Values
9 2018-10-01 395
10 2018-11-01 487
11 2018-12-01 403
12 2019-01-01 478
13 2019-02-01 112
14 2019-03-01 262
15 2019-04-01 283
16 2019-05-01 444
17 2019-06-01 233
18 2019-07-01 324
19 2019-08-01 305
20 2019-09-01 299
21 2019-10-01 294
22 2019-11-01 468
23 2019-12-01 321
HEAD & TAIL

In [40]: df.head()
Out[40]:
Month Values
0 2018-01-01 201
Month Values
1 2018-02-01 107
2 2018-03-01 483
3 2018-04-01 240
4 2018-05-01 356
In [41]: df.tail()
Out[41]:
Month Values
19 2019-08-01 305
20 2019-09-01 299
21 2019-10-01 294
22 2019-11-01 468
23 2019-12-01 321
TREATMENT OF NAN VALUES

In [42]: #Creating a dataframe without Excel
# for creating a database - we can do it in python directly
# pd.DataFrame ({'A':[elem1, elem2, elem3...etc], 'B':[elem1, elem2, el
em3...etc], 'C':[elem1, elem2, elem3...etc]})
# all columns should have same number of values, in case any value is b
lank ....we need to specify/mention
#value as np.nan (if np gives error - we need to import numpy)
import pandas as pd
import numpy as np
df=pd.DataFrame({'A':[1,2,np.nan,4,7],'B':[np.nan,3.5,4,5,8], 'C':[3,np
.nan,2.5,6,9]})
In [43]: df
Out[43]:
A B C
0 1.0 NaN 3.0
1 2.0 3.5 NaN
2 NaN 4.0 2.5
3 4.0 5.0 6.0
4 7.0 8.0 9.0
In [44]: #if we want to remove NaN Values we need to use df.dropna() where df is
the dataframe
df.dropna()
Out[44]:
A B C
3 4.0 5.0 6.0
4 7.0 8.0 9.0
In [45]: #we can also assign df.dropna() to a new dataframe, say df2
df2=df.dropna()
df2
Out[45]:
A B C
3 4.0 5.0 6.0
4 7.0 8.0 9.0
FILLING UP NAN VALUES
In [46]: #we can fill up the NaN values using fillna function as below:
df.fillna(value=0)
Out[46]:
A B C
0 1.0 0.0 3.0
1 2.0 3.5 0.0
2 0.0 4.0 2.5
3 4.0 5.0 6.0
4 7.0 8.0 9.0
In [47]: df["NewValue"]=pd.Series(data=['Ankur','Sondeep','Poornima', 'Divya',

'Vivek'])
df
Out[47]:
A B C NewValue
0 1.0 NaN 3.0 Ankur
1 2.0 3.5 NaN Sondeep
2 NaN 4.0 2.5 Poornima
3 4.0 5.0 6.0 Divya
4 7.0 8.0 9.0 Vivek
In [48]: #for replacing the null values with mean for an entire column
df['A']=df['A'].fillna(value=df['A'].mean())
df
Out[48]:
A B C NewValue
0 1.0 NaN 3.0 Ankur
1 2.0 3.5 NaN Sondeep
2 3.5 4.0 2.5 Poornima
3 4.0 5.0 6.0 Divya
4 7.0 8.0 9.0 Vivek
In [49]: #for replacing the null values with mean for an all the column
df3=df.fillna(value=df.mean())
df3
Out[49]:
A B C NewValue
0 1.0 5.125 3.000 Ankur
1 2.0 3.500 5.125 Sondeep
2 3.5 4.000 2.500 Poornima
3 4.0 5.000 6.000 Divya
4 7.0 8.000 9.000 Vivek
USING FOR TO FILL NAN VALUES

In [50]: #for replacing the Nan Values with Mean of Columns for all the columns
in a DataFrame
for x in df.columns:
df[x]=df[x].fillna(value=df[x].mean())
-----------------------------------------------------------------------
----
ValueError Traceback (most recent call l
ast)
~\Anaconda3\lib\site-packages\pandas\core\nanops.py in _ensure_numeric
(x)
818 try:
--> 819 x = float(x)
820 except Exception:
ValueError: could not convert string to float: 'AnkurSondeepPoornimaDiv

yaVivek'
During handling of the above exception, another exception occurred:

ast)
(x)
821 try:
--> 822 x = complex(x)
ValueError: complex() arg is a malformed string
TypeError Traceback (most recent call l

ast)
~\Anaconda3\lib\site-packages\pandas\core\nanops.py in f(values, axis,
skipna, **kwds)
127 else:
--> 128 result = alt(values, axis=axis, skipna=skip
na, **kwds)
~\Anaconda3\lib\site-packages\pandas\core\nanops.py in nanmean(values,
axis, skipna)
355 count = _get_counts(mask, axis, dtype=dtype_count)
--> 356 the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum)
)
357
(x)
824 raise TypeError('Could not convert {value!s} to
numeric'
--> 825 .format(value=x))
826 return x
TypeError: Could not convert AnkurSondeepPoornimaDivyaVivek to numeric

ast)
(x)
818 try:
--> 819 x = float(x)
ValueError: could not convert string to float: 'AnkurSondeepPoornimaDiv

yaVivek'

ast)
(x)
821 try:
--> 822 x = complex(x)
ValueError: complex() arg is a malformed string
TypeError Traceback (most recent call l
ast)
<ipython-input-50-d8f394443f45> in <module>()
1 #for replacing the Nan Values with Mean of Columns for all the
columns in a DataFrame
2 for x in df.columns:
----> 3 df[x]=df[x].fillna(value=df[x].mean())
~\Anaconda3\lib\site-packages\pandas\core\generic.py in stat_func(self,
axis, skipna, level, numeric_only, **kwargs)
7313 skipna=skipna)
7314 return self._reduce(f, name, axis=axis, skipna=skipna,
-> 7315 numeric_only=numeric_only)
7316
7317 return set_function_name(stat_func, name, cls)
~\Anaconda3\lib\site-packages\pandas\core\series.py in _reduce(self, o
p, name, axis, skipna, numeric_only, filter_type, **kwds)
2575 'numeric_only.'.forma
t(name))
2576 with np.errstate(all='ignore'):
-> 2577 return op(delegate, skipna=skipna, **kwds)
2578
2579 return delegate._reduce(op=op, name=name, axis=axis, sk
ipna=skipna,
~\Anaconda3\lib\site-packages\pandas\core\nanops.py in _f(*args, **kwar

gs)
75 try:
76 with np.errstate(invalid='ignore'):
---> 77 return f(*args, **kwargs)
78 except ValueError as e:
79 # we want to transform an object array
~\Anaconda3\lib\site-packages\pandas\core\nanops.py in f(values, axis,

skipna, **kwds)
130 try:
--> 131 result = alt(values, axis=axis, skipna=skip
na, **kwds)
132 except ValueError as e:
133 # we want to transform an object array
~\Anaconda3\lib\site-packages\pandas\core\nanops.py in nanmean(values,
axis, skipna)
354 dtype_count = dtype
355 count = _get_counts(mask, axis, dtype=dtype_count)
--> 356 the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum)
)
357
358 if axis is not None and getattr(the_sum, 'ndim', False):
(x)
824 raise TypeError('Could not convert {value!s} to
numeric'
--> 825 .format(value=x))
826 return x
827
TypeError: Could not convert AnkurSondeepPoornimaDivyaVivek to numeric
In [51]: df
Out[51]:
A B C NewValue
0 1.0 5.125 3.000 Ankur
1 2.0 3.500 5.125 Sondeep
2 3.5 4.000 2.500 Poornima
3 4.0 5.000 6.000 Divya
4 7.0 8.000 9.000 Vivek
In [52]: #using for loop for applying the mean of the column to the Null Values
(for only non-string columns)
for x in df.columns:
if df[x].dtype.kind=="f":
df[x]=df[x].fillna(value=df[x].mean())
In [53]: df
Out[53]:
A B C NewValue
0 1.0 5.125 3.000 Ankur
1 2.0 3.500 5.125 Sondeep
2 3.5 4.000 2.500 Poornima
3 4.0 5.000 6.000 Divya
4 7.0 8.000 9.000 Vivek
In [54]: #pull out unique values within a column, use dataframename['columnnam

e].unique()
df['A'].unique()
Out[54]: array([1. , 2. , 3.5, 4. , 7. ])
In [55]: df.drop_duplicates()
Out[55]:
A B C NewValue
0 1.0 5.125 3.000 Ankur
1 2.0 3.500 5.125 Sondeep
2 3.5 4.000 2.500 Poornima
3 4.0 5.000 6.000 Divya
4 7.0 8.000 9.000 Vivek
In [56]: #specifying condition to filter the criteria...like column a all values
> 2 and for b<5
#using AND condition use '&' and for OR condition use'|'
newdf=df[(df['A']>1) & (df['B']<5)]
newdf
Out[56]:
A B C NewValue
1 2.0 3.5 5.125 Sondeep
2 3.5 4.0 2.500 Poornima
In [57]: #exclusing NULL values syntax

newdf1=df[(df['A']>3) | (df['B']==8) | (df['C'].isnull()==True)]
newdf1
Out[57]:
A B C NewValue
2 3.5 4.0 2.5 Poornima
3 4.0 5.0 6.0 Divya
4 7.0 8.0 9.0 Vivek
SORTING
In [58]: df.sort_values('B')
df.sort_values(by='B')
Out[58]:
A B C NewValue
1 2.0 3.500 5.125 Sondeep
2 3.5 4.000 2.500 Poornima
A B C NewValue
3 4.0 5.000 6.000 Divya
0 1.0 5.125 3.000 Ankur
4 7.0 8.000 9.000 Vivek
In [59]: #sorting based on a column in the descending format

df.sort_values('B', ascending=[False])
Out[59]:
A B C NewValue
4 7.0 8.000 9.000 Vivek
0 1.0 5.125 3.000 Ankur
3 4.0 5.000 6.000 Divya
2 3.5 4.000 2.500 Poornima
1 2.0 3.500 5.125 Sondeep
DESCRIBE
In [60]: d={'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack'
,'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.1
0,3.65])}
x=pd.DataFrame(d)
In [61]: x.describe()
Out[61]:
Age Rating
Age Rating
count 12.000000 12.000000
mean 31.833333 3.743333
std 9.232682 0.661628
min 23.000000 2.560000
25% 25.000000 3.230000
50% 29.500000 3.790000
75% 35.500000 4.132500
max 51.000000 4.800000
In [62]: x.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 3 columns):
Age 12 non-null int64
Name 12 non-null object
Rating 12 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 368.0+ bytes
In [63]: x.columns
Out[63]: Index(['Age', 'Name', 'Rating'], dtype='object')
Basic Data Visualizations

In [68]: import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d'])
df.plot.bar()
Out[68]: <matplotlib.axes._subplots.AxesSubplot at 0x192621c15f8>

import numpy as np
df=pd.DataFrame({'a':np.random.randn(1000)+1,'b':np.random.randn(1000),
'c': np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])
df.plot.hist(bins=10)
Out[69]: <matplotlib.axes._subplots.AxesSubplot at 0x19262248ef0>
import numpy as np
df = pd.DataFrame(np.random.rand(10, 5), columns=['A', 'B', 'C', 'D',
'E'])
df.plot.box()
Out[70]: <matplotlib.axes._subplots.AxesSubplot at 0x19262211a90>
JOINING DATAFRAMES
In [71]: df=pd.DataFrame({'A':[1,2,3,4,5], 'B':[6,7,8,9,10], 'C': [11,12,13,14,1
5] })
df1=pd.DataFrame({'A':[100,101,102,103,104], 'B': [105,106,107,108,109
], 'C':[110,111,112,113,114] })
In [72]: #How to Concatenate: Use the same no. of variables and same no. of elem
ents to concatenate
pd.concat([df, df1])
Out[72]:
A B C
0 1 6 11
1 2 7 12
2 3 8 13
3 4 9 14
4 5 10 15
0 100 105 110
1 101 106 111
2 102 107 112
3 103 108 113
4 104 109 114
In [73]: df2=pd.DataFrame({'A':[10,20,30,40], 'B':[25,35,45,55]})
In [74]: pdconcat=pd.concat([df,df1,df2])
pdconcat
Out[74]: A B C
0 1 6 11.0
1 2 7 12.0
2 3 8 13.0
3 4 9 14.0
4 5 10 15.0
0 100 105 110.0
1 101 106 111.0
2 102 107 112.0
3 103 108 113.0
4 104 109 114.0
0 10 25 NaN
1 20 35 NaN
2 30 45 NaN
3 40 55 NaN
In [75]: print(pdconcat['A'].dtype.kind)
print(pdconcat['B'].dtype.kind)
print(pdconcat['C'].dtype.kind)
i
i
f
In [76]: #Incase the user wants to concatenate the data horizontally, use "axis=
1", default value is "axis=0"
pdconcat1=pd.concat([df,df1,df2], axis=1)
pdconcat1
Out[76]: A B C A B C A B
0 1 6 11 100 105 110 10.0 25.0
1 2 7 12 101 106 111 20.0 35.0
2 3 8 13 102 107 112 30.0 45.0
3 4 9 14 103 108 113 40.0 55.0
4 5 10 15 104 109 114 NaN NaN
In [77]: #Merging two DataFrames

T1=pd.DataFrame({'RollNo.':[1,2,3,4,5,6,7,8,9,10], 'StudentName':["Akas
h", "Vivek", "Sondeep", "Pranav", "Purnima", "Divya" "Saroj", "Nimisha"
, "ajay", "Manju", "Mr. X"]})
T1
Out[77]:
RollNo. StudentName
0 1 Akash
1 2 Vivek
2 3 Sondeep
3 4 Pranav
4 5 Purnima
5 6 DivyaSaroj
6 7 Nimisha
7 8 ajay
8 9 Manju
9 10 Mr. X
In [78]: T2=pd.DataFrame({'RollNo.':[1,2,3,4,5,6,7,8,9,10], 'City':["Shimla", "G
urgaon", "Delhi", "Amritsar", "Jalandhar", "Gandhinagar", "Almora", "Oo
ty", "Mumbai", "Chennai"]})
T2
Out[78]:
City RollNo.
0 Shimla 1
1 Gurgaon 2
2 Delhi 3
3 Amritsar 4
4 Jalandhar 5
5 Gandhinagar 6
6 Almora 7
7 Ooty 8
8 Mumbai 9
9 Chennai 10
In [79]: #there are two ways of join tables: (1) Merging; (2) Using Joins
#In the above two tables (T1 & T2, RollNo. is the common key in both th
e tables. T1&T2 both contain same no. of elements)
#MERGING!!!
pd.merge(T1,T2, how='inner', on='RollNo.')
#T3=pd.DataFrame({'RollNo.':[1,3,5,7,9,11,13,15,17,19], 'PinCode':["171
001", "122002", "110005", "183005", "181001", "168754", "987654", "5476
45", "654789", "123456"]})
Out[79]:
RollNo. StudentName City
0 1 Akash Shimla
RollNo. StudentName City
1 2 Vivek Gurgaon
2 3 Sondeep Delhi
3 4 Pranav Amritsar
4 5 Purnima Jalandhar
5 6 DivyaSaroj Gandhinagar
6 7 Nimisha Almora
7 8 ajay Ooty
8 9 Manju Mumbai
9 10 Mr. X Chennai
In [80]: T3=pd.DataFrame({'RollNo.':[1,3,5,7,9,11,13,15,17,19], 'PinCode':["1710

01", "122002", "110005", "183005", "181001", "168754", "987654", "54764
5", "654789", "123456"]})
T3
Out[80]:
PinCode RollNo.
0 171001 1
1 122002 3
2 110005 5
3 183005 7
4 181001 9
5 168754 11
6 987654 13
7 547645 15
PinCode RollNo.
8 654789 17
9 123456 19
In [81]: #Inner Join using Merge

pd.merge(T1,T3, how='inner', on='RollNo.')
Out[81]:
RollNo. StudentName PinCode
0 1 Akash 171001
1 3 Sondeep 122002
2 5 Purnima 110005
3 7 Nimisha 183005
4 9 Manju 181001
In [82]: #Left Join using Merge

pd.merge(T1,T3, how='left', on='RollNo.')
Out[82]:
0 1 Akash 171001
1 2 Vivek NaN
2 3 Sondeep 122002
3 4 Pranav NaN
4 5 Purnima 110005
5 6 DivyaSaroj NaN
6 7 Nimisha 183005
7 8 ajay NaN
8 9 Manju 181001
9 10 Mr. X NaN
In [83]: #right join using merge

pd.merge(T1,T3, how='right', on='RollNo.')
Out[83]:
0 1 Akash 171001
1 3 Sondeep 122002
2 5 Purnima 110005
3 7 Nimisha 183005
4 9 Manju 181001
5 11 NaN 168754
6 13 NaN 987654
7 15 NaN 547645
8 17 NaN 654789
9 19 NaN 123456
In [84]: #Executing the same functions using the JOIN Prompt

#This can be done after setting index to the common column in both the
tables. for this we use the "set_index" prompt to
#set indices
T2.set_index('RollNo.').join(T3.set_index('RollNo.'), how='inner')
Out[84]:
City PinCode
RollNo.
1 Shimla 171001
3 Delhi 122002
5 Jalandhar 110005
7 Almora 183005
9 Mumbai 181001
LOADING INBUILT DATASETS IN PYTHON

In [86]: from sklearn.datasets import load_iris
iris = load_iris()
data = iris.data
column_names = iris.feature_names
print(column_names)
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal w
idth (cm)']
In [87]: df=pd.DataFrame(data, columns=['Sepal Length', 'Sepal Width', 'Petal Le

ngth', 'Petal width'])
In [88]: df
Out[88]:
Sepal Length Sepal Width Petal Length Petal width
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
5 5.4 3.9 1.7 0.4
6 4.6 3.4 1.4 0.3
7 5.0 3.4 1.5 0.2
8 4.4 2.9 1.4 0.2
9 4.9 3.1 1.5 0.1
10 5.4 3.7 1.5 0.2
11 4.8 3.4 1.6 0.2
12 4.8 3.0 1.4 0.1
13 4.3 3.0 1.1 0.1
14 5.8 4.0 1.2 0.2
15 5.7 4.4 1.5 0.4
16 5.4 3.9 1.3 0.4
17 5.1 3.5 1.4 0.3
18 5.7 3.8 1.7 0.3
19 5.1 3.8 1.5 0.3
20 5.4 3.4 1.7 0.2
21 5.1 3.7 1.5 0.4
22 4.6 3.6 1.0 0.2
23 5.1 3.3 1.7 0.5
24 4.8 3.4 1.9 0.2
25 5.0 3.0 1.6 0.2
26 5.0 3.4 1.6 0.4
27 5.2 3.5 1.5 0.2
28 5.2 3.4 1.4 0.2
29 4.7 3.2 1.6 0.2
... ... ... ... ...
120 6.9 3.2 5.7 2.3
121 5.6 2.8 4.9 2.0
122 7.7 2.8 6.7 2.0
123 6.3 2.7 4.9 1.8
124 6.7 3.3 5.7 2.1
125 7.2 3.2 6.0 1.8
126 6.2 2.8 4.8 1.8
127 6.1 3.0 4.9 1.8
128 6.4 2.8 5.6 2.1
129 7.2 3.0 5.8 1.6
130 7.4 2.8 6.1 1.9
131 7.9 3.8 6.4 2.0
132 6.4 2.8 5.6 2.2
133 6.3 2.8 5.1 1.5
134 6.1 2.6 5.6 1.4
135 7.7 3.0 6.1 2.3
136 6.3 3.4 5.6 2.4
137 6.4 3.1 5.5 1.8
138 6.0 3.0 4.8 1.8
139 6.9 3.1 5.4 2.1
140 6.7 3.1 5.6 2.4
141 6.9 3.1 5.1 2.3
142 5.8 2.7 5.1 1.9
143 6.8 3.2 5.9 2.3
144 6.7 3.3 5.7 2.5
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8
150 rows × 4 columns
In [89]: #Alternatively
import pandas as pd
a = pd.DataFrame(iris.data, columns=iris.feature_names)
In [91]: a
Out[91]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
5 5.4 3.9 1.7 0.4
6 4.6 3.4 1.4 0.3
7 5.0 3.4 1.5 0.2
8 4.4 2.9 1.4 0.2
9 4.9 3.1 1.5 0.1
10 5.4 3.7 1.5 0.2
11 4.8 3.4 1.6 0.2
12 4.8 3.0 1.4 0.1
13 4.3 3.0 1.1 0.1
14 5.8 4.0 1.2 0.2
15 5.7 4.4 1.5 0.4
16 5.4 3.9 1.3 0.4
17 5.1 3.5 1.4 0.3
18 5.7 3.8 1.7 0.3
19 5.1 3.8 1.5 0.3
20 5.4 3.4 1.7 0.2
21 5.1 3.7 1.5 0.4
22 4.6 3.6 1.0 0.2
23 5.1 3.3 1.7 0.5
24 4.8 3.4 1.9 0.2
25 5.0 3.0 1.6 0.2
26 5.0 3.4 1.6 0.4
27 5.2 3.5 1.5 0.2
28 5.2 3.4 1.4 0.2
29 4.7 3.2 1.6 0.2
... ... ... ... ...
120 6.9 3.2 5.7 2.3
121 5.6 2.8 4.9 2.0
122 7.7 2.8 6.7 2.0
123 6.3 2.7 4.9 1.8
124 6.7 3.3 5.7 2.1
125 7.2 3.2 6.0 1.8
126 6.2 2.8 4.8 1.8
127 6.1 3.0 4.9 1.8
128 6.4 2.8 5.6 2.1
129 7.2 3.0 5.8 1.6
130 7.4 2.8 6.1 1.9
131 7.9 3.8 6.4 2.0
132 6.4 2.8 5.6 2.2
133 6.3 2.8 5.1 1.5
134 6.1 2.6 5.6 1.4
135 7.7 3.0 6.1 2.3
136 6.3 3.4 5.6 2.4
137 6.4 3.1 5.5 1.8
138 6.0 3.0 4.8 1.8
139 6.9 3.1 5.4 2.1
140 6.7 3.1 5.6 2.4
141 6.9 3.1 5.1 2.3
142 5.8 2.7 5.1 1.9
143 6.8 3.2 5.9 2.3
144 6.7 3.3 5.7 2.5
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8
In [92]: from sklearn.datasets import load_boston

boston=load_boston()
data=boston.data
column_names = boston.feature_names
a = pd.DataFrame(boston.data, columns=boston.feature_names)
In [93]: a
Out[93]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396
5 0.02985 0.0 2.18 0.0 0.458 6.430 58.7 6.0622 3.0 222.0 18.7 394
6 0.08829 12.5 7.87 0.0 0.524 6.012 66.6 5.5605 5.0 311.0 15.2 395
7 0.14455 12.5 7.87 0.0 0.524 6.172 96.1 5.9505 5.0 311.0 15.2 396
8 0.21124 12.5 7.87 0.0 0.524 5.631 100.0 6.0821 5.0 311.0 15.2 386
9 0.17004 12.5 7.87 0.0 0.524 6.004 85.9 6.5921 5.0 311.0 15.2 386
10 0.22489 12.5 7.87 0.0 0.524 6.377 94.3 6.3467 5.0 311.0 15.2 392
11 0.11747 12.5 7.87 0.0 0.524 6.009 82.9 6.2267 5.0 311.0 15.2 396
12 0.09378 12.5 7.87 0.0 0.524 5.889 39.0 5.4509 5.0 311.0 15.2 390
13 0.62976 0.0 8.14 0.0 0.538 5.949 61.8 4.7075 4.0 307.0 21.0 396
14 0.63796 0.0 8.14 0.0 0.538 6.096 84.5 4.4619 4.0 307.0 21.0 380
15 0.62739 0.0 8.14 0.0 0.538 5.834 56.5 4.4986 4.0 307.0 21.0 395
16 1.05393 0.0 8.14 0.0 0.538 5.935 29.3 4.4986 4.0 307.0 21.0 386
17 0.78420 0.0 8.14 0.0 0.538 5.990 81.7 4.2579 4.0 307.0 21.0 386
18 0.80271 0.0 8.14 0.0 0.538 5.456 36.6 3.7965 4.0 307.0 21.0 288
19 0.72580 0.0 8.14 0.0 0.538 5.727 69.5 3.7965 4.0 307.0 21.0 390
20 1.25179 0.0 8.14 0.0 0.538 5.570 98.1 3.7979 4.0 307.0 21.0 376
21 0.85204 0.0 8.14 0.0 0.538 5.965 89.2 4.0123 4.0 307.0 21.0 392
22 1.23247 0.0 8.14 0.0 0.538 6.142 91.7 3.9769 4.0 307.0 21.0 396
23 0.98843 0.0 8.14 0.0 0.538 5.813 100.0 4.0952 4.0 307.0 21.0 394
24 0.75026 0.0 8.14 0.0 0.538 5.924 94.1 4.3996 4.0 307.0 21.0 394
25 0.84054 0.0 8.14 0.0 0.538 5.599 85.7 4.4546 4.0 307.0 21.0 303
26 0.67191 0.0 8.14 0.0 0.538 5.813 90.3 4.6820 4.0 307.0 21.0 376
27 0.95577 0.0 8.14 0.0 0.538 6.047 88.8 4.4534 4.0 307.0 21.0 306
28 0.77299 0.0 8.14 0.0 0.538 6.495 94.4 4.4547 4.0 307.0 21.0 387
29 1.00245 0.0 8.14 0.0 0.538 6.674 87.3 4.2390 4.0 307.0 21.0 380
... ... ... ... ... ... ... ... ... ... ... ... ...
476 4.87141 0.0 18.10 0.0 0.614 6.484 93.6 2.3053 24.0 666.0 20.2 396
477 15.02340 0.0 18.10 0.0 0.614 5.304 97.3 2.1007 24.0 666.0 20.2 349
478 10.23300 0.0 18.10 0.0 0.614 6.185 96.7 2.1705 24.0 666.0 20.2 379
479 14.33370 0.0 18.10 0.0 0.614 6.229 88.0 1.9512 24.0 666.0 20.2 383
480 5.82401 0.0 18.10 0.0 0.532 6.242 64.7 3.4242 24.0 666.0 20.2 396
481 5.70818 0.0 18.10 0.0 0.532 6.750 74.9 3.3317 24.0 666.0 20.2 393
482 5.73116 0.0 18.10 0.0 0.532 7.061 77.0 3.4106 24.0 666.0 20.2 395
483 2.81838 0.0 18.10 0.0 0.532 5.762 40.3 4.0983 24.0 666.0 20.2 392
484 2.37857 0.0 18.10 0.0 0.583 5.871 41.9 3.7240 24.0 666.0 20.2 370
485 3.67367 0.0 18.10 0.0 0.583 6.312 51.9 3.9917 24.0 666.0 20.2 388
486 5.69175 0.0 18.10 0.0 0.583 6.114 79.8 3.5459 24.0 666.0 20.2 392
487 4.83567 0.0 18.10 0.0 0.583 5.905 53.2 3.1523 24.0 666.0 20.2 388
488 0.15086 0.0 27.74 0.0 0.609 5.454 92.7 1.8209 4.0 711.0 20.1 395
489 0.18337 0.0 27.74 0.0 0.609 5.414 98.3 1.7554 4.0 711.0 20.1 344
490 0.20746 0.0 27.74 0.0 0.609 5.093 98.0 1.8226 4.0 711.0 20.1 318
491 0.10574 0.0 27.74 0.0 0.609 5.983 98.8 1.8681 4.0 711.0 20.1 390
492 0.11132 0.0 27.74 0.0 0.609 5.983 83.5 2.1099 4.0 711.0 20.1 396
493 0.17331 0.0 9.69 0.0 0.585 5.707 54.0 2.3817 6.0 391.0 19.2 396
494 0.27957 0.0 9.69 0.0 0.585 5.926 42.6 2.3817 6.0 391.0 19.2 396
495 0.17899 0.0 9.69 0.0 0.585 5.670 28.8 2.7986 6.0 391.0 19.2 393
496 0.28960 0.0 9.69 0.0 0.585 5.390 72.9 2.7986 6.0 391.0 19.2 396
497 0.26838 0.0 9.69 0.0 0.585 5.794 70.6 2.8927 6.0 391.0 19.2 396
498 0.23912 0.0 9.69 0.0 0.585 6.019 65.3 2.4091 6.0 391.0 19.2 396
499 0.17783 0.0 9.69 0.0 0.585 5.569 73.5 2.3999 6.0 391.0 19.2 395
500 0.22438 0.0 9.69 0.0 0.585 6.027 79.7 2.4982 6.0 391.0 19.2 396
501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 273.0 21.0 391
502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 273.0 21.0 396
503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0 21.0 396
504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0 21.0 393
505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0 21.0 396
OTHER DATA SETS:
load_diabetes
load_digits
Exploring DataFrames
Exploring data is an important first step in most data analyses. DataFrames come with a variety
of functions to help you explore and summarize the data they contain.
First, let's load in data set to explore: the mtcars data set. The mtcars data set comes with the
ggplot library, a port of a popular R plotting library called ggplot2. ggplot does not come with
Anaconda, but you can install it by opening a console (cmd.exe) and running: "pip install ggplot"
In [135]: mtcars=pd.read_csv('mtcars.csv')
In [141]: #mtcars=mtcars.rename(index=str, columns={'Unnamed: 0':'Name'})

mtcars
Out[141]:
Name mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
5 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
6 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
7 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
8 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
9 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
10 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
11 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
12 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
13 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
14 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
15 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
16 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
17 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
18 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
19 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
20 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
21 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
22 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
23 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
24 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
25 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
26 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
27 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
28 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
29 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
30 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
31 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Notice that mtcars is loaded as a DataFrame. We can check the dimensions and size of a
DataFrame with df.shape:
In [142]: mtcars.shape # Check dimensions
Out[142]: (32, 12)
The output shows that mtars has 32 rows and 12 columns.
We can check the first n rows of the data with the df.head() function:
In [143]: mtcars.head(6) # Check the first 6 rows
Out[143]:
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
5 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Similarly, we can check the last few rows with df.tail()
In [144]: mtcars.tail(6) # Check the last 6 rows
Out[144]: Name mpg cyl disp hp drat wt qsec vs am gear carb
26 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
27 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
28 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
29 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
30 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
31 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
With large data sets, head() and tail() are useful to get a sense of what the data looks like without
printing hundreds or thousands of rows to the screen. Since each row specifies a different car,
lets set the row indexes equal to the car name. You can access and assign new row indexes with
df.index:
In [146]: print(mtcars.index, "\n") # Print original indexes

mtcars.index = mtcars["Name"] # Set index to car name
del mtcars["Name"] # Delete name column
print(mtcars.index) # Print new indexes
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '1
2',
'13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '2
3', '24',
'25', '26', '27', '28', '29', '30', '31'],
dtype='object')
Index(['Mazda RX4', 'Mazda RX4 Wag', 'Datsun 710', 'Hornet 4 Drive',

'Hornet Sportabout', 'Valiant', 'Duster 360', 'Merc 240D', 'Merc
230',
'Merc 280', 'Merc 280C', 'Merc 450SE', 'Merc 450SL', 'Merc 450SL
C',
'Cadillac Fleetwood', 'Lincoln Continental', 'Chrysler Imperia
l',
'Fiat 128', 'Honda Civic', 'Toyota Corolla', 'Toyota Corona',
'Dodge Challenger', 'AMC Javelin', 'Camaro Z28', 'Pontiac Firebi
rd',
'Fiat X1-9', 'Porsche 914-2', 'Lotus Europa', 'Ford Pantera L',
'Ferrari Dino', 'Maserati Bora', 'Volvo 142E'],
dtype='object', name='Name')
You can access the column labels with df.columns:
In [147]: mtcars.columns
Out[147]: Index(['mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'g
ear',
'carb'],
dtype='object')
STATISTICAL SUMMARY
Use the df.describe() command to get a quick statistical summary of your data set. The summary
includes the mean, median, min, max and a few key percentiles for numeric columns:
In [148]: mtcars.iloc[:,:6].describe() # Summarize the first 6 columns
Out[148]:
mpg cyl disp hp drat wt
count 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000
mean 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250
std 6.026948 1.785922 123.938694 68.562868 0.534679 0.978457
min 10.400000 4.000000 71.100000 52.000000 2.760000 1.513000
25% 15.425000 4.000000 120.825000 96.500000 3.080000 2.581250
50% 19.200000 6.000000 196.300000 123.000000 3.695000 3.325000
mpg cyl disp hp drat wt
75% 22.800000 8.000000 326.000000 180.000000 3.920000 3.610000
max 33.900000 8.000000 472.000000 335.000000 4.930000 5.424000
In [149]: import numpy as np

np.mean(mtcars, axis=0) # Get the mean of each column
Out[149]: mpg 20.090625

cyl 6.187500
disp 230.721875
hp 146.687500
drat 3.596563
wt 3.217250
qsec 17.848750
vs 0.437500
am 0.406250
gear 3.687500
carb 2.812500
dtype: float64
In [150]: np.sum(mtcars, axis=0) # Get the sum of each column
Out[150]: mpg 642.900

cyl 198.000
disp 7383.100
hp 4694.000
drat 115.090
wt 102.952
qsec 571.160
vs 14.000
am 13.000
gear 118.000
carb 90.000
dtype: float64

5th LESSON (ANKUR - PROSCHOOL) - PANDAS

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5th LESSON (ANKUR - PROSCHOOL) - PANDAS

Uploaded by

Copyright:

Available Formats

PANDAS LIBRARY

Panel --------- 3 DIMENSION -------- General 3D labeled, size-mutable array

In [2]: import numpy as np

*Note: It is common practice to import pandas with the shorthand "pd".

Series([], dtype: float64)

In [4]: data = np.array(['a','b','c','d'])

In [5]: data = {'a' : 0., 'b' : 1., 'c' : 2.}

In [6]: s = pd.Series(5, index=[0, 1, 2, 3])

In [7]: my_series = pd.Series( data = [2,3,5,4], # Data

Similar to a dictionary, you can ACCESS ITEMS in a series by the labels:

In [9]: my_series = pd.Series( data = [2,3,5,4], # Data

Numeric indexing also works:

OPERATIONS performed on two series align by label:

In [12]: my_series + my_series

In [13]: my_series + my_series2 #Missing values converts int datatype to float

In [14]: my_series + my_series

DataFrame Creation and Indexing

In [16]: data =[['Alex',10],['Bob',12],['Clarke',13]]

In [18]: d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),

Adding a Column to a DataFrame

Deleting a Column from a dataframe

In [21]: # Create a dictionary with some different data types as values

Joe 10 M 4.5 Joe 1 75

Bob 15 M 5.0 Bob 1 123

Frans 20 M 6.1 Frans 1 239

In [22]: my_dict2 = {"name" : ["Joe","Bob","Frans"],

1 15 M 5.0 Bob 1 123

2 20 M 6.1 Frans 1 239

In [23]: df2 = pd.DataFrame(my_dict2,

Joe 10 M 4.5 Joe 1 75

Bob 15 M 5.0 Bob 1 123

Frans 20 M 6.1 Frans 1 239

In [24]: # Get a column by name

Alternatively, you can get a column by label using "dot" notation:

In [26]: # Delete a column

In [27]: # Add a new column

Joe 10 M 4.5 1 75 130

Bob 15 M 5.0 1 123 105

Frans 20 M 6.1 1 239 115

Inserting a single value into a DataFrame causes it to be all the rows?

In [28]: df2["Married"] = False

Joe 10 M 4.5 1 75 130 False

Bob 15 M 5.0 1 123 105 False

Frans 20 M 6.1 1 239 115 False

In [29]: df2["College"] = pd.Series(["Harvard"], index=["Frans"])

Joe 10 M 4.5 1 75 130 False NaN

Bob 15 M 5.0 1 123 105 False NaN

Frans 20 M 6.1 1 239 115 False Harvard

In [31]: df2.loc["Joe","IQ"] # Select row "Joe" and column "IQ"

In [32]: df2.loc["Joe":"Bob" , "IQ":"College"] # Slice by label

Joe 130 False NaN

Bob 105 False NaN

Select rows or columns by numeric index with df.iloc[row, column]:

In [33]: df2.iloc[0] # Get row 0

In [34]: df2.iloc[0, 5] # Get row 0, column 5

In [35]: df2.iloc[0:2, 5:8] # Slice by numeric row and column index

Joe 130 False NaN

Bob 105 False NaN

HEAD & TAIL

TREATMENT OF NAN VALUES

0 1.0 NaN 3.0

1 2.0 3.5 NaN

2 NaN 4.0 2.5

3 4.0 5.0 6.0