You are on page 1of 50

PANDAS LIBRARY

To store data from an external source like an excel workbook or database, we need a data
structure that can hold different data types. It is also desirable to be able to refer to rows and
columns in the data by custom labels rather than numbered indexes.

The pandas library offers data structures designed with this in mind: the series and the
DataFrame. Series are 1-dimensional labeled arrays similar to numpy's ndarrays, while
DataFrames are labeled 2-dimensional structures, that essentially function as spreadsheet
tables.

The name Pandas is derived from the word “Panel Data” – an Econometrics from
Multidimensional data. Using Pandas, we can accomplish five typical steps in the processing and
analysis of data, regardless of the origin of data —

load,

prepare,

manipulate,

model, and

analyze.

DATASTRUCTURES IN PANDAS
Series -------- 1 DIMENSION -------- 1D labeled homogeneous array, size immutable.

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
DataFrame ----- 2 DIMENSION -------- General 2D labeled, size-mutable tabular structure with
potentially heterogeneously typed columns

Panel --------- 3 DIMENSION -------- General 3D labeled, size-mutable array

All Pandas data structures are value mutable (can be changed) and except Series all are size
mutable. Series is size immutable.

Pandas Series
Series are very similar to ndarrays: the main difference between them is that with series, you can
provide custom index labels and then operations you perform on series automatically align the
data based on the labels.

To create a new series, first load the numpy and pandas libraries (pandas is preinstalled with the
Anaconda Python distribution.)

In [2]: import numpy as np


import pandas as pd

*Note: It is common practice to import pandas with the shorthand "pd".

In [3]: s = pd.Series()
print (s)

Series([], dtype: float64)

In [4]: data = np.array(['a','b','c','d'])


s = pd.Series(data)
print (s)

0 a
1 b
2 c

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
3 d
dtype: object

In [5]: data = {'a' : 0., 'b' : 1., 'c' : 2.}


s = pd.Series(data)
print (s)

a 0.0
b 1.0
c 2.0
dtype: float64

In [6]: s = pd.Series(5, index=[0, 1, 2, 3])


print (s)

0 5
1 5
2 5
3 5
dtype: int64

Define a new series by passing a collection of homogeneous data like ndarray or list, along with
a list of associated indexes to pd.Series():

In [7]: my_series = pd.Series( data = [2,3,5,4], # Data


index= ['a', 'b', 'c', 'd']) # Indexes
my_series

Out[7]: a 2
b 3
c 5
d 4
dtype: int64

You can also create a series from a dictionary, in which case the dictionary keys act as the labels
and the values act as the data:

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
In [8]: my_dict = {"x": 2, "a": 5, "b": 4, "c": 8}
my_series2 = pd.Series(my_dict)
my_series2

Out[8]: a 5
b 4
c 8
x 2
dtype: int64

Similar to a dictionary, you can ACCESS ITEMS in a series by the labels:

In [9]: my_series = pd.Series( data = [2,3,5,4], # Data


index= ['a', 'b', 'c', 'd']) # Indexes
my_series
my_series["a"]

Out[9]: 2

Numeric indexing also works:

In [10]: my_series[0]

Out[10]: 2

If you take a slice of a series, you get both the values and the labels contained in the slice:

In [11]: my_series[1:3]

Out[11]: b 3
c 5
dtype: int64

OPERATIONS performed on two series align by label:

In [12]: my_series + my_series

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Out[12]: a 4
b 6
c 10
d 8
dtype: int64

If you perform an operation with two series that have different labels, the unmatched labels will
return a value of NaN (not a number.).

In [13]: my_series + my_series2 #Missing values converts int datatype to float

Out[13]: a 7.0
b 7.0
c 13.0
d NaN
x NaN
dtype: float64

In [14]: my_series + my_series

Out[14]: a 4
b 6
c 10
d 8
dtype: int64

DataFrame Creation and Indexing


A DataFrame is a 2D table with labeled columns that can each hold different types of data.
DataFrames are essentially a Python implementation of the types of tables you'd see in an Excel
workbook or SQL database. DataFrames are the defacto standard data structure for working with
tabular data in Python

You can create a DataFrame out a variety of data sources like dictionaries, 2D numpy arrays and
series using the pd.DataFrame() function. Dictionaries provide an intuitive way to create

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
DataFrames: when passed to pd.DataFrame() a dictionary's keys become column labels and the
values become the columns themselves:

In [15]: df = pd.DataFrame()
print (df)

Empty DataFrame
Columns: []
Index: []

In [16]: data =[['Alex',10],['Bob',12],['Clarke',13]]

In [17]: df=pd.DataFrame(data,columns=['Name','Age'])
print (df)

Name Age
0 Alex 10
1 Bob 12
2 Clarke 13

In [18]: d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),


'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print (df)

one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4

Adding a Column to a DataFrame


In [19]: df['three']=pd.Series([10,20,30],index=['a','b','c'])
print (df )

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
one two three
a 1.0 1 10.0
b 2.0 2 20.0
c 3.0 3 30.0
d NaN 4 NaN

Deleting a Column from a dataframe


In [20]: del df['one']
print (df )

two three
a 1 10.0
b 2 20.0
c 3 30.0
d 4 NaN

In [21]: # Create a dictionary with some different data types as values


my_dict = {"name" : ["Joe","Bob","Frans"],
"age" : np.array([10,15,20]),
"weight" : (75,123,239),
"height" : pd.Series([4.5, 5, 6.1],
index=["Joe","Bob","Frans"]),
"siblings" : 1,
"gender" : "M"}
df = pd.DataFrame(my_dict) # Convert the dict to DataFrame
df # Show the DataFrame

Out[21]:
age gender height name siblings weight

Joe 10 M 4.5 Joe 1 75

Bob 15 M 5.0 Bob 1 123

Frans 20 M 6.1 Frans 1 239

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Notice that values in the dictionary you use to make a DataFrame can be a variety of sequence
objects, including lists, ndarrays, tuples and series. If you pass in singular values like a single
number or string, that value is duplicated for every row in the DataFrame (in this case gender is
set to "M" for all records and siblings is set to 1.).

Also note that in the DataFrame above, the rows were automatically given indexes that align with
the indexes of the series we passed in for the "height" column. If we did not use a series with
index labels to create our DataFrame, it would be given numeric row index labels by default:

In [22]: my_dict2 = {"name" : ["Joe","Bob","Frans"],


"age" : np.array([10,15,20]),
"weight" : (75,123,239),
"height" :[4.5, 5, 6.1],
"siblings" : 1,
"gender" : "M"}
df2 = pd.DataFrame(my_dict2) # Convert the dict to DataFrame
df2 # Show the DataFrame

Out[22]:
age gender height name siblings weight

0 10 M 4.5 Joe 1 75

1 15 M 5.0 Bob 1 123

2 20 M 6.1 Frans 1 239

You can provide custom row labels when creating a DataFrame by adding the index argument:

In [23]: df2 = pd.DataFrame(my_dict2,


index = my_dict["name"] )
df2

Out[23]:
age gender height name siblings weight

Joe 10 M 4.5 Joe 1 75

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
age gender height name siblings weight

Bob 15 M 5.0 Bob 1 123

Frans 20 M 6.1 Frans 1 239

A DataFrame behaves like a dictionary of Series objects that each have the same length and
indexes. This means we can get, add and delete columns in a DataFrame the same way we
would when dealing with a dictionary:

In [24]: # Get a column by name


df2["weight"]

Out[24]: Joe 75
Bob 123
Frans 239
Name: weight, dtype: int64

Alternatively, you can get a column by label using "dot" notation:

In [25]: df2.weight

Out[25]: Joe 75
Bob 123
Frans 239
Name: weight, dtype: int64

In [26]: # Delete a column


del df2['name']

In [27]: # Add a new column


df2["IQ"] = [130, 105, 115]
df2

Out[27]:
age gender height siblings weight IQ

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
age gender height siblings weight IQ

Joe 10 M 4.5 1 75 130

Bob 15 M 5.0 1 123 105

Frans 20 M 6.1 1 239 115

Inserting a single value into a DataFrame causes it to be all the rows?

In [28]: df2["Married"] = False


df2

Out[28]:
age gender height siblings weight IQ Married

Joe 10 M 4.5 1 75 130 False

Bob 15 M 5.0 1 123 105 False

Frans 20 M 6.1 1 239 115 False

When inserting a Series into a DataFrame, rows are matched by index. Unmatched rows will be
filled with NaN:

In [29]: df2["College"] = pd.Series(["Harvard"], index=["Frans"])


df2

Out[29]:
age gender height siblings weight IQ Married College

Joe 10 M 4.5 1 75 130 False NaN

Bob 15 M 5.0 1 123 105 False NaN

Frans 20 M 6.1 1 239 115 False Harvard

You can select both rows or columns by label with df.loc[row, column]:

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
In [30]: df2.loc["Joe"] # Select row "Joe"

Out[30]: age 10
gender M
height 4.5
siblings 1
weight 75
IQ 130
Married False
College NaN
Name: Joe, dtype: object

In [31]: df2.loc["Joe","IQ"] # Select row "Joe" and column "IQ"

Out[31]: 130

In [32]: df2.loc["Joe":"Bob" , "IQ":"College"] # Slice by label

Out[32]:
IQ Married College

Joe 130 False NaN

Bob 105 False NaN

Select rows or columns by numeric index with df.iloc[row, column]:

In [33]: df2.iloc[0] # Get row 0

Out[33]: age 10
gender M
height 4.5
siblings 1
weight 75
IQ 130
Married False

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
College NaN
Name: Joe, dtype: object

In [34]: df2.iloc[0, 5] # Get row 0, column 5

Out[34]: 130

In [35]: df2.iloc[0:2, 5:8] # Slice by numeric row and column index

Out[35]:
IQ Married College

Joe 130 False NaN

Bob 105 False NaN

In [36]: df=pd.read_excel("SAMPLEPANDAS.xlsx")

In [39]: df

Out[39]:
Month Values

0 2018-01-01 201

1 2018-02-01 107

2 2018-03-01 483

3 2018-04-01 240

4 2018-05-01 356

5 2018-06-01 369

6 2018-07-01 266

7 2018-08-01 308

8 2018-09-01 453

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Month Values

9 2018-10-01 395

10 2018-11-01 487

11 2018-12-01 403

12 2019-01-01 478

13 2019-02-01 112

14 2019-03-01 262

15 2019-04-01 283

16 2019-05-01 444

17 2019-06-01 233

18 2019-07-01 324

19 2019-08-01 305

20 2019-09-01 299

21 2019-10-01 294

22 2019-11-01 468

23 2019-12-01 321

HEAD & TAIL


In [40]: df.head()

Out[40]:
Month Values

0 2018-01-01 201

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Month Values

1 2018-02-01 107

2 2018-03-01 483

3 2018-04-01 240

4 2018-05-01 356

In [41]: df.tail()

Out[41]:
Month Values

19 2019-08-01 305

20 2019-09-01 299

21 2019-10-01 294

22 2019-11-01 468

23 2019-12-01 321

TREATMENT OF NAN VALUES


In [42]: #Creating a dataframe without Excel
# for creating a database - we can do it in python directly
# pd.DataFrame ({'A':[elem1, elem2, elem3...etc], 'B':[elem1, elem2, el
em3...etc], 'C':[elem1, elem2, elem3...etc]})
# all columns should have same number of values, in case any value is b
lank ....we need to specify/mention
#value as np.nan (if np gives error - we need to import numpy)
import pandas as pd
import numpy as np

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
df=pd.DataFrame({'A':[1,2,np.nan,4,7],'B':[np.nan,3.5,4,5,8], 'C':[3,np
.nan,2.5,6,9]})

In [43]: df

Out[43]:
A B C

0 1.0 NaN 3.0

1 2.0 3.5 NaN

2 NaN 4.0 2.5

3 4.0 5.0 6.0

4 7.0 8.0 9.0

In [44]: #if we want to remove NaN Values we need to use df.dropna() where df is
the dataframe
df.dropna()

Out[44]:
A B C

3 4.0 5.0 6.0

4 7.0 8.0 9.0

In [45]: #we can also assign df.dropna() to a new dataframe, say df2
df2=df.dropna()
df2

Out[45]:
A B C

3 4.0 5.0 6.0

4 7.0 8.0 9.0

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
FILLING UP NAN VALUES
In [46]: #we can fill up the NaN values using fillna function as below:

df.fillna(value=0)

Out[46]:
A B C

0 1.0 0.0 3.0

1 2.0 3.5 0.0

2 0.0 4.0 2.5

3 4.0 5.0 6.0

4 7.0 8.0 9.0

In [47]: df["NewValue"]=pd.Series(data=['Ankur','Sondeep','Poornima', 'Divya',


'Vivek'])
df

Out[47]:
A B C NewValue

0 1.0 NaN 3.0 Ankur

1 2.0 3.5 NaN Sondeep

2 NaN 4.0 2.5 Poornima

3 4.0 5.0 6.0 Divya

4 7.0 8.0 9.0 Vivek

In [48]: #for replacing the null values with mean for an entire column
df['A']=df['A'].fillna(value=df['A'].mean())
df

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Out[48]:
A B C NewValue

0 1.0 NaN 3.0 Ankur

1 2.0 3.5 NaN Sondeep

2 3.5 4.0 2.5 Poornima

3 4.0 5.0 6.0 Divya

4 7.0 8.0 9.0 Vivek

In [49]: #for replacing the null values with mean for an all the column
df3=df.fillna(value=df.mean())
df3

Out[49]:
A B C NewValue

0 1.0 5.125 3.000 Ankur

1 2.0 3.500 5.125 Sondeep

2 3.5 4.000 2.500 Poornima

3 4.0 5.000 6.000 Divya

4 7.0 8.000 9.000 Vivek

USING FOR TO FILL NAN VALUES


In [50]: #for replacing the Nan Values with Mean of Columns for all the columns
in a DataFrame
for x in df.columns:
df[x]=df[x].fillna(value=df[x].mean())

-----------------------------------------------------------------------
----

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
ValueError Traceback (most recent call l
ast)
~\Anaconda3\lib\site-packages\pandas\core\nanops.py in _ensure_numeric
(x)
818 try:
--> 819 x = float(x)
820 except Exception:

ValueError: could not convert string to float: 'AnkurSondeepPoornimaDiv


yaVivek'

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call l


ast)
~\Anaconda3\lib\site-packages\pandas\core\nanops.py in _ensure_numeric
(x)
821 try:
--> 822 x = complex(x)
823 except Exception:

ValueError: complex() arg is a malformed string

During handling of the above exception, another exception occurred:

TypeError Traceback (most recent call l


ast)
~\Anaconda3\lib\site-packages\pandas\core\nanops.py in f(values, axis,
skipna, **kwds)
127 else:
--> 128 result = alt(values, axis=axis, skipna=skip
na, **kwds)
129 except Exception:

~\Anaconda3\lib\site-packages\pandas\core\nanops.py in nanmean(values,
axis, skipna)
355 count = _get_counts(mask, axis, dtype=dtype_count)
--> 356 the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum)
)

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
357

~\Anaconda3\lib\site-packages\pandas\core\nanops.py in _ensure_numeric
(x)
824 raise TypeError('Could not convert {value!s} to
numeric'
--> 825 .format(value=x))
826 return x

TypeError: Could not convert AnkurSondeepPoornimaDivyaVivek to numeric

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call l


ast)
~\Anaconda3\lib\site-packages\pandas\core\nanops.py in _ensure_numeric
(x)
818 try:
--> 819 x = float(x)
820 except Exception:

ValueError: could not convert string to float: 'AnkurSondeepPoornimaDiv


yaVivek'

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call l


ast)
~\Anaconda3\lib\site-packages\pandas\core\nanops.py in _ensure_numeric
(x)
821 try:
--> 822 x = complex(x)
823 except Exception:

ValueError: complex() arg is a malformed string

During handling of the above exception, another exception occurred:

TypeError Traceback (most recent call l

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
ast)
<ipython-input-50-d8f394443f45> in <module>()
1 #for replacing the Nan Values with Mean of Columns for all the
columns in a DataFrame
2 for x in df.columns:
----> 3 df[x]=df[x].fillna(value=df[x].mean())

~\Anaconda3\lib\site-packages\pandas\core\generic.py in stat_func(self,
axis, skipna, level, numeric_only, **kwargs)
7313 skipna=skipna)
7314 return self._reduce(f, name, axis=axis, skipna=skipna,
-> 7315 numeric_only=numeric_only)
7316
7317 return set_function_name(stat_func, name, cls)

~\Anaconda3\lib\site-packages\pandas\core\series.py in _reduce(self, o
p, name, axis, skipna, numeric_only, filter_type, **kwds)
2575 'numeric_only.'.forma
t(name))
2576 with np.errstate(all='ignore'):
-> 2577 return op(delegate, skipna=skipna, **kwds)
2578
2579 return delegate._reduce(op=op, name=name, axis=axis, sk
ipna=skipna,

~\Anaconda3\lib\site-packages\pandas\core\nanops.py in _f(*args, **kwar


gs)
75 try:
76 with np.errstate(invalid='ignore'):
---> 77 return f(*args, **kwargs)
78 except ValueError as e:
79 # we want to transform an object array

~\Anaconda3\lib\site-packages\pandas\core\nanops.py in f(values, axis,


skipna, **kwds)
129 except Exception:
130 try:
--> 131 result = alt(values, axis=axis, skipna=skip
na, **kwds)

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
132 except ValueError as e:
133 # we want to transform an object array

~\Anaconda3\lib\site-packages\pandas\core\nanops.py in nanmean(values,
axis, skipna)
354 dtype_count = dtype
355 count = _get_counts(mask, axis, dtype=dtype_count)
--> 356 the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum)
)
357
358 if axis is not None and getattr(the_sum, 'ndim', False):

~\Anaconda3\lib\site-packages\pandas\core\nanops.py in _ensure_numeric
(x)
823 except Exception:
824 raise TypeError('Could not convert {value!s} to
numeric'
--> 825 .format(value=x))
826 return x
827

TypeError: Could not convert AnkurSondeepPoornimaDivyaVivek to numeric

In [51]: df

Out[51]:
A B C NewValue

0 1.0 5.125 3.000 Ankur

1 2.0 3.500 5.125 Sondeep

2 3.5 4.000 2.500 Poornima

3 4.0 5.000 6.000 Divya

4 7.0 8.000 9.000 Vivek

In [52]: #using for loop for applying the mean of the column to the Null Values
(for only non-string columns)

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
for x in df.columns:
if df[x].dtype.kind=="f":
df[x]=df[x].fillna(value=df[x].mean())

In [53]: df

Out[53]:
A B C NewValue

0 1.0 5.125 3.000 Ankur

1 2.0 3.500 5.125 Sondeep

2 3.5 4.000 2.500 Poornima

3 4.0 5.000 6.000 Divya

4 7.0 8.000 9.000 Vivek

In [54]: #pull out unique values within a column, use dataframename['columnnam


e].unique()
df['A'].unique()

Out[54]: array([1. , 2. , 3.5, 4. , 7. ])

In [55]: df.drop_duplicates()

Out[55]:
A B C NewValue

0 1.0 5.125 3.000 Ankur

1 2.0 3.500 5.125 Sondeep

2 3.5 4.000 2.500 Poornima

3 4.0 5.000 6.000 Divya

4 7.0 8.000 9.000 Vivek

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
In [56]: #specifying condition to filter the criteria...like column a all values
> 2 and for b<5
#using AND condition use '&' and for OR condition use'|'
newdf=df[(df['A']>1) & (df['B']<5)]
newdf

Out[56]:
A B C NewValue

1 2.0 3.5 5.125 Sondeep

2 3.5 4.0 2.500 Poornima

In [57]: #exclusing NULL values syntax


newdf1=df[(df['A']>3) | (df['B']==8) | (df['C'].isnull()==True)]
newdf1

Out[57]:
A B C NewValue

2 3.5 4.0 2.5 Poornima

3 4.0 5.0 6.0 Divya

4 7.0 8.0 9.0 Vivek

SORTING
In [58]: df.sort_values('B')
df.sort_values(by='B')

Out[58]:
A B C NewValue

1 2.0 3.500 5.125 Sondeep

2 3.5 4.000 2.500 Poornima

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
A B C NewValue

3 4.0 5.000 6.000 Divya

0 1.0 5.125 3.000 Ankur

4 7.0 8.000 9.000 Vivek

In [59]: #sorting based on a column in the descending format


df.sort_values('B', ascending=[False])

Out[59]:
A B C NewValue

4 7.0 8.000 9.000 Vivek

0 1.0 5.125 3.000 Ankur

3 4.0 5.000 6.000 Divya

2 3.5 4.000 2.500 Poornima

1 2.0 3.500 5.125 Sondeep

DESCRIBE
In [60]: d={'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack'
,'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.1
0,3.65])}
x=pd.DataFrame(d)

In [61]: x.describe()

Out[61]:
Age Rating

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Age Rating

count 12.000000 12.000000

mean 31.833333 3.743333

std 9.232682 0.661628

min 23.000000 2.560000

25% 25.000000 3.230000

50% 29.500000 3.790000

75% 35.500000 4.132500

max 51.000000 4.800000

In [62]: x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 3 columns):
Age 12 non-null int64
Name 12 non-null object
Rating 12 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 368.0+ bytes

In [63]: x.columns

Out[63]: Index(['Age', 'Name', 'Rating'], dtype='object')

Basic Data Visualizations


In [68]: import pandas as pd
import numpy as np

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
df=pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d'])
df.plot.bar()

Out[68]: <matplotlib.axes._subplots.AxesSubplot at 0x192621c15f8>

In [69]: import pandas as pd


import numpy as np
df=pd.DataFrame({'a':np.random.randn(1000)+1,'b':np.random.randn(1000),
'c': np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])
df.plot.hist(bins=10)

Out[69]: <matplotlib.axes._subplots.AxesSubplot at 0x19262248ef0>

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
In [70]: import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10, 5), columns=['A', 'B', 'C', 'D',
'E'])
df.plot.box()

Out[70]: <matplotlib.axes._subplots.AxesSubplot at 0x19262211a90>

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
JOINING DATAFRAMES
In [71]: df=pd.DataFrame({'A':[1,2,3,4,5], 'B':[6,7,8,9,10], 'C': [11,12,13,14,1
5] })
df1=pd.DataFrame({'A':[100,101,102,103,104], 'B': [105,106,107,108,109
], 'C':[110,111,112,113,114] })

In [72]: #How to Concatenate: Use the same no. of variables and same no. of elem
ents to concatenate

pd.concat([df, df1])

Out[72]:
A B C

0 1 6 11

1 2 7 12

2 3 8 13

3 4 9 14

4 5 10 15

0 100 105 110

1 101 106 111

2 102 107 112

3 103 108 113

4 104 109 114

In [73]: df2=pd.DataFrame({'A':[10,20,30,40], 'B':[25,35,45,55]})

In [74]: pdconcat=pd.concat([df,df1,df2])
pdconcat

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Out[74]: A B C

0 1 6 11.0

1 2 7 12.0

2 3 8 13.0

3 4 9 14.0

4 5 10 15.0

0 100 105 110.0

1 101 106 111.0

2 102 107 112.0

3 103 108 113.0

4 104 109 114.0

0 10 25 NaN

1 20 35 NaN

2 30 45 NaN

3 40 55 NaN

In [75]: print(pdconcat['A'].dtype.kind)
print(pdconcat['B'].dtype.kind)
print(pdconcat['C'].dtype.kind)

i
i
f

In [76]: #Incase the user wants to concatenate the data horizontally, use "axis=
1", default value is "axis=0"
pdconcat1=pd.concat([df,df1,df2], axis=1)
pdconcat1

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Out[76]: A B C A B C A B

0 1 6 11 100 105 110 10.0 25.0

1 2 7 12 101 106 111 20.0 35.0

2 3 8 13 102 107 112 30.0 45.0

3 4 9 14 103 108 113 40.0 55.0

4 5 10 15 104 109 114 NaN NaN

In [77]: #Merging two DataFrames


T1=pd.DataFrame({'RollNo.':[1,2,3,4,5,6,7,8,9,10], 'StudentName':["Akas
h", "Vivek", "Sondeep", "Pranav", "Purnima", "Divya" "Saroj", "Nimisha"
, "ajay", "Manju", "Mr. X"]})
T1

Out[77]:
RollNo. StudentName

0 1 Akash

1 2 Vivek

2 3 Sondeep

3 4 Pranav

4 5 Purnima

5 6 DivyaSaroj

6 7 Nimisha

7 8 ajay

8 9 Manju

9 10 Mr. X

In [78]: T2=pd.DataFrame({'RollNo.':[1,2,3,4,5,6,7,8,9,10], 'City':["Shimla", "G

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
urgaon", "Delhi", "Amritsar", "Jalandhar", "Gandhinagar", "Almora", "Oo
ty", "Mumbai", "Chennai"]})
T2

Out[78]:
City RollNo.

0 Shimla 1

1 Gurgaon 2

2 Delhi 3

3 Amritsar 4

4 Jalandhar 5

5 Gandhinagar 6

6 Almora 7

7 Ooty 8

8 Mumbai 9

9 Chennai 10

In [79]: #there are two ways of join tables: (1) Merging; (2) Using Joins
#In the above two tables (T1 & T2, RollNo. is the common key in both th
e tables. T1&T2 both contain same no. of elements)

#MERGING!!!
pd.merge(T1,T2, how='inner', on='RollNo.')
#T3=pd.DataFrame({'RollNo.':[1,3,5,7,9,11,13,15,17,19], 'PinCode':["171
001", "122002", "110005", "183005", "181001", "168754", "987654", "5476
45", "654789", "123456"]})

Out[79]:
RollNo. StudentName City

0 1 Akash Shimla

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
RollNo. StudentName City

1 2 Vivek Gurgaon

2 3 Sondeep Delhi

3 4 Pranav Amritsar

4 5 Purnima Jalandhar

5 6 DivyaSaroj Gandhinagar

6 7 Nimisha Almora

7 8 ajay Ooty

8 9 Manju Mumbai

9 10 Mr. X Chennai

In [80]: T3=pd.DataFrame({'RollNo.':[1,3,5,7,9,11,13,15,17,19], 'PinCode':["1710


01", "122002", "110005", "183005", "181001", "168754", "987654", "54764
5", "654789", "123456"]})
T3

Out[80]:
PinCode RollNo.

0 171001 1

1 122002 3

2 110005 5

3 183005 7

4 181001 9

5 168754 11

6 987654 13

7 547645 15

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
PinCode RollNo.

8 654789 17

9 123456 19

In [81]: #Inner Join using Merge


pd.merge(T1,T3, how='inner', on='RollNo.')

Out[81]:
RollNo. StudentName PinCode

0 1 Akash 171001

1 3 Sondeep 122002

2 5 Purnima 110005

3 7 Nimisha 183005

4 9 Manju 181001

In [82]: #Left Join using Merge


pd.merge(T1,T3, how='left', on='RollNo.')

Out[82]:
RollNo. StudentName PinCode

0 1 Akash 171001

1 2 Vivek NaN

2 3 Sondeep 122002

3 4 Pranav NaN

4 5 Purnima 110005

5 6 DivyaSaroj NaN

6 7 Nimisha 183005

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
RollNo. StudentName PinCode

7 8 ajay NaN

8 9 Manju 181001

9 10 Mr. X NaN

In [83]: #right join using merge


pd.merge(T1,T3, how='right', on='RollNo.')

Out[83]:
RollNo. StudentName PinCode

0 1 Akash 171001

1 3 Sondeep 122002

2 5 Purnima 110005

3 7 Nimisha 183005

4 9 Manju 181001

5 11 NaN 168754

6 13 NaN 987654

7 15 NaN 547645

8 17 NaN 654789

9 19 NaN 123456

In [84]: #Executing the same functions using the JOIN Prompt


#This can be done after setting index to the common column in both the
tables. for this we use the "set_index" prompt to
#set indices
T2.set_index('RollNo.').join(T3.set_index('RollNo.'), how='inner')

Out[84]:

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
City PinCode

RollNo.

1 Shimla 171001

3 Delhi 122002

5 Jalandhar 110005

7 Almora 183005

9 Mumbai 181001

LOADING INBUILT DATASETS IN PYTHON


In [86]: from sklearn.datasets import load_iris
iris = load_iris()
data = iris.data
column_names = iris.feature_names
print(column_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal w
idth (cm)']

In [87]: df=pd.DataFrame(data, columns=['Sepal Length', 'Sepal Width', 'Petal Le


ngth', 'Petal width'])

In [88]: df

Out[88]:
Sepal Length Sepal Width Petal Length Petal width

0 5.1 3.5 1.4 0.2

1 4.9 3.0 1.4 0.2

2 4.7 3.2 1.3 0.2

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Sepal Length Sepal Width Petal Length Petal width

3 4.6 3.1 1.5 0.2

4 5.0 3.6 1.4 0.2

5 5.4 3.9 1.7 0.4

6 4.6 3.4 1.4 0.3

7 5.0 3.4 1.5 0.2

8 4.4 2.9 1.4 0.2

9 4.9 3.1 1.5 0.1

10 5.4 3.7 1.5 0.2

11 4.8 3.4 1.6 0.2

12 4.8 3.0 1.4 0.1

13 4.3 3.0 1.1 0.1

14 5.8 4.0 1.2 0.2

15 5.7 4.4 1.5 0.4

16 5.4 3.9 1.3 0.4

17 5.1 3.5 1.4 0.3

18 5.7 3.8 1.7 0.3

19 5.1 3.8 1.5 0.3

20 5.4 3.4 1.7 0.2

21 5.1 3.7 1.5 0.4

22 4.6 3.6 1.0 0.2

23 5.1 3.3 1.7 0.5

24 4.8 3.4 1.9 0.2

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Sepal Length Sepal Width Petal Length Petal width

25 5.0 3.0 1.6 0.2

26 5.0 3.4 1.6 0.4

27 5.2 3.5 1.5 0.2

28 5.2 3.4 1.4 0.2

29 4.7 3.2 1.6 0.2

... ... ... ... ...

120 6.9 3.2 5.7 2.3

121 5.6 2.8 4.9 2.0

122 7.7 2.8 6.7 2.0

123 6.3 2.7 4.9 1.8

124 6.7 3.3 5.7 2.1

125 7.2 3.2 6.0 1.8

126 6.2 2.8 4.8 1.8

127 6.1 3.0 4.9 1.8

128 6.4 2.8 5.6 2.1

129 7.2 3.0 5.8 1.6

130 7.4 2.8 6.1 1.9

131 7.9 3.8 6.4 2.0

132 6.4 2.8 5.6 2.2

133 6.3 2.8 5.1 1.5

134 6.1 2.6 5.6 1.4

135 7.7 3.0 6.1 2.3

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Sepal Length Sepal Width Petal Length Petal width

136 6.3 3.4 5.6 2.4

137 6.4 3.1 5.5 1.8

138 6.0 3.0 4.8 1.8

139 6.9 3.1 5.4 2.1

140 6.7 3.1 5.6 2.4

141 6.9 3.1 5.1 2.3

142 5.8 2.7 5.1 1.9

143 6.8 3.2 5.9 2.3

144 6.7 3.3 5.7 2.5

145 6.7 3.0 5.2 2.3

146 6.3 2.5 5.0 1.9

147 6.5 3.0 5.2 2.0

148 6.2 3.4 5.4 2.3

149 5.9 3.0 5.1 1.8

150 rows × 4 columns

In [89]: #Alternatively
import pandas as pd
a = pd.DataFrame(iris.data, columns=iris.feature_names)

In [91]: a

Out[91]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)

0 5.1 3.5 1.4 0.2

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)

1 4.9 3.0 1.4 0.2

2 4.7 3.2 1.3 0.2

3 4.6 3.1 1.5 0.2

4 5.0 3.6 1.4 0.2

5 5.4 3.9 1.7 0.4

6 4.6 3.4 1.4 0.3

7 5.0 3.4 1.5 0.2

8 4.4 2.9 1.4 0.2

9 4.9 3.1 1.5 0.1

10 5.4 3.7 1.5 0.2

11 4.8 3.4 1.6 0.2

12 4.8 3.0 1.4 0.1

13 4.3 3.0 1.1 0.1

14 5.8 4.0 1.2 0.2

15 5.7 4.4 1.5 0.4

16 5.4 3.9 1.3 0.4

17 5.1 3.5 1.4 0.3

18 5.7 3.8 1.7 0.3

19 5.1 3.8 1.5 0.3

20 5.4 3.4 1.7 0.2

21 5.1 3.7 1.5 0.4

22 4.6 3.6 1.0 0.2

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)

23 5.1 3.3 1.7 0.5

24 4.8 3.4 1.9 0.2

25 5.0 3.0 1.6 0.2

26 5.0 3.4 1.6 0.4

27 5.2 3.5 1.5 0.2

28 5.2 3.4 1.4 0.2

29 4.7 3.2 1.6 0.2

... ... ... ... ...

120 6.9 3.2 5.7 2.3

121 5.6 2.8 4.9 2.0

122 7.7 2.8 6.7 2.0

123 6.3 2.7 4.9 1.8

124 6.7 3.3 5.7 2.1

125 7.2 3.2 6.0 1.8

126 6.2 2.8 4.8 1.8

127 6.1 3.0 4.9 1.8

128 6.4 2.8 5.6 2.1

129 7.2 3.0 5.8 1.6

130 7.4 2.8 6.1 1.9

131 7.9 3.8 6.4 2.0

132 6.4 2.8 5.6 2.2

133 6.3 2.8 5.1 1.5

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)

134 6.1 2.6 5.6 1.4

135 7.7 3.0 6.1 2.3

136 6.3 3.4 5.6 2.4

137 6.4 3.1 5.5 1.8

138 6.0 3.0 4.8 1.8

139 6.9 3.1 5.4 2.1

140 6.7 3.1 5.6 2.4

141 6.9 3.1 5.1 2.3

142 5.8 2.7 5.1 1.9

143 6.8 3.2 5.9 2.3

144 6.7 3.3 5.7 2.5

145 6.7 3.0 5.2 2.3

146 6.3 2.5 5.0 1.9

147 6.5 3.0 5.2 2.0

148 6.2 3.4 5.4 2.3

149 5.9 3.0 5.1 1.8

150 rows × 4 columns

In [92]: from sklearn.datasets import load_boston


boston=load_boston()
data=boston.data
column_names = boston.feature_names
a = pd.DataFrame(boston.data, columns=boston.feature_names)

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
In [93]: a

Out[93]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO

0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396

1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396

2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392

3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394

4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396

5 0.02985 0.0 2.18 0.0 0.458 6.430 58.7 6.0622 3.0 222.0 18.7 394

6 0.08829 12.5 7.87 0.0 0.524 6.012 66.6 5.5605 5.0 311.0 15.2 395

7 0.14455 12.5 7.87 0.0 0.524 6.172 96.1 5.9505 5.0 311.0 15.2 396

8 0.21124 12.5 7.87 0.0 0.524 5.631 100.0 6.0821 5.0 311.0 15.2 386

9 0.17004 12.5 7.87 0.0 0.524 6.004 85.9 6.5921 5.0 311.0 15.2 386

10 0.22489 12.5 7.87 0.0 0.524 6.377 94.3 6.3467 5.0 311.0 15.2 392

11 0.11747 12.5 7.87 0.0 0.524 6.009 82.9 6.2267 5.0 311.0 15.2 396

12 0.09378 12.5 7.87 0.0 0.524 5.889 39.0 5.4509 5.0 311.0 15.2 390

13 0.62976 0.0 8.14 0.0 0.538 5.949 61.8 4.7075 4.0 307.0 21.0 396

14 0.63796 0.0 8.14 0.0 0.538 6.096 84.5 4.4619 4.0 307.0 21.0 380

15 0.62739 0.0 8.14 0.0 0.538 5.834 56.5 4.4986 4.0 307.0 21.0 395

16 1.05393 0.0 8.14 0.0 0.538 5.935 29.3 4.4986 4.0 307.0 21.0 386

17 0.78420 0.0 8.14 0.0 0.538 5.990 81.7 4.2579 4.0 307.0 21.0 386

18 0.80271 0.0 8.14 0.0 0.538 5.456 36.6 3.7965 4.0 307.0 21.0 288

19 0.72580 0.0 8.14 0.0 0.538 5.727 69.5 3.7965 4.0 307.0 21.0 390

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO

20 1.25179 0.0 8.14 0.0 0.538 5.570 98.1 3.7979 4.0 307.0 21.0 376

21 0.85204 0.0 8.14 0.0 0.538 5.965 89.2 4.0123 4.0 307.0 21.0 392

22 1.23247 0.0 8.14 0.0 0.538 6.142 91.7 3.9769 4.0 307.0 21.0 396

23 0.98843 0.0 8.14 0.0 0.538 5.813 100.0 4.0952 4.0 307.0 21.0 394

24 0.75026 0.0 8.14 0.0 0.538 5.924 94.1 4.3996 4.0 307.0 21.0 394

25 0.84054 0.0 8.14 0.0 0.538 5.599 85.7 4.4546 4.0 307.0 21.0 303

26 0.67191 0.0 8.14 0.0 0.538 5.813 90.3 4.6820 4.0 307.0 21.0 376

27 0.95577 0.0 8.14 0.0 0.538 6.047 88.8 4.4534 4.0 307.0 21.0 306

28 0.77299 0.0 8.14 0.0 0.538 6.495 94.4 4.4547 4.0 307.0 21.0 387

29 1.00245 0.0 8.14 0.0 0.538 6.674 87.3 4.2390 4.0 307.0 21.0 380

... ... ... ... ... ... ... ... ... ... ... ... ...

476 4.87141 0.0 18.10 0.0 0.614 6.484 93.6 2.3053 24.0 666.0 20.2 396

477 15.02340 0.0 18.10 0.0 0.614 5.304 97.3 2.1007 24.0 666.0 20.2 349

478 10.23300 0.0 18.10 0.0 0.614 6.185 96.7 2.1705 24.0 666.0 20.2 379

479 14.33370 0.0 18.10 0.0 0.614 6.229 88.0 1.9512 24.0 666.0 20.2 383

480 5.82401 0.0 18.10 0.0 0.532 6.242 64.7 3.4242 24.0 666.0 20.2 396

481 5.70818 0.0 18.10 0.0 0.532 6.750 74.9 3.3317 24.0 666.0 20.2 393

482 5.73116 0.0 18.10 0.0 0.532 7.061 77.0 3.4106 24.0 666.0 20.2 395

483 2.81838 0.0 18.10 0.0 0.532 5.762 40.3 4.0983 24.0 666.0 20.2 392

484 2.37857 0.0 18.10 0.0 0.583 5.871 41.9 3.7240 24.0 666.0 20.2 370

485 3.67367 0.0 18.10 0.0 0.583 6.312 51.9 3.9917 24.0 666.0 20.2 388

486 5.69175 0.0 18.10 0.0 0.583 6.114 79.8 3.5459 24.0 666.0 20.2 392

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO

487 4.83567 0.0 18.10 0.0 0.583 5.905 53.2 3.1523 24.0 666.0 20.2 388

488 0.15086 0.0 27.74 0.0 0.609 5.454 92.7 1.8209 4.0 711.0 20.1 395

489 0.18337 0.0 27.74 0.0 0.609 5.414 98.3 1.7554 4.0 711.0 20.1 344

490 0.20746 0.0 27.74 0.0 0.609 5.093 98.0 1.8226 4.0 711.0 20.1 318

491 0.10574 0.0 27.74 0.0 0.609 5.983 98.8 1.8681 4.0 711.0 20.1 390

492 0.11132 0.0 27.74 0.0 0.609 5.983 83.5 2.1099 4.0 711.0 20.1 396

493 0.17331 0.0 9.69 0.0 0.585 5.707 54.0 2.3817 6.0 391.0 19.2 396

494 0.27957 0.0 9.69 0.0 0.585 5.926 42.6 2.3817 6.0 391.0 19.2 396

495 0.17899 0.0 9.69 0.0 0.585 5.670 28.8 2.7986 6.0 391.0 19.2 393

496 0.28960 0.0 9.69 0.0 0.585 5.390 72.9 2.7986 6.0 391.0 19.2 396

497 0.26838 0.0 9.69 0.0 0.585 5.794 70.6 2.8927 6.0 391.0 19.2 396

498 0.23912 0.0 9.69 0.0 0.585 6.019 65.3 2.4091 6.0 391.0 19.2 396

499 0.17783 0.0 9.69 0.0 0.585 5.569 73.5 2.3999 6.0 391.0 19.2 395

500 0.22438 0.0 9.69 0.0 0.585 6.027 79.7 2.4982 6.0 391.0 19.2 396

501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 273.0 21.0 391

502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 273.0 21.0 396

503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0 21.0 396

504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0 21.0 393

505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0 21.0 396

506 rows × 13 columns

OTHER DATA SETS:

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
load_diabetes

load_digits

Exploring DataFrames
Exploring data is an important first step in most data analyses. DataFrames come with a variety
of functions to help you explore and summarize the data they contain.

First, let's load in data set to explore: the mtcars data set. The mtcars data set comes with the
ggplot library, a port of a popular R plotting library called ggplot2. ggplot does not come with
Anaconda, but you can install it by opening a console (cmd.exe) and running: "pip install ggplot"

In [135]: mtcars=pd.read_csv('mtcars.csv')

In [141]: #mtcars=mtcars.rename(index=str, columns={'Unnamed: 0':'Name'})


mtcars

Out[141]:
Name mpg cyl disp hp drat wt qsec vs am gear carb

0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4

1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4

2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1

3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1

4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

5 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1

6 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4

7 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Name mpg cyl disp hp drat wt qsec vs am gear carb

8 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2

9 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4

10 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4

11 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3

12 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3

13 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3

14 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4

15 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4

16 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4

17 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1

18 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2

19 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1

20 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1

21 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2

22 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2

23 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4

24 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2

25 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1

26 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2

27 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2

28 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4

29 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Name mpg cyl disp hp drat wt qsec vs am gear carb

30 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8

31 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

Notice that mtcars is loaded as a DataFrame. We can check the dimensions and size of a
DataFrame with df.shape:

In [142]: mtcars.shape # Check dimensions

Out[142]: (32, 12)

The output shows that mtars has 32 rows and 12 columns.

We can check the first n rows of the data with the df.head() function:

In [143]: mtcars.head(6) # Check the first 6 rows

Out[143]:
Name mpg cyl disp hp drat wt qsec vs am gear carb

0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4

1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4

2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1

3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1

4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

5 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1

Similarly, we can check the last few rows with df.tail()

In [144]: mtcars.tail(6) # Check the last 6 rows

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Out[144]: Name mpg cyl disp hp drat wt qsec vs am gear carb

26 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2

27 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2

28 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4

29 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6

30 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8

31 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2

With large data sets, head() and tail() are useful to get a sense of what the data looks like without
printing hundreds or thousands of rows to the screen. Since each row specifies a different car,
lets set the row indexes equal to the car name. You can access and assign new row indexes with
df.index:

In [146]: print(mtcars.index, "\n") # Print original indexes


mtcars.index = mtcars["Name"] # Set index to car name
del mtcars["Name"] # Delete name column
print(mtcars.index) # Print new indexes

Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '1
2',
'13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '2
3', '24',
'25', '26', '27', '28', '29', '30', '31'],
dtype='object')

Index(['Mazda RX4', 'Mazda RX4 Wag', 'Datsun 710', 'Hornet 4 Drive',


'Hornet Sportabout', 'Valiant', 'Duster 360', 'Merc 240D', 'Merc
230',
'Merc 280', 'Merc 280C', 'Merc 450SE', 'Merc 450SL', 'Merc 450SL
C',
'Cadillac Fleetwood', 'Lincoln Continental', 'Chrysler Imperia
l',
'Fiat 128', 'Honda Civic', 'Toyota Corolla', 'Toyota Corona',

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
'Dodge Challenger', 'AMC Javelin', 'Camaro Z28', 'Pontiac Firebi
rd',
'Fiat X1-9', 'Porsche 914-2', 'Lotus Europa', 'Ford Pantera L',
'Ferrari Dino', 'Maserati Bora', 'Volvo 142E'],
dtype='object', name='Name')

You can access the column labels with df.columns:

In [147]: mtcars.columns

Out[147]: Index(['mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'g
ear',
'carb'],
dtype='object')

STATISTICAL SUMMARY
Use the df.describe() command to get a quick statistical summary of your data set. The summary
includes the mean, median, min, max and a few key percentiles for numeric columns:

In [148]: mtcars.iloc[:,:6].describe() # Summarize the first 6 columns

Out[148]:
mpg cyl disp hp drat wt

count 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000

mean 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250

std 6.026948 1.785922 123.938694 68.562868 0.534679 0.978457

min 10.400000 4.000000 71.100000 52.000000 2.760000 1.513000

25% 15.425000 4.000000 120.825000 96.500000 3.080000 2.581250

50% 19.200000 6.000000 196.300000 123.000000 3.695000 3.325000

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
mpg cyl disp hp drat wt

75% 22.800000 8.000000 326.000000 180.000000 3.920000 3.610000

max 33.900000 8.000000 472.000000 335.000000 4.930000 5.424000

In [149]: import numpy as np


np.mean(mtcars, axis=0) # Get the mean of each column

Out[149]: mpg 20.090625


cyl 6.187500
disp 230.721875
hp 146.687500
drat 3.596563
wt 3.217250
qsec 17.848750
vs 0.437500
am 0.406250
gear 3.687500
carb 2.812500
dtype: float64

In [150]: np.sum(mtcars, axis=0) # Get the sum of each column

Out[150]: mpg 642.900


cyl 198.000
disp 7383.100
hp 4694.000
drat 115.090
wt 102.952
qsec 571.160
vs 14.000
am 13.000
gear 118.000
carb 90.000
dtype: float64

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD

You might also like