Professional Documents
Culture Documents
To store data from an external source like an excel workbook or database, we need a data
structure that can hold different data types. It is also desirable to be able to refer to rows and
columns in the data by custom labels rather than numbered indexes.
The pandas library offers data structures designed with this in mind: the series and the
DataFrame. Series are 1-dimensional labeled arrays similar to numpy's ndarrays, while
DataFrames are labeled 2-dimensional structures, that essentially function as spreadsheet
tables.
The name Pandas is derived from the word “Panel Data” – an Econometrics from
Multidimensional data. Using Pandas, we can accomplish five typical steps in the processing and
analysis of data, regardless of the origin of data —
load,
prepare,
manipulate,
model, and
analyze.
DATASTRUCTURES IN PANDAS
Series -------- 1 DIMENSION -------- 1D labeled homogeneous array, size immutable.
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
DataFrame ----- 2 DIMENSION -------- General 2D labeled, size-mutable tabular structure with
potentially heterogeneously typed columns
All Pandas data structures are value mutable (can be changed) and except Series all are size
mutable. Series is size immutable.
Pandas Series
Series are very similar to ndarrays: the main difference between them is that with series, you can
provide custom index labels and then operations you perform on series automatically align the
data based on the labels.
To create a new series, first load the numpy and pandas libraries (pandas is preinstalled with the
Anaconda Python distribution.)
In [3]: s = pd.Series()
print (s)
0 a
1 b
2 c
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
3 d
dtype: object
a 0.0
b 1.0
c 2.0
dtype: float64
0 5
1 5
2 5
3 5
dtype: int64
Define a new series by passing a collection of homogeneous data like ndarray or list, along with
a list of associated indexes to pd.Series():
Out[7]: a 2
b 3
c 5
d 4
dtype: int64
You can also create a series from a dictionary, in which case the dictionary keys act as the labels
and the values act as the data:
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
In [8]: my_dict = {"x": 2, "a": 5, "b": 4, "c": 8}
my_series2 = pd.Series(my_dict)
my_series2
Out[8]: a 5
b 4
c 8
x 2
dtype: int64
Out[9]: 2
In [10]: my_series[0]
Out[10]: 2
If you take a slice of a series, you get both the values and the labels contained in the slice:
In [11]: my_series[1:3]
Out[11]: b 3
c 5
dtype: int64
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Out[12]: a 4
b 6
c 10
d 8
dtype: int64
If you perform an operation with two series that have different labels, the unmatched labels will
return a value of NaN (not a number.).
Out[13]: a 7.0
b 7.0
c 13.0
d NaN
x NaN
dtype: float64
Out[14]: a 4
b 6
c 10
d 8
dtype: int64
You can create a DataFrame out a variety of data sources like dictionaries, 2D numpy arrays and
series using the pd.DataFrame() function. Dictionaries provide an intuitive way to create
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
DataFrames: when passed to pd.DataFrame() a dictionary's keys become column labels and the
values become the columns themselves:
In [15]: df = pd.DataFrame()
print (df)
Empty DataFrame
Columns: []
Index: []
In [17]: df=pd.DataFrame(data,columns=['Name','Age'])
print (df)
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
one two three
a 1.0 1 10.0
b 2.0 2 20.0
c 3.0 3 30.0
d NaN 4 NaN
two three
a 1 10.0
b 2 20.0
c 3 30.0
d 4 NaN
Out[21]:
age gender height name siblings weight
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Notice that values in the dictionary you use to make a DataFrame can be a variety of sequence
objects, including lists, ndarrays, tuples and series. If you pass in singular values like a single
number or string, that value is duplicated for every row in the DataFrame (in this case gender is
set to "M" for all records and siblings is set to 1.).
Also note that in the DataFrame above, the rows were automatically given indexes that align with
the indexes of the series we passed in for the "height" column. If we did not use a series with
index labels to create our DataFrame, it would be given numeric row index labels by default:
Out[22]:
age gender height name siblings weight
0 10 M 4.5 Joe 1 75
You can provide custom row labels when creating a DataFrame by adding the index argument:
Out[23]:
age gender height name siblings weight
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
age gender height name siblings weight
A DataFrame behaves like a dictionary of Series objects that each have the same length and
indexes. This means we can get, add and delete columns in a DataFrame the same way we
would when dealing with a dictionary:
Out[24]: Joe 75
Bob 123
Frans 239
Name: weight, dtype: int64
In [25]: df2.weight
Out[25]: Joe 75
Bob 123
Frans 239
Name: weight, dtype: int64
Out[27]:
age gender height siblings weight IQ
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
age gender height siblings weight IQ
Out[28]:
age gender height siblings weight IQ Married
When inserting a Series into a DataFrame, rows are matched by index. Unmatched rows will be
filled with NaN:
Out[29]:
age gender height siblings weight IQ Married College
You can select both rows or columns by label with df.loc[row, column]:
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
In [30]: df2.loc["Joe"] # Select row "Joe"
Out[30]: age 10
gender M
height 4.5
siblings 1
weight 75
IQ 130
Married False
College NaN
Name: Joe, dtype: object
Out[31]: 130
Out[32]:
IQ Married College
Out[33]: age 10
gender M
height 4.5
siblings 1
weight 75
IQ 130
Married False
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
College NaN
Name: Joe, dtype: object
Out[34]: 130
Out[35]:
IQ Married College
In [36]: df=pd.read_excel("SAMPLEPANDAS.xlsx")
In [39]: df
Out[39]:
Month Values
0 2018-01-01 201
1 2018-02-01 107
2 2018-03-01 483
3 2018-04-01 240
4 2018-05-01 356
5 2018-06-01 369
6 2018-07-01 266
7 2018-08-01 308
8 2018-09-01 453
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Month Values
9 2018-10-01 395
10 2018-11-01 487
11 2018-12-01 403
12 2019-01-01 478
13 2019-02-01 112
14 2019-03-01 262
15 2019-04-01 283
16 2019-05-01 444
17 2019-06-01 233
18 2019-07-01 324
19 2019-08-01 305
20 2019-09-01 299
21 2019-10-01 294
22 2019-11-01 468
23 2019-12-01 321
Out[40]:
Month Values
0 2018-01-01 201
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Month Values
1 2018-02-01 107
2 2018-03-01 483
3 2018-04-01 240
4 2018-05-01 356
In [41]: df.tail()
Out[41]:
Month Values
19 2019-08-01 305
20 2019-09-01 299
21 2019-10-01 294
22 2019-11-01 468
23 2019-12-01 321
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
df=pd.DataFrame({'A':[1,2,np.nan,4,7],'B':[np.nan,3.5,4,5,8], 'C':[3,np
.nan,2.5,6,9]})
In [43]: df
Out[43]:
A B C
In [44]: #if we want to remove NaN Values we need to use df.dropna() where df is
the dataframe
df.dropna()
Out[44]:
A B C
In [45]: #we can also assign df.dropna() to a new dataframe, say df2
df2=df.dropna()
df2
Out[45]:
A B C
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
FILLING UP NAN VALUES
In [46]: #we can fill up the NaN values using fillna function as below:
df.fillna(value=0)
Out[46]:
A B C
Out[47]:
A B C NewValue
In [48]: #for replacing the null values with mean for an entire column
df['A']=df['A'].fillna(value=df['A'].mean())
df
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Out[48]:
A B C NewValue
In [49]: #for replacing the null values with mean for an all the column
df3=df.fillna(value=df.mean())
df3
Out[49]:
A B C NewValue
-----------------------------------------------------------------------
----
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
ValueError Traceback (most recent call l
ast)
~\Anaconda3\lib\site-packages\pandas\core\nanops.py in _ensure_numeric
(x)
818 try:
--> 819 x = float(x)
820 except Exception:
~\Anaconda3\lib\site-packages\pandas\core\nanops.py in nanmean(values,
axis, skipna)
355 count = _get_counts(mask, axis, dtype=dtype_count)
--> 356 the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum)
)
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
357
~\Anaconda3\lib\site-packages\pandas\core\nanops.py in _ensure_numeric
(x)
824 raise TypeError('Could not convert {value!s} to
numeric'
--> 825 .format(value=x))
826 return x
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
ast)
<ipython-input-50-d8f394443f45> in <module>()
1 #for replacing the Nan Values with Mean of Columns for all the
columns in a DataFrame
2 for x in df.columns:
----> 3 df[x]=df[x].fillna(value=df[x].mean())
~\Anaconda3\lib\site-packages\pandas\core\generic.py in stat_func(self,
axis, skipna, level, numeric_only, **kwargs)
7313 skipna=skipna)
7314 return self._reduce(f, name, axis=axis, skipna=skipna,
-> 7315 numeric_only=numeric_only)
7316
7317 return set_function_name(stat_func, name, cls)
~\Anaconda3\lib\site-packages\pandas\core\series.py in _reduce(self, o
p, name, axis, skipna, numeric_only, filter_type, **kwds)
2575 'numeric_only.'.forma
t(name))
2576 with np.errstate(all='ignore'):
-> 2577 return op(delegate, skipna=skipna, **kwds)
2578
2579 return delegate._reduce(op=op, name=name, axis=axis, sk
ipna=skipna,
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
132 except ValueError as e:
133 # we want to transform an object array
~\Anaconda3\lib\site-packages\pandas\core\nanops.py in nanmean(values,
axis, skipna)
354 dtype_count = dtype
355 count = _get_counts(mask, axis, dtype=dtype_count)
--> 356 the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum)
)
357
358 if axis is not None and getattr(the_sum, 'ndim', False):
~\Anaconda3\lib\site-packages\pandas\core\nanops.py in _ensure_numeric
(x)
823 except Exception:
824 raise TypeError('Could not convert {value!s} to
numeric'
--> 825 .format(value=x))
826 return x
827
In [51]: df
Out[51]:
A B C NewValue
In [52]: #using for loop for applying the mean of the column to the Null Values
(for only non-string columns)
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
for x in df.columns:
if df[x].dtype.kind=="f":
df[x]=df[x].fillna(value=df[x].mean())
In [53]: df
Out[53]:
A B C NewValue
In [55]: df.drop_duplicates()
Out[55]:
A B C NewValue
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
In [56]: #specifying condition to filter the criteria...like column a all values
> 2 and for b<5
#using AND condition use '&' and for OR condition use'|'
newdf=df[(df['A']>1) & (df['B']<5)]
newdf
Out[56]:
A B C NewValue
Out[57]:
A B C NewValue
SORTING
In [58]: df.sort_values('B')
df.sort_values(by='B')
Out[58]:
A B C NewValue
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
A B C NewValue
Out[59]:
A B C NewValue
DESCRIBE
In [60]: d={'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack'
,'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.1
0,3.65])}
x=pd.DataFrame(d)
In [61]: x.describe()
Out[61]:
Age Rating
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Age Rating
In [62]: x.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 3 columns):
Age 12 non-null int64
Name 12 non-null object
Rating 12 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 368.0+ bytes
In [63]: x.columns
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
df=pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d'])
df.plot.bar()
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
In [70]: import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10, 5), columns=['A', 'B', 'C', 'D',
'E'])
df.plot.box()
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
JOINING DATAFRAMES
In [71]: df=pd.DataFrame({'A':[1,2,3,4,5], 'B':[6,7,8,9,10], 'C': [11,12,13,14,1
5] })
df1=pd.DataFrame({'A':[100,101,102,103,104], 'B': [105,106,107,108,109
], 'C':[110,111,112,113,114] })
In [72]: #How to Concatenate: Use the same no. of variables and same no. of elem
ents to concatenate
pd.concat([df, df1])
Out[72]:
A B C
0 1 6 11
1 2 7 12
2 3 8 13
3 4 9 14
4 5 10 15
In [74]: pdconcat=pd.concat([df,df1,df2])
pdconcat
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Out[74]: A B C
0 1 6 11.0
1 2 7 12.0
2 3 8 13.0
3 4 9 14.0
4 5 10 15.0
0 10 25 NaN
1 20 35 NaN
2 30 45 NaN
3 40 55 NaN
In [75]: print(pdconcat['A'].dtype.kind)
print(pdconcat['B'].dtype.kind)
print(pdconcat['C'].dtype.kind)
i
i
f
In [76]: #Incase the user wants to concatenate the data horizontally, use "axis=
1", default value is "axis=0"
pdconcat1=pd.concat([df,df1,df2], axis=1)
pdconcat1
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Out[76]: A B C A B C A B
Out[77]:
RollNo. StudentName
0 1 Akash
1 2 Vivek
2 3 Sondeep
3 4 Pranav
4 5 Purnima
5 6 DivyaSaroj
6 7 Nimisha
7 8 ajay
8 9 Manju
9 10 Mr. X
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
urgaon", "Delhi", "Amritsar", "Jalandhar", "Gandhinagar", "Almora", "Oo
ty", "Mumbai", "Chennai"]})
T2
Out[78]:
City RollNo.
0 Shimla 1
1 Gurgaon 2
2 Delhi 3
3 Amritsar 4
4 Jalandhar 5
5 Gandhinagar 6
6 Almora 7
7 Ooty 8
8 Mumbai 9
9 Chennai 10
In [79]: #there are two ways of join tables: (1) Merging; (2) Using Joins
#In the above two tables (T1 & T2, RollNo. is the common key in both th
e tables. T1&T2 both contain same no. of elements)
#MERGING!!!
pd.merge(T1,T2, how='inner', on='RollNo.')
#T3=pd.DataFrame({'RollNo.':[1,3,5,7,9,11,13,15,17,19], 'PinCode':["171
001", "122002", "110005", "183005", "181001", "168754", "987654", "5476
45", "654789", "123456"]})
Out[79]:
RollNo. StudentName City
0 1 Akash Shimla
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
RollNo. StudentName City
1 2 Vivek Gurgaon
2 3 Sondeep Delhi
3 4 Pranav Amritsar
4 5 Purnima Jalandhar
5 6 DivyaSaroj Gandhinagar
6 7 Nimisha Almora
7 8 ajay Ooty
8 9 Manju Mumbai
9 10 Mr. X Chennai
Out[80]:
PinCode RollNo.
0 171001 1
1 122002 3
2 110005 5
3 183005 7
4 181001 9
5 168754 11
6 987654 13
7 547645 15
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
PinCode RollNo.
8 654789 17
9 123456 19
Out[81]:
RollNo. StudentName PinCode
0 1 Akash 171001
1 3 Sondeep 122002
2 5 Purnima 110005
3 7 Nimisha 183005
4 9 Manju 181001
Out[82]:
RollNo. StudentName PinCode
0 1 Akash 171001
1 2 Vivek NaN
2 3 Sondeep 122002
3 4 Pranav NaN
4 5 Purnima 110005
5 6 DivyaSaroj NaN
6 7 Nimisha 183005
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
RollNo. StudentName PinCode
7 8 ajay NaN
8 9 Manju 181001
9 10 Mr. X NaN
Out[83]:
RollNo. StudentName PinCode
0 1 Akash 171001
1 3 Sondeep 122002
2 5 Purnima 110005
3 7 Nimisha 183005
4 9 Manju 181001
5 11 NaN 168754
6 13 NaN 987654
7 15 NaN 547645
8 17 NaN 654789
9 19 NaN 123456
Out[84]:
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
City PinCode
RollNo.
1 Shimla 171001
3 Delhi 122002
5 Jalandhar 110005
7 Almora 183005
9 Mumbai 181001
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal w
idth (cm)']
In [88]: df
Out[88]:
Sepal Length Sepal Width Petal Length Petal width
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Sepal Length Sepal Width Petal Length Petal width
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Sepal Length Sepal Width Petal Length Petal width
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Sepal Length Sepal Width Petal Length Petal width
In [89]: #Alternatively
import pandas as pd
a = pd.DataFrame(iris.data, columns=iris.feature_names)
In [91]: a
Out[91]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
In [93]: a
Out[93]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396
5 0.02985 0.0 2.18 0.0 0.458 6.430 58.7 6.0622 3.0 222.0 18.7 394
6 0.08829 12.5 7.87 0.0 0.524 6.012 66.6 5.5605 5.0 311.0 15.2 395
7 0.14455 12.5 7.87 0.0 0.524 6.172 96.1 5.9505 5.0 311.0 15.2 396
8 0.21124 12.5 7.87 0.0 0.524 5.631 100.0 6.0821 5.0 311.0 15.2 386
9 0.17004 12.5 7.87 0.0 0.524 6.004 85.9 6.5921 5.0 311.0 15.2 386
10 0.22489 12.5 7.87 0.0 0.524 6.377 94.3 6.3467 5.0 311.0 15.2 392
11 0.11747 12.5 7.87 0.0 0.524 6.009 82.9 6.2267 5.0 311.0 15.2 396
12 0.09378 12.5 7.87 0.0 0.524 5.889 39.0 5.4509 5.0 311.0 15.2 390
13 0.62976 0.0 8.14 0.0 0.538 5.949 61.8 4.7075 4.0 307.0 21.0 396
14 0.63796 0.0 8.14 0.0 0.538 6.096 84.5 4.4619 4.0 307.0 21.0 380
15 0.62739 0.0 8.14 0.0 0.538 5.834 56.5 4.4986 4.0 307.0 21.0 395
16 1.05393 0.0 8.14 0.0 0.538 5.935 29.3 4.4986 4.0 307.0 21.0 386
17 0.78420 0.0 8.14 0.0 0.538 5.990 81.7 4.2579 4.0 307.0 21.0 386
18 0.80271 0.0 8.14 0.0 0.538 5.456 36.6 3.7965 4.0 307.0 21.0 288
19 0.72580 0.0 8.14 0.0 0.538 5.727 69.5 3.7965 4.0 307.0 21.0 390
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO
20 1.25179 0.0 8.14 0.0 0.538 5.570 98.1 3.7979 4.0 307.0 21.0 376
21 0.85204 0.0 8.14 0.0 0.538 5.965 89.2 4.0123 4.0 307.0 21.0 392
22 1.23247 0.0 8.14 0.0 0.538 6.142 91.7 3.9769 4.0 307.0 21.0 396
23 0.98843 0.0 8.14 0.0 0.538 5.813 100.0 4.0952 4.0 307.0 21.0 394
24 0.75026 0.0 8.14 0.0 0.538 5.924 94.1 4.3996 4.0 307.0 21.0 394
25 0.84054 0.0 8.14 0.0 0.538 5.599 85.7 4.4546 4.0 307.0 21.0 303
26 0.67191 0.0 8.14 0.0 0.538 5.813 90.3 4.6820 4.0 307.0 21.0 376
27 0.95577 0.0 8.14 0.0 0.538 6.047 88.8 4.4534 4.0 307.0 21.0 306
28 0.77299 0.0 8.14 0.0 0.538 6.495 94.4 4.4547 4.0 307.0 21.0 387
29 1.00245 0.0 8.14 0.0 0.538 6.674 87.3 4.2390 4.0 307.0 21.0 380
... ... ... ... ... ... ... ... ... ... ... ... ...
476 4.87141 0.0 18.10 0.0 0.614 6.484 93.6 2.3053 24.0 666.0 20.2 396
477 15.02340 0.0 18.10 0.0 0.614 5.304 97.3 2.1007 24.0 666.0 20.2 349
478 10.23300 0.0 18.10 0.0 0.614 6.185 96.7 2.1705 24.0 666.0 20.2 379
479 14.33370 0.0 18.10 0.0 0.614 6.229 88.0 1.9512 24.0 666.0 20.2 383
480 5.82401 0.0 18.10 0.0 0.532 6.242 64.7 3.4242 24.0 666.0 20.2 396
481 5.70818 0.0 18.10 0.0 0.532 6.750 74.9 3.3317 24.0 666.0 20.2 393
482 5.73116 0.0 18.10 0.0 0.532 7.061 77.0 3.4106 24.0 666.0 20.2 395
483 2.81838 0.0 18.10 0.0 0.532 5.762 40.3 4.0983 24.0 666.0 20.2 392
484 2.37857 0.0 18.10 0.0 0.583 5.871 41.9 3.7240 24.0 666.0 20.2 370
485 3.67367 0.0 18.10 0.0 0.583 6.312 51.9 3.9917 24.0 666.0 20.2 388
486 5.69175 0.0 18.10 0.0 0.583 6.114 79.8 3.5459 24.0 666.0 20.2 392
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO
487 4.83567 0.0 18.10 0.0 0.583 5.905 53.2 3.1523 24.0 666.0 20.2 388
488 0.15086 0.0 27.74 0.0 0.609 5.454 92.7 1.8209 4.0 711.0 20.1 395
489 0.18337 0.0 27.74 0.0 0.609 5.414 98.3 1.7554 4.0 711.0 20.1 344
490 0.20746 0.0 27.74 0.0 0.609 5.093 98.0 1.8226 4.0 711.0 20.1 318
491 0.10574 0.0 27.74 0.0 0.609 5.983 98.8 1.8681 4.0 711.0 20.1 390
492 0.11132 0.0 27.74 0.0 0.609 5.983 83.5 2.1099 4.0 711.0 20.1 396
493 0.17331 0.0 9.69 0.0 0.585 5.707 54.0 2.3817 6.0 391.0 19.2 396
494 0.27957 0.0 9.69 0.0 0.585 5.926 42.6 2.3817 6.0 391.0 19.2 396
495 0.17899 0.0 9.69 0.0 0.585 5.670 28.8 2.7986 6.0 391.0 19.2 393
496 0.28960 0.0 9.69 0.0 0.585 5.390 72.9 2.7986 6.0 391.0 19.2 396
497 0.26838 0.0 9.69 0.0 0.585 5.794 70.6 2.8927 6.0 391.0 19.2 396
498 0.23912 0.0 9.69 0.0 0.585 6.019 65.3 2.4091 6.0 391.0 19.2 396
499 0.17783 0.0 9.69 0.0 0.585 5.569 73.5 2.3999 6.0 391.0 19.2 395
500 0.22438 0.0 9.69 0.0 0.585 6.027 79.7 2.4982 6.0 391.0 19.2 396
501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 273.0 21.0 391
502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 273.0 21.0 396
503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0 21.0 396
504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0 21.0 393
505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0 21.0 396
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
load_diabetes
load_digits
Exploring DataFrames
Exploring data is an important first step in most data analyses. DataFrames come with a variety
of functions to help you explore and summarize the data they contain.
First, let's load in data set to explore: the mtcars data set. The mtcars data set comes with the
ggplot library, a port of a popular R plotting library called ggplot2. ggplot does not come with
Anaconda, but you can install it by opening a console (cmd.exe) and running: "pip install ggplot"
In [135]: mtcars=pd.read_csv('mtcars.csv')
Out[141]:
Name mpg cyl disp hp drat wt qsec vs am gear carb
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Name mpg cyl disp hp drat wt qsec vs am gear carb
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Name mpg cyl disp hp drat wt qsec vs am gear carb
Notice that mtcars is loaded as a DataFrame. We can check the dimensions and size of a
DataFrame with df.shape:
We can check the first n rows of the data with the df.head() function:
Out[143]:
Name mpg cyl disp hp drat wt qsec vs am gear carb
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
Out[144]: Name mpg cyl disp hp drat wt qsec vs am gear carb
With large data sets, head() and tail() are useful to get a sense of what the data looks like without
printing hundreds or thousands of rows to the screen. Since each row specifies a different car,
lets set the row indexes equal to the car name. You can access and assign new row indexes with
df.index:
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '1
2',
'13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '2
3', '24',
'25', '26', '27', '28', '29', '30', '31'],
dtype='object')
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
'Dodge Challenger', 'AMC Javelin', 'Camaro Z28', 'Pontiac Firebi
rd',
'Fiat X1-9', 'Porsche 914-2', 'Lotus Europa', 'Ford Pantera L',
'Ferrari Dino', 'Maserati Bora', 'Volvo 142E'],
dtype='object', name='Name')
In [147]: mtcars.columns
Out[147]: Index(['mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'g
ear',
'carb'],
dtype='object')
STATISTICAL SUMMARY
Use the df.describe() command to get a quick statistical summary of your data set. The summary
includes the mean, median, min, max and a few key percentiles for numeric columns:
Out[148]:
mpg cyl disp hp drat wt
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
mpg cyl disp hp drat wt
Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD