Py FM Analytics 05

Data Structures in Python
In the last session, we introduced ourselves to the primary data structure in Python - Lists. We
have seen that lists allow you to store multiple values with assignment to a single variable. We
have also briefly seen that by doing so, we can iterate over multiple values by using for loops. As
a result, we can do the same operations on multiple values using a small set of commands. The
list data structure allows you to optimize memory and reduce the length of the code. This data
structure is the foundation for the efficiency and effectiveness of Python.
Today, we will see three other data structures that build on the lists and allow you to do the
same operations even more efficiently. These data structures are
1. Dictionaries
2. Arrays
3. Data Frames
Dictionaries
Just like a list, a dictionary is a collection of many values. The only difference between list and
dictionaries is in terms of the way indexing works. Recall that lists are ordered sequences of
values. This means that for a list myList , myList[1,2] is not the same as myList[2,1] .
Because they are ordered sequences, elements of a list can be indexed with integers -3, -2, -1,
0, 1, 2, 3, etc.
On the contrary, index in dictionary can be of any type. An index in a dictionary is called a key.
Within a dictionary, each key is associated with a value, creating a key-value pair.
Ok. We are ready to create our first dictionary.
In [2]: firstDict = {'name': 'Sunil', 'age': 38, 'Address': 'RNo. 9'}
In [3]: firstDict['Address']
'RNo. 9'
In [4]: 'The ' + str(firstDict['age']) + ' year old professor, ' + firstDict['name'] + ' has
'The 38 year old professor, Sunil has office in RNo. 9'
In [5]: secondDict = {'age': 38, 'name': 'Sunil', 'Address': 'RNo. 9'}
firstDict == secondDict
True
In [6]: a = [1,2,3]
b = [3,2,1]
a==b
False
In [7]: secondDict['specialization']
---------------------------------------------------------------------------
Traceback (most recent call last)
at line 1 in <module>
KeyError: 'specialization'
The KeyError when working with dictionaries is similar to IndexError when working with lists.
KeyError means that you have used a non-existing key as an index in a dictionary. IndexError in
the same way means you used a non-existing index in a list.
Lists and Dictionaries

Lists and dictionaries are connected in many ways:
1. You can have a dictionary of lists.

2. You can convert certain kinds of lists to dictionaries.
3. You can convert dictionaries to lists and run for loops on elements of dictionaries.
In class exercise:
The following table shows the marks of five students in five subjects:
Name English Hindi Mathematics Science Social
Anand 65 78 85 58 72
Bhanu 83 64 74 94 65
Chetna 47 84 74 59 82
Durga 57 59 95 78 49
Eshwar 78 65 84 68 65
Exercise 1: Get this data into your Python environment using lists and using dictionaries.
In [9]: ## your code here
Exercise 2: Now create lists that can be converted to a dictionary that stores the above table.
Exercise 3: Iterate over the dictionary above to calculate average scores.
Arrays
Notice that we have to iterate over the lists to make simple calculations. This is very inefficient.
The loops are inefficient because they execute sequentially and have to temporarily store values
until the loop ends. There are efficient ways of doing mathematical operations on multiple
values. We do not need to bother about these efficient ways. This is because these methods are
written as functions in various modules. The best module for doing mathematical operations is
numpy. The fundamental datatype in numpy is the ndarray or the n-dimensional array.
NdArray
The numpy module is based on one main object: ndarray. It stands for N-dimensional array. It is
a
1. multidimensional (1-d is a vector, 2-d is a matrix, n-d is what?)

2. homogenous array (all the elements are of the same type)
3. with a pre-determined number of dimensions.
Let's code!!!
In [12]: import numpy as np
In [13]: np.array([[[1,2],[3,4]],[[5,6],[7,8]]]) # jump ahead and witness a nd-array
array([[[1, 2],
[3, 4]],
[[5, 6],
[7, 8]]])
In [15]: ## let's break it down
firstArray = np.array([1,2,3]) # you are passing a list of numbers to the function a

display(firstArray)
array([1, 2, 3])
In [16]: type(firstArray)
numpy.ndarray
In [17]: firstArray.dtype # this gives the type of the data within the ndarray it is int in t
dtype('int64')
In [19]: # let's create two more arrays for comparison
secondArray = np.array([[1.5,2.5],[3.5,4.5]])
display(secondArray)
secondArray.dtype
dtype('float64')
In [20]: modFirstArray = np.array([1,2,3,4.1]) # there are 3 ints and 1 float
display(modFirstArray)
modFirstArray.dtype
dtype('float64')
In [25]: # Attributes of arrays
# number of dimension, ndim, is the 1 for a vector, 2 for a matrix and can be any va
nDim1 = firstArray.ndim
nDim2 = secondArray.ndim
print('First array has {0} dimensions and second array has {1} dimensions'.format(nD
First array has 1 dimensions and second array has 2 dimensions
In [26]: # size of the array is the number of elements in the array
size1 = firstArray.size
size2 = secondArray.size
print('First array has {0} size and second array has {1} size'.format(size1,size2))
First array has 3 size and second array has 4 size
In [28]: # shape of the array is the number of elements in each dimension
# is the size along each dimension
# it is a list of sizes - it is actually a tuple
shape1 = firstArray.shape
shape2 = secondArray.shape
print('First array has {0} shape and second array has {1} shape'.format(shape1,shape
First array has (3,) shape and second array has (2, 2) shape
Exercise 4: Repeat exercise 3 with arrays and without loops.

In [29]: # your code here
Data Frames
The final data structure we will study is the most important one - Data frames. The DataFrame is
the fundamental datatype in the pandas module. You can understand a DataFrame as building
on two data structures discussed above: arrays and dictionaries.
From a single array, we arrive at pandas Series. A dictionary of arrays can be converted into a
DataFrame
In [31]: import pandas as pd
import numpy as np
firstSeries = pd.Series([1,4,9,16]) # the Series() function in pandas module can be

display(firstSeries)
0 1
1 4
2 9
3 16
dtype: int64
In [33]: # compare this to firstArray
display(firstArray)
array([1, 2, 3])
In [36]: # notice that Series comes with an explicit index
# as a resut a Series has two attributes - index and values
display(firstSeries.index)
RangeIndex(start=0, stop=4, step=1)
In [37]: display(firstSeries.values)
array([ 1, 4, 9, 16])
Notice that the values attribute of the Series is actually a numpy array. Essentially, pandas Series
is an improvement to numpy array in two ways:
1. You can define an index in which ever way you want.

2. (To be revisited in forthcoming sessions) Series can hold heterogenous datatypes
In [39]: # we can give our own index. we will see this with the second series
textSeries = pd.Series(['Alice', 'Bob', 'Connor', 'Dana'], index = ['a','b','c','d']

display(textSeries)
a Alice
b Bob
c Connor
d Dana
dtype: object
Exercise 5: Build the above Series using dictionaries.
In [40]: # your code here
Dataframe
It is an ordered collection of columns each of which can contain a value of a different type.
There are two index arrays:
1. Index associated with lines or rows. This is similar to the index in series.
2. Array of labels associated with each column.
You can consider a dataframe a dictionary of series. The key to each series is the column name.
The values are the series that make up each column. A dataframe holds rectangular data with
index and column names.
Exercise 6: Build a DataFrame that best holds the marks of the five students.
In [41]: # Your code here
Key Takeaways
1. Python grows as new modules build on old modules to increase efficiency and
effectiveness.
2. Lists, dictionaries, arrays, Series and DataFrame all build on the previous datatypes and help
store more complex datasets while making operations on them more efficient.
3. Lists help iterate easily. Multiple assignment a boon.
4. Dictionaries help make indexing more meaningful and flexible.
5. Arrays make iteration unnecessary for mathematical operations.
6. Series and DataFrame take the best from all three worlds and help store complex datasets
and work on them.
We will compare each of these datatypes on how they enable working with data and where each
is useful.

Py FM Analytics 05

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Py FM Analytics 05

Uploaded by

Copyright:

Available Formats

Data Structures in Python

Ok. We are ready to create our first dictionary.

In [2]: firstDict = {'name': 'Sunil', 'age': 38, 'Address': 'RNo. 9'}

'The 38 year old professor, Sunil has office in RNo. 9'

In [5]: secondDict = {'age': 38, 'name': 'Sunil', 'Address': 'RNo. 9'}

Traceback (most recent call last)

Lists and Dictionaries

1. You can have a dictionary of lists.

Name English Hindi Mathematics Science Social

In [9]: ## your code here

In [10]: ## your code here

Exercise 3: Iterate over the dictionary above to calculate average scores.

In [11]: ## your code here

1. multidimensional (1-d is a vector, 2-d is a matrix, n-d is what?)

In [12]: import numpy as np

In [13]: np.array([[[1,2],[3,4]],[[5,6],[7,8]]]) # jump ahead and witness a nd-array

In [15]: ## let's break it down

firstArray = np.array([1,2,3]) # you are passing a list of numbers to the function a

In [19]: # let's create two more arrays for comparison

In [20]: modFirstArray = np.array([1,2,3,4.1]) # there are 3 ints and 1 float

In [25]: # Attributes of arrays

First array has 1 dimensions and second array has 2 dimensions

In [26]: # size of the array is the number of elements in the array

First array has 3 size and second array has 4 size

In [28]: # shape of the array is the number of elements in each dimension

# is the size along each dimension

# it is a list of sizes - it is actually a tuple

Exercise 4: Repeat exercise 3 with arrays and without loops.

In [31]: import pandas as pd

firstSeries = pd.Series([1,4,9,16]) # the Series() function in pandas module can be

In [33]: # compare this to firstArray

In [36]: # notice that Series comes with an explicit index

# as a resut a Series has two attributes - index and values

RangeIndex(start=0, stop=4, step=1)

1. You can define an index in which ever way you want.

textSeries = pd.Series(['Alice', 'Bob', 'Connor', 'Dana'], index = ['a','b','c','d']

In [40]: # your code here

In [41]: # Your code here

You might also like