You are on page 1of 5

Data Structures in Python

In the last session, we introduced ourselves to the primary data structure in Python - Lists. We
have seen that lists allow you to store multiple values with assignment to a single variable. We
have also briefly seen that by doing so, we can iterate over multiple values by using for loops. As
a result, we can do the same operations on multiple values using a small set of commands. The
list data structure allows you to optimize memory and reduce the length of the code. This data
structure is the foundation for the efficiency and effectiveness of Python.

Today, we will see three other data structures that build on the lists and allow you to do the
same operations even more efficiently. These data structures are

1. Dictionaries
2. Arrays
3. Data Frames

Dictionaries
Just like a list, a dictionary is a collection of many values. The only difference between list and
dictionaries is in terms of the way indexing works. Recall that lists are ordered sequences of
values. This means that for a list myList , myList[1,2] is not the same as myList[2,1] .
Because they are ordered sequences, elements of a list can be indexed with integers -3, -2, -1,
0, 1, 2, 3, etc.

On the contrary, index in dictionary can be of any type. An index in a dictionary is called a key.
Within a dictionary, each key is associated with a value, creating a key-value pair.

Ok. We are ready to create our first dictionary.

In [2]: firstDict = {'name': 'Sunil', 'age': 38, 'Address': 'RNo. 9'}

In [3]: firstDict['Address']

'RNo. 9'

In [4]: 'The ' + str(firstDict['age']) + ' year old professor, ' + firstDict['name'] + ' has

'The 38 year old professor, Sunil has office in RNo. 9'

In [5]: secondDict = {'age': 38, 'name': 'Sunil', 'Address': 'RNo. 9'}

firstDict == secondDict

True

In [6]: a = [1,2,3]

b = [3,2,1]

a==b

False

In [7]: secondDict['specialization']

---------------------------------------------------------------------------

Traceback (most recent call last)

at line 1 in <module>

KeyError: 'specialization'
The KeyError when working with dictionaries is similar to IndexError when working with lists.
KeyError means that you have used a non-existing key as an index in a dictionary. IndexError in
the same way means you used a non-existing index in a list.

Lists and Dictionaries


Lists and dictionaries are connected in many ways:

1. You can have a dictionary of lists.


2. You can convert certain kinds of lists to dictionaries.
3. You can convert dictionaries to lists and run for loops on elements of dictionaries.

In class exercise:
The following table shows the marks of five students in five subjects:

Name English Hindi Mathematics Science Social

Anand 65 78 85 58 72

Bhanu 83 64 74 94 65

Chetna 47 84 74 59 82

Durga 57 59 95 78 49

Eshwar 78 65 84 68 65

Exercise 1: Get this data into your Python environment using lists and using dictionaries.

In [9]: ## your code here

Exercise 2: Now create lists that can be converted to a dictionary that stores the above table.

In [10]: ## your code here

Exercise 3: Iterate over the dictionary above to calculate average scores.

In [11]: ## your code here

Arrays
Notice that we have to iterate over the lists to make simple calculations. This is very inefficient.
The loops are inefficient because they execute sequentially and have to temporarily store values
until the loop ends. There are efficient ways of doing mathematical operations on multiple
values. We do not need to bother about these efficient ways. This is because these methods are
written as functions in various modules. The best module for doing mathematical operations is
numpy. The fundamental datatype in numpy is the ndarray or the n-dimensional array.

NdArray
The numpy module is based on one main object: ndarray. It stands for N-dimensional array. It is
a

1. multidimensional (1-d is a vector, 2-d is a matrix, n-d is what?)


2. homogenous array (all the elements are of the same type)
3. with a pre-determined number of dimensions.

Let's code!!!

In [12]: import numpy as np

In [13]: np.array([[[1,2],[3,4]],[[5,6],[7,8]]]) # jump ahead and witness a nd-array

array([[[1, 2],

[3, 4]],

[[5, 6],

[7, 8]]])

In [15]: ## let's break it down

firstArray = np.array([1,2,3]) # you are passing a list of numbers to the function a


display(firstArray)

array([1, 2, 3])

In [16]: type(firstArray)

numpy.ndarray

In [17]: firstArray.dtype # this gives the type of the data within the ndarray it is int in t

dtype('int64')

In [19]: # let's create two more arrays for comparison

secondArray = np.array([[1.5,2.5],[3.5,4.5]])

display(secondArray)

secondArray.dtype

dtype('float64')

In [20]: modFirstArray = np.array([1,2,3,4.1]) # there are 3 ints and 1 float

display(modFirstArray)

modFirstArray.dtype

dtype('float64')

In [25]: # Attributes of arrays

# number of dimension, ndim, is the 1 for a vector, 2 for a matrix and can be any va
nDim1 = firstArray.ndim

nDim2 = secondArray.ndim

print('First array has {0} dimensions and second array has {1} dimensions'.format(nD

First array has 1 dimensions and second array has 2 dimensions

In [26]: # size of the array is the number of elements in the array

size1 = firstArray.size

size2 = secondArray.size

print('First array has {0} size and second array has {1} size'.format(size1,size2))

First array has 3 size and second array has 4 size

In [28]: # shape of the array is the number of elements in each dimension

# is the size along each dimension

# it is a list of sizes - it is actually a tuple

shape1 = firstArray.shape

shape2 = secondArray.shape

print('First array has {0} shape and second array has {1} shape'.format(shape1,shape

First array has (3,) shape and second array has (2, 2) shape

Exercise 4: Repeat exercise 3 with arrays and without loops.


In [29]: # your code here

Data Frames
The final data structure we will study is the most important one - Data frames. The DataFrame is
the fundamental datatype in the pandas module. You can understand a DataFrame as building
on two data structures discussed above: arrays and dictionaries.

From a single array, we arrive at pandas Series. A dictionary of arrays can be converted into a
DataFrame

In [31]: import pandas as pd

import numpy as np

firstSeries = pd.Series([1,4,9,16]) # the Series() function in pandas module can be


display(firstSeries)

0 1

1 4

2 9

3 16

dtype: int64

In [33]: # compare this to firstArray

display(firstArray)

array([1, 2, 3])

In [36]: # notice that Series comes with an explicit index

# as a resut a Series has two attributes - index and values

display(firstSeries.index)

RangeIndex(start=0, stop=4, step=1)

In [37]: display(firstSeries.values)

array([ 1, 4, 9, 16])
Notice that the values attribute of the Series is actually a numpy array. Essentially, pandas Series
is an improvement to numpy array in two ways:

1. You can define an index in which ever way you want.


2. (To be revisited in forthcoming sessions) Series can hold heterogenous datatypes

In [39]: # we can give our own index. we will see this with the second series

textSeries = pd.Series(['Alice', 'Bob', 'Connor', 'Dana'], index = ['a','b','c','d']


display(textSeries)

a Alice

b Bob

c Connor

d Dana

dtype: object
Exercise 5: Build the above Series using dictionaries.

In [40]: # your code here

Dataframe
It is an ordered collection of columns each of which can contain a value of a different type.
There are two index arrays:
1. Index associated with lines or rows. This is similar to the index in series.
2. Array of labels associated with each column.

You can consider a dataframe a dictionary of series. The key to each series is the column name.
The values are the series that make up each column. A dataframe holds rectangular data with
index and column names.

Exercise 6: Build a DataFrame that best holds the marks of the five students.

In [41]: # Your code here

Key Takeaways
1. Python grows as new modules build on old modules to increase efficiency and
effectiveness.
2. Lists, dictionaries, arrays, Series and DataFrame all build on the previous datatypes and help
store more complex datasets while making operations on them more efficient.
3. Lists help iterate easily. Multiple assignment a boon.
4. Dictionaries help make indexing more meaningful and flexible.
5. Arrays make iteration unnecessary for mathematical operations.
6. Series and DataFrame take the best from all three worlds and help store complex datasets
and work on them.

We will compare each of these datatypes on how they enable working with data and where each
is useful.

You might also like