You are on page 1of 37

NumPy

One of the reasons that the Python language is extremely popular is that it makes writing
programs easy. Because Python is a high-level language, you don't have to worry about things
like allocating memory on your computer or choosing how certain operations are done by your
computer's processor. In contrast, when you use low-level languages like C, you define exactly
how memory will be managed and how the processor will execute your instructions. This means
that coding in a low-level language takes longer; however, you have more ability to optimize
your code to run faster.

Language Time taken to write Control over program


Example
Type program performance

High-Level Python Low Low

Low-Level C High High

When choosing between a high- and low-level language, you make a trade-off between being
able to work quickly, and creating programs that run quickly and efficiently. Luckily, there are
two Python libraries that give us the best of both worlds: NumPy and pandas. Together, pandas
and NumPy provide a powerful toolset for working with data in Python because they allow us to
write code quickly without sacrificing performance.

Lists vs NumPy

While lists of lists are sufficient for working with small data sets, they aren't very good for working
with larger data sets. The NumPy library solves this problem.
Let's look at an example where we have two columns of data. Each row contains two numbers we
wish to add together. Using just Python, we could use a list of lists structure to store our data,
and use for loops to iterate over that data:

In each iteration of our loop, the Python interpreter turns our code into bytecode, and the
bytecode asks our computer's processor to add the two numbers together:

Our computer would take eight processor cycles to process the eight rows of our data.
The NumPy library takes advantage of a processor feature called Single Instruction Multiple
Data (SIMD) to process data faster. SIMD allows a processor to perform the same operation, on
multiple data points, in a single processor cycle:

As a result, the NumPy version of our code would only take two processor cycles — a four times
speed-up! This concept of replacing for loops with operations applied to multiple data points at
once is called vectorization.

The core data structure in NumPy that makes vectorization possible is the ndarray or n-
dimensional array. In programming, array describes a collection of elements, similar to a list.
The word n-dimensional refers to the fact that ndarrays can have one or more dimensions. We'll
start by first working with one-dimensional (1D) ndarrays.

To use the NumPy library, we first need to import it into our Python environment. NumPy is
commonly imported using the alias np:
import numpy as np

Then, we can directly convert a list to an ndarray using the numpy.array() constructor. To create
a 1D ndarray, we can pass in a single list:

data_ndarray = np.array([5, 10, 15, 20])

type(data_ndarray)

numpy.ndarray

We used the syntax np.array() instead of numpy.array() because of our import numpy as np code.
When we introduce new syntax, we'll always use the full name to describe it, and you'll need to
substitute in the shorthand as appropriate.

Example:
In the last screen, we practiced creating a 1D ndarray. However, ndarrays can also be two-

dimensional:

As we learn about two-dimensional (2D) ndarrays, we'll analyze taxi trip data released by the

city of New York.


.shape

It’s often useful to know the number of rows and columns in an ndarray. When we can't easily

print the entire ndarray, we can use the ndarray.shape attribute instead:

data_ndarray = np.array([[5, 10, 15], [20, 25, 30]])

Print (data_ndarray.shape)

O/P: (2, 3)
We can select rows in ndarrays very similarly to lists of lists. In reality, what we're seeing is a

kind of shortcut. For any 2D array, the full syntax for selecting data is:

Ndarray[row_index, column_index]

# or if you want to select all

# Columns for a given set of rows

Ndarray[row_index]

Where row_index defines the location along the row axis and column_index defines the location

along the column axis.

Like lists, array slicing is from the first specified index up to — but not including — the second
specified index. For example, to select the items at index 1, 2, and 3, we'd need to use the
slice [1:4].

This is how we select a single item from a 2D ndarray:


row_0 = taxi[0]

rows_391_to_500 = taxi[391:501]

row_21_column_5 = taxi[21,5]

How to select one or more columns of data:


With a list of lists, we need to use a for loop to extract specific column(s) and append them back

to a new list. With ndarrays, the process is much simpler. We again use single brackets with

comma-separated row and column locations, but we use a colon (:) for the row locations, which

gives us all of the rows.

If we want to select a partial 1D slice of a row or column, we can combine a single value for one
dimension with a slice for the other dimension :

Example:

columns_1_4_7 = taxi[:,[1,4,7]]

row_99_columns_5_to_8 = taxi[99,5:9]

rows_100_to_200_column_14 = taxi[100:201,14]
Vectors:

At the time, we only talked about how vectorized operations make this faster; however,

vectorized operations also make our code easier to execute. Here's how we would perform the

same task above with vectorized operations:

Here are some key observations about this code:


 When we selected each column, we used the syntax ndarray[:,c] where c is the column

index we wanted to select. Like we saw in the previous screen, the colon selects all rows.

 To add the two 1D ndarrays, col1 and col2, we simply use the addition operator (+)

between them.

The result of adding two 1D ndarrays is a 1D ndarray of the same shape (or dimensions) as the
original. In this context, ndarrays can also be called vectors, a term taken from a branch of
mathematics called linear algebra. What we just did, adding two vectors together, is
called vector addition.

Example:

fare_amount = taxi[:,9]

fees_amount = taxi[:,10]

fare_and_fees = fare_amount + fees_amount

We used vector addition to add two columns, or vectors, together. We can actually use any of the
standard Python numeric operators with vectors, including:

 vector_a + vector_b - Addition

 vector_a - vector_b - Subtraction

 vector_a * vector_b - Multiplication (this is unrelated to the vector multiplication used in

linear algebra).

 vector_a / vector_b - Division

When we perform these operations on two 1D vectors, both vectors must have the same shape.
Let's look at another example from our taxi data set. Here are the first five rows of two of the
columns in the data set:

trip_distance trip_length

21.00 2037.0

16.29 1520.0

12.70 1462.0

8.70 1210.0

5.56 759.0

Let's use these columns to calculate the average travel speed of each trip in miles per hour. The
formula for calculating miles per hour is:

miles per hour = distance in miles ÷ length in hours

As we learned in the second screen of this mission, trip_distance is already expressed in miles,
but trip_length is expressed in seconds. First, we'll convert trip_length into hours:

Let's perform vector division again to calculate the miles per hour:

trip_distance_miles = taxi[:,7]
trip_length_seconds = taxi[:,8]

trip_length_hours = trip_length_seconds / 3600 # 3600 seconds is one hour

trip_mph = trip_distance_miles / trip_length_hours

Few Numpy ndarray methods:

Numpy ndarrays have methods for many different calculations. A few key methods are:

 ndarray.min() to calculate the minimum value

 ndarray.max() to calculate the maximum value

 ndarray.mean() to calculate the mean or average value

 ndarray.sum() to calculate the sum of the values

You can see the full list of ndarray methods in the NumPy ndarray documentation:

https://docs.scipy.org/doc/numpy-1.14.0/reference/arrays.ndarray.html#calculation

It's important to get comfortable with the documentation because it's not possible to remember
the syntax for every variation of every data science library. However, if you remember wh at is
possible and can read the documentation, you'll always be able to refamiliarize yourself with it.

Whenever you see the syntax ndarray.method_name(), substitute ndarray with the name of your
ndarray (in this case, trip_mph) like below:
Calculate the maximum and mean (average) speed from trip_mph ndarray.

mph_min = trip_mph.min()

mph_max = trip_mph.max()

mph_mean = trip_mph.mean()

Functions vs Methods:

Let's review the difference between methods and functions. Functions act as stand-alone
segments of code that usually take an input, perform some processing, and return some output.
For example, we can use the len() function to calculate the length of a list or the number of
characters in a string.

my_list = [21,14,91]

print(len(my_list))

0/p : 3

my_string = 'Dataquest'

print(len(my_string))

0/p : 9

In contrast, methods are special functions that belong to a specific type of object. This means
that, for instance, when we work with list objects, there are special functions or methods that can
only be used with lists. For example, we can use the list.append() method to add an item to the
end of a list. If we try to use that method on a string, we will get an error:
my_string.append(' is the best!')

O/p:

Traceback (most recent call last):

File "stdin", line 1, in module

AttributeError: 'str' object has no attribute 'append'

In NumPy, sometimes there are operations that are implemented as both methods and functions,
which can be confusing. Let's look at some examples:

Calculation Function Representation Method Representation

Calculate the minimum value


np.min(trip_mph) trip_mph.min()
of trip_mph

Calculate the maximum value


np.max(trip_mph) trip_mph.max()
of trip_mph

Calculate the mean average value


np.mean(trip_mph) trip_mph.mean()
of trip_mph

Calculate the median average value There is no ndarray median


np.median(trip_mph)
of trip_mph method

To remember the right terminology, anything that starts with np (e.g. np.mean() ) is a function and
anything expressed with an object (or variable) name first (e.g. trip_mph.mean() ) is a method.
When both exist, it's up to you to decide which to use, but it's much more common to use the method
approach.

axis parameter:

Next, we'll calculate statistics for 2D ndarrays. If we use the ndarray.max() method on a 2D
ndarray without any additional parameters, it will return a single value, just like with a 1D array:

But what if we wanted to find the maximum value of each row? We'd need to use
the axis parameter and specify a value of 1 to indicate we want to calculate the maximum value
for each row.

If we want to find the maximum value of each column, we'd use an axis value of 0:
Example:
Let's use what we've learned to check the data in our taxi data set. To remind ourselves of what
the data looks like, let's look at the first five rows of the columns with indexes 9 through 13:

You may have noticed that the sum of the first four values in each row should equal the last
value, total_amount:

Write a code to check these values. We'll only review the first five rows in taxi so that we can
more easily verify that results.
Boolean Indexing:

In the previous mission, we learned how to use NumPy and vectorized operations to analyze taxi
trip data from the city of New York. We learned that NumPy makes it quick and easy to select
data, and includes a number of functions and methods that make it easy to calculate statistics
across the different axes (or dimensions).

However, what if we also wanted to find out how many trips were taken in each month? Or
which airport is the busiest? For this, we will learn a new technique: Boolean Indexing. Before
we dive into what boolean indexing is and how it can help us, let's refamiliarize ourselves with
our data.
Here are the first 5 rows of our data with column labels:

Simple way to import the data much more quickly and efficiently than the method we used
before:

Now that we understand NumPy a little better, let's learn how to use the numpy.genfromtxt()
function to read files into NumPy ndarrays. Here is the simplified syntax for the function, and an
explanation of the two parameters:

np.genfromtxt(filename, delimiter=None)

 filename: A positional argument, usually a string representing the path to the text file to

be read.
 delimiter: A named argument, specifying the string used to separate each value.

In this case, because we have a CSV file, the delimiter is a comma. Here's how we'd read

in a file named data.csv:

data = np.genfromtxt('data.csv', delimiter=',')

Let's read our nyc_taxis.csv file into NumPy next:

import numpy as np

taxi = np.genfromtxt('nyc_taxis.csv', delimiter=',')

taxi_shape = taxi.shape

Earlier Method:

# import nyc_taxi.csv as a list of lists

f = open("nyc_taxis.csv", "r")

taxi_list = list(csv.reader(f))

# remove the header row

taxi_list = taxi_list[1:]

# convert all values to floats

converted_taxi_list = []
for row in taxi_list:

converted_row = []

for item in row:

converted_row.append(float(item))

converted_taxi_list.append(converted_row)

taxi = np.array(converted_taxi_list)

You may not have noticed in the last mission that we converted all the values to floats before we
converted the list of lists to a ndarray. That's because NumPy ndarrays can contain only one
datatype.

We didn't have to complete this step in the last execise, because when numpy.genfromtxt() reads
in a file, it attempts to determine the data type of the file by looking at the values.
We can use the ndarray.dtype attribute to see the internal datatype that has been used.

print(taxi.dtype)
o/p : float64

NumPy chose the float64 type, since it will allow most of the values from our CSV to be read.
You can think of NumPy's float64 type as being identical to Python's float type (the "64" refers
to the number of bits used to store the underlying value).
If we review the results from the last exercise, we can see that taxicontains almost all numbers
except for a value that we haven't seen before: nan.
NaN is an acronym for Not a Number - it literally means that the value cannot be stored as a
number. It is similar to (and often referred to as a) null value, like Python's None constant.

NaN is most commonly seen when a value is missing, but in this case, we have NaN values
because the first line from our CSV file contains the names of each column. NumPy is unable to
convert string values like pickup_year into the float64 data type.

For now, we need to remove this header row from our ndarray. We can do this the same way we
would if our data was stored in a list of lists:

taxi = taxi[1:]

Alternatively, we can pass an additional parameter, skip_header, to


the numpy.genfromtxt() function. The skip_header parameter accepts an integer, the
number of rows from the start of the file to skip. Note that because this integer should be
the number of rows and not the index, skipping the first row would require a value of 1, not 0

Example:

taxi = np.genfromtxt('nyc_taxis.csv', delimiter=',', skip_header=1)

taxi_shape = taxi.shape
Boolean Arrays:

A boolean array, as the name suggests, is an array of boolean values. Boolean arrays are
sometimes called Boolean vectors or boolean masks.

You may recall that the boolean (or bool) type is a built-in Python type that can be one of two
unique values:

 True
 False

You may also remember that we've used boolean values when working with Python comparison
operators like == (equal) > (greater than), < (less than), != (not equal).
When we explored vector math in the first mission, we learned that an operation between a
ndarray and a single value results in a new ndarray:

print(np.array([2,4,6,8]) + 10)

O/p: [12 14 16 18]

The + 10 operation is applied to each value in the array.

Now, let's look at what happens when we perform a boolean operationbetween an ndarray and a
single value:

print(np.array([2,4,6,8]) < 5)

O/p: [ True True False False]

Example:

a = np.array([1, 2, 3, 4, 5])
b = np.array(["blue", "blue", "red", "blue"])

c = np.array([80.0, 103.4, 96.9, 200.3])

a_bool = a < 3

b_bool = b == "blue"

c_bool = c > 100

O/P: [True True False False False] [True True False True] [False True False True]

Boolean Indexing:

In the last screen, we learned how to create boolean arrays using vectorized boolean operations.
Next, we'll learn how to index (or select) using boolean arrays, known as boolean indexing.
Let's use one of the examples from the previous screen.
To index using our new boolean array, we simply insert it in the square brackets, just like we
would do with our other selection techniques:

The boolean array acts as a filter, so that the values corresponding to True become part of the
result and the values corresponding to False are removed.

Let's use boolean indexing to confirm the number of taxi rides in our data set from the month of
January. First, let's select just the pickup_monthcolumn, which is the second column in the
ndarray:

pickup_month = taxi[:,1]

Next, we use a boolean operation to make a boolean array, where the value 1 corresponds to
January:

january_bool = pickup_month == 1

Then we use the new boolean array to select only the items from pickup_month that have a value
of 1:

january = pickup_month[january_bool]

Finally, we use the .shape attribute to find out how many items are in our january ndarray, which
is equal to the number of taxi rides from the month of January. We'll use [0] to extract the value
from the tuple returned by .shape:
january_rides = january.shape[0]

print(january_rides)

O/p: 13481

Example:

pickup_month = taxi[:,1]

january_bool = pickup_month == 1

january = pickup_month[january_bool]

january_rides = january.shape[0]

february_bool = pickup_month == 2

february = pickup_month[february_bool]

february_rides = february.shape[0]

Boolean Indexing:

When working with 2D ndarrays, you can use boolean indexing in combination with any of the
indexing methods we learned in the previous mission. The only limitation is that the boolean
array must have the same length as the dimension you're indexing. Let's look at some examples:
Because a boolean array contains no information about how it was created, we can use a boolean
array made from just one column of our array to index the whole array.
Let's use what we've learned to analyze the average speed of trips. In the previous miss ion, we
calculated the maximum trip speed to be 82,000 mph, which we know is definitely not accurate.
Let's verify if there are any issues with the data. Recall that we calculated the average travel
speed as follows:

# calculate the average speed

trip_mph = taxi[:,7] / (taxi[:,8] / 3600)

Next, we'll check for trips with an average speed greater than 20,000 mph:

# create a boolean array for trips with average

# speeds greater than 20,000 mph

trip_mph_bool = trip_mph > 20000

# use the boolean array to select the rows for

# those trips, and the pickup_location_code,

# dropoff_location_code, trip_distance, and

# trip_length columns

trips_over_20000_mph = taxi[trip_mph_bool,5:9]

print(trips_over_20000_mph)
O/p:

[[ 2 2 23 1]

[ 2 2 19.6 1]

[ 2 2 16.7 2]

[ 3 3 17.8 2]

[ 2 2 17.2 2]

[ 3 3 16.9 3]

[ 2 2 27.1 4]]

We can see from the last column that most of these are very short rides - all have trip_length values of 4
or less seconds, which does not reconcile with the trip distances, all of which are more than 16 miles.

Let's use this technique to examine the rows that have the highest values for the tip_amount column:

tip_amount = taxi[:,12]

tip_bool = tip_amount > 50

top_tips = taxi[tip_bool, 5:14]

So far, we've learned how to retrieve data from ndarrays. Next, we'll use the same indexing techniques
we've already learned to modify values within an ndarray. The syntax we'll use (in pseudocode) is:

ndarray[location_of_values] = new_value
Let's take a look at what that looks like in actual code. With our 1D array, we can specify one specific
index location:

a = np.array(['red','blue','black','blue','purple'])

a[0] = 'orange'

print(a)

O/p: ['orange', 'blue', 'black', 'blue', 'purple']

a[3:] = 'pink'

print(a)

O/p: ['orange', 'blue', 'black', 'pink', 'pink']

With a 2D ndarray, just like with a 1D ndarray, we can assign one specific index location:

ones = np.array([[1, 1, 1, 1, 1],

[1, 1, 1, 1, 1],

[1, 1, 1, 1, 1]])

ones[1,2] = 99

print(ones)

O/p:

[[ 1, 1, 1, 1, 1],
[ 1, 1, 99, 1, 1],

[ 1, 1, 1, 1, 1]]

We can also assign a whole row:

ones[0] = 42

print(ones)

O/p:

[[42, 42, 42, 42, 42],

[ 1, 1, 99, 1, 1],

[ 1, 1, 1, 1, 1]]

...or a whole column:

ones[:,2] = 0

print(ones)

O/p:

[[42, 42, 0, 42, 42],

[1, 1, 0, 1, 1],

[1, 1, 0, 1, 1]]
Boolean arrays become very powerful when we use them for assignment. Let's look at an example:

a2 = np.array([1, 2, 3, 4, 5])

a2_bool = a2 > 2

a2[a2_bool] = 99

print(a2)

O/p: [ 1 2 99 99 99]

# this creates a copy of our taxi ndarray

taxi_copy = taxi.copy()

total_amount = taxi_copy[:,13]

total_amount[total_amount < 0] = 0

2D Array:

Next, we'll look at an example of assignment using a boolean array with two dimensions:
The b > 4 boolean operation produces a 2D boolean array which then controls the values that the assignment
applies to.

We can also use a 1D boolean array to perform assignment on a 2D array:

The c[:,1] > 2 boolean operation compares just one column's values and produces a 1D boolean array. We
then use that boolean array as the row index for assignment, and 1 as the column index to specify the second
column. Our boolean array is only applied to the second column, while all other values remaining
unchanged.

The pseudocode syntax for this code is as follows, first using an intermediate variable:

bool = array[:, column_for_comparison] == value_for_comparison

array[bool, column_for_assignment] = new_value

and then all in one line:

array[array[:, column_for_comparison] == value_for_comparison, column_for_assignment] = new_value


Example:

We have created a new copy of our taxi dataset, taxi_modified with an additional column containing the
value 0 for every row.

In our new column at index 15, assign the value 1 if the pickup_location_code (column index 5) corresponds
to an airport location, leaving the value as 0 otherwise by performing these three operations:

For rows where the value for the column index 5 is equal to 2 (JFK Airport), assign the value 1 to column
index 15.

For rows where the value for the column index 5 is equal to 3 (LaGuardia Airport), assign the value 1 to
column index 15.

For rows where the value for the column index 5 is equal to 5 (Newark Airport), assign the value 1 to
column index 15.

# create a new column filled with `0`.

zeros = np.zeros([taxi.shape[0], 1])

taxi_modified = np.concatenate([taxi, zeros], axis=1)

print(taxi_modified)

taxi_modified[taxi_modified[:, 5] == 2, 15] = 1

taxi_modified[taxi_modified[:, 5] == 3, 15] = 1

taxi_modified[taxi_modified[:, 5] == 5, 15] = 1

Boolean Indexing

Challenge 1:

In this challenge, we want to figure out which airport is the most popular destination in our data set. To
do that, we'll use boolean indexing to create three filtered arrays and then look at how many rows are in
each array.
To complete this task, we'll need to check if the dropoff_location_code column (column index 6) is equal
to one of the following values:

2: JFK Airport

3: LaGuardia Airport

5: Newark Airport.

Instructions:

1) Using the original taxi ndarray, calculate how many trips had JFK Airport as their destination:
 Use boolean indexing to select only the rows where the dropoff_location_code column (column
index 6) has a value that corresponds to JFK. Assign the result to jfk.
 Calculate how many rows are in the new jfk array and assign the result to jfk_count.

2) Calculate how many trips from taxi had Laguardia Airport as their destination:
 Use boolean indexing to select only the rows where the dropoff_location_code column (column
index 6) has a value that corresponds to Laguardia. Assign the result to laguardia.
 Calculate how many rows are in the new laguardia array. Assign the result to laguardia_count.

3) Calculate how many trips from taxi had Newark Airport as their destination:
 Select only the rows where the dropoff_location_code column has a value that corresponds to
Newark, and assign the result to newark.
 Calculate how many rows are in the new newark array and assign the result to newark_count.

4) After you have run your code, inspect the values for jfk_count, laguardia_count, and
newark_count and see which airport has the most dropoffs.

Solution:

jfk = taxi[taxi[:,6] == 2]

jfk_count = jfk.shape[0]

laguardia = taxi[taxi[:,6] == 3]
laguardia_count = laguardia.shape[0]

newark = taxi[taxi[:,6] == 5]

newark_count = newark.shape[0]

Challenge 2:

Our calculations in the previous challenge show that Laguardia is the most common airport for dropoffs
in our data set.

Our second and final challenge involves removing potentially bad data from our data set, and then
calculating some descriptive statistics on the remaining "clean" data.

We'll start by using boolean indexing to remove any rows that have an average speed for the trip greater
than 100 mph (160 kph) which should remove the questionable data we have worked with over the past
two missions. Then, we'll use array methods to calculate the mean for specific columns of the remaining
data. The columns we're interested in are:

trip_distance, at column index 7

trip_length, at column index 8

total_amount, at column index 13

Instructions:

The trip_mph ndarray has been provided for you:

trip_mph = taxi[:,7] / (taxi[:,8] / 3600)

1) Create a new ndarray, cleaned_taxi, containing only rows for which the values of trip_mph
are less than 100.
2) Calculate the mean of the trip_distance column of cleaned_taxi. Assign the result to
mean_distance.
3) Calculate the mean of the trip_length column of cleaned_taxi. Assign the result to
mean_length.
4) Calculate the mean of the total_amount column of cleaned_taxi. Assign the result to
mean_total_amount.

Solution:

trip_mph = taxi[:,7] / (taxi[:,8] / 3600)

cleaned_taxi = taxi[trip_mph<100]

mean_distance = sum(cleaned_taxi[:,7])/cleaned_taxi.shape[0]

mean_length = sum(cleaned_taxi[:,8])/cleaned_taxi.shape[0]

mean_total_amount= sum(cleaned_taxi[:,13])/cleaned_taxi.shape[0]

print(mean_distance,mean_length,mean_total_amount)

OR

trip_mph = taxi[:,7] / (taxi[:,8] / 3600)

cleaned_taxi = taxi[trip_mph < 100]

mean_distance = cleaned_taxi[:,7].mean()

mean_length = cleaned_taxi[:,8].mean()

mean_total_amount = cleaned_taxi[:,13].mean()

You might also like