You are on page 1of 21

CS3352 FOUNDATIONS OF DATA SCIENCE

II YEAR / III SEMESTER B.Tech.- INFORMATION TECHNOLOGY

UNIT IV
PYTHON LIBRARIES FOR DATA WRANGLING

COMPILED BY,

Prof.M.KARTHIKEYAN, M.E., HoD / IT

VERIFIED BY

HOD PRINCIPAL CEO/CORRESPONDENT

DEPARTMENT OF INFORMATION TECHNOLOGY

SENGUNTHAR COLLEGE OF ENGINEERING, TIRUCHENGODE – 637 205.


1
UNIT IV
PYTHON LIBRARIES FOR DATA WRANGLING

 Basics of Numpy arrays


 aggregations
 computations on arrays
 comparisons, masks,
 boolean logic
 fancy indexing
 structured arrays
 Data manipulation with Pandas
 data indexing and selection
 operating on data – missing data
 Hierarchical indexing – combining datasets – aggregation and grouping
 pivot tables

2
LIST OF IMPORTANT QUESTIONS
UNIT IV
PYTHON LIBRARIES FOR DATA WRANGLING
PART – A

1. Write short on Importance of Data Wrangling.


2. Write short note on NumPy.
3. List the operations which are done using NumPy?
4. Write short notes on aggregate() function.
5. List the operations which are done using Pandas?
6. Write short note on Data Manipulation using Pandas.
7. How can Pandas get missing data?
8. How do you treat missing data in Python?
9. How to use Hierarchical Indexes with Pandas?
10. List some of the Aggregation functions in Pandas.
11. What is grouping in pandas?
12. What is the use of pivot table in Python?

PART – B
1. Explain in detail about Data Wrangling in Python.
2. Explain the two main ways to carry out boolean masking.
3. Explain in detail about Aggregation in Pandas.
4. Pandas DataFrame - transform() function
5. Explain in detail about the pivot table using python.

3
LIST OF IMPORTANT QUESTIONS
UNIT IV
PYTHON LIBRARIES FOR DATA WRANGLING

PART – A

1. Write short on Importance of Data Wrangling.


Data Wrangling is a very important step. The below example will explain its importance
as :Books selling Website want to show top-selling books of different domains,
according to user preference. For example, a new user search for motivational books,
then they want to show those motivational books which sell the most or having a high
rating, etc.

2. Write short note on NumPy.


One of the most fundamental packages in Python, NumPy is a general-purpose array-
processing package. It provides high-performance multidimensional array objects and tools
to work with the arrays. NumPy is an efficient container of generic multi-dimensional data.
3. List the operations which are done using NumPy?
Basic array operations: add, multiply, slice, flatten, reshape, index arrays
1. Advanced array operations: stack arrays, split into sections, broadcast arrays
2. Work with DateTime or Linear Algebra
3. Basic Slicing and Advanced Indexing in NumPy Python

4. Write short notes on aggregate() function.


The aggregate() method allows us to apply a function or a list of function names to be
executed along one of the axis of the DataFrame, default 0, which is the index (row) axis.
Note: the agg() method is an alias of the aggregate() method.
dataframe.aggregate(func, axis, args, kwargs)

5. List the operations which are done using Pandas?


1. Indexing, manipulating, renaming, sorting, merging data frame
2. Update, Add, Delete columns from a data frame
3. Impute missing files, handle missing data or NANs

4
4. Plot data with histogram or box plot

6. Write short note on Data Manipulation using Pandas.


 Dropping columns in the data.
 Dropping rows in the data.
 Renaming a column in the dataset.
 Select columns with specific data types.
 Slicing the dataset.
7. How can Pandas get missing data?
In order to check missing values in Pandas DataFrame, we use a function isnull()
and notnull(). Both function help in checking whether a value is NaN or not. These
function can also be used in Pandas Series in order to find null values in a series
8. How do you treat missing data in Python?
It is time to see the different methods to handle them.
1. Drop rows or columns that have a missing value.
2. Drop rows or columns that only have missing values.
3. Drop rows or columns based on a threshold value.
4. Drop based on a particular subset of columns.
5. Fill with a constant value.
6. Fill with an aggregated value.
9. How to use Hierarchical Indexes with Pandas?
#importing pandas library as alias pd
import pandas as pd
# calling the pandas read_csv() function.
# and storing the result in DataFrame df
df = pd.read_csv('homelessness.csv')

print(df.head())

5
10. List some of the Aggregation functions in Pandas.
Pandas provide us with a variety of aggregate functions. These functions help to perform
various activities on the datasets. The functions are:
 .count(): This gives a count of the data in a column.
 .sum(): This gives the sum of data in a column.
 .min() and .max(): This helps to find the minimum value and maximum value, ina
function, respectively.
 .mean() and .median(): Helps to find the mean and median, of the values in a
column, respectively.
11. What is grouping in pandas?
Pandas groupby is used for grouping the data according to the categories
and apply a function to the categories. It also helps to aggregate data efficiently.
Pandas dataframe. groupby() function is used to split the data into groups based on
some criteria.

12. What is the use of pivot table in Python?


The Pandas pivot_table() function provides a familiar interface to create Excel-
style pivot tables. The function requires at a minimum either the index= or columns=
parameters to specify how to split data. The function can calculate one or multiple
aggregation methods, including using custom functions

PART – B
1. Explain in detail about Data Wrangling in Python.
Data wrangling involves processing the data in various formats like - merging, grouping,
concatenating etc. for the purpose of analysing or getting them ready to be used with another
set of data. Python has built-in features to apply these wrangling methods to various data sets
to achieve the analytical goal. In this chapter we will look at few examples describing these
methods.

Merging Data

The Pandas library in python provides a single function, merge, as the entry point for all
standard database join operations between DataFrame objects −

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,


left_index=False, right_index=False, sort=True)
6
Let us now create two different DataFrames and perform the merging operations on it.

# import the pandas library


import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print left
print right

Its output is as follows −

Name id subject_id
0 Alex 1 sub1
1 Amy 2 sub2
2 Allen 3 sub4
3 Alice 4 sub6
4 Ayoung 5 sub5

Name id subject_id
0 Billy 1 sub2
1 Brian 2 sub4
2 Bran 3 sub3
3 Bryce 4 sub6
4 Betty 5 sub5

Grouping Data

Grouping data sets is a frequent need in data analysis where we need the result in terms of
various groups present in the data set. Panadas has in-built methods which can roll the data
into various groups.
7
In the below example we group the data by year and then get the result for a specific year.

# import the pandas library


import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',


'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

grouped = df.groupby('Year')
print grouped.get_group(2014)

Its output is as follows −

Points Rank Team Year


0 876 1 Riders 2014
2 863 2 Devils 2014
4 741 3 Kings 2014
9 701 4 Royals 2014

Concatenating Data

Pandas provides various facilities for easily combining together Series, DataFrame,
and Panel objects. In the below example the concat function performs concatenation
operations along an axis. Let us create different objects and do concatenation.

import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({

8
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print pd.concat([one,two])

Its output is as follows −

Marks_scored Name subject_id


1 98 Alex sub1
2 90 Amy sub2
3 87 Allen sub4
4 69 Alice sub6
5 78 Ayoung sub5
1 89 Billy sub2
2 80 Brian sub4
3 79 Bran sub3
4 97 Bryce sub6
5 88 Betty sub5

2. Explain the two main ways to carry out boolean masking.

The NumPy library in Python is a popular library for working with arrays. Boolean masking,
also called boolean indexing, is a feature in Python NumPy that allows for the filtering of
values in numpy arrays.

There are two main ways to carry out boolean masking:

 Method one: Returning the result array.

 Method two: Returning a boolean array.

Method one: Returning the result array

The first method returns an array with the required results. In this method, we pass a condition
in the indexing brackets, [ ], of an array. The condition can be any comparison, like arr > 5, for
the array arr.

9
Syntax
arr[arr > 5]

Parameter values

 arr: This is the array that we are querying.


 The condition arr > 5 is the criterion with which values in the arr array will be filtered.

Return value

This method returns a NumPy array, ndarray, with values that satisfy the given condition. The
line in the example given above will return all the values in arr that are greater than 5.

Example

Let's try out this method in the following example:

# importing NumPy
import numpy as np
# Creating a NumPy array
arr = np.arange(15)
# Printing our array to observe
print(arr)
# Using boolean masking to filter elements greater than or equal to 8
print(arr[arr >= 8])
# Using boolean masking to filter elements equal to 12
print(arr[arr == 12])

Method two: Returning a boolean array

The second method returns a boolean array that has the same size as the array it represents.
A boolean array only contains the boolean values of either True or False. This boolean array is
also called a mask array, or simply a mask. We'll discuss boolean arrays in more detail in the
"Return value" section.

10
Syntax
The code snippet given below shows us how to use this method:
mask = arr > 5
Return value

The line in the code snippet given above will:

 Return an array with the same size and dimensions as arr. This array will only contain the
values True and False. All the True values represent elements in the same position
in arr that satisfy our condition, and all the False values represent elements in the same
position in arr that do not satisfy our condition.
 Store this boolean array in a mask array.

The mask array can be passed in the index brackets of arr to return the values that satisfy our
condition. We will see how this works in our coding example.

Example

Let's try out this method in the following example:

# importing NumPy
import numpy as np
# Creating a NumPy array
arr = np.array([[ 0, 9, 0],
[ 0, 7, 8],
[ 6, 0, 1]])
# Printing our array to observe
print(arr)
# Creating a mask array
mask = arr > 5
# Printing the mask array
print(mask)
# Printing the filtered array using both methods
print(arr[mask])
print(arr[arr > 5])

11
3. Explain in detail about Aggregation in Pandas.

Pandas provide us with a variety of aggregate functions. These functions help to perform
various activities on the datasets. The functions are:
 .count(): This gives a count of the data in a column.
 .sum(): This gives the sum of data in a column.
 .min() and .max(): This helps to find the minimum value and maximum value, ina
function, respectively.
 .mean() and .median(): Helps to find the mean and median, of the values in a
column, respectively.

 DataFrame.aggregate(func, axis=0, *args, **kwargs)

Parameters:

func: It refers callable, string, dictionary, or list of string/callables.

It is used for aggregating the data. For a function, it must either work when passed to a
DataFrame or DataFrame.apply(). For a DataFrame, it can pass a dict, if the keys are the
column names.

axis: (default 0): It refers to 0 or 'index', 1 or 'columns'

0 or 'index': It is an apply function for each column.

1 or 'columns': It is an apply function for each row.

*args: It is a positional argument that is to be passed to func.

**kwargs: It is a keyword argument that is to be passed to the func.

Returns:

It returns the scalar, Series or DataFrame.

scalar: It is being used when Series.agg is called with the single function.

Series: It is being used when DataFrame.agg is called for the single function.

DataFrame: It is being used when DataFrame.agg is called for the several functions.

Example:
1. import pandas as pd
2. import numpy as np

12
3. info=pd.DataFrame([[1,5,7],[10,12,15],[18,21,24],
[np.nan,np.nan,np.nan]],columns=['X','Y','Z'])
4. info.agg(['sum','min'])

13
4. Pandas DataFrame - transform() function

The transform() function is used to call function (func) on self producing a DataFrame with
transformed values and that has the same axis length as self.
Syntax:
DataFrame.transform(self, func, axis=0, *args, **kwargs)
Parameters:
Name Description Type/Default Required
Value /
Optional

func Function to use for transforming the data. If a function, str, list Required
or dict
function, must either work when passed a
DataFrame or when passed to
DataFrame.apply.
Accepted combinations are:
 function
 string function name
 list of functions and/or function names,
e.g. [np.exp. 'sqrt']
 dict of axis labels -> functions, function
names or list of such.

axis If 0 or ‘index’: apply function to each column. If {0 or ‘index’, 1 Required


1 or ‘columns’: apply function to each row. or ‘columns’}
Default Value: 0

*args Positional arguments to pass to func. Required

**kwargs Keyword arguments to pass to func. Required

Returns: DataFrame
A DataFrame that must have the same length as self.
Raises: ValueError - If the returned DataFrame has a different length than self.
Example:
Examples
In [1]:
import numpy as np
import pandas as pd
In [2]:
df = pd.DataFrame({'X': range(4), 'Y': range(2, 6)})
df
Out[2]:

14
X Y

0 0 2

1 1 3

2 2 4

3 3 5
In [3]:
df.transform(lambda x: x + 1)
Out[3]:
X Y

0 1 3

1 2 4

2 3 5

3 4 6
Even though the resulting DataFrame must have the same length as the input DataFrame,
it is possible to provide several input functions:
In [4]:
s = pd.Series(range(4))
s
Out[4]:
0 0
1 1
2 2
3 3
dtype: int64
In [5]:
s.transform([np.sqrt, np.exp])
Out[5]:
sqrt exp

0 0.000000 1.000000

1 1.000000 2.718282

2 1.414214 7.389056

3 1.732051 20.085537

15
5. Explain in detail about the pivot table using python.

Most people likely have experience with pivot tables in Ecel. Pandas provides a similar function
called (appropriately enough) pivot_table . While it is exceedingly useful, I frequently find myself
struggling to remember how to use the syntax to format the output for my needs. This article will
focus on explaining the pandas pivot_table function and how to use it for your data analysis.

As an added bonus, I’ve created a simple cheat sheet that summarizes the pivot_table. We can
find it at the end of this post and I hope it serves as a useful reference. Let me know if it
is helpful.

The Data

One of the challenges with using the panda’s pivot_table is making sure us understand our data
and what questions we are trying to answer with the pivot table. It is a seemingly simple function
but can produce very powerful analysis very quickly.

In this scenario, I’m going to be tracking a sales pipeline (also called funnel). The basic problem
is that some sales cycles are very long (think “enterprise software”, capital equipment, etc.) and
management wants to understand it in more detail throughout the year.

Typical questions include:

 How much revenue is in the pipeline?


 What products are in the pipeline?
 Who has what products at what stage?
 How likely are we to close deals by year end?

Many companies will have CRM tools or other software that sales uses to track the process.
While they may have useful tools for analyzing the data, inevitably someone will export the data
to Excel and use a PivotTable to summarize the data.

Using a panda’s pivot table can be a good alternative because it is:

 Quicker (once it is set up)

16
 Self documenting (look at the code and we know what it does)
 Easy to use to generate a report or email
 More flexible because we can define custome aggregation functions

Read in the data

Let’s set up our environment first.

If we want to follow along, we can download the Excel file.

import pandas as pd

import numpy as np

Version Warning
The pivot_table API has changed over time so please make sure we have a recent version of
pandas ( > 0.15) installed for this example to work. This example also uses the category data
type which requires a recent version as well.
Read in our sales funnel data into our DataFrame

df = pd.read_excel("../in/sales-funnel.xlsx")

df.head()

For convenience sake, let’s define the status column as a category and set the order we want
to view.

This isn’t strictly required but helps us keep the order we want as we work through analyzing
the data.
17
df["Status"] = df["Status"].astype("category")

df["Status"].cat.set_categories(["won","pending","presented","declined"],inplace=True)

Pivot the data

As we build up the pivot table, I think it’s easiest to take it one step at a time. Add items and
check each step to verify we are getting the results we expect. Don’t be afraid to play with the
order and the variables to see what presentation makes the most sense for our needs.

The simplest pivot table must have a dataframe and an index . In this case, let’s use the Name
as our index.

pd.pivot_table(df,index=["Name"])

We can have multiple indexes as well. In fact, most of the pivot_table args can take multiple
values via a list.

18
pd.pivot_table(df,index=["Name","Rep","Manager"])

This is interesting but not particularly useful. What we probably want to do is look at this by
Manager and Rep. It’s easy enough to do by changing the index .

pd.pivot_table(df,index=["Manager","Rep"])

We can see that the pivot table is smart enough to start aggregating the data and summarizing
it by grouping the reps with their managers. Now we start to get a glimpse of what a pivot table
can do for us.

For this purpose, the Account and Quantity columns aren’t really useful. Let’s remove it by
explicitly defining the columns we care about using the values field.

19
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"])

The price column automatically averages the data but we can do a count or a sum. Adding them
is simple using aggfunc and np.sum .

pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=np.sum)

aggfunc can take a list of functions. Let’s try a mean using the numpy mean function and len to
get a count.

pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=[np.mean,len])

20
21

You might also like