Professional Documents
Culture Documents
Unit 4 Fod
Unit 4 Fod
UNIT IV
PYTHON LIBRARIES FOR DATA WRANGLING
COMPILED BY,
VERIFIED BY
2
LIST OF IMPORTANT QUESTIONS
UNIT IV
PYTHON LIBRARIES FOR DATA WRANGLING
PART – A
PART – B
1. Explain in detail about Data Wrangling in Python.
2. Explain the two main ways to carry out boolean masking.
3. Explain in detail about Aggregation in Pandas.
4. Pandas DataFrame - transform() function
5. Explain in detail about the pivot table using python.
3
LIST OF IMPORTANT QUESTIONS
UNIT IV
PYTHON LIBRARIES FOR DATA WRANGLING
PART – A
4
4. Plot data with histogram or box plot
print(df.head())
5
10. List some of the Aggregation functions in Pandas.
Pandas provide us with a variety of aggregate functions. These functions help to perform
various activities on the datasets. The functions are:
.count(): This gives a count of the data in a column.
.sum(): This gives the sum of data in a column.
.min() and .max(): This helps to find the minimum value and maximum value, ina
function, respectively.
.mean() and .median(): Helps to find the mean and median, of the values in a
column, respectively.
11. What is grouping in pandas?
Pandas groupby is used for grouping the data according to the categories
and apply a function to the categories. It also helps to aggregate data efficiently.
Pandas dataframe. groupby() function is used to split the data into groups based on
some criteria.
PART – B
1. Explain in detail about Data Wrangling in Python.
Data wrangling involves processing the data in various formats like - merging, grouping,
concatenating etc. for the purpose of analysing or getting them ready to be used with another
set of data. Python has built-in features to apply these wrangling methods to various data sets
to achieve the analytical goal. In this chapter we will look at few examples describing these
methods.
Merging Data
The Pandas library in python provides a single function, merge, as the entry point for all
standard database join operations between DataFrame objects −
Name id subject_id
0 Alex 1 sub1
1 Amy 2 sub2
2 Allen 3 sub4
3 Alice 4 sub6
4 Ayoung 5 sub5
Name id subject_id
0 Billy 1 sub2
1 Brian 2 sub4
2 Bran 3 sub3
3 Bryce 4 sub6
4 Betty 5 sub5
Grouping Data
Grouping data sets is a frequent need in data analysis where we need the result in terms of
various groups present in the data set. Panadas has in-built methods which can roll the data
into various groups.
7
In the below example we group the data by year and then get the result for a specific year.
grouped = df.groupby('Year')
print grouped.get_group(2014)
Concatenating Data
Pandas provides various facilities for easily combining together Series, DataFrame,
and Panel objects. In the below example the concat function performs concatenation
operations along an axis. Let us create different objects and do concatenation.
import pandas as pd
one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
two = pd.DataFrame({
8
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print pd.concat([one,two])
The NumPy library in Python is a popular library for working with arrays. Boolean masking,
also called boolean indexing, is a feature in Python NumPy that allows for the filtering of
values in numpy arrays.
The first method returns an array with the required results. In this method, we pass a condition
in the indexing brackets, [ ], of an array. The condition can be any comparison, like arr > 5, for
the array arr.
9
Syntax
arr[arr > 5]
Parameter values
Return value
This method returns a NumPy array, ndarray, with values that satisfy the given condition. The
line in the example given above will return all the values in arr that are greater than 5.
Example
# importing NumPy
import numpy as np
# Creating a NumPy array
arr = np.arange(15)
# Printing our array to observe
print(arr)
# Using boolean masking to filter elements greater than or equal to 8
print(arr[arr >= 8])
# Using boolean masking to filter elements equal to 12
print(arr[arr == 12])
The second method returns a boolean array that has the same size as the array it represents.
A boolean array only contains the boolean values of either True or False. This boolean array is
also called a mask array, or simply a mask. We'll discuss boolean arrays in more detail in the
"Return value" section.
10
Syntax
The code snippet given below shows us how to use this method:
mask = arr > 5
Return value
Return an array with the same size and dimensions as arr. This array will only contain the
values True and False. All the True values represent elements in the same position
in arr that satisfy our condition, and all the False values represent elements in the same
position in arr that do not satisfy our condition.
Store this boolean array in a mask array.
The mask array can be passed in the index brackets of arr to return the values that satisfy our
condition. We will see how this works in our coding example.
Example
# importing NumPy
import numpy as np
# Creating a NumPy array
arr = np.array([[ 0, 9, 0],
[ 0, 7, 8],
[ 6, 0, 1]])
# Printing our array to observe
print(arr)
# Creating a mask array
mask = arr > 5
# Printing the mask array
print(mask)
# Printing the filtered array using both methods
print(arr[mask])
print(arr[arr > 5])
11
3. Explain in detail about Aggregation in Pandas.
Pandas provide us with a variety of aggregate functions. These functions help to perform
various activities on the datasets. The functions are:
.count(): This gives a count of the data in a column.
.sum(): This gives the sum of data in a column.
.min() and .max(): This helps to find the minimum value and maximum value, ina
function, respectively.
.mean() and .median(): Helps to find the mean and median, of the values in a
column, respectively.
Parameters:
It is used for aggregating the data. For a function, it must either work when passed to a
DataFrame or DataFrame.apply(). For a DataFrame, it can pass a dict, if the keys are the
column names.
Returns:
scalar: It is being used when Series.agg is called with the single function.
Series: It is being used when DataFrame.agg is called for the single function.
DataFrame: It is being used when DataFrame.agg is called for the several functions.
Example:
1. import pandas as pd
2. import numpy as np
12
3. info=pd.DataFrame([[1,5,7],[10,12,15],[18,21,24],
[np.nan,np.nan,np.nan]],columns=['X','Y','Z'])
4. info.agg(['sum','min'])
13
4. Pandas DataFrame - transform() function
The transform() function is used to call function (func) on self producing a DataFrame with
transformed values and that has the same axis length as self.
Syntax:
DataFrame.transform(self, func, axis=0, *args, **kwargs)
Parameters:
Name Description Type/Default Required
Value /
Optional
func Function to use for transforming the data. If a function, str, list Required
or dict
function, must either work when passed a
DataFrame or when passed to
DataFrame.apply.
Accepted combinations are:
function
string function name
list of functions and/or function names,
e.g. [np.exp. 'sqrt']
dict of axis labels -> functions, function
names or list of such.
Returns: DataFrame
A DataFrame that must have the same length as self.
Raises: ValueError - If the returned DataFrame has a different length than self.
Example:
Examples
In [1]:
import numpy as np
import pandas as pd
In [2]:
df = pd.DataFrame({'X': range(4), 'Y': range(2, 6)})
df
Out[2]:
14
X Y
0 0 2
1 1 3
2 2 4
3 3 5
In [3]:
df.transform(lambda x: x + 1)
Out[3]:
X Y
0 1 3
1 2 4
2 3 5
3 4 6
Even though the resulting DataFrame must have the same length as the input DataFrame,
it is possible to provide several input functions:
In [4]:
s = pd.Series(range(4))
s
Out[4]:
0 0
1 1
2 2
3 3
dtype: int64
In [5]:
s.transform([np.sqrt, np.exp])
Out[5]:
sqrt exp
0 0.000000 1.000000
1 1.000000 2.718282
2 1.414214 7.389056
3 1.732051 20.085537
15
5. Explain in detail about the pivot table using python.
Most people likely have experience with pivot tables in Ecel. Pandas provides a similar function
called (appropriately enough) pivot_table . While it is exceedingly useful, I frequently find myself
struggling to remember how to use the syntax to format the output for my needs. This article will
focus on explaining the pandas pivot_table function and how to use it for your data analysis.
As an added bonus, I’ve created a simple cheat sheet that summarizes the pivot_table. We can
find it at the end of this post and I hope it serves as a useful reference. Let me know if it
is helpful.
The Data
One of the challenges with using the panda’s pivot_table is making sure us understand our data
and what questions we are trying to answer with the pivot table. It is a seemingly simple function
but can produce very powerful analysis very quickly.
In this scenario, I’m going to be tracking a sales pipeline (also called funnel). The basic problem
is that some sales cycles are very long (think “enterprise software”, capital equipment, etc.) and
management wants to understand it in more detail throughout the year.
Many companies will have CRM tools or other software that sales uses to track the process.
While they may have useful tools for analyzing the data, inevitably someone will export the data
to Excel and use a PivotTable to summarize the data.
16
Self documenting (look at the code and we know what it does)
Easy to use to generate a report or email
More flexible because we can define custome aggregation functions
import pandas as pd
import numpy as np
Version Warning
The pivot_table API has changed over time so please make sure we have a recent version of
pandas ( > 0.15) installed for this example to work. This example also uses the category data
type which requires a recent version as well.
Read in our sales funnel data into our DataFrame
df = pd.read_excel("../in/sales-funnel.xlsx")
df.head()
For convenience sake, let’s define the status column as a category and set the order we want
to view.
This isn’t strictly required but helps us keep the order we want as we work through analyzing
the data.
17
df["Status"] = df["Status"].astype("category")
df["Status"].cat.set_categories(["won","pending","presented","declined"],inplace=True)
As we build up the pivot table, I think it’s easiest to take it one step at a time. Add items and
check each step to verify we are getting the results we expect. Don’t be afraid to play with the
order and the variables to see what presentation makes the most sense for our needs.
The simplest pivot table must have a dataframe and an index . In this case, let’s use the Name
as our index.
pd.pivot_table(df,index=["Name"])
We can have multiple indexes as well. In fact, most of the pivot_table args can take multiple
values via a list.
18
pd.pivot_table(df,index=["Name","Rep","Manager"])
This is interesting but not particularly useful. What we probably want to do is look at this by
Manager and Rep. It’s easy enough to do by changing the index .
pd.pivot_table(df,index=["Manager","Rep"])
We can see that the pivot table is smart enough to start aggregating the data and summarizing
it by grouping the reps with their managers. Now we start to get a glimpse of what a pivot table
can do for us.
For this purpose, the Account and Quantity columns aren’t really useful. Let’s remove it by
explicitly defining the columns we care about using the values field.
19
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"])
The price column automatically averages the data but we can do a count or a sum. Adding them
is simple using aggfunc and np.sum .
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=np.sum)
aggfunc can take a list of functions. Let’s try a mean using the numpy mean function and len to
get a count.
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=[np.mean,len])
20
21