1
CSIT 553 EXPLORATORY DATA
ANALYSIS AND VISUALIZATION
Introductor: Jiayin Wang
Class 5: Data Aggregation & Time Series
Outline 2
▸ Introduction to Data Aggregation and Group Operation
▸ GroupBy Mechanics
▸ Data Aggregation
▸ Apply: General split-apply-combine
▸ Pivot Tables and Crosstab
▸ Time Series
Why Data Aggregation and Group Operation? 3
▸ Sometimes we want to analyze data by groups. E.g., group students by
▸ Majors
▸ GPAs
▸ Years
▸ Calculate statistics based on these groups
▸ Perform our own type of transformation
▸ Create visualizations for each group
Techniques in Data Aggregation and Group Operation 4
▸ Split data into pieces using one or more keys
▸ Calculate group summary statistics (count, mean, standard deviation, or user-
defined function, etc)
▸ Compute pivot tables and cross-tabulations
Outline 5
▸ Introduction to Data Aggregation and Group Operation
▸ GroupBy Mechanics
▸ Data Aggregation
▸ Apply: General split-apply-combine
▸ Pivot Tables and Crosstab
▸ Time Series
Split-Apply-Combine 6
▸ Coined by Hadley Wickham, 2011
▸ Split: split into groups based on one or more keys
▸ Apply: apply one function to each group, producing a new value
▸ Combine: results are combined into a result
Split-Apply-Combine 7
Example: Splitting 8
Sources: Hadley Wickham, 2011
Example: Apply (Counting) & Combine 9
Sources: Hadley Wickham, 2011
GroupBy in pandas 10
▸ Method: <Series/DataFrame>.groupby(<keys>)
▸ <keys> is the group key (one or more columns)
▸ Create a GroupBy object
▸ groupby method doesn’t compute anything until you apply aggregation
operation to each groups
GroupBy in pandas 11
▸ R
Iterating Over Groups 12
▸ The GroupBy object supports iteration, generating a sequence of 2-tuples
containing the group name along with the chunk of data
for name, group in [Link](‘day’)
print(name)
print(group)
Group by
multiple keys
Data Aggregation 13
▸ Refer to any data process in which information is gathered and expressed in a
summary form
▸ Examples: count(), sum(), mean(), median()
▸ Can aggregate a slice of the dataset: use square brackets with column names
▸ Can define your own aggregation functions and pass them to agg function
Groupby Aggregation Operations 14
Data Aggregation Examples 15
Data Aggregation Examples 16
▸ Can also apply .describe() to the groups
Exercise 1 17
▸ Consider the DataFrame tips with columns: total_bill, tip, smoker, day, time,
size, tip_pct
▸ 1. Create a Series of the average tip percentage for each size on each day.
▸ 2. Create a dataFrame of the sum of the total bill in each time on each day.
Exercise 1 Solution 18
Outline 19
▸ Introduction to Data Aggregation and Group Operation
▸ GroupBy Mechanics
▸ Data Aggregation
▸ Apply: General split-apply-combine
▸ Pivot Tables and Crosstab
▸ Time Series
Apply: General split-apply-combine 20
▸ During apply, functions are invoked on each group (piece)
▸ Then, all groups (pieces) are concatenated together
▸ You can pass your own function by .apply(<function>)
Apply: General split-apply-combine 21
Outline 22
▸ Introduction to Data Aggregation and Group Operation
▸ GroupBy Mechanics
▸ Data Aggregation
▸ Apply: General split-apply-combine
▸ Pivot Tables and Crosstab
▸ Time Series
Pivot Tables 23
▸ A data summarization tool well used in spreadsheet programs
▸ Aggregate a table of data by one or more keys with some keys along the rows
(index), and some along the columns (columns)
▸ A combination of groupby operation and reshape operation utilizing
hierarchical indexing
▸ Pandas supports pandas.pivot_table function
▸ DataFrame has a pivot_table method
Pivot Tables: Simple Syntax 24
▸ Aggregate the table by one or more keys (columns)
<DataFrame>.pivot_table(index=<column(s)>)
▸ Create a new DataFrame with the configured index and the
default values are the average ones
▸ For example, list the average values of each day
Pivot Tables: More Examples 25
▸ List the average values of each day for both smokers and
non-smokers.
Two levels of index:
day & smoker
Average value of
aggregated data
Pivot Table Options 26
Pivot Tables: Examples with More Parameters 27
Pivot Tables: More Examples 28
Number of rows in each group
Can also set as:
len, [Link], [Link], [Link], [Link]
Or:
‘count’, ’mean’, ‘sum’, ‘max’, ‘min’
Pivot Tables: Cheat Sheet 29
Sources: [Link]
Cross Table 30
▸ A special case of a pivot table to compute group frequencies
▸ Similar to pivot_table with aggfunc=‘count’
Index Column
Cross Table: More Example 31
▸ List the number of smokers of each day at different times. Include the row/
column subtotal and the grand total as well.
Equivalent to
tips.pivot_table(‘total_bill’, index=[‘time’, ‘day’],
columns=[‘smoker’], aggfunc=‘count’,
margins=True, fill_value=0)
Outline 32
▸ Introduction to Data Aggregation and Group Operation
▸ Time Series
▸ Intro to Date&Time
▸ Date & Time in Python
▸ Date & Time in pandas
▸ Date Ranges, Frequencies ,and Shifting
▸ Time Zone
▸ Resampling
▸ Window Functions
Dates and Times 33
▸ Time to computer is the number of seconds elapsed since the Unix epic (1 Jan.
1970 [Link] UTC)
▸ Usually break down to years, months, days, hours, minutes, seconds, etc
▸ There are lots of formats to express time:
▸ ‘2018-11-28’ vs. ’11/28/2018’
▸ 12-hour clock vs. 24-hour clock
▸ Time zones
Time Series 34
▸ Anything that is observed or measured at many points in time forms a time
series
▸ Important form of structured data in may fields, such as finance, ecology, and
economics
▸ Time series may be referred as:
▸ Timestamps, specific instants in time
▸ Fixed periods, such as full year of 2018
▸ Intervals of time, indicated by a start and end timestamp
Why Time Series? 35
▸ To identify trends, cycles, and seasonal variances to aid in the forecasting of a
future event
Sources: [Link]
Outline 36
▸ Introduction to Data Aggregation and Group Operation
▸ Time Series
▸ Intro to Date&Time
▸ Date & Time in Python
▸ Date & Time in pandas
▸ Date Ranges, Frequencies ,and Shifting
▸ Time Zone
▸ Resampling
▸ Window Functions
Date And Time in Python 37
▸ Main module for date and time data: datetime
from datetime import datetime
▸ Stores both the date and time down to the microseconds
▸ .now() indicates the current date time
▸ Can access to the year, month, day, hour, etc
Date And Time in Python 38
▸ Data types in the datetime module
▸ Represent difference between two datetime objects: timedelta
▸ Can get the difference in days and seconds
Converting Datetime to String 39
▸ str(<datetime>) converts datetime to a string as “YYYY-MM-DD hh:mm:ss”
▸ .strftime() can pass datetime to a specific format
Converting String to Datetime 40
▸ [Link]: for known format
▸ [Link]: for unknown format
Outline 41
▸ Introduction to Data Aggregation and Group Operation
▸ Time Series
▸ Intro to Date&Time
▸ Date & Time in Python
▸ Date & Time in pandas
▸ Date Ranges, Frequencies ,and Shifting
▸ Time Zone
▸ Resampling
▸ Window Functions
Datetime in pandas 42
▸ pd.to_datetime(<arg>) method: convert an entire column to date time
▸ arg can be list, tuple, 1-d array, Series, and DataFrame
▸ Datetime is stored as numpy.datetime64 format
▸ A NaT is used to indicate a missing time value (similar to NaN)
▸ [Link] is the pandas equivalence of Python’s [Link]
▸ Can use time as the index
Datetime in pandas Examples 1 43
▸ Data: Average single home prices in New Jersey from 05/01/2018 to
10/01/2018 (source: Zillow Research)
Datetime in pandas Examples 2 44
Indexing, Selection, Subsetting 45
▸ Slicing also works
▸ Can select one date, or only a year
(or a year and a month)
▸ Can select a range of datetime
Exercise 2 46
▸ The following two lists indicate the single home prices in New York City from
2016 to 2018.
time = [‘2016-06','2016-12','2017-03',
‘2017-06','2017-09','2018-03']
prices = [379100, 388000, 393500, 403100, 409700, 423500]
▸ 1. Create a Series and set the time as index of type datetimes and the prices as
values.
▸ 2. Write the command to list the prices in 2017.
Exercise 2 Solution 47
Outline 48
▸ Introduction to Data Aggregation and Group Operation
▸ Time Series
▸ Intro to Date&Time
▸ Date & Time in Python
▸ Date & Time in pandas
▸ Date Ranges, Frequencies ,and Shifting
▸ Time Zone
▸ Resampling
▸ Window Functions
Generating Date Ranges and Frequency 49
▸ pandas.date_range can generate DatetimeIndex with a time range and
frequency
▸ Range:
▸ by setting the start/end date time
▸ by setting the periods
▸ Frequency defines how the range is divided. Can be set as:
▸ year, month, week, day, hours, etc…
▸ and the combination of them (such as 1h30 min)
Generating Date Ranges and Frequency Example 50
Date Frequencies 51
Exercise 3 52
▸ Create a DatetimeIndex which shows all the business days from 01/22/2019
to 01/31/2019.
Exercise 3 Solution 53
▸ Create a DatetimeIndex which shows all the business days from 01/22/2019 to
01/31/2019.
Date Shifting 54
▸ Shifting date refers to moving data backward and forward through time
▸ Both Series and DataFrame have a shift method for naive shifts, leaving the
index unmodified:
▸ Shifting by Time:
Outline 55
▸ Introduction to Data Aggregation and Group Operation
▸ Time Series
▸ Intro to Date&Time
▸ Date & Time in Python
▸ Date & Time in pandas
▸ Date Ranges, Frequencies ,and Shifting
▸ Time Zone
▸ Resampling
▸ Window Functions
Time Zone Handling 56
▸ We need to handle the times in different time zone
▸ Current international standard is the coordinated universal time (UTC)
(successor to Greenwich Mean Time)
▸ All the other time zones are indicated as offsets from UTC (+/- [1, 12])
▸ For example, New York time: UTC-5 (UTC-4 in daylight saving time)
▸ Time zone in Python: using pytz library
Time Zone in Time Series 57
▸ By default, time series in pandas are time zone naive (no time zone setting).
▸ Can localize a time zone by the tz_localize method
▸ Can set the time zone in generating a date range by setting tz
▸ Can convert from one time zone to another by tz_convert()
▸ Operations between different time zones: automatically convert to UTC
Time Zone in Time Series Example 58
Outline 59
▸ Introduction to Data Aggregation and Group Operation
▸ Time Series
▸ Intro to Date&Time
▸ Date & Time in Python
▸ Date & Time in pandas
▸ Date Ranges, Frequencies ,and Shifting
▸ Time Zone
▸ Resampling
▸ Window Functions
Resampling 60
▸ Refers to the process of converting a time series from one frequency to another
▸ Downsampling: higher frequency to lower frequency
▸ Upsampling: lower frequency to higher frequency
▸ Other: every Wednesday to every Friday
▸ In pandas, can call resample to group the data and then call an aggregation
function
▸ e.g.,
[Link]('D').mean()
Resampling Example 61
Downsampling 62
▸ The desired frequency defines bin edges to slice the time series into intervals
to aggregate
▸ Each interval is half-open (only one side is included) so that any data point just
belong to one interval
▸ While downsampling data, think about
▸ Which side of each interval is closed
▸ How to label each aggregation bin
(the start or the end of the interval)
Downsampling Example 63
By default, closed and label are set as ‘left’
Upsampling 64
▸ No aggregation is needed in upsampling
▸ Just need to consider how to fill the missing values result in the gaps
Filled by NaN
Upsampling (Cont.) 65
Filled a certain number
Filled forward of periods forward
Outline 66
▸ Introduction to Data Aggregation and Group Operation
▸ Time Series
▸ Intro to Date&Time
▸ Date & Time in Python
▸ Date & Time in pandas
▸ Date Ranges, Frequencies ,and Shifting
▸ Time Zone
▸ Resampling
▸ Window Functions
Window Functions 67
▸ Apply statistics over a sliding window of time
▸ The basic idea is to apply functions over a window of time, get the results, and
then slide the window ahead, and continue
▸ Used for smoothing noisy or missing data
▸ The rolling operator is called on Series/DateFrame along with a window of
time (period) and a resample/aggregation function
▸ Result is set to the right edge of the window (can be changed by
center=True)
Window Functions Example 68