You are on page 1of 21

Time Series Analysis and

Predictive Modeling
Unit 5

By
Dr. G. Sunitha
Professor
Department of AI & ML

School of Computing

Sree Sainath Nagar, A. Rangampet, Tirupati – 517 102


Time Series
• A time series is a sequence
of measurements from a
system that varies in time.

• It is a sequence of data
points that occur in
successive order over some
period of time.

• Time series analysis can be


useful to see how a given
asset, security, or economic
variable changes over time.

• In particular, a time series


allows one to see what
factors influence certain
variables from period to
period.

2
mj-clean Dataset

• mj-clean.csv Dataset - price, quantity, quality, and location of cannabis


transactions.
• Each transaction is an event in time, so we could treat this dataset as a time
series. But the events are not equally spaced in time; the number of
transactions reported each day varies from 0 to several hundred. Many methods
used to analyze time series require the measurements to be equally spaced.

3
mj-clean Dataset
S.No. Attribute Description
A string representing the name of the city where the
1 city
transaction occurred.
Two-letter state abbreviation indicating the U.S. state in which
2 state
the transaction took place.
3 price Price paid in dollars

4 amount Quantity purchased in grams

5 quality High, medium, or low quality, as reported by the purchaser

6 date Date of report, presumed to be shortly after date of purchase

7 ppg Price per gram in dollars

8 state.name String state name

9 lat Approximate latitude of the location of transaction

10 lon Approximate longitude of the location of transaction

4
Importing and Cleaning
# importing pandas
import pandas as pd

# reading csv file and creating a DataFrame


# parse_dates parameter will interpret values in column 5 as dates and convert
# them to NumPy datetime64 objects.
dataset = pd.read_csv( 'D://mj-clean.csv’ , parse_dates=[5] )

# displaying first 5 records


dataset.head()

Output

5
Importing and Cleaning . . .

• The events in this dataset are not equally spaced in time; the number of
transactions reported each day varies from 0 to several hundred. Many methods
used to analyze time series require the measurements to be equally spaced.

import numpy as np
def GroupByDay( transactions ):
grouped = transactions[ [ 'date’ , 'ppg’ ] ].groupby( 'date’ )
daily = grouped.aggregate(np.mean)
daily[ 'date’ ] = daily.index
start = daily.date[ 0 ]
one_year = np.timedelta64(1 , 'Y’)
daily[ 'years’ ] = (daily.date - start) / one_year
return daily

6
Importing and Cleaning . . .
• transactions[['date', 'ppg']].groupby('date'): This line selects the 'date' and
'ppg' columns from the 'transactions' DataFrame and groups the data by the 'date'
column. It creates a grouped object.
• grouped.aggregate(np.mean): This line applies mean function to 'ppg' values for
each group of transactions on the same date. The result is a DataFrame with the
aggregated values.

7
Importing and Cleaning . . .
• daily['date'] = daily.index: This line creates a new column 'date' in the 'daily'
DataFrame and sets its values to be the index of the 'daily' DataFrame, which
represents the dates.
• start = daily.date[0]: This line assigns the first date in the 'daily' DataFrame
to the variable 'start'.
• one_year = np.timedelta64(1, 'Y'): This line creates a numpy timedelta
object representing one year.
• daily['years'] = (daily.date - start) / one_year: This line calculates the
number of years from the 'start' date for each date in the 'daily' DataFrame and
adds a new column 'years' with these values.
• return daily: The function returns the modified 'daily' DataFrame.

In summary, the function groups transactions by day, calculates a mean on the


'ppg' values for each day, and adds columns for the date and the number of years
from the start date.

8
Importing and Cleaning . . .

• Output of GroupByDay Function

9
Importing and Cleaning . . .

def GroupByQualityAndDay(transactions):
groups = transactions.groupby('quality’)
dailies = { }
for name, group in groups:
dailies[name] = GroupByDay(group)
return dailies

10
Importing and Cleaning . . .

• GroupByQualityAndDay() function groups transactions by the 'quality' and then


applies the GroupByDay() function to each group.
• This function returns a dictionary where each key corresponds to a quality level;
and the associated value is a DataFrame containing the daily aggregated
information for each quality level; value contains columns ppg, date, and years.

# calling function to find daily aggregated information for each quality level
d = GroupByQualityAndDay(dataset)

11
Plotting
# plotting using matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Create subplots
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(15, 15), sharex=True)

# Iterate over quality levels and create scatter plots


for i, (quality, daily_data) in enumerate(d.items()):
sns.scatterplot(x='date', y='ppg', data=daily_data, ax=axes[i], color='blue')
axes[i].set_title(f'Scatter Plot - {quality}')
axes[i].set_xlabel('Date')
axes[i].set_ylabel('Price per Gram')
axes[i].set_ylim(0, 20) # Set y-axis limits

# show plot
plt.show()
12
Plotting . . .

13
Plotting . . .

• One apparent feature in these plots is a gap around November 2013. It is possible
that data collection was not active during this time, or the data might not be
available. We will consider ways to deal with this missing data later.
• Visually, it looks like the price of high quality cannabis is declining during this
period, and the price of medium quality is increasing. The price of low quality
might also be increasing, but it is harder to tell, since it seems to be more volatile.

14
Moving Averages

• Most time series analysis is based on the modeling assumption that the observed
series is the sum of three components:
– Trend
A smooth function that captures persistent changes
– Seasonality
Periodic variation, possibly including daily, weekly, monthly, or yearly cycles
– Noise
Random variation around the longterm trend

• Regression is one way to extract the trend from a series.

15
Moving Averages . . .

• But if the trend is not a simple function, a good alternative is a moving average.
• A moving average divides the series into overlapping regions, called windows,
and computes the average of the values in each window.
• One of the simplest moving averages is the rolling mean, which computes the
mean of the values in each window. For example, if the window size is 3, the
rolling mean computes the mean of values 0 through 2, 1 through 3, 2 through
4, etc.
• pandas provides rolling().mean() function, which takes a Series and a window
size and returns a new Series.

16
Moving Averages . . .
import matplotlib.pyplot as plt
import seaborn as sns

# Create subplots
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(10, 7), sharex=True)

# Iterate over quality levels and create rolling mean plots using Seaborn
for i, (quality, daily_data) in enumerate(d.items()):
# Adjust the window size as needed
rolling_mean = daily_data['ppg'].rolling(window = 30).mean()
axes[i].plot(daily_data['date'], rolling_mean,
label = f'{quality} - Rolling Mean', color='blue')
axes[i].set_title(f'Rolling Mean - {quality}')
axes[i].set_xlabel('Date')
axes[i].set_ylabel('Rolling Mean (ppg)')
axes[i].legend()

# show plot
plt.tight_layout()
plt.show() 17
Moving Averages . . .

18
Moving Averages . . .

• The rolling mean seems to do a good job of smoothing out the noise and
extracting the trend. The first 29 values are NaN, and wherever there’s a
missing value, it’s followed by another 29 NaNs. There are ways to fill in these
gaps, but they are a minor nuisance.
• An alternative is the Exponentially-Weighted Moving Average (EWMA), which
has two advantages.
– It computes a weighted average where the most recent value has the
highest weight and the weights for previous values drop off exponentially.
– The pandas implementation of EWMA handles missing values better.

19
Moving Averages . . .
import matplotlib.pyplot as plt
import seaborn as sns

# Create subplots
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(10, 7), sharex=True)

# Iterate over quality levels and create EWMA plots using Seaborn
for i, (quality, daily_data) in enumerate(d.items()):
# Calculate EWMA with a specified span (adjust alpha as needed)
ewma = daily_data['ppg'].ewm(span=7, adjust=False).mean()
axes[i].plot(daily_data['date'], ewma, label=f'{quality} - EWMA’ , color='blue')
axes[i].set_title(f'EWMA - {quality}')
axes[i].set_xlabel('Date')
axes[i].set_ylabel('EWMA (ppg)')
axes[i].legend()

# show plot
plt.tight_layout()
plt.show() 20
Moving Averages . . .

21

You might also like