You are on page 1of 40

P- 338– OIL PRICE PREDICTION

Presented By- Group 4


1. Ms. Vaishali Ravi Mukherjee
2. Mr. Yash Kumar Roy
3. Ms. Kashish Agarwal
4. Mr. Aniket Mahesh Prabhale
5. Mohammed Gouse Qureshi
6. Ms. Ayushi Srivastava
CONTENTS
• Business Objective
• Introductions
• Project Architecture
• Data Collection & Details
• Exploratory Data Analysis
• Modelling
• Evaluation
• Deployment
ACKNOWLEDGEMENT

We would like to express our deepest gratitude to Mr. Kartik Muskula and Ms. Anavadhya for
their invaluable patience and feedback. Thank you for entrusting us with this opportunity.
We could not have undertaken this journey without your guidance and expertise.
Business Objective

Oil is a product that goes completely in a different direction for a single market event as the oil prices are
rarely based on real-time data, instead, it is driven by externalities making our attempt to forecast it even more
challenging. As the economy will be highly affected by oil prices our model will help to understand the pattern
in prices to help the customers and businesses to make smart decisions.
Introduction
The volatility and complexity of global oil markets make predicting oil prices a
challenging yet crucial task for various stakeholders, including investors, policymakers,
and industry professionals. In recent years, data science has emerged as a powerful tool
to analyze historical trends, identify patterns, and build predictive models to forecast
future oil prices.
This data science project aims to leverage advanced analytics and machine learning
techniques to develop an accurate and reliable model for predicting oil prices. The
project involves collecting and preprocessing a diverse set of data that influence the
energy market.
PROJECT ARCHITECTURE/ PROJECT FLOW

Business
Understanding

Model
Deployment
Data Collection

Model Evolution Data Preparation

Exploratory Data
Visualization
Analysis
DATA COLLECTION

• We have taken Data from the site www.eia.gov Shape of dataset:- (456, 2)
• Here we have taken date as Independent variable and COSP Data Types:- Date datetime64[ns] COSP float64
as the Dependent variable

Info:- Describe:- COSP


count 456.000000

mean 46.886338

std 29.567799

min 11.350000

25% 20.085000

50% 38.170000

75% 70.375000

max 133.880000
DATASET

The range of COSP varies from 133.88 $/barrel to 11.35 $/barrel.


OUTLIERS

Density Box Plot


Observation:
There are no Outliers above the positive upper extreme whisker.
There are no outliers below the negative side of the lower whisker.
From the Density Plot the data looks right skewed indicating there are some extreme values present which do not follow the trend of the data.

Some extreme values noted:


Date Crude Oil Price in dollars/barrel
1998-12-25 11.00
2008-07-04 142.52
2014-06-20 107.23
2020-04-24 3.32
•Deleting Outliers which are extreme values in the dataset, can lead to a biased dataset
•Keeping extreme values, would result in overfitting and affect the model performance and accuracy
•Need to take a decision on handling the extreme values
EDA AND VISUALIZATION

Observations:

1. We can see there is an increasing trend.


2. Variance is also not constant. Hence, the Time Series is NOT
Stationary.
3. Distribution is skewed. We need to check if this will get
better with transformation.
Heatmap

Observations:
1. From the year 1986 to 1998, there is
not much variance,
but after that there is a lot of variance
with some extreme highs and lows.
2. Post 1998, there is an increasing
trend in Crude oil price
3. On 15th December 1998, the
minimum price was $11.35/barrel,
15th June 2008, maximum price
$133.88/barrel
4. There are some peak values
observed in 2014
5. In April 2020, there are dips in the
price of Crude oil
Final Observation: Since 1998, there is
increase in Crude oil Price, with peak
values observed in year
2008, 2013, 2014 and 2022
LINE PLOT

Observations:-

We can also verify by this Line Plot that the increase in price in significantly high
around the year 2005 and 2006
Lag Plot (Monthly)

Lag plot is a scatter plot of a time series (signal) against a lag of


itself. It is normally used to check for autocorrelation. If there is any
pattern existing in the series like the one we see above, the series is
autocorrelated. If there is no such pattern, the series is likely to be
random white noise.
Note: In our time series, datapoints cluster along a diagonal line
from the bottom-left to the top-right of the plot, it suggests a
positive correlation relationship.

ACF Plot/ Correlogram

In ACF plot we can quantify the strength and type of relationship


between time series (signal) and its lagged version.
Observation: There is statistically significant correlation between
the time series signal and its lagged version of itself. This dataset is
good for time series forecasting
SPLITTING THE RAW DATA IN TO TRAIN TEST SPLIT:

We will do a sequential split that’s because the order of sequence should be intact in a Time series
dataset to use it for Forecasting

We have total 38 years of data, of which we will use the last 2 years data as Test and the remaining
36 years data as Train. The Final model will be trained on Entire dataset for making predictions.
TIME SERIES DECOMPOSITION
• Trend: A trend is said to exist when there is a long-term increase or decrease in the time series data.
• Seasonality: A seasonal pattern is observed when a time series is affected by seasonal factors occurring
yearly, monthly, daily etc . Seasonality refers to a known frequency, e.g. Quarterly = 4,monthly =12
To extract the Trend, Seasonality and error we used the decompose() and forecast::stl() function to split our
time series in to seasonality, trend and error components.

Observation:
1. Actual data
2. Trend : There is an Increasing Trend
3. Seasonality: There is Seasonality associated
4. Resid : Residual is the left over after decomposition of the two major components (Trend and Seasonality)
TEST FOR STATIONARITY
Assumptions of AIRMA model

∙ Data should be Stationary: A Time series is said to be Stationary if its statistical properties like mean, variance, autocorrelation etc are all
constant over time implying it does not have Trend or Seasonal effect. If a Trend appears and Stationarity is not evident, many of the
computations throughout the process cannot be made and produce the intended results. Statistical modelling methods assume or
require the time series to be stationary to be effective.

∙ Data should be Univariate: AIRMA works on a single variable. (Auto regression is all about regression with past values)

Test to Check for Stationarity:


• Augmented Dickey-Fuller Test (ADF test)

• Null Hypothesis (H0): The series is not stationary


❖ p-value > 0.05
∙ Alternate Hypothesis (H1): The series is stationary
❖ p-value <= 0.05
• A p-Value of less than 0.05 in adf.test() indicates that the time series is stationary.
Observation:
We can see that p-value is 0.315, Hence p-value > 0.05, and we fail to reject the Null hypothesis.

Conclusion: Our time series is Non-stationary. We need to make our non-Stationary time series to
Stationary before we begin model building.
Now we will apply transformations to the non-stationary time series and check with ADF test if the
time series has become stationary.
The stationary time series will help to analyze the data further and understand the variance in the
data.
❖ Square root Transformation
❖Log Transformation:
❖ Differencing :
In this we compute the differences between consecutive observations in the time series. If Yt denotes the value of the time series
Y at period t, then the first order difference of Y at period t is equal to [First order differencing Y’t = Yt - Yt-1]. Differencing is done
to get rid of the varying mean.
Observation:

We can see that p-value is 6.590164e-18, Since p-value < 0.05, we reject the Null hypothesis.
Conclusion: Our time series is Stationary. It means it does not have any Trend and Seasonality. The
data does not depend on the time when it is captured. We can use this data for Model Building.
Model based methods (monthly data)

Model based methods (Models) RMSE_Values


RMSE_Linear 14.116137
RMSE_Mult_Add_Sea 15.906114
RMSE_Exp 16.074304
RMSE_Add_Sea_Quad 20.508255
RMSE_Quad 20.707724
RMSE_Add_Sea 43.303265
RMSE_Mult_Sea 51.365315

Observation: Linear model has performed well with Raw data, followed by Ordinary least square
with Multiplicative Additive Seasonality
DATA DRIVEN MODELS (MONTHLY DATA)

Data Driven Models RMSE_Values


Triple_Exp_Mul 11.766312
Simple_Exp 13.109679
Triple_Exp_Mul_Add 13.211462
Triple_Exp_Add 13.259708
Double_Exp 13.618835

Triple_Exp_Add_Mul 88.349556

Observation: Holt’s Winter Triple Exponential Smoothing model with Multiplicative


Trend and Seasonality has performed well with Raw data, followed by Simple
exponential smoothing model.
DATA DRIVEN MODELS (MONTHLY DATA)
ARIMA MODEL (MONTHLY DATA)

RMSE_ARIMA = 21.84
SARIMA MODEL (MONTHLY DATA)

RMSE_SARIMA = 22.43
PROPHET MODEL (MONTHLY DATA)

• Prophet is an open-source software released by Facebook. Prophet is a procedure for


forecasting time series data based on an additive model where non-linear trends are fit with
yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that
have strong seasonal effects and several seasons of historical data. Prophet is robust to missing
data and shifts in the trend, and typically handles outliers well.

m = Prophet()
rec no ds yhat yhat_lower yhat_upper
m.fit(train)
451 31/07/23 20.857337 3.272883 38.198375
future = m.make_future_dataframe(24,freq='M') #MS
for monthly, H for hourly 452 31/08/23 48.247309 31.433898 66.395961
forecast = m.predict(future) 453 30/09/23 57.71588 40.133683 73.89622
454 31/10/23 67.195082 50.459086 82.728286
455 30/11/23 27.835174 10.847369 45.154253

RMSE_Prophet = 47.93
PROPHET MODEL CONTD
LSTM MODEL (MONTHLY DATA)
• The Long Short-Term Memory network, or LSTM network, is a recurrent neural network trained
using Backpropagation Through Time that overcomes the vanishing gradient problem. It can be
applied to time series forecasting.
from keras.preprocessing.sequence import TimeseriesGenerator
# define generator
n_input = 3
n_features = 1
generator = TimeseriesGenerator(scaled_train, scaled_train,
length=n_input, batch_size=1)
# generator: Input for 12 months
n_input = 12
generator = TimeseriesGenerator(scaled_train, scaled_train,
length=n_input, batch_size=1)
# define model
model = Sequential()
model.add(LSTM(100, activation='relu', input_shape=(n_input,
n_features)))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
LSTM MODEL (MONTHLY DATA)
RMSE_LSTM = 5.45
SILVERKITE MODEL (MONTHLY DATA)
Silverkite is a forecasting algorithm developed by LinkedIn. It supports different kinds of growth,
interactions, and fitting algorithms.
# specify dataset information
RMSE_Prophet = 9.93
metadata = MetadataParam(
time_col="ds", # name of the time column
value_col="y", # name of the value column
freq="M", # "H" for hourly, "D" for daily,"M" for monthly etc.
train_end_date = pd.to_datetime('2021-06-30’))
#
forecaster = Forecaster() # Creates forecasts and stores the result
result = forecaster.run_forecast_config( # result is also stored as
`forecaster.forecast_result`.
df=df,
config=ForecastConfig(
model_template=ModelTemplateEnum.SILVERKITE.name,
forecast_horizon=24, # forecasts 2 years ahead
coverage=0.95, # 95% prediction intervals
metadata_param=metadata))
SILVERKITE MODEL (CONTD)
ALL MODELS WITH RMSE VALUES
Sr No: Models RMSE_Values
1 RMSE_LSTM 5.44
2 RMSE_SILVERKITE 9.93
3 Triple_Exp_Mul 11.76
Error Evaluation:
4 Simple_Exp 13.11 Root Mean Square
5 Triple_Exp_Mul_Add 13.21
6 Triple_Exp_Add 13.26
7 Double_Exp 13.62
8 RMSE_Linear 14.12
9 RMSE_Mult_Add_Sea 15.91
10 RMSE_Exp 16.07
11 RMSE_Add_Sea_Quad 20.51
12 RMSE_Quad 20.71 Observation: The LSTM model performed quite
13 RMSE_ARIMA 21.84 well on the Raw data, followed by Silverkite and
14 RMSE_SARIMA 22.43 Holt’s Winter Triple Exponential Smoothing
15 RMSE_Add_Sea 43.31 model with Multiplicative Trend and Seasonality
16 RMSE_PROPHET 47.93
17 RMSE_Mult_Sea 51.37
18 Triple_Exp_Add_Mul 88.35
MODEL DEPLOYMENT
• We used LSTM Model for deployment.
Final Model Graph
MODEL DEPLOYMENT (CONTD)

Streamlit command to run the App:


MODEL DEPLOYMENT (CONTD)
Sample Page:
MODEL DEPLOYMENT (CONTD)
Future Prediction:
CHALLENGES

• Limited Predictive Power: Date alone does not provide sufficient information to predict oil
prices with a high degree of accuracy. While historical price trends may exhibit certain patterns
over time, these patterns may not necessarily persist in the future. The lack of contextual
information limits the model's predictive power and makes it susceptible to random
fluctuations or noise in the data.
• The ARIMA model source code was taking time to execute. So we used Google Collaboratory
to develop the code.
Thank you

You might also like