EDA Document

Librarys
 import pandas as pd
 import numpy as np
 import matplotlib.pyplot as plt
 from sklearn.metrics import mean_squared_error, mean_absolute_error
 from statsmodels.tsa.arima_model import ARIMA
 from datetime import date
 from statsmodels.tsa.stattools import adfuller
 from statsmodels.tsa.seasonal import seasonal_decompose
 from pmdarima import auto_arima
 from datetime import timedelta
 import investpy
 import scipy.stats as stats
 from statistics import stdev
 from pandas.plotting import lag_plot
 import math
Pandas: pandas is a Python package providing fast, flexible, and expressive data
structures designed to make working with “relational” or “labeled” data both easy and
intuitive. It aims to be the fundamental high-level building block for doing practical, real-world
data analysis in Python.
Numpy: NumPy is a Python library used for working with arrays. It also has functions for
working in domain of linear algebra, fourier transform, and matrices. NumPy was created in
2005 by Travis Oliphant. ... NumPy stands for Numerical Python.
Matplotlib: Matplotlib is an amazing visualization library in Python for 2D plots of arrays. ...
One of the greatest benefits of visualization is that it allows us visual access to huge amounts
of data in easily digestible visuals. Matplotlib consists of several plots like line, bar, scatter,
histogram etc.
Sklearn: the most useful library for machine learning in Python. The sklearn library
contains a lot of efficient tools for machine learning and statistical modeling including
classification, regression, clustering and dimensionality reduction.
Statsmodels: Python StatsModels allows users to explore data, perform statistical tests
and estimate statistical models. It is supposed to complement to SciPy's stats module. It is
part of the Python scientific stack that deals with data science, statistics and data analysis.
Datetime: Date and datetime are an object in Python, so when you manipulate them, you are
actually manipulating objects and not string or timestamps. ... datetime – Its a combination of
date and time along with the attributes year, month, day, hour, minute, second, microsecond,
and tzinfo.
Adfuller: The adfuller function returns a tuple of statistics from the ADF test such as the
Test Statistic, P-Value, Number of Lags Used, Number of Observations used for the ADF
regression and a dictionary of Critical Values
ARIMA: ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average.
It is a class of model that captures a suite of different standard temporal structures in time
series data. In this tutorial, you will discover how to develop an ARIMA model for time series
forecasting in Python.
Investpy: investpy is a Python package to retrieve data from Investing.com, which
provides data retrieval from up to 39952 stocks, 82221 funds, 11403 ETFs, 2029 currency
crosses, 7797 indices, 688 bonds, 66 commodities, 250 certificates, and 4697
cryptocurrencies.
Seasonal decompose: The statsmodels library provides an implementation of the naive, or
classical, decomposition method in a function called seasonal_decompose(). It requires that
you specify whether the model is additive or multiplicative. ... The seasonal_decompose()
function returns a result object.
Scipy: SciPy in Python is an open-source library used for solving mathematical,
scientific, engineering, and technical problems. It allows users to manipulate the data and
visualize the data using a wide range of high-level Python commands. SciPy is built on the
Python NumPy extention. SciPy is also pronounced as “Sigh Pi.”
Pmdarima: Pmdarima (originally pyramid-arima , for the anagram of 'py' + 'arima') is a
statistical library designed to fill the void in Python's time series analysis capabilities. This
includes: ... A collection of statistical tests of stationarity and seasonality. Time series utilities,
such as differencing and inverse differencing.
Stdev: Statistics module in Python provides a function known as stdev() , which can be used to
calculate the standard deviation. stdev() function only calculates standard deviation from a
sample of data, rather than an entire population. ... It is used to quantify the measure of
spread, variation of a set of data values.
Lag plot: A lag plot checks whether a data set or time series is random or not. Random
data should not exhibit any identifiable structure in the lag plot. Non-random structure in the lag
plot indicates that the underlying data are not random.
Math: The math module is a standard module in Python and is always available. To use
mathematical functions under this module, you have to import the module using import math . It
gives access to the underlying C library functions.
Load the data
import mysql.connector
mydb = mysql.connector.connect(
host="localhost",
user="root",
password="*****",
database="sbi_life"
)
mycursor = mydb.cursor()
mycursor.execute("SELECT Dates, Open FROM sbi")
myresult = mycursor.fetchall()
 df.drop(['High','Low','Close','Change Pct'], axis = 1)
Here we drop the values which are not use.
 df.set_index('Date', inplace = True)

Filling Missing dates
define range of date between min & max(yesterday) available in DataSet
date_range = pd.date_range(start=df.index.min(), end= date.today() - timedelta(days = 1),
freq="D")
improve prediction accuracy and overall usability of model

df1 = df.reindex(date_range, fill_value='NA')
Handling Missing prices & Imputing price values for new dates, to help make data
distribution more equitable and return an output
df1['Price'] = df1['Price'].ffill()
df1['Price'] = df1['Price'].replace('NA', np.nan).ffill()
print(df1)
mean, median, mode, kurtosis, std deviation, skewness
print("Mean=",np.mean(df1["Price"]))
np.mean always computes an arithmetic mean, and has some additional options for input and
output = 35.51
print("Median=",np.median(df1["Price"]))
np.median()
The numpy median function helps in finding the middle value of a sorted array.
= 35.572
print(stats.mode(df1["Price"]))
Return an array of the modal (most common) value in the passed array.
= 39.935
print("Standard Deviation=",stdev(list(df1['Price'])))
The standard deviation is the square root of the average of the squared deviations from the mean.
= 3.074
print("Skewness:\n",df1.skew(axis=0))
Skewness
Skewness is a measure of the asymmetry of the probability distribution of a real-valued random
variable about its mean. The skewness value can be positive or negative, or undefined.
Price -0.276881
print("Kurtosis:\n",df1.kurt(axis = 0))
The pandas DataFrame has a computing method kurtosis() which computes the kurtosis for a set
of values across a specific axis (i.e., a row or a column).
Price -1.295515
 Plot close price

This plot is showing date wise closing price
plt.figure(figsize=(10,6))
plt.grid(True)
plt.xlabel('Date')
plt.ylabel('Price')
plt.plot(df1['Price'])
plt.title('SBI Bond price')
plt.show()
 Q-Q PLOT
The quantile-quantile (q-q) plot is a graphical technique for determining if two data sets
come from populations with a common distribution.
plt.figure(figsize=(15, 10))
stats.probplot(df1["Price"], plot=plt)
plt.title("Normal Q-Q plot")
plt.ylabel("SBI Bond Prices")
plt.show()
 histogram plot
A histogram is basically used to represent data provided in a form of some groups.It is
accurate method for the graphical representation of numerical data distribution.It is a type
of bar plot where X-axis represents the bin ranges while Y-axis gives information about
frequency.
mu = sum(list(df1['Price']))/len(df1['Price'])
sigma = stdev(list(df1['Price'
n, bins, patches = plt.hist(df1['Price'], bins=30, facecolor='#2ab0ff', edgecolor='#e0e0e0',
linewidth=0.5, alpha=0.7)
n = n.astype('int')
for i in range(len(patches)):
patches[i].set_facecolor(plt.cm.viridis(n[i]/max(n)))
 seasonality plot
A seasonal plot is very similar to the time plot, with the exception that the data is plotted
against the individual seasons. Choosing the definition of the season is up to the analyst and
in our particular case, the season is simply the month. We can generate the seasonal plot by
running the following code.
df2 = df1.copy()
plt.rcParams.update({'figure.figsize': (10,10)})
decompose_result_mult = seasonal_decompose(df2["Price"], model="multiplicative")
decompose_result_mult.plot()
 autocorrelation for lag_3
Autocorrelation is used to obtain the degree of similarity of a time series with itself, which
provides to obtain periodical components embedded in the data. Autocorrelation of an x(t)
series is expressed analytically.
lag_plot(df1['Price'], lag=3)
plt.title('SBI Bond Price - Autocorrelation plot with lag = 3')
plt.show()
 scatter plot
A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to
display values for typically two variables for a set of data.
df_open = df1['Price']
df_open.plot(style='k,')
plt.title('Scatter plot of closing price')
plt.show()
 train, test data

Train/Test is a method to measure the accuracy of your model.
It is called Train/Test because you split the data set into two sets: a training set and a
testing set.
From previous date to Feb-2021 for training, and March-2021 -today date for testing.
You train the model using the training set.
You test the model using the testing set.
train_data = df1.loc[:'2021-02-28',:]
test_data = df1.loc['2021-03-01':, :]
 Train_Test plot
plt.figure(figsize=(10,6))
plt.grid(True)
plt.xlabel('Date')
plt.ylabel('Price')
plt.plot(df1, 'red', label='Train data')
plt.plot(test_data, 'blue', label='Test data')
t.legend()#train, test data
## Model Part ##
 AutoARIMA for bestfit

model_autoARIMA = auto_arima(train_data, start_p=0, start_q=0,
test='adf', # use adftest to find
max_p=5, max_q=5, # maximum p and q
m=1, # frequency of series
d=None, # let model determine 'd'
seasonal=False, # No Seasonality
start_P=0,
D=0,
trace=True,
error_action='ignore',
suppress_warnings=True,
stepwise=True)
d Order of first-differencing. If missing, will choose a value

based on test.
D Order of seasonal-differencing. If missing, will choose a
value based on season.test.
max.p Maximum value of p
max.q Maximum value of q
max.P Maximum value of P
max.Q Maximum value of Q
max.order Maximum value of p+q+P+Q if model selection is not
stepwise.
max.d Maximum number of non-seasonal differences
max.D Maximum number of seasonal differences
stationary If TRUE, restricts search to stationary models.
seasonal If FALSE, restricts search to non-seasonal models.
ic Information criterion to be used in model selection.
stepwise If TRUE, will do stepwise selection (faster). Otherwise, it
searches over all models. Non-stepwise selection can be
very slow, especially for seasonal models.
trace If TRUE, the list of ARIMA models considered will be
reported.
seasonal.test This determines which method is used to select the number
of seasonal differences. The default method is to use a
measure of
seasonal strength computed from an STL decomposition.
Other possibilities involve seasonal unit root tests.
#auto_arima Diagnostic plot

 model_autoARIMA.plot_diagnostics(figsize=(15,8))
 plt.show()
 model building
 model = ARIMA(train_data, order=(1,1,0)) #best fit value from autoarima
 fitted = model.fit(disp=-1)
 print(fitted.summary())
 Forecast
 fc, se, conf = fitted.forecast(len(test_data), alpha=0.05) # 95% confidence
 fc_series = pd.Series(fc, index=test_data.index)
 lower_series = pd.Series(conf[:, 0], index=test_data.index)
 upper_series = pd.Series(conf[:, 1], index=test_data.index)

 plt.figure(figsize=(12,5), dpi=100)
 plt.plot(train_data, label='training')
 plt.plot(test_data, color = 'blue', label='Actual BOND Price')
 plt.plot(fc_series, color = 'orange',label='Predicted BOND Price')
 plt.fill_between(lower_series.index, lower_series, upper_series, color='k', alpha=.05)
 plt.title('SBI Bond Price Prediction')
 plt.xlabel('Time')
 plt.ylabel('Actual Stock Price')
 plt.legend(loc='upper left', fontsize=8)
 plt.show()
 ARIMA_acc = mean_absolute_error(test_data.Price,fc_series)
 mse = mean_squared_error(test_data, fc)

 print('MSE: '+str(mse))
MSE: 0.03231212962843829
 mae = mean_absolute_error(test_data, fc)
 print('MAE: '+str(mae))
MAE: 0.1576871822323348
 rmse = math.sqrt(mean_squared_error(test_data, fc))
 print('RMSE: '+str(rmse))
RMSE: 0.1797557499175987
 import pickle
Pickle: “Pickling” is the process whereby a Python object hierarchy is converted into a byte
stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file
or bytes-like object) is converted back into an object hierarchy. Pickle in Python is primarily
used in serializing and deserializing a Python object structure. In other words, it's the
process of converting a Python object into a byte stream to store it in a file/database,
maintain program state across sessions, or transport data over the network.
 filename = "bond.pkl"
 pickle.dump(model2, open(filename,'wb'))
Advantages of investing in SBI(Government) Bonds
The following are the advantages of investing in SBI bonds.
 Risk-Free
SBI bonds promise assured returns and stability of funds to investors. They have always
been an example of risk-free security. Thus, investors looking for a risk-free investment,
government bonds are suitable for them.
 Returns
The returns from SBI bonds are generally as good as bank deposits. Also, there is a
guarantee of principal along with fixed interest. Unlike bank deposits, these bonds are
available for a longer duration.
One can use Scrip box’s returns calculator to estimate their returns.
 Liquidity
One can buy and sell SBI bonds like equity instruments. The liquidity in these bonds is as
adequate as banks and financial institutions.
 Portfolio Diversification
Investment in SBI bonds makes a well-diversified portfolio for the investor. It mitigates the
risk of the overall portfolio since SBI bonds are risk-free investments.
 Regular Income
As per RBI guidelines, the interest accrued on government bonds shall be disbursed every
six months to bondholders. Therefore, it provides an opportunity for the bondholders to earn
regular income by investing their idle funds.
Disadvantages of investing in SBI(Government) Bonds

The following are the disadvantages of investing in SBI bonds.
 Low Returns
The yield or interest earned on government bonds is relatively lower in comparison to other
investment options like equity, real estate, Corporate Bonds, etc.
 Interest Rate Risk

SBI bonds are long term investment bonds where the maturity is ranging from 5 years – 40
years. Hence, the bond might lose its value over this period. If inflation rises, the interest
rate is less attractive. Also, higher the bond period, the market risk also increases along
with interest rate risk. Furthermore, the investor remains with an investment which is paying
below the market value.
 Long Maturity Periods

Long maturity periods and long-term rewards keep your assets locked in for the duration of
the maturity of the period, thus making liquidity for optimal financial gains not viable.
Yield
SBI bonds or any bonds for that matter, have considerably lesser yield rates as compared
to company stocks and other competitive asset classes, however, this lack of competitive
return on investment is somewhat balanced by the risk to reward ratio nature of the bond
market.

EDA Document

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EDA Document

Uploaded by

Copyright:

Available Formats

Librarys

 df.set_index('Date', inplace = True)

improve prediction accuracy and overall usability of model

mean, median, mode, kurtosis, std deviation, skewness

 Plot close price

 train, test data

 AutoARIMA for bestfit

d Order of first-differencing. If missing, will choose a value

#auto_arima Diagnostic plot

 upper_series = pd.Series(conf[:, 1], index=test_data.index)

 mse = mean_squared_error(test_data, fc)

Disadvantages of investing in SBI(Government) Bonds

 Interest Rate Risk

 Long Maturity Periods

You might also like