You are on page 1of 11

Assignment- Data Science using R

by

Vegulla Manikantha

FMS/MBA/21-23/041

Submitted to:
Prof. Dr. Bhagirathi Nayak, Professor, FMS, Sri Sri University

Sri Sri University

CUTTACK-754006

Date:

22th May, 2023


MULTI- VARIABLE CORRELATION:
The Data is of Finance, the company and the returns.
Data consist of 15 variables with 506 observations, in Total. Then the Data set is
modified by choosing to 14 variables namely, Symbol, Name, Sector, Price,
Price/Earnings, Dividend Yield, Earnings/Share, 52 weeks High, 52 weeks Low,
Market Cap, EBDITA, Price/Sales, Price/Book.
This is done to ensure that only quantitative data is chosen as the method
intended to be applied for analysis is Pearson Correlation.

1. The data set is imported in R-Studio via the following steps: File > Import
Dataset > From Excel > Browse (TJ)> Import.

2. Column titled “SEC Fillings “is removed using the subset function. This
function allows us to create a subset of the original dataset by consisting of
the variables of our choice. As shown in , we mention our dataset (TJ), then
select the variables to be eliminated [select= -c(SEC Fillings)]. This subset is
then redirected to a variable named “input”.

2|Page
3. The correlation matrix along with the respective significance value (p-value)
of the entire dataset is constructed. The rcorr() function is obtained from
the ‘Hmisc’ package and is used to derive both correlation matrix and the p-
value matrix of the concerned dataset simultaneously. The rcorr() will only
work with matrices, thus we introduce our dataset (input) as a matrix using
the ‘as.matrix’ function. This is redirected to a variable named “result”.
The correlation coefficient allows us to gauge two aspects:
i) The level of interrelatedness (-1 to 0 = minimum correlation; 0 = no
correlation; 0 to 1= maximum correlation)
ii) The direction of correlation (positive correlation= one variable
increase then other variable correspondingly increases and vice-versa;
negative correlation= one variable increases, the other variable
correspondingly decreases and vice-versa)
The p-value is taken in order to know whether there exists any significant
relation between the two variables with respect to population because a
correlation coefficient different from 0 in the sample does not mean that
the correlation is significantly different from 0 in the population.

3|Page
4. For the purpose of this analysis, the objective is to examine the correlation
between price of the stock and the following variables respectively, namely-
price/earnings, dividend yield, market cap.

4|Page
The meaning of the terms is: ‘result1$r’ – denotes that the plot be constructed
based on the correlation coefficient of the variables. ‘type’ denotes the layout
of the graph. ‘order’ denotes that the variables will be arranged on the basis of
their respective correlation coefficient. ‘p.mat’ would include the p-value in the
plot as well. ‘sig.level’ sets the level of significance (in this case, 0.01), The
‘insig’ function specifies the action to be taken for the points that are greater
than the significant value (in this case, we enter “blank” indicating that points
greater than 0.01 will be left blank)

5|Page
TIME-SERIES FORECASTING:
A time series data set consists of a set of observations about a single
phenomenon that are recorded over multiple time intervals with constant time
period difference (either daily, weekly, monthly, quarterly, annually etc.). For
this analysis, a dataset consisting of the GDP of a country in each quarter of
every year (from 1st quarter of 1959 to 1st quarter of 2001) is imported into R
studio, consisting of 169 observations and 2 variables in total. The class of the
imported data is in tabular (data frame) format.
This dataset has to be converted into a timeseries which is depicted in the
following screenshot. The syntax depicts that we are choosing to convert the
GDP variable as time series, with the start and end date taken as the minimum
and maximum of the corresponding GDP date respectively and the frequency
of observation is done in a quarterly basis (frequency =4). The timeseries is
redirected to a variable named “gdptime”. A plot of the timeseries is
constructed for visualisation.

The model used for this analysis is ARIMA model. ARIMA can be expanded as
“Autoregressive Integrated Moving Average” model and this model requires
that the timeseries data must be satisfying the following conditions:
i) The time-series data must be stationary [(i.e.) the lag values of the
variable (GDP) should have the same mean, variance and covariance]
6|Page
ii) The time-series data must have autocorrelation [(i.e.) the values of GDP
must not correlate itself with the lag values of the same variable (GDP)]
In order to check these conditions, we conduct the following tests, namely:
• Augmented Dickey Fuller (adf) test: If p- value is less than 0.05, then the
dataset is stationary.
• Auto & Cross-covariance & Correlation Function (acf): The acf plot should be
within the dotted blue line that depicts the mean for the dataset to be devoid
of autocorrelation and nonstationarity.
From (Screenshot- 3), we see that the p-value of the (adf) test is 0.99, thereby
proving non-stationarity and the (acf plot) also shows signs of autocorrelation.

At this juncture, we use the auto.arima() function to combat the


nonstationarity and autocorrelation issue and also construct the ARIMA model
simultaneously.
An ARIMA model can be expressed in the format (p,d,q), wherein:
(p) stands for the autoregression (i.e.) a model which uses the observations of
lag variables (of GDP) as input to a regression equation to predict the value at
the next time step.
GDP(t) = GDP(t-1) + GDP (t-2) + GDP (t-3)+ …

7|Page
(d) stands for the order of differentiation (i.e.) the number of times the
timeseries data has to be differentiated in order to eliminate nonstationarity.
Differentiation involves subtracting a value from the immediate lag value
preceding that value.
(q) stand for moving average (i.e.) representing the error of the model as a
combination of the lag error terms. This essentially means that moving average
does the same function as auto-regression but takes the residuals impacting
the variable into account instead of the lag values of the same variable.

As depicted in the screenshot, the auto.arima() function is applied using the


syntax as follows: We first mention the dataset to be used for modelling, then
we mention the information criterion(ic), for which, I have chosen the Akaike’s
Information Criteria (aic) which compares the quality of a set of statistical
models and choses the best one (based on log-likelihood) (The lower the aic,
the better will be model). At first the auto.arima() will use approximations to
speed up the calculations. It is to be noted that the AIC value are computed
without approximation and the values produced by trace=TRUE will match that
final value reported. The best model suggested is ARIMA (0,2,2) indicating the
2 degrees of differentiation and 2 moving averages. Screenshot depicts the
coefficient of the moving averages which would be incorporated in the model.

8|Page
In order to check the stationarity of the model, we once again perform the
(adf) and (acf) tests on the model, pertaining to the residuals (due to the
presence of the moving average). Screenshot depicts that the p-value is 0.01
and the acf graph shows that the peaks have settled within the mean range.
This shows that the seasonality is satisfied and autocorrelation is eliminated.

9|Page
Thus, the auto.arima has performed these functions in one step thereby saving
ample time. The auto.arima function is redirected to the variable named
“gdpmodel” Now that the time series is stationary, it is now fit for being used
for forecasting. The forecasting is done using the forecast() function wherein
we mention the model to be used for the forecasted (gdpmodel), followed by
the confidence level (95% in this case) and the forecast period (Here we are
forecasting for 10 years and for 4 quarters in each respective year). Screenshot
shows depicts the model and the visualisation as well.

Finally, we check whether the forecast of the model is correct or not. For this,
the “Ljung Bob test” which examines the autocorrelations of the residuals and
deems whether the model is fit or not for giving correct forecast. In this case,
the test is performed for 25 lag values. The p-value is 0.429 thereby indicating
the absence of autocorrelation and verifying the validity of the model.
(Screenshot) gives the depiction of the test and its result.

10 | P a g e
Thus, the timeseries forecasting is performed.

11 | P a g e

You might also like