DA-EBC4263 Final-Assignment-1-1

Data Analytics
Michelle van Wamelen, i6155473 and Romen van den Boom, i6162224
19/10/2020
Table of Contents
1. Introduction ..................................................................................................................................................... 1
2. Literature review ........................................................................................................................................... 2
3. Data ..................................................................................................................................................................... 3
4. Methodology .................................................................................................................................................... 4
5. Hypotheses ....................................................................................................................................................... 5
6. R and discussion of the results ................................................................................................................. 6
6.1 Programming, Creating Variables and running tests in R ..................................................... 6
Verifying the assumptions ........................................................................................................................... 11
6.2 Discussion of the results .................................................................................................................. 18
7. Implication and conclusions .................................................................................................................. 19
7.1 Limitations and Future Research ................................................................................................. 19
8. Reference List............................................................................................................................................... 20
Final Assignment DA Michelle van Wamelen and Romen van den

Boom
1. Introduction
Putting your money in savings on the bank might not be the smartest thing to do nowadays.
The interest rate on savings is approaching negative numbers, therefore more and more
people are turning to investments to create their own interest. This creates a growing
number of beginning investors on the investment market. With a lack of knowledge, a big
proportion of the investors is passively investing through mutual funds or robot advisors.
The investor states the amount of risk they are willing to take on and the amount they want
to invest, the robot advisors take care of the rest. In light of these recent developments it is
important to go back to the basics by validating known drivers of stock return with recent
data, in order to answer the research question: “What is driving investment behaviour in
the pharmaceutical industry?”
1
2. Literature review
In 1952, Markowitz came up with the ‘Modern Portfolio Theory’ which marked the
beginning of a profound academic covering of investment portfolio theory. In his theory,
Markowitz (1952) explained that when investors have the choice between two portfolios
with the same return they will always go for the portfolio with the least amount of risk,
meaning that investors act in a way that is risk averse.
This Modern Portfolio theory was later used by Sharpe (1964) and Lintner (1965) to
develop the Capital Asset Pricing model (CAPM). The CAPM formula looks as follows:
𝐶𝐴𝑃𝑀: 𝑟𝑠 = 𝑟𝑓 + 𝑏𝑒𝑡𝑎(𝑅𝑝𝑚)
where Rs stands for the expected return of the given stock, rf stands for the risk free rate,
beta stands for the riskiness of the stock with regards to the market (systematic risk) and
Rpm is the market risk premium. The market risk premium can be calculated by extracting
the return on the market with the risk free rate. All in all, the CAPM uses the time value of
money (the risk free rate) and the risk of the stock in comparison to the market to calculate
an expected return which can be used to analyze the actual return to see whether a stock is
fairly valued.
When it turned out that the explanatory power of the CAPM in terms of expected return
was decreasing Fama and French (1993) came up with their three-factor model. To reduce
the increasing alpha, which is the difference between the actual return and the expected
return, for the CAPM, Fama and French (1993) expanded the CAPM by introducing a size
and a book to market variable. They did this as they noticed that, over time, companies with
a small market capitalization were outperforming companies with a big market
capitalization. Which was the same for high book to market stocks in comparison to low
book to market stocks (values stocks were outperforming growth stocks). The Fama and
French (1993) three factor model looks as follows:
𝐹𝐹 𝑡ℎ𝑟𝑒𝑒 𝑓𝑎𝑐𝑡𝑜𝑟 𝑚𝑜𝑑𝑒𝑙: 𝑟𝑠 = 𝑟𝑓 + 𝛽(𝑅𝑝𝑚) + 𝛽𝑆𝑀𝐵 + 𝛽𝐻𝑀𝐿
This formula includes the risk free rate, the company beta and the market risk premium.
However it adds a size beta which is the SMB ‘Small minus big’ beta and it adds a book to
market beta which is the ‘High minus low’ HML beta.
Then in 1997, Carhart improved the Fama and French three factor model even further by
including a fourth factor into the model, namely momentum. Carhart (1997) observed
through looking at historical stock prices that a companies’ stock return at t = -1(year) has a
significant predictive value for the future stock returns (for example those at t = 0). By
including this fourth factor into the model he was able to explain more of the actual return
of a companies’ stock. The Fama and French & Carhart four factor model looks as follows:
𝐹𝐹𝐶 𝑓𝑜𝑢𝑟 𝑓𝑎𝑐𝑡𝑜𝑟 𝑚𝑜𝑑𝑒𝑙: 𝑟𝑠 = 𝑟𝑓 + 𝛽(𝑅𝑝𝑚) + 𝛽𝑆𝑀𝐵 + 𝛽𝐻𝑀𝐿 + 𝛽𝑃𝑅1𝑌𝑅
The only thing that changed in the four factor model is the additional PR1YR beta that
accounts for the momentum effect (the stock’s return prior 1 year, that is the stock’s return
at t = -1). Even though all of these theories talk about portfolio effects, it might still be
2
interesting to look at the effect of the different predictor variables on the expected stock
return of individual companies (as a dependent variable).
The following section provides a summary of a research conducted in Iran about the Stock
Return of Pharmaceutical companies listed on the Theran Stock Exchange. Kabriaee-zadeh
et al. (2013), conducted a research showing the relationship between fundamental
variables and the pharmaceutical companies listed in the Tehran Stock Exchange. The
fundamental variables consist of the following: Current ratio, Capital in work to total assets,
Market’s share, Medical care inflation, Operating cycle, Debt Equity Ratio, Net profit Margin,
Exchange rate & Company size. The researchers found that 80% of the total change in stock
return can be explained with 9 of the above-mentioned fundamental variables. The
variables that have a significant effect on the return of the stocks are as follows; debt to
equity ratio, working capital to total assets, current ratio, net profit margin, operating cycle,
market share, inflation rate of medicinal products prices, total assets, and exchange rate
(Kebriaee-zadeh et al., 2013).
3. Data
For this research we are using two databases; Compustat and CRSP. Compustat provides us
with quarterly data and general company financial data. This data base tracks the financial
data of the companies over the period of 2013 to 2018. The Data is Panel Data as it is data
across time and cross-sectional units. The following variables are given within the dataset:
· Gvkey, Global company Key which is a unique 6 digit number for a given company
· Datadate, the date of the data on a quarterly basis
· Fyearq, the fiscal year of the data
· Fqtr, fiscal quarter within that fiscal year
· Tic, is the ticker which is derived from the company name
· Cusip, is a company identifier
· Conmp, is the name of a company
· Curcdq, the currency the companies operate in
· Current assets total – Q, the total current assets in a quarter
· Assets Total – Q, the total assets in a quarter
· Common - Ordinary Equity Total – Q, the total common and ordinary equity per quarter
· Cash – Q, the amount of cash per quarter
· Long term debt total – Q, the total amount of long term debt per quarter
· Current liabilities total – Q, the total amount of current liabilities per quarter
· Liabilities total – Q, the total amount of liabilities in a quarter
· Revenue total – Q, the total revenue in a quarter
· Sales total – Q, total sales in a quarter
· Working Capital – Q, the amount of working capital in a quarter. Which is the difference
between the current assets and current liabilities
· R&D expense – Q, the amount of money a company invests in Research and Development in
a quarter
· SG&A expense – Q, Selling General and administrative expenses in a quarter.
· Capital expenditure – Y, the amount spent on capital on a yearly basis. Capital expenditures
is money used by a company to maintain physical assets.
3
· Revenue total – Y, the total revenue on a yearly basis. Which is the total amount of
products sold times the price of one product
· Sales total – Y, the total sales are the same as the total revenue of a company
· R&D expenditure – Y, the investment in Research and Development on a yearly basis
· SG&A expense – Y, Selling General and administrative expenses on a yearly basis · Sic,
Standard Industrial Classification Code
· State, the state in which the company is operating in within the USA
CRSP is a database in which monthly data about company security stock price data. The
CRSP base tracks the stock price data over the years of 2013-2018. The Data is Panel Data
as it is data across time and cross-sectional units. The following variables are given within
the dataset:
· Gvkey, the global company key, this is an identifier
· Fyear, the fiscal year of the row data
· Fyend, fiscal year end. In which month did the fiscal year end
· Date, the date of that particular return
· Comnam, the name of the company
· Sic, the standard industrial classification code
· Ret, the monthly change in stock price including dividends
· Vwretd, the monthly returns on a value-weighted index
· Sprtrn, the monthly returns on the S&P500 index
4. Methodology
The Data used in this research is composed of two different Data Bases, namely Compustat
and CRSP. These two data bases are both secondary data sets and pre-existing structured
archives. Which means that the datasets were already created and the data within them has
not been collected by us. In order to have one consistent data set to work with in R, the two
data sets from Compustat and CRSP have to be linked. These two can be linked using
common identifiers, in this research we used company name and date to link the two
datasets with each other. However, the data coverage of both databases is not constant. A
company in Compustat might be covered for five years and only be covered in CRSP for 4
years. Therefore, we are losing observations. The linked data could be influenced by
extreme outliers, which in turn could influence the outcomes of our analysis. With the use
of winsorizing the extreme outliers are dropped. However, with winsorizing you could lose
some critical observations, therefore winsorization should be used with care.
Within this research we are making use of panel data, which is data across time and cross-
sectional units. Time series data is one variable observed over time, like the stock of Apple
observed over the period of 2010 to 2019. Cross-sectional data is multiple variables
observed in one time period. For example, multiple listed companies in the year 2011. Panel
Data is thus multiple variables observed over multiple years, for example 10 different
companies who are listed and their stock prices, average assets and total revenue over a
period of time from 2010-2020. Within panel data you have two different formats; the
balanced panel data and the unbalanced panel data. With balanced panel data all the
individuals, or companies, are observed in all time periods. Therefore, the number of
observations equals N times T, where N stands for the number of companies and T for the
4
number of time periods. Unbalanced panel data is where at least one individual is not
observed for every time period. Therefore, the total number of observations will be lower
than N times T. In our research we are making use of unbalanced panel data due to the
linking of Compustat and CRSP, where observations were lost. We cannot simply link all the
companies where we have all time series observations for, as this would lead to a
survivorship bias.
When using panel data we can run two types of models; the fixed-effect model and the
random-effect model. With the fixed-effect model the focus is on the variation within the
cross-sectional units and not between observations. On the other hand, the random-effect
model includes both within and between entity effects. This model assumes that the
variation across units is random and uncorrelated with the regressors. To determine which
model is better, the fixed-effect or the random-effect, a Hausman-Test could be used. The
test looks at the correlation between the cross-sectional units and the regressors in the
model. Consequently, the null hypothesis is that no correlation exists, or in other words the
random-effects model is the preferred one. The alternative hypothesis is that the preferred
model is the fixed-effects model and thus correlation exists between the cross-sectional
units and the regressors in the model.
5. Hypotheses
To study the following question: “What is driving investment behavior in the
pharmaceutical industry?” we are testing the following hypotheses:
Hypothesis 1: H0: The R&D median does not have an influence on the investment
behaviour in the pharmaceutical industry
Ha: The R&D median does have an influence on the investment behaviour
in the pharmaceutical industry
Hypothesis 2: H0: The Total Revenue per Quarter does not have an influence on the
investment behaviour in the pharmaceutical industry
Ha: The Total Revenue per Quarter does have an influence on the
Hypothesis 3: H0: Retlag12, which is the lag variable for the Stock Return, does not
have an influence on the investment behaviour in the pharmaceutical
industry
Ha: Retlag12 does have an influence on the investment behaviour in the
pharmaceutical industry
5
Hypothesis 4: H0: The Research and Development expense per quarter does not have
an influence on the investment behaviour in the pharmaceutical industry
Ha: The Research and Development expense per quarter does have an
influence on the investment behaviour in the pharmaceutical industry
Hypothesis 5: H0: The Current Ratio does not have an influence on the investment
Ha: The Current Ratio does have an influence on the investment

Hypothesis 6: H0: The RD.D, which is the lag variable for Research and Development,
does not have an influence on the investment behaviour in the
Ha: The RD.D does have an influence on the investment behaviour in the
Hypothesis 7: H0: The total amount of Cash per quarter does not have an influence on
the investment behaviour in the pharmaceutical industry
Ha: The total amount of Cash per quarter does have an influence on the
Hypothesis 8: H0: The total amount of Assets per quarter does not have an influence
on the investment behaviour in the pharmaceutical industry
Ha: The total amount of Assets per quarter does have an influence on the
6. R and discussion of the results

6.1 Programming, Creating Variables and running tests in R
# Let's first make all of my packages active
library(tidyverse)
library(summarytools)
library(sjPlot)
6
library(afex)
library(emmeans)
library(psych)
library(car)
library(ggplot2)
library(readxl)
library(gvlma)
library(lmtest)
library(gridExtra)
library(olsrr)
library(interactions)
library(dplyr)
library(anytime)
library(doBy)
library(foreign)
library(plm)
library(stargazer)
# Load the cstat and crsp data in

cstat <- read.csv("data/Compustat data (Q)-1.csv")
crsp <- read.csv("data/crsp-3.csv")
# /cstat keep only the distinct values

cstat <- cstat %>%
distinct()
# /crsp keep only the distinct values

crsp <- crsp %>%
distinct()
# /cstat change datadate into date format

cstat$datadate<-as.character(cstat$datadate)
cstat$datadate<-as.Date(cstat$datadate, format='%Y%m%d')
# /crsp change date into date format

crsp$date<-as.Date(crsp$date, format='%d%b%Y')
# /cstat Let's do some data cleaning only selecting one industry

cstat$sic<-as.character(cstat$sic)
cstat <- subset(cstat, cstat$sic=='2834')
# /crsp Let's do some data cleaning only selecting one industry

crsp$sic<-as.character(crsp$sic)
crsp <- subset(crsp, crsp$sic=='2834')
# /cstat Transform the grouping variables into the same name

cstat$date<-cstat$datadate
cstat<-cstat %>%
dplyr::select(-'datadate')
7
# /crsp Transform the grouping variables into the same name
crsp$conm<-crsp$comnam
crsp<-crsp %>%
dplyr::select(-'comnam')
# Finding out the mean, median, min, max and sd for R&D expense per year
myFun <- function(x) {
c(min = min(x, na.rm=TRUE), max = max(x, na.rm=TRUE),
mean = mean(x, na.rm=TRUE), median = median(x, na.rm=TRUE),
std = sd(x, na.rm=TRUE))
}
tapply(cstat$R.D.expense...Q, cstat$fyearq, myFun)
## $`2012`
## min max mean median std
## 0.00000 271.47000 36.50712 1.88000 83.72066
##
## $`2013`
## 0.0000 5611.0500 110.3958 2.8810 434.7981
##
## $`2014`
## -4.0000 5032.7020 106.0949 2.7760 424.2188
##
## $`2015`
## 0.0000 4807.4040 114.8821 3.4040 436.0098
##
## $`2016`
## -0.4000 5940.8760 125.1168 4.0600 486.6773
##
## $`2017`
## 0.0000 5847.0000 139.8214 4.0760 527.0769
##
## $`2018`
## 0.0000 6641.0000 152.5450 4.9415 554.2400
##
## $`2019`
## 0.000000 9.810000 2.993133 1.258000 3.711197
# /cstat create R.DMedian with value 1 if the R&D expense is bigger than the
yearly average and 0 if the R&D expense is smaller than the yearly average.
8
cstat$R.DMedian<-
if(cstat$fyearq=='2012'){
ifelse(cstat$R.D.expense...Q<1.88, 0, 1)
} else if (cstat$fyearq=='2013'){
} else if(cstat$fyearq=='2014'){
}
# /cstat create lag variable for R&D to see whether volatility in R&D influen
ces investment behavior
cstat <- cstat %>%
group_by(cusip) %>% # Group by cusip to overcome the problem of using other
companies' data
mutate(RDlag1 = dplyr::lag(R.D.expense...Q, n = 1, default = NA) ) %>% #R&D
at t=-1
mutate(RD.D = R.D.expense...Q-RDlag1) %>% #Difference in R&D between curren
t month and month at t=-1
ungroup() # ungroup to avoid coming into trouble later
# /crsp create lag variables for the stock return (momentum)

crsp <- crsp %>%
group_by(gvkey) %>% # Group by cusip to overcome the problem of using other
companies' data
mutate(retlag1 = dplyr::lag(ret, n = 1, default = NA) ) %>% #retlag t=-1
mutate(retlag6 = dplyr::lag(ret, n = 6, default = NA)) %>% #retlag t=-6
mutate(retlag12 = dplyr::lag(ret, n = 12, default = NA)) %>%#retlag t=-12
# /cstat Create Current ratio

cstat <- cstat %>%
group_by(cusip) %>% # Group by cusip to overcome the problem of using other
companies' data
mutate(Current.ratio = Current.Assets.Total...Q / Current.liabilities.total
...Q) %>% #Current ratio=current assets/current liabilities
# /crsp Create abnormal return

crsp <- crsp %>%
group_by(gvkey) %>% # Group by cusip to overcome the problem of using other
9
companies' data
mutate(ABNRET = ret - vwretd) %>% #Abnormal return=return/market return
# Create another dataset where cstatq and crsp are linked by cusip and the da
tadates
link <- left_join(cstat, crsp, by = c("conm", "date"))# link by conm and date
(names from the x dataset)
# /link delete all the rows where gvkey.y and cash are na
link <- link[!is.na(link$gvkey.y), ]
link <- link[!is.na(link$Cash...Q), ]
# /link transform R.DMedian into a factor variable

link$R.DMedian<-as.factor(link$R.DMedian)
# /link winsorize the variables to take out the outliers (to overcome errors
in the end)
link$ABNRET.winsor <- winsor(link$ABNRET, 0.01)
link$Revenue.total...Q.winsor <- winsor(link$Revenue.total...Q, 0.01)
link$retlag12.winsor <- winsor(link$retlag12, 0.01)
link$R.D.expense...Q.winsor <- winsor(link$R.D.expense...Q, 0.01)
link$Current.ratio.winsor <- winsor(link$Current.ratio, 0.01)
link$RD.D.winsor <- winsor(link$RD.D, 0.01)
link$Cash...Q.winsor <- winsor(link$Cash...Q, 0.01)
link$Assets.Total...Q.winsor <- winsor(link$Assets.Total...Q, 0.01)
# /link create panel data

link.p <- pdata.frame(link, index=c("conm","date"))
# /link.p Running a multiple regression with the chosen independent variables

lm.p<-lm(ABNRET~R.DMedian+Revenue.total...Q+retlag12+R.D.expense...Q+Current.
ratio+RD.D+Cash...Q+Assets.Total...Q, data=link.p)
summary(lm.p)
##
## Call:
## lm(formula = ABNRET ~ R.DMedian + Revenue.total...Q + retlag12 +
## R.D.expense...Q + Current.ratio + RD.D + Cash...Q + Assets.Total...Q,
## data = link.p)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.81680 -0.09663 -0.00294 0.07790 2.39384
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.027e-02 1.210e-02 -2.501 0.0125 *
## R.DMedian1 9.693e-03 1.310e-02 0.740 0.4596
10
## Revenue.total...Q 1.363e-06 9.393e-06 0.145 0.8846
## retlag12 -1.245e-02 2.446e-02 -0.509 0.6110
## R.D.expense...Q 6.347e-06 4.880e-05 0.130 0.8965
## Current.ratio 1.826e-03 1.040e-03 1.755 0.0796 .
## RD.D -7.107e-06 5.286e-05 -0.134 0.8931
## Cash...Q 9.226e-07 6.157e-06 0.150 0.8809
## Assets.Total...Q -1.637e-07 5.923e-07 -0.276 0.7823
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1981 on 1154 degrees of freedom
## (366 observations deleted due to missingness)
## Multiple R-squared: 0.003826, Adjusted R-squared: -0.00308
## F-statistic: 0.554 on 8 and 1154 DF, p-value: 0.8159
Verifying the assumptions

1. Linearity of data
plot(lm.p, 1)
What we want to see in the plot is a red line which is approximately horizontal at zero. In
our lm.p model there seems to be no pattern in the residual plot. This suggests that we can
assume linear relationship between the predictors and the outcome variable which means
that this assumption is met.
11
2. Test to see if the residuals are normally distributed
hist(lm.p[["residuals"]], main="Histogram of the residuals", xlab='Residuals'
)
lines(density(lm.p[["residuals"]]), col="blue", lwd=2) # add a density estima
te with defaults
lines(density(lm.p[["residuals"]], adjust=2), lty="dotted", col="darkgreen",
lwd=2)
We also have to test if the errors are normally distributed. Theory says that if the sample
size is larger than 100, we do not have to look at this assumption as long as the other
assumptions hold. Since the sample size is 1529 this means that we technically do not have
to look at the histogram, but we are going to do this anyway. In the histogram, we can see
the distribution of the residuals and if we look at the boxes in the histogram we can assume
that the residuals are normally distributed.
3. Test if the residuals are homoscedastic (the homogeneity of variance)

plot(lm.p, 3)
12
If there is absolutely no heteroscedastity, you should see a completely random, equal
distribution of points throughout the range of X axis and a flat red line. This is roughly the
case for our ‘scale-location’ graph. That is why we conclude that, judging from the graphs,
heteroscedasticity does not exist.
4. Test if the mean of the residuals is 0

mean(lm.p[["residuals"]])
## [1] -1.665597e-18
We can test whether the mean of the residuals is zero with a simple mean function. Since
the mean is -1.665597e-18 we can assume that the mean is close to zero.
5. Independence of Observation
However, the normal multiple regression model does not seem to pass the last assumption
since the error term is mostly related over time for a given residual (since we are working
with cross sectional data and time series data). That is why we have to run a fixed effects
model and a random effects model
# /link.p fixed effects model
femod <- plm(ABNRET~R.DMedian+Revenue.total...Q+retlag12+R.D.expense...Q+Curr
ent.ratio+RD.D+Cash...Q+Assets.Total...Q, data=link.p, model = "within")
# /link.p random effects model

remod <- plm(ABNRET~R.DMedian+Revenue.total...Q+retlag12+R.D.expense...Q+Curr
13
ent.ratio+RD.D+Cash...Q+Assets.Total...Q, data=link.p, model = "random")
stargazer(femod, remod, type = "text")
##
## ================================================
## Dependent variable:
## ------------------------------
## ABNRET
## (1) (2)
## ------------------------------------------------
## R.DMedian1 -0.014 0.009
## (0.027) (0.013)
##
## Revenue.total...Q 0.00001 0.00000
## (0.00003) (0.00001)
##
## retlag12 -0.037 -0.013
## (0.026) (0.024)
##
## R.D.expense...Q -0.00003 0.00001
## (0.0001) (0.00005)
##
## Current.ratio 0.0001 0.002*
## (0.002) (0.001)
##
## RD.D 0.00001 -0.00001
## (0.0001) (0.0001)
##
## Cash...Q 0.00000 0.00000
## (0.00001) (0.00001)
##
## Assets.Total...Q 0.00000 -0.00000
## (0.00000) (0.00000)
##
## Constant -0.030**
## (0.012)
##
## ------------------------------------------------
## Observations 1,163 1,163
## R2 0.002 0.004
## Adjusted R2 -0.137 -0.003
## F Statistic 0.318 (df = 8; 1019) 4.289
## ================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
# /link.p run a Hausman text to see which one to use

phtest(femod, remod)
##
## Hausman Test
##
14
## data: ABNRET ~ R.DMedian + Revenue.total...Q + retlag12 + R.D.expense...Q
+ ...
## chisq = 13.563, df = 8, p-value = 0.09389
## alternative hypothesis: one model is inconsistent
In the fixed effects model we focus only on the variation within the companies and the time-
invariant characteristics are removed.
One thing that immediately stands out from the fixed effects model is that it does not have
an intercept.
Furthermore, we can see that the F-value is very low. Since the F-value measures the
significance of the overall model we can say that this model has a very low significance
which means it is not able to explain a lot of the variance in Abnormal returns through the
independent variables that we have chosen.
We can also see that none of our independent variables seem to be significant at any
conventional confidence level. On top of this, the R^2 is very low and the Adjusted R^2 is
negative (because of the many independent variables that we use).
In the random effects model we also include the between company effects and we assume
that the variation across companies is random and uncorrelated with the regressor.
In the random effects model each company has a different intercept which follows a normal
distribution (that is why there is an intercept in this random effects model).
When we take a look at the random effects model, we see that the F-value has increased.
This means that this model has a higher overal significance. The R^2 and the Adjusted R^2
have also gone up in comparison to the fixed effects model, which hints at a better fit,
however, we have to test this empirically later.
Again, we see that almost all of our coefficients are not significant on the conventional
confidence levels, except for a positive current ratio coefficient (0.002 which is now
significant at a 90% confidence level) and a negative constant coefficient (-0.03 which is
significant at a 95% confidence level)
We have to run a Hausmann test in order to see which model we should use. Since the p-
value of the Hausmann test is significant at a 90% confidence level, this means that we
assume that there is a correlation with the regressors, hence why we should go for the fixed
effects model.
Since both of our models did not give a participated outcome (most of the independent
variables had negative coefficients) we can winsorize our variables in order to get rid of the
outliers in an attempt to improve the fit of our model.
# /link.p fixed effects model winsorized
femodw <- plm(ABNRET.winsor~R.DMedian+Revenue.total...Q.winsor+retlag12.winso
r+R.D.expense...Q.winsor+Current.ratio.winsor+RD.D.winsor+Cash...Q.winsor+Ass
ets.Total...Q.winsor, data=link.p, model = "within")
15
# /link.p random effects model winsorized
remodw <- plm(ABNRET.winsor~R.DMedian+Revenue.total...Q.winsor+retlag12.winso
r+R.D.expense...Q.winsor+Current.ratio.winsor+RD.D.winsor+Cash...Q.winsor+Ass
ets.Total...Q.winsor, data=link.p, model = "random")
stargazer(femod, femodw, remod, remodw, type = "text")
##
## ==========================================================================
## Dependent variable:
## -----------------------------------------------
## ABNRET ABNRET.winsor ABNRET ABNRET.winsor
## (1) (2) (3) (4)
## --------------------------------------------------------------------------
## R.DMedian1 -0.014 -0.017 0.009 0.013
## (0.027) (0.024) (0.013) (0.012)
##
## Revenue.total...Q 0.00001 0.00000
## (0.00003) (0.00001)
##
## retlag12 -0.037 -0.013
## (0.026) (0.024)
##
## R.D.expense...Q -0.00003 0.00001
## (0.0001) (0.00005)
##
## Current.ratio 0.0001 0.002*
## (0.002) (0.001)
##
## RD.D 0.00001 -0.00001
## (0.0001) (0.0001)
##
## Cash...Q 0.00000 0.00000
## (0.00001) (0.00001)
##
## Assets.Total...Q 0.00000 -0.00000
## (0.00000) (0.00000)
##
## Revenue.total...Q.winsor 0.00001 0.00000
## (0.00003) (0.00001)
##
## retlag12.winsor -0.055* -0.016
## (0.029) (0.028)
##
## R.D.expense...Q.winsor -0.00004 0.00001
## (0.0001) (0.00005)
##
## Current.ratio.winsor 0.0003 0.002**
## (0.002) (0.001)
##
## RD.D.winsor 0.00000 -0.00002
16
## (0.0001) (0.0001)
##
## Cash...Q.winsor 0.00000 0.00000
## (0.00001) (0.00001)
##
## Assets.Total...Q.winsor 0.00000 -0.00000
## (0.00000) (0.00000)
##
## Constant -0.030** -0.035***
## (0.012) (0.011)
##
## --------------------------------------------------------------------------
## Observations 1,163 1,163 1,163 1,163
## R2 0.002 0.004 0.004 0.006
## Adjusted R2 -0.137 -0.135 -0.003 -0.001
## F Statistic (df = 8; 1019) 0.318 0.552 4.289 7.219
## ==========================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
# /link.p run a Hausman text to see which one to use

phtest(femodw, remodw)
##
## Hausman Test
##
## data: ABNRET.winsor ~ R.DMedian + Revenue.total...Q.winsor + retlag12.win
sor + ...
## chisq = 26.724, df = 8, p-value = 0.000789
## alternative hypothesis: one model is inconsistent
Winsorizing does seem to change our fixed effects and random effects models slightly. In
the fixed effects model the value of R^2 doubled, the Adjusted R^2 only changed 0.002 and
the F-statistic went up by 0.234 which is an improvement but it still points to a poor fit of
the model. The coefficients seem mostly unaffected apart from the fact that the return lag
t=-12 coefficient is now significant at a 90% confidence level (however, it has a minus sign
which points towards negative momentum, which we did not expect).
The random effects model also seems to have improved slightly, since the R^2 and the
Adjusted R^2 both went up by 0.002 and the F-statistic also increased by 2.97. However,
the same goes for the random effects model. It has improved but it still points towards a
poor fit of the model. The coefficients in the random effects model also do not seem to be
affected by the winsorizing apart from the slightly lower constant value (an decrease of
0.005).
All in all, we can say that winsorizing did improve our models slightly, however it was not
good enough to improve the fit to such a level that we could say that our model would
explain a lot of the abnormal return of a company.
When we run the Hausmann test again to see which model we should use we get a p-value
of 0.000789 which means it is significant at a 99% confidence level. This, in turn, means
17
that we assume that there is a correlation with the regressors and that we should go for the
fixed effects model.
6.2 Discussion of the results

A summary of our results regarding the beta’s of our independent variables are shown on
page 16 and 17. Here, we can see that only the current ratio coefficient and the constant are
significantly related to abnormal returns without winsorizing and that is for the random
effects model which turns out to be a worse fit than the fixed effects model in our
Hausmann test. After winsorizing, however, we show that the return prior 1 year (retlag12)
coefficient is now significant in the fixed effects model which is again the model to use
according to the Hausmann test. On top of this, the constant in the random effects model
also increased slightly after winsorizing.
• It is interesting to note that the social performance feedback of investing more or
less money into R&D than competitors does not seem to influence ABNRET, but this
might be because R&D efficiency (quality) is regarded as more important than
investing a lot of money (quantity) when it comes to R&D.
• The quarterly revenue also does not seem to influence ABNRET and we think that
this might be because income is the more important measure here.
• The historical performance feedback of returns prior one year do seem to have an
influence on ABNRET, however, this influence is negative which might hint towards
an overall negative momentum in the pharmaceutical industry.
• The R&D expense of individual companies, on the other hand, does not seem to have
an influence on ABNRET, which again might hint towards the importance of R&D
efficiency and not quantity.
• The current ratio does seem significant, but only in the random effects model which
is inferior to the fixed effects model. The positive coefficient does seem logical, since
current ratio was found to be significant in prior studies as well.
• The volatility of R&D seems to be insignificant, but this might be due to the fact that
a monthly difference is not deemed important by investors. Instead they might look
at yearly differences in R&D spending.
• The cash that a company has is also insignificant. which might be caused by the
positive side of ‘having cash to invest’ and the negative side of ‘not doing anything
with idle cash’ cancelling each other out.
• Lastly, the size of a company proxied by total assets also seems insignificant. Which
is strange considering that size had a significant coefficient in most of the earlier
studies.
18
7. Implication and conclusions
This research set out to answer the question: “What is driving investment behaviour in the
pharmaceutical industry?” and we specifically focused our efforts on explaining the
abnormal returns of the companies in our linked dataset of Compustat and Crsp over the
period from 2013 up to and including 2018. Through our data analysis we came to the
conclusion that almost all of our hypothesis could be rejected, that is, most of them were
found to be insignificant. Except for current ratio in our random effects models and return
lag at t = -12 in our winsorized fixed effects model. However, these significant coefficients
still only had a very low influence on the abnormal returns of the companies in our dataset.
That is why, along with the relatively low R^2, Adjusted R^2 and F-statistic, we can
conclude that our model is not able to explain a lot of the abnormal returns of the
companies in our linked dataset.
7.1 Limitations and future research

One of the clear limitations of our research is that we only have basic data which is freely
available on Compustat and Crsp. On top of this, we were missing some fundamental
variables like net income, shares outstanding and share price which could have been used
to verify the already existing models as well as providing our models with a better fit.
Which is a significant weakness in our study. Furthermore, it was also apparent that we had
to work with unbalanced panel data after linking the Compustat and Crsp datasets together
since some of the observations were lost through this process. Further research could
include a longer time frame (10 years instead of 5 years) and datasets which are more
complete. More complete data sets could include key variables such as share price, shares
outstanding and net income. With these more complete data sets, hopefully more of the real
drivers behind investment behavior in the pharmaceutical industry could be revealed.
All in all, we can conclude from our models that basic variables such as revenue or quarterly
R&D expenses are not able to explain or predict abnormal returns of individual companies.
That is why to estimate these abnormal returns a deeper understanding of investments is
needed and special variables such as the market beta and market to book ratio should be
included in the analysis. Consequently, for beginning investors, it might be smart to rely on
robo advisors or mutual funds in order to get a first feeling for investments before taking
the plunge into the deep and unknown waters that is called investing.
19
8. Reference List
Carhart, M. M. (1997, March). On Persistence in Mutual Fund Performance. The Journal of
Finance. https://onlinelibrary.wiley.com/doi/epdf/10.1111/j.1540-6261.1997.tb03808.x
Fama, E., & French, K. (1993). Common risk factors in the returns on stocks andbonds.
Journal of Financial Economics, 33(1), 3-56.)
Kebriaee-zadeha, A., Zartabb, S., Farshad Fatemic, S., & Radmanesha, R. (2013). Fundamentals
and Stock Return in Pharmaceutical Companies: a Panel Data Model of Iranian Industry.
Iranian Journal of Pharmaceutical Sciences.
http://www.ijps.ir/article_4845_85291727ff05ac7169db255ad7041e4b.pdf
Litner, J. (1965, December). Security Prices, Risk, and Maximal Gains From Diversification. The
Journal of Finance.
https://www.jstor.org/stable/2977249?casa_token=yeVEuPYfUrgAAAAA%3Agykkjdtfa-
kwk7FfY_O_SVlDQ6t0Uxenx9TfrWOyRGocrJKXdhu2BuoVWAxQttoCZSZW_2ioz1_
rBB_pQ2Gagour3qU9Nm8WZV0kf4dZyTwmUbLQIg&seq=1
Markowitz, H. (1952, March). Portfolio Selection. The Journal of Finance.

http://www.finance.martinsewell.com/capm/Markowitz1952.pdf
Sharpe, W. F. (1964, September). CAPITAL ASSET PRICES: A THEORY OF MARKET

EQUILIBRIUM UNDER CONDITIONS OF RISK. The Journal of Finance.
https://onlinelibrary.wiley.com/doi/full/10.1111/j.1540-6261.1964.tb02865.x
20

DA-EBC4263 Final-Assignment-1-1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DA-EBC4263 Final-Assignment-1-1

Uploaded by

Copyright:

Available Formats

Data Analytics

Final Assignment DA Michelle van Wamelen and Romen van den

Ha: The Current Ratio does have an influence on the investment

6. R and discussion of the results

# Load the cstat and crsp data in

# /cstat keep only the distinct values

# /crsp keep only the distinct values

# /cstat change datadate into date format

# /crsp change date into date format

# /cstat Let's do some data cleaning only selecting one industry

# /crsp Let's do some data cleaning only selecting one industry

# /cstat Transform the grouping variables into the same name

tapply(cstat$R.D.expense...Q, cstat$fyearq, myFun)

# /crsp create lag variables for the stock return (momentum)

# /cstat Create Current ratio

# /crsp Create abnormal return

# /link transform R.DMedian into a factor variable

# /link create panel data

# /link.p Running a multiple regression with the chosen independent variables

Verifying the assumptions

3. Test if the residuals are homoscedastic (the homogeneity of variance)

4. Test if the mean of the residuals is 0

# /link.p random effects model

# /link.p run a Hausman text to see which one to use

# /link.p run a Hausman text to see which one to use

6.2 Discussion of the results

7.1 Limitations and future research

Markowitz, H. (1952, March). Portfolio Selection. The Journal of Finance.

Sharpe, W. F. (1964, September). CAPITAL ASSET PRICES: A THEORY OF MARKET

You might also like