You are on page 1of 35

HRSTA82/002/0/2019

Department of Statistics
HRSTA82: Honours Research Project in Statistics

The research onion ring (Saunders, Lewis & Thornhill, 2016:124)

Assignment 2, 2019
2

REMEMBER ALL YOUR ASSIGNMENTS ARE TYPED IN LATEX NOT WORD

ASSIGNMENT 02
Unique Nr.: 813179
Fixed closing date:28 May 2019

QUESTION 1
Information on the performance on the stock exchange of the shares of the 30 largest chemical engineering
companies was collected. The price-to-earnings ratio (abbreviated RPE), i.e. the price of the share divided
by the earnings for that share for the past year, is a measure of a company’s growth. This ratio indicates
the amount investors are willing to pay for the stock per rand of the current earnings of the company. RPE
is usually expected to be high for growing companies and low for mature or troubled firms.
Company investment managers generally prefer high RPE ratios since high ratios make it possible to
raise substantial amounts of capital for a small number of shares. Investors also consider high RPE ratios
to be an important factor in the evaluation of stocks for possible sale and/or purchase.
It is therefore important to investigate which variables influence the level of the RPE ratio. For example,
one would expect that an increase in the debt-to-equity ratio (DE) would result in a decrease in RPE,
since DE indicates the extent to which management uses borrowed funds to operate the company. On
the other hand, an increase in the net profit margin of the company, would result in an increase in RPE.
Similarly an increase in the proportion of earnings paid out to stockholders, would be expected to result
in an increase in RPE.
A multiple regression analysis was performed with RPE as dependent variable. A brief description of the
variables, as well as the variable names used in the computer output, are given in the table below.
Description of variable Variable name used in computer output
Price-to-earnings ratio RPE
% Rate of return on total capital averaged over the ROR5
past 5 years
Debt-to-equity ratio of investment capital for the DE
past year
% Annual compound growth rate of sales in the most SALESGR5
recent 5 years compared with the previous 5 years
% Annual compound growth in earnings per share EPS5
computed from the most recent five years compared
to the previous five years
% Net profit margin for the past year NPM1
Annual dividend paid out divided by the latest 12- PAYOUTR1
month earnings per share
3 HRSTA82/ASS2/0

NPM1 is computed by dividing the net profits by the net sales for the past year expressed as a percentage.
PAYOUTR1 represents the proportion of earnings paid out to stockholders instead of being retained to
operate and expand the company.

The data set is shown in Exhibit 1.1. Some descriptive statistics and the correlation matrix for all the
variables are given in Exhibit 1.2.

The assumption of normality was tested. It was found that the variables DE and RPE were normally
distributed.

An all subsets regression analysis was performed on all independent variables with RPE as dependent
variable. These results appear in Exhibit 1.3. In Exhibit 1.4 the output of a stepwise regression analysis
is given.

(a) Discuss the correlation matrix given in Exhibit 1.2.Specifically comment on

(i) the possible existence of multicollinearity between the predictor variables and (20)

(ii) likely predictors for RPE. (10)

(b) Which regression model would you select from the all subsets regression analysis given in Exhibit
1.3? Justify your answer. (5)

(c) Use the stepwise regression analysis results given in Exhibit 1.4 to answer the following questions.

(i) Write down the final model of the multivariate regression analysis. (2)

(ii) Evaluate the model fit of the final model. (3)

(iii) Comment on the statistical significance of the parameter estimates. How do these results compare
with the conclusions drawn in (a)? (7)

(iv) Are there signs of multicollinearity in the data? Justify. (3)

(v) Are there outliers or influential observations in this data set? Justify. (5)

(d) Can this regression model be used to make predictions on the performance of shares of mining
companies? Justify. (5)
4

Exhibit 1.1: Stocks data det


5 HRSTA82/ASS2/0

Exhibit 1.2: Descriptive statistics and correlations of the data


6

Exhibit 1.3: All possible subsets regression analysis


7 HRSTA82/ASS2/0
8

Exhibit 1.4: Stepwise regression analysis


9 HRSTA82/ASS2/0
10
11 HRSTA82/ASS2/0
12
13 HRSTA82/ASS2/0
14
15 HRSTA82/ASS2/0

[60]

QUESTION 2

Consider the following dataset of measurements on 38 1978-79 model automobiles. The following
variables are contained in this dataset (Note R dataset Cars.csv).:

(1) Country: Nationality of manufacturer (eg. U.S., Japan)

(2) Car: Car name (Make and model)

(3) Weight: Weight of the car

(4) MPG: Miles per gallon, a measure of gas mileage

(5) Drive_Ratio: Drive ratio of the automobile

(6) Horsepower: Horsepower

(7) Displacement: Displacement of the car (in cubic inches)

(8) Cylinder: Number of cylinders

Exhibit 2.1 shows the original dataset. A hierarchical cluster analysis using the Euclidean distance
measure and the average linking method was performed on the numerical values in this dataset; with
the car make being the identifying variable and the results are shown in Exhibit 2.2. Exhibit 2.3 shows
the results of a -means cluster analysis with  = 3. Use the information to answer the following
questions.

(a) Using the infomation in Exhibit 2.2:

(i) Calculate the percentage change in the clustering criterion for solutions of 1 to 10 clusters. (15)

(ii) Using this or any other criterion, determine what would be an optimum number of clusters for this
data. Justify your answer. (5)
16

(iii) Finally, using any information available to you, indicate which observations fall into each cluster,
based on your answer for optimum number of clusters in (ii) above. (7)

(b) Using the infomation in Exhibit 2.2:

(i) Compare the results of the hierarchical cluster solution to the -means solution. (10)

(ii) Determine the appropriateness of the 3 cluster solution. (3)

(iii) By considering the cluster centres, indicate how the clusters differ from each other (i.e. what are
the distinguishing features of each cluster). Finally try to attach a summary or name to each
cluster. (10)

(iv) Is there some association between a vehicle’s country of origin and its segmentation? Explain. (5)

Exhibit 2.1: Cars data set


17 HRSTA82/ASS2/0

Exhibit 2.2: Hierarchical cluster analysis


18
19 HRSTA82/ASS2/0

Exhibit 2.2: K-means cluster analysis


20
21 HRSTA82/ASS2/0
22
23 HRSTA82/ASS2/0

[55]

QUESTION 3

The table below shows data for four people, indicating their Systolic blood pressure, an indication of the
amount of tobacco they consume daily, and whether they suffer from coronary heart disease.
Table 3.1: Coronary heart disease
Case Systolic Blood Pressure Tobacco CHD
1 114 0 0
2 130 0.08 1
3 158 16 0
4 154 2.4 1
Use this data to determine (without the use of a statistical computer package):

(a) The Euclidean distance matrices at each step of a hierarchical cluster analysis on Systolic blood
pressure and tobacco usage using the single linkage method. Clearly show each distance matrix at
every step of the algorithm. (15)

(b) The icicle plot for this analysis. (5)

(c) Is there a relation between coronary heart disease and the segmentation of a person? (5)

[25]
24

QUESTION 4

Data on house prices in several suburbs of Boston in the United States of America was collected, together
with other variables that might influence house prices. (Note R dataset Boston.csv).
The data frame contains the following variables:

Variable Description of the variable


AGE proportion of owner-occupied units built prior to 1940
CRIM per capita crime rate by town
ZN proportion of residential land zoned for lots over 25 000 sq.ft
INDUS proportion of non-retail business acres per town
CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX nitric oxides concentration (parts per 10 million)
RM average number of rooms per dwelling
DIS weighted distances to five Boston employment centres
RAD index of accessibility to radial highways
TAX full-value property-tax rate per $10 000
PTRATIO pupil-teacher ratio by town
BLACK 1 000 ( − 063)ˆ2 where  is the proportion of blacks by town
LSTAT % lower status of the population
MEDV Median value of owner-occupied homes in $1 000’s
A Multivariate Analysis of Variance (MANOVA) was conducted to determine the effect of the age of a
suburb and the proximity of the suburb to the Charles River on:

• CRIM

• ZN

• INDUS

• NOX

• RM

• DIS

• TAX

• PTRATIO

• B
25 HRSTA82/ASS2/0

• LSTAT

• MEDV

In order to do so, the variable AGE was recoded into an ordinal variable, with categories “New”,
“Moderate” and “Old”. The variables were related to classification of houses as historical sites for the
future. This new variable was named AGECAT. An extract from the data is given in Exhibit 4.1 and
Exhibit 4.2 shows P-P plots. The results from the MANOVA analysis are in Exhibit 4.3. You are
required to write a detailed research report following Hair’s Six Stage Model Building (and also refer to
the supplementary notes) on these results. The report should pay attention not only to the results shown
below, but also to recommendations as to how these results could have been improved upon (with regard
to the testing and meeting of assumptions, for example). Should the analysis prove to not have met all
of the assumptions, you are still required to interpret the results, but suggest which remedies could have
been applied to rectify the violation of assumptions.
Exhibit 4.1: Boston data set extract

Exhibit 4.2: P-P Plots

CRIM ZN INDUS
26

NOX RM DIS

TAX PTRATIO BLACK

LSTAT MEDV
27 HRSTA82/ASS2/0

Exhibit 4.3: MANOVA analysis


28
29 HRSTA82/ASS2/0
30
31 HRSTA82/ASS2/0
32
33 HRSTA82/ASS2/0
34
35 HRSTA82/ASS2/0

[60]

[200]

NOTE : PRESENT WORK IN POINT FORM NO ESSAYS


HINT : “MATERIAL IN A LINE SHOULD BE WORTH A MARK”

You might also like