Professional Documents
Culture Documents
Department of Statistics
HRSTA82: Honours Research Project in Statistics
Assignment 2, 2019
2
ASSIGNMENT 02
Unique Nr.: 813179
Fixed closing date:28 May 2019
QUESTION 1
Information on the performance on the stock exchange of the shares of the 30 largest chemical engineering
companies was collected. The price-to-earnings ratio (abbreviated RPE), i.e. the price of the share divided
by the earnings for that share for the past year, is a measure of a company’s growth. This ratio indicates
the amount investors are willing to pay for the stock per rand of the current earnings of the company. RPE
is usually expected to be high for growing companies and low for mature or troubled firms.
Company investment managers generally prefer high RPE ratios since high ratios make it possible to
raise substantial amounts of capital for a small number of shares. Investors also consider high RPE ratios
to be an important factor in the evaluation of stocks for possible sale and/or purchase.
It is therefore important to investigate which variables influence the level of the RPE ratio. For example,
one would expect that an increase in the debt-to-equity ratio (DE) would result in a decrease in RPE,
since DE indicates the extent to which management uses borrowed funds to operate the company. On
the other hand, an increase in the net profit margin of the company, would result in an increase in RPE.
Similarly an increase in the proportion of earnings paid out to stockholders, would be expected to result
in an increase in RPE.
A multiple regression analysis was performed with RPE as dependent variable. A brief description of the
variables, as well as the variable names used in the computer output, are given in the table below.
Description of variable Variable name used in computer output
Price-to-earnings ratio RPE
% Rate of return on total capital averaged over the ROR5
past 5 years
Debt-to-equity ratio of investment capital for the DE
past year
% Annual compound growth rate of sales in the most SALESGR5
recent 5 years compared with the previous 5 years
% Annual compound growth in earnings per share EPS5
computed from the most recent five years compared
to the previous five years
% Net profit margin for the past year NPM1
Annual dividend paid out divided by the latest 12- PAYOUTR1
month earnings per share
3 HRSTA82/ASS2/0
NPM1 is computed by dividing the net profits by the net sales for the past year expressed as a percentage.
PAYOUTR1 represents the proportion of earnings paid out to stockholders instead of being retained to
operate and expand the company.
The data set is shown in Exhibit 1.1. Some descriptive statistics and the correlation matrix for all the
variables are given in Exhibit 1.2.
The assumption of normality was tested. It was found that the variables DE and RPE were normally
distributed.
An all subsets regression analysis was performed on all independent variables with RPE as dependent
variable. These results appear in Exhibit 1.3. In Exhibit 1.4 the output of a stepwise regression analysis
is given.
(i) the possible existence of multicollinearity between the predictor variables and (20)
(b) Which regression model would you select from the all subsets regression analysis given in Exhibit
1.3? Justify your answer. (5)
(c) Use the stepwise regression analysis results given in Exhibit 1.4 to answer the following questions.
(i) Write down the final model of the multivariate regression analysis. (2)
(iii) Comment on the statistical significance of the parameter estimates. How do these results compare
with the conclusions drawn in (a)? (7)
(v) Are there outliers or influential observations in this data set? Justify. (5)
(d) Can this regression model be used to make predictions on the performance of shares of mining
companies? Justify. (5)
4
[60]
QUESTION 2
Consider the following dataset of measurements on 38 1978-79 model automobiles. The following
variables are contained in this dataset (Note R dataset Cars.csv).:
Exhibit 2.1 shows the original dataset. A hierarchical cluster analysis using the Euclidean distance
measure and the average linking method was performed on the numerical values in this dataset; with
the car make being the identifying variable and the results are shown in Exhibit 2.2. Exhibit 2.3 shows
the results of a -means cluster analysis with = 3. Use the information to answer the following
questions.
(i) Calculate the percentage change in the clustering criterion for solutions of 1 to 10 clusters. (15)
(ii) Using this or any other criterion, determine what would be an optimum number of clusters for this
data. Justify your answer. (5)
16
(iii) Finally, using any information available to you, indicate which observations fall into each cluster,
based on your answer for optimum number of clusters in (ii) above. (7)
(i) Compare the results of the hierarchical cluster solution to the -means solution. (10)
(iii) By considering the cluster centres, indicate how the clusters differ from each other (i.e. what are
the distinguishing features of each cluster). Finally try to attach a summary or name to each
cluster. (10)
(iv) Is there some association between a vehicle’s country of origin and its segmentation? Explain. (5)
[55]
QUESTION 3
The table below shows data for four people, indicating their Systolic blood pressure, an indication of the
amount of tobacco they consume daily, and whether they suffer from coronary heart disease.
Table 3.1: Coronary heart disease
Case Systolic Blood Pressure Tobacco CHD
1 114 0 0
2 130 0.08 1
3 158 16 0
4 154 2.4 1
Use this data to determine (without the use of a statistical computer package):
(a) The Euclidean distance matrices at each step of a hierarchical cluster analysis on Systolic blood
pressure and tobacco usage using the single linkage method. Clearly show each distance matrix at
every step of the algorithm. (15)
(c) Is there a relation between coronary heart disease and the segmentation of a person? (5)
[25]
24
QUESTION 4
Data on house prices in several suburbs of Boston in the United States of America was collected, together
with other variables that might influence house prices. (Note R dataset Boston.csv).
The data frame contains the following variables:
• CRIM
• ZN
• INDUS
• NOX
• RM
• DIS
• TAX
• PTRATIO
• B
25 HRSTA82/ASS2/0
• LSTAT
• MEDV
In order to do so, the variable AGE was recoded into an ordinal variable, with categories “New”,
“Moderate” and “Old”. The variables were related to classification of houses as historical sites for the
future. This new variable was named AGECAT. An extract from the data is given in Exhibit 4.1 and
Exhibit 4.2 shows P-P plots. The results from the MANOVA analysis are in Exhibit 4.3. You are
required to write a detailed research report following Hair’s Six Stage Model Building (and also refer to
the supplementary notes) on these results. The report should pay attention not only to the results shown
below, but also to recommendations as to how these results could have been improved upon (with regard
to the testing and meeting of assumptions, for example). Should the analysis prove to not have met all
of the assumptions, you are still required to interpret the results, but suggest which remedies could have
been applied to rectify the violation of assumptions.
Exhibit 4.1: Boston data set extract
CRIM ZN INDUS
26
NOX RM DIS
LSTAT MEDV
27 HRSTA82/ASS2/0
[60]
[200]