# MULTIVARIATE ANALYSIS EXAM 2 Dan Sewell Question 1: a.

The first few observations of mortalities per 100,000 are below, which were calculated in SAS (see code in Appendix), but the entirety of the data set can be found in the exhaustive SAS output (see Appendix).
Obs 1 2 3 4 5 6 7 8 9 10 newcopd 64.2845 15.8351 21.5268 34.7171 47.6086 22.2840 29.5735 42.5145 31.7778 33.0126 newcvd 336.711 160.462 206.496 161.644 289.430 370.108 242.242 427.039 216.549 328.191 newpneu 32.4186 16.8907 21.3919 12.1879 19.0434 19.8618 29.4285 31.8858 28.7582 31.1527 newresp 97.2557 32.7258 42.9862 47.3974 66.9543 42.1459 59.1470 74.8212 60.8236 64.5372

b. Monthly average temperatures and ozones, from April to September were computed using Microsoft Excel (=average(…). The resulting data set was then imported into SAS. The first few observations are below, but the entire data set can be found in the exhaustive SAS output.
Obs 1 2 3 4 5 6 7 8 9 10 TempApril 51.4000 56.7667 65.3667 72.0000 58.0667 72.0333 49.4000 46.1333 62.2333 49.8000 Obs 1 2 3 4 5 6 7 8 9 10 TempMay 61.6129 67.4839 69.1290 76.6452 66.6452 74.7419 58.4194 59.8387 65.9677 61.8710 O3May 2.9776 3.3453 2.8535 2.1398 10.1284 13.4318 2.6733 9.0380 2.7441 6.0385 TempJune 70.0667 74.9333 75.1667 81.9667 73.9000 80.5000 71.2000 68.5667 73.6333 70.5667 O3June 9.3160 6.7535 -1.9897 -5.5971 14.0376 0.1609 7.3469 13.3186 4.2031 7.6226 TempJuly 76.0968 83.2258 79.3871 83.4516 80.2258 81.8710 76.0323 74.5806 79.0645 78.5806 O3July 10.1753 13.4359 -0.0305 -11.3581 13.9645 -3.6472 7.7088 15.9207 3.6292 7.0373 TempAug 68.4516 79.9032 82.1290 88.5161 77.8710 85.2903 71.6452 68.0645 79.6774 70.5484 O3Aug -1.3167 8.1736 13.6344 9.2746 9.5357 8.2088 3.5155 1.0480 12.7265 -0.5158 TempSep 63.7333 70.2667 73.3000 81.6667 77.2333 76.3000 67.3667 64.4667 69.3667 63.6667 O3Sep -5.4473 -6.5235 2.2994 11.6430 16.2752 3.4473 -3.5027 0.3954 -6.9343 0.0429 O3April -2.59742 -5.77401 -4.00079 8.26811 2.01369 4.48913 6.72094 2.51458 0.53066 1.25153

Question 2:

Are mortalities related to coastline? To answer this question, I ran MANOVA, with the following model: newcopdinewcvdinewpneuinewrespi=1coastiβ01β11 β02 β12 β03 β13 β04 β14+ [εi1εi2εi3εi4] for i = 1…70. I used Wilk’s Lambda to determine whether to reject or fail to reject the null hypothesis that Coastline has no effect (i.e. H0:β1=0). Less relevant to the question but still tested was the null hypothesis that the intercept was not significant (i.e. H0:β0=0). The p-values from Wilk’s Lambda for these two tests are 0.003 and less than 0.001, respectively. This implies that there is indeed a difference in mortalities due to geography (specifically, if they live in region next to the coast). Since the mortality rates differ depending on region, it is important to look at the means of the mortality rates for each of the two locations.
N coast Obs Variable N Mean Std Dev Minimum Maximum ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 0 32 newcopd 32 46.9256059 10.4654806 24.8226145 74.4175238 newcvd 32 342.2685421 104.0352160 170.4295948 706.4312769 newpneu 32 26.9604181 6.1229006 14.3832579 39.9480052 newresp 32 74.3137923 13.2431913 45.9742789 99.4803952 1 38 newcopd newcvd newpneu newresp 38 38 38 38 36.8365308 304.6484248 20.5756272 57.6683982 11.7025433 117.8645380 8.0217809 15.7261493 15.8350620 160.4619615 7.3123619 32.7257948 77.1583167 738.6820742 38.0844329 110.2571727

We can see that for each type of death, the mean mortality rates are higher for those cities which are not along a coastline. I conclude that there is a higher probability of a Chronic Obstructive Pulmonary Disease death, a cardiovascular death, a pneumonia death, or a respiratory death if one lives away from the coastline. Diagnostic Checks were run to check for Normality and for equal covariance matrices. Testing the homogeneity of the covariances, by Chi-squared test, leads us to fail to reject the null hypothesis (equal variances) at the 0.05 level. However, when Henze-Zirkler Test was run on the residuals, it turned out to not be normal. Question 3: We wish to better see the underlying variation structure in the monthly averages of temperature and ozone. To this end, I perform principle component analysis on both temperature and ozone. First, with monthly average temperature, I find that 96.04% of the variation of the data is explained by the first two PC’s. Further evidence for choosing to use just the first two PC’s comes from the following Scree Plot, and noticing the elbow is at 2:

Scree Plot of Eigenvalues 6 ˆ ‚ ‚ ‚ ‚ ‚ 1 ‚ 5 ˆ ‚ ‚ ‚ ‚ ‚ ‚ 4 ˆ ‚ E ‚ i ‚ g ‚ e ‚ n ‚ v 3 ˆ a ‚ l ‚ u ‚ e ‚ s ‚ ‚ 2 ˆ ‚ ‚ ‚ ‚ ‚ ‚ 1 ˆ ‚ ‚ ‚ 2 ‚ ‚ ‚ 3 4 0 ˆ 5 6 Šƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒ 0 1 2 3 4 5 6

The following table shows how each monthly average temperature is correlated with the first two PC’s. PC APRIL MAY JUNE JULY AUG SEPT

1 0.928 0.987 0.952 0.809 0.957 0.949 2 -0.353 -0.0183 0.256 0.562 0.147 -0.092 Second, for average monthly ozone, I find that I should choose either 3 or 4 PC’s by looking at the Scree Plot below:
Scree Plot of Eigenvalues

2.5

2.0

E i g 1.5 e n v a l u e 1.0 s

0.5

0.0

‚ ‚ ‚ ‚ ˆ ‚ ‚ ‚ ‚ ‚ ‚ 1 2 ˆ ‚ ‚ ‚ ‚ ‚ ‚ ˆ ‚ ‚ ‚ ‚ ‚ ‚ 3 ˆ ‚ ‚ ‚ ‚ ‚ ‚ ˆ ‚ 4 ‚ ‚ 5 ‚ ‚ 6 ‚ ˆ ‚ ‚ ‚ Šƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒ 0 1 2 3 4 5 6 Number

The first 3 PC’s account for 88.24% of the variance in average monthly Ozone, and 4 PC’s account for 94.38% of the variance. This is a rather subjective decision, but I will choose that 4 PC’s will be used for further analytic needs. Correlations between each monthly Ozone and each of the first four PC’s are given below. PC Y1 Y2 Y3 APRIL MAY JUNE JULY AUG SEP 0.245 0.246 0.919 0.972 0.206 0.297 31 509 129 862 886 25 0.449 0.592 0.026 0.025 0.847 0.777 67 193 071 7 247 09 0.781 0.603 0.177 0.024 0.441 0.109 158 777 878 001 6 002

0.271 0.147 0.122 0.020 0.206 0.536 Y4 42 53 663 533 32 598 The data set with the PC Scores from all 6 PC’s attached in the exhaustive SAS output. Note that PCTi refers to the ith PC for temperature, and PCOi refers to the ith PC for Ozone, i = 1..6 . Question4: In order to understand the relationships and associations between mortality and monthly temperatures, monthly ozones, and geography (coastlines or no coastlines), I set up a regression model, with four response variables (four mortality rates) and 13 covariates (14 including intercept). The following are the regression coefficients for each of the four mortality variables: For Chronic Obstructive Pulmonary Disease:
Intercept TempApril TempMay TempJune TempJuly TempAug TempSep O3April O3May O3June O3July O3Aug O3Sep coast 20.28952 -0.19539 -0.48570 -0.16807 1.12896 -1.52321 1.61922 0.15609 -0.03675 -0.31666 -0.16258 -0.15963 0.29670 -12.50162

For cardiovascular deaths:
Intercept TempApril TempMay TempJune TempJuly TempAug TempSep O3April O3May O3June O3July O3Aug O3Sep coast -193.94372 10.72090 -39.54714 26.74436 22.51034 -24.88893 8.86207 -5.95457 14.62094 -9.52123 -0.69330 -1.42830 -0.95663 39.53147

For pneumonia deaths:
Intercept TempApril TempMay TempJune TempJuly TempAug TempSep O3April O3May O3June O3July O3Aug O3Sep coast -193.94372 10.72090 -39.54714 26.74436 22.51034 -24.88893 8.86207 -5.95457 14.62094 -9.52123 -0.69330 -1.42830 -0.95663 39.53147

For respiratory deaths:
Intercept TempApril TempMay TempJune TempJuly TempAug TempSep O3April O3May O3June O3July O3Aug O3Sep coast 40.02573 0.12833 -1.50822 2.27612 1.33574 -2.91614 1.12797 0.32036 -0.17657 -0.85262 0.00097820 0.11777 -0.00352 -13.97280

A test for the overall significance of the model was conducted. With H0:B=0 gave a resulting p-value of 0.0041. With a significance level of 0.05, we reject the null hypothesis, concluding that at least one of the 13 covariates is significant. In other words, at least one monthly temperature average, or one monthly ozone average, or their geography has an effect on mortality rates. This is not surprising since from earlier we found out that geography played a role in mortality rates, and in particular, cities by the coast had lower mortality rates in the specific four areas. It is important to recognize which of these covariates play a role in mortality rates, so a separate test is conducted for each. For each test, Wilk’s Lambda is used to obtain a p-value to reject or fail to reject the null hypothesis that the covariate’s effect is insignificant. Below, a table of these results is given. COVARIATE Mean April Mean May Mean June Mean Mean Aug Mean Sept Mean Mean Mean Mean Mean Mean coast Temperature for Temperature for Temperature for Temperature for July Temperature for Temperature for Ozone Ozone Ozone Ozone Ozone Ozone for for for for for for April May June July Aug Sept p-value 0.7441 0.1777 0.0549 0.3901 0.5209 0.3019 0.4409 0.2785 0.4956 0.9295 0.8481 0.4615 0.0346 Does it affect mortality rates? No No No No No No No No No No No No Yes

Next, we use only the PC’s (specifically, PCT1,PCT2, PCO1,PCO2,PCO3, and PCO4) as covariates, along with coast, to construct much the same model. The following are the regression coefficients for this model. For Chronic Obstructive Pulmonary Disease:
Intercept PCT1 PCT2 PCO1 PCO2 PCO3 PCO4 coast 48.11297 0.13659 -0.02778 -0.01593 -0.26565 0.41358 0.69546 -12.27632

For cardiovascular deaths:
Intercept PCT1 PCT2 PCO1 PCO2 PCO3 PCO4 coast 328.24533 0.88349 2.34210 0.03544 -4.82485 -0.93247 2.95725 -11.78788

For pneumonia deaths:
Intercept PCT1 PCT2 PCO1 PCO2 PCO3 PCO4 coast 24.82633 0.02900 0.27819 -0.04426 -0.38240 -0.36996 -0.24137 -2.45358

For respiratory deaths:
Intercept PCT1 PCT2 PCO1 PCO2 PCO3 PCO4 coast 73.31611 0.16035 0.26899 -0.06609 -0.65850 0.05854 0.46664 -14.80757

Again using Wilk’s Lambda for our tests, we test the null hypothesis that none of the covariates have any effect on mortality rates. The test yields a p-value of 0.008, so we reject, concluding that at least one of the covariates affects mortality rates. The results for individual covariate tests are given in the same fashion as for the first regression model. COVARIATE PCT1 PCT2 PCO1 PCO2 PCO3 p-value 0.5959 0.6509 0.9490 0.0610 0.0211 Does it affect mortality rates? No No No No Yes

PCO4 Coast

0.1783 0.0220

No Yes

Finally, one last regression model is conducted, which only has two covariates, PCO3 and Coast, just to make sure that we can really say that ozone has an effect on the mortality rates. With this model, we get the following regression coefficients: For Chronic Obstructive Pulmonary Disease:
Intercept PCO3 coast 47.53327 0.33993 -11.20846

For cardiovascular deaths:
Intercept PCO3 coast 340.00412 -1.26673 -33.44881

For pneumonia deaths:
Intercept PCO3 coast 26.28443 -0.37815 -5.13954

For respiratory deaths:
Intercept PCO3 coast 74.27432 -0.02208 -16.57268

The overall model is significant, as the Wilk’s Lambda gave a p-value of less than 0.0001. Individual tests show that they both are significant, as the p-value for PCO3 is 0.0166, and the p-value for coast is 0.0006. This indicates that the four mortality rates are affected by both geography and the monthly average ozone. Diagnostics were run to ensure that our residuals were multivariate normal. To assess the multinormality of the residuals, I used Henze-Zirkler Test, which indicated that they were in fact not normal.

Question 5: In order to see which cities have similar average monthly temperatures, I performed both hierarchical and non-hierarchical clustering methods. First, I performed a hierarchical clustering method based on average linkage. The problem now lies in how many clusters to choose. Looking at the plot of the first two PC’s does not give any clear idea as to how many clusters one should choose, so I then looked at both the Cubic Clustering Criterion and the Pseudo Hotelling’s T2 test. On the plot of CCC vs. Number of Clusters, it seems that there is a peak at 11 clusters, and at 7 clusters. I look at the Pseudo T2 and it also confirms that I should choose 11

clusters, since it jumps from 4.9 (obtaining 11 clusters) to 41.7 (obtaining 10 clusters). As a final check, I used a non-hierarchical method, using Beale’s F-type statistic to determine if I should choose 7 or 11 clusters. The Beale’s F-type statistic for this comparison is 16.9, which leads us to conclude that 11 clusters is better. The Beale’s F-type statistic is computed and compared to the F value using the R Cities clustered by . . . Monthly avg. Temperatures Monthly avg. Ozone Monthly avg. Temps. and Ozone First 2 Temperature PC’s First 4 Ozone PC’s First 2 Temp. PC’s and first 4 ozone PC’s # of clusters indicated by CCC 7, 11 3, 5, 7, 9, 11 4, 6, 10 # of clusters indicated by Pseudo T2 11 3, 7, 9, 11 4, 10 # indicated from Beale’s F-type statistic 11 11 10 Decided number of clusters 11 11 10

6, 11, 13

SAS did not calculate 3, 7, 11, 15 4, 7, 10, 12

13

13

3, 7, 11 4, 6, 10

15 12

15 12

code in the appendix. See the SAS output for a plot of the first 2 temperature PC’s by cluster, and for a list of cities sorted by their cluster. So I can say that all of our cities fall into one of 11 groups based on their average monthly temperature. Since parts a-f are all clustered in the same manner, I will summarize all of the results in a tabulated form below. From these data, the following conclusions can be made: Each of our cities can be put into one of 11 groups, where the cities within a group have similar monthly average temperatures. Each city can also be put into one of 11 groups, where the cities within a group have similar monthly average ozone. They can also be put into one of 10 groups, where the cities within a group have similar monthly average temperatures and ozone. If we use the first few PC’s from temperature and ozone which explain most of the variance of the original variables, we can put each city in one of 13 groups, where the cities within a group have similar average monthly temperatures, or 15 groups, where the cities in a group have similar average monthly ozone, or 12 groups, where the cities within a group have similar average monthly temperatures and ozone. To find which city belongs to which group, for each of these 6 grouping methods, see the exhaustive SAS output.

Question 6: It is of interest to find the correlation between temperature and ozone. To learn more about how these two things are correlated, I used canonical correlation analysis. First, I used all the monthly average temperatures, and correlated those to the set of variables consisting of the monthly average ozone. Second, I used only the first two PC’s of monthly average temperature, and correlated those to the set of the first four PC’s of monthly average ozone. Note that in the SAS output, you can find the canonical correlation coefficients, as well as the correlation matrices for all four sets of variables, and the correlation matrices for each pair of sets of variables (i.e. for temperatures and ozone, and for the PC’s of temperatures and ozone). For both analyses, I report results based on the first canonical variables, since they explain most of the correlation. For the first case, I find that the canonical correlation between the set of all monthly average temperatures to the set of all monthly average ozone is 0.965. This implies that there is a very high correlation between temperature and ozone. Since it is positive, as the temperature canonical variable (temp1) increases, we can feel quite certain that the ozone canonical variable (oz1) will also increase. Furthermore, we can see which months of temperature are more correlated to the average monthly ozone, and vice versa. First, temperatures for April and August, followed by May, have the strongest correlation with the canonical variable temp1. They also have the strongest (positive) correlation with the canonical variable oz1. We may conclude from this that the average temperatures for April, August, and May have the strongest correlation with the monthly average ozone from April to September. Second, by looking at the monthly average ozone variables, we see that the average ozone in June and July have the strongest (negative) correlation with canonical variable ozone1. They also have the strongest (negative) correlation with the canonical variable temperature1. We may conclude that the average ozone during June and July have the strongest correlation with the average monthly temperatures from April to September. For the second case, I find that the canonical correlation between the first two PC’s of monthly average temperatures and the first four PC’s of monthly average ozone is 0.823. This implies that there is a strong positive correlation between the PC temperature canonical variable (PCtemp1) and the PC ozone canonical variable (PCOz1). As one would expect, the first PC for temperature has the highest (negative) correlation with PCtemp1, and the first two PC’s for ozone have the strongest (PCO1 is positive, and PCO2 is negative) correlation with PCOz1. Similarly, the first PC for temperature has the strongest (negative) correlation with PCOz1 and the first two PC’s for ozone have the strongest (PCO1 is positive and PCO2 is negative) correlation with PCTemp1. This implies the first linear combination of average monthly temperatures (PCT1) has the strongest correlation

with the average monthly ozone from April to September, and the first two linear combinations of average monthly ozone (PCO1 and PCO2) have the strongest correlation with average monthly temperatures from April to September. Below are two plots of the canonical variable scores, first using the monthly means, and second using the PC’s.
Clustering based on Monthly Average Ozone Plot of Temp1*Oz1. Legend: A = 1 obs, B = 2 obs, etc.

Temp1 ‚ ‚ 2.0 ˆ ‚ ‚ A A ‚ A A A ‚ A 1.5 ˆ A A A ‚ AA ‚ ‚ A ‚ A A 1.0 ˆ A ‚ A ‚ A A A ‚ B ‚ A 0.5 ˆ ‚ A ‚ ‚ AA A ‚ A A 0.0 ˆ A ‚ A BB ‚ A A A ‚ A BA AA A A ‚ AA A -0.5 ˆ A ‚ A A A ‚ B A ‚ ‚ -1.0 ˆ A A A ‚ A ‚ A A ‚ A A ‚ A A A A -1.5 ˆ ‚ A ‚ A ‚ A ‚ -2.0 ˆ ‚ Šˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒˆ -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0

Oz1

Plot of PCTemp1*PCOz1.

Legend: A = 1 obs, B = 2 obs, etc.

PCTemp1 ‚ ‚ 2 ˆ ‚ A ‚ A ‚ ‚ ‚ A A B A ‚ A A 1 ˆ A A A B A ‚ A B A ‚ A A A AA A ‚ B AA ‚ A ‚ A AA ‚ A 0 ˆ A A A ‚ A ‚ A A A A ‚ AA AA A ‚ A A B A A ‚ A A ‚ A A -1 ˆ A A ‚ A ‚ ‚ A A A ‚ A A ‚ A ‚ A A A -2 ˆ ‚ ‚ A ‚ ‚ ‚ ‚ -3 ˆ ‚ Šƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒ -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0

The plots give good visual reference to the fact that the Canonical Correlation was stronger when using the average monthly means of temperature and ozone, as opposed to using the PC’s. The Canonical Variable Scores can be seen in the full SAS output.

PCOz1

Question 7: We are now interested in being able to classify a region as having a coastline or not having a coastline by various sets of variables. The end result of the following analysis will be that if we are given the monthly averages for temperature and ozone for a region, we will be able to predict whether it is coastal or not. Three methods will be applied to 6 different sets of variables, and the misclassification rates will determine which rule and which set of variables are most effective for determining if a region is coastal or not. The cross-validation technique will be used to calculate misclassification rates. The following table summarizes the results, which can be found in detail in the SAS output. For each method and each variable set, there is a complete listing of the cities and their classification/misclassification which is found in the SAS output.

VARIABLES

LINEAR DISCRIMINANT FUNCTION

K NEAREST NEIGHBOR (K=5)

LOGISTIC REGRESSION

TempApril to TempSep O3April to O3Sep

1370

1270

770

2170 TempApril to TempSep and O3April to O3Sep PCT1 and PCT2 1470 PCO1 to PCO4 2170 PCT1, PCT2 and PCO1 to PCO4

1870

1770

1070

970

070

2070

1470

1870

2270

1570

1870

1270

As is clearly seen, our best way to correctly classify a region as being coastal or not coastal is using all of our monthly means (our average monthly temperatures and our average monthly ozone), and using logistic regression. In this manner, we have the highest probability of correctly classifying a region as coastal or not coastal.