You are on page 1of 14

BIRLA INSTITUTE OF TECHNOLOGY

MESRA

ASSIGNMENT

MULTIVARIATE DATA ANALYSIS

Submitted to
Dr. Supriyo Roy

Submitted by-
Rishav Raman Tiwari MBA/10021/18
Q.1. A. DEFINE ANOVA (BOTH ONE WAY AND TWO WAY).
Answer 1A :

Analysis of variance (ANOVA) is an analysis tool used in statistics that splits an observed aggregate variability
found inside a data set into two parts: systematic factors and random factors. The systematic factors have a
statistical influence on the given data set, while the random factors do not. Analysts use the ANOVA test to
determine the influence that independent variables have on the dependent variable in a regression study.

The ANOVA test allows a comparison of more than two groups at the same time to determine whether a
relationship exists between them. The result of the ANOVA formula, the F statistic (also called the F-ratio),
allows for the analysis of multiple groups of data to determine the variability between samples and within
samples.

If no real difference exists between the tested groups, which is called the null hypothesis, the result of the
ANOVA's F-ratio statistic will be close to 1. Fluctuations in its sampling will likely follow the Fisher F
distribution. This is actually a group of distribution functions, with two characteristic numbers, called the
numerator degrees of freedom and the denominator degrees of freedom.

Assumptions of the ANOVA

1. Observations are randomly or independently selected from their respective populations.


2. The shape of population distributions is normal.
3. These normal populations have identical variances
ANOVA is robust enough to handle departures from normality and unequal variances. Problems occur when
heterogeneity of variances is combined with unequal sample sizes. Therefore it is worthwhile to design an
experiment in which the samples from the populations are equal in size.
Analysis of real life data

One way
Null Hypothesis Ho: There is no significant difference between mileage and cylinder

Following analysis has been done on data of cars


Descriptives
Mileage
95% Confidence Interval for
Mean
N Mean Std. Deviation Std. Error Lower Bound Upper Bound Minimum Maximum
4 4.15423E
394 2.0108E4 8245.90896 19291.5044 20924.9626 266.00 50387.00
2
6 4.48441E
310 1.9564E4 7895.61585 18681.1579 20445.9259 636.00 41566.00
2
8 8.93350E
100 1.9575E4 8933.50198 17802.7294 21347.9306 583.00 42691.00
2
Total 2.89062E
804 1.9832E4 8196.31971 19264.5279 20399.3402 266.00 50387.00
2

ANOVA
Mileage
Sum of
Squares df Mean Square F Sig.
Between
5.899E7 2 2.950E7 .438 .045
Groups
Within Groups 5.389E10 801 6.727E7
Total 5.395E10 803

We can see that the significance value is 0.045 (i.e., p = .045), which is below 0.05. and, therefore, there is a
statistically significant difference in the mean length of cylinder and mileage of the cars
Hence we do not accept the null hypothesis

Two way anova conducted to analyze make and type of cars on the price of car
Tests of Between-Subjects Effects
Dependent Variable: Price
Type III Sum of
Source Squares Df Mean Square F Sig.
a
Corrected Model 6.417E10 14 4.583E9 253.005 .000
Intercept 2.164E11 1 2.164E11 1.194E4 .000
Make 2.076E10 5 4.153E9 229.230 .000
Type 1.205E10 4 3.012E9 166.244 .000
Make * Type 4.590E9 5 9.181E8 50.679 .000
Error 1.429E10 789 1.812E7
Total 4.447E11 804
Corrected Total 7.846E10 803
a. R Squared = .818 (Adjusted R Squared = .815)

These rows inform us whether our independent variables (the “Make” and “Type " rows) and their interaction
(the " Make * Type " row) have a statistically significant effect on the dependent variable, “Price ".
Hence it can be seen that the is a significant effect of make and type of cars on the price of car
Managerial significance
From above anova analysis it can be clearly seen there is effect of cylinder on mileage and make and type of car
shows variation on the price of car hence manufacturers should keep these things in mind while making an car

Answer 2 MANOVA
Multivariate analysis of variance (MANOVA) is simply an ANOVA with several dependent variables. That is
to say, ANOVA tests for the difference in means between two or more groups, while MANOVA tests for the
difference in two or more vectors of means.
Univariate case

One-way ANOVA investigates the effects of a categorical variable (the groups, i.e. independent variables) on a
continuous outcome (the dependent variable). In one-way ANOVA, we have m random variables x1, …, xm (also
called groups or treatments). For each group we have a sample, where we denote the jth group sample as { ,
…,  }. Group j is said to have nj subjects in its sample. We also define  .

Our objective is to test the null hypothesis H0: μ1 = μ2 = ⋯ = μm.

We use the following definitions for the total (T), between groups (B) and within groups (W) sum of squares
(SS), degrees of freedom (df) and mean square (MS):
The test statistic F is defined as follows and has an F distribution with dfB, dfW degrees of freedom:

We reject the null hypothesis if F > Fcrit.

Multivariate case

MANOVA also investigates the effects of a categorical variable (the groups, i.e. independent variables) on a
continuous outcome, but in this case the outcome is represented by a vector of dependent variables.

We could simply perform multiple ANOVA’s, one for each dependent variable, but this would have two
disadvantages: it would introduce additional experiment-wise error and it would not account for the correlations
between the dependent variables. It is therefore possible that MANOVA shows a significant difference between
the means while the individual ANOVA do not.

Also MANOVA can be used in place of ANOVA with repeated measures; in which case no sphericity
assumption needs to be met when using MANOVA. In this case, you treat the repeated levels as dependent
variables.

Definition 1: In One-way MANOVA, we have m random vectors X1,


…, Xm (representing groups or treatments). Each Xj is a k × 1 column vector of form

Where each xjp is a random variable.

For each random vector Xj we collect a sample { , …,  } of size nj. We also define . Each
sample Xij is a k × 1 vector of form

Where each xijp is a data element (not a random variable), where index i refers to the subject in the
experiment (1 ≤ i ≤ nj), index j refers to the group (1 ≤ j ≤ m) and index p refers to the position (i.e. dependent
variable) within the random vector (1 ≤ p ≤ k).

Our objective is to test the null hypothesis H0: μ1 = μ2 = ⋯ = μm  where the μj are vectors


and so the null hypothesis is equivalent to H0: μ1p = μ2p = ⋯ = μmp for all p such that 1 ≤ p ≤ k. The alternative
hypothesis is therefore H1: μr  ≠ μj for some r, j such that 1 ≤ r, j ≤ m, or equivalently, μrp ≠ μjp for some r, j,
p such that 1 ≤ r, j ≤ m and 1 ≤ p ≤ k.

Now we define the various means as in the univariate case, except that now these means become k × 1 vectors.
The total (or grand) mean vector is the column vector

where

The sample group mean vector for group j is a column vector

where

Multivariate Testsc
Effect Value F Hypothesis df Error df Sig.
Intercept Pillai's Trace .941 6.369E3a 2.000 792.000 .000
a
Wilks' Lambda .059 6.369E3 2.000 792.000 .000
a
Hotelling's Trace 16.082 6.369E3 2.000 792.000 .000
Roy's Largest Root 16.082 6.369E3a 2.000 792.000 .000
Cylinder Pillai's Trace .469 121.327 4.000 1.586E3 .000
Wilks' Lambda .532 1.468E2a 4.000 1.584E3 .000
Hotelling's Trace .878 173.543 4.000 1.582E3 .000
Roy's Largest Root .876 3.473E2b 2.000 793.000 .000
Type Pillai's Trace .399 49.389 8.000 1.586E3 .000
Wilks' Lambda .601 57.341a 8.000 1.584E3 .000
Hotelling's Trace .663 65.533 8.000 1.582E3 .000
Roy's Largest Root .662 1.313E2b 4.000 793.000 .000
Cylinder * Pillai's Trace .010 1.025 8.000 1.586E3 .415
Type Wilks' Lambda .990 1.024a 8.000 1.584E3 .415
Hotelling's Trace .010 1.023 8.000 1.582E3 .416
Roy's Largest Root .007 1.470b 4.000 793.000 .209
a. Exact statistic
b. The statistic is an upper bound on F that yields a lower bound on the significance level.
c. Design: Intercept + Cylinder + Type + Cylinder * Type
The Multivariate Tests table is where we find the actual result of the one-way MANOVA. You need to look at
the second Effect, and the Wilks' Lambda row (highlighted in red). To determine whether the one-way
MANOVA was statistically significant you need to look at the "Sig." column. We can see from the table that we
have a "Sig." value of .000, which means p < .0005. Therefore, we can conclude that this price and mileage was
significantly dependent on cylinder and type of cars they had attended (p < .0005).

Managerial Significance

Through the above analysis there is a clear indication that shows that cylinder and type of car are very
important component of determining car price and mileage of the car hence manufacturers should always keep
these two factors in mind while manufacturing a car
Q3 .What is the general form of multiple regression?

The general form of the equation for linear regression is:

y=B*x+A

Where y is the dependent variable, x is the independent variable, and A and B are coefficients dictating the
equation. The difference between the equation for linear regression and the equation for multiple regression is
that the equation for multiple regression must be able to handle multiple inputs, instead of only the one input of
linear regression. To account for this change, the equation for multiple regression takes the form:

y = B_1 * x_1 + B_2 * x_2 + … + B_n * x_n + A

In this equation, the subscripts denote the different independent variables. x_1 is the value of the first
independent variable, x_2 is the value of the second independent variable, and so on. It keeps going as more and
more independent variables are added until the last independent variable, x_n, is added to the equation. Note
that this model allows you to have any number, n, independent variables and more terms are added as needed.
The B coefficients employ the same coefficients, indicating that they are the coefficients linked to each
independent variable. A, as before, is simply a constant stating the value of the dependent variable, y, when all
of the independents variables, the xs, are zero.

Coefficients

Standardized
Unstandardized Coefficients Coefficients

Model B Std. Error Beta t Sig.

1 (Constant) 3026.463 1770.855 1.709 .088

Cylinder 3931.519 694.300 .552 5.663 .000

liter -909.962 881.508 -.102 -1.032 .002

Doors -1528.710 325.899 -.131 -4.691 .000

Cruise 6176.824 668.848 .270 9.235 .300

Sound -1900.839 581.237 -.090 -3.270 .001

Leather 3318.087 607.825 .150 5.459 .000

a. Dependent Variable: Price


Variables Entered/Removedb
Model Variables Entered Variables Removed Method
1 Leather, Doors, Cylinder,
. Enter
Sound, Cruise, litera
a. All requested variables entered.
b. Dependent Variable: Price
Analysis

Equation for this model is


Price=3026.463+.552*Cylinder+ (-.102)* liter+ (-.131) Doors+ (-.090) *Sound+.150 *Leather
Managerial sig
The price of car depends majorly on cylinder hence manufacturers should majorly focus on cylinder capacity
Other than that petrol tank size number of doors sound of car and interior quality (leather) also attracts the
customers

Q4.Factor analysis
.  Factor – The initial number of factors is the same as the number of variables used in the factor analysis. 
However, not all 12 factors will be retained.  In this example, only the first three factors will be retained (as we
requested).
b. Initial Eigenvalues – Eigenvalues are the variances of the factors.  Because we conducted our factor analysis
on the correlation matrix, the variables are standardized, which means that the each variable has a variance of 1,
and the total variance is equal to the number of variables used in the analysis, in this case, 12.
c. Total – This column contains the eigenvalues.  The first factor will always account for the most variance (and
hence have the highest eigenvalue), and the next factor will account for as much of the left over variance as it
can, and so on.  Hence, each successive factor will account for less and less variance.
d. % of Variance – This column contains the percent of total variance accounted for by each factor.
e. Cumulative % – This column contains the cumulative percentage of variance accounted for by the current
and all preceding factors. For example, the third row shows a value of 68.313.  This means that the first three
factors together account for 68.313% of the total variance.
f. Extraction Sums of Squared Loadings – The number of rows in this panel of the table correspond to the
number of factors retained.  In this example, we requested that three factors be retained, so there are three rows,
one for each retained factor.  The values in this panel of the table are calculated in the same way as the values in
the left panel, except that here the values are based on the common variance.  The values in this panel of the
table will always be lower than the values in the left panel of the table, because they are based on the common
variance, which is always smaller than the total variance
The scree plot graphs the eigenvalue against the factor number.  You can see these values in the first two
columns of the table immediately above. From the third factor on, you can see that the line is almost flat,
meaning the each successive factor is accounting for smaller and smaller amounts of the total variance.
Q5 structural equation
Structural equation modeling is a multivariate statistical analysis technique that is used to analyze structural
relationships.  This technique is the combination of factor analysis and multiple regression analysis, and it is
used to analyze the structural relationship between measured variables and latent constructs.  This method is
preferred by the researcher because it estimates the multiple and interrelated dependence in a single analysis.  In
this analysis, two types of variables are used endogenous variables and exogenous variables.  Endogenous
variables are equivalent to dependent variables and are equal to the independent variable.
Theory:
This can be thought of as a set of relationships providing consistency and comprehensive explanations of the
actual phenomena.  There are two types of models:
Measurement model: The measurement model represents the theory that specifies how measured variables
come together to represent the theory.
Structural model: Represents the theory that shows how constructs are related to other constructs.
Structural equation modelling is also called casual modelling because it tests the proposed casual relationships. 
The following assumptions are assumed:
Multivariate normal distribution: The maximum likelihood method is used and assumed for multivariate
normal distribution.  Small changes in multivariate normality can lead to a large difference in the chi-square
test.
Linearity: A linear relationship is assumed between endogenous and exogenous variables.
Outlier: Data should be free of outliers.  Outliers affect the model significance.
Sequence: There should be a cause and effect relationship between endogenous and exogenous variables, and a
cause has to occur before the event.
Non-spurious relationship: Observed covariance must be true.
Model identification: Equations must be greater than the estimated parameters or models should be over
identified or exact identified. Under identified models are not considered.
Sample size: Most of the researchers prefer a 200 to 400 sample size with 10 to 15 indicators.  As a rule of
thumb, that is 10 to 20 times as many cases as variables.
Uncorrelated error terms: Error terms are assumed uncorrelated with other variable error terms.
Data: Interval data is used.
Steps:
Defining individual constructs: The first step is to define the constructs theoretically.  Conduct a pretest to
evaluate the item.  A confirmatory test of the measurement model is conducted using CFA.
Developing the overall measurement model: The measurement model is also known as path analysis.  Path
analysis is a set of relationships between exogenous and endogens variables.  This is shown by the use of an
arrow.  The measurement model follows the assumption of unidimensionality.  Measurement theory is based on
the idea that latent constructs cause the measured variable and that the error term is uncorrelated within
measured variables.  In a measurement model, an arrow is drawn from the measured variable to the constructs.
Design the study to produce the empirical results: In this step, the researcher must specify the model.  The
researcher should design the study to minimize the likelihood of an identification problem.  Order condition and
rank condition methods are used to minimize the identification problem.
Assessing the measurement model validity: Assessing the measurement model is also called CFA.  In CFA, a
researcher compares the theoretical measurement against the reality model.  The result of the CFA must be
associated with the constructs’ validity.
Specifying the structural model: In this step, structural paths are drawn between constructs.  In the structural
model, no arrow can enter an exogenous construct.  A single-headed arrow is used to represent a hypothesized
structural relationship between one construct and another.  This shows the cause and effect relationship.  Each
hypothesized relationship uses one degree of freedom.  The model can be recursive or non-recursive.
Examine the structural model validity: In the last step, a researcher examines the structural model validity.  A
model is considered a good fit if the value of the chi-square test is insignificant, and at least one incremental fit
index (like CFI, GFI, TLI, AGFI, etc.) and one badness of fit index (like RMR, RMSEA, SRMR, etc.) meet the
predetermined criteria.

C. HIGHLIGHT MANAGERIAL SIGNIFICANCE.


Answer 5C:
Imagine if you wanted to better understand which consumer perceptions are most strongly associated with
Liking, Purchase Interest or Satisfaction in your product or service category, and also see if there are latent
segments (clusters) of consumers with different perceptions of the category or features they are seeking. Though
not a simple modeling task, SEM would be appropriate for these objectives, and the images of brands could also
be mapped to help us understand how the dimensions underlying brand perceptions distinguish the brands.
SEM can be used for simpler jobs, such as the example below from a consumer survey on a men's personal care
category. This illustration is a simplified and cloaked version of the full model, which included many more
attributes as well as exogenous variables such as age. I should note that a great deal of output besides a path
diagram needs to be checked carefully!

Fig 5.1

 In the path diagram above, ovals represent factors, also known as latent variables, unobserved variables
or unmeasured variables in SEM lingo. These are theoretical concepts which can be inferred but not directly
measured.
 Rectangles are used to represent attributes, also called measured variables, observed variables, or
manifest variables. In this example, the Traditional factor is represented by, or measured by, the
attributes Prestigious, Big Brand  and  Reliable.
 Single-headed arrows pointing from one latent variable to another depict hypothetical causal
relationships, for instance the impact of Traditional on Brand Equity, the dependent variable in this
analysis. These can be likened to regression coefficients. The single-headed arrows running from the latent
variables to the attributes are equivalent to loadings in Factor Analysis.
 The double-headed arrow is the correlation between the latent exogenous (independent) variables in this
example. 
 The numbers adjacent the arrows are the regression coefficients, correlation coefficients and factor
loadings. In SEM, regression coefficients are normally smaller than correlations and loadings, as they are
here.
 To reduce clutter, I've omitted error and residual terms, which are similar to unique factors in Factor
Analysis and residual terms in regression.
The brands rated in this survey were also mapped in a scatterplot according to their factor scores in the full
model. This is not shown for reasons of confidentiality and space.

Structural Equation Mixture Modeling (SEMM) was employed to see if there might be hidden segments of
consumers with very different needs lurking in the data. (I call this "Driver Segmentation.") It was concluded
that two driver models were needed, which partly reflected price tier, but that the same factors (latent variables)
could be used for these two segments. This is not always the case and there are times when entirely different
models are necessary.

Merely assuming that a one-size-fits-all model is sufficient can lead to very bad decisions, but running many
models on pre-determined subgroups is also ill-advised unless we have sound theoretical and empirical grounds
to believe these subgroups are truly distinct with regard to their needs or perceptions. Mixture Modeling is very
tricky but worth the effort when done competently. Sometimes we conclude that one overall model is sufficient
- negative findings are also important.

Data used
1. FOR FACTOR ANALYSIS

M255.sav

2. FOR ANOVA MANOVA AND MULTIPLE REGRESSION

cars (2).xlsx

Software used
SPSS for all analysis

You might also like