Professional Documents
Culture Documents
Evidence-based Management
Quantitative Data Analysis, part III
• Nominal: Numbers are just placeholders for which group you’re in:
• For example: (1) = Our section; (2) = Dr. Veselovsky’s section; (3) = Dr. Jelley’s section
• Ordinal: Numbers indicate rank ordering, but not actual distance:
• For example: (1) = D grade; (2) = C grade; (3) = B grade; (4) = A grade.
• Interval: Numbers indicate rank *and* there’s equal distance between each number:
• However, there’s no fixed or meaningful “starting point” on the scale (i.e., 0). It’s arbitrary.
• For example: Temperature. 30 degrees Celsius is 15 degrees hotter than 15 degrees
Celsius…but it is not “twice as hot”.
• Ratio: An interval scale that *also* has a meaningful starting point (i.e., 0):
• For example: Weight. Someone who weighs 240lbs *is* twice as heavy as someone who
weights 120lbs.
Continuous IV Categorical IV
(Predictor) (Predictor)
Categorical DV
Logistic Regressions Chi-Squares
(Outcome)
Summary: The Caveat of Data Scaling
• When we conduct inferential analyses, we’re trying to use sample results to make
conclusions about our larger, unstudied population:
• There are several different analytic strategies that we can use for this.
• Remember from earlier this semester that there are four general types of data scaling
(nominal, ordinal, interval and ratio), which we can group into two general categories:
• Categorical variables (nominal or ordinal): Treated as membership in a distinct group.
• Continuous variables (interval or ratio): Treated as a magnitude or amount of something.
• The type of scaling you use for your independent and dependent variables will
determine what kind of statistical analysis you can run.
Part II: Correlation and Regression
When the IV and DV are both Continuous…
Independent Variable: “Work Shift Timing”
• The absolute value of the correlation gives us the size of the effect:
• ±.1 = small effect / ±.3 = medium effect / ±.5 = large effect.
• This can range from 0.00 (no variance explained) to 1.00 (all of the variance in the
outcome explained).
r = 0.50 r2 = 0.25
r = -0.50 2
r = 0.25
…25 percent of the variance in the outcome is explained.
Time of
Day
Number of
Adventures
Proportion of variance in number of
adventures predicted by time of day (r2)
Adding the Idea of “Regression”
• A regression equation goes one step further:
• It allows us to predict future scores on our outcome using known scores on our predictor.
Slope
Intercept
• The ‘e’ in the equation acknowledges that these guesses aren’t perfect:
• Represents all aspects of ‘Y’ that are not predicted by ‘X’.
• ‘e’ is assumed to be random.
What Does That Line Mean Though?!
• This line is calculated to minimize the average amount of distance between data
points and their “predicted score” on the line:
• B0 (Intercept): The value of ‘Y’ where the line crosses the X-axis.
• B1 (Slope): The increase in ‘Y’ for each one unit increase in ‘X’.
• This line provides the best linear (i.e., straight line) prediction about ‘Y’ using ‘X’.
High
Weight (kg)
Low
Low High
Height (cm)
Interpreting the Slope of the Regression
• Researchers are usually most interested in the slope of the regression equation:
• The slope tells us how much ‘Y’ should change by after a one unit increase in ‘X’:
NumAdventures = 2 + 1.3(TimeDay) + e
• We can also square a correlation to calculate the coefficient of determination (i.e., the
percentage of variance in our outcome that is predicted by our predictor):
• Can range from 0.00 (no prediction) to 1.00 (perfect prediction).
• A regression allows us to predict future scores on our outcome using known scores on
our predictor using algebra:
• B0 (Intercept): The value of ‘Y’ where the line crosses the X-axis.
• B1 (Slope): The increase in ‘Y’ for each one unit increase in ‘X’.
Part III: Analyses of Variance
When the IV is Categorical and the DV is Continuous…
Independent Variable: “Work Shift Timing”
• To examine these relations, we need to move past the idea of looking at how scores
on each variable covary:
• Instead, we should look at the mean score on the outcome variable for members of
each group we’re examining.
Between-group variance
F=
Within-group variance
• The F-value will equal 1.00 if the between-groups variability equals the average
variability within groups:
• The F-value gets larger as more variance is accounted for between groups relative to
within groups.
Some Different Types of ANOVAs
• The single sample t-test: Compare one group’s average score to a pre-determined
standard score.
• The independent groups t-test: Compare one group’s average score to a second
group’s average score.
• The repeated measures t-test: Compare one group’s average score to the same
group’s average score at a later point in time.
• The one-way ANOVA: Compare the average scores of two or more groups (one
independent variable only).
• The factorial ANOVA: Compare the average scores of two or more groups (two or
more independent variables):
• Notated like: 2x3 factorial ANOVA for a situation where one independent variable has
two groups, and the other independent variable has three groups.
• The multiple ANOVA (MANOVA): ANOVAs with multiple dependent variables.
Summary: Analyses of Variance
• t-tests and ANOVAs examine variable relationships by looking at average group
differences:
• How much do members from “X group 1” differ from members of “X group 2” on “Y”?
• We need our independent variable to be categorical, and our outcome to be continuous.
• In this calculation, we compare the difference in “Y” we found between the groups to
the average difference that we’d expect within any given group.
• That is, we should look at the frequency of observations that fall into each category
that is possible:
• In this example, we would consider which participants started their shift in the day or
night; and, which participants went on adventures versus didn’t go.
The Chi-Square Statistic and Frequencies
• The chi-square statistic uses this idea of frequencies:
• We ask whether the frequencies of participants we observe within each level of the
outcome matches what we would expect those frequencies to be due to chance.
Adventures Cross-Tab
Adventure None Total
Night 28 48 76
Day 10 114 124
Total 38 162 200
Uhh…Expected Frequencies?
• We can apply the logic of deviation to these data by calculating a model statistic that
assumes participants fall into each of the four categories in our example randomly:
Row_Totali * Column_Totalj
Eij =
N
Total sample size
Expected count (i.e., total count)
for that cell
Adventures Cross-Tab
Adventure None Total
Night 28 48 76
Day 10 114 124
Total 38 162 200
Observed Frequencies
Adventure None Total
Night 28 48 76
Day 10 114 124
Total 38 162 200
Summary: Chi-Squares
• The chi-square statistic looks at whether the frequencies of participants in each possible
group matches what we would expect those frequencies to be due to chance:
• This analysis is used when both the independent and dependent variable are categorical.
• But in some cases, you may have study data that estimates the reverse:
• That is, the predictor is a continuous score…but the outcome is a categorical indicator
of group membership.
Making a Regression “Logistic”
• In logistic regressions, we don’t predict the value on “Y” for a given value of “X”:
• Instead we predict the probability of “Y” occurring for a given value of “X”.
• Remember that, because the outcome is categorical, the most relevant numeric score
associated with it is its probability.
Probability of “Y”
occurring
1
P(Y) =
1 + e-(b0+b1(x))
“e” = base of natural
logarithms (don’t The linear regression
worry if you don’t equation acts as an
understand this part) exponent on “e”
Why Can’t I Use Regular Regression?
• In normal, linear approaches to regressions, we assume that the relationship
between variables is…well, linear:
• This assumption must be met for the estimates from our regression to be accurate.
• Categorical outcomes do not allow for these kinds of linear relations, and can’t be
estimated using analyses that require linearity:
• This is one example of “parametric” (data conform to assumptions like linearity and
normality) versus “non-parametric” analyses.
Summary: Logistic Regressions
• In logistic regressions, we predict the probability of our outcome occurring for each possible
value of our predictor:
• This analysis is used whenever our outcome is categorical, and our predictor is continuous.