You are on page 1of 52

BUS 2880: Research and

Evidence-based Management
Quantitative Data Analysis, part III

Scott Cassidy, Ph.D.


March 16, 2023
The Agenda for Today…
• The Caveat of Data Scaling
• Correlation and (Linear) Regression
• t-Tests and Analyses of Variance (ANOVAs)
• Chi-Squares
• Logistic Regressions
Independent Variable: “Work Shift Timing”

Work Shift Time Shenanigans!

Dependent Variable: “Shenanigans!”


Part I: The Caveat of Data Scaling
Why Some Types of Data Require Different Analyses…
Sample Statistics versus Population Parameters
• In (most) research, we’re not particularly interested in our sample-level data:
• Instead, we’re interested in whether our sample results would generalize to a larger
population of possible participants.

• We can’t generally collect data on every participant in our population of interest.

• As a result, most research is “inferential” in nature…


The Idea of *Inferential* Statistics
• Inferential statistics go beyond just looking at the available sample data:
• These statistics make inferences (i.e., best guesses) about characteristics of a population.

• There are a few variations on this idea:


• Estimation: Make your best guess about the ‘true score’ of the population parameter.
• Hypothesis testing: Determine if it is reasonable to believe that the value of the
population parameter is “X” (a given value, often 0.00).
Interpreting Inferential Analysis Data
• There are a few things to focus on here…

• Check the correctness and cleanliness of the data.

• Determine the direction of the relationship between your variables.

• Determine the strength of the relationship between your variables.

• Assess the statistical significance of the relationship between your variables.


They Types of Quantitative Data We Deal With
• There are also different ways the variables we analyze can be scaled…

• Nominal: Numbers are just placeholders for which group you’re in:
• For example: (1) = Our section; (2) = Dr. Veselovsky’s section; (3) = Dr. Jelley’s section
• Ordinal: Numbers indicate rank ordering, but not actual distance:
• For example: (1) = D grade; (2) = C grade; (3) = B grade; (4) = A grade.
• Interval: Numbers indicate rank *and* there’s equal distance between each number:
• However, there’s no fixed or meaningful “starting point” on the scale (i.e., 0). It’s arbitrary.
• For example: Temperature. 30 degrees Celsius is 15 degrees hotter than 15 degrees
Celsius…but it is not “twice as hot”.
• Ratio: An interval scale that *also* has a meaningful starting point (i.e., 0):
• For example: Weight. Someone who weighs 240lbs *is* twice as heavy as someone who
weights 120lbs.
Continuous IV Categorical IV
(Predictor) (Predictor)

Continuous DV Analyses of Variance


(Linear) Regressions
(Outcome) (ANOVAs)

Categorical DV
Logistic Regressions Chi-Squares
(Outcome)
Summary: The Caveat of Data Scaling
• When we conduct inferential analyses, we’re trying to use sample results to make
conclusions about our larger, unstudied population:
• There are several different analytic strategies that we can use for this.

• Remember from earlier this semester that there are four general types of data scaling
(nominal, ordinal, interval and ratio), which we can group into two general categories:
• Categorical variables (nominal or ordinal): Treated as membership in a distinct group.
• Continuous variables (interval or ratio): Treated as a magnitude or amount of something.

• The type of scaling you use for your independent and dependent variables will
determine what kind of statistical analysis you can run.
Part II: Correlation and Regression
When the IV and DV are both Continuous…
Independent Variable: “Work Shift Timing”

Time shift started (using the 24-hour ‘military clock’)

Work Shift Time Shenanigans!

Number of wacky adventures embarked upon

Dependent Variable: “Shenanigans!”


Independent Variable: “Work Shift Timing”

Time shift started (using the 24-hour ‘military clock’)

Start Time (24hr) Number of Adventures

Number of wacky adventures embarked upon

Dependent Variable: “Shenanigans!”


The (Pearson) Correlation
• The Pearson correlation estimates the degree of linear association between two
numeric (i.e., interval-level or ratio-level) variables:
• Can range from -1.00 (i.e., a perfect negative correlation) to 1.00 (i.e., a perfect
positive correlation).
• 0.00 is thought to represent no relation. At least, no linear relation…

• The absolute value of the correlation gives us the size of the effect:
• ±.1 = small effect / ±.3 = medium effect / ±.5 = large effect.

• This can only be used with interval-level or ratio-level variables…


The Coefficient of Determination
• We can also square a correlation to tell us the percentage of variance in our
dependent variable (i.e., outcome) that is predicted by our independent variable
(i.e., predictor).

• This can range from 0.00 (no variance explained) to 1.00 (all of the variance in the
outcome explained).

r = 0.50 r2 = 0.25
r = -0.50 2
r = 0.25
…25 percent of the variance in the outcome is explained.
Time of
Day

Number of
Adventures
Proportion of variance in number of
adventures predicted by time of day (r2)
Adding the Idea of “Regression”
• A regression equation goes one step further:
• It allows us to predict future scores on our outcome using known scores on our predictor.

Y = B0 + B1(x) + e Random error

Slope
Intercept

• The ‘e’ in the equation acknowledges that these guesses aren’t perfect:
• Represents all aspects of ‘Y’ that are not predicted by ‘X’.
• ‘e’ is assumed to be random.
What Does That Line Mean Though?!
• This line is calculated to minimize the average amount of distance between data
points and their “predicted score” on the line:
• B0 (Intercept): The value of ‘Y’ where the line crosses the X-axis.
• B1 (Slope): The increase in ‘Y’ for each one unit increase in ‘X’.

• This line provides the best linear (i.e., straight line) prediction about ‘Y’ using ‘X’.
High

Weight (kg)

Low
Low High
Height (cm)
Interpreting the Slope of the Regression
• Researchers are usually most interested in the slope of the regression equation:
• The slope tells us how much ‘Y’ should change by after a one unit increase in ‘X’:

NumAdventures = 2 + 1.3(TimeDay) + e

“For every one hour that passes, we can expect people to go on an


additional 1.3 wacky adventures (on average).”
Some Different Types of Correlations/Regressions
• The Pearson correlation: Compare the covariance between two numeric variables:
• The multivariate correlation: Compare the covariance of more than two variables.
• The canonical correlation: Situations where you have multiple dependent variables in
your correlation analysis.
• The Spearman correlation: Compare the covariance between the rankings (not
scores) of ordinal variables.
• The bivariate (linear) regression: Predict scores on one numeric outcome variable
using known scores on one numeric predictor variable.
• The multivariate (linear) regression: Predict scores on one numeric outcome
variable using known scores on more than one numeric predictor variable.
Summary: Correlation and Regression
• A correlation estimates the degree of linear association between two numeric variables:
• Can range from -1.00 (negative correlation) to 1.00 (positive correlation).
• Can only be used with interval or ratio variables.

• We can also square a correlation to calculate the coefficient of determination (i.e., the
percentage of variance in our outcome that is predicted by our predictor):
• Can range from 0.00 (no prediction) to 1.00 (perfect prediction).

• A regression allows us to predict future scores on our outcome using known scores on
our predictor using algebra:
• B0 (Intercept): The value of ‘Y’ where the line crosses the X-axis.
• B1 (Slope): The increase in ‘Y’ for each one unit increase in ‘X’.
Part III: Analyses of Variance
When the IV is Categorical and the DV is Continuous…
Independent Variable: “Work Shift Timing”

Whether the shift started in the day or the evening

Work Shift Time Shenanigans!

Number of wacky adventures embarked upon

Dependent Variable: “Shenanigans!”


Independent Variable: “Work Shift Timing”

Whether the shift started in the day or the evening

Day vs. Night Number of Adventures

Number of wacky adventures embarked upon

Dependent Variable: “Shenanigans!”


Moving on to the Idea of “Mean Difference”
• Correlation and linear regression can’t handle designs with categorial (i.e., ordinal
or nominal) variables.

• To examine these relations, we need to move past the idea of looking at how scores
on each variable covary:
• Instead, we should look at the mean score on the outcome variable for members of
each group we’re examining.

• To do this, we will examine the data in terms of mean group differences:


• That is, how much do members from “X group 1” differ from members of “X group 2”
on “Y”?
Getting Things Done with a t-Test
• To accomplish this goal, we can (often) use a t-test:
• We calculate the observed difference in ‘Y’ between two groups.
• We then divide this by the difference we may expect due to random sampling.

Mgroup1 – Mgroup2 Mean difference between groups


t =
SDpooled Pooled standard deviation

√n Square root of number


of people per group
What Does That Even Do?!
• In a t-test, we essentially compare the difference in “Y” we found between the
groups to the average difference that we’d expect within any given group.

• It works like a “signal-to-noise” ratio:


• Between-groups variance: Signal (“good” variance).
• Within-groups variance: Noise (“bad” variance).
What If I Have More Groups?
• Just relax! We can easily accommodate more than two groups in our analysis by
extending the t-test logic a bit!

• Remember that t-tests essentially function as “signal-to-noise” ratios, where we


look at between-group versus within-group variability in outcome scores:
• We could use the same idea to look at between-group versus within-group variability
across even more groups!
• This is the basis for the Analysis of Variance (i.e., ANOVA).
Introducing the ANOVA F-Test
• ANOVAs use the F-statistic instead of the t-statistic:
• It looks at how much of the differences we see in outcome scores may be explained by
between-group vs. within-group variance:

Between-group variance
F=
Within-group variance

• The F-value will equal 1.00 if the between-groups variability equals the average
variability within groups:
• The F-value gets larger as more variance is accounted for between groups relative to
within groups.
Some Different Types of ANOVAs
• The single sample t-test: Compare one group’s average score to a pre-determined
standard score.
• The independent groups t-test: Compare one group’s average score to a second
group’s average score.
• The repeated measures t-test: Compare one group’s average score to the same
group’s average score at a later point in time.
• The one-way ANOVA: Compare the average scores of two or more groups (one
independent variable only).
• The factorial ANOVA: Compare the average scores of two or more groups (two or
more independent variables):
• Notated like: 2x3 factorial ANOVA for a situation where one independent variable has
two groups, and the other independent variable has three groups.
• The multiple ANOVA (MANOVA): ANOVAs with multiple dependent variables.
Summary: Analyses of Variance
• t-tests and ANOVAs examine variable relationships by looking at average group
differences:
• How much do members from “X group 1” differ from members of “X group 2” on “Y”?
• We need our independent variable to be categorical, and our outcome to be continuous.

• In this calculation, we compare the difference in “Y” we found between the groups to
the average difference that we’d expect within any given group.

• This works like a “signal-to-noise” ratio:


• Between-groups variance: Signal (“good” variance).
• Within-groups variance: Noise (“bad” variance).
Part IV: Chi-Squares
When the IV and DV are both Categorical…
Independent Variable: “Work Shift Timing”

Whether the shift started in the day or the evening

Work Shift Time Shenanigans!

Whether the participant had a wacky adventure

Dependent Variable: “Shenanigans!”


Independent Variable: “Work Shift Timing”

Whether the shift started in the day or the evening


Had Adventure versus
Day vs. Night Did Not Have

Whether the participant had a wacky adventure

Dependent Variable: “Shenanigans!”


Getting into Frequency Data
• When our outcomes are categorical, we should look at the number of observations
within each level of the (categorical) outcome variable.

• That is, we should look at the frequency of observations that fall into each category
that is possible:
• In this example, we would consider which participants started their shift in the day or
night; and, which participants went on adventures versus didn’t go.
The Chi-Square Statistic and Frequencies
• The chi-square statistic uses this idea of frequencies:
• We ask whether the frequencies of participants we observe within each level of the
outcome matches what we would expect those frequencies to be due to chance.

• We then compare the observed distribution of participants to a null distribution


calculated using the expected weighted average.
Do participants who work at night
also go on wacky adventures more
than expected?

Adventures Cross-Tab
Adventure None Total
Night 28 48 76
Day 10 114 124
Total 38 162 200
Uhh…Expected Frequencies?
• We can apply the logic of deviation to these data by calculating a model statistic that
assumes participants fall into each of the four categories in our example randomly:

Total count in that column


Total count in that row

Row_Totali * Column_Totalj
Eij =
N
Total sample size
Expected count (i.e., total count)
for that cell
Adventures Cross-Tab
Adventure None Total
Night 28 48 76
Day 10 114 124
Total 38 162 200

Let’s Determine Our Expected Counts...


Night_Adventure: 76*38/200 = 14.44
Day_Adventure: 124*38/200 = 23.56
Night_NoAdventure: 76*162/200 = 61.56
Day_NoAdventure: 124*162/200 = 100.44
Expected Frequencies
Adventure None Total
Night 14.44 61.56 76
Day 23.56 100.44 124
Total 38 162 200

Observed Frequencies
Adventure None Total
Night 28 48 76
Day 10 114 124
Total 38 162 200
Summary: Chi-Squares
• The chi-square statistic looks at whether the frequencies of participants in each possible
group matches what we would expect those frequencies to be due to chance:
• This analysis is used when both the independent and dependent variable are categorical.

• To do this, there are a few steps:


• Create an “expected” distribution of frequencies based on the weighted average of
participants across each group.
• Compare this expected distribution to your actual observed distribution from your sample.
• Determine whether or not there is a significant difference between the observed and
expected distributions.
Part V: Logistic Regressions
When the IV is Continuous and the DV is Categorical…
Independent Variable: “Work Shift Timing”

Time shift started (using the 24-hour ‘military clock’)

Work Shift Time Shenanigans!

Whether the participant had a wacky adventure

Dependent Variable: “Shenanigans!”


Independent Variable: “Work Shift Timing”

Time shift started (using the 24-hour ‘military clock’)


Had Adventure versus
Start Time (24hr) Did Not Have

Whether the participant had a wacky adventure

Dependent Variable: “Shenanigans!”


What Does This Analysis Do?
• In t-tests and ANOVAs, we have a categorical predictor (i.e., representing group
membership) and a continuous outcome (i.e., representing scores):
• We thus examine the mean difference in scores between groups.

• But in some cases, you may have study data that estimates the reverse:
• That is, the predictor is a continuous score…but the outcome is a categorical indicator
of group membership.
Making a Regression “Logistic”
• In logistic regressions, we don’t predict the value on “Y” for a given value of “X”:
• Instead we predict the probability of “Y” occurring for a given value of “X”.
• Remember that, because the outcome is categorical, the most relevant numeric score
associated with it is its probability.

Probability of “Y”
occurring
1
P(Y) =
1 + e-(b0+b1(x))
“e” = base of natural
logarithms (don’t The linear regression
worry if you don’t equation acts as an
understand this part) exponent on “e”
Why Can’t I Use Regular Regression?
• In normal, linear approaches to regressions, we assume that the relationship
between variables is…well, linear:
• This assumption must be met for the estimates from our regression to be accurate.

• Categorical outcomes do not allow for these kinds of linear relations, and can’t be
estimated using analyses that require linearity:
• This is one example of “parametric” (data conform to assumptions like linearity and
normality) versus “non-parametric” analyses.
Summary: Logistic Regressions
• In logistic regressions, we predict the probability of our outcome occurring for each possible
value of our predictor:
• This analysis is used whenever our outcome is categorical, and our predictor is continuous.

• This is different from the (linear) regression we covered earlier:


• Categorical outcomes do not allow for these kinds of linear relations, and can’t be estimated
using analyses that require linearity.

You might also like