You are on page 1of 20

An Introduction to

Statistical Analysis
How to approach data analysis
1. What is your hypothesis?
1a. Are you interested in association or prediction?
2. What type of data do you have?
3. Does your data violate the assumptions of the
parametric test you have chosen?
3a. In no, proceed with the parametric test
3b. If yes, find an appropriate nonparametric alternative
4. Interpret results, reject or do not reject the null
hypothesis
Types of data and variables
• Data is composed of variables, which may take
several forms
• Data types have a huge impact on how the data can
be analyzed
Types of data and variables
• Categorical variables – describe membership into a
category or group
• Alive or dead
• Big, medium, large
• Red, green, blue
Types of data and variables
• Categorical variables can be either:
• Ordinal – have an order
• e.g. life stage, pain scale (like when you're at the ER)
• Nominal – have no order
• e.g. sex chromosome genotype (XX or XY)
Types of data and variables
• Numeric or Numerical – quantitative and have
magnitude, they are numbers
• Counts
• Mass (g)
• Lengths (mm)
• Percentages
Types of data and variables
• Continuous numerical data can take the form of
any real number within some range
• 1, 1.358, 6.2, 4.5
Types of data and variables
• Discrete numerical data are real number but are
indivisible units (you cant have a fractional part)
• Number of amino acids in a protein
• Number of bones in the hand
Associations vs variable
prediction
• Some statistical models test associations between
variables
• Does A occur with B
• Contingency analysis
• Correlation
Associations vs variable
prediction
• Other models test for predictive effects
• Does X predict Y?
• e.g. ANOVA, linear regression, generalized linear models (GLM)
• Often modeled as a strait line
• y = mx + b
• In statistics this is often expressed as:
• Y = βo + β1X1 + e
• Y is the dependent variable
• βo is the intercept
• β1 is the slope for the first independent variable
• X1 is the first independent variable
• e is the error
Parametric vs. Nonparametric
• Parametric statistics assumes that sample data
comes from a population that follows a specific
probability distribution
• Are typically more powerful statistically
• The most common distributional assumption is the
normal distribution
• Most well-known statistical methods are
parametric
• ANOVA
• Linear regression
Parametric vs. Nonparametric
• Nonparametric models no, or very few
assumptions about the probability distributions of
the variables being assessed
• However, nonparametric models are less powerful
• Generally, we try to use a parametric model but
then move to a nonparametric approach if the
assumptions are violated
Statistical assumptions
• Statistical assumptions are properties of the data
which may render a given statistical test invalid
• All statistical models have assumptions
• Common assumptions in parametric models
include:
• Independence of observations from each other
• Independence of error from confounding effects
• Linear relationships between variables
• Normality of observations and errors
How do I know if my data
violates assumptions?
• Data exploration - happens before you fit the data.
We covered this in our last lab.

• Model diagnostics - Verify assumptions are met


after fitting the model
• For many linear model functions in R, plot() does this
automatically
• Ideally, you should do this before you calculate P-values
Transforming data to satisfy
assumptions
• Data can be transformed in an attempt to satisfy
assumptions
• It does not always work!
• You have to reevaluate the transformed data to see if
the transformation made an improvement
Transforming data to satisfy
assumptions
• The panel on the left shows a curvilinear
relationship
• The panel on the right shows the log transformed
data, it now shows a linear relationship
Transforming data to satisfy
assumptions
• Some common transformations
• Log transform
• Useful for forcing normality
• If you have data that has zeros you will have to add 1 to all the
data and then transform it. You can not take the log of zero.
• Arcsine square root transformation
• Useful for proportional data
Univariate vs. multivariate
• Univariate statistics involves the simultaneous
observation and analysis of one dependent variable (y)

• Multivariate statistics involves the simultaneous


observation and analysis of more than one dependent
variable ( Y = [y1, y2, y3….yi])
• Doubtful we will explore much of multivariate
• Perhaps MANOVA and a few ordination methods

You might also like