Sinharay S. Definition of Statistical Inference

STATISTICS
Contents
An Overview of Statistics in Education
Analysis and Interpretation of Multivariate Data
Analysis of Covariance
Analysis of Extreme Values in Education
Analysis of Variance
Bayesian Statistical Analysis
Bootstrap Method
Canonical Correlation
Categorical Data Analysis
Causal Inference
Cluster Analysis: Overview
Cognitive Psychology and Educational Statistics
Computational Statistics
Continuous Probability Distributions
Correspondence Analysis
Data Mining
Decision Theory
Design of Experiments
Discrete Probability Distributions
Discrimination and Classification
Empirical Bayes Methods
Evaluation Research
Exploratory Data Analysis
Factor Analysis: An Overview and Some Contemporary Advances
Generalized Linear Mixed Models
Generalized Linear Models
Generating Random Numbers
Goodness-of-Fit Testing
Graphical Models
Growth Modeling
Hierarchical Linear Models
Hypothesis Testing and Confidence Intervals
Instrumental Variables
Jackknife Methods
Large-sample Statistical Methods
Latent Class Models
Markov Chain Monte Carlo
Matrix Algebra
Measure of Association
Measures of Central Tendency
Measures of Dispersion, Skewness and Kurtosis
Meta Analysis
Missing Data
Model Selection
Monte Carlo Methods
Multidimensional Scaling
Multiple Comparisons
Multivariate Analysis of Variance
Multivariate Linear Regression
1
2 Statistics
Multivariate Normal Distribution

Nonlinear Regression Analysis
Nonparametric Statistical Methods
Observational Studies
Order Statistics
Point Estimation Methods with Applications to Item Response Theory Models
Principal Components Analysis
Probability Theory
Recursive Partitioning
Sampling
Sequential Probability Ratio Test
Sequential Testing
Signal Detection Theory
Small Area Estimation
Statistical Analysis of Functional Data
Statistical Inequalities
Statistical Paradoxes
Statistical Power Analysis
Statistical Significance Versus Effect Size
Stochastic Processes
Structural Equation Models
Survival Data Analysis
The Normal Distribution and its Applications
Time Series Analysis
Univariate Linear Regression
Value-Added Models
An Overview of Statistics in Education

S Sinharay, ETS, Princeton, NJ, USA
ã 2010 Elsevier Ltd. All rights reserved.
Introduction the topics included in the statistics section, or, that sev-
eral examples included in the articles in this section, deal
This article intends to provide an overview of the appli- mostly with educational measurement and should have
cation of statistics to the field of education. Statistics is a belonged in the Educational measurement section of
vast field, with new topics such as proteomics, ensemble this encyclopedia. In addition, some of the topics may
sampling, and statistics in opthalmology being intro- overlap to a small extent with other topics in this section
duced every now and then. The Statistics section of the (e.g., generalized linear models are tools in categorical
encyclopedia did not attempt to provide a summary of all data analysis, but this section has two separate articles on
possible topics under the subject. Instead, this section the two topics). In any case, the number of applications
focuses on the topics in statistics that have found appli- of statistics to education is on the rise. This is because of
cations in the field of education. Most applications of (1) increases in computing power that have led people to
statistics to education are found in the area of educa- ask questions that could not have been answered 20 years
tional measurement for the simple reason that statistics, ago in a timely manner, and, (2) increases in the number
the science that deals with quantitative analysis of data, of educational tests, partially due to the No Child Left
inherently is related to measurement. (Several such appli- Behind (NCLB) Act of 2001 in the USA that requires
cations – such as automated scoring, differential item annual testing in the schools and produces a lot of data.
functioning, generalizability theory, and item response Practitioners in education should find this section, together
theory (IRT) – are covered in the educational measure- with the Educational measurement section, helpful as these
ment section of the encyclopedia and will not be two provide a comprehensive overview of the statistical
repeated here.) Hence, one could argue that some of methods used in education.
An Overview of Statistics in Education 3
Next, several topics in statistics that are relevant to the

field of education are described in brief along with, for
some topics, examples of actual applications to the field of
education. All of these topics and many more topics are
covered in one or more articles in the statistics section.
Exploratory Data Analysis
Examinees
After data are collected, often the first step in analyzing
the data is exploratory data analysis (EDA), which consists
of looking at data to see what they seem to say (Tukey,
1977) while relying on simple arithmetic and easy-to-
draw pictures or plots. The techniques used in EDA
include the following:
Plotting the data in bar charts, pie charts, histograms,
Youden plots, etc.,
Plotting simple statistics in plots such as mean plots,
standard deviation (SD) plots, box plots, etc., and
Positioning such plots so as to extract the maximum
information possible from them. Items
Consider Figure 1, which shows responses of 325 exam- Figure 1 A plot of the responses of 325 examinees to 15 mixed
inees to 15 items regarding mixed-number subtraction number subtraction items.
(Tatsuoka, 1984). An example item is 457 147. The items,
sorted according to decreasing proportion correct (i.e., 9 and 13 were higher in 2004 than in 1971 and the average
increasing difficulty), are shown along the x-axis in the score for 17-year-olds in 2004 was similar to that in 1971.
figure; the examinees, sorted according to increasing raw Measures of central tendency and measures of dispersion,
scores, are shown along the y-axis. A short black horizontal skewness, and kurtosis are discussed in the statistics sec-
line for an examinee and an item indicates a correct tion of the encyclopedia.
response. Some patterns are immediately visible from the
figure. For example, several examinees (24 out of 325) at
the top answer all items correctly (clear from the top of Measures of Association
the plot being completely black). The examinees with the
In education, it is often of interest to examine the amount
lowest scores could answer only two items correctly. Inter-
of association between a group of variables. For example,
estingly, these items ð34 34 and 378 2Þ are not the two
test administrators administering several tests simulta-
easiest items and can be solved without any knowledge of
neously to students will like that scores on different tests
mixed-number subtraction. Further, the lower half of the
(e.g., in reading, writing, mathematics, and science) do
examinees rarely answered any difficult items correctly.
not correlate highly with each other. High correlations
Wainer (2000, 2005) provided several applications of
between such scores may raise questions about the need
EDA to education.
of so many tests. Choice of the appropriate measures of
association depends on whether the variables of interest
are continuous, discrete ordinal, or discrete nominal. For
Simple Summary Measures
example, if the scores on the abovementioned tests are
It is often important to examine simple summary measures given on a scale of 1 to 100 in 1-point increments, a
such as the mean and standard deviation (SD) of numerical correlation coefficient may be the appropriate measure
information. For example, the Digest of Education Statistics, of association. On the other hand, if scores are given as
an annual publication of the National Center for Educa- 0 or 1 (where a score of 1 means that a student is good on a
tion Statistics (NCES) in the United States of America, subject and 0 otherwise), an odds ratio or the Kendall’s
includes the number of schools and colleges, teachers, tau (see, e.g., Agresti, 2002) may be the appropriate mea-
enrolments, and graduates, in addition to other informa- sure. It is often of interest to examine the association
tion on education. Further, the 2007 Digest reports that the between two groups of variables, X ¼ ðX1 ; X2 ; . . . Xk Þ
average salary for teachers in 2005–06 was US$49 109, and Y ¼ ðY1 ; Y2 ; . . . Y1 Þ. Canonical correlation analysis
about 1% higher than in 1995–96, after adjustment for attempts to find linear combinations of the two groups
inflation, and that the average reading scores at ages with high correlations.
4 Statistics
Probability Theory an investigator has to perform several hypothesis tests

simultaneously. For example, one may want to compare
Providing a mathematical description of our beliefs about
the SAT critical reading scores of several pairs of schools
the systematic properties of a random phenomenon is the
belonging to a geographical region. The article on multi-
first step in several statistical analyses. Usually this is
ple comparison in the statistics section of the encyclope-
accomplished with the help of probability theory, the
dia, discusses how to handle such a situation in an
branch of mathematics that describes the pattern of chance
appropriate manner.
outcomes. A key idea in probability theory is that of a
random variable, which is a variable whose value is a
numerical outcome of a random phenomenon, and its
Principal-Components Analysis
distribution. There are several types of random variables,
and the articles in the statistics section, on discrete and Principal-component analysis (PCA) is often used to
continuous probability distributions, provide detailed reduce multidimensional data sets to a lower number of
descriptions of them. A knowledge of random variables dimensions for analysis. PCA retains those characteristics
that follow a normal distribution is essential in several of the data set that contribute most to its variance, by
applications of statistics to education (e.g., the distribution keeping lower-order principal components (the ones that
plays a key role in topics such as factor analysis, analysis of explain a large part of the variance present in the data)
variance, linear regression, and multilevel models) – hence and ignoring higher-order ones (that do not explain much
two articles in the statistics section are devoted to these of the variance present in the data). Such low-order com-
variables. Stochastic processes are sequences of random ponents often contain the most important aspects of
variables and are often of interest in probability theory the data. For example, in the National Assessment of
(e.g., the path traced by a molecule as it travels in a liquid Educational Progress (NAEP), a large-scale educational
or a gas can be modeled using a stochastic process). Order survey conducted in the USA (see, e.g., von Davier et al.,
statistics is another area of some importance in probability 2007), values of several hundred background variables are
theory. For example, the ordered scores on a test of a collected for each examinee. As it is important to use the
sample of students represent the order statistics for the information contained in these background variables in
test score variable for the sample. the statistical model used in NAEP, several principal
components are computed from the original background
variables and are used, instead of the original variable
Statistical Inference values, in the final NAEP statistical model (see, e.g.,
Jenkins et al., 2001: 377–378).
Statistical inference consists in the use of statistics to draw
conclusions about some unknown aspect of a population
based on a random sample from that population. Some
Factor Analysis
preliminary conclusions may be drawn by the use of EDA
or by the computation of summary statistics as well, but The beginning of factor analysis lies in the early attempts
formal statistical inference uses calculations based on of Karl Pearson, Charles Spearman, and others to define
probability theory to substantiate those conclusions. Sta- and measure intelligence – hence factor analysis has been
tistical inference can be divided into two areas: estimation a popular tool in education. When several tests are admi-
and hypothesis testing. In estimation, the goal is to nistered to a group of examinees, one aspect of validation
describe an unknown aspect of a population, for example, may involve determining whether there are a few under-
the average scholastic aptitude test (SAT) writing score lying abilities or skill variables that govern the examinees’
of all examinees in the State of California in the USA. performances on the tests. Factor analysis, which is often
Estimation can be of two types, point estimation and used in such situations, attempts to describe the covari-
interval estimation, depending on the goal of the applica- ance among several variables in terms of a few underlying,
tion. The goal of hypothesis testing is to decide which of but unobservable, random variables called factors. For
two complementary statements about a population is true. example, correlations from the group of test scores in
Two such complementary statements may be: (1) the stu- classics, French, English, mathematics, and music col-
dents of California score higher on an average on SAT lected by Spearman suggested an underlying intelligence
writing than the students of Texas, and (2) the students of factor (Johnson and Wichern, 1998). Factor analysis can be
California score lower on an average on SAT writing than considered to be an extension of PCA – both of these
the students of Texas. Point estimation is discussed in the techniques approximate the covariance matrix among the
statistics section of the encyclopedia. Details on interval variables. A factor analysis is exploratory if the investiga-
estimation and hypothesis testing, and power analysis, tor does not have a hypothesis about the number of factors
which play a key role in hypothesis testing are also dis- measured by the tests, and confirmatory if the investigator
cussed in the statistics section of the encyclopedia. Often, has such hypotheses and conducts statistical tests of them.
Structural-Equation Modeling Multidimensional Scaling

Structural-equation modeling is an extension of factor Multidimensional scaling is related to cluster analysis and
analysis and is a methodology designed primarily to test assigns a location of each sample observation in a low-
substantive theory from empirical data. For example, a dimensional space so that their distances are close to their
theory may suggest that certain mental traits do not affect actual distances in multiple dimensions. This idea can be
other traits and that certain variables do not load on illustrated by reference to map construction. Suppose that
certain factors, and that structural equation modeling one is given a table with the distances between several
can be used to test the theory. (A mental trait is a habitual American cities (and not given the exact locations of them
pattern of behavior, thought and emotion.) A structural- on a map). One could attempt to place the cities on a map
equation model (SEM) is a system of linear equations so that the distances between the cities on the map are as
among several unobservable variables (constructs) and close as possible to the distances on the table. Once this
observed variables. An SEM is composed of two parts: a step is done, the map is used to draw conclusions on the
structural part, linking the constructs to each other (usu- sample; often, a cluster analysis is performed on these
ally, this part expresses the endogenous or dependant low-dimensional representations of the observations.
constructs as linear functions of the exogenous or inde-
pendent constructs), and a measurement part, linking the
constructs to observed measurements. The second part Categorical Data Analysis
resembles a confirmatory factor analysis model. The A categorical variable has a measurement scale that con-
SEMs can be displayed in visual form – these displays sists of a set of categories. For example, the response of a
are called path diagrams. The full model is then estimated student to a test question is often measured as correct or
from a data set and inferences drawn. incorrect.
Categorical variables have two types of scales – nominal
(when the categories do not have a natural ordering; e.g.,
Classification and Discriminant Analysis gender and religion) and ordinal (when the categories have
a natural ordering; e.g., patient condition being measured as
The goal of classification and discriminant analysis are good, fair, and serious). Several statistical methods and
to describe, either graphically or algebraically, the differ- concepts have been devised for categorical data. For exam-
ent features of observations from several known groups, ple, distributions such as the binomial and Poisson distribu-
and to sort new objects (whose group membership is tions, models such as generalized linear models, generalized
unknown) into the groups. For example, consider a medi- linear mixed models, logistic regression model, and log-
cal school applications office that has the test scores and linear models, and methods such as analysis of contingency
other college records of several students who became tables were devised to handle categorical data.
MDs and several others who did not. When a new appli-
cation comes, the office may want to classify the applicant
into likely to become MD and unlikely based on the test Analysis of Variance, Analysis of Covariance,
scores and college records. and Multivariate Analysis of Variance
Analysis of variance (ANOVA) is the statistical procedure
of comparing the means of a variable across several groups
Cluster Analysis
of individuals. For example, ANOVA may be used to
Cluster analysis is a technique to group similar observa- compare the average SAT critical reading scores of several
tions into a number of clusters based on the observed schools. The name of the technique arises from the fact
values of several variables for each individual. Cluster that the first step in an ANOVA is to partition the variance
analysis is similar in concept to discriminant analysis. present in the observations into several components. The
The group membership of a sample of observations is ANOVA method was the second most frequently used
known upfront in the latter while it is not known for any data-analysis procedure in a survey of articles published
observation in the former. As an application of cluster between 1971 and 1998 in three reputed educational-
analysis to education, Everitt (1990) describes a data set research journals (Hsu, 2005). Generalizability theory
that has achievement test scores on reading and arithme- (Cronbach et al., 1963), which is a competitor to the
tic for children in the fourth and sixth grades of 25 classical theory of reliability of tests, usually applies
schools and the interest is in identifying different levels ANOVA procedures to test scores.
of performance and assessing similarities and differences Analysis of covariance (ANCOVA) is used when, like
in the patterns of change from fourth to sixth grade – in ANOVA, the interest is in comparing several means, but
cluster analysis is the most appropriate technique for the investigator also has the values of an additional variable
the example. that influences the variable of interest. For example,
6 Statistics
ANCOVA may be used to compare the average SAT collect data on a sample of individuals, and apply an
critical reading scores of several schools where the pre- appropriate method to draw conclusions. In the example
liminary scholastic aptitude test/national merit scholar- above, one could record the number of hours of television
ship qualifying test (PSAT/NMSQT) critical reading watched by the students and examine its association with
score of each examinee is available in addition to the their grades. Only limited number of conclusions can be
SAT critical reading score. (The PSAT/NMSQT is sup- drawn from an observational study. Any observed differ-
posed to provide firsthand practice for the SAT.) ence or association has several reasonable alternative
Multivariate analysis of variance (MANOVA) is used explanations. For example, an observed lower score of
to compare means of several variables simultaneously students watching television longer can be caused by
across several groups of individuals. For example, one such students having less appropriate home and school
could apply MANOVA to simultaneously compare the inputs, such as fewer books at home or parents who read
average scores on several subjects across several schools. less to them. Huang and Lee (2009) investigated whether
Longford (1990) provides such an example. television watching at ages 6–7 and 8–9 affects cognitive
development measured by math and reading scores at age
8–9 using a rich childhood longitudinal sample.
Design of Experiments
An experiment is a test in which purposeful changes are
made to the input variables of a process or system so that Causal Inference and Instrumental Variables
one may observe and identify the reasons for changes
Suppose that an investigator is interested in testing a
that may be observed in the output response. Design of
hypothesis, for example, about the comparison of a new
experiments is the science of planning and conducting
educational program versus the existing program. Whether
experiments and analyzing the resulting data so that
the investigator performs a randomized experiment or an
valid and objective conclusions can be drawn. For exam-
observational study, he is faced with the question of how to
ple, the education ministry of a country may be interested
draw inferences about the causal effects of the new program.
in conducting an experiment to find out if a particular
In other words, if there is a performance difference between
style of teaching mathematics helps children of fourth
the students who were administered the new educational
grade to learn the subject better than the existing style.
program and those who were administered the existing
In designing an experiment, the ministry has to make sure
program, the investigator would like an answer to the ques-
that any difference that they might observe in the out-
tion ‘‘Is the difference in performance of the two groups
come for students who were taught using the new style
caused by the difference in the educational program?’’ An
and those who were taught using the existing style cannot
article in the statistics section discusses how causal infer-
be attributed to a factor other than the teaching style (e.g.,
ences can be made. Instrumental variables (IVs) are used to
if they assign all students from rural areas to the new style
estimate causal relationships when controlled experiments
and all students from urban areas to the existing style,
are not feasible. An overview of instrumental variables and
then a difference can be attributed to the rural vs. urban
of their possible applications to education is discussed in the
difference). There are three basic principles in design of
statistics section of the encyclopedia.
experiments – randomization (which means that the
assignment of the experimental material and the order in
which the individuals receive the experimental material are
randomly determined), replication (which refers to repeats Sampling
of each experimental condition), and blocking (which is the There is a growing importance of survey information on
grouping of individuals to create several homogeneous individuals, households, institutions, businesses, and envi-
groups before assigning the experimental material). ronmental resources. Typically, one wants to gather infor-
mation on a large group of individuals. However, time and
cost usually does not allow obtaining information from
Observational Studies
each individual in the group. In such cases, one usually
In some situations, it is not possible (for reasons such as gathers information on only a sample, which is a small
budget constraints and ethical issues) to design an exper- part of the large group. Sampling plays an essential role in
iment to answer a question or to test a hypothesis. drawing conclusions about the large group (which is
For example, consider that the interest is in finding called the population) from the information contained
whether watching television is affecting the class grades in the sample. An example of an application of sam-
of students. As watching television may have adverse pling is the National Assessment of Educational Progress
effects, it will be unethical for one to design an experiment (NAEP), an educational sampling survey (Allen et al., 2001).
and randomly assign students to watch television for dif- NAEP is the only ongoing measure of what students in the
ferent number of hours per day. In such situations, often USA know and can do in a variety of subject areas and it
the only way is to conduct an observational study, that is, to reports scores for different demographic groups based on
gender, ethnicity, school type, school location, etc. NAEP the population of interest. Two of the most popular resam-
draws a sample of students that is representative of the pling methods are the jackknife and bootstrap. Both of these
whole student population, applies several statistical tech- are examples of nonparametric statistical methods.
niques, and draws conclusions (an example of a conclusion Jackknife is used in statistical inference to estimate the
is that between 1992 and 2000, the percentage of fourth- bias and standard error of a test statistic. The basic idea
graders at or above the proficient-achievement level in behind jackknife lies in systematically recomputing the
reading increased by a small, but statistically significant statistic a large number of times, leaving out one observa-
amount) on the whole population based on information tion or a group of observations at a time from the sample.
contained in the sample. Estimates of the bias and variance of the statistic can be
calculated from this set of jackknife replications of the
statistic. The jackknife finds several applications in com-
Bayesian and Empirical Bayes Methods plex sampling schemes, such as multistage sampling with
In a traditional or frequentist statistical analysis, the varying sampling weights – an example of such applica-
parameter of a probability model is considered an tion is NAEP, where the jackknife method is employed to
unknown but nonrandom quantity and only the informa- compute standard errors of estimates.
tion contained in the observed data is relevant for any Bootstrap is a statistical method for estimating the
inference. On the contrary, a Bayesian analysis (see, for sampling distribution of an estimator by sampling with
e.g., Gelman et al., 2003) assumes that the parameter is a replacement from the original sample, most often with the
random variable with a certain probability distribution, purpose of deriving robust estimates of standard errors
referred to as the prior distribution. The prior distribution and confidence intervals of a population parameter like a
quantifies the experimenter’s beliefs about the parameter mean, median, and correlation coefficient. It is often used
before observing the data. The next step in a Bayesian as a robust alternative to procedures based on parametric
approach is to update the prior distribution on the basis of assumptions, especially when those assumptions are in
the likelihood function of the observed data through doubt, or where parametric inference is impossible or
Bayes’ theorem (Bayes, 1763). The resulting distribution requires very complicated formulas for the calculation of
is referred to as the posterior distribution of the parameter standard errors. See, for example, Hanson et al. (1993),
and summarizes the information in both the prior distri- who applied the bootstrap method to compute the stan-
bution and in the data. The influence of the prior dis- dard error of an equating method.
tribution on the posterior distribution becomes weaker as
the size of the observed data sample increases. The varia-
tion of the Bayesian methods in which the parameters of Nonparametric Inference
the prior distribution are estimated from the observed
data is called empirical Bayes methods. Sinharay (2006) Nonparametric methods, or distribution-free methods,
provided a review of the applications of Bayesian methods are statistical methods that do not rely on assumptions
to educational measurement. Novick and Jackson (1974) that the data are drawn from a given probability distribu-
included several applications of Bayesian methods to tion. Nonparametric methods are often applied when less
educational measurement. Other examples of applications is known about the data (so that a probability distribution
of Bayesian methods to education are Rubin (1983), who cannot be assumed). Due to the reliance on fewer assump-
applied Bayesian methods to three problems in educa- tions, nonparametric methods are more robust (i.e., less
tional measurement, Zwick et al. (1999), who applied an vulnerable to violations of assumptions). They are also
empirical Bayes method to differential item function- often applied because of their simplicity. Examples of
ing, and Sinharay (2005), who applied Bayesian model- nonparametric methods are Pearson’s w2 test for assessing
checking methods to assess the goodness of fit of IRT independence in a contingency table, jackknife and boot-
models. Further details on Bayesian methods and empir- strap methods for estimating the bias and variance of an
ical Bayes methods (see, for example, Carlin and Louis, estimator, the Wilcoxon Mann–Whitney rank-sum test,
1996) are discussed in the statistics section of the encyclo- the permutation test, the Kolmogorov–Smirnov test for
pedia. Decision theory, which is a Bayesian approach, is assessing whether two distributions are the same, and
concerned with identifying the values, uncertainties, and spline regression for estimating regression curves of a
other issues relevant in a given decision and the resulting dependent variable on several independent variables.
optimal decision. An article in the statistics section pro-
vides more details on decision theory.
Multiple Linear Regression Models
Multiple linear regression models have been extensively
Resampling Methods
used in education (see, e.g., Hsu, 2005). Interestingly, the
Resampling methods (see, e.g., Efron, 1982) draw samples name regression, borrowed from the title of the first
from the observed data to draw certain conclusions about article on this subject (Galton, 1885), does not reflect
8 Statistics
either the importance or breadth of application of this requirements (Braun, 2005). Interestingly, in this respect,
method. Multiple regression is the statistical procedure to some states have taken the lead by seeking a quantitative
predict the values of a response (dependent) variable from evaluation of teachers based on an analysis of the test-
a collection of predictor (independent) variable values. score gains of their students. Such evaluations employ a
For example, if scores on multiple predictors and one class of models called value-added models (VAMs). These
criterion are available, multiple regression may be used models require data that track individual students’ aca-
to develop a single equation to predict criterion perfor- demic growth over several years in different subjects in
mance from the set of predictors. Several applications of order to estimate the contributions that teachers make to
multiple regression models can be found in the prediction that growth. Thus, VAMs can be viewed as a special case of
of first-year grade-point average in college from the SAT growth models and, hence, of HLMs. Given their current
scores and high school grade-point average (see, e.g., state of development, VAMs can be used to identify a group
Kobrin et al., 2008). Multiple regression and multivariate of teachers who may reasonably be assumed to require
multiple regression, the case when there are more than targeted professional development. These are the teachers
one dependent variables of interest and the interest is in with the lowest estimates of relative effectiveness. Despite
predicting them simultaneously from a set of predictor the enthusiasm these models have generated among many
variables, are discussed in the statistics section of the policymakers, several technical reviews of VAMs have
encyclopedia. revealed a number of serious concerns and it is important
that such concerns be properly addressed before VAMs are
used to make important decisions.
Hierarchical Linear Models and Growth Models
In an application of linear regression, the observations are Generalized Linear Models and Generalized
assumed to be independent. When the assumption of Linear Mixed Models
independence is likely to be violated, for example, in an
application in which one has data on several students who Linear regression models apply when the response vari-
belong to a few schools (so that the responses of the able can be assumed to be a continuous variable or to be
students within each school are dependent), a popular normally distributed. However, in several applications in
option is to employ hierarchical linear models (HLMs). education, the response does not belong to either of those
These models are also referred to as multilevel models types. Suppose the interest is in finding out how the
and random-effects regression models. The students con- socioeconomic status and average parents’ education for
stitute the lower level while the schools constitute the a class of students affects their performance on a test. If we
higher level in the example. Note that HLMs can also have the scores on the test for each student, we can
be applied to repeated measures design or longitudinal employ a linear regression model regressing the test
studies, where individuals are followed and their re- scores on the socioeconomic status and average parent
sponses recorded several times over a certain period education. However, if we do not have the scores, but only
of time; the repeated measures constitute the lower level know who passed the test and who did not (which is a
and the individuals constitute the higher level; exam- binary response), we cannot employ linear regression.
ples of such models are growth models, where, for ex- Generalized linear models (GLMs) can be used in situa-
ample, the investigator measures the cognitive growth tions like this. GLMs are extensions of the linear regres-
of students by giving them several tests over a certain sion model to a wider class of response type such as binary
period of time. Growth models are increasingly popular or count data. A GLM requires the specification of two
in the US due to the NCLB Act of 2001 that puts special defining characteristics – the distribution of the response
emphasis on the cognitive growth of students. For exam- and the link function that describes how the mean of the
ple, in December 2007, the US Secretary of Education response is linked to a linear combination of the predic-
Margaret Spellings invited all eligible US states to submit tors. Generalized linear mixed models (GLMM) are
a growth model proposal for the 2007–08 school year. extensions of GLMs to the case when the individuals are
Growth models and HLMs are discussed in the statistics clustered (e.g., students belonging to different schools).
section of the encyclopedia. The statistics section of the encyclopedia includes two
articles, one each on GLM and GLMM.
Value-Added Models
Nonlinear Regression Methods
The NCLB Act of 2001 in the US requires states to ensure
that there are quality teachers in every classroom, with One is often interested is in studying how a set of inde-
quality defined in terms of traditional criteria such as pendent variables affect a dependent variable, but the
academic training and fully meeting the state’s licensure relationship between them cannot be assumed linear. So
the abovementioned models, all of which assume a linear the number of students who will take a test (e.g., SAT) at
relationship, cannot be applied. Nonlinear regression an administration based on the numbers from the previ-
methods, which may be applicable in such situations to ous administrations of the test. A time-series model gen-
predict the dependent variable from the independent erally reflects the fact that observations close together
variables and recursive partitioning, or, classification and in time will be more closely related than observations
regression trees method, which is another method that further apart. Three broad classes of time-series models
may be applicable in such situations, are discussed in the of practical importance are the autoregressive (AR) mod-
statistics section of the encyclopedia. els, the integrated (I) models, and the moving average
(MA) models. There are models, such as the autoregres-
sive moving average (ARMA) and autoregressive inte-
IRT Models grated moving average (ARIMA) that are combinations
of the above three.
These models, with numerous applications in education,
are discussed in an article in the educational measure-
ment section and are not covered here.
Model Fit and Model Selection
Model fit analysis refers to an examination of whether
Latent Class Models the statistical model employed in an application ade-
A latent class model (LCM) relates a set of observed quately explains the important features of the data set
discrete multivariate variables to a set of latent variables at hand. Model selection refers to the choice of the
(latent variables are not directly observed but are rather statistical model that describes the data best among
inferred, mostly through a mathematical model, from several competing models. Model fit and model selec-
other variables that are observed; e.g., quality of life or tion analysis for the linear models employed in educa-
intelligence of a person is a latent variable). It is called an tion do not pose any problems and proceed in a similar
LCM because the latent variable is discrete and divides manner as in any other statistics field, for example, by
the population into several classes. A class is characterized using residual analysis, Akaike information criterion
by a pattern of conditional probabilities that indicate the (AIC) and Bayesian information criterion (BIC) (see,
chance that variables take certain values. For example, e.g., Draper and Smith, 1998). However, model fit and
Dayton and Macready (2006) discuss an application in model selection analysis for the nonlinear models, espe-
which the observed variables are the responses to ten cially for the IRT models, are not trivial, primarily
questions on matrix algebra on a test, the latent variable because the computations are not straightforward with
refers to the knowledge of matrix algebra of students, and these models, the response variable is discrete so that
the latent classes refer to masters and nonmasters on normality of the response cannot be assumed, and the
matrix algebra. Given class membership, the conditional number of possible responses is huge so that there is
probabilities specify the chance certain answers are cho- sparseness in the data. Fortunately, with the advent of
sen. Within each latent class, the observed variables are faster computers, there has been substantial work in
statistically independent (this is often called local inde- these areas. Swaminathan et al. (2006) provided a detailed
pendence). This is an important aspect of LCMs. Usually, review of the literature on model fit of IRT models.
the observed variables are statistically dependant. By Several model fit statistics have been suggested for test-
introducing the latent variable, independence is restored ing different aspects of an IRT model: statistics for test-
in the sense that variables are independent within classes. ing the unidimensionality assumption of the IRT model,
The association between the observed variables is thus item fit statistics, person fit statistics, and overall model
explained by the latent classes. fit statistics. In applications of IRT models, it is impor-
tant to employ the appropriate model fit statistics
depending on the intended use of the model, and to
evaluate not only statistical significance, but also practi-
Time-Series Analysis
cal significance. For example, it may happen that the
A time series is a sequence of data points, measured value of a fit statistic is statistically significant so that
typically at successive time points. Time series analysis the model does not predict an aspect of the data, but the
comprises methods that attempt to understand such time misfit has negligible consequences operationally.
series, often either to understand the underlying context Kang and Cohen (2007) provided a detailed review of
of the data points, or to make forecasts (predictions). model selection methods for IRT models. More techni-
Forecasting using a time-series analysis consists of the ques such as the penalty criteria and the theoretical back-
use of a model to forecast future events based on known ground behind several techniques are discussed in the
past events. An example in education is the prediction of statistics section of the encyclopedia.
10 Statistics
Other Topics in Statistics misunderstood by university students and their instruc-

tors alike, and Stigler (2005) to see how correlation is often
Large-sample methods are essential in the application of misinterpreted as causation. These works show that there is
several statistical methods as it is often important to know a need to apply appropriate statistical methods to educa-
the distribution of a test statistic for a large sample. An tional applications. The articles in the statistics and educa-
overview of how to handle multivariate data, which is tional measurement sections of this encyclopedia (along
common in the field of education, is discussed in the with the references and the further readings therein) will
statistics section of the encyclopedia. There are several help practitioners in the field of education to learn more
statistics topics that are of increasing interest to the field about statistics and to do a better job in choosing appropri-
of education and will most likely become more popular in ate statistical methods in their applications.
the coming years. Examples of such topics are analysis of There is often a tendency to fit complicated statistical
extreme values, association mining, the relationship models to real data. However, before using such a model, a
between cognitive psychology and educational statistics, researcher should always make sure that the model per-
correspondence analysis, data mining, functional data forms considerably better than a simpler alternative, both
analysis, graphical models, instrumental variables, latent substantively and statistically, and that the model is well
class analysis, missing data analysis, Monte Carlo meth- identified. It is important to remember the following
ods, Markov chain Monte Carlo methods, sequential comment in Rubin (1983)
probability ratio test, small-area estimation, survival anal- William Cochran once told me that he was relatively
ysis, and signal detection theory. Hence, articles on these unimpressed with statistical work that produced methods
topics are included in this section. Computational meth- for solving non-existent problems or produced compli-
ods play a key role in statistics in optimizing the time cated methods which were at best only imperceptibly
required to run computationally intensive statistical ana- superior to simple methods already available. Cochran
lyses on data. A knowledge of matrix algebra is important went on to say that he wanted to see statistical methods
in several topics in statistics (such as linear regression, developed to help solve existing problems which were
GLM, and ANOVA). Meta analysis, a technique for com- without currently acceptable solutions.
bining information from several studies, statistical
inequalities and statistical paradoxes, which are three
topics that any user of statistical methods should be
aware of, are discussed in the statistics section of the See also: Analysis and Interpretation of Multivariate Data;
Analysis of Covariance; Bayesian Statistical Analysis;
encyclopedia. Random number generation is often an
Bootstrap Method; Canonical Correlation; Categorical Data
important issue in applications of statistics to education, Analysis; Cluster Analysis: Overview; Computational Statis-
especially those involving Monte Carlo or Markov chain tics; Continuous Probability Distributions; Decision Theory;
Monte Carlo methods. Evaluation research, which is a Design of Experiments; Discrete Probability Distributions;
form of disciplined and systematic inquiry that is carried Discrimination and Classification; Empirical Bayes Methods;
out to arrive at an assessment or appraisal of an object, Evaluation Research; Exploratory Data Analysis; Factor
program, practice, activity, or system, is discussed in the Analysis: An Overview and Some Contemporary Advances;
statistics section of the encyclopedia. Generalizability Theory; Generalized Linear Mixed Models;
Generalized Linear Models; Generating Random Numbers;
Goodness-of-Fit Testing; Growth Modeling; Hierarchical
Conclusions Linear Models; Hypothesis Testing and Confidence Inter-
vals; Instrumental Variables; Jackknife Methods; Large-
The list of topics in statistics that is provided above, along sample Statistical Methods; Latent Class Models; Matrix
with the examples of applications of these topics to edu- Algebra; Measure of Association; Measures of Central
cation, demonstrate how statistics can be useful in educa- Tendency; Measures of Dispersion, Skewness and Kurtosis;
tion. As more students are tested, access to data becomes Meta Analysis; Model Selection; Multidimensional Scaling;
Multiple Comparisons; Multivariate Analysis of Variance;
easier, computers become faster, and statistical software
Multivariate Linear Regression; Nonlinear Regression Analy-
packages become more accessible, statistics is sure to find
sis; Nonparametric Statistical Methods; Observational Stud-
more applications in education. However, this situation ies; Point Estimation Methods with Applications to Item
also suggests the use of caution on several grounds. Response Theory Models; Probability Theory; Recursive
Unless the investigator is careful, the resulting inference Partitioning; Sampling; Statistical Inequalities; Statistical
from a statistical analysis may be inappropriate. See, for Paradoxes; Statistical Power Analysis; Statistical Signifi-
example, Kramer and Gigerenzer (2005), of how statistics cance Versus Effect Size; Structural Equation Models; Time
can be confusing and has been misused, Haller and Kraus Series Analysis; Univariate Linear Regression; Value-Added
(2002) to see how the concept of statistical significance is Models.
Bibliography
point average. College Board Research Report No. 2008-5.
New York: College Board.
Agresti, A. (2002). Categorical Data Analysis, 2nd edn. New York: Wiley. Kramer, W. and Gigerenzer, G. (2005). How to confuse with statistics or:
Allen, N. L., Donoghue, J. R., and Schoeps, T. L. (2001). The NAEP The use and misuse of conditional probabilities. Statistical Science
1998 Technical Report (NCES 2001-509). Washington, DC: 20, 223–230.
National Center for Education Statistics, U.S. Department of Longford, N. T. (1990). Multivariate variance component analysis: An
Education. application in test development. Journal of Educational Statistics
Bayes, T. (1763). An essay towards solving a problem in the doctrine of 15(2), 91–112.
chances. Philosophical Transactions of the Royal Society of London Novick, M. R. and Jackson, P. H. (1974). Statistical Methods
61.53, 370–418. [Reprinted with biographical note by Barnard, for Educational and Psychological Research. New York:
G. A. (1958) Biometrika 45, 293–315.] McGraw-Hill.
Braun, H. (2005). Using Student Progress to Evaluate Teaching: Rubin, D. B. (1983). Some applications of Bayesian statistics to
A Primer on Value-Added Models. Princeton, NJ: Policy Information educational data. Statistician 32, 55–68.
Center, Educational Testing Service. Sinharay, S. (2005). Assessing fit of unidimensional item response
Carlin, B. P. and Louis, T. A. (1996). Bayes and Empirical Bayes theory models using a Bayesian approach. Journal of Educational
Methods for Data Analysis. London: Chapman and Hall. Measurement 42, 375–394.
Cronbach, L. J., Nageswari, R., and Gleser, G. C. (1963). Theory of Sinharay, S. (2006). Bayesian methods in educational measurement. In
generalizability: A liberation of reliability theory. British Journal of Upadhyay, S. K., Singh, U., and Dey, D. K. (eds.) Bayesian Statistics
Statistical Psychology 16, 137–163. and Its Applications, pp 422–437. New Delhi: Anamaya.
Dayton, C. M. and Macready, G. B. (2006). Latent class analysis in Stigler, S. M. (2005). Correlation and causation: A comment.
psychometrics. In Rao, C. R. and Sinharay, S. (eds.) Handbook of Perspective in Biology and Medicine 48(1), 88–94.
Statistics, vol. 26, pp 421–446. Amsterdam: North-Holland/Elsevier. Swaminathan, H., Hambleton, R. K., and Rogers, H. J. (2006).
Draper, N. R. and Smith, H. (1998). Applied Regression Analysis, 3rd Assessing the fit of item response theory models. In Rao, C. R. and
edn. New York: Wiley. Sinharay, S. (eds.) Handbook of Statistics, vol. 26, pp 683–718.
Efron, B. (1982). The Jackknife, the Bootstrap, and Other Resampling Amsterdam: North-Holland/Elsevier.
Plans. Philadelphia, PA: Society for Industrial and Applied Tatsuoka, K. K. (1984). Caution indices based on item response theory.
Mathematics. Psychometrika 49, 95–110.
Everitt, B. S. (1990). Cluster analysis. In Husen, T. and Tukey, J. W. (1977). Exploratory Data Analysis. Reading, MA:
Postlethwaite, N. (eds.) International Encyclopedia of Education, Addison-Wesley.
2nd edn., pp 825–831. Oxford: Pergamon. von Davier, M., Sinharay, S., Oranje, A., and Beaton, A. (2007). The
Galton, F. (1885). Regression towards mediocrity in heredity stature. statistical procedures used in national assessment of educational
Journal of the Anthropological Institute 15, 246–263. progress: Recent developments and future directions.
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2003). Bayesian In Rao, C. R. and Sinharay, S. (eds.) Handbook of Statistics, vol. 26,
Data Analysis. New York: Chapman and Hall. pp 1039–1055. Amsterdam: Elsevier.
Haller, H. and Kraus, S. (2002). Misinterpretations of significance: Wainer, H. (2000). Visual Revelations: Graphical Tales of Fate and
A problem students share with their teachers? Methods of Deception from Napoleon Bonaparte to Ross Perot, 2nd edn.
Psychological Research 7, 1–20. Hillsdale, NJ: Erlbaum.
Hanson, B. A., Zeng, L., and Kolen, M. J. (1993). Standard Wainer, H. (2005). Graphic Discovery: A Trout in the Milk and
errors of Levine linear equating. Applied Psychological Measurement Other Visual Adventures. Princeton, NJ: Princeton
17, 225–237. University Press.
Hsu, T. (2005). Research methods and data analysis procedures used Zwick, R., Thayer, D. T., and Lewis, C. (1999). An empirical Bayes
by educational researchers. International Journal of Research and approach to Mantel–Haenszel DIF analysis. Journal of Educational
Method in Education 28(2), 109–133. Measurement 36, 1–28.
Huang, F. and Lee, M. (2009). Dynamic treatment effect analysis of TV
effects on child cognitive development. No 0906, Discussion Paper
Series, Institute of Economic Research, Korea University. http://
econpapers.repec.org/RePEc:iek:wpaper:0906. Further Reading
Jenkins, F., Kaplan, B., and Lim, Y. (2001). Data analysis for the
national writing samples. In Allen, N. (ed.) The NAEP 1998
Technology Report, 359–370. Washington, DC: National Center Crocker, L. and Algina, J. (1986). Introduction to Classical and Modern
for Education Statistics. Test Theory. New York: Harcourt Brace Jovanovich College
Johnson, R. A. and Wichern, D. W. (1998). Applied Publishers.
Multivariate Statistical Analysis, 4th edn. Upper Saddle River, NJ:
Prentice-Hall.
Kang, T. and Cohen, A. S. (2007). IRT model selection methods
for dichotomous items. Applied Psychological Measurement 31(4), Relevant Website
331–358.
Kobrin, J. L., Patterson, B. F., Shaw, E. J., Mattern, K. D., and Barbuti,
S. M. (2008). Validity of the SAT for predicting first-year college grade http://www.ed.gov – US Department of Education.

Sinharay S. Definition of Statistical Inference

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sinharay S. Definition of Statistical Inference

Uploaded by

Copyright:

Available Formats

STATISTICS

Multivariate Normal Distribution

An Overview of Statistics in Education

Next, several topics in statistics that are relevant to the

Exploratory Data Analysis

Probability Theory an investigator has to perform several hypothesis tests

Structural-Equation Modeling Multidimensional Scaling

Other Topics in Statistics misunderstood by university students and their instruc-

You might also like