Professional Documents
Culture Documents
0000242740
8614
Educational Statistics
Spring-2023
1st
Question#1: How descriptive and inferential statistics help a
teacher? Explain.
Descriptive Statistics
Both descriptive and inferential statistics help make sense out of row after row of data!
Use descriptive statistics to summarize and graph the data for a group that you choose. This process
allows you to understand that specific set of observations.
Descriptive statistics describe a sample. That’s pretty straightforward. You simply take a group
that you’re interested in, record data about the group members, and then use summary statistics
and graphs to present the group properties. With descriptive statistics, there is no uncertainty
because you are describing only the people or items that you actually measure. You’re not trying
to infer properties about a larger population.
The process involves taking a potentially large number of data points in the sample and reducing
them down to a few meaningful summary values and graphs. This procedure allows us to gain
more insights and visualize the data than simply pouring through row upon row of raw numbers!
Descriptive statistics frequently use the following statistical measures to describe groups:
Central tendency: Use the mean or the median to locate the center of the dataset. This measure
tells you where most values fall.
Dispersion: How far out from the center do the data extend? You can use the range or standard
deviation to measure the dispersion. A low dispersion indicates that the values cluster more tightly
around the center. Higher dispersion signifies that data points fall further away from the center.
We can also graph the frequency distribution.
Suppose we want to describe the test scores in a specific class of 30 students. We record all of
the test scores and calculate the summary statistics and produce graphs. Here is the CSV data
file: Descriptive statistics.
Statistic Class value
Mean 79.18
These results indicate that the mean score of this class is 79.18. The scores range from 66.21 to
96.53 and the distribution is symmetrically centered on the mean. A score of at least 70 on the
test is acceptable. The data show that 86.7% of the students have acceptable scores.
Collectively, this information gives us a pretty good picture of this specific class. There is no
uncertainty surrounding these statistics because we gathered the scores for everyone in the class.
However, we can’t take these results and extrapolate to a larger population of students.
We’ll do that later.
A good exploratory tool for descriptive statistics is the five-number summary, which presents a set
of distributional properties for your sample.
Inferential Statistics
Inferential statistics takes data from a sample and makes inferences about the larger population
from which the sample was drawn. Because the goal of inferential statistics is to draw conclusions
from a sample and generalize them to a population, we need to have confidence that our sample
accurately reflects the population. This requirement affects our process. At a broad level, we must
do the following:
We don’t get to pick a convenient group. Instead, random sampling allows us to have confidence
that the sample represents the population. This process is a primary method for obtainingsamples
that mirrors the population on average. Random sampling produces statistics, such asthe mean,
that do not tend to be too high or too low. Using a random sample, we can generalize from the
sample to the broader population. Unfortunately, gathering a truly random sample can be a
complicated process. Learn more about Making Statistical Inferences.
You can use the following methods to collect a representative sample:
While samples are much more practical and less expensive to work with, there are tradeoffs.
Typically, we learn about the population by drawing a relatively small sample from it. We are a
very long way off from measuring all people or objects in that population. Consequently, when
you estimate the properties of a population from a sample, the sample statistics are unlikely to
equal the actual population value exactly.
For instance, your sample mean is unlikely to equal the population mean exactly. The difference
between the sample statistic and the population value is the sampling error. Inferential statistics
incorporate estimates of this error into the statistical results.
In contrast, summary values in descriptive statistics are straightforward. The average score in a
specific class is a known value because we measured all individuals in that class. There is no
uncertainty.
The most common methodologies in inferential statistics are hypothesis tests, confidence intervals,
and regression analysis. Interestingly, these inferential methods can produce similar summary
values as descriptive statistics, such as the mean and standard deviation. However, as I’ll show
you, we use them very differently when making inferences.
Hypothesis tests
Hypothesis tests use sample data answer questions like the following:
Is the population mean greater than or less than a particular value?
Are the means of two or more populations different from each other?
For example, if we study the effectiveness of a new medication by comparing the outcomes in a
treatment and control group, hypothesis tests can tell us whether the drug’s effect that we
observe in the sample is likely to exist in the population. After all, we don’t want to use the
medication if it is effective only in our specific sample. Instead, we need evidence that it’ll be
useful in the entire population of patients. Hypothesis tests allow us to draw these types of
conclusions about entire populations.
In inferential statistics, a primary goal is to estimate population parameters. These parameters are
the unknown values for the entire population, such as the population mean and standard deviation.
These parameter values are not only unknown but almost always unknowable. Typically, it’s
impossible to measure an entire population. The sampling error I mentioned earlier produces
uncertainty, or a margin of error, around our estimates.
Suppose we define our population as all high school basketball players. Then, we draw a random
sample from this population and calculate the mean height of 181 cm. This sample estimate of 181
cm is the best estimate of the mean height of the population. However, it’s virtuallyguaranteed
that our estimate of the population parameter is not exactly correct.
Confidence intervals incorporate the uncertainty and sample error to create a range of values the
actual population value is like to fall within. For example, a confidence interval of [176 186]
indicates that we can be confident that the real population mean falls within this range.
Regression analysis
Regression analysis describes the relationship between a set of independent variables and a
dependent variable. This analysis incorporates hypothesis tests that help determine whether the
relationships observed in the sample data actually exist in the population.
For example, the fitted line plot below displays the relationship in the regression model between
height and weight in adolescent girls. Because the relationship is statistically significant, we have
sufficient evidence to conclude that this relationship exists in the population rather than just our
sample.
Let’s define our population as 8th-grade students in public schools in the State of Pennsylvania in
the United States. We need to devise a random sampling plan to help ensure a representative
sample. This process can actually be arduous. For the sake of this example, assume that we are
provided a list of names for the entire population and draw a random sample of 100 students from
it and obtain their test scores. Note that these students will not be in one class, but from many
different classes in different schools across the state.
For inferential statistics, we can calculate the point estimate for the mean, standard deviation, and
proportion for our random sample. However, it is staggeringly improbable that any of these point
estimates are exactly correct, and there is no way to know for sure anyway. Because we can’t
measure all subjects in this population, there is a margin of error around these statistics.
Consequently, I’ll report the confidence intervals for the mean, standard deviation, and the
proportion of satisfactory scores (>=70). Here is the CSV data file: Inferential statistics.
Given the uncertainty associated with these estimates, we can be 95% confident that the population
mean is between 77.4 and 80.9. The population standard deviation (a measure of dispersion) is
likely to fall between 7.7 and 10.1. And, the population proportion of satisfactory scores is
expected to be between 77% and 92%.
Differences between Descriptive and Inferential Statistics
As you can see, the difference between descriptive and inferential statistics lies in the process as
much as it does the statistics that you report.
For descriptive statistics, we choose a group that we want to describe and then measure all subjects
in that group. The statistical summary describes this group with complete certainty (outside of
measurement error).
For inferential statistics, we need to define the population and then devise a sampling plan that
produces a representative sample. The statistical results incorporate the uncertainty that is inherent
in using a sample to understand an entire population. The sample size becomes a vital
characteristic. The law of large numbers states that as the sample size grows, the sample statistics
(i.e., sample mean) will converge on the population value.
A study using descriptive statistics is simpler to perform. However, if you need evidence that an
effect or relationship between variables exists in an entire population rather than only your sample,
you need to use inferential statistics.
If you’re learning about statistics and like the approach I use in my blog, check out my
Introduction to Statistics book! It’s available at Amazon and other retailers.
Question#2: Explain non-probability sampling techniques used in
educational research.
When we are going to do an investigation, and we need to collect data, we have to know the type
of techniques we are going to use to be prepared. For this reason, there are two typesof
sampling: the random or probabilistic sample and the non-probabilistic one. In this case, we will
talk in-depth about non-probability sampling.
Definition
Non-probability sampling is a method in which not all population members have an equal chance
of participating in the study, unlike probability sampling. Each member of the population has a
known chance of being selected. Non-probability sampling is most useful for exploratory studies
like a pilot survey (deploying a survey to a smaller sample compared to pre-determined sample
size). Researchers use this method in studies where it is impossible to draw random probability
sampling due to time or cost considerations.
2. Consecutive sampling
This non-probability sampling method is very similar to convenience sampling, with a slight
variation. Here, the researcher picks a single person or a group of a sample, conducts
research over a period, analyzes the results, and then moves on to another subject or group
if needed. Consecutive sampling technique gives the researcher a chance to work with
many topics and fine-tune his/her research by collecting results that have vital insights.
3. Quota sampling
Hypothetically consider, a researcher wants to study the career goals of male and female
employees in an organization. There are 500 employees in the organization, also known as the
population. To understand better about a population, the researcher will need only a sample, not
the entire population. Further, the researcher is interested in particular strata within the population.
Here is where quota sampling helps in dividing the population into strata or groups.
In the judgmental sampling method, researchers select the samples based purely on the
researcher’s knowledge and credibility. In other words, researchers choose only those people who
they deem fit to participate in the research study. Judgmental or purposive sampling is not a
scientific method of sampling, and the downside to this sampling technique is that the
preconceived notions of a researcher can influence the results. Thus, this research technique
involves a high amount of ambiguity.
5. Snowball sampling
Snowball sampling helps researchers find a sample when they are difficult to locate. Researchers
use this technique when the sample size is small and not easily available. This sampling system
works like the referral program. Once the researchers find suitable subjects, he asks them for
assistance to seek similar subjects to form a considerably good size sample.
Here are three simple examples of non-probability sampling to understand the subject better.
Non-probability sampling techniques are a more conducive and practical method for
researchers deploying surveys in the real world. Although statisticians prefer probability
sampling because it yields data in the form of numbers, however, if done correctly,
it can produce similar if not the same quality of results andavoid sampling errors.
Getting responses using non-probability sampling is faster and more cost- effective
than probability sampling because the sample is known to theresearcher. The
respondents respond quickly as compared to people randomly selected as they have a
high motivation level to participate.
Question#3: Give examples to describe variables commonly used in
educational research.
The definition of a variable in the context of a research study is some feature with the potential to
change, typically one that may influence or reflect a relationship or outcome. For example,
potential variables might be time it takes for something to occur, whether or not an object is used
within a study, or the presence of a feature among members of the sample.
Within research, independent and dependent variables are key, forming the basis on which a
study is performed. However, other types of variables may come into play within a study, such as
confounding variables, controlled variables, extraneous, and moderator variables.
Although "dependent variable" is the most commonly used term, they may also be referred to
as response variables, outcome variable, or left-hand-side variable. These alternate names help to
further illustrate their purpose: a dependent variable shows a response to changes in other
variables, displaying the outcome.
The meaning of "left-hand-side" is less immediately transparent, but becomes more obvious when
considering the format of a basic algebraic equation. Typically, the dependent variable in these is
referred to as "Y" and placed on the left-hand-side of the equation. Because of this standard,
dependent variables may also be called the Y variable as well, and the dependent variable is usually
seen on the y-axis in graphs.
One example of a dependent variable would be a student's test scores. Several factors would
influence these scores, such as the amount of time spent studying, amount of sleep, or the stress
levels of the student. Ultimately, the dependent variable is not static or controlled directly, but is
subject to change depending on the independent variables involved.
An independent variable is one that the researcher controls or otherwise manipulates within a
study. In order to determine the relationship between dependent and independent variables, a
researcher will purposefully change an independent variable, watching to see if and how the
dependent variable changes in response.
The independent variable can alternately be called the explanatory, predicator, right-hand-side, or
X variable. Similarly to dependent variables, these reflect the uses of independent variables, as
they are intended to explain or predict changes in the dependent variables. Likewise, independent
variables are often referred to as "X" in basic algebraic equations and plotted using the x-axis. In
research, the experimenters will generally control independent variables as much as possible, so
that they can understand their true relationship with the dependent variables.
For example, a research study might use age as an independent variable, since it influences some
potential dependent variables. Obviously, a researcher cannot randomly assign ages to
participants, but they could only allow participants of certain ages into a study or sort a sample
into desired age groups.
Many research studies have independent and dependent variables, since understanding cause- and-
effect between them is a key end goal. Some examples of research questions involving these
variables include:
How does sleep the night before an exam affect scores in students? The independent
variable is the amount of time slept (in hours), and the dependent variable is the test score.
How does caffeine affect hunger? The amount of caffeine consumed would be the
independent variable, and hunger would be the dependent variable.
Is quality of sleep affected by phone use before bedtime? The length of time spent on the
phone prior to sleeping would be the independent variable and the quality of sleep would
be the dependent variable.
Does listening to classical music help young children develop their reading abilities? The
frequency and level of classical music exposure would be the independent variables, and
reading scores would be the dependent variable.
Coffee may affect hunger levels. To study this, coffee would be the independent variable and
hunger would be the dependent variable.
While the independent and dependent variables are the most commonly discussed variables in
research, other variables can influence outcomes. These include confounding, extraneous, control,
and moderator variables.
Confounding Variables
A confounding variable, also known as a "third variable," changes the dependent variable despite
not being the independent variable being studied. This can cause issues within a study. After all,
since variation in a confounding variable causes a response in a dependent variable that response
may be misattributed the independent variable. In order to ensure that the observed outcome is
only due to changes in independent variables, it is crucial to determine what confounding variables
might sway experimental results.
Question#4: Describe histogram as data interpretation technique.
Answer: A frequency distribution shows how often each different value in a set of data occurs. A
histogram is the most commonly used graph to show frequency distributions. It looks very much like
a bar chart, but there are important differences between them. This helpful data collection andanalysis
tool is considered one of the seven basic quality tools.
USE A HISTOGRAM
You want to see the shape of the data’s distribution, especially when determining whether the
Seeing whether a process change has occurred from one time period to another
You wish to communicate the distribution of data quickly and easily to others
HOW TO CREATE A HISTOGRAM
HISTOGRAM ANALYSIS
Before drawing any conclusions from your histogram, be sure that the process was operating
normally during the time period being studied. If any unusual events affected the process during
the time period of the histogram, your analysis of the histogram shape likely cannot be
generalized to all time periods.
Analyze the meaning of your histogram's shape. Typical histogram shapes and what they mean
Histogram template (Excel) Analyze the frequency distribution of up to 200 data points using this
simple, but powerful, histogram generating tool.
Check sheet template (Excel) Analyze the number of defects for each day of the week. Start by
tracking the defects on the check sheet. The tool will create a histogram using the data you enter.
HISTOGRAM WORKSHEET EXAMPLE
Normal Distribution
Skewed Distribution
The skewed distribution is asymmetrical because a natural limit prevents outcomes on one side. The
distribution’s peak is off center toward the limit and a tail stretches away from it. For example, a
distribution of analyses of a very pure product would be skewed, because the product cannot be more
than 100 percent pure. Other examples of natural limits are holes that cannot be smaller than the
diameter of the drill bit or call-handling times that cannot be less than zero. These distributions are
called right- or left-skewed according to the direction of the tail.
Double-Peaked or Bimodal
The bimodal distribution looks like the back of a two-humped camel. The outcomes of two processes
with different distributions are combined in one set of data. For example, a distribution of production
data from a two-shift operation might be bimodal, if each shift produces a different distribution of
results. Stratification often reveals this problem.
The plateau might be called a ―multimodal distribution.‖ Several processes with normal distributions
are combined. Because there are many peaks close together, the top of the distribution resembles a
plateau.
Edge Peak Distribution
The edge peak distribution looks like the normal distribution except that it has a large peak at one tail.
Usually this is caused by faulty construction of the histogram, with data lumped together into a group
labeled ―greater than.‖
Comb Distribution
In a comb distribution, the bars are alternately tall and short. This distribution often results from
rounded-off data and/or an incorrectly constructed histogram. For example, temperature data rounded
off to the nearest 0.2 degree would show a comb shape if the bar width for the histogram were 0.1
degree.
Truncated or Heart-Cut Distribution
The truncated distribution looks like a normal distribution with the tails cut off. The supplier might be
producing a normal distribution of material and then relying on inspection to separate what is within
specification limits from what is out of spec. The resulting shipments to the customer from inside the
specifications are the heart cut.
The dog food distribution is missing something—results near the average. If a customer receives this
kind of distribution, someone else is receiving a heart cut and the customer is left with the ―dog
food,‖ the odds and ends left over after the master’s meal. Even though what the customer receives is
within specifications, the product falls into two clusters: one near the upper specification limit and one
near the lower specification limit. This variation often causes problems in the customer’s process.
Question#5: Explain different measures of dispersion used in
educational research.
Answer: Measure of dispersion explains the extent of variability. Dispersion helps to
understand the disparity or distribution in a dataset. It gives us an idea about the variation and
central value of a unit. Range, interquartile range, standard deviation and mean deviation are the
commonly used measures of dispersion. Dispersion can be calculated and measured using these
methods.
Dispersion is the state of getting dispersed or spread. Statistical dispersion means the extent to
which numerical data is likely to vary about an average value. In other words, dispersion helps to
understand the distribution of the data.
Measures of Dispersion
In statistics, the measures of dispersion help to interpret the variability of data i.e. to know how
much homogenous or heterogeneous the data is. In simple terms, it shows how squeezed or
scattered the variable is.
There are two main types of dispersion methods in statistics which are:
An absolute measure of dispersion contains the same unit as the original data set. The absolute
dispersion method expresses the variations in terms of the average of deviations of observations
like standard or means deviations. It includes range, standard deviation, quartile deviation, etc.
1. Range: It is simply the difference between the maximum value and the minimum value
given in a data set. Example: 1, 3,5, 6, 7 => Range = 7 -1= 6
2. Variance: Deduct the mean from each data in the set, square each of them and add each
square and finally divide them by the total no of values in the data set to get the variance.
Variance (σ2) = ∑(X−μ)2/N
3. Standard Deviation: The square root of the variance is known as the standard deviation
i.e. S.D. = √σ.
4. Quartiles and Quartile Deviation: The quartiles are values that divide a list of numbers
into quarters. The quartile deviation is half of the distance between the third and the first
quartile.
5. Mean and Mean Deviation: The average of numbers is known as the mean and the
arithmetic mean of the absolute deviations of the observations from a measure of central
tendency is known as the mean deviation (also called mean absolute deviation.
The relative measures of dispersion are used to compare the distribution of two or more data sets.
This measure compares values without units. Common relative dispersion methods include:
1. Co-efficient of Range
2. Co-efficient of Variation
The coefficients of dispersion are calculated (along with the measure of dispersion) when two series
are compared, that differ widely in their averages. The dispersion coefficient is also used when two
series with different measurement units are compared. It is denoted as C.D.
The most important formulas for the different dispersion methods are:
Solved Examples
Step 2: Squaring the above values we get, 22.563, 7.563, 0.563, 0.563, 0.063, 1.563, 10.563,
18.063
Example 2: Calculate the range and coefficient of range for the following data
values.
Solution:
Let Xi values be: 45, 55, 63, 76, 67, 84, 75, 48, 62, 65
Here,
= 84 – 45
= 39
= 39/129
= 0.302 (approx.)