Professional Documents
Culture Documents
Statistics Midterms Reviewer
Statistics Midterms Reviewer
Descriptive statistics
➔ the branch of statistics concerned with describing sets of measurements, both samples
and populations.
Variable
➔ characteristics that change or vary over time and/or for different individuals or objects
under consideration.
Experimental unit / an element of the sample
➔ the individual or object on which a variable is measured.
Single measurement / data value
➔ results when a variable is actually measured on an experimental unit.
Sample
➔ any smaller subset of measurements.
Population
➔ the set of all measurements of interest to the investigator.
Sample
➔ a subset of measurements selected from the population of interest.
Univariate data
➔ result when a single variable is measured on a single experimental unit.
Bivariate data
➔ result when two variables are measured on a single experimental unit.
Multivariate data
➔ result when more than two variables are measured.
Types of Variables
Qualitative variable
➔ measure a quality or characteristic on each experimental unit
Quantitative variable
➔ measure a numerical quantity or amount on each experimental unit.
Categorical data
➔ qualitative variables produce data that can be categorized according to similarities or
differences in kind
Discrete variable
➔ can assume only a finite or countable number of values.
Continuous variable
➔ can assume the infinitely many values corresponding to the points on a line interval.
Pie chart
➔ circular graph that shows how the measurements are distributed among the categories.
➔ used to display the relationship of the parts to the whole.
Bar chart
➔ shows the same distribution of measurements among the categories
➔ the height of the bar measuring how often a particular category was observed.
➔ used to emphasize the actual quantity or frequency for each category.
Pareto chart
➔ a bar chart in which the bars are ordered from largest to smallest.
Line charts
Time series
➔ formed in the data set when a quantitative variable is recorded over time at equally
spaced intervals (such as daily, weekly, monthly, quarterly, or yearly).
➔ most effectively presented on a line chart with time as the horizontal axis.
Bar chart
➔ shows the same distribution of measurements among the categories
➔ the height of the bar measures how often a particular category was observed.
➔ used to emphasize the actual quantity or frequency for each category.
Dotplots
➔ simplest graph for quantitative data.
Symmetric
➔ if the left and right sides of the distribution, when divided at the middle value, form mirror
images.
Skewed to the right
➔ if a greater proportion of the measurements lie to the right of the peak value.
➔ contain a few unusually large measurements.
Skewed to the right
➔ if a greater proportion of the measurements lie to the left of the peak value.
➔ contain a few unusually small measurements.
Unimodal
➔ has one peak.
Bimodal
➔ two peaks.
➔ often represent a mixture of two different populations in the data set.
● When comparing graphs created for two data sets, you should compare their scales of
measurement, locations, and shapes, and look for unusual measurements or outliers.
➔ resembles a bar chart, but it is used to graph quantitative rather than qualitative data.
➔ for a quantitative data set is a bar graph in which the height of the bar shows “how often”
(measured as a proportion or relative frequency) measurements fall in a particular class
or subinterval.
➔ the classes or subintervals are plotted along the horizontal axis.
➔ can be used to describe the distribution of a set of data in terms of its location and
shape, and to check for outliers
➔ How to construct a Stem and Leaf Plot:
1. Choose the number of classes, usually between 5 and 12. The more data you
have, the more classes you should use.
2. Calculate the approximate class width by dividing the difference between the
largest and smallest values by the number of classes.
3. Round the approximate class width up to a convenient number.
4. If the data are discrete, you might assign one class for each integer value taken
on by the data. For a large number of integer values, you may need to group
them into classes.
5. Locate the class boundaries. The lowest class must include the smallest
measurement. Then add the remaining classes using the left inclusion method.
6. Construct a statistical table containing the classes, their frequencies, and their
relative frequencies.
7. Construct the histogram like a bar graph, plotting class intervals on the horizontal
axis and relative frequencies as the heights of the bars.
Describing Data with Numerical Measures
Numerical measures
➔ can be calculated for either a sample or a population of measurements.
Parameters
➔ numerical descriptive measures associated with a population of measurements.
Statistics
➔ computed from sample measurements.
Measures of Center
➔ a measure along the horizontal axis that locates the center of the distribution.
Arithmetic Mean / Average
➔ a set of n measurements is equal to the sum of the measurements divided by n.
Median
➔ m of a set of n measurements is the value of x that falls in the middle position when the
measurements are ordered from smallest to largest.
Mode
➔ the category that occurs most frequently.
➔ the most frequently occurring value of x.
Modal Class
➔ the class with the highest peak or frequency.
➔ the midpoint of that class is taken to be the mode.
Measures of Variability
Range
➔ R, of a set of n measurements is defined as the difference between the largest and
smallest measurements.
Variance of a Population
➔ of N measurements is the average of the squares of the deviations of the measurements
about their mean m. The population variance is denoted by s2 and is given by the
formula
➔
Variance of a Sample
➔ of n measurements is the sum of the squared deviations of the measurements about
their mean x-bar divided by (n - 1). The sample variance is denoted by s2 and is given
by the formula
➔
Standard Deviation
➔ of a set of measurements is equal to the positive square root of the variance.
● We always divide by (n 1) when computing the sample variance s 2 and the sample
standard deviation s.
● The value of s is always greater than or equal to zero.
● The larger the value of s 2 or s, the greater the variability of the data set.
● If s 2 or s is equal to zero, all the measurements must have the same value.
● In order to measure the variability in the same units as the original observations, we
compute the standard deviation
Tchebysheff’s Theorem
➔ given a number k greater than or equal to 1 and a set of n measurements, at least
[1 (1/k 2 )] of the measurements will lie within k standard deviations of their mean.
➔ applies to any set of measurements and can be used to describe either a sample or a
population.
➔ an interval is constructed by measuring a distance kσ on either side of the mean μ. The
number kσ can be any number as long as it is greater than or equal to 1. Then
Tchebysheff’s Theorem states that at least 1 (1/k 2 ) of the total number n measurements
lies in the constructed interval.
➔
➔ At least none of the measurements lie in the interval μ - σ to μ + σ.
➔ At least 3/4 of the measurements lie in the interval μ - 2σ to μ + 2σ.
➔ At least 8/9 of the measurements lie in the interval μ - 3σ to μ + 3σ.
Empirical Rule
➔ given a distribution of measurements that is approximately mound-shaped:
◆ the interval (μ ± σ) contains approximately 68% of the measurements
◆ the interval (μ ± 2σ) contains approximately 95% of the measurements. t
◆ the interval (μ ± 3σ) contains approximately 99.7% of the measurements.
Normal Distribution
➔ mound-shaped distribution.
➔
➔ measures the distance between an observation and the mean, measured in units of
standard deviation
➔
Percentile
➔ A set of n measurements on the variable x has been arranged in order of magnitude.
Pth Percentile
➔ the value of x that is greater than p% of the measurements and is less than the
remaining (100 - p)%.
➔ Lower Quartile (First Quartile), Q1
➔ the value of x that is greater than one-fourth of the measurements and is less than the
remaining three-fourths.
➔ Upper Quartile (Third Quartile), Q3
➔ the value of x that is greater than three-fourths of the measurements and is less than the
remaining one-fourth.
➔ for a set of measurements is the difference between the upper and lower quartiles; that
is, IQR = Q3 - Q1.
The Five-Number Summary and The Box Plot
Five Number Summary
➔ consists of the smallest number, the lower quartile, the median, the upper quartile, and
the largest number, presented in order from smallest to largest:
➔ Min Q1 Median Q3 Max
➔ one-fourth of the measurements in the data set lie between each of the four adjacent
pairs of numbers.
➔ can be used to create a simple graph called a box plot to visually describe the data
distribution.
To construct a box plot:
1. Calculate the median, the upper and lower quartiles, and the IQR for the data set.
2. Draw a horizontal line representing the scale of measurement. Form a box just above
the horizontal line with the right and left ends at Q1 and Q3. Draw a vertical line through
the box at the location of the median.
● Any measurement beyond the upper or lower fence is an outlier; the rest of the
measurements, inside the fences, are not unusual.