You are on page 1of 9

Variables and Data

Descriptive statistics
➔ the branch of statistics concerned with describing sets of measurements, both samples
and populations.
Variable
➔ characteristics that change or vary over time and/or for different individuals or objects
under consideration.
Experimental unit / an element of the sample
➔ the individual or object on which a variable is measured.
Single measurement / data value
➔ results when a variable is actually measured on an experimental unit.
Sample
➔ any smaller subset of measurements.
Population
➔ the set of all measurements of interest to the investigator.
Sample
➔ a subset of measurements selected from the population of interest.
Univariate data
➔ result when a single variable is measured on a single experimental unit.
Bivariate data
➔ result when two variables are measured on a single experimental unit.
Multivariate data
➔ result when more than two variables are measured.

Types of Variables
Qualitative variable
➔ measure a quality or characteristic on each experimental unit
Quantitative variable
➔ measure a numerical quantity or amount on each experimental unit.
Categorical data
➔ qualitative variables produce data that can be categorized according to similarities or
differences in kind
Discrete variable
➔ can assume only a finite or countable number of values.
Continuous variable
➔ can assume the infinitely many values corresponding to the points on a line interval.

Graphs for Categorical Data


Statistical table
➔ can be used to display the data graphically as a data distribution.
Frequency
➔ number of measurements in each category
Relative frequency
➔ proportion of measurements in each category.
Percentage
➔ measurements in each category.

● a measurement will belong to one and only one category


● each measurement has a category to which it can be assigned

Pie chart
➔ circular graph that shows how the measurements are distributed among the categories.
➔ used to display the relationship of the parts to the whole.
Bar chart
➔ shows the same distribution of measurements among the categories
➔ the height of the bar measuring how often a particular category was observed.
➔ used to emphasize the actual quantity or frequency for each category.
Pareto chart
➔ a bar chart in which the bars are ordered from largest to smallest.

Graphs for Quantitative Data

Line charts
Time series
➔ formed in the data set when a quantitative variable is recorded over time at equally
spaced intervals (such as daily, weekly, monthly, quarterly, or yearly).
➔ most effectively presented on a line chart with time as the horizontal axis.
Bar chart
➔ shows the same distribution of measurements among the categories
➔ the height of the bar measures how often a particular category was observed.
➔ used to emphasize the actual quantity or frequency for each category.

Dotplots
➔ simplest graph for quantitative data.

Stem and Leaf Plots


➔ this plot presents a graphical display of the data using the actual numerical values of
each data point.
➔ How to construct a Stem and Leaf Plot:
1. Divide each measurement into two parts: the stem and the leaf.
2. List the stems in a column, with a vertical line to their right.
3. For each measurement, record the leaf portion in the same row as its
corresponding stem.
4. Order the leaves from lowest to highest in each stem.
5. Provide a key to your stem and leaf coding so that the reader can re-create the
actual measurements if necessary.

Interpreting Graphs with a Critical Eye


1. First, check the horizontal and vertical scales, so that you are clear about what is being
measured.
2. Examine the location of the data distribution. Where on the horizontal axis is the center
of the distribution? If you are comparing two distributions, are they both centered in the
same place?
3. Examine the shape of the distribution. Does the distribution have one “peak,” a point that
is higher than any other? If so, this is the most frequently occurring measurement or
category. Is there more than one peak? Are there an approximately equal number of
measurements to the left and right of the peak
4. Look for any unusual measurements or outliers. That is, are any measurements much
bigger or smaller than all of the others? These outliers may not be representative of the
other values in the set.

● Distributions are often described according to their shapes

Symmetric
➔ if the left and right sides of the distribution, when divided at the middle value, form mirror
images.
Skewed to the right
➔ if a greater proportion of the measurements lie to the right of the peak value.
➔ contain a few unusually large measurements.
Skewed to the right
➔ if a greater proportion of the measurements lie to the left of the peak value.
➔ contain a few unusually small measurements.
Unimodal
➔ has one peak.
Bimodal
➔ two peaks.
➔ often represent a mixture of two different populations in the data set.

● When comparing graphs created for two data sets, you should compare their scales of
measurement, locations, and shapes, and look for unusual measurements or outliers.

Relative Frequency Histograms

➔ resembles a bar chart, but it is used to graph quantitative rather than qualitative data.
➔ for a quantitative data set is a bar graph in which the height of the bar shows “how often”
(measured as a proportion or relative frequency) measurements fall in a particular class
or subinterval.
➔ the classes or subintervals are plotted along the horizontal axis.
➔ can be used to describe the distribution of a set of data in terms of its location and
shape, and to check for outliers
➔ How to construct a Relative Frequency Diagram:
1. Choose the number of classes, usually between 5 and 12. The more data you
have, the more classes you should use.
2. Calculate the approximate class width by dividing the difference between the
largest and smallest values by the number of classes.
3. Round the approximate class width up to a convenient number.
4. If the data are discrete, you might assign one class for each integer value taken
on by the data. For a large number of integer values, you may need to group
them into classes.
5. Locate the class boundaries. The lowest class must include the smallest
measurement. Then add the remaining classes using the left inclusion method.
6. Construct a statistical table containing the classes, their frequencies, and their
relative frequencies.
7. Construct the histogram like a bar graph, plotting class intervals on the horizontal
axis and relative frequencies as the heights of the bars.
Describing Data with Numerical Measures

Describing A Set Of Data With Numerical Measures

Numerical measures
➔ can be calculated for either a sample or a population of measurements.
Parameters
➔ numerical descriptive measures associated with a population of measurements.
Statistics
➔ computed from sample measurements.

Measures of Center

➔ a measure along the horizontal axis that locates the center of the distribution.
Arithmetic Mean / Average
➔ a set of n measurements is equal to the sum of the measurements divided by n.

Median
➔ m of a set of n measurements is the value of x that falls in the middle position when the
measurements are ordered from smallest to largest.
Mode
➔ the category that occurs most frequently.
➔ the most frequently occurring value of x.
Modal Class
➔ the class with the highest peak or frequency.
➔ the midpoint of that class is taken to be the mode.

Measures of Variability
Range
➔ R, of a set of n measurements is defined as the difference between the largest and
smallest measurements.
Variance of a Population
➔ of N measurements is the average of the squares of the deviations of the measurements
about their mean m. The population variance is denoted by s2 and is given by the
formula


Variance of a Sample
➔ of n measurements is the sum of the squared deviations of the measurements about
their mean x-bar divided by (n - 1). The sample variance is denoted by s2 and is given
by the formula


Standard Deviation
➔ of a set of measurements is equal to the positive square root of the variance.

● We always divide by (n 1) when computing the sample variance s 2 and the sample
standard deviation s.
● The value of s is always greater than or equal to zero.
● The larger the value of s 2 or s, the greater the variability of the data set.
● If s 2 or s is equal to zero, all the measurements must have the same value.
● In order to measure the variability in the same units as the original observations, we
compute the standard deviation

On the Practical Significance of the Standard Deviation

Tchebysheff’s Theorem
➔ given a number k greater than or equal to 1 and a set of n measurements, at least
[1 (1/k 2 )] of the measurements will lie within k standard deviations of their mean.
➔ applies to any set of measurements and can be used to describe either a sample or a
population.
➔ an interval is constructed by measuring a distance kσ on either side of the mean μ. The
number kσ can be any number as long as it is greater than or equal to 1. Then
Tchebysheff’s Theorem states that at least 1 (1/k 2 ) of the total number n measurements
lies in the constructed interval.

➔ At least none of the measurements lie in the interval μ - σ to μ + σ.
➔ At least 3/4 of the measurements lie in the interval μ - 2σ to μ + 2σ.
➔ At least 8/9 of the measurements lie in the interval μ - 3σ to μ + 3σ.
Empirical Rule
➔ given a distribution of measurements that is approximately mound-shaped:
◆ the interval (μ ± σ) contains approximately 68% of the measurements
◆ the interval (μ ± 2σ) contains approximately 95% of the measurements. t
◆ the interval (μ ± 3σ) contains approximately 99.7% of the measurements.
Normal Distribution
➔ mound-shaped distribution.

Measures of Relative Standing


sample z-score
➔ a measure of relative standing


➔ measures the distance between an observation and the mean, measured in units of
standard deviation

Percentile
➔ A set of n measurements on the variable x has been arranged in order of magnitude.
Pth Percentile
➔ the value of x that is greater than p% of the measurements and is less than the
remaining (100 - p)%.
➔ Lower Quartile (First Quartile), Q1
➔ the value of x that is greater than one-fourth of the measurements and is less than the
remaining three-fourths.
➔ Upper Quartile (Third Quartile), Q3
➔ the value of x that is greater than three-fourths of the measurements and is less than the
remaining one-fourth.

Calculating Sample Quartile


➔ when the measurements are arranged in order of magnitude, the lower quartile, Q1, is
the value of x in position 0.25(n + 1), and the upper quartile, Q3, is the value of x in
position 0.75(n + 1).
➔ when 0.25(n + 1) and 0.75(n + 1) are not integers, the quartiles are found by
interpolation, using the values in the two adjacent positions.

➔ for a set of measurements is the difference between the upper and lower quartiles; that
is, IQR = Q3 - Q1.
The Five-Number Summary and The Box Plot
Five Number Summary
➔ consists of the smallest number, the lower quartile, the median, the upper quartile, and
the largest number, presented in order from smallest to largest:
➔ Min Q1 Median Q3 Max
➔ one-fourth of the measurements in the data set lie between each of the four adjacent
pairs of numbers.
➔ can be used to create a simple graph called a box plot to visually describe the data
distribution.
To construct a box plot:
1. Calculate the median, the upper and lower quartiles, and the IQR for the data set.
2. Draw a horizontal line representing the scale of measurement. Form a box just above
the horizontal line with the right and left ends at Q1 and Q3. Draw a vertical line through
the box at the location of the median.

● Any measurement beyond the upper or lower fence is an outlier; the rest of the
measurements, inside the fences, are not unusual.

3. Mark any outliers with an asterisk (*) on the graph.


4. Extend horizontal lines called “whiskers” from the ends of the box to the smallest and
largest observations that are not outliers.

You might also like