Professional Documents
Culture Documents
Quantitative Techniques in Business: Introduction To Statistics
Quantitative Techniques in Business: Introduction To Statistics
Muehsam, 2006
Since this would take considerable time and money, and since the probability of
collecting the data necessary to determine the true annual salary of the students is small, a
sample of this population will be taken. The sample mean annual salary of the sample of
students will be determined and used to estimate the true mean annual salary of all the
students with majors in the College of Business Administration at Sam Houston State
University.
The study of statistics consists of two types: descriptive statistics and
inferential statistics. Descriptive statistics are characteristics, usually numeric, used to
describe a particular data set. An example of a descriptive statistic would be the average
final exam grade of ten students in an elementary statistics class. This average test score
is used to indicate a typical value for the exam grades of the ten students. Inferential
statistics, on the other hand, are similar to descriptive statistics in that each is calculated
from a sample, but the difference is the use of the statistic. In inferential statistics, the
statistic is used to make inference, or make decisions, about the entire population of
interest. In other words, we take a sample and calculate a statistic and use that statistic to
make inference about the actual value of the characteristic in the entire population.
For instance, there are many descriptive characteristics of a firms customers that
their management would like to know but this information may be difficult or impossible
to determine. Measurement of each and every customer of a large retail firm is nearly
impossible. Even if the information were gathered, it would be unlikely that it would be
timely.
Unfortunately, managers do not always know what mean (average) weekly
demand for a product will be or what proportion of television viewers will watch a
particular show. Since these parameters of interest are not known, and usually
impossible or impractical to determine, the parameters will be estimated using partial
information gathered from a sample.
For instance, if the desired parameter is the mean annual salary of the income
earning residents of a particular county, a sample of 200 of these residents could be
obtained and the annual salary of each resident (element) in the sample could be
determined and the mean annual salary of the sample residents. If the sample is drawn in
a random fashion from a frame, or list, of the entire population, and if we use correct
statistical techniques, the sample mean annual salary (a statistic) may be a good estimate
of the true mean annual salary (a parameter) of all the residents of this county.
A population includes all the elements of interest. We use the term element to
represent each individual unit of a group in which we have interest. For instance,
elements may refer to people (i.e., customers), records (i.e., all loan accounts at a
particular bank), products (i.e., we are interested in the proportion defective) etc. The
notation used in statistics to represent the population size is N. In our example above,
the population of interest would be all the income earning residents of the county. Each
of these residents is an element in our population. If the population of the income
earning residents in the county was 50,000 then N = 50,000. The size of the population,
N, is often not known.
A sample is a subset of the population. The notation for the sample size is n.
In our previous example, the sample would be the 200 residents we sampled out of all the
income earning residents in the county. In this case n = 200.
use statistical techniques that allow us to estimate the probability of getting a value for
the sample statistic that is not a good estimate of the population parameter.
The use of statistics to estimate parameters of interest is not guaranteed to be
successful. If the estimate is not good the result could be a faulty decision that, in turn,
could result in loss of time and/or revenue. We must not allow quantitative techniques to
make decisions for us, we must use these techniques only as a tool to assist us in decision
making.
qualitative data that has no natural order. Examples of nominal data include: gender;
political affiliation; type of car owned; product model; etc. Data comprised of numbers
can also be qualitative data. Zip codes, area codes, telephone numbers are examples of
data that are qualitative. In math terms, these data are not real numbers because they
do not represent numeric measures. One way to determine whether numbers are
numeric measures is to consider whether one might be interested in an average of these
numbers. If a number can be replaced with letters, words or symbols without losing
any information then this indicates that a number is NOT a numeric measure. Ordinal
data is qualitative data that has a natural order. Examples of ordinal data include:
military rank; size of clothing using S, M, L, XL; place in which a race was finished;
condition of a used appliance using POOR, AVERAGE, GOOD, EXCELLENT; etc.
While ordinal data has an order, the intervals between the rankings are not equal
intervals. Thus, while ordinal data has more structure than nominal data, math functions
on the data, such as differences, are not valid.
Quantitative data categorizes an element by a numeric measure. Quantitative
data are true numbers and, as a result, more quantitative techniques are available for use
with this data. Quantitative data can be divided into two types of data: interval data and
ratio data. Interval data is quantitative data that has no natural starting point or zero
level. Examples of interval data include Fahrenheit temperature and scores on IQ tests.
Each (of these type data) is a numeric measure but neither has a natural starting point or
zero level. Zero degrees Fahrenheit is not the absence of temperature just as there is no
zero level for a test of intelligence. Interval data can be used for any technique that
requires quantitative data, however, we must realize that ratios have no meaning with this
type of data since there is no natural zero level. For example, 50 degrees Fahrenheit is
not twice as warm as 25 degrees Fahrenheit. Ratio data is quantitative data that has a
natural starting point or zero level. Most quantitative data falls into this scale of data
measurement. Examples of ratio scaled data include height, weight, rate of return, net
income, etc. Since there is a natural zero level, ratios have meaning.
x .
n
mean is denoted (the Greek script letter mu) and is calculated the same way as the
sample mean except that all elements in the population are measured.
The mean requires at least interval scaled data which means it is only valid for
true numeric measures. The mean is often referred to as the gravitational center of the
data set which is similar to the balancing point of the data. If equal weights were
placed on a scale representing a number line for each observation in a data set, the mean
would be the point at which the scale balances. Since each observation has an equal
weight, the magnitude of the values influence the mean. The mean, while certainly the
most commonly used measure of central tendency, is not always a good measure of
typical. For instance, data sets that include extreme values relative to the rest of the
data pull the mean in that direction. Extremely small values cause the mean to be
small and extremely large values cause the mean to be large. The result is that the
mean is not a good measure of typical and in fact, may be larger or smaller than all
values except the extreme one. When extreme values occur in a data set, we often use
another measure of typical referred to as the median. For instance, attempts to find a
typical income often is best expressed as the median income rather than the mean income
since there is a lower limit (zero) but not an upper limit on income.
The median is the second most commonly used measure of central tendency and
is referred to as the positional average. The median is the center value in an ordered
data set. If the data set has an odd number of observations then the median is the value
found in the center of the distribution of ordered values. If the sample set has an even
number of values then the median is the mean of the two values surrounding the center of
8
the data set. The median is also P50, the fiftieth percentile. This means that 50% or half
of the values are smaller than the median and half of the values or 50% are greater than
the median. The procedure for finding the median is:
1. Order the data set from smallest to largest (or largest to smallest). NOTE:
this requires that the data can be ordered so the median cannot be found for
nominal data.
2. Find i, which is the location or position of the median. This position can be
n 1
, where n is the size of
2
the sample.
3. If i is an integer then the median is the value found at the ith position in the
ordered data set. If i is not an integer, then the median is the mean of the two
values surrounding the ith position.
x .
The median is often denoted as M or ~
The last of the more common Measures of Central Tendency is called the mode.
The mode is the most commonly occurring value in a data set, in other words, the value
that occurs with the greatest frequency. The mode, unlike either the mean or the median,
does not have to be unique. A data set can have more than one mode or no mode at all. A
data set with: one mode is referred to as unimodal; two modes is referred to as bimodal;
and three or more modes is referred to as multimodal. There is no universal notation for
the mode and the mode is valid for any type of data.
Besides a measure of typical, what else might we want to know about a data
set? Do the measures of central tendency tell us all we need to know about the
observations we have collected? Certainly not, in fact, two data sets could have the same
mean and be completely difference in terms of dispersion. Consider that we know the
mean depth of a lake where we plan our next office picnic. Suppose the mean depth of
the lake is 4 feet, is this all we need to know about the depth of this lake? No. We need
to know how much the values (depth) varies around 4 feet. The depth of the lake could
be 4 feet at every point and have a mean of 4 feet or the depth of the lake could vary
greatly around four feet and still have a mean of 4 feet. There could be places where the
depth is a few inches and other places where the depth is 10 feet. This information about
how the data are dispersed is very important (especially for those of us who cannot
swim). The study of statistics could appropriately be referred to as the study of
variability since many of the techniques employ the comparison of the variability of
typical values in different groups to determine whether or not these values are the same or
different between groups.
spread) are attempts to describe how spread out, or how much the values vary, in a
particular data set. All measures of data variation or dispersion require quantitative
data to calculate and are nonnegative. The measures of data variation are zero (if all the
values are equal) or positive. A large measure of spread indicates a more dispersed
data set while a small measure indicates a more tightly grouped data set.
The easiest measure of spread to calculate is the range. The range is the
difference between the largest or maximum value and the smallest or minimum value.
The notation and formula for the range is: R H L , where H is the largest of
10
maximum value and L is the smallest or minimum value. The range, while simple to
calculate, is only informative if it is small. Small and large are relative terms and
must be determined relative to the magnitude of the values measured. For instance, a
range of $3 for dinner could be characterized as small if we are eating at a five-star
restaurant in a pricey hotel in New York City where the dinner entrees range in price from
$12.00 to $35.00 but may be characterized as large if were eating at a local fast-food
restaurant. If the range is small it means that the two extreme values are very close to
each other, so the rest of the values must also be tightly grouped. If the range is large
we know that the extreme values are a long way from each other but we know nothing
about the distribution of the rest of the observations. Since the range only uses two
values in its calculation, we are provided with limited information.
Like our favorite measure of central tendency, the mean, we might like to come
up with a measure of variability that incorporates all the values in the data set as opposed
to using only the two values needed to calculate the range. We might be interested in
finding out, on the average, how much the values vary around a typical value. In an
effort to describe the variability of a data set we could measure the distance each value is
from the mean, our standard measure of typical. The distance a value is from the mean
is called the deviation from the mean and is found by subtracting the mean from a
particular value. This deviation from the mean can be negative, (if the value is smaller
than the mean) positive, (if the value is bigger than the mean) or zero (if the value is
equal to the mean). To calculate the average deviation from the mean, we could sum
the deviations from the mean for each value in the data set and divide by the number of
observations in our sample. Unfortunately, although a good idea intuitively, this value
11
will always be zero since the mean is the gravitational center of the data set and as a
result, the sum of the deviations from the mean sum to zero and so the average
( x x ) 0 .
n
the mean that are negative offset the deviations from the mean that are positive. We can
avoid this problem by using the absolute value or square of the deviations from the mean.
The Mean Absolute Deviation (MAD), is the sum of the absolute deviations
| x x | .
n
financial analysis to determine the variability in stock prices from the expected price.
Unfortunately, while the MAD is the best measure of spread for descriptive purposes, it
is not useful for inferential statistics since the distribution of an absolute value function is
not smooth.
The sample variance, denoted s2, is the sum of the squared deviations from the
mean divided by the sample size less one (n-1). Continuing our effort to find an average
deviation from the mean, we square the deviations from the mean to eliminate any
negative values so our numerator is not equal to zero, and then divide by the sample size
less one. Our denominator is made smaller (hence our variance is made larger) as an
adjustment to our estimate for the true population variance, denoted 2 (sigma squared)
since we calculate the sample variance, s2, using the sample mean, x , instead of the true
population mean, (mu). The true measure of variability for the population should be
calculated according to each values distance from , the population mean. The
12
adjustment in the denominator makes our estimate larger than without the adjustment to
account for the estimate ( x ) used in the numerator. Since we would prefer to have a
small measure of variability because this indicates that the mean, x , is a good measure
of typical since most of the values are close to the mean, adjusting our estimate for
the variance to be larger is considered to be conservative. We are unsure of the true value
of the mean so we use the value of the sample mean to estimate the variability in the data.
The deviations from the mean are estimated using deviations from the sample mean. It is
said that we lose one degree of freedom (df) in the denominator for every estimate in the
numerator. All variances are of the form: sum of squares divided by degrees of
freedom.
The problem with the variance is that the value is in squared units. For instance,
if we are measuring the dollar amount spent on lunch, the variance will be in dollars
squared. Since squared units make interpretation difficult, we normally take the square
root of the variance to return to the original units of measurement. The positive square
root of the sample variance, s2, is the sample standard deviation, s. The sample
standard deviation, s, is our estimate for the true population standard deviation,
denoted sigma), which is the positive square root of the population variance, 2. The
definitional formula for the sample variance, s2, is given below followed by an algebraic
manipulation which we call the computation formula. The computational formula is
easier and faster to calculate but intuitively the definitional formula makes more sense as
our estimate of the average (squared) deviation from the mean.
s2
(x x)
n 1
( x) 2
n 1
13
s2
Although we rarely calculate parameters, the following formulae are given for the
population variance and the population standard deviation.
(x )
x2
( x) 2
N
1
% of the values will fall. Since Chebyshevs Theorem applies to any distribution
k2
regardless of shape, the information learned is less specific then we might like. In other
words, using the formula, we would discover that at least 75% of the observations (in
any distribution) lie within 2 standard deviations of the mean. This means that 75%100% of the values will fall within two standard deviations of the mean. While some
information is better than none, we would like to be more precise in our estimate of this
percentage. For certain known distributions, we can more precisely estimate the
percentage of values that lie within one, two or three standard deviations of the mean.
14
15
The kth percentile, Pk, is that value which is equal to or greater than, k% of the
observations and is less than or equal to the remaining (100-k)% of the observations.
The procedure for calculating the kth percentile is:
1. Order the data from smallest to largest value.
2. Find
nk
, where n is the sample size and k is the percentile you are
100
calculating.
3. (a) if
nk
is not an integer, then i, the position of the kth percentile, will be
100
if
nk
= 4.5 then i = 5.
100
nk
is an integer, then i, the position of the kth percentile, will be
100
nk
nk
+.5. For example if
= 6 then i = 6.5.
100
100
4. (a) if i is an integer (3a above) then the kth percentile if the value found at the
ith position. For example, in 3a above, i = 5, so the kth percentile is the 5th
value in the ordered data set.
(b) if i is not an integer (3b above) then the kth percentile if the mean of the two
values surrounding the ith position. For example, in 3b above, i = 6.5, so
the kth percentile is the mean of the sixth and seventh values in the ordered
data set.
Sometimes, instead of being interested in what data point has a certain percentage
above it or below it, researchers are interested in determining the value that is typical
for the center group of values. For example, suppose we are charged with the
16
17
the first and third quartiles. The formula for the MQR is: MQR
Q1 Q3
.
2
Another measure of position or location is called the Z-score or Z value. The Zscore for a particular value in a data set indicates the number of standard deviations
that value is from the mean. Z-scores can be negative (if the value is less than the
mean), positive (if the values is larger than the mean), or equal to zero (if the value is
equal to the mean). The Z-score for the mean is always zero. For example, a value with
a Z-score of 1.35 is 1.35 standard deviations above the mean. A value with a Z-score of
2.12 is 2.12 standard deviations below the mean.
Z-values can be calculated, and a Standard Normal Table used, to determine
approximately what proportion of the values, for a normal distribution, are above or
below a particular value, or between two values in a distribution.
Frequency Distributions
Terminology:
Defn: The frequency, f, for a value or a class of values is the number of times
that value or class of values occurs in the data set.
We are simply counting how often a value or set of values occurs in the data set.
1. What is the minimum number of times a value or class of values occur(s) in a data
set? The minimum number of times a value or class of values can occur is zero
(0). What is the maximum number of times a value or class of values can occur in
18
the data set? The maximum number of times a value or class of values can occur
in the data set is n, or the total number of values in the data set.
0fn
2. If we add the frequencies for each value or set of values it will sum to n.
f = n
Defn: The relative frequency, f/n, (how often the value occurs divided by the
total number of observationsgives you a proportion of times a value or class of
values occurs) for a value or a class of values is the proportion of time that a value
or class of values occurs in the data set.
1. What is the minimum proportion of time a value or class of values occur(s) in a
data set? The minimum proportion of time a value or class of values can occur is
zero (0). What is the maximum proportion of time a value or class of values can
occur in the data set? The maximum proportion of time a value or class of values
can occur in the data set is one (1).
0 f/n 1
2. If we add the relative frequencies for each value or set of values it will sum to one
(1).
f/n = 1
Defn: The cumulative frequency, F, for a value or a class of values is the
number of times that value or any smaller value occurs in the data set.
We are simply keeping a running total.
1. Cumulative frequencies are non-decreasing (this means the values cannot
decreasethey can level off but they cant go down).
2. The cumulative frequency for the last value or class of values is n.
3. We must have at least ordinal scaled data to find cumulative frequencies.
Defn: The cumulative relative frequency, F/n, for a value or a class of values
is the proportion of time that value or any smaller value occurs in the data set.
We are simply keeping a running total of relative frequencies or proportions.
1. Cumulative relative frequencies are non-decreasing.
2. The cumulative relative frequency for the last value or class of values is one (1).
3. We must have at least ordinal scaled data to find cumulative relative frequencies.
19
20