Professional Documents
Culture Documents
P.1
Stat 1602 Business Statistics Fall 2012-2013
• The word “Statistics” is derived from the Latin word for “the state” as the first
important accumulation of data was for the purposes of the state.
• While the science of statistics was being studied in Germany, the words
“Statistics” and “Statistical” were introduced into the English language around
1787 by Ebesherd A.W. von Zimmerman (1743-1815).
Elements of Statistics
⎧
⎪
⎪ ⎧ Survey Sampling
⎪ Data Collection ⎪⎨Experimental Design
⎪ ⎪Observational Study
⎪ ⎩
⎪
Statistics ⎨
⎪
⎪ ⎧ Descriptive Statistics and Statistical Graphics
⎪ ⎪
⎪ ⎧ ⎧ Point
⎪ Data Analysis ⎨ ⎪ Estimation ⎨
⎪ ⎪ Statistical Inference ⎨ ⎩ Interval
⎪ ⎪ ⎪ Testing Hypothesis
⎩ ⎩ ⎩
Statistics Users
P.2
Stat 1602 Business Statistics Fall 2012-2013
Terminology
Instrument device used to assist in the production of a number from the object
Unit 2
•
•
•
Unit n
Example 1.1
Student records
Dataset : student_record.dat
1. Run Microsoft® Excel 07.
3. Change directory and select the data file student_record.dat. You may need
to first change the Files of type into All Files (*.*) to show all files.
4. In the Text Import Wizard, select Delimited as the Original data type.
Click on Next.
5. Select both Tab and Space as the Delimiters. This will specify how the data
are separated in each row in the original data file. The Data preview
window shows the input data set. Click on Finish will load the data into
the worksheet. (Click on Next can set the Data Format of each column.)
5. From the Office Button menu, select Save as .... You can save the worksheet
in Excel Workbook (*.xlsx) format so that the data format for each variable
can be stored too. You can also save the worksheet in Excel 97-2003
Workbook (*.xls) format so that the file will be compatible with older
versions of Excel.
P.3
Stat 1602 Business Statistics Fall 2012-2013
2. Ordinal Scale
• Also tells when one unit has more of the property than does another unit, e.g.
grade, attitude (disagree, agree, strongly agree).
Quantitative Scales
3. Interval Scale
• Also tells us that one unit differs by a certain amount of the property from
another unit, e.g. temperature in degrees Celsius, altitude of a place, time of
occurrence of an event.
• Zero is just a reference point.
4. Ratio Scale
• Tells us that one unit has so many times as much of the property as does
another unit (it has a meaningful zero), e.g. height, weight.
• It has a meaningful zero. Zero really means nothing.
P.4
Stat 1602 Business Statistics Fall 2012-2013
The distribution of a quantitative variable provides a general picture for the user to
have a rough idea on the number of units with the value of the variable falling in a
certain range. It can be represented by a histogram, a boxplot, or summary
statistics.
A first glance from the above histograms would give the following impressions.
The distribution of scores in 00/01-02/03 is located on the right hand side of the
distribution of scores in 03/04. On the other hand, the scores in 03/04 looks more
symmetric and spread out wider.
Such comparisons can also be done through the calculation of some statistics,
which represent the location and spread of the distributions.
Suppose we have a set of data, {9, 16, 11, 19, 11, 10, 13, 12, 6, 9} which are the
sales (in $m) of 10 furniture companies in a particular year. There are several
measures of central location.
At this point it is necessary to introduce some symbols. The above ten individual
items of data (X) will be designated X 1 , X 2 ,..., X 10 , and the number of these items
of data by n . The mean X pronounced ‘x-bar’, is then computed by
P.5
Stat 1602 Business Statistics Fall 2012-2013
X=
∑X
n
where Σ, ‘sigma’, is the summation sign and is simply a command to add up all the
10 x-values, i.e.
∑ X = 9 + 16 + 11 + 19 + 11 + 10 + 13 + 12 + 6 + 9 = 116
X=
∑ X = 116 ÷ 10 = 11.6
n
Hence the mean sales of these ten companies is 11.6 million dollars.
Similarly, the mean overall scores of the 706 students in example 1.1 is
The sum and mean can be computed by using the Excel functions:
∑X @SUM(data range)
X @AVERAGE(data range)
B. Median
Arrange the data in ascending order. The median is the number with middle rank,
i.e.
⎧⎛ n + 1 ⎞
⎪⎜ 2 ⎟ th number if n is odd
⎪⎝ ⎠
median = ⎨
⎪average of ⎛⎜ n ⎞⎟ th and ⎛⎜ n + 1⎞⎟ th numbers if n is even
⎪⎩ ⎝2⎠ ⎝2 ⎠
For these ten sales of companies, {9, 16, 11, 19, 11, 10, 13, 12, 6, 9}, the sorted
dataset will be
6 9 9 10 11 11 12 13 16 19
P.6
Stat 1602 Business Statistics Fall 2012-2013
11 + 11
and the median is the middle number, which is = 11 million dollars.
2
Using Excel function @MEDIAN(data range), the median overall scores of all the
706 students in the student record dataset is found to be 71.9. Comparing this
figure with the mean we can see that both measures of location give quite close
results. However, in general this is not always the case.
Example 1.2
Monthly income of five staffs (in $1000) : 13, 21, 23, 32, 35
Mean = 24.8 Median = 23
If 35 is mistyped as 135, then the data will become : 13, 21, 23, 32, 135
Mean = 44.8 Median = 23 mg
Hence it is possible to have very different values of mean and median. From this
example we can also see that the mean is more sensitive to extreme values than the
median.
In general, the median is the best for describing highly skewed (not symmetric)
distributions or when there are one or more outlying values whose validity is
suspected. Otherwise, the mean is the best as it fully utilizes all the data and easy
to be studied.
The above histograms show the scores (based on a questionnaire of ten five-point-
scale items) at five-star hotels rated by 127 male guests and 114 female guests. The
P.7
Stat 1602 Business Statistics Fall 2012-2013
locations of both distributions are more or less the same. However, the scores
given by male guests spread out wider than the scores given by female guests. That
means the rating variation by male is larger than that of females.
There are several ways to assess variation, but by far the most useful is the statistic
known as the standard deviation, which is given the symbol S. The formula for
calculating the standard deviation is
∑ (X − X )
2
S=
n −1
Example 1.3
Sales of ten furniture companies: 9, 16, 11, 19, 11, 10, 13, 12, 6, 9
X = 11..6
∑ (X − X )
2
124.4
= = 13.8222 S = 13.8222 = 3.7178
n −1 9
@STDEV(data range)
P.8
Stat 1602 Business Statistics Fall 2012-2013
Example 1.4
Hotel Rating Data
Dataset : hotel_scores.dat
1. Read the data into an Excel worksheet. The first column will contain the
gender (M/F) of the guests and the second column will contain the scores
given by the guests.
3. To compute the standard deviation for male scores only, we can use the
array formula:
@STDEV(IF(A2:A242=”M”,B2:B242))
Remarks
∑X − (∑ X ) n
2 2
S= .
n −1
P.9
Stat 1602 Business Statistics Fall 2012-2013
Example 1.5
Sales ($m)
X X2
9 81
(∑ X ) 2
16
11
256
121
∑X 2
−
n
= 124.4
19 361
11 121 124.4
10 100 S2 = = 13.8222
10 − 1
13 169
12 144
6 36 S = 13.8222 = 3.7178
9 81
∑ X = 116 ∑X 2
= 1470
A. Bar Chart
• Display summarized data where there is no emphasis on the percentage of a
total.
90
80
70
60
Mean Overall
50
40
30
20
10
0
1 2 3 *
Year
P.10
Stat 1602 Business Statistics Fall 2012-2013
B. Pie Charts
• A simple descriptive display of data that sum to a given total.
• Most illustrative way of displaying percentages.
• For nominal or ordinal data.
A-, 59, 8%
A, 27, 4%
A+, 8, 1%
F, 20, 3%
C. Dot Diagram
•
• • • • • • • • • •• • • • • • •
Remarks
1. It is easy to construct.
P.11
Stat 1602 Business Statistics Fall 2012-2013
D. Frequency Distribution
e.g. Table : Flow of vehicles passing through a particular point during an hour.
Example 1.6
Student Record Data
1. Open the student record worksheet.
P.12
Stat 1602 Business Statistics Fall 2012-2013
Remarks
1. Percentages and/or cumulative percentages should be included if it is what other
people interest. Counts can be omitted provided that they can be recovered if
desired.
E. Histogram
1. Partition the range of data into several intervals (not necessarily of equal
widths).
To construct histogram using Excel, the Analysis ToolPak should be loaded first:
1. Office Button -> Excel Options
2. Click the Add-ins tab on the left panel. At the Manage menu, select Excel
Add-ins, then click Go...
3. Select Analysis ToolPak and click on OK. (If it is prompted that the
Analysis ToolPak is not installed, install it).
2. Office Button -> Excel Options -> Add-ins -> Excel Add-ins -> Go...
Detailed instructions on the use of this command can be found in the book “ Data
Analysis Using Microsoft Excel: Updated for Office 07 ”, or the following site:
http://www.treeplan.com/BetterHistogram_20041117_1555.htm
Example 1.7
Frequency histogram with equal width classes
1. Open the student record worksheet.
4. Input 0 as the Start Value, 10 as the Step Value, 100 as the Stop Value to
define the intervals.
5. Click OK to construct the graph. Format the graph (color, titles, fonts,
etc) as appropriate.
250
200
Frequency
150
100
50
0
0 10 20 30 40 50 60 70 80 90 100
Overall
P.14
Stat 1602 Business Statistics Fall 2012-2013
From this histogram we can see the rectangular blocks from 50 to 90 comprises
most of the area. Hence most of the students scored within 50 to 90. The graph is
not symmetric as there is a tail pointing towards the left. It indicates that there are
less low-score students than high-score students, comparing to the average students.
This kind of deviation from symmetry is called skewness. Since the tail of the
histogram points towards left hand side, the distribution is said to be skewed to the
left, or negatively skewed.
Since the area represent frequencies, formally the scale of the y-axis should be the
frequencies/score rather than just frequencies (height = area ÷ width). The area
under the whole distribution would represent the total number of data, i.e. 706.
Usually the y-scale will be further adjusted to make the total area equal to 1. Such
y-scale is called the density.
Change of the y-axis scale make no difference on these two histograms because
with same class width, heights of the rectangular blocks are directly proportional to
the areas. However, one must use density as the y-axis scale whenever there are
unequal class widths, as illustrated by the following examples.
P.15
Stat 1602 Business Statistics Fall 2012-2013
Example 1.8
Frequency histogram with unequal width classes
(Incorrect construction of histogram)
The shape of the distribution was totally ruined. It will give the incorrect
impression that there are much more students scored 35 to 70 than 70 to 90.
Example 1.9
Density histogram with unequal width class
This histogram provide more detail description of the data distribution from 70 to
90. However, the shape is still preserved. So conservation of distribution
perception is the main reason of using area to represent frequencies rather than
using height.
P.16
Stat 1602 Business Statistics Fall 2012-2013
We saw from the above histogram that a given class of results, say those lying
between 60 and 70, made up about 0.02 × (70 − 60) = 20% of the total. This 20% is
called the relative frequency of scores between 60 and 70. If the population of all
students (including those in future semesters) has more or less the same
distribution as this dataset, we may infer that there would be around 20% chance to
observe a student scoring in this range. The process of statistical inference (with
suitable assumptions) allows us to equate in a numerical fashion the relative
frequency of past events and the probability of future.
Remarks
F. Percentile
In a dataset of n observations, (100p)th percentile (0 < p < 1) has approximately np
observations less than or equal to it and also n(1-p) observations greater than it.
e.g. Test scores of ten students : 68, 75, 58, 47, 83, 34, 90, 71, 63, 79
Sorted scores : 34, 47, 58, 63, 68, 71, 75, 79, 83, 90
There are 3 students having scores less than or equal to 58. Therefore the datum 58
is a 30th percentile. Note that under the above definition, all the values between 58
and 63 are also 30th percentiles.
P.17
Stat 1602 Business Statistics Fall 2012-2013
Let X (i ) denote the ith smallest value (such that X (1) ≤ X ( 2 ) ≤ L ≤ X ( n ) ). Let r and
f be the integer part and fractional part of (n + 1) p respectively. The following
definition provides a formula to compute percentile uniquely:
In a relative frequency histogram, the 100pth percentile is the cutoff point on the x-
axis such that the total area of the histogram on the left of this cutoff point is equal
to 100p percent.
d (np − Fp −1 )
(100 p )th percentile = L p +
fp
Lp (100p)th percentile Lp + d
a
d
fp
a d d (np − Fp −1 )
= ⇒a=
np − Fp −1 f p fp
d (np − Fp −1 )
(100 p )th percentile = L p + a = L p +
fp
P.18
Stat 1602 Business Statistics Fall 2012-2013
e.g. Table : Frequency table for 20 grain bullet penetration depths into oak wood
from a distance of 15 feet.
For p = 0.75, np = 15 , L p = 64 , Fp −1 = 14 , f p = 3 , d = 2 ,
2(15 − 14 )
75 th percentile = 64 + = 64.67
3
G. Boxplot
The 25th, 50th, 75th percentiles cuts the data into four pieces and are given the
special names lower quartile, median, upper quartile respectively. These three
values, together with the maximum and minimum, provide a five-number summary
of the data.
55 60 65 70 75
P.19
Stat 1602 Business Statistics Fall 2012-2013
Example 1.10
Boxplot of Student Record Data (created by another statistical package)
Lower Upper
Quartile Quartile
Median
The box comprises the middle half of the dataset, i.e. about 50% of the overall
scores located at the centre, from 60 to 80 mm.
The stars represents data points that is too extreme compared to the others and is
sometimes regarded as outliers.
Example 1.11
Comparison of several boxplots
We can easily compare the distributions of the data in several groups by just one
graph. For example, the scores of audit (*) and year 2 students spread out a little
bit wider than year 1 and year 3 students. The location of the distribution of audit
students is on the right of all the others, which suggesting that they performed
better than other students in terms of examination results.
P.20
Stat 1602 Business Statistics Fall 2012-2013
H. Stem-and-Leaf Display
Each data value is split into two components called stem and leaf.
e.g. Data : 78, 65, 90, 86, 79, 51, 79, 62, 84, 101
5* 1
6 25 n = 10
7 899 Leaf unit = 1
8 46
9 0
10* 1
Stem Leaf
Stem-and-leaf plot is more informative than histogram as it displays the raw data.
However, It is not suitable for large datasets.
3+ 8
4* 0014 n = 15
4+ 556779 Leaf unit = 1
5* 013 * for 0-4
5+ 8 + for 5-9
1* 0 n = 18 Leaf unit = 1
t 23
f 4445 * for 0,1
s 67777 t for 2,3
+ 889 f for 4,5
2* 01 s for 6,7
t 2 + for 8,9
P.21
Stat 1602 Business Statistics Fall 2012-2013
Group A Group B
84 1 122355567
531 2 0111222346777899
988421 3 012457 n = 72
85200 4 11257 Leaf unit = 1
976540 5 0236
97655210 6 02
Example 1.12
Stem-and-leaf display of student record data (created by other statistical package)
1 2 2
2 2 6
3 3 2
4 3 5
8 4 0034
19 4 55778888999
33 5 00122233344444
52 5 5556666777888899999
71 6 0011111223333334444
(25) 6 5555566667778888888999999
94 7 00011112233333344444444
71 7 5555566677788888899999
49 8 0000000001111112334444
27 8 555666666777788999
9 9 01112223
1 9 6
The numbers in the first column are the cumulative frequencies of each class
interval, accumulating from two ends. Note that (25) is the frequency of the
median class interval [65,70).
P.22
Stat 1602 Business Statistics Fall 2012-2013
• Statistical tests tend to be more objective than human eyes and are less prone to
deception as long as the corresponding assumptions hold.
Examples 1.13
D D
A
18% 18% B
27%
37%
C C
18% 18%
B A
37% 27%
1.5
1.2
1.1 1.0
1.0
0.5
0 0
A B C D A B C D
P.23
Stat 1602 Business Statistics Fall 2012-2013
Example 1.14
Example 1.15
P.24
Stat 1602 Business Statistics Fall 2012-2013
Example 1.16
Misleading alignment of graphs
P.25
Stat 1602 Business Statistics Fall 2012-2013
Example 1.17
"This may well be the worst graphic ever to find its way into print." --- Tufte (1983)
This graph uses colours, 3D effects, disguised redundancy to display just five
numbers. Note the clever use of mirror-imaging -- the top series is just (100
- the bottom series) and the interesting use of curved lines, front and back to avoid
the appearance that there's a lot less here than meets the eye.
A simple bar chart displaying the same set of data is given below:
P.26
Stat 1602 Business Statistics Fall 2012-2013
Shape of histograms:
L shape J shape
Skewness indicate how far the shape the histogram is different from symmetric
shape.
P.27
Stat 1602 Business Statistics Fall 2012-2013
1
∑ (X − X )3
Measures of skewness: γ1 = n
S3
mean − median
Another measure of skewness: γ2 =
S
@SKEW(data range)
Example 1.18
Descriptive Statistics of Student Record Data
1. Open the student record worksheet.
4. Type F1:F707 in the Input Range field. Check the Labels in First Row box.
This will specify the first row as the label of the variable.
6. Click OK. The summary statistics will be computed and listed in a new
worksheet.
P.28