You are on page 1of 28

Stat 1602 Business Statistics Fall 2012-2013

Prologue Statistics – What is it ?


Statistics deals with the collection and analysis of data to solve real-world
problems.

Time Contributor Contribution


Greece Philosophers Idea – no quantitative analyses.
Ancient Babylonians Collect demographical data for tax collection and
Egyptians recruitment of military units.

European marine Marine insurance rates were set using data


14th Century
companies concerning the success of the transportation of goods.

Blasé Pascal Studied probability through games of chance and


Pierre de Fermat gambling.

17th Century Proved the law of large numbers -- as the number of


observations increased, “the ratio of observed
Jacob Bernoulli
successful to unsuccessful occurrences will differ
from the true ratio within certain small limits.”

Pierre Simon Laplace Constructed the normal curve, developed the


Karl Friedrich Gauss application of probability ideas to astronomy.

Statistics underwent simultaneous horizontal and vertical development :


18th Century • Horizontal : methods spread among disciplines including astronomy,
geodesy, psychology, biology, social sciences, etc.
• Vertical : understanding of mathematical probability theory led to the
development of statistical inference.

Astronomer who first applied statistical analyses to


Adolphe Quetelet
human biology.
19th Century
Studied genetic variation in humans, using regression
Sir Francis Galton
and correlation.

Studied natural selection using correlation, formed


Karl Pearson first academic department of statistics, Biometrika
journal, helped develop the Chi Square analysis.

Studied process of brewing, alerted the statistics


William Sealy Gossett
community about problems with small sample sizes,
(Student)
20th Century developed Student's test.

Evolutionary biologists who developed ANOVA,


Sir Ronald Fisher
stressed the importance of experimental design.

Provided many advantages over calculations by hand


Computer Technology or by calculator, stimulated the growth of
investigation into new techniques.

P.1
Stat 1602 Business Statistics Fall 2012-2013

• The word “Statistics” is derived from the Latin word for “the state” as the first
important accumulation of data was for the purposes of the state.

• “Statistik” – probably first used by the German philosopher Gottfried


Achenwall in middle of the eighteenth century. It referred to “inquiries
respecting the Population, Political Circumstances, the Production of a Country,
and other Matters of State”.

• While the science of statistics was being studied in Germany, the words
“Statistics” and “Statistical” were introduced into the English language around
1787 by Ebesherd A.W. von Zimmerman (1743-1815).

Elements of Statistics



⎪ ⎧ Survey Sampling
⎪ Data Collection ⎪⎨Experimental Design
⎪ ⎪Observational Study
⎪ ⎩

Statistics ⎨

⎪ ⎧ Descriptive Statistics and Statistical Graphics
⎪ ⎪
⎪ ⎧ ⎧ Point
⎪ Data Analysis ⎨ ⎪ Estimation ⎨
⎪ ⎪ Statistical Inference ⎨ ⎩ Interval
⎪ ⎪ ⎪ Testing Hypothesis
⎩ ⎩ ⎩

Statistics Users

Category 1 – able to understand statistical presentations

Category 2 – able to select and apply statistical procedures to a particular


problem

Category 3 – applied statisticians who help others use statistics on a particular


problem

Category 4 – mathematical statisticians who develop new statistical


techniques and discover new characteristics of old techniques

P.2
Stat 1602 Business Statistics Fall 2012-2013

Chapter I Descriptive Statistics and Statistical Graphics

Terminology

Measurement process of assigning a number or numerical code to an object

Instrument device used to assist in the production of a number from the object

Variable a piece of information that may be expressed as a number,


changing from item to item

Data a set of numbers representing records of observations

Variable 1 Variable 2 …… Variable m


Unit 1

Unit 2



Unit n

Example 1.1
Student records
Dataset : student_record.dat
1. Run Microsoft® Excel 07.

2. Office Button -> Open...

3. Change directory and select the data file student_record.dat. You may need
to first change the Files of type into All Files (*.*) to show all files.

4. In the Text Import Wizard, select Delimited as the Original data type.
Click on Next.

5. Select both Tab and Space as the Delimiters. This will specify how the data
are separated in each row in the original data file. The Data preview
window shows the input data set. Click on Finish will load the data into
the worksheet. (Click on Next can set the Data Format of each column.)

5. From the Office Button menu, select Save as .... You can save the worksheet
in Excel Workbook (*.xlsx) format so that the data format for each variable
can be stored too. You can also save the worksheet in Excel 97-2003
Workbook (*.xls) format so that the file will be compatible with older
versions of Excel.

P.3
Stat 1602 Business Statistics Fall 2012-2013

§ 1.1 Scales of Measurement


Qualitative Scales
1. Nominal Scale
• Tells only what class a unit falls in with respect to the property, e.g. sex,
nationality, tutorial class.
• The classes are often called categories.
• The categories have no logical order.

2. Ordinal Scale
• Also tells when one unit has more of the property than does another unit, e.g.
grade, attitude (disagree, agree, strongly agree).

Quantitative Scales
3. Interval Scale
• Also tells us that one unit differs by a certain amount of the property from
another unit, e.g. temperature in degrees Celsius, altitude of a place, time of
occurrence of an event.
• Zero is just a reference point.

4. Ratio Scale
• Tells us that one unit has so many times as much of the property as does
another unit (it has a meaningful zero), e.g. height, weight.
• It has a meaningful zero. Zero really means nothing.
P.4
Stat 1602 Business Statistics Fall 2012-2013

§ 1.2 Distribution of Data

The distribution of a quantitative variable provides a general picture for the user to
have a rough idea on the number of units with the value of the variable falling in a
certain range. It can be represented by a histogram, a boxplot, or summary
statistics.

A first glance from the above histograms would give the following impressions.
The distribution of scores in 00/01-02/03 is located on the right hand side of the
distribution of scores in 03/04. On the other hand, the scores in 03/04 looks more
symmetric and spread out wider.

Such comparisons can also be done through the calculation of some statistics,
which represent the location and spread of the distributions.

The interpretation and construction of histogram will be described in later sections.

§ 1.3 Measures of Central Tendency or Location

Suppose we have a set of data, {9, 16, 11, 19, 11, 10, 13, 12, 6, 9} which are the
sales (in $m) of 10 furniture companies in a particular year. There are several
measures of central location.

A. Average / Arithmetic Mean

At this point it is necessary to introduce some symbols. The above ten individual
items of data (X) will be designated X 1 , X 2 ,..., X 10 , and the number of these items
of data by n . The mean X pronounced ‘x-bar’, is then computed by

P.5
Stat 1602 Business Statistics Fall 2012-2013

X=
∑X
n

where Σ, ‘sigma’, is the summation sign and is simply a command to add up all the
10 x-values, i.e.

∑ X = 9 + 16 + 11 + 19 + 11 + 10 + 13 + 12 + 6 + 9 = 116
X=
∑ X = 116 ÷ 10 = 11.6
n

Hence the mean sales of these ten companies is 11.6 million dollars.

Similarly, the mean overall scores of the 706 students in example 1.1 is

96.9 + 96.5 + 96.5 + L + 28.6 + 26.8 + 22.9


= 69.6
706

The sum and mean can be computed by using the Excel functions:

∑X @SUM(data range)

X @AVERAGE(data range)

B. Median

Arrange the data in ascending order. The median is the number with middle rank,
i.e.

⎧⎛ n + 1 ⎞
⎪⎜ 2 ⎟ th number if n is odd
⎪⎝ ⎠
median = ⎨
⎪average of ⎛⎜ n ⎞⎟ th and ⎛⎜ n + 1⎞⎟ th numbers if n is even
⎪⎩ ⎝2⎠ ⎝2 ⎠

where n is the size of dataset.

For these ten sales of companies, {9, 16, 11, 19, 11, 10, 13, 12, 6, 9}, the sorted
dataset will be

6 9 9 10 11 11 12 13 16 19

P.6
Stat 1602 Business Statistics Fall 2012-2013

11 + 11
and the median is the middle number, which is = 11 million dollars.
2

Using Excel function @MEDIAN(data range), the median overall scores of all the
706 students in the student record dataset is found to be 71.9. Comparing this
figure with the mean we can see that both measures of location give quite close
results. However, in general this is not always the case.

Example 1.2

Monthly income of five staffs (in $1000) : 13, 21, 23, 32, 35
Mean = 24.8 Median = 23

If 35 is mistyped as 135, then the data will become : 13, 21, 23, 32, 135
Mean = 44.8 Median = 23 mg

Hence it is possible to have very different values of mean and median. From this
example we can also see that the mean is more sensitive to extreme values than the
median.

In general, the median is the best for describing highly skewed (not symmetric)
distributions or when there are one or more outlying values whose validity is
suspected. Otherwise, the mean is the best as it fully utilizes all the data and easy
to be studied.

§ 1.4 Measures of Spread (Variation)

The above histograms show the scores (based on a questionnaire of ten five-point-
scale items) at five-star hotels rated by 127 male guests and 114 female guests. The

P.7
Stat 1602 Business Statistics Fall 2012-2013

locations of both distributions are more or less the same. However, the scores
given by male guests spread out wider than the scores given by female guests. That
means the rating variation by male is larger than that of females.

There are several ways to assess variation, but by far the most useful is the statistic
known as the standard deviation, which is given the symbol S. The formula for
calculating the standard deviation is

∑ (X − X )
2

S=
n −1

Example 1.3

Sales of ten furniture companies: 9, 16, 11, 19, 11, 10, 13, 12, 6, 9

X = 11..6

Sales ($m) Deviation from mean ( X = 11.6 ) Deviation squared


X X−X (X − X )2
9 -2.6 6.76
16 4.4 19.36
11 -0.6 0.36
19 7.4 54.76
11 -0.6 0.36
10 -1.6 2.56
13 1.4 1.96
12 0.4 0.16
6 -5.6 31.36
9 -2.6 6.76
∑ X = 116 ∑ (X − X ) = 0 ∑ (X − X ) = 124.4
2

∑ (X − X )
2
124.4
= = 13.8222 S = 13.8222 = 3.7178
n −1 9

The standard deviation can be computed by using the Excel function:

@STDEV(data range)

P.8
Stat 1602 Business Statistics Fall 2012-2013

Example 1.4
Hotel Rating Data

Dataset : hotel_scores.dat
1. Read the data into an Excel worksheet. The first column will contain the
gender (M/F) of the guests and the second column will contain the scores
given by the guests.

2. At any empty cells, input @AVERAGE(B2:B242), @MEDIAN(B2:B242),


@STDEV(B2:B242) to obtain respectively the mean, median, and standard
deviation of the scores given by all the guests.

3. To compute the standard deviation for male scores only, we can use the
array formula:
@STDEV(IF(A2:A242=”M”,B2:B242))

Press Ctrl-Shift-Enter instead of just Enter for an array formula. This


will compute the standard deviation of the data in the range B2:B242 that
satisfy the criteria A2:A242=”M”, i.e. gender=male. The other statistics,
and the statistics for female scores can be computed similarly.

Remarks

1. A better formula for calculating the standard deviation would be

∑X − (∑ X ) n
2 2

S= .
n −1

2. The square of standard deviation, S 2 , is called the variance.

P.9
Stat 1602 Business Statistics Fall 2012-2013

Example 1.5

Sales ($m)
X X2
9 81
(∑ X ) 2
16
11
256
121
∑X 2

n
= 124.4
19 361
11 121 124.4
10 100 S2 = = 13.8222
10 − 1
13 169
12 144
6 36 S = 13.8222 = 3.7178
9 81
∑ X = 116 ∑X 2
= 1470

§ 1.5 Tables and Graphs for One-dimensional Data

A. Bar Chart
• Display summarized data where there is no emphasis on the percentage of a
total.

Mean Overall Scores of Math244 students

90

80

70

60
Mean Overall

50

40

30

20

10

0
1 2 3 *
Year

P.10
Stat 1602 Business Statistics Fall 2012-2013

B. Pie Charts
• A simple descriptive display of data that sum to a given total.
• Most illustrative way of displaying percentages.
• For nominal or ordinal data.

Grade Distribution of Math244 Students

B, 107, 15% B+, 107, 15%

A-, 59, 8%

B-, 100, 14%

A, 27, 4%
A+, 8, 1%
F, 20, 3%

C+, 82, 12%


D, 75, 11%

C, 72, 10% C-, 49, 7%

C. Dot Diagram


• • • • • • • • • •• • • • • • •

80 90 100 110 120 130

Remarks
1. It is easy to construct.

2. It is compact and can be used in the margins of other displays to add


information.

3. Not suitable when we have too many points.

P.11
Stat 1602 Business Statistics Fall 2012-2013

D. Frequency Distribution

e.g. Table : Flow of vehicles passing through a particular point during an hour.

Vehicles Frequency Percentage


Cars 45 59
Lorries 22 29
Motorcycles 6 8
Buses 3 4
Total 76 100

e.g. Table : Number of applicants by age

Age No. of Applicants Cumulative Frequencies


below 20 2 2
20-22 3 5
22-24 14 19
24-26 26 45
26-28 20 65
28 or above 5 70
Total 70
”20-22” means “greater than or equal to 20 but less than 22”

Example 1.6
Student Record Data
1. Open the student record worksheet.

2. Create a column of grade categories (column J in the following table).

3. In cell K2, input the function @COUNTIF(G$2:G$707,J2) to obtain the


frequency of A+. Copy this cell to cells K3 to K12 to obtain the
frequencies of other grades. In cell K13, input @SUM(K2:12) to calculate
the total frequency.

4. Use simple arithmetic operations to calculate the percentages and


cumulative percentages.

P.12
Stat 1602 Business Statistics Fall 2012-2013

Remarks
1. Percentages and/or cumulative percentages should be included if it is what other
people interest. Counts can be omitted provided that they can be recovered if
desired.

2. Total size of the data should always be included.

3. For nominal data, arrange the categories in a meaningful way. If no natural


order is formed, arrange the categories so that their associated frequencies are
decreasing. For ordinal data, categories should be arranged in natural order.

4. For continuous data, information is lost through grouping of data.

5. Different choices of class-intervals may give different impressions.

E. Histogram

Histogram is a very suitable form of representation of data distribution, especially


for large datasets. The histogram itself is a frequency diagram because it shows the
frequency-of-occurrence of results within particular intervals. When inspecting a
histogram, the most important point one should keep in mind is that counts of data
are represented by area, rather than height.

To construct a histogram, one should:

1. Partition the range of data into several intervals (not necessarily of equal
widths).

2. Draw rectangles in the intervals. The area of each rectangle should be


proportional to the corresponding count.

To construct histogram using Excel, the Analysis ToolPak should be loaded first:
1. Office Button -> Excel Options

2. Click the Add-ins tab on the left panel. At the Manage menu, select Excel
Add-ins, then click Go...

3. Select Analysis ToolPak and click on OK. (If it is prompted that the
Analysis ToolPak is not installed, install it).

4. After the Analysis ToolPak is successfully loaded, the Data Analysis


command will be added to the Data menu. This command provides some handy
functions for performing statistical analyses. Detailed instructions can be
founded at:
http://office.microsoft.com/en-us/excel-help/load-the-analysis-toolpak-HP001127724.aspx
P.13
Stat 1602 Business Statistics Fall 2012-2013

There is a histogram function in the added data analysis command. However, it


produces histogram as a column bar chart, with numerical labels misaligned. A
handy add-in built on the basis of the original histogram command was developed
by Prof. Michael R. Middleton, School of Business and Management, University
of San Francisco.
1. Download the add-in BetterHistogram_20070222_2050.xla from the course web.

2. Office Button -> Excel Options -> Add-ins -> Excel Add-ins -> Go...

3. Click Browse..., change directory to select the downloaded add-in file,


then click OK.

4. After the Better Histogram add-in is successfully loaded, the Better


Histogram command will be added to the Add-Ins menu.

Detailed instructions on the use of this command can be found in the book “ Data
Analysis Using Microsoft Excel: Updated for Office 07 ”, or the following site:

http://www.treeplan.com/BetterHistogram_20041117_1555.htm

Example 1.7
Frequency histogram with equal width classes
1. Open the student record worksheet.

2. Add-Ins -> Better Histogram

3. Input F2:F707 as the Data Range for constructing the histogram.

4. Input 0 as the Start Value, 10 as the Step Value, 100 as the Stop Value to
define the intervals.

5. Click OK to construct the graph. Format the graph (color, titles, fonts,
etc) as appropriate.

Overall Scores of Math244 Students (00/01-03/04)


(N = 706)

250

200
Frequency

150

100

50

0
0 10 20 30 40 50 60 70 80 90 100
Overall

P.14
Stat 1602 Business Statistics Fall 2012-2013

From this histogram we can see the rectangular blocks from 50 to 90 comprises
most of the area. Hence most of the students scored within 50 to 90. The graph is
not symmetric as there is a tail pointing towards the left. It indicates that there are
less low-score students than high-score students, comparing to the average students.

This kind of deviation from symmetry is called skewness. Since the tail of the
histogram points towards left hand side, the distribution is said to be skewed to the
left, or negatively skewed.

Since the area represent frequencies, formally the scale of the y-axis should be the
frequencies/score rather than just frequencies (height = area ÷ width). The area
under the whole distribution would represent the total number of data, i.e. 706.
Usually the y-scale will be further adjusted to make the total area equal to 1. Such
y-scale is called the density.

frequency datasize relative frequency


density = =
width width

Change of the y-axis scale make no difference on these two histograms because
with same class width, heights of the rectangular blocks are directly proportional to
the areas. However, one must use density as the y-axis scale whenever there are
unequal class widths, as illustrated by the following examples.

P.15
Stat 1602 Business Statistics Fall 2012-2013

Example 1.8
Frequency histogram with unequal width classes
(Incorrect construction of histogram)

The shape of the distribution was totally ruined. It will give the incorrect
impression that there are much more students scored 35 to 70 than 70 to 90.

Example 1.9
Density histogram with unequal width class

This histogram provide more detail description of the data distribution from 70 to
90. However, the shape is still preserved. So conservation of distribution
perception is the main reason of using area to represent frequencies rather than
using height.

P.16
Stat 1602 Business Statistics Fall 2012-2013

Relative frequency and Probability

We saw from the above histogram that a given class of results, say those lying
between 60 and 70, made up about 0.02 × (70 − 60) = 20% of the total. This 20% is
called the relative frequency of scores between 60 and 70. If the population of all
students (including those in future semesters) has more or less the same
distribution as this dataset, we may infer that there would be around 20% chance to
observe a student scoring in this range. The process of statistical inference (with
suitable assumptions) allows us to equate in a numerical fashion the relative
frequency of past events and the probability of future.

Hence it is important to understand the interpretation of histogram as there will be


some similarities between histogram and probability density curve, which will be
discussed in Chapter III.

Remarks

1. Size of dataset should be given.

2. No gap between blocks.

3. It is invalid to use broken vertical scale in a histogram.

4. There is no good way to represent graphically the open-ended intervals when


they have non-zero frequencies.

5. Histogram is a more suitable form of representation than frequency distribution


table when the class-intervals have unequal widths.

F. Percentile
In a dataset of n observations, (100p)th percentile (0 < p < 1) has approximately np
observations less than or equal to it and also n(1-p) observations greater than it.
e.g. Test scores of ten students : 68, 75, 58, 47, 83, 34, 90, 71, 63, 79
Sorted scores : 34, 47, 58, 63, 68, 71, 75, 79, 83, 90

There are 3 students having scores less than or equal to 58. Therefore the datum 58
is a 30th percentile. Note that under the above definition, all the values between 58
and 63 are also 30th percentiles.

P.17
Stat 1602 Business Statistics Fall 2012-2013

Let X (i ) denote the ith smallest value (such that X (1) ≤ X ( 2 ) ≤ L ≤ X ( n ) ). Let r and
f be the integer part and fractional part of (n + 1) p respectively. The following
definition provides a formula to compute percentile uniquely:

(100 p )th percentile = X ( ) + f (X (


r r +1 ) − X (r ) ) .

e.g. 30th percentile = X (3 ) + 0.3( X ( 4 ) − X (3 ) ) = 58 + 0.3(63 − 58 ) = 59.5

In a relative frequency histogram, the 100pth percentile is the cutoff point on the x-
axis such that the total area of the histogram on the left of this cutoff point is equal
to 100p percent.

For grouped frequency distribution table,

d (np − Fp −1 )
(100 p )th percentile = L p +
fp

where Lp = lower class boundary of the class of this percentile


n = number of observations
Fp −1 = cumulative frequency below the class of this percentile
fp = frequency of the class of this percentile
d = class width of the class of this percentile
Fp −1 np

Lp (100p)th percentile Lp + d
a
d

fp

a d d (np − Fp −1 )
= ⇒a=
np − Fp −1 f p fp
d (np − Fp −1 )
(100 p )th percentile = L p + a = L p +
fp

P.18
Stat 1602 Business Statistics Fall 2012-2013

e.g. Table : Frequency table for 20 grain bullet penetration depths into oak wood
from a distance of 15 feet.

Penetration Depth Frequency Cumulative Frequency


(mm)
58 – 60 5 5
60 – 62 3 8
62 – 64 6 14
64 – 66 3 17
66 – 68 1 18
68 – 70 0 18
70 – 72 2 20
Total 20
”58 - 60” means “greater than or equal to 58 but less than 60”

For p = 0.75, np = 15 , L p = 64 , Fp −1 = 14 , f p = 3 , d = 2 ,

2(15 − 14 )
75 th percentile = 64 + = 64.67
3

Excel percentile function: =PERCENTILE(data range, p)

G. Boxplot

The 25th, 50th, 75th percentiles cuts the data into four pieces and are given the
special names lower quartile, median, upper quartile respectively. These three
values, together with the maximum and minimum, provide a five-number summary
of the data.

A boxplot is a graphical display of the five-number summary. It is simple and


compact. Although it is less informative than the histogram, it can give good
picture about the centre and spread of the distribution.

e.g. Boxplot of the bullet penetration data.

(Min = 58, QL = 60, Median = 62.67, QU = 64.67, Max = 72)

55 60 65 70 75

P.19
Stat 1602 Business Statistics Fall 2012-2013

Example 1.10
Boxplot of Student Record Data (created by another statistical package)

Lower Upper
Quartile Quartile
Median

The box comprises the middle half of the dataset, i.e. about 50% of the overall
scores located at the centre, from 60 to 80 mm.

The stars represents data points that is too extreme compared to the others and is
sometimes regarded as outliers.

Example 1.11
Comparison of several boxplots

We can easily compare the distributions of the data in several groups by just one
graph. For example, the scores of audit (*) and year 2 students spread out a little
bit wider than year 1 and year 3 students. The location of the distribution of audit
students is on the right of all the others, which suggesting that they performed
better than other students in terms of examination results.
P.20
Stat 1602 Business Statistics Fall 2012-2013

H. Stem-and-Leaf Display

Each data value is split into two components called stem and leaf.

Data Split Stem Leaf Unit of Leaf


22.9 22|9 22 9 (0.1)
or 22.9 2|2.9 2 29 (0.1)
or 22 2|2 2 2 (1)
trunctated

e.g. Data : 78, 65, 90, 86, 79, 51, 79, 62, 84, 101
5* 1
6 25 n = 10
7 899 Leaf unit = 1
8 46
9 0
10* 1

Stem Leaf

Stem-and-leaf plot is more informative than histogram as it displays the raw data.
However, It is not suitable for large datasets.

There are variations of stem-and-leaf displays.

Stretched stem-and-leaf display :

3+ 8
4* 0014 n = 15
4+ 556779 Leaf unit = 1
5* 013 * for 0-4
5+ 8 + for 5-9

Squeezed stem-and-leaf display :

1* 0 n = 18 Leaf unit = 1
t 23
f 4445 * for 0,1
s 67777 t for 2,3
+ 889 f for 4,5
2* 01 s for 6,7
t 2 + for 8,9

P.21
Stat 1602 Business Statistics Fall 2012-2013

Back-to-back stem-and-leaf plot :

Group A Group B

84 1 122355567
531 2 0111222346777899
988421 3 012457 n = 72
85200 4 11257 Leaf unit = 1
976540 5 0236
97655210 6 02

Example 1.12
Stem-and-leaf display of student record data (created by other statistical package)

Stem-and-Leaf Display: Overall

Stem-and-leaf of Overall Year = 1 N = 190


Leaf Unit = 1.0

1 2 2
2 2 6
3 3 2
4 3 5
8 4 0034
19 4 55778888999
33 5 00122233344444
52 5 5556666777888899999
71 6 0011111223333334444
(25) 6 5555566667778888888999999
94 7 00011112233333344444444
71 7 5555566677788888899999
49 8 0000000001111112334444
27 8 555666666777788999
9 9 01112223
1 9 6

The numbers in the first column are the cumulative frequencies of each class
interval, accumulating from two ends. Note that (25) is the frequency of the
median class interval [65,70).

P.22
Stat 1602 Business Statistics Fall 2012-2013

§ 1.6 Cautions about graphs


• Pictures can be deceptive even when there is no intention to deceive.

• “lying with statistics”


Presenting data graphically on a stretched or compressed scale of numbers with
the aim of making the data show whatever you want to show.

• Statistical tests tend to be more objective than human eyes and are less prone to
deception as long as the corresponding assumptions hold.

Examples 1.13

Four brands of cigarette : A, B, C, D

Pie chart : Market penetration of the four brands.

D D
A
18% 18% B
27%
37%

C C
18% 18%
B A
37% 27%

Bar graph : Monthly sales of the four brands.


Monthly sales ($m) Monthly sales ($m)

1.5
1.2

1.1 1.0

1.0
0.5

0 0
A B C D A B C D

P.23
Stat 1602 Business Statistics Fall 2012-2013

Example 1.14

The representations are not directly proportional to the quantities represented.

Example 1.15

Failure to show the relevant context produces a thoroughly misleading display.

P.24
Stat 1602 Business Statistics Fall 2012-2013

Example 1.16
Misleading alignment of graphs

P.25
Stat 1602 Business Statistics Fall 2012-2013

Example 1.17

"This may well be the worst graphic ever to find its way into print." --- Tufte (1983)

This graph uses colours, 3D effects, disguised redundancy to display just five
numbers. Note the clever use of mirror-imaging -- the top series is just (100
- the bottom series) and the interesting use of curved lines, front and back to avoid
the appearance that there's a lot less here than meets the eye.

A simple bar chart displaying the same set of data is given below:

P.26
Stat 1602 Business Statistics Fall 2012-2013

§ 1.7 Measures of Skewness

Shape of histograms:

L shape J shape

Bell shape U shape

Skewness indicate how far the shape the histogram is different from symmetric
shape.

Skewed to the right


(Positively skewed)

Skewed to the left


(Negatively skewed)

P.27
Stat 1602 Business Statistics Fall 2012-2013

1
∑ (X − X )3
Measures of skewness: γ1 = n
S3

γ1 > 0 skewed to the right


γ1 < 0 skewed to the left
γ1 = 0 not skewed (symmetric)

mean − median
Another measure of skewness: γ2 =
S

The skewness can be computed by using the Excel function:

@SKEW(data range)

The Descriptive Statistics command in the Data Analysis add-in provides


an integrated calculation for all the summary statistics described in this chapter.

Example 1.18
Descriptive Statistics of Student Record Data
1. Open the student record worksheet.

2. Data -> Data Analysis...

3. Select Descriptive Statistics, then click OK.

4. Type F1:F707 in the Input Range field. Check the Labels in First Row box.
This will specify the first row as the label of the variable.

5. Check the Summary Statistics box to request the calculation of summary


statistics.

6. Click OK. The summary statistics will be computed and listed in a new
worksheet.

P.28

You might also like