You are on page 1of 37

Ch2: Treatment of data

 Outline
 Pareto diagrams, dot diagrams
 Histograms (Frequency distributions)
 Stem-and-leaf display
 Box-plot (Quartiles and Percentiles)
 The calculation of mean x and standard
deviation s
What it is –
Descriptive statistics
 Descriptive statistics include the numbers, tables,
charts, and graphs used to describe, organize,
summarize, and present raw data.
 central tendency (location) of data, i.e. where data tend to fall,
as measured by the mean, median, and mode.
 dispersion (variability) of data, i.e. how spread out data are, as
measured by the variance and its square root, the standard
deviation.
 skew (symmetry) of data, i.e. how concentrated data are at the
low or high end of the scale, as measured by the skew index.
 kurtosis (peakedness) of data, i.e. how concentrated data are
around a single value, as measured by the kurtosis index.
Pareto Diagram
 Pareto Diagram display orders each type of
failure or defect according to its frequency.

 For a computer-controlled lathe whose


performance was below par, workers
recorded the following
causes and their frequencies:
power fluctuations 6
controller not stable 22
operator error 13
worn tool not replaced 2
other 5
Dot diagram
 Second step to improve the quality of lathe,
 Data were collected from observation on the
deviations of cutting speed from the target value set
by the controller.
 EX. Cutting speed – target speed
 3 6 –2 4 7 4
 Dot diagram: A number line in which one dot is placed
above a value on the number line for each occurrence
of that value. That is, one dot means the value
occurred once, three dots mean the value occurred
three times, etc.
Dot diagram
 This diagram visually summarize the
information that the lathe is generally
running fast.
Frequency distributions
 A frequency distribution is a
tabular arrangement of data whereby
the data is grouped into different
intervals, and then the number of
observations that belong to each
interval is determined.
 Data that is presented in this manner
are known as grouped data.
Data001.
80 data of emission (in ton)of
sulfur oxides from an industry plant
 15.8 26.4 17.3 11.2 23.9 24.8 18.7 13.9 9.0 13.2
22.7 9.8 6.2 14.7 17.5 26.1 12.8 28.6 17.6 23.7 26.8

 22.7 18.0 20.5 11.0 20.9 15.5 19.4 16.7 10.7 19.1
15.2 22.9 26.6 20.4 21.4 19.2 21.6 16.9 19.0 18.5
23.0

 24.6 20.1 16.2 18.0 7.7 13.5 23.5 14.5 14.4 29.6
19.4 17.0 20.8 24.3 22.5 24.6 18.4 18.1 8.3 21.9
12.3

 22.3 13.3 11.8 19.3 20.0 25.7 31.8 25.9 10.5 15.9
27.5 18.1 17.9 9.4 24.1 20.1 28.5
Class limits & frequnecy
Class limits Frequency
5.0 -- 8.9 3
9.0 – 12.9 10
13.0 – 16.9 14
17.0 – 20.9 25
21.0 – 24.9 17
25.0 – 28.9 9
29.0 – 32.9 2
Total 80
Class limit and width
 lower class limit: The smallest value that
can belong to a given interval

 upper class limit: The largest value that


can belong to the interval.

 Class width: The difference between the


upper class limit and the lower class limit is
defined to be the class width.
Guidelines for classes
 1. There should be between 5 and 20 classes.
 2.The class width should be an odd number. This will
guarantee that the class midpoints are integers instead of
decimals.
 3. The classes must be mutually exclusive. This means that
no data value can fall into two different classes
 4. The classes must be all inclusive or exhaustive. This
means that all data values must be included.
 5. The classes must be continuous. There are no gaps in a
frequency distribution. Classes that have no values in them
must be included (unless it's the first or last class which are
dropped).
 6.The classes must be equal in width. The exception here is
the first or last class. It is possible to have an "below ..." or
"... and above" class. This is often used with ages
Steps
 1. Find the largest and smallest values
 2. Compute the Range = Maximum -
Minimum
 3. Select the number of classes desired.
This is usually between 5 and 20.
 4. Find the class width by dividing the
range by the number of classes and
rounding up. You must round up, not off.
Normally 3.2 would round to
be 3, but in rounding up, it
becomes 4.
Class limits & frequnecy
Class limits Frequency
[5.0, 9.0) 3
[9.0, 13.0) 10
[13.0, 17.0) 14
[17.0, 21.0) 25
[21.0, 25.0) 17
[25.0, 29.0) 9
[29.0, 33.0) 2
Total 80
Variants of frequency distribution
 The cumulative frequency distribution is
obtained by computing the cumulative
frequency, defined as the total frequency of
all values less than the upper class limit of
a particular interval, for all intervals.
 Relative frequency: the ratio of the number
of observations in the interval to the total
number of observations
 The percentage frequency distribution is
arrived at by multiplying the relative
frequencies of each interval by 100%.
Cumulative frequency
Class limits Frequency
Less than 5 0
Less than 9 3
Less than 13 13
Less than 17 27
Less than 21 52
Less than 25 69
Less than 29 78
Less than 33 80
Percentage distribution
Class limits Perc. Dist. Frequency
[5.0, 9.0) 3.75% 3
[9.0, 13.0) 12.5% 10
[13.0, 17.0) 17.5% 14
[17.0, 21.0) 31.25% 25
[21.0, 25.0) 21.25% 17
[25.0, 29.0) 11.25% 9
[29.0, 33.0) 2.5% 2
Total 100% 80
Histogram
 The most common form of graphical
presentation of a frequency
distribution is the histogram.
 Histogram: is constructed of adjacent
rectangles; the height of the
rectangles is the class frequencies
and the bases of the rectangles
extend between successive class
boundaries.
Histogram
Density histogram
 When a histogram is constructed from a
frequency table having classes of unequal
lengths, the height of each rectangle must
be changed to

 Height = relative frequency / width.

 The area of the rectangle then represents


the relative frequency for the class and the
total area of the histogram is 1.
Density histogram
Density Histogram
 Graph->histogram->simple
 Scale->Y-Scale Type->Density
 Edit Bars->Binning->Cut point->
 5 13 17 21 25 29 33
Cumulative histogram
 1) Graph-
>histogram-
>simple
 2) Dataview->
Datadisplay: check
“symbos” only
Smoother: check
“lowess” and “0” in
degree of
smoothing and “1”
in number of steps.
Stem-and-leaf Display
 Class limits and frequency, contain data in each class,
but the original data points have been lost.

 Stem-and-leaf: A data plot which uses part


of the data value as the stem and the rest of
the data value (the leaf) to form groups or
classes. This is very useful for sorting data
quickly.

 Stem-and-leaf: function the same as histogram but


save the original data points.

 Example: 11 numbers:
 12, 13, 21, 27, 33, 34, 35, 37, 40, 40, 41
 Frequency table
Class limits Frequency
10 – 19 2
20 – 29 2
30 – 39 4
40 – 49 3
Stem-and-leaf

Stem-and-leaf: each row has a stem and


each digit on a stem to the right of the vertical
line is a life.
The "stem" is the left-hand column which
contains the tens digits.
The "leaves" are the lists in the right-hand
column, showing all the ones digits for each
of the tens, twenties, thirties, and forties.

Key: “4|0” means 40


Stem-and-leaf Display
 Example in P23: 20 numbers:
 29, 44, 12, 53, 21, 34, 39, 25, 48, 23
 17, 24, 27, 32, 34, 15, 42, 21, 28, 27
Frequency table
Class limits Frequency
10 – 19 3
20 – 29 9
30 – 39 4
40 – 49 3 Stem-and-leaf
50 – 59 1 1|257
2|113457789
3|244 9
4|248
5|3
Stem-and-leaf in Minitab
 The display has three columns:
 The leaves (right) - Each value in the leaf
column represents a digit from one observation.
 The stem (middle) - The stem value represents
the digit immediately to the left of the leaf digit.
 Counts (left) - If the median value for the
sample is included in a row, the count for that
row is enclosed in parentheses. The values for
rows above and below the median are
cumulative.
Stem-and-leaf for DATA001
 Stem-and-leaf of frequencies N = 80
 Leaf Unit = 1.0

 2 0 67
 6 0 8999
 11 1 00111
 17 1 223333
 24 1 4445555
 32 1 66677777
 (13) 1 8888888999999
 35 2 0000000111
 25 2 222223333
 16 2 4444455
 9 2 66667
 4 2 889
 1 3 1
Ch2.5: Descriptive measures
 Mean: the sum of the observation divided
by the sample size. n

x i
x i 1

n
 Median: the center, or location, of a set of
data. If the observations are arranged in an
ascending or descending order:
 If the number of observations is odd, the median
is the middle value.
 If the number of observations is even, the
median is the average of the two middle values.
Example
 15 14 2 27 13
 Mean:
15  14  2  27  13
x  14.2
5

 Ordering the data from smallest to


largest
 2 13 14 15 27
 The median is the third largest value
14
Other central tendency
 Midrange
 The midrange is simply the midpoint
between the highest and lowest values.
 Mode
 The mode is the most frequent data
value. There may be no mode if no one
value appears more than any other.
There may also be two modes (bimodal),
three modes (trimodal), or more than
three modes (multi-modal).
Summary
 The Mean is used in computing other statistics (such
as the variance) and does not exist for open ended
grouped frequency distributions. It is often not
appropriate for skewed distributions such as salary
information.
 The Median is the center number and is good for
skewed distributions because it is resistant to change.
 The Mode is used to describe the most typical case.
The mode can be used with nominal data whereas the
others can't. The mode may or may not exist and
there may be more than one value for the mode
 The Midrange is not used very often. It is a very
rough estimate of the average and is greatly affected
by extreme values (even more so than the mean).
Summary cont.

Preporty Mean Median Mode Midrange

Always No Yes No Yes


Exists
Uses all Yes No No No
data values
Affected by Yes No No Yes
extreme
values
Sample variance
 Deviations from the mean:
n

 i
( x  x ) 2 n n
n   x  ( xi ) 2
2
i
s2  i 1
s2  i 1 i 1

n 1 n(n  1)
 Standard deviation s:
n

 i
( x  x ) 2

s i 1
n 1
Quartiles and Percentiles
 Quartiles: are values in a given set of
observations that divide the data in 4 equal
parts.
 The first quartile, Q1 , is a value that has one
fourth, or 25%, of the observation below its
value.
 The sample 100 p-th percentile is a value
such that at least 100p% of the observation
are at or below this value, and at least
100(1-p)% are at or above this value.
Example
 Example in P34:
N/4 is an
14.7  15.2 integer, take
Q1   14.95 the average;
2 Or round up,
otherwise
19.0  19.1
Q2   19.05
2

22.9  23
Q3   22.95
2
Boxplots
 A boxplot is a way of summarizing
information contained in the quartiles
(or on a interval)
 Box length= interquartile range= Q3  Q1

You might also like