Professional Documents
Culture Documents
aghatak@iimv.ac.in
TABULAR AND GRAPHICAL METHODS FOR SUMMARIZING DATA
Scales of Measurement
• Nominal Scale - groups or classes
ü Gender, color, professional classification, etc.
• Ordinal Scale - order matters
ü Ranks (top ten videos, products, etc.)
• Interval Scale - difference or distance matters
– has arbitrary zero value.
ü Temperatures (0F, 0C)
• Ratio Scale - Ratio matters – has a natural
zero value.
ü Salaries, weight, volume, area, length, etc.
TYPES OF DATA - TWO TYPES
• Qualitative - • Quantitative -
Categorical or Measurable or
Nominal: Countable:
Examples are- Examples are-
üColor ü Temperatures
üGender ü Salaries
üNationality
ü Number of points
scored on a 100
point exam
Summarizing Qualitative Data
Frequency Distribution
Tabular and graphical methods can be used to summarize qualitative data with the
definition of a frequency distribution.
Let us use the following example to demonstrate the construction and interpretation of a
frequency distribution for qualitative data.
Coke Classic, Diet Coke, Dr. Pepper, Pepsi, and Sprite are five popular soft drinks.
Assume that the data in the soft drink selected in a sample of 50 soft drink purchases.
To develop a frequency distribution for these data, we count the number of times each soft drink appears. Coke Classic
appears 19 times, Diet Coke appears 8 times, Dr. Pepper appears 5 times, Pepsi appears 13 times, and Sprite appears 5
times. These counts are summarized in the frequency distribution
DATA FROM A SAMPLE OF 50 SOFT DRINK PURCHASES
This frequency distribution provides a summary
This summary offers more insight than the original data.
Relative Frequency and Percent Frequency Distributions
• The relative frequency of a class equals the fraction or proportion of items belonging
to a class. For a data set with n observations, the relative frequency of each class can
be determined as follows:
The percent frequency of a class is the relative frequency multiplied by 100.
• The relative frequency for Coke Classic is 19/50 = .38, The relative frequency for Diet Coke is 8/50 =.16, …
• From the percent frequency distribution, we see that 38% of the purchases were Coke Classic, 16% of the
purchases were Diet Coke, and so on.
We can also note that 38%+26%+6% = 80% of the purchases were the top three soft drinks.
The three steps necessary to define the classes for a frequency distribution with
quantitative data are:
1. Determine the number of nonoverlapping classes.
2. Determine the width of each class.
3. Determine the class limits.
Number of classes are formed by specifying ranges that will be used to group the
data.
As a general guideline, we recommend using between 5 and 20 classes.
For a small number of data items, as few as five or six classes may be used to summarize the data. For a larger number of data
items, a larger number of classes is usually required.
Based on the class frequencies in and with n = 20, the next table shows the relative
frequency distribution and percent frequency distribution for the audit time data.
RELATIVE FREQUENCY AND PERCENT FREQUENCY DISTRIBUTIONS FOR THE AUDIT
TIME DATA
Dot Plot
• One of the simplest graphical summaries of data is a dot plot.
• A horizontal axis shows the range for the data. Each data value is represented by a
dot placed above the axis.
• The three dots located above 18 on the horizontal axis indicate that an audit time of
18 days occurred three times.
• Dot plots show the details of the data and are useful for comparing the distribution of
the data for two or more variables.
The adjacent rectangles of a histogram touch one another. Unlike a bar graph, a
histogram contains no natural separation between the rectangles of adjacent classes.
This format is the usual convention for histograms.
One of the most important uses of a histogram is to provide information about the shape,
or form, of a distribution. Example below contains four histograms constructed from
relative frequency distributions.
HISTOGRAMS SHOWING DIFFERING LEVELS OF SKEWNESS
• Panel A shows the histogram for a set of data moderately skewed to the left.
• A histogram is said to be skewed to the left if its tail extends farther to the left. This
histogram is typical for exam scores, with no scores above 100%, most of the
scores above 70%, and only a few really low scores.
• Panel B shows the histogram for a set of data moderately skewed to the right.
• A histogram is said to be skewed to the right if its tail extends farther to the right.
An example of this type of histogram would be for data such as housing prices; a
few expensive houses create the skewness in the right tail.
Cumulative Distributions
To understand how the cumulative frequencies are determined, consider the class with the description “less
than or equal to 24.” The cumulative frequency for this class is simply the sum of the frequencies for all
classes with data values less than or equal to 24. For the frequency distribution the sum of the frequencies
for classes 10–14, 15–19, and 20–24 indicates that 4+8+5=17 data values are less than or equal to 24.
Hence,
Ogive
A graph of a cumulative distribution, called an ogive, shows data values on the horizontal
axis and either the cumulative frequencies, the cumulative relative frequencies, or the
cumulative percent frequencies on the vertical axis.
OGIVE FOR THE AUDIT TIME DATA
Cross tabulation
QUALITY RATING AND MEAL PRICE FOR 300 LOS ANGELES RESTAURANTS
CROSSTABULATION OF QUALITY RATING & MEAL PRICE OF 300 L.A. RESTAURANTS
From the percent frequency distribution we see that 28% of the restaurants were rated
good, 50% were rated very good, and 22% were rated excellent.
Dividing the totals in the bottom row of the cross tabulation by the total for that row
provides a relative and percent frequency distribution for the meal price variable.
Below are the admission rates of Men and Women in a University in two subjects
True of False?
Men Women
Subject 1 14% (168 of 1200) 15% (270 of 1800)
Subject 2 50% (400 of 800) 51% (102 of 200)
Below are the admission rates of Men and Women in a University in two subjects
False!
Men Women
Subject 1 14% (168 of 1200) 15% (270 of 1800)
Subject 2 50% (400 of 800) 51% (102 of 200)
Total 28% (568 of 19% (372 of
2000) 2000)
Scatter Diagram and Trend line
A scatter diagram is a graphical presentation of the relationship between two
quantitative variables, and a trend line is a line that provides an approximation of the
relationship.
SAMPLE DATA FOR THE STEREO AND SOUND EQUIPMENT STORE
SCATTER DIAGRAM AND TRENDLINE FOR THE STEREO AND SOUND EQUIPMENT STORE
TYPES OF RELATIONSHIPS DEPICTED BY SCATTER DIAGRAMS
Measures of Central Tendency
• Mean Ø Average
Example – Median
Sorted
Billions Billions
33 18
26 18 Median
24 18
21 18
50th Percentile
19 19
20 20 (20+1)50/100=10.5 22 + (.5)(0) = 22
18 20
18 20
52 21
56 22
The median is the middle
27 22 ⇽ Median value of data sorted in order
22 23 of magnitude. It is the 50th
18 24
49 26
percentile.
22 27
20 32
23 33
32 49
20 52
18 56
Example - Mode
Mode = 18
The mean of a set of observations is their average - the sum of the observed values
divided by the number of observations.
• Range
ü Difference between maximum and minimum values
• Interquartile Range
ü Difference between third and first quartile (Q3,Q1)
• Variance
ü Average* of the squared deviations from the mean
• Standard Deviation
ü Square root of the variance
Variance and Standard Deviation
• Kurtosis
Ø Measure of flatness or peaked-ness of a frequency distribution
Ø Platykurtic (relatively flat)
Ø Mesokurtic (normal)
Ø Leptokurtic (relatively peaked)
Measures of Skewness & Kurtosis
Skewness:
2. Galton skewness
Kurtosis:
1. Kurtosis
2. Excess Kurtosis
Skewness
Skewness
Skewness
Symmetric Bimodal Distribution
Kurtosis
Kurtosis
Kurtosis
PERCENTILE
A percentile provides information about how the data are spread over the interval from
the smallest value to the largest value.
Approximately p percent of the observations have values less than the 𝑝"# percentile;
approximately (100- p) percent of the observations have values greater than the 𝑝"#
percentile.
PERCENTILE
The 𝑝"# percentile is a value such that at least p percent of the observations are less than
or equal to this value and at least (100 - p) percent of the observations are greater than or
equal to this value.
CALCULATING THE pth PERCENTILE
Step 1. Arrange the data in ascending order (smallest value to largest value).
Step 2. Compute an index i
$
i= ( )(n+1)
%&&
The first quartile, Q1, (25th percentile) is often called the lower quartile.
The second quartile, Q2, (50th percentile) is often called the median or
The third quartile, Q3, (75th percentile) is often called the upper quartile.
The interquartile range is the difference between the first and the third quartiles.
Finding Quartiles
Finding Quartiles
Chebyshev’s Theorem
Chebyshev’s theorem enables us to make statements about the proportion of data
values that must be within a specified number of standard deviations of the mean.
• At least (1-1/𝑧 ( ) of the data values must be within z standard deviations of the
mean, where z is any value greater than 1.
Some of the implications of this theorem, with z =2, 3, and 4 standard deviations follow.
• At least .75, or 75%, of the data values must be within z =2 standard deviations of the mean.
• At least .89, or 89%, of the data values must be within z =3 standard deviations of the mean.
• At least .94, or 94%, of the data values must be within z = 4 standard deviations of the mean.
æ ö
ç1 - 1 ÷
At leastç
è k 2 ÷ø
of the elements of any distribution lie within k standard deviations of
the mean.
Empirical Rule
For roughly mound-shaped and symmetric distributions, approximately:
Box Plot
A box plot is a graphical summary of data that is based on a five-number summary.
A key to the development of a box plot is the computation of the median and the
quartiles, Q1 and Q3.
The interquartile range, IQR= Q3 - Q1, is also used for identifying outliers.
1. A box is drawn with the ends of the box located at the first and third quartiles.
2. A vertical line is drawn in the box at the location of the median
3. By using the interquartile range, IQR =𝑄* - 𝑄% ,limits are located. The limits for the
box plot are 1.5(IQR) below 𝑄% and 1.5(IQR) above 𝑄* .
Data outside these limits are considered outliers.
4. The extended lines are called whiskers. The whiskers are drawn from the ends of the
box to the smallest and largest values inside the limits computed in step 3.
BASIC PROBABILITY
By specifying all possible experimental outcomes, we identify the sample space for an
experiment.
SAMPLE SPACE: The sample space for an experiment is the set of all experimental
outcomes.
Ø An experimental outcome is also called a sample point to identify it as an element
of the sample space.
Consider the first experiment in the preceding table—tossing a coin.
The upward face of the coin—a head or a tail—determines the experimental outcomes
(sample points). If we let S denote the sample space ,we can use the following
Notation to describe the sample space.
S ={Head, Tail}
The sample space for the second experiment in the table selecting a part for inspection
can be described as follows:
S ={Defective, Non defective}
Both of the experiments just described have two experimental outcomes (sample points).
However, suppose we consider the fourth experiment listed in the table—rolling a die.
The possible experimental outcomes, defined as the number of dots appearing on the
upward face of the die, are the six points in the sample space for this experiment.
S ={1, 2, 3, 4, 5, 6}
Being able to identify and count the experimental outcomes is a necessary step in
assigning probabilities.
Three useful counting rules
1. Multiple-step experiments : The first counting rule applies to multiple-step
experiments.
Consider the experiment of tossing two coins.
Let the experimental outcomes be defined in terms of the pattern of heads and tails appearing on the upward faces of
the two coins.
2. The sum of the probabilities for all the experimental outcomes must equal 1.
For n experimental outcomes, this requirement can be written as
The classical method of assigning probabilities is appropriate when all the
experimental outcomes are equally likely. If n experimental outcomes are possible,
a probability of 1/n is assigned to each experimental outcome.
A B C
SELECTED DOOR
A B C
The contestant
selected door A. SELECTED DOOR
Monty Hall
(who knows where
The car is) then
opened door C. A B C
We now know the
car is either behind
A or B.
SELECTED DOOR
At this point,
the contestant is
given the
A B C
option to
SWITCH DOORS.
SELECTED DOOR
At this point,
the contestant is
given the
A B C
option to
SWITCH DOORS.
DOES IT MATTER?
If contestant
SWITCHES
from original choice
If contestant
SWITCHES
from original choice
He
LOSES
If contestant
SWITCHES
from original choice
He He
LOSES WINS
If contestant
SWITCHES
from original choice
He He He
LOSES WINS WINS
If contestant
SWITCHES
from original choice
He He He
LOSES WINS WINS
He
WINS
If contestant
SWITCHES
from original choice
He He He
LOSES WINS WINS
He He
WINS LOSES
If contestant
SWITCHES
from original choice
He He He
LOSES WINS WINS
He He He
WINS LOSES WINS
If contestant
SWITCHES
from original choice
He He He
LOSES WINS WINS
He He He
WINS LOSES WINS
He
WINS
If contestant
SWITCHES
from original choice
He He He
LOSES WINS WINS
He He He
WINS LOSES WINS
He He
WINS WINS
If contestant
SWITCHES
from original choice
He He He
LOSES WINS WINS
He He He
WINS LOSES WINS
He He He
WINS WINS LOSES
If contestant
SWITCHES
from original choice
He He He
LOSES WINS WINS
He He He
WINS LOSES WINS
He He He
WINS WINS LOSES
If contestant
SWITCHES
from original choice
He WINS He He He
LOSES WINS WINS
6 times
He He He
out of 9 WINS LOSES WINS
He He He
WINS WINS LOSES
A Brain Teaser!
What is the minimum number of people you need in a room, in order to say, with at least 50%
chance, that at least two of them share a common birthday?
(Alternatively)
What is the probability of at least one pair of students sharing birthday in this class?
Conditional Probability
The fact that conditional probabilities can be computed as the ratio of a joint probability
to a marginal probability provides the following general formula for conditional
probability calculations for two events A and B.
Independence of Events
Product Rules for Independent Events
The Law of Total Probability and Bayes’ Theorem
The analysis with initial or prior probability estimates for specific events of interest.
Then, from sources such as a sample, a special report, or a product test, we obtain
additional information about the events. Given this new information, we update the
prior probability values by calculating revised probabilities, referred to as posterior
probabilities. Bayes’ theorem provides a means for making these probability
calculations. The steps in this probability revision process