You are on page 1of 114

Anirban Ghatak

aghatak@iimv.ac.in
TABULAR AND GRAPHICAL METHODS FOR SUMMARIZING DATA
Scales of Measurement
• Nominal Scale - groups or classes
ü Gender, color, professional classification, etc.
• Ordinal Scale - order matters
ü Ranks (top ten videos, products, etc.)
• Interval Scale - difference or distance matters
– has arbitrary zero value.
ü Temperatures (0F, 0C)
• Ratio Scale - Ratio matters – has a natural
zero value.
ü Salaries, weight, volume, area, length, etc.
TYPES OF DATA - TWO TYPES

• Qualitative - • Quantitative -
Categorical or Measurable or
Nominal: Countable:
Examples are- Examples are-
üColor ü Temperatures
üGender ü Salaries
üNationality
ü Number of points
scored on a 100
point exam
Summarizing Qualitative Data
Frequency Distribution
Tabular and graphical methods can be used to summarize qualitative data with the
definition of a frequency distribution.

A frequency distribution is a tabular summary of data showing the number (frequency) of


items in each of several nonoverlapping classes.

Let us use the following example to demonstrate the construction and interpretation of a
frequency distribution for qualitative data.
Coke Classic, Diet Coke, Dr. Pepper, Pepsi, and Sprite are five popular soft drinks.
Assume that the data in the soft drink selected in a sample of 50 soft drink purchases.

To develop a frequency distribution for these data, we count the number of times each soft drink appears. Coke Classic
appears 19 times, Diet Coke appears 8 times, Dr. Pepper appears 5 times, Pepsi appears 13 times, and Sprite appears 5
times. These counts are summarized in the frequency distribution
DATA FROM A SAMPLE OF 50 SOFT DRINK PURCHASES
This frequency distribution provides a summary
This summary offers more insight than the original data.
Relative Frequency and Percent Frequency Distributions

• A frequency distribution shows the number (frequency) of items in each of several


nonoverlapping classes.
• However, we are often interested in the proportion, or percentage, of items in each class.

• The relative frequency of a class equals the fraction or proportion of items belonging
to a class. For a data set with n observations, the relative frequency of each class can
be determined as follows:
The percent frequency of a class is the relative frequency multiplied by 100.

• The relative frequency for Coke Classic is 19/50 = .38, The relative frequency for Diet Coke is 8/50 =.16, …
• From the percent frequency distribution, we see that 38% of the purchases were Coke Classic, 16% of the
purchases were Diet Coke, and so on.
We can also note that 38%+26%+6% = 80% of the purchases were the top three soft drinks.

RELATIVE FREQUENCY AND PERCENT FREQUENCY DISTRIBUTIONS OF SOFT DRINK


PURCHASES
Bar Graphs and Pie Charts
A bar graph, or bar chart, is a graphical device for depicting qualitative data summarized
in a frequency, relative frequency, or percent frequency distribution. On one axis of the
graph(usually the horizontal axis), we specify the labels that are used for the classes
(categories).
The pie chart provides another graphical device for presenting relative frequency and
percent frequency distributions for qualitative data.
• To construct a pie chart, we first draw a circle to represent all of the data. Then we use
the relative frequencies to subdivide the circle into sectors, or parts, that correspond to
the relative frequency for each class.
For example, because a circle contains 360 degrees and Coke Classic shows a relative frequency of .38, the
sector of the pie chart labeled Coke Classic consists of .38(360)= 136.8 degrees. The sector of the pie chart
labeled Diet Coke consists of .16(360)=57.6 degrees.
Similar calculations for the other classes yield the pie chart. The numerical values shown for each sector can
be frequencies, relative frequencies, or percent frequencies.
PIE CHART OF SOFT DRINK PURCHASES
Frequency Distribution

A frequency distribution is a tabular summary of data showing the number (frequency)


of items in each of several nonoverlapping classes.
This definition holds for quantitative as well as qualitative data. However, with quantitative data we must be more careful in
defining the nonoverlapping classes to be used in the frequency distribution.

The three steps necessary to define the classes for a frequency distribution with
quantitative data are:
1. Determine the number of nonoverlapping classes.
2. Determine the width of each class.
3. Determine the class limits.
Number of classes are formed by specifying ranges that will be used to group the
data.
As a general guideline, we recommend using between 5 and 20 classes.
For a small number of data items, as few as five or six classes may be used to summarize the data. For a larger number of data
items, a larger number of classes is usually required.

Year End Audit Times (in Days)


12 14 19 18
15 15 18 17
20 27 22 23
22 21 33 28
14 18 16 13
• Because the number of data items in is relatively small (n = 20), we choose to
develop a frequency distribution with five classes.
• The second step in constructing a frequency distribution for quantitative data is to
choose a width for the classes.
As a general the width be the same for each class. Thus the choices of the number of classes and the width of classes are not
independent decisions.

To determine an approximate class width, we begin by identifying the largest and


smallest data values. Then, with the desired number of classes specified, we can use the
following expression to determine the approximate class width.
• For the audit time data , after deciding to use five classes, each with a width of five
days, the next task is to specify the class limits for each of the classes.
• Class limits must be chosen so that each data item belongs to one and only one class.
The lower class limit identifies the smallest possible data value assigned to the class.
The upper class limit identifies the largest possible data value assigned to the class.
In developing frequency distributions for qualitative data, we did not need to specify class limits because each data item
naturally fell into a separate class. But with quantitative data, such as the audit times , class limits are necessary to
determine where each data value belongs.
The data show that four values—12, 14, 14, and 13—belong to the 10–14 class. Thus, the
frequency for the 10–14 class is 4.
Continuing this counting process for the 15–19, 20–24, 25–29, and 30–34 classes provides
the frequency distribution. Using this frequency distribution, we can observe the
following:
1. The most frequently occurring audit times are in the class of 15–19 days. Eight of
the 20 audit times belong to this class.
2. Only one audit required 30 or more days.
Other conclusions are possible, depending on the interests of the person viewing the
frequency distribution. The value of a frequency distribution is that it provides insights
about the data that are not easily obtained by viewing the data in their original
unorganized form.

In some applications, we want to know the midpoints of the classes in a frequency


distribution for quantitative data.
The class midpoint is the value halfway between the lower and upper class limits. For the
audit time data, the five class midpoints are 12, 17, 22, 27, and 32.
Relative Frequency and Percent Frequency Distributions

Recall: The relative frequency is the proportion of the observations belonging to a


class.
The percent frequency of a class is the relative frequency multiplied by 100.

Based on the class frequencies in and with n = 20, the next table shows the relative
frequency distribution and percent frequency distribution for the audit time data.
RELATIVE FREQUENCY AND PERCENT FREQUENCY DISTRIBUTIONS FOR THE AUDIT
TIME DATA
Dot Plot
• One of the simplest graphical summaries of data is a dot plot.
• A horizontal axis shows the range for the data. Each data value is represented by a
dot placed above the axis.
• The three dots located above 18 on the horizontal axis indicate that an audit time of
18 days occurred three times.
• Dot plots show the details of the data and are useful for comparing the distribution of
the data for two or more variables.

DOT PLOT FOR THE AUDIT TIME DATA


Histogram
A common graphical presentation of quantitative data is a histogram. This graphical
summary can be prepared for data previously summarized in either a frequency,
relative frequency, or percent frequency distribution.

• A histogram is constructed by placing the variable of interest on the horizontal axis


and the frequency, relative frequency, or percent frequency on the vertical axis.
• The frequency, relative frequency, or percent frequency of each class is shown by
drawing a rectangle whose base is determined by the class limits on the horizontal
axis and whose height is the corresponding frequency, relative frequency, or percent
frequency.
HISTOGRAM FOR THE AUDIT TIME DATA

The adjacent rectangles of a histogram touch one another. Unlike a bar graph, a
histogram contains no natural separation between the rectangles of adjacent classes.
This format is the usual convention for histograms.
One of the most important uses of a histogram is to provide information about the shape,
or form, of a distribution. Example below contains four histograms constructed from
relative frequency distributions.
HISTOGRAMS SHOWING DIFFERING LEVELS OF SKEWNESS
• Panel A shows the histogram for a set of data moderately skewed to the left.
• A histogram is said to be skewed to the left if its tail extends farther to the left. This
histogram is typical for exam scores, with no scores above 100%, most of the
scores above 70%, and only a few really low scores.
• Panel B shows the histogram for a set of data moderately skewed to the right.
• A histogram is said to be skewed to the right if its tail extends farther to the right.
An example of this type of histogram would be for data such as housing prices; a
few expensive houses create the skewness in the right tail.
Cumulative Distributions

• A variation of the frequency distribution that provides another tabular summary of


quantitative data is the cumulative frequency distribution.
• The cumulative frequency distribution uses the number of classes, class widths, and
class limits developed for the frequency distribution.
• However, rather than showing the frequency of each class, the cumulative frequency
distribution shows the number of data items with values less than or equal to the
upper class limit of each class.
• The first two columns of the next table provide the cumulative frequency distribution
for the audit time data.
CUMULATIVE FREQUENCY, CUMULATIVE RELATIVE FREQUENCY, AND CUMULATIVE
PERCENT FREQUENCY DISTRIBUTIONS FOR THE AUDIT TIME DATA

To understand how the cumulative frequencies are determined, consider the class with the description “less
than or equal to 24.” The cumulative frequency for this class is simply the sum of the frequencies for all
classes with data values less than or equal to 24. For the frequency distribution the sum of the frequencies
for classes 10–14, 15–19, and 20–24 indicates that 4+8+5=17 data values are less than or equal to 24.
Hence,
Ogive
A graph of a cumulative distribution, called an ogive, shows data values on the horizontal
axis and either the cumulative frequencies, the cumulative relative frequencies, or the
cumulative percent frequencies on the vertical axis.
OGIVE FOR THE AUDIT TIME DATA
Cross tabulation

A cross tabulation is a tabular summary of data for two variables.


Let us illustrate the use of a crosstabulation by considering the following application
based on data from Zagat’s Restaurant Review.
The quality rating and the meal price data were collected for a sample of 300 restaurants located in the Los Angeles area.

QUALITY RATING AND MEAL PRICE FOR 300 LOS ANGELES RESTAURANTS
CROSSTABULATION OF QUALITY RATING & MEAL PRICE OF 300 L.A. RESTAURANTS
From the percent frequency distribution we see that 28% of the restaurants were rated
good, 50% were rated very good, and 22% were rated excellent.
Dividing the totals in the bottom row of the cross tabulation by the total for that row
provides a relative and percent frequency distribution for the meal price variable.
Below are the admission rates of Men and Women in a University in two subjects

Claim: The University was biased towards women.

True of False?

Men Women
Subject 1 14% (168 of 1200) 15% (270 of 1800)
Subject 2 50% (400 of 800) 51% (102 of 200)
Below are the admission rates of Men and Women in a University in two subjects

Claim: The University was biased towards women.

False!

Men Women
Subject 1 14% (168 of 1200) 15% (270 of 1800)
Subject 2 50% (400 of 800) 51% (102 of 200)
Total 28% (568 of 19% (372 of
2000) 2000)
Scatter Diagram and Trend line
A scatter diagram is a graphical presentation of the relationship between two
quantitative variables, and a trend line is a line that provides an approximation of the
relationship.
SAMPLE DATA FOR THE STEREO AND SOUND EQUIPMENT STORE
SCATTER DIAGRAM AND TRENDLINE FOR THE STEREO AND SOUND EQUIPMENT STORE
TYPES OF RELATIONSHIPS DEPICTED BY SCATTER DIAGRAMS
Measures of Central Tendency

• Median Ø Middle value when sorted in order of


magnitude = 50th percentile

• Mode Ø Most frequently-occurring value

• Mean Ø Average
Example – Median
Sorted
Billions Billions
33 18
26 18 Median
24 18
21 18
50th Percentile
19 19
20 20 (20+1)50/100=10.5 22 + (.5)(0) = 22
18 20
18 20
52 21
56 22
The median is the middle
27 22 ⇽ Median value of data sorted in order
22 23 of magnitude. It is the 50th
18 24
49 26
percentile.
22 27
20 32
23 33
32 49
20 52
18 56
Example - Mode

Mode = 18

The mode is the most frequently occurring value. It is the


value with the highest frequency.
Arithmetic Mean or Average

The mean of a set of observations is their average - the sum of the observed values
divided by the number of observations.

Population Mean Sample Mean


Sorted Example – Mean
Billions Billions
33 18
26 18
24 18
21 18
19 19
20 20
18 20
18 20
52 21
56 22
27 22
22 23
18 24
49 26
22 27
20 32
23 33
32 49
20 52
18 56
Sum = 538
Measures of Variability or Dispersion

• Range
ü Difference between maximum and minimum values
• Interquartile Range
ü Difference between third and first quartile (Q3,Q1)
• Variance
ü Average* of the squared deviations from the mean
• Standard Deviation
ü Square root of the variance
Variance and Standard Deviation

Population Variance Sample Variance


Skewness and Kurtosis
• Skewness
Ø Measure of the degree of asymmetry of a frequency distribution
Ø Skewed to left
Ø Symmetric or un-skewed
Ø Skewed to right

• Kurtosis
Ø Measure of flatness or peaked-ness of a frequency distribution
Ø Platykurtic (relatively flat)
Ø Mesokurtic (normal)
Ø Leptokurtic (relatively peaked)
Measures of Skewness & Kurtosis

Skewness:

1. Fisher-Pearson coefficient of skewness

2. Galton skewness

Kurtosis:

1. Kurtosis

2. Excess Kurtosis
Skewness
Skewness
Skewness
Symmetric Bimodal Distribution
Kurtosis
Kurtosis
Kurtosis
PERCENTILE
A percentile provides information about how the data are spread over the interval from
the smallest value to the largest value.

Approximately p percent of the observations have values less than the 𝑝"# percentile;
approximately (100- p) percent of the observations have values greater than the 𝑝"#
percentile.
PERCENTILE
The 𝑝"# percentile is a value such that at least p percent of the observations are less than
or equal to this value and at least (100 - p) percent of the observations are greater than or
equal to this value.
CALCULATING THE pth PERCENTILE

Step 1. Arrange the data in ascending order (smallest value to largest value).
Step 2. Compute an index i
$
i= ( )(n+1)
%&&

where p is the percentile of interest and n is the number of observations.


Step 3.
(a) If i is not an integer, round up . The next integer greater than i denotes the position
of the 𝑝"# percentile.
(b) If i is an integer, the 𝑝"# percentile is the average of the values in positions i and i+1.
Quartiles – Special Percentiles
Quartiles are the percentage points that break down the ordered data

set into quarters.


• The first quartile is the 25th percentile. It is the point below which lie 1/4 of
the data.
• The second quartile is the 50th percentile. It is the point below which lie 1/2
of the data. This is also called the median.
• The third quartile is the 75th percentile. It is the point below which lie 3/4 of
the data.
Quartiles and Interquartile Range

The first quartile, Q1, (25th percentile) is often called the lower quartile.

The second quartile, Q2, (50th percentile) is often called the median or

the middle quartile.

The third quartile, Q3, (75th percentile) is often called the upper quartile.

The interquartile range is the difference between the first and the third quartiles.
Finding Quartiles
Finding Quartiles
Chebyshev’s Theorem
Chebyshev’s theorem enables us to make statements about the proportion of data
values that must be within a specified number of standard deviations of the mean.

• At least (1-1/𝑧 ( ) of the data values must be within z standard deviations of the
mean, where z is any value greater than 1.

Some of the implications of this theorem, with z =2, 3, and 4 standard deviations follow.
• At least .75, or 75%, of the data values must be within z =2 standard deviations of the mean.
• At least .89, or 89%, of the data values must be within z =3 standard deviations of the mean.
• At least .94, or 94%, of the data values must be within z = 4 standard deviations of the mean.
æ ö
ç1 - 1 ÷
At leastç
è k 2 ÷ø
of the elements of any distribution lie within k standard deviations of
the mean.
Empirical Rule
For roughly mound-shaped and symmetric distributions, approximately:
Box Plot
A box plot is a graphical summary of data that is based on a five-number summary.
A key to the development of a box plot is the computation of the median and the
quartiles, Q1 and Q3.

The interquartile range, IQR= Q3 - Q1, is also used for identifying outliers.

1. A box is drawn with the ends of the box located at the first and third quartiles.
2. A vertical line is drawn in the box at the location of the median
3. By using the interquartile range, IQR =𝑄* - 𝑄% ,limits are located. The limits for the
box plot are 1.5(IQR) below 𝑄% and 1.5(IQR) above 𝑄* .
Data outside these limits are considered outliers.

4. The extended lines are called whiskers. The whiskers are drawn from the ends of the
box to the smallest and largest values inside the limits computed in step 3.
BASIC PROBABILITY

Probability is a numerical measure of the likelihood that an event will occur.

• Probability values are always assigned on a scale from 0 to 1.


• A probability near zero indicates an event is unlikely to occur;
• A probability near 1 indicates an event is almost certain to occur.
• Other probabilities between 0 and 1 represent degrees of likelihood that an event will
occur.
Ø An experiment as a process that generates well-defined outcomes.
Ø On any single repetition of an experiment, one and only one of the possible
experimental outcomes will occur.
Ø Some examples of experiments and their associated outcomes:

By specifying all possible experimental outcomes, we identify the sample space for an
experiment.
SAMPLE SPACE: The sample space for an experiment is the set of all experimental
outcomes.
Ø An experimental outcome is also called a sample point to identify it as an element
of the sample space.
Consider the first experiment in the preceding table—tossing a coin.
The upward face of the coin—a head or a tail—determines the experimental outcomes
(sample points). If we let S denote the sample space ,we can use the following
Notation to describe the sample space.
S ={Head, Tail}
The sample space for the second experiment in the table selecting a part for inspection
can be described as follows:
S ={Defective, Non defective}
Both of the experiments just described have two experimental outcomes (sample points).
However, suppose we consider the fourth experiment listed in the table—rolling a die.
The possible experimental outcomes, defined as the number of dots appearing on the
upward face of the die, are the six points in the sample space for this experiment.
S ={1, 2, 3, 4, 5, 6}
Being able to identify and count the experimental outcomes is a necessary step in
assigning probabilities.
Three useful counting rules
1. Multiple-step experiments : The first counting rule applies to multiple-step
experiments.
Consider the experiment of tossing two coins.
Let the experimental outcomes be defined in terms of the pattern of heads and tails appearing on the upward faces of
the two coins.

The experiment of tossing two coins can be thought of as a two-step experiment in


which step 1 is the tossing of the first coin and step 2 is the tossing of the second coin.
We can describe the sample space (S) for this coin-tossing experiment as follows:

S ={(H, H), (H, T), (T, H), (T, T)}


If an experiment can be described as a sequence of k steps with 𝑛% possible outcomes on the first step, 𝑛( possible outcomes on the
second step, and so on, then the total number of experimental outcomes is given by (𝑛% ) (𝑛( )...( 𝑛, ).
2. Combinations: A second useful counting rule allows one to count the number of
experimental outcomes when the experiment involves selecting n objects from a (usually
larger) set of N objects. It is called the counting rule for combinations.
3. Permutations: A third counting rule that is sometimes useful is the counting rule
for permutations. It allows one to compute the number of experimental outcomes
when n objects are to be selected from a set of N objects where the order of selection
is important.
The same n objects selected in a different order are considered a different
experimental outcome.
BASIC REQUIREMENTS FOR ASSIGNING PROBABILITIES

1. The probability assigned to each experimental outcome must be between 0


and 1, inclusively. If we let Ei denote the ith experimental outcome and P (Ei)
its probability, then this requirement can be written as

2. The sum of the probabilities for all the experimental outcomes must equal 1.
For n experimental outcomes, this requirement can be written as
The classical method of assigning probabilities is appropriate when all the
experimental outcomes are equally likely. If n experimental outcomes are possible,
a probability of 1/n is assigned to each experimental outcome.

The relative frequency method of assigning probabilities is appropriate when data


are available to estimate the proportion of the time the experimental outcome will
occur if the experiment is repeated a large number of times.

The subjective method of assigning probabilities is most appropriate when one


cannot realistically assume that the experimental outcomes are equally likely and
when little relevant data are available.
Complement of a Set
• Intersection (And)
§ a set containing all elements in both A and B .
• Union (Or)
§ a set containing all elements in A or B or both.

• Mutually exclusive or disjoint sets


§ sets having no elements in common, having no intersection, whose intersection is
the empty set.
• Partition
§ a collection of mutually exclusive sets which together include all possible
elements, whose union is the universal set.
Sets: A Intersecting with B
Sets: A Union B
Mutually Exclusive or Disjoint Sets
Sets: Partition
Basic Rules for Probability
Basic Rules for Probability
Welcome to the Monty Hall show!

Behind one of these


doors is a shiny new
car. Behind two of A B C
these doors are goats

Our contestant will select a door.


I select
DOOR A

A B C
SELECTED DOOR

A B C
The contestant
selected door A. SELECTED DOOR

Monty Hall
(who knows where
The car is) then
opened door C. A B C
We now know the
car is either behind
A or B.
SELECTED DOOR

At this point,
the contestant is
given the
A B C
option to
SWITCH DOORS.
SELECTED DOOR

At this point,
the contestant is
given the
A B C
option to
SWITCH DOORS.

WOULD IT IMPROVE HIS CHANCES TO


SWITCH TO DOOR B?

SHOULD HE STAY WITH DOOR A?

DOES IT MATTER?
If contestant
SWITCHES
from original choice
If contestant
SWITCHES
from original choice

He
LOSES
If contestant
SWITCHES
from original choice

He He
LOSES WINS
If contestant
SWITCHES
from original choice

He He He
LOSES WINS WINS
If contestant
SWITCHES
from original choice

He He He
LOSES WINS WINS

He
WINS
If contestant
SWITCHES
from original choice

He He He
LOSES WINS WINS

He He
WINS LOSES
If contestant
SWITCHES
from original choice

He He He
LOSES WINS WINS

He He He
WINS LOSES WINS
If contestant
SWITCHES
from original choice

He He He
LOSES WINS WINS

He He He
WINS LOSES WINS

He
WINS
If contestant
SWITCHES
from original choice

He He He
LOSES WINS WINS

He He He
WINS LOSES WINS

He He
WINS WINS
If contestant
SWITCHES
from original choice

He He He
LOSES WINS WINS

He He He
WINS LOSES WINS

He He He
WINS WINS LOSES
If contestant
SWITCHES
from original choice

He He He
LOSES WINS WINS

He He He
WINS LOSES WINS

He He He
WINS WINS LOSES
If contestant
SWITCHES
from original choice

He WINS He He He
LOSES WINS WINS

6 times
He He He
out of 9 WINS LOSES WINS

He He He
WINS WINS LOSES
A Brain Teaser!

What is the minimum number of people you need in a room, in order to say, with at least 50%
chance, that at least two of them share a common birthday?

(Alternatively)

What is the probability of at least one pair of students sharing birthday in this class?
Conditional Probability
The fact that conditional probabilities can be computed as the ratio of a joint probability
to a marginal probability provides the following general formula for conditional
probability calculations for two events A and B.
Independence of Events
Product Rules for Independent Events
The Law of Total Probability and Bayes’ Theorem
The analysis with initial or prior probability estimates for specific events of interest.
Then, from sources such as a sample, a special report, or a product test, we obtain
additional information about the events. Given this new information, we update the
prior probability values by calculating revised probabilities, referred to as posterior
probabilities. Bayes’ theorem provides a means for making these probability
calculations. The steps in this probability revision process

PROBABILITYREVISION USING BAYES’THEOREM

You might also like