Professional Documents
Culture Documents
WHAT IS STATISTICS? (DOANE & SEWARD, 2019) WHY STUDY STATISTICS? (LEVINE ET AL., 2008)
● Statistics is the science of collecting, organizing, ● Decision Makers Use Statistics To:
analyzing, interpreting, and presenting data. ○ Present and describe business data and information
○ Root word: State properly
● A statistic is a single measure, reported as a number, ○ Draw conclusions about large populations, using
used to summarize a sample data set; for example, information collected from samples
the average height of students in a university. ○ Make reliable forecasts about a business activity
● Examples of statistics – average height for the length ○ Improve business processes
of the gowns, maximum height to design the height
of the doorways of the classrooms, etc. BASIC VOCABULARY OF STATISTICS (LEVINE ET AL.,
2008)
KINDS OF STATISTICS (DOANE & SEWARD, 2019) ● VARIABLE
● Descriptive Statistics refers to the collection, ○ A variable is a characteristic of an item or individual.
presentation, and summary of data (either using ● DATA
charts and graphs or using numerical summary). ○ Data are the different values associated with a
● Inferential Statistics refers to the generalizing from a variable.
sample to a population, estimating unknown ● OPERATIONAL DEFINITIONS
population parameters, drawing conclusions, and ○ Variable values are meaningless unless their
making decisions. variables have operational definitions, universally
accepted meanings that are clear to all associated
STATISTICS (LEVINE ET AL., 2008) with an analysis.
● The branch of mathematics that transforms data into ● POPULATION
useful information for decision makers. ○ A population consists of all the items or individuals
● Descriptive Statistics about which you want to draw a conclusion.
○ Collecting, summarizing, and describing data ● SAMPLE
○ Collect data – Survey ○ A sample is the portion of a population selected for
○ Present data – Tables and graphs analysis.
○ Characterize data – Sample mean ● PARAMETER
● Inferential Statistics ○ A parameter is a numerical measure that describes a
○ Drawing conclusions and/or making decisions characteristic of a population.
concerning a population based only on sample data ● STATISTIC
○ Estimation – Ex. Estimate the population mean ○ A statistic is a numerical measure that describes a
weight using the sample mean weight characteristic of a sample.
○ Hypothesis testing – Ex. Test the claim that the
population mean weight is 120 pounds POPULATION VS. SAMPLE (LEVINE ET AL., 2008)
● Drawing conclusions and/or making decisions ● Population
concerning a population based on sample results. ○ Measures used to describe the population are called
parameters
Tasks in Statistics (Doane & Seward, 2019) ● Sample
○ Measures computed from sample data are called
statistics
DATA MEASUREMENT
● Why collect data?
○ A marketing research analyst needs to assess the
effectiveness of a new television advertisement.
○ A pharmaceutical manufacturer needs to determine
whether a new drug is more effective than those
currently in use.
○ An operations manager wants to monitor a
manufacturing process to find out whether the
1 TJMS, JLLA
quality of product being manufactured is
conforming to company standards.
○ An auditor wants to review the financial transactions
of a company in order to determine whether the
company is in compliance with generally accepted
accounting principles.
SOURCES OF DATA
● Primary Sources: The data collector is the one using
the data for analysis.
○ Data from a market survey
○ Data collected from an experiment
○ Observed data
● Secondary Sources: The person performing data
LEVELS OF MEASUREMENT
analysis is not the data collector CATEGORICAL VARIABLES
○ Analyzing census data ● A nominal scale classifies data into distinct categories
○ Examining data from print journals, from in which no ranking is implied.
government agencies or data published on the
Categorical Variables Categories
internet I-Phone Ownership Yes / No
Type of Bank Accounts Savings/Current/Investment
TYPES OF VARIABLES Cable TV Provider Sky/Destiny
● Categorical (qualitative) variables have values that can ● An ordinal scale classifies data into distinct categories
only be placed into categories, such as “yes” and “no.” in which ranking is implied.
● Numerical (quantitative) variables have values that Categorical Variables Categories
represent quantities. Student class designation Freshman, Sophomore, Junior,
Senior
Product satisfaction Satisfied, Neutral, Unsatisfied
Faculty rank Professor, Associate Professor,
Assistant Professor, Instructor
Standard & Poor’s bond AAA, AA, A, BBB, BB, B, CCC,
ratings CC, C, DDD, DD, D
Student Grades A, B, C, D, F
LEVELS OF MEASUREMENT
NUMERICAL VARIABLES
● The difference between numerical and categorical ● An interval scale is an ordered scale in which the
data. difference between measurements is a meaningful
quantity but the measurements do not have a true
zero point.
● A ratio scale is an ordered scale in which the
difference between the measurements is a
meaningful quantity and the measurements have a
true zero point.
LEVELS OF MEASUREMENT
INTERVAL MEASUREMENT
NOTE: Ambiguity is introduced when continuous data ● Data can not only be ranked, but also have meaningful
are rounded to whole numbers. intervals between scale points (e.g., difference
● The levels of measurement in data and ways of coding
between 60F and 70F is same as difference between
data.
20F and 30F).
● Since intervals between numbers represent distances,
mathematical operations can be performed (e.g.,
average).
● Zero point of interval scales is arbitrary, so ratios are
not meaningful (e.g., 60F is not twice as warm as 30F).
● Likert Scales
○ A special case of interval data frequently used in
survey research.
2 TJMS, JLLA
○ The coarseness of a Likert scale refers to the number INTERVAL AND RATIO SCALES
of scale points (typically 5 or 7). NUMERICAL VARIABLES
3 TJMS, JLLA
PARAMETER VS. STATISTIC UNIVARIATE AND BIVARIATE DATA
● A parameter is a measurement or characteristic of the ● Univariate: set of n observations in one variable, e.g.,
population. height of Grade 3 pupils, GWA of second year students,
● A parameter is usually unknown because we can rarely etc.
observe the entire population. ● Bivariate: set of n observations involving two variables,
● Statistics is a numerical value computed from a e.g., height and weight of newly born babies, monthly
sample. income and expenses of selected employees
● Statistics can be used as estimates of parameters
found in the population. ORGANIZING CATEGORICAL DATA
● Symbols are used to represent population parameters
and sample statistics. UNIVARIATE
● From a sample on n items, chosen from a population,
we compute statistics that can be used as estimates SUMMARY TABLE
of parameters found in the population. ● A summary table indicates the frequency, amount, or
● To avoid confusion, we use different symbols for each percentage of items in a set of categories so that you
parameter and its corresponding statistic. can see differences between categories.
● Thus, the population mean is denoted by μ ( the lower
How do you spend the holidays? Frequency Percentage
Greek letter mu) while the sample mean is x̄.
At home with family 25 50
● The population proportion is denoted π (the lowercase Travel to visit family 7 14
Greek letter pi), while the sample proportion is p. Vacation 6 12
Catching up on work 8 16
Other 4 8
BAR CHART
● In a bar chart, a bar shows each category, the length
of which represents the amount, frequency or
percentage of values falling into a category.
● Target Population
○ The population must be carefully specified and the
sample must be drawn scientifically so that the
sample is representative.
■ The target population is the population we are
interested in (e.g., U.S. gasoline prices).
■ The sampling frame is the group from which we
take the sample (e.g., 115,000 stations).
■ The frame should not differ from the target
PIE CHART
population.
● The pie chart is a circle broken up into slices that
represent categories. The size of each slice of the pie
DATA PRESENTATION USING EXCEL
varies according to the percentage in each category.
BIVARIATE
4 TJMS, JLLA
CROSS TABULATIONS: THE CONTINGENCY TABLE FREQUENCY DISTRIBUTION AND HISTOGRAM
● A cross-classification (contingency) table presents ● Bins and Bin Limits
the results of two categorical variables. The joint ○ A frequency distribution is a table formed by
responses are classified so that the categories of one classifying n data values into k classes (bins)
variable are located in the rows and the categories of ○ Bin limits define the values to be included in each
the other variable are located in the columns. bin. Widths must all be the same.
● The cell is the intersection of the row and column and ○ Frequencies are the number of observations within
the value in the cell represents the data corresponding each bin.
to that specific pairing of row and column categories. ○ Express as relative frequencies (frequency divided by
● A useful way to visually display the results of the total) or percentages (relative frequency times
cross-classification data is by constructing a 100)
side-by-side bar chart. ● Constructing a Frequency Distribution
● Example: 1. Find the smallest and largest data values
○ A survey was conducted to determine whether Quiz Scores in Math 607
students stay in the Philippines during holidays. The 55 23 45 50 58
22 34 38 38 42
result, classified by gender, is as follows:
36 30 30 50 52
Where do you stay during holidays? Male Female Total 54 54 49 54 60
In the Philippines 16 20 36 20 27 39 36 47
Other countries 2 12 14 53 52 57 50 50
Total 18 32 50 37 45 48 55 59
35 55 46 41 42
39 45 42 38 50
2. Choose the number of bins (k)
■ k should be much smaller than n
■ Too many bins results in sparsely populated bins,
too few and dissimilar data values are lumped
together
■ Herbert Sturges proposes the following rule:
Sample Size (n) Suggested Number
of Bins (k)
16 5
32 6
ORGANIZING NUMERICAL DATA 64 7
128 8
256 9
ORDERED ARRAY 512 10
● An ordered array is a sequence of data, in rank order, 1024 11
from the smallest value to the largest value. 3. Set the bin limits
Day Students
xmax - xmin
16 17 17 18 18 18 Bin Width =
Age of
19 19 20 20 21 22
k
Surveyed
College
22 25 27 32 38 42 ■ For example, for k = 7 bins, the approximate bin
Night Students width is:
Students
18 18 19 19 20 21
60 - 20
23 28 32 33 41 45 Bin Width = = 5.7 ≈ 6
7
4.Put the data values in the appropriate bin.
STEM AND LEAF DISPLAY
5. Create the table, you can include:
● A stem-and-leaf display organizes data into groups
■ Frequencies - counts for each bin
(called stems) so that the values within each group
■ Relative frequencies - absolute frequency divided
(the leaves) branch out to the right on each row.
by the total number of data
■ Cumulative frequency - accumulated relative
Age of College Students
frequency values as bin limit increases
Day Students Night Students
Stem Leaf Stem Leaf
THE HISTOGRAM (BAR CHART)
1 67788899 1 8899
2 0012257 2 0138
3 28 3 23 Quiz Score Frequency Relative Percentage
4 2 4 15 Frequency
20 - 25 3 0.07 6.67
26 - 31 3 0.07 6.67
5 TJMS, JLLA
32 - 37 5 0.11 11.11
38 - 43 9 0.20 20.00
44 - 49 7 0.16 15.56
50 - 55 14 0.31 31.11
56 - 61 4 0.09 8.89
Total 45 1.00 100.00
6 TJMS, JLLA
GRAPHICAL ERRORS
● Chart Junk
● No Relative Basis
CHAPTER SUMMARY
PRINCIPLES OF EXCELLENT GRAPHS
● The graph should not distort the data.
● The graph should not contain unnecessary
adornments (sometimes referred to as chart junk).
● The scale on the vertical axis should begin at zero.
● All axes should be properly labeled.
● The graph should contain a title.
● The simplest possible graph should be used for a given
set of data.
7 TJMS, JLLA
data be
values affected
exist. by gaps
in data
values.
Mode Most =MODE(Data) Useful for May not
frequently attribute be
occurring data or unique,
data value discrete and is not
data with helpful
a small for
range. continuo
us data.
Trimmed Same as the =TRIMMEAN(Data Mitigates Excludes
● In this chapter, we have mean mean except ,Percent) effects of some
omit highest extreme data
○ Organized categorical data using the summary and lowest values. values
table, bar chart, pie chart, and Pareto diagram. k% of data that
○ Organized numerical data using the ordered array, values (e.g., could be
5%) relevant.
stem and leaf display, frequency distribution,
Geometric =GEOMEAN(Data) Useful for Less
histogram, polygon, and ogive. mean (G) growth familiar
○ Examined cross tabulated data using the rates and and
contingency table and side-by-side bar chart. mitigates requires
high positive
○ Developed scatter plots and time series graphs. extremes data.
○ Examined the do’s and don'ts of graphically .
displaying data.
ARITHMETIC MEAN
MEASURES OF CENTRAL TENDENCY, POSITION, AND ● The arithmetic mean (mean) is the most common
VARIABILITY measure of central tendency.
● For a sample of size n:
DESCRIPTIVE STATISTICS
● The central tendency is the extent to which all the
data values group around a typical or central value.
○ Single number that represents all the distribution
● The variation is the amount of dispersion, or
● The most common measure of central tendency
scattering, of values.
● Mean = sum of values divided by the number of values
● The shape is the pattern of the distribution of values
● Affected by extreme values (outliers)
from the lowest value to the highest value.
● Used if scale data
○ Normality of the data. (Skewness etc.)
● Used to describe the data. (Describe first before
analysis)
NUMERICAL DESCRIPTION
Characteristic Interpretation
Central Tendency Where are the data values concentrated?
What seem to be typical or middle data
values?
Dispersion or How much variation is there in the data? GEOMETRIC MEAN
Variation How spread out are the data values? Are
● Geometric mean
there unusual values?
Shape Are the data values distributed ○ Used to measure the rate of change of a variable
symmetrically? Skewed? Sharply peaked? over time.
Flat? Bimodal?
8 TJMS, JLLA
MEASURES OF CENTRAL TENDENCY LOCATING THE
MEDIAN
● The median of an ordered set of data is located at the
n + ½ ranked value.
● If the number of values is odd, the median is the
middle number.
● If the number of values is even, the median is the
average of the two middle numbers.
● Note that n + ½ is NOT the value of the median, only
● Use the 1-year returns to compute the arithmetic the position of the median in the ranked data.
mean and the geometric mean:
MODE
● Value that occurs most often
● Not affected by extreme values
● Used for categorical data
● Used for numerical primarily when grouped
● There may be no mode
● There may be several modes
Geometric Mean Example
An investment of $100,000 declined to $50,000 at the end
of year one and rebounded to $100,000 at end of year two:
1. Convert rates, –50% and 100% to decimals (divide by 100) to
get -.5 and 1.00
2. Add 1 to each decimal yields .5 and 2
3. Find the geometric mean using the geomean function
4. Subtract 1 from the answer to get a decimal TRIMMED MEAN
5. Convert back to a rate of return (% icon).
● Used if your data has outliers
● To calculate the trimmed mean, first remove the
CENTRAL TENDENCY highest and lowest k percent of the observations.
● Growth Rates
● To determine how many observations to trim, multiply
○ A variation on the geometric mean used to find the
k by n and round off the result.
average growth rate for a time series.
● Let us say that k x n = 3.4 = 3. So, we would remove the
three smallest and three largest observations before
averaging the remaining values.
MEDIAN
● In an ordered array, the median is the “middle”
number (50% above, 50% below)
● Not affected by extreme values.
● Used if ordinal data
LOCATING EXTREME OUTLIERS Z-SCORE
● To compute the Z-score of a data value, subtract the
mean and divide by the standard deviation.
9 TJMS, JLLA
● The Z-score is the number of standard deviations a ○ Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22
data value is from the mean. ■ First, note that n = 9.
● A data value is considered an extreme outlier if its ■ Q1 = is in the (9+1)/4 = 2.5 ranked value of the
Z-score is less than -3.0 or greater than +3.0. ranked data, so use the value half way between
● The larger the absolute value of the Z-score, the farther the 2nd and 3rd ranked values, so Q1 = 12.5
the data value is from the mean. ■ Q1 and Q3 are measures of non-central location Q2
= median, a measure of central tendency.
DISPERSION
● Variation is the “spread” of data points about the
center of the distribution in a sample. Consider the
following measures of dispersion:
● Suppose the mean math SAT score is 490, with a MEASURES OF VARIATION
standard deviation of 100. Statistic Formula Excel Pro Con
● Compute the z-score for a test score of 620. Formula
Range =MAX(Data) Easy to Sensitive to
- MIN(Data) calculate extreme
data values.
● A score of 620 is 1.3 standard deviations above the Variance (s2 ) =VAR(Data) Plays a key Non-intuitiv
role in e meaning.
mean and would not be considered an outlier. mathemati
● Suppose the mean math SAT score is 490, with a cal
standard deviation of 100. statistics.
Standard =STDEV(Dat Most Non
● Compute the 3z-score.
deviation (s) a) common Intuitive
measure. meaning.
Uses the
same units
as the raw
● A score less than 190 or greater than 790 would be data ($ , £,
considered an outlier. ¥, etc.).
Coefficient. None Measures Requires
of variation relative non-negati
QUARTILE MEASURES (CV) variation in ve data.
● Quartiles split the ranked data into 4 segments with percent so
can
an equal number of values per segment.
compare
data sets.
Mean =AVEDEV(D Easy to Lacks “nice”
absolute ata) understand. theoretical
deviation properties.
(MAD)
● The first quartile, Q1 , is the value for which 25% of the
observations are smaller and 75% are larger.
● Q2 is the same as the median (50% are smaller, 50% MEASURES OF VARIATION RANGE
are larger). ● Simplest measure of variation
● Only 25% of the values are greater than the third ● Difference between the largest and the smallest
quartile. values:
10 TJMS, JLLA
● Sensitive to outliers COMPARING THE STANDARD DEVIATION
VARIANCE
● The population variance (σ2) is defined as the sum of
squared deviations around the mean m divided by the
population size.
11 TJMS, JLLA
○ The coefficient compares two samples measured in
different units or one sample with a known
reference distribution (e.g., symmetric normal
distribution).
○ Calculate the sample’s skewness coefficient as:
● Kurtosis
○ Kurtosis is the relative length of the tails and the
degree of concentration in the center.
○ Skewness is a unit-free statistic. ○ Consider three kurtosis prototype shapes.
12 TJMS, JLLA
○ A histogram is an unreliable guide to kurtosis since
scale and axis proportions may differ.
○ Excel and MINITAB calculate kurtosis as:
13 TJMS, JLLA
3. Click OK NORMAL DISTRIBUTION AND TEST OF NORMALITY
4. Enter Input Range
5. Check Labels in First Row if data has labels CONTINUOUS RANDOM VARIABLE
● A variable that can assume any value on a continuum
(can assume an uncountable number of values)
● Examples are as follows:
○ thickness of an item
○ time required to complete a task
○ temperature of a solution
○ Height
● Discrete: One which may take on only a countable
number of distinct values
○ Class Size
○ Shoe Size
NORMAL DISTRIBUTION
● It is the most common continuous distribution.
● Also known as the Gaussian distribution or the bell
curve.
● In this distribution, the probability that various values
occur within certain ranges or intervals can be
calculated.
6. Click OK to get output
NORMAL DISTRIBUTION PROPERTIES
● ‘Bell Shaped’
● Symmetrical
● Mean, Median and Mode are equal
● Location is characterized by the mean, μ
● Spread is characterized by the standard deviation, σ
● The random variable has an infinite theoretical range:
-∞ to +∞
ETHICAL CONSIDERATIONS
● Should document both good and bad results.
● Should be presented in a fair, objective and neutral
manner.
● Should not use inappropriate summary measures to
distort facts.
14 TJMS, JLLA
NORMAL PROBABILITIES
● Probability is measured by the area under the curve.
● The total area under the curve is 1.0, and the curve is
symmetric, so half is above the mean, half is below.
15 TJMS, JLLA
ASSESSING NORMALITY
● It is important to evaluate how well the data set is
approximated by a normal distribution.
● Normally distributed data should approximate the
theoretical normal distribution:
○ The normal distribution is bell shaped (symmetrical)
where the mean is equal to the median.
○ The empirical rule applies to the normal distribution.
○ The interquartile range of a normal distribution is
1.33 standard deviations.
■ Interquartile Range: Q3 - Q1
● Construct charts or graphs
○ For small- or moderate-sized data sets, do stem
and-leaf display and box-and-whisker plot look
symmetric?
○ For large data sets, does the histogram or polygon
appear bell-shaped?
● Compute descriptive summary measures
○ Do the mean, median and mode have similar
values?
○ Is the interquartile range approximately 1.33 σ?
○ Is the range approximately 6 σ?
● Observe the distribution of the data set
● Try this: Given the normal probability, find the value ○ Do approximately 2/3 of the observations lie within
of x mean ± 1 standard deviation?
○ Let X represent the time it takes (in seconds) to ○ Do approximately 80% of the observations lie within
download an image file from the internet. mean ± 1.28 standard deviations?
○ Suppose X is normal with mean 8.0 and standard ○ Do approximately 95% of the observations lie within
deviation 5.0 mean ± 2 standard deviations?
○ Find X such that 20% of download times are less ● Evaluate normal probability plot
than X. ○ Is the normal probability plot approximately linear
with positive slope?
16 TJMS, JLLA
THE EMPIRICAL RULE AS APPLIED TO THE NORMAL
DISTRIBUTION
● This rule states that for symmetrical bell-shaped data
sets, one can find that roughly two out of every three
observations are contained within a distance of 1
standard deviation around the mean and roughly. ● The Box and central line are centered between the
endpoints if data are symmetric around the median.
THE NORMAL PROBABILITY PLOT ● A Box-and-Whisker plot can be shown in either vertical
● A normal probability plot for data from a normal or horizontal format.
distribution will be approximately linear:
17 TJMS, JLLA
18 TJMS, JLLA