Professional Documents
Culture Documents
Simpson’s Paradox
➔ can happen for cross tabulations of
qualitative or quantitative or mixture of
qualitative & quantitative data.
➔ It happens when conclusions drawn
from two or more separate cross
tabulations are reversed when the data Bar Graph
are aggregated into a single cross Simple bar graph
tabulation ➔ A graphical device for depicting
categorical data that have been
Graphical Data Presentation summarized in a frequency distribution
Categorical/Quantitative table wherein the emphasis is on the
● Pie chart frequency of actual count for each
● Bar chart category
○ Simple bar ➔ usually used for univariate tables
○ component/stacked Component/stacked bar graph
○ compound/side by side ➔ Bar chart in which each bar is broken
● Pictogram into rectangular segments of a different
● Mapgraphs/Cartogram color showing the relative frequency of
each category or class in a manner
similar to pie chart
Ex: Ex:
Dot plot
➔ A graphical device that summarizes
Map graphs data by the number of dots above each
➔ a map is drawn and divided into the data value on the horizontal axis
desired regions. Each region may be Ex:
distinguished from other regions using
varied lines, shadings with different
colors, or other symbols.
➔ It is always accompanied by a legend
which tells the meaning of the lines,
colors, or other symbols used
Scatter Diagram Ex:
➔ A graphical display of the relationship
between two quantitative variables. The
independent variable on the x axis and
the dependent variable on the y axis
Ex:
Pareto Diagram
➔ A type of chart that contains both bars
and line graph where individual values
are represented in descending order by
bars and the cumulative total is
represented by time
Line graph/trendline Ex:
➔ A line that provides an approximation of
the relationship between two variables
or the changes in a particular variable
over a span of time
Ex:
Polygon
➔ the line graph in the histogram or pareto
diagram representing the values of the
frequency distribution itself.
Ex:
Histogram
➔ A graphical display of the frequency,
Ogive
relative frequency or percent frequency
➔ the line graph in the histogram or pareto
constructed by placing the class
diagram representing the cumulative
intervals on the horizontal axis and the
percent frequency distribution
frequency distribution on the vertical
Ex:
axis
General guidelines on how to increase the
likelihood that the display will effectively
convey the key information in the data.
1. Give the display a clear and concise title
2. Keep the display simple. Do not use
three dimensions if two dimensions are
Graphical Excellence sufficient.
➔ It is the term we apply to techniques that 3. Clearly label each axis and provide the
are informative and concise and that units of measure
impart information clearly to their 4. If color is used to distinguish categories,
viewers. make sure that the colors are distinct.
➔ We discuss an equally important 5. If multiple colors or line types are used,
concept: Graphical Integrity and its use a legend to define how they are used
enemy graphical deception and place the legend close to the
representation of the data.
5 Characteristics that should be applied to
achieve Graphical Excellence GRAPHICAL DECEPTION:
1. The graph presents large data sets 1. Graph without a scale on one axis. No
concisely and coherently. label or variable to measure on the
2. The ideas & concepts the statistics y-axis although the x axis is time period.
practitioner wants to deliver are clearly So what variable is being measured on
understood by the viewer. the y-axis, is it sales, profit, expenses,
3. The graph encourages the viewer to what ??????
compare two or more variables. 2. Same graph with different captions.
4. The display induces the viewer to Your impression of the trend might
address the substance of the data and not differ depending on which caption you
the forms of the graph. read.
5. There is no distortion of what the graph 3. Perspectives of the chart are often
reveals. distorted by the changes in absolute
values rather than percentage changes.
Edward Tufte (professor of Statistics in Yale 4. Distortions on the chart can be made by
University) summarized graphical excellence drastic stretching of the vertical or
this way: horizontal axis or expanded scale on
➔ It is the well-designed presentation of either axis
interesting data – a matter of substance, 5. Be on the lookout for size distortions as
of statistics, and of design. in the case of pictograms to enhance the
➔ It is that which gives the viewers the appeal such that the increase in sales is
greatest number of ideas in the shortest manifested by increase in height and
time with the least ink in the smallest width - bigger size of the bottle of the
space. soft drink over the years. One is less
➔ It is nearly always multivariate. likely to be misled if the focus is on the
➔ It requires telling the truth about the numerical values rather than the graph
data. representing the value
Unit 3 Data Interpretation
DESCRIPTIVE MEASURES
● Measures of Central Tendency or Location
● Measures of Relative Location
● Measures of Dispersion or Spread or Variability
● Measures of Shape
● Measures of Association or Linear Relationship
Measures of central tendency
Mean
➔ The sum of observations divided by the number of observations
Formula:
Ungrouped Grouped
Median
➔ Middle value of observed ordered observations
➔ Data must be in array
Formula:
Ungrouped Grouped
Mode
➔ Most frequent observation
Formula:
Ungrouped Grouped
The most frequently occurring value. Mode = L + [d1 / (d1 + d2)] * c Where d1 = frequency
of modal class less frequency of preceeding class while
d2 = frequency of modal class less frequency of the
succeeding class L = lower limit of the modal class
Midhinge
Formula:
(Q1 + Q3) / 2
Midrange
Formula:
(LV+ HV) / 2
Population sample
Population sample
2 2 2 2 2 2 2 2
σ = Σ(𝑋 − µ) /𝑁 σ = Σ(𝑋 − 𝑥) /𝑛-1 σ = Σ𝑓(𝑋 − µ) /𝑁 σ = Σ𝑓(𝑋 − 𝑥) /𝑛-1
Standard deviation
Formula:
Ungrouped Grouped
2 2 2 2
σ = Σ(𝑋 − µ) /𝑁 σ = Σ(𝑋 − 𝑥) /𝑛-1 σ = Σ𝑓(𝑋 − µ) /𝑁 σ = Σ𝑓(𝑋 − 𝑥) /𝑛-1
Coefficient of variation
Formula:
Ungrouped Grouped
Empirical rule
➔ When data are believed to have a symmetrical or bell-shaped distribution
➔ can be used to determine the percentage of data values that must be within a specified number of
standard deviations of the mean.
➔ If the skewness is equal to zero, then the basis of interpreting the standard deviation is the
Empirical Rule. For the Empirical Rule get the 1st up to the 3rd std dev values before making the
interpretations.
◆ Approximately 68.26% of the data values will be within one std deviation of the mean.
◆ Approximately 95.44% of the data values will be within two std deviations of the mean.
◆ Approximately 97.74% or Almost all of the data values will be within three std deviations
of the mean.
Ex:
negative 1st std dev positive 1st std dev 497.24 519.76 Approximately 68.26% of the students
= (Mean - Std dev) = (Mean + Std. dev) have Math scores within 497.24 to
519.76 points which fall on the 1st std
deviation from the mean.
negative 2nd std dev positive 2nd std dev 485.98 531.14 Approximately 95.44% of the students
= Mean - (2*std = Mean + (2*std have Math scores within 485.99 to
dev) dev) 531.015 points which fall on the 2nd
std deviation from the mean
negative 3rd std dev positive 3rd std dev 474.72 542.27 Approximately 97.74% of the students
= Mean - (3*std = Mean + (3*std have Math scores within 474.73 to
dev) dev) 542.27 points which fall on the 3rd std
deviation from the mean.
Chebyshev’s theorem
➔ It enables us to make statements about the proportion of data values that must be within a
specified number of standard deviations of the mean to any data set regardless of the shape of the
distribution.
➔ If the skewness is not equal to zero, then the basis of interpreting the standard deviation is
Chebyshev's theorem. For Chebyshev's theorem, get the 2nd up to the 4th std dev values before
making the interpretations.
◆ At least 75% of the data must be within two std deviations of the mean.
◆ At least 89% of the data must be within three std deviations of the mean.
◆ At least 94% of the data must be within four std deviations of the mean.
Ex:
negative 2nd std dev positive 2nd std dev 485.98 531.14 At least 75% of the students have Math
= Mean - (2*std = Mean + (2*std scores between 485.99 to 531.015
dev) dev) points which fall on the 2nd std
deviation from the mean.
negative 3rd std dev positive 3rd std dev 474.72 542.27 At least 89% of the students have Math
= Mean - (3*std = Mean + (3*std scores between 474.73 to 542.27
dev) dev) points which fall on the 3rd std
deviation from the mean.
negative 4th std dev positive 4th std dev 463.47 553.53 At least 94% of the students have Math
= Mean - (4*std = Mean + (4*std scores between 463.47 to 553.53
dev) dev) points which fall on the 4th std
deviation from the mean.
Determining Outliers
3 approaches to determine outliers:
1. The IQR approach for both symmetrical and asymmetrical data distributions.
➔ Formula for the Lower Limit = Q1-(1.5*IQR) & for the Upper Limit = Q3+(1.5*IQR)
➔ any value beyond the lower and upper limits is considered an Outlier.
2. The Empirical rule for symmetrical data. Any value beyond the 3rd std deviation is considered an
Outlier.
3. Chebyshev's theorem for asymmetrical data. Any value beyond the 4th std deviation is considered
an Outlier.
Kurtosis
➔ peakedness of the distribution
◆ ku =3 = normal or mesokurtic
◆ ku >3 = Leptokurtic/positive(thin)
◆ ku <3 = Platykurtic/negative (flat)
Skewness
➔ symmetry of the distribution.
➔ If SK = 0, the distribution is symmetrical or normally distributed with bell-shaped distribution;
➔ it is asymmetrical, if sk > 0, + or right-skewed and if sk < 0, - or left-skewed.
Interpretation examples:
Sum 30510
midhinge 505.5
midrange 520.5
correlation of Math & Science 0.19 the manner of linear relationship or linear
association between Math and Science scores is
positive or direct since the correlation coefficient
is a positive value but the degree is very weak or
very low since the correlation value falls between
0.01-0.19.
correlation of Math & Language 0.12 the manner of linear relationship or linear
association between Math and Language scores is
positive or direct since the correlation coefficient
is a positive value but the degree is very weak or
very low since the correlation value falls between
0.01-0.19.
covariance of Math & Science 27.18 the manner of linear relationship or linear
association between Math & Science scores is
positive or direct since the sign of the coefficient
of covariance is positive.
covariance of Math & Language 21.78 the manner of linear relationship or linear
association between Math & Language scores is
positive or direct since the sign of the coefficient
of covariance is positive.
Note:
➔ To interpret the Standard deviation - you need to look at the value of the skewness
➔ When asked to interpret the default std dev - use the 1st std dev rule for Empirical rule and use
the 2nd std. dev rule for the Chebyshev's theorem.
➔ When asked if there are outlier values and identify the outlier values - identify the rule that is
being used before answering Yes or No and what are the outlier values.
Unit 4 - Probability distributions
● Discrete
● Continuous
Discrete:
● The variables are countable
● The probability function f(x) provides the probability that the random variable x assumes various
values.
● Types
○ Uniform or Univariate
○ Bivariate
○ Binomial
○ Poisson
○ Hypergeometric
1. Uniform or Univariate
- Distribution of a single variable
- Ex. Develop the probability distribution of the number of televisions per household
2. Bivariate
- About the relationship of two variables
- Provides probabilities of combinations of two variables
- Ex. After analyzing several months of sales the owner of an appliance store produces the
probability distribution of the number of refrigerators and stoves sold daily
- Financial Portfolios - amount of investment and interest rate/rate of return
3. Binomial
- The binomial experiment consists of a fixed number of trials represented by n
- Each trial has two possibilities failure or success
- The probability of success is p and failure is 1-p
- The trials are independent which means that the outcome of one trial does not affect the outcomes
of any other trials
- If properties 2, 3, and, 4 are present then each trial is a bernoulli process
- Ex. The leading brand of dishwasher detergent has a 30% market share. A sample of 25
dishwasher detergent customers was taken, what is the probability that 10 or fewer customers
chose the leading brand?
4. Poisson
- The number of successes that occur in a period of time or an interval or space
- The number of successes that occur in any interval is independent of the number of successes that
occur in any other interval
- The probability of success in an interval is the same for all equal size intervals
- The probability of more than one success in an interval approaches 0 as the interval becomes
smaller
- Ex. The number of students who seek assistance with their statistics assignments is distributed
with a mean of 2 per day, what is the probability that no student seeks assistance tomorrow? Find
the probability that 10 students seek assistance in a week.
5. Hypergeometric
- The trials are not independent and the probability of success changes from trial to trial
Ex.
More and more shoppers prefer to do their holiday shopping online from companies, suppose that there is
a group of 10 shoppers, 7 prefer to do their holiday shopping online while 3 prefer to do their holiday
shopping in stores =, a random sample of 3 of these shoppers is selected for a more in depth study of how
the economy has impacted their shopping behavior. What is the probability that exactly 2 prefer shopping
online.
Continuous
● The variables are uncountable
● In fraction or decimal form
● The probability density function f(x) does not provide the probability values directly. Instead,
probabilities are given by areas under the curve or graph of the probability density function f(x).
● Types
○ Uniform
○ Normal
○ Exponential
○ Others
1. Uniform
- Range
- The following requirements apply to probability density function whose range is 𝑎 ≤ 𝑥 ≤ 𝑏
- 𝑓(𝑋) ≥ 0 for all x between a and b
- The total are under the curve between a and b is equal to 1.0
2. Normal
- The curve is symmetric about its mean and random variable ranges between
- − ∞ 𝑎𝑛𝑑 + ∞ there is a two parameter distribution
- Has mean and standard deviation
3. Exponential
- This is a one parameter distribution, the distributions is completely specified once the value of the
lambda (λ) is known
- The mean and standard dev are equal to each other
- Ex. The time between breakdown of aging machines is known to be distributed with a mean of 25
hours, the machine has just been repaired. Determine the probability that the next breakdown
occurs more than 50 hours from now
4. Others
- T distribution
- Chi-square distribution
- F distribution
UNIT 5
Inferential statistics - concerned with empirical verification (trying to evaluate or verify if your
hypothesis is true, valid, and correct)
hypothesis - uneducated guess/theory/hula
- has to be validated ; true = accept false = reject
Hypothesis testing
- the general goal of a hypothesis test it to rule out chance (sampling error) as a plausible
explanation for the results from a research study
- is a technique to infer that what is true to a part is true to a whole (inductive reasoning)
- we should minimize sampling error and non-sampling error, sampling error emanates from the
incorrect computation of sample size and incorrect application of sampling
methodology/technique
Test of comparisons
- the default is population standard deviation, if not sample standard deviation or variances will be
given/stated explicitly
Factorial
design or
two-way
Anova or
two-factor
Anova with
replication
● if the population standard deviation is unknown but sample is more than 100/200 Z-test will be
used (when there are too many observations it approximates to population
● if the population standard deviation is is known even if sample size if very small Z-test will be
used
● if the sample standard deviation is known and sample size is more than 100/200 you can use
Z-test
● T-test is used when the sample standard deviation/variance is given and sample size is small/less
than 100
Mean - Ration data therefore the test variables are a ratio data
ex: GWA, Income, Prices, Height, etc,
Medians - Interval and Ordinal data
Independent
- having different number of observations
Related measures
- Matched/paired (partnered with another group)
- post test and pre test
- Repeated measures
Maracuillo
pairwise
comparison
procedure
- Related measure for 2 groups or 3 or more groups means that the groups are either paired of
matched and post test happened or a case of before and after scenarios
- multiple pairwise comparisons - statistical procedures that can be used to conduct comparisons
between pairs such as in the case of fisher’s LSD or tukey kramer or maracuillo procedures
Test or relationship
linear association between two variables that are pearsons’s correlation coefficient
continuous (when variables are ratio)
- you are testing the goodness of it for multinomial/normal probability distribution or we test the
parametric linear association between two ratio variables