Professional Documents
Culture Documents
Data Analysis
Descriptive Inferential
Statistics Statistics
Critical Thinking
• “My Aunt smoked all her life and
lived to 90. Smoking doesn’t hurt
you”
• Five customers are asked if the new
product design is an improvement
• If 10 patients try a new medication
and one gets a rash, can we
conclude that the medication caused
the rash?
• Most shark attacks occur between
12p.m. and 2p.m
• Men are taller than women
• 3 crore people are vaccinated
A lottery winner told how he picked his
six-digit winning number
(5-6-8-10-22-69)
number of people in his family, birth
date of his wife, school grade of his 13-
year-old daughter, sum of his birth date
and his wife’s, number of years of
marriage, and year of his birth.
Marketing manager hypothesizes that
recent uptick in sales is due to new ad
campaign.
Measure: 50% of all the customers
visited the stores or website saw the ad
before buying the product.
Conclusion: Conversion rate- 50%
Cola Exclusivity Agreement
A large university with a total enrollment of about 50,000 students has
offered one Cola company (Soft) an exclusivity agreement that would give the
company exclusive rights to sell its products at all university facilities for the
next year with an option for future years.
In return, the university would receive 35% of the on-campus revenues and
an additional lump sum of 5,00,000 per year.
Time Series
Data Cross Sectional Data
Quantitative(variable) Qualitative(Attribute)
Ordinal (customer
Discrete (no. of
satisfaction,
customers, no of
efficiency of workers,
claims)
bond rating)
Nominal (gender,
Continuous (salary,
nationality, eye
price)
color)
Categorical Data
- Automobile style (e.g., X = full, midsize, compact, subcompact).
- Mutual fund (e.g., X = load, no-load).
- 1 = Bachelor’s, 2 = Master’s, 3 = Doctorate
- 1 = Male, 2 = Female, 3 = Others
Binary data
1 = employed, 0 = not employed
1 = married, 0 = not married
1 = stock price up, 0 = stock price down
1 = churn, 0 = no churn
Data Data
Time series
Primary (unemployment
rate, GDP)
Cross Sectional
(queue length in
Secondary
different SBI
branches)
Likert Scales (Ordinal Data)
Primary Uses of Statistics
Data showing
the day of the week each transaction was made,
the type of browser the customer used,
the time spent on the Web site,
the number of Web site pages viewed,
the amount spent by each of the 50 customers.
Data Analysis : Problem 2
A commercial bank has faced a major issue of credit card default globally. In order
to decide whether to issue a card to an applicant, the bank wants to leverage the
database of the customers.
%
Defaulter 200
30%
150
African American
Asian
%
Caucasian
Non-Defaulter 100
70%
50
0
Graphical Presentation
of Quantitative Data
Histogram
Summary for WaitTime
A nderson-D arling N ormality Test
A -S quared 0.24
P -V alue 0.759
M ean 5.4600
S tDev 2.4755
V ariance 6.1279
S kew ness 0.250415
Kurtosis -0.404960
N 100
M inimum 0.4000
1st Q uartile 3.8000
M edian 5.2500
3rd Q uartile 7.2000
0 2 4 6 8 10 12
M aximum 11.6000
95% C onfidence Interv al for M ean
4.9688 5.9512
95% C onfidence Interv al for M edian
4.5742 5.8773
95% C onfidence Interv al for S tD ev
9 5 % C onfidence Inter vals
2.1735 2.8757
Mean
Median
20-25 25
25-30 6
Exploratory Data
Analysis
Data and randomness
Three questions that good business managers ask themselves when
they look at “the numbers”:-
Positively Negatively/
Left
/Right
Skewed
Skewed
Dispersion
Describes how similar a set of observations are to each other
or
the degree of deviation (spread) of a set of data from their central
value
• In general, the more spread out a distribution is, the larger the
measure of dispersion will be
Measures of Dispersion
There are four main measures of dispersion:
• Variance
• Standard Deviation
• Mean absolute Deviation
• Quartile Deviation or Semi-Inter-quartile range (IQR)
Mean Absolute Deviation
Variance and Standard Deviation
• The standard deviation is defined as the square root
of the variance. The units of measurement for the
standard deviation is same as the units of the
variable.
25
1
Sales
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
61
64
67
70
73
76
79
82
85
88
91
94
97
100
103
106
109
112
115
118
121
124
127
130
133
136
139
142
145
148
151
154
157
160
163
166
169
172
175
178
181
184
187
190
193
196
---------------------------------------------------------------------------------------------------------------------------------------
199
Interpretation
• The larger the SD/variance is, the more the observations deviate, on
average, away from the mean
• The smaller the SD/variance is, the less the observations deviate, on
average, from the mean
Coefficient of Variation (CV)
s
CV = 100
x
Mean-Variance Analysis and
Sharpe Ratio
• Mean-variance analysis:
✓ The performance of an asset is measured by its rate of return.
✓ The rate of return may be evaluated in terms of its reward
(mean) and risk (variance).
✓ Higher average returns are often associated with higher risk.
• The Sharpe ratio uses the mean and variance to
evaluate risk.
LO 3.5
Mean-Variance Analysis and
Sharpe Ratio
• Sharpe Ratio
✓ Measures the extra reward per unit of risk.
✓ For an investment І , the Sharpe ratio is computed as:
x − R
Sharpe Ratio =
s
where is the mean return for the investment
is the mean return for a risk-free asset
is the standard deviation for the investment
LO 3.5
Empirical Rule
⚫ For roughly mound-shaped and symmetric
distributions, approximately:
m
x
m – 3s m – 1s m + 1s m + 3s
m – 2s m + 2s
Chebyshev’s Theorem
1
⚫ At least 1 −
2 of the elements of any
k
distribution lie within k standard deviations of the
mean
1 1 3
1− = 1 − = = 75%
2
2
4 4 2
Standard
At 1 1 8 Lie
1 − 2 = 1 − = = 89% 3 deviations
least 3 9 9 within of the mean
1 1 15 4
1− 2 = 1− = = 94%
4 16 16
Standardization of Data
• Purpose: To compare each data point to the natural
range and variation of the dataset.
• Method: For each data value – subtract off sample
mean and divided by sample std dev.
Resulting numbers called z-values or z-scores
• measure how many standard deviations above or
below the mean a data point is.
• are “unit free”
• have mean zero and SD 1
Standardization of Data
to compare each data point to the natural
range and variation of the dataset.
x−x
z=
s
z score can be both positive or negative
Measures of Location
Percentiles, Quartiles, and Box-Plots
Quartiles and other percentiles
Percentiles
• Percentiles are data that have been divided into 100 groups.
• For example, you score in the 83rd percentile on a standardized
test. That means that 83% of the test-takers scored below you.
• Deciles are data that have been divided into 10 groups.
• Quintiles are data that have been divided into 5 groups.
• Quartiles are data that have been divided into 4 groups.
Uses of Quartiles and other percentiles
Q1 Q2 Q3
• The three values that separate the four groups are called
Q1, Q2, and Q3, respectively.
Interquartile Range
Quartiles
• The second quartile Q2 is the median, a measure of central
tendency.
Q2
Lower 50% | Upper 50%
Q1 Q3
Lower 25% | Middle 50% | Upper 25%
Finding Quartiles (Example)
Sorted
Sales Sales (n+1)P/100 Quartiles
9 6
6 9 Position
12 10
10 12
13 13 13 + (.25)(1) = 13.25
15 14 First Quartile (20+1)25/100=5.25
16 14
14 15
14 16
16 16 Median (20+1)50/100=10.5 16 + (.5)(0) = 16
17 16
16 17
24 17
21 18
22 18 Third Quartile (20+1)75/100=15.75 18+ (.75)(1) = 18.75
18 19
19 20
18 21
20 22
17 24
Box Plot
Outliers Largest
Whiskers Obs.
Box
**
Elements of a Box Plot
* X X *
Inner Q1 Median Q3
Outer Inner Outer
Fence Fence Fence Fence
Q1-1.5(IQR) Interquartile Range Q3+1.5(IQR)
Q1-3(IQR)
Q3+3(IQR)
Outliers can be
influential…