Intro and EDA

Introduction to
Data Analysis
Prof Shovan Chowdhury

Expectation
Course Outline
Evaluation Components
Project work
Data Statistics Information
• What is Data
• What is Statistics
• Statistics in Business
• Statistical Challenges
• Critical Thinking
Sample and Population
Population (N) Sample (n)

Statistics
Descriptive Inferential
Statistics Statistics
Critical Thinking
• “My Aunt smoked all her life and
lived to 90. Smoking doesn’t hurt
you”
• Five customers are asked if the new
product design is an improvement
• If 10 patients try a new medication
and one gets a rash, can we
conclude that the medication caused
the rash?
• Most shark attacks occur between
12p.m. and 2p.m
• Men are taller than women
• 3 crore people are vaccinated
A lottery winner told how he picked his
six-digit winning number
(5-6-8-10-22-69)
number of people in his family, birth
date of his wife, school grade of his 13-
year-old daughter, sum of his birth date
and his wife’s, number of years of
marriage, and year of his birth.
Marketing manager hypothesizes that
recent uptick in sales is due to new ad
campaign.
Measure: 50% of all the customers
visited the stores or website saw the ad
before buying the product.
Conclusion: Conversion rate- 50%
Cola Exclusivity Agreement
A large university with a total enrollment of about 50,000 students has
offered one Cola company (Soft) an exclusivity agreement that would give the
company exclusive rights to sell its products at all university facilities for the
next year with an option for future years.
In return, the university would receive 35% of the on-campus revenues and
an additional lump sum of 5,00,000 per year.
Soft has been given 2 weeks to respond.

The market for soft drinks is measured in terms of 200 ml bottles.
Cola company currently sells an average of 22,000 bottles per

week (over the 40 weeks of the year that the university operates).
The bottles sell for an average of Rs 10 each.
Soft is unsure of its market share but suspects it is considerably

less than 50%.
A quick analysis reveals that if its current market share were 25%,
then, with an exclusivity agreement, Soft would sell 88,000 (22,000
is 25% of 88,000) bottles per week or 3,520,000 bottles per year.
The profit or loss can be calculated.
The only problem is that
we do not know how many soft drinks are sold weekly at the
university.
Cola assigned a recent university graduate to survey the
university's students to supply the missing information.
Accordingly, she organizes a survey that asks 500 students to

keep track of the number of soft drinks they purchase in the next
7 days.
Inferential statistics
The information we would like to acquire in is an estimate of
annual profits from the exclusivity agreement. The data are
the numbers of bottles of soft drinks consumed in 7 days by
the 500 students in the sample.
We want to know the mean number of soft drinks consumed

by all 50,000 students on campus.
To accomplish this goal we need another branch of statistics-

inferential statistics.
Inferential statistics
Inferential statistics is a body of methods used to draw
conclusions or inferences about characteristics of populations
based on sample data. The population in question in this case
is the soft drink consumption of the university's 50,000
students. The cost of interviewing each student would be
prohibitive and extremely time consuming. Statistical
techniques make such endeavors unnecessary. Instead, we
can sample a much smaller number of students (the sample
size is 500) and infer from the data the number of soft drinks
consumed by all 50,000 students. We can then estimate
annual profits for the cola company.
Data Classification
- Nominal
- Ordinal Qualitative Variables
Discrete Quantitative Variables

Continuous
Time Series
Data Cross Sectional Data
Quantitative(variable) Qualitative(Attribute)
Ordinal (customer
Discrete (no. of
satisfaction,
customers, no of
efficiency of workers,
claims)
bond rating)
Nominal (gender,
Continuous (salary,
nationality, eye
price)
color)
Categorical Data
- Automobile style (e.g., X = full, midsize, compact, subcompact).
- Mutual fund (e.g., X = load, no-load).
- 1 = Bachelor’s, 2 = Master’s, 3 = Doctorate
- 1 = Male, 2 = Female, 3 = Others
Binary data
1 = employed, 0 = not employed
1 = married, 0 = not married
1 = stock price up, 0 = stock price down
1 = churn, 0 = no churn
Data Data
Time series
Primary (unemployment
rate, GDP)
Cross Sectional
(queue length in
Secondary
different SBI
branches)
Likert Scales (Ordinal Data)
Primary Uses of Statistics
• Descriptive statistics – the collection, organization,

presentation and summary of data.
• Inferential statistics – generalizing from a sample to a

population, estimating unknown parameters, drawing
conclusions, making decisions.
Converting Business
Problem into Decisions
Data Analysis : Problem 1
One Chocolate manufacturing company sells quality chocolate products at its plant
and retail stores. Two years ago, the company developed a Web site and began
selling its products over the Internet. Web site have exceeded the company’s
expectations, and management is now considering strategies to increase sales even
further. To learn more about the Web site customers, a sample of 50 Chocolate
transactions was selected from the previous month’s sales.
Data showing
the day of the week each transaction was made,
the type of browser the customer used,
the time spent on the Web site,
the number of Web site pages viewed,
the amount spent by each of the 50 customers.
A commercial bank has faced a major issue of credit card default globally. In order
to decide whether to issue a card to an applicant, the bank wants to leverage the
database of the customers.
This database includes several information on the demographic, and professional

details of the customers.
To reduce the loss due to high default rate.

Overview Challenge Objective
A commercial bank The bank holds a The bank

issues credit cards to huge liability due wants to
the applicants as a to default in the leverage the
revenue generating payments by the database of
avenue. The bank cardholders. the
maintains a database customers
of the customers to reduce
which includes several the loss due
information on the to high
demographic, and default rate.
professional details.
The Credit data set consists of demographic,
professional and card related information about its
clients
The bank considers balance in the credit cards as a

measure of default
Overview Challenge Objective
FOOD4U is a major The company

food and invests a lot for
beverage company. advertising To figure out the
across different effectiveness
It sells a number of
media. of advertising for
different
the product
products across
However, the across different
different markets.
company is not media.
It uses “advertising” sure of the
heavily to utility of
promote the advertising.
products
The Advertising data set consists of the sales (in
thousands of units) of a particular product in 200
different markets.
It also contains the advertising budgets (in thousands

of dollars) for the product in each of the markets for
three different media: TV, Radio, and Newspaper
Is there any relationship between advertising budget and sales?
1
How strong is the relationship between advertising budget and

2 sales?
Which media contribute to sales?

3
Good Data Analysis
1 Find the right data
Use appropriate statistical tools

2
Clear communication of the numerical
3 information information
Graphical Presentation
of Categorical and Time
Series Data
Time Series Plot
Graphs for Categorical Data
250
%
Defaulter 200
30%
150
African American
Asian
%
Caucasian
Non-Defaulter 100
70%
50
0
Graphical Presentation
of Quantitative Data
Histogram
Summary for WaitTime
A nderson-D arling N ormality Test
A -S quared 0.24
P -V alue 0.759
M ean 5.4600
S tDev 2.4755
V ariance 6.1279
S kew ness 0.250415
Kurtosis -0.404960
N 100
M inimum 0.4000
1st Q uartile 3.8000
M edian 5.2500
3rd Q uartile 7.2000
0 2 4 6 8 10 12
M aximum 11.6000
95% C onfidence Interv al for M ean
4.9688 5.9512
95% C onfidence Interv al for M edian
4.5742 5.8773
95% C onfidence Interv al for S tD ev
9 5 % C onfidence Inter vals
2.1735 2.8757
Mean
Median
4.50 4.75 5.00 5.25 5.50 5.75 6.00

Class Interval Frequency
Histogram of Sales
(Advertisement Data) 0-5 3
✓ Classes are mutually exclusive.
5-10 42
✓ Classes are exhaustive.
Largest value - Smallest value 10-15 80

Number of classes
15-20 44
20-25 25
25-30 6
Exploratory Data
Analysis
Data and randomness
Three questions that good business managers ask themselves when
they look at “the numbers”:-
• What is a typical or central value?
• How much variability is present in the data set?
• Are there unusual shocks/events/cases (shape of the curve)?

Key Performance Measures
Measures of Center
There are three main measures of center:
• Mean (most useful measure)
• Median (generally used under the presence of outliers)
• Mode (used for categorical data)
Symmetrical
Positively Negatively/
Left
/Right
Skewed
Skewed
Dispersion
Describes how similar a set of observations are to each other
or
the degree of deviation (spread) of a set of data from their central
value
• In general, the more spread out a distribution is, the larger the
measure of dispersion will be
Measures of Dispersion
There are four main measures of dispersion:
• Variance
• Standard Deviation
• Mean absolute Deviation
• Quartile Deviation or Semi-Inter-quartile range (IQR)
Mean Absolute Deviation
Variance and Standard Deviation
• The standard deviation is defined as the square root
of the variance. The units of measurement for the
standard deviation is same as the units of the
variable.
Population Standard Sample Standard

Deviation Deviation
Standard
Deviation
0
5
10
15
20
30
25
1
Sales
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
61
64
67
70
73
76
79
82
85
88
91
94
97
100
103
106
109
112
115
118
121
124
127
130
133
136
139
142
145
148
151
154
157
160
163
166
169
172
175
178
181
184
187
190
193
196
---------------------------------------------------------------------------------------------------------------------------------------
199
Interpretation
• The larger the SD/variance is, the more the observations deviate, on
average, away from the mean
• The smaller the SD/variance is, the less the observations deviate, on
average, from the mean
Coefficient of Variation (CV)
• Relative measure (unit free) used for the purpose

of comparison of variability.
• Relative Measure=absolute measure/avg. *100
s
CV = 100 
x
Mean-Variance Analysis and
Sharpe Ratio
• Mean-variance analysis:
✓ The performance of an asset is measured by its rate of return.
✓ The rate of return may be evaluated in terms of its reward
(mean) and risk (variance).
✓ Higher average returns are often associated with higher risk.
• The Sharpe ratio uses the mean and variance to
evaluate risk.
LO 3.5
Mean-Variance Analysis and
Sharpe Ratio
• Sharpe Ratio
✓ Measures the extra reward per unit of risk.
✓ For an investment І , the Sharpe ratio is computed as:
x  − R
Sharpe Ratio =
s
where is the mean return for the investment
is the mean return for a risk-free asset
is the standard deviation for the investment
LO 3.5
Empirical Rule
⚫ For roughly mound-shaped and symmetric
distributions, approximately:
68% 1 standard deviation

of the mean
95% Lie 2 standard deviations

within of the mean
All 3 standard deviations

of the mean
Empirical Rule
99.72%
95.44%
68.26%
m
x
m – 3s m – 1s m + 1s m + 3s
m – 2s m + 2s
Chebyshev’s Theorem
 1 
⚫ At least 1 −
 2 of the elements of any


 k 
distribution lie within k standard deviations of the
mean
1 1 3
1− = 1 − = = 75%
2
2
4 4 2
Standard
At 1 1 8 Lie
1 − 2 = 1 − = = 89% 3 deviations
least 3 9 9 within of the mean
1 1 15 4
1− 2 = 1− = = 94%
4 16 16
Standardization of Data
• Purpose: To compare each data point to the natural
range and variation of the dataset.
• Method: For each data value – subtract off sample
mean and divided by sample std dev.
Resulting numbers called z-values or z-scores
• measure how many standard deviations above or
below the mean a data point is.
• are “unit free”
• have mean zero and SD 1
Standardization of Data
to compare each data point to the natural
range and variation of the dataset.
x−x
z=
s
z score can be both positive or negative
Measures of Location
Percentiles, Quartiles, and Box-Plots
Quartiles and other percentiles
Percentiles
• Percentiles are data that have been divided into 100 groups.
• For example, you score in the 83rd percentile on a standardized
test. That means that 83% of the test-takers scored below you.
• Deciles are data that have been divided into 10 groups.
• Quintiles are data that have been divided into 5 groups.
• Quartiles are data that have been divided into 4 groups.
Uses of Quartiles and other percentiles
• Percentiles may be used to establish benchmarks for

comparison purposes (e.g. health care, manufacturing, and
banking industries use 5th, 25th, 50th, 75th and 90th
percentiles).
• Quartiles (25, 50, and 75 percent) are commonly used to
assess financial performance and stock portfolios.
• Percentiles can be used in employee merit evaluation and
salary benchmarking.
Quartiles
• Quartiles are scale points that divide the sorted data
into four groups of approximately equal size.
Q1 Q2 Q3
Lower 25% | Second 25% | Third 25% | Upper 25%
• The three values that separate the four groups are called
Q1, Q2, and Q3, respectively.
Interquartile Range
Quartiles
• The second quartile Q2 is the median, a measure of central
tendency.
Q2
 Lower 50%  |  Upper 50% 
• Q1 and Q3 measure dispersion since the interquartile range Q3 – Q1

measures the degree of spread in the middle 50 percent of data
values.
Q1 Q3
Lower 25% |  Middle 50%  | Upper 25%
Finding Quartiles (Example)
Sorted
Sales Sales (n+1)P/100 Quartiles
9 6
6 9 Position
12 10
10 12
13 13 13 + (.25)(1) = 13.25
15 14 First Quartile (20+1)25/100=5.25
16 14
14 15
14 16
16 16 Median (20+1)50/100=10.5 16 + (.5)(0) = 16
17 16
16 17
24 17
21 18
22 18 Third Quartile (20+1)75/100=15.75 18+ (.75)(1) = 18.75
18 19
19 20
18 21
20 22
17 24
Box Plot
Outliers Largest
Whiskers Obs.
Box
**
Elements of a Box Plot
Smallest data Largest data point

point not below not exceeding Suspected
Outlier inner fence inner fence outlier
* X X *
Inner Q1 Median Q3
Outer Inner Outer
Fence Fence Fence Fence
Q1-1.5(IQR) Interquartile Range Q3+1.5(IQR)
Q1-3(IQR)
Q3+3(IQR)
Outliers can be
influential…

Intro and EDA

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intro and EDA

Uploaded by

Copyright:

Available Formats

Introduction to

Prof Shovan Chowdhury

Population (N) Sample (n)

Soft has been given 2 weeks to respond.

Cola company currently sells an average of 22,000 bottles per

The bottles sell for an average of Rs 10 each.

Soft is unsure of its market share but suspects it is considerably

Accordingly, she organizes a survey that asks 500 students to

We want to know the mean number of soft drinks consumed

To accomplish this goal we need another branch of statistics-

Discrete Quantitative Variables

• Descriptive statistics – the collection, organization,

• Inferential statistics – generalizing from a sample to a

This database includes several information on the demographic, and professional

To reduce the loss due to high default rate.

Overview Challenge Objective

A commercial bank The bank holds a The bank

The bank considers balance in the credit cards as a

Overview Challenge Objective

FOOD4U is a major The company

It also contains the advertising budgets (in thousands

How strong is the relationship between advertising budget and

Which media contribute to sales?

Use appropriate statistical tools

4.50 4.75 5.00 5.25 5.50 5.75 6.00

Largest value - Smallest value 10-15 80

• What is a typical or central value?

• How much variability is present in the data set?

• Are there unusual shocks/events/cases (shape of the curve)?

Population Standard Sample Standard

• Relative measure (unit free) used for the purpose

• Relative Measure=absolute measure/avg. *100

68% 1 standard deviation

95% Lie 2 standard deviations

All 3 standard deviations

• Percentiles may be used to establish benchmarks for

Lower 25% | Second 25% | Third 25% | Upper 25%

• Q1 and Q3 measure dispersion since the interquartile range Q3 – Q1

Smallest data Largest data point

You might also like