SB 2024 Lecture1

Session 1: Introduction
Statistics for Business

Dr. Le Anh Tuan
1
Course Information
►Lecturer info:
►Le Anh Tuan, Ph.D. in Finance
►Email: tuan.le@isb.edu.vn
►Page: https://sites.google.com/view/anhtuanle
►Research Interests: corporate finance, corporate

governance, labor economics, banking, institutional
quality, policy uncertainty shocks, gender diversity.
2
Course Information
► Course requirements:
► In-class activities 10%

► Homework, Presentation
► Students will not be allowed to sit in the final examination if
absences from over 3 sessions (over 20% of sessions).
► Online Quiz 10%

► Quiz: online via eLearning
► Group Assignment 20%
► Midterm Exam 20%

► 50 MCQs
► Final Exam 40%
► MCQs
► Long/short questions
3
Group Assigment (GA)
►Work in group (3-5 students per group).
►Group list should be submitted in Week 2.
►GA1 : Week 4
►GA2: Week 11
4
Exam
► Mid-term exam :
► An in-class closed-book exam, 90 minutes
► The mid-term exam will include 50 multiple choice
questions.
► Cover Weeks 1- 6
► No laptop, no electronic devices.
► A one-page formula sheet is provided.
► Final exam:
► Closed-book exam, 120 minutes.
► 40 multiple choice questions
► Short/long questions
► No laptop, no electronic devices.
► A one-page formula sheet is provided.
5
Course Information
►Required reading:
►Lecture notes.
►Doane, P. D. & Seward, E. L. (2016), Applied Statistics
in Business and Economics, 5th edition, McGraw Hill.
►Recommended reading:
►Berenson, Mark L., David M. Levine and Timothy C.
Krehbiel (2004), Basic Business Statistics, 9th edition,
Prentice-Hall
6
Course Information
►Software
►Microsoft Excel, MegaStat are used in the course.
►Others
7
Tentative schedule
►Session 1: Introduction and Data collection (Chap. 1,2 )
►Session 2: Describing data visually (Chap. 3)
►Session 3: Descriptive statistics (Chap. 4)
►Session 4: Probability (Chap. 5)
►Session 5: Discrete Probability Distributions (Chapter 6)
►Mid-term exam (Week 7)
8
Tentative schedule
►Session 6: Continuous Probability Distributions (Chapter 7)
►Session 7: Sampling Distributions and Estimation (Chapter 8)
►Session 8: One-sample Hypothesis Tests (Chapter 9)
►Session 9: Two-sample Hypothesis Tests (Chapter 10)
►Session 10: Regression (Chapter 12,13)
9
Session 1. Introduction
10
Introduction
►Statistics
►Variables and Data
►Level of Measurement
►Sampling Concepts
►Sample Methods
11
What is Statistics?
► Statistics is the science of collecting, organizing, analyzing,
interpreting, presenting data and drawing conclusions from
data.
► Average score of students in class

► The maximum value of employee bonus
► The minimum wage of Asian countries
► The range of Vinamilk stock price
12
Two Branches of Statistics
Statistics
► The branch of mathematics that transforms data into useful
information for decision makers.
Descriptive Statistics Inferential Statistics
Collecting, summarizing, Using data collected from a

presenting and analyzing small group to draw
data conclusions about a larger
group.
13
Variables and Data
►Data are facts and figures… collected for analysis,

presentation and interpretation.
►Variable: a characteristic about the items that we want to

study (e.g., Name, Gender, DOB, gdp growth, gdp per
capita, inflation, final score, mid-term score, spending ).
►Observation: a single member of items that we want to

study, such as a student, firm, or region. An observation is
a set of variable values.
►Data set: all the values of all of the variables for all of the
observations we chose to study.
14
Variables and Data
Variables
Employee Gender DOB Annual

Name Income in $
Gladys Simpson Female 1-May-1971 120,000

Observation 1
Divid Hinds Male 17-Dec-1968 135,000
Observation 2
Kenneth Henry Male 3-Sep-1965 98,000

Observation 3
This dataset has 3 observations
15
Variables and Data
►Categorical or qualitative data have values that are
described by words, may be coded.
►Numerical or quantitative data comes from counting,
measuring, or mathematical operation.
16
Types of Dataset
► Cross-sectional data
► Time series data
► Pooled cross sections
► Panel/Longitudinal data
17
Cross-sectional data
obs wage educ Exp Age female
1 20.40 20 10 34 1
2 10.29 15 8 30 0
… … … … … …
500 40.39 30 12 45 1
► 1 observation = information about 1 cross-sectional unit

► Cross-sectional units: individuals, households, firms, cities, states
► Data taken at a given point in time
► Typical assumption: units form a random sample from the whole
population → the notion of independence of the units’ values
► Possible violations:
• Censoring: wealthier families are less likely to disclose their wealth
• Small population: neighboring states influence one another, their
indicators are not independent.
18
Time series data
Year Stock Volumn Stock High Low
Price (x1000) Return Price Price
(Netflix (%)
2000 15.00 1,000 4 23 8.00
2001 16.20 7,000 8 25 10.97
… … … … … …
2022 50.33 24,000 12 100 20.22
► Observations on economic variables over time: stock prices, money supply, CPI,
GDP, inflation rates,…
► Frequencies: daily, weekly, monthly, quarterly, annually
► Ordering is important here!
• Behaviour of economic subject (and the resulting indicators) evolve in a gradual
manner in time
• Lags in economic behaviour (stock prices today affect next month’s actions)
► Typically, observations cannot be considered independent across time →
require more complex econometric techniques.
19
Pooled cross sections
obs Year wage educ Exp Age female
1 2010 20.40 20 10 34 1
… … … … … … …
500 2010 10.29 15 8 30 0
501 2020 15.02 12 4 40 1
… … … … … … …
1000 2020 40.39 30 12 45 1
► Both cross-sectional and time-series features

► Data collected in multiple (typically, two) points in time
► Ordering is not crucial, year is recorded as an additional variable
► Often used to evaluate the effect of a policy change
• collect data before and after the policy change and see how the
relationship between the variables changes
► Note: in the second time period, the cross-sectional units need be neither
distinct from nor identical to those in the first period.
20
Panel/Longitudinal Data
Firm Year Size Debt Tangible Age Cash
Ratio (%) Ratio (%) ratio
1 2020 4.44 15 25 30 10
Firm
1 2021 4.98 14 39 31 15
1
1 2022 5.02 23 41 32 21
2 2020 3.43 32 20 4 22
Firm
2 2021 7.53 4 11 5 11
2
2 2022 8.49 10 18 6 10
… … … … … … …
► Are a collection of cross-sectional data for at least two different points/periods

of time.
► Unlike with pooled cross sections, the same units are measured over time
• More difficult /costly to obtain the data
• Have several advantages over (pooled) cross sections (for problem
where panel data make sense)
21
Panel/Longitudinal Data
► Use of double index: it where i (firm) = 1,...,n and t (year) = 1,...,T.
► Typical problem: missing values - for some units and periods there are no
data.
► Maybe balanced panel data or unbalanced panel data.
► Example: firm performance of 500 Vietnamese listed firms from 2000 to

2019 where all Vietnamese firms were chosen for the sample 2000 and kept
fixed for all subsequent years (T =20 years, n=500, N=10,000)
22
Question
► What types of datasets are the following (cross-sectional,
pooled cross-sectional, time series, panel data)?
1. A performance survey of banks in Ho Chi Minh city in 2022 cross - sectional
2. Financial ratios of 2500 SMEs over the 2000 – 2020 period panel
3. A happiness index of Vietnam between 2000– 2020 time
4. Income surveys of citizens in Ho Chi Minh city in 2000, 2010. pool - cross
5. Economic policy uncertainty indices of 40 countries around world for a 20-
year period. panel panel data = 40*20 =800
6. Monthly Stock information (close price, high price, return) of Vinamilk in
2023. time n = 12
7. Unemployment rates of Asian countries in the year of Covid-19 2019. cross - se
8. Sales revenue of firms in the retail industry in 2020. cross - sec
9. Annual income of a group of employees at a company for four consecutive
years. panel => obs = 40*4
10. Survey about customer satisfaction in a supermarket on a given day. cross - sec
23
Collecting the data
24
Collecting the data
Scales of Measurement
26
► Scales of measurement include:
► Nominal
► Ordinal
► Interval
► Ratio
► The scale determines the amount of information contained in
the data.
► The scale indicates the data summarization and statistical
analyses that are most appropriate.
27
Nominal Scale
► Data are labels or names used to identify attributes of the
element.
► Example:
► Students of a university are classified by as Business,
Humanities, Education, and so on.
► Alternatively, a numeric code could be used for the school
variable (e.g. 1 denotes Business, 2 denotes Humanities, 3
denotes Education, and so on).
► No ordering.
28
Ordinal Scale
► The data have the properties of nominal data and the order or
rank of the data is meaningful.
► Example:
► Students of a university are classified as Freshman, Junior, or
Senior.
► Alternatively, a numeric code could be used for the class
standing variable (e.g. 2 denotes Freshman, 3 denotes Junior,
and so on).
► Ordering, but differences have no meaning.
29
Interval Scale
► The data have the properties of ordinal data, and the 0 °C 32.0 °F
difference between measurements is meaningful, 1 °C 33.8 °F
2 °C 35.6 °F
but the measurements have no true zero value 3 °C. 37.4 °F
(absence of quantity being measured) 4 °C. 39.2 °F
5 °C. 41.0 °F
6 °C. 42.8 °F
► Example:
► Difference between a temperature of 00C and 200C is
the same difference as between 200C and 400C, but
you can't say that 40°C is twice as hot as 20°C
because the scale does not start at absolute zero
(where there is no temperature)..
► Differences have meaning, but ratios have no meaning.
30
Ratios
► The data have all the properties of interval data and the ratio
of two values is meaningful.
► The measurements have a true zero value.
► Example:
► Company A has $1 million in total assets, while Company B
has $2 million in total assets. Company B’s firm size is double
that of company A.
► Company C has zero total assets.
► Ratios have meaning.
31
Ratio Data Differences between measurements
and ratios
Interval Data Differences between measurements

but no ratio
Ordinal Data Ordered Categories
Nominal Data Categories (no ordering)

32
33
Quiz
► Determine the level of measurement (Nominal, Ordinal,
Interval, Ratio)
1. Firms are described as small, medium, and large. ordinal
2. Sales revenue of firms ratio
3. The number of years in operations of firms. ratio
4. Statistics software (Eview, SPSS, STATA) nominal
5. Product quality being rated as fail, average, good, excellent ordinal
6. Customer satisfaction is based on the 7-point Likert. interval, strictly ordinal
7. Industry codes nominal
8. Income of people in HoChiMinh city. ratio
9. IQ scores interval
10. Weights of students in class ratio
11. YES/NO question nominal
12. Children in a school are evaluated and classified as non-readers (0), beginning readers (1),
grade level readers (2), or advanced readers (3). ordinal
34
Sampling Concept
36
Population vs. Sample
► A population is the collection of all items of interest or under
investigation, could be finite or infinite.
► A census is an examination of all items in a defined
population.
► A sample is an observed subset of the population.
37
Population vs. Sample
Population Sample
a b cd
b c
ef gh i jk l m n
gi n
o p q rs t u v w
o r u
x y z
y
► A population may be treated as infinite when the

population size N is at least 20 times the sample size n
(i.e., when N/n ≥ 20)
38
Parameters vs. Statistics
► A parameter is a specific characteristic of a population

► A statistic is a specific characteristic of a sample
39
Sampling Concepts
► The population must be carefully specified and the

sample must be drawn scientifically so that the sample is
representative.
► The target population is the population we are
interested in (e.g., people in Ho Chi Minh city).
► The sampling frame is the group from which we take the
sample (e.g., 21 Districts + 1 City).
40
Sampling Methods
41
Sampling Methods
Sampling Methods
Nonstatistical Sampling Statistical Sampling
Convenience Simple
Systematic
Random
Judgment
Cluster
Focus group Stratified
42
Nonstatistical Sampling
(Non-random Sampling)
► Convenience Sample
► Use a sample that happens to be available (e.g.,
ask co-worker opinions at lunch).
► Judgment Sample
► Use expert knowledge to choose “typical” items
(e.g., which employees to interview).
► Focus Groups
► In-depth dialog with a representative panel of
individuals (e.g. iPhone users).
43
Statistical Sampling
► Items of the sample are chosen based on known or

calculable probabilities
(Probability Sampling)
Simple Random Systematic Stratified Cluster
44
► For example, you would like to calculate the average

income of people in District 10 of Ho Chi Minh city.
► Total population in 2021 : 372,000.
45
Simple Random Sampling
► Every member of the population has an equal chance of being selected
► Every possible sample of a given size has an equal chance of being
selected
► Selection may be with replacement or without replacement
► In sampling without replacement (replacement), once an element
is selected from the population to be included in the sample, it is
not returned (returned) to the population. Therefore, the same
element cannot be selected more than once (could potentially be
selected multiple times) in the same sample.
► The sample can be obtained using a table of random numbers or computer

random number generator.
46
Simple Random Sampling
► For example, I would like to calculate the average income

of people in District 10 of Ho Chi Minh city.
► Total population in 2021 : 372,000.
Excel Excel function RANDBETWEEN(a, b). Press

F9 to get a new result.
Internet The web site www.random.org will give

you many kinds of random numbers
47
Systematic Random Sampling
► Decide on sample size: n
► Divide frame of N individuals into n groups of k
individuals: k = N/n
► Randomly select one individual from the first group
► Select every kth individual thereafter
N = 372,000
n = 3,720 First Group
k = 100
48
Stratified Random Sampling
► Divide population into subgroups (called strata) according
to some common characteristics (e.g. age, gender,
occupation)
► Select a simple random sample from each subgroup
► Combine samples from subgroups into one
Population
Divided
into 4
strata
Sample
49
Cluster Sampling
► Divide population into several “clusters” (e.g. regions), each
representative of the population.
► Clusters are often naturally occurring groups, such as counties,
election districts, city blocks, households, or sales territories
► One-stage cluster sampling: randomly selected k clusters
► Two-stage cluster sampling: randomly select k clusters and then
choose a random sample of elements within each cluster.
Population in Dis. 10 is
divided into 14 wards
(clusters) => street
(clusters)
Randomly selected clusters for sample
50
Sources of Error or Bias
Source of Error Characteristics
Nonresponse bias Respondents differ from non-respondents
Selection bias Self-selected respondents are atypical
Response error Respondents give false information
Coverage error Incorrect specification of frame or population
Measurement error Unclear survey instrument wording
Interviewer error Responses influenced by interviewer
Sampling error Random and unavoidable
51
Exercise
► Review Session 1, Online Quiz 1.
► Reading Session 2. Describing data visually (Chap. 3).
► Finishing Install Megastat and bring your laptop to the
next session.
52

SB 2024 Lecture1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SB 2024 Lecture1

Uploaded by

Copyright:

Available Formats

Session 1: Introduction

Statistics for Business

►Le Anh Tuan, Ph.D. in Finance

►Research Interests: corporate finance, corporate

► In-class activities 10%

► Online Quiz 10%

► Group Assignment 20%

► Midterm Exam 20%

►Session 2: Describing data visually (Chap. 3)

►Session 3: Descriptive statistics (Chap. 4)

►Session 4: Probability (Chap. 5)

►Session 5: Discrete Probability Distributions (Chapter 6)

►Mid-term exam (Week 7)

►Session 6: Continuous Probability Distributions (Chapter 7)

►Session 7: Sampling Distributions and Estimation (Chapter 8)

►Session 8: One-sample Hypothesis Tests (Chapter 9)

►Session 9: Two-sample Hypothesis Tests (Chapter 10)

►Session 10: Regression (Chapter 12,13)

►Variables and Data

► Average score of students in class

Descriptive Statistics Inferential Statistics

Collecting, summarizing, Using data collected from a

►Data are facts and figures… collected for analysis,

►Variable: a characteristic about the items that we want to

►Observation: a single member of items that we want to

Employee Gender DOB Annual

Gladys Simpson Female 1-May-1971 120,000

Kenneth Henry Male 3-Sep-1965 98,000

This dataset has 3 observations

► Time series data

► Pooled cross sections

► 1 observation = information about 1 cross-sectional unit

2001 16.20 7,000 8 25 10.97

2022 50.33 24,000 12 100 20.22

500 2010 10.29 15 8 30 0

501 2020 15.02 12 4 40 1

1000 2020 40.39 30 12 45 1

► Both cross-sectional and time-series features

► Are a collection of cross-sectional data for at least two different points/periods

► Maybe balanced panel data or unbalanced panel data.

► Example: firm performance of 500 Vietnamese listed firms from 2000 to

Interval Data Differences between measurements

Ordinal Data Ordered Categories

Nominal Data Categories (no ordering)

► A population is the collection of all items of interest or under

investigation, could be finite or infinite.

► A census is an examination of all items in a defined

► A sample is an observed subset of the population.

► A population may be treated as infinite when the

► A parameter is a specific characteristic of a population

► The population must be carefully specified and the

Nonstatistical Sampling Statistical Sampling

► Items of the sample are chosen based on known or

Simple Random Systematic Stratified Cluster

► For example, you would like to calculate the average

► The sample can be obtained using a table of random numbers or computer

► For example, I would like to calculate the average income

Excel Excel function RANDBETWEEN(a, b). Press

Internet The web site www.random.org will give

Randomly selected clusters for sample

Nonresponse bias Respondents differ from non-respondents

Selection bias Self-selected respondents are atypical

Response error Respondents give false information

Coverage error Incorrect specification of frame or population