You are on page 1of 51

Session 1: Introduction

Statistics for Business


Dr. Le Anh Tuan

1
Course Information
►Lecturer info:

►Le Anh Tuan, Ph.D. in Finance

►Email: tuan.le@isb.edu.vn

►Page: https://sites.google.com/view/anhtuanle

►Research Interests: corporate finance, corporate


governance, labor economics, banking, institutional
quality, policy uncertainty shocks, gender diversity.

2
Course Information
► Course requirements:

► In-class activities 10%


► Homework, Presentation
► Students will not be allowed to sit in the final examination if
absences from over 3 sessions (over 20% of sessions).

► Online Quiz 10%


► Quiz: online via eLearning

► Group Assignment 20%

► Midterm Exam 20%


► 50 MCQs
► Final Exam 40%
► MCQs
► Long/short questions

3
Group Assigment (GA)
►Work in group (3-5 students per group).
►Group list should be submitted in Week 2.
►GA1 : Week 4
►GA2: Week 11

4
Exam
► Mid-term exam :
► An in-class closed-book exam, 90 minutes
► The mid-term exam will include 50 multiple choice
questions.
► Cover Weeks 1- 6
► No laptop, no electronic devices.
► A one-page formula sheet is provided.

► Final exam:
► Closed-book exam, 120 minutes.
► 40 multiple choice questions
► Short/long questions
► No laptop, no electronic devices.
► A one-page formula sheet is provided.

5
Course Information
►Required reading:
►Lecture notes.
►Doane, P. D. & Seward, E. L. (2016), Applied Statistics
in Business and Economics, 5th edition, McGraw Hill.

►Recommended reading:
►Berenson, Mark L., David M. Levine and Timothy C.
Krehbiel (2004), Basic Business Statistics, 9th edition,
Prentice-Hall

6
Course Information

►Software
►Microsoft Excel, MegaStat are used in the course.
►Others

7
Tentative schedule
►Session 1: Introduction and Data collection (Chap. 1,2 )

►Session 2: Describing data visually (Chap. 3)

►Session 3: Descriptive statistics (Chap. 4)

►Session 4: Probability (Chap. 5)

►Session 5: Discrete Probability Distributions (Chapter 6)

►Mid-term exam (Week 7)

8
Tentative schedule

►Session 6: Continuous Probability Distributions (Chapter 7)

►Session 7: Sampling Distributions and Estimation (Chapter 8)

►Session 8: One-sample Hypothesis Tests (Chapter 9)

►Session 9: Two-sample Hypothesis Tests (Chapter 10)

►Session 10: Regression (Chapter 12,13)

9
Session 1. Introduction

10
Introduction

►Statistics

►Variables and Data

►Level of Measurement

►Sampling Concepts

►Sample Methods

11
What is Statistics?
► Statistics is the science of collecting, organizing, analyzing,
interpreting, presenting data and drawing conclusions from
data.

► Average score of students in class


► The maximum value of employee bonus
► The minimum wage of Asian countries
► The range of Vinamilk stock price

12
Two Branches of Statistics

Statistics
► The branch of mathematics that transforms data into useful
information for decision makers.

Descriptive Statistics Inferential Statistics

Collecting, summarizing, Using data collected from a


presenting and analyzing small group to draw
data conclusions about a larger
group.

13
Variables and Data

►Data are facts and figures… collected for analysis,


presentation and interpretation.

►Variable: a characteristic about the items that we want to


study (e.g., Name, Gender, DOB, gdp growth, gdp per
capita, inflation, final score, mid-term score, spending ).

►Observation: a single member of items that we want to


study, such as a student, firm, or region. An observation is
a set of variable values.
►Data set: all the values of all of the variables for all of the
observations we chose to study.

14
Variables and Data
Variables

Employee Gender DOB Annual


Name Income in $

Gladys Simpson Female 1-May-1971 120,000


Observation 1
Divid Hinds Male 17-Dec-1968 135,000
Observation 2

Kenneth Henry Male 3-Sep-1965 98,000


Observation 3

This dataset has 3 observations

15
Variables and Data
►Categorical or qualitative data have values that are
described by words, may be coded.
►Numerical or quantitative data comes from counting,
measuring, or mathematical operation.

16
Types of Dataset

► Cross-sectional data

► Time series data

► Pooled cross sections

► Panel/Longitudinal data

17
Cross-sectional data
obs wage educ Exp Age female

1 20.40 20 10 34 1

2 10.29 15 8 30 0

… … … … … …

500 40.39 30 12 45 1

► 1 observation = information about 1 cross-sectional unit


► Cross-sectional units: individuals, households, firms, cities, states
► Data taken at a given point in time
► Typical assumption: units form a random sample from the whole
population → the notion of independence of the units’ values
► Possible violations:
• Censoring: wealthier families are less likely to disclose their wealth
• Small population: neighboring states influence one another, their
indicators are not independent.
18
Time series data
Year Stock Volumn Stock High Low
Price (x1000) Return Price Price
(Netflix (%)
2000 15.00 1,000 4 23 8.00

2001 16.20 7,000 8 25 10.97

… … … … … …

2022 50.33 24,000 12 100 20.22

► Observations on economic variables over time: stock prices, money supply, CPI,
GDP, inflation rates,…
► Frequencies: daily, weekly, monthly, quarterly, annually
► Ordering is important here!
• Behaviour of economic subject (and the resulting indicators) evolve in a gradual
manner in time
• Lags in economic behaviour (stock prices today affect next month’s actions)
► Typically, observations cannot be considered independent across time →
require more complex econometric techniques.
19
Pooled cross sections
obs Year wage educ Exp Age female

1 2010 20.40 20 10 34 1

… … … … … … …

500 2010 10.29 15 8 30 0

501 2020 15.02 12 4 40 1

… … … … … … …

1000 2020 40.39 30 12 45 1

► Both cross-sectional and time-series features


► Data collected in multiple (typically, two) points in time
► Ordering is not crucial, year is recorded as an additional variable
► Often used to evaluate the effect of a policy change
• collect data before and after the policy change and see how the
relationship between the variables changes
► Note: in the second time period, the cross-sectional units need be neither
distinct from nor identical to those in the first period.
20
Panel/Longitudinal Data
Firm Year Size Debt Tangible Age Cash
Ratio (%) Ratio (%) ratio

1 2020 4.44 15 25 30 10
Firm
1 2021 4.98 14 39 31 15
1
1 2022 5.02 23 41 32 21

2 2020 3.43 32 20 4 22
Firm
2 2021 7.53 4 11 5 11
2
2 2022 8.49 10 18 6 10

… … … … … … …

► Are a collection of cross-sectional data for at least two different points/periods


of time.
► Unlike with pooled cross sections, the same units are measured over time
• More difficult /costly to obtain the data
• Have several advantages over (pooled) cross sections (for problem
where panel data make sense)
21
Panel/Longitudinal Data
► Use of double index: it where i (firm) = 1,...,n and t (year) = 1,...,T.

► Typical problem: missing values - for some units and periods there are no
data.

► Maybe balanced panel data or unbalanced panel data.

► Example: firm performance of 500 Vietnamese listed firms from 2000 to


2019 where all Vietnamese firms were chosen for the sample 2000 and kept
fixed for all subsequent years (T =20 years, n=500, N=10,000)

22
Question
► What types of datasets are the following (cross-sectional,
pooled cross-sectional, time series, panel data)?
1. A performance survey of banks in Ho Chi Minh city in 2022 cross - sectional
2. Financial ratios of 2500 SMEs over the 2000 – 2020 period panel
3. A happiness index of Vietnam between 2000– 2020 time
4. Income surveys of citizens in Ho Chi Minh city in 2000, 2010. pool - cross
5. Economic policy uncertainty indices of 40 countries around world for a 20-
year period. panel panel data = 40*20 =800
6. Monthly Stock information (close price, high price, return) of Vinamilk in
2023. time n = 12
7. Unemployment rates of Asian countries in the year of Covid-19 2019. cross - se
8. Sales revenue of firms in the retail industry in 2020. cross - sec
9. Annual income of a group of employees at a company for four consecutive
years. panel => obs = 40*4
10. Survey about customer satisfaction in a supermarket on a given day. cross - sec

23
Collecting the data

24
Collecting the data
Scales of Measurement

26
Scales of Measurement
► Scales of measurement include:
► Nominal
► Ordinal
► Interval
► Ratio
► The scale determines the amount of information contained in
the data.
► The scale indicates the data summarization and statistical
analyses that are most appropriate.

27
Nominal Scale
► Data are labels or names used to identify attributes of the
element.

► Example:
► Students of a university are classified by as Business,
Humanities, Education, and so on.
► Alternatively, a numeric code could be used for the school
variable (e.g. 1 denotes Business, 2 denotes Humanities, 3
denotes Education, and so on).
► No ordering.
28
Ordinal Scale
► The data have the properties of nominal data and the order or
rank of the data is meaningful.

► Example:
► Students of a university are classified as Freshman, Junior, or
Senior.
► Alternatively, a numeric code could be used for the class
standing variable (e.g. 2 denotes Freshman, 3 denotes Junior,
and so on).
► Ordering, but differences have no meaning.
29
Interval Scale
► The data have the properties of ordinal data, and the 0 °C 32.0 °F
difference between measurements is meaningful, 1 °C 33.8 °F
2 °C 35.6 °F
but the measurements have no true zero value 3 °C. 37.4 °F
(absence of quantity being measured) 4 °C. 39.2 °F
5 °C. 41.0 °F
6 °C. 42.8 °F
► Example:
► Difference between a temperature of 00C and 200C is
the same difference as between 200C and 400C, but
you can't say that 40°C is twice as hot as 20°C
because the scale does not start at absolute zero
(where there is no temperature)..
► Differences have meaning, but ratios have no meaning.
30
Ratios
► The data have all the properties of interval data and the ratio
of two values is meaningful.
► The measurements have a true zero value.

► Example:
► Company A has $1 million in total assets, while Company B
has $2 million in total assets. Company B’s firm size is double
that of company A.
► Company C has zero total assets.
► Ratios have meaning.
31
Scales of Measurement
Ratio Data Differences between measurements
and ratios

Interval Data Differences between measurements


but no ratio

Ordinal Data Ordered Categories

Nominal Data Categories (no ordering)


32
Scales of Measurement

33
Quiz
► Determine the level of measurement (Nominal, Ordinal,
Interval, Ratio)
1. Firms are described as small, medium, and large. ordinal
2. Sales revenue of firms ratio
3. The number of years in operations of firms. ratio
4. Statistics software (Eview, SPSS, STATA) nominal
5. Product quality being rated as fail, average, good, excellent ordinal
6. Customer satisfaction is based on the 7-point Likert. interval, strictly ordinal
7. Industry codes nominal
8. Income of people in HoChiMinh city. ratio
9. IQ scores interval
10. Weights of students in class ratio
11. YES/NO question nominal
12. Children in a school are evaluated and classified as non-readers (0), beginning readers (1),
grade level readers (2), or advanced readers (3). ordinal
34
Sampling Concept

36
Population vs. Sample

► A population is the collection of all items of interest or under

investigation, could be finite or infinite.

► A census is an examination of all items in a defined

population.

► A sample is an observed subset of the population.

37
Population vs. Sample
Population Sample

a b cd
b c
ef gh i jk l m n
gi n
o p q rs t u v w
o r u
x y z
y

► A population may be treated as infinite when the


population size N is at least 20 times the sample size n
(i.e., when N/n ≥ 20)

38
Parameters vs. Statistics

► A parameter is a specific characteristic of a population


► A statistic is a specific characteristic of a sample

39
Sampling Concepts

► The population must be carefully specified and the


sample must be drawn scientifically so that the sample is
representative.
► The target population is the population we are
interested in (e.g., people in Ho Chi Minh city).
► The sampling frame is the group from which we take the
sample (e.g., 21 Districts + 1 City).

40
Sampling Methods

41
Sampling Methods

Sampling Methods

Nonstatistical Sampling Statistical Sampling

Convenience Simple
Systematic
Random
Judgment
Cluster
Focus group Stratified

42
Nonstatistical Sampling
(Non-random Sampling)
► Convenience Sample
► Use a sample that happens to be available (e.g.,
ask co-worker opinions at lunch).
► Judgment Sample
► Use expert knowledge to choose “typical” items
(e.g., which employees to interview).
► Focus Groups
► In-depth dialog with a representative panel of
individuals (e.g. iPhone users).

43
Statistical Sampling

► Items of the sample are chosen based on known or


calculable probabilities

Statistical Sampling
(Probability Sampling)

Simple Random Systematic Stratified Cluster

44
Statistical Sampling

► For example, you would like to calculate the average


income of people in District 10 of Ho Chi Minh city.
► Total population in 2021 : 372,000.

45
Simple Random Sampling
► Every member of the population has an equal chance of being selected
► Every possible sample of a given size has an equal chance of being
selected
► Selection may be with replacement or without replacement
► In sampling without replacement (replacement), once an element
is selected from the population to be included in the sample, it is
not returned (returned) to the population. Therefore, the same
element cannot be selected more than once (could potentially be
selected multiple times) in the same sample.

► The sample can be obtained using a table of random numbers or computer


random number generator.

46
Simple Random Sampling

► For example, I would like to calculate the average income


of people in District 10 of Ho Chi Minh city.
► Total population in 2021 : 372,000.

Excel Excel function RANDBETWEEN(a, b). Press


F9 to get a new result.

Internet The web site www.random.org will give


you many kinds of random numbers

47
Systematic Random Sampling
► Decide on sample size: n
► Divide frame of N individuals into n groups of k
individuals: k = N/n
► Randomly select one individual from the first group
► Select every kth individual thereafter

N = 372,000
n = 3,720 First Group
k = 100

48
Stratified Random Sampling
► Divide population into subgroups (called strata) according
to some common characteristics (e.g. age, gender,
occupation)
► Select a simple random sample from each subgroup
► Combine samples from subgroups into one

Population
Divided
into 4
strata

Sample
49
Cluster Sampling
► Divide population into several “clusters” (e.g. regions), each
representative of the population.
► Clusters are often naturally occurring groups, such as counties,
election districts, city blocks, households, or sales territories
► One-stage cluster sampling: randomly selected k clusters
► Two-stage cluster sampling: randomly select k clusters and then
choose a random sample of elements within each cluster.

Population in Dis. 10 is
divided into 14 wards
(clusters) => street
(clusters)

Randomly selected clusters for sample

50
Sources of Error or Bias
Source of Error Characteristics

Nonresponse bias Respondents differ from non-respondents

Selection bias Self-selected respondents are atypical

Response error Respondents give false information

Coverage error Incorrect specification of frame or population

Measurement error Unclear survey instrument wording

Interviewer error Responses influenced by interviewer

Sampling error Random and unavoidable

51
Exercise

► Review Session 1, Online Quiz 1.

► Reading Session 2. Describing data visually (Chap. 3).

► Finishing Install Megastat and bring your laptop to the

next session.

52

You might also like