You are on page 1of 21



- “Statistics is a way to get information from data
Date  Statistics  Information
- Statistics is a tool for creating new understanding from a set of numbers
- Example:
+ A large university with a total enrollment of about 50000 students had offered
Pepsi-Cola an exclusivity agreement that would give Pepsi exclusive rights to sell its
+ The university would receive 35% of the on-campus revenues and an additional
$200000 per year.
+ Data: Pepsi currently sells an average of 22000 cans per week over the 40 weeks of
the year that the university operates. Price: $1/can. The costs, including labor, total
30 cents per can. Current market share (without the agreement): 25%
 Should Pepsi sign the agreement?

1. Descriptive Statistics
- Descriptive statistics uses numerical techniques to summarize data.
- The mean and median are popular numerical techniques to describe the location of
the data
- The range, variance and standard deviation measure the variability of the data
- Descriptive statistics deals with methods of organizing, summarizing and
presenting data in a convenient and informative way
- One form of descriptive statistics uses graphical techniques

2. Inferential Statistics
- Inferential statistics is a body of methods used to draws conclusions or inferences
about characteristics of populations based on sample data

3. Key Statistical Concepts

- Population: the group of all items of interest to a statistics practitioner
+ frequently very large, sometimes infinite
+ ex: all the population in Vietnam
- Sample: a set of data randomly drawn from the population
+ potentially large, but less than the population
+ ex: a sample of citizens living in Hanoi
- Parameter: a descriptive measure of population (tham số của tổng thể)
- Statistic: a descriptive measure of a sample (tham số của mẫu)
- Statistical inference: the process of making an estimate, prediction, or decision
about a population based on a sample
1. Definitions
- A variable is some characteristics of a population or sample
Ex: student grades
- The values of the variable are the range of possible values for a variable
Ex: student marks (0…100)
- Data are the observed values of a variable
Ex: student marks: {67, 74, 71, 83}

2. Types of data
- Interval data:
+ real numbers, i.e. heights, weights, prices, etc
+ also referred to as quantitative or numerical
+ arithmetic operations can be performed on interval data

- Nominal data:
+ the values of nominal data are categories
+ ex: responses to questions about marital status, coded as: single = 1, married = 2,
Divorced = 3, widowed = 4
+ arithmetic operations don’t make any sense (ex: does widowed : 2 = married?)
+ also called qualitative or categorical

- Ordinal data:
+ appear to be categorical in nature, but their values have an order; a ranking to
+ ex: college course rating system: poor = 1, fair = 2, good = 3, very good = 4, excellent
+ still not meaningful to do arithmetic on this data
+ we can say: excellent > poor or fair < very good
Order is maintained no matter what numeric values are assigned to each category
PART A: Summarizing univariate data
1. Graphical & Tabular Techniques for Nominal data
- The only allowable calculation on nominal data is to count the frequency of each
value of the variable
- We can summarize the data in a table that presents the categories and their counts
called a frequency distribution
- A relative frequency distribution lists the category ………….

2. Ordinal data
- Same as the nominal data but
+ The bars in the bar chart are arranged in ascending or descending order
+ The wedges in the pie chart are arranged clockwise in ascending or descending

3. Graphical Techniques for Interval data

- There are several graphical methods that are used when the data are interval
- The most ………….
- Ex:
+ For nominal data, we create a frequency distribution of the categories
+ For interval data, we create a frequency distribution by counting the number of
observations that fall into a series of intervals, called classes

4. Presentation of interval data

Stem and Leaf Display
- Retains information about individual observations that would normally be lost in
the creation of a histogram
- Split each observation into two parts, a stem and a leaf
- ex: observation value: 42.19
There are several ways to split it up
+ We could split it at the decimal point: 42/19
+ Or split it at the “tens” position (while rounding to the nearest integer in the
‘ones’ position): 4/2
- A graph of a cumulative frequency distribution
- Cumulative relative frequencies

- The ogive can be used to answer questions like: what value is at the 50th percentile?

Line chart
- Observations measured at successive points in time are called time-series data
- Line chart plots the value of the variable on the vertical axis against time periods on
the horizontal axis

PART B: Summarizing bivariate data

1. Cross-tabulations
- A tabular summary of data for 2 variables
- The left and top margin labels define the classes for the 2 variables

2. Scatter Diagram and Trendline

- A scatter diagram is a graphical presentation of the relationship between 2
quantitative variables
- 1 variable is shown on the horizontal axis and the other variable is shown on the
vertical axis
- The general pattern of the plotted points suggests the overall relationship between
the variables
- A trendline is an approximation of the relationship
- Table Design Principles:
+ Avoid using vertical lines in a table unless they are necessary for clarity
+ Horizontal lines are generally necessary only for separating column titles from
data values or when indicating that a calculation has taken place

1. Measures of Central Location

- The central data point reflects the locations of all the actual data points
- With 2 data points, the central location should fall in the middle between them
- If the third data point appears on the left hand-side of the midrange, it should pull
the central location to the left

- The Arithemetic Mean (average)

Mean = sum of the measurements : number of measurements

- Geometric Mean
+ A specialized measure, used to find the average growth rate or rate of change of a
variable over time
+ Formula:
Step 1: express the rate of change ( R ) as ( 1+R )
Step 2: calculate the geometric mean using the formula

- Characteristics of the mean

+ A representative of a data set
+ Takes every single value into account so it is likely to be affected by extreme values
+ Used to compare different-sized data sets

- The median
+ The median of a set of measurements is the value that falls in the middle when the
measurements are arranged in order of magnitude
+ When determining the median pay attention to the number of observations (k)
- The mode:
+ The Mode of a set of measurements is the value that occurs most frequently
+ A Set of data may have one mode (or modal class) or 2 or more modes.

- Relationship among Mean, Median and Mode

+ If a distribution is symmetrical, the mean, median and mode coincide  same
+ If a distribution is non symmetrical, and skewed to the left or to the right, the 3
measures differ: mean  median  mode or mode  median  mean

- Using the Mean, Median and Mode

+ The mean – is very sensitive to extreme values, is used in most statistical analyses
+ The median is not affected by extreme values, yet, does not reflect all the values
included in the data set, but rather the location of the observation in the middle
+ The mode – should be used mainly for categorical data

2. Measures of Variability: đo độ biến thiên

- Measures of central location fail to tell the whole story about the distribution
- A question of interest still remains unanswered: how much are the values of a
given set spread out around the mean value?

ARR: Annual Rate of Return

+ Considering the average ARR only the two portfolios are equal. But are they
+ Is the dispersion (variability) of ARR the same for the 2 portfolios?

- The Range: khoảng biến thiên, chênh lệch giữa giá trị lớn nhất và nhỏ nhất 
không xét đến những giá trị ở giữa, không thực tế
+ The range of a set of measurements is the difference between the largest and
smallest measurements
+ Its major advantage is the ease with which it can be computed
+ Its major shortcoming is its failure to provide information on the dispersion of the
values between the two end points

- The variance: phương sai

+ This measure reflects the dispersion of all the measurement values

+ Problem: when we square the deviation up like this, the units will be squared (kg,
km)  hard to understand and interpret

- Standard Deviation: the standard deviation of a set of measurements is the square

root of the set variance

- Variance: you have to calculate the sample mean (x-bar) in order to calculate the
sample variance, there is a short-cut formulation:

- The Empirical Rule for a Bell Shaped Data Set

- The Chebyshev Theorem: describing any data set
+ The proportion of observations in any sample that lie within k standard deviations
of the mean is at least 1- 1/k2 for any k>1
+ This theorem is valid for any set of measurements (sample, population) of any

3. Measures of Relative Location and Box Plots

- Percentile: bách phân vị
+ The pth. percentile of a set of measurements is the value for which
_ At most p% of the measurements are less than that value
_ At most (100-p)% of all the measurements are greater than that value
+ A demonstration of commonly used percentile
First (lower) decile = 10th percentile
First (lower) quartile, Q1, = 25th percentile
Median = 50th percentile
Third quartile Q3 = 75th percentile
Ninth (upper) decile = 90th percentile
+ Determining Percentiles and their Location: find the location of any percentile
using the formula, then find the value

- Quartiles and Variability:

+ Quartiles can provide an idea about the shape of a histogram
+ Inter-quartile Range
This is a measure of the spread of the middle 50% of the observations
Large value indicates a large spread of the observations
Interquartile range = Q3 – Q1
- Box Plot
+ A box plot is a pictorial display that provides the main descriptive measures of the
measurement set
L: the largest measurement
Q3: the upper quartile
Q2: the median
Q1: the lower quartile
S: the smallest measurement
*Data Source
+ Primary: Data Collection
- Observation
- Survey
- Experiment

+ Secondary: Data Compilation: print or electronic

I. Methods of Collecting Data

1. Observation
- The investigator observes characteristics of a subset of members of one or more
existing populations
- Goals: draw conclusions about the corresponding population or about the
difference between 2 or more populations
- Advantage vs Disadvantage
+ Ad: easy to con duct, relatively inexpensive
+Dis: provide little useful information; impossible to draw cause-and-effect
conclusions due to confounding variable

2. Experiment
- The investigator observes how a response variable behaves when the researcher
manipulates one or more explanatory variables (factors)
- Goal: determine the effect of the manipulated factors on the response variable
- Advantage vs Disadvantage
+ Ad: provide useful data particularly for cause-and-effect relationship
+ Dis: relatively expensive, time required

3. Survey
- Goal: used to solicit information from people concerning things as income, family
size, opinions on various issues…
- The majority of surveys are conducted for private use
- 3 channels
+ Personal interview
_ High rate of response, fewer incorrect answers
_ Costly: people, money, time…
+ Telephone interview
_ less expensive
_ less personal, lower response rate
+ Mail survey
_ inexpensive
_ low response rate, high number of incorrect answers
- Design Steps
+ Define the issue
_ What are the purpose and objectives of the survey?
_ Identify the questions to answer?
+ Deciding what to measure and how to measure
_ Decide what information needed to answer questions
_ Think about how you intend to tabulate and analyze the response
+ Define the population of interest
+ Design questionnaire
_ Questionnaire should be kept as short as possible
_ The questions should be short, simple, clear, unambiguous
_ Begin with simple demographic questions
_ Use both dichotomous questions (close-ended) questions as well as open-ended
_ Avoid using leading questions
+ Pre-test the survey
_ Pilot test with a small group of participants
_ Asses clarity and length
+ Determine the sample size and sampling method
+ Select sample and administer the survey

- Types of questions
+ Close-ended questions: select from a short list of defined choices
+ Open-ended questions: respondents are free to respond with any value, words, or
state ment
+ Demographic questions: about the respondents’ personal characteristics

II. Sampling Methods

1. Why Sampling?
- Less time consuming than a census
- Less costly to administer than a census
- It is possible to obtain statistical results of a sufficiently high precision based on
- Sometimes, it’s impossible to identify the whole population

2. Methods of Sampling
- Probability Samples:
+ Simple random
_ every individual or item from the population has an equal chance of being
_ Selection may be with replacement or without replacement
_ Samples can be obtained from a table of random numbers or computer random
number generators
+ Stratified random
_ Population divided into subgroups (call strata) according to some common
_ Simple random sample selected from each subgroup
_ Samples from subgroups are combined into one
_ The proportions of the subgroups are equal

+ Systematic
_ Decide on sample size: n
_ Divide frame of N individuals into groups of k individuals: k = N/n
_ Randomly select one individual from the 1st group
_ Select every k th individual thereafter

+ Cluster: unpopular
_ Population is divided into several ‘clusters’, each representative of the
_ A simple random sample of clusters is selected: all items in the selected clusters
can be used, or items can be chosen from a cluster using another probability
sampling technique
_ Proportions of each cluster is not important

- Convenient Samples: not recommended

+ Use easily available/convenient group to form a sample
+ Voluntary response sampling, self-selected sampling…

III. Sampling and Non-Sampling Error

1. Sampling Error
- An error is expected to occur when making statement about the population that is
based on the observations contained in a sample taken from the population
- The difference/deviation between the true (unknown) value of a population
parameter (mean, standard deviation…) and its estimate, the sample statistic is the
sampling error
- Sample error may be large due to unrepresentative sample be selected
- The only way to reduce sample error is to take larger sample size

2. Non-Sampling Error
- An error occurs when there are mistakes in the acquisition of the data or due to the
sample observations being selected improperly
+ Selection Bias:
_ occur when the way the sample selected systematically excludes some parts of
the population of interest
_ also usually occur when only volunteer or self-selected individuals are used in a
+ Measurement or response bias:
_ occur when the method of observation tends to produce values that
systematically differ from the true value in some ways
_ this problem might happen due to: an improperly cablibrated scale is used to
weigh items, questions on a survey are worded in a way that tends to influence the
response, the appearance or the behavior of the interviewer/group/organization
conducting the survey, the tendency for people not to be completely honest when
asked about sensitive issues

+ Nonresponse bias:
_ occur when responses are not obtained from some individuals of the sample

1. Assigning probabilities to Events

- A random experiment is a process or course of action, whose outcome is uncertain
- Repeated random experiments may result in different outcomes. One can only refer
to the experiment outcome in terms of the probability of each outcome to occur.
- To determine the probabilities we need to define the sample space (không gian
+ which is exhaustive (list of all the possible outcomes)
+ in which the outcomes are mutually exclusive (outcome do not overlap)
- Simple events: the individual outcomes that cannot be further decomposed
- An event is a collection of simple events
- Our objective is to determine P(A), the probability that event A will occur
- Given a sample space S={O1, O2,…, Ok}
The probability of a simple event P(Oi): the following characteristics must hold:

2. Joint, Marginal, and Conditional Probability

- The probability of an event: the peobability P(A) of event A is the sum of the
probabilities assigned to the simple events in A
- Understanding relationships among events may reduce the computational effort in
determining the probability of events
- Two events are said to be mutually exclusive if the occurrence of one precludes the
occurrence of the other one
- Intersection
If A and B are mutually exclusive, by definition, the probability of their intersection
is equal to zero

- Joint Probability - The Probabilities of Joint Events: the probability of the

intersection of A and B is called also the joint probability of A and B = P(A and B)

- Marginal Probabilities: The probabilitiy of event C can be calculated as the sum of

the 2 joint probabilities: P( C) = P(A and C) + P(B and C)

- Conditional Probability: Frequently, information about the occurrence of even A

changes the probability of event B
Specifically, the conditional probability of event B given that event A has occurred is
calculated as follows: P(B/A) = P(A and B) : P(A)
- Dependent Events: Since the occurrence of B2 has changed the probability of A2,
the 2 events are related and are called ‘dependent event’

- Union of A and B is the event that occurs when either A or B or both occur
It is denoted ‘A or B’ (phân hợp)

3. Probability Rules and Trees

- 3 rules assist us in determining the probability of complex events
+ the complement rule
+ the multiplication rule
+ the addition rule

- Complement of an Event: the complement of event A (denoted by AC) is the event

that occurs when event does not occur (phần bù): P(A) = 1- P(AC)

- Multiplication Rule: for any 2 events A and B:

P (A and B) = P(A) x P(B/A) = P(B) x P(A/B)
When A and B are independent P(B/A) = P(B) so: P(A and B) = P(A) x P(B)

- Addition Rule: for any 2 events A and B: P(A or B) = P(A) + P(B) – P(A and B)

- When A and B are mutually exclusive: P(A or B) = P(A) + P(B)

- Probability Trees
+ This is an effective graphical tool that help apply probability rules
+ The events are represented by the tree branches
+ The given probabilitíe are indicated along these branches

4. Baye’s Law
- To find the conditional probability of a possible cause for an event that is known to
have occurred
1. Random Variables and Discrete Probability Distribution
- When the simple events in the sample space can be assigned numerical values, a
random variable can be defined that takes on these values
- Ex: the numbers of cars entering a gas station in the next 5 minutes, the amount of
gas filled in a gas tank by a driver, the grade in a test,…  uncertain
- There are 2 types of random variables:
+ A random variable is discrete if it can assume only a countable number of values
+ A random variable is continuous if it can assume an uncountable number of

Discrete Probability Distribution

- A table, formula, or graph that lists all members of the sample space, together with
the probabilities associated with each value, is called a discrete (probability)
- Requirements:

- To calculate the probability that the random variable X assumes the value a, P (X=a)
+ add the probabilities of al the simple events for which X is equal to ‘a’
+ use the probability calculation tools (such as tree diagram)

Developing a Probability Distribution

- Example: A mutual fund sales person knows that there is 20% chance of closing a
sale on each call she makes. What is the probability distribution of the number of
sales if she plans to call 3 customers?
- Solution
+ Use probability rules and probability tree
+ define event Ai = {a sale is made in the ith phone call}
Assuming each phone call is independent of all other phone calls, we can assign the
following probabilities to each branch on the tree
The Expect Value (Mean)
- Given a discrete random variable X with values xi , that occurs with probabilities
p(xi), the population mean of X is

- The population mean is the weighted average of all of its values. The weights are
the probabilities

The Population Variance

- Let X be a discrete random variable with possible values xi that occur with
probabilities p(xi) and let E(xi) = m. The variance of X is defined by

Laws of Expected Value

E (c ) = c: The expected value of a constant (c ) is just the value of the constant
E(X+c) = E(X) + c
E (cX) = cE(X)

Laws of Variance
V(c ) = 0
V(X+c) = V(X)
V(cX) = c2V(X)

2. The Binomial Distribution

- The binomial experiment has the following characteristics:
+ There are n trials (n is finite and fixed)
+ Each trial can result in one out of two outcomes, success or failure
+ The probability p of a success is the same for all the trials
+ All the trials of the experiment are independent

Binomial Random Variable

- The binomial random variable counts the number of successes in n trials of the
binomial experiment
- The possible values of this count are 0,1,2,…,n and therefore the binomial variable
is discrete

The Binomial Probability Distribution

The binomial probability distribution is described by the following closed form

Binomial Table

- The binomial table gives cumulative probabilities for P(X <= k)

P(X=k) = P(X<=k) – P(X <= [k-1])

P(X>=k) = 1 - P(X<=[k-1])

Mean and Variance – Binomial Variable

E(X) = m = np
V(X) = s2 = np(1-p)

1. Probability Density Function

- A continuous random variable has an uncountable infinite number of values in the
interval (a, b)
To calculate probabilities of continuous random variables we define a probability
density function f(x) which satisfies the following conditions
+ f(x) is non negative
+ The total area under the curve representing f(x) is equal to 1

- The probability that a continuous variable X will assume any particular value is
The probability that x falls between ‘a’ and ‘b’ is the area under the graph of f(x)
between ‘a’ and ‘b’

- Uniform Distribution:
+ A random variable X is said to be uniformly distributed if its density function is
f(x) = 1/(b-a) a =< x =< b
with E(X) = (a+b) /2 V(X) = (b-a)2 / 12

2. Normal Distribution
- A random variable X with mean μ and variance σ 2 is normally distributed if its
probability density function is

We denote a normal distribution by N( μ, σ 2)

Normal distributions range from minus infinity to plus infinity
Shape: bell shaped, and symmetrical around μ
Increasing the mean shifts the curve to the right
Increasing the standard deviation ‘flattens’ the curve

- 2 facts help calculate normal probabilities

+ The normal distribution is symmetrical
+ Any random variable (rv) X having N( μ, σ 2) can be transformed into a rv Z
having N (0, 1) called the ‘Standard Normal Distribution’ or the Z – distribution by
Z = (X - μ)/σ

You might also like