Professional Documents
Culture Documents
1. Descriptive Statistics
- Descriptive statistics uses numerical techniques to summarize data.
- The mean and median are popular numerical techniques to describe the location of
the data
- The range, variance and standard deviation measure the variability of the data
- Descriptive statistics deals with methods of organizing, summarizing and
presenting data in a convenient and informative way
- One form of descriptive statistics uses graphical techniques
2. Inferential Statistics
- Inferential statistics is a body of methods used to draws conclusions or inferences
about characteristics of populations based on sample data
2. Types of data
- Interval data:
+ real numbers, i.e. heights, weights, prices, etc
+ also referred to as quantitative or numerical
+ arithmetic operations can be performed on interval data
- Nominal data:
+ the values of nominal data are categories
+ ex: responses to questions about marital status, coded as: single = 1, married = 2,
Divorced = 3, widowed = 4
+ arithmetic operations don’t make any sense (ex: does widowed : 2 = married?)
+ also called qualitative or categorical
- Ordinal data:
+ appear to be categorical in nature, but their values have an order; a ranking to
them
+ ex: college course rating system: poor = 1, fair = 2, good = 3, very good = 4, excellent
=5
+ still not meaningful to do arithmetic on this data
+ we can say: excellent > poor or fair < very good
Order is maintained no matter what numeric values are assigned to each category
PART A: Summarizing univariate data
1. Graphical & Tabular Techniques for Nominal data
- The only allowable calculation on nominal data is to count the frequency of each
value of the variable
- We can summarize the data in a table that presents the categories and their counts
called a frequency distribution
- A relative frequency distribution lists the category ………….
2. Ordinal data
- Same as the nominal data but
+ The bars in the bar chart are arranged in ascending or descending order
+ The wedges in the pie chart are arranged clockwise in ascending or descending
order
- The ogive can be used to answer questions like: what value is at the 50th percentile?
Line chart
- Observations measured at successive points in time are called time-series data
- Line chart plots the value of the variable on the vertical axis against time periods on
the horizontal axis
- Geometric Mean
+ A specialized measure, used to find the average growth rate or rate of change of a
variable over time
+ Formula:
Step 1: express the rate of change ( R ) as ( 1+R )
Step 2: calculate the geometric mean using the formula
- The median
+ The median of a set of measurements is the value that falls in the middle when the
measurements are arranged in order of magnitude
+ When determining the median pay attention to the number of observations (k)
- The mode:
+ The Mode of a set of measurements is the value that occurs most frequently
+ A Set of data may have one mode (or modal class) or 2 or more modes.
+ Considering the average ARR only the two portfolios are equal. But are they
really?
+ Is the dispersion (variability) of ARR the same for the 2 portfolios?
- The Range: khoảng biến thiên, chênh lệch giữa giá trị lớn nhất và nhỏ nhất
không xét đến những giá trị ở giữa, không thực tế
+ The range of a set of measurements is the difference between the largest and
smallest measurements
+ Its major advantage is the ease with which it can be computed
+ Its major shortcoming is its failure to provide information on the dispersion of the
values between the two end points
+ Problem: when we square the deviation up like this, the units will be squared (kg,
km) hard to understand and interpret
- Variance: you have to calculate the sample mean (x-bar) in order to calculate the
sample variance, there is a short-cut formulation:
2. Experiment
- The investigator observes how a response variable behaves when the researcher
manipulates one or more explanatory variables (factors)
- Goal: determine the effect of the manipulated factors on the response variable
- Advantage vs Disadvantage
+ Ad: provide useful data particularly for cause-and-effect relationship
+ Dis: relatively expensive, time required
3. Survey
- Goal: used to solicit information from people concerning things as income, family
size, opinions on various issues…
- The majority of surveys are conducted for private use
- 3 channels
+ Personal interview
_ High rate of response, fewer incorrect answers
_ Costly: people, money, time…
+ Telephone interview
_ less expensive
_ less personal, lower response rate
+ Mail survey
_ inexpensive
_ low response rate, high number of incorrect answers
- Design Steps
+ Define the issue
_ What are the purpose and objectives of the survey?
_ Identify the questions to answer?
+ Deciding what to measure and how to measure
_ Decide what information needed to answer questions
_ Think about how you intend to tabulate and analyze the response
+ Define the population of interest
+ Design questionnaire
_ Questionnaire should be kept as short as possible
_ The questions should be short, simple, clear, unambiguous
_ Begin with simple demographic questions
_ Use both dichotomous questions (close-ended) questions as well as open-ended
questions
_ Avoid using leading questions
+ Pre-test the survey
_ Pilot test with a small group of participants
_ Asses clarity and length
+ Determine the sample size and sampling method
+ Select sample and administer the survey
- Types of questions
+ Close-ended questions: select from a short list of defined choices
+ Open-ended questions: respondents are free to respond with any value, words, or
state ment
+ Demographic questions: about the respondents’ personal characteristics
2. Methods of Sampling
- Probability Samples:
+ Simple random
_ every individual or item from the population has an equal chance of being
selected
_ Selection may be with replacement or without replacement
_ Samples can be obtained from a table of random numbers or computer random
number generators
+ Stratified random
_ Population divided into subgroups (call strata) according to some common
characteristic
_ Simple random sample selected from each subgroup
_ Samples from subgroups are combined into one
_ The proportions of the subgroups are equal
+ Systematic
_ Decide on sample size: n
_ Divide frame of N individuals into groups of k individuals: k = N/n
_ Randomly select one individual from the 1st group
_ Select every k th individual thereafter
+ Cluster: unpopular
_ Population is divided into several ‘clusters’, each representative of the
population
_ A simple random sample of clusters is selected: all items in the selected clusters
can be used, or items can be chosen from a cluster using another probability
sampling technique
_ Proportions of each cluster is not important
2. Non-Sampling Error
- An error occurs when there are mistakes in the acquisition of the data or due to the
sample observations being selected improperly
+ Selection Bias:
_ occur when the way the sample selected systematically excludes some parts of
the population of interest
_ also usually occur when only volunteer or self-selected individuals are used in a
study
+ Measurement or response bias:
_ occur when the method of observation tends to produce values that
systematically differ from the true value in some ways
_ this problem might happen due to: an improperly cablibrated scale is used to
weigh items, questions on a survey are worded in a way that tends to influence the
response, the appearance or the behavior of the interviewer/group/organization
conducting the survey, the tendency for people not to be completely honest when
asked about sensitive issues
+ Nonresponse bias:
_ occur when responses are not obtained from some individuals of the sample
_…
CHAPTER 5: INTRODUCTION TO PROBABILITY
- Union of A and B is the event that occurs when either A or B or both occur
It is denoted ‘A or B’ (phân hợp)
- Addition Rule: for any 2 events A and B: P(A or B) = P(A) + P(B) – P(A and B)
- Probability Trees
+ This is an effective graphical tool that help apply probability rules
+ The events are represented by the tree branches
+ The given probabilitíe are indicated along these branches
4. Baye’s Law
- To find the conditional probability of a possible cause for an event that is known to
have occurred
CHAPTER 6: RANDOM VARIABLES AND DISCRETE PROBABILITY
DISTRIBUTIONS
1. Random Variables and Discrete Probability Distribution
- When the simple events in the sample space can be assigned numerical values, a
random variable can be defined that takes on these values
- Ex: the numbers of cars entering a gas station in the next 5 minutes, the amount of
gas filled in a gas tank by a driver, the grade in a test,… uncertain
- There are 2 types of random variables:
+ A random variable is discrete if it can assume only a countable number of values
+ A random variable is continuous if it can assume an uncountable number of
values
- To calculate the probability that the random variable X assumes the value a, P (X=a)
+ add the probabilities of al the simple events for which X is equal to ‘a’
+ use the probability calculation tools (such as tree diagram)
- The population mean is the weighted average of all of its values. The weights are
the probabilities
Laws of Variance
V(c ) = 0
V(X+c) = V(X)
V(cX) = c2V(X)
Binomial Table
- The probability that a continuous variable X will assume any particular value is
zero
The probability that x falls between ‘a’ and ‘b’ is the area under the graph of f(x)
between ‘a’ and ‘b’
- Uniform Distribution:
+ A random variable X is said to be uniformly distributed if its density function is
f(x) = 1/(b-a) a =< x =< b
with E(X) = (a+b) /2 V(X) = (b-a)2 / 12
2. Normal Distribution
- A random variable X with mean μ and variance σ 2 is normally distributed if its
probability density function is