Professional Documents
Culture Documents
2. Data Selection
Methods
Advantages Disadvantages
Experiments Provide controls, replanned Costly, time-consuming,
objectives requires planning
Telephone Surveys Timely, relatively Poor reputation, limited
inexpensive scope and length
Written Surveys Inexpensive, can expand Low response rate
length, can use open-end Requires exceptional clarity
questions
Direct observation, Expands analysis Potential observer bias
personal interview opportunities, no Costly
respondent bias
Bias possible
Non-response
Selection Wrong time/target chosen
Interviewer Voice, tones, judgement …
Observer Observer bias any expectations, beliefs, or
personal preferences of a researcher that
unintentionally influence his or her
recording
Poorly worded questionnaires/leading
questions
Methods of randomization
Nonstatistical sampling
Convenience sampling
Statistical sampling
Simple random sample Possible sample have equal chance of
being selected
Stratified random sample The population is divided into subgroups
called strata -> population values of
interest within each stratum are as much
alike as possible
Cluster sampling The population is divided into mini-
populations
Overcomes the geographical spread
problem
Each cluster has the same characteristics as
the population as a whole
Systematic Random Sampling Selects every Kth items in the population
after a randomly selected starting point
between 1 and K
Data types
Quantitative
Categorical (Qualitative)
Continuous
Discrete
Làm tròn 3 số
Outliers:
Q1
Q3
IQR=Q3-Q1
Fences: IQR*1.5
Upper fence= fence+Q3
Lower fence= Q1-fence
Outside upper and lower fence is outlier
65th percentile = count of values*65%
Measure of Variability
Range =Max-Min
Interquatile range Nhân tổng các data (hàm count)
Q1: *0.25
Q3: *0.75
Chẵn: average với số trên nó/ Lẻ: lấy luôn
Là giá trị ở vị trí vừa tìm
Variance =variance
Standard deviation =std
Sensitivity
Mode: Least sensitive to outliers
Mean: Very sensitive to outliers
Median: Not sensitive to outliers
Normal distribution
Empirical Rule
What percentage of a normally distributed dataset fall between ± 1 standard deviations of mean?
68%
And between ±2 standard deviations? 95%
And between ±3 standard deviations? 99%
symmetric distribution
mean = median = mode
skewed distribution
Left skew: mean < median < mode Right skew: mode < median < mean
σ
CV(Coefficient of Variation) = *100
μ