Professional Documents
Culture Documents
Quantitative Data Collection and Analysis I
Quantitative Data Collection and Analysis I
• Measurements
• Simulations
Sampling Design
• A sample is “a smaller (but hopefully representative) collection
of units from a population used to determine truths about that
population” (Field, 2005)
• Why sample?
• Resources (time, money) and workload
• Gives results with known accuracy that can be calculated
mathematically.
• Central Limit Theory(Reading Assignment/
Programming Assignment(Bonus))
Sampling Design
• The sampling frame is the list from which the potential
respondents are drawn: Examples
• Registrar’s office
• Class rosters
• List of certified programmers in a specific language.
• Must assess sampling frame errors.
• What is your population of interest?
• To whom do you want to generalize your results?
• Can you sample the entire population?
Sampling Design
• 3 factors that influence sample representative-ness
• Sampling procedure
• Sample size
• Participation (response)
• Non Sampling Error: Estimation error that arises from sources other
than random variation
• non-response
• undercoverage of survey
• poorly-trained interviewers
• non-truthful answers
• non-probability sampling
Errors
• Type 1 errors(False Positive)
How much do you think you would have to pay to have ETC
something that needs to be repaired?
• Keep it crystal clear How much do you think ETC charges for a repair service call?
Ch 11 24
Individual Question Wording
• “Do not’s” for all questions
• Don’t ask leading questions Shouldn’t efficientDo you think
designers useusing
AgileAgile Methods
Methods?
efficiency?
Ch 11 25
Questionnaire organization
• Questionnaire organization is the sequence of statements and
questions that make up the questionnaire.
• Introduction with Cover Letter.
• Screening questions.
• Question flow pertains to the sequencing of questions or blocks of questions.
• Warm-up questions
• Transitions
• Skip questions
• lassification and demographic questions
Coding and Pretesting
• Coding: use of numbers associated with question responses
• Numbers are preferred for two reasons:
• Numbers are easier and faster to keystroke into a
computer file.
• Computer tabulation programs are more efficient when
they process numbers.
• Pretest
Descriptive Statistics
Descriptive Statistics
• Lecture outline
• Features of Descriptive Statistics.
• Key measures
• Measures of central tendency
• Mean(Types of Mean), Mode, Median
• Measures of Variation
• Range, Percentile, Quartiles, IQR
• Measures of Shape
• Steam-Leaf and Box Whisker Plots
Types of Variables
• Qualitative Variables
• Attributes, categories
• Examples: male/female, registered to vote/not, ethnicity, eye
color....
• Quantitative Variables
• Discrete - usually take on integer values but can take on
fractions when variable allows - counts, how many
• Continuous - can take on any value at any point along an
interval - measurements, how much
Example: Types of Variables
Skew Skewness --
Peaked Kurtosis --
Descriptive Statistics
An Illustration:
Which Group is Smarter?
Class A--IQs of 13 Students Class B--IQs of 13 Students
102 115 127 162
128 109 131 103
131 89 96 111
98 106 80 109
140 119 93 87
93 97 120 105
110 109
Each individual may be different. If you try to understand a group by remembering the
qualities of each member, you become overwhelmed and fail to understand the group.
Descriptive Statistics
Which group is smarter now?
110.54 110.23
i
x
i 1
X
n
Mean for Frequency Distribution
Weighted mean
Geometric Mean
Geometric Mean
• E.g A data company has an increased data capacity over three years
by the following Mega Bytes.
• Year 1: +300 000 000 MB
• Year 2: +200 000 000 MB
• Year 3: +100 000 000 MB
Geometric Mean
• Therefore, it is fair to say that the company increased their data
capacity by an average of 200 000 000 MB
Example
• Now, consider another Data firm that has their Database increase
information given in percentages:
• Year 1: +1.5%
• Year 2: +2.0%
• Year 3: +2.5%
Example
• Consider a company that started with 100 000 000 MB of data we
would get the following data capacity at the end of the three years:
• If we use arithmetic mean:
• There would be an average of 2% increase per year.
• Hence:
• 100 000 000 * 1.020* 1.020 * 1.020 = 106 120 800 MB of data
Example
• According to the actual figures, the total capacity of the database at the end of the three years should be:
• Percentiles
• To get the range for a variable, you subtract its lowest value from its
highest value.
25th percentile is a quartile that divides the first ¼ of cases from the latter ¾.
75th percentile is a quartile that divides the first ¾ of cases from the latter ¼.
The interquartile range is the distance or range between the 25th percentile and the 75th percentile. Below, what is the
interquartile range?
2
n
( xi )
i 1 n
2
,
2
n
( xi )
i 1 n
Variance, S.D. of a Sample
2
n
( xi )
i 1 n 1
2
s ,
Degrees of freedom
2
n
( xi )
i 1 n 1
s
Example (Grouped data)
Find the variance and standard deviation of the sample data
below:
Weight Frequency, f Class fx Cumulative Class
(Class Mark, Frequency, Boundary fx 2
Interval) x F
60-62 5 61 305 5 59.5-62.5
63-65 18 64 1152 23 62.5-65.5
66-68 42 67 2814 65 65.5-68.5
69-71 27 70 1890 92 68.5-71.5
72-74 8 73 584 100 71.5-74.5
Total 100 6745
fx fx
2 2
fx
fx
2 2
s2
f ? f
s ?
f 1 f 1
Answer : s2=8.61;s=2.93
Visualizing Data
• Steam and Leaf Plot
• A Stem and Leaf Plot is a special table where each data value is split
into a "stem" (the first digit or digits) and a "leaf" (usually the last digit).
box and whisker graph
• A box and whisker plot is defined as a graphical method of displaying variation in a set
of data.
• The procedure to develop a box and whisker plot comes from the five statistics below.
• Minimum value: The smallest value in the data set
• Second quartile: The value below which the lower 25% of the data are contained
• Median value: The middle number in a range of numbers
• Third quartile: The value above which the upper 25% of the data are contained
• Maximum value: The largest value in the data set
Example
• Draw a box-and-whisker plot for the following data set:
• 4.3, 5.1, 3.9, 4.5, 4.4, 4.9, 5.0, 4.7, 4.1, 4.6, 4.4, 4.3, 4.8, 4.4,
4.2, 4.5, 4.4
1. Order the data:
3.9, 4.1, 4.2, 4.3, 4.3, 4.4, 4.4, 4.4, 4.4, 4.5, 4.5, 4.6, 4.7, 4.8, 4.9,
5.0, 5.1
2. Find the Median:= 4.4
3. Find Q1 and Q3; Q1= (4.3 +4.3)/2=4.3 Q3; (4.7+4.8)/2=4.75
Example
• Decide on the Scale;
• Connect Q1, Median and Q3 to make a box;
• Draw whiskers from minimum and maximum values.
Example
Exercise
• What percent of the method has fewer than 7 errors?