You are on page 1of 74

Lecture 5

Quantitative Data Collection and Analysis I


Quantitative Research Vs Qualitative
Research
Quantitative Data
• The value of data in the form of counts or numbers.

• Each data-set has an unique numerical value associated with it.

• It lends itself to Mathematical calculation and statistical analysis.

• Examples of Quantitative Data in Computing?


Quantitative Data Examples
• Counter: Count equated with entities. For example, the number of people who download a
particular application from the App Store.
• Measurement of physical objects: Calculating measurement of any physical thing. For
example, the HR executive carefully measures the size of each cubicle assigned to the newly
joined employees.
• Sensory calculation: Mechanism to naturally “sense” the measured parameters to create a
constant source of information. For example, a digital camera converts electromagnetic
information to a string of numerical data.
• Projection of data: Future projection of data can be done using algorithms and other
mathematical analysis tools.
• Quantification of qualitative entities: Identify numbers to qualitative information. For
example, asking respondents of an online survey to share the likelihood of recommendation
on a scale of 0-10.
Data Collection Methods
• Surveys
• Longitudinal Studies
• Cross-sectional Studies
• One-on-one Interviews

• Measurements

• Simulations
Sampling Design
• A sample is “a smaller (but hopefully representative) collection
of units from a population used to determine truths about that
population” (Field, 2005)

• Why sample?
• Resources (time, money) and workload
• Gives results with known accuracy that can be calculated
mathematically.
• Central Limit Theory(Reading Assignment/
Programming Assignment(Bonus))
Sampling Design
• The sampling frame is the list from which the potential
respondents are drawn: Examples
• Registrar’s office
• Class rosters
• List of certified programmers in a specific language.
• Must assess sampling frame errors.
• What is your population of interest?
• To whom do you want to generalize your results?
• Can you sample the entire population?
Sampling Design
• 3 factors that influence sample representative-ness
• Sampling procedure
• Sample size
• Participation (response)

• When might you sample the entire population?


• When your population is very small
• When you have extensive resources
• When you don’t expect a very high response
9
SAMPLING BREAKDOWN
Types of Samples
• Probability (Random) Samples
• Simple random sample
• Systematic random sample
• Stratified random sample
• Multistage sample
• Multiphase sample
• Cluster sample
• Non-Probability Samples
• Convenience sample
• Purposive sample
• Quota
Simple random sampling
Probability sampling
• Each individual has known (non-zero) probability of selection
• Precision of estimates can be quantified.
Sampling Weights
• Inverse of the net sampling probability
• Interpretation: the sampling weight for an sampled individual is the
number of individuals his/her data “represent”.
Example--sampling weights
• There are 150 employees in a firm
• stratum 1: 50 employees aged 18-29
• stratum 2: 100 employees aged 30-69
• We sample 10 from each stratum
• Sampling probabilities are
• stratum 1: 10/50 = 0.20
• stratum 2: 10/100 = 0.10
Example: sampling weights (continued)
• Sampling weights
• stratum 1: 1/0.20 = 5
• stratum 2: 1/0.10 = 10
• Interpretation:
• Each sampled employee in stratum 1 represents 5 employees
• Each sampled employee in stratum 2 represents 10 employees
The Sampling Process
1. Defining the population of concern
2. Specifying a sampling frame, a set of items or events
possible to measure
3. Specifying a sampling method for selecting items or
events from the frame
4. Determining the sample size
5. Implementing the sampling plan
6. Sampling and data collecting
7. Reviewing the sampling process
Errors
• Sampling Error: Random variability in sample estimates that arises out
of the randomness of the sample selection process.(Can be Quantified).

• Non Sampling Error: Estimation error that arises from sources other
than random variation
• non-response
• undercoverage of survey
• poorly-trained interviewers
• non-truthful answers
• non-probability sampling
Errors
• Type 1 errors(False Positive)

• Type 2 errors (False Negative)

• Design Effect: Relative increase in variance of an estimate due to the


sampling design.
Questionnaire Design
• A questionnaire is the vehicle used to pose the questions that the
researcher wants respondents to answer.
• Questionnaire design is a systematic process in which the researcher
contemplates various question formats, considers a number of factors
characterizing the survey at hand, ultimately words the various
questions very carefully, and organizes the questionnaire’s layout.
• Translates the research objectives into specific questions.
Steps in the Questionnaire Development
Process.
Questionnaire Design
• Question development is the practice of selecting appropriate
response formats and wording questions so that they are
understandable, unambiguous, and unbiased.
• The question should be focused on a single issue or topic.
• The question should be brief.
• The question should be grammatically simple, if possible.
• The question should be crystal clear.
Question development
• Avoid words like (All, Always, Any, Anybody, Ever, Every, Never).
• The question should not “lead” the respondent to a particular answer.
• “Don’t you see any problem with using credit cards for online purchases?”
• The question should not have “loaded” wording or phrasing.
• The question should not be “double-barreled.”
• The question should not use words that overstate the condition…do
not use “dramatics
What is wrong with each question?
How do you feel about Agile Methodologies?
When some connection or data communication
product in your house breaks, do you call the
ETC repair service?
If the Ethio Telecom’s repair service schedule was
not convenient for you, would you consider or not
consider calling a competing repair organization
to fix the problem you have?
How much do you think you would have to pay to
have ETC something that needs to be repaired?
Shouldn’t efficient designers use Agile Methods?
Should the framework be used for small
applications?
Do good programmers and responsible designers
document their programs?
Do you believe Agile methods can protect projects
Ch 11 23
from being cancelled?
Individual Question Wording
Please rate each aspect of Agile
Methodology…
How do you feel about Agile
• “Do’s” for all questions Methodologies
• Keep it focused on a single topic
If the Ethio Telecom’s repair service schedule was not convenien
you, would
If you didyou
notconsider
use EthioorTelecom’s
not consider callingservice,
support a competing
wouldrep
yo
organization to fix the problem
another support service? you have?
• Keep it brief
When some connection or data communication product in your
house breaks, do you call the ETC repair service?
• Keep it grammatically simple When you need it, do you call ETC support service?

How much do you think you would have to pay to have ETC
something that needs to be repaired?
• Keep it crystal clear How much do you think ETC charges for a repair service call?

Ch 11 24
Individual Question Wording
• “Do not’s” for all questions
• Don’t ask leading questions Shouldn’t efficientDo you think
designers useusing
AgileAgile Methods
Methods?
efficiency?

• Don’t ask loaded questions Dothe


Should you think thebeframework
framework is useful
used for small for progr
applications?
Do good programmers and responsible designers document their programs? with less than 20KLOC?
? you think programmers who use documentation
Do
are responsible?
• Don’t ask double-barreled questions
Do you think Agile methods are useful?
• Don’t use overstated question
Do you believe Agile methods can protect projects from being cancelled?

Ch 11 25
Questionnaire organization
• Questionnaire organization is the sequence of statements and
questions that make up the questionnaire.
• Introduction with Cover Letter.
• Screening questions.
• Question flow pertains to the sequencing of questions or blocks of questions.
• Warm-up questions
• Transitions
• Skip questions
• lassification and demographic questions
Coding and Pretesting
• Coding: use of numbers associated with question responses
• Numbers are preferred for two reasons:
• Numbers are easier and faster to keystroke into a
computer file.
• Computer tabulation programs are more efficient when
they process numbers.
• Pretest
Descriptive Statistics
Descriptive Statistics
• Lecture outline
• Features of Descriptive Statistics.
• Key measures
• Measures of central tendency
• Mean(Types of Mean), Mode, Median
• Measures of Variation
• Range, Percentile, Quartiles, IQR
• Measures of Shape
• Steam-Leaf and Box Whisker Plots
Types of Variables
• Qualitative Variables
• Attributes, categories
• Examples: male/female, registered to vote/not, ethnicity, eye
color....
• Quantitative Variables
• Discrete - usually take on integer values but can take on
fractions when variable allows - counts, how many
• Continuous - can take on any value at any point along an
interval - measurements, how much
Example: Types of Variables

• For each of the following, indicate whether the


appropriate variable would be qualitative or
quantitative. If the variable is quantitative, indicate
whether it would be discrete or continuous.
Problem 1.16
• a) Whether you own an • Qualitative Variable
android phone. • two levels: yes/no
• b) Your status as a full- • no measurement
time or a part-time • Qualitative Variable
student • two levels: full/part
• c) Number of people • no measurement
who attended your • Quantitative, Discrete
school’s graduation last Variable
year • a countable number
• only whole numbers
© 2002 The Wadsworth Group
Problem • Quantitative, Continuous
d) Sam’s travel time Variable
from his dorm to the • any number
Student Union • time is measured
• can take on any value greater
than zero
Scales of Measurement
• Nominal Scale - Labels represent various levels of a
categorical variable.
• Ordinal Scale - Labels represent an order that indicates either
preference or ranking.
• Interval Scale - Numerical labels indicate order and distance
between elements. There is no absolute zero and multiples of
measures are not meaningful.
• Ratio Scale - Numerical labels indicate order and distance
between elements. There is an absolute zero and multiples of
measures are meaningful.
Descriptive Statistics
• Descriptive statistics summarizes or describes characteristics of a data
set.
• Descriptive statistics consists of two basic categories of measures:
measures of central tendency and measures of variability or spread.
• Measures of central tendency describe the center of a data set.
• Measures of variability or spread describe the dispersion of data
within the set.
• It can also indicate the major relationships between groups.
Key measures
Describing data

Moment Non-mean based


measure
Center Mean Mode, median

Spread Variance Range,


(standard deviation) Interquartile range

Skew Skewness --

Peaked Kurtosis --
Descriptive Statistics
An Illustration:
Which Group is Smarter?
Class A--IQs of 13 Students Class B--IQs of 13 Students
102 115 127 162
128 109 131 103
131 89 96 111
98 106 80 109
140 119 93 87
93 97 120 105
110 109
Each individual may be different. If you try to understand a group by remembering the
qualities of each member, you become overwhelmed and fail to understand the group.
Descriptive Statistics
Which group is smarter now?

Class A--Average IQ Class B--Average IQ

110.54 110.23

They’re roughly the same!

With a summary descriptive statistic, it is much easier to answer our


question.
Key distinction
Population vs. Sample Notation

Population vs. Sample


Greeks Romans
μ, σ, β s, b
Mean

 i
x
i 1
X
n
Mean for Frequency Distribution
Weighted mean
Geometric Mean
Geometric Mean
• E.g A data company has an increased data capacity over three years
by the following Mega Bytes.
• Year 1: +300 000 000 MB
• Year 2: +200 000 000 MB
• Year 3: +100 000 000 MB
Geometric Mean
• Therefore, it is fair to say that the company increased their data
capacity by an average of 200 000 000 MB
Example
• Now, consider another Data firm that has their Database increase
information given in percentages:

• Year 1: +1.5%
• Year 2: +2.0%
• Year 3: +2.5%
Example
• Consider a company that started with 100 000 000 MB of data we
would get the following data capacity at the end of the three years:
• If we use arithmetic mean:
• There would be an average of 2% increase per year.
• Hence:

• 100 000 000 * 1.020* 1.020 * 1.020 = 106 120 800 MB of data
Example
• According to the actual figures, the total capacity of the database at the end of the three years should be:

• 100 000 000 * 1.015 * 1.020 * 1.025 = 106 118 250 MB

• But if we employ Geometric Mean:

• 100 000 000 * 1.01999* 1.01999 * 1.01999 = 106 118 250 MB


Harmonic Mean
• The reciprocal of the arithmetic mean of the reciprocals of the given
set of observations.
• Gives equal weight to each data point
Example Harmonic Mean
• Example: An Image processing algorithm encodes at a rate of 30
fps(frames per second) and decodes at 10 fps the same file with
unchanged size in the same machine. What is the average processing
speed of the algorithm?
• Harmonic mean of 30 and 10 = ...
Arithmetic mean of reciprocals = 1/30 + 1/10 = 4/30 ÷ 2 = 4/60 = 1/15
Reciprocal of arithmetic mean = 1 ÷ 1/15 = 15/1 = 15
Median
• The median is the middle number in a sorted, ascending or
descending, list of numbers.
• The median is sometimes used as opposed to the mean when there
are outliers in the sequence that might skew the average of the
values.
• If n is odd, the median is the middle number.
• If n is even, the median is the average of the 2 middle numbers.
Median

• Class shoe sizes: 3, 5, 5, 6, 4, 3, 2, 1, 5, 6


• Put in order: 1, 2, 3, 3, 4, 5, 5, 5, 6, 6
The class median shoe size is 4.5
Median for Grouped Data
• Step 1: Construct the cumulative frequency distribution.
• Step 2: Decide the class that contain the median. Class Median is the
first class with the value of cumulative frequency equal at least n/2.
Step 3: Find the median by using the following formula:
Example
• Based on the grouped data below, find the median:
Solution
Solution
When to use Median
• Means can be badly affected by outliers (data points with extreme values unlike the
rest).
• Outliers can make the mean a bad measure of central tendency or common experience.
• Example
• Median is optimal when there is a skewed data(interval/ratio) and Ordinal Data
Mode
• The mode is the most frequent score in our data set.
• The most popular option.
• Suitable for Nominal Variables.
Measures of dispersion
• Describes how the data are spread out or scattered on each side
of central value
• Both measures of central tendency & dispersion needed to
describe data
• Exams Results
• Class 1 – avg. : 60.0 marks
• highest: 95
• lowest : 25
• Class 2 – avg. : 60.0 marks
• highest: 100
• lowest : 15 marks
Measures Of Dispersion
• Range

• Percentiles

• Variance and Standard Deviation


Range
• The spread, or the distance, between the lowest and highest values of
a variable.

• To get the range for a variable, you subtract its lowest value from its
highest value.

Class A--IQs of 13 Students Class B--IQs of 13 Students


102 115 127 162
128 109 131 103
131 89 96 111
98 106 80 109
140 119 93 87
93 97 120 105
110 109
Class A Range = 140 - 89 = 51 Class B Range = 162 - 80 = 82
Interquartile Range
A quartile is the value that marks one of the divisions that breaks a series of values into four equal parts.

The median is a quartile and divides the cases in half.

25th percentile is a quartile that divides the first ¼ of cases from the latter ¾.
75th percentile is a quartile that divides the first ¾ of cases from the latter ¼.

The interquartile range is the distance or range between the 25th percentile and the 75th percentile. Below, what is the
interquartile range?

25% 25% 25% of


25% of
cases cases

0 250 500 750 1000


IQR
• Q1 is the "middle" value in the first half of the rank-ordered data set.
• Q2 is the median value in the set.
• Q3 is the "middle" value in the second half of the rank-ordered data
set.
• The interquartile range is equal to Q3 minus Q1.
• E.g Consider the data set 1, 3, 4, 5, 5, 6, 7, 11.
Variance/ Standard Deviation
• These measures tell us how much the actual values differ from the
mean.
• The larger the standard deviation, the more spread out the values.
• The smaller the standard deviation, the less spread out the values.
• The standard deviation is used in conjunction with the mean to
summarise continuous data, not categorical data.
• Standard deviation, like the mean, is normally only appropriate when
the continuous data is not significantly skewed or has outliers.
Variance/ Standard Deviation
• Standard deviation is only used to measure spread or dispersion
around the mean of a data set.
• Standard deviation is never negative.
• Standard deviation is sensitive to outliers. A single outlier can raise
the standard deviation and in turn, distort the picture of spread.
• For data with approximately the same mean, the greater the spread,
the greater the standard deviation.
• If all values of a data set are the same, the standard deviation is zero
(because each value is equal to the mean).
Variance, Standard Deviation

2
n
( xi   )

i 1 n
2
 ,

2
n
( xi   )

i 1 n

Variance, S.D. of a Sample

2
n
( xi   )

i 1 n 1
2
s ,

Degrees of freedom
2
n
( xi   )

i 1 n 1
s
Example (Grouped data)
Find the variance and standard deviation of the sample data
below:
Weight Frequency, f Class fx Cumulative Class
(Class Mark, Frequency, Boundary fx 2
Interval) x F
60-62 5 61 305 5 59.5-62.5
63-65 18 64 1152 23 62.5-65.5
66-68 42 67 2814 65 65.5-68.5
69-71 27 70 1890 92 68.5-71.5
72-74 8 73 584 100 71.5-74.5
Total 100 6745

 fx   fx 
2 2

 fx 
 
 fx 
2 2

s2 
f ? f
s ?
 f 1  f 1
Answer : s2=8.61;s=2.93
Visualizing Data
• Steam and Leaf Plot
• A Stem and Leaf Plot is a special table where each data value is split
into a "stem" (the first digit or digits) and a "leaf" (usually the last digit).
box and whisker graph
• A box and whisker plot is defined as a graphical method of displaying variation in a set
of data.
• The procedure to develop a box and whisker plot comes from the five statistics below.
• Minimum value: The smallest value in the data set
• Second quartile: The value below which the lower 25% of the data are contained
• Median value: The middle number in a range of numbers
• Third quartile: The value above which the upper 25% of the data are contained
• Maximum value: The largest value in the data set
Example
• Draw a box-and-whisker plot for the following data set:
• 4.3,  5.1,  3.9,  4.5,  4.4,  4.9,  5.0,  4.7,  4.1,  4.6,  4.4,  4.3,  4.8,  4.4,
 4.2,  4.5,  4.4
1. Order the data:
3.9,  4.1,  4.2,  4.3,  4.3,  4.4,  4.4,  4.4,  4.4,  4.5,  4.5,  4.6,  4.7,  4.8,  4.9,
 5.0,  5.1
2. Find the Median:= 4.4
3. Find Q1 and Q3; Q1= (4.3 +4.3)/2=4.3 Q3; (4.7+4.8)/2=4.75
Example
• Decide on the Scale;
• Connect Q1, Median and Q3 to make a box;
• Draw whiskers from minimum and maximum values.
Example
Exercise
• What percent of the method has fewer than 7 errors?

You might also like