Professional Documents
Culture Documents
LOUIS GRENIER
LAST UPDATED
10 SEP 2021
SHARE
Open-ended questions are great for getting authentic feedback because they
give people a chance to describe what they’re experiencing in their own voice.
Analyzing such survey questions yourself is an excellent opportunity to
empathize with your audience, gather essential insights, and make the right
decisions.
How do you efficiently analyze more than 100 replies? Or even 1,000?
To help you learn this technique, we created a data sample that you can
download and use to follow along.
Now let’s begin…
Table of contents
Step 1: Get your data into the template
Step 2: Identify response categories
Step 3: Record the individual responses
Step 4: Organize your categories
Step 5: Represent your data visually
2) Copy the data from your .CSV or .XLS file and paste it into the sheet ‘CSV
Export’ of the template.
🏆 Pro tip: use 'Paste special' to paste 'Values Only' in the Hotjar analysis
template, so no formulas or formatting are copied over.
3) Copy the column from the ‘CSV Export’ sheet containing the open-
ended question you want to analyze first and paste it into the ‘Question 1’
sheet, in the cell marked with < Paste answers to first open-ended question
here >.
4) Choose wrap text for the entire column, so the data fits the column
width and is easier for you to read later on.
Step 2: identify response categories
A response category is a set of replies that can be grouped because they are
part of the same theme, even if they’re worded differently.
In the sample dataset we use for this tutorial, we asked Hotjar customers to
explain how their employer measures their performance (e.g., revenue,
conversions, traffic). In theory, you could go through every answer to identify
your response categories one-by-one, but that wouldn’t be very efficient.
Instead, we’re going to use a series of techniques that help you identify the
broad categories.
A) Use a text analyzer: text analyzers take your data and analyze it for the
most commonly used words in your text, which helps you identify broad
categories of responses.
If you do this with the sample data we’ve provided above, you’ll find that
‘sales,’ ‘conversion,’ and ‘traffic’ are some of the most commonly used words
in the data set:
'sales,’ ‘conversion,’ and ‘traffic’ are some of the most commonly used words
in the data set and could be used as response categories
As such, they represent some of the most popular replies to the question we
asked. They don’t represent all the answers, of course, but they’re a good
place to start when building the list of response categories.
Add each category to the top of separate a separate column (replacing the
text that reads, 'Response Category 01,' 'Response Category 02,' etc.):
Note: some of the popular words in our text analyzer mean the same thing
(e.g., 'sales' and 'revenue'), so you’ll want to create a single category for those
responses called 'Sales/Revenue.' Other popular words will NOT become
categories because, as stand-alone words, they tell us nothing useful (e.g.,
'our,' 'rate').
Scan the alphabetically sorted responses for other categories, such as 'It’s not
measured,' 'Traffic,' 'Conversions,' etc. Be on the lookout for synonyms, but
don’t worry if you create a few redundant categories for now. You will
combine the categories that mean the same thing at the end.
For example, if you sorted our sample data alphabetically, you’ll find that the
response in Row 6 reads, 'Huh?' If you added 'Did not understand the
question' to Column E (as we did in the screenshot), then you’ll place a '1' in
E36.
Note: In our example, many respondents indicate that their performance was
measured by multiple factors (e.g., lead gen + sales + customer satisfaction).
Be sure to place a '1' in each category. In other words, the row for that single
answer, 'Revenue, then conversion rate, then traffic.' will record three different
positive responses.
When you input your first '1,' the cell in Row 3 (below the category) will
change to indicate the number of positive responses in that category. Row 4
will change from a '#DIV/0' error to the percentage of responses that fall into
each category.
Drag these columns next to each other, and apply a color (any color) to the
group of columns you plan to merge—this marks them as a group so you can
return to them in a bit when it’s time to combine them. Repeat this step for
each set of categories you plan to join.
Add a new column to the left-hand side of each group. For example, with
'Lead Gen' and 'Form Submissions,' you’ll create a new category called 'Lead
Gen / Form Submissions,' add up the Row 3 totals for the two old categories,
and enter the new total under the new group. Copy and paste the percentage
formula from any Row 4 cell, then delete the old categories.
⚠️Important: when merging multiple categories, make sure to re-add the '1s'
under the newly merged category, or you run the risk of losing your data.
Paste them into the ‘Graph Question 1’ sheet using the 'Paste special' feature
to paste only the values (so the formulas don’t copy over).
PASTE AS VALUES YOUR SELECTION IN ‘GRAPH QUESTION 1,’ CELL A3
Select and copy the table you just pasted, and choose 'Paste special' again—
this time using 'Paste transposed' to invert the rows and columns (this makes
your data more chart-friendly).
SELECT AND COPY THE TABLE YOU JUST PASTED, AND CHOOSE 'PASTE
SPECIAL' AGAIN—THIS TIME USING 'PASTE TRANSPOSED' IN CELL A9
And there you have it—a visual representation of your data! Feel free to
experiment with different formats if you’re putting the chart into a formal
presentation.
https://www.calculator.net/sample-size-
calculator.ample Size Calculator
Find Out The Sample Size
This calculator computes the minimum number of necessary samples to meet the
desired statistical constraints.
95%
Confidence Level:
5
Margin of Error:
Population Proportion:
50 Use 50% if not sure
Result
Margin of error: 9.60%
This means, in this case, there is a 95% chance that the real value is within ±9.60% of
the measured/surveyed value.
95%
Confidence Level:
100
Sample Size:
Population Proportion:
60
Confidence Level
The confidence level is a measure of certainty regarding how accurately a sample
reflects the population being studied within a chosen confidence interval. The most
commonly used confidence levels are 90%, 95%, and 99%, which each have their own
corresponding z-scores (which can be found using an equation or widely available
tables like the one provided below) based on the chosen confidence level. Note that
using z-scores assumes that the sampling distribution is normally distributed, as
described above in "Statistics of a Random Sample." Given that an experiment or
survey is repeated many times, the confidence level essentially indicates the
percentage of the time that the resulting interval found from repeated tests will contain
the true result.
Confidence Level z-score (±)
0.70 1.04
0.75 1.15
0.80 1.28
0.85 1.44
0.92 1.75
0.95 1.96
0.96 2.05
0.98 2.33
0.99 2.58
0.999 3.29
0.9999 3.89
0.99999 4.42
Confidence Interval
In statistics, a confidence interval is an estimated range of likely values for a population
parameter, for example, 40 ± 2 or 40 ± 5%. Taking the commonly used 95% confidence
level as an example, if the same population were sampled multiple times, and interval
estimates made on each occasion, in approximately 95% of the cases, the true
population parameter would be contained within the interval. Note that the 95%
probability refers to the reliability of the estimation procedure and not to a specific
interval. Once an interval is calculated, it either contains or does not contain the
population parameter of interest. Some factors that affect the width of a confidence
interval include: size of the sample, confidence level, and variability within the sample.
There are different equations that can be used to calculate confidence intervals
depending on factors such as whether the standard deviation is known or smaller
samples (n<30) are involved, among others. The calculator provided on this page
calculates the confidence interval for a proportion and uses the following equations:
where
z is z score
p̂ is the population proportion
n and n' are sample size
N is the population size
Within statistics, a population is a set of events or elements that have some relevance
regarding a given question or experiment. It can refer to an existing group of objects,
systems, or even a hypothetical group of objects. Most commonly, however, population
is used to refer to a group of people, whether they are the number of employees in a
company, number of people within a certain age group of some geographic area, or
number of students in a university's library at any given time.
It is important to note that the equation needs to be adjusted when considering a finite
population, as shown above. The (N-n)/(N-1) term in the finite population equation is
referred to as the finite population correction factor, and is necessary because it cannot
be assumed that all individuals in a sample are independent. For example, if the study
population involves 10 people in a room with ages ranging from 1 to 100, and one of
those chosen has an age of 100, the next person chosen is more likely to have a lower
age. The finite population correction factor accounts for factors such as these. Refer
below for an example of calculating a confidence interval with an unlimited population.
EX: Given that 120 people work at Company Q, 85 of which drink coffee daily, find the
99% confidence interval of the true proportion of people who drink coffee at Company Q
on a daily basis.
Sample Size Calculation
Sample size is a statistical concept that involves determining the number of
observations or replicates (the repetition of an experimental condition used to estimate
the variability of a phenomenon) that should be included in a statistical sample. It is an
important aspect of any empirical study requiring that inferences be made about a
population based on a sample. Essentially, sample sizes are used to represent parts of
a population chosen for any given survey or experiment. To carry out this calculation,
set the margin of error, ε, or the maximum distance desired for the sample estimate to
deviate from the true value. To do this, use the confidence interval equation above, but
set the term to the right of the ± sign equal to the margin of error, and solve for the
resulting equation for sample size, n. The equation for calculating sample size is shown
below.
where
z is the z score
ε is the margin of error
N is the population size
p̂ is the population proportion
EX: Determine the sample size necessary to estimate the proportion of people shopping
at a supermarket in the U.S. that identify as vegan with 95% confidence, and a margin
of error of 5%. Assume a population proportion of 0.5, and unlimited population size.
Remember that z for a 95% confidence level is 1.96. Refer to the table provided in the
confidence level section for z scores of a range of confidence levels.
Thus, for the case above, a sample size of at least 385 people would be necessary. In
the above example, some studies estimate that approximately 6% of the U.S.
population identify as vegan, so rather than assuming 0.5 for p̂, 0.06 would be used. If it
was known that 40 out of 500 people that entered a particular supermarket on a given
day were vegan, p̂ would then be 0.08.
Before using the sample size calculator, there are two terms that you need to know. These
are: confidence interval and confidence level. If you are not familiar with these terms, click here. To
learn more about the factors that affect the size of confidence intervals, click here.
Enter your choices in a calculator below to find the sample size you need or the confidence interval you
have. Leave the Population box blank, if the population is very large or unknown.
Confidence Interval:
Population:
Sample Size:
Population:
50
Percentage:
Confidence Interval:
The confidence interval (also called margin of error) is the plus-or-minus figure usually reported in
newspaper or television opinion poll results. For example, if you use a confidence interval of 4 and 47%
percent of your sample picks an answer you can be "sure" that if you had asked the question of the entire
relevant population between 43% (47-4) and 51% (47+4) would have picked that answer.
The confidence level tells you how sure you can be. It is expressed as a percentage and represents how
often the true percentage of the population who would pick an answer lies within the confidence interval.
The 95% confidence level means you can be 95% certain; the 99% confidence level means you can be
99% certain. Most researchers use the 95% confidence level.
When you put the confidence level and the confidence interval together, you can say that you are 95%
sure that the true percentage of the population is between 43% and 51%. The wider the confidence
interval you are willing to accept, the more certain you can be that the whole population answers would be
within that range.
For example, if you asked a sample of 1000 people in a city which brand of cola they preferred, and 60%
said Brand A, you can be very certain that between 40 and 80% of all the people in the city actually do
prefer that brand, but you cannot be so sure that between 59 and 61% of the people in the city prefer the
brand.
Sample size
Percentage
Population size
Sample Size
The larger your sample size, the more sure you can be that their answers truly reflect the population. This
indicates that for a given confidence level, the larger your sample size, the smaller your confidence
interval. However, the relationship is not linear (i.e., doubling the sample size does not halve the
confidence interval).
Percentage
Your accuracy also depends on the percentage of your sample that picks a particular answer. If 99% of
your sample said "Yes" and 1% said "No," the chances of error are remote, irrespective of sample size.
However, if the percentages are 51% and 49% the chances of error are much greater. It is easier to be
sure of extreme answers than of middle-of-the-road ones.
When determining the sample size needed for a given level of accuracy you must use the worst case
percentage (50%). You should also use this percentage if you want to determine a general level of
accuracy for a sample you already have. To determine the confidence interval for a specific answer your
sample has given, you can use the percentage picking that answer and get a smaller interval.
Population Size
How many people are there in the group your sample represents? This may be the number of people in a
city you are studying, the number of people who buy new cars, etc. Often you may not know the exact
population size. This is not a problem. The mathematics of probability prove that the size of the population
is irrelevant unless the size of the sample exceeds a few percent of the total population you are
examining. This means that a sample of 500 people is equally useful in examining the opinions of a state
of 15,000,000 as it would a city of 100,000. For this reason, The Survey System ignores the population
size when it is "large" or unknown. Population size is only likely to be a factor when you work with a
relatively small and known group of people (e.g., the members of an association).
The confidence interval calculations assume you have a genuine random sample of the relevant
population. If your sample is not truly random, you cannot rely on the intervals. Non-random samples
usually result from some flaw or limitation in the sampling procedure. An example of such a flaw is to only
call people during the day and miss almost everyone who works. For most purposes, the non-working
population cannot be assumed to accurately represent the entire (working and non-working) population.
An example of a limitation is using an opt-in online poll, such as one promoted on a website. There is no
way to be sure an opt-in poll truly represents the population of interest.