You are on page 1of 33

How to analyze open-ended questions in

5 steps [template included]


Open-ended questions are great for getting authentic feedback because they
give people a chance to describe what they’re experiencing in their own voice.
USER RESEARCH

LOUIS GRENIER

LAST UPDATED
10 SEP 2021
SHARE
Open-ended questions are great for getting authentic feedback because they
give people a chance to describe what they’re experiencing in their own voice.
Analyzing such survey questions yourself is an excellent opportunity to
empathize with your audience, gather essential insights, and make the right
decisions.

But you may be wondering...

How do you efficiently analyze more than 100 replies? Or even 1,000?

Here’s a system we use at Hotjar to categorize and visually represent large


volumes of qualitative data—and it’s easier than you might think! You’ll have
to work with the technique a bit before you become comfortable with it, but
once you get it, you’ll be sorting through mountains of qualitative data in no
time.

What you’ll need:

 Working knowledge of spreadsheets (Google Sheets or Excel)


 A quiet space with some uninterrupted focus time
 Hotjar’s open-ended question analysis template
 

To help you learn this technique, we created a data sample that you can
download and use to follow along.
Now let’s begin…

Table of contents
 Step 1: Get your data into the template
 Step 2: Identify response categories
 Step 3: Record the individual responses
 Step 4: Organize your categories
 Step 5: Represent your data visually

Step 1: get your data into the template


1) Export the data from your survey or poll into a .CSV or .XLS file.

2) Copy the data from your .CSV or .XLS file and paste it into the sheet ‘CSV
Export’ of the template.
🏆  Pro tip: use 'Paste special' to paste 'Values Only' in the Hotjar analysis
template, so no formulas or formatting are copied over.
3) Copy the column from the ‘CSV Export’ sheet containing the open-
ended question you want to analyze first and paste it into the ‘Question 1’
sheet, in the cell marked with < Paste answers to first open-ended question
here >.
4) Choose wrap text for the entire column, so the data fits the column
width and is easier for you to read later on.
Step 2: identify response categories
A response category is a set of replies that can be grouped because they are
part of the same theme, even if they’re worded differently.
In the sample dataset we use for this tutorial, we asked Hotjar customers to
explain how their employer measures their performance (e.g., revenue,
conversions, traffic). In theory, you could go through every answer to identify
your response categories one-by-one, but that wouldn’t be very efficient.
Instead, we’re going to use a series of techniques that help you identify the
broad categories.

A) Use a text analyzer: text analyzers take your data and analyze it for the
most commonly used words in your text, which helps you identify broad
categories of responses.

🏆  Pro tip: Textalyser is a simple, free resource that does this well.


Copy and paste your data into textalyser and click ‘analyze the text’

If you do this with the sample data we’ve provided above, you’ll find that
‘sales,’ ‘conversion,’ and ‘traffic’ are some of the most commonly used words
in the data set:

'sales,’ ‘conversion,’ and ‘traffic’ are some of the most commonly used words
in the data set and could be used as response categories

As such, they represent some of the most popular replies to the question we
asked. They don’t represent all the answers, of course, but they’re a good
place to start when building the list of response categories.

Add each category to the top of separate a separate column (replacing the
text that reads, 'Response Category 01,' 'Response Category 02,' etc.):
Note: some of the popular words in our text analyzer mean the same thing
(e.g., 'sales' and 'revenue'), so you’ll want to create a single category for those
responses called 'Sales/Revenue.' Other popular words will NOT become
categories because, as stand-alone words, they tell us nothing useful (e.g.,
'our,' 'rate').

B) Sort your responses alphabetically: when you sort alphabetically, you’ll


notice that specific patterns emerge, and you can create more categories
based on the trends you spot.
In our sample data, every sentence beginning with the word 'Revenue' gets
grouped when you sort alphabetically. Of course, we already have a category
for 'Sales/Revenue,' so there’s no need to add that category in this case—but
grouping the data alphabetically will allow groups to stand out.
Alphabetical sorting will also draw your attention to certain stand-alone
response. For example, someone replied 'Huh?' and another person told us
they didn’t understand the question. This information allows us to add a new
category called 'Didn’t understand the question.'

Scan the alphabetically sorted responses for other categories, such as 'It’s not
measured,' 'Traffic,' 'Conversions,' etc. Be on the lookout for synonyms, but
don’t worry if you create a few redundant categories for now. You will
combine the categories that mean the same thing at the end.

Step 3: record the individual responses


1) Place a '1' in each cell where a response (the row) matches a category (the
column) to identify a positive response in each category. Add categories as
you go.

For example, if you sorted our sample data alphabetically, you’ll find that the
response in Row 6 reads, 'Huh?' If you added 'Did not understand the
question' to Column E (as we did in the screenshot), then you’ll place a '1' in
E36.

Note: In our example, many respondents indicate that their performance was
measured by multiple factors (e.g., lead gen + sales + customer satisfaction).
Be sure to place a '1' in each category. In other words, the row for that single
answer, 'Revenue, then conversion rate, then traffic.' will record three different
positive responses.
When you input your first '1,' the cell in Row 3 (below the category) will
change to indicate the number of positive responses in that category. Row 4
will change from a '#DIV/0' error to the percentage of responses that fall into
each category.

2) Use the 'Find' feature to search for words related to each


category: begin with the first category (in our example, that’s 'sales') and
search the data column for any response that mentions 'sales.' Read the entire
response to ensure it fits the category you searched for, then place a '1' in the
appropriate column for that response.
3) Fill in the gaps: read each row that hasn’t been categorized and place a '1'
under the appropriate category, creating new categories as necessary. As you
create new categories, search your data for those terms to quickly find similar
responses.

⚠️Important:  when adding a new category as you go through the responses,


make sure to retroactively check previous answers that might fit in this new
category.

Step 4: organize your categories


1) Group your data: you will almost certainly find categories that should be
grouped but ended up in different categories because respondents used
different words to describe the same concept. In our sample data, we found
the terms 'Lead Gen' and 'Form Submissions,' and these belong in the same
category.

Drag these columns next to each other, and apply a color (any color) to the
group of columns you plan to merge—this marks them as a group so you can
return to them in a bit when it’s time to combine them. Repeat this step for
each set of categories you plan to join.

Add a new column to the left-hand side of each group. For example, with
'Lead Gen' and 'Form Submissions,' you’ll create a new category called 'Lead
Gen / Form Submissions,' add up the Row 3 totals for the two old categories,
and enter the new total under the new group. Copy and paste the percentage
formula from any Row 4 cell, then delete the old categories.
⚠️Important: when merging multiple categories, make sure to re-add the '1s'
under the newly merged category, or you run the risk of losing your data.

Repeat this step for every group you plan to merge.

2) Arrange your categories from large to small: arrange your categories in


descending order from left to right. For those that only contribute to a small
percentage of the total (2% or less), use the grouping method above to merge
them into one category called 'Others,' which you’ll leave on the far right.
Step 5: represent your data visually
1) Prep your data to create a bar chart. First, select and copy the top three
rows of your spreadsheet (those that make up the 'Response Categories,'
'Total respondents who answered X,' and '% respondents who answered X').
SELECT AND COPY THE TOP THREE ROWS OF YOUR SPREADSHEET

Paste them into the ‘Graph Question 1’ sheet using the 'Paste special' feature
to paste only the values (so the formulas don’t copy over).
PASTE AS VALUES YOUR SELECTION IN ‘GRAPH QUESTION 1,’ CELL A3

Select and copy the table you just pasted, and choose 'Paste special' again—
this time using 'Paste transposed' to invert the rows and columns (this makes
your data more chart-friendly).
SELECT AND COPY THE TABLE YOU JUST PASTED, AND CHOOSE 'PASTE
SPECIAL' AGAIN—THIS TIME USING 'PASTE TRANSPOSED' IN CELL A9

This is what you should see:


YOUR TABLE CONTAINING CATEGORIES, THE VOLUME OF RESPONSES, AND
PERCENTAGE SHOULD YOU LIKE THE ABOVE

2) Create your chart: insert your chart, selecting the percentage column as


your 'Series' and the categories as your 'X-axis.' Resize the chart however you
see fit.
YOUR OPEN-ENDED ANSWERS ARE NOW VISUALIZED IN A GRAPH

And there you have it—a visual representation of your data! Feel free to
experiment with different formats if you’re putting the chart into a formal
presentation.

Analyzing open-ended questions efficiently and empathizing with your


audience take some practice, but the more you do it, the easier it becomes.
Your mind will begin to recognize patterns the more you practice this
technique, so don’t be afraid to dive into it.
home / math / sample size calculator

https://www.calculator.net/sample-size-
calculator.ample Size Calculator
Find Out The Sample Size
This calculator computes the minimum number of necessary samples to meet the
desired statistical constraints.

95%
             
Confidence Level:

5  
Margin of Error:
Population Proportion:
50 Use 50% if not sure

Leave blank if unlimited


Population Size: population size.

Find Out the Margin of Error


This calculator gives out the margin of error or confidence interval of observation or
survey.

Result
Margin of error: 9.60%
This means, in this case, there is a 95% chance that the real value is within ±9.60% of
the measured/surveyed value.

95%
             
Confidence Level:

100  
Sample Size:
Population Proportion:
60  

70595 Leave blank if unlimited


Population Size: population size.

RelatedStandard Deviation Calculator | Probability Calculator

In statistics, information is often inferred about a population by studying a finite number


of individuals from that population, i.e. the population is sampled, and it is assumed that
characteristics of the sample are representative of the overall population. For the
following, it is assumed that there is a population of individuals where some
proportion, p, of the population is distinguishable from the other 1-p in some way;
e.g., p may be the proportion of individuals who have brown hair, while the remaining 1-
p have black, blond, red, etc. Thus, to estimate p in the population, a sample
of n individuals could be taken from the population, and the sample proportion, p̂,
calculated for sampled individuals who have brown hair. Unfortunately, unless the full
population is sampled, the estimate p̂ most likely won't equal the true value p,
since p̂ suffers from sampling noise, i.e. it depends on the particular individuals that
were sampled. However, sampling statistics can be used to calculate what are called
confidence intervals, which are an indication of how close the estimate p̂ is to the true
value p.

Statistics of a Random Sample


The uncertainty in a given random sample (namely that is expected that the proportion
estimate, p̂, is a good, but not perfect, approximation for the true proportion p) can be
summarized by saying that the estimate p̂ is normally distributed with mean p and
variance p(1-p)/n. For an explanation of why the sample estimate is normally
distributed, study the Central Limit Theorem. As defined below, confidence level,
confidence intervals, and sample sizes are all calculated with respect to this sampling
distribution. In short, the confidence interval gives an interval around p in which an
estimate p̂ is "likely" to be. The confidence level gives just how "likely" this is – e.g., a
95% confidence level indicates that it is expected that an estimate p̂ lies in the
confidence interval for 95% of the random samples that could be taken. The confidence
interval depends on the sample size, n (the variance of the sample distribution is
inversely proportional to n, meaning that the estimate gets closer to the true proportion
as n increases); thus, an acceptable error rate in the estimate can also be set, called
the margin of error, ε, and solved for the sample size required for the chosen
confidence interval to be smaller than e; a calculation known as "sample size
calculation."

Confidence Level
The confidence level is a measure of certainty regarding how accurately a sample
reflects the population being studied within a chosen confidence interval. The most
commonly used confidence levels are 90%, 95%, and 99%, which each have their own
corresponding z-scores (which can be found using an equation or widely available
tables like the one provided below) based on the chosen confidence level. Note that
using z-scores assumes that the sampling distribution is normally distributed, as
described above in "Statistics of a Random Sample." Given that an experiment or
survey is repeated many times, the confidence level essentially indicates the
percentage of the time that the resulting interval found from repeated tests will contain
the true result.
Confidence Level z-score (±)
0.70 1.04
0.75 1.15
0.80 1.28
0.85 1.44
0.92 1.75
0.95 1.96
0.96 2.05
0.98 2.33
0.99 2.58
0.999 3.29
0.9999 3.89
0.99999 4.42

Confidence Interval
In statistics, a confidence interval is an estimated range of likely values for a population
parameter, for example, 40 ± 2 or 40 ± 5%. Taking the commonly used 95% confidence
level as an example, if the same population were sampled multiple times, and interval
estimates made on each occasion, in approximately 95% of the cases, the true
population parameter would be contained within the interval. Note that the 95%
probability refers to the reliability of the estimation procedure and not to a specific
interval. Once an interval is calculated, it either contains or does not contain the
population parameter of interest. Some factors that affect the width of a confidence
interval include: size of the sample, confidence level, and variability within the sample.
There are different equations that can be used to calculate confidence intervals
depending on factors such as whether the standard deviation is known or smaller
samples (n<30) are involved, among others. The calculator provided on this page
calculates the confidence interval for a proportion and uses the following equations:

where
z is z score
p̂ is the population proportion
n and n' are sample size
N is the population size
Within statistics, a population is a set of events or elements that have some relevance
regarding a given question or experiment. It can refer to an existing group of objects,
systems, or even a hypothetical group of objects. Most commonly, however, population
is used to refer to a group of people, whether they are the number of employees in a
company, number of people within a certain age group of some geographic area, or
number of students in a university's library at any given time.
It is important to note that the equation needs to be adjusted when considering a finite
population, as shown above. The (N-n)/(N-1) term in the finite population equation is
referred to as the finite population correction factor, and is necessary because it cannot
be assumed that all individuals in a sample are independent. For example, if the study
population involves 10 people in a room with ages ranging from 1 to 100, and one of
those chosen has an age of 100, the next person chosen is more likely to have a lower
age. The finite population correction factor accounts for factors such as these. Refer
below for an example of calculating a confidence interval with an unlimited population.
EX: Given that 120 people work at Company Q, 85 of which drink coffee daily, find the
99% confidence interval of the true proportion of people who drink coffee at Company Q
on a daily basis.
Sample Size Calculation
Sample size is a statistical concept that involves determining the number of
observations or replicates (the repetition of an experimental condition used to estimate
the variability of a phenomenon) that should be included in a statistical sample. It is an
important aspect of any empirical study requiring that inferences be made about a
population based on a sample. Essentially, sample sizes are used to represent parts of
a population chosen for any given survey or experiment. To carry out this calculation,
set the margin of error, ε, or the maximum distance desired for the sample estimate to
deviate from the true value. To do this, use the confidence interval equation above, but
set the term to the right of the ± sign equal to the margin of error, and solve for the
resulting equation for sample size, n. The equation for calculating sample size is shown
below.

where
z is the z score
ε is the margin of error
N is the population size
p̂ is the population proportion
EX: Determine the sample size necessary to estimate the proportion of people shopping
at a supermarket in the U.S. that identify as vegan with 95% confidence, and a margin
of error of 5%. Assume a population proportion of 0.5, and unlimited population size.
Remember that z for a 95% confidence level is 1.96. Refer to the table provided in the
confidence level section for z scores of a range of confidence levels.
Thus, for the case above, a sample size of at least 385 people would be necessary. In
the above example, some studies estimate that approximately 6% of the U.S.
population identify as vegan, so rather than assuming 0.5 for p̂, 0.06 would be used. If it
was known that 40 out of 500 people that entered a particular supermarket on a given
day were vegan, p̂ would then be 0.08.

Sample Size Calculator


This Sample Size Calculator is presented as a public service of Creative Research Systems survey
software. You can use it to determine how many people you need to interview in order to get results that
reflect the target population as precisely as needed. You can also find the level of precision you have in
an existing sample.

Before using the sample size calculator, there are two terms that you need to know. These
are: confidence interval and confidence level. If you are not familiar with these terms, click here. To
learn more about the factors that affect the size of confidence intervals, click here.

Enter your choices in a calculator below to find the sample size you need or the confidence interval you
have. Leave the Population box blank, if the population is very large or unknown.

Determine Sample Size

Confidence Level: 95%  99%

Confidence Interval:

Population:

      

Sample size needed:

Find Confidence Interval


 
95%  99
Confidence Level:
%

Sample Size:

Population:
50
Percentage:

      

Confidence Interval:

Sample Size Calculator Terms: Confidence Interval & Confidence Level

The confidence interval (also called margin of error) is the plus-or-minus figure usually reported in
newspaper or television opinion poll results. For example, if you use a confidence interval of 4 and 47%
percent of your sample picks an answer you can be "sure" that if you had asked the question of the entire
relevant population between 43% (47-4) and 51% (47+4) would have picked that answer.

The confidence level tells you how sure you can be. It is expressed as a percentage and represents how
often the true percentage of the population who would pick an answer lies within the confidence interval.
The 95% confidence level means you can be 95% certain; the 99% confidence level means you can be
99% certain. Most researchers use the 95% confidence level.

When you put the confidence level and the confidence interval together, you can say that you are 95%
sure that the true percentage of the population is between 43% and 51%. The wider the confidence
interval you are willing to accept, the more certain you can be that the whole population answers would be
within that range.

For example, if you asked a sample of 1000 people in a city which brand of cola they preferred, and 60%
said Brand A, you can be very certain that between 40 and 80% of all the people in the city actually do
prefer that brand, but you cannot be so sure that between 59 and 61% of the people in the city prefer the
brand.

Factors that Affect Confidence Intervals


There are three factors that determine the size of the confidence interval for a given confidence level:

 Sample size

 Percentage

 Population size
Sample Size

The larger your sample size, the more sure you can be that their answers truly reflect the population. This
indicates that for a given confidence level, the larger your sample size, the smaller your confidence
interval. However, the relationship is not linear (i.e., doubling the sample size does not halve the
confidence interval).

Percentage

Your accuracy also depends on the percentage of your sample that picks a particular answer. If 99% of
your sample said "Yes" and 1% said "No," the chances of error are remote, irrespective of sample size.
However, if the percentages are 51% and 49% the chances of error are much greater. It is easier to be
sure of extreme answers than of middle-of-the-road ones.

When determining the sample size needed for a given level of accuracy you must use the worst case
percentage (50%). You should also use this percentage if you want to determine a general level of
accuracy for a sample you already have. To determine the confidence interval for a specific answer your
sample has given, you can use the percentage picking that answer and get a smaller interval.

Population Size

How many people are there in the group your sample represents? This may be the number of people in a
city you are studying, the number of people who buy new cars, etc. Often you may not know the exact
population size. This is not a problem. The mathematics of probability prove that the size of the population
is irrelevant unless the size of the sample exceeds a few percent of the total population you are
examining. This means that a sample of 500 people is equally useful in examining the opinions of a state
of 15,000,000 as it would a city of 100,000. For this reason, The Survey System ignores the population
size when it is "large" or unknown. Population size is only likely to be a factor when you work with a
relatively small and known group of people (e.g., the members of an association).
The confidence interval calculations assume you have a genuine random sample of the relevant
population. If your sample is not truly random, you cannot rely on the intervals. Non-random samples
usually result from some flaw or limitation in the sampling procedure. An example of such a flaw is to only
call people during the day and miss almost everyone who works. For most purposes, the non-working
population cannot be assumed to accurately represent the entire (working and non-working) population.
An example of a limitation is using an opt-in online poll, such as one promoted on a website. There is no
way to be sure an opt-in poll truly represents the population of interest.

You might also like