You are on page 1of 2

CREATED

OCTOBER 2020

ANIL R. DOSHI

Describing a Dataset
Giving the Audience an Idea of the Data
When reporting results from nearly any data analysis, describing the data is critical to help your
audience understand the work you have done. Having a sufficient explanation of the data is important
for both technical and non-technical audiences. For the former, it will assist them with giving you
feedback and asking questions to help improve the analysis or ensure its validity. For the latter, it
allows them to have a broad understanding of the context; it helps with big picture takeaways you
want them to leave with.

Key Elements in a Dataset Description


The idea of a dataset description is to convey basic information about the data assembly and
wrangling process (without dragging the audience through all the pain you experienced). Key pieces
of information include:

• The source of the dataset


• A list and explanation of variables in the dataset. This may be a brief list of variable names;
a list of the main variables and a description of each; or a reference to a table describing
variables
• The source and variables you merged from ancillary datasets
• Significant data cleaning steps you took (e.g., made text data more consistent, identified
typographical errors, etc.)
• The number of observations that dropped (and reasons why)
• Whether outliers are present in any variables and how they were treated (e.g., winsorizing)
After all that work, you will have your “final” dataset that you use in your analysis. At this point
you assist the audience with understanding the structure of the dataset. You may include:

• The unit of analysis. What constitutes a unique observation in the dataset? Is it a cross-
sectional dataset at the individual level, where each observation is an individual (a
customer, for example)? Or is it a panel dataset containing the same individual over
multiple days (in which case the data would be at the customer-day unit of analysis).
• The number of observations and the number of panels (if a panel dataset)
• The time period of the sample
• Tables of summary statistics (including number of observations, mean, standard deviation,
and quartiles) and a correlation matrix

Copyright © 2020 Anil Doshi. This publication may not be digitized, photocopied, or otherwise
reproduced, posted, or transmitted, without the permission of the authors. To order copies or request
permission to reproduce materials, email anil.doshi@ucl.ac.uk.
IdeaWeb — Understanding Workplace Networks

A Word on Variable Names


The purpose of analytics reports is often to communicate the computational work you have
performed to a wider set of stakeholders (i.e., managers, executives, investors, etc.). Part of translating
the technical work to a readable report is considering how you write your variable names. One
convention to name variables is to write them as short, but intuitive names in your report. This will
often be a departure from the name of your variables in the software you are using! For example, you
might have a variable priceG50 that equals 1 if price is greater than the median. However, you might
write that variable as high price in the report.

An Example Using Airbnb Data


To give you an idea of what this looks like in practice, here is an example using a dataset of Airbnb
listings:1

We use listing-level data of Airbnb properties that were available on the platform from January to
December 2012. The data was acquired from InsideAirbnb.com, an independent service that provides
data on several cities. Our data consists of London rentals. For each listing, we have data on its price,
days available, number of ratings as well as the text description of the property. We also have data for
each review, including the reviewer, the listing, the review score, and the text of the review. We merged
this data with other social and demographic measures obtained from gov.uk, including quality of
nearby school ratings and local crime rates. See Appendix Table 1 for a list and description of variables
used in our analysis.
We performed some transformation of variables to account for missing and skewed data. First, we
use the natural log of price and number of ratings. Some observations had missing crime data. For
those observations, we substituted the mean crime rate of the neighboring regions.
After dropping 1,425 observations because of missing or erroneous location information, we have
42,195 listings that were used in our analysis. Table 1 presents summary statistics for the variables and
Table 2 presents a correlation matrix. The mean price of a listing in our dataset is 5.05 log pounds and
the mean number of ratings (logged) is 1.56.

Closing Thoughts
The advice in this document is dependent on the analysis you are conducting and who your
audience is. There will be occasions where you do not include some of the elements in your dataset
description. Or you might include others that are not listed above. And there may be occasions where
using the variable names from your software might be appropriate, such as writing up a technical
document or code book. The main point is to actively use your judgement and present your work in a
way that is meaningful and relevant to your audience.

1 Values in the sample description are solely for illustration are are not reflective of any dataset or analysis.

You might also like