Professional Documents
Culture Documents
OCTOBER 2020
ANIL R. DOSHI
Describing a Dataset
Giving the Audience an Idea of the Data
When reporting results from nearly any data analysis, describing the data is critical to help your
audience understand the work you have done. Having a sufficient explanation of the data is important
for both technical and non-technical audiences. For the former, it will assist them with giving you
feedback and asking questions to help improve the analysis or ensure its validity. For the latter, it
allows them to have a broad understanding of the context; it helps with big picture takeaways you
want them to leave with.
• The unit of analysis. What constitutes a unique observation in the dataset? Is it a cross-
sectional dataset at the individual level, where each observation is an individual (a
customer, for example)? Or is it a panel dataset containing the same individual over
multiple days (in which case the data would be at the customer-day unit of analysis).
• The number of observations and the number of panels (if a panel dataset)
• The time period of the sample
• Tables of summary statistics (including number of observations, mean, standard deviation,
and quartiles) and a correlation matrix
Copyright © 2020 Anil Doshi. This publication may not be digitized, photocopied, or otherwise
reproduced, posted, or transmitted, without the permission of the authors. To order copies or request
permission to reproduce materials, email anil.doshi@ucl.ac.uk.
IdeaWeb — Understanding Workplace Networks
We use listing-level data of Airbnb properties that were available on the platform from January to
December 2012. The data was acquired from InsideAirbnb.com, an independent service that provides
data on several cities. Our data consists of London rentals. For each listing, we have data on its price,
days available, number of ratings as well as the text description of the property. We also have data for
each review, including the reviewer, the listing, the review score, and the text of the review. We merged
this data with other social and demographic measures obtained from gov.uk, including quality of
nearby school ratings and local crime rates. See Appendix Table 1 for a list and description of variables
used in our analysis.
We performed some transformation of variables to account for missing and skewed data. First, we
use the natural log of price and number of ratings. Some observations had missing crime data. For
those observations, we substituted the mean crime rate of the neighboring regions.
After dropping 1,425 observations because of missing or erroneous location information, we have
42,195 listings that were used in our analysis. Table 1 presents summary statistics for the variables and
Table 2 presents a correlation matrix. The mean price of a listing in our dataset is 5.05 log pounds and
the mean number of ratings (logged) is 1.56.
Closing Thoughts
The advice in this document is dependent on the analysis you are conducting and who your
audience is. There will be occasions where you do not include some of the elements in your dataset
description. Or you might include others that are not listed above. And there may be occasions where
using the variable names from your software might be appropriate, such as writing up a technical
document or code book. The main point is to actively use your judgement and present your work in a
way that is meaningful and relevant to your audience.
1 Values in the sample description are solely for illustration are are not reflective of any dataset or analysis.