You are on page 1of 3

To: Airbnb Executive Committee

From: David Gantt


Date: 4/15/2021
RE: Sales Forecasting through Rental Price

As the world and mainstream commerce continue to shift toward digital platforms, the
collection of data and the implementation of an analytical framework are changing the ways
large firms operate and strategize. Healthcare companies use AI-enabled chest X-rays to
quickly parse large datasets and generate a diagnosis more quickly. Banks and financial
institutions now place some of the world’s most important fiscal decisions in the hands of
data experts and machine learning algorithms. These are just a couple examples of the rapid
change sweeping through these industries as the data sciences and analytics fields mature.

How does this apply to Airbnb? Airbnb’s success is driven primarily by the quantity of rentals
Overview booked and the price paid for said rental. The price of any given rental can be impacted by a
multitude of competing factors such as location, comparable listings, time of year, amenities,
etc. If we are going to remain a leader in this space, Airbnb needs to find new ways to process
the vast amount of data we have already collected from our catalogue and use those insights
to isolate the most significant forces driving our listing prices.

My team has developed an optimal forecasting model that considers the broad combination
of pricing factors and generates pricing predictions with an acceptable range of error. We
recommend that Airbnb use our model to overhaul our sales forecast reports and, in turn,
enhance our strategic decision-making capabilities. Our final RMSE of 390/413 was achieved
using various modeling and feature selection techniques which we will expand upon below.

Our team queried Airbnb’s internal northeast database and aggregated over 26,000 rows with
96 variables included. We quickly realized that there were quite a few irregularities in our
dataset including null values, unfit data types, and outliers that could have a negative impact
on our final results.
We quickly resolved any misaligned data type errors, however some of the more serious
#203864
imputation work—such as replacing nulls values with a non-null value such as zero or the
#203864 variable’s overall average—required more complex imputation methods.
Data Preparation
Our team focused on maintaining consistency across datasets by applying our more complex
imputation techniques to a master table containing both the analysis and scoring data points.
We began this stage of the preparation process by removing variables containing free-text
fields, redundancy, or both. We passed the column’s average into any remaining missing
values in our data set. We felt that replacing the null values with the overall column average
would help us minimize statistical error in our model while limiting our exposure to increases
in bias.
Before beginning the exploratory phase, we made sure to split the analysis dataset into our
train and test as we prepared to begin a deeper analysis. The median price of the overall
analysis dataset was $100 and the vast majority of the remaining listings were priced in the
sub-$200 range, as shown in Exhibit 1 below. It is unsurprising that Airbnb rentals congregate
in the lower price ranges as that is essentially what the platform is marketed towards,
especially in high throughput cities such as New York.

We also began exploring some working theories in the early stages. For example, Exhibit 2
suggests that the price of an Airbnb listing is significantly correlated with the number of
accommodations. Again this seems fairly standard; additional guests drive up the space
requirements which directly affects the price of the listing, especially given larger homes and
robust residential developments typically drive up overall prices in an area.

Exhibit 3 breaks down the relationship between price and accommodations even further.
Disregard “Room Types” for a moment and focus solely on the distribution of datapoints
within the histogram. Notice how the price clusters below the 2,000 mark and then breaks out
as the number of accommodations increases. This reaffirms our earlier theory that the
number of guests strongly influences our rental prices. We then broke out the population by
“Room Type.” This factor contained for category groups, however we rolled hotel rooms into
the “Private room” subset which explains some of the outlying prices in the “1 – 2”
accommodation group. These initial findings allowed our team to gather a basic
understanding of how some of our variables interact with one another before building out
more advanced feature selection techniques.

Data Exploration
Our goal during this phase was to build on the insights we had uncovered in the exploratory
phase and identify more significant variables that will further enhance our predictions. We
experimented with a few feature selection methods, including the forward selection,
backward selection, and lasso methods. The stepwise variable selection technique generated
the best results (including a higher r-squared) by a slight margin, which was a bit unexpected.
Feature Selection We believe our lasso model was skewed by the imputation and data preparation process, so
we decided to scrap it and move forward with our stepwise model. This enabled us to identify
variables and factors that significantly influence the price of rental listings. These inputs were
critical in achieving an efficient model capable of more accurate pricing predictions.

Linear Regression Model

We split the analysis dataset up into train and test to assess our model accuracy. These
datasets were combined during the imputation process to ensure consistency.

We began building our first model during the exploration phase, developing a simple linear
regression model with price as our target and accommodates as our variable. This generated a
public RMSE of 432.6, zero improvement over the baseline model. After cycling through a few
more variables and seeing little improvement on our score, we decided to integrate the
variables from our stepwise regression function. These new variables helped push our model’s
performance over the baseline value of 432.6 to 431.3.
We decided to move toward more advanced modeling
techniques with the goal of breaking the 400 barrier.
Data Models
Random Forest Model
We based our next model on the random forest
methodology. We started simply by adding the variables
from our feature selection process. Accommodates and
number of reviews are unsurprisingly leading our
variable importance plot, however we were surprised to
see “room_type” so low. The added sophistication of
this package was immediately realized as it pushed us across the 400 RMSE barrier to a 393
public score on our first submission. We then refit the original random forest model into a
tuned random forest model, hoping to improve our score even more.

Ultimately, we achieved a final public RMSE of 390 after using a tuned random forest
approach. While our final model was substantially more effective then our original, there
were a few things we could have done differently. I believe most of the errors I encountered
during the development phase was tied to sloppy imputation. I tried using some more
advanced techniques, like CART, but I think it would have been wiser to stick to a simpler
Takeaways approach and spend more time optimizing the data types. That being said, I do ultimately
believe that sticking to a consistent approach and placing emphasis on the delivery of the data
was the correct approach.

You might also like