You are on page 1of 3

**Objective:**

The objective of this project is to perform an Exploratory Data Analysis (EDA) on


the Airbnb NYC 2019 dataset. EDA is a critical step in data analysis that helps us
understand the dataset, identify patterns, missing values, and outliers, and
visualize data for insights. The project aims to provide valuable insights that can
assist in making data-driven decisions related to Airbnb listings in New York City.

**Project Steps:**

1. **Data Loading:**

- Import necessary libraries including NumPy, Pandas, Matplotlib, Seaborn.


- Mount Google Drive to access the dataset file.
- Load the Airbnb NYC 2019 dataset into a Pandas DataFrame.

2. **Data Exploration:**

- Display the first few rows of the dataset using `df.head()` to get an overview of
the data.
- Display the last few rows of the dataset using `df.tail()`.
- Count the number of rows and columns in the dataset using `df.shape`.
- Get an overview of the dataset's data types and missing values using
`df.info()`.
- Check for duplicate rows in the dataset using `df.duplicated()`.

3. **Missing Values Analysis:**

- Identify and count missing values in each column using `df.isnull().sum()`.


- Visualize missing values using a heatmap with `sns.heatmap()`.

4. **Data Cleaning:**

- Create a copy of the original DataFrame called `df_copy`.


- Remove columns 'reviews_per_month' and 'last_review' as they are not
needed for analysis.
- Handle missing values in the 'host_name' column by replacing them with
'Unknown'.
- Drop the 'name' column as it contains uninformative data.
- Calculate the mean value of the 'price' column and replace zero values with
the mean.

5. **Data Visualization:**

- Create visualizations to explore the dataset:


- Scatterplot of latitude and longitude colored by neighborhood group.
- Boxplot of availability in days.
- Bar chart showing the maximum number of reviews by neighborhood group.
- Bar chart showing the top 10 host names by the number of listings.
- Bar chart showing the top neighborhoods by the number of reviews.
- Boxplot of minimum nights.
- Histogram of prices.
- Pie chart of room type distribution.
- Correlation heatmap of numerical features.
- Pair plot of selected numerical columns.

**Insights:**

- The scatterplot shows the geographical distribution of Airbnb listings in different


neighborhood groups.
- The boxplot reveals the distribution of availability in days.
- The bar chart displays the maximum number of reviews for each neighborhood
group.
- The bar chart identifies the top 10 host names with the most listings.
- The bar chart highlights the top neighborhoods with the most reviews.
- The boxplot visualizes the distribution of minimum nights.
- The histogram shows the distribution of prices.
- The pie chart illustrates the distribution of room types.
- The correlation heatmap identifies relationships between numerical features.
- The pair plot provides scatterplots and histograms for numerical variables.
**Business Impact:**

- Insights from this EDA can help Airbnb make informed decisions such as
pricing strategies, identifying popular neighborhoods, and targeting marketing
efforts.
- By addressing missing values and understanding the dataset's characteristics,
Airbnb can improve data quality and analysis accuracy.

**Recommendations:**

- Further analysis can be done to understand the factors influencing price


variations.
- Machine learning models can be developed for predictive analytics and
recommendation systems.

**Conclusion:**

This EDA project provides valuable insights into the Airbnb NYC 2019 dataset,
helping Airbnb and other stakeholders make data-driven decisions for better
business outcomes. It highlights the importance of data preprocessing,
visualization, and analysis in understanding complex datasets.

You might also like