Professional Documents
Culture Documents
August 4, 2023
2 Greetings!
Welcome to the Video Game Sales Analysis Tutorial! In this tutorial, we will explore and analyze
a dataset containing information about video games with sales. The dataset was generated by
scraping data from vgchartz.com, a popular source for video game sales data.
Tools Used
For this Analysis, we will be using the following Python libraries:
Numpy for numerical operations and calculations.
Pandas for data manipulation and analysis.
Matplotlib for data visualization.
seaborn for creating informative and attractive statistical graphics.
Throughout this tutorial, we will provide step-by-step code examples and visualizations to help
you understand the data analysis process thoroughly. By the end of this tutorial, you will have
a comprehensive understanding of how to perform Data analysis on video game sales data using
Python’s data science libraries.
3 Dataset Description
The dataset consists of the following fields:
Rank: The ranking of overall sales for each game.
Name: The name of the video game.
Platform: The platform on which the game was released (e.g., PC, PS4, Xbox One, etc.).
Year: The year of the game’s release.
Genre: The genre of the video game (e.g., Action, Sports, RPG, etc.).
Publisher: The publisher of the video game.
NA_Sales: Sales in North America (in millions).
EU_Sales: Sales in Europe (in millions).
1
JP_Sales: Sales in Japan (in millions).
Other_Sales: Sales in the rest of the world (in millions).
Global_Sales: Total worldwide sales (sum of all regional sales).
The dataset contains 16,598 records, with 2 records dropped due to incomplete information.
4 Preparation
[1]: # import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Rank 16598 non-null int64
1 Name 16598 non-null object
2 Platform 16598 non-null object
3 Year 16327 non-null float64
2
4 Genre 16598 non-null object
5 Publisher 16540 non-null object
6 NA_Sales 16598 non-null float64
7 EU_Sales 16598 non-null float64
8 JP_Sales 16598 non-null float64
9 Other_Sales 16598 non-null float64
10 Global_Sales 16598 non-null float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB
[5]: Rank 0
Name 0
Platform 0
Year 271
Genre 0
Publisher 58
NA_Sales 0
EU_Sales 0
JP_Sales 0
Other_Sales 0
Global_Sales 0
dtype: int64
Remove all missing values instead of replacing it because missing values come from two columns
“Year” and “Publisher”, so they need to be filled with an exact value → impossible to fill it with
0 or with an average value.
[6]: # remove missing values
df = df.dropna()
[7]: #again I
df.isnull().sum()
[7]: Rank 0
Name 0
Platform 0
Year 0
Genre 0
Publisher 0
NA_Sales 0
EU_Sales 0
JP_Sales 0
Other_Sales 0
Global_Sales 0
dtype: int64
3
[8]: df.describe()
Other_Sales Global_Sales
count 16291.000000 16291.000000
mean 0.048426 0.540910
std 0.190083 1.567345
min 0.000000 0.010000
25% 0.000000 0.060000
50% 0.010000 0.170000
75% 0.040000 0.480000
max 10.570000 82.740000
4
10. Distribution of Sales: What is the distribution of global sales across different games? Are
sales concentrated in a few blockbuster titles, or is it more evenly spread?
11. Sales Over Time: Can you identify any trends in sales for specific games or genres over
time?
12. Best-Selling Genres by Region: Are there any differences in the most popular genres for
each region (NA, EU, JP)?
13. Yearly Sales by Region: How have the sales trends evolved over the years in each region?
14. Regional Market Share: What percentage of the global video game market do each of the
regions hold?
15. Platform Sales Comparison: Compare the sales performance of different platforms side
by side.
6 Let’s Start!
1. Top Selling Games: What are the top 10 best-selling games of all time based on
global sales?
[9]: # Sort the dataset by 'Global_Sales' in descending order and select the top 10
top_10_games = df.sort_values(by='Global_Sales', ascending=False).head(10)
5
2. Sales by Platform: Which gaming platform (e.g., PC, PS4, Xbox, etc.) has the
highest total sales?
[10]: # Group the data by 'Platform' and calculate the sum of global sales for each␣
↪platform
platform_sales = df.groupby('Platform')['Global_Sales'].sum().reset_index()
plt.xlabel('Gaming Platform')
plt.ylabel('Total Sales (millions)')
plt.title('Total Sales by Gaming Platform')
plt.xticks(rotation=45)
plt.show()
6
3. Genre Popularity: What are the most popular genres of video games in terms of
global sales?
[11]: # Group the data by 'Genre' and calculate the sum of global sales for each genre
genre_sales = df.groupby('Genre')['Global_Sales'].sum().reset_index()
plt.axis('equal')
plt.title('Genre Popularity based on Global Sales')
plt.show()
7
4. Sales by Region: Which region (North America, Europe, Japan, Rest of the
World) contributes the most to global video game sales?
plt.xlabel('Region')
plt.ylabel('Total Sales (millions)')
plt.title('Total Sales by Region')
plt.show()
8
5. Publisher Performance: Which publishers have released the most successful
games in terms of global sales?
[13]: # Group the data by 'Publisher' and calculate the sum of global sales for each␣
↪publisher
publisher_sales = df.groupby('Publisher')['Global_Sales'].sum().reset_index()
9
6. Yearly Sales Trends: How have video game sales evolved over the years? Are
they increasing or decreasing?
[14]: # Group the data by 'Year' and calculate the sum of global sales for each year
yearly_sales = df.groupby('Year')['Global_Sales'].sum()
10
7. Platform vs. Genre: Which genres are most popular on specific gaming platforms?
[15]: # Group the data by 'Platform' and 'Genre' and calculate the sum of global␣
↪sales for each combination
plt.xlabel('Platform')
plt.ylabel('Total Sales (millions)')
plt.title('Genre Popularity on Gaming Platforms')
plt.legend(title='Genre', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=45)
plt.show()
11
8 .Regional Sales Variation: Do certain genres or platforms perform better in specific
regions?
[16]: # Create box plots to visualize sales variation of genres and platforms in␣
↪specific regions
plt.figure(figsize=(12, 8))
sns.boxplot(x='Genre', y='NA_Sales', data=df)
plt.xlabel('Genre')
plt.ylabel('Sales in North America (millions)')
plt.title('Regional Sales Variation by Genre in North America')
plt.xticks(rotation=45)
plt.show()
12
9. Correlations: Are there any significant correlations between game sales and other
factors like year of release, genre, or publisher?
[17]: # Create a pairplot to visualize correlations between game sales, year of␣
↪release, and publisher
13
10. Distribution of Sales: What is the distribution of global sales across different
games? Are sales concentrated in a few blockbuster titles, or is it more evenly
spread?
[18]: # Create a histogram to visualize the distribution of global sales
plt.figure(figsize=(10, 6))
plt.hist(df['Global_Sales'], bins=30, color='orange', edgecolor='black')
plt.xlabel('Global Sales (millions)')
plt.ylabel('Frequency')
plt.title('Distribution of Global Sales')
plt.show()
14
11. Sales Over Time: Can you identify any trends in sales for specific games or genres
over time?
[19]: # Group the data by 'Year' and calculate the sum of global sales for each year
yearly_sales = df.groupby('Year')['Global_Sales'].sum()
15
12. Best-Selling Genres by Region: Are there any differences in the most popular
genres for each region (NA, EU, JP)?
[20]: # Group the data by 'Genre' and calculate the sum of global sales for each genre
genre_sales = df.groupby('Genre')['Global_Sales'].sum().reset_index()
16
13. Yearly Sales by Region: How have the sales trends evolved over the years in each
region?
[21]: # Group the data by 'Year' and calculate the sum of sales for each region for␣
↪every year
plt.plot(yearly_sales_by_region.index, yearly_sales_by_region['EU_Sales'],␣
↪label='Europe', marker='o', color='mediumseagreen')
plt.plot(yearly_sales_by_region.index, yearly_sales_by_region['JP_Sales'],␣
↪label='Japan', marker='o', color='tomato')
plt.plot(yearly_sales_by_region.index, yearly_sales_by_region['Other_Sales'],␣
↪label='Rest of the World', marker='o', color='gold')
plt.xlabel('Year')
plt.ylabel('Sales (millions)')
plt.title('Yearly Sales Trends by Region')
plt.legend(title='Region', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.show()
17
14. Regional Market Share: What percentage of the global video game market do
each of the regions hold?
[22]: # Calculate the total global sales
total_global_sales = df['Global_Sales'].sum()
plt.axis('equal')
plt.title('Regional Market Share of Video Game Sales')
plt.show()
18
15. Platform Sales Comparison: Compare the sales performance of different plat-
forms side by side.
[23]: # Group the data by 'Platform' and calculate the sum of global sales for each␣
↪platform
platform_sales = df.groupby('Platform')['Global_Sales'].sum().reset_index()
plt.xlabel('Platform')
19
plt.ylabel('Total Sales (millions)')
plt.title('Platform Sales Comparison')
plt.xticks(rotation=45)
plt.show()
7 Conclusion
This VGSales analysis provided valuable insights into the video game industry, revealing top-
performing games, popular genres, platform preferences, and trends over time.
These insights can be leveraged by game developers, publishers, and marketers to make informed
decisions and tailor their strategies for greater success in the competitive gaming market.
Keep in mind that this analysis is based on the data available in the dataset, and further research
or data collection may be required for more comprehensive conclusions.
The analysis can also be extended by incorporating additional external data sources or using ad-
vanced statistical techniques for deeper insights.
[ ]:
20