You are on page 1of 1

TASK 6

I HAVE DONE THIS TASK IS ENTIRELY BASED UPON MY B.TECH PROJECT AND
DATSET IS FROM MY SIDE ONY. ADDITIONAL TO MY TASK GIVEN,I HAVE DONE
MACHINE LEARNING PART ALSO BY APPLYING ONLY TWO ALGORITHMS

IMPORTING MODULES

In [1]: import numpy as np


import pandas as pd
import plotly.express as px

In [2]: pip install cufflinks

Requirement already satisfied: cufflinks in c:\users\rupes\anaconda3\lib\site-packages (0.17.3)


Requirement already satisfied: numpy>=1.9.2 in c:\users\rupes\anaconda3\lib\site-packages (from cufflinks) (1.24.3)
Requirement already satisfied: pandas>=0.19.2 in c:\users\rupes\anaconda3\lib\site-packages (from cufflinks) (1.5.3)
Requirement already satisfied: plotly>=4.1.1 in c:\users\rupes\anaconda3\lib\site-packages (from cufflinks) (5.9.0)
Requirement already satisfied: six>=1.9.0 in c:\users\rupes\anaconda3\lib\site-packages (from cufflinks) (1.16.0)
Requirement already satisfied: colorlover>=0.2.1 in c:\users\rupes\anaconda3\lib\site-packages (from cufflinks) (0.3.0)
Requirement already satisfied: setuptools>=34.4.1 in c:\users\rupes\anaconda3\lib\site-packages (from cufflinks) (68.0.0)
Requirement already satisfied: ipython>=5.3.0 in c:\users\rupes\anaconda3\lib\site-packages (from cufflinks) (8.12.0)
Requirement already satisfied: ipywidgets>=7.0.0 in c:\users\rupes\anaconda3\lib\site-packages (from cufflinks) (8.0.4)
Requirement already satisfied: backcall in c:\users\rupes\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflinks) (0.2.0)
Requirement already satisfied: decorator in c:\users\rupes\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflinks) (5.1.1)
Requirement already satisfied: jedi>=0.16 in c:\users\rupes\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflinks) (0.18.1)
Requirement already satisfied: matplotlib-inline in c:\users\rupes\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflinks) (0.1.6)
Requirement already satisfied: pickleshare in c:\users\rupes\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflinks) (0.7.5)
Requirement already satisfied: prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30 in c:\users\rupes\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflink
s) (3.0.36)
Requirement already satisfied: pygments>=2.4.0 in c:\users\rupes\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflinks) (2.15.1)
Requirement already satisfied: stack-data in c:\users\rupes\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflinks) (0.2.0)
Requirement already satisfied: traitlets>=5 in c:\users\rupes\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflinks) (5.7.1)
Requirement already satisfied: colorama in c:\users\rupes\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflinks) (0.4.6)
Requirement already satisfied: ipykernel>=4.5.1 in c:\users\rupes\anaconda3\lib\site-packages (from ipywidgets>=7.0.0->cufflinks) (6.19.2)
Requirement already satisfied: widgetsnbextension~=4.0 in c:\users\rupes\anaconda3\lib\site-packages (from ipywidgets>=7.0.0->cufflinks) (4.0.5)
Requirement already satisfied: jupyterlab-widgets~=3.0 in c:\users\rupes\anaconda3\lib\site-packages (from ipywidgets>=7.0.0->cufflinks) (3.0.5)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\rupes\anaconda3\lib\site-packages (from pandas>=0.19.2->cufflinks) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\rupes\anaconda3\lib\site-packages (from pandas>=0.19.2->cufflinks) (2022.7)
Requirement already satisfied: tenacity>=6.2.0 in c:\users\rupes\anaconda3\lib\site-packages (from plotly>=4.1.1->cufflinks) (8.2.2)
Requirement already satisfied: comm>=0.1.1 in c:\users\rupes\anaconda3\lib\site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (0.
1.2)
Requirement already satisfied: debugpy>=1.0 in c:\users\rupes\anaconda3\lib\site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (1.
6.7)
Requirement already satisfied: jupyter-client>=6.1.12 in c:\users\rupes\anaconda3\lib\site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cuff
links) (7.4.9)
Requirement already satisfied: nest-asyncio in c:\users\rupes\anaconda3\lib\site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (1.
5.6)
Requirement already satisfied: packaging in c:\users\rupes\anaconda3\lib\site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (23.0)
Requirement already satisfied: psutil in c:\users\rupes\anaconda3\lib\site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (5.9.0)
Requirement already satisfied: pyzmq>=17 in c:\users\rupes\anaconda3\lib\site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (23.2.
0)
Requirement already satisfied: tornado>=6.1 in c:\users\rupes\anaconda3\lib\site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (6.
3.2)
Requirement already satisfied: parso<0.9.0,>=0.8.0 in c:\users\rupes\anaconda3\lib\site-packages (from jedi>=0.16->ipython>=5.3.0->cufflinks) (0.8.
3)
Requirement already satisfied: wcwidth in c:\users\rupes\anaconda3\lib\site-packages (from prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30->ipython>=5.3.0->
cufflinks) (0.2.5)
Requirement already satisfied: executing in c:\users\rupes\anaconda3\lib\site-packages (from stack-data->ipython>=5.3.0->cufflinks) (0.8.3)
Requirement already satisfied: asttokens in c:\users\rupes\anaconda3\lib\site-packages (from stack-data->ipython>=5.3.0->cufflinks) (2.0.5)
Requirement already satisfied: pure-eval in c:\users\rupes\anaconda3\lib\site-packages (from stack-data->ipython>=5.3.0->cufflinks) (0.2.2)
Requirement already satisfied: entrypoints in c:\users\rupes\anaconda3\lib\site-packages (from jupyter-client>=6.1.12->ipykernel>=4.5.1->ipywidgets
>=7.0.0->cufflinks) (0.4)
Requirement already satisfied: jupyter-core>=4.9.2 in c:\users\rupes\anaconda3\lib\site-packages (from jupyter-client>=6.1.12->ipykernel>=4.5.1->ip
ywidgets>=7.0.0->cufflinks) (5.3.0)
Requirement already satisfied: platformdirs>=2.5 in c:\users\rupes\anaconda3\lib\site-packages (from jupyter-core>=4.9.2->jupyter-client>=6.1.12->i
pykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (2.5.2)
Requirement already satisfied: pywin32>=300 in c:\users\rupes\anaconda3\lib\site-packages (from jupyter-core>=4.9.2->jupyter-client>=6.1.12->ipyker
nel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (305.1)
Note: you may need to restart the kernel to use updated packages.

In [3]: import cufflinks as cf


from plotly.offline import download_plotlyjs, init_notebook_mode,plot,iplot
import plotly.graph_objects as go
import plotly.offline as iplot
import chart_studio
init_notebook_mode(connected=True)
cf.go_offline()

C:\Users\rupes\anaconda3\Lib\site-packages\paramiko\transport.py:219: CryptographyDeprecationWarning:

Blowfish has been deprecated

Loading dataset using Pandas


In [4]: df = pd.read_csv('googleappsplaystore.csv') #Loading Dataset

In [5]: df.head() #Top 5 Rows of the Dataset

Out[5]: Content Android


App Category Rating Reviews Size Installs Type Price Genres Last Updated Current Ver
Rating Ver

Photo Editor & Candy Camera & Grid January 7, 4.0.3 and
0 ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Everyone Art & Design 1.0.0
& ScrapBook 2018 up

Art & Design;Pretend January 15, 4.0.3 and


1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Everyone 2.0.0
Play 2018 up

U Launcher Lite – FREE Live Cool August 1, 4.0.3 and


2 ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Everyone Art & Design 1.2.4
Themes, Hide ... 2018 up

Varies with
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 Teen Art & Design June 8, 2018 4.2 and up
device

Pixel Draw - Number Art Coloring


4 ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up
Book

In [6]: df.tail() #Bottom 5 Rows of Dataset

Out[6]: Content Last


App Category Rating Reviews Size Installs Type Price Genres Current Ver Android Ver
Rating Updated

July 25,
10836 Sya9a Maroc - FR FAMILY 4.5 38 53M 5,000+ Free 0 Everyone Education 1.48 4.1 and up
2017

Fr. Mike Schmitz Audio


10837 FAMILY 5.0 4 3.6M 100+ Free 0 Everyone Education July 6, 2018 1.0 4.1 and up
Teachings

January 20,
10838 Parkinson Exercices FR MEDICAL NaN 3 9.5M 1,000+ Free 0 Everyone Medical 1.0 2.2 and up
2017

The SCP Foundation DB fr Varies with Books & January 19, Varies with Varies with
10839 BOOKS_AND_REFERENCE 4.5 114 1,000+ Free 0 Mature 17+
nn5n device Reference 2015 device device

iHoroscope - 2018 Daily July 25, Varies with Varies with


10840 LIFESTYLE 4.5 398307 19M 10,000,000+ Free 0 Everyone Lifestyle
Horoscope & Astrology 2018 device device

In [7]: df.shape #Shape of the dataset (Rows x Columns)

(10841, 13)
Out[7]:

In [8]: df.size #Total No.of Elements in dataset

140933
Out[8]:

In [9]: df.dtypes #Data type of each column

App object
Out[9]:
Category object
Rating float64
Reviews object
Size object
Installs object
Type object
Price object
Content Rating object
Genres object
Last Updated object
Current Ver object
Android Ver object
dtype: object

Checking Nullvalues
In [10]: df.isnull().sum() #Finding the count of total null values for each Attribute

App 0
Out[10]:
Category 0
Rating 1474
Reviews 0
Size 0
Installs 0
Type 1
Price 0
Content Rating 1
Genres 0
Last Updated 0
Current Ver 8
Android Ver 3
dtype: int64

Droping Null values


In [11]: df.dropna(inplace=True) #Droping Null values of each Attribute

In [12]: df.describe() #Observing the Statistic distribution of Data

Out[12]: Rating

count 9360.000000

mean 4.191838

std 0.515263

min 1.000000

25% 4.000000

50% 4.300000

75% 4.500000

max 5.000000

In [13]: df.Rating.describe() #Observing the Statistical distribution of "Rating" attribute

count 9360.000000
Out[13]:
mean 4.191838
std 0.515263
min 1.000000
25% 4.000000
50% 4.300000
75% 4.500000
max 5.000000
Name: Rating, dtype: float64

Filling Null values with Median value


In [14]: df['Rating'].fillna(df['Rating'].median()) #Filling null values of "Rating" attribute with median of "Rating" attribute

0 4.1
Out[14]:
1 3.9
2 4.7
3 4.5
4 4.3
...
10834 4.0
10836 4.5
10837 5.0
10839 4.5
10840 4.5
Name: Rating, Length: 9360, dtype: float64

In [15]: df['Rating'].describe() #Observing the Statistical distribution of "Rating" attribute

count 9360.000000
Out[15]:
mean 4.191838
std 0.515263
min 1.000000
25% 4.000000
50% 4.300000
75% 4.500000
max 5.000000
Name: Rating, dtype: float64

In [16]: df.isnull().sum() #Finding the count of total null values for each Attribute

App 0
Out[16]:
Category 0
Rating 0
Reviews 0
Size 0
Installs 0
Type 0
Price 0
Content Rating 0
Genres 0
Last Updated 0
Current Ver 0
Android Ver 0
dtype: int64

Converting MB's to KB's


In [17]: df['Size']=df['Size'].replace({ 'M': '*1e3', '-':'-1','Varies with device':'0','k':'1'}, regex=True).map(pd.eval).astype(int) #Converting App Size

Removing '+' from Installs column/Attribute


In [18]: df['Installs']=df['Installs'].map(lambda x:x.replace(',','').replace('+','')).astype(int)

In [19]: df['Installs']

0 10000
Out[19]:
1 500000
2 5000000
3 50000000
4 100000
...
10834 500
10836 5000
10837 100
10839 1000
10840 10000000
Name: Installs, Length: 9360, dtype: int32

Converting Reviews attribute to Int Datatype


In [20]: df['Reviews']=pd.to_numeric(df['Reviews']) #Changing Reviews attribute to Numeric Datatype

In [21]: df.dtypes

App object
Out[21]:
Category object
Rating float64
Reviews int64
Size int32
Installs int32
Type object
Price object
Content Rating object
Genres object
Last Updated object
Current Ver object
Android Ver object
dtype: object

Removing $ and , from Price Attribute


In [22]: df['Price']=df['Price'].map(lambda x:x.replace('$',''))
df['Price']=df['Price'].astype('float')
df['Price']=df['Price'].astype('int')

In [23]: len(df[df.Rating>5]) #Finding the number of elements having Rating greater than 5

0
Out[23]:

In [24]: df_reviews = (df[df.Reviews>df.Installs]) #Finding the Count of elements in Reviews greater than Installs
len(df_reviews)

7
Out[24]:

In [25]: len(df[(df.Type=='free')&(df.Price>0)]) #Finding the count of elements which are having Type as "Free" and Price greater than 0

0
Out[25]:

Droping Reviews where value of Reviews>Installs


In [26]: df.drop(df[df['Reviews']>df['Installs']].index,inplace=True) #Droping rows where value of Reviews greater than Installs

In [27]: df['Reviews']

0 159
Out[27]:
1 967
2 87510
3 215644
4 967
...
10834 7
10836 38
10837 4
10839 114
10840 398307
Name: Reviews, Length: 9353, dtype: int64

Data Visualization

WE CAN DO DATA VISUALIZATION IN SEABORN ,MATPLOTLIB ALSO. BUT I


PREFER PLOTLY .
In [28]: fig1 = px.box(df,y='Price') #Box plot for Price Attribute
fig1.show()

400

350

300

250
Price

200

150

100

50

Observation: Yes there are outliers for Price attribute ranging from (1 to 400)

In [29]: fig = px.line(df,y="Price") #Plotting graph for finding the skewness for the price Attribute

In [30]: fig.show()

400

350

300

250
Price

200

150

100

50

0 2k 4k 6k 8k 10k

index

In [31]: fig2 = px.box(df,y="Reviews") #Boxplot for Reviews Attribute


fig2.show()

80M

70M

60M

50M
Reviews

40M

30M

20M

10M

Observation : Yes there are very high number of reviews

In [32]: fig3 = px.histogram(df,x="Rating") #Histogram for Rating attribute


fig3.show()

1000

800
count

600

400

200

0
1 1.5 2 2.5 3 3.5 4 4.5 5

Rating

Observation : It is right skewed between 4 to 5

In [33]: fig4 = px.histogram(df,x="Size") #Histogram for Size attribute


fig4.show()

2000

1500
count

1000

500

0
0 20k 40k 60k 80k 100k

Size

Observations : It is left skewed between 0 and 20k

In [34]: df[df.Price>200] #Finding the rows having price greater than 200$

Out[34]: App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver

4197 most expensive app (H) FAMILY 4.3 6 1500 100 Paid 399 Everyone Entertainment July 16, 2018 1.0 7.0 and up

4362 💎 I'm rich LIFESTYLE 3.8 718 26000 10000 Paid 399 Everyone Lifestyle March 11, 2018 1.0.0 4.4 and up

4367 I'm Rich - Trump Edition LIFESTYLE 3.6 275 7300 10000 Paid 400 Everyone Lifestyle May 3, 2018 1.0.1 4.1 and up

5351 I am rich LIFESTYLE 3.8 3547 1800 100000 Paid 399 Everyone Lifestyle January 12, 2018 2.0 4.0.3 and up

5354 I am Rich Plus FAMILY 4.0 856 8700 10000 Paid 399 Everyone Entertainment May 19, 2018 3.0 4.4 and up

5355 I am rich VIP LIFESTYLE 3.8 411 2600 10000 Paid 299 Everyone Lifestyle July 21, 2018 1.1.1 4.3 and up

5356 I Am Rich Premium FINANCE 4.1 1867 4700 50000 Paid 399 Everyone Finance November 12, 2017 1.6 4.0 and up

5357 I am extremely Rich LIFESTYLE 2.9 41 2900 1000 Paid 379 Everyone Lifestyle July 1, 2018 1.0 4.0 and up

5358 I am Rich! FINANCE 3.8 93 22000 1000 Paid 399 Everyone Finance December 11, 2017 1.0 4.1 and up

5359 I am rich(premium) FINANCE 3.5 472 9651 5000 Paid 399 Everyone Finance May 1, 2017 3.4 4.4 and up

5362 I Am Rich Pro FAMILY 4.4 201 2700 5000 Paid 399 Everyone Entertainment May 30, 2017 1.54 1.6 and up

5364 I am rich (Most expensive app) FINANCE 4.1 129 2700 1000 Paid 399 Teen Finance December 6, 2017 2 4.0.3 and up

5366 I Am Rich FAMILY 3.6 217 4900 10000 Paid 389 Everyone Entertainment June 22, 2018 1.5 4.2 and up

5369 I am Rich FINANCE 4.3 180 3800 5000 Paid 399 Everyone Finance March 22, 2018 1.0 4.2 and up

5373 I AM RICH PRO PLUS FINANCE 4.0 36 41000 1000 Paid 399 Everyone Finance June 25, 2018 1.0.2 4.1 and up

In [35]: len(df[df.Price>200]) #Finding the count of elements having Price greater than 200

15
Out[35]:

In [36]: df.drop(df[df['Price']>200].index,inplace=True) #Droping rows having Price greater than 200

In [37]: len(df[df.Price>200])

0
Out[37]:

In [38]: len(df[df.Reviews>2000000]) #Finding the number of values having Reviews value greater than 2000000

453
Out[38]:

In [39]: df.drop(df[df['Reviews']>200000].index,inplace=True) #Droping the rows having Reviews value greater than 2000000

In [40]: len(df[df.Reviews>2000000])

0
Out[40]:

In [41]: np.percentile(df['Installs'], 10) #Finding 10 percentile of Installs attribute

1000.0
Out[41]:

In [42]: np.percentile(df['Installs'], 25) #Finding 25 percentile of Installs attribute

10000.0
Out[42]:

In [43]: np.percentile(df['Installs'], 50) #Finding 50 percentile of Installs attribute

100000.0
Out[43]:

In [44]: np.percentile(df['Installs'], 70) #Finding 70 percentile of Installs attribute

1000000.0
Out[44]:

In [45]: np.percentile(df['Installs'], 90) #Finding 90 percentile of Installs attribute

5000000.0
Out[45]:

In [46]: np.percentile(df['Installs'], 95) #Finding 95 percentile of Installs attribute

10000000.0
Out[46]:

In [47]: np.percentile(df['Installs'], 99) #Finding 99 percentile of Installs attribute

10000000.0
Out[47]:

In [48]: fig_installs = px.box(df,y="Installs") #Boxplot for Installs Attribute


fig_installs.show()

100M

80M

60M
Installs

40M

20M

Droping rows which are greater than Inter quartile range


In [49]: for cols in range(len(df['Installs'])): #Finding the inter quartile range
Q3 = df['Installs'].quantile(0.75) #Inter quartile range as threshold
Q1 = df['Installs'].quantile(0.25)
IQR = Q3 - Q1

count = df.loc[(df['Installs'] < (Q1 - 1.5 * IQR)) | (df['Installs'] > (Q3 + 1.5 * IQR))]

In [50]: count

Out[50]: Content
App Category Rating Reviews Size Installs Type Price Genres Last Updated Current Ver Android Ver
Rating

U Launcher Lite – FREE Live Cool August 1,


2 ART_AND_DESIGN 4.7 87510 8700 5000000 Free 0 Everyone Art & Design 1.2.4 4.0.3 and up
Themes, Hide ... 2018

12 Tattoo Name On My Photo Editor ART_AND_DESIGN 4.2 44829 20000 10000000 Free 0 Teen Art & Design April 2, 2018 3.8 4.1 and up

August 3,
18 FlipaClip - Cartoon animation ART_AND_DESIGN 4.3 194216 39000 5000000 Free 0 Everyone Art & Design 2.2.5 4.0.3 and up
2018

Varies with
37 Floor Plan Creator ART_AND_DESIGN 4.1 36639 0 5000000 Free 0 Everyone Art & Design July 14, 2018 2.3.3 and up
device

Canva: Poster, banner, card maker &


45 ART_AND_DESIGN 4.7 174531 24000 10000000 Free 0 Everyone Art & Design July 31, 2018 1.6.1 4.1 and up
graphic de...

... ... ... ... ... ... ... ... ... ... ... ... ... ...

10714 FunForMobile Ringtones & Chat SOCIAL 4.4 68358 7200 5000000 Free 0 Mature 17+ Social May 7, 2016 3.22 4.1 and up

August 5,
10716 Free Slideshow Maker & Video Editor PHOTOGRAPHY 4.2 162564 11000 10000000 Free 0 Everyone Photography 5.2 4.0 and up
2018

February 7,
10723 Mobile Kick SPORTS 4.3 111809 40000 10000000 Free 0 Everyone Sports 1.0.21 4.1 and up
2018

October 22,
10731 FeaturePoints: Free Gift Cards FAMILY 3.9 121321 46000 5000000 Free 0 Everyone Entertainment 8.7 4.0.3 and up
2016

Frim: get new friends on local chat March 23, Varies with Varies with
10826 SOCIAL 4.0 88486 0 5000000 Free 0 Mature 17+ Social
rooms 2018 device device

1198 rows × 13 columns

In [51]: df.drop(count.index,inplace=True) #Droping the Rows with the outliers above the inter quartile range

Data Visualization after droping values having more than Inter quartile range
In [52]: fig_installs = px.box(df,y="Installs") #Boxplot for Installs attribute after droping the outliers above inter quartile range
fig_installs.show()

1M

0.8M

0.6M
Installs

0.4M

0.2M

In [53]: fig_rvsp = px.scatter(df, x="Rating", y="Price",color='Category') #Scatterplot between Rating and Price and Coloring it with Category attribute
fig_rvsp.show()

80 Category
ART_AND_DESIGN
AUTO_AND_VEHICLES
70
BEAUTY
BOOKS_AND_REFERENCE
60 BUSINESS
COMICS
50 COMMUNICATION
DATING
Price

40 EDUCATION
ENTERTAINMENT
EVENTS
30
FINANCE
FOOD_AND_DRINK
20 HEALTH_AND_FITNESS
HOUSE_AND_HOME
10 LIBRARIES_AND_DEMO
LIFESTYLE

0 GAME
FAMILY
1 1.5 2 2.5 3 3.5 4 4.5 5

Rating

Rating doesn't necessarily increase with the price. There are highly rated apps for lower priced apps. So Price doesn't have a considerable effect on the rating.

In [54]: fig_rvss = px.scatter(df, x="Rating", y="Size",color='Category') #Scatterplot between Rating and Size and Coloring it with Category attribute
fig_rvss.show()

Category
100k
ART_AND_DESIGN
AUTO_AND_VEHICLES
BEAUTY
80k BOOKS_AND_REFERENCE
BUSINESS
COMICS
COMMUNICATION
60k
DATING
EDUCATION
Size

ENTERTAINMENT
40k EVENTS
FINANCE
FOOD_AND_DRINK
HEALTH_AND_FITNESS
20k HOUSE_AND_HOME
LIBRARIES_AND_DEMO
LIFESTYLE

0 GAME
FAMILY
1 1.5 2 2.5 3 3.5 4 4.5 5

Rating

Heavier apps does seem to be rated higher but it is not conclusive. Lighter apps also have a good rating. SO heavier apps having higher rating may not always be true.

In [55]: fig_rvsr = px.scatter(df, x="Rating", y="Reviews",color='Category') #Scatterplot between Rating VS Reviews and Coloring it with Category attribute
fig_rvsr.show()

200k Category
ART_AND_DESIGN
AUTO_AND_VEHICLES
BEAUTY
BOOKS_AND_REFERENCE
150k BUSINESS
COMICS
COMMUNICATION
DATING
Reviews

100k EDUCATION
ENTERTAINMENT
EVENTS
FINANCE
FOOD_AND_DRINK
50k HEALTH_AND_FITNESS
HOUSE_AND_HOME
LIBRARIES_AND_DEMO
LIFESTYLE

0 GAME
FAMILY
1 1.5 2 2.5 3 3.5 4 4.5 5

Rating

Heavily reviewed apps does seem to have a higher rating but not all the time. For example there is a game with just 3.8 rating for 159k reviews

In [56]: fig_rvscr = px.box(df,x = 'Content Rating',y='Rating',color='Content Rating') #Boxplot between Rating and Content Rating and Coloring it with Conte
fig_rvscr.show()

5 Content Rating
Everyone
Everyone 10+
4.5
Teen
Mature 17+
4 Adults only 18+
Unrated
3.5
Rating

2.5

1.5

Everyone Everyone 10+ Teen Mature 17+ Adults only 18+ Unrated

Content Rating

Eventhough median ratings for all types above are similar, Adults only 18+ does overall have a better overall rating.

In [57]: fig_rvsc = px.box(df,x = 'Category',y='Rating',color="Category") #Boxplot between Rating and Category and Coloring it with Category attribute
fig_rvsc.show()

5 Category
ART_AND_DESIGN
AUTO_AND_VEHICLES
4 BEAUTY
BOOKS_AND_REFERENCE
BUSINESS
Rating

3 COMICS
COMMUNICATION
DATING
EDUCATION
2
ENTERTAINMENT
EVENTS
FINANCE
1
FOOD AND DRINK
ART_AND_DESIGN

AUTO_AND_VEHICLES

BEAUTY

BOOKS_AND_REFERENCE

BUSINESS

COMICS

COMMUNICATION

DATING

EDUCATION

ENTERTAINMENT

EVENTS

FINANCE

FOOD_AND_DRINK

HEALTH_AND_FITNESS

HOUSE_AND_HOME

LIBRARIES_AND_DEMO

LIFESTYLE

GAME

FAMILY

MEDICAL

SOCIAL

SHOPPING

PHOTOGRAPHY

SPORTS

TRAVEL_AND_LOCAL

TOOLS

PERSONALIZATION

PRODUCTIVITY

PARENTING

WEATHER

VIDEO_PLAYERS

NEWS_AND_MAGAZINES

MAPS_AND_NAVIGATION

Category

Among all categories Events category does seem have overall better ratings

Reducing Skewness by using Log transformation


In [58]: df.Reviews=df.Reviews.apply(func=np.log1p) #Reducing the Skew by applying Log transformation(np.log1p)
df.Installs=df.Installs.apply(func=np.log1p)

df.hist(column=['Reviews','Installs']) # Histogram for Reviews and Installs attributes

array([[<Axes: title={'center': 'Reviews'}>,


Out[58]:
<Axes: title={'center': 'Installs'}>]], dtype=object)

Droping unnecessary attributes


In [59]: df.drop(df.columns[[0,10,11,12]], axis=1, inplace=True) #Droping App,Last Updated, Current Ver, and Android Ver attributes

In [60]: df.head()

Out[60]: Category Rating Reviews Size Installs Type Price Content Rating Genres

0 ART_AND_DESIGN 4.1 5.075174 19000 9.210440 Free 0 Everyone Art & Design

1 ART_AND_DESIGN 3.9 6.875232 14000 13.122365 Free 0 Everyone Art & Design;Pretend Play

4 ART_AND_DESIGN 4.3 6.875232 2800 11.512935 Free 0 Everyone Art & Design;Creativity

5 ART_AND_DESIGN 4.4 5.123964 5600 10.819798 Free 0 Everyone Art & Design

6 ART_AND_DESIGN 3.8 5.187386 19000 10.819798 Free 0 Everyone Art & Design

In [61]: df=pd.get_dummies(df,drop_first=True) #Creating dummies

In [62]: df.head()

Out[62]:
Genres
Rating Reviews Size Installs Price Category_AUTO_AND_VEHICLES Category_BEAUTY Category_BOOKS_AND_REFERENCE Category_BUSINESS Category_COMICS ...

0 4.1 5.075174 19000 9.210440 0 0 0 0 0 0 ...

1 3.9 6.875232 14000 13.122365 0 0 0 0 0 0 ...

4 4.3 6.875232 2800 11.512935 0 0 0 0 0 0 ...

5 4.4 5.123964 5600 10.819798 0 0 0 0 0 0 ...

6 3.8 5.187386 19000 10.819798 0 0 0 0 0 0 ...

5 rows × 150 columns

Model building
In [63]: from sklearn.model_selection import train_test_split #Importing modules for Linear Regression
from sklearn.linear_model import LinearRegression
linreg=LinearRegression()
from statsmodels.api import OLS
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error as ms

Splitting into train and test


In [64]: X=df.iloc[:,1:]
y=df.iloc[:,:1]
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.30, random_state=1)
X_train.shape,X_test.shape

((4545, 149), (1949, 149))


Out[64]:

Model fitting
In [65]: Model=linreg.fit(X_train, y_train)
predict=linreg.predict(X_test)

y_test=np.array(y_test)
predict=np.array(predict)

a=pd.DataFrame({'Actual':y_test.flatten(),'Predicted':predict.flatten()});a.head(10)

Out[65]: Actual Predicted

0 5.0 4.073020

1 4.3 4.404918

2 3.4 4.454015

3 4.5 4.207862

4 4.2 4.078656

5 3.5 4.064958

6 4.4 4.056430

7 4.0 4.325298

8 4.0 4.120195

9 3.2 3.633948

In [66]: fig=a.head(25)
fig.iplot(kind='bar') #Ploting bar graph for predicted and actual values

5 Actual
Predicted

0
0 5 10 15 20

Export to plot.ly »

In [67]: results=OLS( y_train,X_train).fit()


results.summary()

Out[67]: OLS Regression Results


Dep. Variable: Rating R-squared (uncentered): 0.983

Model: OLS Adj. R-squared (uncentered): 0.982

Method: Least Squares F-statistic: 2161.

Date: Mon, 21 Aug 2023 Prob (F-statistic): 0.00

Time: 21:35:25 Log-Likelihood: -3712.4

No. Observations: 4545 AIC: 7659.

Df Residuals: 4428 BIC: 8410.

Df Model: 117

Covariance Type: nonrobust

coef std err t P>|t| [0.025 0.975]

Reviews 0.1754 0.008 21.031 0.000 0.159 0.192

Size -4.311e-07 4.73e-07 -0.910 0.363 -1.36e-06 4.97e-07

Installs -0.1567 0.008 -18.987 0.000 -0.173 -0.140

Price -0.0003 0.004 -0.090 0.928 -0.007 0.007

Category_AUTO_AND_VEHICLES 1.6114 0.162 9.919 0.000 1.293 1.930

Category_BEAUTY 1.6756 0.166 10.088 0.000 1.350 2.001

Category_BOOKS_AND_REFERENCE 1.6588 0.159 10.439 0.000 1.347 1.970

Category_BUSINESS 1.5499 0.158 9.823 0.000 1.241 1.859

Category_COMICS 1.5898 0.159 9.982 0.000 1.278 1.902

Category_COMMUNICATION 1.5011 0.159 9.466 0.000 1.190 1.812

Category_DATING 1.4563 0.160 9.118 0.000 1.143 1.769

Category_EDUCATION 2.6635 0.329 8.107 0.000 2.019 3.308

Category_ENTERTAINMENT 2.5734 0.334 7.711 0.000 1.919 3.228

Category_EVENTS 1.7343 0.164 10.543 0.000 1.412 2.057

Category_FAMILY 2.6907 0.319 8.424 0.000 2.065 3.317

Category_FINANCE 1.5209 0.158 9.637 0.000 1.211 1.830

Category_FOOD_AND_DRINK 1.5523 0.161 9.646 0.000 1.237 1.868

Category_GAME 3.0806 0.314 9.796 0.000 2.464 3.697

Category_HEALTH_AND_FITNESS 1.5755 0.158 9.962 0.000 1.265 1.886

Category_HOUSE_AND_HOME 1.5847 0.162 9.778 0.000 1.267 1.902

Category_LIBRARIES_AND_DEMO 1.6527 0.164 10.099 0.000 1.332 1.973

Category_LIFESTYLE 1.5437 0.158 9.787 0.000 1.234 1.853

Category_MAPS_AND_NAVIGATION 1.4828 0.160 9.256 0.000 1.169 1.797

Category_MEDICAL 1.6262 0.157 10.328 0.000 1.318 1.935

Category_NEWS_AND_MAGAZINES 1.5278 0.158 9.644 0.000 1.217 1.838

Category_PARENTING 2.6343 0.304 8.670 0.000 2.039 3.230

Category_PERSONALIZATION 1.6504 0.158 10.451 0.000 1.341 1.960

Category_PHOTOGRAPHY 1.4885 0.159 9.361 0.000 1.177 1.800

Category_PRODUCTIVITY 1.5450 0.158 9.772 0.000 1.235 1.855

Category_SHOPPING 1.5917 0.159 10.000 0.000 1.280 1.904

Category_SOCIAL 1.5708 0.159 9.889 0.000 1.259 1.882

Category_SPORTS 2.7215 0.638 4.265 0.000 1.470 3.972

Category_TOOLS 1.5057 0.157 9.581 0.000 1.198 1.814

Category_TRAVEL_AND_LOCAL 2.0868 0.280 7.460 0.000 1.538 2.635

Category_VIDEO_PLAYERS 1.5088 0.160 9.442 0.000 1.196 1.822

Category_WEATHER 1.5942 0.163 9.766 0.000 1.274 1.914

Type_Paid -0.0628 0.036 -1.738 0.082 -0.134 0.008

Content Rating_Everyone 1.5043 0.312 4.818 0.000 0.892 2.116

Content Rating_Everyone 10+ 1.5192 0.315 4.825 0.000 0.902 2.136

Content Rating_Mature 17+ 1.5783 0.316 5.000 0.000 0.959 2.197

Content Rating_Teen 1.5327 0.313 4.898 0.000 0.919 2.146

Content Rating_Unrated 1.254e-12 1.12e-11 0.112 0.911 -2.08e-11 2.33e-11

Genres_Action;Action & Adventure 0.6245 0.290 2.155 0.031 0.056 1.193

Genres_Adventure -0.0544 0.103 -0.530 0.596 -0.256 0.147

Genres_Adventure;Action & Adventure 0.3669 0.354 1.036 0.300 -0.328 1.062

Genres_Adventure;Brain Games 0.7192 0.576 1.249 0.212 -0.409 1.848

Genres_Arcade 0.0562 0.090 0.622 0.534 -0.121 0.233

Genres_Arcade;Action & Adventure 0.6327 0.247 2.565 0.010 0.149 1.116

Genres_Arcade;Pretend Play 0.7422 0.576 1.290 0.197 -0.386 1.870

Genres_Art & Design 3.5386 0.327 10.825 0.000 2.898 4.180

Genres_Art & Design;Creativity 3.0228 0.350 8.628 0.000 2.336 3.710

Genres_Art & Design;Pretend Play 1.9058 0.435 4.379 0.000 1.052 2.759

Genres_Auto & Vehicles 1.6114 0.162 9.919 0.000 1.293 1.930

Genres_Beauty 1.6756 0.166 10.088 0.000 1.350 2.001

Genres_Board -0.0414 0.142 -0.291 0.771 -0.321 0.238

Genres_Board;Action & Adventure 0.5512 0.575 0.959 0.338 -0.576 1.679

Genres_Board;Brain Games 0.6324 0.247 2.556 0.011 0.147 1.117

Genres_Board;Pretend Play 1.1267 0.576 1.958 0.050 -0.002 2.255

Genres_Books & Reference 1.6588 0.159 10.439 0.000 1.347 1.970

Genres_Books & Reference;Education -0.0598 0.575 -0.104 0.917 -1.187 1.067

Genres_Business 1.5499 0.158 9.823 0.000 1.241 1.859

Genres_Card -0.3223 0.144 -2.234 0.026 -0.605 -0.039

Genres_Card;Action & Adventure 0.2633 0.576 0.457 0.648 -0.867 1.393

Genres_Card;Brain Games 4.772e-12 1.63e-11 0.292 0.770 -2.72e-11 3.68e-11

Genres_Casino 0.0269 0.153 0.175 0.861 -0.274 0.327

Genres_Casual 0.2699 0.160 1.688 0.091 -0.043 0.583

Genres_Casual;Action & Adventure 0.1446 0.290 0.499 0.618 -0.424 0.713

Genres_Casual;Brain Games 0.7812 0.272 2.873 0.004 0.248 1.314

Genres_Casual;Creativity 0.8776 0.420 2.088 0.037 0.053 1.702

Genres_Casual;Education 0.5850 0.420 1.392 0.164 -0.239 1.409

Genres_Casual;Pretend Play 0.3825 0.220 1.740 0.082 -0.048 0.813

Genres_Comics 1.5898 0.159 9.982 0.000 1.278 1.902

Genres_Comics;Creativity 9.534e-12 2.54e-11 0.376 0.707 -4.02e-11 5.93e-11

Genres_Communication 1.5011 0.159 9.466 0.000 1.190 1.812

Genres_Communication;Creativity 0.6307 0.575 1.096 0.273 -0.497 1.758

Genres_Dating 1.4563 0.160 9.118 0.000 1.143 1.769

Genres_Education 0.6187 0.155 3.999 0.000 0.315 0.922

Genres_Education;Action & Adventure 0.7920 0.354 2.235 0.025 0.097 1.487

Genres_Education;Brain Games 0.5182 0.580 0.893 0.372 -0.620 1.656

Genres_Education;Creativity 1.1910 0.580 2.054 0.040 0.054 2.328

Genres_Education;Education 0.7598 0.186 4.094 0.000 0.396 1.124

Genres_Education;Music & Video 0.7869 0.575 1.369 0.171 -0.340 1.914

Genres_Education;Pretend Play 0.7626 0.196 3.886 0.000 0.378 1.147

Genres_Educational 0.2810 0.191 1.474 0.141 -0.093 0.655

Genres_Educational;Action & Adventure 0.7051 0.575 1.227 0.220 -0.422 1.832

Genres_Educational;Brain Games 0.6380 0.354 1.802 0.072 -0.056 1.332

Genres_Educational;Creativity 0.4813 0.354 1.359 0.174 -0.213 1.176

Genres_Educational;Education 0.7662 0.191 4.010 0.000 0.392 1.141

Genres_Educational;Pretend Play 0.5484 0.239 2.298 0.022 0.081 1.016

Genres_Entertainment 0.4409 0.154 2.864 0.004 0.139 0.743

Genres_Entertainment;Action & Adventure 0.8631 0.575 1.501 0.134 -0.264 1.991

Genres_Entertainment;Brain Games 0.6633 0.356 1.865 0.062 -0.034 1.361

Genres_Entertainment;Creativity 0.8870 0.423 2.096 0.036 0.057 1.717

Genres_Entertainment;Education 0.8131 0.575 1.414 0.157 -0.314 1.941

Genres_Entertainment;Music & Video 0.4474 0.259 1.726 0.084 -0.061 0.956

Genres_Entertainment;Pretend Play -0.0523 0.575 -0.091 0.928 -1.180 1.075

Genres_Events 1.7343 0.164 10.543 0.000 1.412 2.057

Genres_Finance 1.5209 0.158 9.637 0.000 1.211 1.830

Genres_Food & Drink 1.5523 0.161 9.646 0.000 1.237 1.868

Genres_Health & Fitness 1.5755 0.158 9.962 0.000 1.265 1.886

Genres_Health & Fitness;Action & Adventure 0.0911 0.575 0.158 0.874 -1.037 1.219

Genres_Health & Fitness;Education 0.7196 0.575 1.251 0.211 -0.408 1.847

Genres_House & Home 1.5847 0.162 9.778 0.000 1.267 1.902

Genres_Libraries & Demo 1.6527 0.164 10.099 0.000 1.332 1.973

Genres_Lifestyle 1.5437 0.158 9.787 0.000 1.234 1.853

Genres_Lifestyle;Education 5.338e-12 1.03e-11 0.521 0.603 -1.48e-11 2.54e-11

Genres_Maps & Navigation 1.4828 0.160 9.256 0.000 1.169 1.797

Genres_Medical 1.6262 0.157 10.328 0.000 1.318 1.935

Genres_Music -0.2141 0.169 -1.270 0.204 -0.545 0.116

Genres_Music & Audio;Music & Video 0.8983 0.575 1.562 0.118 -0.229 2.026

Genres_Music;Music & Video 0.6186 0.420 1.472 0.141 -0.205 1.443

Genres_News & Magazines 1.5278 0.158 9.644 0.000 1.217 1.838

Genres_Parenting 0.8752 0.199 4.394 0.000 0.485 1.266

Genres_Parenting;Brain Games 0.3904 0.467 0.836 0.403 -0.525 1.306

Genres_Parenting;Education 0.5311 0.467 1.137 0.255 -0.384 1.447

Genres_Parenting;Music & Video 0.8376 0.308 2.722 0.007 0.234 1.441

Genres_Personalization 1.6504 0.158 10.451 0.000 1.341 1.960

Genres_Photography 1.4885 0.159 9.361 0.000 1.177 1.800

Genres_Productivity 1.5450 0.158 9.772 0.000 1.235 1.855

Genres_Puzzle 0.5096 0.157 3.255 0.001 0.203 0.817

Genres_Puzzle;Action & Adventure 0.5776 0.575 1.005 0.315 -0.549 1.705

Genres_Puzzle;Brain Games 0.6492 0.238 2.724 0.006 0.182 1.117

Genres_Puzzle;Creativity 0.5923 0.420 1.409 0.159 -0.232 1.417

Genres_Puzzle;Education 1.0383 0.575 1.806 0.071 -0.089 2.165

Genres_Racing -0.0828 0.104 -0.794 0.427 -0.287 0.122

Genres_Racing;Action & Adventure 0.8047 0.291 2.770 0.006 0.235 1.374

Genres_Racing;Pretend Play 1.1510 0.575 2.000 0.046 0.023 2.279

Genres_Role Playing 0.3399 0.159 2.140 0.032 0.029 0.651

Genres_Role Playing;Action & Adventure 0.7178 0.576 1.246 0.213 -0.411 1.847

Genres_Role Playing;Pretend Play 0.2211 0.354 0.625 0.532 -0.473 0.915

Genres_Shopping 1.5917 0.159 10.000 0.000 1.280 1.904

Genres_Simulation 0.3970 0.160 2.489 0.013 0.084 0.710

Genres_Simulation;Action & Adventure 0.4516 0.354 1.274 0.203 -0.243 1.146

Genres_Simulation;Education 0.3856 0.402 0.959 0.338 -0.403 1.174

Genres_Simulation;Pretend Play 0.3858 0.420 0.919 0.358 -0.438 1.209

Genres_Social 1.5708 0.159 9.889 0.000 1.259 1.882

Genres_Sports 0.3928 0.558 0.704 0.481 -0.701 1.486

Genres_Sports;Action & Adventure 0.4389 0.354 1.238 0.216 -0.256 1.134

Genres_Strategy 0.2159 0.163 1.321 0.187 -0.105 0.536

Genres_Strategy;Action & Adventure 0.6988 0.420 1.663 0.096 -0.125 1.522

Genres_Strategy;Creativity 0.3320 0.575 0.577 0.564 -0.796 1.460

Genres_Strategy;Education 0 0 nan nan 0 0

Genres_Tools 1.5057 0.157 9.581 0.000 1.198 1.814

Genres_Travel & Local 0.9623 0.216 4.449 0.000 0.538 1.386

Genres_Travel & Local;Action & Adventure 1.1246 0.385 2.921 0.004 0.370 1.879

Genres_Trivia -0.0398 0.149 -0.267 0.790 -0.332 0.253

Genres_Video Players & Editors 1.5088 0.160 9.442 0.000 1.196 1.822

Genres_Weather 1.5942 0.163 9.766 0.000 1.274 1.914

Genres_Word 0.1231 0.233 0.528 0.598 -0.334 0.580

Omnibus: 1428.299 Durbin-Watson: 1.980

Prob(Omnibus): 0.000 Jarque-Bera (JB): 5716.976

Skew: -1.506 Prob(JB): 0.00

Kurtosis: 7.595 Cond. No. 2.18e+21

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[3] The smallest eigenvalue is 7.11e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

R2 Score
In [68]: print('R2_Score=',r2_score(y_test,predict))
print('Root Mean Squared Error=',np.sqrt(ms(y_test,predict)))
print('Prediction Error Percentage is',round((0.50/np.mean(y_test))*100))

R2_Score= 0.12274918284357272
Root Mean Squared Error= 0.5333728534851824
Prediction Error Percentage is 12

DECISION TREE
In [69]: #Decision Tree Model
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor()
#Fitting
dt.fit(X_train,y_train)

Out[69]: ▾ DecisionTreeRegressor

DecisionTreeRegressor()

In [70]: pred1 = dt.predict(X_test)


mae_dt = metrics.mean_absolute_error(y_test,pred1)
mse_dt = metrics.mean_squared_error(y_test, pred1)
rmse_dt = np.sqrt(metrics.mean_squared_error(y_test, pred1))

print( mae_dt, mse_dt, rmse_dt)

0.4865059004617753 0.5531759876859929 0.7437580168885528

FINAL OBSERVATION
we have performed both the linear regression and decision tree algorithms for model prediction .and we have observed that linear regression provides the best results with rmse (root
mean squared error)= 0.53 and error percentage of 12%. SO WE CAN SAY That regression model is an excellent start to prefdict the ratings of an apps given categories used in the
model.

In [ ]:

In [ ]:

In [ ]:

You might also like