Data Science

TASK 6
I HAVE DONE THIS TASK IS ENTIRELY BASED UPON MY B.TECH PROJECT AND
DATSET IS FROM MY SIDE ONY. ADDITIONAL TO MY TASK GIVEN,I HAVE DONE
MACHINE LEARNING PART ALSO BY APPLYING ONLY TWO ALGORITHMS
IMPORTING MODULES
In [1]: import numpy as np

import pandas as pd
import plotly.express as px
In [2]: pip install cufflinks
Requirement already satisfied: cufflinks in c:\users\rupes\anaconda3\lib\site-packages (0.17.3)

Requirement already satisfied: numpy>=1.9.2 in c:\users\rupes\anaconda3\lib\site-packages (from cufflinks) (1.24.3)
Requirement already satisfied: pandas>=0.19.2 in c:\users\rupes\anaconda3\lib\site-packages (from cufflinks) (1.5.3)
Requirement already satisfied: plotly>=4.1.1 in c:\users\rupes\anaconda3\lib\site-packages (from cufflinks) (5.9.0)
Requirement already satisfied: six>=1.9.0 in c:\users\rupes\anaconda3\lib\site-packages (from cufflinks) (1.16.0)
Requirement already satisfied: colorlover>=0.2.1 in c:\users\rupes\anaconda3\lib\site-packages (from cufflinks) (0.3.0)
Requirement already satisfied: setuptools>=34.4.1 in c:\users\rupes\anaconda3\lib\site-packages (from cufflinks) (68.0.0)
Requirement already satisfied: ipython>=5.3.0 in c:\users\rupes\anaconda3\lib\site-packages (from cufflinks) (8.12.0)
Requirement already satisfied: ipywidgets>=7.0.0 in c:\users\rupes\anaconda3\lib\site-packages (from cufflinks) (8.0.4)
Requirement already satisfied: backcall in c:\users\rupes\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflinks) (0.2.0)
Requirement already satisfied: decorator in c:\users\rupes\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflinks) (5.1.1)
Requirement already satisfied: jedi>=0.16 in c:\users\rupes\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflinks) (0.18.1)
Requirement already satisfied: matplotlib-inline in c:\users\rupes\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflinks) (0.1.6)
Requirement already satisfied: pickleshare in c:\users\rupes\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflinks) (0.7.5)
Requirement already satisfied: prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30 in c:\users\rupes\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflink
s) (3.0.36)
Requirement already satisfied: pygments>=2.4.0 in c:\users\rupes\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflinks) (2.15.1)
Requirement already satisfied: stack-data in c:\users\rupes\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflinks) (0.2.0)
Requirement already satisfied: traitlets>=5 in c:\users\rupes\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflinks) (5.7.1)
Requirement already satisfied: colorama in c:\users\rupes\anaconda3\lib\site-packages (from ipython>=5.3.0->cufflinks) (0.4.6)
Requirement already satisfied: ipykernel>=4.5.1 in c:\users\rupes\anaconda3\lib\site-packages (from ipywidgets>=7.0.0->cufflinks) (6.19.2)
Requirement already satisfied: widgetsnbextension~=4.0 in c:\users\rupes\anaconda3\lib\site-packages (from ipywidgets>=7.0.0->cufflinks) (4.0.5)
Requirement already satisfied: jupyterlab-widgets~=3.0 in c:\users\rupes\anaconda3\lib\site-packages (from ipywidgets>=7.0.0->cufflinks) (3.0.5)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\rupes\anaconda3\lib\site-packages (from pandas>=0.19.2->cufflinks) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\rupes\anaconda3\lib\site-packages (from pandas>=0.19.2->cufflinks) (2022.7)
Requirement already satisfied: tenacity>=6.2.0 in c:\users\rupes\anaconda3\lib\site-packages (from plotly>=4.1.1->cufflinks) (8.2.2)
Requirement already satisfied: comm>=0.1.1 in c:\users\rupes\anaconda3\lib\site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (0.
1.2)
Requirement already satisfied: debugpy>=1.0 in c:\users\rupes\anaconda3\lib\site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (1.
6.7)
Requirement already satisfied: jupyter-client>=6.1.12 in c:\users\rupes\anaconda3\lib\site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cuff
links) (7.4.9)
Requirement already satisfied: nest-asyncio in c:\users\rupes\anaconda3\lib\site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (1.
5.6)
Requirement already satisfied: packaging in c:\users\rupes\anaconda3\lib\site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (23.0)
Requirement already satisfied: psutil in c:\users\rupes\anaconda3\lib\site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (5.9.0)
Requirement already satisfied: pyzmq>=17 in c:\users\rupes\anaconda3\lib\site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (23.2.
0)
Requirement already satisfied: tornado>=6.1 in c:\users\rupes\anaconda3\lib\site-packages (from ipykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (6.
3.2)
Requirement already satisfied: parso<0.9.0,>=0.8.0 in c:\users\rupes\anaconda3\lib\site-packages (from jedi>=0.16->ipython>=5.3.0->cufflinks) (0.8.
3)
Requirement already satisfied: wcwidth in c:\users\rupes\anaconda3\lib\site-packages (from prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30->ipython>=5.3.0->
cufflinks) (0.2.5)
Requirement already satisfied: executing in c:\users\rupes\anaconda3\lib\site-packages (from stack-data->ipython>=5.3.0->cufflinks) (0.8.3)
Requirement already satisfied: asttokens in c:\users\rupes\anaconda3\lib\site-packages (from stack-data->ipython>=5.3.0->cufflinks) (2.0.5)
Requirement already satisfied: pure-eval in c:\users\rupes\anaconda3\lib\site-packages (from stack-data->ipython>=5.3.0->cufflinks) (0.2.2)
Requirement already satisfied: entrypoints in c:\users\rupes\anaconda3\lib\site-packages (from jupyter-client>=6.1.12->ipykernel>=4.5.1->ipywidgets
>=7.0.0->cufflinks) (0.4)
Requirement already satisfied: jupyter-core>=4.9.2 in c:\users\rupes\anaconda3\lib\site-packages (from jupyter-client>=6.1.12->ipykernel>=4.5.1->ip
ywidgets>=7.0.0->cufflinks) (5.3.0)
Requirement already satisfied: platformdirs>=2.5 in c:\users\rupes\anaconda3\lib\site-packages (from jupyter-core>=4.9.2->jupyter-client>=6.1.12->i
pykernel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (2.5.2)
Requirement already satisfied: pywin32>=300 in c:\users\rupes\anaconda3\lib\site-packages (from jupyter-core>=4.9.2->jupyter-client>=6.1.12->ipyker
nel>=4.5.1->ipywidgets>=7.0.0->cufflinks) (305.1)
Note: you may need to restart the kernel to use updated packages.
In [3]: import cufflinks as cf

from plotly.offline import download_plotlyjs, init_notebook_mode,plot,iplot
import plotly.graph_objects as go
import plotly.offline as iplot
import chart_studio
init_notebook_mode(connected=True)
cf.go_offline()
C:\Users\rupes\anaconda3\Lib\site-packages\paramiko\transport.py:219: CryptographyDeprecationWarning:
Blowfish has been deprecated
Loading dataset using Pandas

In [4]: df = pd.read_csv('googleappsplaystore.csv') #Loading Dataset
In [5]: df.head() #Top 5 Rows of the Dataset
Out[5]: Content Android

App Category Rating Reviews Size Installs Type Price Genres Last Updated Current Ver
Rating Ver
Photo Editor & Candy Camera & Grid January 7, 4.0.3 and
0 ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Everyone Art & Design 1.0.0
& ScrapBook 2018 up
Art & Design;Pretend January 15, 4.0.3 and

1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Everyone 2.0.0
Play 2018 up
U Launcher Lite – FREE Live Cool August 1, 4.0.3 and

2 ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Everyone Art & Design 1.2.4
Themes, Hide ... 2018 up
Varies with
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 Teen Art & Design June 8, 2018 4.2 and up
device
Pixel Draw - Number Art Coloring

4 ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up
Book
In [6]: df.tail() #Bottom 5 Rows of Dataset
Out[6]: Content Last

App Category Rating Reviews Size Installs Type Price Genres Current Ver Android Ver
Rating Updated
July 25,
10836 Sya9a Maroc - FR FAMILY 4.5 38 53M 5,000+ Free 0 Everyone Education 1.48 4.1 and up
2017
Fr. Mike Schmitz Audio

10837 FAMILY 5.0 4 3.6M 100+ Free 0 Everyone Education July 6, 2018 1.0 4.1 and up
Teachings
January 20,
10838 Parkinson Exercices FR MEDICAL NaN 3 9.5M 1,000+ Free 0 Everyone Medical 1.0 2.2 and up
2017
The SCP Foundation DB fr Varies with Books & January 19, Varies with Varies with
10839 BOOKS_AND_REFERENCE 4.5 114 1,000+ Free 0 Mature 17+
nn5n device Reference 2015 device device
iHoroscope - 2018 Daily July 25, Varies with Varies with

10840 LIFESTYLE 4.5 398307 19M 10,000,000+ Free 0 Everyone Lifestyle
Horoscope & Astrology 2018 device device
In [7]: df.shape #Shape of the dataset (Rows x Columns)
(10841, 13)
Out[7]:
In [8]: df.size #Total No.of Elements in dataset
140933
Out[8]:
In [9]: df.dtypes #Data type of each column
App object
Out[9]:
Category object
Rating float64
Reviews object
Size object
Installs object
Type object
Price object
Content Rating object
Genres object
Last Updated object
Current Ver object
Android Ver object
dtype: object
Checking Nullvalues
In [10]: df.isnull().sum() #Finding the count of total null values for each Attribute
App 0
Out[10]:
Category 0
Rating 1474
Reviews 0
Size 0
Installs 0
Type 1
Price 0
Content Rating 1
Genres 0
Last Updated 0
Current Ver 8
Android Ver 3
dtype: int64
Droping Null values

In [11]: df.dropna(inplace=True) #Droping Null values of each Attribute
In [12]: df.describe() #Observing the Statistic distribution of Data
Out[12]: Rating
count 9360.000000
mean 4.191838
std 0.515263
min 1.000000
25% 4.000000
50% 4.300000
75% 4.500000
max 5.000000
In [13]: df.Rating.describe() #Observing the Statistical distribution of "Rating" attribute
count 9360.000000
Out[13]:
mean 4.191838
std 0.515263
min 1.000000
25% 4.000000
50% 4.300000
75% 4.500000
max 5.000000
Name: Rating, dtype: float64
Filling Null values with Median value

In [14]: df['Rating'].fillna(df['Rating'].median()) #Filling null values of "Rating" attribute with median of "Rating" attribute
0 4.1
Out[14]:
1 3.9
2 4.7
3 4.5
4 4.3
...
10834 4.0
10836 4.5
10837 5.0
10839 4.5
10840 4.5
Name: Rating, Length: 9360, dtype: float64
In [15]: df['Rating'].describe() #Observing the Statistical distribution of "Rating" attribute
count 9360.000000
Out[15]:
mean 4.191838
std 0.515263
min 1.000000
25% 4.000000
50% 4.300000
75% 4.500000
max 5.000000
Name: Rating, dtype: float64
In [16]: df.isnull().sum() #Finding the count of total null values for each Attribute
App 0
Out[16]:
Category 0
Rating 0
Reviews 0
Size 0
Installs 0
Type 0
Price 0
Content Rating 0
Genres 0
Last Updated 0
Current Ver 0
Android Ver 0
dtype: int64
Converting MB's to KB's

In [17]: df['Size']=df['Size'].replace({ 'M': '*1e3', '-':'-1','Varies with device':'0','k':'1'}, regex=True).map(pd.eval).astype(int) #Converting App Size
Removing '+' from Installs column/Attribute

In [18]: df['Installs']=df['Installs'].map(lambda x:x.replace(',','').replace('+','')).astype(int)
In [19]: df['Installs']
0 10000
Out[19]:
1 500000
2 5000000
3 50000000
4 100000
...
10834 500
10836 5000
10837 100
10839 1000
10840 10000000
Name: Installs, Length: 9360, dtype: int32
Converting Reviews attribute to Int Datatype

In [20]: df['Reviews']=pd.to_numeric(df['Reviews']) #Changing Reviews attribute to Numeric Datatype
In [21]: df.dtypes
App object
Out[21]:
Category object
Rating float64
Reviews int64
Size int32
Installs int32
Type object
Price object
Content Rating object
Genres object
Last Updated object
Current Ver object
Android Ver object
dtype: object
Removing $ and , from Price Attribute

In [22]: df['Price']=df['Price'].map(lambda x:x.replace('$',''))
df['Price']=df['Price'].astype('float')
df['Price']=df['Price'].astype('int')
In [23]: len(df[df.Rating>5]) #Finding the number of elements having Rating greater than 5
0
Out[23]:
In [24]: df_reviews = (df[df.Reviews>df.Installs]) #Finding the Count of elements in Reviews greater than Installs
len(df_reviews)
7
Out[24]:
In [25]: len(df[(df.Type=='free')&(df.Price>0)]) #Finding the count of elements which are having Type as "Free" and Price greater than 0
0
Out[25]:
Droping Reviews where value of Reviews>Installs

In [26]: df.drop(df[df['Reviews']>df['Installs']].index,inplace=True) #Droping rows where value of Reviews greater than Installs
In [27]: df['Reviews']
0 159
Out[27]:
1 967
2 87510
3 215644
4 967
...
10834 7
10836 38
10837 4
10839 114
10840 398307
Name: Reviews, Length: 9353, dtype: int64
Data Visualization
WE CAN DO DATA VISUALIZATION IN SEABORN ,MATPLOTLIB ALSO. BUT I

PREFER PLOTLY .
In [28]: fig1 = px.box(df,y='Price') #Box plot for Price Attribute
fig1.show()
400
350
300
250
Price
200
150
100
50
Observation: Yes there are outliers for Price attribute ranging from (1 to 400)
In [29]: fig = px.line(df,y="Price") #Plotting graph for finding the skewness for the price Attribute
In [30]: fig.show()
400
350
300
250
Price
200
150
100
50
0 2k 4k 6k 8k 10k
index
In [31]: fig2 = px.box(df,y="Reviews") #Boxplot for Reviews Attribute

fig2.show()
80M
70M
60M
50M
Reviews
40M
30M
20M
10M
Observation : Yes there are very high number of reviews
In [32]: fig3 = px.histogram(df,x="Rating") #Histogram for Rating attribute

fig3.show()
1000
800
count
600
400
200
0
1 1.5 2 2.5 3 3.5 4 4.5 5
Rating
Observation : It is right skewed between 4 to 5
In [33]: fig4 = px.histogram(df,x="Size") #Histogram for Size attribute

fig4.show()
2000
1500
count
1000
500
0
0 20k 40k 60k 80k 100k
Size
Observations : It is left skewed between 0 and 20k
In [34]: df[df.Price>200] #Finding the rows having price greater than 200$
Out[34]: App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
4197 most expensive app (H) FAMILY 4.3 6 1500 100 Paid 399 Everyone Entertainment July 16, 2018 1.0 7.0 and up
4362 💎 I'm rich LIFESTYLE 3.8 718 26000 10000 Paid 399 Everyone Lifestyle March 11, 2018 1.0.0 4.4 and up
4367 I'm Rich - Trump Edition LIFESTYLE 3.6 275 7300 10000 Paid 400 Everyone Lifestyle May 3, 2018 1.0.1 4.1 and up
5351 I am rich LIFESTYLE 3.8 3547 1800 100000 Paid 399 Everyone Lifestyle January 12, 2018 2.0 4.0.3 and up
5354 I am Rich Plus FAMILY 4.0 856 8700 10000 Paid 399 Everyone Entertainment May 19, 2018 3.0 4.4 and up
5355 I am rich VIP LIFESTYLE 3.8 411 2600 10000 Paid 299 Everyone Lifestyle July 21, 2018 1.1.1 4.3 and up
5356 I Am Rich Premium FINANCE 4.1 1867 4700 50000 Paid 399 Everyone Finance November 12, 2017 1.6 4.0 and up
5357 I am extremely Rich LIFESTYLE 2.9 41 2900 1000 Paid 379 Everyone Lifestyle July 1, 2018 1.0 4.0 and up
5358 I am Rich! FINANCE 3.8 93 22000 1000 Paid 399 Everyone Finance December 11, 2017 1.0 4.1 and up
5359 I am rich(premium) FINANCE 3.5 472 9651 5000 Paid 399 Everyone Finance May 1, 2017 3.4 4.4 and up
5362 I Am Rich Pro FAMILY 4.4 201 2700 5000 Paid 399 Everyone Entertainment May 30, 2017 1.54 1.6 and up
5364 I am rich (Most expensive app) FINANCE 4.1 129 2700 1000 Paid 399 Teen Finance December 6, 2017 2 4.0.3 and up
5366 I Am Rich FAMILY 3.6 217 4900 10000 Paid 389 Everyone Entertainment June 22, 2018 1.5 4.2 and up
5369 I am Rich FINANCE 4.3 180 3800 5000 Paid 399 Everyone Finance March 22, 2018 1.0 4.2 and up
5373 I AM RICH PRO PLUS FINANCE 4.0 36 41000 1000 Paid 399 Everyone Finance June 25, 2018 1.0.2 4.1 and up
In [35]: len(df[df.Price>200]) #Finding the count of elements having Price greater than 200
15
Out[35]:
In [36]: df.drop(df[df['Price']>200].index,inplace=True) #Droping rows having Price greater than 200
In [37]: len(df[df.Price>200])
0
Out[37]:
In [38]: len(df[df.Reviews>2000000]) #Finding the number of values having Reviews value greater than 2000000
453
Out[38]:
In [39]: df.drop(df[df['Reviews']>200000].index,inplace=True) #Droping the rows having Reviews value greater than 2000000
In [40]: len(df[df.Reviews>2000000])
0
Out[40]:
In [41]: np.percentile(df['Installs'], 10) #Finding 10 percentile of Installs attribute
1000.0
Out[41]:
10000.0
Out[42]:
100000.0
Out[43]:
1000000.0
Out[44]:
5000000.0
Out[45]:
10000000.0
Out[46]:
10000000.0
Out[47]:
In [48]: fig_installs = px.box(df,y="Installs") #Boxplot for Installs Attribute

fig_installs.show()
100M
80M
60M
Installs
40M
20M
Droping rows which are greater than Inter quartile range

In [49]: for cols in range(len(df['Installs'])): #Finding the inter quartile range
Q3 = df['Installs'].quantile(0.75) #Inter quartile range as threshold
Q1 = df['Installs'].quantile(0.25)
IQR = Q3 - Q1
count = df.loc[(df['Installs'] < (Q1 - 1.5 * IQR)) | (df['Installs'] > (Q3 + 1.5 * IQR))]
In [50]: count
Out[50]: Content
App Category Rating Reviews Size Installs Type Price Genres Last Updated Current Ver Android Ver
Rating
U Launcher Lite – FREE Live Cool August 1,

2 ART_AND_DESIGN 4.7 87510 8700 5000000 Free 0 Everyone Art & Design 1.2.4 4.0.3 and up
Themes, Hide ... 2018
12 Tattoo Name On My Photo Editor ART_AND_DESIGN 4.2 44829 20000 10000000 Free 0 Teen Art & Design April 2, 2018 3.8 4.1 and up
August 3,
18 FlipaClip - Cartoon animation ART_AND_DESIGN 4.3 194216 39000 5000000 Free 0 Everyone Art & Design 2.2.5 4.0.3 and up
2018
Varies with
37 Floor Plan Creator ART_AND_DESIGN 4.1 36639 0 5000000 Free 0 Everyone Art & Design July 14, 2018 2.3.3 and up
device
Canva: Poster, banner, card maker &

45 ART_AND_DESIGN 4.7 174531 24000 10000000 Free 0 Everyone Art & Design July 31, 2018 1.6.1 4.1 and up
graphic de...
... ... ... ... ... ... ... ... ... ... ... ... ... ...
10714 FunForMobile Ringtones & Chat SOCIAL 4.4 68358 7200 5000000 Free 0 Mature 17+ Social May 7, 2016 3.22 4.1 and up
August 5,
10716 Free Slideshow Maker & Video Editor PHOTOGRAPHY 4.2 162564 11000 10000000 Free 0 Everyone Photography 5.2 4.0 and up
2018
February 7,
10723 Mobile Kick SPORTS 4.3 111809 40000 10000000 Free 0 Everyone Sports 1.0.21 4.1 and up
2018
October 22,
10731 FeaturePoints: Free Gift Cards FAMILY 3.9 121321 46000 5000000 Free 0 Everyone Entertainment 8.7 4.0.3 and up
2016
Frim: get new friends on local chat March 23, Varies with Varies with
10826 SOCIAL 4.0 88486 0 5000000 Free 0 Mature 17+ Social
rooms 2018 device device
1198 rows × 13 columns
In [51]: df.drop(count.index,inplace=True) #Droping the Rows with the outliers above the inter quartile range
Data Visualization after droping values having more than Inter quartile range
In [52]: fig_installs = px.box(df,y="Installs") #Boxplot for Installs attribute after droping the outliers above inter quartile range
fig_installs.show()
1M
0.8M
0.6M
Installs
0.4M
0.2M
In [53]: fig_rvsp = px.scatter(df, x="Rating", y="Price",color='Category') #Scatterplot between Rating and Price and Coloring it with Category attribute
fig_rvsp.show()
80 Category
ART_AND_DESIGN
AUTO_AND_VEHICLES
70
BEAUTY
BOOKS_AND_REFERENCE
60 BUSINESS
COMICS
50 COMMUNICATION
DATING
Price
40 EDUCATION
ENTERTAINMENT
EVENTS
30
FINANCE
FOOD_AND_DRINK
20 HEALTH_AND_FITNESS
HOUSE_AND_HOME
10 LIBRARIES_AND_DEMO
LIFESTYLE
0 GAME
FAMILY
1 1.5 2 2.5 3 3.5 4 4.5 5
Rating
Rating doesn't necessarily increase with the price. There are highly rated apps for lower priced apps. So Price doesn't have a considerable effect on the rating.
In [54]: fig_rvss = px.scatter(df, x="Rating", y="Size",color='Category') #Scatterplot between Rating and Size and Coloring it with Category attribute
fig_rvss.show()
Category
100k
ART_AND_DESIGN
AUTO_AND_VEHICLES
BEAUTY
80k BOOKS_AND_REFERENCE
BUSINESS
COMICS
COMMUNICATION
60k
DATING
EDUCATION
Size
ENTERTAINMENT
40k EVENTS
FINANCE
FOOD_AND_DRINK
HEALTH_AND_FITNESS
20k HOUSE_AND_HOME
LIBRARIES_AND_DEMO
LIFESTYLE
0 GAME
FAMILY
1 1.5 2 2.5 3 3.5 4 4.5 5
Rating
Heavier apps does seem to be rated higher but it is not conclusive. Lighter apps also have a good rating. SO heavier apps having higher rating may not always be true.
In [55]: fig_rvsr = px.scatter(df, x="Rating", y="Reviews",color='Category') #Scatterplot between Rating VS Reviews and Coloring it with Category attribute
fig_rvsr.show()
200k Category
ART_AND_DESIGN
AUTO_AND_VEHICLES
BEAUTY
BOOKS_AND_REFERENCE
150k BUSINESS
COMICS
COMMUNICATION
DATING
Reviews
100k EDUCATION
ENTERTAINMENT
EVENTS
FINANCE
FOOD_AND_DRINK
50k HEALTH_AND_FITNESS
HOUSE_AND_HOME
LIBRARIES_AND_DEMO
LIFESTYLE
0 GAME
FAMILY
1 1.5 2 2.5 3 3.5 4 4.5 5
Rating
Heavily reviewed apps does seem to have a higher rating but not all the time. For example there is a game with just 3.8 rating for 159k reviews
In [56]: fig_rvscr = px.box(df,x = 'Content Rating',y='Rating',color='Content Rating') #Boxplot between Rating and Content Rating and Coloring it with Conte
fig_rvscr.show()
5 Content Rating
Everyone
Everyone 10+
4.5
Teen
Mature 17+
4 Adults only 18+
Unrated
3.5
Rating
2.5
1.5
Everyone Everyone 10+ Teen Mature 17+ Adults only 18+ Unrated
Content Rating
Eventhough median ratings for all types above are similar, Adults only 18+ does overall have a better overall rating.
In [57]: fig_rvsc = px.box(df,x = 'Category',y='Rating',color="Category") #Boxplot between Rating and Category and Coloring it with Category attribute
fig_rvsc.show()
5 Category
ART_AND_DESIGN
AUTO_AND_VEHICLES
4 BEAUTY
BOOKS_AND_REFERENCE
BUSINESS
Rating
3 COMICS
COMMUNICATION
DATING
EDUCATION
2
ENTERTAINMENT
EVENTS
FINANCE
1
FOOD AND DRINK
ART_AND_DESIGN
AUTO_AND_VEHICLES
BEAUTY
BOOKS_AND_REFERENCE
BUSINESS
COMICS
COMMUNICATION
DATING
EDUCATION
ENTERTAINMENT
EVENTS
FINANCE
FOOD_AND_DRINK
HEALTH_AND_FITNESS
HOUSE_AND_HOME
LIBRARIES_AND_DEMO
LIFESTYLE
GAME
FAMILY
MEDICAL
SOCIAL
SHOPPING
PHOTOGRAPHY
SPORTS
TRAVEL_AND_LOCAL
TOOLS
PERSONALIZATION
PRODUCTIVITY
PARENTING
WEATHER
VIDEO_PLAYERS
NEWS_AND_MAGAZINES
MAPS_AND_NAVIGATION
Category
Among all categories Events category does seem have overall better ratings
Reducing Skewness by using Log transformation

In [58]: df.Reviews=df.Reviews.apply(func=np.log1p) #Reducing the Skew by applying Log transformation(np.log1p)
df.Installs=df.Installs.apply(func=np.log1p)
df.hist(column=['Reviews','Installs']) # Histogram for Reviews and Installs attributes
array([[<Axes: title={'center': 'Reviews'}>,

Out[58]:
<Axes: title={'center': 'Installs'}>]], dtype=object)
Droping unnecessary attributes

In [59]: df.drop(df.columns[[0,10,11,12]], axis=1, inplace=True) #Droping App,Last Updated, Current Ver, and Android Ver attributes
In [60]: df.head()
Out[60]: Category Rating Reviews Size Installs Type Price Content Rating Genres
0 ART_AND_DESIGN 4.1 5.075174 19000 9.210440 Free 0 Everyone Art & Design
1 ART_AND_DESIGN 3.9 6.875232 14000 13.122365 Free 0 Everyone Art & Design;Pretend Play
4 ART_AND_DESIGN 4.3 6.875232 2800 11.512935 Free 0 Everyone Art & Design;Creativity
In [61]: df=pd.get_dummies(df,drop_first=True) #Creating dummies
In [62]: df.head()
Out[62]:
Genres
Rating Reviews Size Installs Price Category_AUTO_AND_VEHICLES Category_BEAUTY Category_BOOKS_AND_REFERENCE Category_BUSINESS Category_COMICS ...
0 4.1 5.075174 19000 9.210440 0 0 0 0 0 0 ...
1 3.9 6.875232 14000 13.122365 0 0 0 0 0 0 ...
4 4.3 6.875232 2800 11.512935 0 0 0 0 0 0 ...
5 4.4 5.123964 5600 10.819798 0 0 0 0 0 0 ...
6 3.8 5.187386 19000 10.819798 0 0 0 0 0 0 ...
5 rows × 150 columns
Model building
In [63]: from sklearn.model_selection import train_test_split #Importing modules for Linear Regression
from sklearn.linear_model import LinearRegression
linreg=LinearRegression()
from statsmodels.api import OLS
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error as ms
Splitting into train and test

In [64]: X=df.iloc[:,1:]
y=df.iloc[:,:1]
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.30, random_state=1)
X_train.shape,X_test.shape
((4545, 149), (1949, 149))

Out[64]:
Model fitting
In [65]: Model=linreg.fit(X_train, y_train)
predict=linreg.predict(X_test)
y_test=np.array(y_test)
predict=np.array(predict)
a=pd.DataFrame({'Actual':y_test.flatten(),'Predicted':predict.flatten()});a.head(10)
Out[65]: Actual Predicted
0 5.0 4.073020
1 4.3 4.404918
2 3.4 4.454015
3 4.5 4.207862
4 4.2 4.078656
5 3.5 4.064958
6 4.4 4.056430
7 4.0 4.325298
8 4.0 4.120195
9 3.2 3.633948
In [66]: fig=a.head(25)
fig.iplot(kind='bar') #Ploting bar graph for predicted and actual values
5 Actual
Predicted
0
0 5 10 15 20
Export to plot.ly »
In [67]: results=OLS( y_train,X_train).fit()

results.summary()
Out[67]: OLS Regression Results

Dep. Variable: Rating R-squared (uncentered): 0.983
Model: OLS Adj. R-squared (uncentered): 0.982
Method: Least Squares F-statistic: 2161.
Date: Mon, 21 Aug 2023 Prob (F-statistic): 0.00
Time: 21:35:25 Log-Likelihood: -3712.4
No. Observations: 4545 AIC: 7659.
Df Residuals: 4428 BIC: 8410.
Df Model: 117
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Reviews 0.1754 0.008 21.031 0.000 0.159 0.192
Size -4.311e-07 4.73e-07 -0.910 0.363 -1.36e-06 4.97e-07
Installs -0.1567 0.008 -18.987 0.000 -0.173 -0.140
Price -0.0003 0.004 -0.090 0.928 -0.007 0.007
Category_AUTO_AND_VEHICLES 1.6114 0.162 9.919 0.000 1.293 1.930
Category_BEAUTY 1.6756 0.166 10.088 0.000 1.350 2.001
Category_BOOKS_AND_REFERENCE 1.6588 0.159 10.439 0.000 1.347 1.970
Category_BUSINESS 1.5499 0.158 9.823 0.000 1.241 1.859
Category_COMICS 1.5898 0.159 9.982 0.000 1.278 1.902
Category_COMMUNICATION 1.5011 0.159 9.466 0.000 1.190 1.812
Category_DATING 1.4563 0.160 9.118 0.000 1.143 1.769
Category_EDUCATION 2.6635 0.329 8.107 0.000 2.019 3.308
Category_ENTERTAINMENT 2.5734 0.334 7.711 0.000 1.919 3.228
Category_EVENTS 1.7343 0.164 10.543 0.000 1.412 2.057
Category_FAMILY 2.6907 0.319 8.424 0.000 2.065 3.317
Category_FINANCE 1.5209 0.158 9.637 0.000 1.211 1.830
Category_FOOD_AND_DRINK 1.5523 0.161 9.646 0.000 1.237 1.868
Category_GAME 3.0806 0.314 9.796 0.000 2.464 3.697
Category_HEALTH_AND_FITNESS 1.5755 0.158 9.962 0.000 1.265 1.886
Category_HOUSE_AND_HOME 1.5847 0.162 9.778 0.000 1.267 1.902
Category_LIBRARIES_AND_DEMO 1.6527 0.164 10.099 0.000 1.332 1.973
Category_LIFESTYLE 1.5437 0.158 9.787 0.000 1.234 1.853
Category_MAPS_AND_NAVIGATION 1.4828 0.160 9.256 0.000 1.169 1.797
Category_MEDICAL 1.6262 0.157 10.328 0.000 1.318 1.935
Category_NEWS_AND_MAGAZINES 1.5278 0.158 9.644 0.000 1.217 1.838
Category_PARENTING 2.6343 0.304 8.670 0.000 2.039 3.230
Category_PERSONALIZATION 1.6504 0.158 10.451 0.000 1.341 1.960
Category_PHOTOGRAPHY 1.4885 0.159 9.361 0.000 1.177 1.800
Category_PRODUCTIVITY 1.5450 0.158 9.772 0.000 1.235 1.855
Category_SHOPPING 1.5917 0.159 10.000 0.000 1.280 1.904
Category_SOCIAL 1.5708 0.159 9.889 0.000 1.259 1.882
Category_SPORTS 2.7215 0.638 4.265 0.000 1.470 3.972
Category_TOOLS 1.5057 0.157 9.581 0.000 1.198 1.814
Category_TRAVEL_AND_LOCAL 2.0868 0.280 7.460 0.000 1.538 2.635
Category_VIDEO_PLAYERS 1.5088 0.160 9.442 0.000 1.196 1.822
Category_WEATHER 1.5942 0.163 9.766 0.000 1.274 1.914
Type_Paid -0.0628 0.036 -1.738 0.082 -0.134 0.008
Content Rating_Everyone 1.5043 0.312 4.818 0.000 0.892 2.116
Content Rating_Everyone 10+ 1.5192 0.315 4.825 0.000 0.902 2.136
Content Rating_Mature 17+ 1.5783 0.316 5.000 0.000 0.959 2.197
Content Rating_Teen 1.5327 0.313 4.898 0.000 0.919 2.146
Content Rating_Unrated 1.254e-12 1.12e-11 0.112 0.911 -2.08e-11 2.33e-11
Genres_Action;Action & Adventure 0.6245 0.290 2.155 0.031 0.056 1.193
Genres_Adventure -0.0544 0.103 -0.530 0.596 -0.256 0.147
Genres_Adventure;Action & Adventure 0.3669 0.354 1.036 0.300 -0.328 1.062
Genres_Adventure;Brain Games 0.7192 0.576 1.249 0.212 -0.409 1.848
Genres_Arcade 0.0562 0.090 0.622 0.534 -0.121 0.233
Genres_Arcade;Action & Adventure 0.6327 0.247 2.565 0.010 0.149 1.116
Genres_Arcade;Pretend Play 0.7422 0.576 1.290 0.197 -0.386 1.870
Genres_Art & Design 3.5386 0.327 10.825 0.000 2.898 4.180
Genres_Art & Design;Creativity 3.0228 0.350 8.628 0.000 2.336 3.710
Genres_Art & Design;Pretend Play 1.9058 0.435 4.379 0.000 1.052 2.759
Genres_Auto & Vehicles 1.6114 0.162 9.919 0.000 1.293 1.930
Genres_Beauty 1.6756 0.166 10.088 0.000 1.350 2.001
Genres_Board -0.0414 0.142 -0.291 0.771 -0.321 0.238
Genres_Board;Action & Adventure 0.5512 0.575 0.959 0.338 -0.576 1.679
Genres_Board;Brain Games 0.6324 0.247 2.556 0.011 0.147 1.117
Genres_Board;Pretend Play 1.1267 0.576 1.958 0.050 -0.002 2.255
Genres_Books & Reference 1.6588 0.159 10.439 0.000 1.347 1.970
Genres_Books & Reference;Education -0.0598 0.575 -0.104 0.917 -1.187 1.067
Genres_Business 1.5499 0.158 9.823 0.000 1.241 1.859
Genres_Card -0.3223 0.144 -2.234 0.026 -0.605 -0.039
Genres_Card;Action & Adventure 0.2633 0.576 0.457 0.648 -0.867 1.393
Genres_Card;Brain Games 4.772e-12 1.63e-11 0.292 0.770 -2.72e-11 3.68e-11
Genres_Casino 0.0269 0.153 0.175 0.861 -0.274 0.327
Genres_Casual 0.2699 0.160 1.688 0.091 -0.043 0.583
Genres_Casual;Action & Adventure 0.1446 0.290 0.499 0.618 -0.424 0.713
Genres_Casual;Brain Games 0.7812 0.272 2.873 0.004 0.248 1.314
Genres_Casual;Creativity 0.8776 0.420 2.088 0.037 0.053 1.702
Genres_Casual;Education 0.5850 0.420 1.392 0.164 -0.239 1.409
Genres_Casual;Pretend Play 0.3825 0.220 1.740 0.082 -0.048 0.813
Genres_Comics 1.5898 0.159 9.982 0.000 1.278 1.902
Genres_Comics;Creativity 9.534e-12 2.54e-11 0.376 0.707 -4.02e-11 5.93e-11
Genres_Communication 1.5011 0.159 9.466 0.000 1.190 1.812
Genres_Communication;Creativity 0.6307 0.575 1.096 0.273 -0.497 1.758
Genres_Dating 1.4563 0.160 9.118 0.000 1.143 1.769
Genres_Education 0.6187 0.155 3.999 0.000 0.315 0.922
Genres_Education;Action & Adventure 0.7920 0.354 2.235 0.025 0.097 1.487
Genres_Education;Brain Games 0.5182 0.580 0.893 0.372 -0.620 1.656
Genres_Education;Creativity 1.1910 0.580 2.054 0.040 0.054 2.328
Genres_Education;Education 0.7598 0.186 4.094 0.000 0.396 1.124
Genres_Education;Music & Video 0.7869 0.575 1.369 0.171 -0.340 1.914
Genres_Education;Pretend Play 0.7626 0.196 3.886 0.000 0.378 1.147
Genres_Educational 0.2810 0.191 1.474 0.141 -0.093 0.655
Genres_Educational;Action & Adventure 0.7051 0.575 1.227 0.220 -0.422 1.832
Genres_Educational;Brain Games 0.6380 0.354 1.802 0.072 -0.056 1.332
Genres_Educational;Creativity 0.4813 0.354 1.359 0.174 -0.213 1.176
Genres_Educational;Education 0.7662 0.191 4.010 0.000 0.392 1.141
Genres_Educational;Pretend Play 0.5484 0.239 2.298 0.022 0.081 1.016
Genres_Entertainment 0.4409 0.154 2.864 0.004 0.139 0.743
Genres_Entertainment;Action & Adventure 0.8631 0.575 1.501 0.134 -0.264 1.991
Genres_Entertainment;Brain Games 0.6633 0.356 1.865 0.062 -0.034 1.361
Genres_Entertainment;Creativity 0.8870 0.423 2.096 0.036 0.057 1.717
Genres_Entertainment;Education 0.8131 0.575 1.414 0.157 -0.314 1.941
Genres_Entertainment;Music & Video 0.4474 0.259 1.726 0.084 -0.061 0.956
Genres_Entertainment;Pretend Play -0.0523 0.575 -0.091 0.928 -1.180 1.075
Genres_Events 1.7343 0.164 10.543 0.000 1.412 2.057
Genres_Finance 1.5209 0.158 9.637 0.000 1.211 1.830
Genres_Food & Drink 1.5523 0.161 9.646 0.000 1.237 1.868
Genres_Health & Fitness 1.5755 0.158 9.962 0.000 1.265 1.886
Genres_Health & Fitness;Action & Adventure 0.0911 0.575 0.158 0.874 -1.037 1.219
Genres_Health & Fitness;Education 0.7196 0.575 1.251 0.211 -0.408 1.847
Genres_House & Home 1.5847 0.162 9.778 0.000 1.267 1.902
Genres_Libraries & Demo 1.6527 0.164 10.099 0.000 1.332 1.973
Genres_Lifestyle 1.5437 0.158 9.787 0.000 1.234 1.853
Genres_Lifestyle;Education 5.338e-12 1.03e-11 0.521 0.603 -1.48e-11 2.54e-11
Genres_Maps & Navigation 1.4828 0.160 9.256 0.000 1.169 1.797
Genres_Medical 1.6262 0.157 10.328 0.000 1.318 1.935
Genres_Music -0.2141 0.169 -1.270 0.204 -0.545 0.116
Genres_Music & Audio;Music & Video 0.8983 0.575 1.562 0.118 -0.229 2.026
Genres_Music;Music & Video 0.6186 0.420 1.472 0.141 -0.205 1.443
Genres_News & Magazines 1.5278 0.158 9.644 0.000 1.217 1.838
Genres_Parenting 0.8752 0.199 4.394 0.000 0.485 1.266
Genres_Parenting;Brain Games 0.3904 0.467 0.836 0.403 -0.525 1.306
Genres_Parenting;Education 0.5311 0.467 1.137 0.255 -0.384 1.447
Genres_Parenting;Music & Video 0.8376 0.308 2.722 0.007 0.234 1.441
Genres_Personalization 1.6504 0.158 10.451 0.000 1.341 1.960
Genres_Photography 1.4885 0.159 9.361 0.000 1.177 1.800
Genres_Productivity 1.5450 0.158 9.772 0.000 1.235 1.855
Genres_Puzzle 0.5096 0.157 3.255 0.001 0.203 0.817
Genres_Puzzle;Action & Adventure 0.5776 0.575 1.005 0.315 -0.549 1.705
Genres_Puzzle;Brain Games 0.6492 0.238 2.724 0.006 0.182 1.117
Genres_Puzzle;Creativity 0.5923 0.420 1.409 0.159 -0.232 1.417
Genres_Puzzle;Education 1.0383 0.575 1.806 0.071 -0.089 2.165
Genres_Racing -0.0828 0.104 -0.794 0.427 -0.287 0.122
Genres_Racing;Action & Adventure 0.8047 0.291 2.770 0.006 0.235 1.374
Genres_Racing;Pretend Play 1.1510 0.575 2.000 0.046 0.023 2.279
Genres_Role Playing 0.3399 0.159 2.140 0.032 0.029 0.651
Genres_Role Playing;Action & Adventure 0.7178 0.576 1.246 0.213 -0.411 1.847
Genres_Role Playing;Pretend Play 0.2211 0.354 0.625 0.532 -0.473 0.915
Genres_Shopping 1.5917 0.159 10.000 0.000 1.280 1.904
Genres_Simulation 0.3970 0.160 2.489 0.013 0.084 0.710
Genres_Simulation;Action & Adventure 0.4516 0.354 1.274 0.203 -0.243 1.146
Genres_Simulation;Education 0.3856 0.402 0.959 0.338 -0.403 1.174
Genres_Simulation;Pretend Play 0.3858 0.420 0.919 0.358 -0.438 1.209
Genres_Social 1.5708 0.159 9.889 0.000 1.259 1.882
Genres_Sports 0.3928 0.558 0.704 0.481 -0.701 1.486
Genres_Sports;Action & Adventure 0.4389 0.354 1.238 0.216 -0.256 1.134
Genres_Strategy 0.2159 0.163 1.321 0.187 -0.105 0.536
Genres_Strategy;Action & Adventure 0.6988 0.420 1.663 0.096 -0.125 1.522
Genres_Strategy;Creativity 0.3320 0.575 0.577 0.564 -0.796 1.460
Genres_Strategy;Education 0 0 nan nan 0 0
Genres_Tools 1.5057 0.157 9.581 0.000 1.198 1.814
Genres_Travel & Local 0.9623 0.216 4.449 0.000 0.538 1.386
Genres_Travel & Local;Action & Adventure 1.1246 0.385 2.921 0.004 0.370 1.879
Genres_Trivia -0.0398 0.149 -0.267 0.790 -0.332 0.253
Genres_Video Players & Editors 1.5088 0.160 9.442 0.000 1.196 1.822
Genres_Weather 1.5942 0.163 9.766 0.000 1.274 1.914
Genres_Word 0.1231 0.233 0.528 0.598 -0.334 0.580
Omnibus: 1428.299 Durbin-Watson: 1.980
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5716.976
Skew: -1.506 Prob(JB): 0.00
Kurtosis: 7.595 Cond. No. 2.18e+21
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[3] The smallest eigenvalue is 7.11e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
R2 Score
In [68]: print('R2_Score=',r2_score(y_test,predict))
print('Root Mean Squared Error=',np.sqrt(ms(y_test,predict)))
print('Prediction Error Percentage is',round((0.50/np.mean(y_test))*100))
R2_Score= 0.12274918284357272
Root Mean Squared Error= 0.5333728534851824
Prediction Error Percentage is 12
DECISION TREE
In [69]: #Decision Tree Model
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor()
#Fitting
dt.fit(X_train,y_train)
Out[69]: ▾ DecisionTreeRegressor
DecisionTreeRegressor()
In [70]: pred1 = dt.predict(X_test)

mae_dt = metrics.mean_absolute_error(y_test,pred1)
mse_dt = metrics.mean_squared_error(y_test, pred1)
rmse_dt = np.sqrt(metrics.mean_squared_error(y_test, pred1))
print( mae_dt, mse_dt, rmse_dt)
0.4865059004617753 0.5531759876859929 0.7437580168885528
FINAL OBSERVATION
we have performed both the linear regression and decision tree algorithms for model prediction .and we have observed that linear regression provides the best results with rmse (root
mean squared error)= 0.53 and error percentage of 12%. SO WE CAN SAY That regression model is an excellent start to prefdict the ratings of an apps given categories used in the
model.
In [ ]:
In [ ]:
In [ ]:

Data Science

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science

Uploaded by

Copyright:

Available Formats

TASK 6

In [1]: import numpy as np

In [2]: pip install cufflinks

Requirement already satisfied: cufflinks in c:\users\rupes\anaconda3\lib\site-packages (0.17.3)

In [3]: import cufflinks as cf

Blowfish has been deprecated

Loading dataset using Pandas

In [5]: df.head() #Top 5 Rows of the Dataset

Out[5]: Content Android

Art & Design;Pretend January 15, 4.0.3 and

U Launcher Lite – FREE Live Cool August 1, 4.0.3 and

Pixel Draw - Number Art Coloring

In [6]: df.tail() #Bottom 5 Rows of Dataset

Out[6]: Content Last

Fr. Mike Schmitz Audio

iHoroscope - 2018 Daily July 25, Varies with Varies with

In [7]: df.shape #Shape of the dataset (Rows x Columns)

In [8]: df.size #Total No.of Elements in dataset

In [9]: df.dtypes #Data type of each column

Droping Null values

In [12]: df.describe() #Observing the Statistic distribution of Data

In [13]: df.Rating.describe() #Observing the Statistical distribution of "Rating" attribute

Filling Null values with Median value

In [15]: df['Rating'].describe() #Observing the Statistical distribution of "Rating" attribute

Converting MB's to KB's

Removing '+' from Installs column/Attribute

Converting Reviews attribute to Int Datatype

Removing $ and , from Price Attribute

Droping Reviews where value of Reviews>Installs

WE CAN DO DATA VISUALIZATION IN SEABORN ,MATPLOTLIB ALSO. BUT I

In [31]: fig2 = px.box(df,y="Reviews") #Boxplot for Reviews Attribute

Observation : Yes there are very high number of reviews

In [32]: fig3 = px.histogram(df,x="Rating") #Histogram for Rating attribute

Observation : It is right skewed between 4 to 5

In [33]: fig4 = px.histogram(df,x="Size") #Histogram for Size attribute

Observations : It is left skewed between 0 and 20k

In [36]: df.drop(df[df['Price']>200].index,inplace=True) #Droping rows having Price greater than 200

In [41]: np.percentile(df['Installs'], 10) #Finding 10 percentile of Installs attribute

In [42]: np.percentile(df['Installs'], 25) #Finding 25 percentile of Installs attribute

In [43]: np.percentile(df['Installs'], 50) #Finding 50 percentile of Installs attribute

In [44]: np.percentile(df['Installs'], 70) #Finding 70 percentile of Installs attribute

In [45]: np.percentile(df['Installs'], 90) #Finding 90 percentile of Installs attribute

In [46]: np.percentile(df['Installs'], 95) #Finding 95 percentile of Installs attribute

In [47]: np.percentile(df['Installs'], 99) #Finding 99 percentile of Installs attribute

In [48]: fig_installs = px.box(df,y="Installs") #Boxplot for Installs Attribute

Droping rows which are greater than Inter quartile range

U Launcher Lite – FREE Live Cool August 1,

Canva: Poster, banner, card maker &

1198 rows × 13 columns

Reducing Skewness by using Log transformation

df.hist(column=['Reviews','Installs']) # Histogram for Reviews and Installs attributes

array([[<Axes: title={'center': 'Reviews'}>,

Droping unnecessary attributes

In [61]: df=pd.get_dummies(df,drop_first=True) #Creating dummies

0 4.1 5.075174 19000 9.210440 0 0 0 0 0 0 ...

1 3.9 6.875232 14000 13.122365 0 0 0 0 0 0 ...

4 4.3 6.875232 2800 11.512935 0 0 0 0 0 0 ...

5 4.4 5.123964 5600 10.819798 0 0 0 0 0 0 ...

6 3.8 5.187386 19000 10.819798 0 0 0 0 0 0 ...

5 rows × 150 columns

Splitting into train and test

((4545, 149), (1949, 149))