Professional Documents
Culture Documents
I HAVE DONE THIS TASK IS ENTIRELY BASED UPON MY B.TECH PROJECT AND
DATSET IS FROM MY SIDE ONY. ADDITIONAL TO MY TASK GIVEN,I HAVE DONE
MACHINE LEARNING PART ALSO BY APPLYING ONLY TWO ALGORITHMS
IMPORTING MODULES
C:\Users\rupes\anaconda3\Lib\site-packages\paramiko\transport.py:219: CryptographyDeprecationWarning:
Photo Editor & Candy Camera & Grid January 7, 4.0.3 and
0 ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Everyone Art & Design 1.0.0
& ScrapBook 2018 up
Varies with
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 Teen Art & Design June 8, 2018 4.2 and up
device
July 25,
10836 Sya9a Maroc - FR FAMILY 4.5 38 53M 5,000+ Free 0 Everyone Education 1.48 4.1 and up
2017
January 20,
10838 Parkinson Exercices FR MEDICAL NaN 3 9.5M 1,000+ Free 0 Everyone Medical 1.0 2.2 and up
2017
The SCP Foundation DB fr Varies with Books & January 19, Varies with Varies with
10839 BOOKS_AND_REFERENCE 4.5 114 1,000+ Free 0 Mature 17+
nn5n device Reference 2015 device device
(10841, 13)
Out[7]:
140933
Out[8]:
App object
Out[9]:
Category object
Rating float64
Reviews object
Size object
Installs object
Type object
Price object
Content Rating object
Genres object
Last Updated object
Current Ver object
Android Ver object
dtype: object
Checking Nullvalues
In [10]: df.isnull().sum() #Finding the count of total null values for each Attribute
App 0
Out[10]:
Category 0
Rating 1474
Reviews 0
Size 0
Installs 0
Type 1
Price 0
Content Rating 1
Genres 0
Last Updated 0
Current Ver 8
Android Ver 3
dtype: int64
Out[12]: Rating
count 9360.000000
mean 4.191838
std 0.515263
min 1.000000
25% 4.000000
50% 4.300000
75% 4.500000
max 5.000000
count 9360.000000
Out[13]:
mean 4.191838
std 0.515263
min 1.000000
25% 4.000000
50% 4.300000
75% 4.500000
max 5.000000
Name: Rating, dtype: float64
0 4.1
Out[14]:
1 3.9
2 4.7
3 4.5
4 4.3
...
10834 4.0
10836 4.5
10837 5.0
10839 4.5
10840 4.5
Name: Rating, Length: 9360, dtype: float64
count 9360.000000
Out[15]:
mean 4.191838
std 0.515263
min 1.000000
25% 4.000000
50% 4.300000
75% 4.500000
max 5.000000
Name: Rating, dtype: float64
In [16]: df.isnull().sum() #Finding the count of total null values for each Attribute
App 0
Out[16]:
Category 0
Rating 0
Reviews 0
Size 0
Installs 0
Type 0
Price 0
Content Rating 0
Genres 0
Last Updated 0
Current Ver 0
Android Ver 0
dtype: int64
In [19]: df['Installs']
0 10000
Out[19]:
1 500000
2 5000000
3 50000000
4 100000
...
10834 500
10836 5000
10837 100
10839 1000
10840 10000000
Name: Installs, Length: 9360, dtype: int32
In [21]: df.dtypes
App object
Out[21]:
Category object
Rating float64
Reviews int64
Size int32
Installs int32
Type object
Price object
Content Rating object
Genres object
Last Updated object
Current Ver object
Android Ver object
dtype: object
In [23]: len(df[df.Rating>5]) #Finding the number of elements having Rating greater than 5
0
Out[23]:
In [24]: df_reviews = (df[df.Reviews>df.Installs]) #Finding the Count of elements in Reviews greater than Installs
len(df_reviews)
7
Out[24]:
In [25]: len(df[(df.Type=='free')&(df.Price>0)]) #Finding the count of elements which are having Type as "Free" and Price greater than 0
0
Out[25]:
In [27]: df['Reviews']
0 159
Out[27]:
1 967
2 87510
3 215644
4 967
...
10834 7
10836 38
10837 4
10839 114
10840 398307
Name: Reviews, Length: 9353, dtype: int64
Data Visualization
400
350
300
250
Price
200
150
100
50
Observation: Yes there are outliers for Price attribute ranging from (1 to 400)
In [29]: fig = px.line(df,y="Price") #Plotting graph for finding the skewness for the price Attribute
In [30]: fig.show()
400
350
300
250
Price
200
150
100
50
0 2k 4k 6k 8k 10k
index
80M
70M
60M
50M
Reviews
40M
30M
20M
10M
1000
800
count
600
400
200
0
1 1.5 2 2.5 3 3.5 4 4.5 5
Rating
2000
1500
count
1000
500
0
0 20k 40k 60k 80k 100k
Size
In [34]: df[df.Price>200] #Finding the rows having price greater than 200$
Out[34]: App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
4197 most expensive app (H) FAMILY 4.3 6 1500 100 Paid 399 Everyone Entertainment July 16, 2018 1.0 7.0 and up
4362 💎 I'm rich LIFESTYLE 3.8 718 26000 10000 Paid 399 Everyone Lifestyle March 11, 2018 1.0.0 4.4 and up
4367 I'm Rich - Trump Edition LIFESTYLE 3.6 275 7300 10000 Paid 400 Everyone Lifestyle May 3, 2018 1.0.1 4.1 and up
5351 I am rich LIFESTYLE 3.8 3547 1800 100000 Paid 399 Everyone Lifestyle January 12, 2018 2.0 4.0.3 and up
5354 I am Rich Plus FAMILY 4.0 856 8700 10000 Paid 399 Everyone Entertainment May 19, 2018 3.0 4.4 and up
5355 I am rich VIP LIFESTYLE 3.8 411 2600 10000 Paid 299 Everyone Lifestyle July 21, 2018 1.1.1 4.3 and up
5356 I Am Rich Premium FINANCE 4.1 1867 4700 50000 Paid 399 Everyone Finance November 12, 2017 1.6 4.0 and up
5357 I am extremely Rich LIFESTYLE 2.9 41 2900 1000 Paid 379 Everyone Lifestyle July 1, 2018 1.0 4.0 and up
5358 I am Rich! FINANCE 3.8 93 22000 1000 Paid 399 Everyone Finance December 11, 2017 1.0 4.1 and up
5359 I am rich(premium) FINANCE 3.5 472 9651 5000 Paid 399 Everyone Finance May 1, 2017 3.4 4.4 and up
5362 I Am Rich Pro FAMILY 4.4 201 2700 5000 Paid 399 Everyone Entertainment May 30, 2017 1.54 1.6 and up
5364 I am rich (Most expensive app) FINANCE 4.1 129 2700 1000 Paid 399 Teen Finance December 6, 2017 2 4.0.3 and up
5366 I Am Rich FAMILY 3.6 217 4900 10000 Paid 389 Everyone Entertainment June 22, 2018 1.5 4.2 and up
5369 I am Rich FINANCE 4.3 180 3800 5000 Paid 399 Everyone Finance March 22, 2018 1.0 4.2 and up
5373 I AM RICH PRO PLUS FINANCE 4.0 36 41000 1000 Paid 399 Everyone Finance June 25, 2018 1.0.2 4.1 and up
In [35]: len(df[df.Price>200]) #Finding the count of elements having Price greater than 200
15
Out[35]:
In [37]: len(df[df.Price>200])
0
Out[37]:
In [38]: len(df[df.Reviews>2000000]) #Finding the number of values having Reviews value greater than 2000000
453
Out[38]:
In [39]: df.drop(df[df['Reviews']>200000].index,inplace=True) #Droping the rows having Reviews value greater than 2000000
In [40]: len(df[df.Reviews>2000000])
0
Out[40]:
1000.0
Out[41]:
10000.0
Out[42]:
100000.0
Out[43]:
1000000.0
Out[44]:
5000000.0
Out[45]:
10000000.0
Out[46]:
10000000.0
Out[47]:
100M
80M
60M
Installs
40M
20M
count = df.loc[(df['Installs'] < (Q1 - 1.5 * IQR)) | (df['Installs'] > (Q3 + 1.5 * IQR))]
In [50]: count
Out[50]: Content
App Category Rating Reviews Size Installs Type Price Genres Last Updated Current Ver Android Ver
Rating
12 Tattoo Name On My Photo Editor ART_AND_DESIGN 4.2 44829 20000 10000000 Free 0 Teen Art & Design April 2, 2018 3.8 4.1 and up
August 3,
18 FlipaClip - Cartoon animation ART_AND_DESIGN 4.3 194216 39000 5000000 Free 0 Everyone Art & Design 2.2.5 4.0.3 and up
2018
Varies with
37 Floor Plan Creator ART_AND_DESIGN 4.1 36639 0 5000000 Free 0 Everyone Art & Design July 14, 2018 2.3.3 and up
device
... ... ... ... ... ... ... ... ... ... ... ... ... ...
10714 FunForMobile Ringtones & Chat SOCIAL 4.4 68358 7200 5000000 Free 0 Mature 17+ Social May 7, 2016 3.22 4.1 and up
August 5,
10716 Free Slideshow Maker & Video Editor PHOTOGRAPHY 4.2 162564 11000 10000000 Free 0 Everyone Photography 5.2 4.0 and up
2018
February 7,
10723 Mobile Kick SPORTS 4.3 111809 40000 10000000 Free 0 Everyone Sports 1.0.21 4.1 and up
2018
October 22,
10731 FeaturePoints: Free Gift Cards FAMILY 3.9 121321 46000 5000000 Free 0 Everyone Entertainment 8.7 4.0.3 and up
2016
Frim: get new friends on local chat March 23, Varies with Varies with
10826 SOCIAL 4.0 88486 0 5000000 Free 0 Mature 17+ Social
rooms 2018 device device
In [51]: df.drop(count.index,inplace=True) #Droping the Rows with the outliers above the inter quartile range
Data Visualization after droping values having more than Inter quartile range
In [52]: fig_installs = px.box(df,y="Installs") #Boxplot for Installs attribute after droping the outliers above inter quartile range
fig_installs.show()
1M
0.8M
0.6M
Installs
0.4M
0.2M
In [53]: fig_rvsp = px.scatter(df, x="Rating", y="Price",color='Category') #Scatterplot between Rating and Price and Coloring it with Category attribute
fig_rvsp.show()
80 Category
ART_AND_DESIGN
AUTO_AND_VEHICLES
70
BEAUTY
BOOKS_AND_REFERENCE
60 BUSINESS
COMICS
50 COMMUNICATION
DATING
Price
40 EDUCATION
ENTERTAINMENT
EVENTS
30
FINANCE
FOOD_AND_DRINK
20 HEALTH_AND_FITNESS
HOUSE_AND_HOME
10 LIBRARIES_AND_DEMO
LIFESTYLE
0 GAME
FAMILY
1 1.5 2 2.5 3 3.5 4 4.5 5
Rating
Rating doesn't necessarily increase with the price. There are highly rated apps for lower priced apps. So Price doesn't have a considerable effect on the rating.
In [54]: fig_rvss = px.scatter(df, x="Rating", y="Size",color='Category') #Scatterplot between Rating and Size and Coloring it with Category attribute
fig_rvss.show()
Category
100k
ART_AND_DESIGN
AUTO_AND_VEHICLES
BEAUTY
80k BOOKS_AND_REFERENCE
BUSINESS
COMICS
COMMUNICATION
60k
DATING
EDUCATION
Size
ENTERTAINMENT
40k EVENTS
FINANCE
FOOD_AND_DRINK
HEALTH_AND_FITNESS
20k HOUSE_AND_HOME
LIBRARIES_AND_DEMO
LIFESTYLE
0 GAME
FAMILY
1 1.5 2 2.5 3 3.5 4 4.5 5
Rating
Heavier apps does seem to be rated higher but it is not conclusive. Lighter apps also have a good rating. SO heavier apps having higher rating may not always be true.
In [55]: fig_rvsr = px.scatter(df, x="Rating", y="Reviews",color='Category') #Scatterplot between Rating VS Reviews and Coloring it with Category attribute
fig_rvsr.show()
200k Category
ART_AND_DESIGN
AUTO_AND_VEHICLES
BEAUTY
BOOKS_AND_REFERENCE
150k BUSINESS
COMICS
COMMUNICATION
DATING
Reviews
100k EDUCATION
ENTERTAINMENT
EVENTS
FINANCE
FOOD_AND_DRINK
50k HEALTH_AND_FITNESS
HOUSE_AND_HOME
LIBRARIES_AND_DEMO
LIFESTYLE
0 GAME
FAMILY
1 1.5 2 2.5 3 3.5 4 4.5 5
Rating
Heavily reviewed apps does seem to have a higher rating but not all the time. For example there is a game with just 3.8 rating for 159k reviews
In [56]: fig_rvscr = px.box(df,x = 'Content Rating',y='Rating',color='Content Rating') #Boxplot between Rating and Content Rating and Coloring it with Conte
fig_rvscr.show()
5 Content Rating
Everyone
Everyone 10+
4.5
Teen
Mature 17+
4 Adults only 18+
Unrated
3.5
Rating
2.5
1.5
Everyone Everyone 10+ Teen Mature 17+ Adults only 18+ Unrated
Content Rating
Eventhough median ratings for all types above are similar, Adults only 18+ does overall have a better overall rating.
In [57]: fig_rvsc = px.box(df,x = 'Category',y='Rating',color="Category") #Boxplot between Rating and Category and Coloring it with Category attribute
fig_rvsc.show()
5 Category
ART_AND_DESIGN
AUTO_AND_VEHICLES
4 BEAUTY
BOOKS_AND_REFERENCE
BUSINESS
Rating
3 COMICS
COMMUNICATION
DATING
EDUCATION
2
ENTERTAINMENT
EVENTS
FINANCE
1
FOOD AND DRINK
ART_AND_DESIGN
AUTO_AND_VEHICLES
BEAUTY
BOOKS_AND_REFERENCE
BUSINESS
COMICS
COMMUNICATION
DATING
EDUCATION
ENTERTAINMENT
EVENTS
FINANCE
FOOD_AND_DRINK
HEALTH_AND_FITNESS
HOUSE_AND_HOME
LIBRARIES_AND_DEMO
LIFESTYLE
GAME
FAMILY
MEDICAL
SOCIAL
SHOPPING
PHOTOGRAPHY
SPORTS
TRAVEL_AND_LOCAL
TOOLS
PERSONALIZATION
PRODUCTIVITY
PARENTING
WEATHER
VIDEO_PLAYERS
NEWS_AND_MAGAZINES
MAPS_AND_NAVIGATION
Category
Among all categories Events category does seem have overall better ratings
In [60]: df.head()
Out[60]: Category Rating Reviews Size Installs Type Price Content Rating Genres
0 ART_AND_DESIGN 4.1 5.075174 19000 9.210440 Free 0 Everyone Art & Design
1 ART_AND_DESIGN 3.9 6.875232 14000 13.122365 Free 0 Everyone Art & Design;Pretend Play
4 ART_AND_DESIGN 4.3 6.875232 2800 11.512935 Free 0 Everyone Art & Design;Creativity
5 ART_AND_DESIGN 4.4 5.123964 5600 10.819798 Free 0 Everyone Art & Design
6 ART_AND_DESIGN 3.8 5.187386 19000 10.819798 Free 0 Everyone Art & Design
In [62]: df.head()
Out[62]:
Genres
Rating Reviews Size Installs Price Category_AUTO_AND_VEHICLES Category_BEAUTY Category_BOOKS_AND_REFERENCE Category_BUSINESS Category_COMICS ...
Model building
In [63]: from sklearn.model_selection import train_test_split #Importing modules for Linear Regression
from sklearn.linear_model import LinearRegression
linreg=LinearRegression()
from statsmodels.api import OLS
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error as ms
Model fitting
In [65]: Model=linreg.fit(X_train, y_train)
predict=linreg.predict(X_test)
y_test=np.array(y_test)
predict=np.array(predict)
a=pd.DataFrame({'Actual':y_test.flatten(),'Predicted':predict.flatten()});a.head(10)
0 5.0 4.073020
1 4.3 4.404918
2 3.4 4.454015
3 4.5 4.207862
4 4.2 4.078656
5 3.5 4.064958
6 4.4 4.056430
7 4.0 4.325298
8 4.0 4.120195
9 3.2 3.633948
In [66]: fig=a.head(25)
fig.iplot(kind='bar') #Ploting bar graph for predicted and actual values
5 Actual
Predicted
0
0 5 10 15 20
Export to plot.ly »
Df Model: 117
Genres_Art & Design;Pretend Play 1.9058 0.435 4.379 0.000 1.052 2.759
Genres_Health & Fitness;Action & Adventure 0.0911 0.575 0.158 0.874 -1.037 1.219
Genres_Music & Audio;Music & Video 0.8983 0.575 1.562 0.118 -0.229 2.026
Genres_Role Playing;Action & Adventure 0.7178 0.576 1.246 0.213 -0.411 1.847
Genres_Travel & Local;Action & Adventure 1.1246 0.385 2.921 0.004 0.370 1.879
Genres_Video Players & Editors 1.5088 0.160 9.442 0.000 1.196 1.822
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[3] The smallest eigenvalue is 7.11e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
R2 Score
In [68]: print('R2_Score=',r2_score(y_test,predict))
print('Root Mean Squared Error=',np.sqrt(ms(y_test,predict)))
print('Prediction Error Percentage is',round((0.50/np.mean(y_test))*100))
R2_Score= 0.12274918284357272
Root Mean Squared Error= 0.5333728534851824
Prediction Error Percentage is 12
DECISION TREE
In [69]: #Decision Tree Model
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor()
#Fitting
dt.fit(X_train,y_train)
Out[69]: ▾ DecisionTreeRegressor
DecisionTreeRegressor()
FINAL OBSERVATION
we have performed both the linear regression and decision tree algorithms for model prediction .and we have observed that linear regression provides the best results with rmse (root
mean squared error)= 0.53 and error percentage of 12%. SO WE CAN SAY That regression model is an excellent start to prefdict the ratings of an apps given categories used in the
model.
In [ ]:
In [ ]:
In [ ]: