Professional Documents
Culture Documents
Assignment No 01 - Data Analytics
Assignment No 01 - Data Analytics
TASK TO BE PERFORMED:-¶
1. Figure out genre wise user count
2. How many users prefer comedy movie
3. Figure out age group wise preferred genres
4. Which is the most prefered Genres amongst the users below 35 yrs
5. Is there any co-relation between zip code and genre
In [1]:
import pandas as pd
column_names=['user_id','age','gender','occupation','zipcode']
user_id= pd.read_csv('C:\\Users\\ajink\\Desktop\\Sem 2\\Python\\ml-latest-small\\user_id.txt',sep='|',
names= column_names)
user_id.head()
movies = pd.read_csv("C:\\Users\\ajink\\Desktop\\Sem 2\\Python\\ml-latest-small\\movies.csv")
movies.head()
ratings=pd.read_csv('C:\\Users\\ajink\\Desktop\\Sem 2\\Python\\ml-latest-small\\ratings.csv')
In [2]:
Merge = movies.join(user_id)
MERGE = pd.merge(left=Merge,right=ratings, how='left', left_on='movieId',right_on='movieId')
In [3]:
MERGE.dropna(inplace=True)
In [4]:
MERGE.head(1000)
Out[4]:
timest
movieI genre user_i gende occup zipcod amp
title age userId rating
d s d r ation e
Advent
ure|
Animat
ion|
Toy 8.4743
Childr techni
1 1 Story 1.0 24.0 M 85711 5.0 4.0 50e+0
en| cian
(1995) 8
Come
dy|
Fantas
y
Advent
ure|
Animat
ion|
Toy 1.1066
Childr techni
2 1 Story 1.0 24.0 M 85711 7.0 4.5 36e+0
en| cian
(1995) 9
Come
dy|
Fantas
y
Advent
ure|
Animat
ion|
Toy 1.5105
Childr techni
3 1 Story 1.0 24.0 M 85711 15.0 2.5 78e+0
en| cian
(1995) 9
Come
dy|
Fantas
y
Advent
ure|
Animat
ion|
Toy 1.3056
Childr techni
4 1 Story 1.0 24.0 M 85711 17.0 4.5 96e+0
en| cian
(1995) 9
Come
dy|
Fantas
y
... ... ... ... ... ... ... ... ... ... ... ...
Sense
Drama
and 8.6494
| progra
995 17 Sensib 17.0 30.0 M 06355 350.0 2.0 09e+0
Roma mmer
ility 8
nce
(1995)
Sense
Drama
and 1.3486
| progra
996 17 Sensib 17.0 30.0 M 06355 357.0 5.0 12e+0
Roma mmer
ility 9
nce
(1995)
Sense
Drama
and 8.5793
| progra
997 17 Sensib 17.0 30.0 M 06355 389.0 4.0 42e+0
Roma mmer
ility 8
nce
(1995)
Sense
Drama
and 8.3837
| progra
998 17 Sensib 17.0 30.0 M 06355 404.0 4.0 60e+0
Roma mmer
ility 8
nce
(1995)
Sense
Drama
and 9.3913
| progra
999 17 Sensib 17.0 30.0 M 06355 412.0 5.0 69e+0
Roma mmer
ility 8
nce
(1995)
In [5]:
MERGE['genres'].describe()
Out[5]:
count 27487
unique 243
top Comedy
freq 1857
Name: genres, dtype: object
In [6]:
In [7]:
test2
Out[7]:
4
0 1 2 3
In [8]:
test2[0].unique()
Out[8]:
In [9]:
import re
In [10]:
In [11]:
MERGE.head()
Out[11]:
Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85
or hil 1. 24 1. 4.
0 1 M ni 71 ... 1 0 0 0 1 0 0 0 0 0
y dr 0 .0 0 0
ci 1
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y
Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85
or hil 1. 24 5. 4.
1 1 M ni 71 ... 1 0 0 0 1 0 0 0 0 0
y dr 0 .0 0 0
ci 1
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y
2 1 To Ad 1. 24 M te 85 7. 4. ... 1 0 0 0 1 0 0 0 0 0
y ve 0 .0 ch 71 0 5
St nt ni 1
or ur ci
y e| an
(1 An
99 im
5) ati
on
|
C
hil
dr
en
|
C
o
m
ed
y|
Fa
nt
as
y
Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85
or hil 1. 24 15 2.
3 1 M ni 71 ... 1 0 0 0 1 0 0 0 0 0
y dr 0 .0 .0 5
ci 1
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y
Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85
or hil 1. 24 17 4.
4 1 M ni 71 ... 1 0 0 0 1 0 0 0 0 0
y dr 0 .0 .0 5
ci 1
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y
5 rows × 28 columns
FINAL ANSWER Q1¶
In [12]:
Adventure
==>> 6520
Comedy
==>> 9809
Action
==>> 8072
Drama
==>> 11329
Crime
==>> 5364
Children
==>> 3276
Mystery
==>> 1852
Animation
==>> 1699
Documentary
==>> 120
Thriller
==>> 7845
Horror
==>> 1326
Fantasy
==>> 2832
Western
==>> 718
Film-Noir
==>> 183
Romance
==>> 5528
Sci-Fi
==>> 4191
Musical
==>> 1850
In [13]:
l=0
for m in MERGE['Comedy']:
if(m == 1):
l=l+1
print("Commedy\n \t==>>",l)
Commedy
==>> 9809
In [14]:
age_lables = ['1-20','21-25','26-30','31-35','36-40','41-45','45-100']
MERGE['Age_Category'] = pd.cut(MERGE['age'], bins=[0,20,25,30,35,40,45,100], labels=age_lables)
MERGE.head()
Out[14]:
D Ag
oc oc Fil R e_
m us zi H Fa W M Ca
ge ge cu us ra u Th m o Sc
ov titl er ag pc or nt es us te
nr nd pa erI tin ... m rill - m i-
ieI e _i e od ro as te ic go
es er tio d g en er N an Fi
d d e r y rn al ry
n ta oir ce
ry
Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85 21
or hil 1. 24 1. 4.
0 1 M ni 71 ... 0 0 0 1 0 0 0 0 0 -
y dr 0 .0 0 0
ci 1 25
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y
1 1 To Ad 1. 24 M te 85 5. 4. ... 0 0 0 1 0 0 0 0 0 21
y ve 0 .0 ch 71 0 0 -
St nt ni 1 25
or ur ci
y e| an
(1 An
99 im
5) ati
on
|
C
hil
dr
en
|
C
o
m
ed
y|
Fa
nt
as
y
Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85 21
or hil 1. 24 7. 4.
2 1 M ni 71 ... 0 0 0 1 0 0 0 0 0 -
y dr 0 .0 0 5
ci 1 25
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y
Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85 21
or hil 1. 24 15 2.
3 1 M ni 71 ... 0 0 0 1 0 0 0 0 0 -
y dr 0 .0 .0 5
ci 1 25
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y
4 1 To Ad 1. 24 M te 85 17 4. ... 0 0 0 1 0 0 0 0 0 21
y ve 0 .0 ch 71 .0 5 -
St nt ni 1 25
or ur ci
y e| an
An
im
ati
on
|
C
hil
dr
(1 en
99 |
5) C
o
m
ed
y|
Fa
nt
as
y
5 rows × 29 columns
In [15]:
%matplotlib inline
In [16]:
D
ti
A C A oc Fil R
m us m Dr H Fa W M
us ra dv o Ac ni u Th m o Sc
ov er ag es a or nt es us
erI tin en m tio ... m m rill - m i-
ieI _i e ta m ro as te ic
d g tu ed n ati en er N an Fi
d d m a r y rn al
re y on ta oir ce
p
ry
A
ge
_C
at
eg
or
y
3.
13
16 57 96 11 39
74
1- 83 67 44 08 51 63 15 51 14 44 10 17 23 12 42 24 34
65 ... 2 12
20 69 6. 74 0. 68 4 25 9 02 0 41 3 6 2 1 9 2
2.
5 0 .0 5 e+
0
12
5.
20 15
25 11 17 21
21 54 03
31 06 65 51 13 16 79 20 42 13 40 36 30 83 22 37
- 46 83 ... 26 48
01 01 3. 79 66 00 0 92 7 44 5 8 1 7 7 4
25 3. 0.
1 .0 5 e+
0 0
12
5.
23 14
30 13 17 24
26 88 85
07 36 57 49 49 14 11 22 40 10 34 52 95 97 52
- 52 63 ... 10 68 34
74 94 1. 87 4 39 33 19 0 54 0 0 4 7 0
30 7. 6.
7 .0 0 e+
0 0
12
4.
18 11
22 12 13 12
31 18 89
85 40 56 97 11 17 14 14 13 11 40 74 78 14
- 05 05 ... 9 89 5 1
18 78 3. 30 84 39 36 02 4 63 5 9 6 9
35 6. 2.
6 .0 5 e+
0 0
12
3.
13
17 11 94 11 31
36 80
21 59 57 08 18 89 97 10 14 72 18 16 11 63 54 16
- 31 ... 0 5 0
04 74 21 9. 67 0 7 85 09 7 2 4 5 0 5 3
40 5.
7 .0 .0 0 e+
0
12
5 rows × 23 columns
In [17]:
for a in search:
merge_age[a].plot(kind='pie',
figsize=(5, 6),
autopct='%1.1f%%',
startangle=90,
shadow=True,
)
plt.title('Age group wise prefred Genre')
plt.axis('equal')
plt.show()
4. Which is the most prefered Genres amongst the users below 35 yrs¶
In [18]:
bel_35 = merge_age.head(4)
In [19]:
for b in search:
bel_35[b].sum()
p= print(b,"==>",bel_35[b].sum())
Adventure ==> 3678
Comedy ==> 6303
Action ==> 3878
Drama ==> 7115
Crime ==> 3101
Children ==> 2335
Mystery ==> 1032
Animation ==> 1401
Documentary ==> 47
Thriller ==> 4602
Horror ==> 1007
Fantasy ==> 1529
Western ==> 496
Film-Noir ==> 95
Romance ==> 2961
Sci-Fi ==> 2239
Musical ==> 1385
In [20]:
In [21]:
uni_zip=Merge['zipcode'].unique
uni_zip
Out[21]:
In [22]:
MERGE.dropna(inplace=True)
In [23]:
Out[23]:
D Ag
oc oc Fil R e_
m us zi H Fa W M Ca
ge ge cu us ra u Th m o Sc
ov titl er ag pc or nt es us te
nr nd pa erI tin ... m rill - m i-
ieI e _i e od ro as te ic go
es er tio d g en er N an Fi
d d e r y rn al ry
n ta oir ce
ry
Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85 21
or hil 1. 24 1. 4.
0 1 M ni 71 ... 0 0 0 1 0 0 0 0 0 -
y dr 0 .0 0 0
ci 1 25
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y
Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85 21
or hil 1. 24 5. 4.
1 1 M ni 71 ... 0 0 0 1 0 0 0 0 0 -
y dr 0 .0 0 0
ci 1 25
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y
2 1 To Ad 1. 24 M te 85 7. 4. ... 0 0 0 1 0 0 0 0 0 21
y ve 0 .0 ch 71 0 5 -
St nt ni 1 25
or ur ci
y e| an
(1 An
99 im
5) ati
on
|
C
hil
dr
en
|
C
o
m
ed
y|
Fa
nt
as
y
Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85 21
or hil 1. 24 15 2.
3 1 M ni 71 ... 0 0 0 1 0 0 0 0 0 -
y dr 0 .0 .0 5
ci 1 25
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y
Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85 21
or hil 1. 24 17 4.
4 1 M ni 71 ... 0 0 0 1 0 0 0 0 0 -
y dr 0 .0 .0 5
ci 1 25
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
27 12 Ro C 94 22 M st 77 46 5. ... 0 0 0 0 0 0 0 0 0 21
48 43 se o 3. .0 ud 84 9. 0 -
3 nc m 0 en 1 0 25
ra ed
nt
z
an
d
G
uil
de y|
ns Dr
ter a t
n m
Ar a
e
De
ad
(1
99
0)
Ro
se
nc
ra
nt
z
an C
d o
G m
st
27 uil ed 94 77 47 21
12 22 ud 4.
48 de y| 3. M 84 4. ... 0 0 0 0 0 0 0 0 0 -
43 .0 en 0
4 ns Dr 0 1 0 25
t
ter a
n m
Ar a
e
De
ad
(1
99
0)
Ro
se
nc
ra
nt
z
an C
d o
G m
st
27 uil ed 94 77 50 21
12 22 ud 4.
48 de y| 3. M 84 9. ... 0 0 0 0 0 0 0 0 0 -
43 .0 en 0
5 ns Dr 0 1 0 25
t
ter a
n m
Ar a
e
De
ad
(1
99
0)
27 12 Ro C 94 22 M st 77 59 3. ... 0 0 0 0 0 0 0 0 0 21
48 43 se o 3. .0 ud 84 9. 0 -
6 nc m 0 en 1 0 25
ra
nt
z
an
d
G
uil ed
de y|
ns Dr
t
ter a
n m
Ar a
e
De
ad
(1
99
0)
Ro
se
nc
ra
nt
z
an C
d o
G m
st
27 uil ed 94 77 60 21
12 22 ud 4.
48 de y| 3. M 84 6. ... 0 0 0 0 0 0 0 0 0 -
43 .0 en 0
7 ns Dr 0 1 0 25
t
ter a
n m
Ar a
e
De
ad
(1
99
0)
In [24]:
Out[24]:
Do Mu
Ad Ani Fil Ro
zip Co Dr Chi My cu Thr Ho Fa We sic
ve Act Cri ma m- ma Sci
co me am ldr ste me ille rro nta ste al
ntu ion me tio Noi nc -Fi
de dy a en ry nta r r sy rn
re n r e
ry
85
0 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1
85
1 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1
85
2 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1
85
3 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1
85
4 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1
85
5 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1
85
6 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1
85
7 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1
85
8 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1
85
9 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1
In [25]:
In [26]:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-26-a24e12f606ea> in <module>
1 from sklearn import linear_model
2 regr = linear_model.LinearRegression()
----> 3 x = np.asanyarray(train[['Adventure', 'Comedy', 'Action', 'Drama', 'Crime', 'Children',
4 'Mystery', 'Animation', 'Documentary', 'Thriller', 'Horror',
5 'Fantasy', 'Western', 'Film-Noir', 'Romance', 'Sci-Fi', 'Musical']])
In [ ]: