You are on page 1of 24

ASSIGNMENT NO - 01 ¶

PYTHON FOR ANALYTICS¶

TASK TO BE PERFORMED:-¶
1. Figure out genre wise user count
2. How many users prefer comedy movie
3. Figure out age group wise preferred genres
4. Which is the most prefered Genres amongst the users below 35 yrs
5. Is there any co-relation between zip code and genre

Q 1. Figure out genre wise user count¶

In [1]:

import pandas as pd
column_names=['user_id','age','gender','occupation','zipcode']
user_id= pd.read_csv('C:\\Users\\ajink\\Desktop\\Sem 2\\Python\\ml-latest-small\\user_id.txt',sep='|',
names= column_names)
user_id.head()
movies = pd.read_csv("C:\\Users\\ajink\\Desktop\\Sem 2\\Python\\ml-latest-small\\movies.csv")
movies.head()
ratings=pd.read_csv('C:\\Users\\ajink\\Desktop\\Sem 2\\Python\\ml-latest-small\\ratings.csv')

In [2]:

Merge = movies.join(user_id)
MERGE = pd.merge(left=Merge,right=ratings, how='left', left_on='movieId',right_on='movieId')

In [3]:

MERGE.dropna(inplace=True)

In [4]:

MERGE.head(1000)

Out[4]:

timest
movieI genre user_i gende occup zipcod amp
title age userId rating
d s d r ation e

0 1 Toy Advent 1.0 24.0 M techni 85711 1.0 4.0 9.6498


Story ure| cian 27e+0
(1995) Animat 8
ion|
Childr
en|
Come
dy|
Fantas
y

Advent
ure|
Animat
ion|
Toy 8.4743
Childr techni
1 1 Story 1.0 24.0 M 85711 5.0 4.0 50e+0
en| cian
(1995) 8
Come
dy|
Fantas
y

Advent
ure|
Animat
ion|
Toy 1.1066
Childr techni
2 1 Story 1.0 24.0 M 85711 7.0 4.5 36e+0
en| cian
(1995) 9
Come
dy|
Fantas
y

Advent
ure|
Animat
ion|
Toy 1.5105
Childr techni
3 1 Story 1.0 24.0 M 85711 15.0 2.5 78e+0
en| cian
(1995) 9
Come
dy|
Fantas
y

Advent
ure|
Animat
ion|
Toy 1.3056
Childr techni
4 1 Story 1.0 24.0 M 85711 17.0 4.5 96e+0
en| cian
(1995) 9
Come
dy|
Fantas
y

... ... ... ... ... ... ... ... ... ... ... ...

Sense
Drama
and 8.6494
| progra
995 17 Sensib 17.0 30.0 M 06355 350.0 2.0 09e+0
Roma mmer
ility 8
nce
(1995)

Sense
Drama
and 1.3486
| progra
996 17 Sensib 17.0 30.0 M 06355 357.0 5.0 12e+0
Roma mmer
ility 9
nce
(1995)
Sense
Drama
and 8.5793
| progra
997 17 Sensib 17.0 30.0 M 06355 389.0 4.0 42e+0
Roma mmer
ility 8
nce
(1995)

Sense
Drama
and 8.3837
| progra
998 17 Sensib 17.0 30.0 M 06355 404.0 4.0 60e+0
Roma mmer
ility 8
nce
(1995)

Sense
Drama
and 9.3913
| progra
999 17 Sensib 17.0 30.0 M 06355 412.0 5.0 69e+0
Roma mmer
ility 8
nce
(1995)

1000 rows × 11 columns

In [5]:

MERGE['genres'].describe()

Out[5]:

count 27487
unique 243
top Comedy
freq 1857
Name: genres, dtype: object

In [6]:

test2 = MERGE['genres'].str.split("|", n = 4, expand = True)

In [7]:
test2

Out[7]:

4
0 1 2 3

0 Adventure Animation Children Comedy Fantasy

1 Adventure Animation Children Comedy Fantasy

2 Adventure Animation Children Comedy Fantasy

3 Adventure Animation Children Comedy Fantasy

4 Adventure Animation Children Comedy Fantasy

... ... ... ... ... ...

27483 Comedy Drama None None None

27484 Comedy Drama None None None

27485 Comedy Drama None None None

27486 Comedy Drama None None None

27487 Comedy Drama None None None

27487 rows × 5 columns

In [8]:

test2[0].unique()
Out[8]:

array(['Adventure', 'Comedy', 'Action', 'Drama', 'Crime', 'Children',


'Mystery', 'Animation', 'Documentary', 'Thriller', 'Horror',
'Fantasy', 'Western', 'Film-Noir', 'Romance', 'Sci-Fi', 'Musical'],
dtype=object)

In [9]:

import re

In [10]:

search=('Adventure', 'Comedy', 'Action', 'Drama', 'Crime', 'Children',


'Mystery', 'Animation', 'Documentary', 'Thriller', 'Horror',
'Fantasy', 'Western', 'Film-Noir', 'Romance', 'Sci-Fi', 'Musical')
for i in search:
MERGE[i]= MERGE['genres'].str.count(i, re.I)

In [11]:

MERGE.head()

Out[11]:

m titl ge us ag ge oc zi us ra ... A D Th H Fa W Fil R Sc M


ov e nr er e nd cu pc erI tin ni oc rill or nt es m o i- us
ieI es _i er pa od d g m u er ro as te - m Fi ic
d d tio e ati m r y rn N an al
n on en oir ce
ta
ry

Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85
or hil 1. 24 1. 4.
0 1 M ni 71 ... 1 0 0 0 1 0 0 0 0 0
y dr 0 .0 0 0
ci 1
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y

Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85
or hil 1. 24 5. 4.
1 1 M ni 71 ... 1 0 0 0 1 0 0 0 0 0
y dr 0 .0 0 0
ci 1
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y

2 1 To Ad 1. 24 M te 85 7. 4. ... 1 0 0 0 1 0 0 0 0 0
y ve 0 .0 ch 71 0 5
St nt ni 1
or ur ci
y e| an
(1 An
99 im
5) ati
on
|
C
hil
dr
en
|
C
o
m
ed
y|
Fa
nt
as
y

Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85
or hil 1. 24 15 2.
3 1 M ni 71 ... 1 0 0 0 1 0 0 0 0 0
y dr 0 .0 .0 5
ci 1
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y

Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85
or hil 1. 24 17 4.
4 1 M ni 71 ... 1 0 0 0 1 0 0 0 0 0
y dr 0 .0 .0 5
ci 1
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y

5 rows × 28 columns
FINAL ANSWER Q1¶

In [12]:

search=('Adventure', 'Comedy', 'Action', 'Drama', 'Crime', 'Children',


'Mystery', 'Animation', 'Documentary', 'Thriller', 'Horror',
'Fantasy', 'Western', 'Film-Noir', 'Romance', 'Sci-Fi', 'Musical')
for i in search:
j=0
for k in MERGE[i]:
if(k==1):
j=j+1
print(i,"\n \t==>>",j)

Adventure
==>> 6520
Comedy
==>> 9809
Action
==>> 8072
Drama
==>> 11329
Crime
==>> 5364
Children
==>> 3276
Mystery
==>> 1852
Animation
==>> 1699
Documentary
==>> 120
Thriller
==>> 7845
Horror
==>> 1326
Fantasy
==>> 2832
Western
==>> 718
Film-Noir
==>> 183
Romance
==>> 5528
Sci-Fi
==>> 4191
Musical
==>> 1850

2. How many users prefer comedy movie¶

In [13]:

l=0
for m in MERGE['Comedy']:
if(m == 1):
l=l+1
print("Commedy\n \t==>>",l)

Commedy
==>> 9809

3. Figure out age group wise preferred genres¶

In [14]:
age_lables = ['1-20','21-25','26-30','31-35','36-40','41-45','45-100']
MERGE['Age_Category'] = pd.cut(MERGE['age'], bins=[0,20,25,30,35,40,45,100], labels=age_lables)
MERGE.head()

Out[14]:

D Ag
oc oc Fil R e_
m us zi H Fa W M Ca
ge ge cu us ra u Th m o Sc
ov titl er ag pc or nt es us te
nr nd pa erI tin ... m rill - m i-
ieI e _i e od ro as te ic go
es er tio d g en er N an Fi
d d e r y rn al ry
n ta oir ce
ry

Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85 21
or hil 1. 24 1. 4.
0 1 M ni 71 ... 0 0 0 1 0 0 0 0 0 -
y dr 0 .0 0 0
ci 1 25
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y

1 1 To Ad 1. 24 M te 85 5. 4. ... 0 0 0 1 0 0 0 0 0 21
y ve 0 .0 ch 71 0 0 -
St nt ni 1 25
or ur ci
y e| an
(1 An
99 im
5) ati
on
|
C
hil
dr
en
|
C
o
m
ed
y|
Fa
nt
as
y

Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85 21
or hil 1. 24 7. 4.
2 1 M ni 71 ... 0 0 0 1 0 0 0 0 0 -
y dr 0 .0 0 5
ci 1 25
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y

Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85 21
or hil 1. 24 15 2.
3 1 M ni 71 ... 0 0 0 1 0 0 0 0 0 -
y dr 0 .0 .0 5
ci 1 25
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y

4 1 To Ad 1. 24 M te 85 17 4. ... 0 0 0 1 0 0 0 0 0 21
y ve 0 .0 ch 71 .0 5 -
St nt ni 1 25
or ur ci
y e| an
An
im
ati
on
|
C
hil
dr
(1 en
99 |
5) C
o
m
ed
y|
Fa
nt
as
y

5 rows × 29 columns

In [15]:

%matplotlib inline

import matplotlib as mpl


import matplotlib.pyplot as plt

mpl.style.use('ggplot') # optional: for ggplot-like style

# check for latest version of Matplotlib


print('Matplotlib version: ', mpl.__version__) # >= 2.0.0

Matplotlib version: 3.1.1

In [16]:

merge_age = MERGE.groupby('Age_Category', axis=0).sum()


merge_age.head()
Out[16]:

D
ti
A C A oc Fil R
m us m Dr H Fa W M
us ra dv o Ac ni u Th m o Sc
ov er ag es a or nt es us
erI tin en m tio ... m m rill - m i-
ieI _i e ta m ro as te ic
d g tu ed n ati en er N an Fi
d d m a r y rn al
re y on ta oir ce
p
ry

A
ge
_C
at
eg
or
y

3.
13
16 57 96 11 39
74
1- 83 67 44 08 51 63 15 51 14 44 10 17 23 12 42 24 34
65 ... 2 12
20 69 6. 74 0. 68 4 25 9 02 0 41 3 6 2 1 9 2
2.
5 0 .0 5 e+
0
12

5.
20 15
25 11 17 21
21 54 03
31 06 65 51 13 16 79 20 42 13 40 36 30 83 22 37
- 46 83 ... 26 48
01 01 3. 79 66 00 0 92 7 44 5 8 1 7 7 4
25 3. 0.
1 .0 5 e+
0 0
12

5.
23 14
30 13 17 24
26 88 85
07 36 57 49 49 14 11 22 40 10 34 52 95 97 52
- 52 63 ... 10 68 34
74 94 1. 87 4 39 33 19 0 54 0 0 4 7 0
30 7. 6.
7 .0 0 e+
0 0
12

4.
18 11
22 12 13 12
31 18 89
85 40 56 97 11 17 14 14 13 11 40 74 78 14
- 05 05 ... 9 89 5 1
18 78 3. 30 84 39 36 02 4 63 5 9 6 9
35 6. 2.
6 .0 5 e+
0 0
12

3.
13
17 11 94 11 31
36 80
21 59 57 08 18 89 97 10 14 72 18 16 11 63 54 16
- 31 ... 0 5 0
04 74 21 9. 67 0 7 85 09 7 2 4 5 0 5 3
40 5.
7 .0 .0 0 e+
0
12

5 rows × 23 columns
In [17]:

for a in search:
merge_age[a].plot(kind='pie',
figsize=(5, 6),
autopct='%1.1f%%',
startangle=90,
shadow=True,
)
plt.title('Age group wise prefred Genre')
plt.axis('equal')

plt.show()

4. Which is the most prefered Genres amongst the users below 35 yrs¶

In [18]:

bel_35 = merge_age.head(4)

In [19]:

for b in search:
bel_35[b].sum()
p= print(b,"==>",bel_35[b].sum())
Adventure ==> 3678
Comedy ==> 6303
Action ==> 3878
Drama ==> 7115
Crime ==> 3101
Children ==> 2335
Mystery ==> 1032
Animation ==> 1401
Documentary ==> 47
Thriller ==> 4602
Horror ==> 1007
Fantasy ==> 1529
Western ==> 496
Film-Noir ==> 95
Romance ==> 2961
Sci-Fi ==> 2239
Musical ==> 1385

5. Is there any co-relation between zip code and genre¶

In [20]:

import matplotlib.pyplot as plt


import pandas as pd
import pylab as pl
import numpy as np
%matplotlib inline

In [21]:

uni_zip=Merge['zipcode'].unique
uni_zip
Out[21]:

<bound method Series.unique of 0 85711


1 94043
2 32067
3 43537
4 15213
...
9737 NaN
9738 NaN
9739 NaN
9740 NaN
9741 NaN
Name: zipcode, Length: 9742, dtype: object>

In [22]:

MERGE.dropna(inplace=True)

In [23]:

zip_df = MERGE#.groupby('zipcode', axis=0).sum()


zip_df

Out[23]:

D Ag
oc oc Fil R e_
m us zi H Fa W M Ca
ge ge cu us ra u Th m o Sc
ov titl er ag pc or nt es us te
nr nd pa erI tin ... m rill - m i-
ieI e _i e od ro as te ic go
es er tio d g en er N an Fi
d d e r y rn al ry
n ta oir ce
ry
Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85 21
or hil 1. 24 1. 4.
0 1 M ni 71 ... 0 0 0 1 0 0 0 0 0 -
y dr 0 .0 0 0
ci 1 25
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y

Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85 21
or hil 1. 24 5. 4.
1 1 M ni 71 ... 0 0 0 1 0 0 0 0 0 -
y dr 0 .0 0 0
ci 1 25
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y

2 1 To Ad 1. 24 M te 85 7. 4. ... 0 0 0 1 0 0 0 0 0 21
y ve 0 .0 ch 71 0 5 -
St nt ni 1 25
or ur ci
y e| an
(1 An
99 im
5) ati
on
|
C
hil
dr
en
|
C
o
m
ed
y|
Fa
nt
as
y

Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85 21
or hil 1. 24 15 2.
3 1 M ni 71 ... 0 0 0 1 0 0 0 0 0 -
y dr 0 .0 .0 5
ci 1 25
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y

Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85 21
or hil 1. 24 17 4.
4 1 M ni 71 ... 0 0 0 1 0 0 0 0 0 -
y dr 0 .0 .0 5
ci 1 25
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

27 12 Ro C 94 22 M st 77 46 5. ... 0 0 0 0 0 0 0 0 0 21
48 43 se o 3. .0 ud 84 9. 0 -
3 nc m 0 en 1 0 25
ra ed
nt
z
an
d
G
uil
de y|
ns Dr
ter a t
n m
Ar a
e
De
ad
(1
99
0)

Ro
se
nc
ra
nt
z
an C
d o
G m
st
27 uil ed 94 77 47 21
12 22 ud 4.
48 de y| 3. M 84 4. ... 0 0 0 0 0 0 0 0 0 -
43 .0 en 0
4 ns Dr 0 1 0 25
t
ter a
n m
Ar a
e
De
ad
(1
99
0)

Ro
se
nc
ra
nt
z
an C
d o
G m
st
27 uil ed 94 77 50 21
12 22 ud 4.
48 de y| 3. M 84 9. ... 0 0 0 0 0 0 0 0 0 -
43 .0 en 0
5 ns Dr 0 1 0 25
t
ter a
n m
Ar a
e
De
ad
(1
99
0)

27 12 Ro C 94 22 M st 77 59 3. ... 0 0 0 0 0 0 0 0 0 21
48 43 se o 3. .0 ud 84 9. 0 -
6 nc m 0 en 1 0 25
ra
nt
z
an
d
G
uil ed
de y|
ns Dr
t
ter a
n m
Ar a
e
De
ad
(1
99
0)

Ro
se
nc
ra
nt
z
an C
d o
G m
st
27 uil ed 94 77 60 21
12 22 ud 4.
48 de y| 3. M 84 6. ... 0 0 0 0 0 0 0 0 0 -
43 .0 en 0
7 ns Dr 0 1 0 25
t
ter a
n m
Ar a
e
De
ad
(1
99
0)

27487 rows × 29 columns

In [24]:

df = zip_df[['zipcode','Adventure', 'Comedy', 'Action', 'Drama', 'Crime', 'Children',


'Mystery', 'Animation', 'Documentary', 'Thriller', 'Horror',
'Fantasy', 'Western', 'Film-Noir', 'Romance', 'Sci-Fi', 'Musical']]
m= df.head(10)
m

Out[24]:
Do Mu
Ad Ani Fil Ro
zip Co Dr Chi My cu Thr Ho Fa We sic
ve Act Cri ma m- ma Sci
co me am ldr ste me ille rro nta ste al
ntu ion me tio Noi nc -Fi
de dy a en ry nta r r sy rn
re n r e
ry

85
0 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1

85
1 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1

85
2 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1

85
3 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1

85
4 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1

85
5 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1

85
6 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1

85
7 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1

85
8 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1

85
9 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1

In [25]:

msk = np.random.rand(len(df)) < 0.8


train =m[msk]
test = m[~msk]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-25-fab4dc67c61f> in <module>
1 msk = np.random.rand(len(df)) < 0.8
----> 2 train =m[msk]
3 test = m[~msk]

~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)


2969 # Do we have a (boolean) 1d indexer?
2970 if com.is_bool_indexer(key):
-> 2971 return self._getitem_bool_array(key)
2972
2973 # We are left with two options: a single key, and a collection of keys,

~\Anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_bool_array(self, key)


3016 elif len(key) != len(self.index):
3017 raise ValueError(
-> 3018 "Item wrong length %d instead of %d." % (len(key), len(self.index))
3019 )
3020

ValueError: Item wrong length 27487 instead of 10.

In [26]:

from sklearn import linear_model


regr = linear_model.LinearRegression()
x = np.asanyarray(train[['Adventure', 'Comedy', 'Action', 'Drama', 'Crime', 'Children',
'Mystery', 'Animation', 'Documentary', 'Thriller', 'Horror',
'Fantasy', 'Western', 'Film-Noir', 'Romance', 'Sci-Fi', 'Musical']])
y = np.asanyarray(train[['zipcode']])
regr.fit (x, y)
# The coefficients
print ('Coefficients: ', regr.coef_)

---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-26-a24e12f606ea> in <module>
1 from sklearn import linear_model
2 regr = linear_model.LinearRegression()
----> 3 x = np.asanyarray(train[['Adventure', 'Comedy', 'Action', 'Drama', 'Crime', 'Children',
4 'Mystery', 'Animation', 'Documentary', 'Thriller', 'Horror',
5 'Fantasy', 'Western', 'Film-Noir', 'Romance', 'Sci-Fi', 'Musical']])

NameError: name 'train' is not defined

In [ ]:

You might also like