Assignment No 01 - Data Analytics

ASSIGNMENT NO - 01 ¶
PYTHON FOR ANALYTICS¶
TASK TO BE PERFORMED:-¶
1. Figure out genre wise user count
2. How many users prefer comedy movie
3. Figure out age group wise preferred genres
4. Which is the most prefered Genres amongst the users below 35 yrs
5. Is there any co-relation between zip code and genre
Q 1. Figure out genre wise user count¶
In [1]:
import pandas as pd
column_names=['user_id','age','gender','occupation','zipcode']
user_id= pd.read_csv('C:\\Users\\ajink\\Desktop\\Sem 2\\Python\\ml-latest-small\\user_id.txt',sep='|',
names= column_names)
user_id.head()
movies = pd.read_csv("C:\\Users\\ajink\\Desktop\\Sem 2\\Python\\ml-latest-small\\movies.csv")
movies.head()
ratings=pd.read_csv('C:\\Users\\ajink\\Desktop\\Sem 2\\Python\\ml-latest-small\\ratings.csv')
In [2]:
Merge = movies.join(user_id)
MERGE = pd.merge(left=Merge,right=ratings, how='left', left_on='movieId',right_on='movieId')
In [3]:
MERGE.dropna(inplace=True)
In [4]:
MERGE.head(1000)
Out[4]:
timest
movieI genre user_i gende occup zipcod amp
title age userId rating
d s d r ation e
0 1 Toy Advent 1.0 24.0 M techni 85711 1.0 4.0 9.6498

Story ure| cian 27e+0
(1995) Animat 8
ion|
Childr
en|
Come
dy|
Fantas
y
Advent
ure|
Animat
ion|
Toy 8.4743
Childr techni
1 1 Story 1.0 24.0 M 85711 5.0 4.0 50e+0
en| cian
(1995) 8
Come
dy|
Fantas
y
Advent
ure|
Animat
ion|
Toy 1.1066
Childr techni
2 1 Story 1.0 24.0 M 85711 7.0 4.5 36e+0
en| cian
(1995) 9
Come
dy|
Fantas
y
Advent
ure|
Animat
ion|
Toy 1.5105
Childr techni
3 1 Story 1.0 24.0 M 85711 15.0 2.5 78e+0
en| cian
(1995) 9
Come
dy|
Fantas
y
Advent
ure|
Animat
ion|
Toy 1.3056
Childr techni
4 1 Story 1.0 24.0 M 85711 17.0 4.5 96e+0
en| cian
(1995) 9
Come
dy|
Fantas
y
... ... ... ... ... ... ... ... ... ... ... ...
Sense
Drama
and 8.6494
| progra
995 17 Sensib 17.0 30.0 M 06355 350.0 2.0 09e+0
Roma mmer
ility 8
nce
(1995)
Sense
Drama
and 1.3486
| progra
996 17 Sensib 17.0 30.0 M 06355 357.0 5.0 12e+0
Roma mmer
ility 9
nce
(1995)
Sense
Drama
and 8.5793
| progra
997 17 Sensib 17.0 30.0 M 06355 389.0 4.0 42e+0
Roma mmer
ility 8
nce
(1995)
Sense
Drama
and 8.3837
| progra
998 17 Sensib 17.0 30.0 M 06355 404.0 4.0 60e+0
Roma mmer
ility 8
nce
(1995)
Sense
Drama
and 9.3913
| progra
999 17 Sensib 17.0 30.0 M 06355 412.0 5.0 69e+0
Roma mmer
ility 8
nce
(1995)
1000 rows × 11 columns
In [5]:
MERGE['genres'].describe()
Out[5]:
count 27487
unique 243
top Comedy
freq 1857
Name: genres, dtype: object
In [6]:
test2 = MERGE['genres'].str.split("|", n = 4, expand = True)
In [7]:
test2
Out[7]:
4
0 1 2 3
0 Adventure Animation Children Comedy Fantasy
... ... ... ... ... ...
27483 Comedy Drama None None None
In [8]:
test2[0].unique()
Out[8]:
array(['Adventure', 'Comedy', 'Action', 'Drama', 'Crime', 'Children',

'Mystery', 'Animation', 'Documentary', 'Thriller', 'Horror',
'Fantasy', 'Western', 'Film-Noir', 'Romance', 'Sci-Fi', 'Musical'],
dtype=object)
In [9]:
import re
In [10]:
search=('Adventure', 'Comedy', 'Action', 'Drama', 'Crime', 'Children',

'Fantasy', 'Western', 'Film-Noir', 'Romance', 'Sci-Fi', 'Musical')
for i in search:
MERGE[i]= MERGE['genres'].str.count(i, re.I)
In [11]:
MERGE.head()
Out[11]:
m titl ge us ag ge oc zi us ra ... A D Th H Fa W Fil R Sc M

ov e nr er e nd cu pc erI tin ni oc rill or nt es m o i- us
ieI es _i er pa od d g m u er ro as te - m Fi ic
d d tio e ati m r y rn N an al
n on en oir ce
ta
ry
Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85
or hil 1. 24 1. 4.
0 1 M ni 71 ... 1 0 0 0 1 0 0 0 0 0
y dr 0 .0 0 0
ci 1
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y
Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85
or hil 1. 24 5. 4.
1 1 M ni 71 ... 1 0 0 0 1 0 0 0 0 0
y dr 0 .0 0 0
ci 1
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y
2 1 To Ad 1. 24 M te 85 7. 4. ... 1 0 0 0 1 0 0 0 0 0
y ve 0 .0 ch 71 0 5
St nt ni 1
or ur ci
y e| an
(1 An
99 im
5) ati
on
|
C
hil
dr
en
|
C
o
m
ed
y|
Fa
nt
as
y
Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85
or hil 1. 24 15 2.
3 1 M ni 71 ... 1 0 0 0 1 0 0 0 0 0
y dr 0 .0 .0 5
ci 1
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y
Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85
or hil 1. 24 17 4.
4 1 M ni 71 ... 1 0 0 0 1 0 0 0 0 0
y dr 0 .0 .0 5
ci 1
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y
FINAL ANSWER Q1¶
In [12]:
search=('Adventure', 'Comedy', 'Action', 'Drama', 'Crime', 'Children',

'Fantasy', 'Western', 'Film-Noir', 'Romance', 'Sci-Fi', 'Musical')
for i in search:
j=0
for k in MERGE[i]:
if(k==1):
j=j+1
print(i,"\n \t==>>",j)
Adventure
==>> 6520
Comedy
==>> 9809
Action
==>> 8072
Drama
==>> 11329
Crime
==>> 5364
Children
==>> 3276
Mystery
==>> 1852
Animation
==>> 1699
Documentary
==>> 120
Thriller
==>> 7845
Horror
==>> 1326
Fantasy
==>> 2832
Western
==>> 718
Film-Noir
==>> 183
Romance
==>> 5528
Sci-Fi
==>> 4191
Musical
==>> 1850
2. How many users prefer comedy movie¶
In [13]:
l=0
for m in MERGE['Comedy']:
if(m == 1):
l=l+1
print("Commedy\n \t==>>",l)
Commedy
==>> 9809
3. Figure out age group wise preferred genres¶
In [14]:
age_lables = ['1-20','21-25','26-30','31-35','36-40','41-45','45-100']
MERGE['Age_Category'] = pd.cut(MERGE['age'], bins=[0,20,25,30,35,40,45,100], labels=age_lables)
MERGE.head()
Out[14]:
D Ag
oc oc Fil R e_
m us zi H Fa W M Ca
ge ge cu us ra u Th m o Sc
ov titl er ag pc or nt es us te
nr nd pa erI tin ... m rill - m i-
ieI e _i e od ro as te ic go
es er tio d g en er N an Fi
d d e r y rn al ry
n ta oir ce
ry
Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85 21
or hil 1. 24 1. 4.
0 1 M ni 71 ... 0 0 0 1 0 0 0 0 0 -
y dr 0 .0 0 0
ci 1 25
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y
1 1 To Ad 1. 24 M te 85 5. 4. ... 0 0 0 1 0 0 0 0 0 21
y ve 0 .0 ch 71 0 0 -
St nt ni 1 25
or ur ci
y e| an
(1 An
99 im
5) ati
on
|
C
hil
dr
en
|
C
o
m
ed
y|
Fa
nt
as
y
Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85 21
or hil 1. 24 7. 4.
2 1 M ni 71 ... 0 0 0 1 0 0 0 0 0 -
y dr 0 .0 0 5
ci 1 25
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y
Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85 21
or hil 1. 24 15 2.
3 1 M ni 71 ... 0 0 0 1 0 0 0 0 0 -
y dr 0 .0 .0 5
ci 1 25
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y
4 1 To Ad 1. 24 M te 85 17 4. ... 0 0 0 1 0 0 0 0 0 21
y ve 0 .0 ch 71 .0 5 -
St nt ni 1 25
or ur ci
y e| an
An
im
ati
on
|
C
hil
dr
(1 en
99 |
5) C
o
m
ed
y|
Fa
nt
as
y
In [15]:
%matplotlib inline
import matplotlib as mpl

import matplotlib.pyplot as plt
mpl.style.use('ggplot') # optional: for ggplot-like style
# check for latest version of Matplotlib

print('Matplotlib version: ', mpl.__version__) # >= 2.0.0
Matplotlib version: 3.1.1
In [16]:
merge_age = MERGE.groupby('Age_Category', axis=0).sum()

merge_age.head()
Out[16]:
D
ti
A C A oc Fil R
m us m Dr H Fa W M
us ra dv o Ac ni u Th m o Sc
ov er ag es a or nt es us
erI tin en m tio ... m m rill - m i-
ieI _i e ta m ro as te ic
d g tu ed n ati en er N an Fi
d d m a r y rn al
re y on ta oir ce
p
ry
A
ge
_C
at
eg
or
y
3.
13
16 57 96 11 39
74
1- 83 67 44 08 51 63 15 51 14 44 10 17 23 12 42 24 34
65 ... 2 12
20 69 6. 74 0. 68 4 25 9 02 0 41 3 6 2 1 9 2
2.
5 0 .0 5 e+
0
12
5.
20 15
25 11 17 21
21 54 03
31 06 65 51 13 16 79 20 42 13 40 36 30 83 22 37
- 46 83 ... 26 48
01 01 3. 79 66 00 0 92 7 44 5 8 1 7 7 4
25 3. 0.
1 .0 5 e+
0 0
12
5.
23 14
30 13 17 24
26 88 85
07 36 57 49 49 14 11 22 40 10 34 52 95 97 52
- 52 63 ... 10 68 34
74 94 1. 87 4 39 33 19 0 54 0 0 4 7 0
30 7. 6.
7 .0 0 e+
0 0
12
4.
18 11
22 12 13 12
31 18 89
85 40 56 97 11 17 14 14 13 11 40 74 78 14
- 05 05 ... 9 89 5 1
18 78 3. 30 84 39 36 02 4 63 5 9 6 9
35 6. 2.
6 .0 5 e+
0 0
12
3.
13
17 11 94 11 31
36 80
21 59 57 08 18 89 97 10 14 72 18 16 11 63 54 16
- 31 ... 0 5 0
04 74 21 9. 67 0 7 85 09 7 2 4 5 0 5 3
40 5.
7 .0 .0 0 e+
0
12
In [17]:
for a in search:
merge_age[a].plot(kind='pie',
figsize=(5, 6),
autopct='%1.1f%%',
startangle=90,
shadow=True,
)
plt.title('Age group wise prefred Genre')
plt.axis('equal')
plt.show()
4. Which is the most prefered Genres amongst the users below 35 yrs¶
In [18]:
bel_35 = merge_age.head(4)
In [19]:
for b in search:
bel_35[b].sum()
p= print(b,"==>",bel_35[b].sum())
Adventure ==> 3678
Comedy ==> 6303
Action ==> 3878
Drama ==> 7115
Crime ==> 3101
Children ==> 2335
Mystery ==> 1032
Animation ==> 1401
Documentary ==> 47
Thriller ==> 4602
Horror ==> 1007
Fantasy ==> 1529
Western ==> 496
Film-Noir ==> 95
Romance ==> 2961
Sci-Fi ==> 2239
Musical ==> 1385
5. Is there any co-relation between zip code and genre¶
In [20]:
import matplotlib.pyplot as plt

import pandas as pd
import pylab as pl
import numpy as np
%matplotlib inline
In [21]:
uni_zip=Merge['zipcode'].unique
uni_zip
Out[21]:
<bound method Series.unique of 0 85711

1 94043
2 32067
3 43537
4 15213
...
9737 NaN
9738 NaN
9739 NaN
9740 NaN
9741 NaN
Name: zipcode, Length: 9742, dtype: object>
In [22]:
MERGE.dropna(inplace=True)
In [23]:
zip_df = MERGE#.groupby('zipcode', axis=0).sum()

zip_df
Out[23]:
D Ag
oc oc Fil R e_
m us zi H Fa W M Ca
ge ge cu us ra u Th m o Sc
ov titl er ag pc or nt es us te
nr nd pa erI tin ... m rill - m i-
ieI e _i e od ro as te ic go
es er tio d g en er N an Fi
d d e r y rn al ry
n ta oir ce
ry
Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85 21
or hil 1. 24 1. 4.
0 1 M ni 71 ... 0 0 0 1 0 0 0 0 0 -
y dr 0 .0 0 0
ci 1 25
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y
Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85 21
or hil 1. 24 5. 4.
1 1 M ni 71 ... 0 0 0 1 0 0 0 0 0 -
y dr 0 .0 0 0
ci 1 25
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y
2 1 To Ad 1. 24 M te 85 7. 4. ... 0 0 0 1 0 0 0 0 0 21
y ve 0 .0 ch 71 0 5 -
St nt ni 1 25
or ur ci
y e| an
(1 An
99 im
5) ati
on
|
C
hil
dr
en
|
C
o
m
ed
y|
Fa
nt
as
y
Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85 21
or hil 1. 24 15 2.
3 1 M ni 71 ... 0 0 0 1 0 0 0 0 0 -
y dr 0 .0 .0 5
ci 1 25
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y
Ad
ve
nt
ur
e|
An
im
ati
To on
y |
te
St C
ch 85 21
or hil 1. 24 17 4.
4 1 M ni 71 ... 0 0 0 1 0 0 0 0 0 -
y dr 0 .0 .0 5
ci 1 25
(1 en
an
99 |
5) C
o
m
ed
y|
Fa
nt
as
y
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
27 12 Ro C 94 22 M st 77 46 5. ... 0 0 0 0 0 0 0 0 0 21
48 43 se o 3. .0 ud 84 9. 0 -
3 nc m 0 en 1 0 25
ra ed
nt
z
an
d
G
uil
de y|
ns Dr
ter a t
n m
Ar a
e
De
ad
(1
99
0)
Ro
se
nc
ra
nt
z
an C
d o
G m
st
27 uil ed 94 77 47 21
12 22 ud 4.
48 de y| 3. M 84 4. ... 0 0 0 0 0 0 0 0 0 -
43 .0 en 0
4 ns Dr 0 1 0 25
t
ter a
n m
Ar a
e
De
ad
(1
99
0)
Ro
se
nc
ra
nt
z
an C
d o
G m
st
27 uil ed 94 77 50 21
12 22 ud 4.
48 de y| 3. M 84 9. ... 0 0 0 0 0 0 0 0 0 -
43 .0 en 0
5 ns Dr 0 1 0 25
t
ter a
n m
Ar a
e
De
ad
(1
99
0)
27 12 Ro C 94 22 M st 77 59 3. ... 0 0 0 0 0 0 0 0 0 21
48 43 se o 3. .0 ud 84 9. 0 -
6 nc m 0 en 1 0 25
ra
nt
z
an
d
G
uil ed
de y|
ns Dr
t
ter a
n m
Ar a
e
De
ad
(1
99
0)
Ro
se
nc
ra
nt
z
an C
d o
G m
st
27 uil ed 94 77 60 21
12 22 ud 4.
48 de y| 3. M 84 6. ... 0 0 0 0 0 0 0 0 0 -
43 .0 en 0
7 ns Dr 0 1 0 25
t
ter a
n m
Ar a
e
De
ad
(1
99
0)
In [24]:
df = zip_df[['zipcode','Adventure', 'Comedy', 'Action', 'Drama', 'Crime', 'Children',

'Fantasy', 'Western', 'Film-Noir', 'Romance', 'Sci-Fi', 'Musical']]
m= df.head(10)
m
Out[24]:
Do Mu
Ad Ani Fil Ro
zip Co Dr Chi My cu Thr Ho Fa We sic
ve Act Cri ma m- ma Sci
co me am ldr ste me ille rro nta ste al
ntu ion me tio Noi nc -Fi
de dy a en ry nta r r sy rn
re n r e
ry
85
0 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1
85
1 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1
85
2 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1
85
3 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1
85
4 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1
85
5 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1
85
6 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1
85
7 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1
85
8 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1
85
9 71 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0
1
In [25]:
msk = np.random.rand(len(df)) < 0.8

train =m[msk]
test = m[~msk]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-25-fab4dc67c61f> in <module>
1 msk = np.random.rand(len(df)) < 0.8
----> 2 train =m[msk]
3 test = m[~msk]
~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)

2969 # Do we have a (boolean) 1d indexer?
2970 if com.is_bool_indexer(key):
-> 2971 return self._getitem_bool_array(key)
2972
2973 # We are left with two options: a single key, and a collection of keys,
~\Anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_bool_array(self, key)

3016 elif len(key) != len(self.index):
3017 raise ValueError(
-> 3018 "Item wrong length %d instead of %d." % (len(key), len(self.index))
3019 )
3020
ValueError: Item wrong length 27487 instead of 10.
In [26]:
from sklearn import linear_model

regr = linear_model.LinearRegression()
x = np.asanyarray(train[['Adventure', 'Comedy', 'Action', 'Drama', 'Crime', 'Children',
'Fantasy', 'Western', 'Film-Noir', 'Romance', 'Sci-Fi', 'Musical']])
y = np.asanyarray(train[['zipcode']])
regr.fit (x, y)
# The coefficients
print ('Coefficients: ', regr.coef_)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-26-a24e12f606ea> in <module>
1 from sklearn import linear_model
2 regr = linear_model.LinearRegression()
----> 3 x = np.asanyarray(train[['Adventure', 'Comedy', 'Action', 'Drama', 'Crime', 'Children',
4 'Mystery', 'Animation', 'Documentary', 'Thriller', 'Horror',
5 'Fantasy', 'Western', 'Film-Noir', 'Romance', 'Sci-Fi', 'Musical']])
NameError: name 'train' is not defined
In [ ]:

Assignment No 01 - Data Analytics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assignment No 01 - Data Analytics

Uploaded by

Copyright:

Available Formats

ASSIGNMENT NO - 01 ¶

PYTHON FOR ANALYTICS¶

Q 1. Figure out genre wise user count¶

0 1 Toy Advent 1.0 24.0 M techni 85711 1.0 4.0 9.6498

1000 rows × 11 columns

test2 = MERGE['genres'].str.split("|", n = 4, expand = True)

0 Adventure Animation Children Comedy Fantasy

1 Adventure Animation Children Comedy Fantasy

2 Adventure Animation Children Comedy Fantasy

3 Adventure Animation Children Comedy Fantasy

4 Adventure Animation Children Comedy Fantasy

... ... ... ... ... ...

27483 Comedy Drama None None None

27484 Comedy Drama None None None

27485 Comedy Drama None None None

27486 Comedy Drama None None None

27487 Comedy Drama None None None

27487 rows × 5 columns

array(['Adventure', 'Comedy', 'Action', 'Drama', 'Crime', 'Children',

search=('Adventure', 'Comedy', 'Action', 'Drama', 'Crime', 'Children',

m titl ge us ag ge oc zi us ra ... A D Th H Fa W Fil R Sc M

search=('Adventure', 'Comedy', 'Action', 'Drama', 'Crime', 'Children',

2. How many users prefer comedy movie¶

3. Figure out age group wise preferred genres¶

import matplotlib as mpl

mpl.style.use('ggplot') # optional: for ggplot-like style

# check for latest version of Matplotlib

Matplotlib version: 3.1.1

merge_age = MERGE.groupby('Age_Category', axis=0).sum()

5. Is there any co-relation between zip code and genre¶

import matplotlib.pyplot as plt

<bound method Series.unique of 0 85711

zip_df = MERGE#.groupby('zipcode', axis=0).sum()

27487 rows × 29 columns

df = zip_df[['zipcode','Adventure', 'Comedy', 'Action', 'Drama', 'Crime', 'Children',

msk = np.random.rand(len(df)) < 0.8

~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)

~\Anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_bool_array(self, key)

ValueError: Item wrong length 27487 instead of 10.

from sklearn import linear_model

NameError: name 'train' is not defined

You might also like

~\Anaconda3\lib\site-packages\pandas\core\frame.py in getitem(self, key)