You are on page 1of 11

Lab #2

Data Preparation
Lab #2 Dataset
MovieLens Datasets

2
Lab #2: MovieLense

▷ Data from https://grouplens.org/datasets/movielens/latest/


○ Use Small dataset (100k records)

▷ Files in package
○ movies.csv
Movies data. Contains movie ID, name and genre
○ rating.csv
Users rating data. Contains user ID, movie ID and rating

3
Lab #2: MovieLense

movie.csv

movieId title genres


1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy

2
Jumanji (1995) Adventure|Children|Fantasy
3
Grumpier Old Men (1995) Comedy|Romance

4
Lab #2: MovieLense

rating.csv

userId movieId rating timestamp

1 1 4
964982703
1 3 4
964981247
1 6 4
964982224
1 47 5
964983815

5
Lab #2: Question

1. Split genres from each movie into multiple binary columns of


genres
movieId title Action Adventure Animation …

1 Toy Story (1995) 0 1 1

2 Jumanji (1995) 0 1 0
Grumpier Old Men
3 0 0 0
(1995)

Hint: Use Split-Expand. Then Melt/Pivot


**Column ordering doesn’t matter
6
Lab #2: Question

2. Extract year from title into new column

movieId title year Action Adventure Animation …

1 Toy Story (1995) 1995 0 1 1

2 Jumanji (1995) 1995 0 1 0


Grumpier Old Men
3 1995 0 0 0
(1995)

Hint: Use String-Extract


**Column ordering doesn’t matter
7
Lab #2: Question

3. Merge movie into rating. Use movieId as a key

userId movieId Rating title year Action Adventure


1 1 4 Toy Story (1995) 1995 0 1


Grumpier Old Men
1 2 4 1995 0 0
(1995)
1 3 4 Heat (1995) 1995 1 0

**Column ordering doesn’t matter

8
Lab #2: Question

4. Obtain top 10 most reviewed movies of each year


5. Obtain top 5 movies of each user

Top Movies 1 - 5

userId

9
Lab #2: Question

6. Create a user profile matrix. Consists of:


- User ID
- Number of times that user rate the movie
- Average rating that user gives
- Median of the year of movies that user rated

**Column ordering doesn’t matter


Drop Na if necessary
10
Lab #2:
What to submit?
▷ Individual
▷ PDF report
o Code and Output
▷ Name the file as ‘Lab2-your_student_id’
▷ Submit to Lab 2 on LEB2
▷ Due date: 23:59 of 2 September 2019

11

You might also like