You are on page 1of 6

T10 - Big Data

T-DAT-902

Nostradamovies
posters standardisation you say?

1.2.5
Nostradamovies
delivery method: Github
repository name: $CourseCode-$GroupName.git
language: Python or R is advised

• The totality of your source files, except all useless files (binary, temp files, obj
files,...), must be included in your delivery.

From an 80 000 row dataset containing movie posters, movie synopsis and full IMDB webiste informa-
tion, you are asked to make movie genre predictions, based on posters only.

It’s at your discretion to decide to modify the dataset genres, e.g. replacing comedy horror, action with only
comedy horror for instance. You can also add your own genres, such as blockbuster, teen movie or Cannes
Palme winner for instance.

This training set is not exhaustive (it does not


contain Bollywood movies for instance); you
are expected to complete it.

1
Neural networks, deep learning and every possible algorithm are welcome, but do not
spend time on them.
Use bullet-proof libraries instead of reinventing the wheel, and focus on data.

2
A document synthetising visualization and statistics is also required.

An interactive tool would be appreciated

It must also include the methodology and algorithms used to make your extractions/predicitions, and fea-
tures importance for both posters and synopsis (for instance, a black color for horror movies, or a large
rounded title font for comedies).

The relevancy of your vizualisation is of prime importance; display any data you consider
meaningful.

Add clustering and unsupervised analysis to extract features importance.


SHAP values would be welcome.

3
Last but not least, you must extract archetypal posters from your feature importance classifications.

You are expected to display the most typical poster from your database, based on the feature importance
for each movie gender.

For example, you might have this kind of features importance, for
“blockbusters”:

1. names in the top quarter 88%


2. central face 60%
3. large title 55%
4. title in the lowest half 44%
5. 1 to 3 faces 44%
6. 5 text lines in the bottom 37%
7. black color 34%
8. number in the title 33%
9. ...

In this example, your program should pick up a poster containing


as many elements as possible, by order of importance, for instance
the adjacent poster.

4
Your final algorithm will be tested on recent movies. The program should contain a function able to process
a PNG or JPEG image and make the prediction along with the features extracted.

e.g. For this independant comedy drama named Little Miss Sunshine your output could be:

∇ Terminal - + x
∼/T-DAT-902> python genre_prediction.py little_miss_sunshine.jpg
Genre predicted : Comedy Drama
Probability : 0.76
Features extracted ->
Number of faces : 5
Colorimetry : Yellow
Similar poster : xxx
...
Feature importance for the genre predicted ->
...

Rather than giving you a single prediction, algorithms will give probabilities for each
genre possible. You can display the top 3 predictions for example with the associated
probabilities

You might also like