You are on page 1of 1

Machine Learning Engineer Technical Challenge 3.

1. Objective

The main purpose of this challenge is to assess your skills building a scalable data pipeline to be used in an ML context.

2. Data and technology

1. You will be handed a dataset from Kaggle containing movie ratings, and a jupyter notebook containing code with some feature
engineering.
2. You will be required to use a framework for efficient data extraction and processing (like PySpark, or other one you might find
convenient). Therefore you will need to have it installed and configured in your environment.

3. Problem context

A data scientist has engineered two new features that he will use for the development of some ML model. The code used to compute this
features is contained in the jupyter notebook you were handed.

The problem here is that these features are not coded in an efficient way, and it takes just too long to compute them for the whole movie ratings
dataset. You will need to help the data scientist compute these features in a scalable way, so that he or she (and the rest of the data scientists)
can have these features available to do things like building training datasets using them.

Having said this, we will now go to the detailed instructions of the challenge.

4. Instructions

4.1 Understand the features

Take a look at the jupyter notebook, where the data scientist explains the two features he created. Try to understand the features he wants to
compute, and the approach he had to compute them.

4.2 Build data pipelines to compute these features

Using PySpark or other framework you find appropriate, build a data pipeline which extracts data from the ratings dataset, and creates a table
with two columns: the two features defined in the jupyter notebook. The pipeline should be built with the following considerations:

The table created should contain all the rows of the movie ratings dataset (around 20M).
The pipeline should be optimized, so that the table can be computed in a more reasonable amount of time.
The data scientist is planning to add in the near future tens of other features which are variations of the ones he already built. These new
features will be features like the number of ratings on the previous month, standard deviation of all of the previous ratings, the number of
previous ratings which were greater than 3, etc. Therefore, you should take this in consideration when building your data pipeline. Hint:
all of these features have in common that they can be built using the same source data: the list of all the ratings of a given user prior to
the current rating. Maybe you can avoid repeating the same extraction of source data for each feature, and instead you can create an
intermediate layer/file/database where you extract and persist only once this source data from the one you can compute all these
features.
The table you create with the two columns containing the features can be stored in the way you find more convenient: on a file (avro,
parquet, csv, etc.) or a SQL table, etc.

You might also like