Welcome to Scribd!

DS MachineLearningEngineerTechnicalChallenge3.1

Uploaded by

0% found this document useful (0 votes)

6 views1 page

The document describes a machine learning challenge to build a scalable data pipeline to efficiently compute features from a movie ratings dataset. A data scientist has defined two features in a jupyter notebook, but computing them for the full dataset takes too long. The challenge is to use a framework like PySpark to extract the data, compute the two features, and output them in a table containing all movie ratings rows. The pipeline should be optimized for efficiency and extensibility to compute additional planned features derived from user ratings histories.

Original Description:

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

6 views1 page

DS MachineLearningEngineerTechnicalChallenge3.1

Uploaded by

fercho120

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 1

Search inside document

Machine Learning Engineer Technical Challenge 3.

1. Objective

The main purpose of this challenge is to assess your skills building a scalable data pipeline to be used in an ML context.

2. Data and technology

1. You will be handed a dataset from Kaggle containing movie ratings, and a jupyter notebook containing code with some feature
engineering.
2. You will be required to use a framework for efficient data extraction and processing (like PySpark, or other one you might find
convenient). Therefore you will need to have it installed and configured in your environment.

3. Problem context

A data scientist has engineered two new features that he will use for the development of some ML model. The code used to compute this
features is contained in the jupyter notebook you were handed.

The problem here is that these features are not coded in an efficient way, and it takes just too long to compute them for the whole movie ratings
dataset. You will need to help the data scientist compute these features in a scalable way, so that he or she (and the rest of the data scientists)
can have these features available to do things like building training datasets using them.

Having said this, we will now go to the detailed instructions of the challenge.

4. Instructions

4.1 Understand the features

Take a look at the jupyter notebook, where the data scientist explains the two features he created. Try to understand the features he wants to
compute, and the approach he had to compute them.

4.2 Build data pipelines to compute these features

Using PySpark or other framework you find appropriate, build a data pipeline which extracts data from the ratings dataset, and creates a table
with two columns: the two features defined in the jupyter notebook. The pipeline should be built with the following considerations:

The table created should contain all the rows of the movie ratings dataset (around 20M).
The pipeline should be optimized, so that the table can be computed in a more reasonable amount of time.
The data scientist is planning to add in the near future tens of other features which are variations of the ones he already built. These new
features will be features like the number of ratings on the previous month, standard deviation of all of the previous ratings, the number of
previous ratings which were greater than 3, etc. Therefore, you should take this in consideration when building your data pipeline. Hint:
all of these features have in common that they can be built using the same source data: the list of all the ratings of a given user prior to
the current rating. Maybe you can avoid repeating the same extraction of source data for each feature, and instead you can create an
intermediate layer/file/database where you extract and persist only once this source data from the one you can compute all these
features.
The table you create with the two columns containing the features can be stored in the way you find more convenient: on a file (avro,
parquet, csv, etc.) or a SQL table, etc.

Tableau 8.2 Training Manual: From Clutter to Clarity
From Everand
Tableau 8.2 Training Manual: From Clutter to Clarity
Larry Keller
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
Rating: 5 out of 5 stars
5/5 (1)
Design Doc Sample
Document4 pages
Design Doc Sample
srikanth
No ratings yet
Labs Cognitiveclass Ai
Document5 pages
Labs Cognitiveclass Ai
Alaa Barazi
No ratings yet
Shin Tang Finalproject Coversheet
Document3 pages
Shin Tang Finalproject Coversheet
api-398625007
No ratings yet
Final Project Statement
Document10 pages
Final Project Statement
Marina Vives
No ratings yet
Assignment No. 1: Lab Practices-2 Fourth Year Computer Engineering Engineering
Document16 pages
Assignment No. 1: Lab Practices-2 Fourth Year Computer Engineering Engineering
Eminent Paradigm
No ratings yet
Assignment No. 1: Lab Practices-2 Fourth Year Computer Engineering Engineering
Document16 pages
Assignment No. 1: Lab Practices-2 Fourth Year Computer Engineering Engineering
Dhiraj Patil
No ratings yet
Anomaly Detection in Social Networks Twitter Bot
Document11 pages
Anomaly Detection in Social Networks Twitter Bot
Mallikarjun patil
No ratings yet
Thing Ting
Document5 pages
Thing Ting
Jack Patten
No ratings yet
Bods Tutorial Structure
Document3 pages
Bods Tutorial Structure
Mahadevan Krishnan
No ratings yet
About The Database Management Systems - Self Paced - Assignment
Document3 pages
About The Database Management Systems - Self Paced - Assignment
Muganga Malufu
No ratings yet
DWH Assignments and Project
Document9 pages
DWH Assignments and Project
zainab3309
No ratings yet
Daatabase Assignment
Document4 pages
Daatabase Assignment
Ishani Puvimannasinghe
100% (1)
Daatabase Assignment
Document4 pages
Daatabase Assignment
Ishani Puvimannasinghe
No ratings yet
Project Architecture
Document7 pages
Project Architecture
Gondrala Balaji
100% (1)
1 - Assignment 2 Guidance
Document2 pages
1 - Assignment 2 Guidance
Sang Trần Xuân
No ratings yet
CSE 5311: Design and Analysis of Algorithms Programming Project Topics
Document3 pages
CSE 5311: Design and Analysis of Algorithms Programming Project Topics
Fgv
No ratings yet
Student Information Systemm
Document89 pages
Student Information Systemm
Anwesh Kumar Chowdari
No ratings yet
DASS Assignment - 3
Document4 pages
DASS Assignment - 3
Apoorva Tirupathi
No ratings yet
Ad3301 Data Exploration and Visualization
Document30 pages
Ad3301 Data Exploration and Visualization
Shamilie M
No ratings yet
Date Warehouse, Social Networking Analysis, User Profile,: - Project - , - Solution - , - Business Flow
Document5 pages
Date Warehouse, Social Networking Analysis, User Profile,: - Project - , - Solution - , - Business Flow
ABHISHEK KUMAR
No ratings yet
Oracle Performance
Document5 pages
Oracle Performance
contactrsystem
No ratings yet
Assignment Specification: Implementation Exercise: Outcomes
Document5 pages
Assignment Specification: Implementation Exercise: Outcomes
Seban A.C
No ratings yet
Data Stage PDF
Document37 pages
Data Stage PDF
pappujaiswal
No ratings yet
ACA BigData Consolidated Dump
Document29 pages
ACA BigData Consolidated Dump
Ahimed Habib Husen
No ratings yet
SAP Business Objects Interview Questions With Answers
Document6 pages
SAP Business Objects Interview Questions With Answers
scholarmaster
0% (1)
Database Design and Implementation
Document9 pages
Database Design and Implementation
Qirat Iqbal
No ratings yet
System Analysis and Design Project
Document3 pages
System Analysis and Design Project
Jaya Malathy
100% (1)
Aindump.70 452.v2010!11!12.by
Document71 pages
Aindump.70 452.v2010!11!12.by
vikas4cat09
No ratings yet
Assignment 2 Guidance: Task 1 - Develop The Database System (P2 - P3)
Document2 pages
Assignment 2 Guidance: Task 1 - Develop The Database System (P2 - P3)
Guys Good
No ratings yet
Design Project
Document5 pages
Design Project
roshan
No ratings yet
DataStage Interview Question
Document9 pages
DataStage Interview Question
anamik2100
No ratings yet
2021 Homework3 Introduction
Document8 pages
2021 Homework3 Introduction
Ali Zain
No ratings yet
Bi Certi Bw305
Document62 pages
Bi Certi Bw305
Albert Franquesa
No ratings yet
Project Assignment
Document10 pages
Project Assignment
John Abrina
No ratings yet
Timetablegeneratror Contents
Document46 pages
Timetablegeneratror Contents
Anchu Lal
No ratings yet
Rintro Wekacomplete
Document135 pages
Rintro Wekacomplete
pragya
No ratings yet
SDLC of Data Warehose
Document2 pages
SDLC of Data Warehose
saurav560
0% (1)
Notes Informatica
Document121 pages
Notes Informatica
Vijay Lolla
100% (3)
Power Designer Tips Tricks
Document7 pages
Power Designer Tips Tricks
Freddy Castro Ponce
0% (1)
Project 2
Document9 pages
Project 2
jablejinx
No ratings yet
A Case Study Implementing Features Using Aspectj
Document10 pages
A Case Study Implementing Features Using Aspectj
crennydane
No ratings yet
Datastage Interview Questions - Answers - 0516
Document29 pages
Datastage Interview Questions - Answers - 0516
rachit
No ratings yet
E4 DS203 2023 Sem2
Document2 pages
E4 DS203 2023 Sem2
sparee1256
No ratings yet
Professional Program in Data Science and Machine Learning: - ML Engineer
Document4 pages
Professional Program in Data Science and Machine Learning: - ML Engineer
MANPREET SODHI
No ratings yet
Project Report
Document35 pages
Project Report
MOTIVATIONAL VIDEOS
100% (5)
Datastage: Datastage Interview Questions/Answers
Document28 pages
Datastage: Datastage Interview Questions/Answers
ashok7374
No ratings yet
Performance Tuning.
Document51 pages
Performance Tuning.
pavansuhaney
No ratings yet
Data Driven Framework
Document11 pages
Data Driven Framework
MaheshRajbanshi
No ratings yet
Project
Document3 pages
Project
Jelai
No ratings yet
Design Documentation Template
Document6 pages
Design Documentation Template
alireza_ece
No ratings yet
MicroStation V8 VBA Programming
Document8 pages
MicroStation V8 VBA Programming
Greg Mavhunga
No ratings yet
Ee382M - Vlsi I: Spring 2009 (Prof. David Pan) Final Project
Document13 pages
Ee382M - Vlsi I: Spring 2009 (Prof. David Pan) Final Project
sepritahara
No ratings yet
Sap BW Faq
Document89 pages
Sap BW Faq
gopal_v8021
No ratings yet
Ibm Infosphere Datastage Performance Tuning: Menu
Document9 pages
Ibm Infosphere Datastage Performance Tuning: Menu
Nisar Hussain
No ratings yet
DP203 - 216 Questions
Document212 pages
DP203 - 216 Questions
Akash Singh
No ratings yet
Project Assignment.2024
Document2 pages
Project Assignment.2024
tin nguyen
No ratings yet
Question #1topic 1: Hide Solution Discussion
Document74 pages
Question #1topic 1: Hide Solution Discussion
Baji Tulluru
No ratings yet
Advanced Analytics with Transact-SQL: Exploring Hidden Patterns and Rules in Your Data
From Everand
Advanced Analytics with Transact-SQL: Exploring Hidden Patterns and Rules in Your Data
Dejan Sarka
No ratings yet
IEEE Software 19 ML Patterns
Document8 pages
IEEE Software 19 ML Patterns
fercho120
No ratings yet
Mexico Discrepancies 2006 08 PDF
Document5 pages
Mexico Discrepancies 2006 08 PDF
fercho120
No ratings yet
Chiu Et Al. - 2012 - Understanding Customer's Repeat Purchase Intentions in B2C E-Commerce The Roles of Utilitarian Value, Hedonic Value PDF
Document30 pages
Chiu Et Al. - 2012 - Understanding Customer's Repeat Purchase Intentions in B2C E-Commerce The Roles of Utilitarian Value, Hedonic Value PDF
fercho120
100% (1)
B.C Kuo Solutions 1
Document947 pages
B.C Kuo Solutions 1
simply_tom
100% (5)
Methods of Language Teaching
Document2 pages
Methods of Language Teaching
Diana Morilla Azna
No ratings yet
Bandura2009Locke PDF
Document22 pages
Bandura2009Locke PDF
Adafih Jacobs
No ratings yet
LESSON PLAN in Sensory Imagery
Document5 pages
LESSON PLAN in Sensory Imagery
Mariter Pido
100% (1)
Toeic Reading PDF
Document15 pages
Toeic Reading PDF
HoracioCruise
No ratings yet
Free Workbook Relationship
Document21 pages
Free Workbook Relationship
m.pacheco.aragon
No ratings yet
Pre A1 Starters Reading and Writing Part 1: Description
Document8 pages
Pre A1 Starters Reading and Writing Part 1: Description
Krupesh Kiran
No ratings yet
01 OB - Course Plan
Document8 pages
01 OB - Course Plan
PAWAN
No ratings yet
ACTION PLAN IN REMEDIAL Nat 6
Document4 pages
ACTION PLAN IN REMEDIAL Nat 6
irenenama
No ratings yet
Politeness: Strategies, Principles and Theories:Theoretical Perspective Abstract
Document15 pages
Politeness: Strategies, Principles and Theories:Theoretical Perspective Abstract
rabia
No ratings yet
Language and Society 2
Document4 pages
Language and Society 2
Sufio
No ratings yet
Applications of Psychological Test in Education and Counselling
Document22 pages
Applications of Psychological Test in Education and Counselling
Shweta Poonam
100% (1)
You Were Born Rich Workbook
Document72 pages
You Were Born Rich Workbook
Roberto Antonio
No ratings yet
1 & 2 Sociology Perspective, Theory, and Method
Document15 pages
1 & 2 Sociology Perspective, Theory, and Method
glow45
No ratings yet
Q2 WEEK 8 ENGLISH 6compare and Contrast Content of Materials Viewed
Document9 pages
Q2 WEEK 8 ENGLISH 6compare and Contrast Content of Materials Viewed
catherine renante
100% (1)
01 Introduction To The Library and Information Professions
Document208 pages
01 Introduction To The Library and Information Professions
Jorgivania Lopes
100% (1)
Causative Have and Get
Document8 pages
Causative Have and Get
Tari
No ratings yet
Leadership - Worksheets
Document3 pages
Leadership - Worksheets
Ronalyn Cajudo
No ratings yet
Sheet In, But You Must Answer Every Question in Your Short Narrative Essay
Document9 pages
Sheet In, But You Must Answer Every Question in Your Short Narrative Essay
api-511621501
No ratings yet
Oratie Prof Drees Web
Document40 pages
Oratie Prof Drees Web
Deznan Bogdan
No ratings yet
Q2 DLL Week2
Document31 pages
Q2 DLL Week2
April Ranot
No ratings yet
Rules Supplement
Document3 pages
Rules Supplement
Cédric Lecante
No ratings yet
LLM 1
Document6 pages
LLM 1
anavari
No ratings yet
Machine Learning NLP LAB Sayak Mallick
Document4 pages
Machine Learning NLP LAB Sayak Mallick
Sayak Mallick
No ratings yet
Cambridge English: (Writing B1, 45 Minutes, Two Parts)
Document20 pages
Cambridge English: (Writing B1, 45 Minutes, Two Parts)
Felix Martinez Rodriguez
No ratings yet
Novakovic Agopians2014
Document3 pages
Novakovic Agopians2014
bhasis
No ratings yet
Applicability of Kosovo Curriculum For English Language in Upper Secondary Schools
Document5 pages
Applicability of Kosovo Curriculum For English Language in Upper Secondary Schools
Magi Magi
No ratings yet
Adamafortfolio
Document15 pages
Adamafortfolio
QUIJANO, FLORI-AN P.
100% (1)
Extract One: You Hear Part of A Radio Interview With A Product Designer Called Charles Loughlan
Document3 pages
Extract One: You Hear Part of A Radio Interview With A Product Designer Called Charles Loughlan
Minh Nghĩa Hà
No ratings yet
Diagnostic Test MAPEH 10
Document9 pages
Diagnostic Test MAPEH 10
Oliver Laureta
No ratings yet
Loti Framework Poster
Document1 page
Loti Framework Poster
api-18133493
0% (1)