You are on page 1of 25

DATA SCIENCE STUDY PLAN

Type of Data Scientist Role


Data Scientist, Analytics: lots of analytics work, focusing on products, analyzing data for
business insights, partnering with PM and engineers to turn insights into actions. E.g. DS
analytics at Facebook, Airbnb, Linkedin.

Data Scientist, Inference: develop methods and tools that increase the rigor and efficiency of
experimentation platform using causal inference techniques, focusing on A/B testing, design
and analyze experiments.

Data Scientist, ML: programming heavy, creating models, and deploying machine learning
systems to production. Similar to ML engineers.

Basic

1. Linear Algebra
Required -
Optional -

Topics to cover
● Vectors
● Matrix multiplication, factorization
● Singular value decomposition
● Distance metrics (euclidean, cosine similarity)

Resources:
● 3B1B: link
● Khan’s Academy (beginner friendly): link

2. Calculus
Required -
Optional -

Topics to cover:
● Differential
● Integral
● Single Variable Calculus
● Multivariable Calculus
Resources:
● 3B1B: link
● Khan’s Academy (beginner friendly): link
● Single Variable Calculus MIT course: link
● Multivariable Calculus MIT course: link

3. Probability and Statistics


Required -

Topics to cover:
● Dependence and Independence
● Conditional Probability/Bayes’ Theorem
● Random Variables
● Probability Distribution (Does not need to know in depth, but understand the
definition - a couple of distributions that might be helpful to know in depth)
○ Continuous Distribution
○ Discrete Distribution
● Normal Distribution
○ Central Limit Theorem
○ Skewness (distribution shape)
○ Central tendency (mean, mode, median)
○ Confidence interval
● Binomial Distribution
○ Central tendency (mean, mode, median)
○ Normal approximation to the binomial
● Statistical hypothesis testing (ex. ANOVA, T-test, Z-test, Chi-Square Test)
○ P-value
● Correlation (ex. Pearson’s correlation coeff)
● Concept of multicollinearity
● Randomization of sample
● Variance and standard deviation/standard error
● Law of Large Numbers
● Type I/Type II error

Resources:
● Open Intro to Statistics: link
○ Quoc Anh: “In 2022, I used this book to teach statistics to a complete
beginner. This book is truly beginner friendly, i.e. doesn’t require anything
beyond high school math. While light on math, it’s still conceptually deep,
building up gradually from sample vs population, to sampling distribution,
to hypothesis testing. My student was able to read the book 80%
independently, and asked me clarifying questions about the rest. The
book does have a decent lab section, but it’s a bit too hand-holding and
thus doesn’t develop strong coding.”
● MIT OCW (beginner friendly): link
● Blitz Stein's Probability Course: link
● Rigollet's Statistics Course: link
● Probability and Statistics for Data Science by Carlos Fernandez-Granda: link
● Brilliant.org Prob/stats course (not free): link
● Nick Singh’s 40 Prob/Stat Questions: link
● Bayes rules applied (Medium article) link
● Bayes rules visualized: link
● Causal Inference, pick one of these books link

4. Coding
Required -
Optional -

Topics to cover:
● Basic data structures (e.g. hash map, list, stack, recursion, etc)
● Object-oriented programming (e.g. classes and inheritance)
● Dataframe/Data wrangling/data manipulation:
○ Packages: pandas, numpy, seaborn, pyspark, sklearn
● Coding with probability and statistics
○ Probabilities, implementing efficient calculations
● Implementation of simple/basic ML models
○ KNN
○ Linear/Logistic regression
○ K-mean clustering
● Algorithms
○ Types of sorts
○ Types of searches
○ Recursion (especially for ML/AI roles)
○ Iterations

Resources:
● Any of these online courses Datacamp, EdX, Udemy, or Coursera (beginner
friendly)
● Harvard CS50: link
● OOP - Class & Inheritance: link
● Kaggle: link

5. Database/SQL
Required -

Topics to cover:
● Groupby
● Join (self-join)
● Subqueries
● Window functions: link

Resources:
● Coursera: link
● StrataScratch: link
● Hackerrank: link
● SQL leetcode questions(easy-medium): link

6. Product sense
Required -
Optional -

Topics to cover:
● Product Diagnostic: understanding deviations from norm (e.g. the # of views was
decreased by 10% today, examine the problem and propose solutions)
● Product Improvement: how do you improve certain existing products
● Product Design: should we add more marketing promotion emails, should we
make the Submit button smaller/larger, etc.
● How to measure the success of a product (GAME template: Goal, Action, Metric,
Evaluations)
● Decision to launch/not launch a product/service (e.g. how should we decide
whether to roll out the FB campus feature or not?)
● List useful metrics for identification (e.g. how to identify small businesses on our
platform?)
● Component of A/B testing link

Resources:
● A brief overview of the product sense interview: link
● Trustworthy Online Controlled Experiments (A Practical Guide to A/B Testing):
link
● Emma Youtube Channel: link
● Cracking the PM interview: link
● Data masked (not free): link
● StellarPeers: link

7. Machine Learning
Required -
Optional -

Topics to cover:
● Regression vs. Classification Problems
○ Multi-class classifications
○ Multi-label classifications
○ Binary classification: tailored for heavily unbalanced dataset
● Supervised vs. unsupervised learning
● Parametric vs. non-parametric models
● Comparing advantages / disadvantages across different models
● Bias-Variance Tradeoff
● Overfitting/Underfitting
● L1/L2 Regularization
● Preprocessing: normalization, standardization, and techniques for each
advanced topic
● How to deal with missing data and high dimension data
● Cross validation
● Evaluation metrics: MAE, MSE, RMSE, accuracy, precision/recall, AUC, etc.
● Data Centric: link

ML Models
● K-Nearest Neighbor
● Linear Regression
● Logistic Regression
● Decision Tree/Random Forest
● Gradient Boosting
● Support Vector Machines
● K-Means Clustering
● Principal Components Analysis

Resources:
● An Introduction to Statistical Learning in R: link
● The Elements of Statistical Learning: link
● Stanford CS229: link
● CMU 10-601: link
● Hands-On Machine Learning with Scikit-Learn: link
● Comparative study on classic ML algorithms: link
● Machine-learning-interview repo by khangic: link

Advanced

8. Neural Network/ Deep Learning


Required -
Optional -

Topics:
● Backprop
● Loss functions
○ Regression: MSE, RMSE
○ Classification: BCE/Cross-entropy loss, Hinge loss, NLL
■ Relation with logistic regression
■ Softmax formula, smoothing
○ Distribution: KL Divergence
● Optimization
○ Gradient calculation
○ Gradient descent methods
■ Stochastic gradient descent
■ Gradient descent with momentum/acceleration
● Nestorov and similar ones
■ Newton’s method
○ Gradient Clipping
● Feature engineering
○ Potentially multi-model, e.g image and text
● Overfitting
○ Early stopping
○ Regularization, dropouts, layer-or-batch normalization
● Evaluation
○ Precision, recall, f-score
○ ROC, Precision-recall curve

Resources:
● Deep Learning book by Ian Goodfellow and Yoshua Bengio: link
● Deeplearning.ai: link
● Dive into Deep Learning: link
● NYU DS-GA 1008: link
● UPenn ESE546: link

9. Natural Language Processing


Optional -

Problems
● Text classification (e.g. fraud detection, sentiment analysis, spam or not spam)
● Text understanding (e.g. application in NER, recommender system, ranking and
retrieval in general)
● Text generation (e.g. question-answering, reading comprehension, text
summarization, machine translation)

Topics
● Word embedding
○ Contextual (Transformer-based)
○ Non-contextual (word2vec)
○ Skip gram and CBOW
● Statistical model (e.g. Markov Chain Model, Conditional Random Field)
● Deep Learning model
○ Recurrent: RNN/LSTM/GRU
○ Attention:
■ Encoder:, BERT
■ Decoder: GPT-1/2/3
■ Encoder-Decoder: Transformers, T5, BART
● Techniques:
○ Metrics:
■ Understanding tasks: F1, Precision, Recall
■ Generation tasks: BLEU/GLEU (for translation, summarization),
Perplexity, METEOR (for Grammar Error Correction), WER (word
error rate)
○ Decoding algorithms: beam search, conditional random field, copy-pointer
(solving out-of-vocab)

Resources:
● Stanford NLP classes and textbooks (publicly available online): link
● Georgia Tech NLP classes and textbooks (publicly available online): link
● NLP Progress: link

10. Recommender Systems


Optional -

Topics to cover:
● Collaborative filtering (user-based)
● Content-based / item-based filtering
● Evaluation metrics: recall @k, ndcg @k, A/B testing, Hit-Ratio
● Deep Learning RecSys (e.g. Neural Collaborative Filtering, Wide-and-Deep)
● Tasks
○ Next-item Recommendation
○ Within-Basket Recommendation
○ Session-based Recommendation
○ Topic Discovery
Resources:
● Overall understanding of collaborative/content-based filtering: link

11. More advanced topics


Optional -

Topics:
● Multi-task learning:
○ Issues: calibrate convergence speed and scale among loss terms
● Active learning, meta learning
● Calibration
● Reinforcement learning
● VAE
● GAN
● Graph Representation Learning
○ Aggregating neighbors:
■ Issues:
● Deep layers lead to indistinguishable node features
● Sampling large K neighbors requires huge computer
resources
■ Transductive Graph Learning (e.g. GCN, GAN): use node
embedding
■ Inductive Graph Learning (e.g. GraphSAGE): use node meta
○ Graph Transformer: limited by the predefined graph-size
● Machine Learning for Chemistry
○ Tasks:
■ Property Prediction
■ Molecule Generation
■ Property-guided Molecule Generation
○ Problems:
■ Input representation:
● SMILES: linear form of molecules that any change leads to
a completely new molecule.
● Adj. matrix to represent molecule structures.
■ Recurrent approaches:
● Any error in SMILES predictions easily leads to error in
reconstructing molecule structures.
■ Graph approaches:
● SOTA is by graph models which are heavily limited by
graph size (mostly <100 nodes). Any molecule > 100
nodes (atoms) causes performance degradation.
● Hierarchical Graph models are recently applied to generate
motifs (substructures of molecules) to overcome the graph
size but require good distribution of motifs.
● Contrastive Learning: applications in security, ranking and retrieval

Other Resources
● Steve Nouri’s 800 DS questions: link
● Ian’s Data Science cheat sheet: link
INTERVIEW EXPERIENCE
THIS IS THE TEMPLATE TO SHARE YOUR INTERVIEW EXPERIENCE

COMPANY NAME - POSITION


Timeline
Format
Online Assessment (if any)
Hiring Manager (if any)
Onsite: If you feel comfortable, feel free to share some samples of your interview
questions
Overall Experience
How did you prepare?
Pros (if accepted)
Cons (if accepted)

Notes:
● If you accepted the offer, could you please share what are the pros and cons of
working at your current team/company?
● Are you okay with sharing your experience to a larger audience?
● Do you wish to stay anonymous?
FAIRE - DATA SCIENTIST
Timeline
- 4 weeks since online application (no referral)

Format
- Online assessment: 2-hour CodeSignal
- Hiring Manager interview: 45-min
- Onsite: 4 rounds, 1 coding, 1 ML model building, 1 product sense, 1 behavioral

Online assessment
- 2-hour data science assessment via CodeSignal. Lots of SQL (3-4 questions, basic join
+ subqueries, no window function needed) and basic ML questions (how to deal with
overfitting, what is gradient descent, what is the difference between boosting and
bagging, etc.). There might also be prob/stats questions.

Hiring Manager interview


- The guy has a PhD in CS from CMU, extremely smart but very friendly. He started with
some questions about why I wanted to switch careers to do data science, and what I
have done so far to make the transition. He then deep dived into my resume to gauge
my ML knowledge, and I really meant “deep”. Basically he could keep grilling down to
the point where I have to say “I don’t know”. For example, I mentioned node2vec, an
algorithmic framework for representational learning on graphs. He would then ask to
explain what it is, why I want to use graph algo instead of classical ML, how to train a
large data set with that algo, how to optimize/improve it, what the difference if I use
graph neural network instead. Whenever I said Idk or I’m not sure, he would promptly
and happily answer the questions. So moral of the story, had I tried to BS, he would
know immediately and I’d look very bad… Please don’t try to BS your interviewers. DS
knowledge is very broad and it is okay to not know some new algorithms/concepts.

Onsite Interview
- 1hr coding challenge. The problem that will be given doesn’t require knowledge of any
specific algorithms or data structures, but rather, it tests whether you can implement
basic logic and control flow (loops, if/else statements), write clean/organized/idiomatic
code, and debug edge cases. The question itself is not too hard but make sure to
convey your thought process throughout the interview. Communicate calmly and clearly
to the interviewer is the key to succeed through this round (and of course your codes
have to work and pass all test cases lol)

- 1hr of model building. For this exercise, you will be expected to perform exploratory data
analysis, prepare data for model building, and train/evaluate a model on the dataset
provided. You’ll have an option to train either a regression or a classification model with
techniques of your choice (logistic regression, linear regression, tree based model etc).
The interviewer allowed me to use Google or whatever templates I have. So make sure
you have some sort of sample codes for EDA, data manipulating, model
training/evaluation, etc. pretty much a whole pipeline for a standard Kaggle project.
Interviewer asked me lots of “why” questions during this interview, to test my
understanding of the dataset, my ability to break down a vague question into smaller
pieces, and my logic behind using certain algorithms.

- 45 mins of product sense. I talked to one of their Product leads to share my


experience/understanding about managing cross functional relationships and to get a
sense for how I approach product development, as well as which business metrics,
solutions, and focus areas I would hone in to enhance the overall customer experience
at Faire. Pretty standard product sense questions like how do you measure the success
of certain products, how do you react when certain metrics go up and down.

- 30 mins of behavioral questions. Also very standard, tell me about a time you failed, tell
me about a time you led a team, tell me about a time you had a conflict with
coworkers/boss. Just do the STAR template, have 4-6 different projects/stories ready,
practice a few times and you should be fine

Overall Experience
- Faire interviews are one of the most challenging and well rounded/well structured
interviews I had. My technical interviewers all have PhD in CS/Maths/Physics from great
schools and are highly intelligent. They also are down to earth and very helpful. They
know how to ask a question to gauge interviewee’s knowledge, and know how to lead a
conversation. I did not enjoy my behavioral round though. Interviewer seemed to be
disengaged, kept reading questions from the list, and did not seem to be interested in
my answers, but oh well.

- Faire is a great company with very interesting products and a very bright future. 1 month
after I rejected their offer, the company raised Series G and my offer would have
increased quite a bit had I accepted. I still regretted it until this day.

How did you prepare?


- At this point, I just use prior interviews to prepare for the next ones.
- Review my notes in class, from ISLR/ESL books, from other Youtube channels
JP MORGAN - MACHINE LEARNING SCIENTIST

Timeline
- Applied with referral, interviewed for 2-3 weeks, received offer a week after final round

Format
- 1st round: 20-minute phone screening
- 2nd round: 45-minute technical interview
- Final round: one 45-60 minute presentation, three 45-minute technical interviews

Onsite
- Because of Covid, my onsite was conducted via Zoom. I had one week to prepare a
45-minute long presentation about 1-2 research projects. I presented my work to the
entire team, followed by a 15-minute Q&A.
- After the presentation, I had three back to back 45 minute interviews with three team
members, one of them is my current manager. Types of questions include 1/ coding
(Leetcode easy to medium and modeling), 2/ machine learning concepts (very close to
the ML, DL, NLP topics above), and 3/ technical details in my projects. For example, I
used BERT in my previous work so I got asked about BERT a lot - make sure you study
at least the Devlin paper carefully if you said you know BERT.

Overall Experience
- I enjoyed the process and types of questions they asked. The questions cover a wide
variety of skills and topics including coding, math, mostly ML/DL, and a little bit of
advanced NLP. I think it suits me better than a coding heavy one because I don’t have a
background in CS. I also think it sets a pretty accurate expectation of the requirements of
this role and type of work.

How did you prepare?


- Presentation: I spent 60% of the time I had (<1week) preparing the slides and honing my
presentation skill. My presentation covered high-level details about
- 1/ the task/problem,
- 2/ model architecture,
- 3/ results and
- 4/ key learning messages from my research.
- Technical interviews: I studied the key topics in ML, DL and NLP (ie. the list above) and
potentially asked questions like studying for an exam. I did 2-3 mock interviews with
friends who also studied data science. Would have also been great to do mock
interviews with someone already working in the industry.

Notes on the role


- It’s definitely hands-on, coding heavy (in terms of ML/DL modeling, OOP for experiment
setup and deployment), researchy (reading a lot of papers, implementing ideas from the
papers) but also efficiency/practicality-oriented (since it’s applied research, not
theoretical). Pros: you always learn something new and feel like doing cool things. Cons:
in research, sometimes hard work does not guarantee great results.

MICROSOFT - DATA SCIENTIST

Timeline
- I received the Microsoft Career Opportunities email on July. 23rd 2021
- Got the Microsoft Remote interview invitation email on Sep. 1st 2021
- First Round: Sep. 17th 2021, Virtual Onsite: Oct. 21st 2021, Offer: Oct. 25th 2021
- Since I got another offer deadline on Nov. 1st, hr helped me to fasten the process.

Format
- First round includes many short answer questions(bq with very simple coding)
- Virtual Onsite
- Two tech round: test on simple SQL coding(easy level in Leetcode), simple
Python coding, machine learning concepts, professional experience, deep
learning concepts and some statistics knowledge
- Hiring Manager: professional experience and some pop-up question based on
your response

Overall Experience
- Virtual onsite has three back to back interviews, so it is kind of tiring. However, the level
of difficulties is okay. Although I am not familiar with deep learning concepts, I think it is
acceptable.

How did you prepare?


- Since I already got my internship return offer, I did not prepare a lot. I went through
machine learning concepts, SQL statement and my own professional experience before
the interview
LYFT - DATA SCIENTIST, DECISIONS

Timeline
- Applied with referral, interviewed for about 4 weeks, decided to stop because accepted
another offer.

Format
- 1st round: 1-hour technical interview with a data scientist
- 2nd round: take home DS assignment
- Final round: one 30-minute presentation, followed by questions

First round
- This round consists of 3 main topics: prob/stats, A/B testing, and product sense
- Prob/stats: lots of questions came straight from StrataScratch (with some minor
modification), which really surprised me (cause these questions were asked 2
years ago) eg. Cost of discount coupon, Discount Coupon Usages, Two coupons
- A/B testing: everything about A/B testing. Some very tricky questions:
- How do you determine the sample size for the A/B testing?
- We usually split the control and experimental group 50-50. But if I split it
25-75, would you still feel "confident" about the result?
- Product sense: very similar to Meta DS analytics interview
- We want to launch a shared drive product, how to evaluate the success?
- Total number of booked rides decrease by 5% since last week, how to
interpret?

Take home & Onsite:


- They gave me a set of data about rider cancellation and asked to develop and
recommend a Lyft Cancellation Fee policy. I had about 5 days to submit a PPT deck to
present it to some sort of cross-functional panel
- I have heard that lots of people spent way more than the recommended 6-10 hours on
this assignment and did not get moved forward. Since I decided to not continue due to
accepting another offer, I did not spend too much time on it.

Overall Experience
- Can only speak to the first round. I really enjoyed my conversation with the DS. He
seemed to be very genuine and enjoyed what he does at Lyft. The conversation was
awkward at first as he literally picked questions from a list (the prob/stats part), but the
longer it went, the more he opened up about what he likes/doesn’t like about the
company. Lyft definitely offers a lot in terms of TC. I’m not knowledgeable enough in ride
sharing service so not sure how they fare against Uber. I hope they remove the take
home assignment tho, it seems to cause lots of friction among candidates.
VERISK - DATA SCIENTIST

Timeline
- About 2 months (no referral) from September to November

Format
- Online assessment
- Video interview
- Onsite

Online Assessment
- 90 minutes assessment from HackerRank with coding, statistics and ML questions
- 2 Easy coding questions
- Multiple-choice questions relating to probabilities, statistics and ML topics

Video Interview
- A quick video interview on Wepow
- Answer 3 questions from 3 data scientists, all are behavioral questions

Hiring Manager
- We hire candidates for the entire company and they will later be assigned to different
business units so there was no hiring manager. Later I had a chance to join the DS
interview committee and learned that after the onsite interview round, the interview
committee discussed and put each candidate into each of 3 buckets: offer made, no offer
or more discussion needed. Everyone in the committee no matter entry level or director
level has equal voice in evaluating the candidates.

Onsite
- Round 1: Interview with HR, mostly to learn the candidates’ interest and preferred
location for team matching

- Round 2: Behavioral interview with a Senior Data Scientist.

It was a friendly conversation where the interviewer asked about my background, my


experiences and projects. My experiences working at AI startups where I developed my
own projects which have key impacts on the business seemed to catch her attention so
she asked a lot of follow-up questions. I think in this round unique experiences/projects
helped create a good impression that set me apart from the candidate pool and made
the interviewer fight for me in the selection meeting.

- Round 3: Presentation

In this round I needed to prepare a presentation and present it to a board of 3


interviewers at manager/director level. This round was to evaluate communication skills
and business mindset, so a clear presentation with a clear business objective is the goal.
I presented the image-based recommender system project I did when interning at Etsy to
improve their recommender system. I was asked a lot of follow-up questions but they
were mostly business questions instead of technical ones.

- Round 4: Technical round

I was given a dataset several days before the interview. For this round I needed to
prepare a presentation to present my analysis on the data. I just prepared a Colab
notebook with my codes, analysis, visualization and detailed interpretation. The
interviewer was a Lead Data Scientist who described himself as very nerdy. He asked
me several questions about my analysis and choice of algorithms. Fortunately for all the
questions he asked, I already wrote my explanation in my Colab notebook. So a
thorough preparation helped me a lot in this round. After the interview, I was told to be
suited for engineering roles in Data Science.

Overall Experience
- The interview experience was very well-structured. But since I was not interviewed by
the team I would join but the team matching would take place later, the interview
experience was not very personal and I couldn’t ask any questions relating to team
culture.

How did you prepare?


- Know in-depth the technical details of projects I did
- Create a DS theory cheat sheet from https://ds-interviews.org/
FAIRE - DATA SCIENTIST INTERN
(anonymous for now until I sign the paper)

Timeline
- 04/02: Submitted application
- 04/05: Invited to phone screen with recruiter
- 04/08: 30 min behavioral phone screen with recruiter
- 04/13: 45 min interview with Hiring Manager
- 04/21: 45 min Stat/Business case + 60 min live ML interview
- Offer after 2 hours
- 3 weeks from application, no referral

Format
- Hiring Manager: 45 min
- Statistics/Business case: 45 min
- ML: 1hr

Online Assessment (if any) - nope

Hiring Manager (if any)


- Interviewer has an Econ background (undergrad and PhD from Top 1) and a few years
of experience at Airbnb
- She started the interview by introducing herself, the team, and some information on
recent growth and focus areas
- We discussed one of my projects and she went quite but not too deep into the technical
details (Explain intuitively how VaderSentiment works)
- A case question: how to predict whether a brand/retailer would churn in the next month.
The interviewer asked about the approach (which model), which features, how to
measure them, how to validate the model
- During the last few minutes, the interviewer told me that she would move me to the next
round and gave me tips on how to prepare for it. She also set aside time for me to ask
her questions about her and the role. I read up on her blog and connected with her on
LinkedIn before the interview so the follow-up part was quite natural.

Onsite
- 45 min Stat: a big focus on business case. The interviewer also has an Econ background
(undergrad from UChicago working/co-authoring with 3 renowned economists – one
Nobel laureate and PhD from Top 1) and a few years of prior experience. The question
started off vaguely, also related to churning. I had some trouble navigating the scope of
the problem so the first few minutes were a bit awkward. The interviewer was
understanding and trying to help/suggest ways for me to narrow down and redefine the
problem. He mentioned that there would be two cases but we only did one (I don’t know
if he changed his mind in the middle of the conversation or because I managed time
poorly and never got to a solution for the first case). He would ask me questions on my
assumptions and push deeper, like you mentioned that you would want to predict
revenue, how would you do that. Now you have a model, which features would you
include? How do you know if the model is working? Assuming you would want to test
responses to a coupon, how would you design the experiment? What information do you
need? Explain power calculation. How do you measure the components? If we run the
experiments and find no statistically significant result, however, other PMs use the same
data on a subset of retailers and got significant results and would like to roll the feature
out, what would you do? There were a few more questions along that line. Basically, any
ideas I have, he tested why and how I would realize them. We talked about predictive
model (regression models with fixed effects), XGBoost (how do you design an XGBoost,
what would be your y labels), time series forecasting (how do you treat seasonality). I
think I brought it onto myself because I mentioned these words. So yeah, make sure you
know your stuff. I wasn’t too confident about the XGBoost stuff and confused myself for a
moment. My impression is that this interview was like rapid fire.

- 1hr ML interview. Interviewer has a MS from UCBerkeley + some prior work experience,
now specializes in Risk. Data set was sent by email; googling and templates were
allowed. The interviewer stated the objective of the exercise very clearly: clean the data,
split it to train/test sets, train a classifier, evaluate on the test set, print out 10 retailers
with highest probabilities of defaulting. He said not to care about feature engineering
(there were about 40 features), everything was numerical but there were a few with
missing values. The data set was on the transaction level and the goal for prediction was
on the retailer level so I needed to make a few assumptions. The interviewer muted
himself so I was kinda on my own. I got stuck when my one line of code wasn’t running
(silly basic syntax error), he offered to help but I consulted Google and got through. I did
make some mistakes in getting the data so it was not even compatible with my model, I
just went back to fix it…I made sure to think out loud and communicated what I planned
to do and why I think it might work if I had more time. He also asked about metrics, why I
use them, and what they mean. I honestly didn’t finish the last item. I was close enough
that he said he thought I was on the right track. He did help me redefine the problem and
told me I was overthinking it hahaha. So overall, I think he was quite lenient on me. He
asked me to send the code to him and I took ~10 extra minutes to brush it up a bit more.

Overall Experience
- So far this is the best company when it comes to communication. The platform to set up
interview time and communication about the interviewers, type of interviews, and next
steps are clear and smooth. I almost never had any wait time. They promised to get back
to me within 2 business days (even if I get rejected) and they got back in 2 hours.
- Everybody was on time and easy to talk to. The ML guy talks a bit faster and has some
accent but overall nothing to complain about.

How did you prepare?


- Not the best way because I was cramming. I read everything I can on Faire’s business
model on their blog posts to familiarize myself with their key metrics. I knew I was
interviewing for the Brand pillar and they deploy a lot of XGBoost models so I touched up
on that as well. I did PM-style preparation for the case part (I’m not confident in that area
at all). For the coding part, since I knew the interviewer’s specialization is in Risk, I
searched up fraud detection datasets on Kaggle and made myself a few templates/code
snippets for the task. This is where the magic of “đoán đề, trúng tủ" happens.
- Within the three weeks, I had at least 1+ interviews every day, either mock or real, so I
was becoming like a robot to certain questions. I literally did a mock interview before the
real 1.75 hour sprint, just to focus on algorithms.
- I think I caught their eyes and eventually got the offer because I also have an Econ +
research-y background which aligned with most of the senior team members who
interviewed me. The models I discussed and talked about are somewhat in more
econ/stat lingo than ML.
AMAZON - RESEARCH SCIENTIST INTERN
Timeline
- 2 weeks since online application (with referral)

Format
- A quick call with a recruiter
- 1st Tech round 1 hour with a Machine Learning Scientist
- 2nd Tech round 1 hour with a Applied Scientist
- Hiring Manager interview: 45-min

Tech interview
- Tech interview was a combination of machine learning/statistics questions and coding
problems. It is very important to know details of models that you wrote down on your
resume. They will ask you not only the big picture of your work, but also technical details
of ML models that you used (ex. How’s K means clustering algo work? And what's the
objective of K means clustering?) Also, the team that I was interviewing was focused on
inference work, so they asked me lots of regression/causal inference questions. Knowing
fundamental concepts and concisely explaining the concepts are important as well (ex.
How do you explain regression coeff to a non-technical person?)
- First round tech person asked me to do some 3-4 SQL questions (Leetcode medium).
Second round tech person asked me to do 2 Python questions (Leetcode easy). Both
interviewers asked me one or two leadership principle questions (ex. Dive deep or learn
and be curious type).

Hiring manager interview


- After tech interviews, I had an informal interview with the hiring manager. It was mostly
going through my experience and she asked me behavior questions. At this stage, I
think it’s important to show your personable side and be natural (you do you!) because
you already cleared the bar. They want to see if you are going to be a good fit for the
team.

Overall Experience
- Overall, the interview experience was good. As long as you know well about
models/techniques that you wrote down on the resume in detail and fundamental
statistics/causal inference, then I think the interview should not be bad. SQL/Python
leetcode easy/medium should be good to prepare. But, types of interview can be
different by teams, so make sure to do some research about your interview team if you
can (some interview cases might be non-team specific though). Also, even if you might
not know all things that interviewers ask (or you might not know at all), instead of saying
that you don't know, you can first list things that you know and say that you are looking
forward to learning more about it. From my interview experience, it’s okay to say it rather
than pretending that you know and getting more difficult questions later.
How did you prepare?
- Machine learning course/Statistic courses from my education
- Some of youtube channels were helpful (https://www.youtube.com/c/joshstarmer)
META - DATA SCIENCE INTERN

Timeline
- Applied with internal referral (i.e. rejected for a different role, and was referred to this role
by the recruiter)

Format
- Recruiter call (which I skipped due to internal referral)
- Coding Round
- Product Sense / Modeling Round

Recruiter Call (for a different team within the same company) -30 min
- Asked about background follows by a product sense question. The question was “how
would you identify real-life best friends from instagram” (It’s okay to share now since I’ve
interviewed for them a year ago and they have since then scrapped this question). They
look for structure in your response. I would recommend studying the STAR framework,
and practice how to first state the problem, explain possible metrics, rank metrics, and
discuss caveats. There’s hardly much ML that goes into product sense questions, but is
rather an exercise for thinking-out-loud and propose feasible rationales on the spot.
- They might also ask behavior questions like previous projects.

Coding Round / Product Sense - 45 min


- The test is on SQL, and you work with a DS to solve 2-3 questions using the same set of
data. The first question is likely easy and uses one data frame, but the questions go
progressively harder. The level of difficulty should be around Medium on leetcode.
- Another product sense question. Expect the question to be more technical as you are
interviewing with a DS person..

HR Round - 45 min
- What happens in the round heavily depends on your (potential) group. If you are
applying the DS Analytics, it is more or less the same (i.e. another product sense). I
instead had a modeling interview on a hypothetical scenario regarding revenue
projection. My HR round on a different team asked about advertisement search. It’s still
very much intuition based. They want to hear you breaking down a problem, thinking
about what could possibly be done, and list a few algorithms under certain assumptions.
The most important knowledge for me was to be able to list pros and cons for different
ML algorithms, and knowing when to use which.

How did you prepare?


- ISLR, a lot of Medium articles
SAMSUNG RESEARCH AMERICA - RESEARCH INTERN

Timeline
- Applied w/o referral and took 3 weeks to offer

Format
- Hired directly for the Bixby Lab
- 3 technical rounds interviewed with teammates and the direct supervisor

Onsite:
- 1st round: coding questions are Medium LeetCode-styled
- 2nd & 3rd round:
- 1-2 Medium LeetCode-styled coding questions or implementation of a ML
algorithm
- Theoretical ML questions regarding traditional ML, Deep Learning
- SOTA NLP questions and other advanced Deep Learning techniques

Overall Experience
- Since this is the direct hire for Bixby Lab, I can easily narrow the knowledge scope. In
my opinion, this is standard for a research internship at bigtech.
- The interview feedback was fast and intense that I had 3 interviews in 2 weeks.

Pros
- I worked on SOTA techniques for virtual assistants and was directly supervised by
well-known scientists.
- Encouraged to file patents and publish papers.

Cons
- The DevOps teams are located in South Korea. Hence, it takes time to get help,
especially in obtaining computer resources to conduct experiments.
- Bixby is not a brand-name product. Hence, it may not be a big bump in resume.
7-ELEVEN - DATA SCIENTIST INTERN

Timeline
- Applied w/ referral and received offer in 2 weeks

Format
- Hired directly for the team
- 1 technical round directly with supervisor

Onsite:
- 1st round: ~1hr, 1 Medium LC questions + theoretical questions about
ML/DL/RecSys/NLP + design ML algorithms for specific problems

Overall Experience
- My work focused on a new RecSys task that is not actively researched. During the
internship, I was assigned to mini tasks (e.g fine-tuning current in-deployment
algorithms) besides working on the internship project.
- The supervisor was supportive. However, the team was not familiar with Deep Learning.
Hence, I didn’t receive much help from the team.
- The RecSys in the convenient retail is greatly challenging due to the 5-min visit time.

Pros
- The task is challenging but big-impact.
- Gained understanding of e-commerce and convenient retail.

Cons
- GPU power resources are only available upon request.
- The IT department is slow in resolving equipment and software requests that slow down
progress.
- 7-Eleven is not a big name in the Deep Learning field. Hence, it won’t be a big bump in
resume.
CREDIT
Special thanks to these wonderful data scientists who devoted their time and effort to
completing this document!

● Somang Han
● Caroline Liongosari
● Sarah Ye
● Dat Ngo
● My Phung
● Yuchen Zhang
● Ian McDonald

You might also like