Professional Documents
Culture Documents
This is your last free member-only story this month. Sign up for Medium and get an extra one
https://towardsdatascience.com/10-highly-probable-data-scientist-interview-questions-fd83f7414760 1/9
06/09/2021 20:37 10 Highly Probable Data Scientist Interview Questions | by Soner Yıldırım | Sep, 2021 | Towards Data Science
The popularity of data science attracts a lot of people from a wide range of professions
Get started Open in app
to make a career change with the goal of becoming a data scientist.
Despite the high demand for data scientists, it is a highly challenging task to find your
first job. Unless you have a solid prior job experience, interviews are where you can
show you skills and impress your potential employer.
Data science is an interdisciplinary field which covers a broad range of topics and
concepts. Thus, the number of questions that you might be asked at an interview is
very high.
However, there are some questions about the fundamentals in data science and
machine learning. These are the ones you do not want to miss. In this article, we will
go over 10 questions that are likely to be asked at a data scientist interview.
The questions are grouped into 3 main categories which are machine learning, Python,
and SQL. I will try to provide a brief answer for each question. However, I suggest
reading or studying each one in more detail afterwards.
Machine Learning
1. What is overfitting?
Overfitting in machine learning occurs when your model is not generalized well. The
model is too focused on the training set. It captures a lot of detail or even noise in the
training set. Thus, it fails to capture the general trend or the relationships in the data. If
a model is too complex compared to the data, it will probably be overfitting.
A strong indicator of overfitting is the high difference between the accuracy of training
and test sets. Overfit models usually have very high accuracy on the training set but the
test accuracy is usually unpredictable and much lower than the training accuracy.
https://towardsdatascience.com/10-highly-probable-data-scientist-interview-questions-fd83f7414760 2/9
06/09/2021 20:37 10 Highly Probable Data Scientist Interview Questions | by Soner Yıldırım | Sep, 2021 | Towards Data Science
If it is possible, collecting more data is an efficient way to reduce overfitting. You will
be giving more juice to the model so it will have more material to learn from. Data is
always valuable especially for machine learning models.
3. What is regularization?
We have mentioned that the main reason for overfitting is a model being more complex
than necessary. Regularization is a method for reducing the model complexity.
On the other hand, L2 regularization removes a small percentage from the weights at
each iteration. These weights will get closer to zero but never actually become 0.
For instance, spam email detection is a classification task. We provide a model with
several emails marked as spam or not spam. After the model is trained with those
https://towardsdatascience.com/10-highly-probable-data-scientist-interview-questions-fd83f7414760 3/9
06/09/2021 20:37 10 Highly Probable Data Scientist Interview Questions | by Soner Yıldırım | Sep, 2021 | Towards Data Science
Clustering is an unsupervised learning task so the observations do not have any labels.
The model is expected to evaluate the observations and group them into clusters.
Similar observations are placed into the same cluster.
In the optimal case, the observations in the same cluster are as close to each other as
possible and the different clusters are as far apart as possible. An example of a
clustering task would be grouping customers based on their shopping behavior.
Python
The built-in data structures are of crucial importance. Thus, you should be familiar
with what they are and how to interact with them. List, dictionary, set, and tuple are 4
main built-in data structures in Python.
mylist = [1,2,3]
mylist.append(4)
mylist.remove(1)
print(mylist)
[2,3,4]
On the other hand, tuples are immutable. Although we can access each element in a
tuple, we cannot modify its content.
mytuple = (1,2,3)
mytuple.append(4)
One important point to mention here is that although tuples are immutable, they can
contain mutable elements such as lists or sets.
https://towardsdatascience.com/10-highly-probable-data-scientist-interview-questions-fd83f7414760 4/9
06/09/2021 20:37 10 Highly Probable Data Scientist Interview Questions | by Soner Yıldırım | Sep, 2021 | Towards Data Science
mytuple
Get started = (1,2,["a","b","c"])
Open in app
mytuple[2]
mytuple[2][0] = ["A"]
print(mytuple)
mylist = list(text)
myset = set(text)
print(mylist)
['P', 'y', 't', 'h', 'o', 'n', ' ', 'i', 's', ' ', 'a', 'w', 'e',
's', 'o', 'm', 'e', '!']
print(myset)
{'t', ' ', 'i', 'e', 'm', 'P', '!', 'y', 'o', 'h', 'n', 'a', 's',
'w'}
As we notice in the resulting objects, the list contains all the characters in the string
whereas the set only contains unique values.
Another difference is that the characters in the list are ordered based on their location
in the string. However, there is no order associated with the characters in the set.
Here is a table that summarizes the main characteristics of lists, tuples, and sets.
https://towardsdatascience.com/10-highly-probable-data-scientist-interview-questions-fd83f7414760 5/9
06/09/2021 20:37 10 Highly Probable Data Scientist Interview Questions | by Soner Yıldırım | Sep, 2021 | Towards Data Science
(image by author)
mylist[1]
"b"
In a dictionary, we have keys as the index. Thus, we can access a value by using its key.
mydict["Jane"]
26
The keys in a dictionary are unique which makes sense because they act like an address
for the values.
SQL
https://towardsdatascience.com/10-highly-probable-data-scientist-interview-questions-fd83f7414760 6/9
06/09/2021 20:37 10 Highly Probable Data Scientist Interview Questions | by Soner Yıldırım | Sep, 2021 | Towards Data Science
SQL is an extremely important skill for data scientists. There are quite a number of
Get started Open in app
companies that store their data in a relational database. SQL is what is needed to
interact with relational databases.
You will probably be asked a question that involves writing a query to perform a
specific task. You might also be asked a question about general database knowledge.
8. Query example 1
Consider we have a sales table that contains daily sales quantities of products.
(image by author)
SELECT TOP 5
SUM(SalesQty) AS TotalWeeklySales
FROM
SalesTable
(image by author)
https://towardsdatascience.com/10-highly-probable-data-scientist-interview-questions-fd83f7414760 7/9
06/09/2021 20:37 10 Highly Probable Data Scientist Interview Questions | by Soner Yıldırım | Sep, 2021 | Towards Data Science
We first extract the year and week information from the date column and then use it in
Get started Open in app
the aggregation. The sum function is used to calculate the total sales quantities.
9. Query example 2
In the same sales table, find the number of unique items that are sold each month.
SELECT
MONTH(SalesDate) AS Month,
COUNT(DISTINCT(ItemNumber)) AS ItemCount
FROM
SalesTable
GROUP BY MONTH(SalesDate)
Month ItemCount
1 9 1021
2 8 1021
Conclusion
It is a challenging task to become a data scientist. It requires time, effort, and
dedication. Without having prior job experience, the process gets harder.
https://towardsdatascience.com/10-highly-probable-data-scientist-interview-questions-fd83f7414760 8/9
06/09/2021 20:37 10 Highly Probable Data Scientist Interview Questions | by Soner Yıldırım | Sep, 2021 | Towards Data Science
Interviews are very important to demonstrate your skills. In this article, we have
Get started Open in app
covered 10 questions that you are likely to encounter in a data scientist interview.
Thank you for reading. Please let me know if you have any feedback.
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.
https://towardsdatascience.com/10-highly-probable-data-scientist-interview-questions-fd83f7414760 9/9