Professional Documents
Culture Documents
40 R Programming Interview Questions & Answers For All Levels - DataCamp
40 R Programming Interview Questions & Answers For All Levels - DataCamp
Elena Kosourova
TO P I C S
R Programming
Career Services
Being well-prepared for an R programming interview is a crucial factor for succeeding in it.
This success has effectively two sides: for a job hunter, it means to be employed by the
company, while for the company itself, it means to find a perfect fit for their job position.
As additional resources for your R programming interview preparation, consider the following
helpful resources:
If you're a job hunter, you should think in advance about these and similar questions and
prepare your answers. Don't worry if you haven't had any real working experience in R yet:
describing your internship in R programming or your individual or group R projects that you
completed during your studies works just fine.
Besides, if you're interviewing for an entry-level position, your interviewer doesn't necessarily
expect from you an extensive (or even any) work experience in R. Remember that since you
were invited to this interview, the company found your resume attractive anyway.
Entry-Level R Programming Interview Questions
Let’s start with some of the basic technical R interview questions that you might face from
your potential employer. These require you to have mastered the basics and have some
practical experience of using R.
Open source
Functional and flexible (users can define their own functions, as well as tune various
parameters of existing functions)
Relatively slow
2. Integer—whole numbers.
5. Logical—the Boolean values TRUE and FALSE, represented under the hood as 1 and 0,
respectively.
2. List—a multi-dimensional data structure used for storing values of any data type and/or
other data structures.
3. Matrix—a two-dimensional data structure used for storing values of the same data
type.
4. Data frame—a two-dimensional data structure used for storing values of any data type,
but each column must store values of the same data type.
5. How to import data in R?
The base R provides essential functions for importing data:
read.table() —the most general function of the base R for importing data, takes in
tabular data with any kind of field separators, including specific ones, such as |.
In practice, any of these functions can be used to import tabular data with any kind of field
and decimal separators: using them for the specified formats of files is only the question of
convention and default settings. For example, here is the syntax of the first function:
read.table(file, header = FALSE, sep = "", dec = ".") . The other functions have the same
parameters with different default settings that can always be explicitly overwritten.
The tidyverse packages readr and readxl provide some other functions for importing specific
file formats. Each of those functions can be further fine-tuned by setting various optional
parameters.
readr
readxl
To dive deeper into data loading in R, you can go through the tutorial on How to Import
Data Into R.
To install an R package directly from CRAN, we need to pass the package name enclosed in
quotation marks to the install.packages() function, as follows:
install.packages("package_name") . To install more than one package from CRAN in one go,
we need to use a character vector containing the package names enclosed in quotation
marks, as follows: install.packages(c("package_name_1", "package_name_2") . To install an
R package manually, we need first to download the package as a zip file on our computer
and then run the install.packages() function :
install.packages("path_to_the_locally_stored_zipped_package_file", repos=NULL,
B LO G S P O W E R E D B Y D ATA C A M P W O R K S PA C E CategoryEN
To load an installed R package in the working R environment, we can use either library() or
require() functions. Each of them takes in the package name without quotation marks and
loads the package, e.g., library(caret) . However, the behavior of these functions is different
when they can't find the necessary package: library() throws an error and stops the
program execution, while require() outputs a warning and continues the program execution.
7. How to create a data frame in R?
1. From one or more vectors of the same length—by using the data.frame() function:
P O W E R E D B Y D ATA C A M P W O R K S PA C E
df <- data.frame(my_matrix)
P O W E R E D B Y D ATA C A M P W O R K S PA C E
3. From a list of vectors of the same length—by using the data.frame() function:
df <- data.frame(list_of_vectors)
P O W E R E D B Y D ATA C A M P W O R K S PA C E
To combine the data frames horizontally (only if the data frames have the same
number of rows, and the records are the same and in the same order) —by using the
cbind() function:
P O W E R E D B Y D ATA C A M P W O R K S PA C E
To combine the data frames vertically (only if they have an equal number of identically
named columns of the same data type and appearing in the same order) —by using the
rbind() function:
P O W E R E D B Y D ATA C A M P W O R K S PA C E
P O W E R E D B Y D ATA C A M P W O R K S PA C E
Output:
col_1 col_2
1 10 a
2 11 b
3 12 c
4 13 d
col_1 col_2 col_3
1 10 a 5
2 11 b 1
3 12 c 18
4 13 d 16
P O W E R E D B Y D ATA C A M P W O R K S PA C E
P O W E R E D B Y D ATA C A M P W O R K S PA C E
Output:
col_1 col_2
1 10 a
2 11 b
3 12 c
4 13 d
col_1 col_2 col_3
1 10 a 5
2 11 b 1
3 12 c 18
4 13 d 16
P O W E R E D B Y D ATA C A M P W O R K S PA C E
P O W E R E D B Y D ATA C A M P W O R K S PA C E
Output:
col_1 col_2
1 10 a
2 11 b
3 12 c
4 13 d
col_1 col_2 col_3
1 10 a 5
2 11 b 1
3 12 c 18
4 13 d 16
P O W E R E D B Y D ATA C A M P W O R K S PA C E
In each of the three cases, we can assign a single value or a vector or calculate the new
column based on the existing columns of that data frame or other data frames.
P O W E R E D B Y D ATA C A M P W O R K S PA C E
If, instead, we have too many columns to delete, it makes more sense to keep the rest of the
columns rather than delete the columns in interest. In this case, the syntax is similar, but the
names of the columns to keep aren't preceded with a minus sign:
P O W E R E D B Y D ATA C A M P W O R K S PA C E
2. By using the built-in subset() function of the base R. If we need to delete only one column,
we assign to the select parameter of the function the column name preceded with a minus
sign. To delete more than one column, we assign to this parameter a vector containing the
necessary column names preceded with a minus sign:
P O W E R E D B Y D ATA C A M P W O R K S PA C E
If, instead, we have too many columns to delete, it makes more sense to keep the rest of the
columns rather than delete the columns in interest. In this case, the syntax is similar, but no
minus sign is added:
P O W E R E D B Y D ATA C A M P W O R K S PA C E
User-friendly
Flexible
Multifunctional
To learn more about what RStudio is and how to install it and begin using it, you can follow
the RStudio Tutorial.
The resultant documents are shareable, fully reproducible, and of publication quality.
A wide range of static and dynamic outputs and formats, such as HTML, PDF, Microsoft
Word, interactive documents, dashboards, reports, articles, books, presentations,
applications, websites, reusable templates, etc.
P O W E R E D B Y D ATA C A M P W O R K S PA C E
1. Function name—the name of the function object that will be used for calling the
function after its definition.
2. Function parameters—the variables separated with a comma and placed inside the
parentheses that will be set to actual argument values each time we call the function.
3. Function body—a chunk of code in the curly brackets containing the operations to be
performed in a predefined order on the input arguments each time we call the function.
Usually, the function body contains the return() statement (or statements) that
returns the function output, or the print() statement (or statements) to print the
output.
P O W E R E D B Y D ATA C A M P W O R K S PA C E
highcharter—for easy dynamic plotting, offers many flexible features, plugins, and
themes; allows charting different R objects with one function.
Leaflet—for creating interactive maps.
ggvis—for creating interactive and highly customizable plots that can be accessed in
any browser by using Shiny's infrastructure.
patchwork—for combining several plots, usually of various types, on the same graphic.
2. Using the equal operator = , e.g., my_var = 1 —for assigning values to arguments inside
a function definition.
3. Using the rightward assignment operator -> , e.g., my_var -> 1 —can be used in pipes.
4. Using the global assignment operators, either leftward ( <<- ) or rightward ( ->> ), e.g.,
my_var <<- 1 —for creating a global variable inside a function definition.
If a variable name starts with a dot, this dot can't be followed by a digit.
Reserved words in R ( TRUE , for , NULL , etc.) can't be used as variable names.
In the course Writing Efficient R Code, you'll find further best practices for writing code in R.
17. What types of loops exist in R, and what is the syntax of each type?
1. For loop—iterates over a sequence the number of times equal to its length (unless the
statements break and/or next are used) and performs the same set of operations on each
item of that sequence. This is the most common type of loops. The syntax of a for loop in R
is the following:
P O W E R E D B Y D ATA C A M P W O R K S PA C E
2. While loop—performs the same set of operations until a predefined logical condition (or
several logical conditions) is met—unless the statements break and/or next are used.
Unlike for loops, we don't know in advance the number of iterations a while loop is going to
execute. Before running a while loop, we need to assign a variable (or several variables) and
then update its value inside the loop body at each iteration. The syntax of a while loop in R
is the following:
variable assignment
P O W E R E D B Y D ATA C A M P W O R K S PA C E
3. Repeat loop—repeatedly performs the same set of operations until a predefined break
condition (or several break conditions) is met. To introduce such a condition, a repeat loop
has to contain an if-statement code block, which, in turn, has to include the break
statement in its body. Like while loops, we don't know in advance the number of iterations a
repeat loop is going to execute. The syntax of a repeat loop in R is the following:
repeat {
operations
if(break condition) {
break
}
}
P O W E R E D B Y D ATA C A M P W O R K S PA C E
You can read more about Loops in R with our separate tutorial.
FUN —an aggregate function to compute the summary statistics for each group (e.g.,
mean , max , min , count , sum ).
P O W E R E D B Y D ATA C A M P W O R K S PA C E
2. Using the rbind() function to combine the data frames vertically—only if they have an
equal number of identically named columns of the same data type and appearing in the
same order:
P O W E R E D B Y D ATA C A M P W O R K S PA C E
3. Using the merge() function to merge data frames by a column in common, usually an ID
column:
Inner join:
P O W E R E D B Y D ATA C A M P W O R K S PA C E
Left join:
P O W E R E D B Y D ATA C A M P W O R K S PA C E
Right join:
df <- merge(df1, df2, by="ID", all.y=TRUE)
P O W E R E D B Y D ATA C A M P W O R K S PA C E
Outer join:
P O W E R E D B Y D ATA C A M P W O R K S PA C E
4. Using the join() function of the dplyr package to merge data frames by a column in
common, usually an ID column:
P O W E R E D B Y D ATA C A M P W O R K S PA C E
The type parameter takes in one of the following values: "inner", "left", "right", or "full".
P O W E R E D B Y D ATA C A M P W O R K S PA C E
Output:
col_1 col_2
1 10 11
2 20 22
3 30 33
[,1] [,2] [,3]
col_1 10 20 30
col_2 11 22 33
P O W E R E D B Y D ATA C A M P W O R K S PA C E
P O W E R E D B Y D ATA C A M P W O R K S PA C E
Output:
a b c
1 1 11 21
2 2 12 22
3 3 13 23
4 4 14 24
a b
1 3 12
2 4 13
P O W E R E D B Y D ATA C A M P W O R K S PA C E
Area plot—based on a line plot, with the area below the line colored or filled with a
pattern.
Violin plot—shows both a set of descriptive statistics of the data and the
distribution shape for that data.
Heatmap—shows the magnitude of each numeric data point within the dataset.
Circular packing chart—shows an inner hierarchy of the data and the values of the
data points
etc.
The skill track Data Visualization with R will help you broaden your horizons in the field of R
graphics. If you prefer to learn data visualization in R in a broader context, explore a
thorough and beginner-friendly career track Data Scientist with R.
c(1, 2, 3, 4, 5) + c(1, 2, 3)
P O W E R E D B Y D ATA C A M P W O R K S PA C E
The second vector, due to the vector recycling, will actually be converted into c(1, 2, 3, 1, 2) .
Hence, the final result of this operation will be c(2, 4, 6, 5, 7) .
While sometimes vector recycling can be beneficial (e.g., when we expect the cyclicity of
values in the vectors), more often, it's inappropriate and misleading. Hence, we should be
careful and mind the vectors' lengths before performing operations on them.
Both next and break statements can be used in any type of loops in R: for loops, while
loops, and repeat loops. They can also be used in the same loop, e.g.:
for(i in 1:10) {
if(i < 5)
next
if(i == 8)
break
print(i)}
P O W E R E D B Y D ATA C A M P W O R K S PA C E
Output:
[1] 5
[1] 6
[1] 7
P O W E R E D B Y D ATA C A M P W O R K S PA C E
26. What is the difference between the str() and summary() functions
in R?
The str() function returns the structure of an R object and the overall information about it,
the exact contents of which depend on the data structure of that object. For example, for a
vector, it returns the data type of its items, the range of item indices, and the item values (or
several first values, if the vector is too long). For a data frame, it returns its class
(data.frame), the number of observations and variables, the column names, the data type of
each column, and several first values of each column.
The summary() function returns the summary statistics for an R object. It's mostly applied
to data frames and matrices, for which it returns the minimum, maximum, mean, and median
values, and the 1st and 3rd quartiles for each numeric column, while for the factor columns, it
returns the count of each level.
Instead, the sample() function in R can be applied only to vectors. It extracts a random
sample of the predefined size from the elements of a vector, with or without replacement.
For example, sample(my_vector, size=5, replace=TRUE)
P O W E R E D B Y D ATA C A M P W O R K S PA C E
Output:
col_1 col_2
1 1 8
2 3 6
3 5 4
4 7 2
col_1 col_2 col_3
1 1 8 9
2 3 6 9
3 5 4 20
4 7 2 14
P O W E R E D B Y D ATA C A M P W O R K S PA C E
P O W E R E D B Y D ATA C A M P W O R K S PA C E
Output:
col_1 col_2
1 1 8
2 3 6
3 5 4
4 7 2
col_1 col_2 col_3
1 1 8 9
2 3 6 9
3 5 4 20
4 7 2 14
P O W E R E D B Y D ATA C A M P W O R K S PA C E
P O W E R E D B Y D ATA C A M P W O R K S PA C E
Output:
col_1 col_2
1 1 8
2 3 6
3 5 4
4 7 2
col_1 col_2 col_3
1 1 8 9
2 3 6 9
3 5 4 20
4 7 2 14
P O W E R E D B Y D ATA C A M P W O R K S PA C E
4. Using the mutate() function of the dplyr package and the ifelse() function of the base R:
P O W E R E D B Y D ATA C A M P W O R K S PA C E
Output:
col_1 col_2
1 1 8
2 3 6
3 5 4
4 7 2
col_1 col_2 col_3
1 1 8 9
2 3 6 9
3 5 4 20
4 7 2 14
P O W E R E D B Y D ATA C A M P W O R K S PA C E
For example, if we run the dmy() function passing to it any of the strings "05-11-2023",
"05/11/2023" or "05.11.2023", representing the same date, we'll receive the same result: 2023-
11-05 . This is because in all three cases, despite having different dividing symbols, we
actually have the same pattern: the day followed by the month followed by the year.
The expression passed to the switch() function can evaluate to either a number or a
character string, and depending on this, the function behavior is different.
1. If the expression evaluates to a number, the switch() function returns the item from the
list based on positional matching (i.e., its index is equal to the number the expression
evaluates to). If the number is greater than the number of items in the list, the switch()
function returns NULL . For example:
P O W E R E D B Y D ATA C A M P W O R K S PA C E
Output:
"triangle"
P O W E R E D B Y D ATA C A M P W O R K S PA C E
2. If the expression evaluates to a character string, the switch() function returns the value
based on its name:
P O W E R E D B Y D ATA C A M P W O R K S PA C E
Output:
"tomato"
P O W E R E D B Y D ATA C A M P W O R K S PA C E
If there are multiple matches, the first matched value is returned. It's also possible to add an
unnamed item as the last argument of the switch() function that will be a default fallback
option in the case of no matches. If this default option isn't set, and if there are no matches,
the function returns NULL .
The switch() function is an efficient alternative to long if-else statements since it makes the
code less repetitive and more readable. Typically, it's used for evaluating a single expression.
We can still write more complex nested switch constructs for evaluating multiple
expressions. However, in this form, the switch() function quickly becomes hard to read and
hence loses its main advantage over if-else constructs.
apply() —takes in a data frame, a matrix, or an array and returns a vector, a list, a
matrix, or an array. This function can be applied row-wise, column-wise, or both.
lapply() —takes in a vector, a list, or a data frame and always returns a list. In the
case of a data frame as an input, this function is applied only column-wise.
sapply() —takes in a vector, a list, or a data frame and returns the most simplified
data structure, i.e., a vector for an input vector, a list for an input list, and a matrix for an
input data frame.
tapply() —calculates summary statistics for different factors (i.e., categorical data).
Conditional statements:
if —tests whether a given condition is true and provides operations to perform if it's so.
if-else —tests whether a given condition is true, provides operations to perform if it's
so and another set of operations to perform in the opposite case.
if... else if... else —tests a series of conditions one by one, provides operations
to perform for each condition if it's true, and a fallback set of operations to perform if
none of those conditions is true.
switch —evaluates an expression against the items of a list and returns a value from
the list based on the results of this evaluation.
Loop statements:
while —in while loops, checks if a predefined logical condition (or several logical
conditions) is met at the current iteration.
repeat —in repeat loops, continues performing the same set of operations until a
predefined break condition (or several break conditions) is met.
Jump statements:
next —skips a particular iteration of a loop and jumps to the next one if a certain
condition is met.
break —stops and exits the loop at a particular iteration if a certain condition is met.
33. What are regular expressions, and how do you work with them in R?
A regular expression, or regex, in R or other programming languages, is a character or a
sequence of characters that describes a certain text pattern and is used for mining text
data. In R, there are two main ways of working with regular expressions:
1. Using the base R and its functions (such as grep() , regexpr() , gsub() ,
regmatches() , etc.) to locate, match, extract, and replace regex.
2. Using a specialized stringr package of the tidyverse collection. This is a more convenient
way to work with R regex since the functions of stringr have much more intuitive names
and syntax and offer more extensive functionality.
A Guide to R Regular Expressions provides more detail about how to work with regex in R.
e1071—for support vector machines (SVM), naive Bayes classifier, bagged clustering,
fuzzy clustering, and k-nearest neighbors (KNN).
We need to create a correlation matrix of all the features and then identify the highly
correlated ones, usually those with a correlation coefficient greater than 0.75:
P O W E R E D B Y D ATA C A M P W O R K S PA C E
We need to create a training scheme to control the parameters for train, use it to build a
selected model, and then estimate the variable importance for that model:
P O W E R E D B Y D ATA C A M P W O R K S PA C E
One of the most popular methods provided by caret for automatically selecting the optimal
features is a backward selection algorithm called Recursive Feature Elimination (RFE).
We need to compute the control using a selected resampling method and a predefined list
of functions, apply the RFE algorithm passing to it the features, the target variable, the
number of features to retain, and the control, and then extract the selected predictors:
P O W E R E D B Y D ATA C A M P W O R K S PA C E
If you need to strengthen your machine learning skills in R, here is a solid and
comprehensive resource: Machine Learning Scientist with R.
36. What are correlation and covariance, and how do you calculate them
in R?
Correlation is a measure of the strength and direction of the linear relationships between
two variables. It takes values from -1 (a perfect negative correlation) to 1 (a perfect positive
correlation). Covariance is a measure of the degree of how two variables change relative to
each other and the direction of the linear relationships between them. Unlike correlation,
covariance doesn't have any range limit.
In R, to calculate the correlation, we need to use the cor() function, to calculate the
covariance—the cov() function. The syntax of both functions is identical: we need to pass in
two variables (vectors) for which we want to calculate the measure (e.g., cor(vector_1,
vector_2) or cov(vector_1, vector_2) ), or the whole data frame, if we want to calculate the
correlation or covariance between all the variables of that data frame (e.g., cor(df) or
cov(df) ). In the case of two vectors, the result will be a single value, in the case of a data
frame, the result will be a correlation (or covariance) matrix.
37. List and define the various approaches to estimating model accuracy
in R.
Below are several approaches and how to implement them in the caret package of R.
Data splitting—the entire dataset is split into a training dataset and a test dataset. The
first one is used to fit the model, the second one is used to test its performance on
unseen data. This approach works particularly well on big data. To implement data
splitting in R, we need to use the createDataPartition() function and set the p
parameter to the necessary proportion of data that goes to training.
Cross-validation methods
k-fold cross-validation —the dataset is split into k-subsets. The model is trained on
k-1 subsets and tested on the remaining one. The same process is repeated for all
subsets, and then the final model accuracy is estimated.
Repeated k-fold cross-validation —the principle is the same as for the k-fold cross-
validation, only that the dataset is split into k-subsets more than one time. For each
repetition, the model accuracy is estimated, and then the final model accuracy is
calculated as the average of the model accuracy values for all repetitions.
1. Create a contingency table with the categorical variables in interest using the table()
function of the base R:
P O W E R E D B Y D ATA C A M P W O R K S PA C E
chisq.test(table)
P O W E R E D B Y D ATA C A M P W O R K S PA C E
You refresh you knowledge of chi-squared tests and other hypothesis tests in our
Hypothesis Testing in R course.
with(df, a * b)
P O W E R E D B Y D ATA C A M P W O R K S PA C E
Output:
a b
1 1 10
2 2 20
3 3 30
10 40 90
a b c
1 1 10 10
2 2 20 40
3 3 30 90
P O W E R E D B Y D ATA C A M P W O R K S PA C E
When using the within() function, to save the modifications, we need to assign the output of
the function to a variable.
Conclusion
To conclude, in this article, we considered the 40 most common R programming interview
questions and what answers are expected for each of them. Hopefully, with this information
in hand, you feel more confident and ready for a successful R interview, whether you're
looking for a job in R or the right candidate for an open position in your company.
To get some hands-on experience answering questions, check out our Practicing Statistics
Interview Questions in R course.
AUTHOR
Elena Kosourova
TO P I C S
TRACK
TRACK
R Developer
52hrs hr
From data wrangling to developing packages, gain the career-building skills you need to succeed as an R developer. No prior
coding experience needed.
See More
Related
Navigating R Certifications in
2024: A Comprehensive Guide
Matt Crabtree
Matt Crabtree
See More
LEARN
Learn Python
Learn R
Learn AI
Learn SQL
Learn Power BI
Learn Tableau
Assessments
Career Tracks
Skill Tracks
Courses
DATA C O U R S E S
Python Courses
R Courses
SQL Courses
Power BI Courses
Tableau Courses
Azure Courses
Spreadsheets Courses
AI Courses
WO R KS PA C E
Get Started
Templates
Integrations
Documentation
C E R T I F I C AT I O N
Certifications
Data Scientist
Data Analyst
Data Engineer
RESOURCES
Resource Center
Upcoming Events
Blog
Code-Alongs
Tutorials
Open Source
RDocumentation
Course Editor
Data Portfolio
Portfolio Leaderboard
PLANS
Pricing
For Business
For Universities
DataCamp Donates
S U P PO R T
Help Center
Become an Affiliate
ABOUT
About Us
Learner Stories
Careers
Become an Instructor
Press
Leadership
Contact Us
DataCamp Español
Privacy Policy Cookie Notice Do Not Sell My Personal Information Accessibility Security Terms of Use