You are on page 1of 51

Data analytics is the process of

examining, cleaning,
transforming, and modeling data
to extract useful information,
draw conclusions, and support
decision-making.
Descriptive, diagnostic,
predictive, and prescriptive
analysis.
Qualitative data is non-
numerical, such as text or
images, while quantitative data
is numerical, such as
measurements or counts.
Data cleansing is the process of
identifying and correcting errors,
inconsistencies, and
inaccuracies in datasets.
An outlier is a data point that
significantly differs from the rest
of the data points in a dataset.
SQL databases are relational, use
structured query language, and have a
predefined schema, while NoSQL
databases are non-relational, use
various query languages, and have a
dynamic schema.
ETL stands for Extract,
Transform, and Load. It's a
process for retrieving data from
various sources, transforming it
into a usable format, and
loading it into a database or
data warehouse.
A primary key is a unique
identifier for each record in a
table.
A foreign key is a field in a table
that refers to the primary key of
another table, establishing a
relationship between the two
tables.
Inner join returns records with
matching values in both tables, while
outer join returns records from one
table and the matching records from
the other table, filling in NULL values
for non-matching records.
A histogram is a graphical
representation of the distribution
of a dataset, showing the
frequency of data points in
specified intervals.
A box plot is a graphical
representation of the distribution
of a dataset, showing the
median, quartiles, and possible
outliers.
Linear regression is a statistical
method used to model the
relationship between a
dependent variable and one or
more independent variables.
R-squared measures the proportion of
variation in the dependent variable
explained by the independent
variables, while adjusted R-squared
adjusts for the number of independent
variables in the model.
A confusion matrix is a table
used to evaluate the
performance of a classification
model, showing the true
positives, true negatives, false
positives, and false negatives.
K-means clustering is an
unsupervised machine learning
algorithm used to partition data
into k clusters based on their
similarity.
Cross-validation is a technique
used to evaluate the performance of
a model by splitting the dataset into
training and testing sets multiple
times and calculating the average
performance.
Overfitting occurs when a model
is too complex and performs
well
A decision tree is a flowchart-like
structure used in decision making and
machine learning, where each internal
node represents a feature, each
branch represents a decision rule, and
each leaf node represents an outcome.
Supervised learning uses labeled data
and a known output, while
unsupervised learning uses unlabeled
data and discovers patterns or
structures in the data.
PCA is a dimensionality reduction
technique that transforms data into a
new coordinate system, reducing the
number of dimensions while retaining
as much information as possible.
Time series analysis is a
statistical technique for
analyzing and forecasting data
points collected over time, such
as stock prices or weather data.
A bar chart represents data using
rectangular bars, showing the
relationship between categories and
values, while a pie chart represents
data as slices of a circle, showing the
relative proportion of each category.
A pivot table is a data
summarization tool that allows
users to reorganize, filter, and
aggregate data in a spreadsheet
or database.
Data normalization is the process of
scaling and transforming data to
eliminate redundancy and improve
consistency, making it easier to
compare and analyze.
A data warehouse is a large,
centralized repository of data used
for reporting and analysis,
combining data from different
sources and organizing it for
efficient querying and reporting.
A data analyst collects,
processes, and analyzes data to
help organizations make
informed decisions, identify
trends, and improve efficiency.
Missing data can be handled by
imputing values (mean, median,
mode), deleting rows with
missing data, or using models
that can handle missing data.
Outliers can be dealt with by
deleting, transforming, or
replacing them, or by using
models that are less sensitive to
outliers.
Answer this based on your
personal experience, detailing
the problem, your approach, and
the outcome.
Ensuring data quality and accuracy
involves data cleansing, validation,
normalization, and cross-referencing
with other sources, as well as using
appropriate analytical methods and
tools.
Handling large datasets involves using
efficient data storage and processing
techniques, such as SQL databases,
parallel computing, or cloud-based
solutions, and optimizing code and
algorithms for performance.
Answer this based on your personal
experience and familiarity with the
mentioned tools, providing examples of
projects or tasks you have completed
using them.
Mention resources such as blogs,
podcasts, online courses, conferences,
and industry publications that you use
to stay informed and up-to-date.
Answer this based on your personal
experience, highlighting your
proficiency
By following data protection
regulations, anonymizing sensitive
data, using secure data storage and
transfer methods, and implementing
access controls and encryption when
necessary.
By setting clear goals, assessing
deadlines and project importance,
allocating resources efficiently, and
using project management tools or
techniques to stay organized.
By openly discussing the issue,
actively listening to different
perspectives, finding common ground,
and working collaboratively to reach a
resolution.
Answer this based on your personal
experience, detailing how you
simplified the information, used visual
aids, and adapted your communication
style for the audience.
By being aware of potential biases,
using diverse data sources, applying
objective analytical methods, and
cross-validating results with other
sources or techniques.
Metrics may include accuracy,
precision, recall, F1 score, R-squared,
or other relevant performance
measures, depending on the project's
goals and objectives.
By understanding the problem's
context, the nature of the data, the
desired outcome, and the assumptions
and limitations of various techniques,
selecting the most suitable method
through experimentation and
validation.
By using cross-validation, holdout
samples, comparing results with
known benchmarks, and checking
for consistency and reasonableness
in the findings.
Answer this based on your personal
experience, highlighting any
projects or tasks where you have
used APIs to gather data and the
tools or languages you used.
Mention personal strategies, such as
setting goals, focusing on incremental
progress, seeking support from
colleagues or mentors, and staying
curious and engaged with the subject
matter.
Data normalization is the process of
organizing and scaling data to improve
consistency and comparability. An
example might involve scaling the values
of a feature to a range of 0-1, making it
easier to compare with other features.
By clearly communicating the
methodology, assumptions, and limitations
of the analysis, providing evidence to
support the findings, and discussing
possible reasons for the discrepancy,
while remaining open to feedback and
further investigation.
Describe your process, which may include
breaking down the problem, identifying
relevant data and methods, iterating
through potential solutions, and seeking
input from colleagues or experts when
needed.
By prioritizing tasks, managing time
effectively, maintaining clear
communication with team members
and stakeholders, staying focused and
organized, and seeking support when
necessary.
50. What is the most
important skill or quality
you bring to a data
analysis role?

Answer this based on your personal


strengths, such as technical expertise,
communication skills, problem-solving
abilities,

You might also like