Semester : 6
Branch : AI-DS/ML - Engineering

As per Mumbai University Syllabus

Module 1: Introduction to Data Analytics and Lifecycle
Q1: What are the key roles required for a successful analytics project?
A1: Several key roles are vital for the success of an analytics project. Data scientists are
responsible for analyzing data and extracting insights. Data engineers manage data
pipelines and infrastructure. Domain experts provide context and understanding of the
business domain. Project managers coordinate tasks and resources. Business stakeholders
provide guidance and make decisions based on analytics outcomes.

Q2: Describe the Discovery phase of the Data Analytics Lifecycle.

A2: The Discovery phase marks the beginning of an analytics project. It involves

understanding the business domain, defining the problem, identifying stakeholders,
interviewing sponsors, and formulating initial hypotheses. This phase is crucial for setting the
direction of the project and establishing a solid foundation for subsequent stages.

Q3: What activities are involved in the Data Preparation phase?

A3: The Data Preparation phase focuses on getting the data ready for analysis. This
includes setting up the analytic environment, performing data extraction, transformation, and
loading (ETL), exploring and understanding the data, cleaning and formatting it, and creating
visualizations to gain insights into its characteristics and quality.
Q4: Explain the Model Planning phase in the Data Analytics Lifecycle.
A4: The Model Planning phase is where the analytics team decides on the approach to
analyze the data. It involves exploring different variables, selecting relevant features, and
choosing appropriate models for analysis. This phase lays the groundwork for building

predictive or descriptive models that will be used to derive insights from the data.

Q5: What are the common tools used during the Model Planning phase?
A5: Common tools for the Model Planning phase include statistical software such as R or

Python, along with libraries like scikit-learn or TensorFlow. These tools provide functionalities
for data exploration, feature selection, model training, and evaluation, helping analysts in the
decision-making process.

Q6: What activities are encompassed in the Model Building phase?

A6: The Model Building phase involves implementing the selected models, training them on
the prepared data, fine-tuning model parameters for optimal performance, and evaluating

their accuracy and effectiveness. This phase requires iterative testing and refinement to
ensure the models meet the project objectives.

Q7: How do you communicate results in the Data Analytics Lifecycle?

A7: Results are communicated through clear and concise reports, presentations, or
visualizations that convey insights derived from the data analysis. Effective communication
involves tailoring the message to the audience, highlighting key findings, and providing
actionable recommendations based on the analytics outcomes.

Q8: What is the significance of the Operationalize phase?

outcomes. It involves deploying the developed models into production systems, integrating
them with existing workflows, and establishing mechanisms for monitoring model
performance and updating them as needed. This phase ensures that analytics solutions
deliver value in real-world scenarios.

Q9: Why is it essential to involve key stakeholders during the Discovery phase?
A9: Involving key stakeholders during the Discovery phase ensures alignment between
analytics objectives and business goals. It helps in gathering relevant domain knowledge,
clarifying requirements, and identifying potential challenges early in the project lifecycle.
Stakeholder involvement fosters collaboration and ensures that the analytics solution meets
the needs of the organization.

Q10: How does data visualization contribute to the Data Analytics Lifecycle?
A10: Data visualization plays a crucial role in the Data Analytics Lifecycle by making
complex data more accessible and understandable. It helps analysts explore data patterns,
identify trends, and communicate insights effectively to stakeholders. Visualization

techniques such as charts, graphs, and dashboards facilitate decision-making by providing
visual representations of analytical findings.

Module 2: Regression Models

Q1: What is simple Linear Regression, and what components does it involve?
A1: Simple Linear Regression is a statistical method used to model the relationship between
a single independent variable and a dependent variable. It involves fitting a regression

equation to the data, calculating fitted values and residuals, and minimizing the sum of
squared residuals through the method of least squares.

Q2: How is Multiple Linear Regression different from Simple Linear Regression?
A2: Multiple Linear Regression extends the concept of Simple Linear Regression to include

multiple independent variables in the model. It assesses the relationship between a

dependent variable and two or more independent variables, allowing for more complex
analyses of the data.

Q3: What is Logistic Regression, and when is it used?

A3: Logistic Regression is a statistical method used for modeling the probability of a binary

outcome. It is particularly useful when the dependent variable is categorical and has only two
possible outcomes, such as “yes” or “no,” “success” or “failure.”

Q4: Describe the Logistic Response function and its significance in Logistic
A4: The Logistic Response function, also known as the sigmoid function, maps the linear
combination of predictor variables to the probability of a binary outcome. It ensures that
predicted probabilities fall between 0 and 1, making Logistic Regression suitable for
modeling probabilities.

Q5: What are odds ratios, and how are they interpreted in Logistic Regression?
change in the predictor variable. In Logistic Regression, they quantify the effect of each
predictor on the likelihood of the outcome, providing valuable insights into the relationship
between predictors and the outcome variable.

Q6: What are some similarities and differences between Linear Regression and
Logistic Regression?
A6: Both Linear Regression and Logistic Regression are types of regression models used
for predictive modeling. However, Linear Regression models continuous outcomes, while
Logistic Regression models binary outcomes. Additionally, the interpretation of coefficients
differs between the two models, with Linear Regression focusing on the change in the
dependent variable and Logistic Regression focusing on odds ratios.

Q7: How do you assess the performance of Regression models?
A7: Regression model performance can be assessed using various metrics such as
R-squared (for Linear Regression), accuracy, confusion matrix, and ROC curve (for Logistic
Regression). Cross-validation techniques and model selection methods help in choosing the

best-performing model.

Q8: What is Stepwise Regression, and how is it used in model selection?

A8: Stepwise Regression is a method used for automatic variable selection in regression
models. It involves iteratively adding or removing predictor variables based on their
contribution to the model’s performance. Stepwise Regression helps in identifying the most
relevant variables and improving the model’s predictive accuracy.

Q9: How do you interpret the coefficients in a Logistic Regression model?


A9: The coefficients in a Logistic Regression model represent the change in the log odds of
the outcome for a one-unit change in the predictor variable. Exponentiating these
coefficients gives the odds ratios, which quantify the impact of each predictor on the
likelihood of the outcome occurring.

Q10: What role does Cross-Validation play in assessing Regression models?

A10: Cross-Validation is a resampling technique used to evaluate the performance of
regression models by assessing their generalization ability to unseen data. It helps in

estimating the model’s predictive accuracy and identifying potential issues such as overfitting
or underfitting. Cross-Validation ensures that the model performs well on new data, beyond
the training dataset used for model fitting.

Module 3: Time Series Analysis

Q1: What is Time Series Analysis, and why is it important?
A1: Time Series Analysis is a statistical method used to analyze data collected over time to
identify patterns, trends, and seasonality. It is crucial in various fields such as finance,
economics, and environmental science for forecasting future values based on historical data.

Q2: Explain the Box-Jenkins Methodology in Time Series Analysis.

A2: The Box-Jenkins Methodology, also known as the ARIMA modeling approach, is a
steps: identification, estimation, and diagnostic checking of the model. This methodology
helps in selecting the appropriate ARIMA model to fit the data.

Q3: What is the Autocorrelation Function (ACF), and how is it used in Time Series
A3: The Autocorrelation Function (ACF) measures the correlation between observations at
different time lags within a time series. It helps in identifying patterns of correlation, such as
seasonality or trend, and selecting appropriate lag values for autoregressive or moving
average models.

Q4: What are Autoregressive (AR) Models, and how do they work in Time Series

A4: Autoregressive (AR) Models are time series models that use past observations of the
variable to predict future values. They assume that the current value of the variable depends
linearly on its previous values, with the addition of random error.

Q5: Describe Moving Average (MA) Models and their role in Time Series Analysis.
A5: Moving Average (MA) Models are time series models that use past forecast errors to
predict future values. They capture the short-term fluctuations in the data by modeling the
relationship between the current value and past forecast errors.

Q6: What is the difference between ARMA and ARIMA Models in Time Series
A6: ARMA (Autoregressive Moving Average) models combine both autoregressive and
moving average components to capture the temporal dependencies in the data. ARIMA
(Autoregressive Integrated Moving Average) models include an additional differencing step
to make the time series stationary before modeling.

Q7: How do you build and evaluate an ARIMA Model in Time Series Analysis?
A7: To build an ARIMA Model, you first identify the appropriate order of differencing (d),

autoregressive (p), and moving average (q) components using methods like ACF and Partial
Autocorrelation Function (PACF) plots. Then, you estimate the parameters and fit the model
to the data. Evaluation involves assessing the model’s goodness of fit using diagnostic tests

and validating its forecasting performance on holdout data.

Q8: What are some reasons to choose ARIMA models for Time Series Analysis?

A8: ARIMA models are suitable for analyzing time series data with trend and seasonality
patterns. They provide interpretable parameters and can capture complex temporal
dependencies in the data. ARIMA models are widely used for forecasting applications in
various fields.

Q9: What precautions should be taken when using ARIMA models in Time Series
A9: When using ARIMA models, it’s essential to ensure that the time series is stationary or
can be made stationary through differencing. Care should be taken to avoid overfitting by
selecting appropriate model orders and validating the model’s performance on out-of-sample
data. Additionally, outliers and missing values should be handled appropriately before model
Q10: How does Time Series Analysis differ from other types of data analysis?
A10: Time Series Analysis focuses specifically on data collected over time, aiming to
understand and forecast temporal patterns and trends. Unlike cross-sectional or panel data
analysis, which considers observations at a single point in time, Time Series Analysis
accounts for the sequential nature of data and the dependencies between observations.

Module 4: Text Analytics

Q1: What is the history of text mining, and how has it evolved over time?
A1: Text mining, also known as text analytics, has roots dating back to the 1960s with early
work in information retrieval and natural language processing. Over time, advancements in

computational linguistics, machine learning, and big data technologies have led to the
development of more sophisticated text mining techniques capable of extracting insights
from unstructured text data.

Q2: What are the seven practices of text analytics, and how do they contribute to the
A2: The seven practices of text analytics encompass various techniques and methodologies
used for extracting meaning from unstructured text data. These practices include text
summarization, sentiment analysis, topic modeling, named entity recognition, document
categorization, entity linking, and concept extraction. Each practice addresses different
aspects of text analysis to derive valuable insights from textual information.
Q3: What are some application and use cases for text mining?
A3: Text mining finds applications across diverse domains such as customer feedback
analysis, market research, social media monitoring, healthcare informatics, and legal

document analysis. Use cases include sentiment analysis of product reviews, summarization
of news articles, topic modeling of research papers, and categorization of customer support


Q4: Describe the steps involved in text analysis.

A4: Text analysis typically involves several steps, including collecting raw text data,

preprocessing and cleaning the text, representing the text in a suitable format (e.g.,
bag-of-words or word embeddings), applying text mining techniques such as TF-IDF or topic
modeling, and interpreting the results to gain insights.

Q5: Can you provide an example of text analysis?

A5: Sure! Let’s consider the task of sentiment analysis on customer reviews of a product.
We collect raw text data from online review platforms, preprocess the text by removing
stopwords and punctuation, represent the text using TF-IDF vectors, and classify each
review as positive, negative, or neutral based on sentiment analysis algorithms. Finally, we
analyze the distribution of sentiments to understand customer opinions about the product.

Q6: What is Term Frequency—Inverse Document Frequency (TF-IDF), and how is it

used in text mining?
A6: Term Frequency—Inverse Document Frequency (TF-IDF) is a numerical statistic that
calculated by multiplying the term frequency (how often a term appears in a document) by
the inverse document frequency (how rare the term is across all documents). TF-IDF is
commonly used for text representation and feature weighting in text mining tasks.

Q7: How do text analytics techniques like sentiment analysis and topic modeling help
in gaining insights from textual data?
A7: Sentiment analysis helps in understanding the emotional tone of text data, enabling
businesses to gauge customer opinions, identify trends, and address issues proactively.
Topic modeling, on the other hand, organizes textual data into coherent topics or themes,
allowing analysts to uncover hidden patterns, explore relationships, and extract actionable
insights from large text collections.

Module 5: Data Analytics and Visualization with R
Q1: How can you import and export data in R?

A1: In R, you can import data from external sources using functions like read.csv() for
CSV files, read.table() for tabular data, and readRDS() for R data files. Similarly, data
can be exported using functions like write.csv() and write.table().
# Example of importing a CSV file
data <- read.csv("data.csv")
# Example of exporting data to a CSV file
write.csv(data, "exported_data.csv")

Q2: What are some common data types and attributes in R?


A2: Common data types in R include numeric, character, logical, integer, and factor.
Attributes such as names, dimensions, and class define additional properties of objects in R.

# Example of defining a numeric vector

numeric_vector <- c(1.5, 2.3, 3.7)

# Example of defining a character vector

character_vector <- c("apple", "banana", "orange")

Q3: How can descriptive statistics be computed in R?

A3: Descriptive statistics such as mean, median, standard deviation, and quartiles can be
computed in R using functions like mean(), median(), sd(), and quantile().

# Example of computing mean and standard deviation

mean_value <- mean(numeric_vector)
sd_value <- sd(numeric_vector)

Q4: What is the importance of visualization in exploratory data analysis (EDA)?

A4: Visualization is crucial in EDA as it helps in understanding the structure of the data,
identifying patterns, trends, and outliers. It provides insights that may not be apparent from
# Example of visualizing a histogram of a numeric variable

Q5: How can you visualize single variables in R?

A5: Single variables can be visualized in R using histograms, boxplots, bar plots, density
plots, and scatter plots, among others.

# Example of visualizing a histogram of a numeric variable


Q6: What techniques are used for examining multiple variables in R?

A6: Techniques for examining multiple variables in R include scatter plots, pairs plots,

heatmaps, and correlation matrices.

# Example of creating a scatter plot matrix

pairs(iris[, 1:4])

Q7: What is the difference between data exploration and presentation in R?
A7: Data exploration in R involves understanding the structure and patterns in the data using
various visualization and statistical techniques. Presentation, on the other hand, focuses on
creating visually appealing and informative plots or reports to communicate the findings
# Example of data exploration with a scatter plot
plot(data$X, data$Y)

Module 6: Data Analytics and Visualization with Python


Q1: What are the essential data libraries for data analytics in Python?

A1: Essential data libraries for data analytics in Python include Pandas for data manipulation
and analysis, NumPy for numerical computing, and SciPy for scientific computing and
statistical analysis.

# Example of importing Pandas, NumPy, and SciPy

import pandas as pd
import numpy as np

import scipy

Q2: How do you perform basic plotting with Matplotlib in Python?

A2: Basic plotting with Matplotlib involves creating plots such as histograms, bar charts, pie
charts, box plots, and violin plots using functions like plt.hist(),,
plt.pie(), plt.boxplot(), and plt.violinplot().

# Example of creating a histogram with Matplotlib

import matplotlib.pyplot as plt

Q3: How can you create a box plot and a violin plot using Matplotlib in Python?
A3: You can create a box plot using plt.boxplot() and a violin plot using
plt.violinplot() functions in Matplotlib.

# Example of creating a box plot with Matplotlib


# Example of creating a violin plot with Matplotlib


Q4: What is the Seaborn library used for in Python?
A4: Seaborn is a Python visualization library based on Matplotlib that provides a high-level
interface for creating attractive and informative statistical graphics. It is particularly useful for

visualizing complex datasets and for creating visually appealing plots with minimal code.

# Example of importing Seaborn bt

import seaborn as sns

Q5: How do you create multiple plots using Seaborn in Python?

A5: You can create multiple plots using Seaborn by using functions like sns.pairplot()
for pairwise relationships between variables and sns.FacetGrid() for creating a grid of
subplots based on one or more categorical variables.

# Example of creating a pairplot with Seaborn



# Example of creating multiple plots using FacetGrid in Seaborn

g = sns.FacetGrid(data, col="category"), "value")

Q6: What is the purpose of the regplot() function in Seaborn?

A6: The regplot() function in Seaborn is used to plot data and a linear regression model
fit. It shows the relationship between two variables along with a regression line, confidence

intervals, and a scatter plot of the data points.

# Example of creating a regression plot with Seaborn

sns.regplot(x="x_variable", y="y_variable", data=data)

Q7: How can you customize plots in Seaborn to improve their appearance?
A7: You can customize plots in Seaborn by modifying various aesthetics such as colors,
markers, line styles, labels, titles, and axis ticks using functions like sns.set_style(),
sns.set_palette(), and sns.despine().

Q8: Can you create a pie chart using Seaborn in Python?

A8: Seaborn does not have a direct function for creating pie charts. However, you can use
Matplotlib’s plt.pie() function for creating pie charts in Python.

# Example of creating a pie chart with Matplotlib

labels = ['A', 'B', 'C', 'D']
sizes = [15, 30, 45, 10]
plt.pie(sizes, labels=labels, autopct='%1.1f%%')


Q9: How do you create a bar chart with Seaborn in Python?

A9: Seaborn does not have a direct function for creating bar charts. However, you can use

Matplotlib’s function for creating bar charts in Python.

# Example of creating a bar chart with Matplotlib

x = ['A', 'B', 'C', 'D']
y = [10, 20, 15, 25], y)
Q10: How does Seaborn complement Matplotlib in Python?
A10: Seaborn complements Matplotlib by providing a higher-level interface for creating
visually appealing statistical graphics with less code. It builds on top of Matplotlib’s

functionality and integrates seamlessly with Pandas data structures, making it easier to
visualize complex datasets.
