You are on page 1of 13

Solution-AIDS UT2

Chp3:

Q1) Compare data Science and big data.

Q2) Draw the life cycle of data science and explain in brief.

Ans:
Business Understanding:

Explanation: This initial phase involves understanding the business problem or opportunity that data
science aims to address. It requires collaboration between data scientists and stakeholders to define
project objectives, establish success criteria, and identify relevant business metrics.

Key Activities:

Identify business goals and challenges.

Define project scope and objectives.

Establish timelines and resource requirements.

Data Understanding:

Explanation: In this phase, data scientists acquire and explore the available data to gain insights into
its structure, quality, and potential usefulness for addressing the business problem. It involves
assessing data sources, understanding data types, and identifying any issues or limitations.

Key Activities:

Identify relevant data sources.

Collect and acquire data.

Identify data preprocessing needs.

Data Preparation:

Explanation: Data preparation involves cleaning, transforming, and structuring the raw data to make
it suitable for analysis. This phase is crucial for ensuring that the data is accurate, consistent, and
formatted correctly for modeling.

Key Activities:

Cleanse data by handling missing values, outliers, and inconsistencies.

Integrate and merge data from multiple sources.

Split data into training, validation, and test sets for modeling.

Exploratory Data Analysis (EDA):

Explanation: EDA involves visualizing and analyzing the prepared data to uncover patterns, trends,
and relationships that may inform subsequent modeling efforts. It helps data scientists gain insights
into the data and identify potential hypotheses to test.

Key Activities:

Visualize data distributions using histograms, scatter plots, and box plots.

Analyze correlations and relationships between variables.

Identify outliers and anomalies.


Data Modeling:

Explanation: In this phase, data scientists develop and train predictive or descriptive models using
machine learning algorithms or statistical techniques. The goal is to build models that accurately
capture patterns in the data and generalize well to new, unseen data.

Key Activities:

Select appropriate modeling techniques based on the nature of the problem and data.

Train models using training data.

Validate models using validation data and assess their predictive power.

Model Evaluation:

Explanation: Model evaluation involves assessing the performance of trained models using
appropriate evaluation metrics and techniques. It aims to determine how well the models generalize
to unseen data and whether they meet the predefined success criteria.

Key Activities:

Calculate performance metrics such as accuracy, precision, recall, and F1-score.

Use cross-validation or holdout validation to estimate model performance.

Compare models and select the best-performing one based on evaluation results.

Model Deployment:

Explanation: The final phase involves deploying the selected model into production environments
where it can be used to make predictions or generate insights in real-time. This phase requires
collaboration with IT teams to integrate the model into existing systems and ensure scalability and
reliability.

Key Activities:

Develop APIs or services for accessing model predictions.

Integrate the model into production systems or applications.

Implement monitoring and logging mechanisms to track model performance and detect issues.

Q3) Give the case study of application of data science in various industries

Ans: Doubt in this question….

1) Data Science in Education:

Case Study:

Student Performance Prediction: A university predicts student performance to provide timely


support and identify at-risk students.

Personalized Learning: An online platform tailors study materials for each student and recommends
personalized study plans.
Course Recommendation Systems: A college suggests courses based on student preferences and
academic goals.

Resource Optimization: A school district optimizes schedules and resources efficiently and allocates
resources based on demand and utilization.

Adaptive Assessments: Educational platforms offer adaptive assessments tailored to individual


learning needs and adjust difficulty levels based on student performance.

2) Data Science in Bio-Tech:

Case Study:

Drug Discovery: A pharmaceutical company accelerates drug discovery using machine learning and
simulates drug interactions.

Precision Medicine: Healthcare offers personalized treatment plans based on genetics and patient-
specific biomarkers.

Disease Surveillance: A biotech firm tracks and predicts disease outbreaks and monitors the spread
of infectious diseases.

Biometric Analysis: Research institute explores human physiology using data analysis and identifies
biomarkers for disease diagnosis.

Genomic Sequencing: Biotech company uses data science to analyze and interpret genomic data for
research and medical applications and predict disease susceptibility.

3) Predictive Modeling for Maintaining Oil and Gas Supply:

Case Study:

Predictive Maintenance: Oil company prevents equipment failures using predictive models and
schedules maintenance proactively.

Production Optimization: Petroleum firm optimizes production rates and drilling locations and
predicts reservoir performance.

Supply Chain Optimization: Energy company optimizes supply chain and logistics and forecasts
demand for fuel products.

Environmental Monitoring: Consulting firm predicts and mitigates environmental risks and monitors
air and water quality.

Reservoir Management: Oil and gas companies use data science for reservoir characterization and
management and optimize hydraulic fracturing operations.

4) Data Science in Healthcare:

Case Study:

Disease Diagnosis and Prognosis: Healthcare providers use data science for disease diagnosis and
prognosis, aiding in early detection and treatment planning.
Clinical Decision Support: Data science systems provide evidence-based recommendations to
healthcare professionals during patient care, improving outcomes and safety.

Healthcare Operations Optimization: Hospitals optimize resource allocation and scheduling, reducing
wait times and improving efficiency.

Drug Discovery and Development: Pharmaceutical companies expedite drug discovery and
development, leading to faster time-to-market for new medications.

Public Health Surveillance: Public health agencies monitor disease spread and implement targeted
interventions to protect communities.

5) Data Science in Retail:

Case Study:

Customer Segmentation: Retail chain segments customers for tailored marketing and identifies high-
value customer segments.

Demand Forecasting: Fashion retailer forecasts demand for products across seasons and optimizes
inventory levels.

Dynamic Pricing: E-commerce platform adjusts prices dynamically based on market conditions and
competitor pricing.

Recommendation Systems: Grocery chain offers personalized product recommendations and


suggests complementary items.

Inventory Management: Retailers optimize inventory levels and distribution using data-driven
insights and minimize stockouts and overstock situations.

6) Data Science in Finance:

Case Study:

Fraud Detection: Financial institution detects fraudulent activities in real-time and flags suspicious
transactions.

Credit Risk Assessment: Lending company assesses credit risk using predictive models and predicts
default probability.

Algorithmic Trading: Hedge fund executes automated trading strategies using data science and
analyzes market trends in real-time

Portfolio Optimization: Asset management firm optimizes investment portfolios for maximum returns
and minimizes portfolio risk.

Customer Lifetime Value Prediction: Financial institutions predict the lifetime value of customers for
targeted marketing and retention strategies and personalize offerings based on customer segments.
Q15) Explain data modelling, model evaluation and model deployment in case of data science life
cycle

And:

1)Data Modeling:

Explanation: In this phase, data scientists develop and train predictive or descriptive models using
machine learning algorithms or statistical techniques. The goal is to build models that accurately
capture patterns in the data and generalize well to new, unseen data.

Key Activities:

Select appropriate modeling techniques based on the nature of the problem and data.

Train models using training data and tune hyperparameters to optimize performance.

Validate models using validation data and assess their predictive power.

Outcome: Trained models capable of making predictions or generating insights relevant to the
business problem, ready for evaluation and deployment.

2)Model Evaluation:

Explanation: Model evaluation involves assessing the performance of trained models using
appropriate evaluation metrics and techniques. It aims to determine how well the models generalize
to unseen data and whether they meet the predefined success criteria.

Key Activities:

Calculate performance metrics such as accuracy, precision, recall, and F1-score.

Use cross-validation or holdout validation to estimate model performance.

Compare models and select the best-performing one based on evaluation results.

Outcome: An understanding of the strengths and weaknesses of the trained models, along with
insights into their predictive capabilities and potential areas for improvement.

3)Model Deployment:

Explanation: The final phase involves deploying the selected model into production environments
where it can be used to make predictions or generate insights in real-time. This phase requires
collaboration with IT teams to integrate the model into existing systems and ensure scalability and
reliability.

Key Activities:

Develop APIs or services for accessing model predictions.

Integrate the model into production systems or applications.

Implement monitoring and logging mechanisms to track model performance and detect issues.

Outcome: A deployed and operational model that adds value to the business by providing actionable
insights or automated decision-making capabilities.
CHP5

Q4) Discuss any four methods of univariate distribution

Ans:

Q5) Compare histogram and bar graph

Histogram Bar
Definition A histogram is a graphical A bar graph is a graphical representation of
representation of the frequency categorical data, showing the values of
distribution of numerical data. different categories as bars of equal width
Continuous Suitable for continuous data Not suitable for continuous data
Data
Bar Width Variable width, based on Uniform width, each bar is distinct
data intervals
X-Axis Represents the range of values Represents distinct categories
Y-Axis Represents the frequency or density Represents the value or count
Gap No gap, bars are adjacent Often separated by gaps for clarity
Between
Bars
Bars can be NO YES
recorded
Width of Should not be the same Same
bars
Statistical Used for assessing central tendency Not commonly used for statistical
Use and spread analysis

Q6) Discuss any four types of MANOVA

Q7) Discuss any three types of Correlation

Ans:
Q8) Define ANOVA. What are assumptions for ANOVA test

Ans:

Q16) What is EDA? What is the importance of EDA?

Ans:
Imp of EDA:

 Understand Data Structure: EDA helps in understanding the structure, distribution, and
nature of the data, including the types and ranges of variables present.
 Identify Patterns and Trends: EDA allows us to identify patterns, trends, and relationships
within the data, which can provide valuable insights and inform subsequent analysis.
 Detect Anomalies and Outliers: EDA helps in detecting anomalies and outliers in the data,
which may indicate errors, inconsistencies, or interesting phenomena worthy of further
investigation
 Guide Preprocessing Steps: EDA guides preprocessing steps such as handling missing values,
dealing with outliers, and encoding categorical variables, ensuring that subsequent analysis
is based on clean and appropriately prepared data.
 Inform Model Selection: EDA helps in selecting appropriate models and techniques for
analysis by providing insights into the distributional assumptions, linear relationships, and
complexity of the data.

CHP-6

Q9) What is ML? Draw and explain schematic diagram for the same

Ans:

Q10) What are the key tasks of ML? Explain same with suitable example

Ans:

Q11) Compare and contrast supervised and unsupervised learning method

Ans
Q12) Explain logistic regression

Ans: Logistic regression is used for binary classification where we use sigmoid function, that takes
input as independent variables and produces a probability value between 0 and 1.

For example, we have two classes Class 0 and Class 1 if the value of the logistic function for an input
is greater than 0.5 (threshold value) then it belongs to Class 1 it belongs to Class 0. It’s referred to as
regression because it is the extension of linear regression but is mainly used for classification
problems.

Key Points:

 Logistic regression predicts the output of a categorical dependent variable. Therefore, the
outcome must be a categorical or discrete value.
 It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0
and 1, it gives the probabilistic values which lie between 0 and 1.
 In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic
function, which predicts two maximum values (0 or 1).

Logistic Function – Sigmoid Function

 The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
 It maps any real value into another value within a range of 0 and 1. The value of the logistic
regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve
like the “S” form.
 The S-form curve is called the Sigmoid function or the logistic function.
 In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a value
below the threshold values tends to 0.

the terminologies involved in logistic regression:

Independent variables: Input factors used to predict the dependent variable.

Dependent variable: Target variable being predicted.

Logistic function: Formula transforming inputs into probabilities between 0 and 1.

Odds: Ratio of something occurring to not occurring.

Log-odds (logit): Natural logarithm of the odds.

Coefficient: Estimated parameters showing how variables relate.

Intercept: Constant term representing log odds when predictors are zero.

Maximum likelihood estimation: Method for estimating model coefficients by maximizing likelihood
of observing data given the model.
Q13) Write short note on issue in decision tree

Q14) Describe the steps in K-means clustering algorithm.

Ans:

Select the number of clusters (K): Decide how many clusters you want to identify in the dataset.

Select random initial centroids: Choose K random points from the dataset as initial centroids. These
points represent the centers of the clusters.

Assign data points to nearest centroid: For each data point in the dataset, calculate its distance to
each centroid and assign it to the nearest centroid. This forms initial clusters.

Calculate new centroids: Calculate the mean (centroid) of the data points assigned to each cluster.
Move each centroid to the mean of its respective cluster.

Reassign data points to updated centroids: Repeat the process of assigning each data point to the
nearest centroid based on the updated centroids.

Repeat until convergence: Iterate steps 4 and 5 until either the centroids no longer change
significantly (convergence) or a specified number of iterations is reached.

Finish: The algorithm terminates when the centroids no longer change significantly between
iterations. The final centroids represent the centers of the clusters, and each data point belongs to
the cluster with the nearest centroid.

OR
Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of
each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH

Step-7: The model is ready.

You might also like