Data Science Roles, Stages in A Data Science Project

KONERU LAKSHMAIAH EDUCATION FOUNDATION
(Deemed to be University estd, u/s, 3 of the UGC Act,
1956) (NAAC Accredited “A++” Grade University)
Green Fields, Guntur District, A.P., India – 522502
B.Tech. IInd Year

PROGRAM
A.Y.2023-24 EVEN
Semester
22AD2227 DATA ANALYTICS AND VISUALIZATION
CO 1
Session 1.: Data Science Roles and Their Responsibility
1. Course Description
We need Data analytics and visualization are integral components of the data-driven
decision-making process. Data analytics involves the exploration, analysis, and interpretation
of data to extract meaningful patterns and insights, utilizing techniques such as descriptive,
diagnostic, predictive, and prescriptive analytics. On the other hand, data visualization
transforms data into graphical representations, enhancing comprehension and
communication of complex information through charts, graphs, and interactive dashboards.
Together, they empower individuals and organizations to not only understand their data but
also effectively communicate their findings, facilitating informed decision-making in a wide
range of industries and applications.
2. Aim
Understand the modelling of various types of data analytics and the Visualization fundamentals.
3. Instructional Objectives (Course Objectives)
 Understand the modelling of various types of data analysis and the Visualization
fundamentals.
 Apply methods and tools in descriptive statistics to summarize and explore datasets, using
measures like mean, median, variance, and graphical representations like histograms and
box plots.
 Apply methods for Scientific/ Spatial Data Visualization and Web data visualization
 Use Dashboard and its categories.
4. Learning Outcomes (Course Outcome)
Students will be able to Understand the modelling of various types of data and the
Visualization fundamentals.
5. Module Description (CO-1 Description)
Data Modeling: Conceptual models, Spread sheet models, Relational Data Models,
object-oriented models, semi structured data models, unstructured data models.
Visualization Fundamentals, Design principles, The Process of Visualization, Data
Abstraction, Visual Encodings, Use of Color, Perceptual Issues, Designing Views,
Interacting with Visualizations, Filtering and Aggregation
6. Course Description
We need data visualization because a visual summary of information makes it easier
to identify patterns and trends than looking through thousands of rows on a spread
sheet. It's the way the human brain works. Since the purpose of data analysis is to
gain insights, data is much more valuable when it is visualized. Even if a data
analyst can pull insights from data without visualization, it will be more difficult to
communicate the meaning without visualization. Charts and graphs make
communicating data findings easier even if you can identify the patterns without
them. Data visualization has many uses. Each type of data visualization can be used
in different ways. Data visualization can also: Identify areas that need attention or
improvement.
7. Aim
Apply methods and tools for Non-Spatial Data Visualization.
8. Instructional Objectives (Course Objectives)
 Understand the modelling of various types of data and the Visualization

fundamentals.
 Apply methods and tools for Non-Spatial Data Visualization
 Apply methods for Scientific/ Spatial Data Visualization and Web data visualization.
 Use Dashboard and its categories.
9. Learning Outcomes (Course Outcome)
Students will be able to Apply methods and tools for Non-Spatial Data Visualization
10. Modul Description (CO-1 Description)
Introduction to Data Science: Evolution of Data Science, Data Science Roles, Stages in a
Data Science Project, Applications of Data Science in various fields, Data Security Issues
Data Collection Strategies. Data Pre-Processing Overview.
11. Session Introduction
Data science is a multidisciplinary field that combines expertise from computer science,
statistics, and domain-specific knowledge to extract valuable insights and knowledge from
large and complex datasets. It involves a systematic approach to data collection, analysis,
interpretation, and communication, with the goal of informing data-driven decision-
making and solving real-world problems.
12. Session description
Data Science Roles and Their Responsibility
Data science roles encompass a wide range of responsibilities and skills, and they play a
crucial role in extracting valuable insights and knowledge from data.
Here are some common data science roles and their primary responsibilities.
1.Data Scientist:
A data scientist is a professional who uses their skills and expertise to analyze and
interpret complex data sets, helping organizations make informed decisions and solve
problems. Data scientists are often responsible for collecting, cleaning, and processing
data, as well as developing and applying various statistical and machine learning
techniques to extract valuable insights and patterns from the data.
Responsibilities:
 Analyzing and interpreting complex data sets to extract insights.
 Building predictive models and machine learning algorithms.
 Data preprocessing, cleaning, and feature engineering.
 Communicating findings to non-technical stakeholders.
2. Machine Learning Engineer:
Machine Learning Engineer is a specialized role within the field of artificial intelligence
(AI) and data science. These professionals are primarily responsible for developing,
implementing, and deploying machine learning models and systems. Their focus is on
creating robust and scalable machine learning solutions to address specific business or
technical problems.
Responsibilities:
 Developing and deploying machine learning models.
 Optimizing models for performance and scalability.
 Collaborating with data scientists to operationalize their work.
 Managing and maintaining model infrastructure.
3. Data Analyst:
A Machine Learning Engineer is a specialized role within the field of artificial
intelligence (AI) and data science. These professionals are primarily responsible for
developing, implementing, and deploying machine learning models and systems. Their
focus is on creating robust and scalable machine learning solutions to address specific
business or technical problems.
Responsibilities:
 Analysing data to provide actionable insights.
 Creating visualizations and reports for decision-making.
 Identifying trends and patterns in data.
 Collaborating with business stakeholders to understand their data needs.
4.Data Engineer
A data engineer is a professional responsible for the design, construction, installation,
and maintenance of the data architecture, databases, and data pipelines that enable
organizations to efficiently and reliably store, access, and analyze large volumes of data.
Data engineers are instrumental in building the infrastructure that supports data analysis
and data-driven decision-making within an organization.
Responsibilities:
 Building and maintaining data pipelines and ETL processes.
 Managing data infrastructure and databases.
 Ensuring data quality, reliability, and availability.
 Collaborating with data scientists and analysts to provide clean and accessible
data.
5.Business Intelligence (BI) Analyst
A Business Intelligence (BI) Analyst is a professional who plays a critical role in helping
organizations make data-driven decisions by transforming raw data into meaningful
insights. BI analysts use a combination of technical and analytical skills to gather and
analyze data, create reports and dashboards, and provide actionable information to
business stakeholders.
Responsibilities:
 Creating dashboards and reports for business performance tracking.
 Designing data visualization tools for end-users.
 Identifying key performance indicators (KPIs) and metrics.
 Collaborating with business teams to support decision-making.
6.Data Architect
A data architect is a professional responsible for designing and managing an
organization's data infrastructure and systems. Their role focuses on creating and
maintaining the overall architecture that governs how data is collected, stored, organized,
integrated, and accessed. Data architects ensure that data is effectively managed and used
in a way that aligns with an organization's business goals and technical requirements.
Responsibilities:
 Designing and maintaining data architectures.
 Defining data storage, integration, and processing strategies.
 Ensuring data security and compliance.
 Collaborating with data engineers to implement data solutions.
7.Statistician
A statistician is a professional who specializes in the field of statistics, which is the study
of data, data analysis, and data interpretation. Statisticians apply statistical techniques to
collect, analyze, and interpret data in a wide range of fields, including science, business,
healthcare, social sciences, and more. Their work is essential for making informed
decisions, conducting research, and solving problems.
Responsibilities:
 Applying statistical techniques to analyze data.
 Conducting hypothesis testing and experiments.
 Designing surveys and experiments to gather data.
 Providing statistical insights to support decision-making.
8.AI/ML Researcher
An AI/ML (Artificial Intelligence/Machine Learning) researcher is a professional who
specializes in the research and development of advanced AI and machine learning
algorithms, models, and systems. These researchers work on cutting-edge technologies
and methods to advance the field of artificial intelligence and contribute to the
development of intelligent systems.
Responsibilities:
 Conducting research to advance machine learning and AI.
 Developing novel algorithms and models.
 Publishing research papers and contributing to academic or industry conferences.
 Collaborating with data scientists and engineers to implement cutting-edge solutions.
9.Quantitative Analyst (Quant)
A Quantitative Analyst, often referred to as a "Quant," is a professional who specializes
in applying mathematical and statistical techniques to financial and risk management
problems. They work in various industries, with the most common being finance, where
they develop complex models and algorithms to analyze financial markets, assess risk,
and make informed investment decisions.
Responsibilities:
 Applying quantitative and mathematical methods to financial data.
 Developing trading strategies and risk models.
 Analyzing market data to inform investment decisions.
 Implementing algorithms for trading and risk management.
10.Chief Data Officer (CDO)
A Chief Data Officer (CDO) is a high-level executive responsible for overseeing an
organization's data management and data-related strategies. The role of a CDO has
become increasingly important as organizations generate and accumulate vast amounts of
data. The primary objective of a CDO is to leverage data as a strategic asset, ensuring
that it is collected, stored, processed, and used effectively to drive business growth and
innovation.
Responsibilities:
 Setting the data strategy and governance for an organization.
 Overseeing data management, privacy, and compliance.
 Aligning data initiatives with business goals.
 Managing data-related teams and resources.
What is a Data Science Project Lifecycle?
In simple terms, a data science life cycle is nothing but a repetitive set of steps that you
need to take to complete and deliver a project/product to your client. Although the data
science projects and the teams involved in deploying and developing the model will be
different, every data science life cycle will be slightly different in every other company.
However, most of the data science projects happen to follow a somewhat similar process.
In order to start and complete a data science-based project, we need to understand the
various roles and responsibilities of the people involved in building, developing the
project. Let us take a look at those employees who are involved in a typical data science
project:
Who Are Involved in The Projects:
1. Business Analyst
2. Data Analyst
3. Data Scientists
4. Data Engineer
5. Data Architect
6. Machine Learning Engineer
Now that we have an idea of who all are involved in a typical business project, let’s
understand what a data science project is and how do we define the life cycle of the data
science project in a real-world scenario like a fake news identifier.
In a normal case, a Data Science project contains data as its main element. Without any
data, we won’t be able to do any analysis or predict any outcome as we are looking at
something unknown. Hence, before starting any data science project that we have got
from either our clients or stakeholder first we need to understand the underlying problem
statement presented by them. Once we understand the business problem, we have to
gather the relevant data that will help us in solving the use case.
So many questions yet answers might vary from person to person. Hence in order to
address all these concerns right away, we do have a pre-defined flow that is termed as
Data Science Project Life Cycle. The process is fairly simple wherein the company has
to first gather data, perform data cleaning, perform EDA to extract relevant features,
preparing the data by performing feature engineering and feature scaling. In the second
phase, the model is built and deployed after a proper evaluation. This entire lifecycle is
not a one man’s job, for this, you need the entire team to work together to get the work
done by achieving the required amount of efficiency for the project
Life Cycle of a Typical Data Science Project Explained:
1) Understanding the Business Problem:
In order to build a successful business model, its very important to first understand the
business problem that the client is facing. Suppose he wants to predict the customer
churn rate of his retail business. You may first want to understand his business, his
requirements and what he is actually wanting to achieve from the prediction. In such
cases, it is important to take consultation from domain experts and finally understand the
underlying problems that are present in the system. A Business Analyst is generally
responsible for gathering the required details from the client and forwarding the data to
the data scientist team for further speculation. Even a minute error in defining the
problem and understanding the requirement may be very crucial for the project hence it
is to be done with maximum precision.
After asking the required questions to the company stakeholders or clients, we move to
the next process which is known as data collection.
2) Data Collection
After gaining clarity on the problem statement, we need to collect relevant data to break
the problem into small components.
The data science project starts with the identification of various data sources, which may
include web server logs, social media posts, data from digital libraries such as the US
Census datasets, data accessed through sources on the internet via APIs, web scraping, or
information that is already present in an excel spreadsheet. Data collection entails
obtaining information from both known internal and external sources that can assist in
addressing the business issue.
Normally, the data analyst team is responsible for gathering the data. They need to figure
out proper ways to source data and collect the same to get the desired results.
There are two ways to source the data:
1. Through web scraping with Python
2. Extracting Data with the use of third party APIs
3) Data Preparation
After gathering the data from relevant sources, we need to move forward to data
preparation. This stage helps us gain a better understanding of the data and prepares it for
further evaluation.
Additionally, this stage is referred to as Data Cleaning or Data Wrangling. It entails steps
such as selecting relevant data, combining it by mixing data sets, cleaning it, dealing
with missing values by either removing them or imputing them with relevant data,
dealing with incorrect data by removing it, and also checking for and dealing with
outliers. By using feature engineering, you can create new data and extract new features
from existing ones. Format the data according to the desired structure and delete any
unnecessary columns or functions. Data preparation is the most time-consuming process,
accounting for up to 90% of the total project duration, and this is the most crucial step
throughout the entire life cycle.
Exploratory Data Analysis (EDA) is critical at this point because summarising clean data
enables the identification of the data’s structure, outliers, anomalies, and trends. These
insights can aid in identifying the optimal set of features, an algorithm to use for model
creation, and model construction.
4) Data Modelling
Throughout most cases of data analysis, data modelling is regarded as the core process.
In this process of data modelling, we take the prepared data as the input and with this, we
try to prepare the desired output.
We first tend to select the appropriate type of model that would be implemented to
acquire results, whether the problem is a regression problem or classification, or a
clustering-based problem. Depending on the type of data received we happen to choose
the appropriate machine learning algorithm that is best suited for the model. Once this is
done, we ought to tune the hyperparameters of the chosen models to get a favorable
outcome.
Finally, we tend to evaluate the model by testing the accuracy and relevance. In addition
to this project, we need to make sure there is a correct balance between specificity and
generalizability, which is the created model must be unbiased.
5) Model Deployment
Before the model is deployed, we need to ensure that we have picked the right solution
after a rigorous evaluation has been. Later on, it is then deployed in the desired channel
and format. This is naturally the last step in the life cycle of data science projects. Please
take extra caution before executing each step in the life cycle to avoid unwanted errors.
For example, if you choose the wrong machine learning algorithm for data modeling then
you will not achieve the desired accuracy and it will be difficult in getting approval for
the project from the stakeholders. If your data is not cleaned properly, you will have to
handle missing values or the noise present in the dataset later on. Hence, in order to make
sure that the model is deployed properly and accepted in the real world as an optimal use
case, you will have to do rigorous testing in every step.
13. Activities/ Case studies/related to the session.
Case Study: Customer Churn Prediction for a Telecom Company

Background: A telecom company is experiencing a high rate of customer churn, where
subscribers are canceling their contracts and switching to competitors. The company
wants to use data science to identify factors that contribute to churn and predict
which customers are most likely to leave. By proactively addressing these issues,
they aim to reduce churn and improve customer retention.
Objectives:
1. Analyze the data to identify key factors contributing to customer churn.
2. Build a predictive model to identify customers at risk of churn.
3. Provide actionable recommendations for reducing churn.
Data: The dataset includes historical customer information, such as contract length,
monthly charges, usage patterns, customer demographics, and whether the customer
churned (yes/no).
Data Science Process:
1. Data Collection and Exploration:
 Gather the dataset and examine its structure.
 Explore the data to understand its characteristics, including summary statistics and
visualizations.
2. Data Preprocessing:
 Handle missing data and outliers.
 Encode categorical variables.
 Split the data into training and testing sets.
3. Exploratory Data Analysis (EDA):
 Conduct EDA to identify patterns and relationships between variables.
 Use visualizations to gain insights into factors associated with customer churn.
4. Feature Selection:
 Identify important features that significantly affect churn.
 Feature selection techniques may include correlation analysis and feature importance
rankings.
5. Model Building:
 Select appropriate machine learning algorithms (e.g., logistic regression, decision
trees, random forest).
 Train the model on the training data.
6. Model Evaluation:
 Evaluate the model's performance on the testing dataset using metrics like accuracy,
precision, recall, and F1-score.
 Use a confusion matrix to understand true positives, true negatives, false positives,
and false negatives.
7. Predictive Analytics:
 Use the trained model to predict which customers are at high risk of churn.
 The model will output probabilities or predictions for each customer.
8. Actionable Insights:
 Provide actionable recommendations based on the model's insights. For example,
offer targeted incentives or retention strategies to at-risk customers.
9. Monitoring and Continuous Improvement:
 Implement the recommendations and monitor the impact on churn rates.
 Continuously update the model with new data to improve its accuracy and
effectiveness.
This case study demonstrates the application of data science basics in solving a real-
world problem. By analyzing historical data and building a predictive model, the
telecom company can take proactive steps to reduce customer churn and improve
their business outcomes.
14. Examples & contemporary extracts of articles/ practices to convey the

idea of the session.
Case Study: Predicting Student Exam Scores

Background: A high school wants to improve its students' academic performance by
identifying factors that influence exam scores. The school collects data on student
demographics, study hours, attendance, and previous exam scores.
Objectives:
1. Analyze the data to understand the key factors that impact students' exam scores.
2. Build a predictive model to forecast a student's exam score based on the available
data.
3. Provide actionable recommendations to help students improve their performance.
Data: The dataset contains information on a sample of students, including their age,
gender, study hours, attendance, and past exam scores. The target variable is the final
exam score.
Data Science Process:
This case study demonstrates the application of data science in an educational context. By
analyzing student data and building a predictive model, the school can identify areas for
improvement and develop strategies to enhance student performance.
15. SAQ's-Self Assessment Questions
1.Data science is also known as

A) Data Driven science.
b) Data Mining
c) Big data
d) Information retrieval
2. Data preprocessing, cleaning, and feature engineering
a ) Data Scientist
b) Machine Learning Engineer
c) Data Analyst
d)None
16. Summary
Data science is a multidisciplinary field that combines principles from computer science,
statistics, and domain-specific knowledge to extract valuable insights from data. It
involves a systematic process that includes data collection, cleaning, analysis, and
interpretation, with the primary aim of supporting data-driven decision-making and
solving real-world problems. The field encompasses essential components such as data
preprocessing, exploratory data analysis, statistical analysis, machine learning, data
visualization, and communication of findings. Data science has become increasingly
important in the age of big data, enabling organizations to harness the power of data to
gain a competitive edge and drive innovation in various domains. It offers a wide range of
career opportunities for individuals with skills in data analysis, modeling, and
visualization and continues to evolve with technological advancements and changing data
landscapes. Ethical considerations, including data privacy and responsible data use, are
integral to the practice of data science, ensuring that it benefits society while respecting
individual rights and values.
17. Terminal Questions
1. Can you explain the significance of data collection, data cleaning, and data
preprocessing in the data science workflow?
2. How does data visualization play a role in data science, and why is it
important?
3. What is predictive modeling, and how does it relate to data science?
4. Describe the ethical considerations associated with working with data in the
field of data science.
5. What are some key milestones in the development of data science techniques
and methodologies?
6. What are some of the challenges and ethical considerations that have emerged
over the years as data science has grown in importance and scale?
18. Case Studies (Co Wise)
Case study on data science roles and responsibilities within a fictional company,
XYZ Analytics.
Company Background:
XYZ Analytics is a medium-sized technology company that specializes in
providing data-driven solutions to various industries. The company's primary goal
is to help clients make informed business decisions through the use of advanced
analytics and data science techniques.
Scenario:
XYZ Analytics has recently onboarded a new client, a retail chain looking to
optimize its supply chain management. The client has a massive amount of data,
including sales data, inventory data, and customer data, and they are seeking
actionable insights to improve efficiency and reduce costs.
Data Science Team:

The data science team at XYZ Analytics comprises individuals with diverse skills
and expertise. The team includes data scientists, data engineers, and machine
learning engineers.
19. Answer Key

1.d 2.b
20. Glossary
Data Science: A multidisciplinary field that combines computer science, statistics, and
domain knowledge to extract insights from data.
Data Analysis: The process of examining, cleaning, transforming, and interpreting data
to discover patterns, trends, and insights.
Data Visualization: The representation of data using charts, graphs, and visual elements
to aid in understanding and communication.
Data Preprocessing: The initial step in data analysis that involves cleaning and
preparing data for analysis by addressing missing values, outliers, and inconsistencies.
Exploratory Data Analysis (EDA): The practice of visually and statistically exploring
data to understand its characteristics and relationships.
Predictive Modeling: Building models that make predictions based on historical data,
often using machine learning algorithms.
Descriptive Statistics: Numerical and graphical methods used to summarize and
describe the main features of a dataset.
Feature Engineering: Creating new variables from existing data to improve model
performance.
Machine Learning: A subset of artificial intelligence that focuses on developing

algorithms that allow computers to learn and make predictions from data.
Data Mining: The process of discovering patterns and relationships within large
datasets.
Big Data: Extremely large and complex datasets that traditional data processing tools are
inadequate to handle.
Data Scientist: A professional who specializes in analyzing and interpreting data,

developing models, and providing insights to drive data-driven decision-making.
Hypothesis Testing: A statistical technique used to test hypotheses and make inferences
about data.
Feature Selection: Identifying and choosing the most relevant variables or features for
modeling.
Cross-Validation: A technique used to assess the performance of a predictive model by

dividing the data into training and testing sets.
Overfitting: When a model is too complex and fits the training data too closely,
potentially leading to poor generalization to new data.
Bias and Variance: Terms used to describe the sources of error in a model, with bias
indicating underfitting and variance indicating overfitting.
Algorithm: A step-by-step procedure or set of rules for solving a specific problem in

data analysis and machine learning.
Ethical Considerations: The moral and legal aspects of working with data, including
data privacy and responsible data use.
Data Ethics: A branch of ethics that deals with the moral principles governing data
collection, handling, and sharing in the context of data science.
21. References of books, sites, links Textbooks:
1. Python Data Science Handbook, by Jake VanderPlas, Released November 2016

Publisher(s): O'Reilly Media, Inc. ISBN: 9781491912058
Sites and Web links:

Text and Annotation | Python Data Science Handbook (jakevdp.github.io)
17. Keywords
Data Science, Big Data, Data Scientist, Data Mining

Data Science Roles, Stages in A Data Science Project

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science Roles, Stages in A Data Science Project

Uploaded by

Copyright:

Available Formats

KONERU LAKSHMAIAH EDUCATION FOUNDATION

(Deemed to be University estd, u/s, 3 of the UGC Act,

1956) (NAAC Accredited “A++” Grade University)

Green Fields, Guntur District, A.P., India – 522502

B.Tech. IInd Year

Session 1.: Data Science Roles and Their Responsibility

3. Instructional Objectives (Course Objectives)

5. Module Description (CO-1 Description)

Apply methods and tools for Non-Spatial Data Visualization.

8. Instructional Objectives (Course Objectives)

 Understand the modelling of various types of data and the Visualization

9. Learning Outcomes (Course Outcome)

10. Modul Description (CO-1 Description)

12. Session description

Data Science Roles and Their Responsibility

What is a Data Science Project Lifecycle?

13. Activities/ Case studies/related to the session.

Case Study: Customer Churn Prediction for a Telecom Company

14. Examples & contemporary extracts of articles/ practices to convey the

Case Study: Predicting Student Exam Scores

15. SAQ's-Self Assessment Questions

1.Data science is also known as

17. Terminal Questions

Data Science Team:

19. Answer Key

Machine Learning: A subset of artificial intelligence that focuses on developing

Data Scientist: A professional who specializes in analyzing and interpreting data,

Cross-Validation: A technique used to assess the performance of a predictive model by

Algorithm: A step-by-step procedure or set of rules for solving a specific problem in

21. References of books, sites, links Textbooks:

1. Python Data Science Handbook, by Jake VanderPlas, Released November 2016

Sites and Web links:

You might also like