You are on page 1of 23

Fundamentals of Data Science and Analytics Notes – Unit I

AD3491 - Fundamentals of Data Science and Analytics


UNIT I - Introduction to Data Science
Syllabus
Need for data science – benefits and uses – facets of data – data science process – setting
the research goal – retrieving data – cleansing, integrating, and transforming data – exploratory
data analysis – build the models – presenting and building applications.

1.1 INTRODUCTION
1.1.1 Data
Data is a collection of discrete values that convey information, describing quantity, quality,
fact, statistics, etc., Data is information such as facts and numbers used to analyse something or
make decisions.

The characteristics of big data are often referred to as the three Vs:
 Volume - How much data is there?
 Variety - How diverse are different types of data?
 Velocity - At what speed is new data generated?

1.1.2 Why data is so important?


Data is a precious asset of any organization. It helps firms understand and enhance their
processes, thereby saving time and money. Wastage of time and money, such as a huge spend on
advertisements, improper inventory management may deplete resources and severely impact a
business.
The efficient use of data enables businesses to reduce such wastage by analysing different
marketing channels’ performance and focusing on those offering the highest Return on Investment
(ROI). Thus, a company can generate more leads without increasing its advertising spend.

1.1.3 Characteristics of quality data


Determining the quality of data requires an examination of its characteristics, then
weighing those characteristics according to what is most important to your organization and the
application(s) for which they will be used.
 Validity - The degree to which your data conforms to defined business rules or constraints.
 Accuracy - Ensure your data is close to the true values.
 Completeness - The degree to which all required data is known.
 Consistency - Ensure your data is consistent within the same dataset and/or across
multiple data sets.
 Uniformity - The degree to which the data is specified using the same unit of measure.

Page 1 of 23
Fundamentals of Data Science and Analytics Notes – Unit I

1.1.4 Data Science


Data science is the domain of study that deals with vast volumes of data using modern
tools and techniques to find hidden patterns, derive meaningful information, and make business
decisions.

Data science can be explained as the entire process of gathering actionable insights from
raw data that involves concepts like pre-processing of data, data modelling, statistical analysis,
data analysis, machine learning algorithms, etc.

The main purpose of data science is to compute better decision making. It uses complex
machine learning algorithms to build predictive models. The data used for analysis can come from
many different sources and presented in various formats.

The working of data science can be explained as follows:


 Raw data is gathered from various sources that explains the business problem.
 Using various statistical analysis, and machine learning approaches, data modelling is
performed to get the optimum solutions that best explain the business problem.
 Actionable insights that will serve as a solution for the business problems gathered
through data science.

1.1.5 Why does Data science so important? Or Need for Data Science!
Data is meaningless until its conversion into valuable information. Data Science involves
mining large datasets containing structured and unstructured data and identifying hidden
patterns to extract actionable insights.

The importance of Data Science lies in its numerous uses that range from daily activities
like asking Siri or Alexa for recommendations to more complex applications like operating a self-
driving car.

Page 2 of 23
Fundamentals of Data Science and Analytics Notes – Unit I

Handling of huge amount of data is a challenging task for every organization. So to handle,
process, and analysis of this, we required complex, powerful, and efficient algorithms and that
technology came into existence as data Science. Following are some main reasons for using data
science technology:

 With the help of data science technology, we can convert the massive amount of raw and
unstructured data into meaningful insights.

 Data science technology is opting by various companies, whether it is a big brand or a


startup. Google, Amazon, Netflix, etc, which handle the huge amount of data, are using
data science algorithms for better customer experience.

 Data science is working for automating transportation such as creating a self-driving car,
which is the future of transportation.

 Data science can help in different predictions such as various survey, elections, flight ticket
confirmation, etc.

1.2 Components of Data Science:

The main components of Data Science are given below:


1. Statistics: Statistics is one of the most important components of data science. Statistics is
a way to collect and analyze the numerical data in a large amount and finding meaningful
insights from it.

2. Domain Expertise: In data science, domain expertise binds data science together. Domain
expertise means specialized knowledge or skills of a particular area. In data science, there
are various areas for which we need domain experts.

3. Data engineering: Data engineering is a part of data science, which involves acquiring,
storing, retrieving, and transforming the data. Data engineering also includes metadata
(data about data) to the data.

Page 3 of 23
Fundamentals of Data Science and Analytics Notes – Unit I

4. Visualization: Data visualization is meant by representing data in a visual context so that


people can easily understand the significance of data. Data visualization makes it easy to
access the huge amount of data in visuals.

5. Advanced computing: Heavy lifting of data science is advanced computing. Advanced


computing involves designing, writing, debugging, and maintaining the source code of
computer programs.

6. Mathematics: Mathematics is the critical part of data science. Mathematics involves the
study of quantity, structure, space, and changes. For a data scientist, knowledge of good
mathematics is essential.

7. Machine learning: Machine learning is backbone of data science. Machine learning is all
about to provide training to a machine so that it can act as a human brain. In data science,
we use various machine learning algorithms to solve the problems.

1.2.1 Tools for Data Science


Following are some tools required for data science:

Data Analysis tools: Python, Statistics, SAS, Jupyter, R Studio, MATLAB, Excel, RapidMiner.

Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS Redshift

Data Visualization tools: R, Jupyter, Tableau, Cognos.

Machine learning tools: Spark, Mahout, Azure ML studio.

1.3 Data Science Life Cycle


Data Science Life Cycle defines the process of the how information is carried out in various
phases for professionals working on a project. It’s a step-by-step procedure that is arranged in a
circular structure. Each phase has its own characteristics and importance. The Data Science
lifecycle comprises of the following:

Page 4 of 23
Fundamentals of Data Science and Analytics Notes – Unit I

i. Formulating a Business Problem


Any data science problem will start their journey from formulating a business
problem. A business problem explains the issues that may be fixed with insights gathered
from an efficient Data Science solution.
A simple example of a business problem is – You have past 1 year’s sales data for
a retail store. Using machine learning approaches, you have to predict or forecast the sales
for the next 3 months that will help the store to create an inventory that will help in
reducing the wastage of products that have lesser shelf life than the other products.

ii. Data Extraction, Transformation, Loading

The next step in the data science life cycle is to create a data pipeline where the
relevant data is extracted from the source and transformed into machine readable format,
and eventually the data is loaded into the program or the machine learning pipeline to get
things started.

For the above example – To forecast the sales, we will need data from the store
that will be useful for formulating an efficient machine learning model. Keeping this in
mind, we would create separate data points that may or may not be affecting the sales for
that particular store.

iii. Data Preprocessing

The third step is where the magic happens. Using statistical analysis, Exploratory
data analysis, data wrangling and manipulation, we will create meaningful data. The
preprocessing is done to assess the various data points and formulate hypotheses that
best explain the relationship between the various features in the data.

For example – The store sales problem will require the data to be in a time series
format to be able to forecast the sales. The hypothesis testing will test the stationarity of
the series and further computations will show various trends, seasonality and other
relationship patterns in the data.

iv. Data Modeling

This step involves advanced machine learning concepts that will be used for
feature selection, feature transformation, standardization of the data, data normalization,
etc. Choosing the best algorithms based on evidence from the above steps will help you
create a model that will efficiently create a forecast for the said months in the above
example.

For example – We can use the Time Series forecasting approach for the business
problem where the presence of high dimensional data could be the case. We will use
various dimensionality reduction techniques, and create a Forecasting model using AR,
MA, or ARIMA model and forecast the sales for the next quarter.

Page 5 of 23
Fundamentals of Data Science and Analytics Notes – Unit I

v. Gathering Actionable Insights

The final step from the data science life cycle is gathering insights from the said
problem statement. We create inferences and findings from the entire process that would
best explain the business problem.

For example – From the above Time series model, we will get the monthly or
weekly sales for the next 3 months. These insights will in turn help the professionals create
a strategy plan to overcome the problem at hand.

vi. Solutions For the Business Problem

The solutions for the business problem are nothing but actionable insights that will
solve the problem using evidence based information. For example – Our forecast from the
Time series model will give an efficient estimate for the store sales in the next 3 months.
Using those insights, the store can plan their inventory to reduce the wastage of perishable
goods.

1.3.1 Prerequisite for Data Science


Technical Prerequisite:
 Machine learning: To understand data science, one needs to understand the concept
of machine learning. Data science uses machine learning algorithms to solve various
problems.

 Mathematical modelling: Mathematical modelling is required to make fast


mathematical calculations and predictions from the available data.

 Statistics: Basic understanding of statistics is required, such as mean, median, or


standard deviation. It is needed to extract knowledge and obtain better results from
the data.

 Computer programming: For data science, knowledge of at least one programming


language is required. R, Python, Spark are some required computer programming
languages for data science.

 Databases: The depth understanding of Databases such as SQL, is essential for data
science to get the data and to work with data.

Non-Technical Prerequisite:
Curiosity: To learn data science, one must have curiosities. When you have curiosity and
ask various questions, then you can understand the business problem easily.
Critical Thinking: It is also required for a data scientist so that you can find multiple new
ways to solve the problem with efficiency.
Communication skills: Communication skills are most important for a data scientist
because after solving a business problem, you need to communicate it with the team.

Page 6 of 23
Fundamentals of Data Science and Analytics Notes – Unit I

1.4 Benefits of Data Science / Applications of Data Science


 Data Science is widely used in the banking and finance sectors for fraud detection and
personalized financial advice.
 Retailers use Data Science to enhance customer experience and retention. For an
example,

 In the healthcare industry, physicians use Data Science to analyze data from wearable
trackers to ensure their patients’ well-being and make vital decisions. Data Science also
enables hospital managers to reduce waiting time and enhance care.
 Transportation providers use Data Science to enhance the transportation journeys of their
customers. For instance, Transport for London maps customer journeys offering
personalized transportation details, and manages unexpected circumstances using
statistical data.
 Construction companies use Data Science for better decision making by tracking activities,
including average time for completing tasks, materials-based expenses, and more.
 Data Science enables trapping and analyzing massive data from manufacturing processes,
which has gone untapped so far.
 With Data Science, one can analyze massive graphical data, temporal data, and geospatial
data to draw insights. It also helps in seismic interpretation and reservoir characterization.
 Data Science facilitates firms to leverage social media content to obtain real-time media
content usage patterns. This enables the firms to create target audience-specific content,
measure content performance, and recommend on-demand content.
 Data Science helps study utility consumption in the energy and utility domain. This study
allows for better control of utility use and enhanced consumer feedback.
 Data Science applications in the public service field include health-related research,
financial market analysis, fraud detection, energy exploration, environmental protection,
and more.

Page 7 of 23
Fundamentals of Data Science and Analytics Notes – Unit I

1.5 Difference between Business Intelligence and Data Science

Criterion Business intelligence Data science

Data science deals with structured and


Data Business intelligence deals with
unstructured data, e.g., weblogs,
Source structured data, e.g., data warehouse.
feedback, etc.

Scientific(goes deeper to know the


Method Analytical(historical data)
reason for the data report)

Statistics and Visualization are the two Statistics, Visualization, and Machine
Skills skills required for business learning are the required skills for data
intelligence. science.

Data science focuses on past data,


Business intelligence focuses on both
Focus present data, and also future
Past and present data
predictions.

1.6 Difference between Data Mining and Data Science

Page 8 of 23
Fundamentals of Data Science and Analytics Notes – Unit I

1.7 Machine learning in Data Science


To become a data scientist, one should also be aware of machine learning and its algorithms,
as in data science, there are various machine learning algorithms which are broadly being used.
Following are the name of some machine learning algorithms used in data science:
1. Regression
2. Decision tree
3. Clustering
4. Classification
5. Outlier Analysis
1. Linear Regression Algorithm: Linear regression is the most popular machine learning
algorithm based on supervised learning. This algorithm work on regression, which is a
method of modelling target values based on independent variables. It represents the form
of the linear equation, which has a relationship between the set of inputs and predictive
output. This algorithm is mostly used in forecasting and predictions. Since it shows the
linear relationship between input and output variable, hence it is called linear regression.
The below equation can describe the relationship between x and y variables:
Y= mx+c
where, y => Dependent variable
X => independent variable
M => slope
C => intercept

2. Decision Tree: Decision Tree algorithm is another machine learning algorithm, which
belongs to the supervised learning algorithm. This is one of the most popular machine
learning algorithms. It can be used for both classification and regression problems.

In the decision tree algorithm, we can solve the problem, by using tree representation in
which, each node represents a feature, each branch represents a decision, and each leaf
represents the outcome. Following is the example for a Job offer problem:

Page 9 of 23
Fundamentals of Data Science and Analytics Notes – Unit I

In the decision tree, we start from the root of the tree and compare the values of
the root attribute with record attribute. On the basis of this comparison, we follow the
branch as per the value and then move to the next node. We continue comparing these
values until we reach the leaf node with predicated class value.

3. K-Means Clustering: K-means clustering is one of the most popular algorithms of machine
learning, which belongs to the unsupervised learning algorithm. It solves the clustering
problem.

If we are given a data set of items, with certain features and values, and we need
to categorize those set of items into groups, so such type of problems can be solved using
k-means clustering algorithm.

4. Classification
It is the act or process of dividing things into groups according to their type. In
statistics, classification is the problem of identifying which of a set of categories (sub-
populations) an observation (or observations) belongs to. There are two types of
classification such as Binary Classification and Multi-class Classification.

Page 10 of 23
Fundamentals of Data Science and Analytics Notes – Unit I

5. Outlier Analysis

Outlier Analysis is a process that involves identifying the anomalous observation in


the dataset. Outliers are nothing but an extreme value that deviates from the other
observations in the dataset. Outliers are those observations that differ strongly (different
properties) from the other data points in the sample of a population.

Outliers are classified into three types namely Global Outliers, Contextual Outliers
and Collective Outliers.

So in data science, problems are solved using algorithms, and below is the diagram
representation for applicable algorithms for possible questions:

Page 11 of 23
Fundamentals of Data Science and Analytics Notes – Unit I

1.8 Facets of Data


There are different types of data handled in data science domain and big data, and each
of them tends to require different tools and techniques. The main categories of data are these:
 Structured Data
 Unstructured Data
 Natural language Data
 Machine-generated Data
 Graph-based Data
 Audio, video, and images
 Streaming Data
Structured Data
Structured data is when data is in a standardized format, has a well-defined structure,
complies with a data model, follows a persistent order, and is easily accessed by humans and
programs. This data type is generally stored in a database. Structured Query Language (SQL) is
the preferred way to manage and query data that resides in databases. Example of Structured
data is,

Characteristics of Structured Data


Good structured data will have a range of characteristics, regardless of how the data
is stored or what the information is about. Structured data:
 has an identifiable structure that conforms to a data model and It is in fixed fields
 is presented in rows and columns, such as in a database
 is organized so that the definition, format and meaning of the data is explicitly
understood
 has similar groups of data clustered together in classes
 information is easy to access and query for humans and other programs
 elements are able to be addressed, enabling efficient analysis and processing

Page 12 of 23
Fundamentals of Data Science and Analytics Notes – Unit I

Advantages of Structured Data


 Easy Storage and Access
 Ease of Updating and Deleting
 Easily Scalable
 Data Mining is Simple
 Better Business Intelligence
Disadvantages of Structured Data
 Storage Inflexibility
 Limited Use Cases

Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because the content is
context-specific or varying. Unstructured data is information that either does not have a
predefined data model or is not organised in a pre-defined manner.
Unstructured information is typically text-heavy, but may contain data such as dates,
numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to
understand using traditional programs as compared to data stored in structured databases.
Common examples of unstructured data include audio, video files or No-SQL databases.
Examples of unstructured data are:
 Rich media. Media and entertainment data, surveillance data, geo-spatial data,
audio, weather data
 Document collections. Invoices, records, emails, productivity applications
 Internet of Things (IoT). Sensor data, ticker data

Natural Language Data


Natural language is a special type of unstructured data; it’s challenging to process because
it requires knowledge of specific data science techniques and linguistics. Natural language refers
to the way we, humans, communicate with each other namely, speech and text.

Page 13 of 23
Fundamentals of Data Science and Analytics Notes – Unit I

Machine-generated data
Machine-generated data is information that’s automatically created by a computer,
process, application, or other machine without human intervention. Examples of machine data
are web server logs, call detail records, network event logs, and telemetry

Graph-based or network data


In graph theory, a graph is a mathematical structure to model pair-wise relationships
between objects. Graph or network data is, in short, data that focuses on the relationship or
adjacency of objects. The graph structures use nodes, edges, and properties to represent and
store graphical data. Graph-based data is a natural way to represent social networks, and its
structure allows you to calculate specific metrics such as the influence of a person and the
shortest path between two people.

Audio, image, and video


Audio, image, and video are data types that pose specific challenges to a data scientist.
Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be challenging
for computers.

Page 14 of 23
Fundamentals of Data Science and Analytics Notes – Unit I

Streaming data
While streaming data can take almost any of the previous forms, it has an extra property.
The data flows into the system when an event happens instead of being loaded into a data store
in a batch. Although this isn’t really a different type of data, we treat it here as such because you
need to adapt your process to deal with this type of information. Examples are the “What’s
trending” on Twitter, live sporting or music events, and the stock market.

1.9 The Data Science Process


The data science process typically consists of six steps, as follows,
1. Setting the research goal - Defining the what, the why, and the how of your project in
a project charter.
2. Retrieving data - Finding and getting access to data needed in your project. This data
is either found within the company or retrieved from a third party.
3. Data preparation - Checking and remediating data errors, enriching the data with data
from other data sources, and transforming it into a suitable format for your models.
4. Data exploration - Diving deeper into your data using descriptive statistics and visual
techniques.
5. Data modelling - Using machine learning and statistical techniques to achieve your
project goal.
6. Presentation and automation - Presenting your results to the stakeholders and
industrializing your analysis process for repetitive reuse and integration with other
tools.
Setting the research goal
Data science is mostly applied in the context of an organization. When the business asks
you to perform a data science project, you’ll first prepare a project charter. This charter contains
information such as what you’re going to research, how the company benefits from that, what
data and resources you need, a timetable, and deliverables.
Retrieving data
The second step is to collect data. You’ve stated in the project charter which data you
need and where you can find it. In this step you ensure that you can use the data in your program,
which means checking the existence of, quality, and access to the data. Data can also be delivered
by third-party companies and takes many forms ranging from Excel spreadsheets to different
types of databases.

Page 15 of 23
Fundamentals of Data Science and Analytics Notes – Unit I

Data preparation
Data collection is an error-prone process; in this phase you enhance the quality of the data
and prepare it for use in subsequent steps. This phase consists of three subphases: data cleansing
removes false values from a data source and inconsistencies across data sources, data integration
enriches data sources by combining information from multiple data sources, and data
transformation ensures that the data is in a suitable format for use in your models.

Page 16 of 23
Fundamentals of Data Science and Analytics Notes – Unit I

Data exploration
Data exploration is concerned with building a deeper understanding of your data. You try
to understand how variables interact with each other, the distribution of the data, and whether
there are outliers. To achieve this you mainly use descriptive statistics, visual techniques, and
simple modeling. This step often goes by the abbreviation EDA, for Exploratory Data Analysis.

Data modelling or model building


In this phase you use models, domain knowledge, and insights about the data you found
in the previous steps to answer the research question. You select a technique from the fields of
statistics, machine learning, operations research, and so on. Building a model is an iterative
process that involves selecting the variables for the model, executing the model, and model
diagnostics.
Building a model is an iterative process. The way you build your model depends on
whether you go with classic statistics or the somewhat more recent machine learning school, and
the type of technique you want to use. Either way, most models consist of the following main
steps:
i. Selection of a modelling technique and variables to enter in the model
ii. Execution of the model
iii. Diagnosis and model comparison
Presentation and automation
Finally, you present the results to your business. These results can take many forms,
ranging from presentations to research reports. Sometimes you’ll need to automate the
execution of the process because the business will want to use the insights you gained in another
project or enable an operational process to use the outcome from your model.

The above diagram summarizes the data science process and shows the main steps and actions
that will be taken during a project.
1. The first step of this process is setting a research goal. The main purpose here is making
sure all the stakeholders understand the what, how, and why of the project. In every
serious project this will result in a project charter.

2. The second phase is data retrieval. You want to have data available for analysis, so this
step includes finding suitable data and getting access to the data from the data owner.
The result is data in its raw form, which probably needs polishing and transformation
before it becomes usable.

3. The third phase is data preparation. Now that you have the raw data, it’s time to prepare
it. This includes transforming the data from a raw form into data that’s directly usable in

Page 17 of 23
Fundamentals of Data Science and Analytics Notes – Unit I

your models. To achieve this, you’ll detect and correct different kinds of errors in the data,
combine data from different data sources, and transform it. If you have successfully
completed this step, you can progress to data visualization and modelling.

4. The fourth step is data exploration. The goal of this step is to gain a deep understanding
of the data. You’ll look for patterns, correlations, and deviations based on visual and
descriptive techniques. The insights you gain from this phase will enable you to start
modelling.

5. Finally, Fifth step in Data Science process is model building. It is now that you attempt to
gain the insights or make the predictions stated in your project charter.

6. The last step of the data science model is presenting your results and automating the
analysis. One goal of a project is to change a process and/or make better decisions. You
may still need to convince the business that your findings will indeed change the business
process as expected. The importance of this step is more apparent in projects on a
strategic and tactical level. Certain projects require you to perform the business process
over and over again, so automating the project will save time.

1.10 A detailed view on Data Science Process

1.10.1 Data Science Process: Defining research goals and creating a project charter
 A project starts by understanding the what, the why, and the how of your project.
“What does the company expect you to do?”, “why does management place such a
value on your research?”, “Is it part of a bigger strategic picture project originating
from an opportunity someone detected?”. Answering these three questions (what,
why, how) is the goal of the first phase.

Page 18 of 23
Fundamentals of Data Science and Analytics Notes – Unit I

 The outcome should be a clear research goal, a good understanding of the context,
well-defined deliverables, and a plan of action with a timetable. This information is
then best placed in a project charter.

 The entire process is divided into two parts as such,


i. Understanding the goals and context of your research
ii. Create a project charter
 A project charter requires teamwork, and your input covers at least the following:
i. A clear research goal
ii. The project mission and context
iii. How you’re going to perform your analysis
iv. What resources you expect to use
v. Proof that it’s an achievable project, or proof of concepts
vi. Deliverables and a measure of success
vii. A timeline
Company can use this information to make an estimation of the project costs and
the data and people required for your project to become a success.

1.10.2 Data Science Process: Retrieving data


 The next step in data science is to retrieve the required data. Data required for the
analysis process will be collected from the company directly (Internal data) or
collected from outside sources (External Data).
 Data can be stored in many forms, ranging from simple text files to tables in a
database. The objective now is acquiring all the data you need.

Page 19 of 23
Fundamentals of Data Science and Analytics Notes – Unit I

 The main challenge in data collection is identifying the data sources where the
required data is actually stored. Because company may have stored the data across
many places.
 Another challenge is the extract useful data from the data collection and removing the
noises and unwanted data out of it. Many manual and automated tools are used to
refine the data.

 Many companies publish their data in open forum for public access and some of them
are as follows,

1.10.3 Data Science Process: Cleansing, integrating, and transforming data


 Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly
formatted, duplicate, or incomplete data within a dataset. When combining multiple
data sources, there are many opportunities for data to be duplicated or mislabeled. If
data is incorrect, outcomes and algorithms are unreliable, even though they may look
correct.
 Data Cleaning Process includes removal of duplicate and irrelevant observations, fix
structural errors, filtering unwanted outliers, handling missing data, data validation,
etc.,

Page 20 of 23
Fundamentals of Data Science and Analytics Notes – Unit I

 Data Integration is a process of combining data from multiple heterogeneous data


sources into a coherent data store and provides a unified view of the data. These
sources may include multiple data cubes, databases, or flat files.
 There are mainly 2 major approaches for data integration – one is the “tight coupling
approach” and another is the “loose coupling approach”.
 Some of challenges in Data Integration are Schema Integration, Data redundancy and
detection and resolution of data value conflicts.
 Data transformation: Raw data is difficult to trace or understand. That's why it needs
to be pre-processed before retrieving any information from it. Data transformation is
a technique used to convert the raw data into a suitable format that efficiently eases
data mining and retrieves strategic information.
 Data transformation includes data cleaning techniques and a data reduction technique
to convert the data into the appropriate form.
 Data transformation changes the format, structure, or values of the data and converts
them into clean, usable data. Data may be transformed at two stages of the data
pipeline for data analytics projects.
 Data integration, migration, data warehousing, data wrangling may all involve data
transformation. Data transformation increases the efficiency of business and analytic
processes, and it enables businesses to make better data-driven decisions. During the
data transformation process, an analyst will determine the structure of the data.

Page 21 of 23
Fundamentals of Data Science and Analytics Notes – Unit I

1.10.4 Data Science Process: Exploratory data analysis [EDA]


 Exploratory data analysis (EDA) is an approach of analysing data sets to summarize
their main characteristics, often using statistical graphics and other data visualization
methods.

 Exploratory Data Analysis refers to the critical process of performing initial


investigations on data so as to discover patterns, to spot anomalies, to test hypothesis
and to check assumptions with the help of summary statistics and graphical
representations.
 The objectives of EDA are to:
 Enable unexpected discoveries in the data
 Suggest hypotheses about the causes of observed phenomena
 Assess assumptions on which statistical inference will be based
 Support the selection of appropriate statistical tools and techniques
 Provide a basis for further data collection through surveys or experiments
 Typical graphical techniques used in EDA are Box plot, Histogram, Multi-vari chart, Run
chart, Pareto chart, Scatter plot (2D/3D), Stem-and-leaf plot, Parallel coordinates,
Odds ratio, Targeted projection pursuit, Heat map, Bar chart, etc.,

Page 22 of 23
Fundamentals of Data Science and Analytics Notes – Unit I

1.10.5 Data Science Process: Build the models

 The model building process involves setting up ways of collecting data, understanding
and paying attention to what is important in the data to answer the questions you are
asking, finding a statistical, mathematical or a simulation model to gain understanding
and make predictions.
 In regression analysis, model building is the process of developing a probabilistic
model that best describes the relationship between the dependent and independent
variables.
 Model Building Process consist of the following three steps,
i. Selection of a modelling technique and variables to enter in the model
ii. Execution of the model
iii. Diagnosis and model comparison

1.10.6 Data Science Process: Presenting findings and building applications


 Once the data is successfully analysed and a well-performing model is built, the findings
need to be presented to the world. It involves presenting your results to the stakeholders
and industrializing your analysis process for repetitive reuse and integration with other
tools.

 The entire process needs to be documented in a professional way and have clear
visualizations, so it makes it easy for the audience to understand.

Page 23 of 23

You might also like