You are on page 1of 30

Unit I

Data Science Introduction

Data Science is a combination of multiple disciplines that uses statistics, data analysis, and
machine learning to analyze data and to extract knowledge and insights from it.

What is Data Science?

Data Science is about data gathering, analysis and decision-making.

Data Science is about finding patterns in data, through analysis, and make future predictions.

By using Data Science, companies are able to make:

 Better decisions (should we choose A or B)


 Predictive analysis (what will happen next?)
 Pattern discoveries (find pattern, or maybe hidden information in the data)

Where is Data Science Needed?

Data Science is used in many industries in the world today, e.g. banking, consultancy,
healthcare, and manufacturing.

Examples of where Data Science is needed:

 For route planning: To discover the best routes to ship


 To foresee delays for flight/ship/train etc. (through predictive analysis)
 To create promotional offers
 To find the best suited time to deliver goods
 To forecast the next years revenue for a company
 To analyze health benefit of training
 To predict who will win elections

Data Science can be applied in nearly every part of a business where data is available.
Examples are:

 Consumer goods
 Stock markets
 Industry
 Politics
 Logistic companies
 E-commerce

How Does a Data Scientist Work?

A Data Scientist requires expertise in several backgrounds:


 Machine Learning
 Statistics
 Programming (Python or R)
 Mathematics
 Databases

A Data Scientist must find patterns within the data. Before he/she can find the patterns,
he/she must organize the data in a standard format.

Here is how a Data Scientist works:

1. Ask the right questions - To understand the business problem.


2. Explore and collect data - From database, web logs, customer feedback, etc.
3. Extract the data - Transform the data to a standardized format.
4. Clean the data - Remove erroneous values from the data.
5. Find and replace missing values - Check for missing values and replace them with a
suitable value (e.g. an average value).
6. Normalize data - Scale the values in a practical range (e.g. 140 cm is smaller than 1,8
m. However, the number 140 is larger than 1,8. - so scaling is important).
7. Analyze data, find patterns and make future predictions.
8. Represent the result - Present the result with useful insights in a way the "company"
can understand.

What is Data?

Data is a collection of information.

One purpose of Data Science is to structure data, making it interpretable and easy to work
with.

Data can be categorized into two groups:

 Structured data
 Unstructured data

Unstructured Data

Unstructured data is not organized. We must organize the data for analysis purposes.
Structured Data

Structured data is organized and easier to work with.


How to Structure Data?

We can use an array or a database table to structure or present data.

Example of an array:

[80, 85, 90, 95, 100, 105, 110, 115, 120, 125]

The following example shows how to create an array in Python:

Example
Array = [80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
print(Array)

It is common to work with very large data sets in Data Science.

What are the advantages and disadvantages of Data Science

Advantages Disadvantages

Better Decision-Making Data Privacy Concerns

Improved Efficiency Bias in Data

Enhanced Customer Experience Misinterpretation of Data

Predictive Analytics Data Quality Issues


What are the challenges of data science technology?

Common Data Science Problems Faced by Data Scientists


 Preparation of Data for Smart Enterprise AI. ...
 Generation of Data from Multiple Sources. ...
 Identification of Business Issues. ...
 Communication of Results to Non-Technical Stakeholders. ...
 Data Security. ...
 Efficient Collaboration. ...
 Selection of Non-Specific KPI Metrics.
What are the advantages of data science?
The various benefits of Data Science are as follows:
 It's in Demand. Data Science is greatly in demand. ...
 Abundance of Positions. ...
 A Highly Paid Career. ...
 Data Science is Versatile. ...
 Data Science Makes Data Better. ...
 Data Scientists are Highly Prestigious. ...
 No More Boring Tasks. ...
 Data Science Makes Products Smarter.
What is the difference between data analytics vs data mining?
 Data analysis necessitates employing technologies to conduct analysis and draw
hypotheses that will aid in the making of data-driven decisions. On the other hand,
data mining is the process of uncovering hidden patterns in raw data using complex
machine learning algorithms in order to make precise decisions.
What is facet in data science?
 Faceting is a good way to get an overview of a specific column of your data. Text
faceting will organize unique items in the selected column by name and will give a
count for how many rows or records possess that item name. For example, let's text
facet the column Facility Name.
What are the types of data science or facets of data?

 Data science incorporates various disciplines -- for example, data engineering, data
preparation, data mining, predictive analytics, machine learning and data
visualization, as well as statistics, mathematics and software programming.
What are the 4 major components of data science?
 The four pillars of data science are domain knowledge, math and statistics skills,
computer science, communication and visualization. Each is essential for the success
of any data scientist. Domain knowledge is critical to understanding the data, what it
means, and how to use it.
How do you set research goals in data science?
 Setting the research goal: Understanding the business or activity our data science
project is part of is key to ensuring its success and the first phase of any sound data
analytics project. Defining the what, the why, and the how of our project in a project
charter is the foremost task.
What are the goals of the data science process?
 The goal of data science is to construct the means for extracting business-focused
insights from data. This requires an understanding of how value and information
flows in a business, and the ability to use that understanding to identify business
opportunities.
What are the six steps of the data science process?
 Data science life cycle is a collection of individual steps that need to be taken to
prepare for and execute a data science project. The steps include identifying the
project goals, gathering relevant data, analyzing it using appropriate tools and
techniques, and presenting results in a meaningful way.

What Is the Data Science Process?


The data science process is a systematic approach to solving a data problem. It provides a
structured framework for articulating your problem as a question, deciding how to solve it,
and then presenting the solution to stakeholders.

Data Science Life Cycle

Another term for the data science process is the data science life cycle. The terms can be used
interchangeably, and both describe a workflow process that begins with collecting data, and
ends with deploying a model that will hopefully answer your questions. The steps include:

Framing the Problem


Understanding and framing the problem is the first step of the data science life cycle. This
framing will help you build an effective model that will have a positive impact on your
organization.

Collecting Data
The next step is to collect the right set of data. High-quality, targeted data—and the
mechanisms to collect them—are crucial to obtaining meaningful results. Since much of the
roughly 2.5 quintillion bytes of data created every day come in unstructured formats, you’ll
likely need to extract the data and export it into a usable format, such as a CSV or JSON file.
Cleaning Data

Most of the data you collect during the collection phase will be unstructured, irrelevant, and
unfiltered. Bad data produces bad results, so the accuracy and efficacy of your analysis will
depend heavily on the quality of your data.

Cleaning data eliminates duplicate and null values, corrupt data, inconsistent data types,
invalid entries, missing data, and improper formatting.

This step is the most time-intensive process, but finding and resolving flaws in your data is
essential to building effective models.

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is an approach that is used to analyze the data and discover trends,
patterns, or check assumptions in data with the help of statistical summaries and graphical
representations.
Now that you have a large amount of organized, high-quality data, you can begin conducting
an exploratory data analysis (EDA). Effective EDA lets you uncover valuable insights that
will be useful in the next phase of the data science lifecycle.

Model Building and Deployment


Next, you’ll do the actual data modeling. This is where you’ll use machine learning,
statistical models, and algorithms to extract high-value insights and predictions.

Communicating Your Results


Lastly, you’ll communicate your findings to stakeholders. Every data scientist needs to build
their repertoire of visualization skills to do this.

Your stakeholders are mainly interested in what your results mean for their organization, and
often won’t care about the complex back-end work that was used to build your model.
Communicate your findings in a clear, engaging way that highlights their value in strategic
business planning and operation.

Data Science Process Steps and Framework


There are several different data science process frameworks that you should know. While
they all aim to guide you through an effective workflow, some methodologies are better for
certain use cases.

CRISP-DM
CRISP-DM stands for Cross Industry Standard Process for Data Mining. It’s an industry-
standard methodology and process model that’s popular because it’s flexible and
customizable. It’s also a proven method to guide data mining projects. The CRISP-DM
model includes six phases in the data process life cycle. Those six phases are:
1. Business Understanding

The first step in the CRISP-DM process is to clarify the business’s goals and bring focus to
the data science project. Clearly defining the goal should go beyond simply identifying the
metric you want to change. Analysis, no matter how comprehensive, can’t change metrics
without action.

To better understand the business, data scientists meet with stakeholders, subject matter
experts, and others who can offer insights into the problem at hand. They may also do
preliminary research to see how others have tried to solve similar problems. Ultimately,
they’ll have a clearly defined problem and a roadmap to solving it.

2. Data Understanding
The next step in CRISP-DM is understanding your data. In this phase, you’ll determine what
data you have, where you can get more of it, what your data includes, and its quality. You’ll
also decide what data collection tools you’ll use and how you’ll collect your initial data. Then
you’ll describe the properties of your initial data, such as the format, the quantity, and the
records or fields of your data sets.

Collecting and describing your data will allow you to begin exploring it. You can then
formulate your first hypothesis by asking data science questions that can be answered through
queries, visualization, or reporting. Finally, you’ll verify the quality of your data by
determining if there are errors or missing values.

3. Data Preparation

Data preparation is often the most time-consuming phase, and you may need to revisit this
phase multiple times throughout your project.
Data comes from various sources and is usually unusable in its raw state, as it often has
corrupt and missing attributes, conflicting values, and outliers. Data preparation resolves
these issues and improves the quality of your data, allowing it to be used effectively in the
modeling stage.

Data preparation involves many activities that can be performed in different ways. The main
activities of data preparation are:

 Data cleaning: fixing incomplete or erroneous data


 Data integration: unifying data from different sources
 Data transformation: formatting the data
 Data reduction: reducing data to its simplest form
 Data discretization: reducing the number of values to make data management easier
 Feature engineering: selecting and transforming variables to work better with
machine learning.

Facets of data

In data science and big data you’ll come across many different types of data, and each of
them tends to require different tools and techniques. The main categories of data are these:

■ Structured

■ Unstructured

■ Natural language

■ Machine-generated

■ Graph-based

■ Audio, video, and images

■ Streaming

Exploratory Data Analysis (EDA)

A model on this data we have to analyze all the information which is present across the
dataset like as what is the salary distribution of employees, what is the bonus they are getting,
what is their starting time, and the assigned team. These all steps of analyzing and modifying
the data come under EDA.

Exploratory Data Analysis (EDA) is an approach that is used to analyze the data and
discover trends, patterns, or check assumptions in data with the help of statistical summaries
and graphical representations.

Types of EDA

Depending on the number of columns we are analyzing we can divide EDA into two types.
1. Univariate Analysis – In univariate analysis, we analyze or deal with only one
variable at a time. The analysis of univariate data is thus the simplest form of analysis
since the information deals with only one quantity that changes. It does not deal with
causes or relationships and the main purpose of the analysis is to describe the data and
find patterns that exist within it.
2. Bi-Variate analysis – This type of data involves two different variables. The analysis
of this type of data deals with causes and relationships and the analysis is done to find
out the relationship between the two variables.
3. Multivariate Analysis – When the data involves three or more variables, it is
categorized under multivariate.

Depending on the type of analysis we can also subcategorize EDA into two parts.

1. Non-graphical Analysis – In non-graphical analysis, we analyze data using statistical


tools like mean median or mode or skewness
2. Graphical Analysis – In graphical analysis, we use visualizations charts to visualize
trends and patterns in the data

Steps Involved in Data Science Modelling

The key steps involved in Data Science Modelling are:

 Step 1: Understanding the Problem


 Step 2: Data Extraction
 Step 3: Data Cleaning
 Step 4: Exploratory Data Analysis
 Step 5: Feature Selection
 Step 6: Incorporating Machine Learning Algorithms
 Step 7: Testing the Models
 Step 8: Deploying the Model

Step 1: Understanding the Problem

The first step involved in Data Science Modelling is understanding the problem. A Data
Scientist listens for keywords and phrases when interviewing a line-of-business expert about
a business challenge. The Data Scientist breaks down the problem into a procedural flow that
always involves a holistic understanding of the business challenge, the Data that must be
collected, and various Artificial Intelligence and Data Science approach that can be used to
address the problem.

Step 2: Data Extraction

The next step in Data Science Modelling is Data Extraction. Not just any Data, but the
Unstructured Data pieces you collect, relevant to the business problem you’re trying to
address. The Data Extraction is done from various sources online, surveys, and existing
Databases.
Step 3: Data Cleaning

Data Cleaning is useful as you need to sanitize Data while gathering it. The following are
some of the most typical causes of Data Inconsistencies and Errors:

 Duplicate items are reduced from a variety of Databases.


 The error with the input Data in terms of Precision.
 Changes, Updates, and Deletions are made to the Data entries.
 Variables with missing values across multiple Databases.

Step 4: Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a robust technique for familiarising yourself with Data
and extracting useful insights. Data Scientists sift through Unstructured Data to find patterns
and infer relationships between Data elements. Data Scientists use Statistics and Visualisation
tools to summarise Central Measurements and variability to perform EDA.

If Data skewness persists, appropriate transformations are used to scale the distribution
around its mean. When Datasets have a lot of features, exploring them can be difficult. As a
result, to reduce the complexity of Model inputs, Feature Selection is used to rank them in
order of significance in Model Building for enhanced efficiency. Using Business Intelligence
tools like Tableau, MicroStrategy, etc. can be quite beneficial in this step. This step is crucial
in Data Science Modelling as the Metrics are studied carefully for validation of Data
Outcomes.

Step 5: Feature Selection

Feature Selection is the process of identifying and selecting the features that contribute the
most to the prediction variable or output that you are interested in, either automatically or
manually.

The presence of irrelevant characteristics in your Data can reduce the Model accuracy and
cause your Model to train based on irrelevant features. In other words, if the features are
strong enough, the Machine Learning Algorithm will give fantastic outcomes. Two types of
characteristics must be addressed:

 Consistent characteristics that are unlikely to change.


 Variable characteristics whose values change over time.

Step 6: Incorporating Machine Learning Algorithms

This is one of the most crucial processes in Data Science Modelling as the Machine Learning
Algorithm aids in creating a usable Data Model. There are a lot of algorithms to pick from,
the Model is selected based on the problem. There are three types of Machine Learning
methods that are incorporated:
1) Supervised Learning

It is based on the results of a previous operation that is related to the existing business
operation. Based on previous patterns, Supervised Learning aids in the prediction of an
outcome. Some of the Supervised Learning Algorithms are:

 Linear Regression
 Random Forest
 Support Vector Machines

2) Unsupervised Learning

This form of learning has no pre-existing consequence or pattern. Instead, it concentrates on


examining the interactions and connections between the presently available Data points.
Some of the Unsupervised Learning Algorithms are:

 KNN (k-Nearest Neighbors)


 K-means Clustering
 Hierarchical Clustering
 Anomaly Detection

3) Reinforcement Learning

It is a fascinating Machine Learning technique that uses a dynamic Dataset that interacts with
the real world. In simple terms, it is a mechanism by which a system learns from its mistakes
and improves over time. Some of the Reinforcement Learning Algorithms are:

 Q-Learning
 State-Action-Reward-State-Action (SARSA)
 Deep Q Network

For further information on Advance Machine Learning techniques, visit here.

Step 7: Testing the Models

This is the next phase, and it’s crucial to check that our Data Science Modelling efforts meet
the expectations. The Data Model is applied to the Test Data to check if it’s accurate and
houses all desirable features. You can further test your Data Model to identify any
adjustments that might be required to enhance the performance and achieve the desired
results. If the required precision is not achieved, you can go back to Step 5 (Machine
Learning Algorithms), choose an alternate Data Model, and then test the model again.

Step 8: Deploying the Model

The Model which provides the best result based on test findings is completed and deployed in
the production environment whenever the desired result is achieved through proper testing as
per the business needs. This concludes the process of Data Science Modelling.
Applications of Data Science

Every industry benefits from the experience of Data Science companies, but the most
common areas where Data Science techniques are employed are the following:

 Banking and Finance: The banking industry can benefit from Data Science in many
aspects. Fraud Detection is a well-known application in this field that assists banks in
reducing non-performing assets.
 Healthcare: Health concerns are being monitored and prevented using Wearable
Data. The Data acquired from the body can be used in the medical field to prevent
future calamities.
 Marketing: Marketing offers a lot of potential, such as a more effective price
strategy. Pricing based on Data Science can help companies like Uber and E-
Commerce businesses enhance their profits.
 Government Policies: Based on Data gathered through surveys and other official
sources, the government can use Data Science to better build poli==cies that cater to
the interests and wishes of the people.

What Is Data Mining?

Data mining is the process of searching and analyzing a large batch of raw data in order to
identify patterns and extract useful information.

Companies use data mining software to learn more about their customers. It can help them to
develop more effective marketing strategies, increase sales, and decrease costs. Data mining
relies on effective data collection, warehousing, and computer processing.

How Data Mining Works

Data mining involves exploring and analyzing large blocks of information to glean
meaningful patterns and trends. It is used in credit risk management, fraud detection, and
spam filtering. It also is a market research tool that helps reveal the sentiment or opinions of a
given group of people. The data mining process breaks down into four steps:

 Data is collected and loaded into data warehouses on-site or on a cloud service.
 Business analysts, management teams, and information technology professionals
access the data and determine how they want to organize it.
 Custom application software sorts and organizes the data.
 The end user presents the data in an easy-to-share format, such as a graph or table.

Data Mining Techniques

Data mining uses algorithms and various other techniques to convert large collections of data
into useful output. The most popular types of data mining techniques include:

 Association rules, also referred to as market basket analysis, search for relationships
between variables. This relationship in itself creates additional value within the data
set as it strives to link pieces of data. For example, association rules would search a
company's sales history to see which products are most commonly purchased
together; with this information, stores can plan, promote, and forecast.
 Classification uses predefined classes to assign to objects. These classes describe the
characteristics of items or represent what the data points have in common with each.
This data mining technique allows the underlying data to be more neatly categorized
and summarized across similar features or product lines.
 Clustering is similar to classification. However, clustering identifies similarities
between objects, then groups those items based on what makes them different from
other items. While classification may result in groups such as "shampoo,"
"conditioner," "soap," and "toothpaste," clustering may identify groups such as "hair
care" and "dental health."
 Decision trees are used to classify or predict an outcome based on a set list of criteria
or decisions. A decision tree is used to ask for the input of a series of cascading
questions that sort the dataset based on the responses given. Sometimes depicted as a
tree-like visual, a decision tree allows for specific direction and user input when
drilling deeper into the data.
 K-Nearest neighbor (KNN) is an algorithm that classifies data based on its proximity
to other data. The basis for KNN is rooted in the assumption that data points that are
close to each other are more similar to each other than other bits of data. This non-
parametric, supervised technique is used to predict the features of a group based on
individual data points.
 Neural networks process data through the use of nodes. These nodes are comprised
of inputs, weights, and an output. Data is mapped through supervised learning, similar
to the ways in which the human brain is interconnected. This model can be
programmed to give threshold values to determine a model's accuracy.
 Predictive analysis strives to leverage historical information to build graphical or
mathematical models to forecast future outcomes. Overlapping with regression
analysis, this technique aims at supporting an unknown figure in the future based on
current data on hand.

The Data Mining Process

To be most effective, data analysts generally follow a certain flow of tasks along the data
mining process. Without this structure, an analyst may encounter an issue in the middle of
their analysis that could have easily been prevented had they prepared for it earlier. The data
mining process is usually broken into the following steps.

Step 1: Understand the Business

Before any data is touched, extracted, cleaned, or analyzed, it is important to understand the
underlying entity and the project at hand. What are the goals the company is trying to achieve
by mining data? What is their current business situation? What are the findings of a SWOT
analysis? Before looking at any data, the mining process starts by understanding what will
define success at the end of the process.

Step 2: Understand the Data

Once the business problem has been clearly defined, it's time to start thinking about data.
This includes what sources are available, how they will be secured and stored, how the
information will be gathered, and what the final outcome or analysis may look like. This step
also includes determining the limits of the data, storage, security, and collection and assesses
how these constraints will affect the data mining process.
Step 3: Prepare the Data

Data is gathered, uploaded, extracted, or calculated. It is then cleaned, standardized, scrubbed


for outliers, assessed for mistakes, and checked for reasonableness. During this stage of data
mining, the data may also be checked for size as an oversized collection of information may
unnecessarily slow computations and analysis.

Step 4: Build the Model

With our clean data set in hand, it's time to crunch the numbers. Data scientists use the types
of data mining above to search for relationships, trends, associations, or sequential patterns.
The data may also be fed into predictive models to assess how previous bits of information
may translate into future outcomes.

Step 5: Evaluate the Results

The data-centered aspect of data mining concludes by assessing the findings of the data
model or models. The outcomes from the analysis may be aggregated, interpreted, and
presented to decision-makers that have largely been excluded from the data mining process to
this point. In this step, organizations can choose to make decisions based on the findings.

Step 6: Implement Change and Monitor

The data mining process concludes with management taking steps in response to the findings
of the analysis. The company may decide the information was not strong enough or the
findings were not relevant, or the company may strategically pivot based on findings. In
either case, management reviews the ultimate impacts of the business and recreates future
data mining loops by identifying new business problems or opportunities.

Benefits of Data Mining

Data mining ensures a company is collecting and analyzing reliable data. It is often a more
rigid, structured process that formally identifies a problem, gathers data related to the
problem, and strives to formulate a solution. Therefore, data mining helps a business become
more profitable, more efficient, or operationally stronger.

Data mining can look very different across applications, but the overall process can be used
with almost any new or legacy application. Essentially any type of data can be gathered and
analyzed, and almost every business problem that relies on qualifiable evidence can be
tackled using data mining.

The end goal of data mining is to take raw bits of information and determine if there is
cohesion or correlation among the data. This benefit of data mining allows a company to
create value with the information they have on hand that would otherwise not be overly
apparent. Though data models can be complex, they can also yield fascinating results, unearth
hidden trends, and suggest unique strategies.
Limitations of Data Mining

This complexity of data mining is one of its greatest disadvantages. Data analytics often
requires technical skill sets and certain software tools. Smaller companies may find this to be
a barrier of entry too difficult to overcome.

Data mining doesn't always guarantee results. A company may perform statistical analysis,
make conclusions based on strong data, implement changes, and not reap any benefits.
Through inaccurate findings, market changes, model errors, or inappropriate data
populations, data mining can only guide decisions and not ensure outcomes.

There is also a cost component to data mining. Data tools may require costly subscriptions,
and some bits of data may be expensive to obtain. Security and privacy concerns can be
pacified, though additional IT infrastructure may be costly as well. Data mining may also be
most effective when using huge data sets; however, these data sets must be stored and require
heavy computational power to analyze.

What Is a Data Warehouse?

Data Warehouse Defined

A data warehouse is a type of data management system that is designed to enable and
support business intelligence (BI) activities, especially analytics. Data warehouses are solely
intended to perform queries and analysis and often contain large amounts of historical data.
The data within a data warehouse is usually derived from a wide range of sources such as
application log files and transaction applications.

A data warehouse centralizes and consolidates large amounts of data from multiple sources.
Its analytical capabilities allow organizations to derive valuable business insights from their
data to improve decision-making. Over time, it builds a historical record that can be
invaluable to data scientists and business analysts. Because of these capabilities, a data
warehouse can be considered an organization’s “single source of truth.”

typical data warehouse often includes the following elements:

 A relational database to store and manage data


 An extraction, loading, and transformation (ELT) solution for preparing the data for
analysis
 Statistical analysis, reporting, and data mining capabilities
 Client analysis tools for visualizing and presenting data to business users
 Other, more sophisticated analytical applications that generate actionable information
by applying data science and artificial intelligence (AI) algorithms, or graph and
spatial features that enable more kinds of analysis of data at scale

Benefits of a Data Warehouse

Data warehouses offer the overarching and unique benefit of allowing organizations to
analyze large amounts of variant data and extract significant value from it, as well as to keep
a historical record.
Four unique characteristics (described by computer scientist William Inmon, who is
considered the father of the data warehouse) allow data warehouses to deliver this
overarching benefit. According to this definition, data warehouses are

 Subject-oriented. They can analyze data about a particular subject or functional area
(such as sales).
 Integrated. Data warehouses create consistency among different data types from
disparate sources.
 Nonvolatile. Once data is in a data warehouse, it’s stable and doesn’t change.
 Time-variant. Data warehouse analysis looks at change over time.

A well-designed data warehouse will perform queries very quickly, deliver high data
throughput, and provide enough flexibility for end users to “slice and dice” or reduce the
volume of data for closer examination to meet a variety of demands—whether at a high level
or at a very fine, detailed level. The data warehouse serves as the functional foundation for
middleware BI environments that provide end users with reports, dashboards, and other
interfaces.

Data Warehouse Architecture

The architecture of a data warehouse is determined by the organization’s specific needs.


Common architectures include

 Simple. All data warehouses share a basic design in which metadata, summary data,
and raw data are stored within the central repository of the warehouse. The repository
is fed by data sources on one end and accessed by end users for analysis, reporting,
and mining on the other end.
 Simple with a staging area. Operational data must be cleaned and processed before
being put in the warehouse. Although this can be done programmatically, many data
warehouses add a staging area for data before it enters the warehouse, to simplify data
preparation.
 Hub and spoke. Adding data marts between the central repository and end users
allows an organization to customize its data warehouse to serve various lines of
business. When the data is ready for use, it is moved to the appropriate data mart.
 Sandboxes. Sandboxes are private, secure, safe areas that allow companies to quickly
and informally explore new datasets or ways of analyzing data without having to
conform to or comply with the formal rules and protocol of the data warehouse.

The Evolution of Data Warehouses—From Data Analytics to AI and Machine Learning

When data warehouses first came onto the scene in the late 1980s, their purpose was to help
data flow from operational systems into decision-support systems (DSSs). These early data
warehouses required an enormous amount of redundancy. Most organizations had multiple
DSS environments that served their various users. Although the DSS environments used
much of the same data, the gathering, cleaning, and integration of the data was often
replicated for each environment.

As data warehouses became more efficient, they evolved from information stores that
supported traditional BI platforms into broad analytics infrastructures that support a wide
variety of applications, such as operational analytics and performance management.
Basic statistical description of data

The three main types of descriptive statistics are frequency distribution, central tendency,
and variability of a data set. The frequency distribution records how often data occurs,
central tendency records the data's center point of distribution, and variability of a data set
records its degree of dispersion.

The evolution of Data Science

The evolution of Data Science is a result of the inclusion of contemporary technologies like
Machine Learning (ML), Artificial Intelligence (AI), and the Internet of Things (IoT). The
application of data science started to spread to several other fields, such as engineering and
medicine.

Evolution of Data Science: Growth & Innovation

The term “data science” — and the practice itself — has evolved over the years. In recent
years, its popularity has grown considerably due to innovations in data collection,
technology, and mass production of data worldwide. Gone are the days when those who
worked with data had to rely on expensive programs and mainframes. The proliferation of
programming languages like Python and procedures to collect, analyze, and interpret data
paved the way for data science to become the popular field it is today.

Data science began in statistics. Part of the evolution of data science was the inclusion of
concepts such as machine learning, artificial intelligence, and the internet of things. With the
flood of new information coming in and businesses seeking new ways to increase profit and
make better decisions, data science started to expand to other fields, including medicine,
engineering, and more.

In this article, we'll share a concise summary of the evolution of data science — from its
humble beginnings as a statistician’s dream to its current state as a unique science in its own
right recognized by every imaginable industry.

In this article, we'll share a concise summary of the evolution of data science — from its
humble beginnings as a statistician’s dream to its current state as a unique science in its own
right recognized by every imaginable industry.
Origins, Predictions, Beginnings

We could say that data science was born from the idea of merging applied statistics with
computer science. The resulting field of study would use the extraordinary power of modern
computing. Scientists realized they could not only collect data and solve statistical problems
but also use that data to solve real-world problems and make reliable fact-driven predictions.

1962: American mathematician John W. Tukey first articulated the data science dream. In his
now-famous article "The Future of Data Analysis," he foresaw the inevitable emergence of a
new field nearly two decades before the first personal computers. While Tukey was ahead of
his time, he was not alone in his early appreciation of what would come to be known as "data
science." Another early figure was Peter Naur, a Danish computer engineer whose book
Concise Survey of Computer Methods offers one of the very first definitions of data science:

"The science of dealing with data, once they have been established, while the relation of the
data to what they represent is delegated to other fields and sciences."

1977: The theories and predictions of "pre" data scientists like Tukey and Naur became more
concrete with the establishment of The International Association for Statistical Computing
(IASC), whose mission was "to link traditional statistical methodology, modern computer
technology, and the knowledge of domain experts in order to convert data into information
and knowledge."

1980s and 1990s: Data science began taking more significant strides with the emergence of
the first Knowledge Discovery in Databases (KDD) workshop and the founding of the
International Federation of Classification Societies (IFCS). These two societies were among
the first to focus on educating and training professionals in the theory and methodology of
data science (though that term had not yet been formally adopted).

It was at this point that data science started to garner more attention from leading
professionals hoping to monetize big data and applied statistics.

1994: BusinessWeek published a story on the new phenomenon of "Database Marketing.” It


described the process by which businesses were collecting and leveraging enormous amounts
of data to learn more about their customers, competition, or advertising techniques. The only
problem at the time was that these companies were flooded with more information than they
could possibly manage. Massive amounts of data were sparking the first wave of interest in
establishing specific roles for data management. It began to seem like businesses would need
a new kind of worker to make the data work in their favor.

1990s and early 2000s: We can clearly see that data science has emerged as a recognized
and specialized field. Several data science academic journals began to circulate, and data
science proponents like Jeff Wu and William S. Cleveland continued to help develop and
expound upon the necessity and potential of data science.

2000s: Technology made enormous leaps by providing nearly universal access to internet
connectivity, communication, and (of course) data collection.
2005: Big data enters the scene. With tech giants such as Google and Facebook uncovering
large amounts of data, new technologies capable of processing them became necessary.
Hadoop rose to the challenge, and later on Spark and Cassandra made their debuts.

2014: Due to the increasing importance of data, and organizations’ interest in finding patterns
and making better business decisions, demand for data scientists began to see dramatic
growth in different parts of the world.

2015: Machine learning, deep learning, and Artificial Intelligence (AI) officially enter the
realm of data science. These technologies have driven innovations over the past decade —
from personalized shopping and entertainment to self-driven vehicles along with all the
insights to efficiently bring forth these real-life applications of AI into our daily lives.

2018: New regulations in the field are perhaps one of the biggest aspects in the evolution in
data science.

2020s: We are seeing additional breakthroughs in AI, machine learning, and an ever-more-
increasing demand for qualified professionals in Big Data

The Future of Data Science

Seeing how much of our world is currently powered by data and data science, we can
reasonably ask, Where do we go from here? What does the future of data science hold? While
it's difficult to know exactly what the hallmark breakthroughs of the future will be, all signs
seem to indicate the critical importance of machine learning. Data scientists are searching for
ways to use machine learning to produce more intelligent and autonomous AI.

In other words, data scientists are working tirelessly toward developments in deep learning to
make computers smarter. These developments can bring about advanced robotics paired with
a powerful AI. Experts predict the AI will be capable of understanding and interacting
seamlessly with humans, self-driving vehicles, and automated public transportation in a
world interconnected like never before. This new world will be made possible by data
science.

Perhaps, on the more exciting side, we may see an age of extensive automation of labor in the
near future. This is expected to revolutionize the healthcare, finance, transportation, and
defense industries.

Data Scientist Roles and Responsibilities

Data scientists collaborate closely with business leaders and other key players to comprehend
company objectives and identify data-driven strategies for achieving those objectives. A data
scientist’s job is to gather a large amount of data, analyze it, separate out the essential
information, and then utilize tools like SAS, R programming, Python, etc. to extract insights
that may be used to increase the productivity and efficiency of the business. Depending on an
organization’s needs, data scientists have a wide range of roles and responsibilities. The
following is a list of some of the data scientist roles and responsibilities:

 Collect data and identify data sources


 Analyze huge amounts of data, both structured and unstructured
 Create solutions and strategies to business problems
 Work with team members and leaders to develop data strategy
 To discover trends and patterns, combine various algorithms and modules
 Present data using various data visualization techni
 ques and tools
 Investigate additional technologies and tools for developing innovative data strategies
 Create comprehensive analytical solutions, from data gathering to display; assist in
the construction of data engineering pipelines
 Supporting the data scientists, BI developers, and analysts team as needed for their
projects Working with the sales and pre-sales team on cost reduction, effort
estimation, and cost optimization
 To boost general effectiveness and performance, stay current with the newest tools,
trends, and technologies
 collaborating together with the product team and partners to provide data-driven
solutions created with original concepts
 Create analytics solutions for businesses by combining various tools, applied
statistics, and machine learning
 Lead discussions and assess the feasibility of AI/ML solutions for business processes
and outcomes
 Architect, implement, and monitor data pipelines, as well as conduct knowledge
sharing sessions with peers to ensure effective data use

What Is Data Science Pipeline?

We can think of a data science pipeline as a unified system consisting of customized tools
and processes which enable the organizations to get the maximum value out of their data.
Depending on the factors like scale, nature of the problem at hand, domain, etc, the data
science pipeline can be as simple as a simple ETL process, or in other cases, it could be very
complex consisting of different stages with multiple processes working together to achieve
the final objective. To get a deep understanding of the Data Science Pipeline, you could refer
the Data Science Bootcamp Curriculum.
Why Is the Data Science Pipeline Important?

The data science pipeline of any organization is a fair reflection of how data driven the
organization is and what’s the influence of derived insights on business-critical decisions.

Here are some of the points explaining the importance of a data science pipeline for an
organization:

1. Data science pipeline enables business decisions driven by data.


2. Data science pipeline helps to identify shortcomings of current processes.
3. Data science pipeline makes an organization future ready.
4. Data science pipeline allows for innovation and creativity by unlocking the previously
inaccessible insights.
5. Without a data science pipeline, the valuable data of the organization would be
worthless.

Data Science Pipeline Stages

Following are the various stages of the Data Science Pipeline :

1. Data Acquisition: Data science pipeline starts with Data. In most companies, there are
Data engineers that create tables for data collection and in some cases, you may use
API to call data and make it available at the start of the pipeline.
2. Data Splitting: The second stage is very important, which is data splitting. Here we
break the dataset into train and test and validate data. Training data is used to train the
model and we do this process in the initial stage only so that we can avoid data
leakage. In some cases, we break the data after preprocessing.
3. Data Preprocessing: In Data science we say it often, “Garbage in, Garbage out”, hence
the quality of data matters if we want quality in outcome. This step is usually about
cleaning the data and normalizing it. Generally, it takes care of getting rid of
characters that are irrelevant. The purpose of normalization is to update the numerical
values in data and bring them to a common scale without harming the actual
difference in values. This is applicable wherever there is a huge range in any variable
values.
4. Feature Engineering: This step consists of multiple tasks which are missing value
treatment, outlier treatment, Combinations (using current features to make new
features), aggregations, transformations, and encoding for categorical data type.
5. Labeling: This step is applicable in supervised cases which means you don't have
labels, but they are required and you feel that model will be better if we feed labels to
it. There are two ways to approach this one is manual method and other is rule based.
6. Exploratory Data Analysis: Some people do this part early but for simplicity it's
suggested to EDA after we have relevant features and labels in hand. EDA can guide
you during feature engineering.
7. Model Training: Model training refers to experimenting with different ML models for
the task at hand and choosing the best model based on the problem at hand.
8. Performance monitoring: After training a model, it is important to spend time in
model performance monitoring. We can make printouts about relevant model metrics,
reports, charts and visuals that provide clarity about model performance.
9. Interpretation: It's critical for businesses and to get knowledge of what is happening.
There are many ways to approach the same. Global and local interpretations are the
examples.
10. Iteration: This step talks about modifying the model to get better performance. It takes
the feedback loop into consideration.
11. Deployment: Next step is Deployment which means to put the model under
production. It depends upon systems, on cloud and also on how a company desires to
use the built model.
12. Monitoring: Post deploying the models, one has to keep monitoring the performance
of it against unseen data. Oftentimes, the model needs to be re-trained due to a certain
data drift i.e., the distribution of the unseen data has changed compared to what was
used during training and validation phase.

Benefits of Data Science Pipelines

Following are the benefits of the Data Science Pipelines:

1. Organizations can leverage the insights gained with the help of the pipeline and hence
make critical decisions faster.
2. Data Science pipelines allow organizations to understand the behavioral patterns of
their target audience and then recommend personalized products and services.
3. Allows for efficiency in the processes by identifying the anti-patterns bottlenecks.

Characteristics of a Data Science Pipeline

1. Customizable and Extensible: The constituent components of a data science pipeline


should be loosely coupled allowing for easy extensibility and ease of customization
when it comes to use by different teams or departments.
2. Highly available and resistant to data corruption: Depending upon the rate / amount of
ingestion, the data science pipeline should be elastic enough to handle surge in
amount of data without causing any kind of corruption of data.
3. Redundancy and recovery from disaster: An ideal data science pipeline should have
controls in place to recover from a disaster and in the event of a disaster, continuity of
business should not be impacted, or the impact should be minimized.

Application of data science in various fields


Applications of Data Science
1. In Search Engines

The most useful application of Data Science is Search Engines. As we know when we want
to search for something on the internet, we mostly used Search engines like Google, Yahoo,
Safari, Firefox, etc. So Data Science is used to get Searches faster.

For Example, When we search something suppose “Data Structure and algorithm courses ”
then at that time on the Internet Explorer we get the first link of GeeksforGeeks Courses. This
happens because the GeeksforGeeks website is visited most in order to get information
regarding Data Structure courses and Computer related subjects. So this analysis is Done
using Data Science, and we get the Topmost visited Web Links.

In Transport

Data Science also entered into the Transport field like Driverless Cars. With the help of
Driverless Cars, it is easy to reduce the number of Accidents.

For Example, In Driverless Cars the training data is fed into the algorithm and with the help
of Data Science techniques, the Data is analyzed like what is the speed limit in Highway,
Busy Streets, Narrow Roads, etc. And how to handle different situations while driving etc.

3. In Finance

Data Science plays a key role in Financial Industries. Financial Industries always have an
issue of fraud and risk of losses. Thus, Financial Industries needs to automate risk of loss
analysis in order to carry out strategic decisions for the company. Also, Financial Industries
uses Data Science Analytics tools in order to predict the future. It allows the companies to
predict customer lifetime value and their stock market moves.

For Example, In Stock Market, Data Science is the main part. In the Stock Market, Data
Science is used to examine past behavior with past data and their goal is to examine the
future outcome. Data is analyzed in such a way that it makes it possible to predict future
stock prices over a set timetable.

4. In E-Commerce

E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better user
experience with personalized recommendations.

For Example, When we search for something on the E-commerce websites we get
suggestions similar to choices according to our past data and also we get recommendations
according to most buy the product, most rated, most searched, etc. This is all done with the
help of Data Science.

5. In Health Care

In the Healthcare Industry data science act as a boon. Data Science is used for:
 Detecting Tumor.
 Drug discoveries.
 Medical Image Analysis.
 Virtual Medical Bots.
 Genetics and Genomics.
 Predictive Modeling for Diagnosis etc.

6. Image Recognition

Currently, Data Science is also used in Image Recognition. For Example, When we upload
our image with our friend on Facebook, Facebook gives suggestions Tagging who is in the
picture. This is done with the help of machine learning and Data Science. When an Image is
Recognized, the data analysis is done on one’s Facebook friends and after analysis, if the
faces which are present in the picture matched with someone else profile then Facebook
suggests us auto-tagging.

7. Targeting Recommendation

Targeting Recommendation is the most important application of Data Science. Whatever the
user searches on the Internet, he/she will see numerous posts everywhere. This can be
explained properly with an example: Suppose I want a mobile phone, so I just Google search
it and after that, I changed my mind to buy offline. Data Science helps those companies who
are paying for Advertisements for their mobile. So everywhere on the internet in the social
media, in the websites, in the apps everywhere I will see the recommendation of that mobile
phone which I searched for. So this will force me to buy online.

8. Airline Routing Planning

With the help of Data Science, Airline Sector is also growing like with the help of it, it
becomes easy to predict flight delays. It also helps to decide whether to directly land into the
destination or take a halt in between like a flight can have a direct route from Delhi to the
U.S.A or it can halt in between after that reach at the destination.

9. Data Science in Gaming

In most of the games where a user will play with an opponent i.e. a Computer Opponent, data
science concepts are used with machine learning where with the help of past data the
Computer will improve its performance. There are many games like Chess, EA Sports, etc.
will use Data Science concepts.

10. Medicine and Drug Development

The process of creating medicine is very difficult and time-consuming and has to be done
with full disciplined because it is a matter of Someone’s life. Without Data Science, it takes
lots of time, resources, and finance or developing new Medicine or drug but with the help of
Data Science, it becomes easy because the prediction of success rate can be easily determined
based on biological data or factors. The algorithms based on data science will forecast how
this will react to the human body without lab experiments.
11. In Delivery Logistics

Various Logistics companies like DHL, FedEx, etc. make use of Data Science. Data Science
helps these companies to find the best route for the Shipment of their Products, the best time
suited for delivery, the best mode of transport to reach the destination, etc.

12. Autocomplete

AutoComplete feature is an important part of Data Science where the user will get the facility
to just type a few letters or words, and he will get the feature of auto-completing the line. In
Google Mail, when we are writing formal mail to someone so at that time data science
concept of Autocomplete feature is used where he/she is an efficient choice to auto-complete
the whole line.

What is Data Security?

Data security is the process of protecting corporate data and preventing data loss through
unauthorized access. This includes protecting your data from attacks that can encrypt or
destroy data, such as ransomware, as well as attacks that can modify or corrupt your data.
Data security also ensures data is available to anyone in the organization who has access to it.

Some industries require a high level of data security to comply with data protection
regulations. For example, organizations that process payment card information must use and
store payment card data securely, and healthcare organizations in the USA must secure
private health information (PHI) in line with the HIPAA standard.

Data Science Vs business


Business Analytics vs Data Science – A Comprehensive Comparison
Business Analytics Data Science

Study of data using methods derived from


Statistical study of business, business
computer science – like algorithms, mathematics,
goals, business data to gain insights and
and statistics – to find patterns and make future
develop better strategies and processes.
predictions.

Deals primarily with structured data. Works with both unstructured and structured data.

This is more statistics and analytics


Heavily relies on programming to create models
oriented – it does not require much
which identify patterns and derive insights.
programming.

Statistics is just one part of the entire process and


The entire analysis is statistical. is performed at the end – after programming the
required models.

Mostly important for the following Mostly important for the following industries – e-
industries – healthcare, marketing, retail, commerce, manufacturing, academics, ML/AI,
supply chain, entertainment, etc. fintech, etc.

Role of Data engineers

Data scientists and data analysts analyze data sets to glean knowledge and insights. Data
engineers build systems for collecting, validating, and preparing that high-quality data.
Data engineers gather and prepare the data and data scientists use the data to promote better
business decisions.

Data Engineer Roles and Responsibilities

Here is the list of roles and responsibilities, Data Engineers are expected to perform:

1. Work on Data Architecture

They use a systematic approach to plan, create, and maintain data architectures while also
keeping it aligned with business requirements.

2. Collect Data

Before initiating any work on the database, they have to obtain data from the right sources.
After formulating a set of dataset processes, data engineers store optimized data.

3. Conduct Research

Data engineers conduct research in the industry to address any issues that can arise while
tackling a business problem.
4. Improve Skills

Data engineers don’t rely on theoretical database concepts alone. They must have the
knowledge and prowess to work in any development environment regardless of their
programming language. Similarly, they must keep themselves up-to-date with machine
learning and its algorithms like the random forest, decision tree, k-means, and others.

They are proficient in analytics tools like Tableau, Knime, and Apache Spark. They use these
tools to generate valuable business insights for all types of industries. For instance, data
engineers can make a difference in the health industry and identify patterns in patient
behavior to improve diagnosis and treatment. Similarly, law enforcement engineers can
observe changes in crime rates.

5. Create Models and Identify Patterns

Data engineers use a descriptive data model for data aggregation to extract historical insights.
They also make predictive models where they apply forecasting techniques to learn about the
future with actionable insights. Likewise, they utilize a prescriptive model, allowing users to
take advantage of recommendations for different outcomes. A considerable chunk of a data
engineer’s time is spent on identifying hidden patterns from stored data.

6. Automate Tasks

Data engineers dive into data and pinpoint tasks where manual participation can be
eliminated with automation.

Data science impact on business

Data science uses a combination of various tools, algorithms, formulas, and machine
learning principles to draw hidden patterns from raw data. These patterns can then be used
to gain a better understanding of a variety of factors and influence decision making. Data
science does more than just crunch numbers — it reveals the “why” behind your data.

Data science is the key to making information actionable by using massive volumes of data
to predict behaviors and infer meaning from correlating data in a meaningful way. From
finding the best customers and charging the right prices to allocating costs accurately and
minimizing work-in-progress and inventory, data science is helping businesses maximize
innovation.

Data science tools and technologies have come a long way, but no development was more
important than the improvement of artificial intelligence (AI). AI is the ability of computers
to perform tasks that formerly were exclusive to humans. AI used to rely entirely on human
programming, but thanks to the application of machine learning, computers can now learn
from data to further develop their abilities. As a result, AIs can now read, write, listen, chat
and even listen like a human can – though at a scope and speed that far exceeds what any
one person is capable of doing.
How Data Science Can Impact Your Business

Data science can positively impact many business functions, both customer-facing and
internally. And while the benefits and potential uses of data science are vast, here are some
of the primary ways organizations have used data science in their operations, and the
solutions they are using to get results.

Quantifiable & Data-Driven Decision Making


This is arguably the biggest reason many businesses utilize data science applications, and
its usually also the biggest benefit. When organizations can organize, make sense of,
and leverage their data, they can make more accurate predictions, forecasts, and plans for
all areas of their operations. Using data science tools, businesses can determine what
elements they need to focus on to reach their most important targets and can then
implement the most effective plans to reach them. One relatively new but exciting feature
of this technology is the ability to analyze streaming data through time series analysis,
giving businesses real-time feedback that they can act on.
Better Understanding of Customer Intent
Organizations can now use data science tools to more effectively and accurately understand
customer intent and their data, thanks in large part to what is known as natural language
processing. Otherwise known as NLP, natural language processing utilizes AI to read,
write, understand, and ultimately extract meaning from human language to make decisions.
This is a major advancement for artificial intelligence and is changing the game for
businesses and data scientists. Using NLP, they have expanded capabilities such as topic
modeling, name entity recognition, and sentiment detection, all of which can help them
more effectively utilize their data and understand their customers.
Recruiting

Recruiting and retaining quality and skilled employees is a struggle for many businesses,
regardless of industry. NLP is also making a difference here, by automating aspects of the
recruiting process to help organizations find better candidates, faster. Using unique
algorithms, data science can “read” resumes and decide whether or not a candidate is worth
pursuing. It can even select resumes based on specific character and personality traits,
which enables businesses to get very specific about the type of person they are looking to
hire.

Opportunity Identification

Another capability of data science tools and analytics is opportunity identification. Using
historical and forecasted market data, businesses can identify geographic areas to target to
penetrate for sales and marketing initiatives with greater accuracy. Data can inform new
market decisions and make predictions as to whether a new venture is likely to be cost
effective. This will ultimately help organizations determine what is worth the investment
and whether they can expect to see a return.

************************************

You might also like