Professional Documents
Culture Documents
1.1 INTRODUCTION
1.1.1 Data
Data is a collection of discrete values that convey information, describing quantity, quality,
fact, statistics, etc., Data is information such as facts and numbers used to analyse something or
make decisions.
The characteristics of big data are often referred to as the three Vs:
Volume - How much data is there?
Variety - How diverse are different types of data?
Velocity - At what speed is new data generated?
Page 1 of 23
Fundamentals of Data Science and Analytics Notes – Unit I
Data science can be explained as the entire process of gathering actionable insights from
raw data that involves concepts like pre-processing of data, data modelling, statistical analysis,
data analysis, machine learning algorithms, etc.
The main purpose of data science is to compute better decision making. It uses complex
machine learning algorithms to build predictive models. The data used for analysis can come from
many different sources and presented in various formats.
1.1.5 Why does Data science so important? Or Need for Data Science!
Data is meaningless until its conversion into valuable information. Data Science involves
mining large datasets containing structured and unstructured data and identifying hidden
patterns to extract actionable insights.
The importance of Data Science lies in its numerous uses that range from daily activities
like asking Siri or Alexa for recommendations to more complex applications like operating a self-
driving car.
Page 2 of 23
Fundamentals of Data Science and Analytics Notes – Unit I
Handling of huge amount of data is a challenging task for every organization. So to handle,
process, and analysis of this, we required complex, powerful, and efficient algorithms and that
technology came into existence as data Science. Following are some main reasons for using data
science technology:
With the help of data science technology, we can convert the massive amount of raw and
unstructured data into meaningful insights.
Data science is working for automating transportation such as creating a self-driving car,
which is the future of transportation.
Data science can help in different predictions such as various survey, elections, flight ticket
confirmation, etc.
2. Domain Expertise: In data science, domain expertise binds data science together. Domain
expertise means specialized knowledge or skills of a particular area. In data science, there
are various areas for which we need domain experts.
3. Data engineering: Data engineering is a part of data science, which involves acquiring,
storing, retrieving, and transforming the data. Data engineering also includes metadata
(data about data) to the data.
Page 3 of 23
Fundamentals of Data Science and Analytics Notes – Unit I
6. Mathematics: Mathematics is the critical part of data science. Mathematics involves the
study of quantity, structure, space, and changes. For a data scientist, knowledge of good
mathematics is essential.
7. Machine learning: Machine learning is backbone of data science. Machine learning is all
about to provide training to a machine so that it can act as a human brain. In data science,
we use various machine learning algorithms to solve the problems.
Data Analysis tools: Python, Statistics, SAS, Jupyter, R Studio, MATLAB, Excel, RapidMiner.
Page 4 of 23
Fundamentals of Data Science and Analytics Notes – Unit I
The next step in the data science life cycle is to create a data pipeline where the
relevant data is extracted from the source and transformed into machine readable format,
and eventually the data is loaded into the program or the machine learning pipeline to get
things started.
For the above example – To forecast the sales, we will need data from the store
that will be useful for formulating an efficient machine learning model. Keeping this in
mind, we would create separate data points that may or may not be affecting the sales for
that particular store.
The third step is where the magic happens. Using statistical analysis, Exploratory
data analysis, data wrangling and manipulation, we will create meaningful data. The
preprocessing is done to assess the various data points and formulate hypotheses that
best explain the relationship between the various features in the data.
For example – The store sales problem will require the data to be in a time series
format to be able to forecast the sales. The hypothesis testing will test the stationarity of
the series and further computations will show various trends, seasonality and other
relationship patterns in the data.
This step involves advanced machine learning concepts that will be used for
feature selection, feature transformation, standardization of the data, data normalization,
etc. Choosing the best algorithms based on evidence from the above steps will help you
create a model that will efficiently create a forecast for the said months in the above
example.
For example – We can use the Time Series forecasting approach for the business
problem where the presence of high dimensional data could be the case. We will use
various dimensionality reduction techniques, and create a Forecasting model using AR,
MA, or ARIMA model and forecast the sales for the next quarter.
Page 5 of 23
Fundamentals of Data Science and Analytics Notes – Unit I
The final step from the data science life cycle is gathering insights from the said
problem statement. We create inferences and findings from the entire process that would
best explain the business problem.
For example – From the above Time series model, we will get the monthly or
weekly sales for the next 3 months. These insights will in turn help the professionals create
a strategy plan to overcome the problem at hand.
The solutions for the business problem are nothing but actionable insights that will
solve the problem using evidence based information. For example – Our forecast from the
Time series model will give an efficient estimate for the store sales in the next 3 months.
Using those insights, the store can plan their inventory to reduce the wastage of perishable
goods.
Databases: The depth understanding of Databases such as SQL, is essential for data
science to get the data and to work with data.
Non-Technical Prerequisite:
Curiosity: To learn data science, one must have curiosities. When you have curiosity and
ask various questions, then you can understand the business problem easily.
Critical Thinking: It is also required for a data scientist so that you can find multiple new
ways to solve the problem with efficiency.
Communication skills: Communication skills are most important for a data scientist
because after solving a business problem, you need to communicate it with the team.
Page 6 of 23
Fundamentals of Data Science and Analytics Notes – Unit I
In the healthcare industry, physicians use Data Science to analyze data from wearable
trackers to ensure their patients’ well-being and make vital decisions. Data Science also
enables hospital managers to reduce waiting time and enhance care.
Transportation providers use Data Science to enhance the transportation journeys of their
customers. For instance, Transport for London maps customer journeys offering
personalized transportation details, and manages unexpected circumstances using
statistical data.
Construction companies use Data Science for better decision making by tracking activities,
including average time for completing tasks, materials-based expenses, and more.
Data Science enables trapping and analyzing massive data from manufacturing processes,
which has gone untapped so far.
With Data Science, one can analyze massive graphical data, temporal data, and geospatial
data to draw insights. It also helps in seismic interpretation and reservoir characterization.
Data Science facilitates firms to leverage social media content to obtain real-time media
content usage patterns. This enables the firms to create target audience-specific content,
measure content performance, and recommend on-demand content.
Data Science helps study utility consumption in the energy and utility domain. This study
allows for better control of utility use and enhanced consumer feedback.
Data Science applications in the public service field include health-related research,
financial market analysis, fraud detection, energy exploration, environmental protection,
and more.
Page 7 of 23
Fundamentals of Data Science and Analytics Notes – Unit I
Statistics and Visualization are the two Statistics, Visualization, and Machine
Skills skills required for business learning are the required skills for data
intelligence. science.
Page 8 of 23
Fundamentals of Data Science and Analytics Notes – Unit I
2. Decision Tree: Decision Tree algorithm is another machine learning algorithm, which
belongs to the supervised learning algorithm. This is one of the most popular machine
learning algorithms. It can be used for both classification and regression problems.
In the decision tree algorithm, we can solve the problem, by using tree representation in
which, each node represents a feature, each branch represents a decision, and each leaf
represents the outcome. Following is the example for a Job offer problem:
Page 9 of 23
Fundamentals of Data Science and Analytics Notes – Unit I
In the decision tree, we start from the root of the tree and compare the values of
the root attribute with record attribute. On the basis of this comparison, we follow the
branch as per the value and then move to the next node. We continue comparing these
values until we reach the leaf node with predicated class value.
3. K-Means Clustering: K-means clustering is one of the most popular algorithms of machine
learning, which belongs to the unsupervised learning algorithm. It solves the clustering
problem.
If we are given a data set of items, with certain features and values, and we need
to categorize those set of items into groups, so such type of problems can be solved using
k-means clustering algorithm.
4. Classification
It is the act or process of dividing things into groups according to their type. In
statistics, classification is the problem of identifying which of a set of categories (sub-
populations) an observation (or observations) belongs to. There are two types of
classification such as Binary Classification and Multi-class Classification.
Page 10 of 23
Fundamentals of Data Science and Analytics Notes – Unit I
5. Outlier Analysis
Outliers are classified into three types namely Global Outliers, Contextual Outliers
and Collective Outliers.
So in data science, problems are solved using algorithms, and below is the diagram
representation for applicable algorithms for possible questions:
Page 11 of 23
Fundamentals of Data Science and Analytics Notes – Unit I
Page 12 of 23
Fundamentals of Data Science and Analytics Notes – Unit I
Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because the content is
context-specific or varying. Unstructured data is information that either does not have a
predefined data model or is not organised in a pre-defined manner.
Unstructured information is typically text-heavy, but may contain data such as dates,
numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to
understand using traditional programs as compared to data stored in structured databases.
Common examples of unstructured data include audio, video files or No-SQL databases.
Examples of unstructured data are:
Rich media. Media and entertainment data, surveillance data, geo-spatial data,
audio, weather data
Document collections. Invoices, records, emails, productivity applications
Internet of Things (IoT). Sensor data, ticker data
Page 13 of 23
Fundamentals of Data Science and Analytics Notes – Unit I
Machine-generated data
Machine-generated data is information that’s automatically created by a computer,
process, application, or other machine without human intervention. Examples of machine data
are web server logs, call detail records, network event logs, and telemetry
Page 14 of 23
Fundamentals of Data Science and Analytics Notes – Unit I
Streaming data
While streaming data can take almost any of the previous forms, it has an extra property.
The data flows into the system when an event happens instead of being loaded into a data store
in a batch. Although this isn’t really a different type of data, we treat it here as such because you
need to adapt your process to deal with this type of information. Examples are the “What’s
trending” on Twitter, live sporting or music events, and the stock market.
Page 15 of 23
Fundamentals of Data Science and Analytics Notes – Unit I
Data preparation
Data collection is an error-prone process; in this phase you enhance the quality of the data
and prepare it for use in subsequent steps. This phase consists of three subphases: data cleansing
removes false values from a data source and inconsistencies across data sources, data integration
enriches data sources by combining information from multiple data sources, and data
transformation ensures that the data is in a suitable format for use in your models.
Page 16 of 23
Fundamentals of Data Science and Analytics Notes – Unit I
Data exploration
Data exploration is concerned with building a deeper understanding of your data. You try
to understand how variables interact with each other, the distribution of the data, and whether
there are outliers. To achieve this you mainly use descriptive statistics, visual techniques, and
simple modeling. This step often goes by the abbreviation EDA, for Exploratory Data Analysis.
The above diagram summarizes the data science process and shows the main steps and actions
that will be taken during a project.
1. The first step of this process is setting a research goal. The main purpose here is making
sure all the stakeholders understand the what, how, and why of the project. In every
serious project this will result in a project charter.
2. The second phase is data retrieval. You want to have data available for analysis, so this
step includes finding suitable data and getting access to the data from the data owner.
The result is data in its raw form, which probably needs polishing and transformation
before it becomes usable.
3. The third phase is data preparation. Now that you have the raw data, it’s time to prepare
it. This includes transforming the data from a raw form into data that’s directly usable in
Page 17 of 23
Fundamentals of Data Science and Analytics Notes – Unit I
your models. To achieve this, you’ll detect and correct different kinds of errors in the data,
combine data from different data sources, and transform it. If you have successfully
completed this step, you can progress to data visualization and modelling.
4. The fourth step is data exploration. The goal of this step is to gain a deep understanding
of the data. You’ll look for patterns, correlations, and deviations based on visual and
descriptive techniques. The insights you gain from this phase will enable you to start
modelling.
5. Finally, Fifth step in Data Science process is model building. It is now that you attempt to
gain the insights or make the predictions stated in your project charter.
6. The last step of the data science model is presenting your results and automating the
analysis. One goal of a project is to change a process and/or make better decisions. You
may still need to convince the business that your findings will indeed change the business
process as expected. The importance of this step is more apparent in projects on a
strategic and tactical level. Certain projects require you to perform the business process
over and over again, so automating the project will save time.
1.10.1 Data Science Process: Defining research goals and creating a project charter
A project starts by understanding the what, the why, and the how of your project.
“What does the company expect you to do?”, “why does management place such a
value on your research?”, “Is it part of a bigger strategic picture project originating
from an opportunity someone detected?”. Answering these three questions (what,
why, how) is the goal of the first phase.
Page 18 of 23
Fundamentals of Data Science and Analytics Notes – Unit I
The outcome should be a clear research goal, a good understanding of the context,
well-defined deliverables, and a plan of action with a timetable. This information is
then best placed in a project charter.
Page 19 of 23
Fundamentals of Data Science and Analytics Notes – Unit I
The main challenge in data collection is identifying the data sources where the
required data is actually stored. Because company may have stored the data across
many places.
Another challenge is the extract useful data from the data collection and removing the
noises and unwanted data out of it. Many manual and automated tools are used to
refine the data.
Many companies publish their data in open forum for public access and some of them
are as follows,
Page 20 of 23
Fundamentals of Data Science and Analytics Notes – Unit I
Page 21 of 23
Fundamentals of Data Science and Analytics Notes – Unit I
Page 22 of 23
Fundamentals of Data Science and Analytics Notes – Unit I
The model building process involves setting up ways of collecting data, understanding
and paying attention to what is important in the data to answer the questions you are
asking, finding a statistical, mathematical or a simulation model to gain understanding
and make predictions.
In regression analysis, model building is the process of developing a probabilistic
model that best describes the relationship between the dependent and independent
variables.
Model Building Process consist of the following three steps,
i. Selection of a modelling technique and variables to enter in the model
ii. Execution of the model
iii. Diagnosis and model comparison
The entire process needs to be documented in a professional way and have clear
visualizations, so it makes it easy for the audience to understand.
Page 23 of 23