You are on page 1of 24

Introduction to Data Analysis:

Knowledge Domains

 The knowledge about the environment in which the data


is processed. In other words, it is the knowledge of the
field that the data belongs to/processed to reveal secrets
of the data.

 The knowledge about the environment in which the


target (i.e. software agent) operates.

 In data science, the accuracy of the model also increases


with the use of such knowledge of data.
Introduction to Data Analysis:
Data Analysis:

 Data analysis is the process of cleaning, changing, and


processing raw data, and extracting actionable, relevant
information that helps businesses to make informed
decisions.

 The procedure helps to reduce the risks inherent in


decision-making by providing useful insights and
statistics, often presented in charts, images, tables, and
graphs.
Introduction to Data Analysis:
Data Analysis Process:
 Data Requirements Specification
 Data Collection
 Data Processing
 Data Cleaning
 Data Analysis
 Communication
Data Analysis Process:
Data Requirements Specification:
 The data required for analysis is based on a question or
an experiment.
 Based on the requirements of those directing the
analysis, the data necessary as inputs to the analysis is
identified (e.g., Population of people).
 Specific variables regarding a population (e.g., Age and
Income) may be specified and obtained.
 Data may be numerical or categorical.

Data Collection:
 Data Collection is the process of gathering information
on targeted variables identified as data requirements.
 Data Collection ensures that data gathered is accurate
such that the related decisions are valid.
Data Analysis Process:
 Data is collected from various sources ranging from
organizational databases to the information in web
pages.
 The data thus obtained, may not be structured and may
contain irrelevant information.
 Hence, the collected data is required to be subjected to
Data Processing and Data Cleaning.
Data Processing:
 The data that is collected must be processed or organized
for analysis.
 This includes structuring the data as required for the
relevant Analysis Tools. For example, the data might
have to be placed into rows and columns in a table within
a Spreadsheet or Statistical Application.
Data Analysis Process:
Data Cleaning:
 The processed and organized data may be incomplete,
contain duplicates, or contain errors. Data Cleaning is the
process of preventing and correcting these errors.
 There are several types of Data Cleaning that depend on
the type of data. For example, while cleaning the financial
data, certain totals might be compared against reliable
published numbers or defined thresholds.
Data Analysis:
 Data that is processed, organized and cleaned would be
ready for the analysis. Various data analysis techniques
are available to understand, interpret, and derive
conclusions based on the requirements.
 Data Visualization may also be used to examine the data in
graphical format, to obtain additional insight regarding the
messages within the data.
Data Analysis Process:
 Statistical Data Models such as Correlation, Regression
Analysis can be used to identify the relations among the
data variables.
 The process might require additional Data Cleaning or
additional Data Collection, and hence these activities are
iterative in nature.
Communication:
 The results of the data analysis are to be reported in a
format as required by the users to support their decisions
and further action. The feedback from the users might
result in additional analysis.
 The data analysts can choose data visualization
techniques, such as tables and charts, which help in
communicating the message clearly and efficiently to the
users.
Types of Data Analysis:
 Diagnostic Analysis: 
 Predictive Analysis: 
 Prescriptive Analysis: 
 Statistical Analysis: 
 Descriptive: 
 Inferential: 
 Text Analysis: Also called “data mining,”
Types of Data Analysis:
Diagnostic Analysis: 
 Diagnostic analysis answers the question, “Why did this
happen?”
 Using insights gained from statistical analysis (more on
that later!), analysts use diagnostic analysis to identify
patterns in data.
 Ideally, the analysts find similar patterns that existed in
the past, and consequently, use those solutions to
resolve the present challenges hopefully.
Predictive Analysis: 
 Predictive analysis answers the question, “What is most
likely to happen?”
 By using patterns found in older data as well as current
events, analysts predict future events.
Types of Data Analysis:
 While there’s no such thing as 100 percent accurate
forecasting, the odds improve if the analysts have plenty
of detailed information and the discipline to research it
thoroughly.
Prescriptive Analysis: 
 Mix all the insights gained from the other data analysis
types, and you have prescriptive analysis.
 Sometimes, an issue can’t be solved solely with one
analysis type, and instead requires multiple insights.
Statistical Analysis: 
 Statistical analysis answers the question, “What
happened?”
 This analysis covers data collection, analysis, modeling,
interpretation, and presentation using dashboards.
Types of Data Analysis:
 The statistical analysis breaks down into two sub-
categories:
 Descriptive: 
Descriptive analysis works with either complete or
selections of summarized numerical data.
It illustrates means and deviations in continuous data
and percentages and frequencies in categorical data.
 Inferential: 
Inferential analysis works with samples derived from
complete data.
An analyst can arrive at different conclusions from the
same comprehensive data set just by choosing
different samplings.
Types of Data Analysis:
Text Analysis:
 Also called “data mining,” text analysis uses databases
and data mining tools to discover patterns residing in
large datasets.
 It transforms raw data into useful business information.
Text analysis is arguably the most straightforward and
the most direct method of data analysis.
Quantitative and Qualitative Analysis:
Quantitative Analysis:
 Quantitative analysis is often associated with numerical
analysis where data is collected, classified, and then
computed for certain findings using a set of statistical
methods.
 Data is chosen randomly in large samples and then
analyzed. The advantage of quantitative analysis the
findings can be applied in a general population using
research patterns developed in the sample.
 Quantitative analysis is more objective in nature.
 Quantitative analysis is generally concerned with
measurable quantities such as weight, length,
temperature, speed, width, and many more.
 The data can be expressed in a tabular form or any
diagrammatic representation using graphs or charts.
Quantitative and Qualitative Analysis:
 Quantitative data can be classified as continuous or
discrete, and it is often obtained using surveys,
observations, experiments or interviews.
Qualitative Analysis:
 Qualitative analysis is concerned with the analysis of
data that cannot be quantified. This type of data is about
the understanding and insights into the properties and
attributes of objects (participants).
 Qualitative analysis can get a deeper understanding of
“why” a certain phenomenon occurs.
 Unlike with quantitative analysis that is restricted by
certain classification rules or numbers, qualitative data
analysis can be wide ranged and multi-faceted. And it is
subjective, descriptive, non-statistical and exploratory in
nature.
Quantitative and Qualitative Analysis:
 In a quantitative analysis the characteristics of objects
are often undisclosed.
 The typical data analyzed qualitatively include color,
gender, nationality, taste, appearance, and many more as
long as the data cannot be computed.
 Such data is obtained using interviews or observations.
 There are limitations in qualitative analysis. For instance,
it cannot be used to generalize the population.

o Qualitative analysis is exploratory and subjective.


o Quantitative analysis is conclusive and objective.
Summary of Quanti… and Quali….:
 Quantitative analysis quantifies data to test hypotheses
or predict the future whereas qualitative analysis seeks
to get a deeper understanding of why certain things
occur.
 The sample is small in qualitative analysis and cannot be
used to represent the whole population while in
quantitative analysis the sample is large and can
represent the entire population.
 The researcher conducts interviews or surveys to collect
qualitative data whereas in quantitative analysis the
researcher conducts experiments, observations and
measurements.
 Typical data include color, race, gender, in qualitative
analysis whereas in quantitative analysis include all
measurable quantities such as density, length, size,
weight.
Quantitative V/S Qualitative:
Nature of Data:
 Data can be seen  in two distinct ways: 
Categorical and Numerical:

 Categorical data are values or observations that can be


sorted into groups or categories.
 There are two types of categorical values, nominal and
ordinal. A nominal variable has no intrinsic ordering to its
categories. For example, housing is a categorical variable
having two categories (own and rent).
Nature of Data:
 An ordinal variable has an established ordering. For
example, age as a variable with three orderly categories
(young, adult, and elder).
 Numerical data are values or observations that can be
measured. There are two kinds of numerical values,
discrete and continuous.
 Discrete data are values or observations that can be
counted and are distinct and separate. For example,
number of lines in a code.
 Continuous data are values or observations that may
take on any value within a finite or infinite interval. For
example, an economic time series such as historic gold
prices.
Nature of Data:
 There are four types of data:
 Nominal
 Ordinal
 Interval
 Ratio
 Each offers a unique set of characteristics, which impacts
the type of analysis that can be performed.
 The distinction between the four types of scales center
on three different characteristics:
 The order of responses – whether it matters or not.
 The distance between observations – whether it
matters or is interpretable.
 The presence or inclusion of a true zero.
Nature of Data:
Nominal Scales:
 Nominal scales measure categories and have the
following characteristics:
 Order: The order of the responses or observations does
not matter.
 Distance: Nominal scales do not hold distance. The
distance between a 1 and a 2 is not the same as a 2 and
3.
 True Zero: There is no true or real zero. In a nominal
scale, zero is un-interpretable.
Ordinal Scales:
 At the risk of providing a tautological definition, ordinal
scales measure, well, order. So, our characteristics for
ordinal scales are:
Nature of Data:
 Order: The order of the responses or observations
matters.
 Distance: Ordinal scales do not hold distance. The
distance between first and second is unknown as is the
distance between first and third along with all
observations.
 True Zero: There is no true or real zero. An item,
observation, or category cannot finish zero.
Interval Scales:
 Interval scales provide insight into the variability of the
observations or data. Classic interval scales are Likert
scales (e.g., 1 - strongly agree and 9 - strongly disagree)
and Semantic Differential scales (e.g., 1 - dark and 9 -
light).
 The characteristics of interval scales are:
Nature of Data:
 Order: The order of the responses or observations does
matter.
 Distance: Interval scales do offer distance. That is, the
distance from 1 to 2 appears the same as 4 to 5. Also, six
is twice as much as three and two is half of four. Hence,
we can perform arithmetic operations on the data.
 True Zero: There is no zero with interval scales. However,
data can be rescaled in a manner that contains zero.
 A n interval scales measure from 1 to 9 remains the same
as 11 to 19 because we added 10 to all values. Similarly,
a 1 to 9 interval scale is the same a -4 to 4 scale because
we subtracted 5 from all values.
 Although the new scale contains zero, zero remains un-
interpretable because it only appears in the scale from
the transformation.
Nature of Data:
Ratio Scales:
 Ratio scales appear as nominal scales with a true zero.
They have the following characteristics:
 Order: The order of the responses or observations
matters.
 Distance: Ratio scales do have an interpretable distance.
 True Zero: There is a true zero.

You might also like