Professional Documents
Culture Documents
Unit-2
Descriptive Analytics
By: Deependra Singh, Assistant Professor, School of Management, The NorthCap University, Gurugram
Descriptive Analytics
• Descriptive analytics summarizes data into meaningful charts
and reports, for example, about budgets, sales, revenues, or cost.
• Descriptive analytics are a set of techniques used to explain or
quantify the past.
• Several examples of descriptive analytics include data queries,
visual reports, and descriptive statistics.
Understanding Data
• Data can be defined as a systematic record of a particular quantity. It is the
different values of that quantity represented together in a set.
• It is a collection of facts and figures to be used for a specific purpose such
as a survey or analysis.
• When arranged in an organized form, can be called information. The
source of data (primary data, secondary data) is also an important factor.
Types of Data
➢
Qualitative VS Quantitative Data
Qualitative Data: They represent some characteristics or attributes. They
depict descriptions that may be observed but cannot be computed or
calculated. They are more exploratory than conclusive in nature.
Quantitative Data: These can be measured and not simply observed. They
can be numerically represented and calculations can be performed on them.
This information is numerical and can be classified as quantitative.
➢
Discrete VS Continuous Data
Discrete Data: Discrete data that can take on only integer values, such as
counts.
Continuous Data: Continuous data that can take on any value in an interval.
These are data that can take values between a certain range with the highest
and lowest values.
➢
Primary VS Secondary Data
Primary Data: Primary data is one which an investigator collects for the
first time for a particular purpose.
Secondary Data: They are the data that are sourced from someplace
that has originally collected it.
➢
Categorical VS Binary VS Ordinal Data
• Categorical Data: Data that can take on only a specific set of
values representing a set of possible categories.
• Binary/Dichotomous/Boolean Data: A special case of categorical
data with just two categories of values (0/1, true/false).
• Ordinal Data: Categorical data that has an explicit ordering.
Rectangular Data
• The typical frame of reference for an analysis in data science is
a rectangular data object, like a spreadsheet or database table.
• Rectangular data (like a spreadsheet) is the basic data structure
for statistical and machine learning models.
• Rectangular data is essentially a two-dimensional matrix with
rows indicating records (cases) and columns indicating features
(variables).
Sales volume generated by salesmen
Region Salesmen
S1 S2 S3 S4
East 24 30 26 23
West 22 32 27 25
North 23 28 25 22
South 32 31 32 34
Non-rectangular data structures
• Time series data records successive measurements of the same variable.
It is the raw material for statistical forecasting methods.
• Spatial data are used in mapping and location analytics. It is relatively
more complex and varied than rectangular data structure.
Data Preparation and handling
• Data cleaning is the process of detecting and correcting for removing
corrupt or incomplete records from a record set, and refers to identifying
incomplete, incorrect, inaccurate or irrelevant parts of the data and then
replacing, modifying, or deleting unsuitable data.
• Data screening is the process of ensuring that researcher’s data is clean
and ready to go for statistical analyses.
• Data editing is the inspection and correction of the data received from
each element of the sample.
Data Cleaning
• Under data cleaning, a researcher generally focuses on these three aspects:
❑
Missing Data: Information not available for a case about whom other
information is available. It occurs when a respondent fails to answer some
questions in a survey.
❑
Outliers: Outliers are observations with a unique combination of
characteristics identifiable as distinctly different from the other observations.
❑
Normality: Normality is the degree to which the distribution of the sample
data corresponds to a normal distribution.
Four-step process for identifying
missing data and applying remedies
• The researcher must ascertain whether the missing data process occurs in a
completely random manner. When the data set is small, the researcher
may be able to visually see much pattern or perform a set of simple
calculation.
• However, as sample size increases, so does the need for empirical
diagnostic tests. Some statistical programs add techniques specifically
designed for missing data analysis, eg. Missing Value Analysis in SPSS,
which generally include one or both diagnostic tests.
• The first approach assesses the missing data process of a single variable Y by forming
two groups: observations with missing data for Y and those with valid values of Y.
Statistical tests are then performed to determine whether significant differences exist
between the two groups on other variables of interest. Significant differences indicate the
possibility of a non random missing data process.
• The second approach is an overall test of randomness that determines whether the
missing data can be classified as MCAR. This test analyses the pattern of missing data on
all available variables and compares it with the pattern expected for a random missing
data process. If no significant differences are found, the missing data can be classified as
MCAR. If significant differences are found, the researcher must use the approaches to
identify the specific missing data processes that are nonrandom.
Step 4: Select the imputation method
• At this step of process, the researcher must select the approach used for
accommodating missing data in the analysis. This decision is based on
whether the missing data are missing at random (MAR) or missing
completely at random (MCAR).
• Imputation is the process of estimating the missing value based on valid
values of other variables and/or cases in the sample. The researcher has
several options for imputation.
• Imputation process is generally avoided for non-metric data. This process
is suggested for the metric data basically.
Imputation or missing data process
In this approach, the researcher substitutes a value from another source for
the missing values.
• In the “hot deck” method, the value comes from another observation in the
sample that is deemed similar. Each observation with missing data is paired with
another case that is similar on a variable specified by the researcher. Then
missing data are replaced with valid values from the similar observation.
• In the “cold deck” method, the replacement value is derived from an external
source (e.g. prior studies, other sample, etc.). Here, the researcher must be sure that
the replacement value from an external source is more valid than an internally
generated value.
Case substitution
▪
If there are three items 4, 6, and 9, then their geometric mean, which is generally
denoted by G, can be computed as:
Computation of Geometric Mean for
Individual Series
Harmonic Mean
• The harmonic mean of any series is the reciprocal of the arithmetic
mean of the reciprocal of the variate, that is, the harmonic mean by
definition is given by:
Computation of Harmonic Mean for
Individual Series
Relationship between Arithmetic mean (AM),
Geometric Mean (GM) and Harmonic Mean (HM)
Positional Averages
▪
Arithmetic mean, geometric mean, and harmonic mean are all mathematical in nature
and are measures of quantitative characteristics of data.
▪
To measure the qualitative characteristics of data, other measures of central
tendency, namely median and mode are used.
▪
Positional averages, as the name indicates, mainly focus on the position of the value
of an observation in the data set.
Median
▪
The median may be defined as the middle or central value of the variable when
values are arranged in the order of magnitude.
▪
In other words, median is defined as that value of the variable that divides the
group into two equal parts, one part comprising all values greater and the other all
values lesser than the median.
Computation of Median for the Individual
Series
• In this type of distribution, data can be arranged in ascending or
descending order. If there are n terms (observations) in the data, there can
be two cases:
•▪ Mode
Mode is the variate having the maximum frequency in a data series.
▪
In the case of an individual series, data is arranged in order and mode can be determined by inspection
only.
▪
The value of the variable (in data series) which occurs the most or the value of the data series with
maximum frequency is the mode of the data series.
▪
For example, for a series 1, 1, 3, 3, 3, 3, 4, 5, 8, 8, 16, 16 (arranged in the order of magnitude),
observation 3 has the maximum frequency 4. Therefore, mode of the series is 3.
Empirical Relationship between Mean, Median
and Mode
Partition Values: Quartiles, Deciles, and
Percentiles
▪
Partition values are measures that divide the data into several equal parts. Quartiles
divide data into 4 equal parts, deciles divide data into 10 equal parts, and percentiles
divide data into 100 equal parts.
▪
For an individual series, the first and third quartiles can be computed using the following
formula:
• In a data series, when the observations are arranged in an ordered
sequence, deciles divide the data into 10 equal parts. In the case of
individual series and discrete frequency distribution, the generalized
formula for computing deciles is given as:
• In a data series, when observations are arranged in an ordered sequence,
percentiles divide the data into 100 equal parts. For an individual series
and a discrete frequency distribution, the generalized formula for
computing percentiles is given as:
Measures of Dispersion
▪ The meaning of dispersion is “scatteredness.” The degree to which numerical
data tends to spread around an average value is called variation or dispersion of
data.
Types of Measures of Dispersion
▪ There are two types of measures of dispersion:
1. Absolute measures of dispersion: Absolute measures of dispersion
are presented in the same unit as the unit of distribution.
2. Relative measures of dispersion: Relative measures of dispersion
are useful in comparing two sets of data which have different units of
measurement.
▪ Relative measures of dispersion are pure unitless numbers and are generally called
coefficient of dispersion.
Methods of Measuring Dispersion
The following are some of the important and widely used
methods of measuring dispersion:
▪ Range
▪ Interquartile range and quartile deviation
▪ Average absolute deviation
▪ Standard deviation
• Range
▪
Range is defined as the difference between the smallest and the greatest values in a
distribution.
▪
Range is an absolute measure of dispersion. The relative measure of dispersion
for range is called the coefficient of range and is calculated by the following
formula:
•▪ Interquartile range and quartile deviation
Interquartile range is the difference between the third quartile and the first quartile.
▪
Quartile deviation or semi-interquartile range can be obtained by dividing the interquartile
range by 2.
▪
Quartile deviation is an absolute measure of dispersion. Relative measure is
called the coefficient of quartile deviation. Coefficient of quartile deviation can be
used to measure the degree of variation in two different distributions when both
have different units of measurement.
• Average absolute deviation
Average absolute deviation is the average amount of scatter of the items in a
distribution, from either the mean or the median or the mode, ignoring the
signs of deviations.
• Average absolute deviation is an absolute measure of
dispersion. In this context, a relative measure, also known as
coefficient of average absolute deviation, is obtained by the
following formula:
Standard Deviation and Variance
• Standard deviation is the square root of the sum of square deviations of various
values from their arithmetic mean divided by the sample size minus one.