You are on page 1of 80

Marketing Analytics

Unit-2
Descriptive Analytics

By: Deependra Singh, Assistant Professor, School of Management, The NorthCap University, Gurugram
Descriptive Analytics
• Descriptive analytics summarizes data into meaningful charts
and reports, for example, about budgets, sales, revenues, or cost.
• Descriptive analytics are a set of techniques used to explain or
quantify the past.
• Several examples of descriptive analytics include data queries,
visual reports, and descriptive statistics.
Understanding Data
• Data can be defined as a systematic record of a particular quantity. It is the
different values of that quantity represented together in a set.
• It is a collection of facts and figures to be used for a specific purpose such
as a survey or analysis.
• When arranged in an organized form, can be called information. The
source of data (primary data, secondary data) is also an important factor.
Types of Data

Qualitative VS Quantitative Data
Qualitative Data: They represent some characteristics or attributes. They
depict descriptions that may be observed but cannot be computed or
calculated. They are more exploratory than conclusive in nature.
Quantitative Data: These can be measured and not simply observed. They
can be numerically represented and calculations can be performed on them.
This information is numerical and can be classified as quantitative.

Discrete VS Continuous Data
Discrete Data: Discrete data that can take on only integer values, such as
counts.
Continuous Data: Continuous data that can take on any value in an interval.
These are data that can take values between a certain range with the highest
and lowest values.

Primary VS Secondary Data
Primary Data: Primary data is one which an investigator collects for the
first time for a particular purpose.
Secondary Data: They are the data that are sourced from someplace
that has originally collected it.

Categorical VS Binary VS Ordinal Data
• Categorical Data: Data that can take on only a specific set of
values representing a set of possible categories.
• Binary/Dichotomous/Boolean Data: A special case of categorical
data with just two categories of values (0/1, true/false).
• Ordinal Data: Categorical data that has an explicit ordering.
Rectangular Data
• The typical frame of reference for an analysis in data science is
a rectangular data object, like a spreadsheet or database table.
• Rectangular data (like a spreadsheet) is the basic data structure
for statistical and machine learning models.
• Rectangular data is essentially a two-dimensional matrix with
rows indicating records (cases) and columns indicating features
(variables).
Sales volume generated by salesmen
Region Salesmen
S1 S2 S3 S4
East 24 30 26 23
West 22 32 27 25
North 23 28 25 22
South 32 31 32 34
Non-rectangular data structures
• Time series data records successive measurements of the same variable.
It is the raw material for statistical forecasting methods.
• Spatial data are used in mapping and location analytics. It is relatively
more complex and varied than rectangular data structure.
Data Preparation and handling
• Data cleaning is the process of detecting and correcting for removing
corrupt or incomplete records from a record set, and refers to identifying
incomplete, incorrect, inaccurate or irrelevant parts of the data and then
replacing, modifying, or deleting unsuitable data.
• Data screening is the process of ensuring that researcher’s data is clean
and ready to go for statistical analyses.
• Data editing is the inspection and correction of the data received from
each element of the sample.
Data Cleaning
• Under data cleaning, a researcher generally focuses on these three aspects:

Missing Data: Information not available for a case about whom other
information is available. It occurs when a respondent fails to answer some
questions in a survey.

Outliers: Outliers are observations with a unique combination of
characteristics identifiable as distinctly different from the other observations.

Normality: Normality is the degree to which the distribution of the sample
data corresponds to a normal distribution.
Four-step process for identifying
missing data and applying remedies

Source: Hair et al. (2014)


Step 1: Determine the type of missing data
• Determine the type of missing data.
• Is the missing data a part of the research design and under the control
of researcher or whether the causes and impacts are truly unknown.
Step 2: Determine the extent of missing data
• Given that some of the missing data are not ignorable, the researcher must next
examine the pattern of the missing data and determine the extent of the missing
data for individual variables, individual cases, and even overall.
• If it is sufficiently low, then any of the approaches for remedying missing data
may be applied. If missing data level is not low enough, then we must first
determine the randomness of the missing data process before selecting a remedy.
• The unresolved issue at this step is the question- What is low enough?. In making
the assessment as to the extent of missing data, the researcher may find that the
deletion of cases and/or variables will reduce the missing data to levels that are low
enough to allow for remedies without concern for creating biases in the results.
Step 3: Diagnose the randomness of the
missing data processes
• Having determined that the extent of missing data is substantial enough to
warrant action. The next step is to determine the degree of randomness
present in the missing data, which then determines the appropriate
remedies available.
Level of randomness of the missing data processes

1. Missing at random (MAR)


• Missing at random means the propensity for a data point to be
missing is not related to the missing data, but it is related to some of
the observed data.
• It is also called Missing Conditionally at Random, because
missingness is conditional on another variable.
• Here, missing values of Y depend on X, but not on Y.
For example, assume that we know the gender of respondents (the X variable) and are
asking about household income (the Y variable). We find that the missing data are random
for both males and females but occur at a much higher frequency for males than females.
Even though the missing data process is operating in a random manner within the gender
variable, any remedy applied to the missing data will still reflect the missing data process
because gender affects the ultimate distribution of the household income values.

2. Missing completely at random (MCAR)


MCAR is propensity for a data point to be missing is completely random. There is no
relationship between whether a data point is missing and any values in the data set, missing
or observed.
Diagnostic tests for level of randomness

• The researcher must ascertain whether the missing data process occurs in a
completely random manner. When the data set is small, the researcher
may be able to visually see much pattern or perform a set of simple
calculation.
• However, as sample size increases, so does the need for empirical
diagnostic tests. Some statistical programs add techniques specifically
designed for missing data analysis, eg. Missing Value Analysis in SPSS,
which generally include one or both diagnostic tests.
• The first approach assesses the missing data process of a single variable Y by forming
two groups: observations with missing data for Y and those with valid values of Y.
Statistical tests are then performed to determine whether significant differences exist
between the two groups on other variables of interest. Significant differences indicate the
possibility of a non random missing data process.
• The second approach is an overall test of randomness that determines whether the
missing data can be classified as MCAR. This test analyses the pattern of missing data on
all available variables and compares it with the pattern expected for a random missing
data process. If no significant differences are found, the missing data can be classified as
MCAR. If significant differences are found, the researcher must use the approaches to
identify the specific missing data processes that are nonrandom.
Step 4: Select the imputation method
• At this step of process, the researcher must select the approach used for
accommodating missing data in the analysis. This decision is based on
whether the missing data are missing at random (MAR) or missing
completely at random (MCAR).
• Imputation is the process of estimating the missing value based on valid
values of other variables and/or cases in the sample. The researcher has
several options for imputation.
• Imputation process is generally avoided for non-metric data. This process
is suggested for the metric data basically.
Imputation or missing data process

Imputation using only valid data


• Some researchers may question whether using only valid data is actually a
form of imputation, because no data values are actually replaced.
• The basic intent of this approach is to represent the entire sample with
those observations or cases with valid data.
i. Complete case approach
The simplest and most direct approach for dealing with missing data is to include
only those observations with complete data. This method is known as LISTWISE
method in SPSS.
ii. Using all-available data
The second imputation method using only valid data also does not actually replace
the missing data, but instead imputes the distribution characteristics (e.g., means or
standard deviations) or relationships (e.g., correlations) from every valid value.
Imputation by using replacement values
• The second form of imputation involves replacing missing values
with estimated values based on other information available in the
sample.
• These methods can be classified as:
i) using a known value as a replacement or
ii) calculate a replacement value from other observations.
• Using known replacement values

The common characteristics in these methods is to identify a known value,
most often from a single observation, that is used to replace the missing data.

The observation may be from the sample or even external to the

sample. This includes two techniques
i) Hot or Cold deck imputation,
ii) Case Substitution
Hot or Cold deck imputation

In this approach, the researcher substitutes a value from another source for
the missing values.
• In the “hot deck” method, the value comes from another observation in the
sample that is deemed similar. Each observation with missing data is paired with
another case that is similar on a variable specified by the researcher. Then
missing data are replaced with valid values from the similar observation.
• In the “cold deck” method, the replacement value is derived from an external
source (e.g. prior studies, other sample, etc.). Here, the researcher must be sure that
the replacement value from an external source is more valid than an internally
generated value.
Case substitution

• In this method, entire observations with missing data are replaced by


choosing another non sampled observation.
• An example is to replace a sampled household that can not be contacted or
has extensive missing data with another household not in the sample,
preferably similar to the original observation.
Calculating replacement values
• The second basic approach involves calculating a replacement value
from a set of observations with valid data in the sample.
• This includes two techniques:
i) Mean substitution
ii) Regression substitution.
• Mean substitution
One of the most widely used methods, mean substitution replaces the
missing values for a variable with the mean value of that variable
calculated from all valid responses.
• Regression Imputation
In this method, regression analysis is used to predict the missing values
of a variable based on its relationship to other variables in the data set.
Outliers
An observation that is substantially different from the other
observations (i.e., has an extreme value) on one or more
characteristics (variables). The issue is its representativeness of the
population.
Example: assume that we sample 20 individuals to determine the
average household income. In our sample we gather responses that
range between $20,000 and $100,000, so that the average is
$45,000. But assume that the 21st person has an income of $1
million. If we include this value in the analysis, the average income
increases to more than $90,000.
Obviously, the outlier is a valid case, but what is the better estimate of the average
household income: $45,000 or $90,000?. The researcher must assess whether the
outlying value is retained or eliminated due to its undue influence on the results.
In substantive terms, the outlier must be viewed in light of how representative it is
of the population. Again, using our example of household income, how
representative of the more wealthy segment is the millionaire? If the researcher
feels that it is a small, but viable segment in the population, then perhaps the value
should be retained. If, however, this millionaire is the only one in the entire
population and truly far above everyone else (i.e., unique) and represents an
extreme value, then it may be deleted.
Methods of Detecting Outliers

• Outliers can be identified from a univariate, bivariate,


or multivariate perspective based on the number of
variables (characteristics) considered.
Univariate Methods of Outlier Detection

Through Standard Score (Z-Score)
• Examines the distribution of observations for each variable in the analysis
and selects as outliers those cases falling at the outer range of the
distribution.
• The typical approach first converts the data values to standard scores,
which have a mean of 0 and standard deviation of 1. Because the values,
expressed in a standardised format make the comparisons across
variables easy.

Through Box-Plot
• To detect outliers on each variable, just produce a box-plot. Outliers will
appear at the extremes, and will be labelled, as the following figure:
➢Through the values of quartile and inter-quartile range
Let suppose an example: The provided data is 119, 201, 235, 269, 271, 278, 283, 291,
301, 303, 441.
th rd
Q1=[(n+1)/4] value= [(11+1)/4]= 3 = 235
th
Q3=[3(n+1)/4]th value= [3*(11+1)/4]=9 =301
Inter-quartile range=Q3-Q1= 301-235=66
So, Lower limit= [Q1-(IQR*1.5)]=[235-
(66*1.5)]=136 Upper limit=[Q3+
(IQR*1.5)]=[301+(66*1.5)]=400
That’s why, 119 and 441 are outliers.
Bivariate Method of Outlier Detection
• Pair of variables can be assessed jointly through a scatterplot. Cases, that
fall markedly outside the range of the other observations will be seen as
isolated points in the scatterplot.
• A drawback of the bivariate method in general:
The potentially large number of scatter plots that arises as the number of
variables increases.
Multivariate Method of Outlier Detection

• Because most multivariate analyses involve more than two variables,


the bivariate methods quickly become inadequate because,
i)They require a large number of graphs,
ii)They are limited to two dimensions (variables) at a time
2 2
• This issue is addressed by the Mahalanobis D measure. Higher D
values represent observations farther removed from the general
distribution of observations in this multidimensional space.
Normality
• The most fundamental assumption in multivariate analysis is normality,
referring to the shape of the data distribution for an individual metric
variable and its correspondence to the normal distribution.
• Normal distribution: Purely theoretical continuous probability
distribution in which the horizontal axis represents all possible values of a
variable and the vertical axis represents the probability of those values
occurring. The scores on the variable are clustered around the mean in a
symmetrical, unimodal pattern known as the bell-shaped, or normal curve.
It is also called as Gaussian distribution as well.
Slicing and Dicing of Data
• In many marketing situations, researcher needs to do slicing and dicing of
data.
• Software, like Excel through Pivot table, SPSS through cross-tabulation
enable researcher to quickly summarize and describe the data in many
different ways.
Data Visualisation
• Data visualization is the process of translating large data sets and metrics
into charts, graphs and other visuals.
• The resulting visual representation of data makes it easier to identify and
share real-time trends, outliers, and new insights about the information
represented in the data.
• In the world of Big Data, data visualization tools and technologies are
essential to analyze massive amounts of information and make data-driven
decisions.
Common general types of data
visualization
• Charts
• Tables
• Graphs
• Maps
• Infographics
• Dashboards
Descriptive Statistics
• Descriptive statistics is the process of describing data and trying to reach
a conclusion based on it.
• Descriptive statistics includes two concepts measures of central
tendency and measures of dispersion.
Measures of Central Tendency
1. Mathematical averages 2. Positional averages
(a) Arithmetic mean or mean (a) Median
▪ Simple
▪ Weighted
(b) Mode
(b) Geometric mean (c) Quartiles
(c) Harmonic mean (d) Deciles
(e) Percentiles
Arithmetic Mean

The arithmetic mean (AM) of a set of observations is their sum, divided by the number of
observations.

It is generally denoted by x or AM. Population mean is denoted by μ.

Arithmetic mean is of two types:
Simple arithmetic mean
Weighted arithmetic mean
• Computation of Arithmetic Mean for Discrete Frequency Distribution

• Calculating Weighted Mean


Geometric Mean

Geometric mean (GM) is the nth root of the product of n items of a series.


If there are three items 4, 6, and 9, then their geometric mean, which is generally
denoted by G, can be computed as:
Computation of Geometric Mean for
Individual Series
Harmonic Mean
• The harmonic mean of any series is the reciprocal of the arithmetic
mean of the reciprocal of the variate, that is, the harmonic mean by
definition is given by:
Computation of Harmonic Mean for
Individual Series
Relationship between Arithmetic mean (AM),
Geometric Mean (GM) and Harmonic Mean (HM)
Positional Averages

Arithmetic mean, geometric mean, and harmonic mean are all mathematical in nature
and are measures of quantitative characteristics of data.

To measure the qualitative characteristics of data, other measures of central
tendency, namely median and mode are used.

Positional averages, as the name indicates, mainly focus on the position of the value
of an observation in the data set.
Median

The median may be defined as the middle or central value of the variable when
values are arranged in the order of magnitude.


In other words, median is defined as that value of the variable that divides the
group into two equal parts, one part comprising all values greater and the other all
values lesser than the median.
Computation of Median for the Individual
Series
• In this type of distribution, data can be arranged in ascending or
descending order. If there are n terms (observations) in the data, there can
be two cases:
•▪ Mode
Mode is the variate having the maximum frequency in a data series.

In the case of an individual series, data is arranged in order and mode can be determined by inspection
only.

The value of the variable (in data series) which occurs the most or the value of the data series with
maximum frequency is the mode of the data series.

For example, for a series 1, 1, 3, 3, 3, 3, 4, 5, 8, 8, 16, 16 (arranged in the order of magnitude),
observation 3 has the maximum frequency 4. Therefore, mode of the series is 3.
Empirical Relationship between Mean, Median
and Mode
Partition Values: Quartiles, Deciles, and
Percentiles

Partition values are measures that divide the data into several equal parts. Quartiles
divide data into 4 equal parts, deciles divide data into 10 equal parts, and percentiles
divide data into 100 equal parts.


For an individual series, the first and third quartiles can be computed using the following
formula:
• In a data series, when the observations are arranged in an ordered
sequence, deciles divide the data into 10 equal parts. In the case of
individual series and discrete frequency distribution, the generalized
formula for computing deciles is given as:
• In a data series, when observations are arranged in an ordered sequence,
percentiles divide the data into 100 equal parts. For an individual series
and a discrete frequency distribution, the generalized formula for
computing percentiles is given as:
Measures of Dispersion
▪ The meaning of dispersion is “scatteredness.” The degree to which numerical
data tends to spread around an average value is called variation or dispersion of
data.
Types of Measures of Dispersion
▪ There are two types of measures of dispersion:
1. Absolute measures of dispersion: Absolute measures of dispersion
are presented in the same unit as the unit of distribution.
2. Relative measures of dispersion: Relative measures of dispersion
are useful in comparing two sets of data which have different units of
measurement.
▪ Relative measures of dispersion are pure unitless numbers and are generally called
coefficient of dispersion.
Methods of Measuring Dispersion
The following are some of the important and widely used
methods of measuring dispersion:
▪ Range
▪ Interquartile range and quartile deviation
▪ Average absolute deviation
▪ Standard deviation
• Range

Range is defined as the difference between the smallest and the greatest values in a
distribution.

Range is an absolute measure of dispersion. The relative measure of dispersion
for range is called the coefficient of range and is calculated by the following
formula:
•▪ Interquartile range and quartile deviation
Interquartile range is the difference between the third quartile and the first quartile.

Quartile deviation or semi-interquartile range can be obtained by dividing the interquartile
range by 2.

Quartile deviation is an absolute measure of dispersion. Relative measure is
called the coefficient of quartile deviation. Coefficient of quartile deviation can be
used to measure the degree of variation in two different distributions when both
have different units of measurement.
• Average absolute deviation
Average absolute deviation is the average amount of scatter of the items in a
distribution, from either the mean or the median or the mode, ignoring the
signs of deviations.
• Average absolute deviation is an absolute measure of
dispersion. In this context, a relative measure, also known as
coefficient of average absolute deviation, is obtained by the
following formula:
Standard Deviation and Variance
• Standard deviation is the square root of the sum of square deviations of various
values from their arithmetic mean divided by the sample size minus one.

• Variance is the square of standard deviation. Sample variance is the sum of


squared deviations of various values from their arithmetic mean divided by the
sample size minus one.
• For population standard deviation, we have N instead of n-1 in formula of sample
standard deviation.
Coefficient of Variance

To compare the dispersion of two distributions, the relative measure of standard
deviation is used and is referred to as the coefficient of variation.

A distribution with lesser CV shows greater consistency, homogeneity, and
uniformity, whereas a distribution with greater CV is considered more variable
than others.
Measures of Association

Measures of association are statistics for measuring the strength of relationship between two
variables.

Correlation measures the degree of association between two variables.

Karl Pearson’s coefficient of correlation is a quantitative measure of the degree of
relationship between two variables. Suppose these variables are x and y, then Karl
Pearson’s coefficient of correlation is calculated as:
• The coefficient of correlation lies in between +1 and –1.
Empirical rule

Figure: Area under the normal curve


Measures of Shape
• Measures of shape are the tools used for describing the shape of
a distribution of the data. There are two measures of shape:
skewness and kurtosis.
• A distribution of data where the right half is the mirror image of
the left half is said to be symmetrical. If the distribution is not
symmetrical, it is said to be asymmetrical or skewed.
Figure 4.12 : (a) Left skewed distribution, (b) right skewed distribution, and (c)
symmetrical distribution
Coefficient of Skewness
▪ Karl Pearson developed a method for measuring skewness, referred to as the
Pearsonian coefficient of skewness. This coefficient compares mean and
mode and is divided by standard deviation. Pearsonian coefficient of skewness
is given as:
Kurtosis
Kurtosis measures the amount of peakedness of a distribution. A
flatter distribution than a normal distribution is called platykurtic.
A more peaked distribution than the normal distribution is
referred to as leptokurtic. Between these two types of distribution
is a distribution which is more normal in shape, referred to as
mesokurtic distribution.
Figure : (a) Leptokurtic distribution, (b) Platykurtic
distribution, and (c) Mesokurtic distribution
References
• Bajpai, N. (2017). Business Research Methods (2nd ed.). Pearson Education.
• Hair Jr., J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2014).
th
Multivariate Data Analysis (7 ed.). USA: Pearson-Prentice Hall.

You might also like