You are on page 1of 78

Marketing Analytics

Unit-2
Descriptive Analytics

By: Deependra Singh, Assistant Professor, School of Management, The NorthCap University, Gurugram
Descriptive Analytics
• Descriptive analytics summarizes data into meaningful charts and
reports, for example, about budgets, sales, revenues, or cost.
• Descriptive analytics are a set of techniques used to explain or
quantify the past.
• Several examples of descriptive analytics include data queries,
visual reports, and descriptive statistics.
Understanding Data
• Data can be defined as a systematic record of a particular quantity. It is the
different values of that quantity represented together in a set.
• It is a collection of facts and figures to be used for a specific purpose such as
a survey or analysis.
• When arranged in an organized form, can be called information. The source
of data (primary data, secondary data) is also an important factor.
Types of Data
➢Qualitative VS Quantitative Data
Qualitative Data: They represent some characteristics or attributes. They
depict descriptions that may be observed but cannot be computed or
calculated. They are more exploratory than conclusive in nature.
Quantitative Data: These can be measured and not simply observed. They
can be numerically represented and calculations can be performed on them.
This information is numerical and can be classified as quantitative.
➢Discrete VS Continuous Data
Discrete Data: Discrete data that can take on only integer values, such as
counts.
Continuous Data: Continuous data that can take on any value in an interval.
These are data that can take values between a certain range with the highest and
lowest values.
➢Primary VS Secondary Data
Primary Data: Primary data is one which an investigator collects for the first
time for a particular purpose.
Secondary Data: They are the data that are sourced from someplace that
has originally collected it.
➢Categorical VS Binary VS Ordinal Data
• Categorical Data: Data that can take on only a specific set of values
representing a set of possible categories.
• Binary/Dichotomous/Boolean Data: A special case of categorical data
with just two categories of values (0/1, true/false).
• Ordinal Data: Categorical data that has an explicit ordering.
Rectangular Data
• The typical frame of reference for an analysis in data science is a
rectangular data object, like a spreadsheet or database table.
• Rectangular data (like a spreadsheet) is the basic data structure for
statistical and machine learning models.
• Rectangular data is essentially a two-dimensional matrix with rows
indicating records (cases) and columns indicating features
(variables).
Sales volume generated by salesmen
Salesmen
Region
S1 S2 S3 S4
East 24 30 26 23
West 22 32 27 25
North 23 28 25 22
South 32 31 32 34
Non-rectangular data structures
• Time series data records successive measurements of the same variable. It
is the raw material for statistical forecasting methods.
• Spatial data are used in mapping and location analytics. It is relatively more
complex and varied than rectangular data structure.
Data Preparation and handling
• Data cleaning is the process of detecting and correcting for removing
corrupt or incomplete records from a record set, and refers to identifying
incomplete, incorrect, inaccurate or irrelevant parts of the data and then
replacing, modifying, or deleting unsuitable data.
• Data screening is the process of ensuring that researcher’s data is clean and
ready to go for statistical analyses.
• Data editing is the inspection and correction of the data received from each
element of the sample.
Data Cleaning
• Under data cleaning, a researcher generally focuses on these three aspects:
❑Missing Data: Information not available for a case about whom other
information is available. It occurs when a respondent fails to answer some
questions in a survey.
❑Outliers: Outliers are observations with a unique combination of
characteristics identifiable as distinctly different from the other observations.
❑Normality: Normality is the degree to which the distribution of the sample
data corresponds to a normal distribution.
Four-step process for identifying missing
data and applying remedies

Source: Hair et al. (2014)


Step 1: Determine the type of missing data
• Determine the type of missing data.
• Is the missing data a part of the research design and under the control of
researcher or whether the causes and impacts are truly unknown.
Step 2: Determine the extent of missing data
• Given that some of the missing data are not ignorable, the researcher must next
examine the pattern of the missing data and determine the extent of the missing
data for individual variables, individual cases, and even overall.
• If it is sufficiently low, then any of the approaches for remedying missing data may
be applied. If missing data level is not low enough, then we must first determine the
randomness of the missing data process before selecting a remedy.
• The unresolved issue at this step is the question- What is low enough?. In making
the assessment as to the extent of missing data, the researcher may find that the
deletion of cases and/or variables will reduce the missing data to levels that are low
enough to allow for remedies without concern for creating biases in the results.
Step 3: Diagnose the randomness of the
missing data processes
• Having determined that the extent of missing data is substantial enough to
warrant action. The next step is to determine the degree of randomness
present in the missing data, which then determines the appropriate remedies
available.
Level of randomness of the missing data processes

1. Missing at random (MAR)


• Missing at random means the propensity for a data point to be missing
is not related to the missing data, but it is related to some of the
observed data.
• It is also called Missing Conditionally at Random, because missingness
is conditional on another variable.
• Here, missing values of Y depend on X, but not on Y.
For example, assume that we know the gender of respondents (the X variable) and are
asking about household income (the Y variable). We find that the missing data are random
for both males and females but occur at a much higher frequency for males than females.
Even though the missing data process is operating in a random manner within the gender
variable, any remedy applied to the missing data will still reflect the missing data process
because gender affects the ultimate distribution of the household income values.

2. Missing completely at random (MCAR)


MCAR is propensity for a data point to be missing is completely random. There is no
relationship between whether a data point is missing and any values in the data set, missing or
observed.
Diagnostic tests for level of randomness

• The researcher must ascertain whether the missing data process occurs in a
completely random manner. When the data set is small, the researcher may
be able to visually see much pattern or perform a set of simple calculation.
• However, as sample size increases, so does the need for empirical diagnostic
tests. Some statistical programs add techniques specifically designed for
missing data analysis, eg. Missing Value Analysis in SPSS, which generally
include one or both diagnostic tests.
• The first approach assesses the missing data process of a single variable Y by forming two
groups: observations with missing data for Y and those with valid values of Y. Statistical
tests are then performed to determine whether significant differences exist between the two
groups on other variables of interest. Significant differences indicate the possibility of a non
random missing data process.
• The second approach is an overall test of randomness that determines whether the
missing data can be classified as MCAR. This test analyses the pattern of missing data on all
available variables and compares it with the pattern expected for a random missing data
process. If no significant differences are found, the missing data can be classified as MCAR.
If significant differences are found, the researcher must use the approaches to identify the
specific missing data processes that are nonrandom.
Step 4: Select the imputation method
• At this step of process, the researcher must select the approach used for
accommodating missing data in the analysis. This decision is based on
whether the missing data are missing at random (MAR) or missing
completely at random (MCAR).
• Imputation is the process of estimating the missing value based on valid
values of other variables and/or cases in the sample. The researcher has
several options for imputation.
• Imputation process is generally avoided for non-metric data. This process is
suggested for the metric data basically.
Imputation or missing data process

Imputation using only valid data


• Some researchers may question whether using only valid data is actually a
form of imputation, because no data values are actually replaced.
• The basic intent of this approach is to represent the entire sample with those
observations or cases with valid data.
i. Complete case approach
The simplest and most direct approach for dealing with missing data is to include only
those observations with complete data. This method is known as LISTWISE method
in SPSS.
ii. Using all-available data
The second imputation method using only valid data also does not actually replace the
missing data, but instead imputes the distribution characteristics (e.g., means or
standard deviations) or relationships (e.g., correlations) from every valid value.
Imputation by using replacement values
• The second form of imputation involves replacing missing values with
estimated values based on other information available in the sample.
• These methods can be classified as:
i) using a known value as a replacement or
ii) calculate a replacement value from other observations.
• Using known replacement values
✓ The common characteristics in these methods is to identify a known value,
most often from a single observation, that is used to replace the missing data.
✓ The observation may be from the sample or even external to the sample.
✓ This includes two techniques
i) Hot or Cold deck imputation,
ii) Case Substitution
Hot or Cold deck imputation

In this approach, the researcher substitutes a value from another source for the
missing values.
• In the “hot deck” method, the value comes from another observation in the
sample that is deemed similar. Each observation with missing data is paired with
another case that is similar on a variable specified by the researcher. Then missing
data are replaced with valid values from the similar observation.
• In the “cold deck” method, the replacement value is derived from an external
source (e.g. prior studies, other sample, etc.). Here, the researcher must be sure that
the replacement value from an external source is more valid than an internally
generated value.
Case substitution

• In this method, entire observations with missing data are replaced by


choosing another non sampled observation.
• An example is to replace a sampled household that can not be contacted or
has extensive missing data with another household not in the sample,
preferably similar to the original observation.
Calculating replacement values
• The second basic approach involves calculating a replacement value from a
set of observations with valid data in the sample.
• This includes two techniques:
i) Mean substitution
ii) Regression substitution.
• Mean substitution
One of the most widely used methods, mean substitution replaces the missing
values for a variable with the mean value of that variable calculated from all
valid responses.
• Regression Imputation
In this method, regression analysis is used to predict the missing values of a
variable based on its relationship to other variables in the data set.
Outliers
An observation that is substantially different from the other
observations (i.e., has an extreme value) on one or more characteristics
(variables). The issue is its representativeness of the population.
Example: assume that we sample 20 individuals to determine the
average household income. In our sample we gather responses that
range between $20,000 and $100,000, so that the average is $45,000.
But assume that the 21st person has an income of $1 million. If we
include this value in the analysis, the average income increases to more
than $90,000.
Obviously, the outlier is a valid case, but what is the better estimate of the average
household income: $45,000 or $90,000?. The researcher must assess whether the
outlying value is retained or eliminated due to its undue influence on the results.
In substantive terms, the outlier must be viewed in light of how representative it is of
the population. Again, using our example of household income, how representative of
the more wealthy segment is the millionaire? If the researcher feels that it is a small,
but viable segment in the population, then perhaps the value should be retained. If,
however, this millionaire is the only one in the entire population and truly far above
everyone else (i.e., unique) and represents an extreme value, then it may be deleted.
Methods of Detecting Outliers

• Outliers can be identified from a univariate, bivariate,


or multivariate perspective based on the number of
variables (characteristics) considered.
Univariate Methods of Outlier Detection
➢Through Standard Score (Z-Score)
• Examines the distribution of observations for each variable in the analysis
and selects as outliers those cases falling at the outer range of the
distribution.
• The typical approach first converts the data values to standard scores, which
have a mean of 0 and standard deviation of 1. Because the values, expressed
in a standardised format make the comparisons across variables easy.
➢Through Box-Plot
• To detect outliers on each variable, just produce a box-plot. Outliers will
appear at the extremes, and will be labelled, as the following figure:
➢ Through the values of quartile and inter-quartile range
Let suppose an example: The provided data is 119, 201, 235, 269, 271, 278, 283, 291, 301,
303, 441.
Q1=[(n+1)/4]th value= [(11+1)/4]= 3rd = 235
Q3=[3(n+1)/4]th value= [3*(11+1)/4]=9th=301
Inter-quartile range=Q3-Q1= 301-235=66
So, Lower limit= [Q1-(IQR*1.5)]=[235-(66*1.5)]=136
Upper limit=[Q3+(IQR*1.5)]=[301+(66*1.5)]=400
That’s why, 119 and 441 are outliers.
Bivariate Method of Outlier Detection
• Pair of variables can be assessed jointly through a scatterplot. Cases, that fall
markedly outside the range of the other observations will be seen as isolated
points in the scatterplot.
• A drawback of the bivariate method in general:
The potentially large number of scatterplots that arises as the number of variables
increases.
Multivariate Method of Outlier Detection

• Because most multivariate analyses involve more than two variables, the
bivariate methods quickly become inadequate because,
i)They require a large number of graphs,
ii)They are limited to two dimensions (variables) at a time
• This issue is addressed by the Mahalanobis D2 measure. Higher D2 values
represent observations farther removed from the general distribution of
observations in this multidimensional space.
Normality
• The most fundamental assumption in multivariate analysis is normality,
referring to the shape of the data distribution for an individual metric
variable and its correspondence to the normal distribution.
• Normal distribution: Purely theoretical continuous probability distribution in
which the horizontal axis represents all possible values of a variable and the
vertical axis represents the probability of those values occurring. The scores
on the variable are clustered around the mean in a symmetrical, unimodal pattern
known as the bell-shaped, or normal curve. It is also called as Gaussian distribution
as well.
Slicing and Dicing of Data
• In many marketing situations, researcher needs to do slicing and dicing of
data.
• Software, like Excel through Pivot table, SPSS through cross-tabulation
enable researcher to quickly summarize and describe the data in many
different ways.
Data Visualisation
• Data visualization is the process of translating large data sets and metrics
into charts, graphs and other visuals.
• The resulting visual representation of data makes it easier to identify and
share real-time trends, outliers, and new insights about the information
represented in the data.
• In the world of Big Data, data visualization tools and technologies are
essential to analyze massive amounts of information and make data-driven
decisions.
Common general types of data
visualization
• Charts
• Tables
• Graphs
• Maps
• Infographics
• Dashboards
Descriptive Statistics
• Descriptive statistics is the process of describing data and trying to reach a
conclusion based on it.
• Descriptive statistics includes two concepts measures of central tendency
and measures of dispersion.
Measures of Central Tendency
1. Mathematical averages 2. Positional averages
(a) Arithmetic mean or mean (a) Median
▪ Simple
▪ Weighted (b) Mode
(b) Geometric mean (c) Quartiles
(c) Harmonic mean (d) Deciles
(e) Percentiles
Arithmetic Mean
▪ The arithmetic mean (AM) of a set of observations is their sum, divided by the
number of observations.
▪ It is generally denoted by x or AM. Population mean is denoted by μ.
▪ Arithmetic mean is of two types:
Simple arithmetic mean
Weighted arithmetic mean
• Computation of Arithmetic Mean for Discrete Frequency Distribution

• Calculating Weighted Mean


Geometric Mean
▪ Geometric mean (GM) is the nth root of the product of n items of a series.

▪ If there are three items 4, 6, and 9, then their geometric mean, which is
generally denoted by G, can be computed as:
Computation of Geometric Mean for
Individual Series
Harmonic Mean
• The harmonic mean of any series is the reciprocal of the arithmetic mean
of the reciprocal of the variate, that is, the harmonic mean by definition is
given by:
Computation of Harmonic Mean for
Individual Series
Relationship between Arithmetic mean (AM),
Geometric Mean (GM) and Harmonic Mean (HM)
Positional Averages

▪ Arithmetic mean, geometric mean, and harmonic mean are all mathematical
in nature and are measures of quantitative characteristics of data.
▪ To measure the qualitative characteristics of data, other measures of central
tendency, namely median and mode are used.
▪ Positional averages, as the name indicates, mainly focus on the position of
the value of an observation in the data set.
Median
▪ The median may be defined as the middle or central value of the variable
when values are arranged in the order of magnitude.

▪ In other words, median is defined as that value of the variable that divides
the group into two equal parts, one part comprising all values greater and the
other all values lesser than the median.
Computation of Median for the Individual
Series
• In this type of distribution, data can be arranged in ascending or descending
order. If there are n terms (observations) in the data, there can be two cases:
• Mode
▪ Mode is the variate having the maximum frequency in a data series.
▪ In the case of an individual series, data is arranged in order and mode can be determined by
inspection only.
▪ The value of the variable (in data series) which occurs the most or the value of the data
series with maximum frequency is the mode of the data series.
▪ For example, for a series 1, 1, 3, 3, 3, 3, 4, 5, 8, 8, 16, 16 (arranged in the order of
magnitude), observation 3 has the maximum frequency 4. Therefore, mode of the series is 3.
Empirical Relationship between Mean, Median
and Mode
Partition Values: Quartiles, Deciles, and
Percentiles
▪ Partition values are measures that divide the data into several equal parts.
Quartiles divide data into 4 equal parts, deciles divide data into 10 equal
parts, and percentiles divide data into 100 equal parts.

▪ For an individual series, the first and third quartiles can be computed using
the following formula:
• In a data series, when the observations are arranged in an ordered sequence,
deciles divide the data into 10 equal parts. In the case of individual series
and discrete frequency distribution, the generalized formula for computing
deciles is given as:
• In a data series, when observations are arranged in an ordered sequence,
percentiles divide the data into 100 equal parts. For an individual series and
a discrete frequency distribution, the generalized formula for computing
percentiles is given as:
Measures of Dispersion
▪ The meaning of dispersion is “scatteredness.” The degree to which
numerical data tends to spread around an average value is called variation or
dispersion of data.
Types of Measures of Dispersion
▪ There are two types of measures of dispersion:
1. Absolute measures of dispersion: Absolute measures of dispersion are
presented in the same unit as the unit of distribution.
2. Relative measures of dispersion: Relative measures of dispersion are
useful in comparing two sets of data which have different units of
measurement.
▪ Relative measures of dispersion are pure unitless numbers and are generally
called coefficient of dispersion.
Methods of Measuring Dispersion
The following are some of the important and widely used methods
of measuring dispersion:
▪ Range
▪ Interquartile range and quartile deviation
▪ Average absolute deviation
▪ Standard deviation
• Range
▪ Range is defined as the difference between the smallest and the greatest
values in a distribution.
▪ Range is an absolute measure of dispersion. The relative measure of
dispersion for range is called the coefficient of range and is calculated by the
following formula:
• Interquartile range and quartile deviation
▪ Interquartile range is the difference between the third quartile and the first quartile.
▪ Quartile deviation or semi-interquartile range can be obtained by dividing the
interquartile range by 2.
▪ Quartile deviation is an absolute measure of dispersion. Relative measure is called
the coefficient of quartile deviation. Coefficient of quartile deviation can be used to
measure the degree of variation in two different distributions when both have
different units of measurement.
• Average absolute deviation
Average absolute deviation is the average amount of scatter of the items in a
distribution, from either the mean or the median or the mode, ignoring the
signs of deviations.
• Average absolute deviation is an absolute measure of dispersion.
In this context, a relative measure, also known as coefficient of
average absolute deviation, is obtained by the following formula:
Standard Deviation and Variance
• Standard deviation is the square root of the sum of square deviations of various
values from their arithmetic mean divided by the sample size minus one.

• Variance is the square of standard deviation. Sample variance is the sum of squared
deviations of various values from their arithmetic mean divided by the sample size
minus one.
• For population standard deviation, we have N instead of n-1 in formula of sample
standard deviation.
Coefficient of Variance
▪ To compare the dispersion of two distributions, the relative measure of
standard deviation is used and is referred to as the coefficient of variation.
▪ A distribution with lesser CV shows greater consistency, homogeneity, and
uniformity, whereas a distribution with greater CV is considered more
variable than others.
Measures of Association
▪ Measures of association are statistics for measuring the strength of relationship
between two variables.
▪ Correlation measures the degree of association between two variables.
▪ Karl Pearson’s coefficient of correlation is a quantitative measure of the degree
of relationship between two variables. Suppose these variables are x and y, then
Karl Pearson’s coefficient of correlation is calculated as:
• The coefficient of correlation lies in between +1 and –1.
Empirical rule

Figure: Area under the normal curve


Measures of Shape
• Measures of shape are the tools used for describing the shape of
a distribution of the data. There are two measures of shape:
skewness and kurtosis.
• A distribution of data where the right half is the mirror image of
the left half is said to be symmetrical. If the distribution is not
symmetrical, it is said to be asymmetrical or skewed.
Figure 4.12 : (a) Left skewed distribution, (b) right skewed distribution, and (c) symmetrical
distribution
Coefficient of Skewness
▪ Karl Pearson developed a method for measuring skewness, referred to as the
Pearsonian coefficient of skewness. This coefficient compares mean and
mode and is divided by standard deviation. Pearsonian coefficient of
skewness is given as:
Kurtosis
Kurtosis measures the amount of peakedness of a distribution. A
flatter distribution than a normal distribution is called platykurtic. A
more peaked distribution than the normal distribution is referred to
as leptokurtic. Between these two types of distribution is a
distribution which is more normal in shape, referred to as
mesokurtic distribution.
Figure : (a) Leptokurtic distribution, (b) Platykurtic distribution,
and (c) Mesokurtic distribution
References
• Bajpai, N. (2017). Business Research Methods (2nd ed.). Pearson Education.
• Hair Jr., J. F., Black, W. C., Babin, B. J., & Anderson, R. E. (2014). Multivariate
Data Analysis (7th ed.). USA: Pearson-Prentice Hall.

You might also like