You are on page 1of 17

CHAPTER 2: DESCRIPTIVE STATISTICS

Introduction
• U.S Census Bureau- collects data related to the population and economy of the United
States using a variety of methods and for many purposes.
- One of the most important providers of data used in business analytics
• Decennial Census- an effort to count the total U.S population
- Collects categorical data (e.g. sex, race, number of people living in the household)
• Current Population Survey (CPS)- used to estimate employment and unemployment
rates un different geographic areas.
Terminologies:
• Data- facts and figures collected, analyzed and summarized for presentation and
interpretation.
• Variable- a characteristic or a quantity of interest that can take on different values.
• Variables- Symbol, Industry, Share, Price, and Volume.
• Observation- set of values corresponding to a set of variables.
• Decision variables- used in optimization models
- Values of some variables that are under a direct control of the decision maker.
• Variation- difference in a variable measured over observations (time, customers, items,
etc.).
- How the value of a variable can vary
• Role of descriptive analytics- to collect and analyze data to gain a better understanding
of variation and its impact on the business setting.
• Random variable/ Uncertain variable- a quantity whose values are not known with
certainty.
Types of Data:
• Population- the set of all elements of interest in a particular study.
• Sample- subset of the population. (Business analytics deals with sample data)
• Quantitative data- numeric and arithmetic operations, such as addition, subtraction,
multiplication, and division, can be performed on them.
• Categorical data- if arithmetic operations cannot be performed on the data.
- Counting the number of observations or computing the proportion of observations in
each category.
• Cross-Sectional data- collected from several entities at the same, or approximately the
same, point in time.
• Time series data- collected over several time periods.
- Graphs of time series is used in business and economic publications.
- These graphs help analysts understand what happened in the past, identify trends over
time, and project future levels for the time series.
Sources of Data:
Experimental study
Non-experimental/Observational studies- (survey common type)
Terminologies:
• Distributions- help summarize many characteristics of a data set by describing how often
certain values for a variable appear in that data set.
- Can be created for both categorical and quantitative data, and they assist the analyst
in gauging variation.
- Useful for interpreting and analyzing data.
- Describes the overall variability of the observed values of a variable.

• Frequency distribution- summary of data that shows the number(frequency) of


observations in each several non-overlapping classes (called bins) n= observations
• Relative frequency distribution- tabular summary of data showing the relative
frequency for each bin.
• Percent frequency distribution- summarizes the percent frequency of the data for each
bin
- Can be used to provide estimates of the relative likelihoods of different values for a
random variable.
Three steps necessary to define the classes for a frequency distribution with quantitative
data are as follows:
1) Determine the number of non-overlapping bins.
2) Determine the width of each bin.
3) Determine the bin limits.
Terminologies:
• Bins- formed by specifying the ranges used to group the data (guideline: 5-20 bins)
- For a small number of data items- 5-6 bins may be used to summarize the data.
- For a large number of data items- more bins are required.
- Goal: use enough bins to show the variation in the data, but not so many that some
contain only few data items.
• Width of Bins

• Bin limits
- Lower bin limit identifies the smallest possible data value assigned to the bin.
- Upper bin limit identifies the largest possible data value assigned to the class.
• Histograms- a common graphical presentation of quantitative data.
• Cumulative frequency distribution- a variation of the frequency distribution that
provides another tabular summary of quantitative data.
- Uses number of classes, class widths, and class limits developed for the frequency
distribution.
Measures of Location
• Mean (Arithmetic Mean)- most commonly used measure of location.
- Provides a measure of central location for the data.
• Sample mean- point estimate of the population mean for the variable of interest. ( )
• Population mean- computed in the same manner

but demoted by the greek letter (u)

• Median- another measure of central location; is the value in the middle when the data are
arrange in ascending(smallest to largest value) order.
• Mode- third measure of location; is the value that
occurs most frequently in the data set.
• Multimodal- data contain at least 2 modes
• Bimodal- more than 2 modes
• Geometric mean- measure of location that is
calculated by finding the nth root of the product of
n values.


• Formula for sample geometric mean:

• Measures of Variability
- Difference
• Range- simplest measure of variability.
- Can be found by subtracting the smallest value from the largest value in a data set.
• Variance- measure of variability that utilizes all the data.
- Based on the deviation about the mean, w/ c is the difference between the value of
each observation (xi) and the mean.

• Standard Deviation- the positive square root of the variance.


- (s)- sample standard deviation
- (o)- population standard deviation

• Coefficient of Variation- indicates how large the standard deviation is relative to the
mean.
- Usually expressed as a percentage
Analyzing Distributions
• Percentile- value of a variable at which a specified(approximate) percentage of
observations are below at the value.
- Pth percentile- tells us the point in the data where approximately p% of the
observations have value less than the pth percentile; hence, approximately (100-p)%
of the observations have values greater than the pth percentile.
• Location of the pth Percentile

• Quartiles- dividing data into four parts, with each part containing approximately one-
fourth, or 25%, of the observations.
- Division points

• Interquartile range (IQR)- difference between the third and first quartiles.
• Z-score- allows us to measure the relative location of a value in the data set.
- Helps us determine how far a particular value is from the mean relative to the data
set’s standard deviation.
- Often called standardized value.

• Empirical Rule- when the distribution of data exhibits a symmetrical bell-shaped


distribution.
- Used to determine the percentage of data values that are within a specified number of
standard deviations of the mean.
• Identifying Outliers
- Extreme values
(1) An outlier may be a data value that has been incorrectly recorded; if so, it can be
corrected before the data are analyzed further.
(2) An outlier may also be from an observation that doesn’t belong to the population we
are studying and was correctly included in the data set; if so, it can be removed.
(3) An outlier may be an unusual data value that has been recorded correctly and is a
member of the population we are studying.
- Standardized values (z-scores)- can be used to identify outliers.
- Z-scores less than (-3) or greater than (+3) is an outlier.
• Box plots- a graphical summary of the distribution of data.
- Developed from the quartiles for a data set.
Measures of Association Between Two Variables
• Scatter charts- useful graph for analyzing the relationship between two variables.

• Covariance- descriptive measure of the linear association between two variables.

• Correlation Coefficient- measures the relationship between two variables, and, unlike
covariance, the relationship between two variables is not affected by the units of
measurement for x and y.
Data Cleansing
- the data in a data set are often said to be “dirty” and “raw” before they have been put
into a form that is best suited for investigation, analysis, and modeling.
- Data preparation makes heavy use of the descriptive statistics and data-visualization
methods to gain and understanding of the data.
- Common tasks in data preparation include treating missing data, identifying
erroneous data and outliers, and defining the appropriate way torepresent variables.
Missing Data
• Legitimately missing data- missing data that naturally occur.
- No remedial action is taken
• Illegitimately missing data- missing data that occur for different reasons
- Remedial action is considered
‣ Primary options for addressing such missing data:
(1) To discard observations (rows) with any
missing values
(2) To discard any variable (column) with
missing values
(3) To fill in missing entries with estimated
values
(4) To apply a data-mining algorithm (such as
classification and regression tress) that can
handle missing values.
• Missing completely at random(MCAR)- if the tendency for an observation to be missing the
value for some variable is entirely random, then whether data are missing does not depend on
either the value of the missing data or the value of any other variable in the data.
• Missing at random (MAR)- if the tendency for an observation to be missing a value for some
variables is related to the value of some other variable(s) in the data.
• Miss not at random (MNAR)- if the tendency for the value of a variable to be missing is
related to the value that is missing.
‣ If the missing values cannot be determined and ignoring missing values or removing a variable
with missing values from consideration is not an option, imputation (systematic replacement of
missing values with values that seems reasonable) may be useful.
Blakely Tires- U.S. producer of automobile tires.
Identification of Erroneous Outliers and other Erroneous Values
- Examining the variables in the data set by use of summary statistics, frequency
distributions, bar charts and histograms, z-scores, scatter plots, correlation
coefficients, and other tools can uncover data-quality issues and outliers.
- Conservative approach- to create 2 data sets, 1 with and 1 without outliers and
potentially erroneous values, and then construct a model on both data sets.
Variable Representation
- Dimension reduction- process of removing variables from the analysis without
losing crucial information.
- A critical part of data mining is determining how to represent the measurements of
the variables and which variables to consider.

CHAPTER 3: DATA VISUALIZATION

Data visualization - is also important in conveying your analysis to others. Although business
analytics is about making better decisions, in many cases, the ultimate decision maker is not the
person who analyses the data. Therefore, the person analyzing the data has to make the analysis
simple for the other to understand. Proper data-visualization techniques greatly improve the
ability of the decision maker to interpret the analysis easily.

Microsoft Excel – is a ubiquitous tool used in business for data visualization.


• Can make it easy for anyone to create many standard examples of data visualization.
Effective Design Techniques

• Data-ink Ratio – first described by Edward R. Tufte in 2001 in his book. The Visual
Display of Quantitative Information.
- It measures the proportion of what Tufte terms “data-ink” to the total amount of ink
used in a table or chart.
- Necessary to convey the meaning of the data to the audience.
• Non-Data-Ink – serves no useful purpose in conveying the data to the audience.

Charts can often convey information faster and easier to readers, but in some cases a table is
more appropriate. Tables should be used when:
1. Reader needs to refer to specific numerical values.
2. Reader needs to make precise comparisons between different values and not just relative
comparisons.
3. Values being displayed have different units or very different magnitudes.

Table Design Principles


- in designing an effective table, keep in mind the data-ink ratio and avoid the use of
unnecessary ink in tables, avoid using vertical lines in a table unless they are
necessary for clarity.
- Horizontal lines are generally necessary only for separating column title from data
values or when indicating that calculation has taken place.

Crosstabulation – provides tabular summary of data for two variables.


- a crosstabulation in Microsoft excel is known as a PivotTable

Recommended PivotTables in Excel – excel also has the ability to recommend PivotTables for
your data set.
Charts (graphs)
- are visual methods for displaying data.
- Excel is the most used software package for creating simple charts.
• Scatter chart – is a graphical presentation of the relationship between two quantitative
variables.
- Are often referred to as scatter plots or scatter diagrams.
• Trendline – is a line that provides an approximation of the relationship between the
variables.
• Chart Button in excel – allow users to quickly modify and format charts.
- Chart elements button brings up a list of check boxes to quickly add and remove
axes, axis titles, chart titles, data labels, trendlines and more.
- Chart style button allows the user to quickly choose from many preformatted styles
to change the look of the chart.
- Chart filter button allows the user to select the data to be included in the chart.

Line Charts
- a line chart for time series data is often called a time series plot.
- Are similar to scatter charts, but a line connects the point in the chart.
- Are very useful for time series data collected over a period of time (minutes, hours,
days, years, etc.)
- Can also be used to graph multiple lines.
• Sparkline – minimalist type of line chart that can be placed directly into a cell in Excel.
- Contain no axes; they display only the line for the data.
- Take up very little space, and they can be effectively used to provide information on
overall trends for time series data.
Bar Charts and Column Charts – provide a graphical summary of categorical data.
- Helpful in making comparisons between categorical variables.
• Bar chart – use horizontal bars to display the magnitude of the quantitative variable.
• Column charts – use vertical bars to display the magnitude of the quantitative variable.

A Note on Pie Charts and Three-Dimensional Charts


• Pie charts – are another common form of chart used to compare categorical data.
• Bubble chart – is a graphical means of visualizing three variables in a two-dimensional
graph and is therefore sometimes a preferred alternative to a 3-D
• Heat map – a two-dimensional graphical representation of the data that uses different
shades of color to indicate magnitude.
Additional Charts for multiple variables
• stacked-column chart – a basic Excel chart type to allow part-to-whole comparisons
over time, or across categories.
• Stacked-bar chart – used to display the same data by using horizontal bars instead of
vertical.
• Clustered column – alternative chart
- Often superior to stacked-column and stacked-bar charts for comparing quantitative
variables, but they can become cluttered for more than a few quantitative variables
per category.
- Are also referred to as side-by-side-column bar charts.
• Scatter-chart matrix – especially useful chart for displaying multiple variables.
- Allows the reader to easily see the relationships among multiple variables.
PivotCharts in Excel
- To summarize and analyze data with both a crosstabulation and charting, Excel pairs
PivotCharts with PivotTables.
- Is a clustered-column chart whose column heights correspond to the average wait
times and clustered into the categorial groupings of Good, very good, and excellent.

Advance data visualization


• Advanced charts
• Parallel-coordinates plot – examining data with more than two variables.
• Treemap – used for visualizing hierarchical data along multiple dimensions.
Geographic Information systems charts (GIS) – Which merges maps and statistics to present
data collected over different geographic areas.
- Help in interpreting data and observing patterns
• Data dashboard - a data-visualization tool that illustrates multiple metrics and
automatically updates these metrics as new data become available.
Principles of effective data dashboards
• Key performance indicators are sometimes referred to as key performance metrics
(KPMs).
- Is displayed in the data dashboard should convey meaning to its user and be related to
the decisions the user makes,
Application of Data Dashboards
Chapter 4 FBAS

Observation or Record – the set of recorded values of variables associated with a single entity.
- It is often displayed as a row of the values in a spreadsheet or database.
Descriptive data-mining methods – is also called unsupervised leaning techniques.
- There is no outcome variable to predict; rather, the goal is to use the variable values to
identify relationships between observations.
- Can be thought of as high-dimensional descriptive analytics because they are designed to
describe patterns and relationships in large data sets with many observations of many
variables.
Cluster Analysis – segment observations into similar groups based on the observed variables.
- Can be employed during the data-preparation step to identity variables or observations
that can be aggregated or removed from consideration.
- It is commonly used in marketing to divide consumer into different homogeneous groups.
- It can used to identify outliers
Market segmentation - is a marketing term that refers to aggregating prospective buyers into
groups or segments with common needs and who respond similarly to a marketing action.

Hierarchical Clustering – starts with each observation belonging to its own cluster and
sequentially merges the most similar clusters to create a series of nested clusters.
- Determines the similarity of two clusters by considering the similarity between the
observations composing either cluster.
Measuring similarity between observations
- Euclidean distance
- matching coefficients
- jaccard’s coefficients
K-means clustering – assigns each observation to one of k clusters in a manner such that the
observations assigned to the same cluster are as similar as possible.

Measuring Similarity between observations


- The goal of cluster analysis is to group observations into clusters such that observations
within a cluster are similar and observations in different clusters are dissimilar.
- We need to measure the similarity or conversely, dissimilarity.
Euclidean distance – is the most common method to measure dissimilarity between
observations.
- It become smaller as a pair of observations become more similar with respect to their
variable values.
- Is highly influenced by the scale on which variables are measured.

Conversion to z-scores also makes it easier to identify outlier measurements, which can distort
the Euclidean distance between observations.

Matching Coefficient – is the simples overlap measure.

Jaccard’s coefficient – used to measure similarity.

Single linkage clustering method – the similarity between two clusters is defined by the
similarity of the pair observations (one from each cluster) that are the most similar.
- Will consider two clusters to be close if an observation in one of the clusters is close.
Complete linkage clustering method – of the similarity between two cluster as the similarity of
the pair of observations (one from each cluster) that are the most different.
- This method produces clusters such that all member observations of a cluster are
relatively close to each other.
Group average linkage clustering method – is the similarity between two clusters to be the
average similarity computed over all pairs of observations between two clusters.
Median linkage method – is analogous to group average linkage except that it uses the median
of the similarities computed between all pairs of observations between the two clusters.
- Reduces the effect of outliers
Centroid linkage – uses the averaging concept of cluster centroids to define between cluster
similarity.
Ward’s method – merges two clusters such that the dissimilarity of the observations within the
resulting single cluster increases as little as possible.
- It tends to produce early defined clusters of similar size.
- computes the centroid of the resulting merged cluster and then calculates the sum of
squared dissimilarity between this centroid and each observation in the union of the two
clusters.
- Representing observations within a cluster with the centroid can be viewed as a loss of
information in the sense that the individual differences in these observations will not be
captured by the cluster centroid.
- Minimizes loss of information between the individual observation level and cluster
centroid level.
McQuitty’s method – consider merging two clusters A and B, the dissimilarity of the resulting
cluster AB to any other cluster C is calculated as ((dissimilarity between A and C) +
(dissimilarity between B and C)) / 2
- when two clusters are be joined, the distance of the new cluster to any other cluster is
calculated as. the average of the distances of the soon to be joined clusters to that other
cluster.
Dendrogram – to visually summarize the output from a hierarchical clustering using the
matching coefficient to measure similarity between observations and the group average linkage
clustering method to measure similarity between clusters.
- Is a chart that depicts the set of nested clusters resulting at each step of aggregation.
- The horizontal axis of the dendrogram represents the dissimilarity (distance) resulting
from a merger of two different groups of observations
- Blue horizontal line is the merger of the two or more clusters, where the observations
composing the merged clusters are connected to the blue horizontal line with a blue
vertical line.
Association rules – convey the likelihood of certain items being purchased together.
- Is important tool in market basket analysis, they are also applicable to discipline other
than marketing
Market basket analysis - is a data mining technique used by retailers to increase sales by better
understanding customer purchasing patterns.
Antecedent – is the collection of items corresponding to the if portion of the rule.
Consequent – is the collection of items set corresponding to the then portion of the rule.
Support count – formalize the notion of frequent
Confidence – help identify reliable association rules.

Lift ratio – is used to evaluate the efficiency of a rule

Text Mining – is the process of extracting useful information from text data.
- Is more challenging than data mining with traditional numerical data
Text data - is often referred to as unstructured data because in its raw form, it cannot be stored
in a traditional structured database.
- Audio and video examples of unstructured data
Corpus – collection of text documents to be analyzed
Presence/Absence or binary term-document matrix – a matrix with the rows representing
documents and the columns representing words, and the entries in the columns indicating either
the presence or the absence of a particular word in a particular document (1 = present and 0 = not
present)
Tokenization – is the process of dividing text into separate terms, referred to as tokens. The
process of identifying tokens is not straightforward. First, symbols and punctuations must be
removed from the document and all letters should be converted to lowercase.
Stemming – process of converting a word to its stem or root word.
Frequency term-document matrix – is a matrix whose rows represent documents and columns
represent tokens, and the entries in the matrix are the frequency of occurrence of each token in
each document.

You might also like