You are on page 1of 61

Spreadsheet-Based Decision Support Systems

Chapter 7: Statistical Analysis

Prof. Name
Position
University Name

name@email.com
(123) 456-7890

Overview

7.1 Introduction
7.2 Understanding Data
7.3 Relationships in Data
7.4 Distributions
7.5 Summary

Introduction

Performing basic statistical analysis of data using Excel functions

Statistical features of the Data Analysis Toolpack

Trend curves for analyzing data patterns

Basic linear regression techniques in Excel

Several different distribution functions in Excel

Understanding Data

Statistical Functions

Descriptive Statistics

Histograms

Statistical Functions

AVERAGE
Finds the mean of a set of data.
=AVERAGE(range or range_name)

MEDIAN
Finds the middle number in a list of sorted data.
=MEDIAN(range or range_name)

STDEV
Finds the standard deviation of a set of data.
This is equal to the square root of the variance, which measures the
difference between the mean of the data set and the individual values.
=STDEV(range or range_name)

Figures 7.1 and 7.2

Figures 7.3 and 7.4

Analysis Toolpack

An Excel Add-In which includes several statistical analysis techniques

To ensure that it is an active Add-in, choose Tools > Add-ins from the
menu. Select Analysis Toolpack from the list.

Descriptive Statistics

Provides a list of statistical information about your data set including

Mean
Median
Standard deviation
Variance

Go to Tools > Data Analysis > Descriptive Statistics

Descriptive Statistics (cont)

The Input Range refers to the location of


the data set.

You can check whether your data is


Grouped By Columns or Rows.

If there are labels in the first row of each


column of data, then check the Labels in
First Row box.

The Output Range refers to where you


want the results of the analysis to be
displayed in the current worksheet.

The Summary Statistics box will


calculate the most commonly used
statistics from our data.

Figure 7.7

Quarterly stock returns for three different companies are recorded. We


want to know
Average stock return
Variability of stock returns
Which quarters had the highest and lowest stock returns

Figures 7.8 and 7.9

Figure 7.11

The standard deviation can be used to understand how common


outliers are in the data.

More Descriptive Statistics

Confidence Level for Mean


The mean is calculated using the specified confidence level (for example, 95% or
99%), the standard deviation, and the size of the sample data.
The confidence level and calculated mean are then added to the analysis report.
You can compare the actual mean to this calculated mean based on the specified
confidence level.

Kth Largest
Gives the largest ranked data value for a specified value of k.
For k = 1, the maximum data value would be returned.

Kth Smallest
Gives the smallest ranked data value for a specified value of k.
For k = 1, the minimum data value would be returned.

Descriptive Statistics Functions

PERCENTILE
Returns a value for which a desired percentile k of the specified data_set falls below.
=PERCENTILE(data_set, k)

For example, for the MSFT data, the value for which 95% of the data falls below
is
=PERCENTILE(B4:B27,0.95) = 0.108

PERCENTRANK
Returns the percentile of the data_set which falls below a given value.
=PERCENTRANK(data_set, value)

For example, the percent of the MSFT data which falls below the value 0.108 is
=PERCENTRANK(B4:B27, 0.108) = 0.95, or 95%

Histograms

Histograms calculate the number of occurrences, or frequency, which


values in a data set fall into various intervals.

Choose the Histogram option from the Analysis Toolpack list.

Histograms (contd)

The Input Range is the range of the data


set.

The Bin Range is used to specify the


location of the bin values.
Bins are the intervals into which values can
fall; they can be defined by the user or can
be evenly distributed among the data by
Excel.

The Output Range is the location of the


output, or the frequency calculations for
each bin.

The chart options include a simple Chart


Output (the actual histogram), Cumulative
Percentage for each bin value, and a
Pareto organization of the chart.

Figures 7.15 and 7.16

Figures 7.17 and 7.18

To create your own bin values, make a list of upper bounds for each
interval.

Figure 7.19

Histograms (contd)

Histograms can also be formatted.


Right-click on the histogram and change the Chart Options or other
parameters.

Histograms (cont)

There are four basic shapes to a histogram:


Symmetric: has only one peak; that is, there is a central high part and almost
equal lower parts to the left and right of this peak.
Positively skewed: has a peak on the left and many lower points (stretching)
to the right.
Negatively skewed: has a peak on the right and many lower points
(stretching) to the left.
Multiple peaks: imply that more than one source, or population, of data is
being evaluated.

Relationships in Data

Trend Curves

Regression

Data Relationships

Relationships in data are usually identified by comparing two variables:


the dependent variable and the independent variable.
The dependent variable is the variable we are most interested in. By
understanding its current behavior we can better predict its future behavior.
The independent variable is the variable we use as the comparison in order
to make this prediction.

Trend Curves

Trend curves are used to graph and analyze these relationships between
data.

Trend curves graph the data with


the independent variable on the x-axis
the dependent variable on the y-axis

To add a trend curve to your chart, right-click on the data points in an XY


Scatter chart and choose Add Trendline from the drop-down list of
options.

Trend Curves (contd)

There are five basic trend


curves which Excel can model:

Linear
Exponential
Power
Moving Average
Logarithmic

Trend Curves (contd)

Click on the Options tab to set


options for the trend curve.

Set the name of the trendline.

Specify a period forward or


backwards for which you want to
predict the behavior of your
dependent variable.

Check to Display Equation and


Display R-Squared Value.

Linear Trend Curves

Number of Units Produced each month


and the corresponding Monthly Plant Cost
are recorded.

The company wants to be able to estimate


their plant costs based on the planned
production amounts.

The dependent variable is therefore the


Monthly Plant Cost and the independent
variable is the Units Produced.

Figures 7.26 and 7.29

Graph the data and then add a Linear trendline.

Figure 7.30

Use the displayed equation to predict future values.


First check the accuracy of the equation by calculating the error from the
known data.

Linear trends have the relationship: y = a*x - b

Exponential Trend Curves

Sales data for ten years is recorded.

We want to be able to predict sales for the next few years.

The independent variable is Years and our dependent variable is


Sales.

Figures 7.34 and 7.35

Exponential trends have the relationship: y = a*e^(b*x) or


y = a*EXP(b*x)

Power Trend Curves

We are given yearly Production values and yearly Unit Cost for
production.

We want to determine the relationship between Unit Cost and Production


in order to be able to predict future Unit Costs.

Figures 7.39 and 7.40

Power trends have the relationship: y = a*x^b

Regression Analysis

We can use some regression analysis parameters to ensure that the


relationships we have chosen for our data are good fits.

These parameters include

R-Squared value
Standard error
Slope
Intercept

R-Squared Value

The R-Squared value measures the amount of influence the


independent variable has on the dependent variable.

The closer the R-Squared value is to 1, the stronger the relationship is


between the independent and dependent variables.

If the R-Squared value is closer to 0, then there may not be a


relationship between these two variables.

Figure 7.42

We fit a Linear trendline to the


Monthly Plant Cost per Units
Produced chart (see Figure 7.44).

The R-Squared value is 0.8137,


which is fairly close to 1, implying a
good fit.

Figure 7.45

The RSQ Excel function can calculate the


R-squared value from a set of data.
=RSQ(y_range, x_range)

Note that this function only works with


Linear trend curves.

Standard Error

The standard error measures the accuracy


of any predictions made.

It can be calculated in Excel using the


STEYX function
=STEYX(y_range, x_range)

This function can also only be used for


Linear trend curves.

Slope and Intercept

Two Excel functions can be used with a linear regression line of a


collection of data.

SLOPE function
=SLOPE(y_range, x_range)

INTERCEPT function
=INTERCEPT(y_range, x_range)

Distributions

Many distributions have Excel functions associated with them.


These functions are basically equivalent to using distribution tables.
That is, given certain parameters of a set of data for a particular
distribution, you would look at a distribution table to find the corresponding
area from the distribution curve.

Some common distributions are

Normal
Exponential
Uniform
Binomial
Poisson
Beta
Weibull

Normal Distribution

The parameters for this distribution are simply the value we are
interested in finding the probability for, and the mean and standard
deviation of the set of data.

The function we use with the Normal distribution is NORMDIST


=NORMDIST(x, mean, std_dev, cumulative)

Normal Distribution (cont)

The cumulative parameter will be seen in many Excel distribution


functions.

This parameter can take the values True or False to determine if you
want the value returned from the cumulative distribution function or
the probability density function, respectively.
The cumulative distribution function (cdf) will find the probability that a value
in the data set is less than or equal to x.
The probability density function (pdf) will find the probability that a value is
exactly equal to x.

Figure 7.48

Annual drug sales at a local drugstore are distributed Normally with a


mean of 40,000 and standard deviation of 10,000.

The probability that the actual sales for the year are 42,000 is 0.58, or
58%.

Figure 7.49

What is the probability that annual sales will be between 35,000 and
49,000?
To find this value, we will subtract the cdf values for these two bounds.
=NORMDIST(49000, 40000, 10000, True)
NORMDIST(35000, 40000, 10000, True)

This will return a 0.51 probability, or 51% chance.

Standard Normal Distribution

If the mean of your data is 0 and the standard deviation is 1, then placing
these values in the NORMDIST function with the cumulative parameter
as True will find the resulting value from the Standard Normal
distribution.

The STANDARDIZE function will convert the x value from a data set of a
mean not equal to 0 and a standard deviation not equal to 1 into a value
which does assume a mean of 0 and a standard deviation of 1.
=STANDARDIZE(x, mean, std_dev)

The resulting standardized value is then used as the main parameter in


the NORMSDIST function
=NORMSDIST(standardized_x)

Figure 7.50

Consider the same example used previously to find the probability that a
drugstores annual sales are 42,000.

Uniform Distribution

The Uniform distribution does not actually have a corresponding Excel


function; however, a simple formula can be used to model the Uniform
distribution.
1 / (b a)

Given that a value x is Uniformly distributed between a and b, we can


use this formula to determine the probability that x will have an integer
value in this interval.

Figure 7.51

Consider any values for a and b, then use the formula to calculate the
Uniform value.

Poisson Distribution

The Poisson distribution has only the mean as its parameter.

The function we use for this distribution is POISSON


=POISSON(x, mean, cumulative)

The Poisson distribution value is the probability that the number events
which occur is either between 0 and x (cdf) or equal to x (pdf).

Figure 7.52

For example, consider a bakery which serves an average of 20


customers per hour.
Find the probability that at most 35 customers will be served in the next
two hours.

Exponential Distribution

The Exponential distribution has only one parameter: lambda = 1 / mean


of the data set.

The function we use for this distribution is EXPONDIST


=EXPONDIST(x, lambda, cumulative)

The Exponential distribution is commonly used for modeling interarrival


times.

Figure 7.53

Let us use the same example with the bakery data.


Arrival rate is said to be 20 customers per hour.
Interarrival mean, or the Exponential mean, is 1 / arrival rate. Therefore,
for this example, the interarrival mean is 1/20 hours per customer arrival.
To find the probability that a customer arrives in 10 minutes, we would
set
x = 10/60 = 0.17 hours
lambda = 1/(1/20) = 20 hours
=EXPONDIST(0.17, 20, True)

Binomial Distribution

The Binomial distribution has the following parameters: the number of


trials and the probability of a success.

We are trying to determine the probability that the number of successes


is less than or equal to (using cdf) or equal to (pdf) some x value.

The function we use for this distribution is BINOMDIST


=BINOMDIST(x, trials, prob_success, cumulative)

Figure 7.54

Suppose a survey shows that 40 percent of people pay more attention to


ads in the newspaper, and 60 percent pays more attention to ads on
television.
What is the probability that out of 100 people surveyed, 50 of them
respond more to ads on television?

Beta Distribution

The Beta distribution has the following parameters: alpha, beta, A, and B.
Alpha and beta are determined from the data set
A and B are optional bounds on the x value for which you want the Beta
distribution value

The function we use for this distribution is BETADIST


=BETADIST(x, alpha, beta, A, B)

If A and B are omitted, then a standard cumulative distribution is


assumed and they are given the values 0 and 1, respectively.

Figure 7.55

Determine the probability that a team can complete a project in 10 days.

Estimate the total time needed to be 1 to 2 weeks; these estimates will


be the bound values, or the A and B parameters.

Use a mean and standard deviation of 12 and 3 days to compute the


alpha and beta parameters.

Weibull Distribution

The Weibull distribution has the parameters alpha and beta.

The function we use for this distribution is WEIBULL


=WEIBULL(x, alpha, beta, cumulative)

The Weibull distribution is most commonly used to determine reliability


functions.

Figure 7.56

On average, a lightbulb will last 1200 hours, with a standard deviation of


100 hours. We can use these values to calculate alpha and beta.

We can now use the WEIBULL distribution to determine the probability


that a lightbulb will be reliable for 55 days = 1320 hours.

Summary

The Analysis Toolpack is an Excel Add-In that includes statistical analysis


techniques such as Descriptive Statistics, Histograms, Exponential
Smoothing, Correlation, Covariance, Moving Average, and others.
The Descriptive Statistics option provides a list of statistical information
about a data set, including the mean, median, standard deviation, and
variance.
Histograms calculate the number of occurrences, or frequency, which
values in a data set fall into various intervals.
Relationships in data are usually identified by comparing the dependent
variable and the independent variable.
There are five basic trend curves that Excel can model: Linear,
Exponential, Power, Moving Average, and Logarithmic.
Some of the more common distributions that can be recognized when
performing a statistical analysis of data are the Normal, Exponential,
Uniform, Binomial, Poisson, Beta, and Weibull distributions.

Additional Links

(place links here)

61

You might also like