You are on page 1of 4

BUSINESS ANALYTICS MODULE 1

Describing and Summarizing Data

→ Before gathering and analyzing data, we should always identify the question we wish to answer.

→ Graphs are very useful for examining a data set, as they often reveal patterns and trends and help us detect
outliers.
• One useful graph is a histogram.
→ A histogram’s x-axis represents bins corresponding to ranges of data; its y-axis indicates the frequency of
observations falling into each bin.
• An outlier is a value that falls far from the rest of the data.
→ We should always carefully investigate an outlier before deciding whether to leave it as is, change its
value to the correct value, or remove it.
• Graphing two variables on a scatter plot can reveal relationships between two variables (two data sets).
→ Although there may be a relationship between two variables, we cannot conclude that one variable
“causes” the other. This point is best summarized in the admonition, “correlation does not imply
causation.”
→ Be alert to the possibility of hidden variables, which may be responsible for patterns we see when
graphing or examining relationships between two data sets.

→ To summarize a data set numerically, we often use descriptive statistics, also known as summary statistics.
• Three values describe the center, or central tendency, of the data set:
→ The mean is equal to sum of all data points in the set divided by the number of data points:
! !! !! !!! !⋯!!!
𝑥= !!! ! =
!

→ The median is the middle value of the data set: half of the data set’s values lie below the median, and half
lie above the median.
→ The mode is the value that occurs most frequently in the data set. A data set may have multiple modes.
• The range, variance, and standard deviation measure the spread of the data.
→ The standard deviation is equal to the square root of the variance.
→ To compare variation in different data sets, we calculate the coefficient of variation. The coefficient of
variation measures the size of the standard deviation relative to the size of the mean (that is, coefficient of
!"#$%#&%  !"#$%&$'(
variation= )
!"#$

→ We can also calculate a conditional mean. A conditional mean is the mean of a subset of the data that includes all
values satisfying a certain condition.

→ A percentile may be another value of interest. For example, 60% of the observations are less than or equal to the
th th
60 percentile. The median is by definition the 50 percentile of a data set.

Describing and Summarizing Data | Page 1 of 4


BUSINESS ANALYTICS MODULE 1

Describing and Summarizing Data

→ We can quantify the strength of a linear relationship between two variables by calculating the correlation
coefficient.
• The value of the correlation coefficient ranges between -1 and +1.
• A correlation coefficient near zero indicates a weak or nonexistent linear relationship. A correlation coefficient
near zero does not mean there is no relationship between the two variables; it indicates only that any
relationship that does exist is not linear.

→ When one of the variables is time, the relationship is known as a time series. Cross-sectional data provide a
snapshot of data across multiple groups at a given point in time.

   

Describing and Summarizing Data | Page 2 of 4


BUSINESS ANALYTICS MODULE 1

Describing and Summarizing Data

EXCEL SUMMARY

Recall the Excel functions and analyses covered in this course and make sure to familiarize yourself with all of the
necessary steps, syntax, and arguments. We have provided some additional information for the more complex
functions listed below. As usual, the arguments shown in square brackets are optional. The functions whose names
include “S” are applied to samples rather than populations.

→ Using the Data Analysis tool to:


• Create bins and histograms
• Create the Descriptive Statistics output table

→ Creating scatter plots

→ =AVERAGE(number 1, [number 2], …)

→ =MEDIAN(number 1, [number 2], …)

→ =MODE.SNGL(number 1, [number 2], …)

→ =AVERAGEIF(range, criteria, [average_range])


• Returns the conditional mean, or average of the cells in a specified range that meet the given criteria.
• range contains the one or more cells to which we wish to apply the criteria or condition.
• criteria is the condition that is to be applied to the range.
• [average_range] is the range of cells containing the data we wish to average.

→ =PERCENTILE.INC(array, k)
th
• Returns the k-th percentile of value in the specified array. For example, if we want to know the 95 percentile
for an array of data, k would be 0.95.

→ =VAR.S(number 1, [number 2], …)

→ =STDEV.S(number 1, [number 2], …)

→ =SQRT(number)

→ =COUNT(value 1, [value 2], …)

Describing and Summarizing Data | Page 3 of 4


BUSINESS ANALYTICS MODULE 1

Describing and Summarizing Data

→ =MIN(number 1, [number 2], …)

→ =MAX(number 1, [number 2], …)

→ =SUM(number 1, [number 2], …)

→ =CORREL(array 1, array 2)

Describing and Summarizing Data | Page 4 of 4

You might also like