You are on page 1of 16

Introduction to Business Analytics

Descriptive Statistics Contents


provide summarizing information of the characteristics and distribution of values
Data Cleansing

♥ allow analysts to have a quick glance of


the central tendency and the degree of
Modifying Data in Excel
dispersion of values
Descriptive Descriptive Data Regression
♥ helps describe data in a meaningful way
CreatingVisualization
Statistics Distributions from DataOptimization
Model Model
Statistics
such that, patterns might emerge from the
data Measures of Location and Variability

Measures of Association Between Two Variables

1. Data Cleansing 1. Data Cleansing


◦ Example: Blakely Tires (Data file TreadWear)
legitimately no remedial action
A U.S. producer of automobile tires wants to learn about the
conditions of its tires on automobiles in Texas.
Missing Data ♥ discard observations (rows) with any missing
Analyzing
values. Step 1. Enter the heading # of Missing Values in cell G2
illegitimately ♥ discard any variable (column) with missing values. Step 2. Enter the heading Life of Tire (Months) in cell H1
♥ fill in missing entries with estimated values. Step 3. Enter=COUNTBLANK(C2 : C457) in cell H2

♥ apply a data-mining algorithm that can handle


missing values.

Assoc.Prof. Nguyen Vinh 1


Introduction to Business Analytics

1. Data Cleansing 1. Data Cleansing


◦ Example: Blakely Tires (Data file TreadWear) ◦ Example: Blakely Tires (Data file TreadWear)
Sort all of data on Miles from smallest to largest value to determine which
Figure 2.1: Portion of Excel Spreadsheet Showing Number of Missing Values observation is missing its value of this variable.

Figure 2.2: Data Sorted on Miles from Lowest to Highest Value


Analyzing Analyzing

1. Data Cleansing 1. Data Cleansing


◦ Example: Blakely Tires (Data file TreadWear) ◦ Example: Blakely Tires (Data file TreadWear)
Sort all of data on Miles from smallest to largest value to determine which => sort all the data on ID number and scroll through the data to find the
observation is missing its value of this variable. four tires that belong to the automobile with the ID Number of missing data

Figure 2.2: Data Sorted on Miles from Lowest to Highest Value Figure 2.3: Data Sorted on ID number
Analyzing Analyzing

Assoc.Prof. Nguyen Vinh 2


Introduction to Business Analytics

1. Data Cleansing 1. Data Cleansing


◦ Example: Blakely Tires (Data file TreadWear) ◦ Identify Erroneous Outliers and other Erroneous Values
=> sort all the data on ID number and scroll through the data to find the - use of summary statistics, frequency distributions, bar charts and
four tires that belong to the automobile with the ID Number of missing data histograms and other tools can uncover data-quality issues and outliers

Figure 2.3: Data Sorted on ID number - Example: minimum and maximum values for Life of Tires (Months)
Analyzing Analyzing
Figure 2.4: minimum and maximum values for Life of Tires (Months)

33254

| Business Analytics | Introduction to BA | Nano_PAMS Program |

1. Data Cleansing 1. Data Cleansing


◦ Identify Erroneous Outliers and other Erroneous Values ◦ Identify Erroneous Outliers and other Erroneous Values
- use of summary statistics, frequency distributions, bar charts and - Example: minimum and maximum values for Life of Tires (Months)
histograms and other tools can uncover data-quality issues and outliers
- Example: minimum and maximum values for Life of Tires (Months)
Analyzing the values of Life of Tires for Analyzing
Figure 2.4: minimum and maximum values for Life of Tires (Months) the other three tires are
same, 60.1

=> the decimal for this value


is in the wrong place.

Figure 2.5: Data Sorted on ID number

Assoc.Prof. Nguyen Vinh 3


Introduction to Business Analytics

1. Data Cleansing 1. Data Cleansing


◦ Identify Erroneous Outliers and other Erroneous Values ◦ Identify Erroneous Outliers and other Erroneous Values
Figure 2.6: Scatter Diagram of Tread Depth and Miles
Variable Representation:
In many data-mining applications, it may be prohibitive to analyze the data
The points that lie outside of because of the number of variables recorded.
Analyzing Analyzing
this ellipse may be inaccurate Dimension reduction is the process of removing variables from the analysis
and should be investigated without losing crucial information.
A critical part of data mining is determining how to represent the
measurements of the variables and which variables to consider.
Often data sets contain variables that, considered separately, are not
particularly insightful but that, when appropriately combined, result in a new
variable that reveals an important relationship.

2. Modifying Data in Excel 2. Modifying Data in Excel


◦ Sorting and Filtering Data in Excel ◦ Sorting and Filtering Data in Excel
To sort the automobiles by March 2010 sales:
Figure 2.7 Using Excel’s Sort Function to Sort the Top-Selling Automobiles Data
◦ Step 1: Select cells A1:F21.
◦ Step 2: Click the Data tab in the Ribbon.
Analyzing Analyzing
◦ Step 3: Click Sort in the Sort & Filter group.
◦ Step 4: Select the check box for My data has headers.
◦ Step 5: In the first Sort by dropdown menu, select Sales (March 2010).
◦ Step 6: In the Order dropdown menu, select Largest to Smallest.
◦ Step 7: Click OK.

Assoc.Prof. Nguyen Vinh 4


Introduction to Business Analytics

2. Modifying Data in Excel 2. Modifying Data in Excel


◦ Sorting and Filtering Data in Excel ◦ Sorting and Filtering Data in Excel
Using Excel’s Filter function to see the sales of models made by Toyota: Figure 2.8 Using Excel’s Filter Function to Show Only Toyota
◦ Step 1: Select cells A1:F21.
◦ Step 2: Click the Data tab in the Ribbon.
Analyzing Analyzing
◦ Step 3: Click Filter in the Sort & Filter group.
◦ Step 4: Click on the Filter Arrow in column B, next to Manufacturer.
◦ Step 5: If all choices are checked, you can easily deselect all choices by
unchecking (Select All). Then select only the check box for Toyota.
◦ Step 6. Click OK.

2. Modifying Data in Excel 2. Modifying Data in Excel


◦ Conditional Formatting of Data in Excel: Makes it easy to identify ◦ Conditional Formatting of Data in Excel
data that satisfy certain conditions in a data set.

The automobile models which sales had decreased from Mar 2010 to Mar 2011
Figure 2.9:
◦ Step 1: Starting with the original data, select cells F1:F21.
Using Conditional
◦ Step 2: Click on the Home tab in the Ribbon.
Formatting to
◦ Step 3: Click Conditional Formatting in the Styles group. Highlight
◦ Step 4: Select Highlight Cells Rules, and click Less Than from the dropdown Automobiles with
menu. Declining Sales
◦ Step 5: Enter 0% in the Format cells that are LESS THAN: box.
◦ Step 6: Click OK.

Assoc.Prof. Nguyen Vinh 5


Introduction to Business Analytics

2. Modifying Data in Excel 2. Modifying Data in Excel


◦ Conditional Formatting of Data in Excel ◦ Conditional Formatting of Data in Excel

Figure 2.11: Using Quick Analysis button, shortcuts for Conditional


Formatting, adding Data Bars, and other operations.
Figure 2.10:
Using Conditional
Formatting in
Excel to Generate
Data Bars

3. Creating Distributions from Data 3. Creating Distributions from Data

Frequency distribution: A summary of data that shows the Frequency distribution: A summary of data that shows the
number (frequency) of observations in each of several number (frequency) of observations in each of several
nonoverlapping classes. nonoverlapping classes.
Analyzing Analyzing

bins

Assoc.Prof. Nguyen Vinh 6


Introduction to Business Analytics

3. Creating Distributions from Data 3. Creating Distributions from Data


Table 2.1: Data from a Sample of 50 Soft Drink Purchases (file SoftDrinks) ◦ Frequency Distributions for Categorical Data
Coca-Cola Sprite Pepsi Table 2.2: Frequency Distribution of Soft Drink Purchases
Diet Coke Coca-Cola Coca-Cola
Pepsi Diet Coke Coca-Cola
Diet Coke Coca-Cola Coca-Cola
Coca-Cola Diet Coke Pepsi Soft Drink Frequency
Coca-Cola Coca-Cola Analyzing Dr. Pepper Coca-Cola
Analyzing
19
Dr. Pepper Sprite Coca-Cola
Diet Coke Pepsi Diet Coke Diet Coke 8
Pepsi Coca-Cola Pepsi
Pepsi Coca-Cola Pepsi Dr. Pepper 5
Coca-Cola Coca-Cola Pepsi Pepsi 13
Dr. Pepper Pepsi Pepsi
Sprite Coca-Cola Coca-Cola Sprite 5
Coca-Cola Sprite Dr. Pepper
Diet Coke Dr. Pepper Pepsi Total 50
Coca-Cola Pepsi Sprite
Coca-Cola Diet Coke

3. Creating Distributions from Data 3. Creating Distributions from Data


◦ Frequency Distributions for Categorical Data Figure 2.11: Creating a Frequency Distribution for Soft Drinks Data in Excel

Table 2.2: Frequency Distribution of Soft Drink Purchases

Soft Drink Frequency


Analyzing Analyzing
Coca-Cola 19
COUNTIF
function in Diet Coke 8
Excel Dr. Pepper 5
Pepsi 13
Sprite 5
Total 50

Assoc.Prof. Nguyen Vinh 7


Introduction to Business Analytics

3. Creating Distributions from Data 3. Creating Distributions from Data


◦ Relative Frequency and Percent Frequency Distributions ◦ Relative Frequency and Percent Frequency Distributions
◦ Relative frequency : fraction or proportion of items belonging ◦ Relative frequency distribution: A tabular summary of data showing
to a class. the relative frequency for each bin.
◦ Percent frequency distribution: Summarizes the percent frequency of
◦ Relative frequency distribution:Analyzing
A tabular summary of data the data for each bin.
AnalyzingRelative Percent
showing the relative frequency for each bin.
Soft Drink Frequency Frequency (%)
◦ Percent frequency distribution: Summarizes the percent Table 2.3:Relative Frequency and Percent
Coca-Cola 0.38 38
frequency of the data for each bin. Frequency Distributions of Soft Diet Coke 0.16 16
Drink Purchases Dr. Pepper 0.10 10
Pepsi 0.26 26
Sprite 0.10 10
Total 1.00 100

3. Creating Distributions from Data 3. Creating Distributions from Data


◦ Frequency Distributions for Quantitative Data: ◦ Frequency Distributions for Quantitative Data
Three steps necessary to define the classes for a frequency distribution Table 2.4: Year-End Audit Times (Days)
with quantitative data: 12 14 19 18
1. Determine the number of nonoverlapping bins. 15 15 18 17
2. Determine the width of each bin.Analyzing 20 27
Analyzing
22 23
3. Determine the bin limits.
22 21 33 28
14 18 16 13

| Business Analytics | Introduction to BA | Nano_PAMS Program | | Business Analytics | Introduction to BA | Nano_PAMS Program |

Assoc.Prof. Nguyen Vinh 8


Introduction to Business Analytics

3. Creating Distributions from Data 3. Creating Distributions from Data


◦ Frequency Distributions for Quantitative Data ◦ Frequency Distributions for Quantitative Data
Table 2.4: Year-End Audit Times (Days) Table 2.5: Frequency, Relative Frequency, and
Percent Frequency Distributions for the Audit Time Data
12 14 19 18
Number of bins 15 15 18 17 Audit Times Relative Percent
20 27
Analyzing
22 23 (days) Frequency Frequency Frequency
5
22 21 33 28 10–14 4 0.20 20
14 18 16 13 15–19 8 0.40 40
20–24 5 0.25 25
Approximate bin width = Bin limits 25–29 2 0.10 10
30–34 1 0.05 5
???????? ?????

3. Creating Distributions from Data 3. Creating Distributions from Data


◦ Frequency Distributions for Quantitative Data ◦ Histograms: A common graphical presentation of quantitative data
◦ Constructed by placing the variable of interest on the horizontal
axis and the selected frequency measure (absolute frequency,
relative frequency, or percent frequency) on the vertical axis.
Figure 2.12: Using Excel to
Generate a Frequency ◦ The frequency measure of each class is shown by drawing a
Distribution for Audit Times rectangle whose base is the class limits on the horizontal axis
Data and whose height is the corresponding frequency measure.

Assoc.Prof. Nguyen Vinh 9


Introduction to Business Analytics

3. Creating Distributions from Data 3. Creating Distributions from Data


◦ Histograms: A common graphical presentation of quantitative data ◦ Histograms: A common graphical presentation of quantitative data
Histograms can be created in Excel using the Data Analysis ToolPak. Figure 2.13: Histogram for the Audit Time Data
◦ Step 1. Click the Data tab in the Ribbon
◦ Step 2. Click Data Analysis in the Analyze group
◦ Step 3. When the Data Analysis dialog box opens, choose Histogram
from the list of Analysis Tools, and click OK
In the Input Range: box, enter A2:D6
In the Bin Range: box, enter A10:A14
Under Output Options:, select New Worksheet Ply:
Select the check box for Chart Output
Click OK

3. Creating Distributions from Data 3. Creating Distributions from Data


◦ Histograms (cont.) ◦ Histograms (cont.)
Histograms provide information about the shape, or form, of a
distribution.
Skewness: Lack of symmetry.
Figure 2.17:
Skewness is an important characteristic of the shape of a distribution. Histograms Showing
Distributions with
Different Levels of
Skewness

Assoc.Prof. Nguyen Vinh 10


Introduction to Business Analytics

3. Creating Distributions from Data 4. Measures of Location and Variability


◦ Cumulative frequency distribution: A variation of the ◦ Measures of Location
frequency distribution that provides another tabular summary of Mean/Arithmetic Mean: Average value for a variable:
quantitative data.
Table 2.6: Cumulative Frequency, Cumulative Relative Frequency, and
Cumulative Percent Frequency Distributions for the Audit Time Data
Cumulative Cumulative
Cumulative Relative Percent
Audit Time (days) Frequency Frequency Frequency Population mean: μ
Less than or equal to 14 4 0.20 20
Less than or equal to 19 12 0.60 60
Less than or equal to 24 17 0.85 85
Less than or equal to 29 19 0.95 95
Less than or equal to 34 20 1.00 100

4. Measures of Location and Variability 4. Measures of Location and Variability


◦ Measures of Location ◦ Measures of Location
Mean/Arithmetic Mean: Average value for a variable: Mean/Arithmetic Mean: Average value for a variable:
Home Sale Selling Price ($) Home Sale Selling Price ($)
1 138,000 1 138,000
x x + x + + x12 xi x1 + x2 + + x12
2 254,000 x= i = 1 2 2 254,000 x= =
3 186,000 n 12 3 186,000 n 12
4 257,500 138,000 + 254,000 + 456,250 4 257,500 138,000 + 254,000 + 456,250
5 108,000 = 5 108,000 =
6 254,000 12 6 254,000 12
7 138,000 2,639,250 7 138,000 2,639,250
= = 219,937.50 = = 219,937.50
8 298,000 12 8 298,000 12
9 199,500 9 199,500
10 208,000 10 208,000
11 142,000 11 142,000 AVERAGE function
12 456,250 12 456,250

Assoc.Prof. Nguyen Vinh 11


Introduction to Business Analytics

4. Measures of Location and Variability 4. Measures of Location and Variability


◦ Measures of Location ◦ Measures of Location
◦ Median: Value in the middle when the data are arranged in ◦ Median: Value in the middle when the data are arranged in
ascending order. ascending order.
◦ Middle value, for an odd number of observations. ◦ Middle value, for an odd number of observations.
◦ Average of two middle values, for an even number of ◦ Average of two middle values, for an even number of
observations. observations.

MEDIAN function

4. Measures of Location and Variability 4. Measures of Location and Variability


◦ Measures of Location ◦ Measures of Location
◦ Mode: Value that occurs most frequently in a data set. Geometric mean: A measure of location that is calculated by
◦ Multimodal data: Data contain at least two modes. finding the nth root of the product of n values
Used in analyzing growth rates in financial data.
◦ Bimodal data: Data contain exactly two modes.

MODE.SNGL function
(MODE.MULT)

Assoc.Prof. Nguyen Vinh 12


Introduction to Business Analytics

4. Measures of Location and Variability 4. Measures of Location and Variability


◦ Measures of Location Year Return (%) Growth ◦ Measures of Variability
Factor Table 2.8: Annual Payouts for Two
Table 2.7: Percentage Annual 1 −22.1 0.779
Returns and Growth Factors for the Different Investment Funds
2 28.7 1.287
Mutual Fund Data (cont.) Figure 2.19: Histograms for Payouts of Past 20 Year Fund A ($) Fund B ($)
3 10.9 1.109
Years from Fund A and Fund B 12 1,100 890
4 4.9 1.049
GEOMEAN function 13 1,100 1,050
5 15.8 1.158
6 5.5 1.055 14 1,100 800
7 −37.0 0.630 15 1,100 1,150
xg = 10 1.335 = 1.029. 2.9% 8 26.5 1.265 16 1,100 1,200
9 15.1 1.151 17 1,100 1,800
10 2.1 1.021 18 1,100 100

$100 (0.779)(1.287)(1.109) (1.049 ) (1.158)(1.055)(0.630)(1.265)(1.151)(1.021) 19 1,100 1,750


20 1,100 1,000
= $100 (1.335) = $133.45 Mean 1,100 1,100

4. Measures of Location and Variability 4. Measures of Location and Variability


◦ Measures of Variability ◦ Measures of Variability
The range can be found by subtracting the smallest value from the Variance is a measure of variability that utilizes all the data.
largest value in a data set. It is based on the deviation about the mean, which is the difference
between the value of each observation (xi) and the mean.
.
The deviations about the mean are squared while computing the
variance.

 ( xi −  )
2

2 = .
Population variance: N

Assoc.Prof. Nguyen Vinh 13


Introduction to Business Analytics

4. Measures of Location and Variability 4. Measures of Location and Variability


◦ Measures of Variability Calculating Variability Measures for the Home Sales Data in Excel (practice)

Standard deviation is the positive square root of the variance.


Measured in the same units as the original data

VAR.S
Variance function Excel

STDEV.S
function

4. Measures of Location and Variability 5. Measures of Association Between Two Variables


◦ Measures of Variability High Temperature Bottled Water Sales
(°F) (cases)
Coefficient of variation is a descriptive statistic that indicates how Table 2.9: Data for Bottled Water Sales at 78 23
large the standard deviation is relative to the mean. Queensland Amusement Park for a Sample of 79 22
14 Summer Days (file Bottledwater) 80 24
Expressed as a percentage. 80 22
82 24
83 26
85 27
86 25
87 28
87 26
88 29
88 30
90 31
92 31

Assoc.Prof. Nguyen Vinh 14


Introduction to Business Analytics

5. Measures of Association Between Two Variables 5. Measures of Association Between Two Variables
◦ Scatter Charts: is a useful graph for analyzing the relationship ◦ Covariance is a descriptive measure of the linear association
between two variables between two variables:

Figure 2.20: Chart Showing the


Positive Linear Relation
Between Sales and High
 ( xi −  x )  ( yi −  y )
Temperatures
Population covariance  xy = .
N

5. Measures of Association Between Two Variables 5. Measures of Association Between Two Variables
◦ Covariance ◦ Covariance is a descriptive measure of the linear association
between two variables:
Table 2.10: Sample Covariance
Calculations for Daily High • If the covariance is near 0, then the x and y variables are not
Temperature and Bottled linearly related.
Water Sales at Queensland
Amusement Park • If the covariance is less than 0, then the x and y variables are
negatively related
COVARIANCE.S

Assoc.Prof. Nguyen Vinh 15


Introduction to Business Analytics

5. Measures of Association Between Two Variables 5. Measures of Association Between Two Variables
◦ Correlation coefficient measures the relationship between two ◦ Correlation coefficient measures the relationship between two
variables. variables.
Not affected by the units of measurement for x and y
−1  r  +1

• The closer the correlation coefficient is to +1, the closer the x and y
values are to forming a straight line that trends upward to the right
(positive slope).
• The closer the correlation coefficient is to –1, the closer the x and y
values are to forming a straight line with negative slope.

5. Measures of Association Between Two Variables


◦ Correlation coefficient measures the relationship between two
variables.

−1  r  +1 CORREL function

• The closer the correlation coefficient is to +1, the closer the x and y
values are to forming a straight line that trends upward to the right
(positive slope).
• The closer the correlation coefficient is to –1, the closer the x and y
values are to forming a straight line with negative slope.

Assoc.Prof. Nguyen Vinh 16

You might also like