Slide PTDL.1

Introduction to Business Analytics
Descriptive Statistics Contents

provide summarizing information of the characteristics and distribution of values
Data Cleansing
♥ allow analysts to have a quick glance of

the central tendency and the degree of
Modifying Data in Excel
dispersion of values
Descriptive Descriptive Data Regression
♥ helps describe data in a meaningful way
CreatingVisualization
Statistics Distributions from DataOptimization
Model Model
Statistics
such that, patterns might emerge from the
data Measures of Location and Variability
Measures of Association Between Two Variables
1. Data Cleansing 1. Data Cleansing

◦ Example: Blakely Tires (Data file TreadWear)
legitimately no remedial action
A U.S. producer of automobile tires wants to learn about the
conditions of its tires on automobiles in Texas.
Missing Data ♥ discard observations (rows) with any missing
Analyzing
values. Step 1. Enter the heading # of Missing Values in cell G2
illegitimately ♥ discard any variable (column) with missing values. Step 2. Enter the heading Life of Tire (Months) in cell H1
♥ fill in missing entries with estimated values. Step 3. Enter=COUNTBLANK(C2 : C457) in cell H2
♥ apply a data-mining algorithm that can handle

missing values.
Assoc.Prof. Nguyen Vinh 1


◦ Example: Blakely Tires (Data file TreadWear) ◦ Example: Blakely Tires (Data file TreadWear)
Sort all of data on Miles from smallest to largest value to determine which
Figure 2.1: Portion of Excel Spreadsheet Showing Number of Missing Values observation is missing its value of this variable.
Figure 2.2: Data Sorted on Miles from Lowest to Highest Value

Analyzing Analyzing

◦ Example: Blakely Tires (Data file TreadWear) ◦ Example: Blakely Tires (Data file TreadWear)
Sort all of data on Miles from smallest to largest value to determine which => sort all the data on ID number and scroll through the data to find the
observation is missing its value of this variable. four tires that belong to the automobile with the ID Number of missing data
Figure 2.2: Data Sorted on Miles from Lowest to Highest Value Figure 2.3: Data Sorted on ID number
Analyzing Analyzing


◦ Example: Blakely Tires (Data file TreadWear) ◦ Identify Erroneous Outliers and other Erroneous Values
=> sort all the data on ID number and scroll through the data to find the - use of summary statistics, frequency distributions, bar charts and
four tires that belong to the automobile with the ID Number of missing data histograms and other tools can uncover data-quality issues and outliers
Figure 2.3: Data Sorted on ID number - Example: minimum and maximum values for Life of Tires (Months)
Analyzing Analyzing
Figure 2.4: minimum and maximum values for Life of Tires (Months)
33254
| Business Analytics | Introduction to BA | Nano_PAMS Program |

◦ Identify Erroneous Outliers and other Erroneous Values ◦ Identify Erroneous Outliers and other Erroneous Values
- use of summary statistics, frequency distributions, bar charts and - Example: minimum and maximum values for Life of Tires (Months)
histograms and other tools can uncover data-quality issues and outliers
- Example: minimum and maximum values for Life of Tires (Months)
Analyzing the values of Life of Tires for Analyzing
Figure 2.4: minimum and maximum values for Life of Tires (Months) the other three tires are
same, 60.1
=> the decimal for this value

is in the wrong place.
Figure 2.5: Data Sorted on ID number


◦ Identify Erroneous Outliers and other Erroneous Values ◦ Identify Erroneous Outliers and other Erroneous Values
Figure 2.6: Scatter Diagram of Tread Depth and Miles
Variable Representation:
In many data-mining applications, it may be prohibitive to analyze the data
The points that lie outside of because of the number of variables recorded.
Analyzing Analyzing
this ellipse may be inaccurate Dimension reduction is the process of removing variables from the analysis
and should be investigated without losing crucial information.
A critical part of data mining is determining how to represent the
measurements of the variables and which variables to consider.
Often data sets contain variables that, considered separately, are not
particularly insightful but that, when appropriately combined, result in a new
variable that reveals an important relationship.
2. Modifying Data in Excel 2. Modifying Data in Excel

◦ Sorting and Filtering Data in Excel ◦ Sorting and Filtering Data in Excel
To sort the automobiles by March 2010 sales:
Figure 2.7 Using Excel’s Sort Function to Sort the Top-Selling Automobiles Data
◦ Step 1: Select cells A1:F21.
◦ Step 2: Click the Data tab in the Ribbon.
Analyzing Analyzing
◦ Step 3: Click Sort in the Sort & Filter group.
◦ Step 4: Select the check box for My data has headers.
◦ Step 5: In the first Sort by dropdown menu, select Sales (March 2010).
◦ Step 6: In the Order dropdown menu, select Largest to Smallest.
◦ Step 7: Click OK.


◦ Sorting and Filtering Data in Excel ◦ Sorting and Filtering Data in Excel
Using Excel’s Filter function to see the sales of models made by Toyota: Figure 2.8 Using Excel’s Filter Function to Show Only Toyota
◦ Step 1: Select cells A1:F21.
◦ Step 2: Click the Data tab in the Ribbon.
Analyzing Analyzing
◦ Step 3: Click Filter in the Sort & Filter group.
◦ Step 4: Click on the Filter Arrow in column B, next to Manufacturer.
◦ Step 5: If all choices are checked, you can easily deselect all choices by
unchecking (Select All). Then select only the check box for Toyota.
◦ Step 6. Click OK.

◦ Conditional Formatting of Data in Excel: Makes it easy to identify ◦ Conditional Formatting of Data in Excel
data that satisfy certain conditions in a data set.
The automobile models which sales had decreased from Mar 2010 to Mar 2011
Figure 2.9:
◦ Step 1: Starting with the original data, select cells F1:F21.
Using Conditional
◦ Step 2: Click on the Home tab in the Ribbon.
Formatting to
◦ Step 3: Click Conditional Formatting in the Styles group. Highlight
◦ Step 4: Select Highlight Cells Rules, and click Less Than from the dropdown Automobiles with
menu. Declining Sales
◦ Step 5: Enter 0% in the Format cells that are LESS THAN: box.
◦ Step 6: Click OK.


◦ Conditional Formatting of Data in Excel ◦ Conditional Formatting of Data in Excel
Figure 2.11: Using Quick Analysis button, shortcuts for Conditional

Formatting, adding Data Bars, and other operations.
Figure 2.10:
Using Conditional
Formatting in
Excel to Generate
Data Bars
3. Creating Distributions from Data 3. Creating Distributions from Data
Frequency distribution: A summary of data that shows the Frequency distribution: A summary of data that shows the
number (frequency) of observations in each of several number (frequency) of observations in each of several
nonoverlapping classes. nonoverlapping classes.
Analyzing Analyzing
bins


Table 2.1: Data from a Sample of 50 Soft Drink Purchases (file SoftDrinks) ◦ Frequency Distributions for Categorical Data
Coca-Cola Sprite Pepsi Table 2.2: Frequency Distribution of Soft Drink Purchases
Diet Coke Coca-Cola Coca-Cola
Pepsi Diet Coke Coca-Cola
Diet Coke Coca-Cola Coca-Cola
Coca-Cola Diet Coke Pepsi Soft Drink Frequency
Coca-Cola Coca-Cola Analyzing Dr. Pepper Coca-Cola
Analyzing
19
Dr. Pepper Sprite Coca-Cola
Diet Coke Pepsi Diet Coke Diet Coke 8
Pepsi Coca-Cola Pepsi
Pepsi Coca-Cola Pepsi Dr. Pepper 5
Coca-Cola Coca-Cola Pepsi Pepsi 13
Dr. Pepper Pepsi Pepsi
Sprite Coca-Cola Coca-Cola Sprite 5
Coca-Cola Sprite Dr. Pepper
Diet Coke Dr. Pepper Pepsi Total 50
Coca-Cola Pepsi Sprite
Coca-Cola Diet Coke

◦ Frequency Distributions for Categorical Data Figure 2.11: Creating a Frequency Distribution for Soft Drinks Data in Excel
Table 2.2: Frequency Distribution of Soft Drink Purchases
Soft Drink Frequency

Analyzing Analyzing
Coca-Cola 19
COUNTIF
function in Diet Coke 8
Excel Dr. Pepper 5
Pepsi 13
Sprite 5
Total 50


◦ Relative Frequency and Percent Frequency Distributions ◦ Relative Frequency and Percent Frequency Distributions
◦ Relative frequency : fraction or proportion of items belonging ◦ Relative frequency distribution: A tabular summary of data showing
to a class. the relative frequency for each bin.
◦ Percent frequency distribution: Summarizes the percent frequency of
◦ Relative frequency distribution:Analyzing
A tabular summary of data the data for each bin.
AnalyzingRelative Percent
showing the relative frequency for each bin.
Soft Drink Frequency Frequency (%)
◦ Percent frequency distribution: Summarizes the percent Table 2.3:Relative Frequency and Percent
Coca-Cola 0.38 38
frequency of the data for each bin. Frequency Distributions of Soft Diet Coke 0.16 16
Drink Purchases Dr. Pepper 0.10 10
Pepsi 0.26 26
Sprite 0.10 10
Total 1.00 100

◦ Frequency Distributions for Quantitative Data: ◦ Frequency Distributions for Quantitative Data
Three steps necessary to define the classes for a frequency distribution Table 2.4: Year-End Audit Times (Days)
with quantitative data: 12 14 19 18
1. Determine the number of nonoverlapping bins. 15 15 18 17
2. Determine the width of each bin.Analyzing 20 27
Analyzing
22 23
3. Determine the bin limits.
22 21 33 28
14 18 16 13
| Business Analytics | Introduction to BA | Nano_PAMS Program | | Business Analytics | Introduction to BA | Nano_PAMS Program |


◦ Frequency Distributions for Quantitative Data ◦ Frequency Distributions for Quantitative Data
Table 2.4: Year-End Audit Times (Days) Table 2.5: Frequency, Relative Frequency, and
Percent Frequency Distributions for the Audit Time Data
12 14 19 18
Number of bins 15 15 18 17 Audit Times Relative Percent
20 27
Analyzing
22 23 (days) Frequency Frequency Frequency
5
22 21 33 28 10–14 4 0.20 20
14 18 16 13 15–19 8 0.40 40
20–24 5 0.25 25
Approximate bin width = Bin limits 25–29 2 0.10 10
30–34 1 0.05 5
???????? ?????

◦ Frequency Distributions for Quantitative Data ◦ Histograms: A common graphical presentation of quantitative data
◦ Constructed by placing the variable of interest on the horizontal
axis and the selected frequency measure (absolute frequency,
relative frequency, or percent frequency) on the vertical axis.
Figure 2.12: Using Excel to
Generate a Frequency ◦ The frequency measure of each class is shown by drawing a
Distribution for Audit Times rectangle whose base is the class limits on the horizontal axis
Data and whose height is the corresponding frequency measure.


◦ Histograms: A common graphical presentation of quantitative data ◦ Histograms: A common graphical presentation of quantitative data
Histograms can be created in Excel using the Data Analysis ToolPak. Figure 2.13: Histogram for the Audit Time Data
◦ Step 1. Click the Data tab in the Ribbon
◦ Step 2. Click Data Analysis in the Analyze group
◦ Step 3. When the Data Analysis dialog box opens, choose Histogram
from the list of Analysis Tools, and click OK
In the Input Range: box, enter A2:D6
In the Bin Range: box, enter A10:A14
Under Output Options:, select New Worksheet Ply:
Select the check box for Chart Output
Click OK

◦ Histograms (cont.) ◦ Histograms (cont.)
Histograms provide information about the shape, or form, of a
distribution.
Skewness: Lack of symmetry.
Figure 2.17:
Skewness is an important characteristic of the shape of a distribution. Histograms Showing
Distributions with
Different Levels of
Skewness

3. Creating Distributions from Data 4. Measures of Location and Variability

◦ Cumulative frequency distribution: A variation of the ◦ Measures of Location
frequency distribution that provides another tabular summary of Mean/Arithmetic Mean: Average value for a variable:
quantitative data.
Table 2.6: Cumulative Frequency, Cumulative Relative Frequency, and
Cumulative Percent Frequency Distributions for the Audit Time Data
Cumulative Cumulative
Cumulative Relative Percent
Audit Time (days) Frequency Frequency Frequency Population mean: μ
Less than or equal to 14 4 0.20 20
4. Measures of Location and Variability 4. Measures of Location and Variability

◦ Measures of Location ◦ Measures of Location
Mean/Arithmetic Mean: Average value for a variable: Mean/Arithmetic Mean: Average value for a variable:
Home Sale Selling Price ($) Home Sale Selling Price ($)
1 138,000 1 138,000
x x + x + + x12 xi x1 + x2 + + x12
2 254,000 x= i = 1 2 2 254,000 x= =
3 186,000 n 12 3 186,000 n 12
4 257,500 138,000 + 254,000 + 456,250 4 257,500 138,000 + 254,000 + 456,250
5 108,000 = 5 108,000 =
6 254,000 12 6 254,000 12
7 138,000 2,639,250 7 138,000 2,639,250
= = 219,937.50 = = 219,937.50
8 298,000 12 8 298,000 12
9 199,500 9 199,500
10 208,000 10 208,000
11 142,000 11 142,000 AVERAGE function
12 456,250 12 456,250


◦ Median: Value in the middle when the data are arranged in ◦ Median: Value in the middle when the data are arranged in
ascending order. ascending order.
◦ Middle value, for an odd number of observations. ◦ Middle value, for an odd number of observations.
◦ Average of two middle values, for an even number of ◦ Average of two middle values, for an even number of
observations. observations.
MEDIAN function

◦ Mode: Value that occurs most frequently in a data set. Geometric mean: A measure of location that is calculated by
◦ Multimodal data: Data contain at least two modes. finding the nth root of the product of n values
Used in analyzing growth rates in financial data.
◦ Bimodal data: Data contain exactly two modes.
MODE.SNGL function
(MODE.MULT)


◦ Measures of Location Year Return (%) Growth ◦ Measures of Variability
Factor Table 2.8: Annual Payouts for Two
Table 2.7: Percentage Annual 1 −22.1 0.779
Returns and Growth Factors for the Different Investment Funds
2 28.7 1.287
Mutual Fund Data (cont.) Figure 2.19: Histograms for Payouts of Past 20 Year Fund A ($) Fund B ($)
3 10.9 1.109
Years from Fund A and Fund B 12 1,100 890
4 4.9 1.049
GEOMEAN function 13 1,100 1,050
5 15.8 1.158
6 5.5 1.055 14 1,100 800
7 −37.0 0.630 15 1,100 1,150
xg = 10 1.335 = 1.029. 2.9% 8 26.5 1.265 16 1,100 1,200
9 15.1 1.151 17 1,100 1,800
10 2.1 1.021 18 1,100 100
$100 (0.779)(1.287)(1.109) (1.049 ) (1.158)(1.055)(0.630)(1.265)(1.151)(1.021) 19 1,100 1,750

20 1,100 1,000
= $100 (1.335) = $133.45 Mean 1,100 1,100

◦ Measures of Variability ◦ Measures of Variability
The range can be found by subtracting the smallest value from the Variance is a measure of variability that utilizes all the data.
largest value in a data set. It is based on the deviation about the mean, which is the difference
between the value of each observation (xi) and the mean.
.
The deviations about the mean are squared while computing the
variance.
 ( xi −  )
2
2 = .
Population variance: N


◦ Measures of Variability Calculating Variability Measures for the Home Sales Data in Excel (practice)
Standard deviation is the positive square root of the variance.

Measured in the same units as the original data
VAR.S
Variance function Excel
STDEV.S
function
4. Measures of Location and Variability 5. Measures of Association Between Two Variables

◦ Measures of Variability High Temperature Bottled Water Sales
(°F) (cases)
Coefficient of variation is a descriptive statistic that indicates how Table 2.9: Data for Bottled Water Sales at 78 23
large the standard deviation is relative to the mean. Queensland Amusement Park for a Sample of 79 22
14 Summer Days (file Bottledwater) 80 24
Expressed as a percentage. 80 22
82 24
83 26
85 27
86 25
87 28
87 26
88 29
88 30
90 31
92 31

5. Measures of Association Between Two Variables 5. Measures of Association Between Two Variables
◦ Scatter Charts: is a useful graph for analyzing the relationship ◦ Covariance is a descriptive measure of the linear association
between two variables between two variables:
Figure 2.20: Chart Showing the

Positive Linear Relation
Between Sales and High
 ( xi −  x )  ( yi −  y )
Temperatures
Population covariance  xy = .
N
◦ Covariance ◦ Covariance is a descriptive measure of the linear association
between two variables:
Table 2.10: Sample Covariance
Calculations for Daily High • If the covariance is near 0, then the x and y variables are not
Temperature and Bottled linearly related.
Water Sales at Queensland
Amusement Park • If the covariance is less than 0, then the x and y variables are
negatively related
COVARIANCE.S

◦ Correlation coefficient measures the relationship between two ◦ Correlation coefficient measures the relationship between two
variables. variables.
Not affected by the units of measurement for x and y
−1  r  +1
• The closer the correlation coefficient is to +1, the closer the x and y
values are to forming a straight line that trends upward to the right
(positive slope).
• The closer the correlation coefficient is to –1, the closer the x and y
values are to forming a straight line with negative slope.
5. Measures of Association Between Two Variables

◦ Correlation coefficient measures the relationship between two
variables.
−1  r  +1 CORREL function
• The closer the correlation coefficient is to +1, the closer the x and y
values are to forming a straight line that trends upward to the right
(positive slope).
• The closer the correlation coefficient is to –1, the closer the x and y
values are to forming a straight line with negative slope.

Slide PTDL.1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Slide PTDL.1

Uploaded by

Copyright:

Available Formats

Introduction to Business Analytics

Descriptive Statistics Contents

♥ allow analysts to have a quick glance of

Measures of Association Between Two Variables

1. Data Cleansing 1. Data Cleansing

♥ apply a data-mining algorithm that can handle

Assoc.Prof. Nguyen Vinh 1

1. Data Cleansing 1. Data Cleansing

Figure 2.2: Data Sorted on Miles from Lowest to Highest Value

1. Data Cleansing 1. Data Cleansing

Assoc.Prof. Nguyen Vinh 2

1. Data Cleansing 1. Data Cleansing

| Business Analytics | Introduction to BA | Nano_PAMS Program |

1. Data Cleansing 1. Data Cleansing

=> the decimal for this value

Figure 2.5: Data Sorted on ID number

Assoc.Prof. Nguyen Vinh 3

1. Data Cleansing 1. Data Cleansing

2. Modifying Data in Excel 2. Modifying Data in Excel

Assoc.Prof. Nguyen Vinh 4

2. Modifying Data in Excel 2. Modifying Data in Excel

2. Modifying Data in Excel 2. Modifying Data in Excel

Assoc.Prof. Nguyen Vinh 5

2. Modifying Data in Excel 2. Modifying Data in Excel

Figure 2.11: Using Quick Analysis button, shortcuts for Conditional

3. Creating Distributions from Data 3. Creating Distributions from Data

Assoc.Prof. Nguyen Vinh 6

3. Creating Distributions from Data 3. Creating Distributions from Data

3. Creating Distributions from Data 3. Creating Distributions from Data

Table 2.2: Frequency Distribution of Soft Drink Purchases

Soft Drink Frequency

Assoc.Prof. Nguyen Vinh 7

3. Creating Distributions from Data 3. Creating Distributions from Data

3. Creating Distributions from Data 3. Creating Distributions from Data

Assoc.Prof. Nguyen Vinh 8

3. Creating Distributions from Data 3. Creating Distributions from Data

3. Creating Distributions from Data 3. Creating Distributions from Data

Assoc.Prof. Nguyen Vinh 9

3. Creating Distributions from Data 3. Creating Distributions from Data

3. Creating Distributions from Data 3. Creating Distributions from Data

Assoc.Prof. Nguyen Vinh 10

3. Creating Distributions from Data 4. Measures of Location and Variability

4. Measures of Location and Variability 4. Measures of Location and Variability

Assoc.Prof. Nguyen Vinh 11

4. Measures of Location and Variability 4. Measures of Location and Variability

4. Measures of Location and Variability 4. Measures of Location and Variability

Assoc.Prof. Nguyen Vinh 12

4. Measures of Location and Variability 4. Measures of Location and Variability

$100 (0.779)(1.287)(1.109) (1.049 ) (1.158)(1.055)(0.630)(1.265)(1.151)(1.021) 19 1,100 1,750

4. Measures of Location and Variability 4. Measures of Location and Variability

Assoc.Prof. Nguyen Vinh 13

4. Measures of Location and Variability 4. Measures of Location and Variability

Standard deviation is the positive square root of the variance.

4. Measures of Location and Variability 5. Measures of Association Between Two Variables

Assoc.Prof. Nguyen Vinh 14

Figure 2.20: Chart Showing the

Assoc.Prof. Nguyen Vinh 15

5. Measures of Association Between Two Variables

Assoc.Prof. Nguyen Vinh 16

You might also like