You are on page 1of 12

ETW1001

Week 3: Pre-class

This homework is designed to help you summarise numerical data by:


• Constructing a frequency distribution table and cumulative frequency distribution table
• Constructing and interpreting histograms
• Calculating descriptive statistics using Excel formulae and interpreting them

A. Tables and Charts for Numerical Data


Pivot tables and bar charts are usually used for categorical data. They are not generally
appropriate for numerical data because numerical data often takes many different values which
could be cumbersome, if not impossible, to list. The equivalent to the pivot table and bar chart
for summarising numerical data (continuous or discrete) is the frequency distribution table, and
its visual display, the histogram. Essentially, the difference between a pivot table and frequency
distribution table (and bar chart band histogram) is that the frequency distribution table uses
interval ranges, while the pivot table uses categories.

Interval ranges or groupings (bins) may be obvious or not so obvious: for example, the bin
ranges for income may be obvious if there are well-defined income groups such as low-
income, middle-income, high-income. But if this categorisation is not predefined or is
subjective, then it may be up to the analyst to determine what the cut-off value is for each
income class.

A frequency distribution table is the tabulation of bin ranges and their corresponding
frequency of occurrence in the dataset. The histogram is a graph of this frequency distribution
table, with bins on the horizontal axis and frequency on the vertical axis.

This homework will go through an example to illustrate how to use Excel to a create
frequency distribution table, cumulative frequency distribution table and histogram.

1. Frequency Distribution Table


The data to be analysed in this homework is results from a class of first year students (Credit
Risk Data.xlsx). We will use Excel’s Data Analysis, Histogram tool to analyse the student
results.
Determine the grouped frequency table for the saving account for the gender for given data .
Find frequency and hence calculate percentage frequency for male and female.

First, we need to decide on our bin ranges – that is, what groupings will be displayed on the
x-axis of our histogram.

For this data, we will make the frequency distribution table for saving between male and
female. Thus, to form the frequency distribution table the steps are as follows:
 Range = Max- Min
 Class interval = Range/No of classes
◦ No of classes (BINs)=1+3.3(log n)
 Grouped data = Discrete or Continuous
We will be using Credit Risk data to illustrate the frequency distribution table for the male
and female on the saving account.

Step one: reorganize male and female [highlight both the columns gender and saving
account and sort by gender and order by A to Z]. Your data will be as follows:

 Step two:
Range = Maximum- Minimum of the saving amount for male and female

 Step three:
Class interval = Range/No of classes
◦ No of classes (BINs)=1+3.3(log n)

Range= Max-Min Class interval = Class Interval = =2063.64


Maximum savings for Gender $19,811
Minimum savings for Gender $0

Range =1 9811 - 0 =19811

Savings
No of class Lower Bound Upper Bound [Bin Range]
1 0 2100
2 2100 4200
3 4200 6300
4 6300 8400
5 8400 10500
6 10500 12600
7 12600 14700
8 14700 16800
9 16800 18900
10 18900 21000

Note that we specify a single value for each bin – the upper limit of the bin range. Note that
we can specify an integer value for our upper limit because our data only takes integer values.
Now we use Excel’s Tools, Data Analysis, Histogram tool. Excel’s Histogram tool will take
our specified bin range and automatically construct our frequency distribution, cumulative
frequency distribution and histogram in one go.

On the Data tab, select Data Analysis from the Analysis group (if you do not have the Data
Analysis tool installed, see last week’s homework for instructions on how to install it).

A pop-up box will appear – select Histogram from the list [for female the range is from B2 to
B136 and for male is from B137 to B 426] and click OK. A second pop-up box will appear.
Complete it as follows:

• Input Range: This is the cell range of your data.


• Bin Range: This is the cell range of the bins you specified in your spreadsheet. If you leave
this blank, Excel will make up its own bin ranges, but often these are not nice round numbers
so are not as clear to the reader.
• We tick the Labels checkbox because our input ranges include the names of the data we
entered.
• Ticking Cumulative Percentage tells Excel to produce a cumulative frequency distribution
table as well as a regular frequency distribution table.
• Ticking the Chart Output tells Excel to produce a histogram from the frequency distribution.

Here is the out obtained when you click OK:

Frequency for Female

From this output, we can see that out of 135 females 109 have saving between $0 to $2100,
and only 1 female have savings between $18900 to $21000. It shows inequality of savings
within female and most of them very low savings. Below is the saving for male and it seems
to almost same as female.
Frequency for Male
The presentation of the default histogram is rather poor – it is not helpfully labelled, there are
gaps between bars (the class ranges of histograms are continuous by construction, unlike bar
charts for categorical data), and the chart area is far too small. But just like a regular Excel
chart, we can alter its presentation:

• Enlarge the size of the entire chart area so that the chart plot area is in better (larger)
proportion to the legend.
• Change the titles to something more meaningful to the reader.
• Remove the gaps between bars: right-click on the blue bars for Frequency, and select
Format Data Series. Under Series Options, reduce the Gap Width to 1% (it is visually more
appealing to have some gap, even though technically there should not be). Below is graphical
presentation for saving distribution for females.

Often it is more meaningful to specify the y-axis in percentage terms rather than number To
do this we need to convert the frequencies into percentages. Do this as follows:
Bin Frequency Cumulative % % Frequency
2100 109 80.74% 81%
4200 9 87.41% 7%
6300 2 88.89% 1%
8400 3 91.11% 2%
10500 2 92.59% 1%
12600 3 94.81% 2%
14700 4 97.78% 3%
16800 0 97.78% 0%
18900 2 99.26% 1%
21000 1 100.00% 1%
More 0 100.00% 0%
135 100%

B. Descriptive Statistics
1. Descriptive Statistics with Excel Formulae
We can calculate descriptive statistics using Excel formulae. Check that you can obtain the
following figures for the female customer’s results data:
Statistic Excel function Result
Mean =AVERAGE(B2:B136)
Median =MEDIAN(B2:B136)
Maximum =MAX(B2:B136)
Minimum =MIN(B2:B136)
Standard Deviation

Note that the mean is greater than the median, indicating positive skewness – indeed the
histogram in part A shows a long tail of very low results higher amount of savings.

2. Descriptive Statistics with Excel’s Data Analysis Tool

Excel’s Data Analysis add-in will automatically produce a table of summary statistics for
you. From the Data tab, select Data Analysis, and then Descriptive Statistics from the list.
Complete the dialog box as follows:

Descriptive Statistics for Femala

Mean $2,075.08
Standard Error $337.11
Median $680.00
Mode $0.00
Standard Deviation $3,916.85
Sample Variance $15,341,749.16
Kurtosis 7.11
Skewness 2.76
Range $19,568.00
Minimum $0.00
Maximum $19,568.00
Sum $280,136.00
Count 135

Base on the credit risk data answer the following question.

i. Construct percentage frequency polygons of the saving account for male and female as
one chart.
ii. Comment on the shapes of the two distributions
iii. Compare the central locations of the distributions for different categories of rental
properties.
iv. List the four measures of variability and comment on the saving account.
Case study: 2 based on CEO’s KLCI Data

Learning Objectives:

• Descriptive Statistics

In this case study, we will be looking at the remunerations paid to the directors in the
Public Listed Companies for the FBM KLCI for 2014. The director’s remuneration
consists of fees, allowances, benefits, and payments. The data consist of various firms:
consumer products, trading/services, properties/hotels, construction, plantations, and
industrial products. The number of directors being appointed in a particular firm is
different from firm to firm.

Column C shows the total remuneration for all the directors in a particular firm, and
column D shows the remuneration for the firm's Chief Executive Officer (CEO). The
CEOs earn significant remuneration due to their substantial share held by them and in
charge performance of the firm.

Descriptive Statistics with Excel Data Analysis Tool

CEO's Remuneration in RM'000'


Mean 9132.775
Standard Error 1811.077962
Median 3899.5
Mode #N/A
Standard Deviation 19839.36507
Sample Variance 393600406.3
Kurtosis 35.33148057
Skewness 5.596275094
Range 155794
Minimum 106
Maximum 155900
Sum 1095933
Count 120

Frequency
Bin Frequency Cumulative %
10000 96 80.00%
20000 14 91.67%
30000 4 95.00%
40000 2 96.67%
50000 0 96.67%
60000 2 98.33%
70000 1 99.17%
80000 0 99.17%
90000 0 99.17%
100000 0 99.17%
110000 0 99.17%
120000 0 99.17%
130000 1 100.00%
More 0 100.00%
120

Mean based on Grouped data


Lower Bound Upper Bound Frequency (f) Midpoint (x)
0 10000 96 5000 480000
10000 20000 14 15000 210000
20000 30000 4 25000 100000
30000 40000 2 35000 70000
40000 50000 0 45000 0
50000 60000 2 55000 110000
60000 70000 1 65000 65000
70000 80000 0 75000 0
80000 90000 0 85000 0
90000 100000 0 95000 0
100000 110000 0 105000 0
110000 120000 0 115000 0
120000 130000 1 125000 125000
120 1160000

9666.667 RM9,666,667.00
=

Question

i. Discuss why there is a mean difference calculated from the raw data and grouped data.
ii. Based on the summary statistics why there is a significant difference between mean and
median.
iii. Why is there no mode value provided in the summary statistics?
iv. Find the standard deviation and coefficient of variation for the grouped data.
v. Describe the shape of the distribution.

Self-Test question

The data gives the dividend yield on shareholders’ funds for Australia’s top 150 companies for
the year 2005. [Note: Dividend yield is defined as the amount of a company’s annual
dividend expressed as a percentage of the current price of the share of that company] The
data is in Dividend.xls worksheet of the file Week 3.
Column A stores the dividend yield for the top 1 – 50 companies (Group A) ranked by market
capitalisation, Column B stores the dividend yield for the companies 51 – 100 (Group B), and
Column C stores the dividend yield for companies 101 – 150 (Group C).
(a) Use functions or the appropriate Data Analysis tool to fill in the table of summary
statistics given in the Worksheet.

Group A Group B Group C


Count
Minimum
Maximum
Range
Mean
Std.Dev.
Coefft of
variation
Median
Lower Quartile
Upper Quartile
IQR

Alternatively, first use the Data Analysis button on the Data tab, and select Descriptive
Statistics, Copy and paste appropriately, and then use the quartile function to obtain the
quartiles.

The remaining values (Range, Coefficient of variation, and IQR) must be calculated by
typing in formulae. See below.

Here are a few comments on using the Insert Function option.

■ In the Insert Function dialogue box, select ‘Statistical’ from the ‘select a category’
menu. Select the appropriate function and follow the instructions in the dialogue box.

■ Note that the Excel function ‘Average’ is actually the mean.

■ The Lower and Upper quartiles can be obtained by selecting the function ‘Quartile.inc’.
The second entry in the dialogue box (‘Quart’) should be assigned the value 1 for the
lower quartile, and 3 for the upper quartile.

■ Note that since the data concerns all the top 150 companies, rather than a sample, the
STDEV.S function should be used for standard deviation. (Population standard
deviation is calculated by the function STDEV.P)

■ To obtain the Range, Interquartile range, and Coefficient of Variation, use pointing to
create a formula in Excel in terms of values already in the table. For example:

Range = Maximum -Minimum

(b) (i) Obtain a frequency distribution for each data set by following these steps.

■ By first finding the maximum and minimum values among all 150 data values, decide
how many class intervals to use, then choose the width and the lower limit of the first
interval. (The same class intervals should be used for all 3 data sets, as we wish them to
be easily compared.)

■ You should make these choices in such a way that the resulting frequency polygon or
histogram will show the right amount of detail, and the class intervals cover the full
range of the data. In addition, the limits of the class intervals should be at convenient
values. To take an extreme example 0, 5, 10, … is preferable to 0.12, 4.24, 8.36, …

■ Prepare a table (Table 1) to display your results including the following column
headings:

Lower limit; Upper limit; Midpoint; Frequency (A); Frequency (B); Frequency (C)

■ Fill in the Lower and Upper limit columns according to your choice of class intervals.
The midpoint column can be calculated as (Lower limit + Upper limit)/2.

■ To obtain the frequencies, use the Histogram tool in Excel. You will need to use it once
for each set of data.

■ From the Data tab, click on Data Analysis and select Histogram. (If Data Analysis is
not available on the Data tab, then the tool pack has to be loaded in).

■ Fill in the dialogue box as follows:

■ the Input Range should be the list of dividend yields for a given group of 50, including
the heading (specifying group A, B, or C);

■ the Bin Range should be the list of upper limits from Table 1, again including the
heading (‘Upper Limit’);

■ check the Labels box (or otherwise do not include the headings);

■ for the Output range, choose a convenient spot in the data sheet somewhere near Table
1;

■ this time you should not ask for Chart Output, as we will use the Scatter Chart button to
create three frequency polygons on the same set of axes for the three data sets;

■ in each output, delete the ‘More’ row, and then transfer the frequencies to the
appropriate column of the table you have prepared.

(ii) Use the Midpoint and Frequency columns of Table 1 to obtain frequency polygons
of all three data sets on the same axes. By definition, a frequency polygon is a closed
figure: it must reach the horizontal axis at both ends. This is achieved, if needed, by
adding classes with zero frequency at both ends.

The frequency polygon can be obtained using the XY (Scatter) plot type.
In the Charts group on the Insert tab, Click on the Scatter button and select the Scatter
with Straight Lines and Markers.

Using the Select Data button, select the midpoint column and all three frequency columns
of your data table, with headings. Investigate the Legend Entries (series) box. Here you
have the opportunity to add or remove frequency polygons from the chart, change the
legend labels (by changing the name of a series) etc. Click OK. Now choose appropriate
titles and remove the gridlines as in previous Exercises.

(c) In a textbox, compare the distributions, by responding to the following bullet


points.

To create a textbox, click on the Textbox button on the


Insert tab, then click and drag to select where and how large
you want the textbox to be.
‘Textbox’ button
● Comment on the shapes of the frequency distributions.

● Discuss the relative values of mean and median in each data set in relation to the
shape of each distribution.

● Comment on the parameters that indicate location of the distributions (e.g. compare
mean dividend yields).

● Comment on the parameters that indicate how spread out the distributions are.
(Range, Interquartile range, Standard deviation, Coefficient of variation.)

● Are there any outliers in any of the three data sets? If so, what influence do the
outliers have?

Shape:

● The frequency distributions are all unimodal and reasonably symmetrical. Note that
there are no shares with negative yields and the appearance of a substantial-looking
tail in negative territory is an artifact of the method of constructing a frequency
polygon, with class boundary values being included in the class below. If we take
account of the fact that the values shown in the negative area are in fact exactly equal
to zero, then groups A and B are if anything skewed to the right, with C slightly left
skewed. However, the amount of skew is small in all cases.

You might also like