You are on page 1of 2

Now, let's have a think about some different data.

Think about a telecommunications company


investigating the amount of the first bill for new customers, like what kind of size is the average bill
for new customers? Is it too high? How are new customers faring? See if we can get some information
from this. Maybe we have 200 bills, recent new customers. So you know, for example, here's what it
looks like we've got 42.19 for the first, 38.45, 29.23 etc. as we go down. So clearly not in yen, a bit
small for that, but looks like it's more in Australian dollars or American dollars. Now this is not really
looking like a pie chart is it, because you know there's different numbers and how does that all work.
It's not obvious its not really a bar chart either because every number is different so mmm interesting.
So we need a new approach here. One approach is that we could start to summarise this in terms of
categories. So what if we said how many bills are between say 0 dollars and 30 dollars? How many
are from 30 to 60? 60 to 90? And so on. Or we could go more detailed, how many between 0 and 15?
Well we can't see any here but how many between 15 and 30? Well that 29.23 is in there. How many
between 30 and 45? Well the 42, the 38 is in there okay so we can start to get frequencies. And that's
one way to summarise this. Obviously there was 200 so that's why there's more information here, I
was just showing the top few. But here we've got a class so we're saying everything from 0 up to 15
then from 15 up to 30, 30 up to 45 etc. and in Excel by convention we call these bins and we give it
the upper limit there, the 15, the 30, the 45, 60 etc. So there's 71 bills from 0 to 15, 37 from 15 to 30
and now have a look at those frequencies. We could graph those frequencies couldn't we and that
would look something like this. See we can see the 71. Okay, there is just above 70 for the frequency
from the 0 to 15. Then we sort of have the 37 we're sort of seeing and you can see it now like oh well
there's a pattern here. This starts to tell you more information and I'll direct you to a few things here,
notice the chart title histogram of first bill size, this is a histogram and this is great for that numeric
data. We break it into categories and then we plot the frequencies, so that would be a histogram.
Notice my y-axis, I have it nicely labelled frequency and my x-axis I have labelled first monthly bill,
and notice I'm telling the reader of the information here it's the upper limit of the class because it's not
just there were 71 bills that were 15 dollars no no no. So I've told the reader it's the upper limit of the
class, sometimes you might want to write 0 up to 15, 15 up to 30 etc. So what kind of interpretation
would we be looking at here? Well have a look here if we sort of break this down we can see that you
know a huge number of bills are very small. In fact about half. And also there's a few bills in the
middle rent, but not much. Not many bills between sort of you know 30 up to 75 there. Not many bills
is it. But there's actually a relatively large number that are quite high in that top area, okay, sort of
above 75. So notice the histogram tells us this. This is a great example of data going to information.
Data, oh a bill was 42.23, next bill was 67.25, next bill was 40. That doesn't tell you anything but
seeing the histogram that tells you something. Lots of bills are small but there are some customers
they're having really large bills initially. Are they going to leave us? Why is that? Maybe we want to
investigate more to see if we need to do something for them. So frequency is one thing, but there's
another thing called relative frequency. Sometimes we don't so much care that there's 71 bills we care
that there are 36% of bills in the first category. So it's not the absolute number it's the frequency, the
relative frequency in fact that we care about. And relative frequency is simply the frequency, the 71
for the first category, total number of observations so 71 out of 200. So it tells us you know in
percentage terms how many. And this can sometimes give us an insight into the whole population of
bills given that we only have a sample of 200. So we don't care about 71, because you know in fact
there's thousands and thousands of bills, but we might say oh it looks like 36 percent are in that 0 to
15 category. So you can also compare different histograms based on different sample sizes being
careful to consider whether that's appropriate but you can you can compare based on percentage
terms, rather than absolute number, because somebody else might have taken a sample of three
hundred bills. So they're obviously going to have different numbers in different categories. So what
does this look like? Exactly the same as before, but notice now we have relative frequency. So we can
see that 36% how is that calculated 71 divided by 200. Now let's just do some simple logic checks
here. Firstly we should have already checked all the frequencies should add up to, 200 you got it right.
All the relative frequencies have to add up to, 100% or 1 exactly. So there's some great logic checks.
So let's have a look at how this looks. Oh, exactly the same, what's the difference? Ah, the only
difference here, look at the y-axis, relative frequency. So now we can see it's 0, 0.05, it's not the actual
number it's the relative frequency. So the only thing that changes is the y-axis there. So this was all
well and good but how did we know we should have done 0 to 15, 15 to 30, 30 to 45? How do we
know that that's the categories that we want? And this is actually an art, not an exact measure. So the
number of categories depends on the size of the data set, how many observations you want to be in
each category, more importantly, business sense. What do you care about as a manager? What are the
categories that you might care about? If we're looking at people's ages maybe we often do people in
their 20s so we do 20 to 29, 30 to 39, 40 to 49. Bills, maybe we care about 15 dollar increments,
historically that's a key sort of amount. So it's about using a lot of business and common sense, but
just remember that there's a general recommendation to have between 5 and 15 categories. Because if
you have too fewer than five you're not really showing everything, you're just showing everything's in
one big lump. Well that doesn't show much about the shape or anything does it. And if you get more
than 15 categories it's very hard for our minds to sort of conceptualise looking at it what's going on
here. So sometimes you might need to have more than 15 but generally it's a good idea to be between
5 and 15. Notice how it's not a hard and fast rule it's a bit of an art form here getting these graphs so
generally between 5 and 15. Sometimes you will need more, maybe sometimes you only need 4 but
probably more likely to be bigger than 15 if you have to get outside this range. So let's just summarise
up here a histogram we collect the data, we prepare a frequency distribution or relative frequency
distribution that was just that table, then we can draw the histogram or the relative frequency
histogram. Depending on if we care about the absolute value we'll just use a normal histogram or if
we want relative frequency, if we care more about the proportion, the percentage that are in each
category.

You might also like