Now, let's have a think about some different data.
Think about a telecommunications company
investigating the amount of the first bill for new customers, like what kind of size is the average bill for new customers? Is it too high? How are new customers faring? See if we can get some information from this. Maybe we have 200 bills, recent new customers. So you know, for example, here's what it looks like we've got 42.19 for the first, 38.45, 29.23 etc. as we go down. So clearly not in yen, a bit small for that, but looks like it's more in Australian dollars or American dollars. Now this is not really looking like a pie chart is it, because you know there's different numbers and how does that all work. It's not obvious its not really a bar chart either because every number is different so mmm interesting. So we need a new approach here. One approach is that we could start to summarise this in terms of categories. So what if we said how many bills are between say 0 dollars and 30 dollars? How many are from 30 to 60? 60 to 90? And so on. Or we could go more detailed, how many between 0 and 15? Well we can't see any here but how many between 15 and 30? Well that 29.23 is in there. How many between 30 and 45? Well the 42, the 38 is in there okay so we can start to get frequencies. And that's one way to summarise this. Obviously there was 200 so that's why there's more information here, I was just showing the top few. But here we've got a class so we're saying everything from 0 up to 15 then from 15 up to 30, 30 up to 45 etc. and in Excel by convention we call these bins and we give it the upper limit there, the 15, the 30, the 45, 60 etc. So there's 71 bills from 0 to 15, 37 from 15 to 30 and now have a look at those frequencies. We could graph those frequencies couldn't we and that would look something like this. See we can see the 71. Okay, there is just above 70 for the frequency from the 0 to 15. Then we sort of have the 37 we're sort of seeing and you can see it now like oh well there's a pattern here. This starts to tell you more information and I'll direct you to a few things here, notice the chart title histogram of first bill size, this is a histogram and this is great for that numeric data. We break it into categories and then we plot the frequencies, so that would be a histogram. Notice my y-axis, I have it nicely labelled frequency and my x-axis I have labelled first monthly bill, and notice I'm telling the reader of the information here it's the upper limit of the class because it's not just there were 71 bills that were 15 dollars no no no. So I've told the reader it's the upper limit of the class, sometimes you might want to write 0 up to 15, 15 up to 30 etc. So what kind of interpretation would we be looking at here? Well have a look here if we sort of break this down we can see that you know a huge number of bills are very small. In fact about half. And also there's a few bills in the middle rent, but not much. Not many bills between sort of you know 30 up to 75 there. Not many bills is it. But there's actually a relatively large number that are quite high in that top area, okay, sort of above 75. So notice the histogram tells us this. This is a great example of data going to information. Data, oh a bill was 42.23, next bill was 67.25, next bill was 40. That doesn't tell you anything but seeing the histogram that tells you something. Lots of bills are small but there are some customers they're having really large bills initially. Are they going to leave us? Why is that? Maybe we want to investigate more to see if we need to do something for them. So frequency is one thing, but there's another thing called relative frequency. Sometimes we don't so much care that there's 71 bills we care that there are 36% of bills in the first category. So it's not the absolute number it's the frequency, the relative frequency in fact that we care about. And relative frequency is simply the frequency, the 71 for the first category, total number of observations so 71 out of 200. So it tells us you know in percentage terms how many. And this can sometimes give us an insight into the whole population of bills given that we only have a sample of 200. So we don't care about 71, because you know in fact there's thousands and thousands of bills, but we might say oh it looks like 36 percent are in that 0 to 15 category. So you can also compare different histograms based on different sample sizes being careful to consider whether that's appropriate but you can you can compare based on percentage terms, rather than absolute number, because somebody else might have taken a sample of three hundred bills. So they're obviously going to have different numbers in different categories. So what does this look like? Exactly the same as before, but notice now we have relative frequency. So we can see that 36% how is that calculated 71 divided by 200. Now let's just do some simple logic checks here. Firstly we should have already checked all the frequencies should add up to, 200 you got it right. All the relative frequencies have to add up to, 100% or 1 exactly. So there's some great logic checks. So let's have a look at how this looks. Oh, exactly the same, what's the difference? Ah, the only difference here, look at the y-axis, relative frequency. So now we can see it's 0, 0.05, it's not the actual number it's the relative frequency. So the only thing that changes is the y-axis there. So this was all well and good but how did we know we should have done 0 to 15, 15 to 30, 30 to 45? How do we know that that's the categories that we want? And this is actually an art, not an exact measure. So the number of categories depends on the size of the data set, how many observations you want to be in each category, more importantly, business sense. What do you care about as a manager? What are the categories that you might care about? If we're looking at people's ages maybe we often do people in their 20s so we do 20 to 29, 30 to 39, 40 to 49. Bills, maybe we care about 15 dollar increments, historically that's a key sort of amount. So it's about using a lot of business and common sense, but just remember that there's a general recommendation to have between 5 and 15 categories. Because if you have too fewer than five you're not really showing everything, you're just showing everything's in one big lump. Well that doesn't show much about the shape or anything does it. And if you get more than 15 categories it's very hard for our minds to sort of conceptualise looking at it what's going on here. So sometimes you might need to have more than 15 but generally it's a good idea to be between 5 and 15. Notice how it's not a hard and fast rule it's a bit of an art form here getting these graphs so generally between 5 and 15. Sometimes you will need more, maybe sometimes you only need 4 but probably more likely to be bigger than 15 if you have to get outside this range. So let's just summarise up here a histogram we collect the data, we prepare a frequency distribution or relative frequency distribution that was just that table, then we can draw the histogram or the relative frequency histogram. Depending on if we care about the absolute value we'll just use a normal histogram or if we want relative frequency, if we care more about the proportion, the percentage that are in each category.