You are on page 1of 10

Describing your data with graphs and numbers

1. Numbers or pictures?
To be able to see patterns in data, you need to summarise it. There are many ways of summarising data, and no one way is the correct way. Summarising data is like examining a patient. You need to use several different strategies to be able to build up a full picture. In statistics, the two main techniques of summarising data are summary statistics and graphics. You can think of these as lab tests and diagnostic images. Lab tests give you numbers which you must interpret by thinking about them. Diagnostic images, on the other hand, tap the power of your visual perception. In particular, something you didn't suspect may emerge from a diagnostic image, whereas lab tests are highly specific you only get the results you asked for. So before you do summary statistics like averages, percentages etc it is always a good idea to look at the data using graphics.

2. Proportions and rates


With categorical data, or dichotomous data, we often summarise our findings by giving proportions or rates. A proportion is simply Number of times something happened Number of times it could have happened This gives us a number between zero and one. Since most people hate decimals, we often quote proportions as percentages, by multiplying by 100. You are more likely to be understood if you say that 10% of women attending antenatal clinics test positive for chlamydia. In medicine, many figures are proportions. Sensitivity, specificity, positive and negative predictive values are all proportions (and are often reported as percentages to make them easier to understand). Prevalence is a proportion too, but most diseases have very small prevalences, and so we usually report prevalences as cases per thousand or per ten thousand. A rate is a proportion per unit time. Incidence The incidence of a disease is the number of cases per unit population per unit time. We often use 1,000 or 10,000 as the unit of population and a year as the unit of time. Rarer diseases, are reported using 100,000. The incidence of cervical cancer in the UK is 9 women per 100,000 per year, for example. Descriptive statistics EBH S2 1

Many health indices are rates. The death rate is an obvious example, as are infant mortality, neonatal mortality and other indices of child survival. Here are incidence rates for pregnancy outcome from a study done in the Rotunda hospital.

Outcome Normal Blood pressure Pre eclampsia Gestational hypertension Total


! !

Number Percent 334 20 25 379 88.1% 5.3% 6.6%

95% Condence interval 84.5% to 91.0% 3.4% to 8.0% 4.5% to 9.6%

Gestational hypertension is a benign condition where the mother develops high blood pressure, with a higher than normal cardiac output, but the only effect that this has on the developing baby is that the baby benefits from this extra blood supply and is typically bigger at birth than babies whose mothers had normal blood pressure. On the other hand, preeclampsia is a condition in which the mother's cardiac output falls in the final trimester of pregnancy and her peripheral resistance rises. This high-pressure, low flow blood supply is good for neither mother nor baby, and has to be managed very carefully. There were 334 patients with a normal outcome, 25 with gestational hypertension (GH) and 20 with pre-eclampsia (PET). The rate of occurrence of each category is given in the column headed Percent. Gestational hypertension occurred in 6.6% of pregnancies and preeclampsia in 5.3%. This rate of occurrence is the incidence - the number of cases per unit population per unit time. In this case, the unit of population is 100 and the unit of time is the duration of pregnancy. So a 6.6% incidence means that there is an incidence of 6.6 cases per hundred pregnancies. The table also shows confidence intervals for each proportion, which is useful. The confidence interval for the pre-eclampsia rate is 3.4% to 8.0%. This means that although we found a 5.3% incidence of pre-eclampsia, we could be wrong. The real rate could be as low as 3.4% or as high as 5.3%. It is unlikely to be lower or higher than that, where "unlikely" means "less than 5% likely".

Descriptive statistics EBH S2 2

Graphing proportions : bar charts


Outcome

Pre eclampsia

Normal

Gestational

100

200

300

400

Graphing proportions : pie charts

Outcome

5%

7%

Gestational Normal Pre eclampsia

88%

With only three categories, neither the bar chart nor the pie chart tells us anything interesting. Most women had a normal outcome. The incidence of gestational hypertension was slightly higher than the incidence of PET, which is just about visible from the mosaic plot, and not at all visible in the bar chart. But the graphs do not justify the space they take up by telling us something we didn't know. In this case, numbers were the best summary, and the table also allowed us to give confidence intervals.

Descriptive statistics EBH S2 3

2. How two things relate to each other


We might be interested in the relationship between smoking and outcome. The stacked bar chart
100%

80%

60%

40%

Pre eclampsia Normal Gestational HT

20%

0% Smoker Non-Smoker

Contingency Table (also known as a crosstabulation)

Outcome Gestational HT Normal Pre-eclampsia

Smoker 6 (3.8%) 146 (91.8%) 7 (4.4%)

Non-smoker 19 (8.6%) 188 (85.5%) 13 (5.9%)

Women who smoke appear to have a lower rate of hypertension, especially of gestational hypertension. This is visible in the stacked bar plot. The table tells us that the rate of preeclampsia isn't strikingly lower (it's 4.4% in smokers and 5.9% in non-smokers) but the rate of gestational hypertension is quite different: 3.8% in smokers and 8.6% in non-smokers. This might lead us to a couple of questions that need formal statistical testing. The first is whether smoking really affects the risk of hypertensive complications, and the second is whether the effect is the same for both types of hypertension, or whether smoking exerts a differential effect on the risk of gestational hypertension and pre-eclampsia. Descriptive statistics EBH S2 4

3. Looking at numeric data


Here is a graph and table describing cardiac output, in litres per minute, at about the 36th week of pregnancy. On the left you can see a histogram, and a boxplot. The boxplot has some features you don't usually see in a boxplot: the diamond shape inside the box shows the mean and its confidence interval. The little scattering of dots at the upper end of the boxplot shows outliers, unusually high cardiac outputs.

Distributions Cardiac Output time5

Quantiles
100.0% 99.5% 97.5% 90.0% 75.0% 50.0% 25.0% 10.0% 2.5% 0.5% 0.0% minimum quartile median quartile maximum 10.7 10.5 9.3 8.2 7.1 6.4 5.8 5.3 4.9 4.3 4.2

10

11

Moments
Mean Std Dev 95% CI for Mean Number 6.6 1.1 6.4 to 6.7 256

On the right are quantiles of the distribution of cardiac output, as well as the mean, standard deviation and 95% confidence interval. It's a formidable table of numbers, and it is hard to read them and form a clear picture of cardiac output. When you compare the tables on the right with the graphs on the left, you can see the advantages that graphs have as a way of quickly forming a mental picture of your data.

The mean, and other ways of describing the centre of the data

We would like to know something about typical cardiac output in late pregnancy. The average output is 6.6 litres a minute (which is quite impressive: imagine emptying a litre cartons of milk every ten seconds). But what does that mean? Does it mean that a typical woman will have a cardiac output of 6.6 litres a minute? Not necessarily. The mean may not be the most frequently-occurring value, and may not even be a value that has been observed at all. Descriptive statistics EBH S2 5

Does it mean that there are more or less as many women above 6.5 litres as below? No - that figure is the median (or 50th percentile). The median marks the value at or below which half the data fall. So half the women have an output of 6.4 litres a minute or less. We can build up a more detailed picture using other quantiles. A quarter of the women have outputs of 5.8 litres or less, while three quarters have an output of 7.1 litres or less. So the mean can be misleading if the data are very asymmetrical if there is a long 'tail' of high or low values. In this case, the extreme values 'pull' the mean in their direction. This often happens in real life. If you think of the average length of stay in hospital, it will be influenced by a 'tail' of people who stayed unusually long. These people will make the mean length of stay longer than the median. So if you are reporting on length of stay, it is easier and less misleading to say 'half the patients were discharged in 9 days of less' than it is to say 'the average length of stay was 12 days'. The median is relatively immune to this problem, For this reason it is called a robust statistic. It is also easier to explain. Finally, there is a statistical term for asymmetrical, which you may see in published papers: skewed.

Skewed data: an example Levels of social isolation in people with hepatitis C

You can see from the graph that a lot of people report low levels of isolation, while small numbers report quite high levels. The smooth curve shows a long 'tail' on the right. These small numbers of high levels cause the data to be asymmetrical skewed. The tail of high values will affect the average. In this case, the mean is 6.8, while the median is only 6. You can see the mean and the median in the boxplot at the top of the graph. The median is the bar inside the box, while the mean is the top and bottom of the diamond shape. Descriptive statistics EBH S2 6

Quantiles
The median is an example of a quantile. Quantiles are values below which a certain proportion of the data lies. The table gives a number of quantiles, allowing you to build up a detailed picture of cardiac output. Only 2.5% of women have a cardiac output of 4.9 litres or less, so this would constitute one definition of 'abnormally low'. Likewise, 97.5% have outputs of 9.3 litres or less, so an output greater than 9.3 could be defined as 'abnormally high'. This method of defining 'normal' is purely statistical. We define 'statistically normal' so that 95% of people are normal. But we cannot assume that there are any health risks attached to either 'abnormally high' or 'abnormally low' cardiac output. An unusually high cardiac output may not be harmful (in pregnancy it isnt, in fact), and, for example, an unusually low cholesterol isnt bad either (though a sudden drop in cholesterol isn't a good sign). And unusually good eyesight is a downright advantage! Nevertheless, many laboratories give 'reference ranges' for biochemical tests which are the range that would be expected to cover about 95% of the population.

Finding out about the spread of the data


Quantiles are useful for getting an idea of the spread of the data. The 'statistically normal' range is one measure of spread. Other quantiles will help to build up the idea of spread. A quarter of women had cardiac outputs of 5.8 or less and three quarters had cardiac outputs of 7.1 or less. So a typical cardiac output would be between 5.8 and 7.1. A quarter of women had cardiac outputs higher than this and a quarter lower. These numbers correspond to the ends of the 'box' part of the boxplot. Quartiles, quintiles and their friends We can use quantiles to divide the data into groups. The 25th, 50th and 75th percentiles divide the data into four equal-sized groups. These groups are called quartiles. Here they are for cardiac output: Quartile 1 Quartile 2 Quartile 3 Quartile 4 Up to 5.8 (up to 25th percentile) 5.8 to 6.4 (25th to 50th percentile) 6.4 to 7.1 (50th to 75th percentile) Over 7.1 (over 75th percentile)

So the boxplot is showing you quartiles, like rungs of a ladder. You can also make five groups by dividing your data at the 20th, 40th, 60th and 80th percentiles. These are called quintiles. Making three groups gives you tertiles (which percentiles do you need?) and making ten groups gives you deciles. Other groupings are possible. (What percentiles do you need to make 8 groups? How about six?) However, the only ones you meet in practice are tertiles, quartiles, quintiles, deciles and percentiles.

The standard deviation: a measure of spread


Descriptive statistics EBH S2 7

The standard deviation is a statistic that causes students a lot of headaches. And although it regularly quoted in scientific papers, most people dont read or understand it. I used to teach it conscientiously. I place less emphasis on it now because a) it tells the average reader a lot less than a boxplot and b) it is misleading when data are skewed or they have outliers which is very often in medicine. So dont worry if you dont understand all of the next bit.

Have a look at this example.

The plot shows the depression scores of people with hepatitis C. It is clear that those who get the disease through contaminated blood products for treatment of hmophilia have lower depression scores than anyone else. But they also have less variation in their depression scores. How can we measure this variation? One solution is to find the height of the 'box' part of the boxplot. This is the 75th percentile minus the 25th. This statistic is called the interquartile range. However, the standard deviation is another way of measuring spread. The standard deviation is a measure of variation or scatter. The problem of trying to measure the scatter or variation in the depression scores is like trying to measure how scattered the houses in a city are. Some cities are densely built, with people living very close together, while other cities are very spread out. You can think of the IVDA group as a sprawling city, and the hmophilia group as a tightly-packed city. Descriptive statistics EBH S2 8

If we were measuring scatter in cities, the most logical thing would be to take each person and find the distance they had to travel to get to the city centre. The average distance from the city centre would be a measure of how scattered the city was. The same logic works with data. If we can define centre of the data and distance from the centre then we can find the average distance from the centre of the data. This will measure the scatter of the data. The mean is used as the centre of the data. To find how far each piece of data is from the centre, we could just subtract (remember that distance cannot be negative, so we would have to ignore minus signs!). However, for reasons that you thankfully dont have to know about, we measure distance as the squared distance between each piece of data and the mean. The average square distance from the mean is called the variance. The variance is a measure of variation around the mean that is used a lot in the calculation of other statistics, but when reporting on data, we often use the standard deviation which is the square root of the variance. This is because the standard deviation is quite an informative statistic; if chance alone affects the variation of your data, then the following are true 50% of the data lie in the range mean +/- 70% of a SD 68% (about 2/3) lie in the range mean +/- 1 SD 95% lie in the range mean +/- 2 SD 99% lie in the range mean +/- 2.6 SD

So if someone tells you that the mean storage life for an antibiotic is 120 weeks, with a standard deviation of 4 weeks, you can conclude that 50% of the antibiotics will become ineffective between 117.2 weeks and 122.8 weeks (120 minus/plus 70% of the standard deviation). About two thirds will expire between 116 and 124 weeks (and, of course, this implies that of the remaining one third, half will last longer than 124 weeks) and half will expire before 116 weeks). Ninety-nine percent of the antibiotics will expire in the period 109.6 weeks to 130.4 weeks (mean minus/plus 2.6 SD). The remaining one percent will have lives shorter or longer than this period - half shorter, half longer. So there is only a half-a-percent chance that an antibiotic will become ineffective before 109.6 weeks - a useful piece of information. Apart from the 'ready reckoner' information given above for the standard deviation, there are tables which allow you to answer questions like 'how many cases will fall further than a certain distance above the mean?' or 'what interval around the mean will contain a given percentage of cases?' By measuring the distance between a value and the mean in standard deviations, you can tell how likely or unlikely it is. An antibiotic that went off after 108 weeks is 12 weeks short of the mean - that is, 3 standard deviations below the mean. We know that 99% of cases will fall within 2.6 standard deviations of the mean, so clearly we are dealing with a very unusual case. In fact, if you use statistical tables, you will find that only 0.135% of Descriptive statistics EBH S2 9

cases will fall three standard deviations or more below the mean. This is the basis for the ztest - a statistical test found in every textbook and virtually never used in practice. A lot of people worry about the standard deviation; it's a hard concept. Most people who give standard deviations in scientific papers do so simply because they think it's expected, and most readers who read scientific papers don't understand them. In fact, using percentiles you can communicate more about your data in a way that everyone can understand, and using boxplots is even more informative, since the eye will detect patterns. What should you know about the standard deviation? 1. That is measures the scatter of the data around the mean. 2. That when data are normally distributed, two thirds will be within one standard deviation of the mean, and 95% within 2 standard deviations.

Descriptive statistics EBH S2 10

You might also like