You are on page 1of 19

COURSECODE: NCM-6308/NCM-5308 BIOSTATISTICS/PRELIMS

F R E Q U E N C Y DISTRIBUTION

DISTRIBUTIONS
The first step in turning data into information is to create a
distribution. The most primitive way to present a distribution is to
simply list, in one column, each value that occurs in the population
and, in the next column, the number of times it occurs. It is
customary to list the values from lowest to highest. This simple listing
is called a frequency distribution. A more elegant way to turn data
into information is to draw a graph of the distribution. Customarily,
the values that occur are put along the horizontal axis and the
frequency of the value is on the vertical axis.
Frequency tells you how often something happened. The
frequency of an observation tells you the number of times the
observation occurs in the data. For example, in the following list of
numbers, the frequency of the number 9 is 5 (because it occurs 5
times):

1, 2, 3, 4, 6, 9, 9, 8, 5, 1, 1, 9, 9, 0, 6, 9.

Tables can show either categorical variables (sometimes called


qualitative variables) or quantitative variables (sometimes called
numeric variables). You can think of categorical variables as
categories (like eye color or brand of dog food) and quantitative
variables as numbers.
The following table shows what family planning methods were
used by teens in Kweneng, West Botswana. The left column shows
the categorical variable (Method) and the right column is the
frequency — the number of teens using that particular method
(image courtesy of KSU).
COURSECODE: NCM-6308/NCM-5308 BIOSTATISTICS/PRELIMS

Frequency distribution tables give you a snapshot of the data to


allow you to find patterns. A quick look at the above frequency
distribution table tells you the majority of teens don’t use any birth
control at all.

How to make a Frequency Distribution Table:


COURSECODE: NCM-6308/NCM-5308 BIOSTATISTICS/PRELIMS

DESCRIPTIVE STATISTICS

An early and fundamental stage in any science is the descriptive


stage. Until phenomena can be accurately described, an analysis of
their causes is premature. The question "What?" comes before "How?"
We shall be unable to ascertain the effect of a given dose of a drug
upon this variable. In a sizable sample, it would be tedious to obtain
our knowledge of the material by contemplating each individual
observation. We need some form of summary to permit u s to deal
with the data in manageable form, as well as to be able to share our
findings with others in scientific talks and publications. A histogram
or bar diagram of the frequency distribution would be one type of
summary, However, for most purposes, a numerical summary is
needed to describe concisely, yet accurately, the properties of the
observed frequency distribution. Quantities providing su c h a
summary are called descriptive statistics. This chapter will introduce
you to some of them and show how they are computed.
Two kinds of descriptive statistics will be discussed in this
module: statistics of location and statistics of dispersion. The
statistics of location (also known as measures of central tendency)
describe the position of a sample along a given dimension
representing a variable. For example, after we measure the length of
the animals within a sample, we will then want to know whether the
COURSECODE: NCM-6308/NCM-5308 BIOSTATISTICS/PRELIMS

animals are closer, say, to 2 cm or to 20 cm. To express a


representative value for the sample of observations-for the length of
the animals-we use a statistic of location. B u t statistics of location
will not describe the shape of a frequency distribution. The shape
may be long or very narrow, may be humped or U shaped, may
contain two h u mps, or may be markedly asymmetrical. Quantitative
measures of su c h aspects of frequency distributions are required. To
this end we need to define and study the statistics of dispersion

STATISTICS O F LOCATION

ARITHMETIC MEAN :
The most common statistic of location is familiar to everyone. It
is the arithmetic mean, commonly called the mean or average. The
mean is calculated by summing all the individual observations or
items of a sample and dividing this s u m by the number of items in
the sample.
For instance, a s the result of a gas
analysis in a respirometer an investigator
the following four readings of oxygen
obtains
percentages and s u m s them:
14.9
10.8
12.3
23.3

S u m =61.3
The investigator calculates the mean oxygen percentage as the
s u m of the four items divided by the number of items. Thus, the
average oxygen percentage is:
COURSECODE: NCM-6308/NCM-5308 BIOSTATISTICS/PRELIMS

Mean = 6.13 / 4 = 15.325%


Calculating a mean presents u s with the opportunity for learning
statistical symbolism. An individual observation is symbolized by Y i ,
which stands for the nth observation in the sample. Four
observations could be written symbolically as follows: Y 1 , Y 2 , Y 3 , Y 4 .
We shall define n, the sample size, as the number of items in a
sample. In this particular instance, the sample size n is 4. Thus, in a
large sample, we can symbolize the array from the first to the nth
item as follows: Y 1 , Y 2 …...Y n

When we wish to s u m items, we use the following notation:

The capital Greek sigma, , simply means the s u m of the items


indicated. The i = 1 means that the items should be summed, starting
with the first one and ending with the nth one, as indicated by the i =
n above the . The subscript and superscript are necessary to indicate
how many items should be summed. The "i = " in the superscript is
usually omitted as superfluous. For instance, if we had wished to sum

only the first three items, we would have written O n the other
hand, had we wished to s u m all of them except the first one, we would
have written With some exceptions it is desirable to omit
subscripts and superscripts, which generally add to the apparent
complexity of the formula and, when they are unnecessary, distract
the student's attention from the important relations expressed by the
formula.
COURSECODE: NCM-6308/NCM-5308 BIOSTATISTICS/PRELIMS

Below are seen increasing simplifications


of the complete summation notation shown
at the extreme left:

The third symbol might be interpreted as meaning, "Sum the Yi's over
all available values of i". This is a frequently used notation. The next
with “n” as a superscript, tells u s to s u m “n” items of Y; note that the
“i” subscript of the Y has been dropped as unnecessary. Finally. the
simplest notation is shown at the right. It merely says “sum the Y's”.
This will be the form we shall use most frequently: if a summation
sign precedes a variable, the summation will be understood to be over
“n” items (all the items in the sample) unless subscripts or
superscripts specifically tell u s otherwise.

MEDIAN
The median is used to calculate variables that are measured
with ordinal, interval, or ratio scales. It is obtained by arranging the
data from the lowest to the highest and then picking the number(s)
in the middle. If the total number of data points is an odd number,
the median is usually the middle number. If the numbers are even,
the median is obtained by summing the two numbers in the middle
and dividing them by two to get the mean.
Median is mostly used when there are a few data points that are
different. For example, when calculating the median of students
entering college, there may be a section of students who are older
than the rest. Using the mean may distort the values since it will
show that the average age of students entering college to be higher,
whereas using the median can give a truer reflection of the situation.
For example, let’s find the median age of students
entering
college for the first time, given the following values of ten
students:
COURSECODE: NCM-6308/NCM-5308 BIOSTATISTICS/PRELIMS

17, 17, 18, 19, 19, 20, 21, 25, 28, 32


The median of the values above is
(19+20)/2 = 19.5.

MODE
The mode refers to the value represented by the greatest
number of individuals. The mode is the most occurring number
within a data distribution. It shows what number or value is the
highest in number or most common in the data distribution. The
mode is used for any type of data.

For example, let’s take the example of a college class with about
40 students. The students are given a test exam, graded, and then
grouped on a scale of 1-5, starting with students with the lowest
number of marks.

The marks are graded as follows:

Cluster 1: 5
Cluster 2: 7
Cluster 3: 13
Cluster 4: 12
Cluster 5: 3

Cluster 3 shows the highest number of students and, therefore,


the mode is 13. It reveals that out of 40 students, most of the
students were graded in cluster 3.
COURSECODE: NCM-6308/NCM-5308 BIOSTATISTICS/PRELIMS

M E AS U R E S O F DISPERSION

RANGE:
One simple measure of dispersion is the range, which is defined
as the difference between the largest and the smallest items in a
sample. Since the range is a measure of the span of the variates along
the scale of the variable, it is in the same units as the original
measurements. The range is clearly affected by even a single outlying
value and for this reason is only a rough estimate of the dispersion
of all the items in the sample.

The range can sometimesbe misleading


when there are extremely high or low values.
COURSECODE: NCM-6308/NCM-5308 BIOSTATISTICS/PRELIMS

QUARTILES:
Quartiles are the values that divide a list of numbers into quarters:

Put the list of numbers in order


Then cut the list into four equal parts
The Quartiles are at the "cuts"

Sometimes a "cut" is between two numbers . . . the Quartile is the


average of the two numbers.
COURSECODE: NCM-6308/NCM-5308 BIOSTATISTICS/PRELIMS

INTERQUARTILE RANGE

STANDARD DEVIATION :

The standard deviation is a statistic that measures the


dispersion of a dataset relative to its mean and is calculated as the
square root of the variance. The standard deviation is calculated as
the square root of variance by determining each data point's deviation
COURSECODE: NCM-6308/NCM-5308 BIOSTATISTICS/PRELIMS

relative to the mean. If the data points are further from the mean,
there is a higher deviation within the data set; th us, the more spread
out the data, the higher the standard deviation. Standard deviation
is a statistical measurement in finance that, when applied to the
annual rate of return of an investment, sheds light on that
investment's historical volatility. The greater the standard deviation
of securities, the greater the variance between each price and the
mean, which shows a larger price range. For example, a volatile stock
has a high standard deviation, while the deviation of a stable blue-
chip stock is usually rather low.

OVERVIEW O F HOW TO CALCUL ATE STANDARD


DEVIATION

The formula for standard deviation (SD) is:

The standard deviation formula may look confusing, but it will


make sense after we break it down. In the coming sections, we'll walk
through a step-by-step interactive example. Here's a quick preview of
the steps we're about to follow:

Step 1: Find the mean.


COURSECODE: NCM-6308/NCM-5308 BIOSTATISTICS/PRELIMS

Step 2: For each data point,


find the square of its distance
to the mean.
Step 3: S u m the values from
Step 2.
Step 4: Divide by the number
of data points.
Step 5: Take the square root.

Step-by-step interactive example for calculating


standard deviation:
First, we need a data set to work with. Let's pick something
small so we don't get overwhelmed by the number of data points.
Here's a good one:
6, 2, 3, 16,2,3,1
COURSECODE: NCM-6308/NCM-5308 BIOSTATISTICS/PRELIMS
COURSECODE: NCM-6308/NCM-5308 BIOSTATISTICS/PRELIMS
COURSECODE: NCM-6308/NCM-5308 BIOSTATISTICS/PRELIMS

STANDARD DEVIATION VS . VARIANCE

Variance is derived by taking the mean of the data points,


subtracting the mean from each data point individually, squaring
each of these results, and then taking another mean of these squares.
Standard deviation is the square root of the variance.

The variance helps determine the data's spread size when


compared to the mean value. As the variance gets bigger, more
variation in data values occurs, and there may be a larger gap
between one data value and another. If the data values are all close
together, the variance will be smaller. However, this is more difficult
to grasp than the standard deviation because variances represent a
squared result that may not be meaningfully expressed on the same
graph as the original dataset.

Standard deviations are usually easier to picture and apply.


The standard deviation is expressed in the same unit of
measurement as the data, which isn't necessarily the case with
the variance. Using the standard deviation, statisticians may
determine if the data has a normal curve or other mathematical
relationship. If the data behaves in a normal curve, then 68% of the
data points will fall within one standard deviation of the average,
or mean, data point. Larger variances cause more data points
to fall outside the standard deviation. Smaller variances result
in more data that is close to average.
COURSECODE: NCM-6308/NCM-5308 BIOSTATISTICS/PRELIMS

IDENTIFYING PARAMETERS AND STATISTICS

Parameters are numbers that summarize data for an entire


population. Statistics are numbers that summarize data from a
sample, i.e. some subset of the entire population.
A parameter is a useful component of statistical analysis. It
refers to the characteristics that are used to define a given
population. It is used to describe a specific characteristic of the entire
population. When making an inference about the population, the
parameter is unknown because it would be impossible to collect
information from every member of the population. Rather, we use a
statistic of a sample picked from the population to derive a
conclusion about the parameter.
A parameter is used to describe the entire population being
studied. For example, we want to know the average length of a
butterfly. This is a parameter because it is states something
about the entire population of butterflies. Parameters are difficult to
obtain,
but we use the corresponding statistic to estimate its
value. A
statistic describes a sample of a population, while a
parameter describes the entire population. Since it will be
impossible to catch and measure all the butterflies in the world,
we can catch 100 butterflies and measure their length. The
mean length of the 100 butterflies is a statistic that we can use to
make an inference about the length of the entire butterfly
population.
Typically, the value of a statistic can vary from one sample to
another, while the parameter remains fixed. For example, one sample
of 100 butterflies may have an average length of 6.5 m m , while
another sample of 100 butterflies from another region may have an
average length of 6.8 m m. Also, a smaller sample of 50 butterflies
may have an average length of 7.0 m m . The statistic obtained from
the sample of the population can then be used to estimate the
parameter of the entire population.
COURSECODE: NCM-6308/NCM-5308 BIOSTATISTICS/PRELIMS

ST E P S TO TELL THE D I F F E R E N C E BETWEEN A STATISTIC


AND A PARAMETER:
Step 1: Ask yourself, is this a fact about the whole population?
Sometimes that’s easy to figure out. For example, with small
populations, you usually have a parameter because the groups are
small enough to measure:
Examples of parameters:
1. 10% of U S senators voted for a particular measure. There are
only 100 U S Senators, you can count what every single one of
them voted.
2. 40% of 1,211 students at a particular elementary school got
below a 3 on a standardized test. You know this
because you have each and every students’ test score.
3. 33% of 120 workers at a particular bike factory were paid less
than $20,000 per year. You have the payroll data for all of the
workers.
Step 2: Ask yourself, is this obviously a fact about a very large
population? If it is, you have a statistic.
4. 0% of U S residents agree with the latest health care proposal.
It’s not possible to actually ask hundreds of millions of
people whether they agree. Researchers have to just take
samples and calculate the rest.
5. 45% of Jacksonville, Florida residents report that they have
been to at least one J a g u a r s game. It’s very doubtful that
anyone polled in excess of a million people for this data.
They took a sample, so they have a statistic.
6. 30% of dog owners poop scoop after their dog. It’s impossible to
survey all dog owners—no one keeps an accurate track of
exactly how many people own dogs. This data had to be from a
sample, so it’s a statistic.
COURSECODE: NCM-6308/NCM-5308 BIOSTATISTICS/PRELIMS

COE F F IC IE N T O F VARIATION

The coefficient of variation (CV) is a statistical measure of the


dispersion of data points in a data series around the mean. The
coefficient of variation represents the ratio of the standard deviation
to the mean, and it is a useful statistic for comparing the degree of
variation from one data series to another, even if the means are
drastically different from one another.

Coefficient of Variation Formula:

REFERENCES:

1. https://www.investopedia.com/terms/s/standarddeviation.as

p#:~:text=The%20standard%20deviation%20is%20a,square%2

0root%20of%20the%20variance.&text=If%20the%20data%20p

oints%20are,the%20higher%20the%20standard%20deviation.
2. Sokal, R ., & Rohlf. (2009). Introduction to biostatistics. Dover
publications, inc. Mineola, New York.
3. https://corporatefinanceinstitute.com/resources/knowledge/
other/parameter/
COURSECODE: NCM-6308/NCM-5308 BIOSTATISTICS/PRELIMS

4. https://opentextbc.ca/introductorybusinessstatistics/chapter
/descriptive-statistics-and-frequency-distributions-2/
5. https://www.statisticshowto.com/probability-and-
statistics/descriptive-statistics/frequency-distribution-table/

You might also like