You are on page 1of 13

Data science as part of the scientific method

The next five sessions of the scientific methods section will be devoted to how we deal with the data
that is collected when conducting experiments. These sessions continue on from the discussions that
you have already had with Prof Vine. It is very important that if you have any questions on the
course so far that you get hold of Prof Vine and ask him to explain anything that is not clear. The
next five sessions will include; 1) spreadsheets, 2) data types and graphs, 3) measure of central
tendency, 4) measures of variation and 5) interpretation of your results in the broader scientific
context. This document deals with spreadsheets in science.

Scientific method – Spreadsheets in science

Spreadsheets are electronic documents in which data is arranged in the rows and columns of a grid
and can be manipulated and used in calculations. Well known spreadsheet programs include
Microsoft Excel. Within a spreadsheet each block is called a cell and that cell is generally indexed by
a letter and number, with the letter referring to the column and the number referring to a row. Thus
the cell A2 would be the first column (A being the first letter of the alphabet) and the second row.
Columns are vertical cells and rows are horizontal cells.

Spreadsheets have numerous uses and these are listed below;


1) Data entry: Data that was collected in the field can be entered into an electronic format and
stored for future use. This process makes it easy to access and find data that has been
collected when you were in the field or conducting your experiment.
2) Organizing data: Sorting and arranging data so that the data are easier to use. As mentioned
above, the spreadsheet is arranged into cells that can be arranged into rows and columns.
This arrangement and the functioning of the spreadsheet program allows data to be logically
stored (to be discussed below) and thus individual data points can be easily found.
3) Subsetting and sorting data: Basic manipulation of your data for doing analysis on different
sections of your dataset. Since the data are stored in a program that allows to select, cut,
sort, filter and manipulate your data this makes spreadsheet programs ideal to be used to do
basic investigations into patterns that might be emerging between variables (i.e. response
and predictor variables).
4) Statistics: Developing patterns and models in the data used to create inference based on
your data you have collected. Statistics is simply the branch of mathematics that deals with
the data collection, organization, analysis, interpretation and presentation of data. The data
that you collect can be investigated to develop patterns between the variables and
sometimes these patterns can be developed into predictive models that can be used to infer
the state of the data that you collected at broader levels. Some of this will be unpacked in
later sessions.
5) Plotting: Graphically presenting your data to make it easy to understand. Humans are
inherently better at understanding pictures, so data plots are pictures describing the
relationship between the data that you collect. We will investigate some of the basic plotting
types in a later session.

We have touched on some of the terminology of spreadsheets already by seeing what a cell, a row
and a column are. To recap, the cell is a single data point or element in a spreadsheet, indexed by
the row and column in which that cell is placed. The column is a vertical (up and down) set of cells
and the row is a horizontal (sideways) set of cells. Sometimes we may want to refer to a larger group
of cells that include more than one row and column. This group of cells is known as a range, for
example you may have rows 1 to 9 and columns A to E (as in the Powerpoint example) and the range
is then from A1 to E9 (top left hand corner to bottom right corner as if you are reading text). We saw
in the previous section that spreadsheets can help us with subsetting and sorting data and for doing
statistics. These activities (maths, statistics, data manipulation etc.) are done using functions.
Functions are built in operations from the spreadsheet program that can be used to calculate values
in cells, rows, columns and across ranges as well as do basic manipulation of the data. For example,
if you wanted to add the number in cell A1 and the number in cell A2 and then show this in cell A3
you could do so by using the function SUM by typing =SUM(A1,A2) into the function slot when you
have cell A3 highlighted in the program. Working with the spreadsheet program will always be easier
to grasp these concepts. The main thing is that you know the spreadsheet program as these built in
functions that assist with your data manipulation and organization. The “=SUM(A1,A2)” text we
mentioned in the previous sentence is the same as writing “=A1+A2” and this is then a formula. A
formula is the combination of functions, cells, rows, columns and ranges used to obtain a specific
result. Once again, this is best explained practically when you have the program open in front of you.
Formula’s put simply are often basic mathematical formulas used for calculations in your data that
you have entered into the spreadsheet program. Within the spreadsheet program you will have
individual worksheets or sheets and these are each page of rows and columns that you can give a
name to. Lots of sheets or worksheets are then combined to form a spreadsheet where the different
sheets are effectively the pages of a book (but the book is called the spreadsheet).

There are some hints and tricks that you can use to develop clean and well designed spreadsheets
and sheets that will make your life easier when using a spreadsheet program. It is often a good idea
to ensure that your sheet is well formatted before you even enter the data. This means that you will
know where each piece of data will go and what calculations you may want to make once the data
are entered into the spreadsheet. A well formatted spreadsheet will save you time once you have
collected your data and it will also guide your data collection and generation of the field data sheets.
If you don’t have a well organized database you might as well just take all your datasheets that you
have filled out in the field or lab and staple them together to make a book. Think how hard it will be
to find something from this book of data. There are two simple rules that should be adhered to
when entering your data.
1) Each cell will have an observation (one piece of data that has been collected) that needs to
have all the relevant information connected to it for it to stand alone. That means that
although the cell will be a single piece of data, the column that the cell belongs to will have a
name that will explain what that information is and the row will be sample number. By
looking at your spreadsheet and cell you need to know exactly what that number means.
2) This is very important as the computer only works on logic and will not be able to interpret
how you enter your data if it is not clear and logical. The functions and formulas that you will
use or the statistical program that you might use to analyse your data will require the data
to be in a particular format, so you need to understand the fomat and how the data should
be entered into the spreadsheet.
So to recap, you individual data point will be entered into the cell, the column will normally
represent the variable of interest and the row will indicate the observations. So, to use an example,
if you are measuring the height of 10 people, the individual cell will be a height (for example 160)
and the column which that number in will be called “Height” and the rows will be each of the 10
individuals you are measuring. It is often best practice to have a single table in a single sheet, rather
to have multiple tables measuring different things in different experiments all in the same sheet. In
most cases, additional sheets or tables can be avoided by simply adding a new column of data to
your original spreadsheet. Things brings us back to having a properly structured worksheet before
you go and collect your data so that you do not have these problems once your data has been
collected. In the example on page 11 of the spreadsheets slides, an additional column for date in the
table will reduce the need to have different sheets for each day.

When you are entering data into your spreadsheet there are a few things that you should avoid
doing. These include;
1) Not filling in zero’s – if your value that you have recorded is zero then put zero in the cell. If
you do not have a value make sure you do not put in a zero.
2) Using formatting to make your spreadsheet and data look “pretty” - Some of the tools in the
spreadsheet programs allows you to merge cells to have nice headings and group data. This
is to be avoided as much as possible as any deviation from the standard row and column
configuration (cell with data, row as observation and each column as a variable) makes it
difficult for other programs to read the data should you need to export the data to a
different program.
3) Placing comments or units in the cells – Importantly, each cell will be given a type of format
(this could be – number, date, time, text, percentage, currency) and if you mix formats by
having for example a number and text explaining the unit the cell will no longer function as it
should. If you want to have comments about a datapoint or have a reference to the units,
create an extra column for these such that the data recorded (for example number) is only a
number and then the text (for example the unit) is in the next column.
4) More than one piece of information in a cell – As mentioned above, each cell should have a
single item of information. If you have measured the height and weight of a person, do not
put both of these pieces of information into the cell. Use two adjacent columns to have a
cell for height and a cell for weight.
5) Using simple field names – A field is another name for a variable. So field names are usually
the names of your variables which are most often entered into row 1. It is best to use simple
names that do not have special characters or spaces when defining the name of the variable.
This is because many other programs that you may export your datasheet into may not be
able to read a space and then the program will not be able to read the data from the
spreadsheet. The same goes for special characters in your data. For example, degree Celsius
is normally depicted by a special character, a small circle in superscript (°). Try avoid using
these special characters as they may not be readable in other programs.

Dates are a special format in spreadsheets and you need to be very careful when you are entering
dates into a spreadsheet program. I am not sure how many of you are aware, but dates in the USA
are written differently to dates in SA. In South Africa 3/12/2020 means the 3rd of December 2020. In
the USA this means the 12th of March 2020. So firstly, always be aware of what background language
setting your program is set to (many computers are set to default English (US)). A further example is
if you do not enter the full date into the spreadsheet program the program may autocomplete the
remainder of the data. Thus when entering data for the 10th of July, putting Jul-10 into the cell will
result in the spreadsheet program reading this as 1st of July 2010 (i.e. 1-Jul-10). If you can help it
either enter your dates using full words (i.e. 10 April 2020) or make sure you know what the date
format is so that you know what should be placed where. For example a date format yyyy/mm/dd
means that when you enter the data you should enter year first, then month then date. The format
mm/dd/year will require the data entered with month first, then day and then year. If you don’t
enter the date data in correctly this can cause massive problems down the line when you are trying
to analyse your data.

OK, you have now managed to enter all your data correctly and your formatting is fine, what now? It
is very important that you ensure that you have some sort of quality control and version control in
your data. Version control means that you are in control of what version of your data you are using.
If you have manipulated your raw data that version would be a second version and you may want to
save the file as ver 2 or ver 1.2 so that you know it is not your original dataset. Saving your data with
a date in the file name is also advisable as this makes it very easy to identify what file you worked on
last. It is very important however that you try avoid doing any work on your raw data. Once you
manipulate your raw data and save it by accident in the same file, your raw data is now gone and
you cannot go back to that file. It is best practice to once you have your data entered to make a copy
and do any further work on the data in the copy of the file – therefore never manipulating or
changing the raw data. In that way you reduce the risk of accidently deleting or over-righting your
original data file. It is also good practice to have a separate readme txt file (one that is created in
notepad) that provides the history of what you have done to your data and what each new file is. If
you come back to your data after five years you will need this readme file to explain what you did at
each step so that you can understand the steps in your data storage and analysis. By having this file
it also makes it easy to share your files with another researcher as they will have the log of what you
have done at each step in your analysis.

Finally, it is likely that all of you will work in Microsoft Excel and you will save your files in the excel
file structure (.xlsx). Although this is fine as long as excel is your desired program of choice, this
becomes a problem when you may want to open your data in another program. There are universal
file structures that can be read by many programs and it is often better to save your files in those
formats. One of these is called a csv file (.csv), which is a comma delimited file. This simply means
that the columns are separated by a comma and the data are saved in a txt format. This type of file
can be opened by almost all programs and thus you can use your data in multiple programs. Many
programs will struggle to recognize an excel format as they are not part of the Microsoft family of
programs.

Data and graphs

In the second session of the data component of the scientific methods course, we will be looking at
the different types of data and the different types of graphs that can be used to display these data.
Quite importantly, data is plural as it refers to multiple datums. So when I refer to data I will be
referring to it in the plural.

Data can be broadly categorized as primary data or secondary data. Primary data is data that you
have collected whereas secondary data refers to data that someone else has collected. Thus, if you
go out and collect information on the cats on campus you are collecting primary data. If you want to
relate these data to the weather but you do not have the weather data you can request the weather
data from the National Weather Service. They may provide you with the maximum temperature for
each day, these data will then be secondary as someone else has collected it. Primary data can then
be grouped into either quantitative or qualitative data. Quantitative data is a measured value and
deals with numbers, for example measuring the number of cats seen or the weight of cats if you go
and catch them. Qualitative data does not involve numbers and generally deals with categories.
Using the cats again, we can separate cats into male or female or by what colour they are.
Qualitative data is also known as nominal data. Your quantitative data can be measured on either
discrete or continuous measurements, with discrete measurements being fixed values such as the
number of cats. If you counted the number of cats that entered the res kitchen, you would record 1,
2, 3 etc. You could not be able to count 3.45 cats. Thus, a discrete measurement is a fixed value with
set intervals. The other scale is a continuous measurement. For continuous measurements, there
are no fixed intervals and the data reading could take any measurement. If we were weighing the
cats, depending on how accurate our scale was we could have 3.45345 kg for one cat and 4.23768 kg
for another cat, thus a continuous scale.

Qualitative data is commonly separated into three types of qualitative data, namely ordinal, interval
and ratio scales. Ordinal scale data is used when events in question can be rank-ordered. With the
ordinal scale, the intervals separating the ranks do not have to be comparable. We will use the
example of the measurements used to categorise runners finishing a race to explain the ordinal
scale. For ordinal scales, we can measure that a runner came first, second, third, fourth and so on
until last. Thus, the order of the rank (where they came in the race) is important but there is no need
to understand the difference in their times. The second scale of measurement for quantitative data
is the interval scale. The interval scale is used when the events in question can be rank-ordered and
equal intervals separate adjacent events. In this case, the item being measured is measured on a
scale that has equal intervals. An example could be measurements of temperature. On hot days in
Alice, the temperature can be as high as 40 degrees, and on cold days the temperature can be as low
as -5 degrees. Thus because we have an interval scale we can assess the difference between the two
of 45 degrees. The final scale is a ratio scale. The ratio scale takes the interval scale one step further
by permitting the rank order of scores with the assumption of equal intervals between the ranks, but
it also assumes the presence of a true zero point. By this, we mean that when we measure
something we are concerned by the distance from zero. For example, we could go back to the race
example from above. We know the order of finishers (ordinal scale), we can measure the difference
in the time taken to finish the race (interval scale) and we know that the runner that won finished in
a certain time, thus the measurement can be measured relative to zero (ratio scale). The different
types of data are important as they will dictate what type of graph you will use to present your
information.

Graphs are used to display summaries of the data that has been collected. You will come to realise
that humans are a visual species and often grasp concepts easier when converted from numbers and
word into pictures. Graphs are simply pictures of the data that has been collected to describe what
the data looks like. There are a few guiding principles that need to be considered when developing
graphs of data. First, the graph should be clear, unambiguous and accurately represent your data.
This leads to the second element in that your graph should be free of distracting elements and
should draw the readers attention to the most important aspects of your data. This means in most
cases that you need to produce simple graphs. You may be tempted to try to make your graph pretty
and cool, but often by adding in more complexities to the graph you make it harder to understand
what you are trying to show. A key example in this is using 3D graphs. You may think the graph looks
cool, but 3D graphs are very difficult to understand and should be avoided at all costs. You may also
consider using funky patterns for your graphs, but bear in mind that these patterns may confuse the
reader. Thus, the third guiding principle of simplicity is vitally important when developing your
graphs. Finally, when you produce graphs you may need to consider who your audience is. A graph
for one audience may not be as effective as a different graph for another audience.
The main types of graphs we are going to discuss in the session today include; bar graphs, line
graphs, scatterplots and pie charts. The type of graph that a researcher chooses depends on the type
of data that they have collected. This is why it is important to understand and know the data types
first before you consider graphing your data.

Bar graph: Bar graphs are used when you are plotting a quantitative variable as a function of a
categorical variable. When we use the term “as a function of” it means dependent on, i.e. the
quantitative variable (response variable) will be dependent on the categorical variable (predictor
variable). In the bar plot, the quantitative variable is expressed on the x-axis and the amount or
measurement of the quantitative variable will be expressed as the height of the bar. The top of the
bar will represent a single total or some measure of central tendency (see next session for more on
these). If you have the top of the bar as the measure of central tendency it is best practice to include
your measure of the variation as well (see session 4 for more information on this topic). If you have
two different grouping variables (i.e. the categories) then you could have a second set of bars
represented in a different colour. As mentioned above, please avoid using 3D graphs as they are not
easy to read. So an example of a bar plot could be the number of people of different ages that went
to watch a movie. In the graph on page 10 of the slides, you can see that the grouping category is
the age of the people (16, 17, 18 and 19) and the height of the bars represents a number, i.e. the
count of each age class that went to the movie. In the example on page 11 of the slides, you can see
we now have two groups of categories (those that survived and those that did not survive) and we
can immediately see the difference in height indicating clear differences in the categories for those
that survived and did not survive. The final example on page 12 of the slides shows a bar plot where
the top of the bars is not a count, but rather some measure of the central tendency, in this case, the
average of a measurement. The block lines above the bar represent the measure of variation around
that average – but more about that in a later session. The key thing is to understand that the top of
the bar reflects the quantitative variable, either the total number of some measure of the central
tendency.

Line graph: The line graph is similar to the bar graph except for now the top of the bar is replaced by
a point (represented by a symbol like a circle or a triangle), with a line joining those points indicating
some sort of relationship between the categories. Therefore, the line plot is used when the
measurements (represented on the x-axis) of the categories (represented on the y-axis) can be
ordered in some way. An example of a commonly used order is time. If you have measurements for
something in different months, you would ideally have the months in the correct order on the y-axis
(Jan-Feb-Mar-Apr etc) and not have the months in a random order (Apr-Feb-Jan-Mar etc). As in the
bar plot, the symbols can represent the total number or they can measure some measure of central
tendency. Again, if you using the measure of central tendency then you should also be plotting the
measure of variation as well. On page 14 of the slides, there is an example of a line graph showing
the number (the quantitative measure) of Covid-19 cases and this is plotted per day (the categories).
The days can be logically placed in an order and thus the line can be joined to show how the number
of Covid-19 cases changes as the date changes.

Scatterplots: In the above two graph types the x-axis was a quantitative variable and the y-xis was a
nominal variable or category. In the scatterplot, we now have the case where both the y-axis and the
x-axis are quantitative variables and thus we cannot group the quantitative data into categories.
Therefore the scatterplot is an effective tool representing bivariate (two sets of continuous data)
relationship. Once we have plotted all the combinations of x and y values we will have a set of points
in the graph space 0 these points can be seen in the example on page 16. If we want to show that
there might be a relationship between the two continuous variables we can draw a line the
estimates the relationship between the variable plotted on the x-axis and the variable plotted on the
y-axis. This line is known as the regression line or a trend line. An example of a trend in two datasets
that might have a trendline could be the relationship between height and weight of people. In
general, taller people are heavier and shorter people are lighter. By plotting these two datasets for a
sample of people that you have weighed and measured the height off, we would be able to see if
there is a relationship (or trend) between the two variables. A very good example of a trend is the
relationship between class attendance in ZOO212 and the test marks. This relationship is shown on
page 17 of the slides. The class attendance on the y-axis is the predictor variable and the class test
mark is the response variable. Both of these are continuous variables and each point in the graph
space is the combination of attendance percentage and test mark for each student. So, the student
that attended no classes (represented by the single dot on the y-axis) obtained a class test mark of
about 30%. The students that attended 100% of all classes (represented by the points on the 100%
line of the y-axis obtained test marks between 46 and 90%. By plotting all the students' pairs of data
we can see that there is a trend that students that attended class more often got slightly higher
marks. Thus a scatterplot provides some indication of the relationship between two continuous
quantitative measurements.

Pie chart: The final plot type for this session is not used very often in science as it does not portray
sufficient information about what has been measured and lack s measurement of order. The pie
chart is a way to visually show the contribution of each category to the total dataset. For example, if
100 observations of 3 categories were made, how many of each category will be represented in the
100 observations. This type of graph is most often used in business and as mentioned above not that
useful in science. So as an example, imagine we surveyed 256 students to find out what flavour of
ice-cream out of apple, blueberry and raspberry they prefer. The pie chart on page 19 of the slides
provides the percentage of students that liked each flavour. We can immediately see that most of
the students like raspberry and the least liked blueberry.

Central tendencies

Thus far in the data section of the scientific method course, we have discussed what spreadsheets
are and what their use is and the what different types of data you can collect and from there what
graphs you can produce. In the graphs lecture, I mentioned that both bar graphs and line graphs can
represent either the absolute number of a category or they can present some measure of the central
tendency of measurements of a particular category. In this session, we will expand a bit on the idea
of a central tendency and how we can measure and present these.

The measure of central tendency is a way in which we can try to describe our data. In the
spreadsheet lecture, we say that each row could be an observation in your dataset. If we have 1000
observations and at each observation, we measure a few continuous variables it will be very difficult
to describe our data since all the data will be in 1000 lines of our spreadsheet. When we have
collected lots and lots of data we will want to describe the data. There are three ways to describe
the data. The first will be graphically, the second will be using the central tendency and the third will
be using the measure of spread. In this session, we will look into the first of these two approaches
and then investigate measures of spread (or variability) in the next session. But before we start we
need to refresh ourselves on some terminology.
When we are doing an experiment or collecting data the best possible way to answer our question is
to collect every bit of information about what you are interested in. Thus if you wanted to know if
cats in Alice are bigger than cats in East London you would find every single cat in Alice and every
single cat in East London and weight them all. In the above example, all the cats in Alice are called
the population of cats in Alice and all the cats in East London is called the population of cats in East
London. Easy, right? If life was that easy scientists would be out of work. It is going to take a long
time and possibly a lot of money and effort to try measure all these cats. In most cases when doing
an experiment you have a limited amount of time and a limited amount of resources to try and
answer your question of interest. So scientists will often collect a smaller dataset from a portion of
the overall population, and this data that is collected is from a sample from the population. So, the
sample is what we observe and is a portion of the overall population that we are interested in. We
can use the measurements from the sample to try to make observations (or inference) about the
larger populations. In general, the more samples you take from the population, the better the
estimate of the population the sample will be. Once we have collected data from the sample we are
going to want to summarise these data so that we can quickly and easily explain what patterns we
are observing in our data. There are two options for summarizing the data, the first is we can
calculate what is known as a summary statistic. The summary statistic is a single value that provides
a summary of all the data that you have collected for a particular variable. The second option is
drawing a graph. Now we have discussed some basic graphs in session 2, and we will now use the
bar graph to try to summarise our data so that we can see what patterns are emerging. We start
doing this by plotting a graph of how many times each observation occurs, and this is called a
frequency distribution or histogram.

The histogram is a graphical way of describing the data that we have collected. It simply provides a
bar chart of the number of each category of your data. If your data are continuous you can create
categories by grouping similar values together into what are known as bins. For example, if you have
a continuous measurement that you have made such as height of students in cm, you can group the
heights into 5 cm bins or 10 cm bins. The result of the histogram is a visual assessment of the shape
of the data that you have collected. On page five of the slides the histogram shows the number of
kids counted playing in the park. The histogram pattern suggests that there are few observations of
young children, few observations of older children and the most common age of children at the park
are between 8 and 11 years old. From slide 6 we show an example of an experiment we may have
done when we were bored at home during the lockdown. The question we ask is do all flies have the
same length wings. We catch 30 flies, remove their wings and measure the length. To create the
frequency distribution or histogram we simply put the wing length on the x-axis and count the
number of each wing length and generate a bar chart with the heights of the bars the number of
each wing length. This shows a basic histogram of the wing lengths of the 30 flies (slide 7). From this
histogram, we can get an idea of the most common wing length, but it is not that informative. We
might not think that a 1mm difference between wings is relevant so we may want to group our wing
length data into bins of 3mm. The histogram on slide 8 does just this. From this histogram, we now
clearly see that the most common wing length is between 3.6 and 3.8 mm. The y-axis on the
histogram can be any relevant ordered category (where continuous variables can be made into
categories by creating even-sized bins). Once you have the histogram which is the frequency
distribution of the data, you can draw a smooth curve through these data and this resulting curve is
the distribution of your data. The shape of this distribution will dictate which types of statistics you
will be able to do on your data.
If the shape of the distribution is symmetrical (i.e. a mirror of itself if a line is drawn down the middle
of the curve) then the distribution is called a normal distribution. This is often referred to as the bell-
shaped curve. If there are lots of smaller values in the dataset, the data distribution curve will be
humped towards the smaller values and have what is known as a long tail towards the larger
numbers. This is called a positively skewed data distribution. Similarly, if the data has lost of larger
values the hump of the curve will be centred around the larger values and the tail of the data will be
longer towards the smaller values. This is known as a negatively skewed data distribution. You can
tell which way the skew is by seeing in what direction the tail poits. If the tail points towards the left
(i.e. towards negative numbers eventually) the distribution is negatively skewed. If the tail points
towards larger numbers then the data is positively skewed. The distribution can also be described by
the heights of the hump. A very steep hum is known as a leptokurtic distribution and a very flat
hump is known as a platykurtic distribution. The leptokurtic distribution results from a dataset
where all of the values are very similar whereas a platykurtic distribution results from a dataset
where all the values are quite spread out and not that similar.

Up to now, we have been discussing how to look at the shape of your data using histograms. The
next section we will look at how you can describe your data with a single value called a summary
statistic. This summary statistic is often the measure of centra tendency so it gives some estimate of
the middle value or most common value of the measurements you have made. In the histogram the
most common value was the highest bar and the middle value was roughly in the middle of the
frequency distribution. The central tendency can be measured through any one of the median, mean
(average) or mode. The main difference between these measures will result from the presence of
what are known as outliers. Outliers are datapoints that are very different from the rest of the data
collected, and they will cause your mean to vary away from the mode and the median. The value of
the summary statistic is that it very quickly gives you a measure that can be used to compare to
samples. If you recall at the start of the session we talked about the weight of cats in East London
versus cats in Alice. If we measured 1000 cats in each location it would be very difficult to say which
has the largest cats in general. However, by using a summary statistic where the Alice data are
reduced to one value and the East London data are reduced to one value we can immediately see if
there is any difference and which might be bigger or smaller. We will look at three possible summary
statistics, the mean, median and mode.

Mean: The mean or the average is simply the sum of all measurements divided the number of
measurements. The mean has three main advantages. The mean has the advantage that it uses all
the available data so is a true reflection of the full dataset. The mean can be manipulated
mathematically so it makes it very nice to work with for further analysis. In general, the mean is
resistant to sampling variation, so should you have a smaller sample size or larger sample size the
mean will often be very similar. The mean also, however, has a disadvantage or two. First, it’s
extremely sensitive to the outliers. Any very large or very small value relative to the rest of the
values will have a large impact on the mean. This means that negatively or positively skewed data
distributions will end up with a mean that is pulled to the side where the tail is.

Mode: The mode is simply the most common value in the dataset. The mode is, therefore, the same
value as the highest peak in the histogram. The advantage of this is that it is very easy to calculate
and can be used with nominal data. Thus if you have five colours of sweets that are chosen by
students, the mode would be the colour most chosen. There cannot be a mean of nominal data.
However, the disadvantage of the mode is that there could be more than one which could cause
confusion, especially if they are not near each other. Data can be defined by the number of modes,
i.e. unimodal = one mode, bimodal = two modes, multimodal = many modes and uniform = no real
peak and all values are fairly equally represented.

Median: The median is the middle measurement when the measurements are arranged from
smallest to largest. To estimate the mode you first arrange the data from the smallest measurement
to the largest and then find the measurement that is in the middle of this array of numbers. If there
are an even number of measurements you will end up with two values in the middle of the array of
numbers, you then simply take the average of these two values. The advantage of the median is that
it is unaffected by outliers which means that highly skewed distributions are best represented by the
median value (rather than the mean). Like the mean, the median can be used with ordinal, ratio and
interval data but not nominal data. There are however some disadvantages of the median. The
median can vary quite a lot when sample sizes change, especially when sample sizes are quite small.
Ultimately, since not all the values in the dataset are used (you only technically using the middle one
or two values) the median itself cannot be used for further mathematical analysis.

These three summary statistics can be used to effectively summarise or describe the data that you
have collected. However, is this enough to describe the data? If we go back to our cat size example
we could have found that the cats in Alice have an average weight of 3.4 kg and the cats in East
London have an average weight of 4.2 kg we could assume that cats in East London are bigger.
However, the cats in Alice might range from 3.0 kg to 3.9 kg whereas the cats in East London range
from 2.9 kg to 5.5 kg. There seems to be something happening in the spread of the data around the
central value, the mean. This spread or variability in the measurements could be important and will
be discussed in more detail in the next session.

Measures of variation

In our last session, we looked at summarizing our data using a histogram (visual display of the shape
of our data) or a summary statistic (some measure of central tendency). In this session, we will
continue our description of the data by investigating the spread or variation of the data.

Measuring the central tendency is not enough to describe the data, we need to understand
the variability or variation in the collected data as well. There is a saying, if your feet are in
the freezer and your head is in the oven, on average you are comfortable (not too cold or
too hot). Thus we need to look at the variation in the data as well as the central tendency to
fully understand the data that we have collected. To look at how spread out our data are we
can look at the range, variation or standard deviation of the data.

Range: The range is simply the difference between the lowest and highest measurement in
your dataset. The range provides a quick assessment of how spread out the data are but
does not give much information beyond that. You may have an outlier (remember an outlier
is a value that is nowhere near any of your other measurements) that skews your data in
one direction and the range would then be biased by one measurement that may be very
different to the others.

Variance: The variance provides the statistical average of the amount of dispersion (or
spread) of the data. This measure is not used often as a measure of the spread of the data
by scientists. However, the variance is an important step in the calculation of the standard
deviation which is a common measure of variation used by scientists.

Standard deviation: For this course, we are not going to go into the maths behind the
calculation of the standard deviation but for those interested, the formula is shown on slide
seven of the PowerPoint notes. To understand the meaning of what the standard deviation
means we need to look at the meaning of the two words. Deviation means the difference
between individual measurements in the data distribution and the average measurement of
the entire data distribution. Standard implies the typical or average. Thus the standard
deviation (otherwise known as SD) is the typical or average difference between any
measurement and the average of all the measurements. This provides a handy measure of
how spread out your data are around your measure of central tendency. We will get to what
the spread means a bit later. Therefore, combined, the mean and the standard deviation
give a very good quick picture of what your data looks like, even if you have 1000’s of
datapoints.

To summarise the above, the range provides the total spread of the data distribution but
the variance and standard deviation provide a measure of the average spread of the data in
the data distribution. The standard deviation will always be represented by a range of values
from below the average to above the average (thus the spread on either side of the central
point of the data distribution).

In the previous section, we introduced the concept of data distribution and different types
of curves. If you remember we discussed a platykurtic and leptokurtic distribution. If you do
not remember these terms please go back and check the notes for the previous section. The
different distributions (platykurtic and leptokurtic) will result in different magnitudes or
values of the standard deviation. We saw that a platykurtic distribution was quite flat and
we concluded that this distribution has a wide range of data values. This type of distribution
will result in a high standard deviation as the data are spread out on either side of the mean.
In contrast, we concluded that the leptokurtic data distribution had little variation and the
measurements taken tended to cluster around the mean. This data distribution will result in
a small standard deviation as the majority of the measurements are quite similar to the
average measurement.

Now, we will continue going back in time and I now ask that you cast your mind back to the
second session that we did on graphs. Again, if you cannot remember what we did on
graphs please go back and revise that session. At that stage, I mentioned that bar plots and
line plots can both represent the measure of central tendency and we should always
present the spread of the data around that central tendency if we are plotting these. In
these graphs, we now plot the measure of the variation as the mean plus the SD and the
mean minus the SD to indicate the average spread of the data around the mean. The
variation estimates are often termed the error bars of a graph and these are shown on slide
10 of the PowerPoint notes. The range (from top to bottom of the error bars) indicates how
spread out your data are or how clustered your data are around the mean. Very wide SD or
error bars suggests that your data that you collected are not very similar and the
measurements vary a lot. A narrow SD or error bars indicates that all your measurements
are quite similar and there is little variation between your measurements. If we are
comparing two categories and the error bars do not overlap we can start assuming that the
two things we are comparing are likely different.

Interpretation and context

Once you have designed your experiment, collected your data and analysed your results what does
this all mean? What do we do with the data and results that we have obtained, what is the point to
this process. In this final session, we will look at the final step to an experiment or laboratory report,
and that is explaining what the results mean and placing these results into the broader scientific
context and finally explaining something about your initial observation, problem statement or
question.

Once you have done the experiment, collected the data, estimated summary statistics, drawn some
graphs, done some cool statistics… what now? (note we have not covered the doing some cool
statistics and you continue to learn this as your science career unfolds). All of the summary statistics,
graphs and cool statistics make up your results. Your results will tell the reader what you have found
in your experiment. The final part of a scientific project is bringing these results back to what they
mean about your original hypothesis (do you remember what a hypothesis is? If not, go back and
check that lecture from Prof Vine) and ultimately your problem statement and observation. This final
step is done in the discussion section of your project or lab report. The discussion normally focuses
on the following subsections; 1) summarizing your findings, 2) relating your findings to what we
know from the broader scientific literature, 3) discussing the limitations of your research or
experiment, 4) making suggestions about future work and finally drawing some conclusions.

Summarizing your findings: You will generally start your discussion by summarizing the main
findings of your study. This summary will be a recap of the results (most important of them) without
any of the statistics included. For example, In your results section you may have stated, “The average
length of male flies wings (3.7 mm SD = 1.01 mm) were longer than those of females wings (2.9 mm
SD = 0.84 mm).” This is you presenting the summary statistic (mean and SD). In the first paragraph of
your discussion, you may then state, “Our experiment indicates that males have longer wings than
females, which may be due to them needing to fly higher or faster”. Thus, you have recapped the
results but left out the actual statistics. In this section just focus on the most WOW results.

Relative your findings to previous research (the scientific context): Since science is always building
on previous work, we need to explain how our results fit in with previous research findings. This
implies that you know about the previous research (which means you would have needed to have
read previous research on your topic – preferably before you had even started your experiment or
designing your study). When relating your results to previous research, you will need to say how
your results are the same or different, while citing the previous studies (if you do not remember how
to cite then please go back and check Prof Vine’s notes). For example, you could state, “In our study
males flies had significantly longer wings than females and this supports the work on birds where
males similarly had longer wings than females (Bobson et al. 2005).”

Discussing the limitations of your study: As I mentioned in session three, you cannot do a complete
study that measures every single variable and every single item in the population of interest as there
are almost always logistical constraints. For our fly experiment, we only measured flies in our house,
does that mean flies across the globe are the same? Without widening our sample we cannot know
this for sure. Therefore, part of being a scientist is understanding and identifying the limitations and
then explaining how these may have influenced your results. It is important that when you are being
self-critical that you can sufficiently evaluate your research (you will be doing this for other peoples
research as well later in your career). In a lab report, this section shows your lecturer that you have a
complete understanding of the activity that was being done

Making suggestions for future research: It is unlikely that your research has answered all the
questions that you had, and most of the time the research generates more questions that require
further research. In this paragraph, you can highlight what further work can be done to increase our
understanding of the topic.

Draw some conclusions: This paragraph will conclude the discussion and should bring the story full
circle to your observation, problem statement and hypotheses. Try to avoid repeating what you have
already said in the discussion. This is a good spot to elevate the importance and scope of the work to
a broader context.

You might also like