BEP - Summarizing Data

BEP: SUMMARIZING DATA
Information 2. for each value, record the number of persons who had that
- Result of data processing variable (twins and other multiple-birth pregnancies count
- data only becomes information that you can use to make only once)
decisions after it has been processed
summarizing data Table 2. Methods of summarizing data
- increases ability to interpret data
- raw data → experiment → results → summarize
descriptive statistics
- mathematical summaries of data
organizing data
- one common method is line list or line listing
o one type of epidemiologic database
o organized like a spreadsheet with rows and
column
- each row → observation or record III. DESCTIPTIVE STATISTICS
o represents one person or disease - 2 types that are generally most useful
- each column → variable
o contains information about one characteristic of A. MEASURES OF CENTRAL TENDENCY
the individual, such as race or date of birth - summaries that calculate the "middle" or "average" of data
- 1st column/variable is usually a person’s name, initals, ID no.
- Other columns: demographic information, clinical details,
1. MEAN
exposures related to illness
- average
I. VARIABLE - calculate the mean by adding up all of the measurements in
- can be any characteristic that differs from person to person, a group and then dividing by the number of measurements
- height, sex, smallpox vaccination status, or physical activity
- The value of a variable – number or descriptor that applies to 2. MEDIAN
a particular person, - value at the midpoint of the group
o 5'6" (168 cm), female, and never vaccinated - exactly half of values in the group are smaller than median
- other half of values in the group are greater than the median
A. TYPES OF VARIABLE odd number of measurements
- The type of values influences the way in which the variables - median = middle value when values are arranged in
can be summarized. ascending order
Classified into one of four types depending on its scale even number
- median = mean of the two middle values when values are
arranged in ascending order
3. MODE
- the value that appears most frequently in the group of
measurements
- It is entirely possible for a group of data to have no mode at
all, or for it to have more than one mode.
- If all values occur with the same frequency (for example, if
Qualitative/Categorical variables
all values occur only once), then the group has no mode.
- Nominal-scale variale
- If more than one value occurs at the highest frequency, then
o Categories without any numerical ranking, such as
each of those values is a mode
county of residence
o Bimodal data set
o Dichotomous variable – two categories are very
o Multimodal data set
common: alive or dead, ill or well, vaccinated or
unvaccinated, or did or did not eat
- Ordinal-scale WHICH MEASURE TO USE
o values that can be ranked - mean, median, mode are all clustered towards the center in
o not necessarily evenly spaced a graph
o stage of cancer - Each a slightly different measure of what happened "on
Quantitative/continuous variables average" in the experiment
- Interval-scale Mean
o measured on a scale of equally spaced units, but - most often used to describe the central tendency
without a true zero point, such as date of birth - most sensitive measurement, because it reflects the
- Ratio-scale contributions of each of the data values in the group
o interval variable with a true zero point median and the mode
o duration of illness - less sensitive to "outliers"—data values at the extreme
- sometimes it is an advantage to have MCT that is less
II. FREQUENCY DISTRIBUTIONS sensitive to changes in extremes of the data
- displays the values a variable can take and the number of - eg, small number of outliers at one extreme → median is a
persons or records with each value better MCT than mean
- For example, data from a study of women with ovarian categorical variables
cancer and number of times each woman has given birth - best MCT is the most frequent outcome (the mode)
(parity) – ratio-scale - eg, a survey on the most effective way to quit smoking →
To construct frequency table reasonable MCT of results – the method that works most
1. list all the values that the variable can take, from lowest frequently
value to highest If data contains more than one mode
- summarizing them with mean or median will obscure this fact
- Median. Notice how the data in this graph is non-
Data: Groups, or classes of things. Survey results often fall in this symmetrical. The peak of the data is not centered, and the
category, such as, "What is the most effective way to quit smoking?" or body mass values fall off more sharply on the left of the peak
"Gender Differences in After-School Activities" than on the right. When the peak is shifted like this to one
- best measure of central tendency: side or the other, we call it skewed data. For skewed data,
- Mode. In these made-up survey results, 'cold turkey' is the the median is the best choice to measure central tendency.
most frequent response The median body mass for this skewed population is 185
grams.
Data: Position on a ranking scale, such as: 1-5 stars for movies, books,
or restaurants - Notice how this graph has two peaks. We call data with two
- Median. The median movie ranking in this survey was 2.3 prominent peaks bimodal data. In the case of a bimodal
stars. distribution, you may have two populations, each with its
own separate central tendency. Here one group has a mean
body mass of 147 grams and the other has a mean body
mass of 178 grams.
Data: Measures on a linear scale (e.g., voltage, mass, height, money,

etc.)
- Mean. The shape of this data is approximately the same on
the left and the right side of the graph, so we call this - None. Notice how this graph has three peaks and lots of
symmetrical data. For symmetrical data, the mean is the overlap between the tails of the peaks. We call this
best measurement of central tendency. In this case the multimodal data. There is no single central tendency. It is
mean body mass is 178 grams easiest to describe data like this by referring to the graph.
Don't use a measure of central tendency in this case, it
would be misleading
It would be useful to have a measure of scatter that has the
following properties:
1. The measure should be proportional to the scatter of the
data
o small when the data are clustered together, and
large when the data are widely scattered)
2. independent of the number of values in the data set
o by taking more measurements the value would
increase but the scatter will not increase
3. independent of the mean
o we are only interested in the spread of the data,
not its central tendency
Both the variance and the standard deviation meet these three
criteria for normally-distributed (symmetric, "bell-curve") data sets
IV. SUMMARIZING DATA

Cleaning up the data
- to make sure it is appropriate and accurate prior to being
summarized and analyzed
- assessment results from a paper-based survey or rubric may
include some unclear or inaccurate responses → correcting
- None. In this case, the data is scattered all over the place. In or eliminating data from the sample
some cases, this may indicate that you need to collect more Some types of responses that may need to be addressed before
data. In this case there is no central tendency summarizing data
1. Inapplicable responses
o e.g., male students answered questions for female
B. MEASURES OF DISPERSION
2. Inappropriate multiple responses
- Measure the spread of data set
o two answers checked for one non-multiple choice
- summaries that indicate the "spread" of the raw
3. Responses outside given category
measurements around the average
o student wrote in answer because they didn't like
choices provided
4. "Other" responses that really aren't
o student checked “Other — Please Specify” but
these two data sets both have the same mean (5) their comment matched one of the answers
- values in data set 2 are much more scattered than the provided
values in data set 1 1. Make a List and check it twice
- MOD let us know whether the values in a data set are - List the raw data
generally close to or far from the mean - Remove identifying information such as names to ensure
confidentiality
- Compare the list to the source information. This will help in
finding and correcting any errors
2. Tally the results or responses to get a quick picture
3. Chart result
1. RANGE - tables, line graphs or bar charts to get look at the big picture
- simplest of the three measures. - It depends on the kind of questions the assessments are
- defined by the smallest and largest data values in the set needed to answer
- The range of data set 1 is 3–8 - Tips
- only minimal information about the spread of the data, by o AVOID complex statistics
defining the two extremes o Use round numbers
- It says nothing about how the data are distributed between o simple charts – easier to read and understand
those two endpoints o Sort results from highest to lowest [optional]
o Percentages – more meaningful than averages
2. VARIANCE, σ2 o Show trend data if assessing over time
- measure of how far each value in data set is from the mean Example 1: Table with percentages added, column with total %
- defined by: students successful (Exemplary + Good + Minimally Acceptable). N=18
1. Subtract the mean from each value in the data → Target=78%
measure of the distance of each value from mean
2. Square each of these distances (so that they are
all positive values), and add all the squares
3. Divide the sum of squares by the number of
values in the data set
3. STANDARD DEVIATION, σ
- simply the (positive) square root of the variance
variance and standard deviation
- provide a numerical summary of how much the data are
scattered
Example 2: Line chart using data from tally above with target the - Focus on most important findings
program hope to achieve. - Use data and results to justify conclusions
- Be careful how you describe your results
- Did you really prove your hypothesis or did you just find
evidence supporting it
- Ask audience for questions or comments. They may have a
different and equally valid interpretation of your results
4. Find the Story in the Data [Analyze Data]. Data Summaries

alone cannot fully communicate your message.
- Data summaries make it easier to see meaning but by
themselves they don't reveal the whole story.
- include an explicit narrative interpretation of the data
and what you plan to do about it
- What do the data summaries reveal about students'
learning? (identify meaningful information)
- What are you going to do about what you have
learned?
- When, where, and how are you going to do it
V. RESULTS AND DISCUSSION

Results
- results section of a scientific paper is for narrating findings
- without trying to interpret or evaluate them
- summarize results, both in written form and visually, using
graphs and charts
- graphs, figures, and tables
- notable correlation between two variables
o Speculating why this correlation exists belongs in
the discussion section
Discussion
- interpreting results and trying to explain what they mean
- explaining whether results support or disprove hypothesis
and why
Limitations
- weakness or errors in the study that influenced results
- type of data analysis used affects ability to prove
o causation –one thing causes another
o association – one thing is related to another
o differences – one dataset is different from another
conclusion
- justifies the results
- Are your results consistent with past studies? Why?
- New valuable information learned
- Disproving hypothesis is just as significant as supporting it →
revise hypothesis and future research studies
Application
- All research studies add to the overall understanding and
body of knowledge of a given topic or field
- Do not overstate the importance of your findings.
- all studies have limitations
- results and conclusions may have been different if you used
different study site or larger dataset
findings might also help drive future research studies by generating
new questions
- most research uncovers more questions than answers
- What might you do differently if you were to repeat the study
- What research questions would you suggest other students
studying your stream site(s) ask in the future
Presenting the study
- Presenting in written or verbal form allows others to learn
- Who will be your audience
o peers, scientists, professionals, general public
- Are they already familiar with your topic
- Present material in a way appropriate for your audience
- Be clear
- Label and describe all figures

BEP - Summarizing Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BEP - Summarizing Data

Uploaded by

Copyright:

Available Formats

BEP: SUMMARIZING DATA

Data: Measures on a linear scale (e.g., voltage, mass, height, money,

IV. SUMMARIZING DATA

4. Find the Story in the Data [Analyze Data]. Data Summaries

V. RESULTS AND DISCUSSION

You might also like