Professional Documents
Culture Documents
Bio Stat
Bio Stat
1. Prepare
2. Context
3. Sampling Method
Were the data collected in a way that is unbiased, or were the data collected in a way
that is biased (such as a procedure in which respondents volunteer to participate?
1. Analyze
2. Graph the Data
3. Explore the Data
Are there any outliers (numbers very far away from almost all the other data)?
What important statistics summarize the data (such as the mean and standard
deviation)?
How are the data distributed?
Are there missing data?
Did many selected subjects refuse to respond?
III. Conclude
1. Significance
DEFINITIONS
A voluntary response sample (or self-selected sample) is one in which respondents
themselves decide whether to be included.
Statistical significance is achieved in a study when we get
a result that is very unlikely to occur by chance. A common criterion is that we have
statistical significance if the likelihood of an event occurring by chance is 5% or less.
Example:
- Getting 98 girls in 100 random births is statistically significant because such an
extreme outcome is not likely to result from random chance.
-Getting 52 girls in 100 births is not statistically significant because that event could
easily occur with random chance.
Practical significance is possible that some treatment or finding is effective, but
common sense might suggest that the treatment or finding does not make enough of a
difference to justify its use or to be practical.
ANALYZING DATA: POTENTIAL PITFALLS
Here are few items that could cause problems when analyzing data.
1. Misleading Conclusions
When collecting data from people, it is better to take measurements yourself instead of
asking subjects to report results.
3. Loaded Questions
If survey questions are not worded carefully, the results of a study can be misleading.
4. Order of Questions
Sometimes survey questions are unintentionally loaded by such factors as the order of
the items being considered.
5. Nonresponse
6. Percentages
Some studies cite misleading or unclear percentages. Note that 100% of some quantity
is all of it, but if there are references made to percentages that exceed 100%, such
references are often not justified.
BASIC TYPES OF DATA
A parameter is a numerical measurement describing some characteristic of a
population.
A statistic is a numerical measurement describing some characteristic of a sample.
EXAMPLE:
There are 17,246,372 high school students in the United States. In a study of 8505 U.S.
high school students 16 years of age or older, 44.5% of them said that they texted while
driving at least once during the previous 30 days (based on data in Texting While
Driving and Other Risky Motor Vehicle Behaviors Among High School Students," by
Olsen, Shults, Eaton, Pediatrics, Vol. 131, No. 6).
students in the United States. If we somehow knew the percentage of all 17,246,372
high school students who reported they had texted while driving, that percentage would
also be a parameter.
QUANTITATIVE/CATEGORICAL
Quantitative (or numerical) data consist of numbers representing counts or
measurements.
Categorical (or qualitative or attribute) data consist of names or labels (not numbers
that represent counts or measurements).
EXAMPLES:
1. Discrete Data of the Finite Type: Each of several physicians plans to count the
number of physical examinations given during the next full week. The data are
discrete data because they are finite numbers, such as 27 and 46 that result from a
counting process.
2. Discrete Data of the Infinite Type: Researchers plan to test the accuracy of a
blood typing test by repeating the process of submitting a sample of the same blood
(Type O+) until the test yields an error. It is possible that each researcher could
repeat this test forever without ever getting an error, but they can still count the
number of tests as they proceed. The collection of the numbers of tests is countable,
because you can count them, even though the counting could go on forever.
3. Continuous Data: When the typical patient has blood drawn as part of a routine
examination, the volume of blood drawn is between 0 mL and 50 mL. There are
infinitely many values between 0 mL and 50 mL. Because it is impossible to count
the number of different possible values on such a continuous scale, these amounts
are continuous data.
LEVELS OF MEASUREMENT
1. NOMINAL LEVEL
Data are at the ordinal level of measurement if they can be arranged in some
order, but differences (obtained by subtraction) between data values either cannot be
determined or are meaningless.
EXAMPLE:
Course Grades: A biostatistics professor assigns grades of A, B, C, D, or F. These
grades can be arranged in order, but we can't determine differences between the
grades. For example, we know that A is higher than B (so there is an ordering), but we
cannot subtract B from A (so the difference cannot be found).
3. INTERVAL LEVEL
Data are at the interval level of measurement if they can be arranged in order, and
differences between data values can be found and are meaningful; but data at this level
do not have a natural zero starting point at which none of the quantity is present.
EXAMPLES:
Data are at the ratio level of measurement if they can be arranged in order,
differences can be found and are meaningful, and there is a natural zero starting point
(where zero indicates that none of the quantity is present). For data at this level,
differences and ratios are both meaningful.
EXAMPLES:
BIG DATA
Big data refers to data sets so large and so complex that their analysis is beyond the
capabilities of traditional software tools. Analysis of big data may require software
simultaneously running in parallel on many different computers.
Data science involves applications of statistics, computer science, and software
engineering, along with some other relevant fields (such as biology and epidemiology).
MISSING DATA
A data value is missing completely at random if the likelihood of its being missing is
independent of its value or any of the other values in the data set. That is, any data
value is just as likely to be missing as any other data value.
A data value is missing not at random if the missing value is related to the reason that
it is missing.
Different Methods of Correcting Missing Data:
1. Delete Cases: One very common method for dealing with missing data is to
delete all subjects having any missing values.
2. Impute Missing Values: We impute missing data values when we substitute
values for them. There are different methods of determining the replacement values,
such as using the mean of the other values, or using a randomly selected value from
other similar cases, or using a method based on regression analysis.
BASICS OF DESIGN OF EXPERIMENTS
The Gold Standard: Randomization with placebo/treatment groups is sometimes called
the “gold standard” because it is so effective. (A placebo such as sugar pill has no
medicinal effect.)
DEFINITIONS
In an experiment, we apply some treatment and then proceed to observe its effects on
the individuals. (The individuals in experiments are called experimental units, and they
are often called subjects when they are people.)
In an observational study, we observe and measure specific characteristics, but we do
not attempt to modify the individuals being studied.
EXAMPLE:
Observational Study: Observe past data to conclude that ice cream causes drownings
(based on data showing that increases in ice cream sales are associated with increases
in drownings). The mistake is to miss the lurking variable of temperature and the failure
to see that as the temperature increases, ice cream sales increase and
drownings increase because more people swim.
Experiment: Conduct an experiment with one group treated with ice cream while
another group gets no ice cream. We would see that the rate of drowning victims is
about the same in both groups, so ice cream consumption has no effect on drownings.
Here, the experiment is clearly better than the observational study.
Design of Experiments
COLLECTING SAMPLE DATA
A simple random sample of n subiects is selected in such a way that every possible
sample of the same size n has the same chance of being chosen. (A simple random
sample is often called a random sample, but strictly speaking, a random sample has the
weaker requirement that all members of the population have the same chance of being
selected. That distinction is not so important in this text.)
In systematic sampling, we select some starting point and then select every kth (such
as every 50th) element in the population.
With convenience sampling, we simply use data that are very easy to get.
In stratified sampling, we subdivide the population into at least two different subgroups
(or strata) so that subjects within the same subgroup share the same characteristics
(such as gender). Then we draw a sample from each subgroup (or stratum).
In cluster sampling, we first divide the population area into sections (or clusters). Then
we randomly select some of those clusters and choose all the members from those
selected clusters.
In a multistage sample design, pollsters select a sample in different stages, and each
stage might use different methods of sampling.
OBSERVATIONAL STUDIES
In a cross-sectional study, data are observed, measured, and collected at one point in
time, not over a period of time.
In a retrospective (or case-control) study, data are collected from a past time period
by going back in time (through examination of records, interviews, and so on).
In a prospective (or longitudinal or cohort) study, data are collected in the future
from groups that share common factors (such groups are called cohorts).
EXPERIMENTS
In a study, cofounding occurs when we can see some effect, but we can’t identify the
specific factor that caused it.
Completely Randomized Experimental Design: Assign subjects to different treatment
groups through a process of random selection.
Randomized Block Design: A block is a group of subjects that are similar, but blocks
differ in ways that might affect the outcome of the experiment. Use the following
procedure: Form blocks (or groups) of subjects with similar characteristics; and
randomly assign treatments to the subjects within each block.
Matched Pairs Design: Compare two treatment groups (such as treatment and
placebo) by using subjects matched in pairs that are somehow related or have similar
characteristics.
Rigorously Controlled Design: Carefully assign subjects to different treatment groups,
so that those given each treatment are similar in the ways that are important to the
experiment. This can be extremely difficult to implement, and often we can never be
sure that we have accounted for all of the relevant factors.
SAMPLING ERRORS
A sampling error (or random sampling error) occurs when the sample has been
selected with a random method, but there is a discrepancy between a sample result and
the true population result: such an error results from chance sample fluctuations.
A nonsampling error is the result of human error, including such factors as wrong data
entries, computing errors, questions with biased wording, false data provided by
respondents, forming biased conclusions, or applying statistical methods that are not
appropriate for the circumstances.
A nonrandom sampling error is the result of using a sampling method that is not
random, such as using a convenience sample or a voluntary response sample.
Chronic diseases
Cancer
Human growth and development
The relationship between genetics and the environment
AIDS
Environmental health (the impact and monitoring of)
Biostatistics is integral to the advancement of knowledge, not only in public
health policy, but also in biology, health policy, clinical medicine, health economics,
genomics, proteomics, and a number of other disciplines.
The Role of Biostatisticians
Biostatisticians are said to be the specialists of data evaluation, as it is their expertise
that allows them to take complex, mathematical findings of clinical trials and research-
related data and translate them into valuable information that is used to make public
health decisions. The work of biostatisticians is also required in government agencies
and legislative offices, where research is often used to influence change at the policy-
making level.
In short, these professionals use mathematics to enhance science and bridge the gap
between theory and practice.
Biostatisticians are required to develop statistical methods for clinical trials,
observational studies, longitudinal studies, and genomics:
What is Informatics?
Informatics, which is actually an emerging field, is also known as bioinformatics, a
science that relies on the basic disciplines of science, mathematics, probability and
statistics, and computer science to build a solid statistical foundation for making
advances, improvements, and even breakthroughs in public health and medicine.
Health informatics is often said to meet at the intersection of information science,
computer science, and healthcare, as it deals with the resources, devices, and methods
required for the effective storage, use, and retrieval of information, while public health
informatics includes the application of informatics in public health areas, such as
surveillance, prevention, preparedness, and health promotion. Public health informatics
focuses on information and technology issues from the perspective of groups of
individuals.
Naturally, health informatics tools would include computers, making systems analysts
important members of public health informatics research teams. It is the responsibility of
expert informaticists to systematically apply information, computer science, and
technology into research, learning, and the practice of public health.
The Role of Systems Analysts in Informatics
Systems analysts are called upon to write and troubleshoot the software used by
biostatisticians and researchers. Their work may also include conducting their own
research, designing databases, and developing algorithms for processing and analyzing
information.
The main responsibilities of systems analysts in biostatistics and informatics include:
- The distribution indicates something on how and why that disease process
occurs.
Wikipedia defines an epidemic as the rapid spread of disease to a large number of
people in a given population within a short period of time. Epidemics occur when an
agent and susceptible hosts are present in adequate numbers, and the agent can be
effectively conveyed from a source to the susceptible hosts. It is therefore important to
determine the characteristics of epidemic diseases that cause them to die out and not
reappear for a long period of time.
The epidemic characteristics of diseases are:
1. The incubation period of the disease – refers to the time interval between the
infection and the appearance of the signs and symptoms of the disease
2. The various means by w/c a disease may spread
3. The speed of penetration of a disease into the community
4. The rapidity of the disappearance of a disease from a community
1. Confirmation of Outbreak
2. Verify Diagnosis
3. Case Definition
4. Case Finding
5. Descriptive Epidemiology
6. Generate Hypothesis
7. Analytical Epidemiology
8. Evaluate Control Measures
9. Surveillance
10. Module 1 - Lesson 5: The
Sample Size and Sampling
Technique-2
11.
12. The sample size is the proportion of the general population that are taking part
in the study. It is an important feature of any theoretical study in which the goal is
to make conclusions about a population from a sample. The lower your sample
size, the higher your margin of error and lower your confidence level. This means
that your data is becoming less reliable. On the contrary, the greater the sample
size the more “statistically significant” the result will be. In other words, if a very
large sample is used, even a small difference from the null hypothesis will be
statistically significant, even if these are not, in fact, practically important.
13. There are a number of different methods for calculating sample size. This
link https://www.statisticshowto.com/probability-and-statistics/find-sample-size/ (L
inks to an external site.) shows the different methods of computing a sample
size, like Cochran's formula and Excel. Likewise, you can also look for online
calculators that can be useful for your research study.
14. Once you’ve chosen the sample size for your study, you’ll need to determine
which sampling technique you’ll use to select your sample from the target
population. The sampling technique that’s right for you depends on the nature
and objectives of your study. There are several sampling techniques available,
and they can be subdivided into two groups: probability sampling and non-
probability sampling.
15. Click on this link https://www.slideserve.com/courtney/sampling-
techniques (Links to an external site.) to determine the various sampling
techniques used in research. Examples are also provided for a better
understanding of the topic.
Module 1 - Lesson 6: Methods of
Collecting, Presenting, Organizing
and Summarizing Data
In this era when “information is power,” how we collect information should be one of our
issues of concern as well as which method of collecting data best answers our
individual needs. If data collected are unreliable, it will surely affect the findings of the
study, thereby leading to false or invaluable results. Conversely, if collected data are
accurate, it can help researchers predict future occurrences and trends.
Data collection is the systematic process of gathering and measuring information from
a variety of sources to get a complete and accurate picture of an area of
interest. Surveys, interviews, and focus groups are principal tools for collecting
information. Today, with help from Web and some analytics tools, researchers are also
able to collect data from mobile devices, website traffic, server activity, and other
relevant sources, depending on the project and needs.
This videoLearn Data collection Methods | Data Science | Quantra Free Courses (Links
Presentation of Data
The presentation of data is of utter importance nowadays. After all, everything that’s
pleasing to our eyes never fails to grab our attention. The presentation of data refers
to an exhibition or putting up data in an attractive and useful manner such that it can be
easily interpreted.
The three main forms of data presentation are:
1. Textual presentation - In this method, data are presented in text format similar
to what is found in books, reports, and research papers.
2. Data tables - In this form, data are presented in rows and columns. It is a precise
way of showing all the data but it can be hard to interpret or see a pattern. It is
normally used to differentiate, classify, compare, and relate different datasets.
3. Diagrammatic Presentation - Data can further be presented in a simple and
even easier form by using diagrams, illustrations, images, or graphs. Changing raw
data into a diagrammatic form directly makes it quick and easier to interpret.