This action might not be possible to undo. Are you sure you want to continue?

**Dr. Muhamad Saiful Bahri Yusoff
**

MD, MScMEd

Content

RESEARCH IN MEDICAL EDUCATION ......................................................2 STEPS IN RESEARCH ................................................................................8 SAMPLING METHOD................................................................................ 11 SAMPLE SIZE........................................................................................... 16 OVERVIEW ON MEDICAL STATISTICS ................................................... 20 STUDY DESIGN........................................................................................ 23 DESCRIPTIVE STATITISTICS.................................................................. 30 HYPOTHESIS FORMULATION AND TESTING ......................................... 38 CONFIDENCE INTERVAL......................................................................... 43 EXPLORATORY DATA ANALYSIS (NUMERICAL DATA)........................... 46 UNIVARIATE ANALYSIS OF NUMERICAL DATA...................................... 49 UNIVARIATE ANALYSIS OF CATEGORICAL DATA ................................. 57 CORRELATION & REGRESSION.............................................................. 61 CORRELATION ..................................................................................... 61 SIMPLE LINEAR REGRESSION (SLR) ................................................... 64 CORRELATION ........................................................................................ 72 NONPARAMETRIC STATISTICS .............................................................. 76 NON-PARAMETRIC TESTS ...................................................................... 80 STATISTICAL ANALYSIS: WHICH TO CHOOSE?..................................... 84 WRITING A RESEARCH PROPOSAL ........................................................ 99 VARIABLES............................................................................................ 102 DATA PRESENTATION........................................................................... 106 Z-Score & IT’S USES.............................................................................. 110 t-test ...................................................................................................... 113 SENSITIVITY & SPECIFICITY................................................................ 114

1

RESEARCH IN MEDICAL EDUCATION 1. Research and types of research: • How do we develop knowledge? o Intuitive knowledge (based on “I feel or I think”) o Authoritative knowledge (based on authorized person view) o Logical knowledge (based on experience explanation which is reasonable and logical.) o Empirical knowledge (based on judgement back up by facts and usually 90% correct) • What is research? o Research is a systematized effort to gain new knowledge – (Redman & Mory) o Literally research means search again and again repeatedly. o Research is an organized and systematic way of finding answers to questions. • Research comprises: o Defining and redefining problems o Formulating hypothesis or suggested solutions o Collecting, organizing and evaluating data o Making deduction and reaching conclusions

o

**Testing the conclusions to determine whether they fit the formulating hypothesis
**

(Clifford Woody)

•

Types of research o Basic research and applied research o Quantitative research and qualitative research

•

Qualitative research: o Ethnography, cognitive anthropology, etc o Synthetic rather than analytic o Generally hypothesis generating o Investigative methods are non-intrusive o Data are more impressionistic.

2

o Searching the relationship of variables in natural setting. Educational Research History Descriptive Correlational Group comparison Ethnographic Survey Experimental Quasi-experimental Ex Post Facto or Causal-comparative Thomas K. o Carried out in natural setting. • Ethnographic research o Descriptive and qualitative research. o Deals with naturally occurring phenomena. self-report and tests. o May operates on the basis of hypotheses. • Survey o Descriptive o Quantitative study • Correlational research o Investigate the relationship between two or more variables. o Researcher as participant and observer. 3 .Crowl • Descriptive research o Include quantitative and qualitative researches. o Methodologies include observations. surveys. o Report is detailed verbal description.o Research in such a situation is a function of researcher’s insights and impressions.

o Values of independent variable of two groups are preset (al ready present). • Confirmatory o Experimental o Quasi-experimental o Correlational (non-experimental) • Exploratory o Qualitative 4 . 2. • Inductive reasoning: o Involves going from a series of specific cases to a general statement.• • Group comparison research o Comparing the values of two or more groups of population. o The conclusion in an inductive argument is never guaranteed. Inductive: o Begin with observations and attempt to explain by generalizing. Quantitative/Qualitative Research: • • • Deductive o Begin with a theory and collect data to test. o Groups are randomly selected. Experimental research o Random selection of the individuals forming the groups Experimental group Control group • Quasi-experimental research o A type of group comparison research. • Ex Post Facto or Causal-comparative study o Ex Post Facto in latin is “after the fact”. Deductive reasoning o A type of logic in which one goes from a general statement to a specific instance.

Types of quantitative research design: • The research design which are commonly used can be divided into following groups: o Non experimental design Post-test design Pretest-post test design X O1 O1 X O2 no control no control Examination of documents 5 . recorded in fields notes Uses of qualitative data o Some social sciences e.g Anthropology History Psychology Sociology Public health Policy analysis Health care evaluation 4. Qualitative research methods for data collection • • • • • • Interviews Focus groups Survey: open ended questions Observations: recorded in field notes Document analysis What is qualitative data? o Data in the form of words. based on: Asking open ended questions in: • • • • • Interviews Group Surveys Observation of situations and actions. rather than numbers.3.

Group R. Purpose of Medical Education Research: To improve the functioning of educational programmes by providing information for: o Decision making o Evaluating outcomes o Supporting advocacy for change o Contributing to the body of knowledge related to concepts and methods.Static group comparison o True experimental design X O1 O2 no control Pretest-post test control group design Exp.A – Pretest group O1 X R. Group R. Research is like a plant that grows and grows and grows and grows… 6 .A Control o Quasi-experimental design Time series O1 X O2 X O3 X O4 No equivalent control group • • • • • Exp group O1 X O2 No equivalent control group O3 O4 R.A – Post test group X O2 O2 X O1 O3 O4 O1 X O2 Separate sample pretest post test design 5.A (Random Allocation) Control Post test control group design Exp.

creating additional. related research projects of various types… Soon there is a body of basic. it throws off seeds of all types (basic.When it is grown. applied and practical) which in turn sprout and create more research projects… The process continues with all of the new research ‘plants’ throwing off seeds. applied and practical research projects related to similar topics… And the process goes on and on… 7 .

STEPS IN RESEARCH 1. Steps in research . Finding background information • Critically analyzing information sources o Initial appraisal Author Date of publication Edition or revision Publisher Title of the journal o Content of analysis Intended audience Objective reasoning Coverage Writing style Evaluative review 3. Five steps to write topic for better research • • • • • • Think about your topic Define your main concepts Think synonyms Think of broader terms Think of narrower terms Planning o Formulation of the study objectives 8 4. Preliminary steps: • • Clarifying the purpose Formulating the topic o State your topic idea as a question o Identify the main concepts or keywords 2.

such as unemployment income maldistribution.General objective – what are the purpose of the study Specific objective – what are the things you want o find in the study o Planning of methods Study population • • • • • • Selection and definition Sampling Sample size Selection Definition Scales of measurement Variables Method of data collection Method of recording and processing • Preparing for data collection o Construction of research instrument o Pretesting the instrument • • • • Collection the data Processing the data Interpreting the data Writing a report 5. education and maternal and child health? 9 . such as students. To prioritize a problem and selection of a topic for research. it is helpful to ask yourself a series of questions and then try to answer each of them • • • • Is the problem a current one? Does the problem exist now? How widespread is the problem? Are many areas and many people affected by the problem? Does the problem effect social groups. teachers and patients? Does the problem relate to broad social. the status of women. and health issues. economic.

and ranked the problem and arrange them according to the ranking.• Who else is concerned about the problem? Are top government officials concerned? Are medical doctors or other professionals concerned? • • Are the resources available? Are measures available to solve the problem? Review your answers to these questions. Problem identification Dissemination of findings Report writing Information gathering & knowledge building Drawing inference Research question/ hypothesis formulation Confirmation or rejection of hypothesis Planning research Data analysis Data collection Data processing 10 .

2. Sample and Subject: • • A sample is a subset of the population. Element and Population Frame: • • • Population refers to the entire group of people. events. Population. it comprises some members selected from it. 11 . Population frame is a listing of all the elements in the population from which the sample is drawn. or things of interest that the researcher wishes to investigate. A subject is a single member of the sample (just like an element is a single member of the population).SAMPLING METHOD Sampling Method Non-Probability sampling Probability sampling Purposive sampling Convenient sampling Unrestricted sampling (Simple random sampling) Judgment sampling Quota sampling Restricted sampling Systematic sampling Stratified random sampling Cluster sampling Area sampling Double sampling 1. An element is a single member of the population.

become critical. and sometimes more efficient alternative to the unrestricted design. • Restricted (complex) random sampling: o Offer a viable. o When time or other factors. so that a study of the properties or characteristics of the sample make it possible to generalize such properties or characteristics to the population. Sampling: • Sampling is the process of selecting a sufficient number of elements from the population. 12 . non probability sampling design are chosen. o Probability sampling designs are used when the representativeness of the samples is of importance in the interests of wider generalisability. o Disadvantages: Cumbersome (difficult) and expensive. o Advantage: This kind of sampling method has the least bias.3. 4. o The elements do not have a predetermined chance of being selected as subjects. Probability Sampling: • Unrestricted sampling: o More commonly known as simple random sampling. o Every element in the population has a known and equal chance of being selected as a subject. • Non-probability sampling (sample not randomly picked). o The elements in the population have equal chance or probability of being selected as sample subjects. An entirely updated listing (population frame) of the population may not always available. rather than generalisability. 5. Two major types of sampling design: • Probability sampling (sample picked at random).

The total population is divided into groups or clusters. Cluster sampling is used when natural grouping are evident in the population. Stratification is the process of grouping members of the population into relatively homogenous subgroups before sampling. 18. The strata should also collectively exhaustive: no population element can be excluded. we could sample every 9th student (9. Cluster sampling 13 . Each cluster must be mutually exclusive and collectively exhaustive. For example. 27. The random sampling is applied within each stratum. A random sampling technique is then used on any relevant clusters to choose which clusters to include in the study.o Five most common complex probability sampling methods Systematic sampling • • Drawing every nth element in the population starting with a randomly chosen between 1 and n. Elements within a cluster should be heterogenous as possible. …) until 60 students are selected. • • • • • • • • • The strata should be mutually exclusive: every element in the population must be assigned to only one stratum. Stratified random sampling • • When sub-population vary considerably. • The number must be selected randomly for example we san take out one dollar ringgit and choose the last digit of money number. it is advantageous to sample each sub-population (stratum) independently. if we want a sample of 60 students from total population of 300 students. But there should be homogeneity between clusters.

• 2 broad categories: o Convenience sampling Collection of information from members of the population who are conveniently available to provide it. A geographically dispersed population can be expensive to survey. 6. This method is chosen when generalisability is not critical. o Purposive sampling 14 . The findings from the study of the sample cannot be confidently generalized to the population. and later a sub-sample of this primary sample is used to examine the matter in more detail • It is like reverse pilot study because in double sampling take all population then proceeds with sampling the interest subsample.Area sampling • • • • One version of cluster sampling is area sampling or geographically clusters sampling. Double sampling • A sampling design where initially a sample is used in a study to collect some preliminary information of interest. Non-probability Sampling: • • • The elements in the population do not have any probabilities attached to their being chosen as sample subjects. Greater economy than simple random sampling can be achieved by treating several respondents within a local area as a cluster. Clusters consist of geographical areas. focus may be on obtaining preliminary information in a quick and inexpensive way.

o The quota fixed for each subgroup is based on the total numbers of each group in the population. or conform to some criteria set by the researcher. either because they are the only ones who have it. • Quota sampling o This method ensures that certain groups are adequately represented in the study through the assignment of a quota. in which a predetermined proportion of people are sampled from different groups. o Judgment sampling calls for special efforts to locate and gain access to the individually who do not have the requisite information. 2 type of purposive sampling: • Judgment sampling o Involves choice of subject who are most advantageously placed or in the best position to provide the information required. 15 .The sampling is confined to specific types of people who can provide the desired information. o Considered as a form of appropriateness stratified sampling. but on a convenience basis. o Judgment sampling may curtail the generalisability of the findings because we are using a sample of experts who are conveniently available to us.

based on our sample statistics. o Level of confidence can range from 0 to 100%. …large enough to achieve statistically significant results. Introduction: • • Questions: o How large should my sample be? Answer: o It depends… …large enough to be an accurate representation of the population. • Precision: o Precision refers to how close our estimate is to the true population characteristic. 2. • Sample size is function of… o Variability (heterogeneity) in the population 16 . o Normally. o A level of confidence of 95% is conventionally acceptable. the greater the precision required. the larger is the sample size needed.SAMPLE SIZE 1. Determining sample size: • • • What is sample size that would be required to make reasonably precise generalizations with confidence? A reliable and valid sample should enable us to generalize the findings from the sample to the population under investigation. o Confidence reflect the level of certainty with which we can state that our estimates of the population parameters. • Confidence: o Confidence denotes how certain we are that our estimate will really hold true for the population. The sample statistic (statistic finding) should be reliable estimates and reflect the population parameter (actual finding) as closely as possible within a narrow margin of error. hold true.

a small sample size will be sufficient to obtain a high confidence and precision level. a minimum sample size of 30 for each category is necessary. • That is why. the bigger the sample size should be o Confidence level desired The higher the confidence level we want. in both cases.The more variance we find. 3. 17 . o The higher the precision. the lower will our precision level be.05) is used merely as a way indicating the chances are at least 95 out of 100 that the findings obtained from the sample of people who participate in the study are similar to what the findings would be if one were actually able to carry out the study with the entire population. we need bigger sample size to increase the precision and confidence. the bigger the sample should be o Precision or accuracy needed The more precise or accurate we want. o For simple experimental research with tight experimental controls. o In multivariate research (including regression analyses) the sample size should be several times (preferably 10 times or more) as large as the number of variables in the study. The term statistically significant (p<. o The higher the confidence level. the lower will our confidence level be. successful research is possible with samples as small as 10 to 20 in size. Roscoe proposes the following rules of thumb for determining sample size o Sample size larger than 30 and less than 500 are appropriate for most research o Where samples are to be broken into sub samples. the bigger the sample size should be o Type of sampling plan used Different sampling approaches will require different sample size • Trade-off between confidence and precision o If there is little variability in the population.

e.84 1.28 1. equals 0. equals 1. 18 .84) n = sample size ∆ = precision Z = Z-score at significance level p = population proportion 7. Sample size for two means n = sample size Significant level A 5% 1% Power 80% 90% 95% 1. Sample size for single proportion: n = (Z/∆) 2 p (1-p) equals 0. rejected when there is indeed a real difference or association.58 B 0.96) B = power (usually 80%.4. Sample size for single mean n = (Z /∆) 2 n = sample size = population standard deviation ∆ = precision Z = Z-score at significance level If there is a possibility of response from 80% of sample population the sample size = n/0. 6.96) B = power (usually 80%. the lower the chance of missing a real effect.54 n = (A + B) 2 * 22 /∆2 = population standard deviation ∆ = expected difference of mean A = significance level (usually 95%.98 2. Sample size for two proportions: n = (A + B) 2 * [(p1 (1-p1)) + (p2 (1-p2))] / (p1 – p2)2 n = sample size A = significance level (usually 95%.84) p1 = first proportion p2 = second proportion Power is the probability that the null hypothesis will be correctly rejected i. equals 1. It can also be thought of as “100 minus the percentage chance of missing real effect” – therefore the higher the power.8 Table of values for A and B: 5.

Some definition: Sampling error is the difference of statistically finding between actual parameter of population Standard error is means of deviation values between two or more groups of sample or population. 19 . Standard deviation is means of deviation values between two or more units of samples or population.

What do I know? Be honest!! • Do I know about research methods? o If know back to basic… go back and read research methods/ approaches • • • • • • • • Do I know about statistical and software application? Do I know how to interpret? o OK… I understand methods and approaches So… how to proceed? Please try to learn medical statistics OK… I agree to learn medical statistics Tell me how should I go for it (the easiest way) Don’t make it complicated (statistician make statistics more difficult) Tell me only statistics for non-statisticians 5. What should I do? • How should I start??? 3.OVERVIEW ON MEDICAL STATISTICS 1. Let’s make it easily understandable • • Research methods/ approaches – leading the way/ direction Statistical applications – tools/ vehicles 4. Some introduction: • • • I’m interested in research… I’m forced to do research… Whatever the reason may be… 2. Application of statistics in medical research • Why use statistics? o Art statistics differences in medical context due to real effects or random variation or both 20 .

enumeration of the frequency of characteristics. Classification of statistics • It consist of two parts o Descriptive statistics Concerned with collection. organization. auditing. summarization and presentation of data. incidence/prevalence of AIDS and so on… • • • • Statistics is about common sense and good design Statistics is only the guide to make decisions Judgment should be made based on both biological and statistical plausibility Concept and applications of statistics in medical sciences o Let us discuss briefly o People say “stat is boring” o Let us make it interesting 6. vaccination uptake.• Modern viewpoint of statistics o Aid for making scientific decision in the face of uncertainty o A valuable tool in decision making whenever one is uncertain about the state of nature • Statistics in medicine o Increasingly prevalent in medical practice Hospital utility statistics. o Inferential statistics Statistical inference Analytical in nature Consists of a collection of principles or theorems Allows researcher to generalize characteristics of a “population” from the observed characteristics of a “sample” • Statistical jargons o Population parameter A fixed numerical value which describes a particular characteristic of a population 21 .

which is strongly related to the size of the sample. Statistical inference • 2 broad categories o Hypothesis formulation and testing o Estimation Point estimation Interval estimation (Confidence interval) 8.g. point estimate of the effect size o The relationship between sample statistics and population parameters is the basis of statistical inference. Concepts of populations. (Gardner MJ and Altman DG. 7.E. • • • If the study sample is not representative of the population we may well be misled and statistical procedures cannot help But even a well designed study can give only an idea of the answer sought because of random variation in the sample Thus result from a single sample is subject to statistical uncertainty. 2 – the proportion of individuals in the population with a particular characteristics of interest (the proportion of low birth weight babies born in Indonesia) o Sample statistics Varies in value from sample to sample Other terms – statistics. effect size. point estimate. summary statistics. 1988) 22 .g. 1 – the mean value in the population of a particular characteristic of interest (mean systolic blood pressure of Australia adults) E. samples and statistical inference • Statistical analysis of medical studies is based on the key idea that we make observations on sample of subjects and then draw inferences about the populations of all such subjects from which the sample is drawn.

Overview of epidemiologic studies: Design strategies Descriptive Analytical Population (prevalence. RCT) Cohort studies Case-control studies 23 . About the investigation • Presence of a comparison group o Dependent on the objective of the study o Generally increases the validity of an observed association • Exposure (risk) (or intervention) and outcome o Must be measured with as little error as possible 4.STUDY DESIGN 1. correlational studies) Individual (case report. possibility and resources) 2. How do we begin to answer the question? • Start with the building blocks of any design o Participants of investigation o Outcomes of investigation o Direction of inquiry (prospective or retrospective) o Other considerations (e.g. case series. Think about our research question? • • Identify the participants of interest What are the outcomes of interest 3. comparative Cross-sectional studies with studies historical controls) Observational studies Intervention studies (experimental.

Case report • Strength o Hypothesis (question) generation o Clinical observation • Weaknesses o May be one off o Nothing to compare 6.5. Randomized Controlled Trial • • • • • ‘Goal standard’ test of treatment Selection of groups entirely random Control group identical to treatment group at start except for intervention Participants/investigators commonly ‘blind’ to group allocation to reduce bias May evaluate good and bad outcomes 24 . Case series • Strengths o Strengthens the hypothesis o Able to establish temporal relationship • Weakness o Nothing to compare 7. Comparative studies with historical controls • Strength o Like two case series o Have something to compare to • Weaknesses o May be other differences between groups o Relies on recoding information being accurate 8.

• End point blinding e. the pathologist are not given any information about the study sample slide so the pathologist didn’t know whose slide it is and he/she will decide based on his/her independent interpretation about the slide.g. Prospective cohort: • experienced the outcome interest Parallel RCT Cross-over RCT A group of people (cohort) is assembled. none of whom has 25 . • There are few of RCT o Single blind o Double blind o Triple blind o Multiple blind o End point blinding • There are 2 design of RCT o Parallel RCT o Cross-over RCT Population Population Eligible subjects Eligible subjects Randomization Randomization Pre-treatment assessment Pre-treatment assessment Test Control Outcome assessment Test Control Control Test Post-treatment assessment Post-treatment assessment Outcome assessment 9.

do they get the disease? o Exposure can be elicited without the bias o Can assess if the relationship between exposure and many diseases o Calculate risk directly: relative risk (RR) • Strengths o Powerful design for defining incidence and investigating potential causes (aetiology questions) o Establishes temporal sequences o Appropriate for interventions where can’t randomize o Investigator has opportunity to measure important variables completely (not relying on record information) • Weaknesses o Expensive and inefficient for rare outcomes – needs more patients o May be other differences between group 10. Case control studies • Analytic study design 26 .• • On entry. people are classified according to characteristics that might be related to outcome Other names: longitudinal. Disease Exposed No Disease Direction of inquiry Disease Unexposed No Disease • Advantages of cohort studies o The only way of establishing incidence directly o Follow the same logic as clinical question: if person exposed. prospective. incidence studies.

and indirect estimate of risk.o Looking back in nature o We were not there to measure risk directly o Associate outcome (disease) with prior exposure • • • • • Calculate indirect estimate of risk: odds ratio (OR) Compare the frequency of a risk factor in a group of cases and a group of controls There must be a comparison group that does not have the disease There must be enough people in the study so that chance does not play a large part in the observed results Groups must be comparable except for the factor of interest Exposed Cases Unexposed Direction of inquiry Exposed Control Unexposed • Advantages of case-control studies o No need to wait for a long time for disease to occur (causal or prognostic factors) o Most important methods used to study rare disease o Best design for disease with long latent period o Can evaluate multiple possible potential exposure • • Strengths o Very efficient design for rare outcomes Weaknesses o Does not allow for the examination of incidence or risk Cannot directly calculate incidence: OR. o Increased susceptible to bias in measurement of exposure Exposure & disease occurred “prior to” the study 27 .

o Analytical Analytical studies is valid only when the current values of the exposure are extremely stable over time Two types 28 .g.• More potential for biases 11.g. the point prevalence of upper respiratory tract infection on 1st of July 2005 • A period prevalence o Disease occurrence at the particular period of time o E. Cross-sectional study • Distinguish features o Observe at on particular time or over a period o Exposure and outcomes measured at the same time o Information obtained from the subjects only once Observation Population Samples Exposed Disease Unexposed No Disease • Categories of cross-sectional study o Descriptive Prevalence studies • A point prevalence o Disease occurrence at the particular time o E. the ten-year year period prevalence (1996-2005) of the cancer of breast in Malaysia.

• • Classical cross-sectional Comparative cross-sectional study o A comparative way of conducting a cross-sectional study o Samples are drawn from two or more defined different populations o Measure exposure and outcome factors o Investigate the association between exposure and outcome o Strengths of cross sectional studies Very quick and inexpensive to implement Useful for determining prevalence Appropriate for diagnostic test validity o Weaknesses Difficulty in establishing links of causal effect (temporal relationship) Impractical for rare outcomes 29 .

analysis and interpretation of data. organization and summarization of data. Uses of statistical methods • To collect data in the best possible way o Designing form o Organizing o Conducting survey • To describe the characteristics of a group or a situation o Data summary o Data presentation • To analyses data and to withdraw conclusion o Scientific. presentation.DESCRIPTIVE STATITISTICS 1. Definition: • Statistics o A field of study concerned with the collection. o Describe characteristics of the observed data Type of variable Summary statistics Distribution Graphical presentation • Inferential statistics o Analytical in nature 30 . organization. logic o Decision making 3. summarization and presentation of data. • Statistical Methods o A scientific technique employed for collection. Classification of statistics • Descriptive statistics o Concerned with collection. • Biostatistics o Biological field and medicine 2. enumeration of frequency of characteristics.

g. 155cm 5. race. age sex. female. height. etc • Data o The raw or original measurement of statistics o Values taken by the characteristics o E. people o Parameter – descriptive measure from population data • Sample o Subset of population o Selected to represent the population by sampling technique o Statistics – descriptive measure from sample data • Variable o Any characteristics of even/object/person o The characteristics being observed/measured o E.o Involve hypothesis testing and confidence interval o Allows researcher to infer/ generalize the characteristics of the sample (statistic) to the population (parameter) 4. weight. things. Malay. Classification of variables 31 .g. Terms: • Population: o Full sets of individuals o Collection of items objects.

no of students. sex. height. normal. race. age.g. etc. medical diagnosis. no of teeth extracted • Continuous o No gap or interruption o Any value within specified interval o Mainly measurement o E.g. BP. weight. 32 . low.• Discrete o Characteristics by gaps or interruptions in the values o Values that can be assume only whole numbers o Mainly count o E. etc • Nominal o Unordered categories o No implied order among the categories o E.g. BP – high.g. • Ordinal o Ordered categories o Ranked according to some criteria o E.

Categorical Variables: • Data presentation o Statistics Frequency Percentage (%) o Graphical Pie chart Bar chart 7.6. Numerical variables • Measures of central tendency o A measure of centrality Mean • • • Arithmetic average Adding all the values in a population/sample and divided by the number of values that are added Affected by the extreme value 33 .

Median • • • • • • • • • • • The middle value of data ordered from the lowest to the highest arrange all value in order medical is the middle median is the mean of 2 middle If n = odd number If n = even number observation 50th percentile of a set of observation The middle value of data ordered from lowest to the highest value Useful for data with non-normal distribution or skewed data Less sensitive to extreme values than the mean Median (IQR) The most frequent observation Point of maximum concentration Mode Measures of dispersion/variability o Range = largest value – smallest value Different between the largest and smallest value in a set of observations Give idea about the variability of data Simplest to compute Sensitive to outliers Least useful R = Xmax .Xmin o Variance = s2 Total squares of deviation of observations from the mean/number of degree of freedom Average measure of standard deviation of observation from mean sample Measures the amount of variability or spread about/from the mean of a sample S2 = Σ(xi – xmean)2/n-1 34 .

50th.o Standard deviation (SD) A square root of variance The root mean square of the distances (or differences) from mean of sample A better measure of variability of a set of data Smaller SD indicates closer to the mean Mean (SD) S = √[Σ(xi – xmean)2/n-1] o Interquartile range (IQR) Q3 – Q1 Range between 25th and 75th percentile Used along with the median It not affected by outlier o Percentile = 25th. 90th. Normal distribution • Characteristic o Bell shaped appearance o Symmetrical about the mean o Mean = median = mode o Total area the curve = 1 o The curve never touch the x line o SD usually less than 30% of mean value • Approximately o 68% 1 SD 35 . 75th. 95th 8.

7% • Mean (SD) 2 SD 3SD 9.o 95% o 99. Data presentation (numerical data) • Statistics o Mean (SD) o Medical (IQR) • Graphical o Histogram Frequency distribution of quantitative date/continuous data Bars represent frequency distribution for each class of interval No spaces between bars May have equal/unequal class interval o Box plot 36 .

Histogram 10. Summary • Categorical data o Statistics Frequency (%) o Graphs Bar chart Pie chart • Numerical data o Statistics Mean (SD) Median (IQR) o Graphs Histogram Box Plot 37 .

an association factors or a difference between groups. • • Inferential statistics – estimating the probability that a given outcome is due to chance If the sample data provide sufficient evidence to discredit HO HO in favor of HA. HO : the proportion of patients with disease who die after treatment with the new drug is not different from the proportion of similar patients who die after treatment with placebo. reject 38 .HYPOTHESIS FORMULATION AND TESTING 1. There are 2 type: o Null hypothesis (HO) Hypothesis of no difference Hypothesis to tested o Alternative hypothesis (HA) The hypothesis that postulates that there is a treatment effect. Hypothesis: • • A statement about one or more population Research question o Statement o Research hypothesis • Postulating the existence of: o A difference between groups o An association among factors • • Usually derived from a hunch. 2. an educated guess based on published results or preliminary observations. Hypothesis Testing: • • • To aid the researcher in reading a decision concerning a population y examining the sample. Observed differences or associations may have occurred by chance.

Normal distribution (refer to descriptive statistics note) Level of significant () o The null hypothesis is rejected if the probability of obtaining a value as extreme or more extreme than that observed in the sample is small when the null hypothesis is true.• HA : the proportion of patients with disease who die after treatment with the new drug is lower than the proportion of similar patients who die after treatment with placebo. o “Small” is usually taken to less than or equal to 5% o If the 2 tail test is taken then the must be divided by 2 39 . • • • The test statistics o A value with a known distribution when the null hypothesis is true. • Type II error (β) o The probability wrongly not rejecting the null hypothesis when the null hypothesis is false. Statistical Decision based on a sample HO true In the population HO false Type II error Beta (β) (probability of wrongly not rejecting HO when the HO is false) Correct (Confidence Limit) Do not reject HO 1– (level of certainty of our statistical data hold true) Type I error (level of significance) Alpha () (probability of wrongly rejecting the HO when the HO is true) Correct (Power of study) 1–β (probability that the HO correctly rejected) Reject HO • Type I error () o The probability of wrongly rejecting the null hypothesis when the null hypothesis is true.

1-tail research hypothesis The proportion of patients with disease after treatment with new drug is lower than the proportion of similar patients who die after treatment with placebo o E. (2 tail test taken). reject the HO.E. 3.25. Significant level = 0. 2-tail research hypothesis The mean blood pressure of patients in the new treatment group is not different from the mean blood pressure of patients in the old treatment E. Steps in hypothesis testing • Step 1 o Generate the null hypothesis and alternative hypothesis HO : ?? HA : ?? What are the characteristics of interest? • • E. mean.g. p value = 0. o If p value greater than . • The p value o The probability of obtaining a value as extreme or more extreme than that observed in the sample given that the null hypothesis is true is called p value o The smallest value of for which the null hypothesis can be rejected o The p value is compared to the predetermined significance level (usually 0.g.05.05.g. research questions: • Effectiveness of a new antihypertensive drug 40 .05/2 = 0.g.05) to decide whether the null hypothesis should be rejected o If p value less than . = 0. do not reject the HO. Conclusion does not reject the HO.g. proportion 1-tail (one sided) or 2-tail (both sided) o E.

in sample .g.05.01. non parametric test would be used when the data is seriously non-normal) • Step 4: o Compute the test statistic and associated p value Calculate appropriate test statistics • Step 5: o Interpretation Compare p value with the level of significance Decide whether or not to reject the null hypothesis p value < – reject the null hypothesis 41 .SD = standard deviation • Step 2: o Set the significance level Usually set at 0. = standard deviation Notes: 2.x = mean . 0.µ = mean .1 • Step 3: o Decide which statistical test to use and check the assumption of the test Population is approximately normally distributed Data values are obtained by independent random sampling Adequate sample size o To decide which statistical test should be used E. proportion o Assumption must be adequately met o If not met alternative procedures can be used E.• HO: the mean blood pressure of patients in the new treatment group is not different from the mean blood pressure of patients in the old treatment (µ1 = µ2) • HA: the mean blood pressure of patients in the new treatment group is different from the mean blood pressure of patients in the old treatment (µ1 ≠ µ2) Notes: 1. 0.g. mean. in population .

p value > – do not reject the null hypothesis • Step 6: o Draw conclusions Conclude accordingly based on rejecting/not rejecting null hypothesis o Decision rule Rejection region • To reject the null hypothesis if the value of the test statistic that computed from the sample is one of the values in the rejection region Acceptance region • To accept the null hypothesis of the computed values in the acceptance region E. conclusion: • The mean blood pressure of patients in new treatment group is different from the mean blood pressure of patient in old treatment 42 .g.

95 Confidence interval 2.g. variability of the sample mean or sample proportion o SE (SEM) – a special type of standard deviation (the standard deviation of a sample statistics). Standard Error (SE) Population Mean (µ) SD () Sample Mean (x) SD (s) Sample Sample Standard Error o SD – a measure of the variability of individual observation o SE – a measure of variability of summary statistics E. Confidence Interval • Standard Deviation (SD) vs.CONFIDENCE INTERVAL 1. depend on Standard deviation Sample size 43 . Relationship between confidence interval and hypothesis test 0.

o Sample mean varies from sample to sample (as measured by SE) How Close? Sample Population Mean (µ) SD () o Sample to sample variation of the statistic (sample statistic) Lower limit Confidence Interval Likely to fall Population parameter Upper limit 44 .

Width of the CI depend on SE Sample size (larger CI narrower more precise estimate) More variation CI wider Less precise estimate • Confidence Interval can be calculated: o Mean o Relative risk o Odds ratio o Hazards ratio o Correlation coefficient o Regression coefficient o Etc… 45 . 95% CI. 99% CI o 95% CI interpretation – 95% certain that the population parameter lies within its limits.3. General Comments on Confidence Interval • • • As a measure of an estimate of a population parameter (a measure of the precision of a sample statistic) Confidence interval = estimate ± k x (standard error) 90% CI.

Hypothesis test for: • • • Single mean Difference between two means for independent samples Difference between two means (or paired) samples. (twosided test) • • Step 2: Level of significance o Alpha = 0. Single Mean: • Step 1: Null and Alternative Hypothesis o H0: The mean serum amylase level in the population from which the sample was drawn is 120 units/100ml.EXPLORATORY DATA ANALYSIS (NUMERICAL DATA) 1.025 o reject H0 • step 6: conclusion 46 .05 (alpha/2 = 0. o HA: The mean serum amylase level in the population from which the sample was drawn is different from 120 units/ 100ml. 2.025) Step 3: Check the assumption o Population is approximately normally distributed o Random sampling o Independent variable/sample • Step 4: Statistical test (one sample t test) o t = (x .µ0)/ (s/√n) o where x = sample mean s = sample standard deviation n = sample size µ0 = the hypothesis mean t stat has n-1 degrees of freedom • step 5: interpretation o p-value < 0.

x2 = sample means s1.o The population mean serum amylase is statistically significantly different from 120 units/ 100ml. s2 = sample standard deviations n1. 3. n2 = sample size t stat has n1 + n2 – 2 degree of freedom (df) (two-tailed) o Degree of freedom o p-value o 95% confident interval (lower border & upper border) • Step 5: Interpretation o P-value < 0.025) Step 3: Check the assumption o Two population are normally distributed o Two population have equal variance (Levene’s test) o Both are independent samples/variables o Random samples • Step 4: Statistical test o Name of the test = independent t-test o t statistic = (x1 – x2)/√[s2p (1/n1 + 1/n2)] where: s2p = [ (n1-1)s21 + (n2-1)s22 ]/[n1 + n2 – 2] • • • • x1.025 o Reject null hypothesis • Step 6: Conclusion 47 . Difference between two means for independent samples • Step 1: Null and Alternative Hypothesis o H0: The mean serum amylase level in hospitalized and healthy subjects are the same (µ1 = µ2) o HA: The mean serum amylase levels in hospitalized and healthy subjects are different (µ1 ≠ µ2) • • Step 2: Level of significance o Alpha = 0.05 (alpha/2 = 0.

Difference between two means for dependent (or paired) samples o Research question: the investigators wanted to determine if treatment with Amynophylline altered the average number of apneic episodes per hour o Step 1: Null and Alternative Hypothesis o H0: there is no difference in average number of apneic episodes before and after Amynophylline (no difference = zero) o HA: the average number of apneic episodes before and after was difference (difference not equal to zero) (two-tailed) o Step 2: Level of significance o Alpha = 0. 48 .025 o reject null hypothesis o Step 6: Conclusion o At the 5% level of significance that the average number of apneic episodes before and after Amyn0phylline were difference.o At the 5% level of significance that the mean serum amylase levels are different in healthy and hospitalized subjects.025) o Step 3: Check the assumption o The population are normally distributes o The two samples are dependent variables/samples o Random sampling o Step 4: Statistical test (paired t test) o t = d/sd √(n) o where d = means of differences sd = standard deviation of the differences n = number of pairs t stat has n-1 degree of freedom o Step 5: Interpretation o p-value < 0.05 (alpha/2 = 0. 4.

UNIVARIATE ANALYSIS OF NUMERICAL DATA 1.20 125. standard deviation.09 0. min. housing) • Numerical variable (e.Numeric • Statistical analysis o Point estimation Count. kurtosis o Missing value o Outliers o Binning • Visualization o Histogram. age) 3. mode. max. o Dispersion Range. Univariate analysis explores each variable in a data set separately: • It looks at the range of values • The central tendency of the values • It describes the pattern of responsible to the variable • It describes each variable on its own 2. co-variance Skewness. average.37 Covariance 32% 49 .g. Univariate analysis • Categorical variable (e. variance. median. box plot and etc… Univariate Analysis – Numeric Age Count Min Max Range Missing 900 19 75 55 0 Average Median Mode Skewness Kurtosis 35.g.88 St Dev Variance 11.25 33 27 1. Univariate analysis .

Binning • Binning is a process of transferring continuous variables into categorical counterparts • Binning methods o Equal-width Outliers Binning Numeric Missing values 50 . “?”) The variable mean 5. widely deviated points • Removal methods o Box plot o Clustering o Curve-fitting 6.Univariate Analysis – Challenges Variable Categorical Missing values Invalid values Numerization 4.g. Missing data • Data entry error • Data processing error • Certain data may net be available at the time of entry • How to handle missing data o Fill in the missing values manually o Ignore the records with missing data o Fill in it automatically A global constant (e. Outliers • Data points inconsistent with the majority of data • Different outliers o Valid: CEO’s salary o Noisy: one’s age = 200.

26.o Equal-frequency o Entropy-based methods • Variable values (e. 24. 16. 18 o Bin 3: 24. 18 o Bin 3: 24. 28 • Equal-frequency o Bin 1: 0. 0. own. 4. 4. +] bin [-. Quantification • Introduction o To conduct quantitative analysis. 14] bin [14. rent • Binary method o For free: 1. housing o For free. 16. 16. 1 • Ordinal method o Own: 5 o For free: 3 o Rent: 1 8. 0. 0 o Rent: 0. 12. age) o 0. 10] bin [10. • Numerization methods o Binary method o Ordinal method • Variable values (e. 26. +] bin . 26. 0 o Own: 0. 20] bin [20. 16. 28 • Equal-width binning o Bin 1: 0. Numerization • Numerization is the process of transferring categorical variable into numerical counterparts. 21] bin [21. 12 o Bin 2: 16. 16.g. 51 [-. 18. responses to open-ended questions in survey research and the raw data collected using qualitative methods must be coded numerically. 1. 28 7. 4 o Bin 2: 12.g.

then realize that it would be better to combine and use a total of 7 categories o Each time the researcher makes a change in the coding scheme. Distribution • Data analysis begins by examining distributions 52 . “What is the biggest problem in attending college today?” the researcher might begin. it is necessary to restart the coding process to code all responses using the same scheme 9. Exhaustive of the full range of responses Mutually exclusive (mostly) of one another. responses are keypunched into a data file. for example. • Developing code categories o Coding qualitative data can use an existing scheme or one developed by examining the data. requires much effort o This coding typically requires using an iterative procedure of trial and error o Consider. for example. however. o Coding most forms of qualitative data. o Coding qualitative data into numerical categories sometimes can be a straightforward process Coding occupation. In telephone and internet surveys. responses are automatically recorded in numerically format. with a list of 5 categories. o In coding responses to the question. then realize that 8 would be better.o Most responses to survey research questions already are recorded in numerical format In mailed and face-to-face surveys. can rely upon numerical categories defined by the Bureau of the census. “What is the biggest problem in attending college today?” o The researcher must develop a set of codes that are. for example. coding responses to the question.

20. 20. for example. 85 • The mean equals 27. Central tendency • A common measure of central tendency is the average or mean of the responses • The median is the values in the middle case when all responses are rank-ordered • The mode is the most common responses • When data are highly skewed. 19.• One might begin. 22. 19. the median or mode might be better represent the most common or centered response. meaning heavily balanced toward one end of the distribution.g. typically the mean. But this number does not adequately represent the common respondent because the one person who is 85 skews the distribution toward the high end. 21. 19. by examining the distribution of responses to a question about formal education. • The median equals 20 • This measure of central tendency gives a more accurate portrayal of the middle of the distribution 11. • The range is the distance separating the lowest and highest values (e. Dispersion • Dispersion refers to the way the values are distributed around some central value. where responses are recorded within six categories • A frequency distribution will show the number and percent of responses in each category of a variable 10. the range of the ages listed previously equals 18-85) • The standard deviation is an index of the amount of variability in a set of data • The standard deviation represent dispersion with respect to the normal (bell shape) curve 53 . • Consider this distribution of respondent ages: o 18.

• Dispersion measures o Spread around the mean Variance – too abstract.1% of the values below and above the mean o The figure 34. the first standard deviation account 34. age.• Assuming a set of numbers is normally distributed.2% of all responses) and so on. +2. • Each standard deviation (+1. income) • Discrete (i. sex. then each standard deviation equals a certain distance from the mean. categorical) variables have responses that are considered to be separate from one another (i. • Thus approximately 68% of all responses fall within one standard deviation of the mean • The second standard deviation accounts for the next 13.e. • For example. a step towards standard deviation Standard deviation (from mean) – more intuitive o Standard deviation Average distance between mean and each value in data set Translates variance into same scale as mean and all the values High values are generally bad • If the responses are distributed approximately normal and the range of responses is low – meaning that most responses fall closely to the mean – then the standard deviation will be small o The standard deviation of professional golfer’s score on a gold course will be low o The standard deviation of amateur golfer’s scores on a golf course will be high 13. religious) 54 .e. etc) is the same distance from each other on the bell-shaped curve.1% is derived from probability theory and the shape of the curve.g.6% of the responses from the mean (27. but represents a declining percentage of responses because of the shape of the curve. Continuous and Discrete Variables • Continuous variables have responses that form a steady progression (e.

• Example: suppose one measures amount of formal education within five categories (less than hs. post college) • Is this measure continuous or discrete? • In practice. hs. 2 years vocational/college. college. the researcher might want to collapse one or more categories into a single category o The researcher might want to collapse categories to simplify the presentation of the results or because few observations exist within some categories • Collapsing response example Response Strongly disagree Disagree Neither agree nor disagree Agree Strongly agree Frequency 2 22 45 31 1 55 . five categories seem to be cut off point for considering a variable as continuous • Using a seven-point response scale will give the researcher greater chance of deeming a variable to be continuous. 14. it is matter of debate within the community of scholars about whether a measured variable is continuous or discrete • This issue is important because the statistical procedures appropriate for continuous-level data. especially as related to the measurement of the dependent variable. Subgroup comparison • Collapsing response categories o Sometimes the researcher might want to analyse a variable by using fewer response categories than were used to measure it o In these instances.• Sometimes.

Frequency 24 45 32 56 .One might want to collapse the extreme responses and work with just three categories Response Disagree Neither agree nor disagree Agree • Handling “Don’t Know” o When asking about knowledge of factual information (“Does you teenager drink alcohol?”) or opinion on a topic the subject might not know much about (“Do school officials do enough to discourage teenagers from drinking alcohol?”). it is wise to include a “don’t know” categories as a possible responses. however can be a difficult task o The research-on-research literature regarding this issues is complex and without clear-cut guidelines for decision making o The decisions about whether to use “don’t know” response categories and how to code and analyse them tends to be idiosyncratic to the research and the researcher. o Analyzing “don’t know” responses.

Expected count = [row total x column total]/grand total 57 .05 • Step 3: check the assumption o Two variables are independent o Two variables are categorical o Expected count of less than 5 is > 20% (take fisher exact test) and if < 20% (take pearson chi-square test). Two proportion (independent sample) – Pearson Chi-square & Fisher Exact test. Categorical data analysis • • • • One proportion o Chi-square goodness of fit Two proportion (independent sample) o Pearson chi-square/ fisher Exact test Dependent sample (matched or paired) o Mc Nemar’s test Stratified sampling to control cofounder effect o Mantel-Haenszel test 2. • • • To test the association between two categorical variable IHD vs. Gender o Does gender associated with IHD status? Result of test o Not significant o Significant • no association an association Step 1: State the hypothesis o H0: There is no association between gender and IHD o HA: There is an association between gender and IHD • Step 2: set the significance level o How much? – accept the error in estimating the proportion in the population o Usually: = 0.UNIVARIATE ANALYSIS OF CATEGORICAL DATA 1.

• Step 6: conclusion o There is no significance association between gender and IHD status using Pearson Chi-square tests (p-value = 0.• Step 4: statistical test o Chi-square test or o Fisher exact test o X2 = Σ (O – E)2 / E o Chi-square value: When the difference between observed and expected increase Value of chi-square increase increase p-value decrease significant • Step 5: Interpretation o p value = 0.381 *0. Two proportion (dependent sample) – Mc Nemar’s test • • • Dependent sample (matched or pair sample) X2 = (|b+c|) / (b+c) Discordant pair o Is pair of different outcome o Use to test the difference in the outcome • Sample of 25 pair patient with breast cancer RM Die * Discordant ** Concordant *c d** 58 Live Live a** SM Die *b .123) • Data presentation Table 1: Association between IHD and gender IHD Variable Gender Male Female 15 (60) 20 (80) 10 (40) 5 (20) 2.123 do not reject H0 o There is no significant association between gender and IHD status.123 Yes n (%) No n (%) z stat p-value * Pearson Chi-square test 3.

021) • Data presentation 59 .o Matched for age o Undergone Simple Mastectomy (SM) Radical Mastectomy (RM) o Difference of 5-year survival proportion between two group • Step 1: state the null and alternative hypothesis o H0: there is no association between type between type of mastectomy and 5-year survival proportion in patients with breast cancer o HA: there is an association between type of mastectomy and 5-year survival proportion in patients with breast cancer.021 reject H0 o there is significant association between type of mastectomy and 5year survival proportion in patients with breast cancer. • • Step 2: set the significance level o = 0.05 Step 3: check the assumption o Categorical data o Dependent or matched sample • • Step 4: statistical test o Mc Nemar’s test Step 5: interpretation o p-value = 0. • Step 6: conclusion o There is significant association between type of mastectomy and 5year survival proportion in patients with breast cancer using Mc Nemar’s test (p-value = 0.

Table 2: Association between type of mastectomy and 5-year survival proportion in patients with breast cancer Simple mastectomy Variable Radical Live Die * Mc Nemar’s test 13 (%) 9 (%) 1 (%) 2 (%) *0.021 Live n (%) Die n (%) p-value 60 .

the relationship between height and weight. 2. Is a measure of relationship between two numerical variables E. the relationship between cholesterol and blood pressure. Elliptical pattern – indicative of normally distributed variables 61 . Relationship between two variables • • • • Are two variables associated each other? To what degree (strength) are they associated? In which directions is the relationship? o Positive or negative Change in dependent variable that corresponds to change in independent variable. Pattern: Elliptical pattern – degree of elongation of the ellipses – proportional to the correlation coefficient.CORRELATION & REGRESSION 1. o Prediction Correlation Regression - presence of association strength (degree) of association direction of association - prediction CORRELATION 1.g.

r for statistical sample. ρ (rho) for parameter of population.00 poor fair good excellent Y increase Y decrease r = 1 (perfect positive) r = -1 (perfect negative) r=0 No linear relationship r does not imply a cause and effect relationship Correlation should be assessed mathematically.51 – 0.25 o 0. Example: Relationship between height and weight Step 1: state the null and alternative hypothesis • H0: There is no correlation between height and weight • HA: There is correlation between height and weight Step 2: set significance level 62 (2-tailed) . Correlation coefficient: Spearman’e Ranked Correlation coefficient Pearson’s Correlation coefficient - A measure of degree of straight line relationship between two numerical variable - Correlation coefficient calculated on the ranks of the observation of two variables - At least one variable have a normal distribution - Rank correlation and Spearman’s correlation – similar - Different when the scatter plot deviates from an elliptical shape 4.3.75 o 0.76 – 1.26 – 0. Correlation coefficient (r) X increase X increase r o < 0.50 o 0. not visually.

p < 0. (2-tailed) N ** Correlation is significant at the 0. 1 .001 reject H0 step 6: conclusion • There is a significant.000 100 1 . (2-tailed) N weight Weight Pearson Correlation Sig.878(**) . 100 .01 level (2-tailed).000 100 • p-value = <0.• = 0. 100 height Height height Height Pearson Correlation Sig. positive and excellence correlation between height and weight (r = 0.88.05 Step 3: check the assumption • Both numerical variable • One of the variables has normal distribution Histogram Box and Whisker plot Step 4: statistical test • Pearson correlation (if assumption is met) • Spearman’s correlation (if assumption is not met) Step 5: Interpretation Correlations weight Weight .878(**) .001) Checklist for reporting correlation (Figure 1) • Correlation coefficient – Pearson’s correlation coefficient/ Spearman’s Ranked correlation coefficient • Actual p-value of correlation coefficient • Sample size • Scatter plot 63 .

001 75 70 Weight 65 60 55 150 155 160 165 170 175 180 Height SIMPLE LINEAR REGRESSION (SLR) 1. Pearson’s r = 0. Regression Analysis • Regression analysis is a statistical tool that utilizes the relation between variables so that one variable van be predicted from the other or others • Linear regression o Simple (one independent variable (factor) and one outcome) o Multiple ( more than one factor and one outcome) • Logistic Regression (dichotomous dependent variables) 64 .88.Figure1: A scatter plot showing high positive correlation between height and weight 80 n = 100. p < 0.

with zero mean. independent.2. Simple Linear Regression • Example of research questions o Does a relationship exist between oral contraceptive and the incidence of thromboembolism? o What is the relationship of a mother’s weight to her baby’s birth weight? o Relationship between an animal’s pulse rate and the amount particular drug administered? • Simple because only one independent variable • Linear means the relationship between y (dependent/outcome) and x (independent/factor) variables can be represented by a straight line • Analysed linear relationship between two quantitative (numerical) variables • Involves estimating the equation of a straight line that defines the relationship between a dependent variable using a given data set • The method involved is called method of least squares • We choose a line such that the sum of squares of vertical distances of all points from the line is minimized (Q = Σ е2i ) • These vertical distances between y values and their corresponding estimated values on the line are called residuals (ei = yi – ŷi) • The line thus obtained is called the regression line or the least-squares line of best fit 3. Regression line (least squares line of best fit) • Yi = β0 + β1Xi + єi o Yi is the value of dependent variable when the value of the independent variable is Xi o β0 is Y-interception and is constant o β1 is the slope of the regression line. and constant variance a2 65 . It is the change in Yi when Xi is increased by one unit o β0 and β1 are called regression coefficients o єi is random error terms. normally distributed.

4. 7. True Regression line • The random error term єi the regression equation accounts for the scattering of the data points about the regression line • As the mean of the єis is zero. Σ (Yi – Ŷi) = Σ έ2i i=1 i=1 66 n n . Least Squares (LS) • “Best Fit” means difference between actual Y values and predicted Y values are minimum. β^0 is the estimated mean value of Y β^1 is the slope of the regression line. Least square estimate • Time regression line is unknown • Estimated regression line: o Ŷ = β^0 + β^1X least square estimate Ŷ = is estimated mean β^0 is y-intercept and is constant • if x = 0. the mean of Yi (at Xi) is: o E (Yi) = β0 + β1Xi o The notation E (Yi) means ‘expected value’ of Yi and represents the mean of Yi • Not that the mean of y and on x and the relationship is represented by a straight line • This equation represents the true regression line 6. Linear Regression Model • Relationship Population Y-interception Dependent (outcome/ response) variable Population slope Independent (factor/ explanatory) variable Yi = β0 + β1Xi + єi Random error 5. It is the change in Y when X is increased by one unit.

05.3. Sum of squares • Total sum of square (SSTOT) o Measure of total variation in dependent variable Y o SSTOT = Σ (Yi – Ymean)2 = SSR + SSE • Regression Sum of square (SSR) o Measure the variation ‘explained’ by the regression line o SSR = Σ (Y^i – Ymean)2 • Error Sum of squares (SSE) o Measures of the ‘unexplained’ variation in Y or the scatter around the regression line o SSE = Σ (Yi – Y^i)2 67 . Measures of variation in Regression • Total variation (Total Sum of Squares (SSTOT)) o Measures variation of observed Yi around the mean Ymean • Explained variation (Squared Sum of Regression (SSR)) o Variation due to relationship between X & Y • Unexplained variation (Square Sum of Error (SSE)) o Variation due to other factor 9. when the age (X) is 0 (???) 8.• LS minimizes the Sum of the Squared Differences (SSE) 8. Interpretation of Coefficient • Slope (β^1) o The change in the estimated mean value of Y when X is increased by 1 unit If β^1 = 0.3. then the mean cholesterol level (Y) is expected to 3. then the estimated mean cholesterol level (Y) changes by 0.05 mmol/dl when the age is (X) increased by 1 year. • Y-intercept (β^0) o Average value of Y when X = 0 If β^0 = 3.

Y increases with increase in X • Negative slope.Measure of variation Y Yi Ŷ = β^0 + β^1Xi (unexplained sum of squares) SSE = (Yi – Y^i)2 (total sum of squares) SSTOT = (Yi – Ymean)2 Y^i (explained sum of squares) SSR = (Y^i . Hypothesis Testing: • For Simple Linear Regression o H0: β1 = 0 (no linear relationship) o HA: β1 ≠ 0 (there is linear relationship) o Test statistics: t-distribution o Rejection rule: Reject H0 if p-value less than 0. Y and slope: • Positive slope. Y decreases with increase in X 10.05 (assumed ) • For Multi Linear Regression o H0: β1 = 0 (no linear relationship) o HA: β1 ≠ 0 (there is linear relationship) o Test statistics: F-test for ANOVA table: F = MSR/MSE MSR = SSR/dfReg MSE = SSE/dfError 68 .Ymean)2 Ymean X Xi Notes: X.

350 Adjusted R2 . How to analyse • Exploration of the data o Descriptive o Scatter plot between two variables Check for distribution. relationship and outliers • Fit the square least line (regression line) o Using least square method o It is the best fitting straight line trough the data points in a scatter plot o It represents the least square equation and estimates the constant (a) and slope (b) for and β Y^ = a + bx o It is constructed by using the method of least square – minimizes the sum of squared deviations of each point from the mean (regression line) • Evaluation of model by R2 (R square) Model Summary model 1 R .338 Std Error of the estimate .9043 a. 11. Time o R2 = 0. 2 (sigma square). is constant.592a R2 .o Rejection rule: Reject H0 if p-value for the F-test less than 0. meaning that 35% of the total variation in GPA is explained by the study time o R2 measures the closeness of fit of the sample regression equation to the observed values of Y o It ranges fro 0 to 1 o Is called coefficient of determination 69 .35.05 (assumed ) • Assumption o The errors are normally distributes o They are independent o The mean of random error term is equal to zero o The variance of random error. Predictors: (constant).

389 Std error .. we reject Ho at 5% significance level and have sufficient evidence to conclude that there is linear relationship between study time and GPA. .243 Upper 2.534 a.dependent variable for GPA o H0: β1 = 0 (no linear relationship) o HA: β1 ≠ 0 (there is linear relationship) o As p value < 0.639 5. o Positive β means direct relationship o Estimated Least Square (LS) equation GPA = constant + b (study time) GPA = 1.461 + 0..000 95% CI for Lower .000 .389 (study time) • Diagnostic checking for assumption o The assumptions: The errors are normally distributes They are independent The mean of random error term is equal to zero (linearity) The variance of random error.• Evaluation b o Evaluation of β using t-statistics Coefficients Model 1 (constant) Time Unstandardized Coefficient B 1.073 t 4. 2 (sigma square). is constant or equal o Model adequacy checks After obtaining the least square line or fit Linear model appropriate? .001.093 .R2 Investigate model assumption Diagnostic procedures carried out through examination of Residuals (difference between the observed value Y and the fitted ot the predicted value Y at a given value X Normality 70 .829 .315 .461 .342 Sig.

39 o We are 95% confident that for each 1 hour change in the study tie.• • • Histogram of unstandardized residuals Plot of unstandardised residuals against unstandardised predicted values Creating residual: go to analyse regression bivariate save unstandardised residual and predicted values Linearity Let say the assumption is met. the GPA increase will lie between 0.53 71 .24 to 0. • Interpretation and conclusion o 35% of the variation in GPA is explained by study time o There is significant linear association between GPA and study time o For each 1 hour increase in study. the GPA of a student increase by 0.

then by knowing the value of one of these variables it is possible to predict the corresponding unknown value of the other variable • Validity o Validity is them measure that a test truly is measuring what it claims to measure. • Reliability 72 . o The degree of the relationship It measure how strong the relationship between the values of two variables 2. the corresponding value of related variable increases or decreases until certain value. Application of correlation • Prediction o If two variables are positively or negatively related to each other. Zero correlation (no correlation) • It means that when values of one variable increases or decreases independent to the value of other variable o The form of the relationship When value of one variable increases. Negative • It means that when value of one variable increase. but beyond that value there may have change not in the same trend or may not have any change at all. the corresponding value of related variable also increases.CORRELATION 1. Introduction • Correlation is used to measure and describe a relationship between two variables • Correlation measure three characteristics of relationship o The direction of the relationship Positive • It means that when value of one variable increase. the corresponding value of related variable decreases.

o Reliability is the measure whether the test instrument produces the stable. Measures of correlation • Pearson correlation (Pearson product-moment correlation) o The Pearson correlation measures the degree and the direction of the linear relationship and is denoted by the letter r (correlation coefficient) o r = (degree to which x and y vary together)/(degree to which x and y vary separately) o r = (covariability of x and y)/(variabilitiy of x and y separately) o r = (SP)/√(SSx SSy) o SP = sum of product of deviation = Σxy – [ (Σx Σy)/n] o SSx = sum of squared deviation of x = Σxx – [ (Σx Σx)/n] • Or SSx = Σx2 – (Σx)2/n o SSy = sum of squared deviation of y = Σyy – [ (Σy Σy)/n] • Or SSy = Σy2 – (Σy)2/n • Spearman correlation (Spearman rank-order correlation) o It is used when the data are of ordinal variable. • Theory verification o Theory is a statement that makes a specific prediction about the relationship between two variables o This predicted relation can be verified by correlation test 3. o If it is not then data must be ranked o Rank order the score separately for each variables with 1 for the smallest score Case no. consistent measurements it is used again and again in the same group of students or people. 1 2 3 Score for variable 1 3 5 4 Score for variable 2 13 14 12 Ranked for variable 1 1 3 2 Ranked for variable 2 2 3 1 73 .

5 4 o The equitation for the spearman calculation o rs = 1 – (6ΣD2)/[N(N2-1)] • N is the number of pair (xy) • D is the difference between each pair (x – y) • After calculating the value of r or rs.40 • r2 = 0. this is to be compared with the critical value in the correlation table to decide whether there is significant correlation between the variables.16 i.5 5 1 2. o Calculated value tabulated value (significant correlation) o For one-tailed test df = n-1 and for two-tailed test df = n-2 • df = degree of freedom o coefficient of determination • this is squared correlation coefficient • it measures the percentage of variation shared between the two variables • r = 0.e. 16% • Point to be remembered o Correlation is not causation o Correlation is affected by the range of data 74 .4 5 6 7 15 16 4 5 4 5 o If there are same score for more than one respondents the final rank for the respondents will be the average of the ranks Respondent no 1 2 3 4 5 Score 3 5 2 3 4 Rank 2 5 1 3 4 Final rank 2.

one tail or two tail.65. Hypothesis tests with the Pearson correlation • Two-tailed o Ho = ρ = 0 (no correlation) o HA = ρ ≠ 0 (there is correlation) • One-tailed o Ho = ρ ≤ 0 (there is no positive correlation) o HA = ρ > 0 (there is positive correlation) • Reporting correlation o r = 0. n = 30. Summary • Correlation is a statistical test to assess the relation between two variables • Relation can be positive or negative • Two method of test are Pearson and Spearman methods • Test is used in prediction of relationship testing validity and reliability and verifying theories • Can be calculated manually using different formulas or using computer statistical package like SPSS • Correlation does not say about cause and effect relationship • The correlation coefficient is influenced by the outliers and or range of data under analysis 75 .o Correlation is affected by the outliers 4. o r2 = coefficient of determination 5. p-value < 0.01.

Sign test for paired samples 2. variances) • Require very few assumptions. Nonparametric Statistics (NPS) • Name nonparametric indicates – no assumption about parameters (means. Wilcoxon Signed-Ranked test for pair samples Man-Whitney test (Wilcoxon Rank Sum Test) Kruskal-Wallis Test Friedman’s Test Spearman’s Rank correlation coefficient 2. Sign Test for Matched • Observation are matched pairs but assumption underlying the paired ttest are not met.NONPARAMETRIC STATISTICS 1. or the measurement scale is weak then Sign Test can be applied 76 . it is distribution free • Use median as a measure of central tendency o Applied when The data being analysed is ordinal or nominal In case of interval or ratio scale data when no assumption can be made about the population probability distribution o Appropriate foe small samples that are not normally distributed o Computationally easier o Less efficient than parametric counter parts o Loose information by substituting ranks in place of scales Parametric test One-sample t test Paired t-test Two independent sample t-test One-way ANOVA ANOVA (randomized block design) Pearson’s correlation coefficient Nonparametric test Sign test for one sample 1.

• Assumption o The distribution of difference is continuous o The distribution of differences is symmetric • Hypothesis o H0: ∆d = 0 (the median of differences is zero) o HA: ∆d ≠ 0 • T.: smallest of n+ and n• RR: Reject H0 if p-value is less than (assumed alpha) • Procedure o Exclude the observations for which the difference (di) is zero o For di > 0 assign (+sign) and for di < assign (-sign) 3. Wilcoxon Signed-Rank Test for paired samples • It is sophisticated than Sign test • Sign test only tell whether the sign of a difference is positive or negative • This test makes use of both the signs and magnitudes of the differences • Thus for a strong measurement scale the sign test may be undesirable since it would not make full use of the information contained in the data.S.• Hypothesis o H0: ∆d = 0 (the median of differences is zero) o HA: ∆d ≠ 0 • T.S: T = min (T+. |T-|) • Rejection region o Reject H0 if T ≤ critical value or o Reject H0 if p-value is less than (assumed alpha) • Procedure o Calculate the differences of each pair of observations (di) o Ignore the signs of these differences o Rank the absolute values from smallest to largest o Assign the signs of the corresponding differences to these ranks 77 .

Wilcoxon Rank Sum Test (Mann-Whitney-U test) • Counter part of t-test for two independent samples • Assumptions o The two samples have been drawn independently and randomly from their respective populations. Next higher value receives a rank of 5 and so on o Label sample smaller sample size as sample 1. test statistic is the SUM of RANKS for sample 1. |T-|) 4.4.o A difference of zero is not ranked. If tied observation. They differ only with respect to their medians. denote region from the table o Determine rejection region from the table 78 . o The measurement scale is at least ordinal.5. each one will get average rank (1+2)/2 = 1. • Hypothesis o H0: the two populations are identical (∆1 = ∆2) o HA: population 1 and 2 have different medians (∆1 ≠ ∆2) • Rejection rule: o Reject H0 if p-value less than 0.5) o Assign each rank either a (+) or (-) sign corresponding to the sign of the difference o Compute sum of +ve ranks (T)+ and sum of –ve ranks (T-) o Choose the test statistics (smallest of T+. it is eliminated from the analysis and the sample size is reduced by one o Tied observation are assigned an average rank (suppose two smallest differences. o The distributions of the two populations have the same general shape.05 (assumed ) • Procedure o Select independent random samples from each population o Combine the two samples o Jointly rank the combined samples. 4. assign an average to all with the same value For example: if two observations are tied for the rank 3 and 4 each is given 3.

5. • Hypothesis o H0: the two populations are all identical o HA: At least one of the population tend to exhibit larger values than others • Procedure o If no ties or moderate number of ties the formula simplifies to: • Rejection region o When the samples sizes are large (ni ≥ 5) the test statistic T is distributed approximately as x2 (t – 1) o Reject H0 if T > x2 (t – 1) 6. Kruskal-Wallis Test • Counter part of One Way Analysis of Variance (ANOVA: comparing means of more than two groups) if: o Normality assumption of ANOVA not justified o Or the data available is ordinal (consist ranks) • Assumption: o The samples are independent and random o The measurement scale is at least ordinal o The distribution of the values is sampled populations are identical except for the possibility that one or more of the population are composed of values that tend to be larger than those of other populations. Spearman’s Rank correlation coefficient • Nonparametric alternative of Pearson’s coefficient of correlation • Relevant when the measurement scale is at least ordinal or the relationship between two variables is not linear • It is denoted by rs • rs = 1 implies strictly increasing monotonicity • rs = -1 implies strictly decreasing monotonicity 79 .

NON-PARAMETRIC TESTS 1. Introduction • Inferential statistics where population parameters are not a requirement to calculate its value • A process which is carried out in order to find out whether or not a particular statistical hypothesis is likely to be true • A statistical test in which no assumption are made about any statistical parameter. This is similar to a test in which we do not assume that the data have any particular distribution 2. The X2 test for goodness of fit • Test hypothesis about the proportions of a population distribution • Test how well the sample proportions fit the population proportion specified by the null hypothesis • Example: o Ho : there is no difference in proportion of people in different categories o Observed data (fo) Category 1 7 Category 2 26 n = 60 o Expected population based on hypothesis (fe) Category 1 20 Category 2 20 Category 3 20 Category 3 27 1/3 of 60 in each category = 1/3 * 60 = 20 o Difference between observed data and expected data (fo – fe) Category 1 7 – 20 = -13 Category 2 26 – 20 = 6 Category 3 27 – 20 = 7 80 .

45 o X2 = 12.7 o Degree of freedom = number of column – 1 o Degree of freedom = 3 – 1 = 2 3. Chi-square test for independence • Test a relationship between two variables • Each individual in the sample is measured or classified on two separate variables • Example: o Ho : there is no relationship between preference of teaching method and gender of the students • Variable 1: teaching method • Variable 2: gender of students o Observed data (fo) Gender Male Female Total Teaching Methods Lecture 25 35 60 Tutorial 25 65 90 50 100 150 Total chi square formula o X2 = 169/20 + 36/20 + 49/20 • 40% students refer lecture • 60% students prefer tutorial 81 .45 + 1.8 + 2.fe)2/(fe) ] o X2 = 8.o Square the differences (fo – fe)2 Category 1 (-13)2 = 169 Category 2 62 = 36 Category 3 72 = 49 o X2 = Σ [ (square the difference)/(expected data) ] o X2 = Σ [ (fo .

(-5)2 = 25.42 = 3. of row – 1) • Degree of freedom = (2-1)x(2-1) = 1 • Critical value = 3.05 o Interpretation • Calculated values is less than critical value • Therefore hypothesis is accepted (do not reject the null hypothesis).83 + 0.fe)2/(fe) ] • So X2 = 25/20 + 25/30 + 25/40 + 25/60 • X2 = 1. 52 = 25 o Formula for X2 = Σ [ (fo .63 + 0. (-5)2 = 25.o Expected data (fe) Gender Male Female Total Teaching Methods Lecture *20 40 **60 Tutorial 30 60 90 **50 100 ***150 Total • fe = (**Row total x **Column total)/(***whole total) • *Example of calculation o **50 x **60/***150 = *20 o Difference between expected and observed value (fo – fe) Gender Male Female Total Teaching Methods Lecture 25 -20 = 5 35 – 40 = -5 60 Tutorial 25 – 30 = -5 65 – 60 = 5 90 50 100 150 Total o Square of difference (fo – fe)2 • 52 = 25.25 + 0.13 • Degree of freedom = (no.84 when = 0. there is no difference o Conclusion refer to X2 statistic table 82 .1)x(no. Of column .

• There is no relationship between preference of teaching method and gender of the students 4. Chi-squared test for variance • This is a test of the null hypothesis that the population variance is 2. • We have a sample of size n and we compute an unbiased estimate of the population variance s2 using divisor n-1. • The distribution used is X2 statistics is (n-1)2 s2/2. • We assume that the population is normally distributed • For 95% level of confidence probability level are within 97.5% and 2.5% • Find out the critical region in MS Excel CHINV (p, df)

83

STATISTICAL ANALYSIS: WHICH TO CHOOSE? 1. Process of data management (follow the steps below) • Research question(s) • Research design • Data collection • Data entry • Data exploration & cleaning • Data analysis • Interpretation • Writing up 2. Role of statistics in a study • Statistical knowledge and judgment is required at every step of a study • What statistical analysis is appropriate to answer the research question? Points to consider to select the right statistical test: o Research question/ hypothesis Are you clear what you want to find out and what design you have used in your study? o Number of variables o Type of data o Number of groups o Sample distribution o Sample type 3. Research question • The essential question, the study is designed to answer the question • Most studies are concerned with answering one of four types of following questions o What is the magnitude of a health problem or health factor? o What is the efficacy of an intervention? o What is the casual relation between one factor (or factors) and the disease or outcome of interest? o What is the natural history of a disease?

84

• What is/are the research question (s)? o Common in medical research: Difference between/ among means Difference between/ among proportions Associations between/ among factors Difference between/ among treatment effects • Hypothesis o This is a testable statement that describes the nature of the proposed relationship between two/ more variables interest o E.g. there is an association between smoking and coronary heart disease 4. What is the research design applied and expected result? • Randomized control trial (RCT) • Observational studies o Cross-sectional o Case-control o Prospective cohort o Retrospective cohort • Case report/ series • Diagnostic test • E.g. 1 o Research question: effectiveness of new anti-hypertensive drug o Research design: randomized controlled trial

85

2 o Research question: Risk factor for enteric fever o Research design: Case control Time direction (People with disease) Population • E. 3 o Research design: prospective cohort (People without disease) o Research question: maternal & fetal outcome in mother with PIH 86 .g.g.• E.

as it influenced by study factors. 7. disease or outcome of interest. Outcome factor (s) • The event or occurrence that is supposed to have as a result of the study factor • E. univariate gives misleading results • Example risk factors fro coronary heart disease • Multivariate analysis can eliminate confounding effect 8. Type of data • Numerical o Continuous (e. Number of variables • One independent variable only – univariate analysis • More than one independent factor variables – multivariate analysis • Less likely to conduct and conclude a study with only focusing on univariate analysis in health sciences • If there is a multi-factorial effect on the outcome. salt intake.g. Study factor (s) • Variable (s) of interest that is hypothesized to be related to health problem. the outcome factor is blood pressure.Time direction Population (no disease) 5. • Also known as dependent variable.g. weight) 87 . • Also known as the independent variables/ exposure variables/ determinants 6.

Sample distribution • Normal distribution parametric test non-parametric test • Non-normal distribution • Suggested procedure for assessing normality o Compare the mean & median (for normal distribution mean = median) o Construct a histogram overlaid with normal curve o Construct a box and whisker plot o Statistical test Kolmogorov-Sminov test Shapiro-wilk test • Non-parametric test are appropriate when: o Data is ordinal o Data is non-normal distribution and cannot be easily transformed o Data may contain outlier • Non-parametric methods have two general limitations o Not as powerful as parametric counterparts o Test for complex design are not readily available in standard computer packages 11.g. disease and non-disease groups. number of patients admitted) • Categorical o Nominal (e. gender) o Ordinal (e. diabetic and non-diabetic group) • More than two group (more than two levels) (e. Chinese. Others) 10. Sample type • Independent sample (e. occupation. Number of group • Two group (two levels) (e.g.g. socioeconomic status) • Statistical tests applied are different based on the type of variables (must consider both independent and dependent variables) 9. disease severity. male and female) 88 .g. race – Malay.g. Indian.g.o Discrete (e.

difference of blood pressure measurements before and after treatment.• Dependent/ paired/ matched sample (e.g. exposure and potential confounders/ effect modifier o To check records with duplicating ID number (to prevent repeated data entry • Error checking o Respondent’s mis-marking answers 89 . Data exploration and cleaning • Compulsory to do • Do not rush to analyze data • Clean and explore first • Get acquaintance with the data • Check duplications • Out-of-range values and location of error • Distribution of variables • Missing data checking consistency errors • Exploring the relationship between variables • Transformations • To get acquaintance with data set before the major analysis is carried out o Read the protocol again o Recall the objectives o Identify major outcome. What to be asked before choosing a statistical test? • What is the research question/ hypothesis? • What is the outcome factor and what are the study factors? • How many variables? • How many groups? • What is the distribution like? • Are the samples independent? • Is the data numerical/ categorical? 13. age and sex matched samples) 12.

Consistency errors o Situations where respondents answered a question for which they were ineligible or when codes were entered incorrectly o Countercheck with questionnaire/data collection form 90 o If the value is impossible/implausible repeat the measurement justifiable to set as common sense . 99 or 999 16. Distribution of the variables o Examine each variable Continuous • • Normal distribution If not o ? transformation o ? categorization Categorical • Frequency distribution 15. Missing data o Occur when respondent would/could not answer o Too much missing data Threat the study Indicate a problem with a question o Should not be entered as a blank as some statistical packages interpret blanks as zeroes o Common practice – coded as 9.o Coder’s miscoding response o Marking errors by data personnel • Out-of-range values and location errors o Measurement error o Recording error o Genuine observation • What to do? o Check again original measurements where possible o If original measurements suspicious o If not possible to check “missing” 14.

Interpretation • Most confusing part of researchers use the transformed data use nonparametric methods • If resistant to transformation 91 . -ve reciprocal root Reduce extreme skewness to right -ve reciprocal • Check the symmetry of the distribution after transformation • If sufficiently improved 19. Exploring the relationship between variables • Cross tabulation useful for categorical variables (sometimes better to categorize) • Should consider confounding & interaction • Graphs – mostly for continuous variables • Relationship between the outcome variable and other variables o E. Transformation • Severely skewed data – two approaches o Use nonparametric methods o Apply transformation • Many distributions in medicine – skewed to the right • Involve performing a mathematical operation on every value of the variable • Improves the symmetry of the distribution Transformation X3 X2 X1/2 log10 (X) -1/√X -1/X Name Cube Square Square root Log Effect Reduce extreme skewness to left Reduce skewness to left Reduce mild skewness to right Reduce skewness to right Reduce events more extreme skewness to right.g. scatter plot 18.o Can be prevented by proper programming in some statistical software 17.

final model should be interpreted for writing regardless of the prior more-favorable results towards the hypothesis • Recall statistical theory and concepts whenever applicable • May need help from a medical statistician 20. Variables Independent Predictor Explanatory Variables Dependent Outcome Response Covariates Confounders Controls Effect modifiers Not the primary interest Must be recognized 92 . in multivariate analysis.• May be the most difficult part for those who are not familiar with statistical applications • Should interpret only when considered to be results of final analysis stage o E.g. Univariate analysis • Test hypothesis between one independent and one dependent variable 21. Multivariate analysis • Why we need multivariate analysis? • Purpose of using multivariate analysis • Common multivariate analysis methods in health sciences research.

• Confounding ? Risk factor Disease Confounder o Distortion of a risk factor-disease relationship brought about by the association of other factors with both risk factor and disease o Example of confounding: Physical activity level ? Systolic BP Age 93 .

g. multivariate analysis industry Years of employment in the 94 .• Interaction ? Risk factor Disease Interaction factor (effect modifiers) o Exist when the primary relationship of interest between a risk factor and a disease is different at different levels of the interaction factor o Example of interaction Employment in an industry ? Lung cancer Cigarette smoking Smokers Risk of lung cancer Non-Smokers e.

Purpose Multivariate Analysis • To statistically adjust the effect on variable Y by change in a particular variable x when others are controlled o X1 Y (X2.Surgery Compare outcome Radiation • Are these two groups comparable? • What are the role covariates? 22. diet CHD {smoking and age adjusted for}) • To discover the variable X which has the most influence on outcome variable Y Diet Smoking Age • To predict the outcome Y Clinical Pathology Demographic Socio-economic Cancer prognosis CHD • Whole ideas of multivariate analysis are “How to separate independent effect of each X and Y” • Common multivariate analysis methods in health related sciences research • Multivariate models 95 . X4… statistically adjusted for e.g. X3.

Multiple logistic reg. 96 . • The GLM Univariate model procedure provides regression analysis and analysis of variance one dependent variable by one or more factors or covariates • The GLM Multivariate model procedure provides regression analysis and analysis of variance for multiple dependent variable by one or more factor or covariates • The GLM Repeated Measures procedure provides analysis of variance when the same measurement is made several times on each subject or case. Multivariate analysis >1 >1 >1 >1 Dependent variables Continuous Categorical Categorical Continuous (survival time) Continuous General Linear Model (GLM) Log-linear analysis methods Multiple linear reg. Multiple logistic reg.• Modeling strategies MTV Independent Variable Multiple linear regression Multiple logistic regression Log-linear regression Survival analysis Independent variables Continuous Categorical Continuous Continuous/ categorical Continuous/ categorical 23. Survival analysis Dependent variable 1 1 1 1 • The GLM is a flexible statistical model incorporating analysis involving normally distributed dependent variables and combinations of categorical and continuous predictor variables.

Repeated measures in categorical outcome • When the dependent variable is a numerical variable Independent Variable Categorical Categorical Dependent variable Numerical Numerical Statistical test Repeated measures ANOVA (parametric) Friedman test (non-parametric) • When the dependent variable is a categorical variable Independent Variable Repeated measure 2 measures 2 measures 3++ measures Repeated measure with independent variables Binary Ordinal Multiple Logistic regression (xt logic) Dependent variable 2 outcomes categories 3++ outcomes categories 2 outcomes categories Cross-sectional time series (xt) Cochran’s Q test Test of marginal Homogeneity Statistical test Mc Nemar’s test 97 .GLM Univariate GLM Multivariate GLM Independent Variable ≥1 ≥1 Dependent variable 1 >1 24.

Count Loglinear regression (xt poisson) General estimating equation (GEE) model (xtgee) 98 .

if any o Specification of control population. • Concise statement of the rationale behind the proposed approach to the problem 2. a discussion of pitfalls that might be encountered and of limitations of the procedure proposed • Diagram of research design (optional): a diagram is useful foe clarifying points of research strategy • Analysis plan o Specify the kinds of data expected to be obtained 99 . • Brief summary of relevant studies and literature describing what has previously been done and what is currently known about the pattern. Study methodology • Selection of study population o Size of study population or sample o Sampling procedure. Introduction: • Clear statement of the problems or issue to be analysed and the overall objective of the proposed research. Statement of specific research goals: • List specific objectives • List specific hypotheses (if any) to be tested • List the key variables and how they will be operationally defined 3. if any • Description of the experiment or data collection procedure o Description of research design o Description of method and intended research tools o Description of “interfering” (confounding) variables and how they will be controlled. or how their effects will be evaluated o If appropriate.WRITING A RESEARCH PROPOSAL 1.

etc) • Data analysis • Report writing 6. editing. Time table (Gantt chart) • Planning phase • Construction and development of research instruments • Pre-testing of research tools and techniques • Selection of population • Data collection • Data preparation (coding.o Specify the means by which the data will be analysed and interpreted • Data processing plan o Hand tabulation or computer o Analysis technique: statistical measures o Use of dummy tables o Test hypothesis or drive hypothesis to meet the objectives of the study 4. cleaning. Significance of the research for both practice and theory 5. Personnel • Principal investigator • Assistants • Supporting persons 7. Facilities available • Office space • Resources in field area • Data analysis equipment • Other assistance 100 .

Detailed budget • Personnel • Consultant fees • Supplies • Travel expenses • Data processing • Other expenses 101 . Collaboration arrangement • Describe the collaboration 9.8.

add. o This means that you can construct a meaningful fraction (or ratio) with a ratio variable. for example.” 102 . Types of variables • Continuous or quantitative variables • Discrete or qualitative variables 2. o It is assumed that the intervals keep the same importance throughout the scale. but one cannot multiply or divide. a liquid at 40 degrees does not have twice the temperature of a liquid at 20 degrees because 0 degrees does not represent ‘no temperature’ • Ratio-scale interval o Finally. in ratio measurement there is always an absolute zero that is meaningful. o In applied social research most ‘count’ variables are ratio. However.VARIABLES 1. o They allow us not only to rank order the items that are measured but also to quantify and compare magnitudes of differences between them. the number of clients in past six months. o Weight is a ratio variable. o For instance. Continuous or quantitative variables • Interval-scale variables o Interval scale data has order and equal intervals. one can perform logical operations. o Interval scale variables are measured on a linear scale. and can take on positive or negative values. and subtract. o With interval data. it will be 50 degrees. o Why? Because you can have zero clients and because it is meaningful to say that “…we had twice as many clients in the past six months as we did in the previous six months. if a liquid is at 40 degrees and we add 10 degrees.

>. o Ordinal variables are quite useful for subjective assessment of quality. -. male and female. =) on the nominal data. Qualitative or Discrete Variables • Discrete variables is also called categorical variables o Nominal variables o Ordinal variables • Nominal variables o Nominal variables allow for only qualitative classification. o We know upper middle is higher than middle but we cannot say how much higher. / or *) or logical operation (<. and the assignment of numbers to categories is purely arbitrary. o Because of lack of equal distances.g. unmarried. married. they can be measured only in terms of whether the individual items belong to certain distinct categories. but we cannot quantify or even rank order the categories o Nominal data has no order. but logical operations can be performed on the ordinal data. o That is. divorce or widower.3. 103 . o Because of lack of order or equal intervals. • Ordinal variables o A discrete ordinal variable is a nominal variable. one cannot perform arithmetic (+. o Ordinal scale data are very frequently used in social and behavioral research. but the intervals between scale points may be uneven. o A typical example of an ordinal variable is socio-economic status of families. but its different states are ordered in a meaningful sequence o Ordinal data has order. o E. importance or relevance. arithmetic operations are impossible.

Explanatory variables/predictor variables • Any variable that explains the response variable or predictor variable. five. Response variables/target variables • Often called a dependent variable or predicted variable. lurking variable. o Such data are not appropriate for analysis by classical techniques. weakest Ratio Interval Ordinal Nominal 6. absolute zero distance is meaningful attributes can be ordered attributes are only name. • This is the variable manipulated by the experimenter. or confounder) is an extraneous variable in a statistical 104 . • This is the variable that is being watched and/or measured 5. because the numbers are comparable only in terms of relative magnitude.o Almost al opinion surveys today request answers on three-. not actual magnitude. o Consider for example a questionnaire item on the time involvement by selecting one of the following codes: 1 = very low or nil 2 = low 3 = medium 4 = great 5 = very great 4. • Its values will be used to predict the value of the target variable. Confounding variable • A confounding variable (also confounding factor.or seven-point scale. a confound.

confounding is a variable that is associated with the predictor variable and is a cause of the outcome variable. • In other words.model that correlates (positively or negatively) with both the dependent variable and the independent variable. 105 . • Extraneous variables are undesirable variables that influence the relationship between the variables that an experimenter is examining.

Two ways of presenting data • Tables • Charts 2. of respondents 51 49 100 Gender Male Female Total Primary (%) 15 ( ) 14 ( ) 29 ( ) Secondary (%) 20 ( ) 20 ( ) 40 ( ) Higher (%) 16 ( ) 12 ( ) 38 ( ) Total (%) 51 ( ) 49 ( ) 100 ( ) 106 .DATA PRESENTATION 1. Tables • One-way table (Univariate) o Table 1: Number of respondents by gender Gender Male Female Total • Two-way table (Bivariate) o Table 2: Number of respondents by gender and their educational qualification Gender Male Female Total Primary 15 14 29 Secondary 20 20 40 Higher 16 12 38 Total 51 49 100 No.

8. o Bar chart and Histogram A histogram is a bar graph that shows that frequency data The first step… collect data and sort it into categories Label the data as the independent set or the dependent set Data group would be the independent variable and the frequency of that set would be the dependent variable The horizontal axis should be label with independent variable The vertical axis should be labeled with the dependent variable Each mark on either axis should be equal increments. Charts • Charts is a graphically way to organize data • Types o Pie chart A pie chart is a graphical way to organize data All pie charts compare parts of a whole A lie chart uses percentages of fraction to compare data A type of graph in which percentages values are represented as proportionally-sized slices of a pie Pie charts are especially useful in representing proportions. such as 2. etc I think histogram as “sorting bin” 107 . 4.Gender Male Female Total Primary (%) 15 () 14 () 29 () Secondary (%) 20 () 20 () 40 () Higher (%) 16 () 12 () 38 () Total (%) 51 () 49 () 100 () 3. 6. percents and fractions.

from 10.000 to 20.000 to 30. and compare them The main question a histogram is “how many measurements are there in each of the classes of measurement?” The main question a bar graph answer “what is the measurement for each item?” Situation We want to compare total revenues of five different companies We have measured revenues of several companies. We want to compare numbers of companies that make from 0 to 10. from 5 to 10. from 20.000.000 and so on We want to compare height of ten oak tree in a city park We have measured several trees in a city park. Key question: how many companies are there in each class of revenue? Bar graph or Histogram? Bar graph. from 10 to 15 and so on Histogram Key question: how many trees are there in each class of height? Bar graph Key question: what is the height of each tree? Histogram. Key question: what is the revenue for each company? o Line graph Are more popular than all other graphs combined because their visual characteristics reveal data trends clearly and these graphs are easy to create 108 . in bar graph you have several measurement of different items. We want to compare numbers of trees that are from 0 to 5 meters high.You have one variable. and you sort data by this variable by placing them into “bins” Then you count how many pieces of data are in each bin The height of the rectangle you draw on top each bin is proportional to the number of pieces in that bin On the other hand.000.

o Scattered plot The pattern of the data points on the scatter plot reveals the relationship between the variables. dollars. Line graphs compare two variables: one is plotted along the x-axis (horizontal) and the other along the y-axis (vertical) The y-axis is a line graph usually indicates quantity (e.A line graph is a visual comparison of how two variables – shown on the x. It shows related information by drawing a continuous line between all the points on a grid. such as: • • • • • • • Data correlation Positive or direct relationships between variables Negative or inverse relationship between variables Scattered data points Non-linear patterns Spread of data outliers o Pictograph 109 . while the horizontal x-axis often measures units of time.g. liters) or percentage. Scatter plots can illustrate various patterns and relationship.and y-axis – are related or vary with each other.

3413 x = the value that is being standardized μ = the mean of the distribution σ = standard deviation of the distribution Z-score for Means Standard Error formula: = sample mean = standard error = population mean σ = standard deviation n = sample size 1.2 0. Z-score serves 2 purposes • Each z-score will tell the exact location of the original x value within the distribution • The z-score will form a standardized distribution that can be directly compared to other distributions that also have been transformed into zscores.Z-Score & IT’S USES This is the formula for converting a given value of x into its corresponding z score for raw data: In every normal distribution 0. 2. 110 . Value of z-score • The sign tells whether the score is located above (+) or below (-) the mean • The number tells the distance between the score and the mean in terms of the number of standard deviation.3413 of its total area lies between the mean and z = 1.

expressed in units of its distribution’s standard deviation.• The z-score for an item. o Suppose you use a test for your students and the µ = 65 and = 10 and your friend use a test for your students which have µ = 100 and = 15 o Three of your students got 75. • In every normal distribution. that item deviates from its distribution’s mean. • Z scores are especially informative when the distribution to which they refer. the transformed scores will necessarily have a mean of zero and a standard deviation and a standard deviation of one. 3. Z-score for making comparison • For example: bob receive a score of x = 60 on math exam and a score x = 56 on a biology test. the distance between the mean and a given z score cuts off a fixed proportion of the total area under the curve. is normal. 45 and 67 respectively in your test what should be the score of your students in your friend test if you want to say the students’ performance in both the tests are same o Formula for standardized score is x = µ + z • Second example: Ho = there is no effect of PBL on average score obtained by the students o Average score (µ) of USM 3rd year students is 60 with standard deviation () is 5 111 . • The mathematics of the z score transformation are such that if every item in a distribution is converted to its z score. indicated how far and in what direction. • Z scores are sometimes called “standard scores”. • The z score transformation is especially useful when seeking to compare the relative standings of items from distributions with different standard deviations. For which course he did well? o Suppose the biology score had µ = 48 and = 4 and the math score had µ = 50 and = 10.

o A sample of 20 students attended PBL and average score of this group of students is 65 o Is this increase of 5 marks in average due to chance or the effect of PBL? o Answer can be obtained by z test Standard Error formula: = sample mean = standard error = population mean Ho H1 of the students attended PBL = 60 of students attended PBL > 60 or ≠ 60 σ = standard deviation n = sample size Level of significance or level = 0.05 (usually used) Z = 1.96 Z = [sample mean – hypothetical mean]/[standard error between Z = [obtained difference]/[difference due to chance] Consult normal distribution table to see if calculated value is in the critical region or not to reject or accept null hypothesis and µ] 112 .

t-test 1. In calculating z-score we need • µ = population mean • = population standard deviation • When the standard deviation () is not known, t-test is the alternative. 2. In simple t-test instead of , sample variance is used. • Sample variance (S2) = [SS/n-1] = [SS/df] o SS = Σ x2 – ( [Σx]2/n ) SS = Sum of squared deviation 3. Instead of standard error x, estimated standard error Sx is used. • Estimated standard error Sx = S/(√n) = √(S2/n) • t = [X - µ]/Sx o X = sample mean o µ = population mean (hypothesis mean) o Sx = estimated standard error from sample • The higher the the degree of freedom (df) (sample size) the closure the S2 (sample variance) to the 2 (population variance) • Example: o I prefer PBL than Lecture Response 1 = SA, 2 = A, 3 = UD, 4 = DA, 5 = SDA From this example hypothesis mean (µ) = 3 But µ can be getting from study that has been done by someone previously. 4. Independent measures t-test • t = [ (X1 – X2) – (µ1 - µ2) ]/ (standard error) • Pooled variance, S2p = [ SS1 + SS2 ]/[ df1 + df2 ] • Two samples standard error, Sx1 – Sx2 = √[ (S2p/n1) + (S2p/n1) ] • Ho = there is no difference in the m=clinical performance of students attended traditional curriculum and PBL curriculum.

113

SENSITIVITY & SPECIFICITY 1. Definition • Sensitivity o Proportion of subject with a target condition who are identified by a positive test finding. o Test’s ability to correctly identify individuals with the condition o Test’s capacity to detect the condition when it is truly present o Probability of a test being positive given that the condition is present o Also called true positive rate or hit rate o The test will actually classify a person (with the condition) as likely to have the condition • Specificity o Proportion of subjects free of the condition who are correctly identified by a negative test result o Test’s ability to correctly identify individuals without the condition o Test’s capacity to exclude condition when it is truly absent o Also called true negative rate or correct rejection rate o The test will actually classify a person (without the condition) as unlikely to have the condition With the condition a 36 c 4 40 true positive true negative false positive false negative Without the condition b 96 d 864 960

Respondents With the condition Without the condition Total Sensitivity = a/(a+c) = .90 Specificity = d/(b+d) = .90

Total 132 868 1000

Positive Predictive Power (PPP) = a/(a+b) = .27 Negative Predictive Power (NPP) = d/(c+d) = .99

114

2. Validity of the Test True status (population) positive Result of test positive negative a c negative b d

Sensitivity: the probability of testing positive if the condition is truly present = a/(a+c) Specificity: the probability of screening negative if the condition is truly absent = d/(b+d) Example: Screening breast cancer by Physical Exam & Mammography Respondents With the condition Without the condition Total Sensitivity: a/(a+c) = 36/(36+4) = 0.90 = 90% Interpretation screening by physical exam and mammography will identify 90% of all true breast cancer cases Specificity: d/(b+d) = 864/(96+864) = 0.90 = 90% Interpretation screening by physical exam and mammography will correctly classify 90% of all non-breast cancer patient as being free disease. PPP = a/(a+b) = 36/(36/96) 115 With the condition a 36 c 4 40 Without the condition b 96 d 864 960 Total 132 868 1000

99 = 99% Validity – the extend to which the test distinguishes between persons with and without the condition High validity require High sensitivity High specificity 116 .27 = 27% NPP = d/(c+d) = 864/(864 + 4) = 0.= 0.

- Research Design
- Bio Statistics
- Literature Mania
- ngips
- Chap01_intro & Data Collection
- Jeopardy Final Review
- Presentation 1
- Artigo Bioestatística DOris
- Research Methodology
- Quantitative Research Method Class Notes
- 3-STATISTICS MD course.docx
- 1 Introduction to Statistics
- Case Study
- Sampling Human Pop1
- Continuous Sampling Planning
- Sampling the Realities
- Chap14Rev
- m103 Presentationunit 1-2
- The Basics of Sampling
- Hypothesis Testing & Confidence Interval Estimation
- International Refereed Journal of Engineering and Science (IRJES)
- MCQ STATS
- Special Double Sampling Plan for Truncated Life Tests Based On Generalised Log-Logistic Distribution
- CH10
- Research Design1
- Sampling
- Introduction
- Metlit-02 Populasi, Sampel & Variabel - Prof. dr. Sudigdo S, SpA(K).ppt
- S2 Definitions List From Past Papers
- Test Statistics Formulas for Probabiltity and Statistics
- Dr+Saiful's+Notes+on+Medical+%26+Allied+Health+Education+-+Statistics+%26+Research+Methodology

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.

scribd