You are on page 1of 21
A LESSON GUIDE IN BPE 104: MEASUREMENT AND EVALUATION (Pomnanor GiLA DAWA) (MIDTERM EXAMINATION) Chapter 4: importance and purpose of nueasurement andl evaluation tn Kumar performance 1.1) Functions of measurement and evaluation 1.2) Formative and summative evaluation 1.3) ‘Norm- and criterion-referenced standards 1.4) Models of evaluation 1.5) Computer Literacy for measurement and evaluation chapter 2: General Measurement Concepts 2.1) Types of scores 2.2) Common units of measure 2.3) Selecting a criterion score 2.4) Types of reliability and reliability theory 2.5) Acceptable reliability and factors affecting reliability 2.6) Reliability of difference scores 2.7) Objectivity or inter-rater reliability and factors affecting objectivity 2.8) Validity and norm-referenced tests 2.9) Validity and the criterion score 2.10) Validity and criterion-referenced tests Chapter s: Nature and Advinistration of Tests 3.1) Reliability, objectivity, and validity 3.2) Content-related attributes 3.3) Student and participation concerns 3.4) Administrative concerns 3.5) Pretest procedures and pilot testing 3.6) Giving the test 3.7) Posttest procedures 3.8) Measuring individuals with disabilities (FINAL EXAMINATION) Chapter 4: Statistical Tools i. Evaluation 4.1) Organizing and graphing data 4.2) Deseri sti 4.3) Measuring group position 4.4) Standard scores 4.3) Normal curve: characteristics and probability 4.6) Relationships between scores: correlation analysis, 4.7) Simple prediction analysis 4.8) Measures of difference: t-tests and analysis of variances (ANOVAs) 4.9) Estimating reliability, objectivity, and validity Chapter 5: Evaluating Knowledge 5.1) Levels of knowledge 5.2) Types of knowledge tests: essay versus objective and mastery versus discrimination 3) Test construction: procedure and types of test items 5.4) Test administration and scoring 5.5) Test analysis and revision Chapter 6: Evaluating Achievement and Grading 6.1) Evaluation: subjective versus objective and formative versus summative 6.2) Standards for evaluation: criterion -referenced versus norm-referenced 6.3) Grading issues 6.4) Grading methods: natural breaks, teacher's standard, rank order, and norms 6.5) Final grades: sum of letter grades, point systems, and sum of T-scores 6.6) Authentic assessment and rubrics 6.7) Cognitive and Affective Testing 6.8) Measuring attitudes 6.9) Semantic differential scale Chapter 1: importance and purpose of measurement and evaluation bm human performance Funstions of measurement and evaluation ‘Measurement is the collection of information on which a decision is based. Evaluation is the use of measurement in making decision. It collects suitable data (measurement) and judge the value of the data according to some standard (i.e, criterion- referenced standard or norm-referenced standard). It will also make decision based on the data, Interdependent concepts as evaluation is a process that uses measurements and the purpose of ‘measurement is to accurately collect information using tests for evaluation. Improve ‘measurement leads to accurate evaluation. Objective and subjective test continu: Objective test — two or more people score on the same test and assign the same grade. Defined scoring system and trained testers increases objectivity. Subjective test — highly subjective test lacks a standardized scoring system. Functions of Measurement and Evaluation 1) Placement in classes/programs or grouping based on abi 2) Diagnosis of weaknesses 3) Evaluation of achievement to determine if individuals have reached important objectives 4) Prediction of an individual's level of achievement in future activities or predict one ‘measure from another measure 3) Program evaluation 6) Motivation Formative and Summative Evaluation Formative Evaluation 1) Judgment of achievement during the process of learning or training 2) Provides feedback during the process to both the leamer/athlete and teacher/coach 3). What is successful and what needs improvement Summative Evaluation 1) Judgment of achievement at the end of an instructional unit or program 2) Typically involves test administration at the end of an instructional unit or training period 3) Used to decide if broad objectives have been achieved Standards for Evaluation (Norm and Criterion Referenced Test) “Evaluation is the process of giving meaning to a measurement by judging it against some standard” Criterion Referenced Standard 1) Used to determine someone has attained a specified standard 2) Are useful for setting performance standard for all 3) Predetermined standard of performance shows the individual has achieved a desired level of performance 4) Performance of individual is not compared with that of other individuals Common practice to apply a criterion-referenced standard to a norm-referenced test... We can examined the accuracy of the criterion referenced standard using a 2x2 contingency table and test reliability examines the consistency of classification. Limitations of criterion referenced are; 1) Not always possible to fine a criterion that explicitly defines mastery, particularly in some skills 2) Accuracy varies on the population being tested Norm Referenced Standard 1) Used to judge an individual's performance in relation to the performances of other members of a well-defined group 2) Are valuable for comparisons among individuals when the situation requires a degree of sensitivity or discrimination in ability. 3) Develop by testing a large group of people 4) Using descriptive statistics to develop standards 5) Percentile ranks are a common norming method Characteristics used to develop norms may not result in desirable norms; examples with body composition and blood cholesterol levels were average may not be desirable, Evaluation Models Evaluation models either describe what evaluators do or prescribe what they should do (Alkin and Ellet, 1990, p. 15) 1) Prescriptive Models “Prescriptive models are more specific than descriptive models with respect to procedures for planning, conducting, analyzing, and reporting evaluations” (Reeves & Hedberg, 2003, p.36). ‘Examples: — Kirpatrick: Four-Level Model of Evaluation (1959) — Suchman: Experimental / Evaluation Model (1960s) — Stufflebeam: CIPP Evaluation Model (1970s) 2) Descriptive Models ‘They are more general in that they describe the theories that undergird prescriptive models (Alkin & Ellett, 1990) Examples: = Patton: Qualitative Evaluation Model (1980s) ~ Stake: Responsive Evaluation Model (1990s) — Hiynka, Belland, & Yeaman: Postmodern Evaluation Model (1990s) 3) Formative Evaluation ‘An essential part of instructional design models Itis the systematic collection of information for the purpose of informing decisions to design and improve the product / instruction (Flagg, 1990) Why formative evaluation? ‘The purpose of formative evaluation is to improve the effectiveness of the instruction at its formation stage with systematic collection of information and data (Dick & Carey, 1990; Flagg, 1990). So that Learners may like the Instruction. So that learners will learn from the Instruction. Guide Questions: 1) Feasibility: Can it be implemented as it is designed? 2) Usability: Can leamers actually use it? 4) Effectiveness: Will leamers get what is supposed to get? Strategies: 1) Expert review — Content experts: the scope, sequence, and accuracy of the program’s content — Instructional experts: the effectiveness of the program — Graphic experts: appeal, look and feel of the program 2) User review - A ssample of targeted leamers whose background are similar to the final intended users = Observati 3) Field Tests ‘The evaluator is internal. It means that he/she is a member of the design and development team. 4) Summative Evaluation The collection of data to summarize the strengths and weakness of instructional ‘materials to make decision about whether to maintain or adopt the materials. Strategies used are expert judgment and field trials and the evaluator is external. The ‘outcomes are; 1) Report or document of data 2) Recommendations 3) Rationale ‘A comparison between formative and summative evaluation : users” opinions, actions, responses, and suggestions Formative Summative Revision Decision Peer review, one-to- Expert judgment, field trial one, group review, & field trial ‘One set of ‘One or several competing materials instructional materials Internal External A prescription for Recommendations and rationale [ ing materials ‘Source: Dick and Carey (2003). The systematic design of instruction. Objective-Driven Evaluation Model (1930s): RW. Tyler — A professor in Ohio State University ~The director of the Eight Year Study (1934) Tyler's objective-driven model is derived from his Eight-Year Study Objective-Driven Evaluation Model (1930s): ‘The essence: ‘The attainment of objectives is the only criteria to determine whether a program is good or bad. His Approach: In designing and evaluating a program: set goals, derive specific behavioral objectives from the goal, establish measures to the objectives, reconcile the instruction to the objectives, and finally evaluate the program against the attainment of these objectives. Tyler’s Influence Influence: Tyler's emphasis on the importance of objectives has influenced many aspects of education. — The specification of objectives is a major factor in virtually all instruction design models — Objectives provide the basis for the development of measurement procedures and instruments that can be used to evaluate the effectiveness of instruction It is hard to proceed without specification of objectives Four-Level Model of Evaluation (1959): D. Kirpatrick Kirkpatrick's four levels: The first level (reactions) 3 — the assessment of learners’ reactions or attitudes toward the learning experience The second level (learning) ; 5 — an assessment how well the leamers grasp of the instruction. Kirkpatrick suggested that a control group, a pre-test/posttest design be used to assess statistically the learning Of the learners as a result of the instruction The third level (behavior — follow-up assessment on the actual performance of the learners as a result of the instruction. Itis to determine whether the skills or knowledge leamed in. the classroom setting are being used, and how well they are being used in job (ress ‘The final level (results) . re Res = t0 assess the changes in the organization as a result of the instruction Kirkpatrick’s model “Kirkpatrick's model of evaluation expands the application of formative evaluation to the performance or job site” (Dick, 2002, p.152). Experimental Evaluation Model (1960s): — The experimental model is a widely accepted and employed approach to evaluation and research. — Suchman was identified as one of the originators and the strongest advocate of experimental approach to evaluation. — This approach uses such techniques as pretest/posttest, experimental group vs. control up, f0 evaluate the effectiveness of an educational program. = Itis still popularly used today. CIPP Evaluation Model (1970s): D.L. Stufflebeam CIPP stands for Context, Input, Process, and Product, ClPP Evaluation Model = Context is about the environment in which a program would be used. This context analysis is called a needs assessment. — Input analysis is about the resources that will be used to develop the program, such as people, funds, space and equipment. — Process evaluation examines the status during the development of the program, (formative) Product evaluation that assessments on the success of the program (summative) CUPP Evaluation Model — Stufflebean’s CIPP evaluation model was the most influential model in the 1970s. (Reiser & Dempsey,2002) Qualitative Evaluation Model (1980s) — Michael Quinn Patton, Professor, Union Institute and University & Former President of the American Evaluation Association Qualitative Evaluation Model = Patton’s model emphases the qualitative methods, such as observations, case studies, ews, and document analysis. ics of the model claim that qualitative approaches are too subjective and results will be biased. — However, qualitative approach in this model is accepted and used by many ID models, such as Dick & Carey model. Responsive Evaluation Model (1990s) Robert E. Stake — He has been active in the program evaluation profession — He took up a qualitative perspective, particularly case study methods, in order to represent the complexity of evaluation study Responsive Evaluation Model — Itemphasizes the issues, language, contexts, and standards of stakeholders — Stakeholders: administrators, teachers, students, parents, developers, evaluators... — His methods are negotiated by the stakeholders in the evaluation during the development — Evaluators try to expose the subjectivity of their judgment as other stakeholders — The continuous nature of observation and reporting, Responsive Evaluation Model = His response: subjectivity is inherent in any evaluation or measurement. — Evaluators endeavor to expose the origins of their subjectivity while other types of evaluation may disguise their subjectivity by using so-called objective tests and ‘experimental designs. Postmodern Evaluation Model (1990s): Dennis Hlynka. Andrew R. J. Yeaman ‘The postmodern evaluation model — Advocates criticized the modem technologies and positivist modes of inquiry. — They viewed educational technologies as a series of failed innovations. — They opposed the systematic inquiry and evaluation. ID is a tool of positivists who hold onto the false hope of linear progress How to be a postmodernist — Consider concepts, ideas and objects as texts. Textual meanings are open to {interpretation — Look for binary oppositions in those texts. Some usual oppositions are good/bad, progress/tradition, science/myth, love/hate, man/woman, and truth/fiction — Consider the critics, the minority, the altemative view, do not assume that your program is the best The postmodern evaluation model = Anti-technology, anti-progress , and anti-science — Hard to use, — Some evaluation perspectives, such as race, culture and politics can be useful in evaluation process (Reeves & Hedberg, 2003). Fourth generation model .G. Guba = S. Lincoln Fourth generation model Seven principles that underlie their model (constructive perspective) 1. Evaluation is a social political process Evaluation is a collaborative process Evaluation is a teaching/learning process Evaluation is a continuous, recursive, and highly Evaluation is an emergent process . Evaluation is a process with unpredictable outcomes 7. Evaluation is a process that creates reality Fourth generation model = Outcome of evaluation is rich, thick description based on extended observation and carefll reflection — They recommend negotiation strategies for reaching consensus about the purposes, ‘methods, and outcomes of evaluation srgent process avaen ‘Multiple methods evaluation model — MM. Mark and R.L. Shotland Multiple methods evaluation model — One plus one are not necessarily more beautiful than one — Multiple methods are only appropriate when they are chosen for a particularly ‘complex program that cannot be adequately assessed with a single method ‘Computer Literacy in Measurement and Evaluation Researches found significant relationship that existed between teacher computer fiteracy and job effectiveness in secondary schools. Computer literacy is the knowledge and ability to use computers and related technology efficiently, with a range of skills covering levels from elementary use to programming and advanced problem solving. Computer literacy can also refer to the comfort level someone has with using computer programs and ‘other applications that are associated with computers. Another valuable component of computer literacy is the knowledge on how computers work and operate. Having basic computer skills is a significant asset in the developed countries. The following recommendations were made: The government should organize workshops and conferences for teachers on the use of modem computer application packages such as Java and Oracle which would result into effective teaching and leaming. The ‘government should increase the number of computers available in secondary schools. When teachers have access to computers whenever the need arises, this would greatly improve their job effectiveness. The government should pay more attention to teacher computer literacy so that lesson plan preparation could meet up to standard based on secondary school curriculum. ‘The government should make it mandatory for teachers to make use of Microsoft power point in lesson delivery. Chapter 2: General Measurement Concepts Evaluation of Measurement Instruments Reliability has to do with the consistency of the instrument. Internal Consistency (Consistency of the items) = _ Test-retest Reliability (Consistency over time) Interrater Reliability (Consistency between raters) Split-talf Methods Alternate Forms Methods ‘Validity of an instrument has to do with the ability to measure what itis supposed to ‘measure and the extent to which it predicts outcomes. ~ Face Validity ~ Construct & Content Validity - Convergent & Divergent Validity - Predictive Validity -_ Discriminant Validity Reliability is synonymous with consistency. It is the degree to which test scores for a an individual test taker or group of test takers are consistent over repeated applications. No psychological test is completely consistent, however, a measurement that is unreliable is worthiess. For Example A student receives a score of 100 on one intelligence tests and 114 in another or imagine that every time you stepped on a scale it showed a different weight. Would you keep using these measurement tools? ‘The consistency of test scores is critically important in determining whether a test can provide good measurement. Because no unit of measurement is exact, any time you measure something (observed score), ‘you are really measuring two things ‘True Score - the amount of observed score that truly represents what you are intending to measure. Error Component - the amount of other variables that can impact the observed scoreFor Example - if you weigh yourself today and weigh 140 Ibs. and then weigh yourself tomorrow and weigh 142 Ibs., is the 2 pound increase a true measure of your weight gain or could other variables be involved? Other variables may include: food intake, placement of scale, error in the scale itself. ‘Why Do Test Scores Vary? Possible Sources of Variability of Scores ~ General Ability to comprehend instructions - Stable response sets (e.g,, answering “C” option more frequently) - The element of chance of getting a question right - Conditions of testing - Unreliability or bias in grading or rating performance - Motivation - Emotional Strain Measurement Error Any fluctuation in test scores that results ftom factors related to the measurement process that are irrelevant to what is being measured. The difference between the observed score and the true score is called the error score. S true=S observed - $ error Developing better tests with less random measurement error is better than simply documenting the amount of error Measurement Error is Reduced By - Writing items clearly ~ Making instructions easily understood ~ Adhering to proper test administration Providing consistent scoring Determining Reliability ‘There are several way’ that a measurements reliability can be determined, depending on the type of measurement the and the supporting data required. They include: Internal Consistency Measures the reliability of a test solely on the number of items on the test and the intercorrelation among the items. Therefore, it compares each item to every other item. Ifa scale is measuring a construct, then overall the items on that scale should be highly correlated with one another. ‘There are two common ways of measuring internal consistency 1. Cronbach’s Alpha: .80 to .95 (Excellent) -70 to .80 (Very Good), .60 to .70 (Satisfactory), <.60 (Suspect) 2. Item-Total Correlations - the correlation of the item with the remainder of the items (30 is the minimum acceptable item-total correlation). Internal consistency estimates are a function of: ‘The Number of Items - if we think that each test item is an observation of behaviour, high internal consistency strengthens the relationship — .c., There is more of it to observe. Average Intercorrelation - the extent to which each item represents the observation of the same thing observed. The more you observe a construct, with greater consistency equals reliability. ‘Test-retest Reliability ‘Test-retest reliability is usually measured by computing the correlation coefficient between scores of two administrations. The amount of time allowed between measures is critical. The shorter the time gap, the higher the correlation; the longer the time gap, the lower the correlation. This is because the two observations are related over time. Optimum time betweem administrations is 2 to 4 weeks. Ifa scale is measuring a construct consistently, then there should not be radical changes on the scores between administrations — unless something significant happened. ‘The rationale behind this method is that the difference between the scores of the test and the retest should be due to measurement solely. Itis hard to specify one acceptable test-retest correlation since what is considered acceptable depends on the the type of scale, the use of the scale, and the time between testing. For example - it is not clear whether differences in test scores are regarded as sources of ‘measurement error or as sources of real stability. Possible difference in scores between tests? : experience, characteristic being measured may change over time (e.g, reading test), carryover effects (e.g., remember test) A minimum correlation of at least .50 is expected. ‘The higher the correlation (in a positive direction) the higher the test-retest reliability ‘The biggest problem with this type of reliability is what called memory effect. Which means that a respondent may recall the answers from the original test, therefore inflating the reliability. Aso, is it practical? Interrater Reliability Whenever you use humans as a part of your measurement procedure, you have to worry about whether the results you get are reliable or consistent. People are notorious for their inconsistency. We are easily distractible. We get tired of doing repetitive tasks. We daydream. We misinterpret. For some scales it is important to assess interrater reliability. Interrater reliability means that if two different raters scored the scale using the scoring rules, they should attain the same result Interrater reliability is usually measured by computing the correlation coefficient between the scores of two raters for the set of respondents. 10 Here the criterion of acceptability is pretty high (e.g., a correlation of at least .9), but what is, considered acceptable will vary from situation to situation, Split Half & Odd-Even Reliability Split Half - refers to determining a correlation between the first half of the measurement and the second half of the measurement (i.e., we would expect answers to the first half to be similar to the second half). ‘Odd-Even - refers to the correlation between even items and odd items of'a measurement tool. In this sense, we are using a single test to create two tests, eliminating the need for additional items and multiple administrations. Since in both of these types only 1 administration is needed and the groups are determined by the internal components of the tes, itis referred to as an internal consistency measure. Possible Advantages ‘Simplest method - easy to perform Time and Cost Effective Possible Disadvantages Many was of splitting Each split yields a somewhat different reliability estimate Which is the real reliability of the test? Parallel/Alternate Forms Methods It refers to the administration of two altemate forms of the same measurement device and then comparing the scores. Both forms are administered to the same person and the scores are correlated. If the two produce the same results, then the instrument is considered reliable. A correlation between these two forms is computed just as the test-retest method. Possible Advantages Eliminates the problem of memory effect. Reactivity effects (ie., experience of taking the test) are also partially controlled. Can address a wider array of sampling of the entire domain than the test-retest method. Possible Disadvantages Are the two forms of the test actually measuring the same thing. More Exper Requires additional work to develop two measurement tools. Factors Affecting Reliability Administrator Factors Poor or unclear directions given during administration or inaccurate scoring can affect reliability. For Example - say you were told that your scores on being social determined your promotion. ‘The result is more likely to be what you think they want than what your behavior is. ‘Number of Items on the instrument The larger the number of items, the greater the chance for high reliability. nu For Example -it makes sense when you ponder that twenty questions on your leadership style is more likely to get a consistent result than four questions. Remedy: Use longer tests or accumulate scores from short tests. ‘The Instrument Taker For Example -If you took an instrument in August when you had a terrible flu and then in December when you were feeling quite good, we might see a difference in your response consistency. If you were under considerable stress of some sort or if you were interrupted while answering the instrument questions, you might give different responses. Heterogeneity of the Items The greater the heterogeneity (differences in the kind of questions or difficulty of the question) ofthe items, the greater the chance for high reliability correlation coefficients. Heterogeneity of the Group Members ‘The greater the heterogencity of the group members in the preferences, skills or behaviors being tested, the greater the chance for high reliability correlation coefficients. Length of Time between Test and Retest ‘The shorter the time, the greater the chance for high reliability correlation coefficients. As we have experiences, we tend to adjust our views a little from time to time. Therefore, the time interval between the first time we took an instrument and the second time is really an “experience” interval. Experience happens, and it influences how we see things. Because intemal consistency has no time lapse, one can expect it to have the highest reliability correlation coefficient. How high should reliability be? A reliability coefficient of .80 indicates that 20% of the variability in test scores is due to ‘measurement error. Generalizability Theory ‘Theory of measurement that attempts to determine the sources of consistency and inconsistency Allows for the evaluation of interaction effects from different types of error sources. Itis necessary to obtain multiple observations for the sample group of individuals on all the variables that might contribute to causing measurement error (¢.g.. Scores across occasions, across scorers, across alternative forms). Allows for the evaluation of interaction effects from different types of error sources. Useful when associated with complex methods: 1. The conditions of measurement affect test scores. 2. Test scores are used for several different purposes 2 For Example - measurement involving. subjectivity (c.g., interviews, rating scales) involve bias. Therefore, human judgement could be considered “conditions of measurement If feasible, itis a more thorough procedure for identifying the error component that may center scores. Standard Error of Measurement (SEM) ‘SEM isa statistic that obtains the confidence interval for many obtained scores. It represents the hypothetical distribution we would have if someone took a test an infinite number of times. ‘A measure that allows one to predict the range of fluctuation that is likely to occur in a single individual's score because of irrelevant, chance factors. This measurement is used in analyzing the reliability of the test in obtaining the "true" score. Indicates how much variability in test scores can be expected as a result of measurement error. SEM is a function of two factors: reliability of test & variability of test scores. Formula for SEMis: ‘SM = SD(Sq root of 1 minus reliability) ‘The most common use of the SEM is the production of the confidence intervals. The SEM is an estimate of how much error there is in a test. ‘The SEM can be looked atin the same way as Standard Deviations. Sixty cight percent of the time the true score would be between plus one SEM and minus one SEM. We could be 68% sure that the students true score would be between +/- one SEM. Between +/- two SEM the tue score would be found 96% of the time (¢.g., SEM x +/- two SEM) r, ifthe student took the test 100 times, 64 times the true score would fall between +/- one SEM. A test must be RELIABLE to be VALID. VALIDITY Depends on the PURPOSE 4 E.g. a ruler may be a valid measuring device for length, but isn’t very valid for measuring volume, Measuring what ‘it’ is supposed to Matter of degree (how valid?) Specific toa particular purpose! lust be inferred from evidence; cannot be directly measured Learning outcomes Content Coverage (relevance?) Level & type of student engagement (cognitive, affective, psychomotor) — appropriate? ‘Types of validity ‘to measure what itis supposed to measure? s Example: Let’s say you are interested in measuring, “Propensity towards violence and aggression’. By simply looking at the following items, state which ones qualify to measure the variable of interest: Have you been arrested? | Have you been involved in physical fighting? ‘Do you get angry easily? Do you Sleep with your socks on? B Is it hard to control your anger? Do you enjoy playing sports? Construct validity Does the test measure the ‘human’ CHARACTERISTIC() itis supposed to? Examples of constructs or ‘human’ characteristics: “Mathematical reasoning Verbal reasoning Musical ability Spatial ability jechanical aptitude Motivation ‘Applicable to PBA/authentic assessment Each construct is broken down into its component parts Eg. ‘motivation’ can be broken down to: Interest Attention span ‘Hours spent ‘Assignments undertaken and submitted, etc. All of these sub-constructs put together — measure ‘motivation’ Content valid How well elements of the test relate to the content domain? How closely content of questions in the test relates to content of the curriculum? Directly relates to instructional objectives and the fulfillment of the same! Major concern for achievement tests (where content is emphasized) Can you test students on things they have not been taught? How to establish content validity? Instructional objectives (looking at your list) ‘Table of Specification Example: At the end of the chapter, the student will be able to do the following: |L.Explain what ‘stars? are 2.Discuss the type of stars and galaxies in our universe 3.Categorize different constellations by looking at the stars 4 Differentiate between our siars, the sun, and all other stars Content areas Categories of Performance (Mental Skills) Knowledge | Comprehension | Analysis | Total Ty What are “2. Gur star, the Grand Total Criterion validity ‘The degree to which content on a test (predictor) correlates with performance on relevant criterion measures (concrete criter in the "real" world?) If they do correlate highly, it means that the test (predictor) is a valid one! “ Eg. if you taught skills relating to “public speaking” and had students do a test on it, the test can be validated by looking at how it relates to actual performance (public speaking) of students inside or outside of the classroom ‘Two types of Criterion Validity Concurrent Criterion Validity = how well performance on a test estimates current performance on some valued measure (criterion)? (e.g. test of dictionary skills can ‘estimate students’ current skills in the actual use of dictionary — observation) Predictive Criterion Validity = how well performance on a test predicts future performance on some valued measure (criterion)? (e.g. reading readiness test might be —_used to predict students’ achievement in reading) Both are only possible IF the predictors are VALID ‘Consequences validity ‘The extent to which the assessment served its intended purpose Did the test improve performance? Motivation? Independent leari Did it distort the focus of instruction? Did it encourage or discourage creativity? Exploration? Higher order thinking? Factors that can Lower Validity ‘Unclear directions Difficult reading vocabulary and sentence structure Ambiguity in statements Inadequate time limits Inappropriate level of difficulty constructed test items ‘Test items inappropriate for the outcomes being measured ‘ests that ae 00 short j r arrangement of items (complex to easy?) ISR a ae ae \ea ‘Administration and scoring Students ‘Nature of criterion Standard Error of Estimate SEE = S¥Vl-r?5 Validity measure Degree of error in estimating a score based on the criterion, Methods of obtaining a criterion measure ‘Actual participation Perform criterion Predictive measures Interpreting “r” a Criterion-Referenced Measurement Criterion-Referenced Testing aka, Mastery Learning ‘Standard Development Judgmental: use experts typical in human performance ‘Normative: theoretically accepted criteria Empirical: cutoff based on available data Combination: expert & norms typically combined Advantages of Criterion-Referenced Measurement Represent specific, desired performance levels linked to a criterion Independent of the % of the population that meets the standard If not met, specific diagnostic evaluations can be made Degree of performance is not important-reaching the standard is, Performance linked to specific outcomes Individuals know exactly what is expected of them Limitations of Criterion-Referenced Measurement Cutoff scores always involve subjective judgment Misclassifications can be severe “Motivation can be impacted; frustrated/bored Setting a Cholesterol “Cut-Off” 00 400 | No.of Death 200 - o ; ae 160 175 190 200 210 220 «230 «240 260270 Cholesterol mg/dl Statistical Analysis of CRTs ‘Nominal data (categorical; major, gender, pass/fail, etc.) Contingency table development (2x2 Chi ) Chi-Square analysis (used w/ categorical variables) Proportion of agreement (see below) Phi coefficient (corre! for dichotomous (y/n) variables) Proportion of Agreement (P) ‘Sum the correctly classified cells/total (2 tn ymin tn tn Teese ees ra Examples on board Considerations with CRT ‘The same as norm-referenced testing Reliability (consistency) Equivalence: is the PACER equivalent to 1-mi run/walk? Stability: does same test result in consistent findings? Validity (Truthfulness of measurement) Criterion-related: concurrent or predictive Construct-related: establish cut scores Meeting Criterion-Referenced Standards Possible Decisions iret | Truly Below Criterion Truly Above Criterion J Did not achieve standard | Did achieve standard ee Correct i | Negative Decision Correct comet Pate Decision eae CRT Reliability Test/Retest of a single measure Day 1 Day 1 Day 2 Fail Pass cai] Fail NI N2 Pass NB, NE 1 CRT Validity Use ofa field test in a criterion measure Criterion Fail Pass Field Fail NI Nz Test Pass N3 NA v7 Chapter 3: Nature and Administration of Tests Reliability, Objectivity, and Validity - The three most important characteristics for a test ~ Accept able levels determined by both the testing situation itself and the values others have previously obtained ~ A minimal coefficient is 0.70 Content-related Attributes A) Important Attributes ~ It means no more than 10% of the time should involve testing. = Only important skills and abilities should be measured B) Dis ion - Test should discriminate among difference ability groups throughout the total range of abi Should be many different scores normally distributed - Test difficulty: hard enough that no one gets perfect score and ~ Easy enough that no one gets zero ) Resemblance - In order to ensure relevancy and validity when taking the test, individuals must be required to use good form, follow the rules of the activity, and perform act characteristics of the activity. D) Specificity - The test should be as specific as possible for whatever is being measured = Confounding variables or factors should be minimized For Example: Performing stability ball push-ups as measure of upper body muscular strength and endurance is confounded or affected by balance. E) Unrelated Measures - Often an attribute (E.G., Health related fitness) has several components (E.G., cardio respiratory endurance, strength, muscle endurance, flexibility, and body composition) ~ Tests in battery should be unrelated to each other but highly related to the attributes If two tests in the battery are highly related to each other, they are measuring the same ability or factor - Keep the test that is most highly related to the attribute because it has more validity and drop the other test ‘Student and Participation Concerns A) Appropriateness to Students and Participants ~ Performance is influenced by the person’s maturity, gender, and experience - For example; fitness test score standards should be based on age and gender - Skill test standards should be based on age, gender, skill level, strength, and other capacities ~ Performance is affected by disabilities and therefore, tests for individuals without disabilities are usually not acceptable for individuals with disabilities 18 B) Individual Scores ~ Person’ test scores should not be affected by another person’s performance ~ Ball should come from a machine rather than a tennis serve by a person when skill testing the ability to return a serve ©) Enjoyable ~ Increases motivation to do well on the test, particularly when the test is well understood. - Make test interesting, challenging, and comfortable to take D) Safety ~ Examine all aspects of the test for safety and possible risks of injury ~ Confidentiality and privacy of testing should be protected - Be sensitive to the issue which may require testing people one at a time rather than in a group - Motivation to score at maximum potentially increases the reliability and val tests ity of the Administrative Concerns ‘A) Mass testability ~ required for larger groups ~ decreases test administration time - splitting group in half and having one half help test the other half and vice-versa B) Minimal Practice ~ familiarity, either from previous testing, from the program or class, or from practice sessions prior to the testing day will decrease both explanation and practice time C) Minimal Equipment and Personnel ~ tests that require a lot of equipment and/or administrative personnel are often impractical - insufficient training of all test administrators contributes to lack of test score objectivity D) Ease of Preparation - Select tests that are easy to set up over ones that take more time B) Adequate Directions ~ Should specify how the test is set up, the preparation of individuals to be tested, and administration and scoring procedures F) Norms ~ Recent and appropriate norms that are provided with or for a test can save the time necessary to develop local norms; or at least serve as temporary standards until local norms can be developed G) Useful Scores ~ Test should yield a useful score expressed in a single unit of measurement that can be inserted into a formula with little effort. Pretest Procedures and Pilot Testing A) Know the Test Carefully read the directions or test manual several times, thinking about the test as you read ~ Avoid overlooking small details about procedures as it will lead to mistakes B) Develop Test Procedures ~ Select the most efficient method of testing: Group vs. individual testing; one person testing, multiple testers, or testing individuals in pairs ~ identify exact scoring requirements and units of measurement (preferable a single unit of measurement) ~ anticipate mistakes and plan what you will do when mistakes are made - consider all aspects of safety 19 - practice administration of the tests €) Develop Directions ~ administrative procedures - instructions on performance - scoring procedure and policy on incorrect performance - hints on techniques to improve scores D) Prepare the Students or Program Participants Psychologically and Physiologically for Testing - notify individuals about the testing in advance, what the test is, and what the test involves E) Plan the Warm-Up and Test Trials = warm up improves reliabilit should be planned and specific to the skill being tested. - supervised warm-up by tester is most effective - first new trails of @ multiple trial test may serve as a warm-up F) Secure the Equipment and Prepare the testing Facility all equipment should be on hand and the facility prepared before the day of the test ~ this saves time and avoids last minute problems, G) Make Scoring Sheets prepare group or individual scorecards - individual scorecards are better when a person rotates among testing stations or when individuals are testing each other H) Estimate the Time Needed = minimizes confusion and maximizes the use of available time Giving the Test A) Preparation warm-up or practice ~ explain the test instructions and procedures - demonstrate the test - let the person complete a practice trial of the test B) Motivation - give all participants the same degree of motivation and encouragement ©) Safety = watch closely for any safety problems or potential injuries Posttest Procedures A) Analyze the Test Scores - build a data base - enter data into computer - complete statistical analysis - interpret statistical analysis, ~ report test results to the people tested in a confidential and motivating manner B) Record Keeping ~ keep scoring sheets and data analysis documents in a secure and confidential file cabinet ‘Measuring Indi juals with Disabilities - Finding or developing a test or battery of tests for each possible combination of disabling conditions - Finding norm-referenced and/or criterion-referenced standards ~ Shorter attention span and problems with complicated directions; both over and under verbalizing can hinder performance and reduce reliability - Often difficult to have individuals test each other in pairs ~ Vast heterogeneity in needs and performance levels - Wide differences in ability from individuals without disabilities, making the reliability and validity for common tests questionable - Lack of experience with test performances thus decreasing the reliability and validity of common tests - Difficulty in staying on task and giving maximum exertion because they do not understand why or how to give it, or do not understand the test protocol... Measuring Individuals From Different Populations ~ All people do not respond the same way to testing - The same test, test protocol, and performance score, expectations cannot be applied to all people - Understanding the characteristics of the people in the population tested and experience in testing people in the population is vital to obtaining test scores upon which you can base valid interpretations REFERENCES = Dick, W. (2002), Evaluation in instructional design: the impact of Kirky level mode, In Reiser, R.A. & Dempsey, 1. Eis.) Trends and issues instructional design and technology. New Jersey: Merrill Prentice Hall. — Dick, W., & Carey, L. (1990). The systematic design of instruction. Florida: HarperCollinsPublishers. — Reeves, T. & Hedberg, J._ (2003), Interactive Learning Systems Evaluation. Educational Technology Public : = Reiser, R.A. (2002). A history of instructional design and technology. In Reiser, R.A., & Dempsey, J.V. (Eds.). Trends and issues in instructional design and technology. New Jersey: Mernll Prentice Hall. Stake, RE. (1990). Responsive Evaluation. In Walberg, H.J. & Haetel, G.D.(Eds.), The international encyclopedia of educational evaluation (pp.15-77). New York: Pergamon Press. 's four a

You might also like