Professional Documents
Culture Documents
There are different factors that affect the reliability of a measure. The reliability of a measure can
be high or low depending on the following factors:
1. The number of Items in a test - The more tests a test has, the likelihood
of reliability is high. The probability of obtaining consistent scores is high
because of the large pool of items
2. Individual differences of participants - Every participant possesses characteristics that affect their
performance in a rest, such as fatigue. concentration, innate ability, porseverance, and motivation. These
individual factors change over time and affect the consistency of the answers in a test.
3. External environment - The external environment may include room temperature, noise level, depth
of instruction, exposure to materials. and quality of instruction, which could affect changes in the
responses of examinees in a test.
The different types of reliability are indicated and how they are done. Notice in the third column that
statistical analysis is needed to determine the test reliability.
Method in Testing How is this reliability done? What statistics is used?
Reliability
1. Test-retest Test-retest is applicable for tests Correlate the best scores from the first and the next
that measure stable variables, such administration. Significant and positive correlation
as aptitude and psychomotor indicates that the test has temporal stability overtime.
measures (eg typing test, tasks in Correlation refers to a statistical procedure where linear
physical education). relationship is expected for two variables. You may use
Pearson Product Moment Correlation or Pearson because
test data are usually in an interval scale (refer to a
statistics book for Pearson r).
2. Parallel Forms Parallel forms are applicable if there Correlate the test results for the first form and the second
are two versions of the test. This is form. Significant and positive correlation coefficient are
usually done when the test is expected. The significant and positive correlation
repeatedly used for different groups, indicates that the responses in the two forms are the same
such as entrance examinations and or consistent Pearson r is usually used for this analysis.
licensure examinations. Different
versions of the test are given to a
different group of examinees.
3. Split Half Split-half is applicable when the test Correlate the two sets of scores using Pearson r. After the
has a large number of items. correlation, use another formula called Spearman- Brown
Coefficient. The correlation coefficient obtained using
Pearson r and Spearman Brown should be significant and
positive to mean that the test has internal consistency
reliability.
4. Test of Internal This technique will work well when A statistical analysis called Cronbach's alpha or the Kuder
Consistency Using the assessment tool has a large Richardson is used to determine the internal consistency
Kuder-Richardson and number of items. It is also of the items. A Cronbach's alpha value of 0.60 and above
applicable for scales and inventories indicates that the test items have internal consistency.
Cronbach's Alpha
(eg Likert scale from "strongly
Method agree to "strongly disagree").
5. Inter-rater Inter-rater is applicable when the A statistical analysis called Kendall's tau coefficient of
Reliability assessment requires of multiple concordance is used to determine if the ratings provided
raters. by multiple raters agree with each other. Significant
Kendall's tau value indicates that the raters concur or
agree with each other in their ratings.
You will notice in the table that statistical analysis is required to determine the reliability of a measure.
The very basis of statistical analysis to determine reliability is the use of linear regression.
1. Linear regression
Linear regression is demonstrated when you have two variables that s measured, such as two set of
scores in a test taken at two different times the same participants. When the two scores are plotted in a
graph (with X and Y-axis), they tend to form a straight line. The straight line formed for the two sets of
scores can produce a linear regression. When a straight line is formes we can say that there is a
correlation between the two sets of scores. This can be seen in the graph shown. This correlation is shown
in the graph given. Th graph is called a scatterplot. Each point in the scatterplot is a respondent with two
scores (one for each test).
Sample graph
Formula:
3. Difference between a positive and a negative correlation
When the value of the correlation coefficient is positive, it means that the higher the scores in X, the
higher the scores in This is called a positive correlation. In the case of the two spelling scores, a positive
correlation is obtained. When the value of the correlation coefficient is negative, it means that the higher
the scores in X, the lower the scores in Y, and vice versa. This is called a negative correlation. When the
same test is administered to the same group of participants, usually a positive correlation indicates
reliability or consistency of the scores.
Formula:
What is test validity?
A measure is valid when it measures what it is supposed to measure. If a quarterly exam is valid, then
the contents should directly measure the objective of the curriculum. If a scale that measures personality
in composed of five factors then the scores on the five factors should have items that are highly
correlated. An entrance exam is valid, it should predict students grades after the fest semester.
Content Validity When the items represent the domain being The items are compared with the objectives
measured. of the program. The items need to measure
directly the objectives (for achievement) or
definition (for scales). A reviewer conducts
the checking.
Face validity When the test is presented well, free of The test items and layout are reviewed and
errors, and administered well. tried out on a small group of respondents. A
manual for administration can be made as a
guide for the test administrator.
Predictive Validity A measure should predict a future criterion. A correlation coefficient is obtained where
Example is an entrance exam predicting the the X-variable is used as the predictor and
grades of the students after the first semester. the Y-variable as the criterion.
Construct Validity The components or factors of the test should The Pearson r can be used to correlate the
contain items that are strongly correlated. items for each factor. However, there is a
technique called factor analysis to determine
which items are highly correlated to form a
factor.
Concurrent Validity When two or more measures are present for The scores on the measures should be
each examinee that measure the same correlated.
characteristic.
Convergent Validity When the components or factors of a test are Correlation is done for the factors of the test.
hypothesized to have a positive correlation.
Divergent Validity When the components or factors of a test are Correlation is done for the factors of the test.
hypothesized to have a negative correlation.
An example to correlate are the scores in a
test on intrinsic and extrinsic motivation.
How to determine if an item is easy or difficult?
An item is difficult if majority of students are unable to provide the correct answer. The item is easy if
majority of the students are able to answer correct. An item can discriminate if the examinees who score
high in the test can answer more the items correctly than examinees who got low scores.
-Below is a dataset of five items on the addition and subtraction of integers
Follow the procedure to determine the difficulty and discrimination of each item.
1. Get the total score of each student and arrange scores from highest to lowest.
2. Obtain the upper and lower 27% of the group. Multiply 0.27 by the total number of students, and you
will get a value of 2.7. The rounded whole number value is 3.0. Get the top three students and the bottom
3 students based on their total scores. The top three students are students 2, 5, and 5. The bottom three
students are students 7, 8, and 4. The rest of the students are not included in the item analysis.
3. Obtain the proportion correct for each item. This is computed for the upper 27% group and the lower
27% group. This is done by summating the correct answer per item and dividing it by the total number of
students.
4 The item difficulty is obtained using the following formula:
pH+pL
Item difficulty= 2
As you can see in Table 7.1, the test scores presented as a simple list of raw scores. Raw scores are easy
to get because these are scores that are obtain from administering a test, a questionnaire, or any inventory
rating scale measure knowledge, skills, or other attributes of interest. But as presented in the above table,
how do these numbers appeal to you? Most likely, they do not look interesting nor meaningful.
Following are some conventions in presenting test data grouped in frequency distribution:
1. As much as possible, the size of the class intersals should be equal. Class intervals that are multiples of
5, 10, 100, etc. are often desirable. At times when large gaps exist in the data and unequal class intervals
are used. such intervals may cause inconvenience in the preparation of graphs and computation of certain
descriptive statistical measures The following formula can be useful in estimating the necessary class
interval:
i=H-L
C
where i size of the class intervals
H-highest test score
L- lowest test score
C-number of classes
2. Start the class interval at a value which is a multiple of the class width. In 7.3, we used the class
interval of 5 such that we start with the dass value of 20, which is a multiple of 5 and where 20-24
includes the lowest test score of 21, as seen in Table 7.1.
3. As much as possible, open-ended class intervals should be avoided, eg, 100 and below or 150 and
above. These will cause some problems in graphing and computation of descriptive statistical measures.
How do we present test data graphically?
You must be familiar with the saying, "A picture is worth a thousand words in e similar veins", a graph
can be worth a hundred or thousand number. The use of tables may not be enough to give a clear picture
of the properties of a group of test scores.
There are many types of graphs, but the more common methods of graphing a frequency distribution are
the following
1. Histogram. A histogram is a type of graph appropriate for quantitative data such as test scores. This
graph consists of columns-each has a base that represents one class interval, and its height represents the
number of observations or simply the frequency an that class interval.
Sample graph
2. Frequency Polygon. This is also use for quantitative data and it is one of the most commonly used in
presenting test scores. It is the line graph of a frequency polygon.
Sample graph
3. Cumulative Frequency Polygon. This graph is quite different from a frequency polygon because the
cumulative frequencies are plotted. In addition, you plot the point above the exact limits of the interval.
As such, a cumulative polygon gives a picture of the number of observations that fall below a certain
score instead of the frequency within a class interval.
Sample graph
4. Bar Graph. This graph is often to present frequencies in categories of a qualitative variable. It looks
very similar to a histogram constructed in the same manner, but spaces are placed in between the
consecutive bars. The columns represent the categories and the height of each bar as in a histogram
represents the frequency. If experimental data are graphed, the dependent variable in categories is usually
plotted on the x-axis while the dependent variable is the test score on the y-axis.
Sample graph
5. Box-and-Whisker Plots. This is a very useful graph depicting the distribution of test scores through
their quartiles. The first quartile, Q., is the point in the test scale below, which 25% of the scores lie. The
second quartile is the median, which defines the upper 50% and lower 50% of the scores. The third
quartile is the point above which 25% of the scores lie. The data on the test scores of 100 college students
produced this image using the box-plot approach.
Sample graph
6. Pie Graph. One commonly used method to represent categorical data in the use of a circle graph. You
have learned in basic mathematics that there are 360° in a full circle. As such, the categories can be
represented by the slices of the circle that appear like a pie: thus, the name pie graph. The size of the pie
is determined by the percentage of students who belong in each category.
Sample graph
What are the purposes of grading and reporting learners' test performance?
There are various reasons why we assign grades and report learners' test performance. Grades are
alphabetical or numerical symbols/marks that indicate he degree to which learners are able to achieve the
intended learning outcomes. Grades do not exist in a vacuum but are part of the instructional process and
serve as a feedback loop between the teacher and learners.
-Grades also give the parents, who have the greatest stake in learners education, information about their
children's achievements. They provide teachers some bases for improving their teaching and learning
practices and for identifying learners who need further educational intervention.
Number Right Scoring (NR) - entails assigning positive values only to correct answer while giving a
score of zero to incorrect answer. The test score is the sum of the scores for correct responses.
Negative Marking (NM) - entails assigning positive values to correct answers while punishing the
learners for incorrect responses (e. right-minus- wrong-correcting method). In this model, a fraction of the
number of wrong answer is subtracted from the number of correct answers.
Partial Credit Scoring Methods - attempt to determine a learner's degree of level of knowledge with
respect to each response option given. This method of scoring takes into account partial knowledge
mastery of learners, it acknowledges while learners cannot always recognize the correct response, they
can discern that same response options are clearly incorrect. There are three formats of partial credit
scoring method.
•Liberal Multiple-Choice Test-It allows learners to select more than one answer to a question if they feel
uncertain which option or alternative id correct.
•Elimination Testing (ET)- It instructs learners to cross out all alternatives they consider to be incorrect.
•Confidence Weighting (CW)- It asks learners to indicate what they believe is the correct answer and
how confident they are about their choice.
Multiple Answers Scoring Method allows learners to have multiple answer for each item by this
method, learners are instructed that each item has at least one correct answer or how many answers are
correct items can be scoring as solved only.If all the correct response options are marked but none of
incorrect others. Incorrect options that are marked can lead to negative score.
Retrospective Correcting for Guessing considers omitted or no answer
items as incorrect, forcing learners to give an answer for every item even if they not know the correct
answer. The correction for guessing is implemented later or retroactively.
Standard-Setting entails using standards when scoring multiple-choice items particularly standards set
through norm-referenced or criterion referenced assessment: Standards based on norm-referenced
assessment are derived from the test performance of a certain group of learners, while standards from
criterion-referenced assessment are based on preset standards specified from the very start by the teacher
or school in general.
Holistic Scoring - involves giving a single, overall assessment for an essay, writing composition or other
performance type of assessments as a whole.
Analytic Scoring - involves assessing each aspect of a performance task and assigning score for each
criterion.
Primary Trait Scoring - focuses on only one aspect or criterion of a task, and a learners performance is
evaluated on only one trait. This scoring system defines a primary trait in the task that will then be scored.
Multiple-Trait Scoring - requires that an essay test or performance task is scored on more than one
aspect, with scoring criteria in place so that they are consistent with the prompt.
3.1 Pass or Fail Grade. This type of score is most appropriate if the test or assessment is primarily or
entirely to make a pass or fail decision. In this type of scoring, a standard or cut-off score is preset, and a
learner is given a score to Pass if he or she surpassed the expected level of performance or the cut-off
score. Pass or Fail scoring is most appropriate for comprehensive or licensure exams because there is no
limit to the number of examinees who can pass or fail.
3.2 Letter Grade. This is one of the most commonly used grading system. Letter grades are usually
composed of five-level grading scale labeled from A to E or F, with A representing the highest level of
achievement or performance and E or F-the lowest grade-representing a fail grade. These are often used
for all forms of learners' work, such quizzes, essays, projects and assignments.
3.3 Plus (+) and Minus (-) Letter Grades. This grading provides a more
detailed descriptions of the level of learners achievement or tasks performance by dividing each grade
category into three levels, such that a grade of A can be assigned as A+, and A- and B as B and B+, and
B- so on. Plus (+) and minus (-) grades provide a finer discrimination between achievement or
performance levels.
3.4 Categorical Grades. This system of grading is generally more descriptive than letter grades,
especially it coupled with verbal labels. Verbal labels eliminate the need for a key or legend to explain
what each grade category means.
4. Norm-Referenced Grading System. In this method of grading, learners scores are compared with
those of their peers. Norm referenced involves rank ordering learners and expressing a learner's score in
relation to the achievement of the rest of the group(class or grade level's etc.).
4.1 Developmental Score. This is the score that has been transformed from raw scores and reflect the
average performance at age and grade level There are two kinds of developmental scores (1) grade-
equivalent and (2) age equivalent scores.
4.1.1 Grade-Equivalent Score is described as both a growth score and status score.
4.1.2 Age-Equivalent Score indicates the age level that is typically to a learner to obtain such raw score.
It reflects a learn performance in terms of the chronological age as compared those in the norm group.
4.2 Percentile Rank. This indicates the percentage of the score that fall at or below the given score.
4.3 Stanine Score. This system expresses test results in nine equal steps which range from one (lowest) to
nine highest).
4.4 Standard Scores. They are raw scores that are converted into a common scale of measurement that
provides meaningful description of the individual scores within the distribution.
4.4.1 Z-score is one type of a standard score Z-scores have a mean of 0 and a standard deviation of 1.
4.4.2 T-score is another type of standard score, where in the mean is equal to 50, and the standard
deviation is equal to 10.
1. Stick to the purpose of the assessment. Before coming up with an ssessment, it is first important to
determine the purpose of the test.
2. Be guided by the desired learning outcomes. The learners should be informed early on what are
expected of them insolar as learning outcomes are concerned, as well as how they will be assessed and
graded in the test
.
3. Develop grading criteria. Grading criteria to be used in traditional tests, and performance tasks should
be made clear to the students.
4. Inform learners what scoring methods are to be used. Learners should be made aware before the
start of testing, whether their test responses in to be scored based on the number right, negative marking,
or through non conventional scoring methods. As such, the learners will be guided to mark their
responses during the test.
5. Decide on what type of test scores to use. As discussed earlier, there ared dfferent ways by which
students learning can be measured and presented.
1. Identify the criteria for rating the essay. The criteria or standards for evaluating the essay should be
predetermined.
2. Determine the type of rubric to use. There are two basic types of rubric holistic or analytic scores
system, Holistic rubrics require evaluating the essay and taking into consideration all the criteria.
3. Prepare the rubric in developing rubric, the skills and competences related to essay writing should
first be identified.
4. Evaluate essay anonymously. Checking essay should be done anonymously it is important that the
rater does not identify whose paper he/she is rating.
5. Score one essay question at a time. This is to ensure that the same thinking and standards are applied
for all leamers in the class.
6. Be conscious of own biases when evaluating a paper. The rate should not be affected by learners
handwriting, writing style, length of responses, and other factors. He/she should stick to the criteria
included in the rubric when evaluating the essay.
7. Review initial scores and comments before giving the final rating. This is important especially for
essays that were initially given a barely passing or failing grade.
8. Get two or more raters for essays that are high-stake, such as those used
for admission, placement, or scholarship screening purposes. The final grade will be the average of the all
ratings given.
9. Write comments next to the learner's responses to provide feedback on how well one has performed in
the essay test.