You are on page 1of 47

Item Analysis

Siti Masfuah, S.Pd., M.Pd


Universitas Muria Kudus
Statistically analyzing your
test items so that you can
ensure that your items are
effectively evaluating
student learning.
Good instruments???
Advantages
1. Can help users in the evaluation or tests used
2. Very relevant for the preparation of informal and local tests such as
the tests that teachers prepare for students in the classroom
3. Supports effective item writing
4. Materially can improve the test in class
5. Increase the validity of the questions and reliability
6. Determine whether an item function is as expected
7. Provide input to students about abilities and as a basis for discussion
in class
8. Giving input to the teacher about student difficulties
9. Provide input on certain aspects for curriculum development
10. Revise the material of money assessed or measured
11. Improving question writing skills
Qualitative

Quantitative
Quantitative
includes the
consideration of content includes principally
validity (content and the measurement
form of items), as well of item difficulty
as the evaluation of and item
items in terms of discrimination
effective item-writing
procedures.

Qualitative
Quality test??
Criterias??
01 Validity 02 Relibility
✓Valid
✓Reliability/Consistency/c
✓the test measured what it onsistency
was supposed to measure. ✓level of constancy or
✓Measure according to consistency of the
Item what is measured
✓Use of the right
measurement results of a
test
Analysis measuring tools
✓Internal & external validity
✓Consistency relates to the
error rate of the results of
a test in the form of
scores
03 Distinguish 04 level of difficulty
Power
✓measuring the extent to which an ✓a measure of the degree of 05 Distractor
item is able to distinguish students difficulty of a question. If a
who have mastered competence question has a balanced level ✓distractor answer choices
from students who have not of difficulty, then it can be said
mastered it based on certain that the question is good
criteria
Validity
Content Validity
Expert Judgment
Logic
Construct Validity
Expert judgment
Factor Analysis

Criterion Validity/Predictive Validity


Empiric Correlation with criteria/standards
(product moment correlation)
Validity
➢ examination of test items to conclude that
the test measures relevant aspects Logic validity
➢ A test is said to have construct validity of the ➢ sampling validity
items that make up the test measure every ➢ The validity demands
aspect of thinking as stated in the learning careful delimitation of
objectives. the area of behavior
➢ Testing with the opinion of experts (expert being measured and a
judgment) logical design that can
cover part of the area
Content validity of behavior being
measured
➢ the suitability of the indicators in Construct validity
➢ validity
the test plan with the operational ➢ sampling (sampling validity)
definition of the test instrument seen from the scope of the items in the test.
➢ Determining validity can be done Are all items a representative sample of all
through focus group discussion possible items, or do they contain less
(FGD) activities relevant items?
➢ the Lawshe method (1973) and
the Aiken method (1985).
Validity based on Zainal Arifin (2009)

Face Validity

Content Validitaty Construct Validity

Empiric Validity Factor Validity


• Face Falidity

Seeing from the face/face of the instrument.


If a test at a glance is considered good for
revealing the phenomenon to be
The simplest measured, then the test can be said to
type of validity meet the surface validity requirements so
that no in-depth judgment is needed
Content Validity

Content validity is related to whether the test material


is relevant to the specified curriculum and covers all
aspects to be measured

Methode:
How to measure:
Matching the test material with V-Aiken Technique
the syllabus, grids, conducting Lawse Technique
discussions with fellow educators, Maslach Technique
and re-examining the substance of
the concept to be measured
Empiric Validity
01 This validity looks for the relationship between test scores with a certain
criterion which is a benchmark outside the test in question.
This empirical validity is also known as Criterion validity/statistical validity.
The types are predictive validity, concurrent validity, and similar validity

02 Predictive validity aims to predict student achievement, to


see how far the test can predict student behavior in the
future.

03 Concurrent validity is if the standard test criteria used are of different


types/not cognate. Example: math content test scores are correlated with
science content test scores

04 Similar test scores are if the standard test criteria used are similar.
Example: math content with math content
05 Empirical validity testing:
1. Product moment correlation with deviation number
2. Product moment correlation with rough numbers
3. Rank differences correlation
4. Scatter diagram technique (Scatter Diagram)
Product Moment Correlation with Deviation Number
Product Moment Correlation with Deviation Number
Product Moment Correlation Coarse Numbers
Product Moment Correlation Coarse Numbers
Product Moment Correlation Coarse Numbers
Interpretation of r value

The interpretation of the correlation coefficient (r) is conventionally given by Guilford (1956) as
follows

r Coefficient Criteria
0,80 – 1,00 Very good
0,60 – 0,80 Good
0,40 – 0,60 Enough
0,20 – 0,40 Less
0,00 – 0,20 Very less

The interpretation of the value of r can also be seen by comparing the calculated r with the r
table. If r count > r table, then the data is reliable
Construct Validity
➢Construct is a concept that can be observed (observable)
& measurable (measurable).
➢Construct validity is known as logical validity.
➢Construct validity relates to the extent to which the
question/test can measure & observe psychological
function which is a description of the behavior

Testing construct validity by means of content validity,


predictive validity, and concurrent validity.
Statistical analysis used in construct validity is factor
analysis so that it can be seen:
1. What aspects are measured by each item?
2. How big is an item containing certain factors?
3. What factors are measured by an item?
In this factor analysis, you can analyze and consider
whether a test can measure psychological function
which is a description of student behavior to be
measured by the test
Factor Validity

This validity is obtained based on


the dimensions/indicators of the
measured variables according to
what is revealed in the theoretical
construction.
Criteria
Calculating the homogeneity of the
score of each factor with the total
score, and between the scores of
one factor and the scores of other
factors
Reliability
Tingkat/derajat konsistensi dari suatu instrumen
the test scores are consistent

A test is said to be reliable if it always gives the same results


when tested on the same group at different times or on
different occasions.
High and low reliability, empirically indicated by a number
called the reliability coefficient value
High reliability is indicated by the rxx value close to 1. The
general agreement is that reliability is considered
satisfactory if 0.700
Reliability

Combined Split half


Test-retest Alternate
Internal
Consistency

One 2 package of 2 1 instrument


instrument, instruments, instrument package, 1x
1x test. test. The
2x test. Test packages, results of the
Package 1 test 2x test.
result 1 is results are first cleavage
associated Combined test are linked
linked with to the results
with test package 2 (6
of the second
result 2 test results relationship cleavage test
s)
Product Product Product Spearman Brown,
Moment Moment Moment KR 20, KR 21
Reliability Measurement Technique

Spearman-Brown Flanagan Rulon

Hyot Kuder-Richardson
(KR 20)
Alpha Cronbach
(subjective test)
Reliability Testing by Excel
Split-Half Technique Non Split-Half Technique
by dividing the test into two Salah satu kelemahan perhitungan
koefisien reliabilitas dengan
relatively equal parts (the
menggunakan teknik belah dua
number of questions is the
adalah (1) banyaknya butir soal harus
same), so that each test has two genap, dan (2) dapat dilakukan
kinds of scores, namely the dengan cara yang berbeda sehingga
Internal Consistency Reliability) score of the first part (starting menghasilkan nilai yang berbeda pula
question / odd number) and seperti terlihat pada contoh c.1 dan
A single test is a test consisting the score of the second part contoh c.2.
of a set that is given to a group (final / even number question). menggunakan rumus Kuder-
of subjects in one test so that Richardson (KR-20) dan Kuder-
The test hemisphere reliability
from the test results only one Richardson (KR-21).
coefficient is denoted by r1/2
group of data is obtained.
1/2 and can be calculated using Reliability for subjective test
the formula, namely Pearson's
crude number correlation.
Non Split-Half Technique by using the Cronbach-
Alpha formula
Split-Half Technique

Internal Consistency
Reliability
reliability testing instruments

✓Reliability coefficient 0 – 1
The categories of reliability coefficients (Guilford, 1956: 145) are as
follows:
0.80 < r11 < 1.00 very high reliability
0.60 < r11 < 0.80 high reliability
0.40 < r11 < 0.60 moderate reliability
0.20 < r11 < 0.40 low reliability.
-1.00 < r11 < 0.20 very low reliability (unreliable).
ITEMAN, others

Rasch model

Excel SPSS
Difficulty Level
the opportunity to answer correctly a
question at a certain level of ability which is
usually expressed in the form of an index

This difficulty level index is generally expressed in the


form of a proportion that ranges from 0.00 to
1.00 (Aiken, 1994: 66)

The greater the index of difficulty obtained, it means


the easier the problem
The formula for calculating the level of difficulty (Tingkat Kesukaran = TK)

Objective Test
01

TK= the number of students who answered the questions correctly


Number of students taking the test

02 Subjective test

►TK= Mean
The maximum score that has been set On the scoring guidelines

►Mean = Total score on a question


Number of students taking the test
Criteria

Range Criteria Proportion in test

0,00 – 0,30 Difficult 25%


0,31 – 0,70 Moderate 50%
0,71 – 1,00 Easy 25%
The ability of an item can distinguish between
students who have mastered the material in
Discrimination question and students who do not / lack / have
power (DP) not mastered the material in question

The DP index is usually expressed in terms of proportions.


The higher the DP index, the more capable the question in questio
n distinguishes students who have understood the material from
students who have not understood the material

DP index: -1,00 until 1,00


Formula

The formula for finding the discrimination power of multiple-choice questions

BA - BB 2 (BA - BB) DP = discrimination power


DP = or DP = BA = the number of the upper group who answered correctly
1 N BB = the number of the lower group who answered correctly
N
2 N = amount of the students

►Sort student scores/grades starting from the highest to the lowest


►Divide the class into two groups, for example, if the number of students is 20,
then BA = 10 and BB = 10; if the number of students is 30, then BA = 15 and BB
= 15. But what if the number of students is odd?
►If the number of students is 49, then divide by 2, the middle one is not
included. So, BA = number 1-24 (24 students) and BB = number 26-49 (24
students). Students with serial number 25 are not included in the calculation
►If the number of students > 50, then:
►BA = 27% upper group (high scores starting at the top)
►BB = 27% lower group (low score starting with lowest)
E.q
No Students 1 2 3 …. 50 Score
1 A1 A C D … C 45
2 A2 A D B … B 43 27% BA
3 A3 C A A … D 41
… … … … … … … …
33 A4 C D D … A 27
34 A5 D A C … B 26 27% BB
35 A6 C B A … B 25
Answer key A C B … D
Contoh Soal Pilihan Ganda
No Student Questions number Total score
1 2 3 4 5 6 7 8 9 10
1 A 1 1 1 1 1 1 1 1 0 0 8
2 B 1 1 1 1 1 1 1 1 0 0 8
3 C 1 1 1 1 1 1 1 1 0 0 8
4 D 1 1 1 1 0 1 1 1 0 0 7
5 E 1 1 1 1 0 1 1 0 0 0 6
6 F 1 0 1 0 1 0 0 0 1 0 4
7 G 1 0 0 1 1 0 0 0 1 0 4
8 H 1 0 0 1 1 0 0 0 1 0 4
9 I 1 0 0 0 1 0 0 0 1 0 3
10 J 1 0 0 0 1 0 0 0 1 0 3
correct 10 5 6 6 8 5 5 5 5 5 0
answer
Total 10 10 10 10 10 10 10 10 10 10 10
students
TK 0.00 0.5 0.6 0.6 0.8 0.5 0.5 0.5 0.5 0.5 0
Examples of Differential Power of Multiple Choice Questions

Questions number BA BB DP
1 5 5 0.00
2 5 0 1.00
3 5 1 0.90
4 5 1 0.90
5 3 5 -0.30
6 5 0 1.00
7 5 0 0.90
8 4 1 0.70
9 0 5 -1.00
10 0 0 0.00
BA 5
BB 5
E.q Subjective test
Formula DP by biserial point correlation

Selain rumus tersebut, untuk mengetahui daya pembeda soal bentuk pilihan
ganda dapat digunakan rumus korelasi titik biseral (r pbis) dan korelasi biseral (r
bis), sebagai berikut :

Xb − Xs Yb − Ys nb ns
rpbs = pq or rbis = 
SDt SDt un n 2 − n

Xb = average correct answer


Xs = average wrong answer
P = proportion of correct answers
Q=1–p
SD = deviation standart
Formula DP for subjective test

Mean BA − Mean BB
DP =
Maximum score

►Mean = Total score on a question


total students

DP can draw the level of ability of the questions in distinguishing between


students who already understand the material being tested and students who
do not/do not understand the material being tested. The Distinguishing Power
Criteria are as follows:
0.40 – 1.00 questions are accepted/good
0.30 – 0.39 questions accepted but need to be improved
0.20 – 0.29 questions fixed
0.00 – 0.19 questions are not used / discarded
Contoh Daya Pembeda dengan r bis
Distractor

The distribution of answer choices is used as the basis for


reviewing the questions. This is intended to find out
whether the available answers work or not. An answer
choice (a distractor) can be said to work if the distractor:
a. At least 5% of the test takers/students are selected
b. More chosen by groups of students who do not understand the
material
How to determine whether the distractor is working

No Class students
A B C D E*
1 BA 27% = 40 students 4 12 16 8 0
BB 27%= 40 students 0 12 16 12 0

Distractors are in E choice

Question number 1 is really bad because both the upper and lower
groups are confused and both groups choose C. In addition, the 01
distractor or distractor or choice E does not work or is ineffective 02
because no one chooses 03
05 04
No Class Students
A* B C D E
2 Atas/tinggi 27% = 40 orang 40 0 0 0 0
Bawah/rendah 27%= 40 orang 0 8 12 10 0

The distractor is in option A

Question number 2 is a good question because it can distinguish 01


between good and bad test takers 02
03
05 04
Project
The following is a
description of 8 questions,
which are done by 20
students. Minimum score
0 and maximum score 3.
Look for:
a. Discrimination Power
b. Difficulty Level
Please work on it in Excel!
THANK YOU

You might also like