You are on page 1of 33

09/20/2023

THE PSYCHOMETRIC PROPERTIES


OF THE ---- EXAMS

Alina Oshyna
Tel +380504522012 D. Yavornitskogo Ave., 26 psylab.tech
Dnipro, Ukraine alina@psylab.tech
09/20/2023 Pg.02

INTRODUCTION
The --- exams are knowledge-based assessments that are delivered to 9000 individuals biennially. Each of the
forms is administered to evaluate the assimilation of instructional material taught in two successive
courses. Each of the forms presents a set of 100 questions randomly selected from a pool of 260 items. The
primary objective of administering the exams is to classify test-takers into either pass or fail groups. The
context in which this exam is conducted assumes that a decisive majority of the test-takers are
highly motivated and well-prepared and the exams are administered at flexible rather than fixed exam dates.
Given these circumstances, the necessity for conducting the analysis detailed in this paper arises from the goal
of upholding and strengthening the exam’s security.

Simultaneously, the study aims to identify the most effective strategies for ensuring and further enhancing the
precision of individual scores. To fulfill this purpose, the investigation is designed to comprehensively examine
the psychometric properties of the --- exam’s --- and --- forms. It intends to discern the optimal Item
Response Theory (IRT) model that describes the data-generating process the most effectively and formulate a
framework that maximizes the classification accuracy and consistency of the assessment. Beyond identifying
the optimal and most parsimonious IRT model for the description of the data, we also undertook a
comparative evaluation of various IRT models, gauging the level of precision of the ability estimates derived
from these models.

The --- form covers nine distinct topic domains, and the --- covers eight such domains. These topic domains
correspond to the modules covered in the instructional material. The composition of each --- exam form is
guided by a blueprint that outlines the proportion of exam items originating from each domain. The item
selection algorithm of the exam randomly selects the specified number of items defined by the blueprint
from the domain's item pool. Prior to the investigation of the overall coherence of the exam as a unified scale
and the assessment of whether the domains truly represent distinct latent factors, we will also evaluate
whether the item selection algorithm operates as intended.

Identifying the optimal IRT model is not an isolated objective; rather, it serves as a means to accomplish the
primary goal of this study. This goal is to establish the ideal configuration for item development, in terms of
both item characteristics and quantity, to find the optimal balance between item pool size and exam quality
and security. Given the fact that item development and implementation resources are not infinite, the current
study aims to address this issue by finding an answer to an important question: What is the minimum item
pool size required beyond which there is no significant enhancement in the exam's measurement performance,
and what types of items should be developed (i.e., of what difficulty) to create the most efficient item pool?
09/20/2023 Pg.03

The overarching purpose of this endeavor is to achieve the utmost level of exam quality while minimizing the
necessary effort for item development.

Aligned with this objective, simulations were conducted using nine distinct item pool sizes, ranging from two to
ten times the number of items defined by the blueprint for the domain. Additionally, sixteen different
distributions of the difficulty parameter were incorporated. These distribution parameters were the result of
the combinations of four specific variants of the mean and four variants of the standard deviation. The mean
variants were: (1) the actual mean of the difficulty parameter as it is in the current item pool; (2) ability at the
maximum of the information function as it is now; (3) the ability at the pass-mark; and (4) the mean of the
ability distribution in our sample. Regarding the standard deviation, three multiples of the actual difficulty
parameter's standard deviation were utilized: 0.75, 1, and 1.5; and, in addition to that, the current standard
deviation of the sample ability.

The rationale for the choice of the values was as follows: the actual distribution's (both mean and standard
deviation) and maximum information points served as a baseline to gauge the impact of the two alternative
settings. The parameters matching the ability distribution (mean of zero and SD of one) was motivated by
several literature sources that advocate for the alignment of the difficulty parameter distribution with the
ability distribution as a means to achieve optimal exam and item pool efficiency. The ability at the pass-mark
as the mean of the difficulty distribution aims to maximize information at the pass-mark. This is a common
suggestion made in the psychometric literature, as pointed by Wyse and Babcock (2016).

The decision for utilizing a range of SD values for the difficulty parameters stemmed from seemingly
contradictory findings of different studies. For example, Kezer (2021) reported that tests administered through
item pools featuring a difficulty parameter range that matched the ability level tended to yield more precise
estimations of ability. Conversely, when the difficulty parameter range was narrower, the precision of ability
estimates suffered especially at extreme ability levels. On the other hand, a study by Demir (2022) showed
that item pools with a broader distribution of the difficulty parameter exhibited reduced classification
accuracy. In our study we will examine how the breadth of the difficulty parameter distribution influences the
precision of ability estimates and the accuracy of classifications made using the exam scores, and how these
qualities change with the growth of the item pool.

The performance of the simulated exam was evaluated in two aspects: measurement precision and
classification accuracy and consistency. In diagnostic contexts it is important to obtain precise scores along the
whole ability continuum, while in classification contexts it is rather more important to classify individuals
correctly and consistently if the examination is repeated. A review of the literature suggests that these criteria
might involve a trade-off between them (e.g. Kezer, 2021; Demir, 2022).
09/20/2023 Pg.04

Measurement precision is the degree to which the exam recovers the true theta values (which are known when
the sample is a simulated one), and it only takes into account the difference between the real and the
estimated ability along the whole range of ability. To assess measurement precision, we employed two metrics:
bias and Root Mean Squared Difference (RMSD). RMSD is employed to quantify the magnitude of differences
between ability estimations. It achieves this by taking the square root of the squared differences, a
methodology similar to measuring standard deviation, thereby preventing opposite differences from canceling
each other out. Bias, on the other hand, is designed to identify systematic errors. It does so by examining the
simple differences between estimated and actual abilities. Thus, even when these differences are substantial, if
they average out to zero and exhibit symmetry, bias will register as zero. RMSD and bias are calculated by using
the following formulae:

∑𝑁𝑁 �
𝑖𝑖=1(𝜃𝜃𝑖𝑖 − 𝜃𝜃𝑖𝑖 )
2 ∑𝑁𝑁 �
𝑖𝑖=1 𝜃𝜃𝑖𝑖 − 𝜃𝜃𝑖𝑖
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 = � ; 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = ,
𝑁𝑁 𝑁𝑁

where 𝑁𝑁 is the number of examinees, 𝜃𝜃�𝑖𝑖 is the ability estimate of examinee 𝑖𝑖, and 𝜃𝜃𝑖𝑖 is this examinee’s true
ability (Boyd, Dodd & Fitzpatrick, 2013).

Meanwhile, the classification accuracy refers to the average probability across all individuals to fall into the
category that is consistent with their ability, given their ability estimates and their corresponding standard
errors. While measurement precision takes into account only the difference between the real and the
estimated ability, classification accuracy and consistency measures also take into account the standard error of
measurement and the distance of the ability estimate from the cut-score.

Figure 1
An Illustration of the Probability of Being Classified Correctly (Green) and Incorrectly (Red) Above the Pass-
Mark (Left) and Below the Pass-Mark (Right)

The classification accuracy index, τ, is computed based on three values: the location of the cut-score, 𝜅𝜅; the
examinee’s ability estimate 𝜃𝜃�𝑖𝑖 ; and the estimate of the standard error of measurement for that examinee, 𝜎𝜎�𝑖𝑖 .
09/20/2023 Pg.05

Assuming conditional normality of the standard error around the ability estimate, one then finds the area under
the curve below and above the cut-score of the normal curve with 𝜃𝜃�𝑖𝑖 as the mean and 𝜎𝜎�𝑖𝑖 as the standard
deviation. The obtained value represents the probability of being classified this or the other way, as illusrated in
Figure 1. The classification accuracy metric is the average of all the ‘green’ areas, or, the average probability of
being correctly classified, across all the examinees. Looking at that illustration should make it evident that the
farther the score from the pass-mark and the smaller the standard error, the smaller the probability of being
classified incorrectly. Classification consistency, γ, is the average probability across examinees to fall into the
same category twice, and so, it is the average of squares of those same probabilities (Wyse & Hao, 2012). The
chance consistency rate, which was also computed for each of the scenarios, is the probability of being
categorized into one category on one examination attempt and into the other category on another attempt.

Additionally, we have calculated a straightforward metric intended to strike a balance between conflicting score
precision and classification accuracy findings. Given the fact that our exam is intended for classification,
precision is less critical as long as the ability estimate (𝜃𝜃�) and true ability (𝜃𝜃) are located at the same side of the
pass-mark. To address this, we introduced a metric called the "hit-rate," which counts the cases in each
simulation scenario where the estimated ability falls on the same side of the pass-mark as the true ability,
regardless of the distance between the two (contrary to the RMSD).

In addition, we formally computed the exam overlap rate, which essentially comes down to the ratio of test
length to item pool size in fixed-length exams.

By computing these metrics for each of the difficulty distribution configuration for each multiple of the item
pool size (twice the number of items of the prescribed by blueprint, 3 times as many, 4 times, and so on), we
will try to figure out, which is the optimal solution in terms of the best precision and accuracy.
09/20/2023 Pg.06

FORM ----

SAMPLE DESCRIPTIVE STATISTICS

The “---_---_Dat” data set consists of 4302 rows, with 3736 unique candidate IDs, and overall pass rate of
64.16%. 3175 candidates attending the exam for their first attempt. Among those who attempted the
assessment for the first time, the pass rate is similar standing at 64.63%.

Table 1 presents the distribution of attempts in the sample, but it is important to note that the distribution of
attempts does not account for all attempts made by some individuals - while some candidates appear more
than once, others appear only once but not with their first attempt.

The majority of cases (73.8%) in the sample represent individuals who took the exam for the first time.
Approximately 20% of the sample consists of cases where individuals made a second attempt. The remaining
7% of the sample includes cases of individuals who attempted the exam three or more times. Subsequent
analyses are conducted exclusively on participants who completed the test during their first attempt.

Table 1
Distribution of Attempts in the Sample
Attempt Number of Cases % of the
Sample
1 3175 73.8%
2 810 18.83%
3 221 5.14%
4 68 1.58%
5 16 0.37%
6 8 0.19%
7 2 0.05%
9 1 0.02%
10 1 0.02%

ITEM ANALYSIS

Exposure Rates

Exposure rates of the items are not uniform. 75% of the items were exposed to between 38% and 40% of the
examinees with maximal exposure of 42% (items X------, X------, X------). At the same time a few items have
significantly lower exposure rates below 25% (items X------, X------, X------, X------, X------, X------, X------).
Further examination has revealed that those items were found to be breached and therefore removed from the
item pool. Because these items are no longer utilized, they were excluded from further analyses.
09/20/2023 Pg.07

Item Selection Algorithm Verification

The item selection algorithm for this test is domain-weighted random. This means that each item's probability
of being chosen for any examinee is proportional to the number of items defined by the blueprint for that
specific domain, divided by the total number of items in that domain present in the item pool. According to this
algorithm, the probability of any item being selected is equal across all items in each domain, and the
distribution of the item exposures should be uniform on a domain level. With this in mind, we simulated item
exposures under this condition. Our hypothesis is that the simulated distribution of item exposure rates should
demonstrate statistical equivalence to the actual item exposures in case that the algorithm works properly.

Figure 2
Item Exposure Rates in Each Domain – Simulated and Real Data

To test this hypothesis, we conducted 1000 simulations of the test, wherein items were randomly selected to
adhere to the blueprint. We then compared the outcomes of these simulated tests with the results from actual
tests. Both the simulated and real results are illustrated in the Figure 2. On the graphs, each bar represents an
item, colored by domain. The black horizontal line represents the probability of an item to be selected, and it
also happens to correspond to the mean item exposure for that domain. Its level is different for each domain,
for the number of items in the pool and domain representation as defined by the blueprint is different. Overall
item exposures are then compared using Kolmogorov-Smirnov asymptotic two-sample test, and it was found
that both simulated and real data belong to the same distribution (D=0.03, p = 1) indicating that the current
item selection mechanism is working properly.
09/20/2023 Pg.08

Item Statistics

Reliability coefficient alpha of the exam as a whole was found to be 0.96. Basic statistics of the first several
items is provided in Table 2 . The table presents essential item statistics and the number of problems identified.
The first three columns represent the item identifier, the item key, and the domain, to which the item belongs.
Column ‘n’ displays the count of individuals in the sample who were presented with each respective item.
Column labeled as 'Exposure rate,' shows the proportion of individuals exposed to each item, calculated by
dividing the values in the second column by the total sample size.

Columns ‘A,’ ‘B,’ ‘C,’ and ‘D’ provide proportions of choices of distractors A, B, C, and D respectively. Choice
frequency at a rate of 5% or below to all three distractors is considered a problem and adds a point to the
‘Problems’ counter. The ‘Prop. Correct’ column represents the ratio of individuals who answered an item
correctly among those who were exposed to that particular item. ‘ITC’ stands for item-total correlation and is
interpreted as item discrimination parameter in the framework of the Classical Test theory. A value of ITC below
0.25 is considered a problem and adds a point to the ‘Problems’ counter. ‘Alpha.Drop’ refers to the coefficient
alpha value that would result if a specific item is dropped from the test. If this value is larger than the full test
alpha, it adds a point to the ‘Problems’ counter.

Column ‘Problems’ represents the counter for identified issues with the items, as described above. Any items
that received a value of 2 or higher in this column were excluded from further analyses. The decision to exclude
these items was made because they were deemed to disrupt the unidimensional structure of the test.
09/20/2023 Pg.09

Table 2
--- Form ---- Item Statistics
09/20/2023 Pg.10

IRT ANALYSIS
Assessing the Assumptions of Item Response Theory (IRT)

ASSUMPTION OF UNIDIMENSIONALITY

In order to assess the assumption of unidimensionality, a confirmatory factor analysis (CFA) was performed to
fit a unifactorial model using the 228 retained items. The analysis was conducted using the lavaan package in
R (version 0.6-9). The results indicated inadequate fit to a unifactorial solution (TLI = 0.64, CFI = 0.63,
RMSEA = 0.12). In addition to that, the single-factor solution as was obtained using the lavaan::cfa()
method, explained only 9% of the observed variance, which also indicates a more complex latent structure in
the data. The obtained results are not contradictory to a high value of coefficient alpha. It is important to note
that coefficient alpha is dependent on the length of the scale and can be high when scales are particularly
long.

Based on these findings, further scale reduction was deemed necessary, leading us to adopt stricter measures for
item elimination. Instead of considering the scale as a whole, we shifted our focus to the domain level. This
approach serves two important purposes: (1) ensuring greater homogeneity within the domain-level data, and
(2) maintaining appropriate representation of each domain by avoiding a disproportionate elimination of items.
Table 3 represents the domain composition of the original exam and after dropping. It is evident from the table
that Domain A and Domain F experienced the most significant reductions in item count, while Domain I did not
lose any items. Nevertheless, despite these variations, the overall representation of the domains in the test
remains largely consistent, as seen from comparing the two last columns of the table.

To implement this approach, we re-calculated all scale and item statistics, and eliminated items exhibiting the
following properties: (1) all three distractor choice frequencies at or below 5%, (2) item-domain total
correlation (ITC) below 0.3, and (3) domain-factor loading below 0.3. Following this procedure, we were left
with a set of 138 items, as detailed in Table 4.

Table 3
Domain-Wise Item Counts Before And After Dropping Items
Domain Original Count Number Original Percentage
Count after of Items Percentage after
Dropping Dropped Dropping
A 39 30 9 15% 13%
B 33 31 2 13% 14%
C 32 28 4 12% 12%
D 29 26 3 11% 11%
E 35 33 2 14% 14%
09/20/2023 Pg.11

F 26 19 7 10% 8%
G 26 22 4 10% 10%
H 20 19 1 8% 8%
I 20 20 0 8% 9%
Total 260 228 32 100% 100%
Table 4
Domain-Wise Item Counts Before And After Dropping Items for the Second Time
Domain Original Percent Retained Percent Difference Alpha
of of
Total Total
A 39 15% 15 11% -4% 0.72
B 33 13% 16 12% -1% 0.77
C 32 12% 16 12% -1% 0.73
D 29 11% 18 13% 2% 0.75
E 35 13% 24 17% 4% 0.81
F 26 10% 11 8% -2% 0.73
G 26 10% 18 13% 3% 0.76
H 20 8% 10 7% 0% 0.65
I 20 8% 10 7% 0% 0.71
Total 260 100% 138 100% - 0.95
We observe a significant reduction in the number of items, nearly halving the original count. However, it is
noteworthy that none of the domains underwent a proportionate change exceeding 4%. Also, we can see that
coefficient Alpha is acceptable across all domains, while being somewhat low for Domain H.

Confirmatory factor analysis of the retained 138 items yielded only slightly better results: 13% of explained
variance, with TLI=0.748, CFI=0.752. At the same time, the value of RMSEA turned out to be 0.014, which is
good. Such results suggest that although the model fits the data well in terms of RMSEA, the low explained
variance and low TLI and CFI values indicate that the data is not completely unidimensional. Despite sharing a
common factor, it appears that other latent factors may also be influencing the relationships among the items.
Thus, the model's ability to capture the full complexity of the underlying structure of the data should be
treated as limited. It is important to carefully consider these findings when interpreting the model and making
further inferences.

Model Selection and Comparative Evaluation

Four unidimensional models were fitted to the data: 1PL (i.e. Rasch), 2PL, 3PL, and 4PL. All IRT computations
were performed using the mirt R package, version 1.39 (Chalmers, 2012).

Due to the sparsity of the data (i.e. large number of missing values, specifically, 62%), it was impossible to
obtain key fit indices such as TLI, CFI, and RMSEA. The absence of these indices restricts our ability to assess
the absolute fit of each of the IRT models. This limitation complicates conclusive inference further, especially,
09/20/2023 Pg.12

in light of previous CFA findings. It is important to recognize this limitation and acknowledge that the adequacy
of the IRT models remains uncertain.

Table 5 presents the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Sample Adjusted
BIC (SABIC), and Hannan-Quinn (HQ) criterion. All these are statistical measures used to assess and compare
different models. AIC and BIC balance model fit and complexity, with lower values indicating better models.
SABIC adjusts BIC for sample size, providing a robust measure of fit. HQ criterion applies a higher penalty for
model complexity.

Table 5
Model Comparative Fit Statistics
Model AIC BIC SABIC HQ
Rasch 252346.1 253231.1 252789.5 252658.7
2PL 251324.1 253081.3 252204.3 251944.7
3PL 251385.1 254020.9 252705.4 252316
4PL 251519.4 255033.9 253279.9 252760.6
Whereas fit indices such as RMSEA, TLI, and CFI are unavailable, other criteria for model evaluation can be
considered. Notably, it has been reported that unanimous agreement among AIC, BIC, SABIC, and HQ statistics
is often observed when models describe the data well (Sen, & Bradshaw, 2017). Therefore, in the absence of
specific fit indices, the unanimous agreement among these available statistics supports the possibility that the
model adequately captures the characteristics of the observed data.

These statistics consistently indicate that the 2PL model is the preferred choice. Therefore, despite the absence
of the key fit indices, we rely on the unanimous recommendation of these statistics as an indirect indication of
the model’s reasonable fit to the observed data. However, it is still important to maintain awareness that we
do not have a conclusive answer regarding the fit of the model.

2PL stands for 2-parameter logistic model. It treats the probability of solving an item correctly as a logistic
function of two item parameters – discrimination and difficulty. Graphical representation of this function with
illustration of the meaning of the two parameters is provided in Figure 3. Discrimination (denoted as parameter
a) refers to the rate at which the probability of a correct response increases in relation to an individual’s level of
ability. It is represented by the slope on the item characteristic curve. Larger values of the discrimination
parameter represent steeper slope, and therefore better discrimination. The difficulty parameter (denoted as
parameter b) is identified by the location on the ability continuum where the probability of giving a correct
response equals exactly 0.5. Larger values of this parameter indicate more difficult items. Bichi and Talib (2018)
provide useful categorization of these two parameters to facilitate interpretation. This categorization is provided
in Tables 10 and 11 of this report (near the reporting on the item parameters for reader convenience).
09/20/2023 Pg.13

Figure 3
Item Characteristic Curve with Illustration of Item Parameters in the 2PL Model

THE ASSUMPTION OF LOCAL INDEPENDENCE

Local dependence refers to item dependencies that go beyond the measured latent ability, posing a challenge to
the validity of IRT models. In the presence of local dependence, the joint probability of correct responses to an
item pair deviates from the product of individual response probabilities. This additional correlation results in a
reduction of explained variance by the IRT model and affects the estimation of item parameters. To ensure
accurate estimation and valid interpretation in IRT models, and therefore the quality of the exam, it is crucial to
address local dependence issues.

The assumption of local independence can be assessed by examining the correlations of residuals. 2PL model
residuals were calculated and residual correlation analysis was performed. Out of 9384 possible item pairs, 1753
(almost 19%) item pairs’ residuals correlate at a level above 0.3. Among the analyzed items, a total of 126 items
demonstrated local dependence with at least one other item. Remarkably, the highest number of local
dependencies for a single item was observed to be as high as 90.

There are several potential reasons for local dependence. In knowledge exams, local dependence can arise when
multiple questions about the same concept are included. This results in technically identical, redundant items
that provide less information than what would be expected by the model. In other words, the test contains
fewer items than it appears to (Chen & Thissen, 1997). When constructing items for lengthy exams, it is crucial
to prioritize item diversity to prevent the occurrence of local dependence in the future.
09/20/2023 Pg.14

Another potential cause of local dependence is when multidimensional data with correlated latent factors is
analyzed using a unidimensional item response theory model. In such cases, local dependencies can emerge
among pairs of items that belong to the same secondary trait or dimension (Yen, 1984). The fact that with
confirmatory factor analysis we only managed to obtain poor to mediocre fit, supports this possibility.

In summary, no conclusive evidence has been found to support the assumptions of IRT. The test likely possesses
multidimensional structure, and several items are likely to be locally dependent. When transitioning to an IRT-
based adaptive test, it is crucial to address these issues.

One potential solution to address this concern is to treat each domain as an independent scale and apply a
distinct IRT model for each domain. This approach would yield separate ability scores for each domain. To
compute an overall score that aligns with the blueprint, a weighted average can be calculated based on these
domain scores. This approach will be implemented and elaborated upon in the subsequent sections.

DOMAIN-CENTRIC ANALYSES
In the context of domain-centric analyses, we relaxed the criteria for item removal due to practical
considerations. This led to the reintegration of previously discarded items into the analyses. Additionally, we
recalculated item statistics, focusing on computations within each specific domain rather than across the entire
test.

Table 6
Domain Statistics
Domain Item Blueprint Exposure Mean SD Skew STC Alpha
Pool Probability
A 36 15 0.42 10.5 2.35 -0.54 0.71 0.72
B 30 13 0.43 8.58 2.43 -0.4 0.76 0.77
C 24 12 0.5 6.96 2.31 -0.22 0.74 0.70
D 24 11 0.46 6.74 2.2 -0.29 0.74 0.72
E 34 13 0.38 8.57 2.64 -0.54 0.8 0.83
F 25 10 0.4 6.75 1.91 -0.6 0.68 0.74
G 26 10 0.38 5.97 2.17 -0.26 0.73 0.78
H 19 8 0.42 4.86 1.71 -0.25 0.65 0.64
I 20 8 0.4 5.01 1.82 -0.39 0.7 0.73
Table 6 provides basic domain statistics. It shows the number of items for that domain in the pool and the
number of items to be included in the test as per the blueprint. Exposure probability is derived by dividing the
number of items in the blueprint by the number of items in the pool. Subsequent columns encompass the
distribution parameters for the domain sum scores. The Mean column denotes the average of total correct
responses for each domain, while SD represents the standard deviation. The Skew column indicates the degree
09/20/2023 Pg.15

to which the data deviates from the mean. Notably, all values in the Skew column are negative , implying that
the majority of subjects achieved total domain scores above the mean. Domains F, E, and A exhibit the most
negative skewness, signifying higher ease. Conversely, domain C demonstrates the least skewness. The following
column, labeled STC (scale-total correlation), showcases sufficiently high correlations across all domains.

Next, we will delve into an individual examination of each domain. We will identify items within each domain
that deviate from its unidimensional structure, find the best fitting IRT model, and calculate corresponding item
parameters. Following this, we will determine the most effective strategies for enhancing the item pool that
came out of the simulation results.

DOMAIN A

Item Analysis

Table 7 provides basic item statistics for domain A. Columns next to Item ID represent the respective response
options. The values in these cells indicate the proportions of frequencies associated with the selection of these
options. Items with all three distractor choices selected less frequently than 5% of the time are excluded from
further analyses due to their lack of response variability, which diminishes their informative value. Apart from
that, items exhibiting factor loadings below 0.1 are discarded as they deviate from the unifactorial structure
intended for the scale. According to the outlined rules, items X------, X------, X------, X------, X------, X------,
and X------ will be removed from IRT analyses.

Confirmatory factor analysis was conducted for the remaining domain A items. A relatively low RMSEA value of
0.016 for the unifactorial model indicates a good fit. However, the TLI and CFI values are 0.77 and 0.79
respectively, which are below the recommended threshold of 0.9. This outcome suggests that the model fits the
observed data well in terms of relationships among variables, but not all of these relationships are strong
enough, and some items might still be improved for stronger inter-item correlations.

Table 7
Domain A Item Statistics
Item.ID A B C D Mean Bad ITC IDC Alpha Alpha. Fact
Distr’s Drop Load
X------ 0.08 0.67 0.05 0.2 0.67 1 0.29 0.36 0.7157 0.7103 0.23
X------ 0.19 0.17 0.46 0.17 0.46 0 0.19 0.32 0.7157 0.7165 0.15
X------ 0.13 0.73 0.06 0.08 0.73 0 0.36 0.45 0.7157 0.7006 0.43
X------ 0.43 0.12 0.17 0.27 0.27 0 0.14 0.25 0.7157 0.7189 0.07
X------ 0.11 0.58 0.08 0.22 0.58 0 0.38 0.42 0.7157 0.7048 0.35
X------ 0.11 0.21 0.08 0.6 0.6 0 0.2 0.36 0.7157 0.7122 0.19
X------ 0.05 0.01 0.91 0.04 0.91 3 0.03 0.17 0.7157 0.7174 0
09/20/2023 Pg.16

IRT Analyses

After removing the misfitting items, the coefficient alpha remained unchanged at the level of 0.72, leaving the
reliability unaffected. This is yet another indication that the removal of these items did not reduce the
information obtained by this scale.

After we have established unidimensionality for domain A, we can proceed to fitting IRT models. The process of
selecting the most suitable IRT model involves examining their comparative fit indices and subsequently
performing an analysis of variance to determine if the more complex model describes the data significantly
better compared to the simpler model. An examination of Table 8 reveals that according to all indices except for
BIC, the best fitting model is 2PL.
09/20/2023 Pg.17

Table 8
Comparative fit indices of Rasch, 2PL, 3PL and 4PL models for domain A
Model AIC BIC SABIC HQ
Rasch 40121.68 40302.94 40207.62 40186.76
2PL 40002.96 40353.4 40169.11 40128.78
3PL 40048.54 40574.2 40297.76 40237.27
4PL 40037.41 40738.29 40369.71 40289.06
Table 9
Analysis of Variance of Model Performance for Domain A
Model 1 Model 2 X2 df p
Rasch 2PL 174.7168 28 0
2PL 3PL 12.42209 29 0.996902
3PL 4PL 69.1299 29 0.00
2PL 4PL 81.55199 58 0.02244
Next, to further confirm this recommendation, we conducted an analysis of variance. As we can see from Table
9, the findings provide backing for the 2PL model: it significantly outperforms Rasch, whereas 3PL does not
exhibit further improvement. 4PL performs better than 2PL at a significance level below 0.05, indicating there is
a considerable improvement. Therefore, we will evaluate the actual performance of both, 2PL and 4PL models.

We will examine the difficulty and discrimination parameters of the items using the classification framework
proposed by Bichi and Talib (2018), as outlined in Tables 10 and 11. It should be noted that these authors
developed this framework for 1- and 2PL models and not for more complex IRT models. For this reason, 2PL
item parameters will be presented in this report for the possibility of qualitatively evaluate their performance,
because the elevated discrimination parameters are not as interpretable.

Table 10
Interpreting Discrimination Parameter Values
Discrimination Value Quality of an Item
a ≥ 1.70 Item is functioning satisfactorily
1.35 ≤ a ≤ 1.69 Good item; little or no revision is required
0.65 ≤ a ≤ 1.34 Moderate: little revision is required
0.35 ≤ a ≤ 0.64 Item is marginal and needs revision
a ≤ 0.34 Poor item; should be eliminated or revised
Table 11
Interpreting Difficulty Parameter Values
Difficulty Value Interpretation
-3.00 ≤ b < -2.00 Very easy
-2.00 ≤ b < -1.00 Easy
-1.00 ≤ b < 1.00 Moderately difficult
1.00 ≤ b ≤ 2.00 Difficult
09/20/2023 Pg.18

2.00 < b Very difficult


Table 12 presents a comprehensive overview of the item parameters for domain A together with their associated
interpretations. We can see that according to the interpretation, there are nine items that will benefit from
revision, specifically, those are items X------, X------, X------, X------, X------, X------, X------, X------, X------.

A close look at the item parameters shows that there are no excellent items in domain A, as well as there are no
very difficult items. The only difficult item in domain A also happens to be poor in discrimination. Among the
moderately difficult items, 2 are poor and 2 are marginal, 5 are moderate and none are good. Among the easy
and very easy items, 1 is good, and 14 are moderate. The rest 4 are poor to marginal. For future revisions of the
scale, it looks like domain A will benefit from the development of good to excellent difficult and moderately
difficult items.

Table 12
Domain A Item Parameters
Item.ID a b Discrimination Difficulty
X-------- 0.541382 -1.42041 Marginal Easy
X-------- 0.332465 0.449927 Poor Moderately difficult
X-------- 1.042928 -1.19209 Moderate Easy
X-------- 0.864055 -0.47399 Moderate Moderately difficult
X-------- 0.441706 -0.92487 Marginal Moderately difficult
X-------- 1.139931 -0.69492 Moderate Moderately difficult
X-------- 0.212683 1.498227 Poor Difficult
X-------- 0.818085 -2.29211 Moderate Very easy
X-------- 0.930479 -1.24725 Moderate Easy
X-------- 0.615511 -0.63679 Marginal Moderately difficult
X-------- 0.526057 -2.70197 Marginal Very easy
X-------- 1.162845 -2.30821 Very easy
09/20/2023 Pg.19

Figure 4

2PL (Top) and 4PL (Bottom) Test Information and Standard Error Functions and Scale Characteristic Curves for
Domain A

Figure 4 shows the test information function together with the standard error for domain A. The Test
Information Function (TIF) is a visual representation that delineates the precision of measurement across
different ability levels. This graphical depiction highlights the regions where the exam offers higher precision
scores and those where precision diminishes. Essentially, the TIF demonstrates the exam's accuracy at each level
of ability, showing ability regions where measurement precision has room for improvement. According to the
graph in Figure 4, the most precise scores, i.e. the location of maximum information rests at ability level of -
1.48.
09/20/2023 Pg.20

The value of the standard error at the minimum of the function equals 0.54. The Standard Error of
Measurement (SEM) presents the reciprocal of the information function. It denotes the degree of variability in
test scores around an individual's ability level. Lower SEM values correspond to greater information and the
SEM's minimum coincides with the peak of the TIF.

Because domain scores are not directly related to overall exam scores, a straightforward approach to assess the
ability level at the pass-mark is by treating the domain scale as a separate exam. In this way, we can find the
proportion of individuals who correctly answered 60% or more of the items that were administered to them
within that particular domain. Based on the pass rate for this scale, which is roughly 81%, the ability level at the
pass-mark can be estimated at around 0.87, making this domain the easiest one within the exam. The standard
error at this ability level stands at 0.49, indicating a borderline level of precision of scores at this level. – the
horizontal dotted line on the test information function plot illustrates the upper limit of the acceptable standard
error, set at 0.5. The vertical dotted line pinpoints the location of maximum information, while the vertical
dashed line denotes the ability level at the pass-mark. This representation is designed to make the distance
between maximum information and the pass-mark clearly visible. In case of the 2PL model for domain A the
distance between maximum information and the pass-mark is 0.61.

The inclined dotted line on the scale characteristic curve represents the slope at the maximum information
point. In case of domain A, the slope at that point equals approximately 5. Although there is no universally
accepted guideline for interpreting these values, they should prove valuable for comparative analysis across
various models or domains. Generally, larger slopes indicate better performance. By way of comparison, the
slope of the scale characteristic curve for the 3PL model is merely 4.5. This finding aligns with the results of the
analysis of variance, which demonstrated that its performance is not superior to that of the 2PL model.

As for the case of the 4PL model, the slope of its scale characteristic curve at the maximum information point is
7.88, significantly steeper than the slope of the 2PL SCC. However, this steep slope is confined to a very narrow
region and does not extend to the pass-mark. At the pass-mark, the standard error for the 4PL model is 0.57,
compared to the 0.49 of the 2PL. Figure 5 visually presents the test information function and the scale
characteristic curve for the domain as they manifest in the context of the 4PL model.

Looking at the test information plot, it is obvious that if the decision to employ the 4PL model is made, the
recommendation for item development here is to concentrate on items with difficulties matching the pass-mark
ability in order to maximize information between the two distinctive maxima.

The next step involves generating an item-person map for each of the two competing models. This visualization
approach aligns the distribution of examinees' scores and item difficulties on a common scale, enabling a visual
09/20/2023 Pg.21

evaluation of how item difficulties correspond to the ability scale and how the latent ability estimates are
distributed among examinees.

Figure 5
Domain A Item-Person Maps for the 2PL (Left) and the 4PL (Right) Models

A gap in item
difficulties

A close look at the item-person maps for the 2PL and 4PL models in Figure 5 unveils the cause of the steep
decline in test information around the pass-mark, as depicted. Based on the item difficulties estimated by the
4PL model, a noticeable gap exists in that region.

Figure 6 shows how conditional reliability of 2PL scores differ from those of 4PL. We can see that in the center
of the theta range, the 4PL provides more reliable scores throughout except the short segment right at and
below the pass-mark (vertical dashed line). Specifically at the pass-mark, the reliability of 4PL scores falls behind
the 2PL scores, amounting to 0.80 for the 2PL and only 0.75 for the 4PL. Anywhere below the ability of -1.7 and
above 1.5, 2PL outperforms 4PL, but obviously, the majority of examinees fall in between these values, so if the
gap in item difficulties is filled, the 4PL would definitely lead in performance.
09/20/2023 Pg.22

Figure 6
2PL and 4PL Conditional Reliability for Domain A

2PL
4PL

To determine the more reliable model overall, we can employ a metric known as marginal reliability (Thissen &
Wainer, 2001). This metric is useful when there is a need to present a single number that summarizes test
precision for tests constructed using IRT. The distinctive feature of this metric is that it weights the value of the
standard error at each level of ability by the population density. As a result, it grants greater significance to
regions with a higher concentration of examinees. And so, the marginal reliability for the 2PL model is 0.73,
while for the 4PL model, it reaches 0.80. Thus, the superiority of the 4PL model over the 2PL model is confirmed
in terms of overall reliability.

Local Independence

Computation of residual correlations shows that the problem of local dependence remains at domain level as
well. The most locally dependent items of domain A are X763714, X763578, X763892. Their residuals correlate
with as much as 20 other items’ residuals. As opposed to the full exam length, this problem rather stems from
the presence of secondary factors than redundant exam length. By looking at the exam blueprint, it is seen that
each domain has its ‘sub-domains’, which serve as secondary factors and make the domain factor structure
more complex. This can also serve as an explanation for the local dependency problem.

Simulation Results

In the previous section, we observed that both the 2PL and 4PL models are competing to effectively explain the
data generation process for domain A. However, as we proceeded further with the analysis, a discrepancy
emerged between the model recommendations based on statistical criteria (such as AIC, BIC, SABIC, HQ,
ANOVA) and the recommendations derived from simulation results. Therefore, we will present simulation
09/20/2023 Pg.23

results for all four IRT models to gain a comprehensive understanding of how the data behaves under the
current conditions and how it might behave with different item parameter settings.

In order to generate a simulated item pool, the true distribution of item parameters has to be identified first. For
that purpose, the actual distribution of item parameters was compared to a number theoretical distributions
using the Kolmogorov-Smirnov test. The Kolmogorov-Smirnov is a statistical test used to evaluate whether an
underlying distribution differs from a hypothesized distribution, where larger p-value indicates better fit. The
parameters of the distributions of item parameters for the 2PL and the 4PL models are provided in table 13.

Table 13
2PL and 4PL Item Parameter Distributions for Domain A
Model Parameter Distribution Parameter1 Parameter2 p-Value
Rasch Difficulty (b) norm -0.7039 0.7549 0.791982
2PL Discrimination (a) normal 0.7576 0.2974 0.999042
Difficulty (b) normal -1.2151 1.0378 0.783529
3PL Discrimination (a) norm 1.1212 0.8161 0.200031
Difficulty (b) norm -0.5073 1.2507 0.999995
Guessing (c) beta 0.4761 2.164 0.871083
4PL Discrimination (a) normal 5.9185 6.9796 0.07959
Difficulty (b) normal -0.4579 0.9684 0.665097
Guessing (c) beta 0.9303 1.9244 0.123378
Slipping (d) beta 4.2968 0.497 0.438694
As seen in Table 13, the difficulty parameters estimated by all the models follow a normal distribution. For the
Rasch model, there is no variability in the discrimination parameter, it is set to “1” for all items, and therefore is
not provided here.

The discrimination parameter of the 4PL model does not seem to follow the normal distribution very well, and it
is on average much higher than that of the 2PL (a mean of 0.76 for the 2PL vs 5.92 for the 4PL). At the same
time, it has much more variability. That is because there is a number of items with an extremely high
discrimination parameter.

The explanation of this shift in discrimination of items is as follows: the 4PL model introduces two additional
parameters to the traditional discrimination and difficulty parameters found in the 2PL model. One of them is
the "guessing" parameter (the “lower asymptote”), which accounts for the probability of a test-taker guessing
the correct answer even when their true ability is low. The second additional parameter is the "upper
asymptote," which addresses the phenomenon where individuals with very high ability might still occasionally
make mistakes on an item (so called “slipping”).
09/20/2023 Pg.24

Figure 7
Item Characteristic Curves of a Sample Item as Computed by the 4PL (Left) and 2PL (right) Models

Figure 7 demonstrates an example of an item characteristic curve as computed by the 4PL and by the 2PL
model. It presents two item characteristic curves of the same item, one obtained by fitting a 4PL model and the
other obtained by fitting a 2PL model. It is apparent that for the 4PL model, the probability of correctly
answering this item levels off around 0.38 instead of zero as in the 2PL. This means that even individuals with
lower abilities have a 38% chance of answering the item correctly, introducing the guessing component.
Similarly, at the opposite extreme of the ability continuum, we observe that the probability doesn't reach 1, but
instead levels off at around 0.8. This observation suggests that even the most proficient test takers still face a
20% chance of making an incorrect response.

Comparing the 4PL model's discrimination parameter distribution to that of the 2PL model, we observe an
impressive increase from 0.95 (for the 2PL) to 7.3 (for the 4PL). However, the noticeable increase in the
discrimination parameter within the four-parameter logistic model can be largely attributed to the inclusion of
parameters for guessing and slipping. There are certain items, for which the guessing parameters exceeded 0.6,
indicating that the probability of guessing the correct response is greater than that of responding accurately
based on true ability. It is crucial to recognize that the discrimination parameters obtained from the 4PL model
cannot be directly compared to those from the 2PL model, and while these items may have considerably higher
values of their discrimination parameters, this doesn't necessarily translate into better distinguishing between
varying ability levels. Rather, it emphasizes the influence of heightened guessing or, in specific cases, a tendency
towards slipping or making careless errors.

In returning to the topic of our simulation, Figure 8 visualizes the distribution of the difficulty parameter as
estimated by the four models. The aim of this chart is to help visualize how each of the IRT models estimated
09/20/2023 Pg.25

the difficulty levels of the items and their respective distributions. We can immediately see how the models
differ in their estimation of the difficulty of the items in general. For example, we see that the Rasch model
offers the narrowest distribution of the difficulty parameter with the majority of items around -1. It also seems
like the 2PL model tends to estimate item difficulties somewhat easier than all other models.

Figure 8
Rasch, 2PL, 3PL and 4PL Difficulty Parameter Distributions for Domain A

Table 14 provides the mean and the standard deviation values of simulated 2PL and 4PL difficulty parameter
distributions. Each model has its own location of the maximum information and mean difficulty, but the
distribution of ability and the ability at the pass-mark is the same for all models.

Table 14
Simulated Difficulty Parameter Distributions for Domain A
Rasch 2PL 3PL 4PL
Scenario Mean SD Mean SD Mean SD Mean SD
1 -0.7798 0.566175 -1.4839 0.77835 -0.53031 0.938025 -1.3716 0.7263
2 -0.7798 0.7549 -1.4839 1 -0.53031 1 -1.3716 0.9684
3 -0.7798 1 -1.4839 1.0378 -0.53031 1.2507 -1.3716 1
4 -0.7798 1.13235 -1.4839 1.5567 -0.53031 1.87605 -1.3716 1.4526
5 -0.7039 0.566175 -1.2151 0.77835 -0.5073 0.938025 -0.86694 0.7263
6 -0.7039 0.7549 -1.2151 1 -0.5073 1 -0.86694 0.9684
7 -0.7039 1 -1.2151 1.0378 -0.5073 1.2507 -0.86694 1
8 -0.7039 1.13235 -1.2151 1.5567 -0.5073 1.87605 -0.86694 1.4526
9 -0.53031 0.566175 -0.86694 0.77835 0 0.938025 -0.4579 0.7263
10 -0.53031 0.7549 -0.86694 1 0 1 -0.4579 0.9684
11 -0.53031 1 -0.86694 1.0378 0 1.2507 -0.4579 1
12 -0.53031 1.13235 -0.86694 1.5567 0 1.87605 -0.4579 1.4526
13 0 0.566175 0 0.77835 1.4365 0.938025 0 0.7263
09/20/2023 Pg.26

14 0 0.7549 0 1 1.4365 1 0 0.9684


15 0 1 0 1.0378 1.4365 1.2507 0 1
16 0 1.13235 0 1.5567 1.4365 1.87605 0 1.4526
Figure 9
Simulation Results for Domain A
09/20/2023 Pg.27

The visualization in Figure 9 is designed in such a way that various distribution configurations are discerned on
the chart. The scenarios are ordered by the values of the mean first and by the values of SD second, both from
smallest to largest. In this way, each mean value is represented by a certain contrasting color and each SD is
represented by different shades of that color. In this way, all shades of red represent the lowest mean value for
each model, then comes, green, blue, and purple. The variability is represented by this color’s shades. Within the
shades of each color, the darkest represents the smallest SD, and the lightest represents the largest SD and
therefore the widest distribution. The specific values differ between models, but the general logic is that the
same color represents same mean, and the brighter the color – the bigger the SD.

The charts in Figure 9 and its corresponding charts for other domains are organized in a way where the top row
represents the most crucial indicator, which, in our case, is the hit-rate. This indicator assesses the actual
classification accuracy of the ability estimates. Upon reviewing the results, it becomes evident that the Rasch
model outperforms the others as the best-performing model with a hit-rate of above 70% for all difficulty
distribution configurations across all item pool sizes. The superiority of the Rasch model is evidenced by RMSD
too, which equals roughly 0.5 and constant for all scenarios too.

An interesting discovery is that all the metrics we've assessed, ranging from hit-rate to gamma (note that
overlap is not influenced by the difficulty distribution), show a lack of sensitivity to variations in difficulty
distributions when dealing with models that have higher hit-rates and lower RMSD. Furthermore, certain
metrics, such as hit-rate, RMSD, bias, and CCR, remain unaffected by changes in the size of the item pool.

The classification accuracy and consistency metrics show a consistent but gradual increase as the item pool size
expands. It's worth noting that their initial values are already relatively high. Even when the item pool size is
only twice the number of administered items, the values of tau and gamma for the Rasch model are already
above 0.85. This suggests that these metrics are sufficiently high for the current test configuration.
09/20/2023 Pg.165

The results of the simulation for Domain H are consistent with all the previous ones. The model recommended
as a result of the simulation is Rasch again with hit rate of around 0.6 and RMSD below 0.5. The classification
quality metrics suggest that a desirable minimal size of the item pool is at least three times the number of
administered Domain H items.

TOTAL SCORES

Once we've identified the best-performing model and applied it to all the ---_---- domains, a practical and
straightforward approach to calculating each examinee's overall ability score is to compute a weighted average of the
ability estimates across all domains. These weighting coefficients should correspond to the proportion of each domain's
items in the exam, as outlined in the blueprint. Utilizing this approach, we calculated a weighted IRT-based total ability
score for each examinee. Taking into account our sample's pass rate (which stood at 69.8%), we established the cut-off
ability score at -0.5173564.

Figure 81 provides a visual representation of the alignment between the sum score and the IRT-based classifications across
different models. It suggests that IRT-based classifications tend to be more forgiving overall. Notably, there are no
instances where examinees are classified as failing based on their IRT-based estimation while simultaneously classified as
passing based on their sum score. However, there are varying proportions of examinees who are classified as failing by the
sum score but pass based on the IRT-based estimation. Specifically, these proportions are 22% for the Rasch model, 15%
for the rest of the models. These results imply that the IRT-based scoring method, which considers the difficulty levels of
correctly answered questions, exhibits greater sensitivity than the sum score and is able to identify examinees who exhibit
satisfactory ability but didn’t reach the 60% pass-mark.

Figure 81
Agreement Between Total Sum Score and Weighted Ability Estimate-Based Classifications by Each of the IRT Models
09/20/2023 Pg.166

CONCLUSION

M. Stocking’s rule of thumb, as referenced by various authors (e.g., Stocking, 1994, cited in Way, 2010),
suggested that an ideal CAT item pool should contain 10-12 times the number of items to be administered in
an exam. Our study managed to allow substantial savings in item development efforts compared to following
this traditional approach. Our simulation procedure has revealed that for the majority of the --- ---- and ----
domains, if the Rasch model is employed, an item pool size of twice the exam length is sufficient, with larger
item pools offering no substantial improvement in terms of score precision. Classification accuracy and
consistency demonstrate gradual growth with the growth of the item pool size, but there is no single
suggestion that is characterized by a sharp increase of these indices, while its starting values are reasonably
high. It is an essential finding of this research that employing the Rasch model will result in sufficiently precise
scores and reliable classification even with modestly sized item pools, without the need to go beyond item
pool size of twice the length of the test.

Another noteworthy observation is that across all domains, the simulation results consistently demonstrated
that all variants of difficulty parameter distributions yielded similar levels of precision and accuracy at each level
of the item pool size. This implies that the specific difficulty levels of items and the range of the difficulty
distribution have limited impact on the precision of the scores. Furthermore, it became evident that, in terms of
classification accuracy, the primary factor of significance is the size of the item pool. It is essential to note that
the simulated item banks represent idealized versions of reality, and it is nearly impossible to fine-tune item
difficulty distribution during the initial stages of item development. In practical scenarios, these aspects are
typically refined after piloting and collecting data, rather than being determined beforehand. It can be suggested
that further examination is required to evaluate whether or not the score precision and classification metrics are
affected by variation in the distribution of the discrimination parameter rather than the difficulty.

It was demonstrated that IRT-based scoring exhibits greater sensitivity than the sum scores, and has the
possibility to identify examinees who demonstrate sufficient ability among those who fall below the
conventional 60% pass-mark. Based on this discovery, implementing weighted IRT-based scoring offers the
flexibility to adjust the cut-off score independently of the total number of correctly solved questions, thereby
enabling more effective management of the pass rate and of the acceptable latent ability level. As a result, it
offers the potential for a system characterized by known precision, that is robust in scenarios where isolated
items are compromised and become overly easy, by reducing their influence on the total score.

Additional finding that is worth noting is that a potential dysfunction in the item selection algorithm was
flagged for Domains E and G of form ----, but not others. This finding is impotrnat because it potentially
bears overrexposure of certain items witihn those domains and this issue should be examined.
09/20/2023 Pg.167

Further investigation is required to determine the most effective strategies for safeguarding the exam security
such as overlap control and exposure concerns in general remained out of the scope of the current study. The
most commonly used metric for controlling item exposure and security is the average test overlap rate. This
rate represents the average number of common items between any two examinees. The fact is that for exams
with a fixed length, the formula of this metric simplifies to the reciprocal of the exam length, eliminating the
need for separate examinations of it across domains. The recommended maximum for this metric in high-
stakes exams is typically no more than a 20% average overlap rate. This recommendation forms the basis for
suggesting a general ratio of exam length to item pool size of one to five if that level of overlap is desired to be
kept. That is to account for exam security only, while score precision and classification accuracy are satisfied
with smaller item pool size, and represents a rule of thumb rather than an informed decision.

Overall, the current paper serves as an exemplary case of conflicting statistical recommendations and
unexpected outcomes. Across both forms of the --- exam, nearly every estimation procedure produced
different results regarding the preferred IRT model. Information criteria often favored the 2PL model, ANOVA
procedures consistently leaned toward the 4PL model, simulations invariably recommended the Rasch model,
and the best agreement with total scores was found with the 3PL model. These remarkable findings underscore
the notion that the role of any statistical analysis should be to support decision-making rather than replace it.
They highlight the importance of considering qualitative properties of scales and items, as well as taking a
broader perspective, rather than relying solely on numerical output, to form a comprehensive understanding of
the situation and make informed decisions.

ACCOMPANYING FILES

All the results of the analyses conducted in this study are organized in two separate folders for each of the ---
exam forms. Each of the folders contains the following files:

Table 109
Files Accompanying the Report and Their Contents
No. File Name Content
1 resultControl.csv A sheet that includes all the columns from
---_----_Control.csv (---_----_Control.csv), plus all item
statistics, including all the IRT item parameters. It only includes
items that were not discarded for being breached or for
violating the criteria of inclusion into the IRT analyses.
09/20/2023 Pg.168

2 resultControlOriginal.csv A sheet that includes all the columns from


---_----_Control.csv (---_---_Control.csv), plus classical item
statistics. This sheet includes all the items listed in
---_----_Control.csv (---_----_Control.csv), i.e. discarded for
breach and violation of analytical criteria.
3 resultCorrs.csv A sheet that provides correlations of residuals for all pairs of
items in each domain.
4 resultDataLong.csv ---_----_Dat.csv (---_----_Dat.csv), recoded into binary
responses (where 0 represents an incorrect response and 1
represents a correct one) and transformed into a long format
where each row is a single item response of an individual
examinee.
5 resultDataYN.csv ---_----_Dat.csv (---_----_Dat.csv), recoded into binary
responses (where 0 represents an incorrect response and 1
represents a correct one).
6 resultDepsItems.csv A sheet that specifically enlists items exhibiting residual
correlations above 0.3 with one or more items within their
respective domain, with the count of locally dependent items for
each listed item.
7 resultDomain.csv Domain statistics
8 resultDomainDiff.csv Comparative statistics before and after dropping items in the
exam-as-a-single-scale approach.
9 resultLocalDep.csv A list of all locally dependent item pairs with their corresponding
residual correlation.
10 resultModelAnova.csv The output of ANOVA results of model comparisons.
11 resultModelFit.csv IRT model comparative fit indices for all the exam domains.
12 resultParametersDistribution.csv The CSV file contains a range of possible distributions with their
distribution parameters, for each of the IRT model parameters,
and their corresponding p-values for the identification of the
actual distribution of each of the item parameters under each IRT
model.
13 resultSimulationBest.csv A table with the best configuration (the mean and standard
deviation of the difficulty parameter and the best multiplier for
item pool size) as resulted from the simulation procedure.
14 resultSimulationParameters.csv Item difficulty distribution parameters for the subsequent
simulation.
15 resultSimulationTotal.csv All indicator values as obtained by all the simulation scenarios.
16 resultTestTotal.csv Individual total scores – at the domain level and exam total, and
individual exam total latent ability estimates as obtained by 1-,
2-, 3- and 4PL models for each examinee.
17 resultThetasTemp.csv Domain level individual latent ability estimates resulting from
each of the four IRT models.
09/20/2023 Pg.169

REFERENCES

Arndt, C. (2012). Information Measures: Information and Its Description in Science and
Engineering. Germany: Springer Berlin Heidelberg.

Bichi, A. A., & Talib, R. (2018). Item Response Theory: An Introduction to Latent Trait Models to Test and Item
Development. International Journal of Evaluation and Research in Education, 7(2), 142-151.

Boyd, A. M., Dodd, B., & Fitzpatrick, S. (2013). A comparison of exposure control procedures in CAT systems
based on different measurement models for testlets. Applied Measurement in Education, 26(2), 113-
135.

Chalmers, R. P. (2012). mIRT: A multidimensional item response theory package for the R environment. Journal
of statistical Software, 48, 1-29.

Chen, W. H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal
of Educational and Behavioral Statistics, 22(3), 265-289.

Demir, S. (2022). The effect of item pool and selection algorithms on computerized classification testing (CCT)
performance. Journal of Educational Technology & Online Learning, 5(3), 573-584

Kang, T., & Cohen A. S. (2007). IRT Model Selection Methods for Dichotomous Items. Applied Psychological
Measurement, Vol. 31 No. 4, July 2007, 331–358.

Kezer, F. (2021). The Effect of Item Pools of Different Strengths on the Test Results of Computerized-Adaptive
Testing. International Journal of Assessment Tools in Education, 8(1), 145-155.

Sen, S., & Bradshaw, L. (2017). Comparison of relative fit indices for diagnostic model selection. Applied
psychological measurement, 41(6), 422-438.

Stocking, M. L. (1994). Three practical issues for modern adaptive testing item pools 1. ETS Research Report
Series, 1994(1), i-34.

Thissen, D., & Wainer, H. (Eds.). (2001). Test scoring. Lawrence Erlbaum Associates Publishers.

Way, W. D. (2010). Some perspectives on CAT for K-12 assessments. In National Conference on Student
Assessment, Detroit, MI.

Wyse, A. E., & Babcock, B. (2016). Does maximizing information at the cut score always maximize classification
accuracy and consistency? Journal of Educational Measurement, 53(1), 23-44.

Wyse, A. E., & Hao, S. (2012). An evaluation of item response theory classification accuracy and consistency
indices. Applied Psychological Measurement, 36(7), 602-624.
09/20/2023 Pg.170

Yen, W. M. (1984). Effect of local item dependence on the fit and equating performance of the three-parameter
logistic model. Applied Psychological Measurement, 8, 125-145.

You might also like