# Examining exams using Rasch models and assessment of measurement invariance

Models from psychometric item response theory are used to analyze the results from a large introductory mathematics exams in order to gain insights about student abilities, question difficulties, and heterogeneities of these in subgroups.

## Citation

Achim Zeileis (2024). “Examining Exams Using Rasch Models and Assessment of Measurement Invariance.” *arXiv.org E-Print Archive* arXiv:2409.19522 [stat.AP]. doi:10.48550/arXiv.2409.19522

## Abstract

Many statisticians regularly teach large lecture courses on statistics, probability, or mathematics for students from other fields such as business and economics, social sciences and psychology, etc. The corresponding exams often use a multiple-choice or single-choice format and are typically evaluated and graded automatically, either by scanning printed exams or via online learning management systems. Although further examinations of these exams would be of interest, these are frequently not carried out. For example a measurement scale for the difficulty of the questions (or items) and the ability of the students (or subjects) could be established using psychometric item response theory (IRT) models. Moreover, based on such a model it could be assessed whether the exam is really fair for all participants or whether certain items are easier (or more difficult) for certain subgroups of students.

Here, several recent methods for assessing *measurement invariance* and for detecting *differential item functioning* in the Rasch IRT model are discussed and applied to results from a first-year mathematics exam with single-choice items. Several categorical, ordered, and numeric covariates like gender, prior experience, and prior mathematics knowledge are available to form potential subgroups with differential item functioning. Specifically, all analyses are demonstrated with a hands-on R tutorial using the *psycho** family of R packages (*psychotools*, *psychotree*, *psychomix*) which provide a unified approach to estimating, visualizing, testing, mixing, and partitioning a range of psychometric models.

The paper is dedicated to the memory of Fritz Leisch (1968-2024) and his contributions to various aspects of this work are highlighted.

## Software

R packages:

- doi:10.32614/CRAN.package.psychotools
- doi:10.32614/CRAN.package.psychotree
- doi:10.32614/CRAN.package.psychomix

## Illustration

The strategies for analyzing exam results using psychometric item response theory (IRT) models are illustrated with Rasch models fitted to the results from a large introductory mathematics exam for economics and business students. Here, only a quick teaser is provided that shows how to quickly visualize simple exploratory statistics and some model-based results. For the full analysis of the data that gives special emphasis to the assessment of so-called measurement invariance, see the full paper linked above. The full replication code for all results in the paper is provided in: exams.R.

The data are available as `MathExam14W`

in the psychotools package. The code below excludes the students which solved none or all of the exercises, thus not discriminating between the exercises in terms of their difficulty. The response variable is `solved`

which is an object of class `itemresp`

. Internally, it is essentially a 729 x 13 matrix with binary 0/1 coding plus some metainformation. As a first exploratory graphic the `plot()`

method shows a bar plot with empirical frequencies of correctly solving each of the 13 exercises.

```
library("psychotools")
data("MathExam14W", package = "psychotools")
mex <- subset(MathExam14W, nsolved > 0 & nsolved < 13)
plot(mex$solved)
```

The plot demonstrates that most items have been solved correctly by about 40 to 80 percent of the students. The main exception is the payflow exercise (for which a certain integral had to be computed) which was solved correctly by less than 15 percent of the students.

To establish a formal IRT model for this data, we employ a Rasch model that uses the differences between person abilities ${\mathit{\theta}}_{i}$ and item difficulties ${\mathit{\beta}}_{j}$ for describing the logit of the probability ${\mathit{\pi}}_{\mathit{ij}}$ that person $i$ correctly solves item $j$.

${\mathit{\pi}}_{\mathit{ij}}$
$\text{logit}({\mathit{\pi}}_{\mathit{ij}})$ |
${=}_{}$
${=}_{}$ |
$\text{Pr}({y}_{\mathit{ij}}=1)$
${\mathit{\theta}}_{i}-{\mathit{\beta}}_{j}$ |

The `raschmodel()`

function estimates the item difficulties using conditional maximum likelihood and the `plot()`

method then shows the corresponding person abilities (as a bar plot) along with the item difficulties (as a dot chart) on the same latent trait scale.

```
mr <- raschmodel(mex$solved)
plot(mr, type = "piplot")
```

Qualitatively, the Rasch model-based person-item plot shows a similar pattern as the exploratory bar plot. However, due to the latent logistic scale the most difficult item (payflow) and the easiest item (hesse) are brought out even more clearly. Also the majority of the item difficulties are close to the median ability in this sample. Thus, the exam discrimenates more sharply at the median difficulty and less sharply in the tails at very high or very low ability.

So far so good. However, the interpretation above is only reliable if all item difficulties are indeed the same for all students in the sample. If this is not the case, differences in the item responses would not necessarily be caused by differences in mathematics ability. The fundamental assumption that the difficulties are constant across all persons is a special case of so-called *measurement invariance*. And a violation of this assumption is known as *differential item functioning* (DIF), i.e., some item(s) is/are relatively easier for some subgroup of persons compared to others.

The main contribution of the paper is to detect such differential item functioning and investigate the potential sources of it. See the arXiv paper for all details and the full analysis.