Achim Zeileis (2024). “Examining Exams Using Rasch Models and Assessment of Measurement Invariance.” arXiv.org E-Print Archive arXiv:2409.19522 [stat.AP]. doi:10.48550/arXiv.2409.19522
Many statisticians regularly teach large lecture courses on statistics, probability, or mathematics for students from other fields such as business and economics, social sciences and psychology, etc. The corresponding exams often use a multiple-choice or single-choice format and are typically evaluated and graded automatically, either by scanning printed exams or via online learning management systems. Although further examinations of these exams would be of interest, these are frequently not carried out. For example a measurement scale for the difficulty of the questions (or items) and the ability of the students (or subjects) could be established using psychometric item response theory (IRT) models. Moreover, based on such a model it could be assessed whether the exam is really fair for all participants or whether certain items are easier (or more difficult) for certain subgroups of students.
Here, several recent methods for assessing measurement invariance and for detecting differential item functioning in the Rasch IRT model are discussed and applied to results from a first-year mathematics exam with single-choice items. Several categorical, ordered, and numeric covariates like gender, prior experience, and prior mathematics knowledge are available to form potential subgroups with differential item functioning. Specifically, all analyses are demonstrated with a hands-on R tutorial using the psycho* family of R packages (psychotools, psychotree, psychomix) which provide a unified approach to estimating, visualizing, testing, mixing, and partitioning a range of psychometric models.
The paper is dedicated to the memory of Fritz Leisch (1968-2024) and his contributions to various aspects of this work are highlighted.
R packages:
The strategies for analyzing exam results using psychometric item response theory (IRT) models are illustrated with Rasch models fitted to the results from a large introductory mathematics exam for economics and business students. Here, only a quick teaser is provided that shows how to quickly visualize simple exploratory statistics and some model-based results. For the full analysis of the data that gives special emphasis to the assessment of so-called measurement invariance, see the full paper linked above. The full replication code for all results in the paper is provided in: exams.R.
The data are available as MathExam14W
in the psychotools package. The code below excludes the students which solved none or all of the exercises, thus not discriminating between the exercises in terms of their difficulty. The response variable is solved
which is an object of class itemresp
. Internally, it is essentially a 729 x 13 matrix with binary 0/1 coding plus some metainformation. As a first exploratory graphic the plot()
method shows a bar plot with empirical frequencies of correctly solving each of the 13 exercises.
library("psychotools")
data("MathExam14W", package = "psychotools")
mex <- subset(MathExam14W, nsolved > 0 & nsolved < 13)
plot(mex$solved)
The plot demonstrates that most items have been solved correctly by about 40 to 80 percent of the students. The main exception is the payflow exercise (for which a certain integral had to be computed) which was solved correctly by less than 15 percent of the students.
To establish a formal IRT model for this data, we employ a Rasch model that uses the differences between person abilities ${\mathit{\theta}}_{i}$ and item difficulties ${\mathit{\beta}}_{j}$ for describing the logit of the probability ${\mathit{\pi}}_{\mathit{ij}}$ that person $i$ correctly solves item $j$.
${\mathit{\pi}}_{\mathit{ij}}$
$\text{logit}({\mathit{\pi}}_{\mathit{ij}})$ |
${=}_{}$
${=}_{}$ |
$\text{Pr}({y}_{\mathit{ij}}=1)$
${\mathit{\theta}}_{i}-{\mathit{\beta}}_{j}$ |
The raschmodel()
function estimates the item difficulties using conditional maximum likelihood and the plot()
method then shows the corresponding person abilities (as a bar plot) along with the item difficulties (as a dot chart) on the same latent trait scale.
mr <- raschmodel(mex$solved)
plot(mr, type = "piplot")
Qualitatively, the Rasch model-based person-item plot shows a similar pattern as the exploratory bar plot. However, due to the latent logistic scale the most difficult item (payflow) and the easiest item (hesse) are brought out even more clearly. Also the majority of the item difficulties are close to the median ability in this sample. Thus, the exam discrimenates more sharply at the median difficulty and less sharply in the tails at very high or very low ability.
So far so good. However, the interpretation above is only reliable if all item difficulties are indeed the same for all students in the sample. If this is not the case, differences in the item responses would not necessarily be caused by differences in mathematics ability. The fundamental assumption that the difficulties are constant across all persons is a special case of so-called measurement invariance. And a violation of this assumption is known as differential item functioning (DIF), i.e., some item(s) is/are relatively easier for some subgroup of persons compared to others.
The main contribution of the paper is to detect such differential item functioning and investigate the potential sources of it. See the arXiv paper for all details and the full analysis.
]]>To illustrate the benefits of extended-support beta regression models, suggested in a recent arXiv paper with Ioannis Kosmidis, we revisit the analysis of a behavioral economics experiment conducted and published by Glätzle-Rützler et al. (2015, Journal of Economic Behavior & Organization, doi:10.1016/j.jebo.2014.12.021). The outcome variable is the proportion of tokens invested by high-school students in a risky lottery with positive expected payouts. Glätzle-Rützler et al. focused on the effects of several experimental factors on the mean investments, which reflect the players’ willingness to take risks. In their study they employed linear regression models, estimated by ordinary least squares (OLS) with standard errors adjusted for potential clustering and heteroscedasticity.
Here, we extend the analysis from Glätzle-Rützler et al. by employing a similar model for the mean investments but additionally exploring distributional specifications that allow for a probabilistic, rather than mean-only, interpretation of the effects. From an economic perspective this is of interest because it allows to interpret both the mean willingness to take risks in this experiment, and the probability to behave like a rational Homo oeconomicus, who would invest (almost) all tokens in this lottery because it has positive expected payouts.
The full replication code for the analyses from the arXiv paper is available in lossaversion.R with some auxiliary functions in beta01.R. Below we only provide the most important R snippets to provide a feeling for the workflow in R. The rest of the discussion here highlights the main insights from the analysis.
An aggregated version of the data from all nine rounds of the experiment is available as LossAversion
in the betareg
package. Interest is in linking the variable invest
with the proportion of the total tokens invested in all nine rounds to explanatory information:
grade
: Is the player from lower grades 6-8 or upper grades 10-12?arrangement
: Is the player an individual or a team of two?male
: Is (at least one of) the player(s) male?age
: (Average) age of the player(s) in years.We compare four different models for invest
which all employ the same equation for the mean submodel.
And all except the OLS reference model employ the main effects of the three experimental factors for the dispersion submodel.
Normal linear model (N) with constant variance, corresponding to the OLS approach from the original study. In R, this can be fitted with the base lm()
or glm()
function.
data("LossAversion", package = "betareg")
la_ols <- glm(invest ~ grade * (arrangement + age) + male, data = LossAversion)
summary(la_ols)
Heteroscedastic censored normal model (CN), also known as heteroscedastic two-limit tobit model in econometrics. This can be fitted with the crch package (for censored regression with conditional heteroscedasticity).
library("crch")
la_htobit <- crch(invest ~ grade * (arrangement + age) + male | arrangement + male + grade,
data = LossAversion, left = 0, right = 1)
summary(la_htobit)
Beta regression (B) after ad-hoc scaling of the investments to the open unit interval (to avoid the boundary observations). This can be fitted with the betareg package.
library("betareg")
LossAversion$invests <- (LossAversion$invest * (nrow(LossAversion) - 1) + 0.5)/
nrow(LossAversion)
la_beta <- betareg(invests ~ grade * (arrangement + age) + male | arrangement + male + grade,
data = LossAversion)
summary(la_beta)
Extended-support beta mixture model (XBX) with the same specification as B but adding an extra exceedance parameter to be estimated (instead of the ad-hoc scaling). This can also be fitted with betareg
since version 3.2-0 with XBX regression being automatically selected in case of boundary observations in the response.
la_xbx <- betareg(invest ~ grade * (arrangement + age) + male | arrangement + male + grade,
data = LossAversion)
summary(la_xbx)
If you run the code and compare the model summaries, note that the coefficients from N and CN use an identity link for the mean parameter whereas B and XBX use a logit link. In addition, the log-likelihood, and, hence, AIC and BIC, are comparable only between CN and XBX because those two models have the same support for the response variable, that is the unit interval with point masses at 0 and 1. See also the accompanying arXiv paper for the full summary tables. By and large, the mean parameters for the N and CN models are rather similar and those for B and XBX are rather similar, only with some differences that occur due to using models with point masses on the boundaries (CN and XBX) or not (N and B).
However, instead of studying the individual estimated coefficients in more detail we rather assess the models graphically by visualizing their goodness of fit and different types of fitted effects.
To illustrate how different the fitted probability distributions of the four models are, we employ so-called hanging rootograms. These compare the empirical marginal distributions of the response variable (proportion of tokens invested) to the aggregated fitted distributions from the models. The quality of the fit can be judged by the deviation of the hanging bars from the zero reference line.
An object-oriented implementation of rootograms is available, along with other tools for working with probabilistic models, in the topmodels package on R-Forge (hopefully soon to be submitted to CRAN). You can install it from R-Forge or R-universe and then create the rootograms for the four models:
install.packages("topmodels", repos = "https://zeileis.R-universe.dev")
library("topmodels")
rootogram(la_ols, breaks = -6:16 / 10, main = "N")
rootogram(la_htobit, main = "CN")
rootogram(la_beta, main = "B")
rootogram(la_xbx, main = "XBX")
A more refined version of the plots is shown below. See the full replication script linked above for the code details.
The square root of the expected frequencies are shown as red dots and the square root of the observed frequencies are hanging from the points as gray bars. The dashed lines are the Tukey warning limits at +/- 1. These plots show that models N and B fit poorly in the tails. In contrast, models CN and XBX fit very well with almost all bars hanging close to the zero reference line. The fit for XBX appears to be slightly better than for CN.
While models XBX and CN provide a much better probabilistic fit than their uncensored counterparts B and N, it turns out that the predicted mean investments from all four models are still very similar. But XBX and CN allow for interpretations beyond the mean, including economically relevant interpretations of probability effects.
For illustration, we focus on the team arrangement effect for a subsample with a large share of very rational subjects: male players or teams with at least one male, in grades 10-12, and between 15 and 17 years of age. The figure below shows the estimated arrangement effect for the mean E(Y), i.e., the expected proportion of tokens invested, and for the probability to behave very rationally and invest almost everything, i.e., P(Y > 0.95). The empirical quantities are shown in black for the subsample between 15 and 17 years of age while the model-based effects are shown at an age of 16.
The graphic shows that all models do a reasonable job in estimating E(Y) but the censored models XBX and CN are much better at estimating the probability to behave very rationally. Similarly, it could be shown that the fit for the probability P(Y < 0.05) is also much better for XBX and CN than for B and N but this is not done here, because that probability does not have such an appealing economic interpretation like P(Y > 0.95).
For obtaining the model-based effects, as shown above, the procast()
function (for probabilistic forecasts) from the topmodels
package can be used. This is again an object-oriented implementation that facilitates obtaining not only moments (such as means and variances) but also entire probability distributions (as S3 objects) and corresponding probabilities, densities, and quantiles.
Here, we only briefly show the code for the fitted XBX model but the same function calls can be applied to the other fitted model objects. First, we set up the new data that only varies arrangement
from single to team but keeps all other variables fixed. Then, both kinds of effects are computed with procast()
.
la_nd <- data.frame(arrangement = c("single", "team"), male = "yes", age = 16, grade = "10-12")
procast(la_xbx, newdata = la_nd, type = "mean")
## mean
## 1 0.4713
## 2 0.6861
procast(la_xbx, newdata = la_nd, type = "cdf", at = 0.95, lower.tail = FALSE)
## probability
## 1 0.07161
## 2 0.18501
Thus, the mean invested proportion goes up from 47.1% to 68.6% for teams vs. single players in this setting, while the probability to behave almost fully rationally increases from 7.2% to 18.5%.
Again, the full code for creating the figure and underlying table is provided in the replication script linked above.
The script also inlucdes some further illustrations, e.g., the comparison with three-part hurdle models for “zero-and-one-inflated” beta regression. However, these models do not work well here: unappealing interpretation, too many parameters, quasi-complete separation of boundary and non-boundary observations. Hence, we do not show the details in this post.
]]>Ioannis Kosmidis, Achim Zeileis (2024). “Extended-Support Beta Regression for [0, 1] Responses.” arXiv.org E-Print Archive arXiv:2409.07233 [stat.ME]. doi:10.48550/arXiv.2409.07233
We introduce the XBX regression model, a continuous mixture of extended-support beta regressions for modeling bounded responses with or without boundary observations. The core building block of the new model is the extended-support beta distribution, which is a censored version of a four-parameter beta distribution with the same exceedance on the left and right of (0,1). Hence, XBX regression is a direct extension of beta regression. We prove that both beta regression with dispersion effects and heteroscedastic normal regression with censoring at both 0 and 1 – known as the heteroscedastic two-limit tobit model in the econometrics literature – are special cases of the extended-support beta regression model, depending on whether a single extra parameter is zero or infinity, respectively. To overcome identifiability issues that may arise in estimating the extra parameter due to the similarity of the beta and normal distribution for certain parameter settings, we assume that the additional parameter has an exponential distribution with an unknown mean. The associated marginal likelihood can be conveniently and accurately approximated using a Gauss-Laguerre quadrature rule, resulting in efficient estimation and inference procedures. The new model is used to analyze investment decisions in a behavioral economics experiment, where the occurrence and extent of loss aversion is of interest. In contrast to standard approaches, XBX regression can simultaneously capture the probability of rational behavior as well as the mean amount of loss aversion. Moreover, the effectiveness of the new model is illustrated through extensive numerical comparisons with alternative models.
The data for modeling the occurrence and extent of loss aversion in a behavioral economics experiment is available as LossAversion in the package. The corresponding examples also replicate some of the models from the paper. The full replication of the case study will be discussed in another forthcoming blog post.
]]>Reto Stauffer, Achim Zeileis (2024). “colorspace: A Python Toolbox for Manipulating and Assessing Colors and Palettes.” arXiv.org E-Print Archive arXiv:2407.19921 [cs.GR]. doi:10.48550/arXiv.2407.19921
The Python colorspace package provides a toolbox for mapping between different color spaces which can then be used to generate a wide range of perceptually-based color palettes for qualitative or quantitative (sequential or diverging) information. These palettes (as well as any other sets of colors) can be visualized, assessed, and manipulated in various ways, e.g., by color swatches, emulating the effects of color vision deficiencies, or depicting the perceptual properties. Finally, the color palettes generated by the package can be easily integrated into standard visualization workflows in Python, e.g., using matplotlib, seaborn, or plotly.
Color is an integral element of visualizations and graphics and is essential for communicating (scientific) information. However, colors need to be chosen carefully so that they support the information displayed for all viewers (see e.g., Tufte 1990; Ware 2004; Wilke 2019). Therefore, suitable color palettes have been proposed in the literature (e.g., Brewer 1999; Ihaka 2003; Crameri, Shephard, and Heron 2020) and many software packages transitioned to better color defaults over the last decade. A prominent example from the Python community is matplotlib 2.0 (Hunter, Dale, Firing, Droettboom, and the Matplotlib Development Team 2017) which replaced the classic “jet” palette (a variation of the infamous “rainbow”) by the perceptually-based “viridis” palette. Hence a wide range of useful palettes for different purposes is provided in a number of Python packages today, including cmcramery (Rollo 2024), colormap (Cokelaer 2024), colormaps (Patel 2024), matplotlib (Hunter 2007), palettable (Davis 2023), or seaborn (Waskom 2021).
However, in most graphics packages colors are provided as a fixed set. While this makes it easy to use them in different applications, it is usually not easy to modify the perceptual properties or to set up new palettes following the same principles. The colorspace package addresses this by supporting color descriptions using different color spaces (hence the package name), including some that are based on human color perception. One notable example is the Hue-Chroma-Luminance (HCL) model which represents colors by coordinates on three perceptually-based axes: Hue (type of color), chroma (colorfulness), and luminance (brightness). Selecting colors along paths along these axes allows for intuitive construction of palettes that closely match many of the palettes provided in the packages listed above.
In addition to functions and interactive apps for HCL-based colors, the colorspace package also offers functions and classes for handling, transforming, and visualizing color palettes (from any source). In particular, this includes the simulation of color vision deficiencies (Machado Oliviera, and Fernandes 2009) but also contrast ratios, desaturation, lightening/darkening, etc.
The colorspace Python package was inspired by the eponymous R package (Zeileis, Fisher, Hornik, Ihaka, McWhite, Murrell, Stauffer, and Wilke 2020). It comes with extensive documentation at https://retostauffer.github.io/python-colorspace/, including many practical examples. Selected highlights are presented in the following.
The key functions and classes for constructing color palettes using hue-chroma-luminance paths (and then mapping these to hex codes) are:
qualitative_hcl
: For qualitative or unordered categorical information, where every color should receive a similar perceptual weight.sequential_hcl
: For ordered/numeric information from high to low (or vice versa).diverging_hcl
: For ordered/numeric information around a central neutral value, where colors diverge from neutral to two extremes.These functions provide a range of named palettes inspired by well-established packages but actually implemented using HCL paths. Additionally, the HCL parameters can be modified or new palettes can be created from scratch.
As an example, the figure below depicts color swatches for four viridis variations. The first pal1
sets up the palette from its name. It is identical to the second pal2
which employes the HCL specification directly: The hue ranges from purple (300) to yellow (75), colorfulness (chroma) increases from 40 to 95, and luminance (brightness) from dark (15) to light (90). The power
parameter chooses a linear change in chroma and a slightly nonlinear path for luminance.
In pal3
and pal4
the most HCL properties are kept the same but some are modified: pal3
uses a triangular chroma path from 40 via 90 to 20, yielding muted colors at the end of the palette. pal4
just changes the starting hue for the palette to green (200) instead of purple. All four palettes are visualized by the swatchplot
function from the package.
The objects returned by the palette functions provide a series of methods, e.g., pal1.settings
for displaying the HCL parameters, pal1(3)
for obtaining a number of hex colors, or pal1.cmap()
for setting up a matplotlib color map, among others.
from colorspace import palette, sequential_hcl, swatchplot
pal1 = sequential_hcl(palette = "viridis")
pal2 = sequential_hcl(h = [300, 75], c = [40, 95], l = [15, 90],
power = [1., 1.1])
pal3 = sequential_hcl(palette = "viridis", cmax = 90, c2 = 20)
pal4 = sequential_hcl(palette = "viridis", h1 = 200)
swatchplot({"Viridis (and altered versions of it)": [
palette(pal1(7), "By name"),
palette(pal2(7), "By hand"),
palette(pal3(7), "With triangular chroma"),
palette(pal4(7), "With smaller hue range")
]}, figsize = (8, 1.75));
An overview of the named HCL-based palettes in colorspace is depicted below.
from colorspace import hcl_palettes
hcl_palettes(plot = True, figsize = (20, 15))
To better understand the properties of palette pal4
, defined above, the following figure shows its HCL spectrum (left) and the corresponding path through the HCL space (right).
The spectrum in the first panel shows how the hue (right axis) changes from about 200 (green) to 75 (yellow), while chroma and luminance (left axis) increase from about 20 to 95. Note that the kink in the chroma curve for the greenish colors occurs because such dark greens cannot have higher chromas when represented through RGB-based hex codes. The same is visible in the second panel where the path moves along the outer edge of the HCL space.
pal4.specplot(figsize = (5, 5));
pal4.hclplot(n = 7, figsize = (5, 5));
Another important assessment of a color palette is how well it works for viewers with color vision deficiencies. This is exemplified below by depicting a demo plot (heatmap) under “normal” vision (left), deuteranomaly (colloquially known as “red-green color blindness”, center), and desaturated (gray scale, right). The palette in the top row is the traditional fully-saturated RGB rainbow, deliberately selected here as a palette with poor perceptual properties. It is contrasted with a perceptually-based sequential blue-yellow HCL palette in the bottom row.
The sequential HCL palette is monotonic in luminance so that it is easy to distinguish high-density and low-density regions under deuteranomaly and desaturation. However, the rainbow is non-monotonic in luminance and parts of the red-green contrasts collapse under deuteranomaly, making it much harder to interpret correctly.
from colorspace import rainbow, sequential_hcl
col1 = rainbow(end = 2/3, rev = True)(7)
col2 = sequential_hcl("Blue-Yellow", rev = True)(7)
from colorspace import demoplot, deutan, desaturate
import matplotlib.pyplot as plt
fig, ax = plt.subplots(2, 3, figsize = (9, 4))
demoplot(col1, "Heatmap", ax = ax[0,0], ylabel = "Rainbow", title = "Original")
demoplot(col2, "Heatmap", ax = ax[1,0], ylabel = "HCL (Blue-Yellow)")
demoplot(deutan(col1), "Heatmap", ax = ax[0,1], title = "Deuteranope")
demoplot(deutan(col2), "Heatmap", ax = ax[1,1])
demoplot(desaturate(col1), "Heatmap", ax = ax[0,2], title = "Desaturated")
demoplot(desaturate(col2), "Heatmap", ax = ax[1,2])
plt.show()
To illustrate that colorspace can be easily combined with different graphics workflows in Python, the code below shows a heatmap (two-dimensional histogram) from matplotlib and multi-group density from seaborn. The code below employs an example data set from the package (using pandas) with daily maximum and minimum temperature. For matplotlib the colormap (.cmap()
; LinearSegmentedColormap
) is extracted from the adapted viridis palette pal3
defined above. For seaborn the hex codes from a custom qualitative palette are extracted via .colors(4)
.
from colorspace import dataset, qualitative_hcl
import matplotlib.pyplot as plt
import seaborn as sns
df = dataset("HarzTraffic")
fig = plt.hist2d(df.tempmin, df.tempmax, bins = 20,
cmap = pal3.cmap().reversed())
plt.title("Joint density daily min/max temperature")
plt.xlabel("minimum temperature [deg C]")
plt.ylabel("maximum temperature [deg C]")
plt.show()
pal = qualitative_hcl("Dark 3", h1 = -180, h2 = 100)
g = sns.displot(data = df, x = "tempmax", hue = "season", fill = "season",
kind = "kde", rug = True, height = 4, aspect = 1,
palette = pal.colors(4))
g.set_axis_labels("temperature [deg C]")
g.set(title = "Distribution of daily maximum temperature given season")
plt.show()
The colorspace is available from PyPI at https://pypi.org/project/colorspace. It is designed to be lightweight, requiring only numpy (Harris et al. 2020) for the core functionality. Only a few features rely on matplotlib, imageio (Klein et al. 2024), and pandas (The Pandas Development Team 2024). More information and an interactive interface can be found on https://hclwizard.org/. Package development is hosted on GitHub at https://github.com/retostauffer/python-colorspace. Bug reports, code contributions, and feature requests are warmly welcome.
This week the group stage of the UEFA Euro 2024 was concluded so that all pairings for the round of 16 are fixed now. Therefore, today we want to do address two questions regarding our own probabilistic forecast for the UEFA Euro 2024 based on a ensemble machine learning model that we have published prior to the tournament:
TL;DR
First, we look at the results in terms of which teams successfully advanced from the group stage to the round of 16. The barplots below shows all teams along with their predicted probability to proceed to the round of 16, in the observed ranking order, with the color highlighting which teams advanced to the knockout stage.
Clearly, all group favorites made the cut and mostly teams with lower probabilities dropped out. It may seem somewhat surprising that some of the weaker teams (especially Georgia) “survived” the group stage but with four out of six third-ranked teams advancing to the round of 16 this is not completely unexpected. The results in Groups D and E are probably more surprising: Austria came first in Group D and top favorite France only second. Similarly, Romania took the group victory in Group E behind the higher-ranked team from Belgium.
Next, we take a closer look at the 36 individual group-stage matches to check whether we had any major surprises. The stacked bar plot below groups all match results into three categories by their predicted goal difference for the stronger vs. the weaker team.
In the first bar the stronger team was predicted to be only slightly better, with 0 to 0.6 more predicted goals on average. In this bar we see that the stronger team won less than half of the matches (6 out of 15) while the other matches were either lost (2 matches) or ended in a draw (7 matches). Thus, the distribution roughly matches the predictions albeit the number of draws is somewhat higher than expected.
The picture is similar in the second bar where the predicted goal difference for the stronger team was between 0.6 and 1. The stronger team won 5 out of 11 matches, lost 1, and more than expected (5) ended in a draw.
Only in the last bar with the highest predicted goal differences (between 1 and 2 goals) there were fewer draws (2 out of 10). Here the distribution matches closely the expectations with 7 wins for the stronger team and only 1 loss.
As a final evaluation we check whether the observed number of goals per team in each match conforms with the expected distribution based on the Poisson model employed. This is brought out graphically by a so-called hanging rootogram.
The red line shows the square root of the expected frequencies while the “hanging” gray bars represent the square root of the observed frequencies. This shows that the predictions conform closely with the actual observations. There were only a few more occurrences of single goals and fewer results with four goals (none) than expected in our forecast.
Finally, we want to look ahead and explore how the realized tournament draw based on the group stage results changes the predicted winning probabilities for the UEFA Euro 2024. We do so under the assumption that all results so far are within the range of random variation and that we do not need to adapt the predictions for all possible matches. In other words, the simulation is based on the expectation that especially the top favorites France and England can still reach their full potential in the upcoming matches.
Simulating the knockout stage 100,000 times then leads to the following winning probabilities for the tournament. (The barplot preserves the ordering of the teams from the original prediction.)
This shows clearly that England profits most and increases its winning probability for the title to 22.1% (from 16.7%). For the other five top teams France, Germany, Spain, Portugal, and the Netherlands the winning probabilities are almost equal now and all around 13%. Four of these five teams are now all in the same arm of the tournament (which has also been dubbed the “shark tank” in the media) and it will certainly be exciting who will eventually make it to the final. In the other arm England and the Netherlands are now the teams with the highest winning probability but we should keep in mind that Austria has already beaten the Netherlands once. Only the next matches will show whether they will be able to do it again, should both teams be able to advance to the quarterfinal.
In any case, the most exciting part of the UEFA Euro 2024 is only starting now and we can all be curious what is going to happen. Everything is still possible!
]]>In a recent blog post, prior to the start of the tournament, probabilistic forecasts for the UEFA Euro 2024 were provided based on a machine learning approach. In short, the approach obtained a number of highly informative inputs about the 24 participating teams before the start of the tournament: Historic match abilities from all national matches in 8 years, bookmaker consensus abilities based on quoted odds from 28 bookmakers, average player ratings from goal contributions of individual players in club and national matches, as well as further team-specific information like market value or FIFA rank etc. Then an ensemble of a random forest, a lasso, and an XGBoost learner were trained on matches from the UEFA Euro 2004–2020. The outcome was a prediction for the mean goals for both teams in all potential matches at the UEFA Euro 2024. Based on these predictions the entire tournament was simulated 100,000 times yielding probabilities for all possible outcomes of the tournament.
The prediction from the machine learning ensemble above for the match Netherlands vs. Austria is summarized in the following table.
Mean goals | Win probability | |
---|---|---|
🇳🇱 | 1.3 | 48.6% |
Draw | – | 28.1% |
🇦🇹 | 0.8 | 23.4% |
This means that if the Netherlands were to play Austria in lots of matches, the Netherlands are predicted to score 1.3 goals on average in these matches while Austria scores an average of 0.8 goals. Assuming a certain probability distribution for the goals per team in each match, not only the mean goals can be predicted but also the probability for each possible combination of goals by the two teams. The probability distribution employed here is a bivariate independent Poisson model, a relatively simple and standard model that fits empirical scores in football matches very well. The resulting probabilities (for up to five goals per team) are displayed in the heatmap below. Aggregating all probabilities for a Dutch win, a draw, or an Austrian win yields the probabilities shown in the table above (which do not sum to 100% exactly due to rounding).
The Imagine conference hosted by the Austrian Ministry of Climate Action, Environment, Energy, Mobility, Innovation and Technology celebrates its 10th birthday today. The final highlight of the conference program is a public viewing of the match Netherlands vs. Austria where the forecast above will be presented alongside a live data-driven analysis by colleagues from the Rotterdam University of Applied Sciences. The presentation slides are linked from the screenshot below.
]]>The forecast is based on an ensemble of machine learners that blend four main sources of information: An ability estimate for every team based on historic matches; an ability estimate for every team based on odds from 28 bookmakers; average ratings of the players in each team based on their individual performances in their home clubs and national teams; further team and country covariates (e.g., market value or GDP). An ensemble of machine learners is trained on the results of the UEFA Euro tournaments from 2004 to 2020 and then applied to current information to obtain a forecast for the UEFA Euro 2024. More specifically, the ensemble estimates the predicted number of goals for all possible matches between all 24 teams in the tournament. Based on the predicted goals the probabilities for a win, draw, or loss in each of these matches can be computed from a bivariate Poisson distribution. This allows us to simulate all matches in the group phase and which teams proceed to the knock out stage and who eventually wins. Repeating the simulation 100,000 times yields winning probabilities for each team. The results show that France is the favorite for the European title with a winning probability of 19.2%, followed by England with 16.7%, and host Germany with 13.7%. The winning probabilities for all teams are shown in the barchart below with more information linked in the interactive full-width version.
Interactive full-width graphic
The study has been conducted by an international team of researchers: Florian Felice, Andreas Groll, Lars Magnus Hvattum, Christophe Ley, Gunther Schauberger, Jonas Sternemann, Achim Zeileis. The basic idea for the forecast is to proceed in two steps. In the first step, three sophisticated statistical models are employed to determine the strengths of all teams and their players using disparate sets of information. In the second step, an ensemble of machine learners decide how to best combine the three strength estimates with other information about the teams.
Historic match abilities:
An ability estimate is obtained for every team based on “retrospective” data, namely all historic national matches over the last 8 years. A bivariate Poisson model with team-specific fixed effects is fitted to the number of goals scored by both teams in each match. However, rather than equally weighting all matches to obtain average team abilities (or team strengths) over the entire history period, an exponential weighting scheme is employed. This assigns more weight to more recent results and thus yields an estimate of current team abilities. More details can be found in Ley, Van de Wiele, Van Eetvelde (2019).
Bookmaker consensus abilities:
Another ability estimate for every team is obtained based on “prospective” data, namely the odds of 28 international bookmakers that reflect their expert expectations for the tournament. Using the bookmaker consensus model of Leitner, Zeileis, Hornik (2010), the bookmaker odds are first adjusted for the bookmakers’ profit margins (“overround”) and then averaged (on a logit scale) to obtain a consensus for the winning probability of each team. To adjust for the effects of the tournament draw (that might have led to easier or harder groups for some teams), an “inverse” simulation approach is used to infer which team abilities are most likely to lead up to the consensus winning probabilities.
Average player ratings:
To infer the contributions of individual players in a match, the plus-minus player ratings of Pantuso & Hvattum (2021) dissect all matches with a certain player (both on club and on national level) into segments, e.g., between substitutions. Subsequently, the goal difference achieved in these segments is linked to the presence of the individual players during that segment. This yields individual ratings for all players that can be aggregated to average player ratings for each team.
Machine learning ensemble:
Finally, an ensemble of different machine learning methods is used to combine these three highly aggregated and informative variables above along with various further relevant variables, yielding refined probabilistic forecasts for each match. Such an approach was first suggested by Groll, Ley, Schauberger, Van Eetvelde (2019) and subsequently improved collaboratively. The ensemble of machine learners is trained to decide how to blend the different ability estimates with team-specific features that are typically less informative but still powerful enough to enhance the forecasts. The features considered comprise team- and country-specific details (market value, FIFA rank, UEFA points, number of Champions League players, and GDP per capita). By combining a large ensemble of machine learners, each of which employs the available information somewhat differently, the relative importances of all the covariates can be inferred automatically. The resulting predicted number of goals for each team can then finally be used to simulate the entire tournament 100,000 times.
Using the forecasts from the machine learning ensemble yields the predicted number of goals for both teams in each possible match. The explanatory information used for this is the difference between the two teams in each of the variables listed above, i.e., the difference in historic match abilities (on a log scale), the difference in bookmaker consensus abilities (on a log scale), difference in average player ratings of the teams, etc. Assuming a bivariate Poisson distribution with the predicted numbers of goals for both teams, we can compute the probability that a certain match ends in a win, a draw, or a loss. The same can be repeated in overtime, if necessary, and a coin flip is used to decide penalties, if needed.
The following heatmap shows for each possible combination of teams the probability that one team beats the other team in a knockout match. The color scheme uses green vs. purple to signal probabilities above vs. below 50%, respectively. The tooltips for each match in the interactive version of the graphic also print the probabilities for the match to end in a win, draw, or loss after normal time.
Interactive full-width graphic
As every single match can be simulated with the pairwise probabilities above, it is also straightfoward to simulate the entire tournament (here: 100,000 times) providing “survival” probabilities for each team across the different stages.
Interactive full-width graphic
All our forecasts are probabilistic, clearly below 100%, and by no means certain. Thus, although we can quantify this uncertainty in terms of probabilities from an ensemble of potential tournaments, it is far from being predetermined which of these potential tournaments we will eventually see during the actual tournament.
Nevertheless the probabilistic view provides us with some interesting insights: For example, while most bookmakers favor England over France, our model reverses their roles. In a potential final between the two teams, however, France would only have a small advantage with a winning probability of 53.2%. Due to the tournament draw it is relatively unlikely, though, that the two top favorites play the final and much more likely (with a probability of 12.6%) that they play the second semifinal. Somewhat surprisingly, the most likely final (5.4%) is England vs. Germany where the winning probabilities would be almost exactly fifty-fifty.
It is also somewhat unexpected that defending champion Italy has only the 7th-highest probability of winning the championship again (5.6%). This is due to the substantial changes the team underwent in the last three years.
In any case, all of this means that the probabilistic forecasts leave a lot of room for surprises and excitement during the UEFA Euro 2024. But what is absolutely certain is that we look forward to an entertaining tournament as football fans (much more than as professional forecasters).
]]>Marjolein Fokkema, Achim Zeileis (2023). “Subgroup Detection in Linear Growth Curve Models with Generalized Linear Mixed Model (GLMM) Trees.” arXiv.org E-Print Archive arXiv:2309.05862 [stat.ME]. doi:10.48550/arXiv.2309.05862
Growth curve models are popular tools for studying the development of a response variable within subjects over time. Heterogeneity between subjects is common in such models, and researchers are typically interested in explaining or predicting this heterogeneity. We show how generalized linear mixed effects model (GLMM) trees can be used to identify subgroups with differently shaped trajectories in linear growth curve models. Originally developed for clustered cross-sectional data, GLMM trees are extended here to longitudinal data. The resulting extended GLMM trees are directly applicable to growth curve models as an important special case. In simulated and real-world data, we assess the performance of the extensions and compare against other partitioning methods for growth curve models. Extended GLMM trees perform more accurately than the original algorithm and LongCART, and similarly accurate as structural equation model (SEM) trees. In addition, GLMM trees allow for modeling both discrete and continuous time series, are less sensitive to (mis-)specification of the random-effects structure and are much faster to compute.
https://CRAN.R-project.org/package=glmertree
As an example, heterogeneity of science ability trajectories among a sample of 250 children is analyzed. The data are from the Early Childhood Longitudinal Study-Kindergarten (ECLS-K) class of 1998-1999 in the USA. Assessments took place from kindergarten in 1998 through 8th grade in 2007. Here we focus on assessments from kindergarten, 1st, 3rd, 5th, and 8th grade. The time since kindergarten was scaled to the number of months to the power of 2/3 in order to obtain approximately linear trajectories.
A linear mixed-effect model tree is used to detect heterogeneity in a linear model for the growth of science ability over time. This employs a random intercept for each individual in order to account for the longitudinal nature of the data. The tree tests for differences in the baseline science abilities (i.e., the fixed-effect intercepts of the growth curve models) as well as the growth over time (i.e., the corresponding fixed-effect slopes), using eleven socio-demographic and behavioral characteristics of the children, assessed at baseline, as potential splitting variables.
The plot below shows the resulting tree which identifies socio-economic status (SES), gross motor skills (GMOTOR), and internalizing problems (INTERN) as the splitting variables. The x-axes represent the number of months after the baseline assessment, y-axes represent science ability. Gray lines depict observed individual trajectories, red lines depict average growth curve within each terminal node, as estimated with a linear mixed-effect model comprising node-specific fixed effects of time and a random intercept with respect to individuals. The table presents numerical estimates of fixed intercepts and slopes.
Five subgroups are identified, corresponding to the terminal nodes of the tree, each with a different estimate of the fixed intercept and slope. Groups of children with higher SES also have higher intercepts, indicating higher average science ability. The group of children with lower SES (node 2) is further split based on gross motor skills, with higher motor skills resulting in a higher intercept. The group of children with intermediate levels of SES (node 6) is further split based on internalizing problems, with lower internalizing problems resulting in a higher intercept. The two groups (or nodes) with higher intercepts also have higher slopes, indicating that children with higher ability also gain more ability over time.
]]>The model is the so-called bookmaker consensus model which has been proposed by Leitner, Hornik, and Zeileis (2010, International Journal of Forecasting, doi:10.1016/j.ijforecast.2009.10.001) and successfully applied in previous football tournaments, either by itself or in combination with even more refined machine learning techniques.
As in the FIFA Women’s World Cup 2019, the forecast shows that the United States are the clear favorite with a forecasted winning probability of 21.5%, followed by England with a winning probability of 15.7% and Spain with 13.1%. Three other teams are still a bit ahead of the rest: Germany with 9.7%, France with 7.5%, and co-host Australia with 7.4%. More details are displayed in the following barchart.
Interactive full-width graphic
These probabilistic forecasts have been obtained by model-based averaging of the quoted winning odds for all teams across bookmakers. More precisely, the odds are first adjusted for the bookmakers’ profit margins (“overrounds”, on average 8.6%), averaged on the log-odds scale to a consensus rating, and then transformed back to winning probabilities. The raw bookmakers’ odds as well as the forecasts for all teams are also available in machine-readable form in wwc2023.csv.
Although forecasting the winning probabilities for the FIFA Women’s World Cup 2023 is probably of most interest, the bookmaker consensus forecasts can also be employed to infer team-specific abilities using an “inverse” tournament simulation:
Using this idea, abilities in step 1 can be chosen such that the simulated winning probabilities in step 3 closely match those from the bookmaker consensus shown above.
A classical approach to obtain winning probabilities in pairwise comparisons (i.e., matches between teams/players) is the Bradley-Terry model, which is similar to the Elo rating, popular in sports. The Bradley-Terry approach models the probability that a Team A beats a Team B by their associated abilities (or strengths):
$\mathrm{Pr}(A\text{beats}B)=\frac{{\mathrm{ability}}_{A}}{{\mathrm{ability}}_{A}+{\mathrm{ability}}_{B}}.$Coupled with the “inverse” simulation of the tournament, as described in step 1-3 above, this yields pairwise probabilities for each possible match. The following heatmap shows the probabilistic forecasts for each match with light gray signalling approximately equal chances and green vs. purple signalling advantages for Team A or B, respectively.
Interactive full-width graphic
As every single match can be simulated with the pairwise probabilities above, it is also straightfoward to simulate the entire tournament (here: 100,000 times) providing “survival” probabilities for each team across the different stages.
Interactive full-width graphic
For example, this shows that the probability for the United States to reach any stage of the tournament is higher than for any other team to reach the same stage. In fact, their survival probabilities are decreasing rather slowly because they can most likely avoid the other favorites for the title until the semifinal. Conversely, Germany’s chances to reach the round of 16 are almost as high (87.6%) as those of the United States but their chances to reach the quarterfinal are much lower (55.7%) because they are most likely to play the strongest expected runner-up, Brazil, in the round of 16.
In addition to the curves shown in the plot above, further probabilities of interest can be obtained from the simulation. For example, the probability for the “dream final” between the top favorites, World Champion United States and European Champion England, is 9.1%. The most likely first semi-final is between the United States and Spain with a probability of 13.5%. For the second semi-final it is less clear who is the most likely opponent of England because there are three possible pairings with almost the same probability (around 7%): Against Australia, France, or Germany. This shows that this half of the tournament tree is somewhat more contested with a less certain outcome.
The bookmaker consensus model has performed well in previous tournaments, often predicting winners or finalists correctly. However, all forecasts are probabilistic, clearly below 100%, and thus by no means certain. It would also be possible to post-process the bookmaker consensus along with data from historic matches, player ratings, and other information about the teams using machine learning techniques. However, due to lack of time for more refined forecasts at the end of a busy academic year, at least the bookmaker consensus is provided as a solid basic forecast.
As a final remark: Betting on the outcome based on the results presented here is not recommended. Not only because the winning probabilities are clearly far below 100% but, more importantly, because the bookmakers have a profit margin of 8.6% which assures that the best chances of making money based on sports betting lie with them.
Enjoy the FIFA Women’s World Cup 2023!
]]>Siranush Karapetyan, Achim Zeileis, André Henriksen, Alexander Hapfelmeier (2023). “Tree Models for Assessing Covariate-Dependent Method Agreement.” arXiv.org E-Print Archive arXiv:2306.04456 [stat.ME]. doi:10.48550/arXiv.2306.04456
Method comparison studies explore the agreement of measurements made by two or more methods. Commonly, agreement is evaluated by the well-established Bland-Altman analysis. However, the underlying assumption is that differences between measurements are identically distributed for all observational units and in all application settings. We introduce the concept of conditional method agreement and propose a respective modeling approach to alleviate this constraint. Therefore, the Bland-Altman analysis is embedded in the framework of recursive partitioning to explicitly define subgroups with heterogeneous agreement in dependence of covariates in an exploratory analysis. Three different modeling approaches, conditional inference trees with an appropriate transformation of the modeled differences (CTreeTrafo), distributional regression trees (DistTree), and model-based trees (MOB) are considered. The performance of these models is evaluated in terms of type-I error probability and power in several simulation studies. Further, the adjusted rand index (ARI) is used to quantify the models’ ability to uncover given subgroups. An application example to real data of accelerometer device measurements is used to demonstrate the applicability. Additionally, a two-sample Bland-Altman test is proposed for exploratory or confirmatory hypothesis testing of differences in agreement between subgroups. Results indicate that all models were able to detect given subgroups with high accuracy as the sample size increased. Relevant covariates that may affect agreement could be detected in the application to accelerometer data. We conclude that conditional method agreement trees (COAT) enable the exploratory analysis of method agreement in dependence of covariates and the respective exploratory or confirmatory hypothesis testing of group differences. It is made publicly available through the R package coat.
R package: https://CRAN.R-project.org/package=coat
Presentation slides: Psychoco 2023
The paper presents an illustration in which measurements of activity energy expenditure (in 24 hours) from two different accelerometers (ActiGraph vs. Actiheart) are compared and their dependence on age, gender, weight, etc. is assessed. As the data is not freely available, we show below another illustration taken from the MethComp package.
The scint
data provides measurements of the relative kidney function (renal function, percent of total) for 111 patients. The reference method is DMSA static scintigraphy and it is compared here with DTPA dynamic scintigraphy. The question we aim to answer using the new COAT method is:
Does the agreement between DTPA and DMSA depend on the age and/or the gender of the patient?
First, the package and data are loaded and reshaped to wide format:
library("coat")
data("scint", package = "MethComp")
scint_wide <- reshape(scint, v.names = "y",
timevar = "meth", idvar = "item", direction = "wide")
Then, COAT can be applied using the coat()
function, by default leveraging ctree()
from the partykit in the background:
tr1 <- coat(y.DTPA + y.DMSA ~ age + sex, data = scint_wide)
print(tr1)
## Conditional method agreement tree (COAT)
##
## Model formula:
## y.DTPA + y.DMSA ~ age + sex
##
## Fitted party:
## [1] root
## | [2] age <= 35: Bias = -0.49, SD = 3.42
## | [3] age > 35: Bias = 0.25, SD = 7.04
##
## Number of inner nodes: 1
## Number of terminal nodes: 2
This shows that the measurement differences between the two scintigraphies vary clearly between young and old patients. While the average difference between the measurements (bias) is close to zero for both age groups, the corresponding standard deviation (SD) is substantially larger (and hence the limits of agreement wider) for the older subgroup. This is better brought out graphically by the corresponding tree display with the classical Bland-Altman plots in the terminal nodes.
plot(tr1)
As the Bland-Altman plot for the older subgroup suggests that the bias between the methods may also depend on the mean measurement, we fit a second COAT tree. In addition to age and gender we also include the mean renal function measurement from DTPA and DMSA as a third potential split variable.
tr2 <- coat(y.DTPA + y.DMSA ~ age + sex, data = scint_wide, means = TRUE)
print(tr2)
## Conditional method agreement tree (COAT)
##
## Model formula:
## y.DTPA + y.DMSA ~ age + sex
##
## Fitted party:
## [1] root
## | [2] means(y.DTPA, y.DMSA) <= 31: Bias = 4.80, SD = 6.61
## | [3] means(y.DTPA, y.DMSA) > 31
## | | [4] means(y.DTPA, y.DMSA) <= 53.5: Bias = -0.38, SD = 3.33
## | | [5] means(y.DTPA, y.DMSA) > 53.5: Bias = -4.27, SD = 3.90
##
## Number of inner nodes: 2
## Number of terminal nodes: 3
plot(tr2)
This tree reveals three subgroups where only the middle group (with renal function between 31 and 53.5 percent) has both small bias and standard deviation for the scintigraphy differences while for the other two subgroups bias and/or standard deviation are larger.
]]>