Achim Zeileis

Remembering Friedrich "Fritz" Leisch

2025-01-22T00:00:00+01:00

Our friend and colleague Fritz Leisch died in April last year. In a new contribution to The R Journal we honor Fritz and commemorate his many contributions to science in general and to the R community in particular.

Citation

Bettina Grün, Kurt Hornik, Torsten Hothorn, Theresa Scharl, Achim Zeileis (2025). “Remembering Friedrich “Fritz” Leisch.” The R Journal 16(1), 5-14. doi:10.32614/RJ-2024-001

Abstract

This article remembers our friend and colleague Fritz Leisch (1968-2024) who sadly died earlier this year. Many of the readers of The R Journal will know Fritz as a member of the R Core Team and for many of his contributions to the R community. For us, the co-authors of this article, he was an important companion on our journey with the R project and other scientific endeavours over the years. In the following, we provide a brief synopsis of his career, present his key contributions to the R project and to the scientific community more generally, acknowledge his academic service, and highlight his teaching and mentoring achievements.

Read full paper ›

Examining exams using Rasch models and assessment of measurement invariance

2024-10-01T00:00:00+02:00

Models from psychometric item response theory are used to analyze the results from a large introductory mathematics exams in order to gain insights about student abilities, question difficulties, and heterogeneities of these in subgroups.

Citation

Achim Zeileis (2024). “Examining Exams Using Rasch Models and Assessment of Measurement Invariance.” arXiv.org E-Print Archive arXiv:2409.19522 [stat.AP]. doi:10.48550/arXiv.2409.19522

Abstract

Many statisticians regularly teach large lecture courses on statistics, probability, or mathematics for students from other fields such as business and economics, social sciences and psychology, etc. The corresponding exams often use a multiple-choice or single-choice format and are typically evaluated and graded automatically, either by scanning printed exams or via online learning management systems. Although further examinations of these exams would be of interest, these are frequently not carried out. For example a measurement scale for the difficulty of the questions (or items) and the ability of the students (or subjects) could be established using psychometric item response theory (IRT) models. Moreover, based on such a model it could be assessed whether the exam is really fair for all participants or whether certain items are easier (or more difficult) for certain subgroups of students.

Here, several recent methods for assessing measurement invariance and for detecting differential item functioning in the Rasch IRT model are discussed and applied to results from a first-year mathematics exam with single-choice items. Several categorical, ordered, and numeric covariates like gender, prior experience, and prior mathematics knowledge are available to form potential subgroups with differential item functioning. Specifically, all analyses are demonstrated with a hands-on R tutorial using the psycho* family of R packages (psychotools, psychotree, psychomix) which provide a unified approach to estimating, visualizing, testing, mixing, and partitioning a range of psychometric models.

The paper is dedicated to the memory of Fritz Leisch (1968-2024) and his contributions to various aspects of this work are highlighted.

Read full paper ›

Software

R packages:

Illustration

The strategies for analyzing exam results using psychometric item response theory (IRT) models are illustrated with Rasch models fitted to the results from a large introductory mathematics exam for economics and business students. Here, only a quick teaser is provided that shows how to quickly visualize simple exploratory statistics and some model-based results. For the full analysis of the data that gives special emphasis to the assessment of so-called measurement invariance, see the full paper linked above. The full replication code for all results in the paper is provided in: exams.R.

The data are available as MathExam14W in the psychotools package. The code below excludes the students which solved none or all of the exercises, thus not discriminating between the exercises in terms of their difficulty. The response variable is solved which is an object of class itemresp. Internally, it is essentially a 729 x 13 matrix with binary 0/1 coding plus some metainformation. As a first exploratory graphic the plot() method shows a bar plot with empirical frequencies of correctly solving each of the 13 exercises.

library("psychotools") data("MathExam14W", package = "psychotools") mex <- subset(MathExam14W, nsolved > 0 & nsolved < 13) plot(mex$solved)

The plot demonstrates that most items have been solved correctly by about 40 to 80 percent of the students. The main exception is the payflow exercise (for which a certain integral had to be computed) which was solved correctly by less than 15 percent of the students.

To establish a formal IRT model for this data, we employ a Rasch model that uses the differences between person abilities $θ_{i}$ and item difficulties $β_{j}$ for describing the logit of the probability $π_{ij}$ that person $i$ correctly solves item $j$ .

π_{ij}

logit (π_{ij})

=

=

Pr (y_{ij} = 1)

θ_{i} - β_{j}

The raschmodel() function estimates the item difficulties using conditional maximum likelihood and the plot() method then shows the corresponding person abilities (as a bar plot) along with the item difficulties (as a dot chart) on the same latent trait scale.

mr <- raschmodel(mex$solved) plot(mr, type = "piplot")

Qualitatively, the Rasch model-based person-item plot shows a similar pattern as the exploratory bar plot. However, due to the latent logistic scale the most difficult item (payflow) and the easiest item (hesse) are brought out even more clearly. Also the majority of the item difficulties are close to the median ability in this sample. Thus, the exam discrimenates more sharply at the median difficulty and less sharply in the tails at very high or very low ability.

So far so good. However, the interpretation above is only reliable if all item difficulties are indeed the same for all students in the sample. If this is not the case, differences in the item responses would not necessarily be caused by differences in mathematics ability. The fundamental assumption that the difficulties are constant across all persons is a special case of so-called measurement invariance. And a violation of this assumption is known as differential item functioning (DIF), i.e., some item(s) is/are relatively easier for some subgroup of persons compared to others.

The main contribution of the paper is to detect such differential item functioning and investigate the potential sources of it. See the arXiv paper for all details and the full analysis.

Modeling loss aversion with extended-support beta regression

2024-09-23T00:00:00+02:00

The recently-proposed extended-support beta regression model in R package betareg is illustrated by simultaneously modeling the occurrence and extent of loss aversion in a behavioral economics experiment.

Motivation

To illustrate the benefits of extended-support beta regression models, suggested in a recent arXiv paper with Ioannis Kosmidis, we revisit the analysis of a behavioral economics experiment conducted and published by Glätzle-Rützler et al. (2015, Journal of Economic Behavior & Organization, doi:10.1016/j.jebo.2014.12.021). The outcome variable is the proportion of tokens invested by high-school students in a risky lottery with positive expected payouts. Glätzle-Rützler et al. focused on the effects of several experimental factors on the mean investments, which reflect the players’ willingness to take risks. In their study they employed linear regression models, estimated by ordinary least squares (OLS) with standard errors adjusted for potential clustering and heteroscedasticity.

Here, we extend the analysis from Glätzle-Rützler et al. by employing a similar model for the mean investments but additionally exploring distributional specifications that allow for a probabilistic, rather than mean-only, interpretation of the effects. From an economic perspective this is of interest because it allows to interpret both the mean willingness to take risks in this experiment, and the probability to behave like a rational Homo oeconomicus, who would invest (almost) all tokens in this lottery because it has positive expected payouts.

The full replication code for the analyses from the arXiv paper is available in lossaversion.R with some auxiliary functions in beta01.R. Below we only provide the most important R snippets to provide a feeling for the workflow in R. The rest of the discussion here highlights the main insights from the analysis.

An aggregated version of the data from all nine rounds of the experiment is available as LossAversion in the betareg package. Interest is in linking the variable invest with the proportion of the total tokens invested in all nine rounds to explanatory information:

grade: Is the player from lower grades 6-8 or upper grades 10-12?
arrangement: Is the player an individual or a team of two?
male: Is (at least one of) the player(s) male?
age: (Average) age of the player(s) in years.

Models

We compare four different models for invest which all employ the same equation for the mean submodel. And all except the OLS reference model employ the main effects of the three experimental factors for the dispersion submodel.

Normal linear model (N) with constant variance, corresponding to the OLS approach from the original study. In R, this can be fitted with the base lm() or glm() function.
```
data("LossAversion", package = "betareg") la_ols <- glm(invest ~ grade * (arrangement + age) + male, data = LossAversion) summary(la_ols) 
```
Heteroscedastic censored normal model (CN), also known as heteroscedastic two-limit tobit model in econometrics. This can be fitted with the crch package (for censored regression with conditional heteroscedasticity).
```
library("crch") la_htobit <- crch(invest ~ grade * (arrangement + age) + male | arrangement + male + grade, data = LossAversion, left = 0, right = 1) summary(la_htobit) 
```

Beta regression (B) after ad-hoc scaling of the investments to the open unit interval (to avoid the boundary observations). This can be fitted with the betareg package.

library("betareg") LossAversion$invests <- (LossAversion$invest * (nrow(LossAversion) - 1) + 0.5)/ nrow(LossAversion) la_beta <- betareg(invests ~ grade * (arrangement + age) + male | arrangement + male + grade, data = LossAversion) summary(la_beta)

Extended-support beta mixture model (XBX) with the same specification as B but adding an extra exceedance parameter to be estimated (instead of the ad-hoc scaling). This can also be fitted with betareg since version 3.2-0 with XBX regression being automatically selected in case of boundary observations in the response.
```
la_xbx <- betareg(invest ~ grade * (arrangement + age) + male | arrangement + male + grade, data = LossAversion) summary(la_xbx) 
```

If you run the code and compare the model summaries, note that the coefficients from N and CN use an identity link for the mean parameter whereas B and XBX use a logit link. In addition, the log-likelihood, and, hence, AIC and BIC, are comparable only between CN and XBX because those two models have the same support for the response variable, that is the unit interval with point masses at 0 and 1. See also the accompanying arXiv paper for the full summary tables. By and large, the mean parameters for the N and CN models are rather similar and those for B and XBX are rather similar, only with some differences that occur due to using models with point masses on the boundaries (CN and XBX) or not (N and B).

However, instead of studying the individual estimated coefficients in more detail we rather assess the models graphically by visualizing their goodness of fit and different types of fitted effects.

Goodness of fit

To illustrate how different the fitted probability distributions of the four models are, we employ so-called hanging rootograms. These compare the empirical marginal distributions of the response variable (proportion of tokens invested) to the aggregated fitted distributions from the models. The quality of the fit can be judged by the deviation of the hanging bars from the zero reference line.

An object-oriented implementation of rootograms is available, along with other tools for working with probabilistic models, in the topmodels package on R-Forge (hopefully soon to be submitted to CRAN). You can install it from R-Forge or R-universe and then create the rootograms for the four models:

install.packages("topmodels", repos = "https://zeileis.R-universe.dev") library("topmodels") rootogram(la_ols, breaks = -6:16 / 10, main = "N") rootogram(la_htobit, main = "CN") rootogram(la_beta, main = "B") rootogram(la_xbx, main = "XBX")

A more refined version of the plots is shown below. See the full replication script linked above for the code details.

The square root of the expected frequencies are shown as red dots and the square root of the observed frequencies are hanging from the points as gray bars. The dashed lines are the Tukey warning limits at +/- 1. These plots show that models N and B fit poorly in the tails. In contrast, models CN and XBX fit very well with almost all bars hanging close to the zero reference line. The fit for XBX appears to be slightly better than for CN.

Effects

While models XBX and CN provide a much better probabilistic fit than their uncensored counterparts B and N, it turns out that the predicted mean investments from all four models are still very similar. But XBX and CN allow for interpretations beyond the mean, including economically relevant interpretations of probability effects.

For illustration, we focus on the team arrangement effect for a subsample with a large share of very rational subjects: male players or teams with at least one male, in grades 10-12, and between 15 and 17 years of age. The figure below shows the estimated arrangement effect for the mean E(Y), i.e., the expected proportion of tokens invested, and for the probability to behave very rationally and invest almost everything, i.e., P(Y > 0.95). The empirical quantities are shown in black for the subsample between 15 and 17 years of age while the model-based effects are shown at an age of 16.

The graphic shows that all models do a reasonable job in estimating E(Y) but the censored models XBX and CN are much better at estimating the probability to behave very rationally. Similarly, it could be shown that the fit for the probability P(Y < 0.05) is also much better for XBX and CN than for B and N but this is not done here, because that probability does not have such an appealing economic interpretation like P(Y > 0.95).

For obtaining the model-based effects, as shown above, the procast() function (for probabilistic forecasts) from the topmodels package can be used. This is again an object-oriented implementation that facilitates obtaining not only moments (such as means and variances) but also entire probability distributions (as S3 objects) and corresponding probabilities, densities, and quantiles.

Here, we only briefly show the code for the fitted XBX model but the same function calls can be applied to the other fitted model objects. First, we set up the new data that only varies arrangement from single to team but keeps all other variables fixed. Then, both kinds of effects are computed with procast().

la_nd <- data.frame(arrangement = c("single", "team"), male = "yes", age = 16, grade = "10-12") procast(la_xbx, newdata = la_nd, type = "mean") ## mean ## 1 0.4713 ## 2 0.6861 procast(la_xbx, newdata = la_nd, type = "cdf", at = 0.95, lower.tail = FALSE) ## probability ## 1 0.07161 ## 2 0.18501

Thus, the mean invested proportion goes up from 47.1% to 68.6% for teams vs. single players in this setting, while the probability to behave almost fully rationally increases from 7.2% to 18.5%.

Again, the full code for creating the figure and underlying table is provided in the replication script linked above.

The script also inlucdes some further illustrations, e.g., the comparison with three-part hurdle models for “zero-and-one-inflated” beta regression. However, these models do not work well here: unappealing interpretation, too many parameters, quasi-complete separation of boundary and non-boundary observations. Hence, we do not show the details in this post.

Extended-support beta regression for [0, 1] responses

2024-09-16T00:00:00+02:00

New arXiv working paper introducing extended-support beta regression models which can capture probabilities for boundary observations at 0 and/or 1. It is available in the latest R package betareg, also accompanied by a new altdoc web page.

Citation

Ioannis Kosmidis, Achim Zeileis (2024). “Extended-Support Beta Regression for [0, 1] Responses.” arXiv.org E-Print Archive arXiv:2409.07233 [stat.ME]. doi:10.48550/arXiv.2409.07233

Abstract

We introduce the XBX regression model, a continuous mixture of extended-support beta regressions for modeling bounded responses with or without boundary observations. The core building block of the new model is the extended-support beta distribution, which is a censored version of a four-parameter beta distribution with the same exceedance on the left and right of (0,1). Hence, XBX regression is a direct extension of beta regression. We prove that both beta regression with dispersion effects and heteroscedastic normal regression with censoring at both 0 and 1 – known as the heteroscedastic two-limit tobit model in the econometrics literature – are special cases of the extended-support beta regression model, depending on whether a single extra parameter is zero or infinity, respectively. To overcome identifiability issues that may arise in estimating the extra parameter due to the similarity of the beta and normal distribution for certain parameter settings, we assume that the additional parameter has an exponential distribution with an unknown mean. The associated marginal likelihood can be conveniently and accurately approximated using a Gauss-Laguerre quadrature rule, resulting in efficient estimation and inference procedures. The new model is used to analyze investment decisions in a behavioral economics experiment, where the occurrence and extent of loss aversion is of interest. In contrast to standard approaches, XBX regression can simultaneously capture the probability of rational behavior as well as the mean amount of loss aversion. Moreover, the effectiveness of the new model is illustrated through extensive numerical comparisons with alternative models.

Read full paper ›

Software

R package: https://CRAN.R-project.org/package=betareg
Documentation: https://topmodels.R-Forge.R-project.org/betareg/

Illustration

The data for modeling the occurrence and extent of loss aversion in a behavioral economics experiment is available as LossAversion in the package. The corresponding examples also replicate some of the models from the paper. The full replication of the case study will be discussed in another forthcoming blog post.

colorspace: A Python toolbox for colors and palettes

2024-07-30T00:00:00+02:00

Python package 'colorspace' with tools for manipulating and assessing colors and palettes is now available from PyPI, accompanied by a documentation web page and an arXiv paper.

Citation

Reto Stauffer, Achim Zeileis (2024). “colorspace: A Python Toolbox for Manipulating and Assessing Colors and Palettes.” arXiv.org E-Print Archive arXiv:2407.19921 [cs.GR]. doi:10.48550/arXiv.2407.19921

Abstract

The Python colorspace package provides a toolbox for mapping between different color spaces which can then be used to generate a wide range of perceptually-based color palettes for qualitative or quantitative (sequential or diverging) information. These palettes (as well as any other sets of colors) can be visualized, assessed, and manipulated in various ways, e.g., by color swatches, emulating the effects of color vision deficiencies, or depicting the perceptual properties. Finally, the color palettes generated by the package can be easily integrated into standard visualization workflows in Python, e.g., using matplotlib, seaborn, or plotly.

Read full paper ›

Software

Package (PyPI): https://pypi.org/project/colorspace/
Documentation: https://retostauffer.github.io/python-colorspace/
Interactive apps: https://hclwizard.org/
Repository (GitHub): https://github.com/retostauffer/python-colorspace/

Motivation

Color is an integral element of visualizations and graphics and is essential for communicating (scientific) information. However, colors need to be chosen carefully so that they support the information displayed for all viewers (see e.g., Tufte 1990; Ware 2004; Wilke 2019). Therefore, suitable color palettes have been proposed in the literature (e.g., Brewer 1999; Ihaka 2003; Crameri, Shephard, and Heron 2020) and many software packages transitioned to better color defaults over the last decade. A prominent example from the Python community is matplotlib 2.0 (Hunter, Dale, Firing, Droettboom, and the Matplotlib Development Team 2017) which replaced the classic “jet” palette (a variation of the infamous “rainbow”) by the perceptually-based “viridis” palette. Hence a wide range of useful palettes for different purposes is provided in a number of Python packages today, including cmcramery (Rollo 2024), colormap (Cokelaer 2024), colormaps (Patel 2024), matplotlib (Hunter 2007), palettable (Davis 2023), or seaborn (Waskom 2021).

However, in most graphics packages colors are provided as a fixed set. While this makes it easy to use them in different applications, it is usually not easy to modify the perceptual properties or to set up new palettes following the same principles. The colorspace package addresses this by supporting color descriptions using different color spaces (hence the package name), including some that are based on human color perception. One notable example is the Hue-Chroma-Luminance (HCL) model which represents colors by coordinates on three perceptually-based axes: Hue (type of color), chroma (colorfulness), and luminance (brightness). Selecting colors along paths along these axes allows for intuitive construction of palettes that closely match many of the palettes provided in the packages listed above.

In addition to functions and interactive apps for HCL-based colors, the colorspace package also offers functions and classes for handling, transforming, and visualizing color palettes (from any source). In particular, this includes the simulation of color vision deficiencies (Machado Oliviera, and Fernandes 2009) but also contrast ratios, desaturation, lightening/darkening, etc.

The colorspace Python package was inspired by the eponymous R package (Zeileis, Fisher, Hornik, Ihaka, McWhite, Murrell, Stauffer, and Wilke 2020). It comes with extensive documentation at https://retostauffer.github.io/python-colorspace/, including many practical examples. Selected highlights are presented in the following.

Key functionality

HCL-based color palettes

The key functions and classes for constructing color palettes using hue-chroma-luminance paths (and then mapping these to hex codes) are:

qualitative_hcl: For qualitative or unordered categorical information, where every color should receive a similar perceptual weight.
sequential_hcl: For ordered/numeric information from high to low (or vice versa).
diverging_hcl: For ordered/numeric information around a central neutral value, where colors diverge from neutral to two extremes.

These functions provide a range of named palettes inspired by well-established packages but actually implemented using HCL paths. Additionally, the HCL parameters can be modified or new palettes can be created from scratch.

As an example, the figure below depicts color swatches for four viridis variations. The first pal1 sets up the palette from its name. It is identical to the second pal2 which employes the HCL specification directly: The hue ranges from purple (300) to yellow (75), colorfulness (chroma) increases from 40 to 95, and luminance (brightness) from dark (15) to light (90). The power parameter chooses a linear change in chroma and a slightly nonlinear path for luminance.

In pal3 and pal4 the most HCL properties are kept the same but some are modified: pal3 uses a triangular chroma path from 40 via 90 to 20, yielding muted colors at the end of the palette. pal4 just changes the starting hue for the palette to green (200) instead of purple. All four palettes are visualized by the swatchplot function from the package.

The objects returned by the palette functions provide a series of methods, e.g., pal1.settings for displaying the HCL parameters, pal1(3) for obtaining a number of hex colors, or pal1.cmap() for setting up a matplotlib color map, among others.

from colorspace import palette, sequential_hcl, swatchplot pal1 = sequential_hcl(palette = "viridis") pal2 = sequential_hcl(h = [300, 75], c = [40, 95], l = [15, 90], power = [1., 1.1]) pal3 = sequential_hcl(palette = "viridis", cmax = 90, c2 = 20) pal4 = sequential_hcl(palette = "viridis", h1 = 200) swatchplot({"Viridis (and altered versions of it)": [ palette(pal1(7), "By name"), palette(pal2(7), "By hand"), palette(pal3(7), "With triangular chroma"), palette(pal4(7), "With smaller hue range") ]}, figsize = (8, 1.75));

An overview of the named HCL-based palettes in colorspace is depicted below.

from colorspace import hcl_palettes hcl_palettes(plot = True, figsize = (20, 15))

Palette visualization and assessment

To better understand the properties of palette pal4, defined above, the following figure shows its HCL spectrum (left) and the corresponding path through the HCL space (right).

The spectrum in the first panel shows how the hue (right axis) changes from about 200 (green) to 75 (yellow), while chroma and luminance (left axis) increase from about 20 to 95. Note that the kink in the chroma curve for the greenish colors occurs because such dark greens cannot have higher chromas when represented through RGB-based hex codes. The same is visible in the second panel where the path moves along the outer edge of the HCL space.

pal4.specplot(figsize = (5, 5)); pal4.hclplot(n = 7, figsize = (5, 5));

Color vision deficiency

Another important assessment of a color palette is how well it works for viewers with color vision deficiencies. This is exemplified below by depicting a demo plot (heatmap) under “normal” vision (left), deuteranomaly (colloquially known as “red-green color blindness”, center), and desaturated (gray scale, right). The palette in the top row is the traditional fully-saturated RGB rainbow, deliberately selected here as a palette with poor perceptual properties. It is contrasted with a perceptually-based sequential blue-yellow HCL palette in the bottom row.

The sequential HCL palette is monotonic in luminance so that it is easy to distinguish high-density and low-density regions under deuteranomaly and desaturation. However, the rainbow is non-monotonic in luminance and parts of the red-green contrasts collapse under deuteranomaly, making it much harder to interpret correctly.

from colorspace import rainbow, sequential_hcl col1 = rainbow(end = 2/3, rev = True)(7) col2 = sequential_hcl("Blue-Yellow", rev = True)(7) from colorspace import demoplot, deutan, desaturate import matplotlib.pyplot as plt fig, ax = plt.subplots(2, 3, figsize = (9, 4)) demoplot(col1, "Heatmap", ax = ax[0,0], ylabel = "Rainbow", title = "Original") demoplot(col2, "Heatmap", ax = ax[1,0], ylabel = "HCL (Blue-Yellow)") demoplot(deutan(col1), "Heatmap", ax = ax[0,1], title = "Deuteranope") demoplot(deutan(col2), "Heatmap", ax = ax[1,1]) demoplot(desaturate(col1), "Heatmap", ax = ax[0,2], title = "Desaturated") demoplot(desaturate(col2), "Heatmap", ax = ax[1,2]) plt.show()

Integration with Python graphics packages

To illustrate that colorspace can be easily combined with different graphics workflows in Python, the code below shows a heatmap (two-dimensional histogram) from matplotlib and multi-group density from seaborn. The code below employs an example data set from the package (using pandas) with daily maximum and minimum temperature. For matplotlib the colormap (.cmap(); LinearSegmentedColormap) is extracted from the adapted viridis palette pal3 defined above. For seaborn the hex codes from a custom qualitative palette are extracted via .colors(4).

from colorspace import dataset, qualitative_hcl import matplotlib.pyplot as plt import seaborn as sns df = dataset("HarzTraffic") fig = plt.hist2d(df.tempmin, df.tempmax, bins = 20, cmap = pal3.cmap().reversed()) plt.title("Joint density daily min/max temperature") plt.xlabel("minimum temperature [deg C]") plt.ylabel("maximum temperature [deg C]") plt.show() pal = qualitative_hcl("Dark 3", h1 = -180, h2 = 100) g = sns.displot(data = df, x = "tempmax", hue = "season", fill = "season", kind = "kde", rug = True, height = 4, aspect = 1, palette = pal.colors(4)) g.set_axis_labels("temperature [deg C]") g.set(title = "Distribution of daily maximum temperature given season") plt.show()

Dependencies and availability

The colorspace is available from PyPI at https://pypi.org/project/colorspace. It is designed to be lightweight, requiring only numpy (Harris et al. 2020) for the core functionality. Only a few features rely on matplotlib, imageio (Klein et al. 2024), and pandas (The Pandas Development Team 2024). More information and an interactive interface can be found on https://hclwizard.org/. Package development is hosted on GitHub at https://github.com/retostauffer/python-colorspace. Bug reports, code contributions, and feature requests are warmly welcome.

References

Brewer CA (1999). “Color Use Guidelines for Data Representation.” In Proceedings of the Section on Statistical Graphics, American Statistical Association, pp. 55–60. Alexandria, VA.
Cokelaer T (2024). Colormap. Version 1.1.0, Python Package Index (PyPI), URL https://pypi.org/project/colormap/.
Crameri F, Shephard GE, Heron PJ (2020). “The Misuse of Colour in Science Communication.” Nature Communications, 11(5444), 1–10. doi:10.1038/s41467-020-19160-7.
Davis M (2023). palettable: Color Palettes for Python. Version 3.3.3, Python Package Index (PyPI), URL https://pypi.org/project/palettable/.
Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, Wieser E, Taylor J, Berg S, Smith NJ, Kern R, Picus M, Hoyer S, van Kerkwijk MH, Brett M, Haldane A, del Río JF, Wiebe M, Peterson P, Gérard-Marchant P, Sheppard K, Reddy T, Weckesser W, Abbasi H, Gohlke C, Oliphant TE (2020). “Array Programming with NumPy.” Nature, 585(7825), 357–362. doi:10.1038/s41586-020-2649-2.
Hunter JD (2007). “Matplotlib: A 2D Graphics Environment.” Computing in Science & Engineering, 9(3), 90–95. doi:10.1109/mcse.2007.55.
Hunter JD, Dale D, Firing E, Droettboom M, the Matplotlib Development Team (2017). “What’s New in Matplotlib 2.0 (Jan 17, 2017), Changes to the Default Style.” Accessed 2024-07-22, URL https://matplotlib.org/stable/users/prev_whats_new/dflt_style_changes.html.
Ihaka R (2003). “Colour for Presentation Graphics.” In K Hornik, F Leisch, A Zeileis (eds.), Proceedings of the 3rd International Workshop on Distributed Statistical Computing, Vienna, Austria. ISSN 1609-395X, URL https://www.R-project.org/conferences/DSC-2003/Proceedings/Ihaka.pdf.
Klein A, Wallkötter S, Silvester S, Rynes A, actions-user, Müller P, Nunez-Iglesias J, Harfouche M, Schrangl L, Dennis, Lee A, Pandede, McCormick M, OrganicIrradiation, Rai A, Ladegaard A, van Kemenade H, Smith TD, Vaillant G, jackwalker64, Nises J, Komarčevič M, rreilink, Barnes C, Zulko, Hsieh PC, Rosenstein N, Górny M, scivision, Singleton J (2024). Imageio/Imageio: V2.34.2. doi:10.5281/zenodo.12514964. Version 2.34.2, Zenodo.
Machado GM, Oliviera MM, Fernandes LAF (2009). “A Physiologically-Based Model for Simulation of Color Vision Deficiency.” IEEE Transactions on Visualization and Computer Graphics, 15(6), 1291–1298. doi:10.1109/tvcg.2009.113.
Patel P (2024). Colormaps. Version 0.4.2, Python Package Index (PyPI), URL https://pypi.org/project/colormaps/.
Rollo C (2024). cmcrameri: Python Wrapper around Fabio Crameri’s Perceptually Uniform Colormaps. Version 1.9, Python Package Index (PyPI), URL https://pypi.org/project/cmcrameri/.
The Pandas Development Team (2024). pandas-Dev/Pandas: Pandas. doi:10.5281/zenodo.10957263. Version 2.2.2, Zenodo.
Tufte E (1990). Envisioning Information. Graphics Press, Cheshire.
Ware C (2004). “Color.” In Information Visualization: Perception for Design, chapter 4, pp. 103–149. Morgan Kaufmann Publishers Inc.
Waskom ML (2021). “seaborn: Statistical Data Visualization.” Journal of Open Source Software, 6(60), 3021. doi:10.21105/joss.03021.
Wilke CO (2019). Fundamentals of Data Visualization. O’Reilly Media. ISBN 1492031089. URL https://clauswilke.com/dataviz/color-basics.html.
Zeileis A, Fisher JC, Hornik K, Ihaka R, McWhite CD, Murrell P, Stauffer R, Wilke CO (2020). “colorspace: A Toolbox for Manipulating and Assessing Colors and Palettes.” Journal of Statistical Software, 96(1), 1–49. doi:10.18637/jss.v096.i01.

Evaluation of the UEFA Euro 2024 group stage forecast

2024-06-28T00:00:00+02:00

A look back on the group stage of the UEFA Euro 2024 to check whether our ensemble machine learning forecasts based were any good...

How surprising was the group stage?

This week the group stage of the UEFA Euro 2024 was concluded so that all pairings for the round of 16 are fixed now. Therefore, today we want to do address two questions regarding our own probabilistic forecast for the UEFA Euro 2024 based on a ensemble machine learning model that we have published prior to the tournament:

How good were the predictions for the group stage? Were the actual outcomes surprising?
How does the outcome of the group stage change the predicted winning probabilities for the tournament?

TL;DR

All of our predictions worked quite well and most results were within the expected range of random variation. All tournament favorites proceeded to the round of 16 and mostly the weaker teams dropped out of the tournament.
The biggest surprise was probably that Austria not only proceeded to the round of 16 but ranked first in arguably the strongest Group D, even surpassing France.
Smaller surprises were that Croatia dropped out in Group B, that Belgium came second behind Romania in Group E, and that Georgia prevailed in Group F.
However, some of the more interesting surprises did not really have big consequences, yet, in particular the poor scoring of top favorites France and England. Both of the teams made it to the knockout stage and it will be very interesting to see whether they will be able to boost their performance in the next game(s)!
If England is indeed able to unleash their full potential, they will profit most from the group stage because all the other top favorites (in particular France) are in the other arm of the tournament draw now.

Group stage results

First, we look at the results in terms of which teams successfully advanced from the group stage to the round of 16. The barplots below shows all teams along with their predicted probability to proceed to the round of 16, in the observed ranking order, with the color highlighting which teams advanced to the knockout stage.

Clearly, all group favorites made the cut and mostly teams with lower probabilities dropped out. It may seem somewhat surprising that some of the weaker teams (especially Georgia) “survived” the group stage but with four out of six third-ranked teams advancing to the round of 16 this is not completely unexpected. The results in Groups D and E are probably more surprising: Austria came first in Group D and top favorite France only second. Similarly, Romania took the group victory in Group E behind the higher-ranked team from Belgium.

Match results

Next, we take a closer look at the 36 individual group-stage matches to check whether we had any major surprises. The stacked bar plot below groups all match results into three categories by their predicted goal difference for the stronger vs. the weaker team.

In the first bar the stronger team was predicted to be only slightly better, with 0 to 0.6 more predicted goals on average. In this bar we see that the stronger team won less than half of the matches (6 out of 15) while the other matches were either lost (2 matches) or ended in a draw (7 matches). Thus, the distribution roughly matches the predictions albeit the number of draws is somewhat higher than expected.

The picture is similar in the second bar where the predicted goal difference for the stronger team was between 0.6 and 1. The stronger team won 5 out of 11 matches, lost 1, and more than expected (5) ended in a draw.

Only in the last bar with the highest predicted goal differences (between 1 and 2 goals) there were fewer draws (2 out of 10). Here the distribution matches closely the expectations with 7 wins for the stronger team and only 1 loss.

As a final evaluation we check whether the observed number of goals per team in each match conforms with the expected distribution based on the Poisson model employed. This is brought out graphically by a so-called hanging rootogram.

The red line shows the square root of the expected frequencies while the “hanging” gray bars represent the square root of the observed frequencies. This shows that the predictions conform closely with the actual observations. There were only a few more occurrences of single goals and fewer results with four goals (none) than expected in our forecast.

Updated knockout stage predictions

Finally, we want to look ahead and explore how the realized tournament draw based on the group stage results changes the predicted winning probabilities for the UEFA Euro 2024. We do so under the assumption that all results so far are within the range of random variation and that we do not need to adapt the predictions for all possible matches. In other words, the simulation is based on the expectation that especially the top favorites France and England can still reach their full potential in the upcoming matches.

Simulating the knockout stage 100,000 times then leads to the following winning probabilities for the tournament. (The barplot preserves the ordering of the teams from the original prediction.)

This shows clearly that England profits most and increases its winning probability for the title to 22.1% (from 16.7%). For the other five top teams France, Germany, Spain, Portugal, and the Netherlands the winning probabilities are almost equal now and all around 13%. Four of these five teams are now all in the same arm of the tournament (which has also been dubbed the “shark tank” in the media) and it will certainly be exciting who will eventually make it to the final. In the other arm England and the Netherlands are now the teams with the highest winning probability but we should keep in mind that Austria has already beaten the Netherlands once. Only the next matches will show whether they will be able to do it again, should both teams be able to advance to the quarterfinal.

In any case, the most exciting part of the UEFA Euro 2024 is only starting now and we can all be curious what is going to happen. Everything is still possible!

UEFA Euro 2024 forecast: Netherlands vs. Austria

2024-06-25T00:00:00+02:00

Detailed probabilistic forecast for the match Netherlands vs. Austria at UEFA Euro 2024 in Group D, accompanying a conference presentation at Imagine 2024.

Machine learning ensemble

In a recent blog post, prior to the start of the tournament, probabilistic forecasts for the UEFA Euro 2024 were provided based on a machine learning approach. In short, the approach obtained a number of highly informative inputs about the 24 participating teams before the start of the tournament: Historic match abilities from all national matches in 8 years, bookmaker consensus abilities based on quoted odds from 28 bookmakers, average player ratings from goal contributions of individual players in club and national matches, as well as further team-specific information like market value or FIFA rank etc. Then an ensemble of a random forest, a lasso, and an XGBoost learner were trained on matches from the UEFA Euro 2004–2020. The outcome was a prediction for the mean goals for both teams in all potential matches at the UEFA Euro 2024. Based on these predictions the entire tournament was simulated 100,000 times yielding probabilities for all possible outcomes of the tournament.

Match forecast

The prediction from the machine learning ensemble above for the match Netherlands vs. Austria is summarized in the following table.

	Mean goals	Win probability
🇳🇱	1.3	48.6%
Draw	–	28.1%
🇦🇹	0.8	23.4%

This means that if the Netherlands were to play Austria in lots of matches, the Netherlands are predicted to score 1.3 goals on average in these matches while Austria scores an average of 0.8 goals. Assuming a certain probability distribution for the goals per team in each match, not only the mean goals can be predicted but also the probability for each possible combination of goals by the two teams. The probability distribution employed here is a bivariate independent Poisson model, a relatively simple and standard model that fits empirical scores in football matches very well. The resulting probabilities (for up to five goals per team) are displayed in the heatmap below. Aggregating all probabilities for a Dutch win, a draw, or an Austrian win yields the probabilities shown in the table above (which do not sum to 100% exactly due to rounding).

Conference presentation

The Imagine conference hosted by the Austrian Ministry of Climate Action, Environment, Energy, Mobility, Innovation and Technology celebrates its 10th birthday today. The final highlight of the conference program is a public viewing of the match Netherlands vs. Austria where the forecast above will be presented alongside a live data-driven analysis by colleagues from the Rotterdam University of Applied Sciences. The presentation slides are linked from the screenshot below.

Forecasting the UEFA Euro 2024 with a machine learning ensemble

2024-06-10T00:00:00+02:00

Probabilistic forecasts for the UEFA Euro 2024 are obtained by using a hybrid model that combines data from four advanced statistical models. The favorite is France, followed by England and host Germany.

Football fans around the world are looking forward to the kick off to the UEFA Euro 2024 in Germany later this week. 24 of the best European teams will compete from 14 June to 14 July to determine the new European Champion. In anticipation of the tournament the big question is who among the teams will succeed, who will drop out, and who will eventually prevail. While it is, of course, not yet possible to give definitive answers to these questions, we are able to provide probabilistic forecasts for all possible matches based on a refined machine learning approach. This allows us to explore the likely course of the tournament by simulation.

Winning probabilities

The forecast is based on an ensemble of machine learners that blend four main sources of information: An ability estimate for every team based on historic matches; an ability estimate for every team based on odds from 28 bookmakers; average ratings of the players in each team based on their individual performances in their home clubs and national teams; further team and country covariates (e.g., market value or GDP). An ensemble of machine learners is trained on the results of the UEFA Euro tournaments from 2004 to 2020 and then applied to current information to obtain a forecast for the UEFA Euro 2024. More specifically, the ensemble estimates the predicted number of goals for all possible matches between all 24 teams in the tournament. Based on the predicted goals the probabilities for a win, draw, or loss in each of these matches can be computed from a bivariate Poisson distribution. This allows us to simulate all matches in the group phase and which teams proceed to the knock out stage and who eventually wins. Repeating the simulation 100,000 times yields winning probabilities for each team. The results show that France is the favorite for the European title with a winning probability of 19.2%, followed by England with 16.7%, and host Germany with 13.7%. The winning probabilities for all teams are shown in the barchart below with more information linked in the interactive full-width version.

Interactive full-width graphic

The study has been conducted by an international team of researchers: Florian Felice, Andreas Groll, Lars Magnus Hvattum, Christophe Ley, Gunther Schauberger, Jonas Sternemann, Achim Zeileis. The basic idea for the forecast is to proceed in two steps. In the first step, three sophisticated statistical models are employed to determine the strengths of all teams and their players using disparate sets of information. In the second step, an ensemble of machine learners decide how to best combine the three strength estimates with other information about the teams.

Historic match abilities:
An ability estimate is obtained for every team based on “retrospective” data, namely all historic national matches over the last 8 years. A bivariate Poisson model with team-specific fixed effects is fitted to the number of goals scored by both teams in each match. However, rather than equally weighting all matches to obtain average team abilities (or team strengths) over the entire history period, an exponential weighting scheme is employed. This assigns more weight to more recent results and thus yields an estimate of current team abilities. More details can be found in Ley, Van de Wiele, Van Eetvelde (2019).
Bookmaker consensus abilities:
Another ability estimate for every team is obtained based on “prospective” data, namely the odds of 28 international bookmakers that reflect their expert expectations for the tournament. Using the bookmaker consensus model of Leitner, Zeileis, Hornik (2010), the bookmaker odds are first adjusted for the bookmakers’ profit margins (“overround”) and then averaged (on a logit scale) to obtain a consensus for the winning probability of each team. To adjust for the effects of the tournament draw (that might have led to easier or harder groups for some teams), an “inverse” simulation approach is used to infer which team abilities are most likely to lead up to the consensus winning probabilities.
Average player ratings:
To infer the contributions of individual players in a match, the plus-minus player ratings of Pantuso & Hvattum (2021) dissect all matches with a certain player (both on club and on national level) into segments, e.g., between substitutions. Subsequently, the goal difference achieved in these segments is linked to the presence of the individual players during that segment. This yields individual ratings for all players that can be aggregated to average player ratings for each team.
Machine learning ensemble:
Finally, an ensemble of different machine learning methods is used to combine these three highly aggregated and informative variables above along with various further relevant variables, yielding refined probabilistic forecasts for each match. Such an approach was first suggested by Groll, Ley, Schauberger, Van Eetvelde (2019) and subsequently improved collaboratively. The ensemble of machine learners is trained to decide how to blend the different ability estimates with team-specific features that are typically less informative but still powerful enough to enhance the forecasts. The features considered comprise team- and country-specific details (market value, FIFA rank, UEFA points, number of Champions League players, and GDP per capita). By combining a large ensemble of machine learners, each of which employs the available information somewhat differently, the relative importances of all the covariates can be inferred automatically. The resulting predicted number of goals for each team can then finally be used to simulate the entire tournament 100,000 times.

Match probabilities

Using the forecasts from the machine learning ensemble yields the predicted number of goals for both teams in each possible match. The explanatory information used for this is the difference between the two teams in each of the variables listed above, i.e., the difference in historic match abilities (on a log scale), the difference in bookmaker consensus abilities (on a log scale), difference in average player ratings of the teams, etc. Assuming a bivariate Poisson distribution with the predicted numbers of goals for both teams, we can compute the probability that a certain match ends in a win, a draw, or a loss. The same can be repeated in overtime, if necessary, and a coin flip is used to decide penalties, if needed.

The following heatmap shows for each possible combination of teams the probability that one team beats the other team in a knockout match. The color scheme uses green vs. purple to signal probabilities above vs. below 50%, respectively. The tooltips for each match in the interactive version of the graphic also print the probabilities for the match to end in a win, draw, or loss after normal time.

Interactive full-width graphic

Performance throughout the tournament

As every single match can be simulated with the pairwise probabilities above, it is also straightfoward to simulate the entire tournament (here: 100,000 times) providing “survival” probabilities for each team across the different stages.

Interactive full-width graphic

Odds and ends

All our forecasts are probabilistic, clearly below 100%, and by no means certain. Thus, although we can quantify this uncertainty in terms of probabilities from an ensemble of potential tournaments, it is far from being predetermined which of these potential tournaments we will eventually see during the actual tournament.

Nevertheless the probabilistic view provides us with some interesting insights: For example, while most bookmakers favor England over France, our model reverses their roles. In a potential final between the two teams, however, France would only have a small advantage with a winning probability of 53.2%. Due to the tournament draw it is relatively unlikely, though, that the two top favorites play the final and much more likely (with a probability of 12.6%) that they play the second semifinal. Somewhat surprisingly, the most likely final (5.4%) is England vs. Germany where the winning probabilities would be almost exactly fifty-fifty.

It is also somewhat unexpected that defending champion Italy has only the 7th-highest probability of winning the championship again (5.6%). This is due to the substantial changes the team underwent in the last three years.

In any case, all of this means that the probabilistic forecasts leave a lot of room for surprises and excitement during the UEFA Euro 2024. But what is absolutely certain is that we look forward to an entertaining tournament as football fans (much more than as professional forecasters).

Subgroup detection in linear growth curve models

2023-11-13T00:00:00+01:00

New arXiv working paper showing how generalized linear mixed effects model (GLMM) trees, along with their R implementation in the glmertree package, can be used to identify subgroups with differently shaped trajectories in linear growth curve models.

Citation

Marjolein Fokkema, Achim Zeileis (2023). “Subgroup Detection in Linear Growth Curve Models with Generalized Linear Mixed Model (GLMM) Trees.” arXiv.org E-Print Archive arXiv:2309.05862 [stat.ME]. doi:10.48550/arXiv.2309.05862

Abstract

Growth curve models are popular tools for studying the development of a response variable within subjects over time. Heterogeneity between subjects is common in such models, and researchers are typically interested in explaining or predicting this heterogeneity. We show how generalized linear mixed effects model (GLMM) trees can be used to identify subgroups with differently shaped trajectories in linear growth curve models. Originally developed for clustered cross-sectional data, GLMM trees are extended here to longitudinal data. The resulting extended GLMM trees are directly applicable to growth curve models as an important special case. In simulated and real-world data, we assess the performance of the extensions and compare against other partitioning methods for growth curve models. Extended GLMM trees perform more accurately than the original algorithm and LongCART, and similarly accurate as structural equation model (SEM) trees. In addition, GLMM trees allow for modeling both discrete and continuous time series, are less sensitive to (mis-)specification of the random-effects structure and are much faster to compute.

Read full paper ›

Software

https://CRAN.R-project.org/package=glmertree

Illustration

As an example, heterogeneity of science ability trajectories among a sample of 250 children is analyzed. The data are from the Early Childhood Longitudinal Study-Kindergarten (ECLS-K) class of 1998-1999 in the USA. Assessments took place from kindergarten in 1998 through 8th grade in 2007. Here we focus on assessments from kindergarten, 1st, 3rd, 5th, and 8th grade. The time since kindergarten was scaled to the number of months to the power of 2/3 in order to obtain approximately linear trajectories.

A linear mixed-effect model tree is used to detect heterogeneity in a linear model for the growth of science ability over time. This employs a random intercept for each individual in order to account for the longitudinal nature of the data. The tree tests for differences in the baseline science abilities (i.e., the fixed-effect intercepts of the growth curve models) as well as the growth over time (i.e., the corresponding fixed-effect slopes), using eleven socio-demographic and behavioral characteristics of the children, assessed at baseline, as potential splitting variables.

The plot below shows the resulting tree which identifies socio-economic status (SES), gross motor skills (GMOTOR), and internalizing problems (INTERN) as the splitting variables. The x-axes represent the number of months after the baseline assessment, y-axes represent science ability. Gray lines depict observed individual trajectories, red lines depict average growth curve within each terminal node, as estimated with a linear mixed-effect model comprising node-specific fixed effects of time and a random intercept with respect to individuals. The table presents numerical estimates of fixed intercepts and slopes.

Five subgroups are identified, corresponding to the terminal nodes of the tree, each with a different estimate of the fixed intercept and slope. Groups of children with higher SES also have higher intercepts, indicating higher average science ability. The group of children with lower SES (node 2) is further split based on gross motor skills, with higher motor skills resulting in a higher intercept. The group of children with intermediate levels of SES (node 6) is further split based on internalizing problems, with lower internalizing problems resulting in a higher intercept. The two groups (or nodes) with higher intercepts also have higher slopes, indicating that children with higher ability also gain more ability over time.

Probabilistic forecasting for the FIFA Women's World Cup 2023

2023-07-17T00:00:00+02:00

Winning probabilities for all teams in the FIFA Women's World Cup are obtained using a consensus model based on quoted bookmakers' odds. The favorite is defending World Champion United States, followed by European Champion England, and Spain.

Football fans around the world anticipate the FIFA Women's World Cup 2023 that will take place in Australia and New Zealand from 20 July to 20 August 2023. 32 of the best World teams compete to determine the new World Champion. Here, a predictive model is established to forecast what the most likely outcome of the tournament will be. The forecast is based on the expert knowledge of 24 bookmakers and betting exchanges using a model averaging approach.

Winning probabilities

The model is the so-called bookmaker consensus model which has been proposed by Leitner, Hornik, and Zeileis (2010, International Journal of Forecasting, doi:10.1016/j.ijforecast.2009.10.001) and successfully applied in previous football tournaments, either by itself or in combination with even more refined machine learning techniques.

As in the FIFA Women’s World Cup 2019, the forecast shows that the United States are the clear favorite with a forecasted winning probability of 21.5%, followed by England with a winning probability of 15.7% and Spain with 13.1%. Three other teams are still a bit ahead of the rest: Germany with 9.7%, France with 7.5%, and co-host Australia with 7.4%. More details are displayed in the following barchart.

Interactive full-width graphic

These probabilistic forecasts have been obtained by model-based averaging of the quoted winning odds for all teams across bookmakers. More precisely, the odds are first adjusted for the bookmakers’ profit margins (“overrounds”, on average 8.6%), averaged on the log-odds scale to a consensus rating, and then transformed back to winning probabilities. The raw bookmakers’ odds as well as the forecasts for all teams are also available in machine-readable form in wwc2023.csv.

Although forecasting the winning probabilities for the FIFA Women’s World Cup 2023 is probably of most interest, the bookmaker consensus forecasts can also be employed to infer team-specific abilities using an “inverse” tournament simulation:

If team abilities are available, pairwise winning probabilities can be derived for each possible match (see below).
Given pairwise winning probabilities, the whole tournament can be easily simulated to see which team proceeds to which stage in the tournament and which team finally wins.
Such a tournament simulation can then be run sufficiently often (here 100,000 times) to obtain relative frequencies for each team winning the tournament.

Using this idea, abilities in step 1 can be chosen such that the simulated winning probabilities in step 3 closely match those from the bookmaker consensus shown above.

Pairwise comparisons

A classical approach to obtain winning probabilities in pairwise comparisons (i.e., matches between teams/players) is the Bradley-Terry model, which is similar to the Elo rating, popular in sports. The Bradley-Terry approach models the probability that a Team A beats a Team B by their associated abilities (or strengths):

\Pr (A beats B) = \frac{{ability}_{A}}{{ability}_{A} + {ability}_{B}} .

Coupled with the “inverse” simulation of the tournament, as described in step 1-3 above, this yields pairwise probabilities for each possible match. The following heatmap shows the probabilistic forecasts for each match with light gray signalling approximately equal chances and green vs. purple signalling advantages for Team A or B, respectively.

Interactive full-width graphic

Performance throughout the tournament

Interactive full-width graphic

For example, this shows that the probability for the United States to reach any stage of the tournament is higher than for any other team to reach the same stage. In fact, their survival probabilities are decreasing rather slowly because they can most likely avoid the other favorites for the title until the semifinal. Conversely, Germany’s chances to reach the round of 16 are almost as high (87.6%) as those of the United States but their chances to reach the quarterfinal are much lower (55.7%) because they are most likely to play the strongest expected runner-up, Brazil, in the round of 16.

In addition to the curves shown in the plot above, further probabilities of interest can be obtained from the simulation. For example, the probability for the “dream final” between the top favorites, World Champion United States and European Champion England, is 9.1%. The most likely first semi-final is between the United States and Spain with a probability of 13.5%. For the second semi-final it is less clear who is the most likely opponent of England because there are three possible pairings with almost the same probability (around 7%): Against Australia, France, or Germany. This shows that this half of the tournament tree is somewhat more contested with a less certain outcome.

Odds and ends

The bookmaker consensus model has performed well in previous tournaments, often predicting winners or finalists correctly. However, all forecasts are probabilistic, clearly below 100%, and thus by no means certain. It would also be possible to post-process the bookmaker consensus along with data from historic matches, player ratings, and other information about the teams using machine learning techniques. However, due to lack of time for more refined forecasts at the end of a busy academic year, at least the bookmaker consensus is provided as a solid basic forecast.

As a final remark: Betting on the outcome based on the results presented here is not recommended. Not only because the winning probabilities are clearly far below 100% but, more importantly, because the bookmakers have a profit margin of 8.6% which assures that the best chances of making money based on sports betting lie with them.

Enjoy the FIFA Women’s World Cup 2023!