Marjolein Fokkema, Achim Zeileis (2023). “Subgroup Detection in Linear Growth Curve Models with Generalized Linear Mixed Model (GLMM) Trees.” arXiv.org E-Print Archive arXiv:2309.05862 [stat.ME]. doi:10.48550/arXiv.2309.05862
Growth curve models are popular tools for studying the development of a response variable within subjects over time. Heterogeneity between subjects is common in such models, and researchers are typically interested in explaining or predicting this heterogeneity. We show how generalized linear mixed effects model (GLMM) trees can be used to identify subgroups with differently shaped trajectories in linear growth curve models. Originally developed for clustered cross-sectional data, GLMM trees are extended here to longitudinal data. The resulting extended GLMM trees are directly applicable to growth curve models as an important special case. In simulated and real-world data, we assess the performance of the extensions and compare against other partitioning methods for growth curve models. Extended GLMM trees perform more accurately than the original algorithm and LongCART, and similarly accurate as structural equation model (SEM) trees. In addition, GLMM trees allow for modeling both discrete and continuous time series, are less sensitive to (mis-)specification of the random-effects structure and are much faster to compute.
https://CRAN.R-project.org/package=glmertree
As an example, heterogeneity of science ability trajectories among a sample of 250 children is analyzed. The data are from the Early Childhood Longitudinal Study-Kindergarten (ECLS-K) class of 1998-1999 in the USA. Assessments took place from kindergarten in 1998 through 8th grade in 2007. Here we focus on assessments from kindergarten, 1st, 3rd, 5th, and 8th grade. The time since kindergarten was scaled to the number of months to the power of 2/3 in order to obtain approximately linear trajectories.
A linear mixed-effect model tree is used to detect heterogeneity in a linear model for the growth of science ability over time. This employs a random intercept for each individual in order to account for the longitudinal nature of the data. The tree tests for differences in the baseline science abilities (i.e., the fixed-effect intercepts of the growth curve models) as well as the growth over time (i.e., the corresponding fixed-effect slopes), using eleven socio-demographic and behavioral characteristics of the children, assessed at baseline, as potential splitting variables.
The plot below shows the resulting tree which identifies socio-economic status (SES), gross motor skills (GMOTOR), and internalizing problems (INTERN) as the splitting variables. The x-axes represent the number of months after the baseline assessment, y-axes represent science ability. Gray lines depict observed individual trajectories, red lines depict average growth curve within each terminal node, as estimated with a linear mixed-effect model comprising node-specific fixed effects of time and a random intercept with respect to individuals. The table presents numerical estimates of fixed intercepts and slopes.
Five subgroups are identified, corresponding to the terminal nodes of the tree, each with a different estimate of the fixed intercept and slope. Groups of children with higher SES also have higher intercepts, indicating higher average science ability. The group of children with lower SES (node 2) is further split based on gross motor skills, with higher motor skills resulting in a higher intercept. The group of children with intermediate levels of SES (node 6) is further split based on internalizing problems, with lower internalizing problems resulting in a higher intercept. The two groups (or nodes) with higher intercepts also have higher slopes, indicating that children with higher ability also gain more ability over time.
]]>The model is the so-called bookmaker consensus model which has been proposed by Leitner, Hornik, and Zeileis (2010, International Journal of Forecasting, doi:10.1016/j.ijforecast.2009.10.001) and successfully applied in previous football tournaments, either by itself or in combination with even more refined machine learning techniques.
As in the FIFA Women’s World Cup 2019, the forecast shows that the United States are the clear favorite with a forecasted winning probability of 21.5%, followed by England with a winning probability of 15.7% and Spain with 13.1%. Three other teams are still a bit ahead of the rest: Germany with 9.7%, France with 7.5%, and co-host Australia with 7.4%. More details are displayed in the following barchart.
Interactive full-width graphic
These probabilistic forecasts have been obtained by model-based averaging of the quoted winning odds for all teams across bookmakers. More precisely, the odds are first adjusted for the bookmakers’ profit margins (“overrounds”, on average 8.6%), averaged on the log-odds scale to a consensus rating, and then transformed back to winning probabilities. The raw bookmakers’ odds as well as the forecasts for all teams are also available in machine-readable form in wwc2023.csv.
Although forecasting the winning probabilities for the FIFA Women’s World Cup 2023 is probably of most interest, the bookmaker consensus forecasts can also be employed to infer team-specific abilities using an “inverse” tournament simulation:
Using this idea, abilities in step 1 can be chosen such that the simulated winning probabilities in step 3 closely match those from the bookmaker consensus shown above.
A classical approach to obtain winning probabilities in pairwise comparisons (i.e., matches between teams/players) is the Bradley-Terry model, which is similar to the Elo rating, popular in sports. The Bradley-Terry approach models the probability that a Team A beats a Team B by their associated abilities (or strengths):
$\mathrm{Pr}(A\text{beats}B)=\frac{{\mathrm{ability}}_{A}}{{\mathrm{ability}}_{A}+{\mathrm{ability}}_{B}}.$Coupled with the “inverse” simulation of the tournament, as described in step 1-3 above, this yields pairwise probabilities for each possible match. The following heatmap shows the probabilistic forecasts for each match with light gray signalling approximately equal chances and green vs. purple signalling advantages for Team A or B, respectively.
Interactive full-width graphic
As every single match can be simulated with the pairwise probabilities above, it is also straightfoward to simulate the entire tournament (here: 100,000 times) providing “survival” probabilities for each team across the different stages.
Interactive full-width graphic
For example, this shows that the probability for the United States to reach any stage of the tournament is higher than for any other team to reach the same stage. In fact, their survival probabilities are decreasing rather slowly because they can most likely avoid the other favorites for the title until the semifinal. Conversely, Germany’s chances to reach the round of 16 are almost as high (87.6%) as those of the United States but their chances to reach the quarterfinal are much lower (55.7%) because they are most likely to play the strongest expected runner-up, Brazil, in the round of 16.
In addition to the curves shown in the plot above, further probabilities of interest can be obtained from the simulation. For example, the probability for the “dream final” between the top favorites, World Champion United States and European Champion England, is 9.1%. The most likely first semi-final is between the United States and Spain with a probability of 13.5%. For the second semi-final it is less clear who is the most likely opponent of England because there are three possible pairings with almost the same probability (around 7%): Against Australia, France, or Germany. This shows that this half of the tournament tree is somewhat more contested with a less certain outcome.
The bookmaker consensus model has performed well in previous tournaments, often predicting winners or finalists correctly. However, all forecasts are probabilistic, clearly below 100%, and thus by no means certain. It would also be possible to post-process the bookmaker consensus along with data from historic matches, player ratings, and other information about the teams using machine learning techniques. However, due to lack of time for more refined forecasts at the end of a busy academic year, at least the bookmaker consensus is provided as a solid basic forecast.
As a final remark: Betting on the outcome based on the results presented here is not recommended. Not only because the winning probabilities are clearly far below 100% but, more importantly, because the bookmakers have a profit margin of 8.6% which assures that the best chances of making money based on sports betting lie with them.
Enjoy the FIFA Women’s World Cup 2023!
]]>Siranush Karapetyan, Achim Zeileis, André Henriksen, Alexander Hapfelmeier (2023). “Tree Models for Assessing Covariate-Dependent Method Agreement.” arXiv.org E-Print Archive arXiv:2306.04456 [stat.ME]. doi:10.48550/arXiv.2306.04456
Method comparison studies explore the agreement of measurements made by two or more methods. Commonly, agreement is evaluated by the well-established Bland-Altman analysis. However, the underlying assumption is that differences between measurements are identically distributed for all observational units and in all application settings. We introduce the concept of conditional method agreement and propose a respective modeling approach to alleviate this constraint. Therefore, the Bland-Altman analysis is embedded in the framework of recursive partitioning to explicitly define subgroups with heterogeneous agreement in dependence of covariates in an exploratory analysis. Three different modeling approaches, conditional inference trees with an appropriate transformation of the modeled differences (CTreeTrafo), distributional regression trees (DistTree), and model-based trees (MOB) are considered. The performance of these models is evaluated in terms of type-I error probability and power in several simulation studies. Further, the adjusted rand index (ARI) is used to quantify the models’ ability to uncover given subgroups. An application example to real data of accelerometer device measurements is used to demonstrate the applicability. Additionally, a two-sample Bland-Altman test is proposed for exploratory or confirmatory hypothesis testing of differences in agreement between subgroups. Results indicate that all models were able to detect given subgroups with high accuracy as the sample size increased. Relevant covariates that may affect agreement could be detected in the application to accelerometer data. We conclude that conditional method agreement trees (COAT) enable the exploratory analysis of method agreement in dependence of covariates and the respective exploratory or confirmatory hypothesis testing of group differences. It is made publicly available through the R package coat.
R package: https://CRAN.R-project.org/package=coat
Presentation slides: Psychoco 2023
The paper presents an illustration in which measurements of activity energy expenditure (in 24 hours) from two different accelerometers (ActiGraph vs. Actiheart) are compared and their dependence on age, gender, weight, etc. is assessed. As the data is not freely available, we show below another illustration taken from the MethComp package.
The scint
data provides measurements of the relative kidney function (renal function, percent of total) for 111 patients. The reference method is DMSA static scintigraphy and it is compared here with DTPA dynamic scintigraphy. The question we aim to answer using the new COAT method is:
Does the agreement between DTPA and DMSA depend on the age and/or the gender of the patient?
First, the package and data are loaded and reshaped to wide format:
library("coat")
data("scint", package = "MethComp")
scint_wide <- reshape(scint, v.names = "y",
timevar = "meth", idvar = "item", direction = "wide")
Then, COAT can be applied using the coat()
function, by default leveraging ctree()
from the partykit in the background:
tr1 <- coat(y.DTPA + y.DMSA ~ age + sex, data = scint_wide)
print(tr1)
## Conditional method agreement tree (COAT)
##
## Model formula:
## y.DTPA + y.DMSA ~ age + sex
##
## Fitted party:
## [1] root
## | [2] age <= 35: Bias = -0.49, SD = 3.42
## | [3] age > 35: Bias = 0.25, SD = 7.04
##
## Number of inner nodes: 1
## Number of terminal nodes: 2
This shows that the measurement differences between the two scintigraphies vary clearly between young and old patients. While the average difference between the measurements (bias) is close to zero for both age groups, the corresponding standard deviation (SD) is substantially larger (and hence the limits of agreement wider) for the older subgroup. This is better brought out graphically by the corresponding tree display with the classical Bland-Altman plots in the terminal nodes.
plot(tr1)
As the Bland-Altman plot for the older subgroup suggests that the bias between the methods may also depend on the mean measurement, we fit a second COAT tree. In addition to age and gender we also include the mean renal function measurement from DTPA and DMSA as a third potential split variable.
tr2 <- coat(y.DTPA + y.DMSA ~ age + sex, data = scint_wide, means = TRUE)
print(tr2)
## Conditional method agreement tree (COAT)
##
## Model formula:
## y.DTPA + y.DMSA ~ age + sex
##
## Fitted party:
## [1] root
## | [2] means(y.DTPA, y.DMSA) <= 31: Bias = 4.80, SD = 6.61
## | [3] means(y.DTPA, y.DMSA) > 31
## | | [4] means(y.DTPA, y.DMSA) <= 53.5: Bias = -0.38, SD = 3.33
## | | [5] means(y.DTPA, y.DMSA) > 53.5: Bias = -4.27, SD = 3.90
##
## Number of inner nodes: 2
## Number of terminal nodes: 3
plot(tr2)
This tree reveals three subgroups where only the middle group (with renal function between 31 and 53.5 percent) has both small bias and standard deviation for the scintigraphy differences while for the other two subgroups bias and/or standard deviation are larger.
]]>Achim Zeileis, Roger Bivand, Dirk Eddelbuettel, Kurt Hornik, Nathalie Vialaneix (2023). “CRAN Task Views: The Next Generation.” arXiv.org E-Print Archive arXiv:2305.17573 [stat.CO]. doi:10.48550/arXiv.2305.17573
CRAN Task Views: https://CRAN.R-project.org/web/views/
CRAN Task View Initiative: https://github.com/cran-task-views/ctv/
R package: https://CRAN.R-project.org/package=ctv
]]>Functions for emulating color vision deficiencies have been part of the R package colorspace for several years now (since the release of version 1.4-0 in January 2019). They are crucial for assessing how well data visualizations work for viewers affected by color vision deficiencies (about 8% of all males and 0.5% of all females) and for illustrating problems with poor color choices.
The colorspace
package implements the physiologically-based model of Machado, Oliveira, and Fernandes (2009) who provide a unified approach to various forms of deficiencies, in particular encompassing deuteranomaly (green cone cells defective), protanomaly (red cone cells defective), and tritanomaly (blue cone cells defective). See the corresponding package vignette for more details.
Recently, an inaccuracy in the colorspace
implementation of the Machado et al. method was reported by Matthew Petroff and fixed in colorspace
2.1.0 (released earlier this year) with some advice and guidance from Kenneth Knoblauch.
More specifically, Machado et al. provide linear transformations of RGB (red-green-blue) coordinates that simulate the different color vision deficiencies. Following some illustrations from the supplementary materials of Machado et al., earlier versions of the colorspace
package had applied the transformations directly to gamma-corrected sRGB coordinates that can be obtained from color hex codes. However, the paper implicitly relies on a linear RGB space (see page 1294, column 1) where the linear matrix transformations for simulating color vision deficiencies should be applied. Therefore, a new argument linear = TRUE
has been added to simulate_cvd()
(and hence in deutan()
, protan()
, and tritan()
) that first maps the provided colors to linearized RGB coordinates, applies the color vision deficiency transformation, and then maps back to gamma-corrected sRGB coordinates. Optionally, linear = FALSE
can be used to restore the behavior from previous versions where the transformations are applied directly to the sRGB coordinates.
For most colors the difference between the two strategies (in linear vs. gamma-corrected RGB coordinates) is negligible but for some highly-saturated colors it becomes more noticeable, e.g., for red, purple, or orange.
To illustrate this we set up a small convenience function cvd_compare()
that contrasts both approaches for all three types of color vision deficiences using the swatchplot() function from colorspace
.
cvd_compare <- function(pal) {
x <- list(
"Original" = rbind(pal),
"Deutan" = rbind(
"linear = TRUE " = colorspace::deutan(pal, linear = TRUE),
"linear = FALSE" = colorspace::deutan(pal, linear = FALSE)
),
"Protan" = rbind(
"linear = TRUE " = colorspace::protan(pal, linear = TRUE),
"linear = FALSE" = colorspace::protan(pal, linear = FALSE)
),
"Tritan" = rbind(
"linear = TRUE " = colorspace::tritan(pal, linear = TRUE),
"linear = FALSE" = colorspace::tritan(pal, linear = FALSE)
)
)
rownames(x$Original) <- deparse(substitute(pal))
colorspace::swatchplot(x)
}
Subsequently, we apply this function to a selection of new base R palettes, that have been available since R 4.0.0 in functions palette.colors()
and hcl.colors()
. First, it is shown that for many palettes the two strategies lead to almost equivalent output: e.g., for the default qualitative palette in palette.colors()
, Okabe-Ito (excluding black and gray), and the default sequential palette in hcl.colors()
, Viridis.
cvd_compare(palette.colors()[2:8])
cvd_compare(hcl.colors(7))
The comparison shows that both emulations lead to very similar output, bringing out clearly that both palettes are rather robust und color vision deficiencies.
However, for palettes with more flashy colors (especially highly-saturated red, purple, or orange) the differences may be noticeable and practically relevant. This is illustrated using two sequential HCL palettes, PuRd (inspired from ColorBrewer.org) and Rocket (from the Viridis family):
cvd_compare(hcl.colors(7, "PuRd"))
cvd_compare(hcl.colors(7, "Rocket"))
The comparison shows that the emulation differs in particular for colors 2, 3, and 4 in both palettes, leading to slightly different insights regarding the properties of the palettes.
The differences can become even more pronounced for fully-satured colors like those in the infamous rainbow palette, shown below.
cvd_compare(rainbow(7))
Luckily for palettes with better perceptual properties the differences between the old erroneous version and the new fixed one are typically rather small. Hence, we hope that the bug did not affect prior work too much and that the fixed version is even more useful for all users of the package.
]]>Achim Zeileis, Paul Murrell (2023). “Coloring in R’s Blind Spot.” arXiv.org E-Print Archive arXiv:2303.04918 [stat.CO]. doi:10.48550/arXiv.2303.04918
Prior to version 4.0.0 R had a poor default color palette (using highly saturated red, green, blue, etc.) and provided very few alternative palettes, most of which also had poor perceptual properties (like the infamous rainbow palette). Starting with version 4.0.0 R gained a new and much improved default palette and, in addition, a selection of more than 100 well-established palettes are now available via the functions palette.colors()
and hcl.colors()
. The former provides a range of popular qualitative palettes for categorical data while the latter closely approximates many popular sequential and diverging palettes by systematically varying the perceptual hue, chroma, luminance (HCL) properties in the palette. This paper provides an overview of these new color functions and the palettes they provide along with advice about which palettes are appropriate for specific tasks, especially with regard to making them accessible to viewers with color vision deficiencies.
Package grDevices
in base R provides palette.colors()
and hcl.colors()
and accompanying functionality since version R 4.0.0.
Package colorspace
(CRAN, Web page) provides color vision deficiency emulation along with many other color tools. See also below for the recent bug fix in color vision deficiency emulation.
Replication code: coloring.R, paletteGrid.R
The table below provides an overview of the new base R palette functionality: For each main type of palette, the Purpose row describes what sort of data the type of palette is appropriate for, the Generate row gives the functions that can be used to generate palettes of that type, the List row names the functions that can be used to list available palettes, and the Robust row identifies two or three good default palettes of that type.
Qualitative | Sequential | Diverging | |
---|---|---|---|
Purpose | Categorical data | Ordered or numeric data (high → low) |
Ordered or numeric with central value (high ← neutral → low) |
Generate | palette.colors() ,hcl.colors() |
hcl.colors() |
hcl.colors() |
List | palette.pals() ,hcl.pals("qualitative") |
hcl.pals("sequential") |
hcl.pals("diverging") ,hcl.pals("divergingx") |
Robust | "Okabe-Ito" , "R4" |
"Blues 3" , "YlGnBu" , "Viridis" |
"Purple-Green" ,"Blue-Red 3" |
Based on this, the color defaults in base R were adapted. In particular, the old default palette was replaced by the "R4"
palette, using very similar hues but avoiding the garish colors with extreme variations in brightness (see below for an example).
Recently, the recommended package lattice also changed its default color theme (in version 0.21-8), using the qualitative "Okabe-Ito"
palette as the symbol and fill color and the sequential "YlGnBu"
palette for shading regions.
All palettes provides by the palette.colors()
functions are shown below (except the old default "R3"
palette which is only implemented for backward compatibility).
Lighter palettes are typically more useful for shading areas, e.g., in bar plots or similar displays. Darker and more colorful palettes are usually better for coloring points or line. The palettes "R4"
and "Okabe-Ito"
are particularly noteworthy because they have been designed to be reasonably robust under color vision deficiencies.
This is illustrated in a time series line plot of the base R EuStockMarkets
data. The three rows show different palette.colors()
palettes: The old "R3"
default palette (top), the new "R4"
default palette (middle), and the "Okabe-Ito"
palette (bottom). The columns contrast normal vision (left) and emulated deuteranope vision (right), the most common type of color vision deficiency. A color legend is used in the first row and direct labels in the other rows.
We can see that the "R3"
colors are highly saturated and they vary in luminance (brightness). For example, the cyan line is noticeably lighter than the others. Futhermore, for deuteranope viewers, the CAC and the SMI lines are difficult to distinguish from each other (exacerbated by the use of a color legend that makes matching the lines to labels almost impossible). Moreover, the FTSE line is more difficult to distinguish from the white background, compared to the other lines. The "R4"
palette is an improvement: the luminance is more even and the colors are less saturated, plus the colors are more distinguishable for deuteranope viewers (aided by the use of direct color labels instead of a legend). The "Okabe-Ito"
palette works even better, particularly for deuteranope viewers.
In addition to qualitative palettes, the hcl.colors()
function provides a wide range of sequential and diverging palettes designed for numeric or ordered data with or without a neutral reference value, respectively. There are more than 100 such palettes, many of which closely approximate palettes from well-established packages such as the ColorBrewer.org, the Viridis family, CARTO colors, or Crameri’s scientific colors. The graphic below depicts just a subset of the multi-hue sequential palettes for illustration.
Some empirical examples and more insights are provided in the working paper linked above.
]]>Thorsten Simon, Georg J. Mayr, Deborah Morgenstern, Nikolaus Umlauf, Achim Zeileis (2023). “Amplification of Annual and Diurnal Cycles of Alpine Lightning.” Climate Dynamics, Forthcoming. doi:10.1007/s00382-023-06786-8
The response of lightning to a changing climate is not fully understood. Historic trends of proxies known for fostering convective environments suggest an increase of lightning over large parts of Europe. Since lightning results from the interaction of processes on many scales, as many of these processes as possible must be considered for a comprehensive answer. Recent achievements of decade-long seamless lightning measurements and hourly reanalyses of atmospheric conditions including cloud micro-physics combined with flexible regression techniques have made a reliable reconstruction of cloud-to-ground lightning down to its seasonally varying diurnal cycle feasible. The European Eastern Alps and their surroundings are chosen as reconstruction region since this domain includes a large variety of land-cover, topographical and atmospheric circulation conditions. The most intense changes over the four decades from 1980 to 2019 occurred over the high Alps where lightning activity doubled in the 2010s compared to the 1980s. There, the lightning season reaches a higher maximum and starts one month earlier. Diurnally, the peak is up to 50% stronger with more lightning strikes in the afternoon and evening hours. Signals along the southern and northern alpine rim are similar but weaker whereas the flatlands surrounding the Alps have no significant trend.
R packages bamlss
(CRAN, Web page) and mgcv
(CRAN).
The study links two sources of information which are both available in a spatio-temporal resolution of 32 km x 32 km and one hour:
The idea is to learn the link between the lightning observations and the ERA5 atmospheric parameters on the time period where both data sources are available (2010-2019). Subsequently, probabilistic predictions can be made for lightning occurrence on the entire time period starting in 1980, i.e., including the period where only atmospheric parameters but no high-quality lightning detection observations are available. This then allows to track how the probability for lightning occurrence has evolved over the decades, both in terms of the annual seasonal cycles and the diurnal cycle.
The probabilistic model learned on this challenging data set is a generalized additive model (GAM) using a binary logit link and smooth spline terms for all explanatory variables based on the atmospheric parameters and additional spatio-temporal information. In order to deal with variable selection due to the large number of explanatory variables, the model is estimated by gradient boosting (as opposed to the classical maximum likelihood technique) combined with stability selection. These have been implemented using the R packages mgcv
and bamlss
.
Based on the probabilistic predictions from this boosted binary GAM, the figure below shows reconstructed annual cycles of probabilities for lightning events averaged over the four decades from 1980s to 2010s (color coded). The light curves in the background are aggregations to the day of the year. The dark curves in the foreground are smoothed versions of the light curves. This shows that the peak in summer is much more pronounced and starts earlier for the High Alps and the Southern Alpine rim while there are only minor changes at the Northern Alpine rim and the surrounding flatlands.
To aggregate these changes even further and capture climate changes, linear trends are fitted to the reconstructed probabilities for June (afternoons, 13-19 UTC) over time. The figure below shows the spatial distribution of these linear climate trends: Color luminance gives the slope per decade of a linear regression for mean probability of lightning within an hour in percent. Desaturated colors in the grids indicate that the linear trends for these grids are not significant at the 5% level. Again, this highlights the pronounced changes in the High Alps and the Southern Alpine rim while there are no significant changes in the surrounding flatlands.
For more details and further insights see the full paper linked above.
]]>The forecast is based on a conditional inference random forest learner that blends information capturing the past, present, and future of the competing football teams: Insights from the past are captured in an ability estimate for every team based on historic matches. Expectations about the the future in the upcoming tournament are captured in an ability estimate for every team based on odds from international bookmakers. The present status of the teams (and their countries) is represented by covariates such as market value or the types of players in the team as well as country-specific socio-economic factors like population or GDP. The random forest model is learned using the previous five FIFA World Cup tournaments from 2002 to 2018 as training data and then applied to current information to obtain a forecast for the 2022 FIFA World Cup. More precisely, the random forest is calibrated to predict the likely distribution of goals for each team in all possible matches in the tournament. This allows to simulate the outcome of each match in normal time as well as potential extra time and penalties in order to obtain probabilities for a win, draw, or loss. Moreover, because every individual match can be simulated like that, a “multiverse” of potential courses of the entire tournament can be created yielding overall winning probabilities for each team. The results show that - 20 years after winning the title the last time - Brazil is the clear favorite for the World Cup with a winning probability of 15.0%, followed by Argentina with 11.2%, the Netherlands with 9.7%, Germany with 9.2%, and France with 9.1%. The winning probabilities for all teams are shown in the barchart below with more information linked in the interactive full-width version.
Interactive full-width graphic
The full study has been conducted by an international team of researchers: Andreas Groll, Neele Hormann, Christophe Ley, Gunther Schauberger, Hans Van Eetvelde, Achim Zeileis. The core of the contribution is a hybrid approach that starts out from three state-of-the-art forecasting methods, based on disparate sets of information, and lets an adaptive machine learning model decide how to blend the different sources of information.
Historic information: Match abilities.
An ability estimate is obtained for every team based on “retrospective” data, namely all historic national matches over the last 8 years. A bivariate Poisson model with team-specific fixed effects is fitted to the number of goals scored by both teams in each match. However, rather than equally weighting all matches to obtain average team abilities (or team strengths) over the entire history period, an exponential weighting scheme is employed. This assigns more weight to more recent results and thus yields an estimate of current team abilities. More details can be found in Ley, Van de Wiele, Van Eetvelde (2019).
Future expectation: Bookmaker consensus abilities.
Another ability estimate for every team is obtained based on “prospective” data, namely the odds of 28 international bookmakers that reflect their expert expectations for the tournament. Using an enhanced version of the bookmaker consensus model from Leitner, Zeileis, Hornik (2010), the bookmaker odds are first adjusted for the bookmakers’ profit margins (“overround”) and then averaged (on a logit scale) to obtain a consensus for the winning probability of each team. To correct for the effects of the tournament draw (that might have led to easier or harder groups for some teams), an “inverse” simulation approach is used to infer which team abilities are most likely to lead up to these winning probabilities.
Combination with present status: Hybrid random forests.
Finally, machine learning is used to combine these highly aggregated ability estimates with a broad range of further relevant covariates reflecting the current states of the different teams and the countries they come from. Such a hybrid approach was first suggested by Groll, Ley, Schauberger, Van Eetvelde (2019). A random forest learner is trained to decide how to blend the different ability estimates with team-specific features that are typically less informative but still powerful enough to enhance the forecasts. The features considered comprise team-specific details (e.g., market value, FIFA rank, team structure) as well as country-specifc socio-economic factors (population and GDP per capita). By combining a large ensemble of rather weakly informative regression trees in a random forest, the relative importances of all the covariates can be inferred automatically. The resulting predicted number of goals for each team can then finally be used to simulate the entire tournament 100,000 times.
Using the hybrid random forest an expected number of goals is obtained for both teams in each possible match. The covariate information used for this is the difference between the two teams in each of the variables listed above, i.e., the difference in historic match abilities (on a log scale), the difference in bookmaker consensus abilities (on a log scale), difference in market values (on a log scale), etc. Assuming a bivariate Poisson distribution with the expected numbers of goals for both teams, we can compute the probability that a certain match ends in a win, a draw, or a loss. The same can be repeated in overtime, if necessary, and a coin flip is used to decide penalties, if needed.
The following heatmap shows for each possible combination of teams the probability that one team beats the other team in a knockout match. The color scheme uses green vs. purple to signal probabilities above vs. below 50%, respectively. The tooltips for each match in the interactive version of the graphic also print the probabilities for the match to end in a win, draw, or loss after normal time.
Interactive full-width graphic
Based on the simulation of individual pairwise matches, as described above, we can create a “multiverse” of potential courses of the entire tournament (here: 100,000). The chances of the teams’ “survival” throughout the tournament can then be described by the proportions of multiverses in which they reach the different stages from the round of 16 to winning the overall title.
Interactive full-width graphic
All our forecasts are probabilistic, clearly below 100%, and by no means certain. Thus, although we can quantify this uncertainty in terms of probabilities from a multiverse of tournaments, it is far from being predetermined which of these possible tournaments we will see in our universe.
Unfortunately, the experience of observing the actual tournament will be far less exciting and joyful than usual for us as researchers/forecasters and also as football fans due to the special circumstances. In addition to the widely discussed ethical problems regarding this FIFA World Cup, there are also sportive issues that are absolutely critical: The climate in Qatar is extraordinarily hot which necessitated shifting the event to the winter months. Therefore, all major football leagues in Europe and South America have to interrupt their usual schedule in order to accomodate the tournament. This gives the national teams less time for preparation and the players less time for recovery before and after the World Cup. In combination with the extreme climate conditions this also increases the risk of injuries. Hence, having a team with many players in the international European leagues (Champions League, Europa League, Europa Conference League) might actually be a handicap rather than a strength this year.
All of these factors make the forecast of the tournament outcome more difficult as variables that have been highly predictive in previous World Cups might not work or work differently.
Finally, more from the perspective of football fans (rather than professional forecasters) we are sad that all the usual joy and anticipation of a football World Cup has been crushed by the terrible circumstances this year: starting from the alleged bribery and corruption in the FIFA assignment process, to the human rights and working conditions in Qatar, and the lack of sustainability in the construction and operation of the stadiums.
]]>The model is the so-called bookmaker consensus model which has been proposed by Leitner, Hornik, and Zeileis (2010, International Journal of Forecasting, https://doi.org/10.1016/j.ijforecast.2009.10.001) and successfully applied in previous football tournaments, either by itself or in combination with even more refined machine learning techniques.
This time the forecast shows that Spain is the favorite with a forecasted winning probability of 19.6%, closely followed by England with a winning probability of 16.6%. Four teams also have double-digit winning probabilities: France with 13.5%, the Netherlands with 13.3%, Germany with 10.3%, and Sweden with 10.1%. More details are displayed in the following barchart.
Interactive full-width graphic
These probabilistic forecasts have been obtained by model-based averaging the quoted winning odds for all teams across bookmakers. More precisely, the odds are first adjusted for the bookmakers’ profit margins (“overrounds”, on average 20.1%), averaged on the log-odds scale to a consensus rating, and then transformed back to winning probabilities. The raw bookmakers’ odds as well as the forecasts for all teams are also available in machine-readable form in weuro2022.csv.
Although forecasting the winning probabilities for the UEFA Women’s Euro 2022 is probably of most interest, the bookmaker consensus forecasts can also be employed to infer team-specific abilities using an “inverse” tournament simulation:
Using this idea, abilities in step 1 can be chosen such that the simulated winning probabilities in step 3 closely match those from the bookmaker consensus shown above.
A classical approach to obtain winning probabilities in pairwise comparisons (i.e., matches between teams/players) is the Bradley-Terry model, which is similar to the Elo rating, popular in sports. The Bradley-Terry approach models the probability that a Team A beats a Team B by their associated abilities (or strengths):
$\mathrm{Pr}(A\text{beats}B)=\frac{{\mathrm{ability}}_{A}}{{\mathrm{ability}}_{A}+{\mathrm{ability}}_{B}}.$Coupled with the “inverse” simulation of the tournament, as described in step 1-3 above, this yields pairwise probabilities for each possible match. The following heatmap shows the probabilistic forecasts for each match with light gray signalling approximately equal chances and green vs. purple signalling advantages for Team A or B, respectively.
Interactive full-width graphic
As every single match can be simulated with the pairwise probabilities above, it is also straightfoward to simulate the entire tournament (here: 100,000 times) providing “survival” probabilities for each team across the different stages.
Interactive full-width graphic
For example, this shows that Spain’s chances compared to England and France are lower to reach one of the quarterfinals but higher to reach one of the semifinals. The reasons for this are that Spain plays another one of the strongest six teams in their group (Germany) but can likely avoid another of these six teams in the quarterfinal. Conversely, England and France do not have another of the six top teams in their group but most likely play one in their quarterfinals (Germany and Netherlands or Sweden, respectively).
This effect of the tournament draw is also brought out by another display that highlights the likely flow of all teams through the tournament simultaneously. Compared to the survival curves shown above this visualization brings out more clearly at which stages of the tournament the strong teams are most likely to meet.
Interactive full-width graphic
The bookmaker consensus model has performed well in previous tournaments, often predicting winners or finalists correctly. However, all forecasts are probabilistic, clearly below 100%, and thus by no means certain. It would also be possible to post-process the bookmaker consensus along with data from historic matches, player ratings, and other information about the teams using machine learning techniques. However, due to lack of time for more refined forecasts at the end of a busy academic year, at least the bookmaker consensus is provided as a solid basic forecast.
As a final remark: Betting on the outcome based on the results presented here is not recommended. Not only because the winning probabilities are clearly far below 100% but, more importantly, because the bookmakers have a sizeable profit margin of about 20.1% which assures that the best chances of making money based on sports betting lie with them!
In a few days we will start learning which of the probable paths through the tournament, shown above, will actually come true. Enjoy the UEFA Women’s Euro 2022!
]]>Susanne Dandl, Torsten Hothorn, Heidi Seibold, Erik Sverdrup, Stefan Wager, Achim Zeileis (2022). “What Makes Forest-Based Heterogeneous Treatment Effect Estimators Work?.” arXiv.org E-Print Archive arXiv:2206.10323 [stat.ME]. doi:10.48550/arXiv.2206.10323
Estimation of heterogeneous treatment effects (HTE) is of prime importance in many disciplines, ranging from personalized medicine to economics among many others. Random forests have been shown to be a flexible and powerful approach to HTE estimation in both randomized trials and observational studies. In particular “causal forests”, introduced by Athey, Tibshirani, and Wager (2019), along with the R implementation in package grf, were rapidly adopted. A related approach, called “model-based forests”, that is geared towards randomized trials and simultaneously captures effects of both prognostic and predictive variables, was introduced by Seibold, Zeileis, and Hothorn (2018) along with a modular implementation in the R package model4you.
Here, we present a unifying view that goes beyond the theoretical motivations and investigates which computational elements make causal forests so successful and how these can be blended with the strengths of model-based forests. To do so, we show that both methods can be understood in terms of the same parameters and model assumptions for an additive model under L_{2} loss. This theoretical insight allows us to implement several flavors of “model-based causal forests” and dissect their different elements in silico.
The original causal forests and model-based forests are compared with the new blended versions in a benchmark study exploring both randomized trials and observational settings. In the randomized setting, both approaches performed akin. If confounding was present in the data generating process, we found local centering of the treatment indicator with the corresponding propensities to be the main driver for good performance. Local centering of the outcome was less important, and might be replaced or enhanced by simultaneous split selection with respect to both prognostic and predictive effects. This lays the foundation for future research combining random forests for HTE estimation with other types of models.
We demonstrate the practical aspects of such a model-agnostic approach to HTE estimation analyzing the effect of cesarean section on postpartum blood loss in comparison to vaginal delivery. Clearly, randomization is hardly possible in this setup, and we present a tailored model-based forest for skewed and interval-censored data to infer possible predictive variables and their impact on the treatment effect.
To investigate which elements of the different random forest algorithms in causal forests (cf) vs. model-based forests (mob) contribute to more precise estimation of heterogeneous treatment effects, a large simulation experiment was carried out, using normal outcomes, different predictive and prognostic effects, and a varying number of observations (N) and covariates (P).
In addition to the original cf (from grf) and mob (from model4you) algorithms three blended versions (based on model4you) were assessed: mob(\(\widehat W\)) (model-based forests after centering of the treatment indicator), mob(\(\widehat W\), \(\widehat Y\)) (model-based forests after centering of both the treatment indicator and the outcome), mobcf (model-based forests after centering of both the treatment indicator and the outcome, only testing for splits in the treatment effect).
Four data-generation setups are considered, as proposed by Nie and Wager (2021): Setup A has complicated confounding but a relatively simple treatment effect function. Setup B has no confounding. Setup C has strong confounding but a constant treatment effect. In Setup D the treatment and control arms are completely unrelated.
Overall, the results in the figure below show that centering of the treatment indicator as in mob(\(\widehat W\)) is the most relevant ingredient to random forests for HTE estimation in observational studies. If possible, additional centering the outcome in combination with simultaneous estimation of predictive and prognostic effects in mob(\(\widehat W\), \(\widehat Y\)) is recommended as it always performs as well as mob(\(\widehat W\)) and mobcf but may yield relevant improvements in some scenarios. Other technical aspects of tree and forest induction did not contribute to major performance differences. The overall strong performance of mob(\(\widehat W\), \(\widehat Y\)), combining centering of outcome and treatment from causal forests with joint estimation of prognostic and predictive effects, suggests that alternative split criteria sensitive to both intercepts and treatment effects might be able to improve the performance of causal forests.
For more details and more results see the arXiv working paper.
To illustrate how model-based causal forests can be tailored for specific situations, the effect of cesarean sections vs. vaginal deliveries (treatment) on the amount of postpartum blood loss (outcome) is invectigated. Clearly, covariates like maternal age, birth weight, gestational age, or multifetal pregnancy potentially have an impact on both the treatment and the outcome. As randomizing the mode of delivery is impossible, methods for HTE estimation from observational data are needed. Moreover, blood loss is a skewed variable that is additionally impossible to measure exactly in the sometimes hectic environment of a delivery ward. It is hence treated as interval-censored. To accomodate all these features, a model-based causal forest is fitted by using pmforest()
from model4you in combination with:
The dependency of the treatment effect on the prepartum variables is visualized in the figure below, using scatter plots for continuous covariates and boxplots for categorical covariates. While some variables have virtually no influence on the treatment effect (e.g., mother’s age), others are associated with clear effect differences. In particular, higher gestational age, higher neonatal weight, and no multifetal pregnancy have a higher risk for elevated blood loss due to cesarean section compared to vaginal delivery.
For more details see the arXiv working paper.
]]>