https://www.zeileis.org/Achim Zeileis2022-08-05T16:31:15+02:00Research homepage of Achim Zeileis, Universität Innsbruck. <br/>Department of Statistics, Faculty of Economics and Statistics. <br/>Universitätsstr. 15, 6020 Innsbruck, Austria. <br/>Tel: +43/512/507-70403Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/Jekyllhttps://www.zeileis.org/news/weuro2022/Probabilistic forecasting for the UEFA Women's Euro 20222022-07-04T00:00:00+02:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/Using a consensus model based on quoted bookmakers' odds winning probabilities for all competing teams in the UEFA Women's Euro are obtained: The favorite is Spain, followed by host England, France, and the Netherlands as the defending champion.<p>Using a consensus model based on quoted bookmakers' odds winning probabilities for all competing teams in the UEFA Women's Euro are obtained: The favorite is Spain, followed by host England, France, and the Netherlands as the defending champion.</p> <div class="row t20 b20"> <div class="small-8 medium-9 large-10 columns"> Football fans throughout Europe and the world anticipate the UEFA Women's Euro 2022 that will take place in England from 6 July to 31 July 2022. 16 of the best European teams compete to determine the new European Champion. Here, a predictive model is established to forecast what the most likely outcome of the tournament will be. The forecast is based on the expert knowledge of 16 bookmakers and betting exchanges using a model averaging approach. </div> <div class="small-4 medium-3 large-2 columns"> <a href="https://www.uefa.com/womenseuro/" alt="UEFA Women's Euro 2022 web page"><img src="https://upload.wikimedia.org/wikipedia/en/0/0b/UEFA_Women%27s_Euro_2022_logo.svg" alt="UEFA Women's Euro 2022 logo" /></a> </div> </div> <h2 id="winning-probabilities">Winning probabilities</h2> <p>The model is the so-called bookmaker consensus model which has been proposed by Leitner, Hornik, and Zeileis (2010, <em>International Journal of Forecasting</em>, <a href="https://doi.org/10.1016/j.ijforecast.2009.10.001">https://doi.org/10.1016/j.ijforecast.2009.10.001</a>) and successfully applied in previous football tournaments, either by itself or in combination with even more refined <a href="https://www.zeileis.org/news/euro2020/">machine learning techniques</a>.</p> <p>This time the forecast shows that Spain is the favorite with a forecasted winning probability of 19.6%, closely followed by England with a winning probability of 16.6%. Four teams also have double-digit winning probabilities: France with 13.5%, the Netherlands with 13.3%, Germany with 10.3%, and Sweden with 10.1%. More details are displayed in the following barchart.</p> <p><a href="https://www.zeileis.org/assets/posts/2022-07-04-weuro2022/p_win.html">Interactive full-width graphic</a></p> <p><a href="https://www.zeileis.org/assets/posts/2022-07-04-weuro2022/p_win.html"><img src="https://www.zeileis.org/assets/posts/2022-07-04-weuro2022/p_win.png" alt="Barchart: Winning probabilities" /></a></p> <p>These probabilistic forecasts have been obtained by model-based averaging the quoted winning odds for all teams across bookmakers. More precisely, the odds are first adjusted for the bookmakers’ profit margins (“overrounds”, on average 20.1%), averaged on the log-odds scale to a consensus rating, and then transformed back to winning probabilities. The raw bookmakers’ odds as well as the forecasts for all teams are also available in machine-readable form in <a href="https://www.zeileis.org/assets/posts/2022-07-04-weuro2022/weuro2022.csv">weuro2022.csv</a>.</p> <p>Although forecasting the winning probabilities for the UEFA Women’s Euro 2022 is probably of most interest, the bookmaker consensus forecasts can also be employed to infer team-specific abilities using an “inverse” tournament simulation:</p> <ol> <li>If team abilities are available, pairwise winning probabilities can be derived for each possible match (see below).</li> <li>Given pairwise winning probabilities, the whole tournament can be easily simulated to see which team proceeds to which stage in the tournament and which team finally wins.</li> <li>Such a tournament simulation can then be run sufficiently often (here 100,000 times) to obtain relative frequencies for each team winning the tournament.</li> </ol> <p>Using this idea, abilities in step 1 can be chosen such that the simulated winning probabilities in step 3 closely match those from the bookmaker consensus shown above.</p> <h2 id="pairwise-comparisons">Pairwise comparisons</h2> <p>A classical approach to obtain winning probabilities in pairwise comparisons (i.e., matches between teams/players) is the Bradley-Terry model, which is similar to the Elo rating, popular in sports. The Bradley-Terry approach models the probability that a Team A beats a Team B by their associated abilities (or strengths):</p> <math xmlns="http://www.w3.org/1998/Math/MathML"><mstyle displaystyle="true"><mrow><mi fontstyle="normal">Pr</mi><mo stretchy="false">(</mo><mi>A</mi><mtext> beats </mtext><mi>B</mi><mo stretchy="false">)</mo><mo>=</mo><mfrac><mrow><msub><mrow><mi fontstyle="italic">ability</mi></mrow><mrow><mi>A</mi></mrow></msub></mrow><mrow><msub><mrow><mi fontstyle="italic">ability</mi></mrow><mrow><mi>A</mi></mrow></msub><mo>+</mo><msub><mrow><mi fontstyle="italic">ability</mi></mrow><mrow><mi>B</mi></mrow></msub></mrow></mfrac><mo>.</mo></mrow></mstyle></math> <p>Coupled with the “inverse” simulation of the tournament, as described in step 1-3 above, this yields pairwise probabilities for each possible match. The following heatmap shows the probabilistic forecasts for each match with light gray signalling approximately equal chances and green vs. purple signalling advantages for Team A or B, respectively.</p> <p><a href="https://www.zeileis.org/assets/posts/2022-07-04-weuro2022/p_match.html">Interactive full-width graphic</a></p> <p><a href="https://www.zeileis.org/assets/posts/2022-07-04-weuro2022/p_match.html"><img src="https://www.zeileis.org/assets/posts/2022-07-04-weuro2022/p_match.png" alt="Heatmap: Match probabilities" /></a></p> <h2 id="performance-throughout-the-tournament">Performance throughout the tournament</h2> <p>As every single match can be simulated with the pairwise probabilities above, it is also straightfoward to simulate the entire tournament (here: 100,000 times) providing “survival” probabilities for each team across the different stages.</p> <p><a href="https://www.zeileis.org/assets/posts/2022-07-04-weuro2022/p_surv.html">Interactive full-width graphic</a></p> <p><a href="https://www.zeileis.org/assets/posts/2022-07-04-weuro2022/p_surv.html"><img src="https://www.zeileis.org/assets/posts/2022-07-04-weuro2022/p_surv.png" alt="Line plot: Survival probabilities" /></a></p> <p>For example, this shows that Spain’s chances compared to England and France are lower to reach one of the quarterfinals but higher to reach one of the semifinals. The reasons for this are that Spain plays another one of the strongest six teams in their group (Germany) but can likely avoid another of these six teams in the quarterfinal. Conversely, England and France do not have another of the six top teams in their group but most likely play one in their quarterfinals (Germany and Netherlands or Sweden, respectively).</p> <p>This effect of the tournament draw is also brought out by another display that highlights the likely flow of all teams through the tournament simultaneously. Compared to the survival curves shown above this visualization brings out more clearly at which stages of the tournament the strong teams are most likely to meet.</p> <p><a href="https://www.zeileis.org/assets/posts/2022-07-04-weuro2022/p_sankey.html">Interactive full-width graphic</a></p> <p><a href="https://www.zeileis.org/assets/posts/2022-07-04-weuro2022/p_sankey.html"><img src="https://www.zeileis.org/assets/posts/2022-07-04-weuro2022/p_sankey.png" alt="Sankey diagram" /></a></p> <h2 id="odds-and-ends">Odds and ends</h2> <p>The bookmaker consensus model has performed well in previous tournaments, often predicting winners or finalists correctly. However, all forecasts are probabilistic, clearly below 100%, and thus by no means certain. It would also be possible to post-process the bookmaker consensus along with data from historic matches, player ratings, and other information about the teams using <a href="https://www.zeileis.org/news/euro2020/">machine learning techniques</a>. However, due to lack of time for more refined forecasts at the end of a busy academic year, at least the bookmaker consensus is provided as a solid basic forecast.</p> <p>As a final remark: Betting on the outcome based on the results presented here is not recommended. Not only because the winning probabilities are clearly far below 100% but, more importantly, because the bookmakers have a sizeable profit margin of about 20.1% which assures that the best chances of making money based on sports betting lie with them!</p> <p>In a few days we will start learning which of the probable paths through the tournament, shown above, will actually come true. Enjoy the UEFA Women’s Euro 2022!</p>2022-07-04T00:00:00+02:00https://www.zeileis.org/news/causal_forests/Model-based causal forests for heterogeneous treatment effects2022-07-02T00:00:00+02:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/A new arXiv paper investigates which building blocks of random forests, especially causal forests and model-based forests, make them work for heterogeneous treatment effect estimation, both in randomized trials and observational studies.<p>A new arXiv paper investigates which building blocks of random forests, especially causal forests and model-based forests, make them work for heterogeneous treatment effect estimation, both in randomized trials and observational studies.</p> <h3 id="citation">Citation</h3> <p>Susanne Dandl, Torsten Hothorn, Heidi Seibold, Erik Sverdrup, Stefan Wager, Achim Zeileis (2022). “What Makes Forest-Based Heterogeneous Treatment Effect Estimators Work?.” <em>arXiv.org E-Print Archive</em> arXiv:2206.10323 [stat.ME]. <a href="https://doi.org/10.48550/arXiv.2206.10323">doi:10.48550/arXiv.2206.10323</a></p> <h3 id="abstract">Abstract</h3> <p>Estimation of heterogeneous treatment effects (HTE) is of prime importance in many disciplines, ranging from personalized medicine to economics among many others. Random forests have been shown to be a flexible and powerful approach to HTE estimation in both randomized trials and observational studies. In particular “causal forests”, introduced by <a href="https://doi.org/10.1214/18-aos1709">Athey, Tibshirani, and Wager (2019)</a>, along with the R implementation in package <a href="https://CRAN.R-project.org/package=grf"><em>grf</em></a>, were rapidly adopted. A related approach, called “model-based forests”, that is geared towards randomized trials and simultaneously captures effects of both prognostic and predictive variables, was introduced by <a href="https://doi.org/10.1177/0962280217693034">Seibold, Zeileis, and Hothorn (2018)</a> along with a modular implementation in the R package <a href="https://CRAN.R-project.org/package=model4you"><em>model4you</em></a>.</p> <p>Here, we present a unifying view that goes beyond the <em>theoretical</em> motivations and investigates which <em>computational</em> elements make causal forests so successful and how these can be blended with the strengths of model-based forests. To do so, we show that both methods can be understood in terms of the same parameters and model assumptions for an additive model under <em>L</em><sub>2</sub> loss. This theoretical insight allows us to implement several flavors of “model-based causal forests” and dissect their different elements <em>in silico</em>.</p> <p>The original causal forests and model-based forests are compared with the new blended versions in a benchmark study exploring both randomized trials and observational settings. In the randomized setting, both approaches performed akin. If confounding was present in the data generating process, we found local centering of the treatment indicator with the corresponding propensities to be the main driver for good performance. Local centering of the outcome was less important, and might be replaced or enhanced by simultaneous split selection with respect to both prognostic and predictive effects. This lays the foundation for future research combining random forests for HTE estimation with other types of models.</p> <p>We demonstrate the practical aspects of such a model-agnostic approach to HTE estimation analyzing the effect of cesarean section on postpartum blood loss in comparison to vaginal delivery. Clearly, randomization is hardly possible in this setup, and we present a tailored model-based forest for skewed and interval-censored data to infer possible predictive variables and their impact on the treatment effect.</p> <h3 id="benchmark-study">Benchmark study</h3> <p>To investigate which elements of the different random forest algorithms in causal forests (cf) vs. model-based forests (mob) contribute to more precise estimation of heterogeneous treatment effects, a large simulation experiment was carried out, using normal outcomes, different predictive and prognostic effects, and a varying number of observations (N) and covariates (P).</p> <p>In addition to the original cf (from <em>grf</em>) and mob (from <em>model4you</em>) algorithms three blended versions (based on <em>model4you</em>) were assessed: mob(\(\widehat W\)) (model-based forests after centering of the treatment indicator), mob(\(\widehat W\), \(\widehat Y\)) (model-based forests after centering of both the treatment indicator and the outcome), mobcf (model-based forests after centering of both the treatment indicator and the outcome, only testing for splits in the treatment effect).</p> <p>Four data-generation setups are considered, as proposed by Nie and Wager (2021): Setup A has complicated confounding but a relatively simple treatment effect function. Setup B has no confounding. Setup C has strong confounding but a constant treatment effect. In Setup D the treatment and control arms are completely unrelated.</p> <p>Overall, the results in the figure below show that centering of the treatment indicator as in mob(\(\widehat W\)) is the most relevant ingredient to random forests for HTE estimation in observational studies. If possible, additional centering the outcome in combination with simultaneous estimation of predictive and prognostic effects in mob(\(\widehat W\), \(\widehat Y\)) is recommended as it always performs as well as mob(\(\widehat W\)) and mobcf but may yield relevant improvements in some scenarios. Other technical aspects of tree and forest induction did not contribute to major performance differences. The overall strong performance of mob(\(\widehat W\), \(\widehat Y\)), combining centering of outcome and treatment from causal forests with joint estimation of prognostic and predictive effects, suggests that alternative split criteria sensitive to both intercepts and treatment effects might be able to improve the performance of causal forests.</p> <p><a href="https://www.zeileis.org/assets/posts/2022-07-02-causal_forests/fig1.png"><img src="https://www.zeileis.org/assets/posts/2022-07-02-causal_forests/fig1.png" alt="Results for the experimental setups in Section 4.1 of the arXiv working paper. Direct comparison of the adaptive versions of causal forests, model-based forests without centering (mob), mob imitating causal forests (mobcf), mob with centered W (mob(W)) and additional of Y (mob(W, Y))." /></a></p> <p>For more details and more results see the <a href="https://doi.org/10.48550/arXiv.2206.10323">arXiv working paper</a>.</p> <h3 id="empirical-application">Empirical application</h3> <p>To illustrate how model-based causal forests can be tailored for specific situations, the effect of cesarean sections vs. vaginal deliveries (treatment) on the amount of postpartum blood loss (outcome) is invectigated. Clearly, covariates like maternal age, birth weight, gestational age, or multifetal pregnancy potentially have an impact on both the treatment and the outcome. As randomizing the mode of delivery is impossible, methods for HTE estimation from observational data are needed. Moreover, blood loss is a skewed variable that is additionally impossible to measure exactly in the sometimes hectic environment of a delivery ward. It is hence treated as interval-censored. To accomodate all these features, a model-based causal forest is fitted by using <code class="language-plaintext highlighter-rouge">pmforest()</code> from <em>model4you</em> in combination with:</p> <ul> <li>Centering of the treatment variable to account for the observational nature of the data.</li> <li>A transformation model (based on a Bernstein polynomial) to flexibly capture the skewness of the outcome variable.</li> <li>Interval censoring of the outcome observations.</li> </ul> <p>The dependency of the treatment effect on the prepartum variables is visualized in the figure below, using scatter plots for continuous covariates and boxplots for categorical covariates. While some variables have virtually no influence on the treatment effect (e.g., mother’s age), others are associated with clear effect differences. In particular, higher gestational age, higher neonatal weight, and no multifetal pregnancy have a higher risk for elevated blood loss due to cesarean section compared to vaginal delivery.</p> <p><a href="https://www.zeileis.org/assets/posts/2022-07-02-causal_forests/fig5.png"><img src="https://www.zeileis.org/assets/posts/2022-07-02-causal_forests/fig5.png" alt="Dependency plots of the individual treatment effects calculated by the model-based transformation forest. Values > 0 mean that cesarean section increases the blood loss compared to vaginal delivery. Blue lines and diamond points depict (smooth conditional) mean effects." /></a></p> <p>For more details see the <a href="https://doi.org/10.48550/arXiv.2206.10323">arXiv working paper</a>.</p>2022-07-02T00:00:00+02:00https://www.zeileis.org/news/user2022/distributions3 @ useR! 20222022-06-27T00:00:00+02:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/Conference presentation about the 'distributions3' package for S3 probability distributions (and 'topmodels' for graphical model assessment) at useR! 2022: Slides, video, replication code, and vignette.<p>Conference presentation about the 'distributions3' package for S3 probability distributions (and 'topmodels' for graphical model assessment) at useR! 2022: Slides, video, replication code, and vignette.</p> <h2 id="abstract">Abstract</h2> <p><em>(Authors: <a href="https://www.zeileis.org">Achim Zeileis</a>, <a href="https://moritzlang.org/">Moritz N. Lang</a>, <a href="https://www.alexpghayes.com/">Alex Hayes</a>)</em></p> <p>The <a href="https://alexpghayes.github.io/distributions3/">distributions3</a> package provides a beginner-friendly and lightweight interface to probability distributions. It allows to create distribution objects in the S3 paradigm that are essentially data frames of parameters, for which standard methods are available: e.g., evaluation of the probability density, cumulative distribution, and quantile functions as well as random samples. It has been designed such that it can be employed in introductory statistics and probability courses. By not only providing objects for a single distribution but also for vectors of distributions, users can transition seamlessly to a representation of probabilistic forecasts from regression models such as GLM (generalized linear model), GAMLSS (generalized additive models for location, scale, and shape), etc. We show how the package can be used both in teaching and in applied statistical modeling, for interpreting fitted models and assessing their goodness of fit (“by hand” and via the <a href="https://topmodels.R-Forge.R-project.org/">topmodels</a> package).</p> <h2 id="resources">Resources</h2> <p>Links to: <a href="https://www.zeileis.org/papers/useR-2022.pdf">PDF slides</a>, <a href="https://www.youtube.com/watch?v=rs7ha1F5S0k">YouTube video</a>, <a href="https://www.zeileis.org/assets/posts/2022-06-27-user2022/code.R">R code</a>, <a href="https://www.zeileis.org/news/poisson/">vignette/blog post</a>.</p> <p><a href="https://www.zeileis.org/papers/useR-2022.pdf"><img src="https://www.zeileis.org/assets/posts/2022-06-27-user2022/slides.png" alt="PDF slides" /></a></p> <p><a href="https://www.youtube.com/watch?v=rs7ha1F5S0k"><img src="https://www.zeileis.org/assets/posts/2022-06-27-user2022/youtube.png" alt="YouTube video" /></a></p> <p><a href="https://www.zeileis.org/assets/posts/2022-06-27-user2022/code.R"><img src="https://www.zeileis.org/assets/posts/2022-06-27-user2022/code.png" alt="R code" /></a></p> <p><a href="https://www.zeileis.org/news/poisson/"><img src="https://www.zeileis.org/assets/posts/2022-06-27-user2022/vignette.png" alt="vignette/blog post" /></a></p>2022-06-27T00:00:00+02:00https://www.zeileis.org/news/poisson/The Poisson distribution: From basic probability theory to regression models2022-06-23T00:00:00+02:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/Brief introduction to the Poisson distribution for modeling count data using the distributions3 package. The distribution is illustrated using the number of goals scored at the 2018 FIFA World Cup, suitable for self-study or as a classroom exercise.<p>Brief introduction to the Poisson distribution for modeling count data using the distributions3 package. The distribution is illustrated using the number of goals scored at the 2018 FIFA World Cup, suitable for self-study or as a classroom exercise.</p> <h2 id="the-poisson-distribution">The Poisson distribution</h2> <p>The classic basic probability distribution employed for modeling count data is the Poisson distribution. Its probability mass function \(f(y; \lambda)\) yields the probability for a random variable \(Y\) to take a count \(y \in \{0, 1, 2, \dots\}\) based on the distribution parameter \(\lambda > 0\):</p> <p>[\text{Pr}(Y = y) = f(y; \lambda) = \frac{\exp\left(-\lambda\right) \cdot \lambda^y}{y!}.]</p> <p>The Poisson distribution has many distinctive features, e.g., both its expectation and variance are equal and given by the parameter \(\lambda\). Thus, \(\text{E}(Y) = \lambda\) and \(\text{Var}(Y) = \lambda\). Moreover, the Poisson distribution is related to other basic probability distributions. Namely, it can be obtained as the limit of the binomial distribution when the number of attempts is high and the success probability low. Or the Poisson distribution can be approximated by a normal distribution when \(\lambda\) is large. See <a href="#Wiki+Poisson">Wikipedia (2002)</a> for further properties and references.</p> <p>Here, we leverage the <code class="language-plaintext highlighter-rouge">distributions3</code> package (<a href="#CRAN+distributions3">Hayes <em>et al.</em> 2022</a>) to work with the Poisson distribution in R. In <code class="language-plaintext highlighter-rouge">distributions3</code>, Poisson distribution objects can be generated with the <code class="language-plaintext highlighter-rouge">Poisson()</code> function. Subsequently, methods for generic functions can be used print the objects; extract mean and variance; evaluate density, cumulative distribution, or quantile function; or simulate random samples.</p> <pre><code class="language-{r}">library("distributions3") Y <- Poisson(lambda = 1.5) print(Y) ## [1] "Poisson distribution (lambda = 1.5)" mean(Y) ## [1] 1.5 variance(Y) ## [1] 1.5 pdf(Y, 0:5) ## [1] 0.22313 0.33470 0.25102 0.12551 0.04707 0.01412 cdf(Y, 0:5) ## [1] 0.2231 0.5578 0.8088 0.9344 0.9814 0.9955 quantile(Y, c(0.1, 0.5, 0.9)) ## [1] 0 1 3 set.seed(0) random(Y, 5) ## [1] 3 1 1 2 3 </code></pre> <p>Using the <code class="language-plaintext highlighter-rouge">plot()</code> method the distribution can also be visualized which we use here to show how the probabilities for the counts \(0, 1, \dots, 15\) change when the parameter is \(\lambda = 0.5, 2, 5, 10\).</p> <pre><code class="language-{r}">plot(Poisson(0.5), main = expression(lambda == 0.5), xlim = c(0, 15)) plot(Poisson(2), main = expression(lambda == 2), xlim = c(0, 15)) plot(Poisson(5), main = expression(lambda == 5), xlim = c(0, 15)) plot(Poisson(10), main = expression(lambda == 10), xlim = c(0, 15)) </code></pre> <p><a href="https://www.zeileis.org/assets/posts/2022-06-23-poisson/density.png"><img src="https://www.zeileis.org/assets/posts/2022-06-23-poisson/density.png" alt="Probability density for Poisson distributions with means 0.5, 2, 5, and 10" /></a></p> <p>In the following we will illustrate how this infrastructure can be leveraged to obtain predicted probabilities for the number of goals in soccer matches from the 2018 FIFA World Cup.</p> <h2 id="goals-in-the-2018-fifa-world-cup">Goals in the 2018 FIFA World Cup</h2> <p>To investigate the number of goals scored per match in the 2018 FIFA World Cup, the <code class="language-plaintext highlighter-rouge">FIFA2018</code> data set provides two rows, one for each team, for each of the 64 matches during the tournament. In the following, we treat the goals scored by the two teams in the same match as independent which is a realistic assumption for this particular data set. We just remark briefly that there are also bivariate generalizations of the Poisson distribution that would allow for correlated observations but which are not considered here.</p> <p>In addition to the goals, the data set provides some basic meta-information for the matches (an ID, team name abbreviations, type of match, group vs. knockout stage) as well as some further covariates that we will revisit later in this document. The data looks like this:</p> <pre><code class="language-{r}">data("FIFA2018", package = "distributions3") head(FIFA2018) ## goals team match type stage logability difference ## 1 5 RUS 1 A group 0.1531 0.8638 ## 2 0 KSA 1 A group -0.7108 -0.8638 ## 3 0 EGY 2 A group -0.2066 -0.4438 ## 4 1 URU 2 A group 0.2372 0.4438 ## 5 3 RUS 3 A group 0.1531 0.3597 ## 6 1 EGY 3 A group -0.2066 -0.3597 </code></pre> <p>For now, we will focus on the <code class="language-plaintext highlighter-rouge">goals</code> variable only. A brief summary yields</p> <pre><code class="language-{r}">summary(FIFA2018$goals) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0 0.0 1.0 1.3 2.0 6.0 </code></pre> <p>showing that the teams scored between \(0\) and \(6\) goals per match with an average of \(\bar y = 1.3\) from the observations \(y_i\) (\(i = 1, \dots, 128\)). The corresponding table of observed relative frequencies is:</p> <pre><code class="language-{r}">observed <- proportions(table(FIFA2018$goals)) observed ## ## 0 1 2 3 4 5 6 ## 0.257812 0.375000 0.250000 0.078125 0.015625 0.015625 0.007812 </code></pre> <p>This confirms that goals are relatively rare events in a soccer game with each team scoring zero to two goals per match in almost 90 percent of the matches. Below we show that this observed frequency distribution can be approximated very well by a Poisson distribution which can subsequently be used to obtain predicted probabilities for the goals scored in a match.</p> <h2 id="basic-fitted-distribution">Basic fitted distribution</h2> <p>In a first step, we simply assume that goals are scored with a constant mean over all teams and matches and hence just fit a single Poisson distribution for the number of goals. To do so, we obtain a point estimate of the Poisson parameter by using the empirical mean \(\hat \lambda = \bar y = 1.3\) and set up the corresponding distribution object:</p> <pre><code class="language-{r}">p_const <- Poisson(lambda = mean(FIFA2018$goals)) p_const ## [1] "Poisson distribution (lambda = 1.3)" </code></pre> <p>In the technical details below we show that this actually corresponds to the maximum likelihood estimation for this distribution. It could also be fitted via <code class="language-plaintext highlighter-rouge">fit_mle(Poisson(1), FIFA2018$goals)</code> in <code class="language-plaintext highlighter-rouge">distributions3</code>.</p> <p>As already illustrated above, the expected probabilities of observing counts of \(0, 1, \dots, 6\) goals for this Poisson distribution can be extracted using the <code class="language-plaintext highlighter-rouge">pdf()</code> method. A comparison with the observed empirical frequencies yields</p> <pre><code class="language-{r}">expected <- pdf(p_const, 0:6) cbind(observed, expected) ## observed expected ## 0 0.257812 0.273385 ## 1 0.375000 0.354546 ## 2 0.250000 0.229901 ## 3 0.078125 0.099384 ## 4 0.015625 0.032222 ## 5 0.015625 0.008358 ## 6 0.007812 0.001806 </code></pre> <p>By and large, all observed and expected frequencies are rather close. However, it is not reasonable that all teams score goals with the same probabilities, which would imply that winning or losing could just be attributed to “luck” or “random variation” alone. Therefore, while a certain level of randomness will certainly remain, we should also consider that there are stronger and weaker teams in the tournament.</p> <h2 id="poisson-regression-and-probabilistic-forecasting">Poisson regression and probabilistic forecasting</h2> <p>To account for different expected performances from the teams in the 2018 FIFA World Cup, the <code class="language-plaintext highlighter-rouge">FIFA2018</code> data provides an estimated <code class="language-plaintext highlighter-rouge">logability</code> for each team. These have been estimated by <a href="#Zeileis+Leitner+Hornik:2018">Zeileis <em>et al.</em> (2018)</a> prior to the start of the tournament (2018-05-20) based on quoted odds from 26 online bookmakers using the bookmaker consensus model of <a href="#Leitner+Zeileis+Hornik:2010">Leitner <em>et al.</em> (2010)</a>. The <code class="language-plaintext highlighter-rouge">difference</code> in <code class="language-plaintext highlighter-rouge">logability</code> between a team and its opponent is a useful predictor for the number of <code class="language-plaintext highlighter-rouge">goals</code> scored.</p> <p>Consequently, we fit a generalized linear model (GLM) to the data that links the expected number of goals per team/match \(\lambda_i\) to the linear predictor \(x_i^\top \beta\) with regressor vector \(x_i^\top = (1, \mathtt{difference}_i)\) and corresponding coefficient vector \(\beta\) using a log-link: \(\log(\lambda_i) = x_i^\top \beta\). The maximum likelihood estimator \(\hat \beta\) with corresponding inference, predictions, residuals, etc. can be obtained using the <code class="language-plaintext highlighter-rouge">glm()</code> function from base R with <code class="language-plaintext highlighter-rouge">family = poisson</code>:</p> <pre><code class="language-{r}">m <- glm(goals ~ difference, data = FIFA2018, family = poisson) summary(m) ## ## Call: ## glm(formula = goals ~ difference, family = poisson, data = FIFA2018) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.144 -1.155 -0.175 0.528 2.327 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.2127 0.0813 2.62 0.0088 ** ## difference 0.4134 0.1058 3.91 9.3e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 144.20 on 127 degrees of freedom ## Residual deviance: 128.69 on 126 degrees of freedom ## AIC: 359.4 ## ## Number of Fisher Scoring iterations: 5 </code></pre> <p>Both parameters can be interpreted. First, the intercept corresponds to the expected log-goals per team in a match of two equally strong teams, i.e., with zero difference in log-abilities. The corresponding prediction for the number of goals can either be obtained manually from the extracted <code class="language-plaintext highlighter-rouge">coef()</code> by applying <code class="language-plaintext highlighter-rouge">exp()</code> (as the inverse of the log-link).</p> <pre><code class="language-{r}">lambda_zero <- exp(coef(m)[1]) lambda_zero ## (Intercept) ## 1.237 </code></pre> <p>Or equivalently the <code class="language-plaintext highlighter-rouge">predict()</code> function can be used with <code class="language-plaintext highlighter-rouge">type = "response"</code> in order to get the expected \(\hat \lambda_i\) (rather than just the linear predictor \(x_i^\top \hat \beta\) that is predicted by default).</p> <pre><code class="language-{r}">predict(m, newdata = data.frame(difference = 0), type = "response") ## 1 ## 1.237 </code></pre> <p>As above, we can also set up a <code class="language-plaintext highlighter-rouge">Poisson()</code> distribution object and obtain the associated expected probability distribution for zero to six goals in a mathc of two equally strong teams:</p> <pre><code class="language-{r}">p_zero <- Poisson(lambda = lambda_zero) pdf(p_zero, 0:6) ## [1] 0.290242 0.359041 0.222074 0.091571 0.028319 0.007006 0.001445 </code></pre> <p>Note that <code class="language-plaintext highlighter-rouge">distributions3</code> also provides a convenience function <code class="language-plaintext highlighter-rouge">prodist()</code> that allows to obtain <code class="language-plaintext highlighter-rouge">p_zero</code> in a single step via <code class="language-plaintext highlighter-rouge">prodist(m, newdata = data.frame(difference = 0))</code>.</p> <p>Second, the slope of \(0.413\) can be interpreted as an ability elasticity of the number of goals scored. This is because the difference of the log-abilities can also be understood as the log of the ability ratio. Thus, when the ability ratio increases by \(1\) percent, the expected number of goals increases approximately by \(0.413\) percent.</p> <p>This yields a different predicted Poisson distribution for each team/match in the tournament. We can set up the vector of all \(128\) <code class="language-plaintext highlighter-rouge">Poisson()</code> distribution objects by extracting the vector of all fitted point estimates \((\hat \lambda_1, \dots, \hat \lambda_{128})^\top\):</p> <pre><code class="language-{r}">p_reg <- Poisson(lambda = fitted(m)) length(p_reg) ## [1] 128 head(p_reg) ## 1 2 ## "Poisson distribution (lambda = 1.768)" "Poisson distribution (lambda = 0.866)" ## 3 4 ## "Poisson distribution (lambda = 1.030)" "Poisson distribution (lambda = 1.486)" ## 5 6 ## "Poisson distribution (lambda = 1.435)" "Poisson distribution (lambda = 1.066)" </code></pre> <p>Again, the convenience function <code class="language-plaintext highlighter-rouge">prodist(m)</code> could also be used to directly extract <code class="language-plaintext highlighter-rouge">p_reg</code>.</p> <p>Note that specific elements from the vector <code class="language-plaintext highlighter-rouge">p_reg</code> of Poisson distributions can be extracted as usual, e.g., with an index like <code class="language-plaintext highlighter-rouge">p_reg[i]</code> or using the <code class="language-plaintext highlighter-rouge">head()</code> and <code class="language-plaintext highlighter-rouge">tail()</code> functions etc.</p> <p>As an illustration, the following goal distributions could be expected for the FIFA World Cup final (in the last two rows of the data) that France won 4-2 against Croatia:</p> <pre><code class="language-{r}">tail(FIFA2018, 2) ## goals team match type stage logability difference ## 127 4 FRA 64 Final knockout 0.8866 0.629 ## 128 2 CRO 64 Final knockout 0.2576 -0.629 p_final <- tail(p_reg, 2) p_final ## 127 128 ## "Poisson distribution (lambda = 1.604)" "Poisson distribution (lambda = 0.954)" pdf(p_final, 0:6) ## d_0 d_1 d_2 d_3 d_4 d_5 d_6 ## 127 0.2010 0.3225 0.2587 0.13836 0.05550 0.017808 0.0047618 ## 128 0.3853 0.3675 0.1752 0.05572 0.01329 0.002534 0.0004029 </code></pre> <p>This shows that France was expected to score more goals than Croatia but both teams scored more goals than expected, albeit not unlikely many.</p> <h2 id="further-details-and-extensions">Further details and extensions</h2> <p>Assuming independence of the number of goals scored, we can obtain the table of possible match results (after normal time) by multiplying the marginal probabilities (again only up to six goals). In R this be done using the <code class="language-plaintext highlighter-rouge">outer()</code> function which by default performs a multiplication of its arguments.</p> <pre><code class="language-{r}">res <- outer(pdf(p_final[1], 0:6), pdf(p_final[2], 0:6)) round(100 * res, digits = 2) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] ## [1,] 7.74 7.39 3.52 1.12 0.27 0.05 0.01 ## [2,] 12.43 11.85 5.65 1.80 0.43 0.08 0.01 ## [3,] 9.97 9.51 4.53 1.44 0.34 0.07 0.01 ## [4,] 5.33 5.08 2.42 0.77 0.18 0.04 0.01 ## [5,] 2.14 2.04 0.97 0.31 0.07 0.01 0.00 ## [6,] 0.69 0.65 0.31 0.10 0.02 0.00 0.00 ## [7,] 0.18 0.17 0.08 0.03 0.01 0.00 0.00 </code></pre> <p>For example, we can see from this table that the expected probability for France winning against Croatia 1-0 is \(12.43\) percent while the probability that France loses 0-1 is only \(7.39\) percent.</p> <p>The advantage of France can also be brought out more clearly by aggregating the probabilities for winning (lower triangular matrix), a draw (diagonal), or losing (upper triangular matrix). In R these can be computed as:</p> <pre><code class="language-{r}">sum(res[lower.tri(res)]) ## France wins ## [1] 0.5245 sum(diag(res)) ## draw ## [1] 0.2498 sum(res[upper.tri(res)]) ## France loses ## [1] 0.2243 </code></pre> <p>Note that these probabilities do not sum up to \(1\) because we only considered up to six goals per team but more goals can actually occur with a small probability.</p> <p>Next, we update the expected frequencies table by averaging across the expectations per team/match from the regression model.</p> <pre><code class="language-{r}">expected <- pdf(p_reg, 0:6) head(expected) ## d_0 d_1 d_2 d_3 d_4 d_5 d_6 ## 1 0.1707 0.3017 0.2667 0.15721 0.06949 0.024571 0.0072403 ## 2 0.4208 0.3642 0.1576 0.04548 0.00984 0.001703 0.0002457 ## 3 0.3571 0.3677 0.1893 0.06498 0.01673 0.003444 0.0005911 ## 4 0.2262 0.3362 0.2498 0.12377 0.04599 0.013669 0.0033857 ## 5 0.2380 0.3417 0.2452 0.11732 0.04210 0.012086 0.0028914 ## 6 0.3444 0.3671 0.1957 0.06954 0.01853 0.003952 0.0007022 expected <- colMeans(expected) cbind(observed, expected) ## observed expected ## 0 0.257812 0.294374 ## 1 0.375000 0.340171 ## 2 0.250000 0.214456 ## 3 0.078125 0.098236 ## 4 0.015625 0.036595 ## 5 0.015625 0.011727 ## 6 0.007812 0.003333 </code></pre> <p>As before, observed and expected frequencies are reasonably close, emphasizing that the model has a good marginal fit for this data. To bring out the discrepancies graphically we show the frequencies on a square root scale using a so-called <em>hanging rootogram</em> (<a href="#Kleiber+Zeileis:2016">Kleiber & Zeileis 2016</a>). The gray bars represent the square-root of the observed frequencies “hanging” from the square-root of the expected frequencies in the red line. The offset around the x-axis thus shows the difference between the two frequencies which is reasonably close to zero.</p> <pre><code class="language-{r}">bp <- barplot(sqrt(observed), offset = sqrt(expected) - sqrt(observed), xlab = "Goals", ylab = "sqrt(Frequency)") lines(bp, sqrt(expected), type = "o", pch = 19, lwd = 2, col = 2) abline(h = 0, lty = 2) </code></pre> <p><a href="https://www.zeileis.org/assets/posts/2022-06-23-poisson/rootogram.png"><img src="https://www.zeileis.org/assets/posts/2022-06-23-poisson/rootogram.png" alt="Rootogram for the number of goals in the 2018 FIFA World Cup modeled by a Poisson regression model" /></a></p> <p>Finally, we want to point out that while the log-abilities (and thus their differences) had been obtained based on bookmakers odds prior to the tournament, the calibration of the intercept and slope coefficients was done “in-sample”. This means that we have used the data from the tournament itself for estimating the GLM and the evaluation above can only be made <em>ex post</em>. Alternatively, one could have used previous FIFA World Cups for calibrating the coefficients so that probabilistic forecasts for the outcome of all matches (and thus the entire tournament) could have been obtained <em>ex ante</em>. This is the approach used by <a href="#Groll+Ley+Schauberger:2019">Groll <em>et al.</em> (2019)</a> and <a href="#Groll+Hvattum+Ley:2021">Groll <em>et al.</em> (2021)</a> who additionally added further explanatory variables and used flexible machine learning regression techniques rather than a simple Poisson GLM.</p> <h2 id="technical-details-maximum-likelihood-estimation-of-lambda">Technical details: Maximum likelihood estimation of \(\lambda\)</h2> <p>Fitting a single Poisson distribution with constant \(\lambda\) to \(n\) independent observations \(y_1, \dots, y_n\) using maximum likelihood estimation can be done analytically using basic algebra. First, we set up the log-likelihood function \(\ell\) as the sum of the log-densities per observation:</p> <p>[\begin{align<em>} \ell(\lambda; y_1, \dots, y_n) & = \sum_{i = 1}^n \log f(y_i; \lambda) <br /> \end{align</em>}]</p> <p>For solving the first-order condition analytically below we need the score function, i.e., the derivative of the log-likelihood with respect to the parameter \(\lambda\). The derivative of the sum is simply the sum of the derivatives:</p> <p>[\begin{align<em>} \ell^\prime(\lambda; y_1, \dots, y_n) & = \sum_{i = 1}^n \left{ \log f(y_i; \lambda) \right}^\prime <br /> & = \sum_{i = 1}^n \left{ -\lambda + y_i \cdot \log(\lambda) - \log(y_i!) \right}^\prime <br /> & = \sum_{i = 1}^n \left{ -1 + y_i \cdot \frac{1}{\lambda} \right} <br /> & = -n + \frac{1}{\lambda} \sum_{i = 1}^n y_i \end{align</em>}]</p> <p>The first-order condition for maximizing the log-likelihood sets its derivative to zero. This can be solved as follows:</p> <p>[\begin{align<em>} \ell^\prime(\lambda; y_1, \dots, y_n) & = 0 <br /> -n + \frac{1}{\lambda} \sum_{i = 1}^n y_i & = 0 <br /> n \cdot \lambda & = \sum_{i = 1}^n y_i <br /> \lambda & = \frac{1}{n} \sum_{i = 1}^n y_i = \bar y \end{align</em>}]</p> <p>Thus, the maximum likelihood estimator is simply the empirical mean \(\hat \lambda = \bar y.\)</p> <p>Unfortunately, when the parameter \(\lambda\) is not constant but depends on a linear predictor through a log link \(\log(\lambda_i) = x_i^\top \beta\), the corresponding log-likelihood of the regression coefficients \(\beta\) can not be maximized as easily. There is no closed-form solution for the maximum likelihood estimator \(\hat \beta\) which is why the <code class="language-plaintext highlighter-rouge">glm()</code> function employs an iterative numerical algorithm (so-called iteratively weighted least squares) for fitting the model.</p> <h2 id="references">References</h2> <ul> <li><span id="Groll+Hvattum+Ley:2021">Groll A, Hvattum LM, Ley C, Popp F, Schauberger G, Van Eetvelde H, Zeileis A (2021). “Hybrid Machine Learning Forecasts for the UEFA EURO 2020.” arXiv 2106.05799. arXiv.org E-Print Archive. <a href="https://arxiv.org/abs/2106.05799">https://arxiv.org/abs/2106.05799</a></span></li> <li><span id="Groll+Ley+Schauberger:2019">Groll A, Ley C, Schauberger G, Van Eetvelde H (2019). “A Hybrid Random Forest to Predict Soccer Matches in International Tournaments.” <em>Journal of Quantitative Analysis in Sports</em> <strong>15</strong>(4), 271-87. <a href="https://doi.org/10.1515/jqas-2018-0060">https://doi.org/10.1515/jqas-2018-0060</a></span></li> <li><span id="CRAN+distributions3">Hayes A, Moller-Trane R, Jordan D, Northrop P, Lang M, Zeileis A (2022). “distributions3: Probability Distributions as S3 Objects.” R package version 0.2.0, <a href="https://CRAN.R-project.org/package=distributions3">https://CRAN.R-project.org/package=distributions3</a></span></li> <li><span id="Kleiber+Zeileis:2016">Kleiber C, Zeileis A (2016). “Visualizing Count Data Regressions Using Rootograms.” <em>The American Statistician</em> <strong>70</strong>(3), 296-303. <a href="https://doi.org/10.1080/00031305.2016.1173590">https://doi.org/10.1080/00031305.2016.1173590</a></span></li> <li><span id="Leitner+Zeileis+Hornik:2010">Leitner C, Zeileis A, Hornik K (2010). “Forecasting Sports Tournaments by Ratings of (Prob)abilities: A Comparison for the EURO 2008.” <em>International Journal of Forecasting</em> <strong>26</strong>(3), 471-81. <a href="https://doi.org/10.1016/j.ijforecast.2009.10.001">https://doi.org/10.1016/j.ijforecast.2009.10.001</a></span></li> <li><span id="Wiki+Poisson">Wikipedia (2022). “Poisson Distribution - Wikipedia, the Free Encyclopedia.” <a href="https://en.wikipedia.org/wiki/Poisson_distribution">https://en.wikipedia.org/wiki/Poisson_distribution</a>, accessed 2022-02-21.</span></li> <li><span id="Zeileis+Leitner+Hornik:2018">Zeileis A, Leitner C, Hornik, K (2018). “Probabilistic Forecasts for the 2018 FIFA World Cup Based on the Bookmaker Consensus Model.” Working Paper 2018-09. Working Papers in Economics & Statistics, Research Platform Empirical & Experimental Economics, Universität Innsbruck. <a href="https://EconPapers.RePEc.org/RePEc:inn:wpaper:2018-09">https://EconPapers.RePEc.org/RePEc:inn:wpaper:2018-09</a></span></li> </ul>2022-06-23T00:00:00+02:00https://www.zeileis.org/news/euro2020knockout/Updated forecasts for the UEFA Euro 2020 knockout stage2021-06-25T00:00:00+02:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/After all group stage matches at the UEFA Euro 2020 we have updated the knockout stage forecasts by re-training our hybrid random forest model on the extended data. This shows that England profits most from the realized tournament draw.<p>After all group stage matches at the UEFA Euro 2020 we have updated the knockout stage forecasts by re-training our hybrid random forest model on the extended data. This shows that England profits most from the realized tournament draw.</p> <h2 id="updates">Updates</h2> <p>After the 36 matches of the group stage were completed earlier this week, we had decided to update our <a href="https://www.zeileis.org/news/euro2020/">probabilistic forecast for the UEFA Euro 2020</a>. As the <a href="https://www.zeileis.org/news/euro2020group/">evaluation of the group stage</a> showed that, by and large, the forecasts worked reasonably well up to this point, we kept our general strategy and just made a few updates:</p> <ul> <li>The <em>historic match abilities</em> for all teams were updated to incorporate the results from the 36 additional matches from the group stage. Given that the estimates are weighted such that the most recent results have a higher influence, this changed the estimates of the team abilities somewhat.</li> <li>The <em>average plus-minus player ratings</em> for all teams were also updated but these changed to a lesser degree given that each team only played three additional matches.</li> <li>All other covariates (bookmaker consensus, market value, etc.) were left unchanged.</li> <li>The learning data set for the hybrid random forest that combines all the predictors was extended: In addition to all the matches from the UEFA Euro 2004-2016 it now includes the group stage results from this year’s Euro.</li> <li>The resulting predicted number of goals for each team can then be used to simulate the entire knockout stage 100,000 times.</li> </ul> <p>While all the changes above have a certain influence, the biggest effect arguably comes from the last item: Because the match-ups for the round of 16 are fixed now, there is a lot less variation in the potential courses of the tournament. Specifically, it is now clear that there are more top favorites in the upper half of the tournament tableau (namely France, Spain, Italy, Belgium, Portugal) than in the lower half of the tableau (England, Germany, Netherlands). In the following it is shown in more detail what the consequences of this are.</p> <h2 id="winning-probabilities">Winning probabilities</h2> <p>The updated results show that now England became the top favorite for the title with a winning probability of 17.4% because they are more likely to face weaker opponents provided they beat Germany in the round of 16. Our top favorite from the pre-tournament forecast was France and they rank now second with an almost unchanged winning probability of about 15.0%. The winning probabilities for all teams are shown in the barchart below with more information linked in the interactive full-width version.</p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_win.html">Interactive full-width graphic</a></p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_win.html"><img src="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_win.png" alt="Barchart: Winning probabilities" /></a></p> <p>Somewhat surprisingly, Italy still has a rather low winning probability of only 7.3% whereas they are now among the top three teams according to most bookmaker odds. This is most likely due to the tournament draw: If they beat Austria in the round of 16, they meet either the FIFA top-ranked team Belgium or defending champion Portugal in the quarter final. In a potential semi-final they would have a high chance of facing either France or Spain.</p> <h2 id="match-probabilities">Match probabilities</h2> <p>Using the hybrid random forest an expected number of goals is obtained for both teams in each possible match. Using these, we can compute the probability that a certain match ends in a <em>win</em>, a <em>draw</em>, or a <em>loss</em> in normal time. The same can be repeated in overtime, if necessary, and a coin flip is used to decide penalties, if needed.</p> <p>The resulting probability that one team beats the other in a knockout match is depicted in the heatmap below. The color scheme uses green vs. brown to signal probabilities above vs. below 50%, respectively. The tooltips for each match in the interactive version of the graphic also print the probabilities for the match results after normal time.</p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_match.html">Interactive full-width graphic</a></p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_match.html"><img src="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_match.png" alt="Heatmap: Match probabilities" /></a></p> <h2 id="performance-throughout-the-tournament">Performance throughout the tournament</h2> <p>As every single match can be simulated with the pairwise probabilities above, we are able to simulate the entire knockout stage 100,000 times to provide “survival” probabilities for each team across the remaining stages. Teams in the upper half of the tournament tableau are shown in orange while the lower half teams are shown in blue.</p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_surv.html">Interactive full-width graphic</a></p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_surv.html"><img src="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_surv.png" alt="Line plot: Survival probabilities" /></a></p> <p>This shows that England has relatively low chances of surviving the round of 16 - at least compared to other top teams like France, Italy, or Netherlands who play against weaker opponents. However, provided England proceeds to the quarter final, they have a really high probability of prevailing up to the final match.</p> <p>In summary, the updates compared to the pre-tournament forecast changed but maybe not as much as expected. The most important change in information is that the remaining course of the tournament is rather clear now while the knowledge from the 36 group stage matches themselves has only moderate effects. Thus, the most exciting part of the UEFA Euro 2020 is only starting now and we can all be curious what is going to happen. Everything is still possible! (Recall that in the 2016 tournament Portugal eventually took the championship despite not winning a single group stage match and ranking third in their group.)</p>2021-06-25T00:00:00+02:00https://www.zeileis.org/news/euro2020group/Evaluation of the UEFA Euro 2020 group stage forecast2021-06-24T00:00:00+02:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/A look back on the group stage of the UEFA Euro 2020 to check whether our hybrid machine learning forecasts based were any good...<p>A look back on the group stage of the UEFA Euro 2020 to check whether our hybrid machine learning forecasts based were any good...</p> <h2 id="how-surprising-was-the-group-stage">How surprising was the group stage?</h2> <p>Yesterday the group stage of the UEFA Euro 2020 was concluded with the final matches in Groups E and F so that all pairings for the round of 16 are fixed now. Therefore, today we want to do address two questions regarding our own <a href="https://www.zeileis.org/news/euro2020/">probabilistic forecast for the UEFA Euro 2020</a> based on a hybrid machine learning model that we have published prior to the tournament:</p> <ol> <li>How good were the predictions for the group stage? Were the actual outcomes surprising?</li> <li>How can we update the forecasts for the knockout stage starting with the round of 16 on the weekend?</li> </ol> <p>The first of these questions is answered in this post while the second question will be deferred to tomorrow’s post.</p> <p><strong>TL;DR</strong> All of our predictions worked quite well and most results were within the expected range of random variation. All tournament favorites proceeded to the round of 16 and mostly the weakest teams dropped out of the tournament. Only in Group E the final ranking was a bit more surprising with Spain ending up second behind Sweden and Poland finishing last and dropping out. At the individual match level there were a couple of games where the clearly stronger team failed to take the win, especially Hungary’s two draws in the “killer group” F were a bit surprising. But other than that the more exciting part of the tournament ist still ahead of us!</p> <h2 id="group-stage-results">Group stage results</h2> <p>First, we look at the results in terms of which teams successfully proceeded from the group stage to the round of 16. The barplot below shows all teams along with their predicted winning probability for the entire tournament, with the color highlighting elimination from the tournament prior to the knockout stage.</p> <p><img src="https://www.zeileis.org/assets/posts/2021-06-24-euro2020group/barplot.png" alt="Probabilities to win the tournament with highlighting of teams advancing to the knockout stage" /></p> <p>Clearly, only teams from the lower half were eliminated with the most unexpected drop-out being Poland. Also, it may seem somewhat surprising that both the Czech Republic and Ukraine “survived” the group stage but with four out of six third-ranked teams advancing to the round of 16 this is not very unexpected.</p> <p>Looking at the rankings in each group in a bit more detail we see that most group results are as expected. Only in Group E the ranking is really a surprise with Sweden playing stronger than expected and even winning the group. On the other hand, Poland’s performance was somewhat disappointing (as already mentioned above) and Spain waited until the third game (a 5-0 win against Slovakia) to show their full potential.</p> <div class="row"> <div class="t20 small-6 medium-3 large-2 columns"> <table> <thead> <tr> <th style="text-align: left">A <br /> Rank</th> <th style="text-align: left"> <br /> Team</th> <th style="text-align: right"> <br /> Prob.</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><strong>1</strong> <br /> <strong>2</strong> <br /> <strong>3</strong> <br /> 4</td> <td style="text-align: left"><strong>ITA</strong> <br /> <strong>WAL</strong> <br /> <strong>SUI</strong> <br /> TUR</td> <td style="text-align: right"><strong>88.8</strong> <br /> <strong>53.7</strong> <br /> <strong>72.3</strong> <br /> 53.3</td> </tr> </tbody> </table> </div> <div class="t20 small-6 medium-3 large-2 columns"> <table> <thead> <tr> <th style="text-align: left">B <br /> Rank</th> <th style="text-align: left"> <br /> Team</th> <th style="text-align: right"> <br /> Prob.</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><strong>1</strong> <br /> <strong>2</strong> <br /> 3 <br /> 4</td> <td style="text-align: left"><strong>BEL</strong> <br /> <strong>DEN</strong> <br /> FIN <br /> RUS</td> <td style="text-align: right"><strong>91.5</strong> <br /> <strong>84.5</strong> <br /> 37.1 <br /> 52.0</td> </tr> </tbody> </table> </div> <div class="t20 small-6 medium-3 large-2 columns"> <table> <thead> <tr> <th style="text-align: left">C <br /> Rank</th> <th style="text-align: left"> <br /> Team</th> <th style="text-align: right"> <br /> Prob.</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><strong>1</strong> <br /> <strong>2</strong> <br /> <strong>3</strong> <br /> 4</td> <td style="text-align: left"><strong>NED</strong> <br /> <strong>AUT</strong> <br /> <strong>UKR</strong> <br /> MKD</td> <td style="text-align: right"><strong>93.4</strong> <br /> <strong>80.9</strong> <br /> <strong>57.4</strong> <br /> 32.9</td> </tr> </tbody> </table> </div> <div class="t20 small-6 medium-3 large-2 columns"> <table> <thead> <tr> <th style="text-align: left">D <br /> Rank</th> <th style="text-align: left"> <br /> Team</th> <th style="text-align: right"> <br /> Prob.</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><strong>1</strong> <br /> <strong>2</strong> <br /> <strong>3</strong> <br /> 4</td> <td style="text-align: left"><strong>ENG</strong> <br /> <strong>CRO</strong> <br /> <strong>CZE</strong> <br /> SCO</td> <td style="text-align: right"><strong>94.6</strong> <br /> <strong>78.0</strong> <br /> <strong>40.8</strong> <br /> 49.8</td> </tr> </tbody> </table> </div> <div class="t20 small-6 medium-3 large-2 columns"> <table> <thead> <tr> <th style="text-align: left">E <br /> Rank</th> <th style="text-align: left"> <br /> Team</th> <th style="text-align: right"> <br /> Prob.</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><strong>1</strong> <br /> <strong>2</strong> <br /> 3 <br /> 4</td> <td style="text-align: left"><strong>SWE</strong> <br /> <strong>ESP</strong> <br /> SVK <br /> POL</td> <td style="text-align: right"><strong>59.8</strong> <br /> <strong>94.0</strong> <br /> 44.9 <br /> 66.2</td> </tr> </tbody> </table> </div> <div class="t20 small-6 medium-3 large-2 columns"> <table> <thead> <tr> <th style="text-align: left">F <br /> Rank</th> <th style="text-align: left"> <br /> Team</th> <th style="text-align: right"> <br /> Prob.</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><strong>1</strong> <br /> <strong>2</strong> <br /> <strong>3</strong> <br /> 4</td> <td style="text-align: left"><strong>FRA</strong> <br /> <strong>GER</strong> <br /> <strong>POR</strong> <br /> HUN</td> <td style="text-align: right"><strong>89.7</strong> <br /> <strong>85.3</strong> <br /> <strong>85.3</strong> <br /> 13.9</td> </tr> </tbody> </table> </div> <div class="t20 small-6 medium-3 large-2 columns"> </div> <div class="t20 small-6 medium-3 large-2 columns"> </div> </div> <h2 id="match-results">Match results</h2> <p>After seeing that all the favorites prevailed and only relatively weak teams dropped out of the tournament, we take a closer look at the 36 individual group-stage matches to check whether we had any major surprises. The stacked bar plot below groups all match results into four categories by their expected goal difference for the stronger vs. the weaker team.</p> <p><img src="https://www.zeileis.org/assets/posts/2021-06-24-euro2020group/match.png" alt="Observed match outcome vs. expected goal difference" /></p> <p>In the first bar the stronger team was expected to be only marginally better, with 0 to 0.25 more predicted goals on average. In this bar we see that the stronger team won half of the matches (4 out of 8) while the other half was either lost (3 matches) or ended in a draw (1 match). In short, the distribution of match outcomes conforms essentially exactly with the predictions.</p> <p>The same is true for the second and third bar where the expected goal difference for the stronger team was between 0.26 and 0.6 or between 0.6 and 1, respectively. The stronger team won in 70.0% out of ten and 77.7% out of nine matches, respectively, thus conforming closely with the predictions.</p> <p>Only in the last bar with the highest expected goal differences (between 1 and 2 goals), the picture is somewhat unexpected.</p> <ol> <li>There were three draws (out of nine matches), two of which by underdog Hungary against the much stronger teams France and Germany. Ultimately, Hungary nevertheless finished last in Group F.</li> <li>One of these nine matches was even lost by the clear favorite but this match was the 0-1 of Denmark vs. Finland. During this match, Danish key player Christian Eriksen suffered a cardiac arrest and had to be reanimated in the stadium before being brought to the hospital. Denmark then had to continue playing the match later that evening and were clearly still under shock. Needless to say that no forecasting model (that we are aware of) would incorporate such extreme and rare events.</li> </ol> <p>As a final evaluation we check whether the observed number of goals per team in each match conforms with the expected distribution based on the Poisson model employed. This is brought out graphically by a so-called <a href="https://dx.doi.org/10.1080/00031305.2016.1173590">hanging rootogram</a>.</p> <p><img src="https://www.zeileis.org/assets/posts/2021-06-24-euro2020group/goals.png" alt="Hanging rootogram with observed and expected frequencies of number of goals" /></p> <p>The red line shows the square root of the expected frequencies while the “hanging” gray bars represent the square root of the observed frequencies. This shows that the predictions conform closely with the actual observations. There were only a few more occurrences of three goals (ten times) than expected (6.1 times) but this deviation is also within the bounds of random variation.</p>2021-06-24T00:00:00+02:00https://www.zeileis.org/news/euro2020paper/Working paper for the UEFA Euro 2020 forecast2021-06-10T00:00:00+02:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/A working paper describing the data and methods used for our probabilistic UEFA Euro 2020 forecast, published earlier this week, is available now. Additionally, details on the predicted performance of all teams during the group stage are provided.<p>A working paper describing the data and methods used for our probabilistic UEFA Euro 2020 forecast, published earlier this week, is available now. Additionally, details on the predicted performance of all teams during the group stage are provided.</p> <h2 id="overview">Overview</h2> <p>Earlier this week we had published our <a href="https://www.zeileis.org/news/euro2020/">probabilistic UEFA Euro 2020 forecast</a> that combines the expertise of football modelers from four different research teams with the flexibility of machine learning. To explain which data and methods were used exactly, we have also written a <a href="https://arxiv.org/abs/2106.05799">working paper</a>, now published in the <a href="https://arxiv.org/">arXiv.org</a> e-Print archive.</p> <p>Moreover, we take the opportunity and provide further insights that can be obtained from our forecast for the results of the group stage, that starts at the end of this week with the opening match between Italy and Turkey in Rome in Group A. More precisely, predicted probabilities for a <em>win</em>, <em>draw</em>, or <em>loss</em> in each of the 36 group stage matches are provided in interactive heatmaps for all groups.</p> <h2 id="working-paper">Working paper</h2> <p><em>Citation:</em><br /> Groll A, Hvattum LM, Ley C, Popp F, Schauberger G, Van Eetvelde H, Zeileis A (2021). “Hybrid Machine Learning Forecasts for the UEFA EURO 2020.” arXiv:2106.05799, arXiv.org e-Print archive. <a href="https://arxiv.org/abs/2106.05799">https://arxiv.org/abs/2106.05799</a></p> <p><em>Abstract:</em><br /> Three state-of-the-art statistical ranking methods for forecasting football matches are combined with several other predictors in a hybrid machine learning model. Namely an ability estimate for every team based on historic matches; an ability estimate for every team based on bookmaker consensus; average plus-minus player ratings based on their individual performances in their home clubs and national teams; and further team covariates (e.g., market value, team structure) and country-specific socio-economic factors (population, GDP). The proposed combined approach is used for learning the number of goals scored in the matches from the four previous UEFA EUROs 2004-2016 and then applied to current information to forecast the upcoming UEFA EURO 2020. Based on the resulting estimates, the tournament is simulated repeatedly and winning probabilities are obtained for all teams. A random forest model favors the current World Champion France with a winning probability of 14.8% before England (13.5%) and Spain (12.3%). Additionally, we provide survival probabilities for all teams and at all tournament stages.</p> <h2 id="predicted-match-probabilities-for-the-group-stage">Predicted match probabilities for the group stage</h2> <p>Using the hybrid random forest an expected number of goals is obtained for both teams in each possible match in the group stage. As there are typically more goals in the group stage compared to the knockout stage, a different expected number of goals is fitted for the two stages by including a corresponding binary dummy variable in the regression model. While the heatmap shown in our previous blog post contained the probabilities for all possible matches in the knockout stage, we complement this information here by showing different heatmaps for all groups.</p> <p>The color scheme visualizes the winning probability of the team in the row over the team in the column. Light red or orange vs. dark green or blue signals low vs. high winning probabilities. The tooltips for each match in the interactive version of the graphic also print the probabilities for the match to end in a win, draw, or loss.</p> <p>Interactive full-width graphics: <a href="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_a.html">Group A</a>, <a href="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_b.html">Group B</a>, <a href="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_c.html">Group C</a>, <a href="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_d.html">Group D</a>, <a href="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_e.html">Group E</a>, <a href="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_f.html">Group F</a>.</p> <table> <thead> <tr> <th style="text-align: center">Group A</th> <th style="text-align: center">Group B</th> <th style="text-align: center">Group C</th> </tr> </thead> <tbody> <tr> <td style="text-align: center"><a href="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_a.html"><img src="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_a.png" alt="Heatmap: Match probabilities for Group A" /></a></td> <td style="text-align: center"><a href="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_b.html"><img src="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_b.png" alt="Heatmap: Match probabilities for Group B" /></a></td> <td style="text-align: center"><a href="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_c.html"><img src="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_c.png" alt="Heatmap: Match probabilities for Group C" /></a></td> </tr> </tbody> </table> <table> <thead> <tr> <th style="text-align: center">Group D</th> <th style="text-align: center">Group E</th> <th style="text-align: center">Group F</th> </tr> </thead> <tbody> <tr> <td style="text-align: center"><a href="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_d.html"><img src="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_d.png" alt="Heatmap: Match probabilities for Group D" /></a></td> <td style="text-align: center"><a href="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_e.html"><img src="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_e.png" alt="Heatmap: Match probabilities for Group E" /></a></td> <td style="text-align: center"><a href="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_f.html"><img src="https://www.zeileis.org/assets/posts/2021-06-10-euro2020paper/p_match_f.png" alt="Heatmap: Match probabilities for Group F" /></a></td> </tr> </tbody> </table>2021-06-10T00:00:00+02:00https://www.zeileis.org/news/euro2020/Hybrid machine learning forecasts for the UEFA Euro 20202021-06-07T00:00:00+02:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/Probabilistic forecasts for the UEFA Euro 2020 are obtained by using a hybrid model that combines data from four advanced statistical models through random forests. The favorite is France, followed by England and Spain.<p>Probabilistic forecasts for the UEFA Euro 2020 are obtained by using a hybrid model that combines data from four advanced statistical models through random forests. The favorite is France, followed by England and Spain.</p> <div class="row t20 b20"> <div class="small-8 medium-9 large-10 columns"> The UEFA Euro 2020 will finally take place across Europe from 11 June to 11 July 2021 (after a year of delay due to the Covid-19 pandemic). 24 of the best European teams compete to determine the new European Champion. Football fans worldwide are curious what the most likely outcome of the tournament is. Hence, we employ a machine learning approach yielding probabilistic forecasts for all possible matches which can then be used to explore the likely course of the tournament by simulation. </div> <div class="small-4 medium-3 large-2 columns"> <a href="https://www.uefa.com/uefaeuro-2020/" alt="UEFA Euro 2020 web page"><img src="https://upload.wikimedia.org/wikipedia/en/9/96/UEFA_Euro_2020_Logo.svg" alt="UEFA Euro 2020 logo" /></a> </div> </div> <h2 id="winning-probabilities">Winning probabilities</h2> <p>The forecast is based on a conditional inference random forest learner that combines four main sources of information: An ability estimate for every team based on historic matches; an ability estimate for every team based on odds from 19 bookmakers; average ratings of the players in each team based on their individual performances in their home clubs and national teams; further team covariates (e.g., market value, team structure) and country-specific socio-economic factors (population, GDP). The random forest model is learned using the UEFA Euro tournaments from 2004 to 2016 as training data and then applied to current information to obtain a forecast for the UEFA Euro 2020. The random forest forecasts actually provide the predicted number of goals for each team in all possible matches in the tournament so that a bivariate Poisson distribution can be used to compute the probabilities for a <em>win</em>, <em>draw</em>, or <em>loss</em> in such a match. Based on these match probabilities the entire tournament can be simulated 100,000 times yielding winning probabilities for each team. The results show that the current World Champion France is also the favorite for the European title with a winning probability of 14.8%, followed by England with 13.5%, and Spain with 12.3%. The winning probabilities for all teams are shown in the barchart below with more information linked in the interactive full-width version.</p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-07-euro2020/p_win.html">Interactive full-width graphic</a></p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-07-euro2020/p_win.html"><img src="https://www.zeileis.org/assets/posts/2021-06-07-euro2020/p_win.png" alt="Barchart: Winning probabilities" /></a></p> <p>The full study has been conducted by an international team of researchers: <a href="https://www.statistik.tu-dortmund.de/groll.html">Andreas Groll</a>, <a href="https://home.himolde.no/hvattum/">Lars Magnus Hvattum</a>, <a href="https://users.ugent.be/~chley/">Christophe Ley</a>, <a href="https://www.xing.com/profile/Franziska_Popp20">Franziska Popp</a>, <a href="https://www.sg.tum.de/epidemiologie/team/schauberger/">Gunther Schauberger</a>, <a href="https://biblio.ugent.be/person/2C617710-F0EE-11E1-A9DE-61C894A0A6B4">Hans Van Eetvelde</a>, <a href="https://www.zeileis.org/">Achim Zeileis</a>. The corresponding working paper will be published on arXiv in the next couple of days. The core of the contribution is a hybrid approach that starts out from four state-of-the-art forecasting methods, based on disparate sets of information, and lets an adaptive machine learning model decide how to best combine these forecasts.</p> <ul> <li> <p><em>Historic match abilities:</em><br /> An ability estimate is obtained for every team based on “retrospective” data, namely all historic national matches over the last 8 years. A <em>bivariate Poisson model</em> with team-specific fixed effects is fitted to the number of goals scored by both teams in each match. However, rather than equally weighting all matches to obtain <em>average</em> team abilities (or team strengths) over the entire history period, an exponential weighting scheme is employed. This assigns more weight to more recent results and thus yields an estimate of <em>current</em> team abilities. More details can be found in <a href="https://doi.org/10.1177%2F1471082X18817650">Ley, Van de Wiele, Van Eetvelde (2019)</a>.</p> </li> <li> <p><em>Bookmaker consensus abilities:</em><br /> Another ability estimate for every team is obtained based on “prospective” data, namely the odds of 19 international bookmakers that reflect their expert expectations for the tournament. Using the <em>bookmaker consensus model</em> of <a href="https://dx.doi.org/10.1016/j.ijforecast.2009.10.001">Leitner, Zeileis, Hornik (2010)</a>, the bookmaker odds are first adjusted for the bookmakers’ profit margins (“overround”) and then averaged (on a logit scale) to obtain a consensus for the winning probability of each team. To adjust for the effects of the tournament draw (that might have led to easier or harder groups for some teams), an “inverse” simulation approach is used to infer which team abilities are most likely to lead up to these winning probabilities.</p> </li> <li> <p><em>Average player ratings:</em><br /> To infer the contributions of individual players in a match, the <em>plus-minus player ratings</em> of <a href="https://doi.org/10.2478/ijcss-2019-0001">Hvattum (2019)</a> dissect all matches with a certain player (both on club and on national level) into segments, e.g., between substitutions. Subsequently, the goal difference achieved in these segments is linked to the presence of the individual players during that segment. This yields individual ratings for all players that can be aggregated to average player ratings for each team.</p> </li> <li> <p><em>Hybrid random forests:</em><br /> Finally, machine learning is used to combine these three highly aggregated and informative variables above along with a broad range of further relevant covariates, yielding refined probabilistic forecasts for each match. Such a hybrid approach was first suggested by <a href="https://arXiv.org/abs/1806.03208">Groll, Ley, Schauberger, Van Eetvelde (2019)</a>. The task the random forest learner has to accomplish is to combine the three highly-informative team variables above with further team-specific information that may or may not be relevant to the team’s performance. The covariates considered comprise team-specific details (e.g., market value, FIFA rank, team structure) as well as country-specifc socio-economic factors (population and GDP per capita). By combining a large ensemble of rather weakly informative regression trees in a random forest, the relative importances of all the covariates can be inferred automatically. The resulting predicted number of goals for each team can then finally be used to simulate the entire tournament 100,000 times.</p> </li> </ul> <h2 id="match-probabilities">Match probabilities</h2> <p>Using the hybrid random forest an expected number of goals is obtained for both teams in each possible match. The covariate information used for this is the difference between the two teams in each of the variables listed above, i.e., the difference in historic match abilities (on a log scale), the difference in bookmaker consensus abilities (on a log scale), difference in average player ratings of the teams, etc. Assuming a bivariate Poisson distribution with the expected numbers of goals for both teams, we can compute the probability that a certain match ends in a <em>win</em>, a <em>draw</em>, or a <em>loss</em>. The same can be repeated in overtime, if necessary, and a coin flip is used to decide penalties, if needed.</p> <p>The following heatmap shows for each possible combination of teams the probability that one team beats the other team in a knockout match. The color scheme uses green vs. brown to signal probabilities above vs. below 50%, respectively. The tooltips for each match in the interactive version of the graphic also print the probabilities for the match to end in a <em>win</em>, <em>draw</em>, or <em>loss</em> after normal time.</p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-07-euro2020/p_match.html">Interactive full-width graphic</a></p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-07-euro2020/p_match.html"><img src="https://www.zeileis.org/assets/posts/2021-06-07-euro2020/p_match.png" alt="Heatmap: Match probabilities" /></a></p> <h2 id="performance-throughout-the-tournament">Performance throughout the tournament</h2> <p>As every single match can be simulated with the pairwise probabilities above, it is also straightfoward to simulate the entire tournament (here: 100,000 times) providing “survival” probabilities for each team across the different stages.</p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-07-euro2020/p_surv.html">Interactive full-width graphic</a></p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-07-euro2020/p_surv.html"><img src="https://www.zeileis.org/assets/posts/2021-06-07-euro2020/p_surv.png" alt="Line plot: Survival probabilities" /></a></p> <h2 id="odds-and-ends">Odds and ends</h2> <p>All our forecasts are probabilistic, clearly below 100%, and thus by no means certain. Especially the results in group F are hard to predict but may play a crucial role for the tournament. The reason is that this group comprises three very strong teams with current World Champion France, defending European Champion Portugal, and Germany which generally has an excellent record at international tournaments. Moreover, the runner-up in this group will play against the winner from group D with favorite England. Hence, it is likely that this will lead to a very tough knockout match in the round of 16, possibly even between the two top favorites France and England, but it is hard to predict the exact pair of teams that will face each other in this match.</p> <p>Another interesting observation is that the winning probability for Belgium is only moderately high with 8.3%. This is notable as Belgium currently leads the FIFA/Coca-Cola World Ranking and is also judged to have a much higher winning probability by the bookmaker consensus model with 12.1%.</p> <p>In any case, all of this means that even when we can quantify in terms of probabilities what is likely to happen during the UEFA Euro 2020, it is far from being predetermined. Hence, we can all look forward to finally watching this exciting tournament and hope it will bring a little bit of the joy that we have been missing over this difficult last year.</p>2021-06-07T00:00:00+02:00https://www.zeileis.org/news/ivreg/ivreg: Two-stage least-squares regression with diagnostics2021-05-31T00:00:00+02:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/The ivreg function for instrumental variables regression had first been introduced in the AER package but is now developed and extended in its own package of the same name. This post provides a short overview and illustration.<p>The ivreg function for instrumental variables regression had first been introduced in the AER package but is now developed and extended in its own package of the same name. This post provides a short overview and illustration.</p> <h2 id="package-overview">Package overview</h2> <p>The <strong>ivreg</strong> package (by <a href="https://socialsciences.mcmaster.ca/jfox/">John Fox</a>, <a href="https://wwz.unibas.ch/en/kleiber/">Christian Kleiber</a>, and <a href="https://www.zeileis.org">Achim Zeileis</a>) provides a comprehensive implementation of instrumental variables regression using two-stage least-squares (2SLS) estimation. The standard regression functionality (parameter estimation, inference, robust covariances, predictions, etc.) is derived from and supersedes the <code class="language-plaintext highlighter-rouge">ivreg()</code> function in the <a href="https://CRAN.R-project.org/package=AER"><strong>AER</strong></a> package. Additionally, various regression diagnostics are supported, including hat values, deletion diagnostics such as studentized residuals and Cook’s distances; graphical diagnostics such as component-plus-residual plots and added-variable plots; and effect plots with partial residuals.</p> <p>An overview of the package along with vignettes and detailed documentation etc. is available on its web site at <a href="https://john-d-fox.github.io/ivreg/">https://john-d-fox.github.io/ivreg/</a>. This post is an abbreviated version of the “Getting started” vignette.</p> <p>The <strong>ivreg</strong> package integrates seamlessly with other packages by providing suitable S3 methods, specifically for generic functions in the <a href="https://www.R-project.org/">base-R</a> <strong>stats</strong> package, and in the <a href="https://CRAN.R-project.org/package=car"><strong>car</strong></a>, <a href="https://CRAN.R-project.org/package=effects"><strong>effects</strong></a>, <a href="https://CRAN.R-project.org/package=lmtest"><strong>lmtest</strong></a>, and <a href="https://CRAN.R-project.org/package=sandwich"><strong>sandwich</strong></a> packages, among others. Moreover, it cooperates well with other object-oriented packages for regression modeling such as <a href="https://CRAN.R-project.org/package=broom"><strong>broom</strong></a> and <a href="https://CRAN.R-project.org/package=modelsummary"><strong>modelsummary</strong></a>.</p> <h2 id="illustration-returns-to-schooling">Illustration: Returns to schooling</h2> <p>For demonstrating the <strong>ivreg</strong> package in practice, we investigate the effect of schooling on earnings in a classical model for wage determination. The data are from the United States, and are provided in the package as <code class="language-plaintext highlighter-rouge">SchoolingReturns</code>. This data set was originally studied by David Card, and was subsequently employed, as here, to illustrate 2SLS estimation in introductory econometrics textbooks. The relevant variables for this illustration are:</p> <pre><code class="language-{r}">data("SchoolingReturns", package = "ivreg") summary(SchoolingReturns[, 1:8]) ## wage education experience ethnicity smsa ## Min. : 100.0 Min. : 1.00 Min. : 0.000 other:2307 no : 864 ## 1st Qu.: 394.2 1st Qu.:12.00 1st Qu.: 6.000 afam : 703 yes:2146 ## Median : 537.5 Median :13.00 Median : 8.000 ## Mean : 577.3 Mean :13.26 Mean : 8.856 ## 3rd Qu.: 708.8 3rd Qu.:16.00 3rd Qu.:11.000 ## Max. :2404.0 Max. :18.00 Max. :23.000 ## south age nearcollege ## no :1795 Min. :24.00 no : 957 ## yes:1215 1st Qu.:25.00 yes:2053 ## Median :28.00 ## Mean :28.12 ## 3rd Qu.:31.00 ## Max. :34.00 </code></pre> <p>A standard wage equation uses a semi-logarithmic linear regression for <code class="language-plaintext highlighter-rouge">wage</code>, estimated by ordinary least squares (OLS), with years of <code class="language-plaintext highlighter-rouge">education</code> as the primary explanatory variable, adjusting for a quadratic term in labor-market <code class="language-plaintext highlighter-rouge">experience</code>, as well as for factors coding <code class="language-plaintext highlighter-rouge">ethnicity</code>, residence in a city (<code class="language-plaintext highlighter-rouge">smsa</code>), and residence in the U.S. <code class="language-plaintext highlighter-rouge">south</code>:</p> <pre><code class="language-{r}">m_ols <- lm(log(wage) ~ education + poly(experience, 2) + ethnicity + smsa + south, data = SchoolingReturns) summary(m_ols) ## Call: ## lm(formula = log(wage) ~ education + poly(experience, 2) + ethnicity + ## smsa + south, data = SchoolingReturns) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.59297 -0.22315 0.01893 0.24223 1.33190 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.259820 0.048871 107.626 < 2e-16 *** ## education 0.074009 0.003505 21.113 < 2e-16 *** ## poly(experience, 2)1 8.931699 0.494804 18.051 < 2e-16 *** ## poly(experience, 2)2 -2.642043 0.374739 -7.050 2.21e-12 *** ## ethnicityafam -0.189632 0.017627 -10.758 < 2e-16 *** ## smsayes 0.161423 0.015573 10.365 < 2e-16 *** ## southyes -0.124862 0.015118 -8.259 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3742 on 3003 degrees of freedom ## Multiple R-squared: 0.2905, Adjusted R-squared: 0.2891 ## F-statistic: 204.9 on 6 and 3003 DF, p-value: < 2.2e-16 </code></pre> <p>Thus, OLS estimation yields an estimate of 7.4% per year for returns to schooling. This estimate is problematic, however, because it can be argued that <code class="language-plaintext highlighter-rouge">education</code> is endogenous (and hence also <code class="language-plaintext highlighter-rouge">experience</code>, which is taken to be <code class="language-plaintext highlighter-rouge">age</code> minus <code class="language-plaintext highlighter-rouge">education</code> minus 6). We therefore use geographical proximity to a college when growing up as an exogenous instrument for <code class="language-plaintext highlighter-rouge">education</code>. Additionally, <code class="language-plaintext highlighter-rouge">age</code> is the natural exogenous instrument for <code class="language-plaintext highlighter-rouge">experience</code>, while the remaining explanatory variables can be considered exogenous and are thus used as instruments for themselves. Although it’s a useful strategy to select an effective instrument or instruments for each endogenous explanatory variable, in 2SLS regression all of the instrumental variables are used to estimate all of the regression coefficients in the model.</p> <p>To fit this model with <code class="language-plaintext highlighter-rouge">ivreg()</code> we can simply extend the formula from <code class="language-plaintext highlighter-rouge">lm()</code> above, adding a second part after the <code class="language-plaintext highlighter-rouge">|</code> separator to specify the instrumental variables:</p> <pre><code class="language-{r}">library("ivreg") m_iv <- ivreg(log(wage) ~ education + poly(experience, 2) + ethnicity + smsa + south | nearcollege + poly(age, 2) + ethnicity + smsa + south, data = SchoolingReturns) </code></pre> <p>Equivalently, the same model can also be specified slightly more concisely using three parts on the right-hand side indicating the exogenous variables, the endogenous variables, and the additional instrumental variables only (in addition to the exogenous variables).</p> <pre><code class="language-{r}">m_iv <- ivreg(log(wage) ~ ethnicity + smsa + south | education + poly(experience, 2) | nearcollege + poly(age, 2), data = SchoolingReturns) </code></pre> <p>Both models yield the following results:</p> <pre><code class="language-{r}">summary(m_iv) ## Call: ## ivreg(formula = log(wage) ~ education + poly(experience, 2) + ## ethnicity + smsa + south | nearcollege + poly(age, 2) + ethnicity + ## smsa + south, data = SchoolingReturns) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.82400 -0.25248 0.02286 0.26349 1.31561 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.48522 0.67538 6.641 3.68e-11 *** ## education 0.13295 0.05138 2.588 0.009712 ** ## poly(experience, 2)1 9.14172 0.56350 16.223 < 2e-16 *** ## poly(experience, 2)2 -0.93810 1.58024 -0.594 0.552797 ## ethnicityafam -0.10314 0.07737 -1.333 0.182624 ## smsayes 0.10798 0.04974 2.171 0.030010 * ## southyes -0.09818 0.02876 -3.413 0.000651 *** ## ## Diagnostic tests: ## df1 df2 statistic p-value ## Weak instruments (education) 3 3003 8.008 2.58e-05 *** ## Weak instruments (poly(experience, 2)1) 3 3003 1612.707 < 2e-16 *** ## Weak instruments (poly(experience, 2)2) 3 3003 174.166 < 2e-16 *** ## Wu-Hausman 2 3001 0.841 0.432 ## Sargan 0 NA NA NA ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4032 on 3003 degrees of freedom ## Multiple R-Squared: 0.1764, Adjusted R-squared: 0.1747 ## Wald test: 148.1 on 6 and 3003 DF, p-value: < 2.2e-16 </code></pre> <p>Thus, using two-stage least squares to estimate the regression yields a much larger coefficient for the returns to schooling, namely 13.3% per year. Notice as well that the standard errors of the coefficients are larger for 2SLS estimation than for OLS, and that, partly as a consequence, evidence for the effects of <code class="language-plaintext highlighter-rouge">ethnicity</code> and the quadratic component of <code class="language-plaintext highlighter-rouge">experience</code> is now weak. These differences are brought out more clearly when showing coefficients and standard errors side by side, e.g., using the <code class="language-plaintext highlighter-rouge">compareCoefs()</code> function from the <strong>car</strong> package or the <code class="language-plaintext highlighter-rouge">msummary()</code> function from the <strong>modelsummary</strong> package:</p> <pre><code class="language-{r}">library("modelsummary") m_list <- list(OLS = m_ols, IV = m_iv) msummary(m_list) </code></pre> <table> <thead> <tr> <th style="text-align: left"> </th> <th style="text-align: right">OLS</th> <th style="text-align: right">IV</th> </tr> </thead> <tbody> <tr> <td style="text-align: left">(Intercept)</td> <td style="text-align: right">5.260 <br />(0.049)</td> <td style="text-align: right">4.485 <br />(0.675)</td> </tr> <tr> <td style="text-align: left">education</td> <td style="text-align: right">0.074 <br />(0.004)</td> <td style="text-align: right">0.133 <br />(0.051)</td> </tr> <tr> <td style="text-align: left">poly(experience, 2)1</td> <td style="text-align: right">8.932 <br />(0.495)</td> <td style="text-align: right">9.142 <br />(0.564)</td> </tr> <tr> <td style="text-align: left">poly(experience, 2)2</td> <td style="text-align: right">-2.642 <br />(0.375)</td> <td style="text-align: right">-0.938 <br />(1.580)</td> </tr> <tr> <td style="text-align: left">ethnicityafam</td> <td style="text-align: right">-0.190 <br />(0.018)</td> <td style="text-align: right">-0.103 <br />(0.077)</td> </tr> <tr> <td style="text-align: left">smsayes</td> <td style="text-align: right">0.161 <br />(0.016)</td> <td style="text-align: right">0.108 <br />(0.050)</td> </tr> <tr> <td style="text-align: left">southyes</td> <td style="text-align: right">-0.125 <br />(0.015)</td> <td style="text-align: right">-0.098 <br />(0.029)</td> </tr> <tr> <td style="text-align: left">Num.Obs.</td> <td style="text-align: right">3010</td> <td style="text-align: right">3010</td> </tr> <tr> <td style="text-align: left">R2</td> <td style="text-align: right">0.291</td> <td style="text-align: right">0.176</td> </tr> <tr> <td style="text-align: left">R2 Adj.</td> <td style="text-align: right">0.289</td> <td style="text-align: right">0.175</td> </tr> <tr> <td style="text-align: left">AIC</td> <td style="text-align: right">2633.4</td> <td style="text-align: right"> </td> </tr> <tr> <td style="text-align: left">BIC</td> <td style="text-align: right">2681.5</td> <td style="text-align: right"> </td> </tr> <tr> <td style="text-align: left">Log.Lik.</td> <td style="text-align: right">-1308.702</td> <td style="text-align: right"> </td> </tr> <tr> <td style="text-align: left">F</td> <td style="text-align: right">204.932</td> <td style="text-align: right"> </td> </tr> </tbody> </table> <p>The change in coefficients and associated standard errors can also be brought out graphically using the <code class="language-plaintext highlighter-rouge">modelplot()</code> function from <strong>modelsummary</strong> which shows the coefficient estimates along with their 95% confidence intervals. Below we omit the intercept and experience terms as these are on a different scale than the other coefficients.</p> <pre><code class="language-{r}">modelplot(m_list, coef_omit = "Intercept|experience") </code></pre> <p><a href="https://www.zeileis.org/assets/posts/2021-05-31-ivreg/modelplot.png"><img src="https://www.zeileis.org/assets/posts/2021-05-31-ivreg/modelplot.png" alt="Model plot of coefficients and confidence intervals" /></a></p>2021-05-31T00:00:00+02:00https://www.zeileis.org/news/networktree100/Network trees: networktree 1.0.0, web page, and Psychometrika paper2021-02-04T00:00:00+01:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/Version 1.0.0 (and actually 1.0.1) of the R package 'networktree' with tools for recursively partitioning covariance structures is now available from CRAN, accompanied by a paper in Psychometrika, and a dedicated software web page.<p>Version 1.0.0 (and actually 1.0.1) of the R package 'networktree' with tools for recursively partitioning covariance structures is now available from CRAN, accompanied by a paper in Psychometrika, and a dedicated software web page.</p> <h2 id="psychometrika-paper">Psychometrika paper</h2> <ul> <li><em>Citation:</em> Jones PJ, Mair P, Simon T, Zeileis A (2020). “Network Trees: A Method for Recursively Partitioning Covariance Structures.” <em>Psychometrika</em>, <strong>85</strong>(4), 926-945. <a href="https://doi.org/10.1007/s11336-020-09731-4">doi:10.1007/s11336-020-09731-4</a>.</li> <li><em>Preprint version:</em> <a href="https://www.zeileis.org/papers/Jones+Mair+Simon-2020.pdf">https://www.zeileis.org/papers/Jones+Mair+Simon-2020.pdf</a></li> <li><em>OSF replication materials:</em> <a href="https://osf.io/ykq2a/">https://osf.io/ykq2a/</a></li> </ul> <h2 id="abstract">Abstract</h2> <p>In many areas of psychology, correlation-based network approaches (i.e., psychometric networks) have become a popular tool. In this paper, we propose an approach that recursively splits the sample based on covariates in order to detect significant differences in the structure of the covariance or correlation matrix. Psychometric networks or other correlation-based models (e.g., factor models) can be subsequently estimated from the resultant splits. We adapt model-based recursive partitioning and conditional inference tree approaches for finding covariate splits in a recursive manner. The empirical power of these approaches is studied in several simulation conditions. Examples are given using real-life data from personality and clinical research.</p> <h2 id="software--web-page">Software & web page</h2> <p>All methods discussed are implemented in the R package <code class="language-plaintext highlighter-rouge">networktree</code> that is developed on GitHub and stable versions are released on CRAN (Comprehensive R Archive Network). Version 1.0.0 accompanies the publications in Psychometrika and version 1.0.1 adds a few small enhancements and bug fixes, specifically for the plotting infrastructure. Furthermore, a nice web page with introductory examples, documentation, release notes, etc. has been produced with the wonderful <code class="language-plaintext highlighter-rouge">pkgdown</code>.</p> <ul> <li><em>CRAN release:</em> <a href="https://CRAN.R-project.org/package=networktree">https://CRAN.R-project.org/package=networktree</a></li> <li><em>Web page:</em> <a href="https://paytonjjones.github.io/networktree/">https://paytonjjones.github.io/networktree/</a></li> </ul> <h2 id="illustration">Illustration</h2> <p>The idea of psychometric networks is to provide information about the statistical relationships between observed variables. Network trees aim to reveal heterogeneities in these relationships based on observed covariates. This strategy is implemented in the R package <code class="language-plaintext highlighter-rouge">networktree</code> building on the general tree algorithms in the <code class="language-plaintext highlighter-rouge">partykit</code> package.</p> <p>For illustration, we consider a depression network - where the nodes represent different symptoms - and detect heterogeneities with respect to age and race. The data used below is provided by <a href="https://openpsychometrics.org/">https://openpsychometrics.org/</a> and was obtained using the Depression Anxiety and Stress Scale (DASS), a self-report instrument for measuring depression, anxiety, and tension or stress. It is available in the <code class="language-plaintext highlighter-rouge">networktree</code> package as <code class="language-plaintext highlighter-rouge">dass</code>. To make resulting graphics and summaries easier to interpret we use the following variable names for the depression symptoms that are measured with certain questions from the DASS:</p> <ul> <li><code class="language-plaintext highlighter-rouge">anhedonia</code> (Question 3: I couldn’t seem to experience any positive feeling at all.)</li> <li><code class="language-plaintext highlighter-rouge">initiative</code> (Question 42: I found it difficult to work up the initiative to do things.)</li> <li><code class="language-plaintext highlighter-rouge">lookforward</code> (Question 10: I felt that I had nothing to look forward to.)</li> <li><code class="language-plaintext highlighter-rouge">sad</code> (Question 13: I felt sad and depressed.)</li> <li><code class="language-plaintext highlighter-rouge">unenthused</code> (Question 31: I was unable to become enthusiastic about anything.)</li> <li><code class="language-plaintext highlighter-rouge">worthless</code> (Question 17: I felt I wasn’t worth much as a person.)</li> <li><code class="language-plaintext highlighter-rouge">meaningless</code> (Question 38: I felt that life was meaningless.)</li> </ul> <p>First, we load the data and relabel the variables for the depression symptoms:</p> <pre><code class="language-{r}">library("networktree") data("dass", package = "networktree") names(dass)[c(3, 42, 10, 13, 31, 17, 38)] <- c("anhedonia", "initiative", "lookforward", "sad", "unenthused", "worthless", "meaningless") </code></pre> <p>Subsequently, we fit a <code class="language-plaintext highlighter-rouge">networktree()</code> where the relationship between the symptoms (<code class="language-plaintext highlighter-rouge">anhedonia + initiative + lookforward + sad + unenthused + worthless + meaningless</code>) is “explained by” (<code class="language-plaintext highlighter-rouge">~</code>) the covariates (<code class="language-plaintext highlighter-rouge">age + race</code>). (As an alternative to this formula-based interface it is also possible to specify groups of dependent and split variables, respectively, through separate data frames.) The threshold for detecting significant differences in correlations is set to 1% (plus Bonferroni adjustment for testing two covariates at each step).</p> <pre><code class="language-{r}">tr <- networktree(anhedonia + initiative + lookforward + sad + unenthused + worthless + meaningless ~ age + race, data = dass, alpha = 0.01) </code></pre> <p>The resulting network tree can be easily visualized with <code class="language-plaintext highlighter-rouge">plot(tr)</code> which would display the raw correlations. As these are generally high between all depression symptoms we use a display with partial correlations (<code class="language-plaintext highlighter-rouge">transform = "pcor"</code>) instead. This brings out differences between the detected subgroups somewhat more clearly. <em>(Note that version 1.0.1 of networktree is needed for this to work correctly.)</em></p> <pre><code class="language-{r}">plot(tr, transform = "pcor") </code></pre> <p><a href="https://www.zeileis.org/assets/posts/2021-02-04-networktree100/dasstree.png"><img src="https://www.zeileis.org/assets/posts/2021-02-04-networktree100/dasstree.png" alt="Depression network tree" /></a></p> <p>This shows that the network tree detects three subgroups. First, the correlations of the depression symptoms change across <code class="language-plaintext highlighter-rouge">age</code> - with the largest difference between “younger” and “older” persons in the sample at a split point of 30 years. Second, the correlations differ with respect to race for the older persons in the sample - with the largest difference between Arab/Black/Native American/White and Asian/Other. The differences in the symptom correlations affect various pairs of symptoms as brought out in the network display produced by the <a href="https://CRAN.R-project.org/package=qgraph">qgraph</a> package in the terminal nodes. For example, the “centrality” of <code class="language-plaintext highlighter-rouge">anhedonia</code> changes across the three detected subgroups: For the older Asian/Other persons it is partially correlated with most other symptoms while this is less pronounced for the other two subgroups.</p> <p>The networks visualized in the tree can also be extracted easily using the <code class="language-plaintext highlighter-rouge">getnetwork()</code> function. For example, the partial correlation matrix corresponding to the older Asian/Other group (node 5) can be obtained by:</p> <pre><code class="language-{r}">getnetwork(tr, id = 5, transform = "pcor") </code></pre> <p>To explore the returned object <code class="language-plaintext highlighter-rouge">tr</code> in some more detail, the <code class="language-plaintext highlighter-rouge">print()</code> method gives a printed version of the tree structure but does not display the associated parameters.</p> <pre><code class="language-{r}">tr ## Network tree object ## ## Model formula: ## anhedonia + initiative + lookforward + sad + unenthused + worthless + ## meaningless ~ age + race ## ## Fitted party: ## [1] root ## | [2] age <= 30 ## | [3] age > 30 ## | | [4] race in Arab, Black, Native American, White ## | | [5] race in Asian, Other ## ## Number of inner nodes: 2 ## Number of terminal nodes: 3 ## Number of parameters per node: 21 ## Objective function: 42301.84 </code></pre> <p>The estimated correlation parameters in the subgroups can be extracted with <code class="language-plaintext highlighter-rouge">coef(tr)</code>, here returning a 3 x 21 matrix for the 21 pairs of symptom correlations and the 3 subgroups. To show two symptom pairs with larger correlation differences we extract the correlations of <code class="language-plaintext highlighter-rouge">anhedonia</code> with <code class="language-plaintext highlighter-rouge">worthless</code> and <code class="language-plaintext highlighter-rouge">meaningless</code>, respectively. Note that these are the raw correlations and not the partial correlations displayed in the tree above.</p> <pre><code class="language-{r}">coef(tr)[, 5:6] ## rho_anhedonia_worthless rho_anhedonia_meaningless ## 2 0.5595725 0.5994682 ## 4 0.6741686 0.6339481 ## 5 0.6639088 0.7178744 </code></pre> <p>Finally, we extract the p-values of the underlying parameter instability tests to gain some insights how the tree was constructed. In each step the stability we assess whether the correlation parameters are stable across each of the two covariates <code class="language-plaintext highlighter-rouge">age</code> and <code class="language-plaintext highlighter-rouge">race</code> or whether there are significant changes. The corresponding test statistics and Bonferroni-adjusted p-values can be extracted with the <code class="language-plaintext highlighter-rouge">sctest()</code> function (for “structural change test”). For example, in Node 1 there are significant instabilities with respect to both variables but <code class="language-plaintext highlighter-rouge">age</code> has the lower p-value and is hence selected for partitioning the data:</p> <pre><code class="language-{r}">library("strucchange") sctest(tr, node = 1) ## age race ## statistic 7.151935e+01 1.781216e+02 ## p.value 1.787983e-05 3.108049e-03 </code></pre> <p>In Node 3 only <code class="language-plaintext highlighter-rouge">race</code> is significant and hence used for splitting:</p> <pre><code class="language-{r}">sctest(tr, node = 3) ## age race ## statistic 42.9352852 1.728898e+02 ## p.value 0.1447818 6.766197e-05 </code></pre> <p>And in Node 5 neither variable is significant and hence the splitting stops:</p> <pre><code class="language-{r}">sctest(tr, node = 5) ## age race ## statistic 35.1919522 22.09555 ## p.value 0.5514142 0.63279 </code></pre> <p>For more details regarding the method and the software see the Psychometrika paper and the software web page, respectively.</p>2021-02-04T00:00:00+01:00