https://www.zeileis.org/Achim Zeileis2023-05-08T09:16:49+02:00Research homepage of Achim Zeileis, Universität Innsbruck. <br/>Department of Statistics, Faculty of Economics and Statistics. <br/>Universitätsstr. 15, 6020 Innsbruck, Austria. <br/>Tel: +43/512/507-70403Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/Jekyllhttps://www.zeileis.org/news/simulate_cvd/Color vision deficiency emulation fixed in colorspace 2.1-02023-05-08T00:00:00+02:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/The color vision deficiency emulation provided by R package colorspace was inaccurate for some highly-saturated colors due to a bug that was fixed in version 2.1-0. The (typically small) differences are illustrated for a range of palettes.<p>The color vision deficiency emulation provided by R package colorspace was inaccurate for some highly-saturated colors due to a bug that was fixed in version 2.1-0. The (typically small) differences are illustrated for a range of palettes.</p> <h2 id="background">Background</h2> <p>Functions for emulating <a href="https://en.wikipedia.org/wiki/Color_blindness">color vision deficiencies</a> have been part of the R package <a href="http://colorspace.R-Forge.R-project.org/">colorspace</a> for several years now (since the release of version 1.4-0 in January 2019). They are crucial for assessing how well data visualizations work for viewers affected by color vision deficiencies (about 8% of all males and 0.5% of all females) and for illustrating problems with <a href="http://colorspace.R-Forge.R-project.org/articles/endrainbow.html">poor color choices</a>.</p> <p>The <code class="language-plaintext highlighter-rouge">colorspace</code> package implements the physiologically-based model of <a href="https://doi.org/10.1109/TVCG.2009.113">Machado, Oliveira, and Fernandes (2009)</a> who provide a unified approach to various forms of deficiencies, in particular encompassing deuteranomaly (green cone cells defective), protanomaly (red cone cells defective), and tritanomaly (blue cone cells defective). See the <a href="http://colorspace.R-Forge.R-project.org/articles/color_vision_deficiency.html">corresponding package vignette</a> for more details.</p> <h2 id="bug-and-fix">Bug and fix</h2> <p>Recently, an inaccuracy in the <code class="language-plaintext highlighter-rouge">colorspace</code> implementation of the Machado <em>et al.</em> method was reported by Matthew Petroff and fixed in <code class="language-plaintext highlighter-rouge">colorspace</code> 2.1.0 (released earlier this year) with some advice and guidance from Kenneth Knoblauch.</p> <p>More specifically, Machado <em>et al.</em> provide linear transformations of RGB (red-green-blue) coordinates that simulate the different color vision deficiencies. Following some illustrations from the supplementary materials of Machado <em>et al.</em>, earlier versions of the <code class="language-plaintext highlighter-rouge">colorspace</code> package had applied the transformations directly to gamma-corrected sRGB coordinates that can be obtained from color hex codes. However, the paper implicitly relies on a linear RGB space (see page 1294, column 1) where the linear matrix transformations for simulating color vision deficiencies should be applied. Therefore, a new argument <code class="language-plaintext highlighter-rouge">linear = TRUE</code> has been added to <code class="language-plaintext highlighter-rouge">simulate_cvd()</code> (and hence in <code class="language-plaintext highlighter-rouge">deutan()</code>, <code class="language-plaintext highlighter-rouge">protan()</code>, and <code class="language-plaintext highlighter-rouge">tritan()</code>) that first maps the provided colors to linearized RGB coordinates, applies the color vision deficiency transformation, and then maps back to gamma-corrected sRGB coordinates. Optionally, <code class="language-plaintext highlighter-rouge">linear = FALSE</code> can be used to restore the behavior from previous versions where the transformations are applied directly to the sRGB coordinates.</p> <h2 id="illustration">Illustration</h2> <p>For most colors the difference between the two strategies (in linear vs. gamma-corrected RGB coordinates) is negligible but for some highly-saturated colors it becomes more noticeable, e.g., for red, purple, or orange.</p> <p>To illustrate this we set up a small convenience function <code class="language-plaintext highlighter-rouge">cvd_compare()</code> that contrasts both approaches for all three types of color vision deficiences using the <a href="http://colorspace.R-Forge.R-project.org/reference/swatchplot.html">swatchplot()</a> function from <code class="language-plaintext highlighter-rouge">colorspace</code>.</p> <pre><code class="language-{r}">cvd_compare <- function(pal) { x <- list( "Original" = rbind(pal), "Deutan" = rbind( "linear = TRUE " = colorspace::deutan(pal, linear = TRUE), "linear = FALSE" = colorspace::deutan(pal, linear = FALSE) ), "Protan" = rbind( "linear = TRUE " = colorspace::protan(pal, linear = TRUE), "linear = FALSE" = colorspace::protan(pal, linear = FALSE) ), "Tritan" = rbind( "linear = TRUE " = colorspace::tritan(pal, linear = TRUE), "linear = FALSE" = colorspace::tritan(pal, linear = FALSE) ) ) rownames(x$Original) <- deparse(substitute(pal)) colorspace::swatchplot(x) } </code></pre> <p>Subsequently, we apply this function to a selection of <a href="https://www.zeileis.org/news/coloring/">new base R palettes</a>, that have been available since R 4.0.0 in functions <code class="language-plaintext highlighter-rouge">palette.colors()</code> and <code class="language-plaintext highlighter-rouge">hcl.colors()</code>. First, it is shown that for many palettes the two strategies lead to almost equivalent output: e.g., for the default qualitative palette in <code class="language-plaintext highlighter-rouge">palette.colors()</code>, Okabe-Ito (excluding black and gray), and the default sequential palette in <code class="language-plaintext highlighter-rouge">hcl.colors()</code>, Viridis.</p> <pre><code class="language-{r}">cvd_compare(palette.colors()[2:8]) cvd_compare(hcl.colors(7)) </code></pre> <p><a href="https://www.zeileis.org/assets/posts/2023-05-08-simulate_cvd/cvd_compare_okabeito.svg"><img src="https://www.zeileis.org/assets/posts/2023-05-08-simulate_cvd/cvd_compare_okabeito.svg" alt="Comparison of color vision deficiency emulations for Okabe-Ito palette" /></a> <a href="https://www.zeileis.org/assets/posts/2023-05-08-simulate_cvd/cvd_compare_viridis.svg"><img src="https://www.zeileis.org/assets/posts/2023-05-08-simulate_cvd/cvd_compare_viridis.svg" alt="Comparison of color vision deficiency emulations for Viridis palette" /></a></p> <p>The comparison shows that both emulations lead to very similar output, bringing out clearly that both palettes are rather robust und color vision deficiencies.</p> <p>However, for palettes with more flashy colors (especially highly-saturated red, purple, or orange) the differences may be noticeable and practically relevant. This is illustrated using two sequential HCL palettes, PuRd (inspired from ColorBrewer.org) and Rocket (from the Viridis family):</p> <pre><code class="language-{r}">cvd_compare(hcl.colors(7, "PuRd")) cvd_compare(hcl.colors(7, "Rocket")) </code></pre> <p><a href="https://www.zeileis.org/assets/posts/2023-05-08-simulate_cvd/cvd_compare_purd.svg"><img src="https://www.zeileis.org/assets/posts/2023-05-08-simulate_cvd/cvd_compare_purd.svg" alt="Comparison of color vision deficiency emulations for PuRd palette" /></a> <a href="https://www.zeileis.org/assets/posts/2023-05-08-simulate_cvd/cvd_compare_rocket.svg"><img src="https://www.zeileis.org/assets/posts/2023-05-08-simulate_cvd/cvd_compare_rocket.svg" alt="Comparison of color vision deficiency emulations for Rocket palette" /></a></p> <p>The comparison shows that the emulation differs in particular for colors 2, 3, and 4 in both palettes, leading to slightly different insights regarding the properties of the palettes.</p> <p>The differences can become even more pronounced for fully-satured colors like those in the infamous rainbow palette, shown below.</p> <pre><code class="language-{r}">cvd_compare(rainbow(7)) </code></pre> <p><a href="https://www.zeileis.org/assets/posts/2023-05-08-simulate_cvd/cvd_compare_rainbow.svg"><img src="https://www.zeileis.org/assets/posts/2023-05-08-simulate_cvd/cvd_compare_rainbow.svg" alt="Comparison of color vision deficiency emulations for rainbow palette" /></a></p> <p>Luckily for palettes with better perceptual properties the differences between the old erroneous version and the new fixed one are typically rather small. Hence, we hope that the bug did not affect prior work too much and that the fixed version is even more useful for all users of the package.</p>2023-05-08T00:00:00+02:00https://www.zeileis.org/news/coloring/Coloring in R's blind spot2023-05-05T00:00:00+02:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/New arXiv working paper on the new color palette functions palette.colors() and hcl.colors() in base R since version 4.0.0.<p>New arXiv working paper on the new color palette functions palette.colors() and hcl.colors() in base R since version 4.0.0.</p> <h2 id="citation">Citation</h2> <p>Achim Zeileis, Paul Murrell (2023). “Coloring in R’s Blind Spot.” <em>arXiv.org E-Print Archive</em> arXiv:2303.04918 [stat.CO]. <a href="https://doi.org/10.48550/arXiv.2303.04918">doi:10.48550/arXiv.2303.04918</a></p> <h2 id="abstract">Abstract</h2> <p>Prior to version 4.0.0 R had a poor default color palette (using highly saturated red, green, blue, etc.) and provided very few alternative palettes, most of which also had poor perceptual properties (like the infamous rainbow palette). Starting with version 4.0.0 R gained a new and much improved default palette and, in addition, a selection of more than 100 well-established palettes are now available via the functions <code class="language-plaintext highlighter-rouge">palette.colors()</code> and <code class="language-plaintext highlighter-rouge">hcl.colors()</code>. The former provides a range of popular qualitative palettes for categorical data while the latter closely approximates many popular sequential and diverging palettes by systematically varying the perceptual hue, chroma, luminance (HCL) properties in the palette. This paper provides an overview of these new color functions and the palettes they provide along with advice about which palettes are appropriate for specific tasks, especially with regard to making them accessible to viewers with color vision deficiencies.</p> <h2 id="software">Software</h2> <p>Package <code class="language-plaintext highlighter-rouge">grDevices</code> in base <a href="https://www.R-project.org/">R</a> provides <code class="language-plaintext highlighter-rouge">palette.colors()</code> and <code class="language-plaintext highlighter-rouge">hcl.colors()</code> and accompanying functionality since version R 4.0.0.</p> <p>Package <code class="language-plaintext highlighter-rouge">colorspace</code> (<a href="https://CRAN.R-project.org/package=colorspace">CRAN</a>, <a href="https://colorspace.R-Forge.R-project.org/">Web page</a>) provides color vision deficiency emulation along with many other color tools. See also below for the recent bug fix in color vision deficiency emulation.</p> <p>Replication code: <a href="https://www.zeileis.org/assets/posts/2023-05-05-coloring/coloring.R">coloring.R</a>, <a href="https://www.zeileis.org/assets/posts/2023-05-05-coloring/paletteGrid.R">paletteGrid.R</a></p> <h2 id="highlights">Highlights</h2> <p>The table below provides an overview of the new base R palette functionality: For each main type of palette, the <em>Purpose</em> row describes what sort of data the type of palette is appropriate for, the <em>Generate</em> row gives the functions that can be used to generate palettes of that type, the <em>List</em> row names the functions that can be used to list available palettes, and the <em>Robust</em> row identifies two or three good default palettes of that type.</p> <table> <thead> <tr> <th style="text-align: left"> </th> <th style="text-align: left">Qualitative</th> <th style="text-align: left">Sequential</th> <th style="text-align: left">Diverging</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><em>Purpose</em></td> <td style="text-align: left">Categorical data</td> <td style="text-align: left">Ordered or numeric data<br />(high → low)</td> <td style="text-align: left">Ordered or numeric with central value<br />(high ← neutral → low)</td> </tr> <tr> <td style="text-align: left"><em>Generate</em></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">palette.colors()</code>,<br /><code class="language-plaintext highlighter-rouge">hcl.colors()</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">hcl.colors()</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">hcl.colors()</code></td> </tr> <tr> <td style="text-align: left"><em>List</em></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">palette.pals()</code>,<br /><code class="language-plaintext highlighter-rouge">hcl.pals("qualitative")</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">hcl.pals("sequential")</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">hcl.pals("diverging")</code>,<br /><code class="language-plaintext highlighter-rouge">hcl.pals("divergingx")</code></td> </tr> <tr> <td style="text-align: left"><em>Robust</em></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">"Okabe-Ito"</code>, <code class="language-plaintext highlighter-rouge">"R4"</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">"Blues 3"</code>, <code class="language-plaintext highlighter-rouge">"YlGnBu"</code>, <code class="language-plaintext highlighter-rouge">"Viridis"</code></td> <td style="text-align: left"><code class="language-plaintext highlighter-rouge">"Purple-Green"</code>,<br /><code class="language-plaintext highlighter-rouge">"Blue-Red 3"</code></td> </tr> </tbody> </table> <p>Based on this, the color defaults in base R were adapted. In particular, the old default palette was replaced by the <code class="language-plaintext highlighter-rouge">"R4"</code> palette, using very similar hues but avoiding the garish colors with extreme variations in brightness (see below for an example).</p> <p>Recently, the recommended package <a href="https://CRAN.R-project.org/package=lattice">lattice</a> also changed its default color theme (in version 0.21-8), using the qualitative <code class="language-plaintext highlighter-rouge">"Okabe-Ito"</code> palette as the symbol and fill color and the sequential <code class="language-plaintext highlighter-rouge">"YlGnBu"</code> palette for shading regions.</p> <h2 id="qualitative-palettes-in-palettecolors">Qualitative palettes in palette.colors</h2> <p>All palettes provides by the <code class="language-plaintext highlighter-rouge">palette.colors()</code> functions are shown below (except the old default <code class="language-plaintext highlighter-rouge">"R3"</code> palette which is only implemented for backward compatibility).</p> <p><a href="https://www.zeileis.org/assets/posts/2023-05-05-coloring/palette-colors.png"><img src="https://www.zeileis.org/assets/posts/2023-05-05-coloring/palette-colors.png" alt="Qualitative palettes provided in palette.colors()" /></a></p> <p>Lighter palettes are typically more useful for shading areas, e.g., in bar plots or similar displays. Darker and more colorful palettes are usually better for coloring points or line. The palettes <code class="language-plaintext highlighter-rouge">"R4"</code> and <code class="language-plaintext highlighter-rouge">"Okabe-Ito"</code> are particularly noteworthy because they have been designed to be reasonably robust under color vision deficiencies.</p> <p>This is illustrated in a time series line plot of the base R <code class="language-plaintext highlighter-rouge">EuStockMarkets</code> data. The three rows show different <code class="language-plaintext highlighter-rouge">palette.colors()</code> palettes: The old <code class="language-plaintext highlighter-rouge">"R3"</code> default palette (top), the new <code class="language-plaintext highlighter-rouge">"R4"</code> default palette (middle), and the <code class="language-plaintext highlighter-rouge">"Okabe-Ito"</code> palette (bottom). The columns contrast normal vision (left) and emulated deuteranope vision (right), the most common type of color vision deficiency. A color legend is used in the first row and direct labels in the other rows.</p> <p><a href="https://www.zeileis.org/assets/posts/2023-05-05-coloring/EuStockMarkets.png"><img src="https://www.zeileis.org/assets/posts/2023-05-05-coloring/EuStockMarkets.png" alt="Illustration of qualitative palettes" /></a></p> <p>We can see that the <code class="language-plaintext highlighter-rouge">"R3"</code> colors are highly saturated and they vary in luminance (brightness). For example, the cyan line is noticeably lighter than the others. Futhermore, for deuteranope viewers, the CAC and the SMI lines are difficult to distinguish from each other (exacerbated by the use of a color legend that makes matching the lines to labels almost impossible). Moreover, the FTSE line is more difficult to distinguish from the white background, compared to the other lines. The <code class="language-plaintext highlighter-rouge">"R4"</code> palette is an improvement: the luminance is more even and the colors are less saturated, plus the colors are more distinguishable for deuteranope viewers (aided by the use of direct color labels instead of a legend). The <code class="language-plaintext highlighter-rouge">"Okabe-Ito"</code> palette works even better, particularly for deuteranope viewers.</p> <h2 id="sequential-and-diverging-palettes-in-hclcolors">Sequential and diverging palettes in hcl.colors</h2> <p>In addition to qualitative palettes, the <code class="language-plaintext highlighter-rouge">hcl.colors()</code> function provides a wide range of sequential and diverging palettes designed for numeric or ordered data with or without a neutral reference value, respectively. There are more than 100 such palettes, many of which closely approximate palettes from well-established packages such as the ColorBrewer.org, the Viridis family, CARTO colors, or Crameri’s scientific colors. The graphic below depicts just a subset of the multi-hue sequential palettes for illustration.</p> <p><a href="https://www.zeileis.org/assets/posts/2023-05-05-coloring/hcl-colors.png"><img src="https://www.zeileis.org/assets/posts/2023-05-05-coloring/hcl-colors.png" alt="Some of the multi-hue sequential palettes provided in hcl.colors()" /></a></p> <p>Some empirical examples and more insights are provided in the working paper linked above.</p>2023-05-05T00:00:00+02:00https://www.zeileis.org/news/lightning_amplification/Amplification of Lightning in the European Alps 1980-20192023-05-04T00:00:00+02:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/Detailed measurements of lightning as well as reanalyses of atmospheric conditions enable the reconstruction of lightning probabilities over large spatial and temporal domains. Using flexible additive regression models it is shown that lightning activity in the high European Alps has doubled from the 1980s to the 2010s.<p>Detailed measurements of lightning as well as reanalyses of atmospheric conditions enable the reconstruction of lightning probabilities over large spatial and temporal domains. Using flexible additive regression models it is shown that lightning activity in the high European Alps has doubled from the 1980s to the 2010s.</p> <h2 id="citation">Citation</h2> <p>Thorsten Simon, Georg J. Mayr, Deborah Morgenstern, Nikolaus Umlauf, Achim Zeileis (2023). “Amplification of Annual and Diurnal Cycles of Alpine Lightning.” <em>Climate Dynamics</em>, Forthcoming. <a href="https://doi.org/10.1007/s00382-023-06786-8">doi:10.1007/s00382-023-06786-8</a></p> <h2 id="abstract">Abstract</h2> <p>The response of lightning to a changing climate is not fully understood. Historic trends of proxies known for fostering convective environments suggest an increase of lightning over large parts of Europe. Since lightning results from the interaction of processes on many scales, as many of these processes as possible must be considered for a comprehensive answer. Recent achievements of decade-long seamless lightning measurements and hourly reanalyses of atmospheric conditions including cloud micro-physics combined with flexible regression techniques have made a reliable reconstruction of cloud-to-ground lightning down to its seasonally varying diurnal cycle feasible. The European Eastern Alps and their surroundings are chosen as reconstruction region since this domain includes a large variety of land-cover, topographical and atmospheric circulation conditions. The most intense changes over the four decades from 1980 to 2019 occurred over the high Alps where lightning activity doubled in the 2010s compared to the 1980s. There, the lightning season reaches a higher maximum and starts one month earlier. Diurnally, the peak is up to 50% stronger with more lightning strikes in the afternoon and evening hours. Signals along the southern and northern alpine rim are similar but weaker whereas the flatlands surrounding the Alps have no significant trend.</p> <h2 id="software">Software</h2> <p>R packages <code class="language-plaintext highlighter-rouge">bamlss</code> (<a href="https://CRAN.R-project.org/package=bamlss">CRAN</a>, <a href="http://www.bamlss.org/">Web page</a>) and <code class="language-plaintext highlighter-rouge">mgcv</code> (<a href="https://CRAN.R-project.org/package=mgcv">CRAN</a>).</p> <h2 id="highlights">Highlights</h2> <p>The study links two sources of information which are both available in a spatio-temporal resolution of 32 km x 32 km and one hour:</p> <ol> <li>Measurements from the lightning location system ALDIS, available in homogenous quality for the period 2010-2019.</li> <li>40 single-level atmospheric parameters from ECMWF’s fifth reanalysis (ERA5), available from 1980 onward, along with 45 further atmospheric variables derived from vertical profiles etc.</li> </ol> <p>The idea is to learn the link between the lightning observations and the ERA5 atmospheric parameters on the time period where both data sources are available (2010-2019). Subsequently, probabilistic predictions can be made for lightning occurrence on the entire time period starting in 1980, i.e., including the period where only atmospheric parameters but no high-quality lightning detection observations are available. This then allows to track how the probability for lightning occurrence has evolved over the decades, both in terms of the annual seasonal cycles and the diurnal cycle.</p> <p>The probabilistic model learned on this challenging data set is a generalized additive model (GAM) using a binary logit link and smooth spline terms for all explanatory variables based on the atmospheric parameters and additional spatio-temporal information. In order to deal with variable selection due to the large number of explanatory variables, the model is estimated by gradient boosting (as opposed to the classical maximum likelihood technique) combined with stability selection. These have been implemented using the R packages <code class="language-plaintext highlighter-rouge">mgcv</code> and <code class="language-plaintext highlighter-rouge">bamlss</code>.</p> <p>Based on the probabilistic predictions from this boosted binary GAM, the figure below shows reconstructed annual cycles of probabilities for lightning events averaged over the four decades from 1980s to 2010s (color coded). The light curves in the background are aggregations to the day of the year. The dark curves in the foreground are smoothed versions of the light curves. This shows that the peak in summer is much more pronounced and starts earlier for the High Alps and the Southern Alpine rim while there are only minor changes at the Northern Alpine rim and the surrounding flatlands.</p> <p><a href="https://www.zeileis.org/assets/posts/2023-05-04-lightning_amplification/cycle-seasonal.png"><img src="https://www.zeileis.org/assets/posts/2023-05-04-lightning_amplification/cycle-seasonal.png" alt="Seasonal cycles of reconstructed lightning probabilities over four decades" /></a></p> <p>To aggregate these changes even further and capture climate changes, linear trends are fitted to the reconstructed probabilities for June (afternoons, 13-19 UTC) over time. The figure below shows the spatial distribution of these linear climate trends: Color luminance gives the slope per decade of a linear regression for mean probability of lightning within an hour in percent. Desaturated colors in the grids indicate that the linear trends for these grids are not significant at the 5% level. Again, this highlights the pronounced changes in the High Alps and the Southern Alpine rim while there are no significant changes in the surrounding flatlands.</p> <p><a href="https://www.zeileis.org/assets/posts/2023-05-04-lightning_amplification/map-slopes.png"><img src="https://www.zeileis.org/assets/posts/2023-05-04-lightning_amplification/map-slopes.png" alt="Map of linear climate change for reconstructed lightning probabilities" /></a></p> <p>For more details and further insights see the full paper linked above.</p>2023-05-04T00:00:00+02:00https://www.zeileis.org/news/fifa2022/Machine learning of a 2022 FIFA World Cup multiverse2022-11-14T00:00:00+01:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/Probabilistic forecasts for the 2022 FIFA World Cup are obtained by using a hybrid model that combines data from three advanced statistical models through random forests. The favorite is Brazil, followed by Argentina, Netherlands, Germany, and France.<p>Probabilistic forecasts for the 2022 FIFA World Cup are obtained by using a hybrid model that combines data from three advanced statistical models through random forests. The favorite is Brazil, followed by Argentina, Netherlands, Germany, and France.</p> <div class="row t20 b20"> <div class="small-8 medium-9 large-10 columns"> The 2022 FIFA World Cup will take place in Qatar from 20 November to 18 December 2022. 32 of the best teams from all around the world compete to determine the new World Champion. Although the event is overshadowed by many issues, both ethical and sportive, we decided for scientific purposes to employ our machine learning approach that we successfully used in previous tournaments for making probabilistic forecasts. More specifically, our approach yields probabilistic forecasts for all possible matches which can then be used to explore the likely course of the tournament along with its most likely champion by simulation. </div> <div class="small-4 medium-3 large-2 columns"> <a href="https://www.fifa.com/fifaplus/en/tournaments/mens/worldcup/qatar2022" alt="2022 FIFA World Cup web page"><img src="https://upload.wikimedia.org/wikipedia/en/e/e3/2022_FIFA_World_Cup.svg" alt="2022 FIFA World Cup logo" /></a> </div> </div> <h2 id="winning-probabilities">Winning probabilities</h2> <p>The forecast is based on a conditional inference random forest learner that blends information capturing the past, present, and future of the competing football teams: <em>Insights from the past</em> are captured in an ability estimate for every team based on historic matches. <em>Expectations about the the future</em> in the upcoming tournament are captured in an ability estimate for every team based on odds from international bookmakers. <em>The present status</em> of the teams (and their countries) is represented by covariates such as market value or the types of players in the team as well as country-specific socio-economic factors like population or GDP. The random forest model is learned using the previous five FIFA World Cup tournaments from 2002 to 2018 as training data and then applied to current information to obtain a forecast for the 2022 FIFA World Cup. More precisely, the random forest is calibrated to predict the likely distribution of goals for each team in all possible matches in the tournament. This allows to simulate the outcome of each match in normal time as well as potential extra time and penalties in order to obtain probabilities for a <em>win</em>, <em>draw</em>, or <em>loss</em>. Moreover, because every individual match can be simulated like that, a “multiverse” of potential courses of the entire tournament can be created yielding overall winning probabilities for each team. The results show that - 20 years after winning the title the last time - Brazil is the clear favorite for the World Cup with a winning probability of 15.0%, followed by Argentina with 11.2%, the Netherlands with 9.7%, Germany with 9.2%, and France with 9.1%. The winning probabilities for all teams are shown in the barchart below with more information linked in the interactive full-width version.</p> <p><a href="https://www.zeileis.org/assets/posts/2022-11-14-fifa2022/p_win.html">Interactive full-width graphic</a></p> <p><a href="https://www.zeileis.org/assets/posts/2022-11-14-fifa2022/p_win.html"><img src="https://www.zeileis.org/assets/posts/2022-11-14-fifa2022/p_win.png" alt="Barchart: Winning probabilities" /></a></p> <p>The full study has been conducted by an international team of researchers: <a href="https://www.statistik.tu-dortmund.de/groll.html">Andreas Groll</a>, <a href="https://de.linkedin.com/in/neele-hormann-70164123a">Neele Hormann</a>, <a href="https://wwwfr.uni.lu/recherche/fstm/dmath/people/christophe_ley">Christophe Ley</a>, <a href="https://www.sg.tum.de/epidemiologie/team/schauberger/">Gunther Schauberger</a>, <a href="https://biblio.ugent.be/person/2C617710-F0EE-11E1-A9DE-61C894A0A6B4">Hans Van Eetvelde</a>, <a href="https://www.zeileis.org/">Achim Zeileis</a>. The core of the contribution is a hybrid approach that starts out from three state-of-the-art forecasting methods, based on disparate sets of information, and lets an adaptive machine learning model decide how to blend the different sources of information.</p> <ul> <li> <p><em>Historic information: Match abilities.</em><br /> An ability estimate is obtained for every team based on “retrospective” data, namely all historic national matches over the last 8 years. A <em>bivariate Poisson model</em> with team-specific fixed effects is fitted to the number of goals scored by both teams in each match. However, rather than equally weighting all matches to obtain <em>average</em> team abilities (or team strengths) over the entire history period, an exponential weighting scheme is employed. This assigns more weight to more recent results and thus yields an estimate of <em>current</em> team abilities. More details can be found in <a href="https://doi.org/10.1177/1471082X18817650">Ley, Van de Wiele, Van Eetvelde (2019)</a>.</p> </li> <li> <p><em>Future expectation: Bookmaker consensus abilities.</em><br /> Another ability estimate for every team is obtained based on “prospective” data, namely the odds of 28 international bookmakers that reflect their expert expectations for the tournament. Using an enhanced version of the <em>bookmaker consensus model</em> from <a href="https://doi.org/10.1016/j.ijforecast.2009.10.001">Leitner, Zeileis, Hornik (2010)</a>, the bookmaker odds are first adjusted for the bookmakers’ profit margins (“overround”) and then averaged (on a logit scale) to obtain a consensus for the winning probability of each team. To correct for the effects of the tournament draw (that might have led to easier or harder groups for some teams), an “inverse” simulation approach is used to infer which team abilities are most likely to lead up to these winning probabilities.</p> </li> <li> <p><em>Combination with present status: Hybrid random forests.</em><br /> Finally, machine learning is used to combine these highly aggregated ability estimates with a broad range of further relevant covariates reflecting the current states of the different teams and the countries they come from. Such a hybrid approach was first suggested by <a href="https://doi.org/10.1515/jqas-2018-0060">Groll, Ley, Schauberger, Van Eetvelde (2019)</a>. A random forest learner is trained to decide how to blend the different ability estimates with team-specific features that are typically less informative but still powerful enough to enhance the forecasts. The features considered comprise team-specific details (e.g., market value, FIFA rank, team structure) as well as country-specifc socio-economic factors (population and GDP per capita). By combining a large ensemble of rather weakly informative regression trees in a random forest, the relative importances of all the covariates can be inferred automatically. The resulting predicted number of goals for each team can then finally be used to simulate the entire tournament 100,000 times.</p> </li> </ul> <h2 id="match-probabilities">Match probabilities</h2> <p>Using the hybrid random forest an expected number of goals is obtained for both teams in each possible match. The covariate information used for this is the difference between the two teams in each of the variables listed above, i.e., the difference in historic match abilities (on a log scale), the difference in bookmaker consensus abilities (on a log scale), difference in market values (on a log scale), etc. Assuming a bivariate Poisson distribution with the expected numbers of goals for both teams, we can compute the probability that a certain match ends in a <em>win</em>, a <em>draw</em>, or a <em>loss</em>. The same can be repeated in overtime, if necessary, and a coin flip is used to decide penalties, if needed.</p> <p>The following heatmap shows for each possible combination of teams the probability that one team beats the other team in a knockout match. The color scheme uses green vs. purple to signal probabilities above vs. below 50%, respectively. The tooltips for each match in the interactive version of the graphic also print the probabilities for the match to end in a <em>win</em>, <em>draw</em>, or <em>loss</em> after normal time.</p> <p><a href="https://www.zeileis.org/assets/posts/2022-11-14-fifa2022/p_match.html">Interactive full-width graphic</a></p> <p><a href="https://www.zeileis.org/assets/posts/2022-11-14-fifa2022/p_match.html"><img src="https://www.zeileis.org/assets/posts/2022-11-14-fifa2022/p_match.png" alt="Heatmap: Match probabilities" /></a></p> <h2 id="performance-throughout-the-tournament">Performance throughout the tournament</h2> <p>Based on the simulation of individual pairwise matches, as described above, we can create a “multiverse” of potential courses of the entire tournament (here: 100,000). The chances of the teams’ “survival” throughout the tournament can then be described by the proportions of multiverses in which they reach the different stages from the round of 16 to winning the overall title.</p> <p><a href="https://www.zeileis.org/assets/posts/2022-11-14-fifa2022/p_surv.html">Interactive full-width graphic</a></p> <p><a href="https://www.zeileis.org/assets/posts/2022-11-14-fifa2022/p_surv.html"><img src="https://www.zeileis.org/assets/posts/2022-11-14-fifa2022/p_surv.png" alt="Line plot: Survival probabilities" /></a></p> <h2 id="odds-and-ends">Odds and ends</h2> <p>All our forecasts are probabilistic, clearly below 100%, and by no means certain. Thus, although we can quantify this uncertainty in terms of probabilities from a multiverse of tournaments, it is far from being predetermined which of these possible tournaments we will see in our universe.</p> <p>Unfortunately, the experience of observing the actual tournament will be far less exciting and joyful than usual for us as researchers/forecasters and also as football fans due to the special circumstances. In addition to the widely discussed ethical problems regarding this FIFA World Cup, there are also sportive issues that are absolutely critical: The climate in Qatar is extraordinarily hot which necessitated shifting the event to the winter months. Therefore, all major football leagues in Europe and South America have to interrupt their usual schedule in order to accomodate the tournament. This gives the national teams less time for preparation and the players less time for recovery before and after the World Cup. In combination with the extreme climate conditions this also increases the risk of injuries. Hence, having a team with many players in the international European leagues (Champions League, Europa League, Europa Conference League) might actually be a handicap rather than a strength this year.</p> <p>All of these factors make the forecast of the tournament outcome more difficult as variables that have been highly predictive in previous World Cups might not work or work differently.</p> <p>Finally, more from the perspective of football fans (rather than professional forecasters) we are sad that all the usual joy and anticipation of a football World Cup has been crushed by the terrible circumstances this year: starting from the alleged bribery and corruption in the FIFA assignment process, to the human rights and working conditions in Qatar, and the lack of sustainability in the construction and operation of the stadiums.</p>2022-11-14T00:00:00+01:00https://www.zeileis.org/news/weuro2022/Probabilistic forecasting for the UEFA Women's Euro 20222022-07-04T00:00:00+02:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/Using a consensus model based on quoted bookmakers' odds winning probabilities for all competing teams in the UEFA Women's Euro are obtained: The favorite is Spain, followed by host England, France, and the Netherlands as the defending champion.<p>Using a consensus model based on quoted bookmakers' odds winning probabilities for all competing teams in the UEFA Women's Euro are obtained: The favorite is Spain, followed by host England, France, and the Netherlands as the defending champion.</p> <div class="row t20 b20"> <div class="small-8 medium-9 large-10 columns"> Football fans throughout Europe and the world anticipate the UEFA Women's Euro 2022 that will take place in England from 6 July to 31 July 2022. 16 of the best European teams compete to determine the new European Champion. Here, a predictive model is established to forecast what the most likely outcome of the tournament will be. The forecast is based on the expert knowledge of 16 bookmakers and betting exchanges using a model averaging approach. </div> <div class="small-4 medium-3 large-2 columns"> <a href="https://www.uefa.com/womenseuro/" alt="UEFA Women's Euro 2022 web page"><img src="https://upload.wikimedia.org/wikipedia/en/0/0b/UEFA_Women%27s_Euro_2022_logo.svg" alt="UEFA Women's Euro 2022 logo" /></a> </div> </div> <h2 id="winning-probabilities">Winning probabilities</h2> <p>The model is the so-called bookmaker consensus model which has been proposed by Leitner, Hornik, and Zeileis (2010, <em>International Journal of Forecasting</em>, <a href="https://doi.org/10.1016/j.ijforecast.2009.10.001">https://doi.org/10.1016/j.ijforecast.2009.10.001</a>) and successfully applied in previous football tournaments, either by itself or in combination with even more refined <a href="https://www.zeileis.org/news/euro2020/">machine learning techniques</a>.</p> <p>This time the forecast shows that Spain is the favorite with a forecasted winning probability of 19.6%, closely followed by England with a winning probability of 16.6%. Four teams also have double-digit winning probabilities: France with 13.5%, the Netherlands with 13.3%, Germany with 10.3%, and Sweden with 10.1%. More details are displayed in the following barchart.</p> <p><a href="https://www.zeileis.org/assets/posts/2022-07-04-weuro2022/p_win.html">Interactive full-width graphic</a></p> <p><a href="https://www.zeileis.org/assets/posts/2022-07-04-weuro2022/p_win.html"><img src="https://www.zeileis.org/assets/posts/2022-07-04-weuro2022/p_win.png" alt="Barchart: Winning probabilities" /></a></p> <p>These probabilistic forecasts have been obtained by model-based averaging the quoted winning odds for all teams across bookmakers. More precisely, the odds are first adjusted for the bookmakers’ profit margins (“overrounds”, on average 20.1%), averaged on the log-odds scale to a consensus rating, and then transformed back to winning probabilities. The raw bookmakers’ odds as well as the forecasts for all teams are also available in machine-readable form in <a href="https://www.zeileis.org/assets/posts/2022-07-04-weuro2022/weuro2022.csv">weuro2022.csv</a>.</p> <p>Although forecasting the winning probabilities for the UEFA Women’s Euro 2022 is probably of most interest, the bookmaker consensus forecasts can also be employed to infer team-specific abilities using an “inverse” tournament simulation:</p> <ol> <li>If team abilities are available, pairwise winning probabilities can be derived for each possible match (see below).</li> <li>Given pairwise winning probabilities, the whole tournament can be easily simulated to see which team proceeds to which stage in the tournament and which team finally wins.</li> <li>Such a tournament simulation can then be run sufficiently often (here 100,000 times) to obtain relative frequencies for each team winning the tournament.</li> </ol> <p>Using this idea, abilities in step 1 can be chosen such that the simulated winning probabilities in step 3 closely match those from the bookmaker consensus shown above.</p> <h2 id="pairwise-comparisons">Pairwise comparisons</h2> <p>A classical approach to obtain winning probabilities in pairwise comparisons (i.e., matches between teams/players) is the Bradley-Terry model, which is similar to the Elo rating, popular in sports. The Bradley-Terry approach models the probability that a Team A beats a Team B by their associated abilities (or strengths):</p> <math xmlns="http://www.w3.org/1998/Math/MathML"><mstyle displaystyle="true"><mrow><mi fontstyle="normal">Pr</mi><mo stretchy="false">(</mo><mi>A</mi><mtext> beats </mtext><mi>B</mi><mo stretchy="false">)</mo><mo>=</mo><mfrac><mrow><msub><mrow><mi fontstyle="italic">ability</mi></mrow><mrow><mi>A</mi></mrow></msub></mrow><mrow><msub><mrow><mi fontstyle="italic">ability</mi></mrow><mrow><mi>A</mi></mrow></msub><mo>+</mo><msub><mrow><mi fontstyle="italic">ability</mi></mrow><mrow><mi>B</mi></mrow></msub></mrow></mfrac><mo>.</mo></mrow></mstyle></math> <p>Coupled with the “inverse” simulation of the tournament, as described in step 1-3 above, this yields pairwise probabilities for each possible match. The following heatmap shows the probabilistic forecasts for each match with light gray signalling approximately equal chances and green vs. purple signalling advantages for Team A or B, respectively.</p> <p><a href="https://www.zeileis.org/assets/posts/2022-07-04-weuro2022/p_match.html">Interactive full-width graphic</a></p> <p><a href="https://www.zeileis.org/assets/posts/2022-07-04-weuro2022/p_match.html"><img src="https://www.zeileis.org/assets/posts/2022-07-04-weuro2022/p_match.png" alt="Heatmap: Match probabilities" /></a></p> <h2 id="performance-throughout-the-tournament">Performance throughout the tournament</h2> <p>As every single match can be simulated with the pairwise probabilities above, it is also straightfoward to simulate the entire tournament (here: 100,000 times) providing “survival” probabilities for each team across the different stages.</p> <p><a href="https://www.zeileis.org/assets/posts/2022-07-04-weuro2022/p_surv.html">Interactive full-width graphic</a></p> <p><a href="https://www.zeileis.org/assets/posts/2022-07-04-weuro2022/p_surv.html"><img src="https://www.zeileis.org/assets/posts/2022-07-04-weuro2022/p_surv.png" alt="Line plot: Survival probabilities" /></a></p> <p>For example, this shows that Spain’s chances compared to England and France are lower to reach one of the quarterfinals but higher to reach one of the semifinals. The reasons for this are that Spain plays another one of the strongest six teams in their group (Germany) but can likely avoid another of these six teams in the quarterfinal. Conversely, England and France do not have another of the six top teams in their group but most likely play one in their quarterfinals (Germany and Netherlands or Sweden, respectively).</p> <p>This effect of the tournament draw is also brought out by another display that highlights the likely flow of all teams through the tournament simultaneously. Compared to the survival curves shown above this visualization brings out more clearly at which stages of the tournament the strong teams are most likely to meet.</p> <p><a href="https://www.zeileis.org/assets/posts/2022-07-04-weuro2022/p_sankey.html">Interactive full-width graphic</a></p> <p><a href="https://www.zeileis.org/assets/posts/2022-07-04-weuro2022/p_sankey.html"><img src="https://www.zeileis.org/assets/posts/2022-07-04-weuro2022/p_sankey.png" alt="Sankey diagram" /></a></p> <h2 id="odds-and-ends">Odds and ends</h2> <p>The bookmaker consensus model has performed well in previous tournaments, often predicting winners or finalists correctly. However, all forecasts are probabilistic, clearly below 100%, and thus by no means certain. It would also be possible to post-process the bookmaker consensus along with data from historic matches, player ratings, and other information about the teams using <a href="https://www.zeileis.org/news/euro2020/">machine learning techniques</a>. However, due to lack of time for more refined forecasts at the end of a busy academic year, at least the bookmaker consensus is provided as a solid basic forecast.</p> <p>As a final remark: Betting on the outcome based on the results presented here is not recommended. Not only because the winning probabilities are clearly far below 100% but, more importantly, because the bookmakers have a sizeable profit margin of about 20.1% which assures that the best chances of making money based on sports betting lie with them!</p> <p>In a few days we will start learning which of the probable paths through the tournament, shown above, will actually come true. Enjoy the UEFA Women’s Euro 2022!</p>2022-07-04T00:00:00+02:00https://www.zeileis.org/news/causal_forests/Model-based causal forests for heterogeneous treatment effects2022-07-02T00:00:00+02:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/A new arXiv paper investigates which building blocks of random forests, especially causal forests and model-based forests, make them work for heterogeneous treatment effect estimation, both in randomized trials and observational studies.<p>A new arXiv paper investigates which building blocks of random forests, especially causal forests and model-based forests, make them work for heterogeneous treatment effect estimation, both in randomized trials and observational studies.</p> <h3 id="citation">Citation</h3> <p>Susanne Dandl, Torsten Hothorn, Heidi Seibold, Erik Sverdrup, Stefan Wager, Achim Zeileis (2022). “What Makes Forest-Based Heterogeneous Treatment Effect Estimators Work?.” <em>arXiv.org E-Print Archive</em> arXiv:2206.10323 [stat.ME]. <a href="https://doi.org/10.48550/arXiv.2206.10323">doi:10.48550/arXiv.2206.10323</a></p> <h3 id="abstract">Abstract</h3> <p>Estimation of heterogeneous treatment effects (HTE) is of prime importance in many disciplines, ranging from personalized medicine to economics among many others. Random forests have been shown to be a flexible and powerful approach to HTE estimation in both randomized trials and observational studies. In particular “causal forests”, introduced by <a href="https://doi.org/10.1214/18-aos1709">Athey, Tibshirani, and Wager (2019)</a>, along with the R implementation in package <a href="https://CRAN.R-project.org/package=grf"><em>grf</em></a>, were rapidly adopted. A related approach, called “model-based forests”, that is geared towards randomized trials and simultaneously captures effects of both prognostic and predictive variables, was introduced by <a href="https://doi.org/10.1177/0962280217693034">Seibold, Zeileis, and Hothorn (2018)</a> along with a modular implementation in the R package <a href="https://CRAN.R-project.org/package=model4you"><em>model4you</em></a>.</p> <p>Here, we present a unifying view that goes beyond the <em>theoretical</em> motivations and investigates which <em>computational</em> elements make causal forests so successful and how these can be blended with the strengths of model-based forests. To do so, we show that both methods can be understood in terms of the same parameters and model assumptions for an additive model under <em>L</em><sub>2</sub> loss. This theoretical insight allows us to implement several flavors of “model-based causal forests” and dissect their different elements <em>in silico</em>.</p> <p>The original causal forests and model-based forests are compared with the new blended versions in a benchmark study exploring both randomized trials and observational settings. In the randomized setting, both approaches performed akin. If confounding was present in the data generating process, we found local centering of the treatment indicator with the corresponding propensities to be the main driver for good performance. Local centering of the outcome was less important, and might be replaced or enhanced by simultaneous split selection with respect to both prognostic and predictive effects. This lays the foundation for future research combining random forests for HTE estimation with other types of models.</p> <p>We demonstrate the practical aspects of such a model-agnostic approach to HTE estimation analyzing the effect of cesarean section on postpartum blood loss in comparison to vaginal delivery. Clearly, randomization is hardly possible in this setup, and we present a tailored model-based forest for skewed and interval-censored data to infer possible predictive variables and their impact on the treatment effect.</p> <h3 id="benchmark-study">Benchmark study</h3> <p>To investigate which elements of the different random forest algorithms in causal forests (cf) vs. model-based forests (mob) contribute to more precise estimation of heterogeneous treatment effects, a large simulation experiment was carried out, using normal outcomes, different predictive and prognostic effects, and a varying number of observations (N) and covariates (P).</p> <p>In addition to the original cf (from <em>grf</em>) and mob (from <em>model4you</em>) algorithms three blended versions (based on <em>model4you</em>) were assessed: mob(\(\widehat W\)) (model-based forests after centering of the treatment indicator), mob(\(\widehat W\), \(\widehat Y\)) (model-based forests after centering of both the treatment indicator and the outcome), mobcf (model-based forests after centering of both the treatment indicator and the outcome, only testing for splits in the treatment effect).</p> <p>Four data-generation setups are considered, as proposed by Nie and Wager (2021): Setup A has complicated confounding but a relatively simple treatment effect function. Setup B has no confounding. Setup C has strong confounding but a constant treatment effect. In Setup D the treatment and control arms are completely unrelated.</p> <p>Overall, the results in the figure below show that centering of the treatment indicator as in mob(\(\widehat W\)) is the most relevant ingredient to random forests for HTE estimation in observational studies. If possible, additional centering the outcome in combination with simultaneous estimation of predictive and prognostic effects in mob(\(\widehat W\), \(\widehat Y\)) is recommended as it always performs as well as mob(\(\widehat W\)) and mobcf but may yield relevant improvements in some scenarios. Other technical aspects of tree and forest induction did not contribute to major performance differences. The overall strong performance of mob(\(\widehat W\), \(\widehat Y\)), combining centering of outcome and treatment from causal forests with joint estimation of prognostic and predictive effects, suggests that alternative split criteria sensitive to both intercepts and treatment effects might be able to improve the performance of causal forests.</p> <p><a href="https://www.zeileis.org/assets/posts/2022-07-02-causal_forests/fig1.png"><img src="https://www.zeileis.org/assets/posts/2022-07-02-causal_forests/fig1.png" alt="Results for the experimental setups in Section 4.1 of the arXiv working paper. Direct comparison of the adaptive versions of causal forests, model-based forests without centering (mob), mob imitating causal forests (mobcf), mob with centered W (mob(W)) and additional of Y (mob(W, Y))." /></a></p> <p>For more details and more results see the <a href="https://doi.org/10.48550/arXiv.2206.10323">arXiv working paper</a>.</p> <h3 id="empirical-application">Empirical application</h3> <p>To illustrate how model-based causal forests can be tailored for specific situations, the effect of cesarean sections vs. vaginal deliveries (treatment) on the amount of postpartum blood loss (outcome) is invectigated. Clearly, covariates like maternal age, birth weight, gestational age, or multifetal pregnancy potentially have an impact on both the treatment and the outcome. As randomizing the mode of delivery is impossible, methods for HTE estimation from observational data are needed. Moreover, blood loss is a skewed variable that is additionally impossible to measure exactly in the sometimes hectic environment of a delivery ward. It is hence treated as interval-censored. To accomodate all these features, a model-based causal forest is fitted by using <code class="language-plaintext highlighter-rouge">pmforest()</code> from <em>model4you</em> in combination with:</p> <ul> <li>Centering of the treatment variable to account for the observational nature of the data.</li> <li>A transformation model (based on a Bernstein polynomial) to flexibly capture the skewness of the outcome variable.</li> <li>Interval censoring of the outcome observations.</li> </ul> <p>The dependency of the treatment effect on the prepartum variables is visualized in the figure below, using scatter plots for continuous covariates and boxplots for categorical covariates. While some variables have virtually no influence on the treatment effect (e.g., mother’s age), others are associated with clear effect differences. In particular, higher gestational age, higher neonatal weight, and no multifetal pregnancy have a higher risk for elevated blood loss due to cesarean section compared to vaginal delivery.</p> <p><a href="https://www.zeileis.org/assets/posts/2022-07-02-causal_forests/fig5.png"><img src="https://www.zeileis.org/assets/posts/2022-07-02-causal_forests/fig5.png" alt="Dependency plots of the individual treatment effects calculated by the model-based transformation forest. Values > 0 mean that cesarean section increases the blood loss compared to vaginal delivery. Blue lines and diamond points depict (smooth conditional) mean effects." /></a></p> <p>For more details see the <a href="https://doi.org/10.48550/arXiv.2206.10323">arXiv working paper</a>.</p>2022-07-02T00:00:00+02:00https://www.zeileis.org/news/user2022/distributions3 @ useR! 20222022-06-27T00:00:00+02:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/Conference presentation about the 'distributions3' package for S3 probability distributions (and 'topmodels' for graphical model assessment) at useR! 2022: Slides, video, replication code, and vignette.<p>Conference presentation about the 'distributions3' package for S3 probability distributions (and 'topmodels' for graphical model assessment) at useR! 2022: Slides, video, replication code, and vignette.</p> <h2 id="abstract">Abstract</h2> <p><em>(Authors: <a href="https://www.zeileis.org">Achim Zeileis</a>, <a href="https://moritzlang.org/">Moritz N. Lang</a>, <a href="https://www.alexpghayes.com/">Alex Hayes</a>)</em></p> <p>The <a href="https://alexpghayes.github.io/distributions3/">distributions3</a> package provides a beginner-friendly and lightweight interface to probability distributions. It allows to create distribution objects in the S3 paradigm that are essentially data frames of parameters, for which standard methods are available: e.g., evaluation of the probability density, cumulative distribution, and quantile functions as well as random samples. It has been designed such that it can be employed in introductory statistics and probability courses. By not only providing objects for a single distribution but also for vectors of distributions, users can transition seamlessly to a representation of probabilistic forecasts from regression models such as GLM (generalized linear model), GAMLSS (generalized additive models for location, scale, and shape), etc. We show how the package can be used both in teaching and in applied statistical modeling, for interpreting fitted models and assessing their goodness of fit (“by hand” and via the <a href="https://topmodels.R-Forge.R-project.org/">topmodels</a> package).</p> <h2 id="resources">Resources</h2> <p>Links to: <a href="https://www.zeileis.org/papers/useR-2022.pdf">PDF slides</a>, <a href="https://www.youtube.com/watch?v=rs7ha1F5S0k">YouTube video</a>, <a href="https://www.zeileis.org/assets/posts/2022-06-27-user2022/code.R">R code</a>, <a href="https://www.zeileis.org/news/poisson/">vignette/blog post</a>.</p> <p><a href="https://www.zeileis.org/papers/useR-2022.pdf"><img src="https://www.zeileis.org/assets/posts/2022-06-27-user2022/slides.png" alt="PDF slides" /></a></p> <p><a href="https://www.youtube.com/watch?v=rs7ha1F5S0k"><img src="https://www.zeileis.org/assets/posts/2022-06-27-user2022/youtube.png" alt="YouTube video" /></a></p> <p><a href="https://www.zeileis.org/assets/posts/2022-06-27-user2022/code.R"><img src="https://www.zeileis.org/assets/posts/2022-06-27-user2022/code.png" alt="R code" /></a></p> <p><a href="https://www.zeileis.org/news/poisson/"><img src="https://www.zeileis.org/assets/posts/2022-06-27-user2022/vignette.png" alt="vignette/blog post" /></a></p>2022-06-27T00:00:00+02:00https://www.zeileis.org/news/poisson/The Poisson distribution: From basic probability theory to regression models2022-06-23T00:00:00+02:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/Brief introduction to the Poisson distribution for modeling count data using the distributions3 package. The distribution is illustrated using the number of goals scored at the 2018 FIFA World Cup, suitable for self-study or as a classroom exercise.<p>Brief introduction to the Poisson distribution for modeling count data using the distributions3 package. The distribution is illustrated using the number of goals scored at the 2018 FIFA World Cup, suitable for self-study or as a classroom exercise.</p> <h2 id="the-poisson-distribution">The Poisson distribution</h2> <p>The classic basic probability distribution employed for modeling count data is the Poisson distribution. Its probability mass function \(f(y; \lambda)\) yields the probability for a random variable \(Y\) to take a count \(y \in \{0, 1, 2, \dots\}\) based on the distribution parameter \(\lambda > 0\):</p> <p>[\text{Pr}(Y = y) = f(y; \lambda) = \frac{\exp\left(-\lambda\right) \cdot \lambda^y}{y!}.]</p> <p>The Poisson distribution has many distinctive features, e.g., both its expectation and variance are equal and given by the parameter \(\lambda\). Thus, \(\text{E}(Y) = \lambda\) and \(\text{Var}(Y) = \lambda\). Moreover, the Poisson distribution is related to other basic probability distributions. Namely, it can be obtained as the limit of the binomial distribution when the number of attempts is high and the success probability low. Or the Poisson distribution can be approximated by a normal distribution when \(\lambda\) is large. See <a href="#Wiki+Poisson">Wikipedia (2002)</a> for further properties and references.</p> <p>Here, we leverage the <code class="language-plaintext highlighter-rouge">distributions3</code> package (<a href="#CRAN+distributions3">Hayes <em>et al.</em> 2022</a>) to work with the Poisson distribution in R. In <code class="language-plaintext highlighter-rouge">distributions3</code>, Poisson distribution objects can be generated with the <code class="language-plaintext highlighter-rouge">Poisson()</code> function. Subsequently, methods for generic functions can be used print the objects; extract mean and variance; evaluate density, cumulative distribution, or quantile function; or simulate random samples.</p> <pre><code class="language-{r}">library("distributions3") Y <- Poisson(lambda = 1.5) print(Y) ## [1] "Poisson distribution (lambda = 1.5)" mean(Y) ## [1] 1.5 variance(Y) ## [1] 1.5 pdf(Y, 0:5) ## [1] 0.22313 0.33470 0.25102 0.12551 0.04707 0.01412 cdf(Y, 0:5) ## [1] 0.2231 0.5578 0.8088 0.9344 0.9814 0.9955 quantile(Y, c(0.1, 0.5, 0.9)) ## [1] 0 1 3 set.seed(0) random(Y, 5) ## [1] 3 1 1 2 3 </code></pre> <p>Using the <code class="language-plaintext highlighter-rouge">plot()</code> method the distribution can also be visualized which we use here to show how the probabilities for the counts \(0, 1, \dots, 15\) change when the parameter is \(\lambda = 0.5, 2, 5, 10\).</p> <pre><code class="language-{r}">plot(Poisson(0.5), main = expression(lambda == 0.5), xlim = c(0, 15)) plot(Poisson(2), main = expression(lambda == 2), xlim = c(0, 15)) plot(Poisson(5), main = expression(lambda == 5), xlim = c(0, 15)) plot(Poisson(10), main = expression(lambda == 10), xlim = c(0, 15)) </code></pre> <p><a href="https://www.zeileis.org/assets/posts/2022-06-23-poisson/density.png"><img src="https://www.zeileis.org/assets/posts/2022-06-23-poisson/density.png" alt="Probability density for Poisson distributions with means 0.5, 2, 5, and 10" /></a></p> <p>In the following we will illustrate how this infrastructure can be leveraged to obtain predicted probabilities for the number of goals in soccer matches from the 2018 FIFA World Cup.</p> <h2 id="goals-in-the-2018-fifa-world-cup">Goals in the 2018 FIFA World Cup</h2> <p>To investigate the number of goals scored per match in the 2018 FIFA World Cup, the <code class="language-plaintext highlighter-rouge">FIFA2018</code> data set provides two rows, one for each team, for each of the 64 matches during the tournament. In the following, we treat the goals scored by the two teams in the same match as independent which is a realistic assumption for this particular data set. We just remark briefly that there are also bivariate generalizations of the Poisson distribution that would allow for correlated observations but which are not considered here.</p> <p>In addition to the goals, the data set provides some basic meta-information for the matches (an ID, team name abbreviations, type of match, group vs. knockout stage) as well as some further covariates that we will revisit later in this document. The data looks like this:</p> <pre><code class="language-{r}">data("FIFA2018", package = "distributions3") head(FIFA2018) ## goals team match type stage logability difference ## 1 5 RUS 1 A group 0.1531 0.8638 ## 2 0 KSA 1 A group -0.7108 -0.8638 ## 3 0 EGY 2 A group -0.2066 -0.4438 ## 4 1 URU 2 A group 0.2372 0.4438 ## 5 3 RUS 3 A group 0.1531 0.3597 ## 6 1 EGY 3 A group -0.2066 -0.3597 </code></pre> <p>For now, we will focus on the <code class="language-plaintext highlighter-rouge">goals</code> variable only. A brief summary yields</p> <pre><code class="language-{r}">summary(FIFA2018$goals) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0 0.0 1.0 1.3 2.0 6.0 </code></pre> <p>showing that the teams scored between \(0\) and \(6\) goals per match with an average of \(\bar y = 1.3\) from the observations \(y_i\) (\(i = 1, \dots, 128\)). The corresponding table of observed relative frequencies is:</p> <pre><code class="language-{r}">observed <- proportions(table(FIFA2018$goals)) observed ## ## 0 1 2 3 4 5 6 ## 0.257812 0.375000 0.250000 0.078125 0.015625 0.015625 0.007812 </code></pre> <p>This confirms that goals are relatively rare events in a soccer game with each team scoring zero to two goals per match in almost 90 percent of the matches. Below we show that this observed frequency distribution can be approximated very well by a Poisson distribution which can subsequently be used to obtain predicted probabilities for the goals scored in a match.</p> <h2 id="basic-fitted-distribution">Basic fitted distribution</h2> <p>In a first step, we simply assume that goals are scored with a constant mean over all teams and matches and hence just fit a single Poisson distribution for the number of goals. To do so, we obtain a point estimate of the Poisson parameter by using the empirical mean \(\hat \lambda = \bar y = 1.3\) and set up the corresponding distribution object:</p> <pre><code class="language-{r}">p_const <- Poisson(lambda = mean(FIFA2018$goals)) p_const ## [1] "Poisson distribution (lambda = 1.3)" </code></pre> <p>In the technical details below we show that this actually corresponds to the maximum likelihood estimation for this distribution. It could also be fitted via <code class="language-plaintext highlighter-rouge">fit_mle(Poisson(1), FIFA2018$goals)</code> in <code class="language-plaintext highlighter-rouge">distributions3</code>.</p> <p>As already illustrated above, the expected probabilities of observing counts of \(0, 1, \dots, 6\) goals for this Poisson distribution can be extracted using the <code class="language-plaintext highlighter-rouge">pdf()</code> method. A comparison with the observed empirical frequencies yields</p> <pre><code class="language-{r}">expected <- pdf(p_const, 0:6) cbind(observed, expected) ## observed expected ## 0 0.257812 0.273385 ## 1 0.375000 0.354546 ## 2 0.250000 0.229901 ## 3 0.078125 0.099384 ## 4 0.015625 0.032222 ## 5 0.015625 0.008358 ## 6 0.007812 0.001806 </code></pre> <p>By and large, all observed and expected frequencies are rather close. However, it is not reasonable that all teams score goals with the same probabilities, which would imply that winning or losing could just be attributed to “luck” or “random variation” alone. Therefore, while a certain level of randomness will certainly remain, we should also consider that there are stronger and weaker teams in the tournament.</p> <h2 id="poisson-regression-and-probabilistic-forecasting">Poisson regression and probabilistic forecasting</h2> <p>To account for different expected performances from the teams in the 2018 FIFA World Cup, the <code class="language-plaintext highlighter-rouge">FIFA2018</code> data provides an estimated <code class="language-plaintext highlighter-rouge">logability</code> for each team. These have been estimated by <a href="#Zeileis+Leitner+Hornik:2018">Zeileis <em>et al.</em> (2018)</a> prior to the start of the tournament (2018-05-20) based on quoted odds from 26 online bookmakers using the bookmaker consensus model of <a href="#Leitner+Zeileis+Hornik:2010">Leitner <em>et al.</em> (2010)</a>. The <code class="language-plaintext highlighter-rouge">difference</code> in <code class="language-plaintext highlighter-rouge">logability</code> between a team and its opponent is a useful predictor for the number of <code class="language-plaintext highlighter-rouge">goals</code> scored.</p> <p>Consequently, we fit a generalized linear model (GLM) to the data that links the expected number of goals per team/match \(\lambda_i\) to the linear predictor \(x_i^\top \beta\) with regressor vector \(x_i^\top = (1, \mathtt{difference}_i)\) and corresponding coefficient vector \(\beta\) using a log-link: \(\log(\lambda_i) = x_i^\top \beta\). The maximum likelihood estimator \(\hat \beta\) with corresponding inference, predictions, residuals, etc. can be obtained using the <code class="language-plaintext highlighter-rouge">glm()</code> function from base R with <code class="language-plaintext highlighter-rouge">family = poisson</code>:</p> <pre><code class="language-{r}">m <- glm(goals ~ difference, data = FIFA2018, family = poisson) summary(m) ## ## Call: ## glm(formula = goals ~ difference, family = poisson, data = FIFA2018) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.144 -1.155 -0.175 0.528 2.327 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.2127 0.0813 2.62 0.0088 ** ## difference 0.4134 0.1058 3.91 9.3e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 144.20 on 127 degrees of freedom ## Residual deviance: 128.69 on 126 degrees of freedom ## AIC: 359.4 ## ## Number of Fisher Scoring iterations: 5 </code></pre> <p>Both parameters can be interpreted. First, the intercept corresponds to the expected log-goals per team in a match of two equally strong teams, i.e., with zero difference in log-abilities. The corresponding prediction for the number of goals can either be obtained manually from the extracted <code class="language-plaintext highlighter-rouge">coef()</code> by applying <code class="language-plaintext highlighter-rouge">exp()</code> (as the inverse of the log-link).</p> <pre><code class="language-{r}">lambda_zero <- exp(coef(m)[1]) lambda_zero ## (Intercept) ## 1.237 </code></pre> <p>Or equivalently the <code class="language-plaintext highlighter-rouge">predict()</code> function can be used with <code class="language-plaintext highlighter-rouge">type = "response"</code> in order to get the expected \(\hat \lambda_i\) (rather than just the linear predictor \(x_i^\top \hat \beta\) that is predicted by default).</p> <pre><code class="language-{r}">predict(m, newdata = data.frame(difference = 0), type = "response") ## 1 ## 1.237 </code></pre> <p>As above, we can also set up a <code class="language-plaintext highlighter-rouge">Poisson()</code> distribution object and obtain the associated expected probability distribution for zero to six goals in a mathc of two equally strong teams:</p> <pre><code class="language-{r}">p_zero <- Poisson(lambda = lambda_zero) pdf(p_zero, 0:6) ## [1] 0.290242 0.359041 0.222074 0.091571 0.028319 0.007006 0.001445 </code></pre> <p>Note that <code class="language-plaintext highlighter-rouge">distributions3</code> also provides a convenience function <code class="language-plaintext highlighter-rouge">prodist()</code> that allows to obtain <code class="language-plaintext highlighter-rouge">p_zero</code> in a single step via <code class="language-plaintext highlighter-rouge">prodist(m, newdata = data.frame(difference = 0))</code>.</p> <p>Second, the slope of \(0.413\) can be interpreted as an ability elasticity of the number of goals scored. This is because the difference of the log-abilities can also be understood as the log of the ability ratio. Thus, when the ability ratio increases by \(1\) percent, the expected number of goals increases approximately by \(0.413\) percent.</p> <p>This yields a different predicted Poisson distribution for each team/match in the tournament. We can set up the vector of all \(128\) <code class="language-plaintext highlighter-rouge">Poisson()</code> distribution objects by extracting the vector of all fitted point estimates \((\hat \lambda_1, \dots, \hat \lambda_{128})^\top\):</p> <pre><code class="language-{r}">p_reg <- Poisson(lambda = fitted(m)) length(p_reg) ## [1] 128 head(p_reg) ## 1 2 ## "Poisson distribution (lambda = 1.768)" "Poisson distribution (lambda = 0.866)" ## 3 4 ## "Poisson distribution (lambda = 1.030)" "Poisson distribution (lambda = 1.486)" ## 5 6 ## "Poisson distribution (lambda = 1.435)" "Poisson distribution (lambda = 1.066)" </code></pre> <p>Again, the convenience function <code class="language-plaintext highlighter-rouge">prodist(m)</code> could also be used to directly extract <code class="language-plaintext highlighter-rouge">p_reg</code>.</p> <p>Note that specific elements from the vector <code class="language-plaintext highlighter-rouge">p_reg</code> of Poisson distributions can be extracted as usual, e.g., with an index like <code class="language-plaintext highlighter-rouge">p_reg[i]</code> or using the <code class="language-plaintext highlighter-rouge">head()</code> and <code class="language-plaintext highlighter-rouge">tail()</code> functions etc.</p> <p>As an illustration, the following goal distributions could be expected for the FIFA World Cup final (in the last two rows of the data) that France won 4-2 against Croatia:</p> <pre><code class="language-{r}">tail(FIFA2018, 2) ## goals team match type stage logability difference ## 127 4 FRA 64 Final knockout 0.8866 0.629 ## 128 2 CRO 64 Final knockout 0.2576 -0.629 p_final <- tail(p_reg, 2) p_final ## 127 128 ## "Poisson distribution (lambda = 1.604)" "Poisson distribution (lambda = 0.954)" pdf(p_final, 0:6) ## d_0 d_1 d_2 d_3 d_4 d_5 d_6 ## 127 0.2010 0.3225 0.2587 0.13836 0.05550 0.017808 0.0047618 ## 128 0.3853 0.3675 0.1752 0.05572 0.01329 0.002534 0.0004029 </code></pre> <p>This shows that France was expected to score more goals than Croatia but both teams scored more goals than expected, albeit not unlikely many.</p> <h2 id="further-details-and-extensions">Further details and extensions</h2> <p>Assuming independence of the number of goals scored, we can obtain the table of possible match results (after normal time) by multiplying the marginal probabilities (again only up to six goals). In R this be done using the <code class="language-plaintext highlighter-rouge">outer()</code> function which by default performs a multiplication of its arguments.</p> <pre><code class="language-{r}">res <- outer(pdf(p_final[1], 0:6), pdf(p_final[2], 0:6)) round(100 * res, digits = 2) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] ## [1,] 7.74 7.39 3.52 1.12 0.27 0.05 0.01 ## [2,] 12.43 11.85 5.65 1.80 0.43 0.08 0.01 ## [3,] 9.97 9.51 4.53 1.44 0.34 0.07 0.01 ## [4,] 5.33 5.08 2.42 0.77 0.18 0.04 0.01 ## [5,] 2.14 2.04 0.97 0.31 0.07 0.01 0.00 ## [6,] 0.69 0.65 0.31 0.10 0.02 0.00 0.00 ## [7,] 0.18 0.17 0.08 0.03 0.01 0.00 0.00 </code></pre> <p>For example, we can see from this table that the expected probability for France winning against Croatia 1-0 is \(12.43\) percent while the probability that France loses 0-1 is only \(7.39\) percent.</p> <p>The advantage of France can also be brought out more clearly by aggregating the probabilities for winning (lower triangular matrix), a draw (diagonal), or losing (upper triangular matrix). In R these can be computed as:</p> <pre><code class="language-{r}">sum(res[lower.tri(res)]) ## France wins ## [1] 0.5245 sum(diag(res)) ## draw ## [1] 0.2498 sum(res[upper.tri(res)]) ## France loses ## [1] 0.2243 </code></pre> <p>Note that these probabilities do not sum up to \(1\) because we only considered up to six goals per team but more goals can actually occur with a small probability.</p> <p>Next, we update the expected frequencies table by averaging across the expectations per team/match from the regression model.</p> <pre><code class="language-{r}">expected <- pdf(p_reg, 0:6) head(expected) ## d_0 d_1 d_2 d_3 d_4 d_5 d_6 ## 1 0.1707 0.3017 0.2667 0.15721 0.06949 0.024571 0.0072403 ## 2 0.4208 0.3642 0.1576 0.04548 0.00984 0.001703 0.0002457 ## 3 0.3571 0.3677 0.1893 0.06498 0.01673 0.003444 0.0005911 ## 4 0.2262 0.3362 0.2498 0.12377 0.04599 0.013669 0.0033857 ## 5 0.2380 0.3417 0.2452 0.11732 0.04210 0.012086 0.0028914 ## 6 0.3444 0.3671 0.1957 0.06954 0.01853 0.003952 0.0007022 expected <- colMeans(expected) cbind(observed, expected) ## observed expected ## 0 0.257812 0.294374 ## 1 0.375000 0.340171 ## 2 0.250000 0.214456 ## 3 0.078125 0.098236 ## 4 0.015625 0.036595 ## 5 0.015625 0.011727 ## 6 0.007812 0.003333 </code></pre> <p>As before, observed and expected frequencies are reasonably close, emphasizing that the model has a good marginal fit for this data. To bring out the discrepancies graphically we show the frequencies on a square root scale using a so-called <em>hanging rootogram</em> (<a href="#Kleiber+Zeileis:2016">Kleiber & Zeileis 2016</a>). The gray bars represent the square-root of the observed frequencies “hanging” from the square-root of the expected frequencies in the red line. The offset around the x-axis thus shows the difference between the two frequencies which is reasonably close to zero.</p> <pre><code class="language-{r}">bp <- barplot(sqrt(observed), offset = sqrt(expected) - sqrt(observed), xlab = "Goals", ylab = "sqrt(Frequency)") lines(bp, sqrt(expected), type = "o", pch = 19, lwd = 2, col = 2) abline(h = 0, lty = 2) </code></pre> <p><a href="https://www.zeileis.org/assets/posts/2022-06-23-poisson/rootogram.png"><img src="https://www.zeileis.org/assets/posts/2022-06-23-poisson/rootogram.png" alt="Rootogram for the number of goals in the 2018 FIFA World Cup modeled by a Poisson regression model" /></a></p> <p>Finally, we want to point out that while the log-abilities (and thus their differences) had been obtained based on bookmakers odds prior to the tournament, the calibration of the intercept and slope coefficients was done “in-sample”. This means that we have used the data from the tournament itself for estimating the GLM and the evaluation above can only be made <em>ex post</em>. Alternatively, one could have used previous FIFA World Cups for calibrating the coefficients so that probabilistic forecasts for the outcome of all matches (and thus the entire tournament) could have been obtained <em>ex ante</em>. This is the approach used by <a href="#Groll+Ley+Schauberger:2019">Groll <em>et al.</em> (2019)</a> and <a href="#Groll+Hvattum+Ley:2021">Groll <em>et al.</em> (2021)</a> who additionally added further explanatory variables and used flexible machine learning regression techniques rather than a simple Poisson GLM.</p> <h2 id="technical-details-maximum-likelihood-estimation-of-lambda">Technical details: Maximum likelihood estimation of \(\lambda\)</h2> <p>Fitting a single Poisson distribution with constant \(\lambda\) to \(n\) independent observations \(y_1, \dots, y_n\) using maximum likelihood estimation can be done analytically using basic algebra. First, we set up the log-likelihood function \(\ell\) as the sum of the log-densities per observation:</p> <p>[\begin{align<em>} \ell(\lambda; y_1, \dots, y_n) & = \sum_{i = 1}^n \log f(y_i; \lambda) <br /> \end{align</em>}]</p> <p>For solving the first-order condition analytically below we need the score function, i.e., the derivative of the log-likelihood with respect to the parameter \(\lambda\). The derivative of the sum is simply the sum of the derivatives:</p> <p>[\begin{align<em>} \ell^\prime(\lambda; y_1, \dots, y_n) & = \sum_{i = 1}^n \left{ \log f(y_i; \lambda) \right}^\prime <br /> & = \sum_{i = 1}^n \left{ -\lambda + y_i \cdot \log(\lambda) - \log(y_i!) \right}^\prime <br /> & = \sum_{i = 1}^n \left{ -1 + y_i \cdot \frac{1}{\lambda} \right} <br /> & = -n + \frac{1}{\lambda} \sum_{i = 1}^n y_i \end{align</em>}]</p> <p>The first-order condition for maximizing the log-likelihood sets its derivative to zero. This can be solved as follows:</p> <p>[\begin{align<em>} \ell^\prime(\lambda; y_1, \dots, y_n) & = 0 <br /> -n + \frac{1}{\lambda} \sum_{i = 1}^n y_i & = 0 <br /> n \cdot \lambda & = \sum_{i = 1}^n y_i <br /> \lambda & = \frac{1}{n} \sum_{i = 1}^n y_i = \bar y \end{align</em>}]</p> <p>Thus, the maximum likelihood estimator is simply the empirical mean \(\hat \lambda = \bar y.\)</p> <p>Unfortunately, when the parameter \(\lambda\) is not constant but depends on a linear predictor through a log link \(\log(\lambda_i) = x_i^\top \beta\), the corresponding log-likelihood of the regression coefficients \(\beta\) can not be maximized as easily. There is no closed-form solution for the maximum likelihood estimator \(\hat \beta\) which is why the <code class="language-plaintext highlighter-rouge">glm()</code> function employs an iterative numerical algorithm (so-called iteratively weighted least squares) for fitting the model.</p> <h2 id="references">References</h2> <ul> <li><span id="Groll+Hvattum+Ley:2021">Groll A, Hvattum LM, Ley C, Popp F, Schauberger G, Van Eetvelde H, Zeileis A (2021). “Hybrid Machine Learning Forecasts for the UEFA EURO 2020.” arXiv 2106.05799. arXiv.org E-Print Archive. <a href="https://arxiv.org/abs/2106.05799">https://arxiv.org/abs/2106.05799</a></span></li> <li><span id="Groll+Ley+Schauberger:2019">Groll A, Ley C, Schauberger G, Van Eetvelde H (2019). “A Hybrid Random Forest to Predict Soccer Matches in International Tournaments.” <em>Journal of Quantitative Analysis in Sports</em> <strong>15</strong>(4), 271-87. <a href="https://doi.org/10.1515/jqas-2018-0060">https://doi.org/10.1515/jqas-2018-0060</a></span></li> <li><span id="CRAN+distributions3">Hayes A, Moller-Trane R, Jordan D, Northrop P, Lang M, Zeileis A (2022). “distributions3: Probability Distributions as S3 Objects.” R package version 0.2.0, <a href="https://CRAN.R-project.org/package=distributions3">https://CRAN.R-project.org/package=distributions3</a></span></li> <li><span id="Kleiber+Zeileis:2016">Kleiber C, Zeileis A (2016). “Visualizing Count Data Regressions Using Rootograms.” <em>The American Statistician</em> <strong>70</strong>(3), 296-303. <a href="https://doi.org/10.1080/00031305.2016.1173590">https://doi.org/10.1080/00031305.2016.1173590</a></span></li> <li><span id="Leitner+Zeileis+Hornik:2010">Leitner C, Zeileis A, Hornik K (2010). “Forecasting Sports Tournaments by Ratings of (Prob)abilities: A Comparison for the EURO 2008.” <em>International Journal of Forecasting</em> <strong>26</strong>(3), 471-81. <a href="https://doi.org/10.1016/j.ijforecast.2009.10.001">https://doi.org/10.1016/j.ijforecast.2009.10.001</a></span></li> <li><span id="Wiki+Poisson">Wikipedia (2022). “Poisson Distribution - Wikipedia, the Free Encyclopedia.” <a href="https://en.wikipedia.org/wiki/Poisson_distribution">https://en.wikipedia.org/wiki/Poisson_distribution</a>, accessed 2022-02-21.</span></li> <li><span id="Zeileis+Leitner+Hornik:2018">Zeileis A, Leitner C, Hornik, K (2018). “Probabilistic Forecasts for the 2018 FIFA World Cup Based on the Bookmaker Consensus Model.” Working Paper 2018-09. Working Papers in Economics & Statistics, Research Platform Empirical & Experimental Economics, Universität Innsbruck. <a href="https://EconPapers.RePEc.org/RePEc:inn:wpaper:2018-09">https://EconPapers.RePEc.org/RePEc:inn:wpaper:2018-09</a></span></li> </ul>2022-06-23T00:00:00+02:00https://www.zeileis.org/news/euro2020knockout/Updated forecasts for the UEFA Euro 2020 knockout stage2021-06-25T00:00:00+02:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/After all group stage matches at the UEFA Euro 2020 we have updated the knockout stage forecasts by re-training our hybrid random forest model on the extended data. This shows that England profits most from the realized tournament draw.<p>After all group stage matches at the UEFA Euro 2020 we have updated the knockout stage forecasts by re-training our hybrid random forest model on the extended data. This shows that England profits most from the realized tournament draw.</p> <h2 id="updates">Updates</h2> <p>After the 36 matches of the group stage were completed earlier this week, we had decided to update our <a href="https://www.zeileis.org/news/euro2020/">probabilistic forecast for the UEFA Euro 2020</a>. As the <a href="https://www.zeileis.org/news/euro2020group/">evaluation of the group stage</a> showed that, by and large, the forecasts worked reasonably well up to this point, we kept our general strategy and just made a few updates:</p> <ul> <li>The <em>historic match abilities</em> for all teams were updated to incorporate the results from the 36 additional matches from the group stage. Given that the estimates are weighted such that the most recent results have a higher influence, this changed the estimates of the team abilities somewhat.</li> <li>The <em>average plus-minus player ratings</em> for all teams were also updated but these changed to a lesser degree given that each team only played three additional matches.</li> <li>All other covariates (bookmaker consensus, market value, etc.) were left unchanged.</li> <li>The learning data set for the hybrid random forest that combines all the predictors was extended: In addition to all the matches from the UEFA Euro 2004-2016 it now includes the group stage results from this year’s Euro.</li> <li>The resulting predicted number of goals for each team can then be used to simulate the entire knockout stage 100,000 times.</li> </ul> <p>While all the changes above have a certain influence, the biggest effect arguably comes from the last item: Because the match-ups for the round of 16 are fixed now, there is a lot less variation in the potential courses of the tournament. Specifically, it is now clear that there are more top favorites in the upper half of the tournament tableau (namely France, Spain, Italy, Belgium, Portugal) than in the lower half of the tableau (England, Germany, Netherlands). In the following it is shown in more detail what the consequences of this are.</p> <h2 id="winning-probabilities">Winning probabilities</h2> <p>The updated results show that now England became the top favorite for the title with a winning probability of 17.4% because they are more likely to face weaker opponents provided they beat Germany in the round of 16. Our top favorite from the pre-tournament forecast was France and they rank now second with an almost unchanged winning probability of about 15.0%. The winning probabilities for all teams are shown in the barchart below with more information linked in the interactive full-width version.</p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_win.html">Interactive full-width graphic</a></p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_win.html"><img src="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_win.png" alt="Barchart: Winning probabilities" /></a></p> <p>Somewhat surprisingly, Italy still has a rather low winning probability of only 7.3% whereas they are now among the top three teams according to most bookmaker odds. This is most likely due to the tournament draw: If they beat Austria in the round of 16, they meet either the FIFA top-ranked team Belgium or defending champion Portugal in the quarter final. In a potential semi-final they would have a high chance of facing either France or Spain.</p> <h2 id="match-probabilities">Match probabilities</h2> <p>Using the hybrid random forest an expected number of goals is obtained for both teams in each possible match. Using these, we can compute the probability that a certain match ends in a <em>win</em>, a <em>draw</em>, or a <em>loss</em> in normal time. The same can be repeated in overtime, if necessary, and a coin flip is used to decide penalties, if needed.</p> <p>The resulting probability that one team beats the other in a knockout match is depicted in the heatmap below. The color scheme uses green vs. brown to signal probabilities above vs. below 50%, respectively. The tooltips for each match in the interactive version of the graphic also print the probabilities for the match results after normal time.</p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_match.html">Interactive full-width graphic</a></p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_match.html"><img src="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_match.png" alt="Heatmap: Match probabilities" /></a></p> <h2 id="performance-throughout-the-tournament">Performance throughout the tournament</h2> <p>As every single match can be simulated with the pairwise probabilities above, we are able to simulate the entire knockout stage 100,000 times to provide “survival” probabilities for each team across the remaining stages. Teams in the upper half of the tournament tableau are shown in orange while the lower half teams are shown in blue.</p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_surv.html">Interactive full-width graphic</a></p> <p><a href="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_surv.html"><img src="https://www.zeileis.org/assets/posts/2021-06-25-euro2020knockout/p_surv.png" alt="Line plot: Survival probabilities" /></a></p> <p>This shows that England has relatively low chances of surviving the round of 16 - at least compared to other top teams like France, Italy, or Netherlands who play against weaker opponents. However, provided England proceeds to the quarter final, they have a really high probability of prevailing up to the final match.</p> <p>In summary, the updates compared to the pre-tournament forecast changed but maybe not as much as expected. The most important change in information is that the remaining course of the tournament is rather clear now while the knowledge from the 36 group stage matches themselves has only moderate effects. Thus, the most exciting part of the UEFA Euro 2020 is only starting now and we can all be curious what is going to happen. Everything is still possible! (Recall that in the 2016 tournament Portugal eventually took the championship despite not winning a single group stage match and ranking third in their group.)</p>2021-06-25T00:00:00+02:00https://www.zeileis.org/news/euro2020group/Evaluation of the UEFA Euro 2020 group stage forecast2021-06-24T00:00:00+02:00Achim ZeileisAchim.Zeileis@R-project.orghttps://www.zeileis.org/A look back on the group stage of the UEFA Euro 2020 to check whether our hybrid machine learning forecasts based were any good...<p>A look back on the group stage of the UEFA Euro 2020 to check whether our hybrid machine learning forecasts based were any good...</p> <h2 id="how-surprising-was-the-group-stage">How surprising was the group stage?</h2> <p>Yesterday the group stage of the UEFA Euro 2020 was concluded with the final matches in Groups E and F so that all pairings for the round of 16 are fixed now. Therefore, today we want to do address two questions regarding our own <a href="https://www.zeileis.org/news/euro2020/">probabilistic forecast for the UEFA Euro 2020</a> based on a hybrid machine learning model that we have published prior to the tournament:</p> <ol> <li>How good were the predictions for the group stage? Were the actual outcomes surprising?</li> <li>How can we update the forecasts for the knockout stage starting with the round of 16 on the weekend?</li> </ol> <p>The first of these questions is answered in this post while the second question will be deferred to tomorrow’s post.</p> <p><strong>TL;DR</strong> All of our predictions worked quite well and most results were within the expected range of random variation. All tournament favorites proceeded to the round of 16 and mostly the weakest teams dropped out of the tournament. Only in Group E the final ranking was a bit more surprising with Spain ending up second behind Sweden and Poland finishing last and dropping out. At the individual match level there were a couple of games where the clearly stronger team failed to take the win, especially Hungary’s two draws in the “killer group” F were a bit surprising. But other than that the more exciting part of the tournament ist still ahead of us!</p> <h2 id="group-stage-results">Group stage results</h2> <p>First, we look at the results in terms of which teams successfully proceeded from the group stage to the round of 16. The barplot below shows all teams along with their predicted winning probability for the entire tournament, with the color highlighting elimination from the tournament prior to the knockout stage.</p> <p><img src="https://www.zeileis.org/assets/posts/2021-06-24-euro2020group/barplot.png" alt="Probabilities to win the tournament with highlighting of teams advancing to the knockout stage" /></p> <p>Clearly, only teams from the lower half were eliminated with the most unexpected drop-out being Poland. Also, it may seem somewhat surprising that both the Czech Republic and Ukraine “survived” the group stage but with four out of six third-ranked teams advancing to the round of 16 this is not very unexpected.</p> <p>Looking at the rankings in each group in a bit more detail we see that most group results are as expected. Only in Group E the ranking is really a surprise with Sweden playing stronger than expected and even winning the group. On the other hand, Poland’s performance was somewhat disappointing (as already mentioned above) and Spain waited until the third game (a 5-0 win against Slovakia) to show their full potential.</p> <div class="row"> <div class="t20 small-6 medium-3 large-2 columns"> <table> <thead> <tr> <th style="text-align: left">A <br /> Rank</th> <th style="text-align: left"> <br /> Team</th> <th style="text-align: right"> <br /> Prob.</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><strong>1</strong> <br /> <strong>2</strong> <br /> <strong>3</strong> <br /> 4</td> <td style="text-align: left"><strong>ITA</strong> <br /> <strong>WAL</strong> <br /> <strong>SUI</strong> <br /> TUR</td> <td style="text-align: right"><strong>88.8</strong> <br /> <strong>53.7</strong> <br /> <strong>72.3</strong> <br /> 53.3</td> </tr> </tbody> </table> </div> <div class="t20 small-6 medium-3 large-2 columns"> <table> <thead> <tr> <th style="text-align: left">B <br /> Rank</th> <th style="text-align: left"> <br /> Team</th> <th style="text-align: right"> <br /> Prob.</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><strong>1</strong> <br /> <strong>2</strong> <br /> 3 <br /> 4</td> <td style="text-align: left"><strong>BEL</strong> <br /> <strong>DEN</strong> <br /> FIN <br /> RUS</td> <td style="text-align: right"><strong>91.5</strong> <br /> <strong>84.5</strong> <br /> 37.1 <br /> 52.0</td> </tr> </tbody> </table> </div> <div class="t20 small-6 medium-3 large-2 columns"> <table> <thead> <tr> <th style="text-align: left">C <br /> Rank</th> <th style="text-align: left"> <br /> Team</th> <th style="text-align: right"> <br /> Prob.</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><strong>1</strong> <br /> <strong>2</strong> <br /> <strong>3</strong> <br /> 4</td> <td style="text-align: left"><strong>NED</strong> <br /> <strong>AUT</strong> <br /> <strong>UKR</strong> <br /> MKD</td> <td style="text-align: right"><strong>93.4</strong> <br /> <strong>80.9</strong> <br /> <strong>57.4</strong> <br /> 32.9</td> </tr> </tbody> </table> </div> <div class="t20 small-6 medium-3 large-2 columns"> <table> <thead> <tr> <th style="text-align: left">D <br /> Rank</th> <th style="text-align: left"> <br /> Team</th> <th style="text-align: right"> <br /> Prob.</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><strong>1</strong> <br /> <strong>2</strong> <br /> <strong>3</strong> <br /> 4</td> <td style="text-align: left"><strong>ENG</strong> <br /> <strong>CRO</strong> <br /> <strong>CZE</strong> <br /> SCO</td> <td style="text-align: right"><strong>94.6</strong> <br /> <strong>78.0</strong> <br /> <strong>40.8</strong> <br /> 49.8</td> </tr> </tbody> </table> </div> <div class="t20 small-6 medium-3 large-2 columns"> <table> <thead> <tr> <th style="text-align: left">E <br /> Rank</th> <th style="text-align: left"> <br /> Team</th> <th style="text-align: right"> <br /> Prob.</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><strong>1</strong> <br /> <strong>2</strong> <br /> 3 <br /> 4</td> <td style="text-align: left"><strong>SWE</strong> <br /> <strong>ESP</strong> <br /> SVK <br /> POL</td> <td style="text-align: right"><strong>59.8</strong> <br /> <strong>94.0</strong> <br /> 44.9 <br /> 66.2</td> </tr> </tbody> </table> </div> <div class="t20 small-6 medium-3 large-2 columns"> <table> <thead> <tr> <th style="text-align: left">F <br /> Rank</th> <th style="text-align: left"> <br /> Team</th> <th style="text-align: right"> <br /> Prob.</th> </tr> </thead> <tbody> <tr> <td style="text-align: left"><strong>1</strong> <br /> <strong>2</strong> <br /> <strong>3</strong> <br /> 4</td> <td style="text-align: left"><strong>FRA</strong> <br /> <strong>GER</strong> <br /> <strong>POR</strong> <br /> HUN</td> <td style="text-align: right"><strong>89.7</strong> <br /> <strong>85.3</strong> <br /> <strong>85.3</strong> <br /> 13.9</td> </tr> </tbody> </table> </div> <div class="t20 small-6 medium-3 large-2 columns"> </div> <div class="t20 small-6 medium-3 large-2 columns"> </div> </div> <h2 id="match-results">Match results</h2> <p>After seeing that all the favorites prevailed and only relatively weak teams dropped out of the tournament, we take a closer look at the 36 individual group-stage matches to check whether we had any major surprises. The stacked bar plot below groups all match results into four categories by their expected goal difference for the stronger vs. the weaker team.</p> <p><img src="https://www.zeileis.org/assets/posts/2021-06-24-euro2020group/match.png" alt="Observed match outcome vs. expected goal difference" /></p> <p>In the first bar the stronger team was expected to be only marginally better, with 0 to 0.25 more predicted goals on average. In this bar we see that the stronger team won half of the matches (4 out of 8) while the other half was either lost (3 matches) or ended in a draw (1 match). In short, the distribution of match outcomes conforms essentially exactly with the predictions.</p> <p>The same is true for the second and third bar where the expected goal difference for the stronger team was between 0.26 and 0.6 or between 0.6 and 1, respectively. The stronger team won in 70.0% out of ten and 77.7% out of nine matches, respectively, thus conforming closely with the predictions.</p> <p>Only in the last bar with the highest expected goal differences (between 1 and 2 goals), the picture is somewhat unexpected.</p> <ol> <li>There were three draws (out of nine matches), two of which by underdog Hungary against the much stronger teams France and Germany. Ultimately, Hungary nevertheless finished last in Group F.</li> <li>One of these nine matches was even lost by the clear favorite but this match was the 0-1 of Denmark vs. Finland. During this match, Danish key player Christian Eriksen suffered a cardiac arrest and had to be reanimated in the stadium before being brought to the hospital. Denmark then had to continue playing the match later that evening and were clearly still under shock. Needless to say that no forecasting model (that we are aware of) would incorporate such extreme and rare events.</li> </ol> <p>As a final evaluation we check whether the observed number of goals per team in each match conforms with the expected distribution based on the Poisson model employed. This is brought out graphically by a so-called <a href="https://dx.doi.org/10.1080/00031305.2016.1173590">hanging rootogram</a>.</p> <p><img src="https://www.zeileis.org/assets/posts/2021-06-24-euro2020group/goals.png" alt="Hanging rootogram with observed and expected frequencies of number of goals" /></p> <p>The red line shows the square root of the expected frequencies while the “hanging” gray bars represent the square root of the observed frequencies. This shows that the predictions conform closely with the actual observations. There were only a few more occurrences of three goals (ten times) than expected (6.1 times) but this deviation is also within the bounds of random variation.</p>2021-06-24T00:00:00+02:00