Earlier this week we had published our probabilistic UEFA Euro 2020 forecast that combines the expertise of football modelers from four different research teams with the flexibility of machine learning. To explain which data and methods were used exactly, we have also written a working paper, now published in the arXiv.org e-Print archive.
Moreover, we take the opportunity and provide further insights that can be obtained from our forecast for the results of the group stage, that starts at the end of this week with the opening match between Italy and Turkey in Rome in Group A. More precisely, predicted probabilities for a win, draw, or loss in each of the 36 group stage matches are provided in interactive heatmaps for all groups.
Citation:
Groll A, Hvattum LM, Ley C, Popp F, Schauberger G, Van Eetvelde H, Zeileis A (2021). “Hybrid Machine Learning Forecasts for the UEFA EURO 2020.” arXiv:2106.05799, arXiv.org e-Print archive. https://arxiv.org/abs/2106.05799
Abstract:
Three state-of-the-art statistical ranking methods for forecasting football matches are combined with several other predictors in a hybrid machine learning model. Namely an ability estimate for every team based on historic matches; an ability estimate for every team based on bookmaker consensus; average plus-minus player ratings based on their individual performances in their home clubs and national teams; and further team covariates (e.g., market value, team structure) and country-specific socio-economic factors (population, GDP). The proposed combined approach is used for learning the number of goals scored in the matches from the four previous UEFA EUROs 2004-2016 and then applied to current information to forecast the upcoming UEFA EURO 2020. Based on the resulting estimates, the tournament is simulated repeatedly and winning probabilities are obtained for all teams. A random forest model favors the current World Champion France with a winning probability of 14.8% before England (13.5%) and Spain (12.3%). Additionally, we provide survival probabilities for all teams and at all tournament stages.
Using the hybrid random forest an expected number of goals is obtained for both teams in each possible match in the group stage. As there are typically more goals in the group stage compared to the knockout stage, a different expected number of goals is fitted for the two stages by including a corresponding binary dummy variable in the regression model. While the heatmap shown in our previous blog post contained the probabilities for all possible matches in the knockout stage, we complement this information here by showing different heatmaps for all groups.
The color scheme visualizes the winning probability of the team in the row over the team in the column. Light red or orange vs. dark green or blue signals low vs. high winning probabilities. The tooltips for each match in the interactive version of the graphic also print the probabilities for the match to end in a win, draw, or loss.
Interactive full-width graphics: Group A, Group B, Group C, Group D, Group E, Group F.
Group A | Group B | Group C |
---|---|---|
Group D | Group E | Group F |
---|---|---|
The forecast is based on a conditional inference random forest learner that combines four main sources of information: An ability estimate for every team based on historic matches; an ability estimate for every team based on odds from 19 bookmakers; average ratings of the players in each team based on their individual performances in their home clubs and national teams; further team covariates (e.g., market value, team structure) and country-specific socio-economic factors (population, GDP). The random forest model is learned using the UEFA Euro tournaments from 2004 to 2016 as training data and then applied to current information to obtain a forecast for the UEFA Euro 2020. The random forest forecasts actually provide the predicted number of goals for each team in all possible matches in the tournament so that a bivariate Poisson distribution can be used to compute the probabilities for a win, draw, or loss in such a match. Based on these match probabilities the entire tournament can be simulated 100,000 times yielding winning probabilities for each team. The results show that the current World Champion France is also the favorite for the European title with a winning probability of 14.8%, followed by England with 13.5%, and Spain with 12.3%. The winning probabilities for all teams are shown in the barchart below with more information linked in the interactive full-width version.
Interactive full-width graphic
The full study has been conducted by an international team of researchers: Andreas Groll, Lars Magnus Hvattum, Christophe Ley, Franziska Popp, Gunther Schauberger, Hans Van Eetvelde, Achim Zeileis. The corresponding working paper will be published on arXiv in the next couple of days. The core of the contribution is a hybrid approach that starts out from four state-of-the-art forecasting methods, based on disparate sets of information, and lets an adaptive machine learning model decide how to best combine these forecasts.
Historic match abilities:
An ability estimate is obtained for every team based on “retrospective” data, namely all historic national matches over the last 8 years. A bivariate Poisson model with team-specific fixed effects is fitted to the number of goals scored by both teams in each match. However, rather than equally weighting all matches to obtain average team abilities (or team strengths) over the entire history period, an exponential weighting scheme is employed. This assigns more weight to more recent results and thus yields an estimate of current team abilities. More details can be found in Ley, Van de Wiele, Van Eetvelde (2019).
Bookmaker consensus abilities:
Another ability estimate for every team is obtained based on “prospective” data, namely the odds of 19 international bookmakers that reflect their expert expectations for the tournament. Using the bookmaker consensus model of Leitner, Zeileis, Hornik (2010), the bookmaker odds are first adjusted for the bookmakers’ profit margins (“overround”) and then averaged (on a logit scale) to obtain a consensus for the winning probability of each team. To adjust for the effects of the tournament draw (that might have led to easier or harder groups for some teams), an “inverse” simulation approach is used to infer which team abilities are most likely to lead up to these winning probabilities.
Average player ratings:
To infer the contributions of individual players in a match, the plus-minus player ratings of Hvattum (2019) dissect all matches with a certain player (both on club and on national level) into segments, e.g., between substitutions. Subsequently, the goal difference achieved in these segments is linked to the presence of the individual players during that segment. This yields individual ratings for all players that can be aggregated to average player ratings for each team.
Hybrid random forests:
Finally, machine learning is used to combine these three highly aggregated and informative variables above along with a broad range of further relevant covariates, yielding refined probabilistic forecasts for each match. Such a hybrid approach was first suggested by Groll, Ley, Schauberger, Van Eetvelde (2019). The task the random forest learner has to accomplish is to combine the three highly-informative team variables above with further team-specific information that may or may not be relevant to the team’s performance. The covariates considered comprise team-specific details (e.g., market value, FIFA rank, team structure) as well as country-specifc socio-economic factors (population and GDP per capita). By combining a large ensemble of rather weakly informative regression trees in a random forest, the relative importances of all the covariates can be inferred automatically. The resulting predicted number of goals for each team can then finally be used to simulate the entire tournament 100,000 times.
Using the hybrid random forest an expected number of goals is obtained for both teams in each possible match. The covariate information used for this is the difference between the two teams in each of the variables listed above, i.e., the difference in historic match abilities (on a log scale), the difference in bookmaker consensus abilities (on a log scale), difference in average player ratings of the teams, etc. Assuming a bivariate Poisson distribution with the expected numbers of goals for both teams, we can compute the probability that a certain match ends in a win, a draw, or a loss. The same can be repeated in overtime, if necessary, and a coin flip is used to decide penalties, if needed.
The following heatmap shows for each possible combination of teams the probability that one team beats the other team in a knockout match. The color scheme uses green vs. brown to signal probabilities above vs. below 50%, respectively. The tooltips for each match in the interactive version of the graphic also print the probabilities for the match to end in a win, draw, or loss after normal time.
Interactive full-width graphic
As every single match can be simulated with the pairwise probabilities above, it is also straightfoward to simulate the entire tournament (here: 100,000 times) providing “survival” probabilities for each team across the different stages.
Interactive full-width graphic
All our forecasts are probabilistic, clearly below 100%, and thus by no means certain. Especially the results in group F are hard to predict but may play a crucial role for the tournament. The reason is that this group comprises three very strong teams with current World Champion France, defending European Champion Portugal, and Germany which generally has an excellent record at international tournaments. Moreover, the runner-up in this group will play against the winner from group D with favorite England. Hence, it is likely that this will lead to a very tough knockout match in the round of 16, possibly even between the two top favorites France and England, but it is hard to predict the exact pair of teams that will face each other in this match.
Another interesting observation is that the winning probability for Belgium is only moderately high with 8.3%. This is notable as Belgium currently leads the FIFA/Coca-Cola World Ranking and is also judged to have a much higher winning probability by the bookmaker consensus model with 12.1%.
In any case, all of this means that even when we can quantify in terms of probabilities what is likely to happen during the UEFA Euro 2020, it is far from being predetermined. Hence, we can all look forward to finally watching this exciting tournament and hope it will bring a little bit of the joy that we have been missing over this difficult last year.
]]>The ivreg package (by John Fox,
Christian Kleiber, and
Achim Zeileis) provides a comprehensive implementation of instrumental variables
regression using two-stage least-squares (2SLS) estimation. The standard
regression functionality (parameter estimation, inference, robust covariances,
predictions, etc.) is derived from and supersedes the ivreg()
function in the
AER package. Additionally, various
regression diagnostics are supported, including hat values, deletion diagnostics such
as studentized residuals and Cook’s distances; graphical diagnostics such as
component-plus-residual plots and added-variable plots; and effect plots with partial
residuals.
An overview of the package along with vignettes and detailed documentation etc. is available on its web site at https://john-d-fox.github.io/ivreg/. This post is an abbreviated version of the “Getting started” vignette.
The ivreg package integrates seamlessly with other packages by providing suitable S3 methods, specifically for generic functions in the base-R stats package, and in the car, effects, lmtest, and sandwich packages, among others. Moreover, it cooperates well with other object-oriented packages for regression modeling such as broom and modelsummary.
For demonstrating the ivreg package in practice, we investigate
the effect of schooling on earnings in a classical model for wage determination.
The data are from the United States, and are provided in the package as
SchoolingReturns
. This data set was originally studied by David Card, and was subsequently
employed, as here, to illustrate 2SLS estimation in introductory econometrics textbooks.
The relevant variables for this illustration are:
data("SchoolingReturns", package = "ivreg")
summary(SchoolingReturns[, 1:8])
## wage education experience ethnicity smsa
## Min. : 100.0 Min. : 1.00 Min. : 0.000 other:2307 no : 864
## 1st Qu.: 394.2 1st Qu.:12.00 1st Qu.: 6.000 afam : 703 yes:2146
## Median : 537.5 Median :13.00 Median : 8.000
## Mean : 577.3 Mean :13.26 Mean : 8.856
## 3rd Qu.: 708.8 3rd Qu.:16.00 3rd Qu.:11.000
## Max. :2404.0 Max. :18.00 Max. :23.000
## south age nearcollege
## no :1795 Min. :24.00 no : 957
## yes:1215 1st Qu.:25.00 yes:2053
## Median :28.00
## Mean :28.12
## 3rd Qu.:31.00
## Max. :34.00
A standard wage equation uses a semi-logarithmic linear regression for wage
, estimated by
ordinary least squares (OLS), with years of education
as the primary explanatory variable,
adjusting for a quadratic term in labor-market experience
, as well as for factors
coding ethnicity
, residence in a city (smsa
), and residence in the U.S. south
:
m_ols <- lm(log(wage) ~ education + poly(experience, 2) + ethnicity + smsa + south,
data = SchoolingReturns)
summary(m_ols)
## Call:
## lm(formula = log(wage) ~ education + poly(experience, 2) + ethnicity +
## smsa + south, data = SchoolingReturns)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.59297 -0.22315 0.01893 0.24223 1.33190
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.259820 0.048871 107.626 < 2e-16 ***
## education 0.074009 0.003505 21.113 < 2e-16 ***
## poly(experience, 2)1 8.931699 0.494804 18.051 < 2e-16 ***
## poly(experience, 2)2 -2.642043 0.374739 -7.050 2.21e-12 ***
## ethnicityafam -0.189632 0.017627 -10.758 < 2e-16 ***
## smsayes 0.161423 0.015573 10.365 < 2e-16 ***
## southyes -0.124862 0.015118 -8.259 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3742 on 3003 degrees of freedom
## Multiple R-squared: 0.2905, Adjusted R-squared: 0.2891
## F-statistic: 204.9 on 6 and 3003 DF, p-value: < 2.2e-16
Thus, OLS estimation yields an estimate of 7.4%
per year for returns to schooling. This estimate is problematic, however, because it can be argued
that education
is endogenous (and hence also experience
, which is taken to be age
minus
education
minus 6). We therefore use geographical proximity to a college when growing
up as an exogenous instrument for education
. Additionally, age
is the natural
exogenous instrument for experience
, while the remaining explanatory variables can be considered
exogenous and are thus used as instruments for themselves.
Although it’s a useful strategy to select an effective instrument or instruments for each endogenous
explanatory variable, in 2SLS regression all of the instrumental variables are used to estimate all
of the regression coefficients in the model.
To fit this model with ivreg()
we can simply extend the formula from lm()
above, adding a second
part after the |
separator to specify the instrumental variables:
library("ivreg")
m_iv <- ivreg(log(wage) ~ education + poly(experience, 2) + ethnicity + smsa + south |
nearcollege + poly(age, 2) + ethnicity + smsa + south,
data = SchoolingReturns)
Equivalently, the same model can also be specified slightly more concisely using three parts on the right-hand side indicating the exogenous variables, the endogenous variables, and the additional instrumental variables only (in addition to the exogenous variables).
m_iv <- ivreg(log(wage) ~ ethnicity + smsa + south | education + poly(experience, 2) |
nearcollege + poly(age, 2), data = SchoolingReturns)
Both models yield the following results:
summary(m_iv)
## Call:
## ivreg(formula = log(wage) ~ education + poly(experience, 2) +
## ethnicity + smsa + south | nearcollege + poly(age, 2) + ethnicity +
## smsa + south, data = SchoolingReturns)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.82400 -0.25248 0.02286 0.26349 1.31561
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.48522 0.67538 6.641 3.68e-11 ***
## education 0.13295 0.05138 2.588 0.009712 **
## poly(experience, 2)1 9.14172 0.56350 16.223 < 2e-16 ***
## poly(experience, 2)2 -0.93810 1.58024 -0.594 0.552797
## ethnicityafam -0.10314 0.07737 -1.333 0.182624
## smsayes 0.10798 0.04974 2.171 0.030010 *
## southyes -0.09818 0.02876 -3.413 0.000651 ***
##
## Diagnostic tests:
## df1 df2 statistic p-value
## Weak instruments (education) 3 3003 8.008 2.58e-05 ***
## Weak instruments (poly(experience, 2)1) 3 3003 1612.707 < 2e-16 ***
## Weak instruments (poly(experience, 2)2) 3 3003 174.166 < 2e-16 ***
## Wu-Hausman 2 3001 0.841 0.432
## Sargan 0 NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4032 on 3003 degrees of freedom
## Multiple R-Squared: 0.1764, Adjusted R-squared: 0.1747
## Wald test: 148.1 on 6 and 3003 DF, p-value: < 2.2e-16
Thus, using two-stage least squares to estimate the regression yields a much larger
coefficient for the returns to schooling, namely 13.3% per year.
Notice as well that the standard errors of the coefficients are larger for 2SLS estimation
than for OLS, and that, partly as a consequence, evidence for the effects of ethnicity
and the quadratic component of experience
is now weak. These differences are brought
out more clearly when showing coefficients and standard errors side by side, e.g., using the
compareCoefs()
function from the car package or the msummary()
function from the
modelsummary package:
library("modelsummary")
m_list <- list(OLS = m_ols, IV = m_iv)
msummary(m_list)
OLS | IV | |
---|---|---|
(Intercept) | 5.260 (0.049) |
4.485 (0.675) |
education | 0.074 (0.004) |
0.133 (0.051) |
poly(experience, 2)1 | 8.932 (0.495) |
9.142 (0.564) |
poly(experience, 2)2 | -2.642 (0.375) |
-0.938 (1.580) |
ethnicityafam | -0.190 (0.018) |
-0.103 (0.077) |
smsayes | 0.161 (0.016) |
0.108 (0.050) |
southyes | -0.125 (0.015) |
-0.098 (0.029) |
Num.Obs. | 3010 | 3010 |
R2 | 0.291 | 0.176 |
R2 Adj. | 0.289 | 0.175 |
AIC | 2633.4 | |
BIC | 2681.5 | |
Log.Lik. | -1308.702 | |
F | 204.932 |
The change in coefficients and associated standard errors can also be brought out graphically
using the modelplot()
function from modelsummary which shows the coefficient estimates
along with their 95% confidence intervals. Below we omit the intercept and experience terms
as these are on a different scale than the other coefficients.
modelplot(m_list, coef_omit = "Intercept|experience")
]]>In many areas of psychology, correlation-based network approaches (i.e., psychometric networks) have become a popular tool. In this paper, we propose an approach that recursively splits the sample based on covariates in order to detect significant differences in the structure of the covariance or correlation matrix. Psychometric networks or other correlation-based models (e.g., factor models) can be subsequently estimated from the resultant splits. We adapt model-based recursive partitioning and conditional inference tree approaches for finding covariate splits in a recursive manner. The empirical power of these approaches is studied in several simulation conditions. Examples are given using real-life data from personality and clinical research.
All methods discussed are implemented in the R package networktree
that is developed on GitHub and stable versions are released on CRAN (Comprehensive R Archive Network). Version 1.0.0 accompanies the publications in Psychometrika and version 1.0.1 adds a few small enhancements and bug fixes, specifically for the plotting infrastructure. Furthermore, a nice web page with introductory examples, documentation, release notes, etc. has been produced with the wonderful pkgdown
.
The idea of psychometric networks is to provide information about the statistical relationships between observed variables. Network trees aim to reveal heterogeneities in these relationships based on observed covariates. This strategy is implemented in the R package networktree
building on the general tree algorithms in the partykit
package.
For illustration, we consider a depression network - where the nodes represent different symptoms - and detect heterogeneities with respect to age and race. The data used below is provided by https://openpsychometrics.org/ and was obtained using the Depression Anxiety and Stress Scale (DASS), a self-report instrument for measuring depression, anxiety, and tension or stress. It is available in the networktree
package as dass
. To make resulting graphics and summaries easier to interpret we use the following variable names for the depression symptoms that are measured with certain questions from the DASS:
anhedonia
(Question 3: I couldn’t seem to experience any positive feeling at all.)initiative
(Question 42: I found it difficult to work up the initiative to do things.)lookforward
(Question 10: I felt that I had nothing to look forward to.)sad
(Question 13: I felt sad and depressed.)unenthused
(Question 31: I was unable to become enthusiastic about anything.)worthless
(Question 17: I felt I wasn’t worth much as a person.)meaningless
(Question 38: I felt that life was meaningless.)First, we load the data and relabel the variables for the depression symptoms:
library("networktree")
data("dass", package = "networktree")
names(dass)[c(3, 42, 10, 13, 31, 17, 38)] <- c("anhedonia", "initiative", "lookforward",
"sad", "unenthused", "worthless", "meaningless")
Subsequently, we fit a networktree()
where the relationship between the symptoms (anhedonia + initiative + lookforward + sad + unenthused + worthless + meaningless
) is “explained by” (~
) the covariates (age + race
). (As an alternative to this formula-based interface it is also possible to specify groups of dependent and split variables, respectively, through separate data frames.) The threshold for detecting significant differences in correlations is set to 1% (plus Bonferroni adjustment for testing two covariates at each step).
tr <- networktree(anhedonia + initiative + lookforward + sad + unenthused +
worthless + meaningless ~ age + race, data = dass, alpha = 0.01)
The resulting network tree can be easily visualized with plot(tr)
which would display the raw correlations. As these are generally high between all depression symptoms we use a display with partial correlations (transform = "pcor"
) instead. This brings out differences between the detected subgroups somewhat more clearly. (Note that version 1.0.1 of networktree is needed for this to work correctly.)
plot(tr, transform = "pcor")
This shows that the network tree detects three subgroups. First, the correlations of the depression symptoms change across age
- with the largest difference between “younger” and “older” persons in the sample at a split point of 30 years. Second, the correlations differ with respect to race for the older persons in the sample - with the largest difference between Arab/Black/Native American/White and Asian/Other. The differences in the symptom correlations affect various pairs of symptoms as brought out in the network display produced by the qgraph package in the terminal nodes. For example, the “centrality” of anhedonia
changes across the three detected subgroups: For the older Asian/Other persons it is partially correlated with most other symptoms while this is less pronounced for the other two subgroups.
The networks visualized in the tree can also be extracted easily using the getnetwork()
function. For example, the partial correlation matrix corresponding to the older Asian/Other group (node 5) can be obtained by:
getnetwork(tr, id = 5, transform = "pcor")
To explore the returned object tr
in some more detail, the print()
method gives a printed version of the tree structure but does not display the associated parameters.
tr
## Network tree object
##
## Model formula:
## anhedonia + initiative + lookforward + sad + unenthused + worthless +
## meaningless ~ age + race
##
## Fitted party:
## [1] root
## | [2] age <= 30
## | [3] age > 30
## | | [4] race in Arab, Black, Native American, White
## | | [5] race in Asian, Other
##
## Number of inner nodes: 2
## Number of terminal nodes: 3
## Number of parameters per node: 21
## Objective function: 42301.84
The estimated correlation parameters in the subgroups can be extracted with coef(tr)
, here returning a 3 x 21 matrix for the 21 pairs of symptom correlations and the 3 subgroups. To show two symptom pairs with larger correlation differences we extract the correlations of anhedonia
with worthless
and meaningless
, respectively. Note that these are the raw correlations and not the partial correlations displayed in the tree above.
coef(tr)[, 5:6]
## rho_anhedonia_worthless rho_anhedonia_meaningless
## 2 0.5595725 0.5994682
## 4 0.6741686 0.6339481
## 5 0.6639088 0.7178744
Finally, we extract the p-values of the underlying parameter instability tests to gain some insights how the tree was constructed. In each step the stability we assess whether the correlation parameters are stable across each of the two covariates age
and race
or whether there are significant changes. The corresponding test statistics and Bonferroni-adjusted p-values can be extracted with the sctest()
function (for “structural change test”). For example, in Node 1 there are significant instabilities with respect to both variables but age
has the lower p-value and is hence selected for partitioning the data:
library("strucchange")
sctest(tr, node = 1)
## age race
## statistic 7.151935e+01 1.781216e+02
## p.value 1.787983e-05 3.108049e-03
In Node 3 only race
is significant and hence used for splitting:
sctest(tr, node = 3)
## age race
## statistic 42.9352852 1.728898e+02
## p.value 0.1447818 6.766197e-05
And in Node 5 neither variable is significant and hence the splitting stops:
sctest(tr, node = 5)
## age race
## statistic 35.1919522 22.09555
## p.value 0.5514142 0.63279
For more details regarding the method and the software see the Psychometrika paper and the software web page, respectively.
]]>Köll S, Kosmidis I, Kleiber C, Zeileis A (2021). “Bias Reduction as a Remedy to the Consequences of Infinite Estimates in Poisson and Tobit Regression”, arXiv:2101.07141, arXiv.org E-Print Archive. https://arXiv.org/abs/2101.07141
Data separation is a well-studied phenomenon that can cause problems in the estimation and inference from binary response models. Complete or quasi-complete separation occurs when there is a combination of regressors in the model whose value can perfectly predict one or both outcomes. In such cases, and such cases only, the maximum likelihood estimates and the corresponding standard errors are infinite. It is less widely known that the same can happen in further microeconometric models. One of the few works in the area is Santos Silva and Tenreyro (2010) who note that the finiteness of the maximum likelihood estimates in Poisson regression depends on the data configuration and propose a strategy to detect and overcome the consequences of data separation. However, their approach can lead to notable bias on the parameter estimates when the regressors are correlated. We illustrate how bias-reducing adjustments to the maximum likelihood score equations can overcome the consequences of separation in Poisson and Tobit regression models.
R package brglm2
from CRAN: https://CRAN.R-project.org/package=brglm2
R package brtobit
from R-Forge: https://R-Forge.R-project.org/R/?group_id=2305
The simplest but arguably often-encountered occurrence of data separation in practice is when there is a binary regressor such that the response y = 0 (or another boundary value) whenever the regressor is 1. If P(y = 0) is monotonically increasing in the linear predictor of the model, then the coefficient of the binary regressor will diverge to minus infinity in order to push P(y = 0) in this subgroup as close to 1 as possible.
To illustrate this phenomenon in R for both Poisson and Tobit regression we employ a simple data-generating process: In addition to the intercept we generate a continuous regressor x_{2} uniformly distributed on [-1, 1] and a binary regressor x_{3}. The latter comes from a Bernoulli distribution with probability 0.25 if x_{2} is positive and with probability 0.75 otherwise. Thus, x_{2} and x_{3} are correlated.
The linear predictor employed for both Poisson and Tobit is: 1 + x_{2} - 10 x_{3}, where the extreme coefficient of -10 assures that there is almost certainly data separation. In the full paper linked above we also consider less extreme scenarios where separation may or may not occur. The Poisson response is then drawn from a Poisson distribution using a log link between mean and linear predictor. The Tobit response is drawn from a normal distribution censored at zero with identity link and constant variance of 2. Here, we draw two samples with 100 observations from both models:
dgp <- function(n = 100, coef = c(1, 1, -10, 2), prob = 0.25, dist = "poisson") {
x2 <- runif(n, -1, 1)
x3 <- rbinom(n, size = 1, prob = ifelse(x2 > 0, prob, 1 - prob))
y <- switch(match.arg(tolower(dist), c("poisson", "tobit")),
"poisson" = rpois(n, exp(coef[1] + coef[2] * x2 + coef[3] * x3)),
"tobit" = rnorm(n, mean = coef[1] + coef[2] * x2 + coef[3] * x3, sd = sqrt(coef[4]))
)
y[y <= 0] <- 0
data.frame(y, x2, x3)
}
set.seed(2020-10-29)
d1 <- dgp(dist = "poisson")
set.seed(2020-10-29)
d2 <- dgp(dist = "tobit")
Both of these data sets exhibit quasi-complete separation of y with respect to x_{3}, i.e., y is always 0 if x_{3} is 1.
xtabs(~ x3 + factor(y == 0), data = d1)
## factor(y == 0)
## x3 FALSE TRUE
## 0 47 8
## 1 0 45
We then compare four different modeling approaches in this situation:
For Poisson regression, all these models can be fitted with the standard glm()
function in R. To obtain the BR estimate method = "brglmFit"
can be plugged in using the brglm2
package (by Ioannis Kosmidis).
install.packages("brglm2")
library("brglm2")
m12_ml <- glm(y ~ x2 + x3, data = d1, family = poisson)
m12_br <- update(m12_ml, method = "brglmFit")
m1_all <- glm(y ~ x2, data = d1, family = poisson)
m1_sub <- update(m1_all, subset = x3 == 0)
m1 <- list("ML" = m12_ml, "BR" = m12_br, "ML/sub" = m1_sub, "ML/SST" = m1_all)
This yields the following results (shown with the wonderful modelsummary package):
library("modelsummary")
msummary(m1)
ML | BR | ML/sub | ML/SST | |
---|---|---|---|---|
(Intercept) | 0.951 | 0.958 | 0.951 | 0.350 |
(0.100) | (0.099) | (0.100) | (0.096) | |
x2 | 1.011 | 1.006 | 1.011 | 1.662 |
(0.158) | (0.157) | (0.158) | (0.144) | |
x3 | -20.907 | -5.174 | ||
(2242.463) | (1.416) | |||
Num.Obs. | 100 | 100 | 55 | 100 |
Log.Lik. | -107.364 | -107.869 | -107.364 | -169.028 |
The following remarks can be made:
Moreover, in more extensive simulation experiments in the paper it is shown that the BR estimates are always finite, and result in Wald-type intervals with better coverage probabilities.
Analogous results an be obtained for Tobit regression with our brtobit
package, currently available from R-Forge. This provides both ML and BR estimation for homoscedastic tobit models. (Some tools are re-used from our crch
package that implements various estimation techniques , albeit not BR, for Tobit models with conditional heteroscedasticity.) Below we fit the same four models as in the Poisson case above.
install.packages("brtobit", repos = "http://R-Forge.R-project.org")
library("brtobit")
m22_ml <- brtobit(y ~ x2 + x3, data = d2, type = "ML", fsmaxit = 28)
m22_br <- brtobit(y ~ x2 + x3, data = d2, type = "BR")
m2_all <- brtobit(y ~ x2, data = d2, type = "ML")
m2_sub <- update(m2_all, subset = x3 == 0)
m2 <- list("ML" = m22_ml, "BR" = m22_br, "ML/sub" = m2_sub, "ML/SST" = m2_all)
Because brtobit
does not yet provide a direct interface for modelsummary
(via broom
) we go through the coeftest()
results as an intermediate step. These can then be rendered by modelsummary
:
library("lmtest")
m2 <- lapply(m2, coeftest)
msummary(m2)
ML | BR | ML/sub | ML/SST | |
---|---|---|---|---|
(Intercept) | 1.135 | 1.142 | 1.135 | -0.125 |
(0.208) | (0.210) | (0.208) | (0.251) | |
x2 | 0.719 | 0.705 | 0.719 | 2.074 |
(0.364) | (0.359) | (0.364) | (0.404) | |
x3 | -11.238 | -4.218 | ||
(60452.270) | (0.891) | |||
(Variance) | 1.912 | 1.970 | 1.912 | 3.440 |
(0.422) | (0.434) | (0.422) | (0.795) | |
Num.Obs. | 100 | 100 | 55 | 100 |
Log.Lik. | -87.633 | -88.101 | -87.633 | -118.935 |
The results show exactly the same pattern as for the Poisson regression above: ML, BR, and ML/sub yield results close to the true coefficients for intercept, x_{2}, and the variance while the ML/SST estimates are far from the true values. For x_{3} only the BR estimates are finite while the ML estimates diverge towards minus infinity. Actually, the estimates would have diverged even more if we hadn’t stopped the Fisher scoring early (via fsmaxit = 28
instead of the default 100
).
Overall this clearly indicates that bias-reduced (BR) estimation is a convenient way to avoid infinite estimates and standard errors in these models and to enable standard inference even when data separation occurs. In contrast the common recommendation to omit the regressor associated with the separation should be avoided or applied to the non-separated subset of observations only. Otherwise it can give misleading results when regressors are correlated.
]]>The R package colorspace provides a flexible toolbox for selecting individual colors or color palettes, manipulating these colors, and employing them in statistical graphics and data visualizations. In particular, the package provides a broad range of color palettes based on the HCL (hue-chroma-luminance) color space. The three HCL dimensions have been shown to match those of the human visual system very well, thus facilitating intuitive selection of color palettes through trajectories in this space. Using the HCL color model, general strategies for three types of palettes are implemented: (1) Qualitative for coding categorical information, i.e., where no particular ordering of categories is available. (2) Sequential for coding ordered/numeric information, i.e., going from high to low (or vice versa). (3) Diverging for coding ordered/numeric information around a central neutral value, i.e., where colors diverge from neutral to two extremes. To aid selection and application of these palettes, the package also contains scales for use with ggplot2, shiny and tcltk apps for interactive exploration, visualizations of palette properties, accompanying manipulation utilities (like desaturation and lighten/darken), and emulation of color vision deficiencies.
Zeileis A, Fisher JC, Hornik K, Ihaka R, McWhite CD, Murrell P, Stauffer R, Wilke CO (2020). “colorspace: A Toolbox for Manipulating and Assessing Colors and Palettes.” Journal of Statistical Software, 96(1), 1-49. doi:10.18637/jss.v096.i01.
The release of version 2.0-0 on CRAN (Comprehensive R Archive Network) concludes more than a decade of development and substantial updates since the release of version 1.0-0. The JSS paper above gives a detailed overview of the package’s features. The full list of changes over the different release is provided in the package’s NEWS.
Even more details and links along with the full software manual are available on the package web page on R-Forge at https://colorspace.R-Forge.R-project.org/ (produced with pkgdown
).
The sandwich package provides model-robust covariance matrix estimators for cross-sectional, time series, clustered, panel, and longitudinal data. The implementation is modular due to an object-oriented design with support for many model objects, including: lm
, glm
, survreg
, coxph
, mlogit
, polr
, hurdle
, zeroinfl
, and beyond.
The release of version 3.0-0 on CRAN (Comprehensive R Archive Network) completes the substantial updates and improvements started in the 2.4-x and 2.5-x releases: especially clustered, panel, and bootstrap covariances. In addition to the new pkgdown web page and paper in the Journal of Statistical Software (JSS), described below, the new release includes some smaller improvements in: some equations in the vignettes (suggested by Bettina Grün and Yves Croissant), the kernel weights function kweights()
(suggested by Christoph Hanck), in the formula handling (suggested by David Hugh-Jones), in the bread()
method for weighted mlm
objects (suggested by James Pustejovsky). The full list of changes can be seen in the package’s NEWS.
The package comes with a dedicated pkgdown
website on R-Forge now: https://sandwich.R-Forge.R-project.org/. This includes a nice logo, kindly provided by Reto Stauffer.
The web page essentially uses the previous content of the package (documentation, vignettes, NEWS) but also adds a nice overview of the package to help new users to “Get started”.
Citation:
Zeileis A, Köll S, Graham N (2020). “Various Versatile Variances: An Object-Oriented Implementation of Clustered Covariances in R.” Journal of Statistical Software, 95(1), 1-36. doi:10.18637/jss.v095.i01.
Abstract:
Clustered covariances or clustered standard errors are very widely used to account for correlated or clustered data, especially in economics, political sciences, and other social sciences. They are employed to adjust the inference following estimation of a standard least-squares regression or generalized linear model estimated by maximum likelihood. Although many publications just refer to ``the’’ clustered standard errors, there is a surprisingly wide variety of clustered covariances, particularly due to different flavors of bias corrections. Furthermore, while the linear regression model is certainly the most important application case, the same strategies can be employed in more general models (e.g., for zero-inflated, censored, or limited responses).
In R, functions for covariances in clustered or panel models have been somewhat scattered or available only for certain modeling functions, notably the (generalized) linear regression model. In contrast, an object-oriented approach to “robust” covariance matrix estimation - applicable beyond lm()
and glm()
- is available in the sandwich package but has been limited to the case of cross-section or time series data. Starting with sandwich 2.4.0, this shortcoming has been corrected: Based on methods for two generic functions (estfun()
and bread()
), clustered and panel covariances are provided in vcovCL()
, vcovPL()
, and vcovPC()
. Moreover, clustered bootstrap covariances are provided in vcovBS()
, using model update()
on bootstrap samples. These are directly applicable to models from packages including MASS, pscl, countreg, and betareg, among many others. Some empirical illustrations are provided as well as an assessment of the methods’ performance in a simulation study.
Structural equation models (SEMs) are a popular class of models, especially in the social sciences, to model correlations and dependencies in multivariate data, often involving latent variables. To account for individual heterogeneities in the SEM parameters sometimes finite-mixture models are used, in particular when there are no covariates available to explain the source of the heterogeneity. More recently, starting from the work of Brandmaier et al. (2013, Psychological Methods, doi:10.1037/a0030001) tree-based modeling of SEMs has also been receiving increasing interest in the literature. Based on available covariates SEM trees can capture the heterogeneity by recursively partitioning the data into subgroups. Brandmaier et al. also provide an R implementation for their algorithm in their semtree package available from CRAN.
Their original SEM tree algorithm relied on selecting the variables for recursive partitioning based on likelihood ratio tests along with somewhat ad hoc adjustments. Recently, the group around Brandmaier proposed to use score-based tests instead that account more formally for selecting the maximal statistic across a range of possible split points (see Arnold et al. 2020, PsyArXiv Preprints, doi:10.31234/osf.io/65bxv). They show that this not only improves the accuracy of the method but can also greatly alleviate the computational burden.
The score-based tests draw on the work started by us in Merkle & Zeileis (2013, Psychometrika, doi:10.1007/s11336-012-9302-4) which in fact had already long been available in a general model-based tree algorithm (called MOB for short), proposed by us in Zeileis et al. (2008, Journal of Computational and Graphical Statistics, doi:10.1198/106186008X319331) and available in the R package partykit (and party before that).
In this blog post I show how the general mob()
function from partykit can be easily coupled with the lavaan package (Rosseel 2012, Journal of Statistical Software, doi:10.18637/jss.v048.i02) as an alternative approach to fitting SEM trees.
MOB is a very broad tree algorithm that can capture subgroups in general parametric models (e.g., probability distributions, regression models, measurement models, etc.). While it can be applied to M-type estimators in general, it is probably easiest to outline the algorithm for maximum likelihood models. The algorithm assumes that there is some data of interest along with a suitable model that can fit the data, at least locally in subgroups. And additionally there are further covariates that can be used for splitting the data to find these subgroups. It proceeds in the following steps.
The mob()
function in partykit implements this general algorithm and allows to plug in different model-fitting functions, provided they allow to extract the estimated parameters, the maximized log-likelihood, and the corresponding matrix of score (or gradient) contributions for each observation. The details are described in a vignette within the package: Parties, Models, Mobsters: A New Implementation of Model-Based Recursive Partitioning in R.
As the lavaan package readily provides the quantities that MOB needs as input we can easily set up a “mobster” function for SEMs. The lavaan_fit()
function below takes a lavaan model
definition and returns the actual fitting function with the interface as required by mob()
:
lavaan_fit <- function(model) {
function(y, x = NULL, start = NULL, weights = NULL, offset = NULL, ..., estfun = FALSE, object = FALSE) {
sem <- lavaan::lavaan(model = model, data = y, start = start)
list(
coefficients = stats4::coef(sem),
objfun = -as.numeric(stats4::logLik(sem)),
estfun = if(estfun) sandwich::estfun(sem) else NULL,
object = if(object) sem else NULL
)
}
}
The fitting function just calls lavaan()
using the model
, the data y
, and optionally the start
-ing values, ignoring other arguments that mob()
could handle. It then extracts the parameters coef()
, the log-likelihood logLik()
, and the score matrix estfun()
using the generic functions from the corresponding packages and returns them in a list.
To illustrate fitting SEM trees with partykit and lavaan, we consider the example from the Using lavaan with semtree tutorial provided by Brandmaier et al.. It is a linear growth curve model for data measured at five time points: X1
, X2
, X3
, X4
, and X5
. The main parameters of interest are the intercept and the slope of the growth curves while accounting for random variations and correlations among the involved variables according to this SEM. In lavaan notation:
growth_curve_model <- '
inter =~ 1*X1 + 1*X2 + 1*X3 + 1*X4 + 1*X5;
slope =~ 0*X1 + 1*X2 + 2*X3 + 3*X4 + 4*X5;
inter ~~ vari*inter; inter ~ meani*1;
slope ~~ vars*slope; slope ~ means*1;
inter ~~ cov*slope;
X1 ~~ residual*X1; X1 ~ 0*1;
X2 ~~ residual*X2; X2 ~ 0*1;
X3 ~~ residual*X3; X3 ~ 0*1;
X4 ~~ residual*X4; X4 ~ 0*1;
X5 ~~ residual*X5; X5 ~ 0*1;
'
The model can also be visualized using the following graphic taken from the tutorial:
In addition to the measurements at the five time points, the data set example1.txt provides three covariates (agegroup
, training
, and noise
) that can be used to capture individual difference in the model parameters. The data can be read and transformed to appropriate classes by:
ex1 <- data.frame(read.csv(
"https://brandmaier.de/semtree/wp-content/uploads/downloads/2012/07/example1.txt",
sep = "\t"))
ex1 <- transform(ex1,
agegroup = factor(agegroup),
training = factor(training),
noise = factor(noise))
Given the data, model, and mobster function are available, it is easy to fit the MOB tree with SEMs in every node of the tree. The five measurements are the dependent variables (y
) that need to be passed to the model as a "data.frame"
, the three covariates are the explanatory variables:
library("partykit")
tr <- mob(X1 + X2 + X3 + X4 + X5 ~ agegroup + training + noise, data = ex1,
fit = lavaan_fit(growth_curve_model),
control = mob_control(ytype = "data.frame"))
The resulting tree tr
correctly detects the three subgroups that were simulated for the data by Brandmaier et al.. It can be visualized (with somewhat larger terminal nodes, all dropped to the bottom of the display):
plot(tr, drop = TRUE, tnex = 2)
The parameter estimates can also be extracted by coef(tr)
:
t(coef(tr))
## 2 4 5
## vari 0.086 0.080 0.105
## meani 5.020 2.003 1.943
## vars 0.500 1.627 0.675
## means -0.144 -1.082 -0.495
## cov -0.013 -0.041 0.028
## residual 0.050 0.047 0.052
## residual 0.050 0.047 0.052
## residual 0.050 0.047 0.052
## residual 0.050 0.047 0.052
## residual 0.050 0.047 0.052
The main parameters of interest are meani
, the mean intercept, and means
, the mean slope that both vary across the subgroups defined by agegroup
and training
: In node 2 the intercept is about 5 while in nodes 4 and 5 it is around 2. The slope is almost zero in node 2, about -1 in node 4, and about -0.5 in node 5. The residual
variance is restricted to be constant across the five time points and hence repeated in the output.
By extracting the node-specific meani
and means
parameters, the expected growth can also be visualized in the following way:
gr <- coef(tr)[, "meani"] + outer(coef(tr)[, "means"], 0:4)
cl <- palette.colors(4, "Okabe-Ito")[-1]
matplot(t(gr), type = "o", pch = 19, col = cl,
ylab = "Expected growth", xlab = "Time", xlim = c(1, 5.2))
text(5, gr[, 5], paste("Node", rownames(gr)), col = cl, pos = 3)
Finally, using a custom printing function that only shows the subgroup size and the first six parameters, the tree can be nicely printed as:
node_format <- function(node) {
c("",
sprintf("n = %s", node$nobs),
capture.output(print(cbind(node$coefficients[1:6]), digits = 2L))[-1L])
}
print(tr, FUN = node_format)
## Model-based recursive partitioning (lavaan_fit(growth_curve_model))
##
## Model formula:
## X1 + X2 + X3 + X4 + X5 ~ agegroup + training + noise
##
## Fitted party:
## [1] root
## | [2] agegroup in 0
## | n = 200
## | vari 0.086
## | meani 5.020
## | vars 0.500
## | means -0.144
## | cov -0.013
## | residual 0.050
## | [3] agegroup in 1
## | | [4] training in 0
## | | n = 100
## | | vari 0.080
## | | meani 2.003
## | | vars 1.627
## | | means -1.082
## | | cov -0.041
## | | residual 0.047
## | | [5] training in 1
## | | n = 100
## | | vari 0.105
## | | meani 1.943
## | | vars 0.675
## | | means -0.495
## | | cov 0.028
## | | residual 0.052
##
## Number of inner nodes: 2
## Number of terminal nodes: 3
## Number of parameters per node: 10
## Objective function: 1330.735
The main purpose of this blog post was to show that it is relatively simple to fit model-based trees with custom models using the general mob()
infrastructure from the partykit package. Specifically, lavaan makes it easy to fit SEM trees as the lavaan package readily provides all necessary components. As I had provided this as feedback to Arnold et al. and encouraged them to drill a bit deeper to better understand the differences between their adapted SEM tree algorithm and MOB, I thought I should share the code as it might be useful to others as well.
One important difference between the new SEM tree algorithm and the current MOB implementation is the determination of the best split point. The new SEM tree also uses the scores for this while MOB is based on the log-likelihood in the subgroups and hence is slower searching splits in numeric covariates with many possible split points. While we also had experimented with score-based split point estimation in party this has never been released and is currently not available in partykit. However, we are working on making the split point selection more flexible in partykit.
Of course, fitting the tree model is actually just the first step in an analysis of subgroups in a SEM. The subsequent steps for analyzing and interpreting the resulting tree model are at least as important. The work bei Brandmaier and his co-authors and their semtree package provide much more guidance on this.
]]>Hofmann M, Gatu C, Kontoghiorghes EJ, Colubi A, Zeileis A (2020). “lmSubsets: Exact Variable-Subset Selection in Linear Regression for R.” Journal of Statistical Software, 93(3), 1-21. doi:10.18637/jss.v093.i03
An R package for computing the all-subsets regression problem is presented. The proposed algorithms are based on computational strategies recently developed. A novel algorithm for the best-subset regression problem selects subset models based on a predetermined criterion. The package user can choose from exact and from approximation algorithms. The core of the package is written in C++ and provides an efficient implementation of all the underlying numerical computations. A case study and benchmark results illustrate the usage and the computational efficiency of the package.
https://CRAN.R-project.org/package=lmSubsets
Advances in numerical weather prediction (NWP) have played an important role in the increase of weather forecast skill over the past decades. Numerical models simulate physical systems that operate at a large, typically global, scale. The horizontal (spatial) resolution is limited by the computational power available today and hence, typically, the NWP outputs are post-processed to correct for local and unresolved effects in order to obtain forecasts for specific locations. So-called model output statistics (MOS) develops a regression relationship based on past meteorological observations of the variable to be predicted and forecasted NWP quantities at a certain lead time. Variable-subset selection is often employed to determine which NWP outputs should be included in the regression model for a specific location.
Here, the lmSubsets
package is used to build a MOS regression model predicting temperature at Innsbruck Airport, Austria, based on data from the Global Ensemble Forecast System. The data frame IbkTemperature
contains 1824 daily cases for 42 variables: the temperature at Innsbruck Airport (observed), 36 NWP outputs (forecasted), and 5 deterministic time trend/season patterns. The NWP variables include quantities pertaining to temperature (e.g., 2-meter above ground, minimum, maximum, soil), precipitation, wind, and fluxes, among others.
First, package and data are loaded and the few missing values are omitted for simplicity.
library("lmSubsets")
data("IbkTemperature", package = "lmSubsets")
IbkTemperature <- na.omit(IbkTemperature)
A simple output model for the observed temperature (temp
) is constructed, which will serve as the reference model. It consists of the 2-meter temperature NWP forecast (t2m
), a linear trend component (time
), as well as seasonal components with annual (sin
, cos
) and bi-annual (sin2
, cos2
) harmonic patterns.
MOS0 <- lm(temp ~ t2m + time + sin + cos + sin2 + cos2,
data = IbkTemperature)
When looking at summary(MOS0)
or the coefficient table below, it can be observed that despite the inclusion of the NWP variable t2m
, the coefficients for the deterministic components remain significant, which indicates that the seasonal temperature fluctuations are not fully resolved by the numerical model.
MOS0 | MOS1 | MOS2 | ||||
(Intercept) | -345.252 ** | (109.212) | -666.584 *** | (95.349) | -661.700 *** | (95.225) |
t2m | 0.318 *** | (0.016) | 0.055 | (0.029) | ||
time | 0.132 * | (0.054) | 0.149 ** | (0.047) | 0.147 ** | (0.047) |
sin | -1.234 *** | (0.126) | 0.522 *** | (0.147) | 0.811 *** | (0.120) |
cos | -6.329 *** | (0.164) | -0.812 ** | (0.273) | ||
sin2 | 0.240 * | (0.110) | -0.794 *** | (0.119) | -0.870 *** | (0.118) |
cos2 | -0.332 ** | (0.109) | -1.067 *** | (0.101) | -1.128 *** | (0.097) |
sshnf | 0.016 *** | (0.004) | 0.018 *** | (0.004) | ||
vsmc | 20.200 *** | (3.115) | 20.181 *** | (3.106) | ||
tmax2m | 0.145 *** | (0.037) | 0.181 *** | (0.023) | ||
st | 1.077 *** | (0.051) | 1.142 *** | (0.043) | ||
wr | 0.450 *** | (0.109) | 0.505 *** | (0.103) | ||
t2pvu | 0.064 *** | (0.011) | 0.149 *** | (0.028) | ||
mslp | -0.000 *** | (0.000) | ||||
p2pvu | -0.000 ** | (0.000) | ||||
AIC | 9493.602 | 8954.907 | 8948.182 | |||
BIC | 9537.650 | 9031.992 | 9025.267 | |||
RSS | 19506.469 | 14411.122 | 14357.943 | |||
Sigma | 3.281 | 2.825 | 2.820 | |||
R-squared | 0.803 | 0.854 | 0.855 | |||
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Next, the reference model is extended with selected regressors taken from the remaining 35 NWP variables.
MOS1_best <- lmSelect(temp ~ ., data = IbkTemperature,
include = c("t2m", "time", "sin", "cos", "sin2", "cos2"),
penalty = "BIC", nbest = 20)
MOS1 <- refit(MOS1_best)
Best-subset regression with respect to the BIC criterion is employed to determine pertinent veriables in addition to the regressors already used in MOS0
. The 20 best submodels are computed and the selected variables can be visualized by image(MOS1_best, hilite = 1)
(see below) while the corresponding BIC values can be visualized by plot(MOS1_best)
. All in all, these 20 best models are very similar with only a few variables switching between being included and excluded. Using the refit()
method the best submodel can be extracted and fitted via lm()
. Summary statistics are shown in the table above. Overall, the model MOS1
improves the model fit considerably compared to the basic MOS0
model.
Finally, an all-subsets regression is conducted instead of the cheaper best-subsets regression. It considers all 41 variables without any restrictions to determine what is the best model in terms of BIC that could be found for this data set.
MOS2_all <- lmSubsets(temp ~ ., data = IbkTemperature)
MOS2 <- refit(lmSelect(MOS2_all, penalty = "BIC"))
Again, the best model is refitted with lm()
to facilitate further inspections, see above for the summary table.
The best-BIC models MOS1
and MOS2
both have 13 regressors. The deterministic trend and all but one of the harmonic seasonal components are retained in MOS2
even though they are not forced into the model (as in MOS1
). In addition, MOS1
and MOS2
share six NWP outputs relating to temperature (tmax2m
, st
, t2pvu
), pressure (mslp
, p2pvu
), hydrology (vsmc
, wr
), and heat flux (sshnf
). However, and most remarkably, MOS1
does not include the direct 2-meter temperature output from the NWP model (t2m
). In fact, t2m
is not included by any of the 20 submodels (sizes 8 to 27) shown by image(MOS2_all, size = 8:27, hilite = 1, hilite_penalty = "BIC")
whereas the temperature quantities tmax2m
, st
, t2pvu
are included by all. (Additionally, plot(MOS2_all)
would show the associated BIC and residual sum of squares across the different model sizes.) The summary statistics reveal that both MOS1
and MOS2
significantly improve over the simple reference model MOS0
, with MOS2
being only slightly better than MOS1
.
Lang MN, Schlosser L, Hothorn T, Mayr GJ, Stauffer R, Zeileis A (2020). “Circular Regression Trees and Forests with an Application to Probabilistic Wind Direction Forecasting”, arXiv:2001.00412, arXiv.org E-Print Archive. https://arXiv.org/abs/2001.00412
While circular data occur in a wide range of scientific fields, the methodology for distributional modeling and probabilistic forecasting of circular response variables is rather limited. Most of the existing methods are built on the framework of generalized linear and additive models, which are often challenging to optimize and interpret. Therefore, building on previous ideas for trees modeling circular means, we suggest a distributional approach for regression trees and random forests yielding probabilistic forecasts based on the von Mises distribution. The resulting tree-based models simplify the estimation process by using the available covariates for partitioning the data into sufficiently homogeneous subgroups so that a simple von Mises distribution without further covariates can be fitted to the circular response in each subgroup. These circular regression trees are straightforward to interpret, can capture nonlinear effects and interactions, and automatically select the relevant covariates that are associated with either location and/or scale changes in the von Mises distribution. Combining an ensemble of circular regression trees to a circular regression forest can regularize and smooth the covariate effects. The new methods are evaluated in a case study on probabilistic wind direction forecasting at two Austrian airports, considering other common approaches as a benchmark.
R package circtree
from the R-Forge project partykit
: https://R-Forge.R-project.org/R/?group_id=261
Basic examples using artificial data:
install.packages("partykit")
install.packages("disttree", repos = "http://R-Forge.R-project.org")
install.packages("circtree", repos = "http://R-Forge.R-project.org")
library("circtree")
example("circtree", ask = FALSE)
vignette("circtree", package = "circtree")
The basis for the proposed distributional modeling of the circular responses is the von Mises distribution, also known as the “circular normal distribution”. It is based on a location parameter μ in [0, 2 π) and a concentration parameter κ > 0.
The figure below illustrates a model, fitted by maximum likelihood, for circular data in the interval [0, 2 π). It can either be drawn on a linearized scale (left) or circular scale (right). In both cases the empirical histogram (gray bars) and fitted von Mises density (red line) are depicted along with the estimated location parameter (red hand).
The regression trees and forests extend this approach by employing an adaptive local likelihood approach: For each observation, the parameters μ and κ are estimated only locally in a neighborhood, defined either by the nodes of a single tree or weighted by the nodes of a forest.
To provide a first impression of the methodology in practice (motivated by air traffic management), a circular regression tree is employed for probabilistic wind direction forecasting. More specifically, we obtain 1-hourly nowcasts of wind direction at Innsbruck Airport. As the airport is located at the bottom of a narrow valley within the European Alps, it is natural to employ tree-based regression models as there can be abrupt changes in the wind direction rather than smooth changes.
Due to the short lead time only observation data is employed for predictions (41,979 data points) but no numerical weather predictions. The data is obtained from 4 stations at Innsbruck Airport as well as 6 nearby weather stations. The base variables are: Wind direction, wind (gust) speed, temperature, (reduced) air pressure, relative humidity. Based on these 260 covariates are computed via means/minima/maxima, temporal changes, and spatial differences towards the airport. The resulting regression tree is shown below along with the empirical (gray) and fitted von Mises (red) wind direction distribution in each terminal node.
Based on the fitted location parameters μ, the subgroups can be distinguished into the following wind regimes:
In terms of covariates, the lagged wind “direction” (also known as “persistence”) is mostly responsible for distinguishing the broad range of wind regimes listed above while the pressure gradients and wind speed separate the data into subgroups with high vs. low precision.
A more extensive case study of circular regression trees and also circular random forests applied to probabilistic wind direction forecasting at Innsbruck Airport and Vienna International Airport is presented in Section 4 of the paper, along with a benchmark against commonly-used alternative approaches.
]]>