*Citation:*Jones PJ, Mair P, Simon T, Zeileis A (2020). “Network Trees: A Method for Recursively Partitioning Covariance Structures.”*Psychometrika*,**85**(4), 926-945. doi:10.1007/s11336-020-09731-4.*Preprint version:*https://www.zeileis.org/papers/Jones+Mair+Simon-2020.pdf*OSF replication materials:*https://osf.io/ykq2a/

In many areas of psychology, correlation-based network approaches (i.e., psychometric networks) have become a popular tool. In this paper, we propose an approach that recursively splits the sample based on covariates in order to detect significant differences in the structure of the covariance or correlation matrix. Psychometric networks or other correlation-based models (e.g., factor models) can be subsequently estimated from the resultant splits. We adapt model-based recursive partitioning and conditional inference tree approaches for finding covariate splits in a recursive manner. The empirical power of these approaches is studied in several simulation conditions. Examples are given using real-life data from personality and clinical research.

All methods discussed are implemented in the R package `networktree`

that is developed on GitHub and stable versions are released on CRAN (Comprehensive R Archive Network). Version 1.0.0 accompanies the publications in Psychometrika and version 1.0.1 adds a few small enhancements and bug fixes, specifically for the plotting infrastructure. Furthermore, a nice web page with introductory examples, documentation, release notes, etc. has been produced with the wonderful `pkgdown`

.

*CRAN release:*https://CRAN.R-project.org/package=networktree*Web page:*https://paytonjjones.github.io/networktree/

The idea of psychometric networks is to provide information about the statistical relationships between observed variables. Network trees aim to reveal heterogeneities in these relationships based on observed covariates. This strategy is implemented in the R package `networktree`

building on the general tree algorithms in the `partykit`

package.

For illustration, we consider a depression network - where the nodes represent different symptoms - and detect heterogeneities with respect to age and race. The data used below is provided by https://openpsychometrics.org/ and was obtained using the Depression Anxiety and Stress Scale (DASS), a self-report instrument for measuring depression, anxiety, and tension or stress. It is available in the `networktree`

package as `dass`

. To make resulting graphics and summaries easier to interpret we use the following variable names for the depression symptoms that are measured with certain questions from the DASS:

`anhedonia`

(Question 3: I couldn’t seem to experience any positive feeling at all.)`initiative`

(Question 42: I found it difficult to work up the initiative to do things.)`lookforward`

(Question 10: I felt that I had nothing to look forward to.)`sad`

(Question 13: I felt sad and depressed.)`unenthused`

(Question 31: I was unable to become enthusiastic about anything.)`worthless`

(Question 17: I felt I wasn’t worth much as a person.)`meaningless`

(Question 38: I felt that life was meaningless.)

First, we load the data and relabel the variables for the depression symptoms:

```
library("networktree")
data("dass", package = "networktree")
names(dass)[c(3, 42, 10, 13, 31, 17, 38)] <- c("anhedonia", "initiative", "lookforward",
"sad", "unenthused", "worthless", "meaningless")
```

Subsequently, we fit a `networktree()`

where the relationship between the symptoms (`anhedonia + initiative + lookforward + sad + unenthused + worthless + meaningless`

) is “explained by” (`~`

) the covariates (`age + race`

). (As an alternative to this formula-based interface it is also possible to specify groups of dependent and split variables, respectively, through separate data frames.) The threshold for detecting significant differences in correlations is set to 1% (plus Bonferroni adjustment for testing two covariates at each step).

```
tr <- networktree(anhedonia + initiative + lookforward + sad + unenthused +
worthless + meaningless ~ age + race, data = dass, alpha = 0.01)
```

The resulting network tree can be easily visualized with `plot(tr)`

which would display the raw correlations. As these are generally high between all depression symptoms we use a display with partial correlations (`transform = "pcor"`

) instead. This brings out differences between the detected subgroups somewhat more clearly. *(Note that version 1.0.1 of networktree is needed for this to work correctly.)*

```
plot(tr, transform = "pcor")
```

This shows that the network tree detects three subgroups. First, the correlations of the depression symptoms change across `age`

- with the largest difference between “younger” and “older” persons in the sample at a split point of 30 years. Second, the correlations differ with respect to race for the older persons in the sample - with the largest difference between Arab/Black/Native American/White and Asian/Other. The differences in the symptom correlations affect various pairs of symptoms as brought out in the network display produced by the qgraph package in the terminal nodes. For example, the “centrality” of `anhedonia`

changes across the three detected subgroups: For the older Asian/Other persons it is partially correlated with most other symptoms while this is less pronounced for the other two subgroups.

The networks visualized in the tree can also be extracted easily using the `getnetwork()`

function. For example, the partial correlation matrix corresponding to the older Asian/Other group (node 5) can be obtained by:

```
getnetwork(tr, id = 5, transform = "pcor")
```

To explore the returned object `tr`

in some more detail, the `print()`

method gives a printed version of the tree structure but does not display the associated parameters.

```
tr
## Network tree object
##
## Model formula:
## anhedonia + initiative + lookforward + sad + unenthused + worthless +
## meaningless ~ age + race
##
## Fitted party:
## [1] root
## | [2] age <= 30
## | [3] age > 30
## | | [4] race in Arab, Black, Native American, White
## | | [5] race in Asian, Other
##
## Number of inner nodes: 2
## Number of terminal nodes: 3
## Number of parameters per node: 21
## Objective function: 42301.84
```

The estimated correlation parameters in the subgroups can be extracted with `coef(tr)`

, here returning a 3 x 21 matrix for the 21 pairs of symptom correlations and the 3 subgroups. To show two symptom pairs with larger correlation differences we extract the correlations of `anhedonia`

with `worthless`

and `meaningless`

, respectively. Note that these are the raw correlations and not the partial correlations displayed in the tree above.

```
coef(tr)[, 5:6]
## rho_anhedonia_worthless rho_anhedonia_meaningless
## 2 0.5595725 0.5994682
## 4 0.6741686 0.6339481
## 5 0.6639088 0.7178744
```

Finally, we extract the p-values of the underlying parameter instability tests to gain some insights how the tree was constructed. In each step the stability we assess whether the correlation parameters are stable across each of the two covariates `age`

and `race`

or whether there are significant changes. The corresponding test statistics and Bonferroni-adjusted p-values can be extracted with the `sctest()`

function (for “structural change test”). For example, in Node 1 there are significant instabilities with respect to both variables but `age`

has the lower p-value and is hence selected for partitioning the data:

```
library("strucchange")
sctest(tr, node = 1)
## age race
## statistic 7.151935e+01 1.781216e+02
## p.value 1.787983e-05 3.108049e-03
```

In Node 3 only `race`

is significant and hence used for splitting:

```
sctest(tr, node = 3)
## age race
## statistic 42.9352852 1.728898e+02
## p.value 0.1447818 6.766197e-05
```

And in Node 5 neither variable is significant and hence the splitting stops:

```
sctest(tr, node = 5)
## age race
## statistic 35.1919522 22.09555
## p.value 0.5514142 0.63279
```

For more details regarding the method and the software see the Psychometrika paper and the software web page, respectively.

]]>Köll S, Kosmidis I, Kleiber C, Zeileis A (2021). *“Bias Reduction as a Remedy to the Consequences of Infinite Estimates in Poisson and Tobit Regression”*, arXiv:2101.07141, arXiv.org E-Print Archive. https://arXiv.org/abs/2101.07141

Data separation is a well-studied phenomenon that can cause problems in the estimation and inference from binary response models. Complete or quasi-complete separation occurs when there is a combination of regressors in the model whose value can perfectly predict one or both outcomes. In such cases, and such cases only, the maximum likelihood estimates and the corresponding standard errors are infinite. It is less widely known that the same can happen in further microeconometric models. One of the few works in the area is Santos Silva and Tenreyro (2010) who note that the finiteness of the maximum likelihood estimates in Poisson regression depends on the data configuration and propose a strategy to detect and overcome the consequences of data separation. However, their approach can lead to notable bias on the parameter estimates when the regressors are correlated. We illustrate how bias-reducing adjustments to the maximum likelihood score equations can overcome the consequences of separation in Poisson and Tobit regression models.

R package `brglm2`

from CRAN: https://CRAN.R-project.org/package=brglm2

R package `brtobit`

from R-Forge: https://R-Forge.R-project.org/R/?group_id=2305

The simplest but arguably often-encountered occurrence of data separation in practice is when there is a binary regressor such that the response y = 0 (or another boundary value) whenever the regressor is 1. If P(y = 0) is monotonically increasing in the linear predictor of the model, then the coefficient of the binary regressor will diverge to minus infinity in order to push P(y = 0) in this subgroup as close to 1 as possible.

To illustrate this phenomenon in R for both Poisson and Tobit regression we employ a simple data-generating process: In addition to the intercept we generate a continuous regressor x_{2} uniformly distributed on [-1, 1] and a binary regressor x_{3}. The latter comes from a Bernoulli distribution with probability 0.25 if x_{2} is positive and with probability 0.75 otherwise. Thus, x_{2} and x_{3} are correlated.

The linear predictor employed for both Poisson and Tobit is: 1 + x_{2} - 10 x_{3}, where the extreme coefficient of -10 assures that there is almost certainly data separation. In the full paper linked above we also consider less extreme scenarios where separation may or may not occur. The Poisson response is then drawn from a Poisson distribution using a log link between mean and linear predictor. The Tobit response is drawn from a normal distribution censored at zero with identity link and constant variance of 2. Here, we draw two samples with 100 observations from both models:

```
dgp <- function(n = 100, coef = c(1, 1, -10, 2), prob = 0.25, dist = "poisson") {
x2 <- runif(n, -1, 1)
x3 <- rbinom(n, size = 1, prob = ifelse(x2 > 0, prob, 1 - prob))
y <- switch(match.arg(tolower(dist), c("poisson", "tobit")),
"poisson" = rpois(n, exp(coef[1] + coef[2] * x2 + coef[3] * x3)),
"tobit" = rnorm(n, mean = coef[1] + coef[2] * x2 + coef[3] * x3, sd = sqrt(coef[4]))
)
y[y <= 0] <- 0
data.frame(y, x2, x3)
}
set.seed(2020-10-29)
d1 <- dgp(dist = "poisson")
set.seed(2020-10-29)
d2 <- dgp(dist = "tobit")
```

Both of these data sets exhibit quasi-complete separation of y with respect to x_{3}, i.e., y is always 0 if x_{3} is 1.

```
xtabs(~ x3 + factor(y == 0), data = d1)
## factor(y == 0)
## x3 FALSE TRUE
## 0 47 8
## 1 0 45
```

We then compare four different modeling approaches in this situation:

- ML: Standard maximum likelihood estimation.
- BR: Bias-reduced estimation based on adjusted score equations, first suggested by David Firth and later refined by David Firth and Ioannis Kosmidis.
- ML/sub: ML estimation on the subset not affected by separation, i.e., omitting both the regressor and all observations affected.
- ML/SST: ML estimation omitting only the regressor affected by the separation but keeping all observations. This strategy is recommended more commonly (compared to ML/sub) in the literature, specifically for Poisson in a paper by Santos Silva and Tenreyro (2010,
*Economics Letters*).

For Poisson regression, all these models can be fitted with the standard `glm()`

function in R. To obtain the BR estimate `method = "brglmFit"`

can be plugged in using the `brglm2`

package (by Ioannis Kosmidis).

```
install.packages("brglm2")
library("brglm2")
m12_ml <- glm(y ~ x2 + x3, data = d1, family = poisson)
m12_br <- update(m12_ml, method = "brglmFit")
m1_all <- glm(y ~ x2, data = d1, family = poisson)
m1_sub <- update(m1_all, subset = x3 == 0)
m1 <- list("ML" = m12_ml, "BR" = m12_br, "ML/sub" = m1_sub, "ML/SST" = m1_all)
```

This yields the following results (shown with the wonderful modelsummary package):

```
library("modelsummary")
msummary(m1)
```

ML | BR | ML/sub | ML/SST | |
---|---|---|---|---|

(Intercept) | 0.951 | 0.958 | 0.951 | 0.350 |

(0.100) | (0.099) | (0.100) | (0.096) | |

x2 | 1.011 | 1.006 | 1.011 | 1.662 |

(0.158) | (0.157) | (0.158) | (0.144) | |

x3 | -20.907 | -5.174 | ||

(2242.463) | (1.416) | |||

Num.Obs. | 100 | 100 | 55 | 100 |

Log.Lik. | -107.364 | -107.869 | -107.364 | -169.028 |

The following remarks can be made:

- Standard ML estimation using all observations leads to a large estimate of x
_{3}with even larger standard error. As a result, a standard Wald test results in no evidence against the hypothesis that x_{3}should not be in the model, despite the fact that using x_{3}= -10 when generating the data makes x_{3}perhaps the most influential regressor. - The ML/sub strategy, i.e., estimating the model without x
_{3}only for non-separated observations (with x_{3}= 0), yields exactly the same estimates as ML. - Compared to ML and ML/sub, BR has the advantage of returning a finite estimate and standard error for x
_{3}. Hence a Wald test can be directly used to examine the evidence against x_{3}= 0. The other parameter estimates and the log-likelihood are close to ML. BR slightly shrinks the parameter estimates of x_{2}and x_{3}towards zero. - Finally, the estimates from ML/SST, where regressor x
_{3}is omitted and all observations are used, appear to be far from the values we used to generate the data. This is due to the fact that x_{3}is not only highly informative but also correlated with x_{2}.

Moreover, in more extensive simulation experiments in the paper it is shown that the BR estimates are always finite, and result in Wald-type intervals with better coverage probabilities.

Analogous results an be obtained for Tobit regression with our `brtobit`

package, currently available from R-Forge. This provides both ML and BR estimation for homoscedastic tobit models. (Some tools are re-used from our `crch`

package that implements various estimation techniques , albeit not BR, for Tobit models with conditional heteroscedasticity.) Below we fit the same four models as in the Poisson case above.

```
install.packages("brtobit", repos = "http://R-Forge.R-project.org")
library("brtobit")
m22_ml <- brtobit(y ~ x2 + x3, data = d2, type = "ML", fsmaxit = 28)
m22_br <- brtobit(y ~ x2 + x3, data = d2, type = "BR")
m2_all <- brtobit(y ~ x2, data = d2, type = "ML")
m2_sub <- update(m2_all, subset = x3 == 0)
m2 <- list("ML" = m22_ml, "BR" = m22_br, "ML/sub" = m2_sub, "ML/SST" = m2_all)
```

Because `brtobit`

does not yet provide a direct interface for `modelsummary`

(via `broom`

) we go through the `coeftest()`

results as an intermediate step. These can then be rendered by `modelsummary`

:

```
library("lmtest")
m2 <- lapply(m2, coeftest)
msummary(m2)
```

ML | BR | ML/sub | ML/SST | |
---|---|---|---|---|

(Intercept) | 1.135 | 1.142 | 1.135 | -0.125 |

(0.208) | (0.210) | (0.208) | (0.251) | |

x2 | 0.719 | 0.705 | 0.719 | 2.074 |

(0.364) | (0.359) | (0.364) | (0.404) | |

x3 | -11.238 | -4.218 | ||

(60452.270) | (0.891) | |||

(Variance) | 1.912 | 1.970 | 1.912 | 3.440 |

(0.422) | (0.434) | (0.422) | (0.795) | |

Num.Obs. | 100 | 100 | 55 | 100 |

Log.Lik. | -87.633 | -88.101 | -87.633 | -118.935 |

The results show exactly the same pattern as for the Poisson regression above: ML, BR, and ML/sub yield results close to the true coefficients for intercept, x_{2}, and the variance while the ML/SST estimates are far from the true values. For x_{3} only the BR estimates are finite while the ML estimates diverge towards minus infinity. Actually, the estimates would have diverged even more if we hadn’t stopped the Fisher scoring early (via `fsmaxit = 28`

instead of the default `100`

).

Overall this clearly indicates that bias-reduced (BR) estimation is a convenient way to avoid infinite estimates and standard errors in these models and to enable standard inference even when data separation occurs. In contrast the common recommendation to omit the regressor associated with the separation should be avoided or applied to the non-separated subset of observations only. Otherwise it can give misleading results when regressors are correlated.

]]>The R package colorspace provides a flexible toolbox for selecting individual colors or color palettes, manipulating these colors, and employing them in statistical graphics and data visualizations. In particular, the package provides a broad range of color palettes based on the HCL (hue-chroma-luminance) color space. The three HCL dimensions have been shown to match those of the human visual system very well, thus facilitating intuitive selection of color palettes through trajectories in this space. Using the HCL color model, general strategies for three types of palettes are implemented: (1) Qualitative for coding categorical information, i.e., where no particular ordering of categories is available. (2) Sequential for coding ordered/numeric information, i.e., going from high to low (or vice versa). (3) Diverging for coding ordered/numeric information around a central neutral value, i.e., where colors diverge from neutral to two extremes. To aid selection and application of these palettes, the package also contains scales for use with ggplot2, shiny and tcltk apps for interactive exploration, visualizations of palette properties, accompanying manipulation utilities (like desaturation and lighten/darken), and emulation of color vision deficiencies.

Zeileis A, Fisher JC, Hornik K, Ihaka R, McWhite CD, Murrell P, Stauffer R, Wilke CO (2020). “colorspace: A Toolbox for Manipulating and Assessing Colors and Palettes.” *Journal of Statistical Software*, **96**(1), 1-49. doi:10.18637/jss.v096.i01.

The release of version 2.0-0 on CRAN (Comprehensive R Archive Network) concludes more than a decade of development and substantial updates since the release of version 1.0-0. The JSS paper above gives a detailed overview of the package’s features. The full list of changes over the different release is provided in the package’s NEWS.

Even more details and links along with the full software manual are available on the package web page on R-Forge at https://colorspace.R-Forge.R-project.org/ (produced with `pkgdown`

).

The sandwich package provides model-robust covariance matrix estimators for cross-sectional, time series, clustered, panel, and longitudinal data. The implementation is modular due to an object-oriented design with support for many model objects, including: `lm`

, `glm`

, `survreg`

, `coxph`

, `mlogit`

, `polr`

, `hurdle`

, `zeroinfl`

, and beyond.

The release of version 3.0-0 on CRAN (Comprehensive R Archive Network) completes the substantial updates and improvements started in the 2.4-x and 2.5-x releases: especially clustered, panel, and bootstrap covariances. In addition to the new pkgdown web page and paper in the Journal of Statistical Software (JSS), described below, the new release includes some smaller improvements in: some equations in the vignettes (suggested by Bettina Grün and Yves Croissant), the kernel weights function `kweights()`

(suggested by Christoph Hanck), in the formula handling (suggested by David Hugh-Jones), in the `bread()`

method for weighted `mlm`

objects (suggested by James Pustejovsky). The full list of changes can be seen in the package’s NEWS.

The package comes with a dedicated `pkgdown`

website on R-Forge now: https://sandwich.R-Forge.R-project.org/. This includes a nice logo, kindly provided by Reto Stauffer.

The web page essentially uses the previous content of the package (documentation, vignettes, NEWS) but also adds a nice overview of the package to help new users to “Get started”.

*Citation:*

Zeileis A, Köll S, Graham N (2020). “Various Versatile Variances: An Object-Oriented Implementation of Clustered Covariances in R.” *Journal of Statistical Software*, **95**(1), 1-36. doi:10.18637/jss.v095.i01.

*Abstract:*

Clustered covariances or clustered standard errors are very widely used to account for correlated or clustered data, especially in economics, political sciences, and other social sciences. They are employed to adjust the inference following estimation of a standard least-squares regression or generalized linear model estimated by maximum likelihood. Although many publications just refer to ``the’’ clustered standard errors, there is a surprisingly wide variety of clustered covariances, particularly due to different flavors of bias corrections. Furthermore, while the linear regression model is certainly the most important application case, the same strategies can be employed in more general models (e.g., for zero-inflated, censored, or limited responses).

In R, functions for covariances in clustered or panel models have been somewhat scattered or available only for certain modeling functions, notably the (generalized) linear regression model. In contrast, an object-oriented approach to “robust” covariance matrix estimation - applicable beyond `lm()`

and `glm()`

- is available in the *sandwich* package but has been limited to the case of cross-section or time series data. Starting with *sandwich* 2.4.0, this shortcoming has been corrected: Based on methods for two generic functions (`estfun()`

and `bread()`

), clustered and panel covariances are provided in `vcovCL()`

, `vcovPL()`

, and `vcovPC()`

. Moreover, clustered bootstrap covariances are provided in `vcovBS()`

, using model `update()`

on bootstrap samples. These are directly applicable to models from packages including *MASS*, *pscl*, *countreg*, and *betareg*, among many others. Some empirical illustrations are provided as well as an assessment of the methods’ performance in a simulation study.

Structural equation models (SEMs) are a popular class of models, especially in the social sciences, to model correlations and dependencies in multivariate data, often involving latent variables. To account for individual heterogeneities in the SEM parameters sometimes finite-mixture models are used, in particular when there are no covariates available to explain the source of the heterogeneity. More recently, starting from the work of Brandmaier *et al.* (2013, *Psychological Methods*, doi:10.1037/a0030001) tree-based modeling of SEMs has also been receiving increasing interest in the literature. Based on available covariates SEM trees can capture the heterogeneity by recursively partitioning the data into subgroups. Brandmaier *et al.* also provide an R implementation for their algorithm in their *semtree* package available from CRAN.

Their original SEM tree algorithm relied on selecting the variables for recursive partitioning based on likelihood ratio tests along with somewhat ad hoc adjustments. Recently, the group around Brandmaier proposed to use score-based tests instead that account more formally for selecting the maximal statistic across a range of possible split points (see Arnold *et al.* 2020, PsyArXiv Preprints, doi:10.31234/osf.io/65bxv). They show that this not only improves the accuracy of the method but can also greatly alleviate the computational burden.

The score-based tests draw on the work started by us in Merkle & Zeileis (2013, *Psychometrika*, doi:10.1007/s11336-012-9302-4) which in fact had already long been available in a general model-based tree algorithm (called MOB for short), proposed by us in Zeileis *et al.* (2008, *Journal of Computational and Graphical Statistics*, doi:10.1198/106186008X319331) and available in the R package *partykit* (and *party* before that).

In this blog post I show how the general `mob()`

function from *partykit* can be easily coupled with the *lavaan* package (Rosseel 2012, *Journal of Statistical Software*, doi:10.18637/jss.v048.i02) as an alternative approach to fitting SEM trees.

MOB is a very broad tree algorithm that can capture subgroups in general parametric models (e.g., probability distributions, regression models, measurement models, etc.). While it can be applied to M-type estimators in general, it is probably easiest to outline the algorithm for maximum likelihood models. The algorithm assumes that there is some data of interest along with a suitable model that can fit the data, at least locally in subgroups. And additionally there are further covariates that can be used for splitting the data to find these subgroups. It proceeds in the following steps.

- Estimate the model parameters by maximum likelihood for the observations in the current subsample.
- Test for associations (or instabilities) of the corresponding model scores and each of the covariates available for splitting.
- Split the sample along the covariate with the strongest association or instability. Choose the breakpoint with the highest improvement in the log-likelihood.
- Repeat steps 1-3 recursively in the subsamples until these become too small or there is no significant association/instability (or some other stopping criterion is reached).
*Optionally:*Reduce size of the tree by pruning branches of splits that do not improve the model fit sufficiently (e.g., based on information criteria).

The `mob()`

function in *partykit* implements this general algorithm and allows to plug in different model-fitting functions, provided they allow to extract the estimated parameters, the maximized log-likelihood, and the corresponding matrix of score (or gradient) contributions for each observation. The details are described in a vignette within the package: Parties, Models, Mobsters: A New Implementation of Model-Based Recursive Partitioning in R.

As the *lavaan* package readily provides the quantities that MOB needs as input we can easily set up a “mobster” function for SEMs. The `lavaan_fit()`

function below takes a *lavaan* `model`

definition and returns the actual fitting function with the interface as required by `mob()`

:

```
lavaan_fit <- function(model) {
function(y, x = NULL, start = NULL, weights = NULL, offset = NULL, ..., estfun = FALSE, object = FALSE) {
sem <- lavaan::lavaan(model = model, data = y, start = start)
list(
coefficients = stats4::coef(sem),
objfun = -as.numeric(stats4::logLik(sem)),
estfun = if(estfun) sandwich::estfun(sem) else NULL,
object = if(object) sem else NULL
)
}
}
```

The fitting function just calls `lavaan()`

using the `model`

, the data `y`

, and optionally the `start`

-ing values, ignoring other arguments that `mob()`

could handle. It then extracts the parameters `coef()`

, the log-likelihood `logLik()`

, and the score matrix `estfun()`

using the generic functions from the corresponding packages and returns them in a list.

To illustrate fitting SEM trees with *partykit* and *lavaan*, we consider the example from the Using *lavaan* with *semtree* tutorial provided by Brandmaier *et al.*. It is a linear growth curve model for data measured at five time points: `X1`

, `X2`

, `X3`

, `X4`

, and `X5`

. The main parameters of interest are the intercept and the slope of the growth curves while accounting for random variations and correlations among the involved variables according to this SEM. In *lavaan* notation:

```
growth_curve_model <- '
inter =~ 1*X1 + 1*X2 + 1*X3 + 1*X4 + 1*X5;
slope =~ 0*X1 + 1*X2 + 2*X3 + 3*X4 + 4*X5;
inter ~~ vari*inter; inter ~ meani*1;
slope ~~ vars*slope; slope ~ means*1;
inter ~~ cov*slope;
X1 ~~ residual*X1; X1 ~ 0*1;
X2 ~~ residual*X2; X2 ~ 0*1;
X3 ~~ residual*X3; X3 ~ 0*1;
X4 ~~ residual*X4; X4 ~ 0*1;
X5 ~~ residual*X5; X5 ~ 0*1;
'
```

The model can also be visualized using the following graphic taken from the tutorial:

In addition to the measurements at the five time points, the data set example1.txt provides three covariates (`agegroup`

, `training`

, and `noise`

) that can be used to capture individual difference in the model parameters. The data can be read and transformed to appropriate classes by:

```
ex1 <- data.frame(read.csv(
"https://brandmaier.de/semtree/wp-content/uploads/downloads/2012/07/example1.txt",
sep = "\t"))
ex1 <- transform(ex1,
agegroup = factor(agegroup),
training = factor(training),
noise = factor(noise))
```

Given the data, model, and mobster function are available, it is easy to fit the MOB tree with SEMs in every node of the tree. The five measurements are the dependent variables (`y`

) that need to be passed to the model as a `"data.frame"`

, the three covariates are the explanatory variables:

```
library("partykit")
tr <- mob(X1 + X2 + X3 + X4 + X5 ~ agegroup + training + noise, data = ex1,
fit = lavaan_fit(growth_curve_model),
control = mob_control(ytype = "data.frame"))
```

The resulting tree `tr`

correctly detects the three subgroups that were simulated for the data by Brandmaier *et al.*. It can be visualized (with somewhat larger terminal nodes, all dropped to the bottom of the display):

```
plot(tr, drop = TRUE, tnex = 2)
```

The parameter estimates can also be extracted by `coef(tr)`

:

```
t(coef(tr))
## 2 4 5
## vari 0.086 0.080 0.105
## meani 5.020 2.003 1.943
## vars 0.500 1.627 0.675
## means -0.144 -1.082 -0.495
## cov -0.013 -0.041 0.028
## residual 0.050 0.047 0.052
## residual 0.050 0.047 0.052
## residual 0.050 0.047 0.052
## residual 0.050 0.047 0.052
## residual 0.050 0.047 0.052
```

The main parameters of interest are `meani`

, the mean intercept, and `means`

, the mean slope that both vary across the subgroups defined by `agegroup`

and `training`

: In node 2 the intercept is about 5 while in nodes 4 and 5 it is around 2. The slope is almost zero in node 2, about -1 in node 4, and about -0.5 in node 5. The `residual`

variance is restricted to be constant across the five time points and hence repeated in the output.

By extracting the node-specific `meani`

and `means`

parameters, the expected growth can also be visualized in the following way:

```
gr <- coef(tr)[, "meani"] + outer(coef(tr)[, "means"], 0:4)
cl <- palette.colors(4, "Okabe-Ito")[-1]
matplot(t(gr), type = "o", pch = 19, col = cl,
ylab = "Expected growth", xlab = "Time", xlim = c(1, 5.2))
text(5, gr[, 5], paste("Node", rownames(gr)), col = cl, pos = 3)
```

Finally, using a custom printing function that only shows the subgroup size and the first six parameters, the tree can be nicely printed as:

```
node_format <- function(node) {
c("",
sprintf("n = %s", node$nobs),
capture.output(print(cbind(node$coefficients[1:6]), digits = 2L))[-1L])
}
print(tr, FUN = node_format)
## Model-based recursive partitioning (lavaan_fit(growth_curve_model))
##
## Model formula:
## X1 + X2 + X3 + X4 + X5 ~ agegroup + training + noise
##
## Fitted party:
## [1] root
## | [2] agegroup in 0
## | n = 200
## | vari 0.086
## | meani 5.020
## | vars 0.500
## | means -0.144
## | cov -0.013
## | residual 0.050
## | [3] agegroup in 1
## | | [4] training in 0
## | | n = 100
## | | vari 0.080
## | | meani 2.003
## | | vars 1.627
## | | means -1.082
## | | cov -0.041
## | | residual 0.047
## | | [5] training in 1
## | | n = 100
## | | vari 0.105
## | | meani 1.943
## | | vars 0.675
## | | means -0.495
## | | cov 0.028
## | | residual 0.052
##
## Number of inner nodes: 2
## Number of terminal nodes: 3
## Number of parameters per node: 10
## Objective function: 1330.735
```

The main purpose of this blog post was to show that it is relatively simple to fit model-based trees with custom models using the general `mob()`

infrastructure from the *partykit* package. Specifically, *lavaan* makes it easy to fit SEM trees as the *lavaan* package readily provides all necessary components. As I had provided this as feedback to Arnold *et al.* and encouraged them to drill a bit deeper to better understand the differences between their adapted SEM tree algorithm and MOB, I thought I should share the code as it might be useful to others as well.

One important difference between the new SEM tree algorithm and the current MOB implementation is the determination of the best split point. The new SEM tree also uses the scores for this while MOB is based on the log-likelihood in the subgroups and hence is slower searching splits in numeric covariates with many possible split points. While we also had experimented with score-based split point estimation in *party* this has never been released and is currently not available in *partykit*. However, we are working on making the split point selection more flexible in *partykit*.

Of course, fitting the tree model is actually just the first step in an analysis of subgroups in a SEM. The subsequent steps for analyzing and interpreting the resulting tree model are at least as important. The work bei Brandmaier and his co-authors and their *semtree* package provide much more guidance on this.

Hofmann M, Gatu C, Kontoghiorghes EJ, Colubi A, Zeileis A (2020). “lmSubsets: Exact Variable-Subset Selection in Linear Regression for R.” *Journal of Statistical Software*, **93**(3), 1-21. doi:10.18637/jss.v093.i03

An R package for computing the all-subsets regression problem is presented. The proposed algorithms are based on computational strategies recently developed. A novel algorithm for the best-subset regression problem selects subset models based on a predetermined criterion. The package user can choose from exact and from approximation algorithms. The core of the package is written in C++ and provides an efficient implementation of all the underlying numerical computations. A case study and benchmark results illustrate the usage and the computational efficiency of the package.

https://CRAN.R-project.org/package=lmSubsets

Advances in numerical weather prediction (NWP) have played an important role in the increase of weather forecast skill over the past decades. Numerical models simulate physical systems that operate at a large, typically global, scale. The horizontal (spatial) resolution is limited by the computational power available today and hence, typically, the NWP outputs are post-processed to correct for local and unresolved effects in order to obtain forecasts for specific locations. So-called model output statistics (MOS) develops a regression relationship based on past meteorological observations of the variable to be predicted and forecasted NWP quantities at a certain lead time. Variable-subset selection is often employed to determine which NWP outputs should be included in the regression model for a specific location.

Here, the `lmSubsets`

package is used to build a MOS regression model predicting temperature at Innsbruck Airport, Austria, based on data from the Global Ensemble Forecast System. The data frame `IbkTemperature`

contains 1824 daily cases for 42 variables: the temperature at Innsbruck Airport (observed), 36 NWP outputs (forecasted), and 5 deterministic time trend/season patterns. The NWP variables include quantities pertaining to temperature (e.g., 2-meter above ground, minimum, maximum, soil), precipitation, wind, and fluxes, among others.

First, package and data are loaded and the few missing values are omitted for simplicity.

```
library("lmSubsets")
data("IbkTemperature", package = "lmSubsets")
IbkTemperature <- na.omit(IbkTemperature)
```

A simple output model for the observed temperature (`temp`

) is constructed, which will serve as the reference model. It consists of the 2-meter temperature NWP forecast (`t2m`

), a linear trend component (`time`

), as well as seasonal components with annual (`sin`

, `cos`

) and bi-annual (`sin2`

, `cos2`

) harmonic patterns.

```
MOS0 <- lm(temp ~ t2m + time + sin + cos + sin2 + cos2,
data = IbkTemperature)
```

When looking at `summary(MOS0)`

or the coefficient table below, it can be observed that despite the inclusion of the NWP variable `t2m`

, the coefficients for the deterministic components remain significant, which indicates that the seasonal temperature fluctuations are not fully resolved by the numerical model.

MOS0 | MOS1 | MOS2 | ||||

(Intercept) | -345.252 ** | (109.212) | -666.584 *** | (95.349) | -661.700 *** | (95.225) |

t2m | 0.318 *** | (0.016) | 0.055 | (0.029) | ||

time | 0.132 * | (0.054) | 0.149 ** | (0.047) | 0.147 ** | (0.047) |

sin | -1.234 *** | (0.126) | 0.522 *** | (0.147) | 0.811 *** | (0.120) |

cos | -6.329 *** | (0.164) | -0.812 ** | (0.273) | ||

sin2 | 0.240 * | (0.110) | -0.794 *** | (0.119) | -0.870 *** | (0.118) |

cos2 | -0.332 ** | (0.109) | -1.067 *** | (0.101) | -1.128 *** | (0.097) |

sshnf | 0.016 *** | (0.004) | 0.018 *** | (0.004) | ||

vsmc | 20.200 *** | (3.115) | 20.181 *** | (3.106) | ||

tmax2m | 0.145 *** | (0.037) | 0.181 *** | (0.023) | ||

st | 1.077 *** | (0.051) | 1.142 *** | (0.043) | ||

wr | 0.450 *** | (0.109) | 0.505 *** | (0.103) | ||

t2pvu | 0.064 *** | (0.011) | 0.149 *** | (0.028) | ||

mslp | -0.000 *** | (0.000) | ||||

p2pvu | -0.000 ** | (0.000) | ||||

AIC | 9493.602 | 8954.907 | 8948.182 | |||

BIC | 9537.650 | 9031.992 | 9025.267 | |||

RSS | 19506.469 | 14411.122 | 14357.943 | |||

Sigma | 3.281 | 2.825 | 2.820 | |||

R-squared | 0.803 | 0.854 | 0.855 | |||

*** p < 0.001; ** p < 0.01; * p < 0.05. |

Next, the reference model is extended with selected regressors taken from the remaining 35 NWP variables.

```
MOS1_best <- lmSelect(temp ~ ., data = IbkTemperature,
include = c("t2m", "time", "sin", "cos", "sin2", "cos2"),
penalty = "BIC", nbest = 20)
MOS1 <- refit(MOS1_best)
```

Best-subset regression with respect to the BIC criterion is employed to determine pertinent veriables in addition to the regressors already used in `MOS0`

. The 20 best submodels are computed and the selected variables can be visualized by `image(MOS1_best, hilite = 1)`

(see below) while the corresponding BIC values can be visualized by `plot(MOS1_best)`

. All in all, these 20 best models are very similar with only a few variables switching between being included and excluded. Using the `refit()`

method the best submodel can be extracted and fitted via `lm()`

. Summary statistics are shown in the table above. Overall, the model `MOS1`

improves the model fit considerably compared to the basic `MOS0`

model.

Finally, an all-subsets regression is conducted instead of the cheaper best-subsets regression. It considers all 41 variables without any restrictions to determine what is the best model in terms of BIC that could be found for this data set.

```
MOS2_all <- lmSubsets(temp ~ ., data = IbkTemperature)
MOS2 <- refit(lmSelect(MOS2_all, penalty = "BIC"))
```

Again, the best model is refitted with `lm()`

to facilitate further inspections, see above for the summary table.

The best-BIC models `MOS1`

and `MOS2`

both have 13 regressors. The deterministic trend and all but one of the harmonic seasonal components are retained in `MOS2`

even though they are not forced into the model (as in `MOS1`

). In addition, `MOS1`

and `MOS2`

share six NWP outputs relating to temperature (`tmax2m`

, `st`

, `t2pvu`

), pressure (`mslp`

, `p2pvu`

), hydrology (`vsmc`

, `wr`

), and heat flux (`sshnf`

). However, and most remarkably, `MOS1`

does not include the direct 2-meter temperature output from the NWP model (`t2m`

). In fact, `t2m`

is not included by any of the 20 submodels (sizes 8 to 27) shown by `image(MOS2_all, size = 8:27, hilite = 1, hilite_penalty = "BIC")`

whereas the temperature quantities `tmax2m`

, `st`

, `t2pvu`

are included by all. (Additionally, `plot(MOS2_all)`

would show the associated BIC and residual sum of squares across the different model sizes.) The summary statistics reveal that both `MOS1`

and `MOS2`

significantly improve over the simple reference model `MOS0`

, with `MOS2`

being only slightly better than `MOS1`

.

Lang MN, Schlosser L, Hothorn T, Mayr GJ, Stauffer R, Zeileis A (2020). *“Circular Regression Trees and Forests with an Application to Probabilistic Wind Direction Forecasting”*, arXiv:2001.00412, arXiv.org E-Print Archive. https://arXiv.org/abs/2001.00412

While circular data occur in a wide range of scientific fields, the methodology for distributional modeling and probabilistic forecasting of circular response variables is rather limited. Most of the existing methods are built on the framework of generalized linear and additive models, which are often challenging to optimize and interpret. Therefore, building on previous ideas for trees modeling circular means, we suggest a distributional approach for regression trees and random forests yielding probabilistic forecasts based on the von Mises distribution. The resulting tree-based models simplify the estimation process by using the available covariates for partitioning the data into sufficiently homogeneous subgroups so that a simple von Mises distribution without further covariates can be fitted to the circular response in each subgroup. These circular regression trees are straightforward to interpret, can capture nonlinear effects and interactions, and automatically select the relevant covariates that are associated with either location and/or scale changes in the von Mises distribution. Combining an ensemble of circular regression trees to a circular regression forest can regularize and smooth the covariate effects. The new methods are evaluated in a case study on probabilistic wind direction forecasting at two Austrian airports, considering other common approaches as a benchmark.

R package `circtree`

from the R-Forge project `partykit`

: https://R-Forge.R-project.org/R/?group_id=261

Basic examples using artificial data:

```
install.packages("partykit")
install.packages("disttree", repos = "http://R-Forge.R-project.org")
install.packages("circtree", repos = "http://R-Forge.R-project.org")
library("circtree")
example("circtree", ask = FALSE)
vignette("circtree", package = "circtree")
```

The basis for the proposed distributional modeling of the circular responses is the von Mises distribution, also known as the “circular normal distribution”. It is based on a location parameter μ in [0, 2 π) and a concentration parameter κ > 0.

The figure below illustrates a model, fitted by maximum likelihood, for circular data in the interval [0, 2 π). It can either be drawn on a linearized scale (left) or circular scale (right). In both cases the empirical histogram (gray bars) and fitted von Mises density (red line) are depicted along with the estimated location parameter (red hand).

The regression trees and forests extend this approach by employing an adaptive local likelihood approach: For each observation, the parameters μ and κ are estimated only locally in a neighborhood, defined either by the nodes of a single tree or weighted by the nodes of a forest.

To provide a first impression of the methodology in practice (motivated by air traffic management), a circular regression tree is employed for probabilistic wind direction forecasting. More specifically, we obtain 1-hourly nowcasts of wind direction at Innsbruck Airport. As the airport is located at the bottom of a narrow valley within the European Alps, it is natural to employ tree-based regression models as there can be abrupt changes in the wind direction rather than smooth changes.

Due to the short lead time only observation data is employed for predictions (41,979 data points) but no numerical weather predictions. The data is obtained from 4 stations at Innsbruck Airport as well as 6 nearby weather stations. The base variables are: Wind direction, wind (gust) speed, temperature, (reduced) air pressure, relative humidity. Based on these 260 covariates are computed via means/minima/maxima, temporal changes, and spatial differences towards the airport. The resulting regression tree is shown below along with the empirical (gray) and fitted von Mises (red) wind direction distribution in each terminal node.

Based on the fitted location parameters μ, the subgroups can be distinguished into the following wind regimes:

- Up-valley winds blowing from the valley mouth towards the upper valley (from east to west, nodes 4 and 5).
- Downslope winds blowing across the Alpine crest along the intersecting valley towards Innsbruck (from south-east to north-west, node 8).
- Down-valley winds blowing in the direction of the valley mouth (from west to east, nodes 10, 12 and 13).
- Node 7 captures observations with rather low wind speeds that cannot be clearly distinguished into specific wind regimes and are consequently associated with a very low estimated concentration parameter κ, i.e., a high estimated variance.

In terms of covariates, the lagged wind “direction” (also known as “persistence”) is mostly responsible for distinguishing the broad range of wind regimes listed above while the pressure gradients and wind speed separate the data into subgroups with high vs. low precision.

A more extensive case study of circular regression trees and also circular random forests applied to probabilistic wind direction forecasting at Innsbruck Airport and Vienna International Airport is presented in Section 4 of the paper, along with a benchmark against commonly-used alternative approaches.

]]>Umlauf N, Klein N, Simon T, Zeileis A (2019). *“bamlss: A Lego Toolbox for Flexible Bayesian Regression (and Beyond).”* arXiv:1909.11784, arXiv.org E-Print Archive. https://arxiv.org/abs/1909.11784

Over the last decades, the challenges in applied regression and in predictive modeling have been changing considerably: (1) More flexible model specifications are needed as big(ger) data become available, facilitated by more powerful computing infrastructure. (2) Full probabilistic modeling rather than predicting just means or expectations is crucial in many applications. (3) Interest in Bayesian inference has been increasing both as an appealing framework for regularizing or penalizing model estimation as well as a natural alternative to classical frequentist inference. However, while there has been a lot of research in all three areas, also leading to associated software packages, a modular software implementation that allows to easily combine all three aspects has not yet been available. For filling this gap, the R package bamlss is introduced for Bayesian additive models for location, scale, and shape (and beyond). At the core of the package are algorithms for highly-efficient Bayesian estimation and inference that can be applied to generalized additive models (GAMs) or generalized additive models for location, scale, and shape (GAMLSS), also known as distributional regression. However, its building blocks are designed as “Lego bricks” encompassing various distributions (exponential family, Cox, joint models, …), regression terms (linear, splines, random effects, tensor products, spatial fields, …), and estimators (MCMC, backfitting, gradient boosting, lasso, …). It is demonstrated how these can be easily recombined to make classical models more flexible or create new custom models for specific modeling challenges.

CRAN package: https://CRAN.R-project.org/package=bamlss

Replication script: bamlss.R

Project web page: http://www.bamlss.org/

To illustrate that the `bamlss`

follows the same familiar workflow of the other regression packages such as the basic `stats`

package or the well-established `mgcv`

or `gamlss`

two quick examples are provided: a Bayesian logit model and a location-scale model where both mean and variance of a normal response depend on a smooth term.

The logit model is a basic labor force participation model, a standard application in microeconometrics. Here, the data are loaded from the `AER`

package and the same model formula is specified that would also be used for `glm()`

(as shown on `?SwissLabor`

).

```
data("SwissLabor", package = "AER")
f <- participation ~ income + age + education + youngkids + oldkids + foreign + I(age^2)
```

Then, the model can be estimated with `bamlss()`

using essentially the same look-and-feel as for `glm()`

. The default is to use Markov chain Monte Carlo after obtaining initial parameters via backfitting.

```
library("bamlss")
set.seed(123)
b <- bamlss(f, family = "binomial", data = SwissLabor)
summary(b)
## Call:
## bamlss(formula = f, family = "binomial", data = SwissLabor)
## ---
## Family: binomial
## Link function: pi = logit
## *---
## Formula pi:
## ---
## participation ~ income + age + education + youngkids + oldkids +
## foreign + I(age^2)
## -
## Parametric coefficients:
## Mean 2.5% 50% 97.5% parameters
## (Intercept) 6.15503 1.55586 5.99204 11.11051 6.196
## income -1.10565 -1.56986 -1.10784 -0.68652 -1.104
## age 3.45703 2.05897 3.44567 4.79139 3.437
## education 0.03354 -0.02175 0.03284 0.09223 0.033
## youngkids -1.17906 -1.51099 -1.17683 -0.83047 -1.186
## oldkids -0.24122 -0.41231 -0.24099 -0.08054 -0.241
## foreignyes 1.16749 0.76276 1.17035 1.55624 1.168
## I(age^2) -0.48990 -0.65660 -0.49205 -0.31968 -0.488
## alpha 0.87585 0.32301 0.99408 1.00000 NA
## ---
## Sampler summary:
## -
## DIC = 1033.325 logLik = -512.7258 pd = 7.8734
## runtime = 1.417
## ---
## Optimizer summary:
## -
## AICc = 1033.737 converged = 1 edf = 8
## logLik = -508.7851 logPost = -571.3986 nobs = 872
## runtime = 0.012
## ---
```

The summary is based on the MCMC samples, which suggest “significant” effects for all covariates, except for variable `education`

, since the 95% credible interval contains zero. In addition, the acceptance probabilities `alpha`

are reported and indicate proper behavior of the MCMC algorithm. The column `parameters`

shows respective posterior mode estimates of the regression coefficients, which are calculated by the upstream backfitting algorithm.

To show a more flexible regression model we fit a distributional scale-location model to the well-known simulated motorcycle accident data, provided as `mcycle`

in the `MASS`

package.

Here, the relationship between head acceleration and time after impact is captured by smooth relationships in both mean and variance. See also `?gaulss`

in the `mgcv`

package for the same type of model estimated with REML rather than MCMC. Here, we load the data, set up a list of two formula with smooth terms (and increased knots `k`

for more flexibility), fit the model almost as usual, and then visualize the fitted terms along with 95% credible intervals.

```
data("mcycle", package = "MASS")
f <- list(accel ~ s(times, k = 20), sigma ~ s(times, k = 20))
set.seed(456)
b <- bamlss(f, data = mcycle, family = "gaussian")
plot(b, model = c("mu", "sigma"))
```

Finally, we show a more challenging case study. Here, emphasis is given to the illustration of the workflow. For more details on the background for the data and interpretation of the model, see Section 5 in the full paper linked above. The goal is to establish a probabilistic model linking positive counts of cloud-to-ground lightning discharges in the European Eastern Alps to atmospheric quantities from a reanalysis dataset.

The lightning measurements form the response variable and regressors are taken from the atmospheric quantities from ECMWF’s ERA5 reanalysis data. Both have a temporal resolution of 1 hour for the years 2010-2018 and a spatial mesh size of approximately 32 km. The subset of the data analyzed along with the fitted `bamlss`

model are provided in the `FlashAustria`

data on R-Forge which can be installed by

```
install.packages("FlashAustria", repos = "http://R-Forge.R-project.org")
```

To model only the lightning counts with at least one lightning discharge we employ a negative binomial count distribution, truncated at zero. The data can be loaded as follows and the regression formula set up:

```
data("FlashAustria", package = "FlashAustria")
f <- list(
counts ~ s(d2m, bs = "ps") + s(q_prof_PC1, bs = "ps") +
s(cswc_prof_PC4, bs = "ps") + s(t_prof_PC1, bs = "ps") +
s(v_prof_PC2, bs = "ps") + s(sqrt_cape, bs = "ps"),
theta ~ s(sqrt_lsp, bs = "ps")
)
```

The expectation `mu`

of the underlying untruncated negative binomial model is modeled by various smooth terms for the atmospheric variables while the overdispersion parameter `theta`

only depends on one smooth regressor. To fit this challenging model, gradient boosting is employed in a first step to obtain initial values for the subsequent MCMC sampler. Running the model takes about 30 minutes on a well-equipped standard PC. In order to move quickly through the example we load the pre-computed model from the `FlashAustria`

package:

```
data("FlashAustriaModel", package = "FlashAustria")
b <- FlashAustriaModel
```

But, of course, the model can also be refitted:

```
set.seed(111)
b <- bamlss(f, family = "ztnbinom", data = FlashAustriaTrain,
optimizer = boost, maxit = 1000, ## Boosting arguments.
thin = 5, burnin = 1000, n.iter = 6000) ## Sampler arguments.
```

To explore this model in some more detail, we show a couple of visualizations. First, the contribution to the log-likelihood of individual terms during gradient boosting is depicted.

```
pathplot(b, which = "loglik.contrib", intercept = FALSE)
```

Subsequently, we show traceplots of the MCMC samples (left) along with autocorrelation for two splines the term `s(sqrt_cape)`

of the model for `mu`

.

```
plot(b, model = "mu", term = "s(sqrt_cape)", which = "samples")
```

Next, the effects of the terms `s(sqrt_cape)`

and `s(q_prof_PC1)`

from the model for `mu`

and term `s(sqrt_lsp)`

from the model for `theta`

are shown along with 95% credible intervals derived from the MCMC samples.

```
plot(b, term = c("s(sqrt_cape)", "s(q_prof_PC1)", "s(sqrt_lsp)"),
rug = TRUE, col.rug = "#39393919")
```

Finally, estimated probabilities for observing 10 or more lightning counts (within one grid box) are computed and visualized. The reconstructions for four time points on September 15-16, 2001 are shown.

```
fit <- predict(b, newdata = FlashAustriaCase, type = "parameter")
fam <- family(b)
FlashAustriaCase$P10 <- 1 - fam$p(9, fit)
world <- rnaturalearth::ne_countries(scale = "medium", returnclass = "sf")
library("ggplot2")
ggplot() + geom_sf(aes(fill = P10), data = FlashAustriaCase) +
colorspace::scale_fill_continuous_sequential("Oslo", rev = TRUE) +
geom_sf(data = world, col = "white", fill = NA) +
coord_sf(xlim = c(7.95, 17), ylim = c(45.45, 50), expand = FALSE) +
facet_wrap(~time, nrow = 2) + theme_minimal() +
theme(plot.margin = margin(t = 0, r = 0, b = 0, l = 0))
```

]]>Over the last week a big controversy over Hurricane Dorian emerged after US President Donald Trump tweeted on September 1 that Alabama (and other states) “will most likely be hit (much) harder than anticipated”. And after the Birmingham, Alabama, office of the National Weather Service contradicted Trump on Twitter, the US president defended his tweet claiming that earlier forecasts showed a high probability of Alabama being hit. The various pieces of “evidence” for this included a map, manually modified by a marker, leading to the hashtag #sharpiegate trending on Twitter.

Here, we won’t comment further on the controversy as it is undisputed among scientists that on September 1 the forecast path did not include Alabama. However, we will look into the maps that Trump claimed his tweet was based on and we will investigate wether poor color choice may have contributed to a misinterpretation of the maps. Specifically, on September 5 Trump tweeted:

Just as I said, Alabama was originally projected to be hit. The Fake News denies it! pic.twitter.com/elJ7ROfm2p

— Donald J. Trump (@realDonaldTrump) September 5, 2019

These maps convey the impression that there is an increased risk for Alabama and especially the three maps with the color coding are rather suggestive. A closer look, though, reveals that the maps are from August 30, have a 5-day forecasting horizon, and pertain to probabilities for tropical-storm-force winds (i.e., not the cone of the hurricane!), with South-East Alabama only having a 5-20% probability.

Although the information in the maps can be correctly decoded using their titles and legends, it can be argued that this may require some expertise or experience and that there is some potential for misinterpretations. For example, data visualization expert Alberto Cairo writes on Twitter: *“I just want to give him the benefit of the doubt, honestly. These maps are difficult to understand. For me the bad thing isn’t misinterpreting. It’s not apologizing […]”*

And one aspect that makes the maps prone to misinterpretations is the color choice for coding the probabilities. This is a so-called “rainbow color map” going from dark green over bright yellow to red and dark purple. Such color maps are still widely used although it has been widely recognized that they have a number of disadvantages. In the following, Reto Stauffer and I illustrate in detail what the specific problems of the top right map are and suggest a better alternative color choice.

On the left is the original map that was included in Trump’s tweet and on the right is our version with alternative colors. The main problem with the original colors is that the entire area with more than 5% probability is shaded with highly-saturated colors. Some would argue that the traffic light system (green-yellow-red) signals that the green areas are relatively “low risk”. However, we argue that the bright colors and the abrupt transition from “no color” (less than 5%) to “dark green” (for 5-10%) conveys a substantially increased risk for the entire shaded area.

One way to avoid this misinterpretation is to choose colors that go from light (low risk) to dark and colorful (high risk). This is what we have done in the map on the right - while preserving the hues from green over yellow and red to purple. The probabibilities represented in the map are exactly the same but the alternative color choice conveys much more intuitively which areas are affected by increased probabilities beyond 50% or 60% (which do not include Alabama).

In summary, the information in the map certainly does not represent strong evidence for Alabama being likely “hit hard” by Hurricane Dorian. However, the poor color choice facilitates such misinterpretations and better, more intuitive color alternatives are easily available.

Further problems with the original colors can be brought out by converting both maps to grayscale. This shows that not only the transition from below to above 5% is emphasized too much but also the discontinuous transitions between dark and light are very counterintuitive. In contrast, our alternative colors are much more intuitive because they become darker with increasing risk.

Another related problem can be demonstrated by emulating green-deficient vision (deuteranopia), also showing discontinuities in the original colors.

Finally, we briefly comment on some technical details for contructing the alternative color map. We have used our R software package colorspace that facilitates choosing color palettes using the HCL color model, that captures the perceptual dimensions “hue” (type of color, dominant wavelength), “chroma” (colorfulness), and “luminance” (brightness). In the two plots below we show the HCL spectrum of both sets of colors.

For the original colors on the left we see that luminance (blue line) is non-monotonic, chroma (green line) is high throughout, and hue (red line) goes from green to purple. For our alternative colors we have used essentially the same hues. However, luminance covers a similar range as in the original colors but in a monotonic fashion. And chroma is low for colors associated with low risk.

The R code snippet below shows how the alternative colors can be computed using our `colorspace`

package:

```
colorspace::sequential_hcl(10, palette = "Purple-Yellow", rev = TRUE,
c1 = 70, cmax = 100, l2 = 80, h2 = 500)
```

The starting point is the sequential `Purple-Yellow`

palette that we have used previously for risk maps. However, we modify the low-risk hue from yellow to green (hue = 140) and go in the opposite direction through the color wheel (hence hue = 500 = 140 + 360 is used). Moreover, we increase chroma for the high-risk colors and decrease luminance somewhat for the low-risk colors (to be of similar brightness as the gray map in the background). Further similar illustrations of problems with rainbow color maps are available along with more details and explanations on our web site http://colorspace.R-Forge.R-project.org/articles/endrainbow.html.

*(Authors: Achim Zeileis, Jason C. Fisher, Kurt Hornik, Ross Ihaka, Claire D. McWhite, Paul Murrell, Reto Stauffer, Claus O. Wilke)*

The R package “colorspace” (http://colorspace.R-Forge.R-project.org/) provides a flexible toolbox for selecting individual colors or color palettes, manipulating these colors, and employing them in statistical graphics and data visualizations. In particular, the package provides a broad range of color palettes based on the HCL (Hue-Chroma-Luminance) color space. The three HCL dimensions have been shown to match those of the human visual system very well, thus facilitating intuitive selection of color palettes through trajectories in this space.

Namely, general strategies for three types of palettes are provided: (1) Qualitative for coding categorical information, i.e., where no particular ordering of categories is available and every color should receive the same perceptual weight. (2) Sequential for coding ordered/numeric information, i.e., going from high to low (or vice versa). (3) Diverging for coding ordered/numeric information around a central neutral value, i.e., where colors diverge from neutral to two extremes.

To aid selection and application of these palettes the package provides scales for use with ggplot2; shiny (and tcltk) apps for interactive exploration (see also http://hclwizard.org/); visualizations of palette properties; accompanying manipulation utilities (like desaturation and lighten/darken), and emulation of color vision deficiencies.

Links to: PDF slides, YouTube video, R code, arXiv working paper.

Furthermore, replication code for the introductory example (influenza risk map) was already provided in the recent endrainbow blog post.

]]>