### Table of Contents

## Allowing $\tau^2$ to Differ Across Subgroups

In a meta-analysis, we often want to examine if the size of a particular effect differs across different groups of studies. While the focus in such a 'subgroup analysis' tends to be on the size of the effect across the different groups, we might also be interested in examining whether the amount of heterogeneity differs across groups (i.e., whether the effect sizes are more/less consistent in some of the groups). Below, I illustrate different methods for conducting such a subgroup analysis that not only allows the size of the effect but also the amount of heterogeneity to differ across groups.

### Data Preparation

For this illustration, we will use the data from the meta-analysis by Bangert-Drowns et al. (2004) on the effectiveness of writing-to-learn interventions on academic achievement.

library(metafor) dat <- dat.bangertdrowns2004 dat

id author year grade . ni yi vi 1 Ashworth 1992 4 . 60 0.650 0.070 2 Ayers 1993 2 . 34 -0.750 0.126 3 Baisch 1990 2 . 95 -0.210 0.042 4 Baker 1994 4 . 209 -0.040 0.019 5 Bauman 1992 1 . 182 0.230 0.022 . . . . . . . . 44 Weiss & Walters 1980 4 . 25 0.630 0.168 45 Wells 1986 1 . 250 0.040 0.016 46 Willey 1988 3 . 51 1.460 0.099 47 Willey 1988 2 . 46 0.040 0.087 48 Youngberg 1989 4 . 56 0.250 0.072

Variable `yi`

contains the standardized mean differences (with positive values indicating a higher mean level of academic achievement in the intervention group) and variable `vi`

contains the corresponding sampling variances. Variable `grade`

indicates the grade level of the participants (1 = elementary school; 2 = middle school; 3 = high-school; 4 = college). The variable is coded numerically, but we want to treat it as a categorical grouping variable in the analyses below, so we will turn it into a factor.

dat$grade <- factor(dat$grade)

### Random-Effects Model

To start, we fit a standard random-effects model to the data.

res <- rma(yi, vi, data=dat) res

Random-Effects Model (k = 48; tau^2 estimator: REML) tau^2 (estimated amount of total heterogeneity): 0.0499 (SE = 0.0197) tau (square root of estimated tau^2 value): 0.2235 I^2 (total heterogeneity / total variability): 58.37% H^2 (total variability / sampling variability): 2.40 Test for Heterogeneity: Q(df = 47) = 107.1061, p-val < .0001 Model Results: estimate se zval pval ci.lb ci.ub 0.2219 0.0460 4.8209 <.0001 0.1317 0.3122 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results indicate that the estimated (average) effect size is positive (0.2219) and significantly different from 0 (p < .0001). For standardized mean differences, the size of the effect could be considered relatively small (although such labels are somewhat arbitrary). The Q-test suggests that the true effects are heterogeneous (i.e., that the effectiveness of writing-to-learn interventions differs across studies), with an estimate of $\tau^2$ (i.e., the amount of variance in the true effects) approximately equal to 0.05.

### Subgroup Analysis via Meta-Regression

To examine whether the size of the effect differs across grade levels, we can fit a meta-regression model that uses the grade factor as a moderator.

res <- rma(yi, vi, mods = ~ grade, data=dat) res

Mixed-Effects Model (k = 48; tau^2 estimator: REML) tau^2 (estimated amount of residual heterogeneity): 0.0539 (SE = 0.0216) tau (square root of estimated tau^2 value): 0.2322 I^2 (residual heterogeneity / unaccounted variability): 59.15% H^2 (unaccounted variability / sampling variability): 2.45 R^2 (amount of heterogeneity accounted for): 0.00% Test for Residual Heterogeneity: QE(df = 44) = 102.0036, p-val < .0001 Test of Moderators (coefficients 2:4): QM(df = 3) = 5.9748, p-val = 0.1128 Model Results: estimate se zval pval ci.lb ci.ub intrcpt 0.2639 0.0898 2.9393 0.0033 0.0879 0.4399 ** grade2 -0.3727 0.1705 -2.1856 0.0288 -0.7069 -0.0385 * grade3 0.0248 0.1364 0.1821 0.8555 -0.2425 0.2922 grade4 -0.0155 0.1160 -0.1338 0.8935 -0.2429 0.2118 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The omnibus test of the factor is not significant (p = .1128) and hence we do not have statistically significant evidence to reject the null hypothesis that the average effect is the same for all grade levels, although the individual contrasts suggest that the average effect for grade level 2 (middle school) may be quite a bit lower (and even significantly so, with p = .0288) than the average effect for grade level 1 (elementary school).^{1)}

We can use the `predict()`

function to obtain the estimated average effect for each grade level.

predict(res, newmods=rbind(c(0,0,0),diag(3)))

pred se ci.lb ci.ub pi.lb pi.ub 1 0.2639 0.0898 0.0879 0.4399 -0.2241 0.7519 2 -0.1088 0.1450 -0.3929 0.1753 -0.6454 0.4278 3 0.2888 0.1027 0.0874 0.4901 -0.2090 0.7865 4 0.2484 0.0735 0.1044 0.3924 -0.2290 0.7258

Hence, we see that the estimated average effect size is quite similar in grade levels 1, 3, and 4, but a bit lower (and in fact negative) in grade level 2 (middle school). However, as noted above, since the test of the factor as a whole is not actually significant, we should not overinterpret this finding.

### Separate Meta-Analyses

A subgroup analysis, comparing the size of the (average) effect across the four grade levels, could also be conducted by simply fitting separate random-effects models in the four groups. We can do this by using the `subset`

argument and collecting the four models in a list.^{2)}

res1 <- list(rma(yi, vi, data=dat, subset=grade==1), rma(yi, vi, data=dat, subset=grade==2), rma(yi, vi, data=dat, subset=grade==3), rma(yi, vi, data=dat, subset=grade==4))

For easier inspection of the results, we then extract the estimated average effect in each group, the corresponding standard error, the estimate of $\tau^2$, its standard error, and the number of studies included in each model, and put everything into a data frame.

comp1 <- data.frame(estimate = sapply(res1, coef), stderror = sqrt(sapply(res1, vcov)), tau2 = sapply(res1, \(x) x$tau2), se.tau2 = sapply(res1, \(x) x$se.tau2), k = sapply(res1, \(x) x$k)) rownames(comp1) <- paste0("grade", 1:4) round(comp1, digits=4)

estimate stderror tau2 se.tau2 k grade1 0.2571 0.0823 0.0412 0.0319 11 grade2 -0.0998 0.1367 0.0421 0.0688 6 grade3 0.3111 0.1300 0.1150 0.0790 10 grade4 0.2409 0.0694 0.0434 0.0297 21

These results are similar, but not quite identical to the ones we obtained above using the meta-regression approach. The reason for this is that the meta-regression approach above assumes that the amount of heterogeneity is the same within all four grade levels (i.e., the amount of 'residual heterogeneity' is assumed to be homoscedastic), while we obtain different estimates of $\tau^2$ if we fit a separate random-effects model within each of the levels of the grouping factor. Also, it is worth noting that some of the groups are quite small (especially the one for grade level 2), which should lead to further caution not to overinterpret the lower estimated effect for grade level 2.

We can now conduct a Wald-type test to test the null hypothesis that the average effect is identical across the different subgroups (analogous to the test of the factor as a whole) as follows.

wld <- rma(estimate, sei=stderror, mods = ~ rownames(comp1), data=comp1, method="FE") anova(wld)

Test of Moderators (coefficients 2:4): QM(df = 3) = 6.2496, p-val = 0.1001

The test is not significant (p = .1001), so we again do not obtain statistically significant evidence that the average effect differs across the groups.

We can also compare the various estimates of $\tau^2$. The values indicate quite similar amounts of heterogeneity in grade levels 1, 2, and 4, but a higher amount of heterogeneity in grade level 3 (i.e., this suggests that the effectiveness of writing-to-learn interventions might be more variable in high-school students compared to the other grade levels). Again, we should be cautious not to overinterpret this finding.

In principle, we could also do a Wald-type test using the $\tau^2$ estimates (and their corresponding standard errors) as input in the same manner as was done above.

wld <- rma(tau2, sei=se.tau2, mods = ~ rownames(comp1), data=comp1, method="FE") anova(wld)

Test of Moderators (coefficients 2:4): QM(df = 3) = 0.7939, p-val = 0.8509

Based on this, we cannot actually reject the null hypothesis $H_0: \tau^2_1 = \tau^2_2 = \tau^2_3 = \tau^2_4$ (p = .8510). However, I would generally not recommend this approach, because the test assumes that the sampling distributions of the $\tau^2$ estimates are approximately normal, which is unlikely to be the case (variance components tend to have skewed sampling distributions). Hence, Wald-type tests of variance components are unlikely to perform well. We will examine other ways of testing whether the amount of heterogeneity differs across groups further below.

### Model with Different Amounts of Heterogeneity

Instead of the `rma()`

function, we can use the `rma.mv()`

function to fit a meta-regression model that not only allows the average effect to differ across the subgroups, but also the amount of heterogeneity. Note that the data do not have a multilevel/multivariate structure – we are simply using the `rma.mv()`

function to fit a model that allows $\tau^2$ to differ across the subgroups. The appropriate syntax for this is as follows.

res2 <- rma.mv(yi, vi, mods = ~ 0 + grade, random = ~ grade | id, struct="DIAG", data=dat, cvvc=TRUE)

Instead of examining the full output, let us again just pull out the relevant pieces of information as we did above.

comp2 <- data.frame(estimate = coef(res2), stderror = sqrt(diag(vcov(res2))), tau2 = res2$tau2, se.tau2 = sqrt(diag(res2$vvc)), k = c(res2$g.levels.k)) round(comp2, digits=4)

estimate stderror tau2 se.tau2 k grade1 0.2571 0.0823 0.0412 0.0304 11 grade2 -0.0998 0.1367 0.0421 0.0647 6 grade3 0.3111 0.1300 0.1150 0.0869 10 grade4 0.2409 0.0694 0.0434 0.0336 21

These results are identical to the ones we obtained when fitting a separate random-effects model within each grade level (`comp1`

), except for the standard errors of the $\tau^2$ estimates. The reason for this will be discussed later on.

### Location-Scale Model with a Scale Factor

Another possibility to allow $\tau^2$ to differ across the subgroups is to make use of a location-scale model, where we use the grouping variable not only as a moderator of the average effect, but also as a predictor for the (log transformed) amount of heterogeneity in the effects.

res3 <- rma(yi, vi, mods = ~ 0 + grade, scale = ~ 0 + grade, data=dat)

Again, let's collect the relevant pieces into a data frame.

comp3 <- data.frame(estimate = coef(res3)$beta, stderror = sqrt(diag(vcov(res3)$beta)), tau2 = predict(res3, newscale=diag(4), transf=exp)$pred) comp3$se.tau2 <- predict(res3, newscale=diag(4))$se * comp3$tau2 comp3$k <- c(table(dat$grade)) round(comp3, digits=4)

estimate stderror tau2 se.tau2 k grade1 0.2571 0.0823 0.0412 0.0304 11 grade2 -0.0998 0.1367 0.0421 0.0647 6 grade3 0.3111 0.1300 0.1150 0.0869 10 grade4 0.2409 0.0694 0.0434 0.0336 21

Note that the default 'link function' for the scale part of the model is the log link. Therefore, when using `predict()`

to compute the estimated $\tau^2$ value of each grade level, we use `transf=exp`

to transform the estimated log transformed $\tau^2$ values back to the original scale. We can see that this yields the same estimates for the amount of heterogeneity within the subgroups as we obtained above using the other two approaches. The estimated average effects for the different grade levels are also the same as the ones we obtained earlier.

**Sidenote:** The standard errors of the back-transformed $\tau^2$ estimates were obtained by taking the standard errors of the log transformed estimates and using the delta method (for an exponential transformation). Note that this yields the same values as obtained above using the `rma.mv()`

function.

### Location-Scale Model with an Identity Link

Instead of using a log link for the scale part of the model, it is possible to also use an identity link. In this case, the scale part directly predicts the value of $\tau^2$ based on the scale variable(s) in the model. This can be done with:

res4 <- rma(yi, vi, mods = ~ 0 + grade, scale = ~ 0 + grade, data=dat, link="identity") comp4 <- data.frame(estimate = coef(res4)$beta, stderror = sqrt(diag(vcov(res4)$beta)), tau2 = predict(res4, newscale=diag(4))$pred, se.tau2 = predict(res4, newscale=diag(4))$se, k = c(table(dat$grade))) round(comp4, digits=4)

estimate stderror tau2 se.tau2 k grade1 0.2571 0.0823 0.0412 0.0304 11 grade2 -0.0998 0.1367 0.0421 0.0647 6 grade3 0.3111 0.1300 0.1150 0.0869 10 grade4 0.2409 0.0694 0.0434 0.0336 21

These results are again identical to the ones we obtained above using the previous two approaches.

### Comparing the Fit of the Models

We can compare the log likelihoods of the different models with:

c(res1 = sum(sapply(res1, logLik)), res2 = logLik(res2), res3 = logLik(res3), res4 = logLik(res4))

res1 res2 res3 res4 -15.5456 -15.5456 -15.5456 -15.5456

Note that for the first approach (where we fitted separate random-effects models within the various subgroups), we can simply add up the log likelihoods. As we can see above, the values are identical, indicating that we are essentially fitting the same model, just using different approaches/parameterizations thereof.

### Testing for Differences in $\tau^2$ Across Subgroups

At the beginning, we used a Wald-type test to test the null hypothesis of no differences in $\tau^2$ values across subgroups. This is not a recommended approach for the reasons outlined earlier.

As a better alternative, we can conduct a likelihood ratio test to compare a model that allows $\tau^2$ to differ across subgroups (as we have done above) with a model that assumes a common (homoscedastic) value of $\tau^2$.

For the approach using `rma.mv()`

, we can do this as follows:

res2 <- rma.mv(yi, vi, mods = ~ 0 + grade, random = ~ grade | id, struct="DIAG", data=dat) res0 <- update(res2, struct="ID") anova(res0, res2)

df AIC BIC AICc logLik LRT pval QE Full 8 47.0912 61.3647 51.2055 -15.5456 102.0036 Reduced 5 42.2069 51.1278 43.7858 -16.1034 1.1157 0.7733 102.0036

For the location-scale models, we can do this with:

res3 <- rma(yi, vi, mods = ~ 0 + grade, scale = ~ 0 + grade, data=dat) res0 <- update(res3, scale = ~ 1) anova(res0, res3)

and

res4 <- rma(yi, vi, mods = ~ 0 + grade, scale = ~ 0 + grade, data=dat, link="identity") res0 <- update(res3, scale = ~ 1) anova(res0, res4)

The output (not shown) is identical to the one above. The test is not significant (p = .7733) and hence we do not have statistically significant evidence for any differences in $\tau^2$ values across the different grade levels.

As we can see with this example, a nice property of the likelihood ratio test is that it is invariant under different parameterizations of the model. Hence, we obtain the same result with all three approaches. Moreover, the likelihood ratio test is likely (no pun intended) to have better statistical properties than the Wald-type test in general (although in this example, the conclusion happens to be the same).

An interesting (and generally even better) approach for testing for differences in $\tau^2$ values across subgroups is to use a permutation test. Such a test can be carried out with the `permutest()`

function. Note that the following code takes a few minutes to run.^{3)}

res3 <- rma(yi, vi, mods = ~ grade, scale = ~ grade, data=dat) permutest(res3, seed=1234)

Test of Scale Coefficients (coefficients 2:4):¹ QS(df = 3) = 1.2195, p-val = 0.5406

Here, we are interested in the test of the scale coefficients (and only the output for this test is shown). The permutation test is not significant (p = .5406), again indicating lack of evidence for differences in the amount of heterogeneity across the different grade levels. While computationally intensive, the permutation test is in a certain sense 'exact' and (assuming that it is run with a sufficiently large number of iterations) should provide the best statistical properties.

We can also run the permutation test based on the location-scale model that uses an identity link.

res4 <- rma(yi, vi, mods = ~ grade, scale = ~ grade, data=dat, link="identity") permutest(res4, seed=1234)

Test of Scale Coefficients (coefficients 2:4):¹ QS(df = 3) = 0.6652, p-val = 0.7982

Again, the test is not significant (p = .7982). This example demonstrates that the permutation test is not invariant under different parameterizations of the same model (note that the p-value differs from the one we obtained above using the default log link), which could be argued to be a disadvantage compared to the likelihood ratio test. However, generally, I would expect the conclusions from the permutation test to coincide irrespective of the parameterization in most cases.

### Standard Errors of $\tau^2$ Estimates

As pointed out earlier, the standard errors of the $\tau^2$ estimates obtained by fitting separate random-effects models within the subgroups (`comp1`

) with the `rma()`

function are slightly different than the ones we obtained from the other approaches. The reason for this is somewhat technical. The `rma()`

function computes the standard error of a $\tau^2$ estimate based on the Fisher information matrix, while the other approaches compute the standard errors based on the Hessian matrix (i.e., based on the observed Fisher information matrix). Both approaches provide estimates of the standard errors, but do not yield identical values. Which values are to be preferred is debatable. However, note that we do not actually need the standard errors when conducting a likelihood ratio test and while the permutation test does make use of the standard errors, it only does so for constructing the permutation distribution of the test statistic under the permuted data, not for conducting the test itself.

### Conclusion

As we can see above, models can be fitted that relax the assumption that $\tau^2$ is identical within subgroups when conducting a subgroup analysis.

A question that often comes up in this context is whether one should do so (as opposed to sticking to the more standard meta-regression model that assumes a single homoscedastic value for $\tau^2$ within subgroups). Two papers that addresses this issue to some extent are:

- Rubio-Aparicio, M., Sánchez-Meca, J., López-López, J. A., Botella, J. & Marín-Martínez, F. (2017). Analysis of categorical moderators in mixed-effects meta-analysis: Consequences of using pooled versus separate estimates of the residual between-studies variances. British Journal of Mathematical and Statistical Psychology, 70(3), 439-456. https://doi.org/https://doi.org/10.1111/bmsp.12092
- Rubio-Aparicio, M., López-López, J. A., Viechtbauer, W., Marín-Martínez, F., Botella, J., & Sánchez-Meca, J. (2020). Testing categorical moderators in mixed-effects meta-analysis in the presence of heteroscedasticity.
*Journal of Experimental Education, 88*(2), 288-310. https://doi.org/10.1080/00220973.2018.1561404

Generally, even if the $\tau^2$ values are heteroscedastic but we ignore this (and hence assume a common $\tau^2$ across subgroups), then this tends to have a relatively minor impact on the statistical properties of the test of the group factor (i.e., whether the size of the effect differs across the groups).^{4)} However, if we are specifically interested in examining whether the amount of heterogeneity differs across the subgroups (e.g., whether the effectiveness of writing-to-learn interventions tends to be more/less consistent within a particular grade level), then of course we need to allow $\tau^2$ to differ across the grouping variable.

As a final note, the example above also provides evidence for the correctness of the underlying code for fitting the various models. The algorithm underlying the `rma()`

function for fitting a standard random-effects model is entirely different than what is used in the `rma.mv()`

function and also what is used within the `rma()`

function for fitting location-scale models, yet all three approaches yielded identical results, at least when rounded to four decimal places (with less rounding, one will notice small discrepancies between the results, which are a consequence of using different model fitting algorithms).

^{1)}

^{2)}

`res1 <- lapply(1:4, \(g) rma(yi, vi, data=dat, subset=grade==g))`

to accomplish the same thing. Even more general would be `res1 <- lapply(sort(unique(dat$grade)), \(g) rma(yi, vi, data=dat, subset=grade==g))`

which doesn't require us to pre-specify the various values of the grouping variable.^{3)}

^{4)}

`test="knha"`

when fitting the meta-regression model with the `rma()`

function).