# The metafor Package

A Meta-Analysis Package for R

### Site Tools

tips:different_tau2_across_subgroups

## Allowing $\tau^2$ to Differ Across Subgroups

In a meta-analysis, we often want to examine if the size of a particular effect differs across different groups of studies. While the focus in such a 'subgroup analysis' tends to be on the size of the effect across the different groups, we might also be interested in examining whether the amount of heterogeneity differs across groups (i.e., whether the effect sizes are more/less consistent in some of the groups). Below, I illustrate different methods for conducting such a subgroup analysis that not only allows the size of the effect but also the amount of heterogeneity to differ across groups.

### Data Preparation

For this illustration, we will use the data from the meta-analysis by Bangert-Drowns et al. (2004) on the effectiveness of writing-to-learn interventions on academic achievement.

library(metafor)
dat <- dat.bangertdrowns2004
dat
id          author year grade .  ni     yi    vi
1        Ashworth 1992     4 .  60  0.650 0.070
2           Ayers 1993     2 .  34 -0.750 0.126
3          Baisch 1990     2 .  95 -0.210 0.042
4           Baker 1994     4 . 209 -0.040 0.019
5          Bauman 1992     1 . 182  0.230 0.022
.               .    .     . .   .      .     .
44 Weiss & Walters 1980     4 .  25  0.630 0.168
45           Wells 1986     1 . 250  0.040 0.016
46          Willey 1988     3 .  51  1.460 0.099
47          Willey 1988     2 .  46  0.040 0.087
48       Youngberg 1989     4 .  56  0.250 0.072

Variable yi contains the standardized mean differences (with positive values indicating a higher mean level of academic achievement in the intervention group) and variable vi contains the corresponding sampling variances. Variable grade indicates the grade level of the participants (1 = elementary school; 2 = middle school; 3 = high-school; 4 = college). The variable is coded numerically, but we want to treat it as a categorical grouping variable in the analyses below, so we will turn it into a factor.

dat$grade <- factor(dat$grade)

### Random-Effects Model

To start, we fit a standard random-effects model to the data.

res <- rma(yi, vi, data=dat)
res
Random-Effects Model (k = 48; tau^2 estimator: REML)

tau^2 (estimated amount of total heterogeneity): 0.0499 (SE = 0.0197)
tau (square root of estimated tau^2 value):      0.2235
I^2 (total heterogeneity / total variability):   58.37%
H^2 (total variability / sampling variability):  2.40

Test for Heterogeneity:
Q(df = 47) = 107.1061, p-val < .0001

Model Results:

estimate      se    zval    pval   ci.lb   ci.ub
0.2219  0.0460  4.8209  <.0001  0.1317  0.3122  ***

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results indicate that the estimated (average) effect size is positive (0.2219) and significantly different from 0 (p < .0001). For standardized mean differences, the size of the effect could be considered relatively small (although such labels are somewhat arbitrary). The Q-test suggests that the true effects are heterogeneous (i.e., that the effectiveness of writing-to-learn interventions differs across studies), with an estimate of $\tau^2$ (i.e., the amount of variance in the true effects) approximately equal to 0.05.

### Subgroup Analysis via Meta-Regression

To examine whether the size of the effect differs across grade levels, we can fit a meta-regression model that uses the grade factor as a moderator.

res <- rma(yi, vi, mods = ~ grade, data=dat)
res
Mixed-Effects Model (k = 48; tau^2 estimator: REML)

tau^2 (estimated amount of residual heterogeneity):     0.0539 (SE = 0.0216)
tau (square root of estimated tau^2 value):             0.2322
I^2 (residual heterogeneity / unaccounted variability): 59.15%
H^2 (unaccounted variability / sampling variability):   2.45
R^2 (amount of heterogeneity accounted for):            0.00%

Test for Residual Heterogeneity:
QE(df = 44) = 102.0036, p-val < .0001

Test of Moderators (coefficients 2:4):
QM(df = 3) = 5.9748, p-val = 0.1128

Model Results:

estimate      se     zval    pval    ci.lb    ci.ub
intrcpt    0.2639  0.0898   2.9393  0.0033   0.0879   0.4399  **
grade2    -0.3727  0.1705  -2.1856  0.0288  -0.7069  -0.0385   *
grade3     0.0248  0.1364   0.1821  0.8555  -0.2425   0.2922
grade4    -0.0155  0.1160  -0.1338  0.8935  -0.2429   0.2118

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The omnibus test of the factor is not significant (p = .1128) and hence we do not have statistically significant evidence to reject the null hypothesis that the average effect is the same for all grade levels, although the individual contrasts suggest that the average effect for grade level 2 (middle school) may be quite a bit lower (and even significantly so, with p = .0288) than the average effect for grade level 1 (elementary school).1)

We can use the predict() function to obtain the estimated average effect for each grade level.

predict(res, newmods=rbind(c(0,0,0),diag(3)))
pred     se   ci.lb  ci.ub   pi.lb  pi.ub
1  0.2639 0.0898  0.0879 0.4399 -0.2241 0.7519
2 -0.1088 0.1450 -0.3929 0.1753 -0.6454 0.4278
3  0.2888 0.1027  0.0874 0.4901 -0.2090 0.7865
4  0.2484 0.0735  0.1044 0.3924 -0.2290 0.7258

Hence, we see that the estimated average effect size is quite similar in grade levels 1, 3, and 4, but a bit lower (and in fact negative) in grade level 2 (middle school). However, as noted above, since the test of the factor as a whole is not actually significant, we should not overinterpret this finding.

### Separate Meta-Analyses

A subgroup analysis, comparing the size of the (average) effect across the four grade levels, could also be conducted by simply fitting separate random-effects models in the four groups. We can do this by using the subset argument and collecting the four models in a list.2)

res1 <- list(rma(yi, vi, data=dat, subset=grade==1),

For easier inspection of the results, we then extract the estimated average effect in each group, the corresponding standard error, the estimate of $\tau^2$, its standard error, and the number of studies included in each model, and put everything into a data frame.

comp1 <- data.frame(estimate = sapply(res1, coef),
stderror = sqrt(sapply(res1, vcov)),
tau2     = sapply(res1, \(x) x$tau2), se.tau2 = sapply(res1, \(x) x$se.tau2),
k        = sapply(res1, \(x) x$k)) rownames(comp1) <- paste0("grade", 1:4) round(comp1, digits=4) estimate stderror tau2 se.tau2 k grade1 0.2571 0.0823 0.0412 0.0319 11 grade2 -0.0998 0.1367 0.0421 0.0688 6 grade3 0.3111 0.1300 0.1150 0.0790 10 grade4 0.2409 0.0694 0.0434 0.0297 21 These results are similar, but not quite identical to the ones we obtained above using the meta-regression approach. The reason for this is that the meta-regression approach above assumes that the amount of heterogeneity is the same within all four grade levels (i.e., the amount of 'residual heterogeneity' is assumed to be homoscedastic), while we obtain different estimates of$\tau^2$if we fit a separate random-effects model within each of the levels of the grouping factor. Also, it is worth noting that some of the groups are quite small (especially the one for grade level 2), which should lead to further caution not to overinterpret the lower estimated effect for grade level 2. We can now conduct a Wald-type test to test the null hypothesis that the average effect is identical across the different subgroups (analogous to the test of the factor as a whole) as follows. wld <- rma(estimate, sei=stderror, mods = ~ rownames(comp1), data=comp1, method="FE") anova(wld) Test of Moderators (coefficients 2:4): QM(df = 3) = 6.2496, p-val = 0.1001 The test is not significant (p = .1001), so we again do not obtain statistically significant evidence that the average effect differs across the groups. We can also compare the various estimates of$\tau^2$. The values indicate quite similar amounts of heterogeneity in grade levels 1, 2, and 4, but a higher amount of heterogeneity in grade level 3 (i.e., this suggests that the effectiveness of writing-to-learn interventions might be more variable in high-school students compared to the other grade levels). Again, we should be cautious not to overinterpret this finding. In principle, we could also do a Wald-type test using the$\tau^2$estimates (and their corresponding standard errors) as input in the same manner as was done above. wld <- rma(tau2, sei=se.tau2, mods = ~ rownames(comp1), data=comp1, method="FE") anova(wld) Test of Moderators (coefficients 2:4): QM(df = 3) = 0.7939, p-val = 0.8509 Based on this, we cannot actually reject the null hypothesis$H_0: \tau^2_1 = \tau^2_2 = \tau^2_3 = \tau^2_4$(p = .8510). However, I would generally not recommend this approach, because the test assumes that the sampling distributions of the$\tau^2$estimates are approximately normal, which is unlikely to be the case (variance components tend to have skewed sampling distributions). Hence, Wald-type tests of variance components are unlikely to perform well. We will examine other ways of testing whether the amount of heterogeneity differs across groups further below. ### Model with Different Amounts of Heterogeneity Instead of the rma() function, we can use the rma.mv() function to fit a meta-regression model that not only allows the average effect to differ across the subgroups, but also the amount of heterogeneity. Note that the data do not have a multilevel/multivariate structure – we are simply using the rma.mv() function to fit a model that allows$\tau^2$to differ across the subgroups. The appropriate syntax for this is as follows. res2 <- rma.mv(yi, vi, mods = ~ 0 + grade, random = ~ grade | id, struct="DIAG", data=dat, cvvc=TRUE) Instead of examining the full output, let us again just pull out the relevant pieces of information as we did above. comp2 <- data.frame(estimate = coef(res2), stderror = sqrt(diag(vcov(res2))), tau2 = res2$tau2,
se.tau2  = sqrt(diag(res2$vvc)), k = c(res2$g.levels.k))
round(comp2, digits=4)
estimate stderror   tau2 se.tau2  k
grade1   0.2571   0.0823 0.0412  0.0304 11
grade2  -0.0998   0.1367 0.0421  0.0647  6
grade3   0.3111   0.1300 0.1150  0.0869 10
grade4   0.2409   0.0694 0.0434  0.0336 21

These results are identical to the ones we obtained when fitting a separate random-effects model within each grade level (comp1), except for the standard errors of the $\tau^2$ estimates. The reason for this will be discussed later on.

### Location-Scale Model with a Scale Factor

Another possibility to allow $\tau^2$ to differ across the subgroups is to make use of a location-scale model, where we use the grouping variable not only as a moderator of the average effect, but also as a predictor for the (log transformed) amount of heterogeneity in the effects.

res3 <- rma(yi, vi, mods = ~ 0 + grade, scale = ~ 0 + grade, data=dat)

Again, let's collect the relevant pieces into a data frame.

comp3 <- data.frame(estimate = coef(res3)$beta, stderror = sqrt(diag(vcov(res3)$beta)),
tau2     = predict(res3, newscale=diag(4), transf=exp)$pred) comp3$se.tau2 <- predict(res3, newscale=diag(4))$se * comp3$tau2
comp3k <- c(table(dat$grade)) round(comp3, digits=4) estimate stderror tau2 se.tau2 k grade1 0.2571 0.0823 0.0412 0.0304 11 grade2 -0.0998 0.1367 0.0421 0.0647 6 grade3 0.3111 0.1300 0.1150 0.0869 10 grade4 0.2409 0.0694 0.0434 0.0336 21 Note that the default 'link function' for the scale part of the model is the log link. Therefore, when using predict() to compute the estimated$\tau^2$value of each grade level, we use transf=exp to transform the estimated log transformed$\tau^2$values back to the original scale. We can see that this yields the same estimates for the amount of heterogeneity within the subgroups as we obtained above using the other two approaches. The estimated average effects for the different grade levels are also the same as the ones we obtained earlier. Sidenote: The standard errors of the back-transformed$\tau^2$estimates were obtained by taking the standard errors of the log transformed estimates and using the delta method (for an exponential transformation). Note that this yields the same values as obtained above using the rma.mv() function. Instead of using a log link for the scale part of the model, it is possible to also use an identity link. In this case, the scale part directly predicts the value of$\tau^2$based on the scale variable(s) in the model. This can be done with: res4 <- rma(yi, vi, mods = ~ 0 + grade, scale = ~ 0 + grade, data=dat, link="identity") comp4 <- data.frame(estimate = coef(res4)$beta,
stderror = sqrt(diag(vcov(res4)$beta)), tau2 = predict(res4, newscale=diag(4))$pred,
se.tau2  = predict(res4, newscale=diag(4))$se, k = c(table(dat$grade)))
round(comp4, digits=4)
estimate stderror   tau2 se.tau2  k
grade1   0.2571   0.0823 0.0412  0.0304 11
grade2  -0.0998   0.1367 0.0421  0.0647  6
grade3   0.3111   0.1300 0.1150  0.0869 10
grade4   0.2409   0.0694 0.0434  0.0336 21

These results are again identical to the ones we obtained above using the previous two approaches.

### Comparing the Fit of the Models

We can compare the log likelihoods of the different models with:

c(res1 = sum(sapply(res1, logLik)),
res2 = logLik(res2),
res3 = logLik(res3),
res4 = logLik(res4))
res1     res2     res3     res4
-15.5456 -15.5456 -15.5456 -15.5456

Note that for the first approach (where we fitted separate random-effects models within the various subgroups), we can simply add up the log likelihoods. As we can see above, the values are identical, indicating that we are essentially fitting the same model, just using different approaches/parameterizations thereof.

### Testing for Differences in $\tau^2$ Across Subgroups

At the beginning, we used a Wald-type test to test the null hypothesis of no differences in $\tau^2$ values across subgroups. This is not a recommended approach for the reasons outlined earlier.

As a better alternative, we can conduct a likelihood ratio test to compare a model that allows $\tau^2$ to differ across subgroups (as we have done above) with a model that assumes a common (homoscedastic) value of $\tau^2$.

For the approach using rma.mv(), we can do this as follows:

res2 <- rma.mv(yi, vi, mods = ~ 0 + grade, random = ~ grade | id,
struct="DIAG", data=dat)
res0 <- update(res2, struct="ID")
anova(res0, res2)
df     AIC     BIC    AICc   logLik    LRT   pval       QE
Full     8 47.0912 61.3647 51.2055 -15.5456               102.0036
Reduced  5 42.2069 51.1278 43.7858 -16.1034 1.1157 0.7733 102.0036

For the location-scale models, we can do this with:

res3 <- rma(yi, vi, mods = ~ 0 + grade, scale = ~ 0 + grade, data=dat)
res0 <- update(res3, scale = ~ 1)
anova(res0, res3)

and

res4 <- rma(yi, vi, mods = ~ 0 + grade, scale = ~ 0 + grade,
res0 <- update(res3, scale = ~ 1)
anova(res0, res4)

The output (not shown) is identical to the one above. The test is not significant (p = .7733) and hence we do not have statistically significant evidence for any differences in $\tau^2$ values across the different grade levels.

As we can see with this example, a nice property of the likelihood ratio test is that it is invariant under different parameterizations of the model. Hence, we obtain the same result with all three approaches. Moreover, the likelihood ratio test is likely (no pun intended) to have better statistical properties than the Wald-type test in general (although in this example, the conclusion happens to be the same).

An interesting (and generally even better) approach for testing for differences in $\tau^2$ values across subgroups is to use a permutation test. Such a test can be carried out with the permutest() function. Note that the following code takes a few minutes to run.3)

res3 <- rma(yi, vi, mods = ~ grade, scale = ~ grade, data=dat)
permutest(res3, seed=1234)
Test of Scale Coefficients (coefficients 2:4):¹
QS(df = 3) = 1.2195, p-val = 0.5406

Here, we are interested in the test of the scale coefficients (and only the output for this test is shown). The permutation test is not significant (p = .5406), again indicating lack of evidence for differences in the amount of heterogeneity across the different grade levels. While computationally intensive, the permutation test is in a certain sense 'exact' and (assuming that it is run with a sufficiently large number of iterations) should provide the best statistical properties.

We can also run the permutation test based on the location-scale model that uses an identity link.

permutest(res4, seed=1234)
Test of Scale Coefficients (coefficients 2:4):¹
QS(df = 3) = 0.6652, p-val = 0.7982

Again, the test is not significant (p = .7982). This example demonstrates that the permutation test is not invariant under different parameterizations of the same model (note that the p-value differs from the one we obtained above using the default log link), which could be argued to be a disadvantage compared to the likelihood ratio test. However, generally, I would expect the conclusions from the permutation test to coincide irrespective of the parameterization in most cases.

### Standard Errors of $\tau^2$ Estimates

As pointed out earlier, the standard errors of the $\tau^2$ estimates obtained by fitting separate random-effects models within the subgroups (comp1) with the rma() function are slightly different than the ones we obtained from the other approaches. The reason for this is somewhat technical. The rma() function computes the standard error of a $\tau^2$ estimate based on the Fisher information matrix, while the other approaches compute the standard errors based on the Hessian matrix (i.e., based on the observed Fisher information matrix). Both approaches provide estimates of the standard errors, but do not yield identical values. Which values are to be preferred is debatable. However, note that we do not actually need the standard errors when conducting a likelihood ratio test and while the permutation test does make use of the standard errors, it only does so for constructing the permutation distribution of the test statistic under the permuted data, not for conducting the test itself.

### Conclusion

As we can see above, models can be fitted that relax the assumption that $\tau^2$ is identical within subgroups when conducting a subgroup analysis.

A question that often comes up in this context is whether one should do so (as opposed to sticking to the more standard meta-regression model that assumes a single homoscedastic value for $\tau^2$ within subgroups). Two papers that addresses this issue to some extent are:

• Rubio-Aparicio, M., Sánchez-Meca, J., López-López, J. A., Botella, J. & Marín-Martínez, F. (2017). Analysis of categorical moderators in mixed-effects meta-analysis: Consequences of using pooled versus separate estimates of the residual between-studies variances. British Journal of Mathematical and Statistical Psychology, 70(3), 439-456. https://doi.org/https://doi.org/10.1111/bmsp.12092
• Rubio-Aparicio, M., López-López, J. A., Viechtbauer, W., Marín-Martínez, F., Botella, J., & Sánchez-Meca, J. (2020). Testing categorical moderators in mixed-effects meta-analysis in the presence of heteroscedasticity. Journal of Experimental Education, 88(2), 288-310. https://doi.org/10.1080/00220973.2018.1561404

Generally, even if the $\tau^2$ values are heteroscedastic but we ignore this (and hence assume a common $\tau^2$ across subgroups), then this tends to have a relatively minor impact on the statistical properties of the test of the group factor (i.e., whether the size of the effect differs across the groups).4) However, if we are specifically interested in examining whether the amount of heterogeneity differs across the subgroups (e.g., whether the effectiveness of writing-to-learn interventions tends to be more/less consistent within a particular grade level), then of course we need to allow $\tau^2$ to differ across the grouping variable.

As a final note, the example above also provides evidence for the correctness of the underlying code for fitting the various models. The algorithm underlying the rma() function for fitting a standard random-effects model is entirely different than what is used in the rma.mv() function and also what is used within the rma() function for fitting location-scale models, yet all three approaches yielded identical results, at least when rounded to four decimal places (with less rounding, one will notice small discrepancies between the results, which are a consequence of using different model fitting algorithms).

1)
The possibility that the omnibus test can be non-significant when one or more of the individual model coefficients differ significantly from zero is discussed here. Also, the somewhat peculiar fact that the estimate of $\tau^2$ can be larger in a meta-regression model compared to the random-effects model is discussed here.
2)
We could also use the more concise syntax res1 <- lapply(1:4, \(g) rma(yi, vi, data=dat, subset=grade==g)) to accomplish the same thing.
3)
Since the test involves random permutations of the data, setting the seed of the random number generator ensures that the results are fully reproducible.
4)
The full story is a bit more complex and depends on whether the subgroups are of roughly equal size or not and, when they are unbalanced in sizes, how differences in the amount of heterogeneity are paired together with the subgroup sizes. Also, in general, the standard Wald-type test of the fixed effects in a meta-regression model tends to have a slightly inflated Type I error rate especially when $k$ is small, but this can be counteracted by using the 'Knapp-Hartung method' (i.e., by setting test="knha" when fitting the meta-regression model with the rma() function).