Introductory note

This eBook is based on a section of a short course on concepts and methods in causal inference run by the PATHWAYS node. The original practical, on mediation, asked participants to fit models using Stata themselves, whereas this eBook fits models in Stata on the user's behalf. You therefore have to have access to Stata, and to tell Stat-JR where it is: see the third-party software page on the Stat-JR site for more information on how to specify the path to Stata and other software packages.

Question

Our causal question of interest concerns the effect of alcohol consumption on systolic blood pressure. In particular, we wish to investigate to what extent this effect is mediated by gamma-glutamyl transpeptidase .

Causal diagram

Body mass index is known to affect the level of gamma-glutamyl transpeptidase, but is also potentially affected by alcohol consumption. Socio-economic position is potentially a common cause of alcohol consumption, body mass index and systolic blood pressure.

For simplicity let's assume that the following diagram represents the relationships between the variables (where BMI = body mass index, GGT = gamma-glutamyl transpeptidase, SEP = socio-economic position, and SBP = systolic blood pressure):


Causal diagram

Data

We will use a simulated dataset, inspired by the population-based controls from a case-control study (the Izhevsk Family Study), which was carried out to assess the effects of hazardous alcohol-drinking on premature mortality in men living in Izhevsk, Russia.

Here are the variables:

Please wait a moment whilst the dataset summary loads...


Traditional mediation analysis

First, we will perform a traditional mediation analysis, as described, for example, in the PATHWAYS workshop 'Current and new approaches to estimating causal pathways from observational data'.

What you need to do

Below you'll see some quiz questions...

  1. Read through these, and think about what sort of models you need to fit to answer them.

  2. Then, underneath the quiz questions, you'll see an input box allowing you to specify your response variables and explanatory variables to fit a linear regression model.

  3. Once you have made your choices from the drop-down box, and pressed Submit, a do-file will then be created: this will be automatically run in Stata and the coefficient estimates and standard errors will be returned for each model you fit* in the yellow table, with a more complete table of results appearing, in sequence for each model, further below.

* NB: up to six model-fitting opportunities are available, but note you don't have to use all of these - fitting four should be sufficient to answer the questions.

Quiz questions

The quiz questions will appear here once they've loaded...

How many quiz questions did you get right?

Once you have submitted your choices, it'll indicate here how many you got correct...

Fitting your models

Table of coefficient estimates (with SE) from all models run

Once the first model has been run, the compiled table of coefficient estimates and standard errors will appear here. As you then run each subsequent model (up to six), the estimates from those will be added too...


Parameter MODEL 1, response: MODEL 2, response: MODEL 3, response: MODEL 4, response: MODEL 5, response: MODEL 6, response:
cons ()   ()   ()   ()   ()   ()  
sep_2 ()   ()   ()   ()   ()   ()  
sep_3 ()   ()   ()   ()   ()   ()  
alc ()   ()   ()   ()   ()   ()  
bmi ()   ()   ()   ()   ()   ()  
log_ggt ()   ()   ()   ()   ()   ()  
sbp ()   ()   ()   ()   ()   ()  
log_ggt_alc ()   ()   ()   ()   ()   ()  

Choose your explanatory variables

The list of explanatory variables will appear here; you can then choose which of these you want to include in your model...

Using the box below you can choose which response variable and explanatory variables to include in your linear regression models...

...once you have made your choices, and pressed Submit, a do-file will be created; this will be run in Stata, and the results will be compiled in a table, above, as well as in more detailed results tables returned below.

Remember, you have up to six model-fitting opportunities.

Model 1: 6 model-fitting opportunities remaining...

Model 2: 5 model-fitting opportunities remaining...

Model 3: 4 model-fitting opportunities remaining...

Model 4: 3 model-fitting opportunities remaining...

Model 5: 2 model-fitting opportunities remaining...

Model 6: this is the last model-fitting opportunity...


The specific results table for Model 1 will appear here.

Model 1: Results Table

Parameter Coefficient Standard Error 95% C.I. lower 95% C.I. upper t p>|t|

Again, once you have made your choices, the specific results table for Model 2 will appear here.

Model 2: Results Table

Parameter Coefficient Standard Error 95% C.I. lower 95% C.I. upper t p>|t|

Again, once you have made your choices, the specific results table for Model 3 will appear here.

Model 3: Results Table

Parameter Coefficient Standard Error 95% C.I. lower 95% C.I. upper t p>|t|

Again, once you have made your choices, the specific results table for Model 4 will appear here.

Model 4: Results Table

Parameter Coefficient Standard Error 95% C.I. lower 95% C.I. upper t p>|t|

Again, once you have made your choices, the specific results table for Model 5 will appear here.

Model 5: Results Table

Parameter Coefficient Standard Error 95% C.I. lower 95% C.I. upper t p>|t|

Again, once you have made your choices, the specific results table for Model 6 will appear here.

Model 6: Results Table

Parameter Coefficient Standard Error 95% C.I. lower 95% C.I. upper t p>|t|

Answers

Once you have submitted your answers on the previous page, you'll find out here which questions you got correct / incorrect, and why...

Working with the gformula command in Stata

We can be more flexible in our modelling approach using the g-computation formula (Robins, 1986), recently implemented in a Stata routine called gformula. This is what we will do next, comparing our inferences with those from the traditional approach explored on the previous page.

The gformula command is "an implementation of the g-computation procedure...used to estimate the causal effect of time-varying exposures on an outcome in the presence of time-varying confounders that are themselves also affected by the exposures. The procedure also addresses the related problem of estimating direct and indirect effects when the causal effect of the exposures on an outcome is mediated by intermediate variables, and in particular when confounders of the mediator-outcome relationships are themselves affected by the exposures" (Daniel et al, 2011).

If you wish to use it yourself, you can download it within any version of Stata, provided that you have an internet connection, by typing:

. ssc install gformula

We'll fit our first model using the gformula command on the next page...

Reference

Daniel, R.M., De Stavola, B.L. and Cousens, S.N. (2011) gformula: Estimating causal effects in the presence of time-varying confounding or mediation using the g-computation formula. The Stata Journal, 11(4), 479-517.

Robins, J (1986) A new approach to causal inference in mortality studies with a sustained exposure period - applicaiton to control of the healthy worker survivor effect. Mathmematical Modelling, 7, 1393-1512.

Model A: excluding exposure-mediator interaction



. gformula sep alc bmi log_ggt sbp, mediation out(Q1) ///
eq(bmi:i.sep alc,
Q2, sbp:bmi alc log_ggt i.sep) ///
com(bmi:regress, log_ggt:regress, sbp:regress) ex(
Q3) mediator(Q4) ///
control(log_ggt:3) baseline(alc:0) post_confs(bmi) base_confs(sep) sam(50)


Quiz questions

What's missing? Enter your answers here...

The quiz questions will appear here once they've loaded...

How many quiz questions did you get right?

Once you have submitted your choices, it'll indicate here how many you got correct...

Running the model

How did you get on? Here's the complete command:


. gformula sep alc bmi log_ggt sbp, mediation out(sbp) ///
eq(bmi:i.sep alc, log_ggt:bmi alc, sbp:bmi alc log_ggt i.sep) ///
com(bmi:regress, log_ggt:regress, sbp:regress) ex(alc) mediator(log_ggt) ///
control(log_ggt:3) baseline(alc:0) post_confs(bmi) base_confs(sep) sam(50)


This command will be automatically run, in Stata, once you indicate you want it to do so, beneath.

The results will then be returned, below, but note this will take some time...


Once the model has run (this will take some time), the results will appear here...


Results

Understanding the output

  • The total causal effect (TCE) is the difference between the mean of the observed outcomes and the mean potential outcome if, contrary to fact, all subjects' exposure(s) were set at the baseline values. Writing \(X\) for the exposure(s), \(M\) for the mediator(s), \(Y\) for the outcome and \(0\) for the baseline value(s) of the exposure(s), then: \[\mbox{TCE} = E\:[Y\:] - E\:[\:Y\:\left\{0, M(0)\right\}] = E\:[\:Y\:\left\{X, M(X)\right\}] - E\:[\:Y\:\left\{0, M(0)\right\}]\]
  • The natural direct effect (NDE) is the difference between the mean of two potential outcomes. The first is the potential outcome if, contrary to fact, all subjects' mediator(s) were set to their potential value(s) under the baseline value(s) of the exposure, but the exposure value(s) are those actually observed in the observational data. The second is the potential outcome if, contrary to fact, all subjects' exposure(s) were set at the baseline value(s). That is: \[\mbox{NDE} = E\:[\:Y\:\left\{X, M(0)\right\}] - E\:[Y\:\left\{0, M(0)\right\}]\]
  • The natural indirect effect (NIE) is the difference between the TCE and the NDE. That is: \[\mbox{NIE} = \mbox{TCE} - \mbox{NDE} = E\:[\:Y\:\left\{X, M(X)\right\}] - E\:[\:Y\:\left\{X, M(0)\right\}]\]
  • The controlled direct effect (CDE) is the difference between the mean potential outcome when subjects' exposure value(s) were those actually observed under the observational regime and the mean potential outcome when, contrary to fact, all subjects' exposure(s) were set at the baseline value(s); and, in addition, in both cases, the mediator(s) were set to their control value(s). Write \(m\) for the control value(s) of the mediator(s), then: \[\mbox{CDE} = E\left\{Y\:(X, m)\right\} - E\left\{Y\:(0, m)\right\}\]

Results table

  G-computation estimate Bootstrap Std. Err.
TCE
NDE
NIE
CDE

Note, due to the Monte Carlo nature of the procedure, you will obtain slightly different results each time you fit this model.

Quiz questions

Once you have submitted your choice, it'll indicate here whether you were correct or not...

Make sure you understand the comparison being made here; we are comparing the population under different hypothetical conditions.

To test your understanding, choose the correct 'hypothetical populations', from the list that follows, to answer the question beneath:

  1. the population (of 10,000) with the alcohol intake that they were actually observed to have in the observational data;
  2. the population (of 10,000), intervened on, so that all 10,000 subjects consume no alcohol at all;
  3. the population (of 10,000), with the assumption that all have consumed the maximum observed alcohol intake.

Once you have submitted your choice, it'll indicate here whether you were correct or not...

Model B: allowing for exposure-mediator interaction

Now we'll repeat the gformula analysis, allowing for exposure-mediator interaction as follows:

. gformula sep alc bmi log_ggt sbp log_ggt_alc, mediation out(sbp) ///
eq(bmi:i.sep alc, log_ggt:bmi alc, sbp:bmi alc log_ggt log_ggt_alc i.sep) ///
com(bmi:regress, log_ggt:regress, sbp:regress) ex(alc) mediator(log_ggt) ///
control(log_ggt:3) baseline(alc:0) post_confs(bmi) base_confs(sep) derived(log_ggt_alc) ///
derrules(log_ggt_alc:log_ggt*alc) sam(50)


Once the model has run (again, this will take some time!), the results will appear here...


Results

  G-computation estimate Bootstrap Std. Err.
TCE
NDE
NIE
CDE

Quiz questions

Once you have submitted your choice, it'll indicate here whether you were correct or not...

Turn to the next page to see our concluding remarks...

Concluding remarks

Comparing analytical approaches

Let's now compare the results from the analyses we ran in gformula with those obtained from the traditional approach we adopted earlier in the eBook.

First of all, note that the results from the traditional analyses are not easily comparable with the ones from gformula since they relate to a unit change in alc and so the hypothetical worlds being compared are different. Nevertheless, we see that indirect effects implied by the standard approaches are rather larger than was found using the g-formula. This can be explained by the uncontrolled confounding due to bmi. Higher alcohol intake leads to higher BMI, but since high alcohol and high BMI both lead to higher GGT, controlling for log_GGT reduces the association between alcohol and BMI, and so the direct effect of alcohol appears reduced.

Controlling for BMI in fact makes things worse in this example. The direct effect of alcohol on SBP no longer includes the part of this effect which acts through BMI.

There is also apparently an important exposure--mediator interaction which cannot be taken into account in the traditional analysis.

So, all things considered, we would prefer the analysis using the gformula command which allows for intermediate confounding and exposure-mediator interaction.

A note on measurement error

Given that alcohol intake is notoriously poorly estimated, if there is measurement error, then GGT could in fact be a better proxy of true alcohol intake than the alcohol variable itself. Then, even if there were no mediation (suppose GGT has no causal effect on SBP), measurement error may still lead us to believe that an indirect effect (through GGT) exists.