Lab 4:Mapping causal & predictive approaches

Practice session

Luisa M. Mimmi —   https://luisamimmi.org/

February 26, 2025

GOAL OF TODAY’S PRACTICE SESSION

  • Causal analysis: how experimental studies uncover the association between exposure and outcome
  • Visually explore causal pathways and influential variables with Directed Acyclic Graphs (DAGs):
    • Collider: a variable that is influenced by both the exposure and the outcome
    • Confounder: a variable that influences both the exposure and the dependent variable outcome
    • Mediator: a variable that lies on the causal path between the exposure and the outcome
  • Practically illustrate causal modeling in the presence of influential causal variables:
    • [Colliders can introduce bias if conditioned upon]
    • Confounders must be adjusted for
    • Mediators require careful consideration in analyses
  • Calculate commonly used estimands: ATE, ATT, ATU

R ENVIRONMENT SET UP & DATA

Needed R Packages

  • We will use functions from packages base, utils, and stats (pre-installed and pre-loaded)
  • We may also use the packages below (specifying package::function for clarity).
# Load pckgs for this R session

# --- General 
library(here) # A Simpler Way to Find Your Files   
library(skimr)    # Compact and Flexible Summaries of Data
library(paint)   # Fancy structure for data frames 
library(janitor)  # Simple Tools for Examining and Cleaning Dirty Data
library(tidyverse) # Easily Install and Load the 'Tidyverse'
library(gt) # Easily Create Presentation-Ready Display Tables
library(marginaleffects) # Marginal Effects for Model Objects
library(WeightIt) # Weighting for Covariate Balance in Observational Studies
# --- Plotting & data visualization
library(ggfortify)     # Data Visualization Tools for Statistical Analysis Results
library(ggpubr)        # 'ggplot2' Based Publication Ready Plots
library(patchwork) # The Composer of Plots
# --- Descriptive statistics and regressions
library(broom) # Convert Statistical Objects into Tidy Tibbles
library(Hmisc) # Harrell Miscellaneous
library(estimatr) # Fast Estimators for Design-Based Inference
library(modelsummary) # Summary Tables and Plots for Statistical Models and Data:
library(kableExtra) # Construct Complex Table with 'kable' and Pipe Syntax
library(sandwich) #for robust variance estimation
library(survey) # Analysis of Complex Survey Samples
library(haven) # Import and Export 'SPSS', 'Stata' and 'SAS' Files
library(simstudy) # Simulation of Study Data
library(NHANES) # Data from the US National Health and Nutrition Examination Study  
library(mediation) # Causal Mediation Analysis

DATASETS for today

Importing Dataset 1 (NHANES)

Name: NHANES, accessible from package NHANES
Documentation: See reference on the data downloaded with ?NHANES
Sampling details: We’ll use a subset of the NHANES dataset, focusing on variables related to smoking status (SmokeNow), systolic blood pressure (BPSysAve), Body Mass Index (BMI) and age (Age).

set.seed(123) # Set seed for reproducibility
data(NHANES)
# Select relevant columns and drop rows with missing values
nhanes_data <- NHANES %>%
  dplyr::select(ID, Gender, Age, Race1, Height, Weight, BMI, SmokeNow, PhysActive, PhysActiveDays,
         BPSysAve, BPDiaAve, TotChol, DirectChol, Diabetes, HealthGen,
         Education, HHIncomeMid, Poverty) %>%
  drop_na() %>% 
  # make it smaller 
  slice_sample(n = 1000) # Randomly select 1000 rows using

# Take a quick look at the data
#paint(nhanes_data)

Dataset 1 (NHANES) Variables and their description

EXCERPT: see complete file in Input Data Folder

Variable Type Description
ID int xxxxx
Gender chr Gender (sex) of study participant coded as male or female
Age int ##
AgeDecade chr yy-yy es 20-29
Race1 chr Reported race of study participant: Mexican, Hispanic, White, #Black, or Other
Education chr [>= 20 yro]. Ex. 8thGrade, 9-11thGrade, HighSchool, SomeCollege, or CollegeGrad.
HHIncome chr Total annual gross income for the household in US dollars
HHIncomeMid int Numerical version of HHIncome derived from the middle income # in each category. Ex. 12500 40000
Poverty dbl A ratio of family income to poverty guidelines. Smaller # numbers indicate more poverty Ex.. 0.95 1.74 4.99
Weight dbl Weight in kg
Height dbl Standing height in cm. Reported for participants aged 2 years or older.
BMI dbl Body mass index (weight/height2 in kg/m2). Reported for participants aged 2 years or older
Pulse int 60 second pulse rate
BPSysAve int Combined systolic blood pressure reading, following the # procedure outlined for BPXSAR
BPDiaAve int Combined diastolic blood pressure reading, following the # procedure outlined for BPXDAR
DirectChol dbl Direct HDL cholesterol in mmol/L. Reported for participants aged 6 years or older
TotChol dbl Total HDL cholesterol in mmol/L. Reported for participants aged 6 years or older
Diabetes chr Study participant told by a doctor or health professional that they have diabetes
HealthGen chr Self-reported rating of health: Excellent, Vgood, Good, Fair, or Poor Fair
PhysActive chr Participant does moderate or vigorous-intensity sports, fitness or recreational activities (Yes or No).
SmokeNow chr Study participant currently smokes cigarettes regularly. (Yes or No)

VISUALLY EXPLORE CAUSAL PATHWAYS WITH DAGS

  • using R packages:
    • dagitty for creating and analyzing DAGs
    • ggdag for plotting DAGs

DAG with collider

# df DAG 
dag_collid <- ggdag::dagify( Y ~ X + e, Z ~ X + Y, 
                      exposure = "X",outcome = "Y", #latent = "e",
                      # labels
                      labels = c("Z" = "Collider",  "X" = "Exposure",
                                 "Y" = "Outcome", "e" = "Unobserved \nerror"),
                      # positions
                      coords = list(
                        x = c(X = 0, Y= 4, Z = 2 , e = 5),
                        y = c(X = 0, Y = 0, Z = 2, e = 1) 
                      )) %>% 
  ggdag::tidy_dagitty() 

# Plot DAG 
dag_col_p <-  dag_collid %>% 
  ggdag::ggdag(., layout = "circle") +  # decided in dagify
  theme_dag_blank(plot.caption = element_text(hjust = 1)) +
  # Nodes 
   ggdag::geom_dag_node(aes(color = name)) +  
  # Labels dodged to avoid overlap
   ggdag::geom_dag_label(aes(label = label), color = "#4c4c4c", nudge_x = 0.7, nudge_y = 0.2) +  
  
   ggdag::geom_dag_text(color="white") +
   ggplot2::labs(title = "Causal map with COLLIDER (Z)" , 
       #subtitle = "X = exposure, Y = outcome, Z = collider, e = random error", 
       caption = ) +
  # Map colors to specific nodes
   ggplot2::scale_color_manual(values = c("X" = "#9b6723", "Y" = "#72aed8", "Z" = "#d02e4c",
                                "e" = "#A6A6A6"), 
                     guide = "none")    # Remove legend

dag_col_p

DAG with collider

DAG with confounder

# df DAG 
dag_confounder <- ggdag::dagify(Y ~ X + Z + e, X ~  Z, 
                                exposure = "X", outcome = "Y", #latent = "e",
                                # labels
                                labels = c(
                                  "Z" = "Confounder",
                                  "X" = "Exposure",
                                  "Y" = "Outcome",
                                  "e" = "Unobserved \nerror"),
                                # positions
                                coords = list(
                                  x = c(X = 0, Y= 4, Z = 2 , e = 5),
                                  y = c(X = 0, Y = 0, Z = 2, e = 1) 
                                )) %>% 
  ggdag::tidy_dagitty()     # Add edge_type within pipe

# Plot DAG 
dag_conf_p <- dag_confounder %>% 
  ggdag::ggdag(., layout = "circle") + 
  ggdag::theme_dag_blank(plot.caption = element_text(hjust = 1)) +
  # Nodes 
  ggdag::geom_dag_node(aes(color = name)) +  
  # Labels dodged to avoid overlap
  ggdag::geom_dag_label(aes(label = label), color = "#4c4c4c", nudge_x = 0.7, nudge_y = 0.2) +  
  ggdag::geom_dag_text(color="white") +
  ggplot2::labs(title = "Causal map with CONFOUNDER (Z)" , #subtitle = " ",
        caption = ) +
  # Map colors to specific nodes
  ggplot2::scale_color_manual(values = c("X" = "#9b6723", "Y" = "#72aed8", "Z" = "#d02e4c",
                                "e" = "#A6A6A6"), 
                     guide = "none")   
dag_conf_p

DAG with confounder

DAG with mediator

# df DAG 
dag_med <- ggdag::dagify( Y ~ Z + e,  Z ~ X, exposure = "X", outcome = "Y",  #latent = "e",
                   # labels
                   labels = c(
                     "Z" = "Mediator",
                     "X" = "Exposure",
                     "Y" = "Outcome",
                     "e" = "Unobserved \nerror"),
                   # positions
                   coords = list(
                     x = c(X = 0, Y= 4, Z = 2 , e = 5),
                     y = c(X = 0, Y = 0, Z = 2, e = 1) 
                   )) %>% 
  ggdag::tidy_dagitty()     # Add edge_type within pipe

# Plot DAG 
dag_med_p <-  dag_med %>% 
  ggdag::ggdag(., layout = "circle") +  # decided in dagify
  ggdag::theme_dag_blank(plot.caption = element_text(hjust = 1)) +
  # Nodes 
  ggdag::geom_dag_node(aes(color = name)) +  
  # Labels dodged to avoid overlap
  ggdag::geom_dag_label(aes(label = label), color = "#4c4c4c", nudge_x = 0.7, nudge_y = 0.2) +  
  
  ggdag::geom_dag_text(color="white") +
  ggplot2::labs(title = "Causal map with MEDIATOR (Z)" , 
       #subtitle = "X = exposure, Y = outcome, Z = collider, e = random error", 
       caption = ) +
  # Map colors to specific nodes
  ggplot2::scale_color_manual(values = c("X" = "#9b6723", "Y" = "#72aed8", "Z" = "#d02e4c",
                                "e" = "#A6A6A6"), guide = "none")    # Remove legend
dag_med_p

DAG with mediator

CAUSAL MODELING WITH CONFOUNDER

DAG with confounder

Z = Age = confounder
X = SmokeNow
Y = BPSysAve (blood pressure)

Example of CONFOUNDER

  • Z = Age = confounder for the relationship between X = SmokeNow and Y = BPSysAve (blood pressure)
data(NHANES)
# Select relevant columns and drop rows with missing values
nhanes_conf <- NHANES %>%
  dplyr::select(ID, Age, SmokeNow, BPSysAve) %>%
  tidyr::drop_na()
  
# Take a quick look at the data
paint::paint(nhanes_conf)
tibble [3108, 4]
ID       int 51624 51624 51624 51630 51654 51666
Age      int 34 34 34 49 66 58
SmokeNow fct No No No Yes No Yes
BPSysAve int 113 113 113 112 111 127

In this context:

  • Age influences both smoking (SmokeNow) and blood pressure (BPSysAve), making it a confounder.
  • SmokeNow directly affects BPSysAve.

Visualisation of CONFOUNDER

# Scatterplot of Age and Blood Pressure by Smoking Status
ggplot2::ggplot(nhanes_conf, aes(x = Age, y = BPSysAve, color = SmokeNow)) +
  geom_point(alpha = 0.6) +
  # linear rel 
  geom_smooth(method = "lm", se = FALSE,  size = 1.5,  linetype = "dashed", color = "black") +
  scale_color_manual(
    values = c("No" = "#1b9e77", "Yes" = "#d95f02")) +
  facet_wrap(~ SmokeNow) + 
  labs(
    title = "Scatterplot of Age and Blood Pressure by Smoking Status",
    x = "Age",
    y = "Blood Pressure (mmHg)",
    color = "Smokers"
  )

Visualisation of CONFOUNDER

ONE WAY OF DEALING WITH CONFOUNDERS

  • introducing the confounder variable as a term in the regression model to control for its effect on the outcome.



Regression analysis: controlling for the confounder

Unadjusted model:

# Unadjusted linear model 
model_unadjusted <- lm(BPSysAve ~ SmokeNow, data = nhanes_conf)
summary(model_unadjusted)$coefficients
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   124.38      0.432  287.97  0.0e+00
SmokeNowYes    -4.26      0.643   -6.64  3.8e-11

Adjusted model:

  • includes the confounder Age in the model
# Adjusted model
model_adjusted <- lm(BPSysAve ~ SmokeNow + Age, data = nhanes_conf)
summary(model_adjusted)$coefficients
            Estimate Std. Error t value  Pr(>|t|)
(Intercept)  100.997     1.0886   92.78  0.00e+00
SmokeNowYes    0.684     0.6313    1.08  2.79e-01
Age            0.432     0.0187   23.09 5.19e-109

Compare models (without and with confounder)

  • In the Unadjusted Model any observed relationship between smoking and blood pressure might be confounded.
  • In the Adjusted Model (which ‘controls for’ Age) a more accurate estimate of the causal effect of smoking (SmokeNowYes) on blood pressure (BPSysAve) accounts for the confounding influence of Age.
# Create the table with a custom statistic formatter
conf_sum <- modelsummary::msummary(
  list(
    "NO Confounder" = model_unadjusted, 
    "Confounder" = model_adjusted
  ),
  output = "gt",
  statistic = c(
    "conf.int",
    "s.e. = {std.error}"
  ), 
  fmt = "%.3f", 
  gof_omit = 'DF|Deviance|Log.Lik.|F|R2 Adj.|AIC|BIC|RMSE',
  stars = c('*' = .05, '**' = .01)
)

# Render the table directly
conf_sum  %>%    # text and background color
  tab_style(
    style = cell_text(weight = "bold", color = "#005ca1"),
    locations = cells_column_labels(
      columns = c("NO Confounder", "Confounder"))) %>% 
  tab_style(style = cell_text(weight = "bold"),
            locations = cells_body(rows = c(1,4,7))) %>% 
  tab_style(style = cell_fill(color = 'lightblue'),
            locations = cells_body(rows = 4:5)) %>% 
  # Make 'Age' font color red
  tab_style(style = cell_text(color = "#d02e4c"),
    locations = cells_body(rows = c(7,8,9))
  )
NO Confounder Confounder
(Intercept) 124.384** 100.997**
[123.537, 125.231] [98.863, 103.131]
s.e. = 0.432 s.e. = 1.089
SmokeNowYes -4.264** 0.684
[-5.524, -3.004] [-0.554, 1.922]
s.e. = 0.643 s.e. = 0.631
Age 0.432**
[0.395, 0.468]
s.e. = 0.019
Num.Obs. 3108 3108
R2 0.014 0.158
* p < 0.05, ** p < 0.01

Compare predicted values from 2 models

  • With tidyr::pivot_longer() we reshape the dataset so we can plot the predicted values from both models in a faceted plot.
# Add predicted values from the unadjusted and adjusted models
nhanes_conf <- nhanes_conf %>%
  dplyr::mutate(
    pred_unadjusted = predict(model_unadjusted),  # Predicted BPSysAve from unadjusted model
    pred_adjusted = predict(model_adjusted)       # Predicted BPSysAve from adjusted model
  )

# Reshape the data into long format using pivot_longer()
nhanes_long <- nhanes_conf %>%
  tidyr::pivot_longer(cols = c(pred_unadjusted, pred_adjusted), 
               names_to = "model", values_to = "predicted") %>%
  dplyr::mutate(model = recode(model, 
                        pred_unadjusted = "Unadjusted: SmokeNow on BPSysAve", 
                        pred_adjusted = "Adjusted: SmokeNow + Age on BPSysAve")) %>% 
  #Reorder the 'model' factor to change the order of the facets
  dplyr::mutate(model = factor(
   model,   levels = c("Unadjusted: SmokeNow on BPSysAve", "Adjusted: SmokeNow + Age on BPSysAve")
))
  • check the df new size
dim(nhanes_conf)
[1] 3108    6
dim(nhanes_long)
[1] 6216    6

Plot predicted values from 2 models

  • Using nhanes_long df in “long shape”
# Violin plot with dashed line connecting the two conditions
ggplot(nhanes_long, aes(x = factor(SmokeNow), y = BPSysAve, fill = model)) +
  # Create the violin plot to show the distribution of blood pressure
  geom_point(color = "grey", alpha = 0.4, position = position_dodge(width = 0.3) ) +
  # Overlay the predicted dashed line between the two smoking conditions
  geom_line(aes(y = predicted, group = model, color = model), 
            size = 0.8, linetype = "dashed", position = position_dodge(width = 0.3)) +
  # Facet vertically for unadjusted and adjusted models
  facet_wrap(model ~ ., scales = "free_y",ncol = 2) +
  # Add labels and title
  labs(title = "Confounder Analysis: Predicted values in Unadjusted vs Adjusted Regression models",
       subtitle = "Age = Confounder",
       x = "Smoking Status",
       y = "Systolic Blood Pressure (BPSysAve)",
       color = "Model",
       fill = "Model"
       ) +
  # Customize colors for better contrast
  scale_fill_manual(values = c("Unadjusted: SmokeNow on BPSysAve" ="lightcoral" , 
                               "Adjusted: SmokeNow + Age on BPSysAve" = "lightblue")) +
  scale_color_manual(values = c("Unadjusted: SmokeNow on BPSysAve" = "red", 
                                "Adjusted: SmokeNow + Age on BPSysAve" = "blue")) +
  # Minimal theme for clarity
  theme_minimal()+
  theme(legend.position = "none")

Plot predicted values from 2 models

Figure 1: As age causes high blood pressure, but also determines smoking status, the Adjusted model (right) gives us a more credible relationship between smoking and blood pressure.

CAUSAL MODELING WITH MEDIATOR

DAG with mediator

M = BMI is a mediator
X = PhysActiveDays
Y = BPSysAve (blood pressure)

The case of a mediator

  • M = BMI is a mediator for the effect of X = PhysActiveDays on Y = BPSysAve
data(NHANES)
# Select relevant columns and drop rows with missing values
nhanes_med <- NHANES %>%
  dplyr::select(Gender, Age, BPSysAve, BMI, PhysActiveDays, Diabetes, SmokeNow) %>%
  drop_na()

# Take a quick look at the data
summary(nhanes_med)
    Gender         Age          BPSysAve        BMI       PhysActiveDays
 female:632   Min.   :20.0   Min.   : 81   Min.   :15.0   Min.   :1.0   
 male  :831   1st Qu.:34.0   1st Qu.:111   1st Qu.:23.6   1st Qu.:2.0   
              Median :47.0   Median :119   Median :27.0   Median :3.0   
              Mean   :47.8   Mean   :122   Mean   :27.9   Mean   :3.6   
              3rd Qu.:60.0   3rd Qu.:131   3rd Qu.:31.3   3rd Qu.:5.0   
              Max.   :80.0   Max.   :221   Max.   :59.1   Max.   :7.0   
 Diabetes   SmokeNow 
 No :1312   No :834  
 Yes: 151   Yes:629  
                     
                     
                     
                     

Fitting unadjusted (total) model

Before adjusting for any mediator, we fit a simple linear model to estimate the total effect of PhysActiveDays on Systolic Blood Pressure (disregarding any possible mediation of BMI).

  • Total Effect Model (Unadjusted)
    • This model gives the total effect of of PhysActiveDays on Systolic Blood Pressure (including any indirect effects through M = BMI or other variables that haven’t been included.)
# Unadjusted model: total effect of smoking on blood pressure
total_effect_model <- lm(BPSysAve ~ PhysActiveDays, data = nhanes_med)
summary(total_effect_model)$coefficients
               Estimate Std. Error t value Pr(>|t|)
(Intercept)      120.12      1.055  113.88   0.0000
PhysActiveDays     0.59      0.262    2.25   0.0246

Set up the mediation models 1/2

Then, we fit the model to estimate the effect of X = PhysActiveDays on M = BMI:

  • Mediator model
    • This model allows us to see whether more physically active days are associated with (lower) BMI values (which could then affect blood pressure).
# Mediator model: SmokeNow affects BMI
mediator_model <- lm(BMI ~ PhysActiveDays, data = nhanes_med)
summary(mediator_model)$coefficients
               Estimate Std. Error t value Pr(>|t|)
(Intercept)     28.2176     0.3444  81.934    0.000
PhysActiveDays  -0.0821     0.0856  -0.959    0.338

Set up the mediation models 2/2

Lastly, we fit the model to estimate the effect of X = PhysActiveDays on Y = BPSysAve while adjusting for M = BMI:

  • Outcome Model (Adjusted for BMI)
    • This model estimates the effect of PhysActiveDays and BMI on Systolic Blood Pressure (BPSysAve). This is the adjusted model.
# Outcome model: PhysActiveDays and BMI affect BPSysAve
outcome_model <- lm(BPSysAve ~  PhysActiveDays + BMI, data = nhanes_med)
summary(outcome_model)$coefficients
               Estimate Std. Error t value  Pr(>|t|)
(Intercept)     109.353     2.4763   44.16 3.15e-271
PhysActiveDays    0.621     0.2603    2.39  1.71e-02
BMI               0.382     0.0795    4.80  1.77e-06

Compare 3 models

  • The Total (unadjusted) effect model estimates the effect of PhysActiveDays on BPSysAve.
  • The Mediator model estimates the effect of PhysActiveDays on BMI (M).
  • The Outcome model estimates the effect of (PhysActiveDays + BMI) on BPSysAve.
med_sum <- modelsummary::msummary(
  list("BPSysAve ~ Total Effect" = total_effect_model,
       "BMI ~ Mediator Effect" = mediator_model,
       "BPSysAve ~ Outcome" = outcome_model), 
  output = "gt",
  statistic = c(
    "conf.int",
    "s.e. ={std.error}"),
  fmt = "%.3f", # format
  gof_omit = 'DF|Deviance|Log.Lik.|F|R2 Adj.|AIC|BIC|RMSE',
  stars = c('*' = .05, '**' = .01))

# Render the table
med_sum %>% 
  tab_spanner(label = "Tot (unadj) Mod.",columns = 2) %>%
  tab_spanner(label = "Med Mod.", columns = 3) %>%
  tab_spanner(label = "Outcome Mod.",columns = 4) %>%
   tab_style(
    style = cell_text(weight = "bold", color = "#7f173d"),
    locations = cells_column_labels(columns = 3)) %>% 
  tab_style(
    style = cell_text(weight = "bold", color = "#005ca1"),
    locations = cells_column_labels(columns = c(2,4))) %>% 
  tab_style(style = cell_text(weight = "bold"),
            locations = cells_body(rows = c(1,4,7))) %>% 
  # Highlight X 
   tab_style(style = cell_fill(color = 'lightblue'),
            locations = cells_body(rows = 4:6)) %>% 
  # Make 'MEdiator' font color red
  tab_style(style = cell_text(color = "#d02e4c"),
    locations = cells_body(rows = 7:9)) 

Compare 3 models

Tot (unadj) Mod.
Med Mod.
Outcome Mod.
BPSysAve ~ Total Effect BMI ~ Mediator Effect BPSysAve ~ Outcome
(Intercept) 120.120** 28.218** 109.353**
[118.051, 122.189] [27.542, 28.893] [104.495, 114.210]
s.e. =1.055 s.e. =0.344 s.e. =2.476
PhysActiveDays 0.590* -0.082 0.621*
[0.076, 1.104] [-0.250, 0.086] [0.111, 1.132]
s.e. =0.262 s.e. =0.086 s.e. =0.260
BMI 0.382**
[0.226, 0.538]
s.e. =0.080
Num.Obs. 1463 1463 1463
R2 0.003 0.001 0.019
* p < 0.05, ** p < 0.01

Calculate the Indirect, Direct, and Total Effects (1/2)

Using the coefficients from the models, we can calculate the mediation effects manually.

  • Direct Effect: This is the coefficient of PhysActiveDays in the outcome model, which gives the direct effect of exercise on blood pressure (controlling for BMI = M)
  • Indirect Effect: This is the product of the coefficient from the mediator model (PhysActiveDaysBMI) and the coefficient from the outcome model (BMIBPSysAve).
  • Total Effect: This is the coefficient of PhysActiveDays in the unadjusted model (NOT controlling for BMI = M).

Calculate the Indirect, Direct, and Total Effects (2/2)

Breakdown of the estimated effects:

# TOTAL effect: The effect of PhysActiveDays on BPSysAve without adjusting for BMI
total_effect <- coef(total_effect_model)["PhysActiveDays"]# 0.59
total_effect
PhysActiveDays 
          0.59 
# DIRECT effect: The effect of PhysActiveDays on BPSysAve after adjusting for BMI.
direct_effect <- coef(outcome_model)["PhysActiveDays"]# 0.621 
direct_effect
PhysActiveDays 
         0.621 
# Indirect effect: The portion of the effect on BPSysAve that is mediated through BMI
indirect_effect <- coef(mediator_model)["PhysActiveDays"] * coef(outcome_model)["BMI"] # [-0.0821 X 0.382]  
indirect_effect
PhysActiveDays 
       -0.0313 
# Proportion mediated: The proportion of the total effect that is mediated through BMI
proportion_mediated <- glue::glue("{round(indirect_effect / total_effect *100, 2)}%")
proportion_mediated #  -0.0531
-5.31%

Visualise the Indirect, Direct, and Total Effects

# Create a data frame for the effects
effects_df <- data.frame(
  Effect = c("Total Effect", "Direct Effect", 
             "Indirect Effect"),
  Value = c(total_effect, direct_effect, 
            indirect_effect)
)

# Create the bar plot
ggplot(effects_df, aes(x = Effect, y = Value,
                       fill = Effect)) +
  geom_bar(stat = "identity") +
  labs(title = "Mediation Analysis: Total, Direct, and Indirect Effects", 
       subtitle = "Effect (coefficient) of Physical Activity on Systolic Blood Pressure",
       y = "", x = "") +
  theme_minimal()

Visualise the Indirect, Direct, and Total Effects

Interpreting the results

  • The total effect tells us the overall relationship between PhysActiveDays and BPSysAve.

  • The direct effect represents the relationship between PhysActiveDays and BPSysAve, controlling for BMI.

  • The indirect effect (mediated effect) shows how much of the relationship between PhysActiveDays and BPSysAve is explained through BMI (AKA the coef of PhysActiveDays in the mediator model times the coef of BMI in the outcome model)

  • The proportion mediated indicates how much of the total effect is due to the mediation by BMI.

Causal mediation analysis using mediation



Using mediation package

The mediation package can be used to estimate various quantities for causal mediation analysis

  • The function mediate() needs two regression models: 1) \((X → M)\), and 2) \((X + M → Y)\)
  • It returns
    • the average causal mediation effect (ACME),
    • average direct effect (ADE),
    • and total effect.
# recall
#mediator_model[["call"]][["formula"]]
#outcome_model[["call"]][["formula"]]

# Calculate the mediation effect
mediation_model <- mediate(mediator_model, outcome_model, 
                           treat = "PhysActiveDays", 
                           mediator = "BMI", 
                           boot=TRUE, sims=500)

summary(mediation_model)$d1  # Direct effect
[1] -0.0313
summary(mediation_model)$z1  # Indirect effect
[1] 0.621
summary(mediation_model)$tau.coef  # Total effect
[1] 0.59

Using mediation pckg - results

summary(mediation_model)

Causal Mediation Analysis 

Nonparametric Bootstrap Confidence Intervals with the Percentile Method

               Estimate 95% CI Lower 95% CI Upper p-value  
ACME            -0.0313      -0.1067         0.03   0.256  
ADE              0.6213       0.1246         1.16   0.016 *
Total Effect     0.5899       0.0576         1.10   0.020 *
Prop. Mediated  -0.0531      -0.5263         0.06   0.276  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Sample Size Used: 1463 


Simulations: 500 

Using mediation pckg - prediction

# Add predicted values for BMI, BPSysAve (adjusted), and BPSysAve (total effect)
nhanes_med_pred <- nhanes_med %>%
  mutate(
    pred_bmi = predict(mediator_model),         # Predicted BMI (mediator model)
    pred_bps_adj = predict(outcome_model),      # Predicted BPSysAve (adjusted for Age)
    pred_bps_total = predict(total_effect_model) # Predicted BPSysAve (total effect)
  )

nhanes_med_pred[1:3, c(1:2, 8:10)]
# A tibble: 3 × 5
  Gender   Age pred_bmi pred_bps_adj pred_bps_total
  <fct>  <int>    <dbl>        <dbl>          <dbl>
1 male      66     27.6         123.           124.
2 female    58     28.1         121.           121.
3 female    57     28.0         119.           122.

Plot prediction results 1/2

Now, let’s create a plot that includes:

  • The total effect of BMI on blood pressure (without adjusting for Age).
  • The adjusted effect of BMI on blood pressure (controlling for Age).
  • The mediator effect (BMI’s effect on Age).
# Check column names
colnames(nhanes_med_pred)
 [1] "Gender"         "Age"            "BPSysAve"       "BMI"           
 [5] "PhysActiveDays" "Diabetes"       "SmokeNow"       "pred_bmi"      
 [9] "pred_bps_adj"   "pred_bps_total"
# Reshape the data into long format for faceting
nhanes_long <- nhanes_med_pred %>%
  gather(key = "model", value = "predicted",  pred_bps_adj, pred_bps_total) %>%
  mutate(model = recode(model, 
                        #pred_bmi = "Mediator Effect: BPSysAve on BMI", 
                        pred_bps_adj = "Adj. Effect: PhysActiveDays + BMI on BPSysAve", 
                        pred_bps_total = "Total Effect: PhysActiveDays on BPSysAve"))

Plot prediction results 2/2

# Plot with vertically aligned facets
ggplot(nhanes_long, aes(x = BMI)) +
  # Actual data points for BPSysAve
  geom_point(aes(y = BPSysAve), color = "black", shape = 16, alpha = 0.5) +  
  # Predicted values from different models
  geom_line(aes(y = predicted, color = model), size = 1) +  
  # Facet vertically
  facet_grid(model ~ ., scales = "free_y") +  
  labs(title = "Mediation Analysis: Total, Adjusted, and Mediator Effects",
       x = "BMI",
       y = "Blood Pressure | Age",
       color = "Effect Type") +
  theme_minimal()

Plot prediction results 2/2

CALCULATE COMMONLY USED ESTIMANDs (ATE, ATT, ATU)



The explanation and examples below follow very closely two sources:

  1. Noah Greifer, Elizabeth A. Stuart, Choosing the Causal Estimand for Propensity Score Analysis of Observational Studies
  2. Andrew Heiss, Demystifying causal inference estimands: ATE, ATT, and ATU

Establishing some common symbols

  • \(X\) = treatment or intervention
  • \(Y^1\) = potential outcome if treated
  • \(Y^0\) = potential outcome if not treated
  • \(\delta\) = difference \((Y^1 - Y^0)\), or the individual-level causal effect.

EXAMPLE with toy case

Assuming we could observe the potential outcomes for each individual:

basic_po <- tribble(
  ~id, ~age,    ~treated, ~outcome_1, ~outcome_0,
  1,   "Old",   1,        80,         60,
  2,   "Old",   1,        75,         70,
  3,   "Old",   1,        85,         80,
  4,   "Old",   0,        70,         60,
  5,   "Young", 1,        75,         70,
  6,   "Young", 0,        80,         80,
  7,   "Young", 0,        90,         100,
  8,   "Young", 0,        85,         80
) |> 
  dplyr::mutate(
    ice = outcome_1 - outcome_0,
    outcome = ifelse(treated == 1, outcome_1, outcome_0)
  )

ITE in toy example example (all outcomes!)

  • Each person (ID) has two potential outcomes and an individual causal effect (\(ITE_i= Y_{i}^1 − Y_{i}^0\) = \(\delta_i\)).
  • \(\delta_i\) would in fact be measurable only with 2 parallel universes!
Table 1
Confounder
Treatment
Unobservable
Realized
ID
Age
Treated
Potential outcomes
ITE or \(\delta_i\)
Outcome
\[Z_i\] \[X_i\] \[Y^1_i\] \[Y^0_i\] \[Y^1_i - Y^0_i\] \[Y_i\]
1 Old 1 80 60 20 80
2 Old 1 75 70 5 75
3 Old 1 85 80 5 85
4 Old 0 70 60 10 60
5 Young 1 75 70 5 75
6 Young 0 80 80 0 80
7 Young 0 90 100 −10 100
8 Young 0 85 80 5 80

ATE = Average Treatment Effect

If we could compare all the same people in different universes and measure their ITE, we could calculate the Average Treatment Effect (ATE), or the effect of the intervention across the whole population.


\[ \begin{aligned} \text{ATE} &= E[\delta_i] & \text{ or} \\ \text{ATE} &= E[Y^1_i - Y^0_i] \end{aligned} \]

Given the data of the \(\delta\) column in Table 1, the ATE is 5.

\[ \text{ATE} = \frac{20 + 5 + 5 + 5 + 10 + 0 + -10 + 5}{8} = 5 \]

ATT = Average Treatment Effect on Treated

We might want to know how big the effect is just for those who received the treatment.

This is called the Average Treatment effect on the Treated (ATT), and is the average of the individual causal effects just among those who were treated.

\[ \begin{aligned} \text{ATT} &= E[\delta_i \mid X_i = 1] & \text{or} \\ \text{ATT} &= E[Y^1_i - Y^0_i \mid X_i = 1] \end{aligned} \]

In this case, that means we’re only looking at the average \(\delta\) for rows 1, 2, 3, and 5 in Table 1:

\[ \text{ATT} = \frac{20 + 5 + 5 + 5}{4} = 8.75 \]

ATU = Average Treatment Effect on Untreated

We can also calculate the Average Treatment effect on the Untreated (ATU; sometimes called the ATC (for effect on the control group)) by finding the average of the individual causal effects among those who were not treated.

\[ \begin{aligned} \text{ATU} &= E[\delta_i \mid X_i = 0] & \text{or} \\ \text{ATU} &= E[Y^1_i - Y^0_i \mid X_i = 0] \end{aligned} \]

Here, we’re only looking at the average \(\delta\) for rows 4, 6, 7, and 8 in Table 1:

\[ \text{ATU} = \frac{10 + 0 - 10 + 5}{4} = 1.25 \]

Relation among ATE, ATT and ATU

In fact, the ATE is technically a weighted average of the ATT and the ATU (here \(\pi\) indicates “proportion”):

\[ \text{ATE} = (\pi_\text{Treated} \times \text{ATT}) + (\pi_\text{Untreated} \times \text{ATU}) \]

In our example, Table 1, we can get the same ATE:

\[ \begin{aligned} \text{ATE} &= (\frac{4}{8} \times 8.75) + (\frac{4}{8} \times 1.25) \\ &= 4.375 + 0.625 \\ &= 5 \end{aligned} \]

Decomposing the ATE into these two parts like this shows that there’s some systematic difference between the treated and untreated people. This difference is called selection bias.

Relation among ATE, ATT and ATU

We can see the selection bias in an alternative decomposition of the ATE:

\[ \text{ATE} = \text{ATT} + \text{Selection bias} \]

The fact that the ATT (8.75) is bigger than the ATE (5) here is a sign the two groups are different. This intervention, whatever it is, has a big effect on the treated people who signed up for it, likely because they somehow knew that the intervention would be helpful for them.

Those who were untreated have a really low ATU (1.25)—they likely didn’t sign up for the intervention because they knew that it wouldn’t do much for them.

Relation among ATE, ATT and ATU (R)

# OBTAIN ATT, ATU
effect_types <- basic_po |> 
  dplyr::group_by(treated) |> 
  dplyr::summarize(effect = mean(ice), n = n() ) |> 
  dplyr::mutate(prop = n / sum(n),
    weighted_effect = effect * prop) |> 
  dplyr::mutate(estimand = case_match(treated, 0 ~ "ATU", 1 ~ "ATT"), .before = 1)

effect_types
# A tibble: 2 × 6
  estimand treated effect     n  prop weighted_effect
  <chr>      <dbl>  <dbl> <int> <dbl>           <dbl>
1 ATU            0   1.25     4   0.5           0.625
2 ATT            1   8.75     4   0.5           4.38 
# OBTAIN ATE
effect_types |> dplyr::summarize(ATE = sum(weighted_effect))
# A tibble: 1 × 1
    ATE
  <dbl>
1     5