Lab 4:Mapping causal & predictive approaches

Practice session

Luisa M. Mimmi — https://luisamimmi.org/

March 28, 2025

GOAL OF TODAY’S PRACTICE SESSION

Causal analysis: how experimental studies uncover the association between exposure and outcome
Visually explore causal pathways and influential variables with Directed Acyclic Graphs (DAGs):
- Collider: a variable that is influenced by both the exposure and the outcome
- Confounder: a variable that influences both the exposure and the dependent variable outcome
- Mediator: a variable that lies on the causal path between the exposure and the outcome
Practically illustrate causal modeling in the presence of influential causal variables:
- [Colliders can introduce bias if conditioned upon]
- Confounders must be adjusted for
- Mediators require careful consideration in analyses
Calculate commonly used estimands: ATE, ATT, ATU

R ENVIRONMENT SET UP & DATA

Needed R Packages

We will use functions from packages base, utils, and stats (pre-installed and pre-loaded)
We may also use the packages below (specifying package::function for clarity).

# Load pckgs for this R session

# --- General 
library(here) # A Simpler Way to Find Your Files   
library(skimr)    # Compact and Flexible Summaries of Data
library(paint)   # Fancy structure for data frames 
library(janitor)  # Simple Tools for Examining and Cleaning Dirty Data
library(tidyverse) # Easily Install and Load the 'Tidyverse'
library(gt) # Easily Create Presentation-Ready Display Tables
library(marginaleffects) # Marginal Effects for Model Objects
library(WeightIt) # Weighting for Covariate Balance in Observational Studies
# --- Plotting & data visualization
library(ggfortify)     # Data Visualization Tools for Statistical Analysis Results
library(ggpubr)        # 'ggplot2' Based Publication Ready Plots
library(patchwork) # The Composer of Plots
# --- Descriptive statistics and regressions
library(broom) # Convert Statistical Objects into Tidy Tibbles
library(Hmisc) # Harrell Miscellaneous
library(estimatr) # Fast Estimators for Design-Based Inference
library(modelsummary) # Summary Tables and Plots for Statistical Models and Data:
library(kableExtra) # Construct Complex Table with 'kable' and Pipe Syntax
library(sandwich) #for robust variance estimation
library(survey) # Analysis of Complex Survey Samples
library(haven) # Import and Export 'SPSS', 'Stata' and 'SAS' Files
library(simstudy) # Simulation of Study Data
library(NHANES) # Data from the US National Health and Nutrition Examination Study  
library(mediation) # Causal Mediation Analysis

DATASETS for today

Importing Dataset 1 (NHANES)

Name: NHANES, accessible from package NHANES
Documentation: See reference on the data downloaded with ?NHANES
Sampling details: We’ll use a subset of the NHANES dataset, focusing on variables related to smoking status (SmokeNow), systolic blood pressure (BPSysAve), Body Mass Index (BMI) and age (Age).

set.seed(123) # Set seed for reproducibility
data(NHANES)
# Select relevant columns and drop rows with missing values
nhanes_data <- NHANES %>%
  dplyr::select(ID, Gender, Age, Race1, Height, Weight, BMI, SmokeNow, PhysActive, PhysActiveDays,
         BPSysAve, BPDiaAve, TotChol, DirectChol, Diabetes, HealthGen,
         Education, HHIncomeMid, Poverty) %>%
  drop_na() %>% 
  # make it smaller 
  slice_sample(n = 1000) # Randomly select 1000 rows using

# Take a quick look at the data
#paint(nhanes_data)

Dataset 1 (NHANES) Variables and their description

EXCERPT: see complete file in Input Data Folder

Variable	Type	Description
ID	int	xxxxx
Gender	chr	Gender (sex) of study participant coded as male or female
Age	int	##
AgeDecade	chr	yy-yy es 20-29
Race1	chr	Reported race of study participant: Mexican, Hispanic, White, #Black, or Other
Education	chr	[>= 20 yro]. Ex. 8thGrade, 9-11thGrade, HighSchool, SomeCollege, or CollegeGrad.
HHIncome	chr	Total annual gross income for the household in US dollars
HHIncomeMid	int	Numerical version of HHIncome derived from the middle income # in each category. Ex. 12500 40000
Poverty	dbl	A ratio of family income to poverty guidelines. Smaller # numbers indicate more poverty Ex.. 0.95 1.74 4.99
Weight	dbl	Weight in kg
Height	dbl	Standing height in cm. Reported for participants aged 2 years or older.
BMI	dbl	Body mass index (weight/height2 in kg/m2). Reported for participants aged 2 years or older
Pulse	int	60 second pulse rate
BPSysAve	int	Combined systolic blood pressure reading, following the # procedure outlined for BPXSAR
BPDiaAve	int	Combined diastolic blood pressure reading, following the # procedure outlined for BPXDAR
DirectChol	dbl	Direct HDL cholesterol in mmol/L. Reported for participants aged 6 years or older
TotChol	dbl	Total HDL cholesterol in mmol/L. Reported for participants aged 6 years or older
Diabetes	chr	Study participant told by a doctor or health professional that they have diabetes
HealthGen	chr	Self-reported rating of health: Excellent, Vgood, Good, Fair, or Poor Fair
PhysActive	chr	Participant does moderate or vigorous-intensity sports, fitness or recreational activities (Yes or No).
SmokeNow	chr	Study participant currently smokes cigarettes regularly. (Yes or No)
…	…	…

VISUALLY EXPLORE CAUSAL PATHWAYS WITH DAGS

using R packages:
- dagitty for creating and analyzing DAGs
- ggdag for plotting DAGs

DAG with `collider`

# df DAG 
dag_collid <- ggdag::dagify( Y ~ X + e, Z ~ X + Y, 
                      exposure = "X",outcome = "Y", #latent = "e",
                      # labels
                      labels = c("Z" = "Collider",  "X" = "Exposure",
                                 "Y" = "Outcome", "e" = "Unobserved \nerror"),
                      # positions
                      coords = list(
                        x = c(X = 0, Y= 4, Z = 2 , e = 5),
                        y = c(X = 0, Y = 0, Z = 2, e = 1) 
                      )) %>% 
  ggdag::tidy_dagitty() 

# Plot DAG 
dag_col_p <-  dag_collid %>% 
  ggdag::ggdag(., layout = "circle") +  # decided in dagify
  theme_dag_blank(plot.caption = element_text(hjust = 1)) +
  # Nodes 
   ggdag::geom_dag_node(aes(color = name)) +  
  # Labels dodged to avoid overlap
   ggdag::geom_dag_label(aes(label = label), color = "#4c4c4c", nudge_x = 0.7, nudge_y = 0.2) +  
  
   ggdag::geom_dag_text(color="white") +
   ggplot2::labs(title = "Causal map with COLLIDER (Z)" , 
       #subtitle = "X = exposure, Y = outcome, Z = collider, e = random error", 
       caption = ) +
  # Map colors to specific nodes
   ggplot2::scale_color_manual(values = c("X" = "#9b6723", "Y" = "#72aed8", "Z" = "#d02e4c",
                                "e" = "#A6A6A6"), 
                     guide = "none")    # Remove legend

dag_col_p

DAG with `collider`

DAG with `confounder`

# df DAG 
dag_confounder <- ggdag::dagify(Y ~ X + Z + e, X ~  Z, 
                                exposure = "X", outcome = "Y", #latent = "e",
                                # labels
                                labels = c(
                                  "Z" = "Confounder",
                                  "X" = "Exposure",
                                  "Y" = "Outcome",
                                  "e" = "Unobserved \nerror"),
                                # positions
                                coords = list(
                                  x = c(X = 0, Y= 4, Z = 2 , e = 5),
                                  y = c(X = 0, Y = 0, Z = 2, e = 1) 
                                )) %>% 
  ggdag::tidy_dagitty()     # Add edge_type within pipe

# Plot DAG 
dag_conf_p <- dag_confounder %>% 
  ggdag::ggdag(., layout = "circle") + 
  ggdag::theme_dag_blank(plot.caption = element_text(hjust = 1)) +
  # Nodes 
  ggdag::geom_dag_node(aes(color = name)) +  
  # Labels dodged to avoid overlap
  ggdag::geom_dag_label(aes(label = label), color = "#4c4c4c", nudge_x = 0.7, nudge_y = 0.2) +  
  ggdag::geom_dag_text(color="white") +
  ggplot2::labs(title = "Causal map with CONFOUNDER (Z)" , #subtitle = " ",
        caption = ) +
  # Map colors to specific nodes
  ggplot2::scale_color_manual(values = c("X" = "#9b6723", "Y" = "#72aed8", "Z" = "#d02e4c",
                                "e" = "#A6A6A6"), 
                     guide = "none")   
dag_conf_p

DAG with `confounder`

DAG with `mediator`

# df DAG 
dag_med <- ggdag::dagify( Y ~ Z + e,  Z ~ X, exposure = "X", outcome = "Y",  #latent = "e",
                   # labels
                   labels = c(
                     "Z" = "Mediator",
                     "X" = "Exposure",
                     "Y" = "Outcome",
                     "e" = "Unobserved \nerror"),
                   # positions
                   coords = list(
                     x = c(X = 0, Y= 4, Z = 2 , e = 5),
                     y = c(X = 0, Y = 0, Z = 2, e = 1) 
                   )) %>% 
  ggdag::tidy_dagitty()     # Add edge_type within pipe

# Plot DAG 
dag_med_p <-  dag_med %>% 
  ggdag::ggdag(., layout = "circle") +  # decided in dagify
  ggdag::theme_dag_blank(plot.caption = element_text(hjust = 1)) +
  # Nodes 
  ggdag::geom_dag_node(aes(color = name)) +  
  # Labels dodged to avoid overlap
  ggdag::geom_dag_label(aes(label = label), color = "#4c4c4c", nudge_x = 0.7, nudge_y = 0.2) +  
  
  ggdag::geom_dag_text(color="white") +
  ggplot2::labs(title = "Causal map with MEDIATOR (Z)" , 
       #subtitle = "X = exposure, Y = outcome, Z = collider, e = random error", 
       caption = ) +
  # Map colors to specific nodes
  ggplot2::scale_color_manual(values = c("X" = "#9b6723", "Y" = "#72aed8", "Z" = "#d02e4c",
                                "e" = "#A6A6A6"), guide = "none")    # Remove legend
dag_med_p

DAG with `mediator`

CAUSAL MODELING WITH `CONFOUNDER`

DAG with `confounder`

Z = Age = confounder
X = SmokeNow
Y = BPSysAve (blood pressure)

Example of CONFOUNDER

Z = Age = confounder for the relationship between X = SmokeNow and Y = BPSysAve (blood pressure)

data(NHANES)
# Select relevant columns and drop rows with missing values
nhanes_conf <- NHANES %>%
  dplyr::select(ID, Age, SmokeNow, BPSysAve) %>%
  tidyr::drop_na()
  
# Take a quick look at the data
paint::paint(nhanes_conf)

tibble [3108, 4]
ID       int 51624 51624 51624 51630 51654 51666
Age      int 34 34 34 49 66 58
SmokeNow fct No No No Yes No Yes
BPSysAve int 113 113 113 112 111 127

In this context:

Age influences both smoking (SmokeNow) and blood pressure (BPSysAve), making it a confounder.
SmokeNow directly affects BPSysAve.

Visualisation of CONFOUNDER

# Scatterplot of Age and Blood Pressure by Smoking Status
ggplot2::ggplot(nhanes_conf, aes(x = Age, y = BPSysAve, color = SmokeNow)) +
  geom_point(alpha = 0.6) +
  # linear rel 
  geom_smooth(method = "lm", se = FALSE,  size = 1.5,  linetype = "dashed", color = "black") +
  scale_color_manual(
    values = c("No" = "#1b9e77", "Yes" = "#d95f02")) +
  facet_wrap(~ SmokeNow) + 
  labs(
    title = "Scatterplot of Age and Blood Pressure by Smoking Status",
    x = "Age",
    y = "Blood Pressure (mmHg)",
    color = "Smokers"
  )

Visualisation of CONFOUNDER

ONE WAY OF DEALING WITH CONFOUNDERS

introducing the confounder variable as a term in the regression model to control for its effect on the outcome.

Regression analysis: `controlling for` the confounder

Unadjusted model:

# Unadjusted linear model 
model_unadjusted <- lm(BPSysAve ~ SmokeNow, data = nhanes_conf)
summary(model_unadjusted)$coefficients

            Estimate Std. Error t value Pr(>|t|)
(Intercept)   124.38      0.432  287.97  0.0e+00
SmokeNowYes    -4.26      0.643   -6.64  3.8e-11

Adjusted model:

includes the confounder Age in the model

# Adjusted model
model_adjusted <- lm(BPSysAve ~ SmokeNow + Age, data = nhanes_conf)
summary(model_adjusted)$coefficients

            Estimate Std. Error t value  Pr(>|t|)
(Intercept)  100.997     1.0886   92.78  0.00e+00
SmokeNowYes    0.684     0.6313    1.08  2.79e-01
Age            0.432     0.0187   23.09 5.19e-109

Compare models (without and with `confounder`)

In the Unadjusted Model any observed relationship between smoking and blood pressure might be confounded.
In the Adjusted Model (which ‘controls for’ Age) a more accurate estimate of the causal effect of smoking (SmokeNowYes) on blood pressure (BPSysAve) accounts for the confounding influence of Age.

# Create the table with a custom statistic formatter
conf_sum <- modelsummary::msummary(
  list(
    "NO Confounder" = model_unadjusted, 
    "Confounder" = model_adjusted
  ),
  output = "gt",
  statistic = c(
    "conf.int",
    "s.e. = {std.error}"
  ), 
  fmt = "%.3f", 
  gof_omit = 'DF|Deviance|Log.Lik.|F|R2 Adj.|AIC|BIC|RMSE',
  stars = c('*' = .05, '**' = .01)
)

# Render the table directly
conf_sum  %>%    # text and background color
  tab_style(
    style = cell_text(weight = "bold", color = "#005ca1"),
    locations = cells_column_labels(
      columns = c("NO Confounder", "Confounder"))) %>% 
  tab_style(style = cell_text(weight = "bold"),
            locations = cells_body(rows = c(1,4,7))) %>% 
  tab_style(style = cell_fill(color = 'lightblue'),
            locations = cells_body(rows = 4:5)) %>% 
  # Make 'Age' font color red
  tab_style(style = cell_text(color = "#d02e4c"),
    locations = cells_body(rows = c(7,8,9))
  )

	NO Confounder	Confounder
(Intercept)	124.384**	100.997**
	[123.537, 125.231]	[98.863, 103.131]
	s.e. = 0.432	s.e. = 1.089
SmokeNowYes	-4.264**	0.684
	[-5.524, -3.004]	[-0.554, 1.922]
	s.e. = 0.643	s.e. = 0.631
Age		0.432**
		[0.395, 0.468]
		s.e. = 0.019
Num.Obs.	3108	3108
R2	0.014	0.158
* p < 0.05, ** p < 0.01

Compare predicted values from 2 models

With tidyr::pivot_longer() we reshape the dataset so we can plot the predicted values from both models in a faceted plot.

# Add predicted values from the unadjusted and adjusted models
nhanes_conf <- nhanes_conf %>%
  dplyr::mutate(
    pred_unadjusted = predict(model_unadjusted),  # Predicted BPSysAve from unadjusted model
    pred_adjusted = predict(model_adjusted)       # Predicted BPSysAve from adjusted model
  )

# Reshape the data into long format using pivot_longer()
nhanes_long <- nhanes_conf %>%
  tidyr::pivot_longer(cols = c(pred_unadjusted, pred_adjusted), 
               names_to = "model", values_to = "predicted") %>%
  dplyr::mutate(model = recode(model, 
                        pred_unadjusted = "Unadjusted: SmokeNow on BPSysAve", 
                        pred_adjusted = "Adjusted: SmokeNow + Age on BPSysAve")) %>% 
  #Reorder the 'model' factor to change the order of the facets
  dplyr::mutate(model = factor(
   model,   levels = c("Unadjusted: SmokeNow on BPSysAve", "Adjusted: SmokeNow + Age on BPSysAve")
))

check the df new size

dim(nhanes_conf)

[1] 3108    6

dim(nhanes_long)

[1] 6216    6

Plot predicted values from 2 models

Using nhanes_long df in “long shape”

# Violin plot with dashed line connecting the two conditions
ggplot(nhanes_long, aes(x = factor(SmokeNow), y = BPSysAve, fill = model)) +
  # Create the violin plot to show the distribution of blood pressure
  geom_point(color = "grey", alpha = 0.4, position = position_dodge(width = 0.3) ) +
  # Overlay the predicted dashed line between the two smoking conditions
  geom_line(aes(y = predicted, group = model, color = model), 
            size = 0.8, linetype = "dashed", position = position_dodge(width = 0.3)) +
  # Facet vertically for unadjusted and adjusted models
  facet_wrap(model ~ ., scales = "free_y",ncol = 2) +
  # Add labels and title
  labs(title = "Confounder Analysis: Predicted values in Unadjusted vs Adjusted Regression models",
       subtitle = "Age = Confounder",
       x = "Smoking Status",
       y = "Systolic Blood Pressure (BPSysAve)",
       color = "Model",
       fill = "Model"
       ) +
  # Customize colors for better contrast
  scale_fill_manual(values = c("Unadjusted: SmokeNow on BPSysAve" ="lightcoral" , 
                               "Adjusted: SmokeNow + Age on BPSysAve" = "lightblue")) +
  scale_color_manual(values = c("Unadjusted: SmokeNow on BPSysAve" = "red", 
                                "Adjusted: SmokeNow + Age on BPSysAve" = "blue")) +
  # Minimal theme for clarity
  theme_minimal()+
  theme(legend.position = "none")

Plot predicted values from 2 models

Figure 1: As age causes high blood pressure, but also determines smoking status, the Adjusted model (*right*) gives us a more credible relationship between smoking and blood pressure.

CAUSAL MODELING WITH `MEDIATOR`

DAG with `mediator`

M = BMI is a mediator
X = PhysActiveDays
Y = BPSysAve (blood pressure)

The case of a mediator

M = BMI is a mediator for the effect of X = PhysActiveDays on Y = BPSysAve

data(NHANES)
# Select relevant columns and drop rows with missing values
nhanes_med <- NHANES %>%
  dplyr::select(Gender, Age, BPSysAve, BMI, PhysActiveDays, Diabetes, SmokeNow) %>%
  drop_na()

# Take a quick look at the data
summary(nhanes_med)

    Gender         Age          BPSysAve        BMI       PhysActiveDays
 female:632   Min.   :20.0   Min.   : 81   Min.   :15.0   Min.   :1.0   
 male  :831   1st Qu.:34.0   1st Qu.:111   1st Qu.:23.6   1st Qu.:2.0   
              Median :47.0   Median :119   Median :27.0   Median :3.0   
              Mean   :47.8   Mean   :122   Mean   :27.9   Mean   :3.6   
              3rd Qu.:60.0   3rd Qu.:131   3rd Qu.:31.3   3rd Qu.:5.0   
              Max.   :80.0   Max.   :221   Max.   :59.1   Max.   :7.0   
 Diabetes   SmokeNow 
 No :1312   No :834  
 Yes: 151   Yes:629

Fitting unadjusted (total) model

Before adjusting for any mediator, we fit a simple linear model to estimate the total effect of PhysActiveDays on Systolic Blood Pressure (disregarding any possible mediation of BMI).

Total Effect Model (Unadjusted)
- This model gives the total effect of of PhysActiveDays on Systolic Blood Pressure (including any indirect effects through M = BMI or other variables that haven’t been included.)

# Unadjusted model: total effect of smoking on blood pressure
total_effect_model <- lm(BPSysAve ~ PhysActiveDays, data = nhanes_med)
summary(total_effect_model)$coefficients

               Estimate Std. Error t value Pr(>|t|)
(Intercept)      120.12      1.055  113.88   0.0000
PhysActiveDays     0.59      0.262    2.25   0.0246

Set up the mediation models 1/2

Then, we fit the model to estimate the effect of X = PhysActiveDays on M = BMI:

Mediator model
- This model allows us to see whether more physically active days are associated with (lower) BMI values (which could then affect blood pressure).

# Mediator model: SmokeNow affects BMI
mediator_model <- lm(BMI ~ PhysActiveDays, data = nhanes_med)
summary(mediator_model)$coefficients

               Estimate Std. Error t value Pr(>|t|)
(Intercept)     28.2176     0.3444  81.934    0.000
PhysActiveDays  -0.0821     0.0856  -0.959    0.338

Set up the mediation models 2/2

Lastly, we fit the model to estimate the effect of X = PhysActiveDays on Y = BPSysAve while adjusting for M = BMI:

Outcome Model (Adjusted for BMI)
- This model estimates the effect of PhysActiveDays and BMI on Systolic Blood Pressure (BPSysAve). This is the adjusted model.

# Outcome model: PhysActiveDays and BMI affect BPSysAve
outcome_model <- lm(BPSysAve ~  PhysActiveDays + BMI, data = nhanes_med)
summary(outcome_model)$coefficients

               Estimate Std. Error t value  Pr(>|t|)
(Intercept)     109.353     2.4763   44.16 3.15e-271
PhysActiveDays    0.621     0.2603    2.39  1.71e-02
BMI               0.382     0.0795    4.80  1.77e-06

Compare 3 models

The Total (unadjusted) effect model estimates the effect of PhysActiveDays on BPSysAve.
The Mediator model estimates the effect of PhysActiveDays on BMI (M).
The Outcome model estimates the effect of (PhysActiveDays + BMI) on BPSysAve.

med_sum <- modelsummary::msummary(
  list("BPSysAve ~ Total Effect" = total_effect_model,
       "BMI ~ Mediator Effect" = mediator_model,
       "BPSysAve ~ Outcome" = outcome_model), 
  output = "gt",
  statistic = c(
    "conf.int",
    "s.e. ={std.error}"),
  fmt = "%.3f", # format
  gof_omit = 'DF|Deviance|Log.Lik.|F|R2 Adj.|AIC|BIC|RMSE',
  stars = c('*' = .05, '**' = .01))

# Render the table
med_sum %>% 
  tab_spanner(label = "Tot (unadj) Mod.",columns = 2) %>%
  tab_spanner(label = "Med Mod.", columns = 3) %>%
  tab_spanner(label = "Outcome Mod.",columns = 4) %>%
   tab_style(
    style = cell_text(weight = "bold", color = "#7f173d"),
    locations = cells_column_labels(columns = 3)) %>% 
  tab_style(
    style = cell_text(weight = "bold", color = "#005ca1"),
    locations = cells_column_labels(columns = c(2,4))) %>% 
  tab_style(style = cell_text(weight = "bold"),
            locations = cells_body(rows = c(1,4,7))) %>% 
  # Highlight X 
   tab_style(style = cell_fill(color = 'lightblue'),
            locations = cells_body(rows = 4:6)) %>% 
  # Make 'MEdiator' font color red
  tab_style(style = cell_text(color = "#d02e4c"),
    locations = cells_body(rows = 7:9))

Compare 3 models

	Tot (unadj) Mod.	Med Mod.	Outcome Mod.
	BPSysAve ~ Total Effect	BMI ~ Mediator Effect	BPSysAve ~ Outcome
(Intercept)	120.120**	28.218**	109.353**
	[118.051, 122.189]	[27.542, 28.893]	[104.495, 114.210]
	s.e. =1.055	s.e. =0.344	s.e. =2.476
PhysActiveDays	0.590*	-0.082	0.621*
	[0.076, 1.104]	[-0.250, 0.086]	[0.111, 1.132]
	s.e. =0.262	s.e. =0.086	s.e. =0.260
BMI			0.382**
			[0.226, 0.538]
			s.e. =0.080
Num.Obs.	1463	1463	1463
R2	0.003	0.001	0.019
* p < 0.05, ** p < 0.01

Calculate the Indirect, Direct, and Total Effects (1/2)

Using the coefficients from the models, we can calculate the mediation effects manually.

Direct Effect: This is the coefficient of PhysActiveDays in the outcome model, which gives the direct effect of exercise on blood pressure (controlling for BMI = M)
Indirect Effect: This is the product of the coefficient from the mediator model (PhysActiveDays → BMI) and the coefficient from the outcome model (BMI → BPSysAve).
Total Effect: This is the coefficient of PhysActiveDays in the unadjusted model (NOT controlling for BMI = M).

Calculate the Indirect, Direct, and Total Effects (2/2)

Breakdown of the estimated effects:

# TOTAL effect: The effect of PhysActiveDays on BPSysAve without adjusting for BMI
total_effect <- coef(total_effect_model)["PhysActiveDays"]# 0.59
total_effect

PhysActiveDays 
          0.59

# DIRECT effect: The effect of PhysActiveDays on BPSysAve after adjusting for BMI.
direct_effect <- coef(outcome_model)["PhysActiveDays"]# 0.621 
direct_effect

PhysActiveDays 
         0.621

# Indirect effect: The portion of the effect on BPSysAve that is mediated through BMI
indirect_effect <- coef(mediator_model)["PhysActiveDays"] * coef(outcome_model)["BMI"] # [-0.0821 X 0.382]  
indirect_effect

PhysActiveDays 
       -0.0313

# Proportion mediated: The proportion of the total effect that is mediated through BMI
proportion_mediated <- glue::glue("{round(indirect_effect / total_effect *100, 2)}%")
proportion_mediated #  -0.0531

-5.31%

Visualise the Indirect, Direct, and Total Effects

# Create a data frame for the effects
effects_df <- data.frame(
  Effect = c("Total Effect", "Direct Effect", 
             "Indirect Effect"),
  Value = c(total_effect, direct_effect, 
            indirect_effect)
)

# Create the bar plot
ggplot(effects_df, aes(x = Effect, y = Value,
                       fill = Effect)) +
  geom_bar(stat = "identity") +
  labs(title = "Mediation Analysis: Total, Direct, and Indirect Effects", 
       subtitle = "Effect (coefficient) of Physical Activity on Systolic Blood Pressure",
       y = "", x = "") +
  theme_minimal()

Visualise the Indirect, Direct, and Total Effects

Interpreting the results

The total effect tells us the overall relationship between PhysActiveDays and BPSysAve.
The direct effect represents the relationship between PhysActiveDays and BPSysAve, controlling for BMI.
The indirect effect (mediated effect) shows how much of the relationship between PhysActiveDays and BPSysAve is explained through BMI (AKA the coef of PhysActiveDays in the mediator model times the coef of BMI in the outcome model)
The proportion mediated indicates how much of the total effect is due to the mediation by BMI.

Causal mediation analysis using `mediation`

Using `mediation` package

The mediation package can be used to estimate various quantities for causal mediation analysis

The function mediate() needs two regression models: 1) \((X → M)\), and 2) \((X + M → Y)\)
It returns
- the average causal mediation effect (ACME),
- average direct effect (ADE),
- and total effect.

# recall
#mediator_model[["call"]][["formula"]]
#outcome_model[["call"]][["formula"]]

# Calculate the mediation effect
mediation_model <- mediate(mediator_model, outcome_model, 
                           treat = "PhysActiveDays", 
                           mediator = "BMI", 
                           boot=TRUE, sims=500)

summary(mediation_model)$d1  # Direct effect

[1] -0.0313

summary(mediation_model)$z1  # Indirect effect

[1] 0.621

summary(mediation_model)$tau.coef  # Total effect

[1] 0.59

Using `mediation` pckg - results

summary(mediation_model)


Causal Mediation Analysis 

Nonparametric Bootstrap Confidence Intervals with the Percentile Method

               Estimate 95% CI Lower 95% CI Upper p-value  
ACME            -0.0313      -0.1067         0.03   0.256  
ADE              0.6213       0.1246         1.16   0.016 *
Total Effect     0.5899       0.0576         1.10   0.020 *
Prop. Mediated  -0.0531      -0.5263         0.06   0.276  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Sample Size Used: 1463 


Simulations: 500

Using `mediation` pckg - prediction

# Add predicted values for BMI, BPSysAve (adjusted), and BPSysAve (total effect)
nhanes_med_pred <- nhanes_med %>%
  mutate(
    pred_bmi = predict(mediator_model),         # Predicted BMI (mediator model)
    pred_bps_adj = predict(outcome_model),      # Predicted BPSysAve (adjusted for Age)
    pred_bps_total = predict(total_effect_model) # Predicted BPSysAve (total effect)
  )

nhanes_med_pred[1:3, c(1:2, 8:10)]

# A tibble: 3 × 5
  Gender   Age pred_bmi pred_bps_adj pred_bps_total
  <fct>  <int>    <dbl>        <dbl>          <dbl>
1 male      66     27.6         123.           124.
2 female    58     28.1         121.           121.
3 female    57     28.0         119.           122.

Plot prediction results 1/2

Now, let’s create a plot that includes:

The total effect of BMI on blood pressure (without adjusting for Age).
The adjusted effect of BMI on blood pressure (controlling for Age).
The mediator effect (BMI’s effect on Age).

# Check column names
colnames(nhanes_med_pred)

 [1] "Gender"         "Age"            "BPSysAve"       "BMI"           
 [5] "PhysActiveDays" "Diabetes"       "SmokeNow"       "pred_bmi"      
 [9] "pred_bps_adj"   "pred_bps_total"

# Reshape the data into long format for faceting
nhanes_long <- nhanes_med_pred %>%
  gather(key = "model", value = "predicted",  pred_bps_adj, pred_bps_total) %>%
  mutate(model = recode(model, 
                        #pred_bmi = "Mediator Effect: BPSysAve on BMI", 
                        pred_bps_adj = "Adj. Effect: PhysActiveDays + BMI on BPSysAve", 
                        pred_bps_total = "Total Effect: PhysActiveDays on BPSysAve"))

Plot prediction results 2/2

# Plot with vertically aligned facets
ggplot(nhanes_long, aes(x = BMI)) +
  # Actual data points for BPSysAve
  geom_point(aes(y = BPSysAve), color = "black", shape = 16, alpha = 0.5) +  
  # Predicted values from different models
  geom_line(aes(y = predicted, color = model), size = 1) +  
  # Facet vertically
  facet_grid(model ~ ., scales = "free_y") +  
  labs(title = "Mediation Analysis: Total, Adjusted, and Mediator Effects",
       x = "BMI",
       y = "Blood Pressure | Age",
       color = "Effect Type") +
  theme_minimal()

Plot prediction results 2/2

DEFINING POPULAR CAUSAL OUTCOME EFFECTS

CALCULATE COMMONLY USED ESTIMANDs (ATE, ATT, ATU)

The explanation and examples below follow very closely two sources:

Establishing some common symbols

\(X\) = treatment or intervention
\(Y^1\) = potential outcome if treated
\(Y^0\) = potential outcome if not treated
\(\delta\) = difference \((Y^1 - Y^0)\), or the individual-level causal effect.

EXAMPLE with toy case

Assuming we could observe the potential outcomes for each individual:

basic_po <- tribble(
  ~id, ~age,    ~treated, ~outcome_1, ~outcome_0,
  1,   "Old",   1,        80,         60,
  2,   "Old",   1,        75,         70,
  3,   "Old",   1,        85,         80,
  4,   "Old",   0,        70,         60,
  5,   "Young", 1,        75,         70,
  6,   "Young", 0,        80,         80,
  7,   "Young", 0,        90,         100,
  8,   "Young", 0,        85,         80
) |> 
  dplyr::mutate(
    ice = outcome_1 - outcome_0,
    outcome = ifelse(treated == 1, outcome_1, outcome_0)
  )

ITE in toy example example (all outcomes!)

Each person (ID) has two potential outcomes and an individual causal effect (\(ITE_i= Y_{i}^1 − Y_{i}^0\) = \(\delta_i\)).
\(\delta_i\) would in fact be measurable only with 2 parallel universes!

Table 1

	Confounder	Treatment	Unobservable			Realized
ID	Age	Treated	Potential outcomes		ITE or \(\delta_i\)	Outcome
ID	\[Z_i\]	\[X_i\]	\[Y^1_i\]	\[Y^0_i\]	\[Y^1_i - Y^0_i\]	\[Y_i\]
1	Old	1	80	60	20	80
2	Old	1	75	70	5	75
3	Old	1	85	80	5	85
4	Old	0	70	60	10	60
5	Young	1	75	70	5	75
6	Young	0	80	80	0	80
7	Young	0	90	100	−10	100
8	Young	0	85	80	5	80

ATE = Average Treatment Effect

If we could compare all the same people in different universes and measure their ITE, we could calculate the Average Treatment Effect (ATE), or the effect of the intervention across the whole population.

\[ \begin{aligned} \text{ATE} &= E[\delta_i] & \text{ or} \\ \text{ATE} &= E[Y^1_i - Y^0_i] \end{aligned} \]

Given the data of the \(\delta\) column in Table 1, the ATE is 5.

\[ \text{ATE} = \frac{20 + 5 + 5 + 5 + 10 + 0 + -10 + 5}{8} = 5 \]

ATT = Average Treatment Effect on Treated

We might want to know how big the effect is just for those who received the treatment.

This is called the Average Treatment effect on the Treated (ATT), and is the average of the individual causal effects just among those who were treated.

\[ \begin{aligned} \text{ATT} &= E[\delta_i \mid X_i = 1] & \text{or} \\ \text{ATT} &= E[Y^1_i - Y^0_i \mid X_i = 1] \end{aligned} \]

In this case, that means we’re only looking at the average \(\delta\) for rows 1, 2, 3, and 5 in Table 1:

\[ \text{ATT} = \frac{20 + 5 + 5 + 5}{4} = 8.75 \]

ATU = Average Treatment Effect on Untreated

We can also calculate the Average Treatment effect on the Untreated (ATU; sometimes called the ATC (for effect on the control group)) by finding the average of the individual causal effects among those who were not treated.

\[ \begin{aligned} \text{ATU} &= E[\delta_i \mid X_i = 0] & \text{or} \\ \text{ATU} &= E[Y^1_i - Y^0_i \mid X_i = 0] \end{aligned} \]

Here, we’re only looking at the average \(\delta\) for rows 4, 6, 7, and 8 in Table 1:

\[ \text{ATU} = \frac{10 + 0 - 10 + 5}{4} = 1.25 \]

Relation among ATE, ATT and ATU

In fact, the ATE is technically a weighted average of the ATT and the ATU (here \(\pi\) indicates “proportion”):

\[ \text{ATE} = (\pi_\text{Treated} \times \text{ATT}) + (\pi_\text{Untreated} \times \text{ATU}) \]

In our example, Table 1, we can get the same ATE:

\[ \begin{aligned} \text{ATE} &= (\frac{4}{8} \times 8.75) + (\frac{4}{8} \times 1.25) \\ &= 4.375 + 0.625 \\ &= 5 \end{aligned} \]

Decomposing the ATE into these two parts like this shows that there’s some systematic difference between the treated and untreated people. This difference is called selection bias.

Relation among ATE, ATT and ATU

We can see the selection bias in an alternative decomposition of the ATE:

\[ \text{ATE} = \text{ATT} + \text{Selection bias} \]

The fact that the ATT (8.75) is bigger than the ATE (5) here is a sign the two groups are different. This intervention, whatever it is, has a big effect on the treated people who signed up for it, likely because they somehow knew that the intervention would be helpful for them.

Those who were untreated have a really low ATU (1.25)—they likely didn’t sign up for the intervention because they knew that it wouldn’t do much for them.

Relation among ATE, ATT and ATU (R)

# OBTAIN ATT, ATU
effect_types <- basic_po |> 
  dplyr::group_by(treated) |> 
  dplyr::summarize(effect = mean(ice), n = n() ) |> 
  dplyr::mutate(prop = n / sum(n),
    weighted_effect = effect * prop) |> 
  dplyr::mutate(estimand = case_match(treated, 0 ~ "ATU", 1 ~ "ATT"), .before = 1)

effect_types

# A tibble: 2 × 6
  estimand treated effect     n  prop weighted_effect
  <chr>      <dbl>  <dbl> <int> <dbl>           <dbl>
1 ATU            0   1.25     4   0.5           0.625
2 ATT            1   8.75     4   0.5           4.38

# OBTAIN ATE
effect_types |> dplyr::summarize(ATE = sum(weighted_effect))

# A tibble: 1 × 1
    ATE
  <dbl>
1     5

Lab 4:Mapping causal & predictive approaches

GOAL OF TODAY’S PRACTICE SESSION

R ENVIRONMENT SET UP & DATA

Needed R Packages

DATASETS for today

Importing Dataset 1 (NHANES)

Dataset 1 (NHANES) Variables and their description

VISUALLY EXPLORE CAUSAL PATHWAYS WITH DAGS

DAG with collider

DAG with collider

DAG with confounder

DAG with confounder

DAG with mediator

DAG with mediator

CAUSAL MODELING WITH CONFOUNDER

DAG with confounder

Example of CONFOUNDER

Visualisation of CONFOUNDER

Visualisation of CONFOUNDER

ONE WAY OF DEALING WITH CONFOUNDERS

Regression analysis: controlling for the confounder

Compare models (without and with confounder)

Compare predicted values from 2 models

Plot predicted values from 2 models

Plot predicted values from 2 models

CAUSAL MODELING WITH MEDIATOR

DAG with mediator

The case of a mediator

Fitting unadjusted (total) model

Set up the mediation models 1/2

Set up the mediation models 2/2

Compare 3 models

Compare 3 models

Calculate the Indirect, Direct, and Total Effects (1/2)

Calculate the Indirect, Direct, and Total Effects (2/2)

Visualise the Indirect, Direct, and Total Effects

Visualise the Indirect, Direct, and Total Effects

Interpreting the results

Causal mediation analysis using mediation

Using mediation package

Using mediation pckg - results

Using mediation pckg - prediction

Plot prediction results 1/2

Plot prediction results 2/2

Plot prediction results 2/2

DEFINING POPULAR CAUSAL OUTCOME EFFECTS

CALCULATE COMMONLY USED ESTIMANDs (ATE, ATT, ATU)

Establishing some common symbols

EXAMPLE with toy case

ITE in toy example example (all outcomes!)

ATE = Average Treatment Effect

ATT = Average Treatment Effect on Treated

ATU = Average Treatment Effect on Untreated

Relation among ATE, ATT and ATU

Relation among ATE, ATT and ATU

Relation among ATE, ATT and ATU (R)

DAG with `collider`

DAG with `collider`

DAG with `confounder`

DAG with `confounder`

DAG with `mediator`

DAG with `mediator`

CAUSAL MODELING WITH `CONFOUNDER`

DAG with `confounder`

Regression analysis: `controlling for` the confounder

Compare models (without and with `confounder`)

CAUSAL MODELING WITH `MEDIATOR`

DAG with `mediator`

Causal mediation analysis using `mediation`

Using `mediation` package

Using `mediation` pckg - results

Using `mediation` pckg - prediction