Aggregating Time-to-Event Data: Creating Survival Curves for Multiple Imputed Datasets in R

Dealing with missing data in survival analysis is a frequent challenge, particularly when the missingness is not completely at random. Multiple Imputation by Chained Equations (MICE) is the gold standard for addressing this, but it presents a unique visualization problem: how do you draw a single Kaplan-Meier survival curve when you have 5, 10, or 100 different imputed versions of your dataset? Simply averaging the curves or picking one at random violates the principles of Rubin’s Rules. This guide demonstrates the technical pipeline for pooling survival estimates across imputed sets to create a statistically sound "average" survival curve with appropriate confidence intervals.

Table of Content

Purpose of Pooling Imputed Survival Curves
Common Use Cases
Step-by-Step Implementation in R
Best Results: Visualization Strategies
FAQ
Disclaimer

Purpose

The primary purpose of this method is to maintain Statistical Power and Validity in the presence of missing covariates or outcomes. Survival curves generated from a "Complete Case Analysis" (dropping rows with missing values) are often biased and underpowered. By using multiple imputation, we account for the uncertainty introduced by missing data. Pooling these curves ensures that our final visualization reflects the "average" survival probability while incorporating the between-imputation variance into the confidence bands, providing a more honest representation of our uncertainty than any single imputed curve could.

Use Case

Pooling survival curves is essential for:

Clinical Trials: Presenting survival outcomes where baseline patient characteristics (like smoking status or BMI) were partially missing.
Registry Studies: Analyzing long-term cancer registry data where certain diagnostic markers have high missingness rates.
Public Health: Visualizing mortality rates in longitudinal surveys where attrition has led to missing follow-up data.

Step-by-Step

1. Impute the Data Using the mice Package

Ensure that your imputation model includes the Event Indicator and the Nelson-Aalen estimator (or the log of survival time) to capture the survival structure.

Example: imp <- mice(df, m=10, method='pmm')

2. Generate Survival Fits for Each Imputation

Instead of pooling coefficients (like in a Cox model), we need to extract the survival probabilities at specific time points for each of the $m$ datasets.

Convert the mids object to a list of datasets.
Run survfit() on each dataset.
Extract the survival and time components.

3. Pool the Survival Probabilities

Rubin's Rules are traditionally applied to regression coefficients, but for curves, we apply them to the complementary log-log transformed survival probabilities to ensure the pooled estimates stay between 0 and 1.

Calculate the mean survival probability at each time step across all imputed sets.
Calculate the total variance (Within-imputation + Between-imputation variance).

4. Construct the Final Curve

Using a package like ggsurvplot or manual ggplot2 construction:

Plot the pooled survival means as the primary line.
Plot the confidence intervals derived from the pooled total variance.

Best Results

Approach	Statistical Validity	Visual Clarity
Single Imputation	Low (Underestimates variance)	High (One clean line)
Overlaying All Curves	Moderate (Shows spread)	Low (Visually "messy")
Pooled Mean (Rubin's)	High (Accounts for uncertainty)	High (Clean, valid line)

FAQ

Can I use the `pool()` function in mice for survival curves?

The pool() function is designed for mipo objects (models with coefficients). Since a Kaplan-Meier fit is a non-parametric summary and not a model with coefficients, you must manually pool the survival estimates using the mice::complete() long-format data or specialized packages like survMICE.

Should the survival time itself be imputed?

Generally, it is safer to impute predictors and indicators. If survival time is missing, it is often treated as censored at time zero or handled through more complex imputation models. Imputing the "outcome" requires a very strong imputation model to avoid circular reasoning.

How many imputations (m) do I need for a stable curve?

While $m=5$ was historically sufficient, modern standards suggest $m=20$ to $m=50$ for survival curves to ensure the tails of the distribution (where data is sparse) are stabilized and the pooled confidence intervals are reliable.

Disclaimer

Standard errors for pooled survival curves can be sensitive to the number of time points chosen for pooling. This tutorial reflects R programming and survival analysis standards as of March 2026. Always ensure your imputation model is "congenial" with your survival analysis (i.e., it includes all variables used in the final survival model).

Tags: SurvivalAnalysis, MultipleImputation, RStats, KaplanMeier

Aggregating Time-to-Event Data: Creating Survival Curves for Multiple Imputed Datasets in R

Table of Content

Purpose

Use Case

Step-by-Step

1. Impute the Data Using the mice Package

2. Generate Survival Fits for Each Imputation

3. Pool the Survival Probabilities

4. Construct the Final Curve

Best Results

FAQ

Can I use the `pool()` function in mice for survival curves?

Should the survival time itself be imputed?

How many imputations (m) do I need for a stable curve?

Disclaimer

About

Suggestion

Visualizing Dispersion in Binary Vectors via Pairwise Hamming Distance

Clarifying the SNR Error in Variational Diffusion Models (Kingma et al.)

Correct Language for Statistically Insignificant Results | Stats Guide

P-Value Inflation in LRT for Longitudinal Data: Causes and 2026 Fixes

Backcasting Parameters: Which Estimates Should You Use in 2026?

Testing Heterogeneity of Treatment Effects (HTE) in RCTs with Multiple Treatments