Aggregating Time-to-Event Data: Creating Survival Curves for Multiple Imputed Datasets in R
Dealing with missing data in survival analysis is a frequent challenge, particularly when the missingness is not completely at random. Multiple Imputation by Chained Equations (MICE) is the gold standard for addressing this, but it presents a unique visualization problem: how do you draw a single Kaplan-Meier survival curve when you have 5, 10, or 100 different imputed versions of your dataset? Simply averaging the curves or picking one at random violates the principles of Rubin’s Rules. This guide demonstrates the technical pipeline for pooling survival estimates across imputed sets to create a statistically sound "average" survival curve with appropriate confidence intervals.
Table of Content
- Purpose of Pooling Imputed Survival Curves
- Common Use Cases
- Step-by-Step Implementation in R
- Best Results: Visualization Strategies
- FAQ
- Disclaimer
Purpose
The primary purpose of this method is to maintain Statistical Power and Validity in the presence of missing covariates or outcomes. Survival curves generated from a "Complete Case Analysis" (dropping rows with missing values) are often biased and underpowered. By using multiple imputation, we account for the uncertainty introduced by missing data. Pooling these curves ensures that our final visualization reflects the "average" survival probability while incorporating the between-imputation variance into the confidence bands, providing a more honest representation of our uncertainty than any single imputed curve could.
Use Case
Pooling survival curves is essential for:
- Clinical Trials: Presenting survival outcomes where baseline patient characteristics (like smoking status or BMI) were partially missing.
- Registry Studies: Analyzing long-term cancer registry data where certain diagnostic markers have high missingness rates.
- Public Health: Visualizing mortality rates in longitudinal surveys where attrition has led to missing follow-up data.
Step-by-Step
1. Impute the Data Using the mice Package
Ensure that your imputation model includes the Event Indicator and the Nelson-Aalen estimator (or the log of survival time) to capture the survival structure.
- Example:
imp <- mice(df, m=10, method='pmm')
2. Generate Survival Fits for Each Imputation
Instead of pooling coefficients (like in a Cox model), we need to extract the survival probabilities at specific time points for each of the $m$ datasets.
- Convert the
midsobject to a list of datasets. - Run
survfit()on each dataset. - Extract the survival and time components.
3. Pool the Survival Probabilities
Rubin's Rules are traditionally applied to regression coefficients, but for curves, we apply them to the complementary log-log transformed survival probabilities to ensure the pooled estimates stay between 0 and 1.
- Calculate the mean survival probability at each time step across all imputed sets.
- Calculate the total variance (Within-imputation + Between-imputation variance).
4. Construct the Final Curve
Using a package like ggsurvplot or manual ggplot2 construction:
- Plot the pooled survival means as the primary line.
- Plot the confidence intervals derived from the pooled total variance.
Best Results
| Approach | Statistical Validity | Visual Clarity |
|---|---|---|
| Single Imputation | Low (Underestimates variance) | High (One clean line) |
| Overlaying All Curves | Moderate (Shows spread) | Low (Visually "messy") |
| Pooled Mean (Rubin's) | High (Accounts for uncertainty) | High (Clean, valid line) |
FAQ
Can I use the `pool()` function in mice for survival curves?
The pool() function is designed for mipo objects (models with coefficients). Since a Kaplan-Meier fit is a non-parametric summary and not a model with coefficients, you must manually pool the survival estimates using the mice::complete() long-format data or specialized packages like survMICE.
Should the survival time itself be imputed?
Generally, it is safer to impute predictors and indicators. If survival time is missing, it is often treated as censored at time zero or handled through more complex imputation models. Imputing the "outcome" requires a very strong imputation model to avoid circular reasoning.
How many imputations (m) do I need for a stable curve?
While $m=5$ was historically sufficient, modern standards suggest $m=20$ to $m=50$ for survival curves to ensure the tails of the distribution (where data is sparse) are stabilized and the pooled confidence intervals are reliable.
Disclaimer
Standard errors for pooled survival curves can be sensitive to the number of time points chosen for pooling. This tutorial reflects R programming and survival analysis standards as of March 2026. Always ensure your imputation model is "congenial" with your survival analysis (i.e., it includes all variables used in the final survival model).
Tags: SurvivalAnalysis, MultipleImputation, RStats, KaplanMeier
