Synthetic Patient Data for Better ML in Healthcare

Published on 14/11/2025 Reading time : 5 minutes

Generating Synthetic Patient Data to Overcome Machine Learning Limitations in Healthcare Research

Authors: M. Swital, T. Porte, N. Sedmak, C. Bouvard, A. Gougeon, F. Roux, F. Mistretta, A. Lajoinie

Affiliations: RCTs, Lyon, France & Laboratory of Biometry and Evolutionary Biology, UMR 5558, CNRS, University of Lyon 1

Our poster was presented at the ISPOR Europe 2025 congress in Glasgow. This work explores how synthetic patient data can enhance machine learning applications in healthcare, offering promising solutions for data privacy, model robustness, and scalability in research.

POSTER ISPOR2025

MSW

Study Objective

Primary Objective

To explore current methods for generating synthetic patient data to enhance the performance of machine learning (ML) models in healthcare research.

Specific Focus

To analyze scenarios where datasets are small or imbalanced, which often limits the robustness of traditional ML approaches.

Methodology

Design: Systematic literature review
Database: MEDLINE search
Period: Studies published since 2020
Selection: 176 studies initially selected → 6 studies included after full-text review

Key findings

Key Analysis Points

Synthetic data improves ML model robustness
Preserves patient privacy while enriching datasets
Mitigates limitations due to small or biased datasets
Validated techniques: GANs, SMOTE, CTGAN, and Bayesian simulation

Data Sources

Electronic Health Records (n=3)
Clinical Registries (n=2)
Medical Imaging Datasets (n=1)

Identified Techniques

GANs (Generative Adversarial Networks)
SMOTE (Synthetic Minority Over-sampling)
CTGAN (Conditional Tabular GAN)
Bayesian Simulation

Demonstrated Benefits

Enriching datasets
• Improving model training
• Enhancing model evaluation
• Preserving patient privacy

Robustness Assessment

• Cross-validation
• Comparison with real-world data
• Sensitivity analyses
• Performance testing

Conclusion and Perspectives

Synthetic patient data generation is a promising strategy to enhance the reliability and performance of machine learning models in healthcare. It supports privacy-preserving model development and addresses data limitations. However, standardized evaluation frameworks and real-world implementation are essential to fully realize its potential in clinical decision-making and health technology assessment.