Problem
Real-world tabular datasets are often limited by class imbalance, sparsity, and privacy constraints, making it difficult to perform reliable experimentation—especially for churn prediction tasks where minority classes are underrepresented.
Additionally, naïve resampling or duplication techniques can introduce overfitting and data leakage, reducing the validity of downstream models.
Solution
Designed and implemented a TabDiff-inspired synthetic data pipeline, leveraging recent research in generative modeling for tabular data (e.g., diffusion models, CTGAN, TVAE). The system generates synthetic records and evaluates them using both:
- Statistical fidelity (distributional similarity to real data)
- Model utility (impact on downstream predictive performance)
Benchmarked synthetic datasets against real baselines using classification metrics such as AUC, PR-AUC, F1, and recall, with a focus on improving minority-class performance.