Research Notes

Independent Study: TabDiff

Research-backed engineering for synthetic tabular data generation

Stack: Python • NumPy • Pandas • Jupyter • scikit-learn

Problem

Real-world tabular datasets are often limited by class imbalance, sparsity, and privacy constraints, making it difficult to perform reliable experimentation—especially for churn prediction tasks where minority classes are underrepresented.

Additionally, naïve resampling or duplication techniques can introduce overfitting and data leakage, reducing the validity of downstream models.

Solution

Designed and implemented a TabDiff-inspired synthetic data pipeline, leveraging recent research in generative modeling for tabular data (e.g., diffusion models, CTGAN, TVAE). The system generates synthetic records and evaluates them using both:

Statistical fidelity (distributional similarity to real data)
Model utility (impact on downstream predictive performance)

Benchmarked synthetic datasets against real baselines using classification metrics such as AUC, PR-AUC, F1, and recall, with a focus on improving minority-class performance.

What I Did

Designed the experimental framework, including train/test splits and leakage-safe evaluation pipelines
Implemented data preprocessing and feature handling for mixed-type tabular data
Built scripts to compare real vs synthetic distributions and visualize embedding behavior (PCA / UMAP)
Evaluated model performance across datasets to quantify the utility–fidelity tradeoff
Documented findings and engineering decisions in a structured research report (ACM-style)

Challenges

Preventing data leakage while preserving meaningful statistical structure in generated data
Balancing fidelity vs utility — highly realistic data does not always improve model performance
Handling mode collapse and instability in generative approaches for tabular domains
Ensuring synthetic data remained useful for experimentation without overfitting to training artifacts

Key Insight

Synthetic data is not inherently valuable—its effectiveness depends on how well it improves downstream model behavior, not just how closely it matches the original distribution.

Back to Case Study All Projects