Research Notes

Independent Study: TabDiff

Research-backed engineering for synthetic tabular data generation

Stack: Python • NumPy • Pandas • Jupyter • scikit-learn

Problem

Real-world tabular datasets are often limited by class imbalance, sparsity, and privacy constraints, making it difficult to perform reliable experimentation—especially for churn prediction tasks where minority classes are underrepresented.

Additionally, naïve resampling or duplication techniques can introduce overfitting and data leakage, reducing the validity of downstream models.

Solution

Designed and implemented a TabDiff-inspired synthetic data pipeline, leveraging recent research in generative modeling for tabular data (e.g., diffusion models, CTGAN, TVAE). The system generates synthetic records and evaluates them using both:

  • Statistical fidelity (distributional similarity to real data)
  • Model utility (impact on downstream predictive performance)

Benchmarked synthetic datasets against real baselines using classification metrics such as AUC, PR-AUC, F1, and recall, with a focus on improving minority-class performance.

What I Did

  • Designed the experimental framework, including train/test splits and leakage-safe evaluation pipelines
  • Implemented data preprocessing and feature handling for mixed-type tabular data
  • Built scripts to compare real vs synthetic distributions and visualize embedding behavior (PCA / UMAP)
  • Evaluated model performance across datasets to quantify the utility–fidelity tradeoff
  • Documented findings and engineering decisions in a structured research report (ACM-style)

Challenges

  • Preventing data leakage while preserving meaningful statistical structure in generated data
  • Balancing fidelity vs utility — highly realistic data does not always improve model performance
  • Handling mode collapse and instability in generative approaches for tabular domains
  • Ensuring synthetic data remained useful for experimentation without overfitting to training artifacts

Key Insight

Synthetic data is not inherently valuable—its effectiveness depends on how well it improves downstream model behavior, not just how closely it matches the original distribution.