This paper teaches LLMs to generate tabular data while preserving real-world feature relationships
📚 https://arxiv.org/abs/2410.21717
🎯 Original Problem:
Existing LLM methods struggle to generate synthetic tabular data that maintains correct correlations between features and target variables, limiting their usefulness in predictive tasks.
-----
🔧 Solution in this Paper:
→ A novel permutation strategy that keeps target variables at sequence end during fine-tuning, ensuring proper attention flow from features to targets
→ Feature-conditional sampling approach that generates samples based on individual features rather than target variables
→ Label-querying mechanism that constructs prompts to generate labels from fine-tuned LLMs instead of using external classifiers
-----
💡 Key Insights:
→ Traditional permutation strategies break attention links between features and targets
→ Using features as sampling conditions instead of target variables improves data quality
→ Generating labels through LLM prompting outperforms external classifier predictions
→ The method works well even with reduced training data sizes
-----
📊 Results:
→ Outperforms 10 state-of-the-art baselines across 20 datasets
→ Achieves 0.8299 density score and 0.7501 coverage score
→ Only 73% of synthetic samples are detectable vs 92% for baselines
→ Matches real data performance on half of benchmark datasets
Share this post