0:00
/
0:00
Transcript

"Generating Realistic Tabular Data with Large Language Models"

The podcast on this paper is generated with Google's Illuminate.

This paper teaches LLMs to generate tabular data while preserving real-world feature relationships

📚 https://arxiv.org/abs/2410.21717

🎯 Original Problem:

Existing LLM methods struggle to generate synthetic tabular data that maintains correct correlations between features and target variables, limiting their usefulness in predictive tasks.

-----

🔧 Solution in this Paper:

→ A novel permutation strategy that keeps target variables at sequence end during fine-tuning, ensuring proper attention flow from features to targets

→ Feature-conditional sampling approach that generates samples based on individual features rather than target variables

→ Label-querying mechanism that constructs prompts to generate labels from fine-tuned LLMs instead of using external classifiers

-----

💡 Key Insights:

→ Traditional permutation strategies break attention links between features and targets

→ Using features as sampling conditions instead of target variables improves data quality

→ Generating labels through LLM prompting outperforms external classifier predictions

→ The method works well even with reduced training data sizes

-----

📊 Results:

→ Outperforms 10 state-of-the-art baselines across 20 datasets

→ Achieves 0.8299 density score and 0.7501 coverage score

→ Only 73% of synthetic samples are detectable vs 92% for baselines

→ Matches real data performance on half of benchmark datasets

Discussion about this video