Clean data beats complex architecture for better multilingual embeddings
KaLM-Embedding achieves state-of-the-art performance in multilingual text embedding by focusing on superior training data quality and innovative data processing techniques.
-----
https://arxiv.org/abs/2501.01028
🤔 Original Problem:
→ Current embedding models often overlook training data quality, leading to suboptimal performance in multilingual retrieval tasks
→ False negatives in fine-tuning data introduce noise in representation learning
-----
🛠️ Solution in this Paper:
→ Leverages Qwen2-0.5B as base architecture with mean pooling and 512 token limit
→ Implements persona-based synthetic data generation using LLMs to create diverse training examples
→ Uses ranking consistency filtering to remove low-quality samples and reduce noise
→ Applies semi-homogeneous task batch sampling to balance training efficacy
→ Incorporates over 20 categories of pre-training data and 70 categories of fine-tuning data
-----
💡 Key Insights:
→ Data quality proves more crucial than model architecture
→ Task instructions significantly enhance model performance
→ Pre-training impact is relatively minor compared to previous studies
-----
📊 Results:
→ Outperforms other models under 1B parameters on MTEB benchmark
→ Achieves 64.13% on Chinese MTEB and 64.94% on English MTEB
→ Shows strong multilingual capabilities despite primary training on Chinese and English
------
Are you into AI and LLMs❓ Join me on X/Twitter with 52K+ others, to remain on the bleeding-edge of AI every day.
𝕏/🐦 https://x.com/rohanpaul_ai
Share this post