Making code generation cheaper by cutting the fluff from documentation, while keeping the good stuff
ShortenDoc, proposed in this paper, trims DocStrings smartly by keeping only the most important parts for code generation
https://arxiv.org/abs/2410.22793
🤖 Original Problem:
DocStrings in code generation often contain redundant information, increasing computational costs and token usage when calling third-party APIs. Current prompt compression methods only achieve 10% reduction before performance degrades.
-----
🛠️ Solution in this Paper:
→ ShortenDoc dynamically compresses DocStrings by analyzing token importance rather than using fixed ratios
→ It first decomposes prompts into signature and docstring components
→ Then preprocesses docstring to enhance quality based on empirical findings
→ Assigns importance scores to each token and ranks them accordingly
→ Creates a search space of compression-eligible tokens with predefined constraints
→ Iteratively compresses tokens until reaching optimal balance
-----
💡 Key Insights:
→ DocStrings carry redundant information that can be safely removed up to 40% without significant performance loss
→ Function signatures alone contain enough semantic information to guide code generation
→ Token importance varies significantly - articles and prepositions are less critical than code-specific tokens
→ Dynamic compression outperforms fixed-ratio methods
-----
📊 Results:
→ Achieves 25-40% compression while maintaining code generation quality
→ Tested across 6 datasets and 6 LLMs (1B to 10B parameters)
→ Outperforms baseline methods like Selective_Context and LLMLingua2
→ Works effectively across multiple programming languages
Share this post