A keyphrase extractor that actually reads your whole document, not just the first page
LongKey is a novel framework that solves the challenge of extracting keyphrases from documents longer than traditional 512-token limits. It uses an enhanced encoder model capable of processing up to 96K tokens and introduces a max-pooling embedder for better context-aware keyphrase extraction.
-----
https://arxiv.org/abs/2411.17863
🔍 Original Problem:
Most keyphrase extraction methods only work with short documents (up to 512 tokens). This limitation creates a significant gap in processing longer documents like research papers, legal documents, and technical reports.
-----
🛠️ Solution in this Paper:
→ LongKey extends token support up to 96K tokens using a modified Longformer architecture with expanded positional embeddings.
→ It implements a max-pooling embedder that combines multiple occurrences of keyphrases into single, comprehensive representations.
→ The system processes documents in chunks of 8,192 tokens, concatenating embeddings to maintain context across the entire text.
→ A convolutional network generates embeddings for n-gram keyphrases up to length 5.
-----
💡 Key Insights:
→ Document chunking with proper embedding concatenation can effectively handle long documents
→ Max-pooling across keyphrase occurrences captures better context than individual instance scoring
→ Domain adaptation is possible through zero-shot learning across different document types
-----
📊 Results:
→ Achieved 39.55% F1@5 score on LDKP3K dataset, outperforming existing methods
→ Maintained 41.81% F1@5 score on LDKP10K dataset
→ Demonstrated superior performance across 6 different domains without additional training