0:00
/
0:00
Transcript

"LongKey: Keyphrase Extraction for Long Documents"

The podcast on this paper is generated with Google's Illuminate.

A keyphrase extractor that actually reads your whole document, not just the first page

LongKey is a novel framework that solves the challenge of extracting keyphrases from documents longer than traditional 512-token limits. It uses an enhanced encoder model capable of processing up to 96K tokens and introduces a max-pooling embedder for better context-aware keyphrase extraction.

-----

https://arxiv.org/abs/2411.17863

🔍 Original Problem:

Most keyphrase extraction methods only work with short documents (up to 512 tokens). This limitation creates a significant gap in processing longer documents like research papers, legal documents, and technical reports.

-----

🛠️ Solution in this Paper:

→ LongKey extends token support up to 96K tokens using a modified Longformer architecture with expanded positional embeddings.

→ It implements a max-pooling embedder that combines multiple occurrences of keyphrases into single, comprehensive representations.

→ The system processes documents in chunks of 8,192 tokens, concatenating embeddings to maintain context across the entire text.

→ A convolutional network generates embeddings for n-gram keyphrases up to length 5.

-----

💡 Key Insights:

→ Document chunking with proper embedding concatenation can effectively handle long documents

→ Max-pooling across keyphrase occurrences captures better context than individual instance scoring

→ Domain adaptation is possible through zero-shot learning across different document types

-----

📊 Results:

→ Achieved 39.55% F1@5 score on LDKP3K dataset, outperforming existing methods

→ Maintained 41.81% F1@5 score on LDKP10K dataset

→ Demonstrated superior performance across 6 different domains without additional training

Discussion about this video

User's avatar