Trading speed for power: Full attention matrices, instead of causal masks, outshine causal masking in complex tasks.
Share this post
ENTP: Encoder-only Next Token Prediction
Share this post
Trading speed for power: Full attention matrices, instead of causal masks, outshine causal masking in complex tasks.