Trading speed for power: Full attention matrices, instead of causal masks, outshine causal masking in complex tasks.
ENTP: Encoder-only Next Token Prediction
Trading speed for power: Full attention matrices, instead of causal masks, outshine causal masking in complex tasks.