Discussion about this post

User's avatar
Neural Foundry's avatar

Love the doubly stochastic constraint here. The 3000x to 1.6x amplification drop is nuts, basically turns an unstable mess into something deployable at scale. I've seen gradient spikes kill longruns before and this Birkhoff polytope approach feels like the right geometric intuition. The fused kernel work is probaly what makes this actually usable tho, without that 6.7% overhead the memory wall would eat any quality gains.

No posts

Ready for more?