OMCAT, proposed in this paper teaches multi-modal model to understand when things happen in videos by linking sounds and visuals together
Share this post
OMCAT: Omni Context Aware Transformer
Share this post
OMCAT, proposed in this paper teaches multi-modal model to understand when things happen in videos by linking sounds and visuals together