OMCAT, proposed in this paper teaches multi-modal model to understand when things happen in videos by linking sounds and visuals together
OMCAT: Omni Context Aware Transformer
OMCAT, proposed in this paper teaches multi-modal model to understand when things happen in videos by linking sounds and visuals together