Rethinking the Transformer: Toward Native Multimodal Architectures
Rethinking the Transformer: Toward Native Multimodal Architectures - Bowen Peng, Nous Research Transformers have driven breakthroughs in language and vision, but their limitations become clear when extended to multimodal data. This session explores architectural innovations, such as tokenizer-free transformers, mixture of experts and hierarchical attention that better capture the different granularities across various modalities. We’ll discuss how improvements in architectures could redefine the foundations of large-scale model training and open new frontiers beyond language.