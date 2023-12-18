In the field of natural language processing (NLP), training large language models (LLMs) that can handle various tasks without extensive adjustments has become increasingly popular. However, there is a need for equally flexible and scalable models for vision. Vision models must be able to handle diverse sensory inputs and perform various tasks.

Traditional training methods for vision, such as training on RGB images with a single purpose, have not produced the same level of success as language modeling in NLP. To address this, researchers from the Swiss Federal Institute of Technology Lausanne (EPFL) and Apple have focused on building a scalable vision model that can handle different input types.

Their strategy involves training a single integrated Transformer encoder-decoder with a multimodal masked modeling goal, which they call 4M (Massively Multimodal Masked Modeling). This approach combines masked modeling with multimodal learning, allowing the model to have strong cross-modal predictive coding abilities and shared scene representations. Iterative sampling enables the model to be used for generative tasks.

The key to achieving scalability in this approach lies in three critical factors: data scalability, architectural scalability, and training purpose scalability. Data scalability refers to leveraging more training samples to improve performance. Architectural scalability means that performance improves with increasing model size. Training purpose scalability involves efficiently handling an increasing number of input modalities without exponentially increasing computational costs.

To achieve this, the researchers use modality-specific tokenizers to convert different input modalities into sets or sequences of discrete tokens, enabling a single Transformer model to be trained on various types of data. This tokenization approach eliminates the need for task-specific encoders and heads, allowing the Transformer to be used with any modality while retaining parameter-sharing.

Additionally, the researchers use input and target masking to train the model efficiently, even with a large number of modalities. They create modally aligned binding data using pseudo-labeling networks, which allows training on different and large-scale datasets without requiring multimodal/multitask annotations.

The results of their research show that the 4M model excels at various vision tasks and can be fine-tuned for excellent performance on unforeseen downstream tasks and input modalities. With its ability to handle multimodal inputs and achieve remarkable results, 4M has great promise for future developments in the field of vision.

In conclusion, the development of scalable and versatile multimodal models for vision is crucial for handling diverse input types and performing various tasks. The 4M model, with its integrated Transformer encoder-decoder and multimodal masked modeling goal, offers a promising approach to achieve these goals.