Researchers from Zhipu AI and Tsinghua University have introduced a new visual language model (VLM) called CogVLM that aims to enhance the visual understanding skills of big language models without sacrificing their natural language processing (NLP) capabilities. The researchers believe that the lack of deep integration between language and visual information is the primary reason behind the subpar performance of current shallow alignment techniques.

Shallow alignment techniques, such as BLIP-2, transfer image characteristics into the language model’s input embedding space using a trainable Q-Former or a linear layer. However, this approach does not perform as well as training the language and vision modules simultaneously. In chat-style VLMs taught using shallow alignment techniques, poor visual comprehension skills result in hallucinations.

CogVLM addresses this issue introducing deep integration between language and visual information. The model enhances the language model with a trainable visual expert, with each layer incorporating a separate matrix for picture features and an MLP layer for text characteristics. This deep integration allows the model to maintain its performance in text-centered activities while enhancing visual understanding.

CogVLM has achieved state-of-the-art or second-best performance on 14 cross-modal benchmarks, including image captioning datasets, visual question answering datasets, multiple choice datasets, and visual grounding datasets. The model has also been trained to support both Chinese and English for commercial use.

The open-sourcing of CogVLM is expected to have a significant positive impact on visual understanding research and industrial applications. This research provides valuable insights into enhancing the capabilities of visual language models integrating language and visual information more deeply.

