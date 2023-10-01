The researchers from Soochow University have introduced an open-source bilingual language model called OpenBA. This model, which has been pretrained from scratch, fills the gap in the exploration of the Encoder-Decoder framework. OpenBA not only provides model checkpoints but also includes data collection and processing information, along with the motivations and design of model architecture.

To assist Chinese language modeling, the researchers have gathered pre-training data balanced between English and Chinese tokens. They have also included additional English data from the Bilingual-Flan corpus to cover a wide range of jobs and environments. OpenBA uses a shallow-encoder deep decoder structure, which improves its generation capability.

The training procedure of OpenBA consists of three stages: UL2 pre-training, length-adaptation, and Flan training. The researchers also apply enhancement tactics to the model architecture and training to improve stability and effectiveness.

The efficacy of OpenBA has been demonstrated through tests using various benchmarks and tasks, including zero-shot, few-shot, held-in, and held-out settings. OpenBA has outperformed other typical models on benchmarks such as BELEBELE, MMLU, CMMLU, and C-Eval, despite being trained on fewer tokens. Moreover, OpenBA has shown environmental benefits using significantly less carbon emissions compared to other models.

All implementation-related information, including data collection and processing, model checkpoints, and assessments, is publicly available. The researchers encourage feedback and recommendations as they continue to enhance and implement the OpenBA paradigm, and they look forward to collaborating with the open-source community.

Source: Researchers from Soochow University, OpenBA GitHub repository, Paper on OpenBA

