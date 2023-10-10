Large Language Models (LLMs) are increasingly being used for various natural language processing tasks. However, there are limitations in using pretrained LLMs for long input streams. Researchers from MIT, Meta AI, and Carnegie Mellon University have explored the challenges and proposed a solution in their study.

Two main issues arise when using LLMs for endless input streams. Firstly, Transformer-based LLMs cache the Key and Value states (KV) of all prior tokens during decoding, which results in excessive memory usage and decoding delay. Secondly, existing models struggle when the sequence length exceeds the attention window size determined during pre-training.

The study introduces the concept of StreamingLLM, an architecture that enables LLMs with a finite attention window to handle text of indefinite length without fine-tuning. StreamingLLM utilizes the property of “attention sinks,” which are initial tokens that receive significant attention scores but have little semantic value. By maintaining the KVs of a sliding window along with a few attention sink tokens, StreamingLLM stabilizes attention computation and improves model performance.

StreamingLLM outperforms previous techniques such as sliding window with recomputation, achieving faster speeds and enabling the streaming usage of LLMs. It can accurately represent millions of tokens and can be pre-trained to require only a single attention sink token for streaming deployment. This eliminates the need for reintroducing multiple initial tokens, as done in vanilla models, while maintaining performance.

The proposed architecture has significant implications for applications that require LLMs to handle long streams of text effectively and consistently. With StreamingLLM, LLMs can be utilized for tasks like code completion, question answering, document summarization, and dialogue systems in real-world streaming applications.

Definitions:

– Large Language Models (LLMs): models used for natural language processing tasks.

– Key and Value states (KV): information associated with tokens in a language model.

– Attention window: a limited range of tokens that the language model considers during pre-training.

– StreamingLLM: an architecture that enables LLMs with an attention window to process text of indefinite length without fine-tuning.

