Researchers from MIT and other institutions have discovered a solution to prevent AI chatbots, like ChatGPT, from crashing during prolonged conversations. These large language models face performance issues when engaged in extensive, continuous dialogues. The researchers identified a surprising culprit and introduced a method named StreamingLLM to address the problem. The core of many large language models relies on a key-value cache, acting as a conversation memory. When this cache exceeds its capacity, leading to the removal of initial data, the model tends to fail. StreamingLLM ensures that the first few data points remain in memory, allowing chatbots to sustain uninterrupted conversations, even surpassing 4 million words. Compared to alternative methods, StreamingLLM demonstrated a remarkable 22 times faster performance.
This breakthrough opens doors for AI assistants to conduct long conversations throughout the workday without constant rebooting. Applications include efficient AI support for tasks such as copywriting, editing, or code generation. Guangxuan Xiao, lead author and an electrical engineering and computer science graduate student, envisions persistent deployment of these large language models for various innovative applications. The researchers found that keeping the first token in the sliding cache, known as an "attention sink," significantly contributes to maintaining model performance. Despite the initial token seemingly unrelated to subsequent words, its global visibility makes it crucial for the model's dynamics. By incorporating four attention sink tokens and maintaining consistent positional encoding, StreamingLLM outperforms other methods that rely on constant recomputation.
This innovative approach ensures stable memory usage and performance, even when processing texts up to 4 million tokens in length. Experts not involved in the research commend StreamingLLM for its transformative potential, marking a significant leap forward in AI-driven generation applications. StreamingLLM's integration into NVIDIA's large language model optimization library, TensorRT-LLM, underscores its practical implementation. While this method enables continuous conversations, researchers acknowledge its current limitation in remembering words not stored in the cache. Future endeavors aim to address this by exploring methods to retrieve evicted tokens or enhance the model's ability to memorize previous conversations. The project received funding from the MIT-IBM Watson AI Lab, MIT Science Hub, and the U.S. National Science Foundation.