
Researchers at the University of Edinburgh have developed a way to make AI models both more sustainable and accurate—compress the memory. Working with experts at NVIDIA, the researchers found that large language models (LLMs) using memory 8x smaller than an uncompressed LLM scored better on math, science and coding tests, while spending the same amount of time reasoning.
By “thinking” about more complex hypotheses or exploring more hypotheses concurrently, AI models improve their problem-solving abilities. In practice, this is achieved by generating more reasoning threads—a step-by-step process used to solve problems—in text form. The more threads there are, and the longer they are, the more memory is required. The larger the memory size used, the longer the LLM takes to retrieve the KV cache—the model’s memory—from the part of the AI device where it is stored.
To overcome this, the team developed a method to compress the models’ memory called Dynamic Memory Sparsification (DMS). Instead of keeping every token—the units of data that an AI model processes—DMS decides which ones are important enough to keep and which ones can be deleted. In managing which tokens to keep and which to discard, DMS lets the AI model "think” in more depth or explore more possible solutions without needing extra computer power.
The researchers tested DMS on different versions of the AI models Llama and Qwen. The results showed that even with memories compressed to one eighth their original size, LLMs fully retain their original accuracy in difficult tasks while accelerating reasoning compared with non-compressed models.
Data from University of Edinburgh