Skip to main content

3 posts tagged with "News"

news

View All Tags

360° Attention

· 2 min read
Bilel Saghrouchni
PhD Student in Machine Learning

Everyone needs attention, even an LLM

The attention mechanism allows an LLM to identify what is relevant within a sequence, which largely explains its impressive performance. However, while it revolutionized the field a few years ago, it is also the primary driver of significant computational and memory costs. Google Research recently unveiled TurboQuant, a method designed to compress the attention cache.

The Attention Cache?

During the generation phase (producing tokens one by one), recalculating attention over the entire context at each step would be computationally prohibitive. To optimize this, the model stores the projections of previous tokens in what is known as the KV Cache (Key-Value Cache). While this mechanism reduces computational complexity, it shifts the burden to memory: the size of these key and value matrices grows linearly with both the sequence length and the number of simultaneous requests (batch size). This represents the primary bottleneck that saturates GPU VRAM.

Cache is Money

The idea is simple: instead of reading matrices using Cartesian coordinates, we use polar coordinates. To put it simply, instead of saying “Go 3 blocks east and 4 blocks north,” we say “Go 5 blocks at a 37° angle” (PolarQuant).

This reformulation allows the essence of the information—the vector's direction—to be captured using very few bits. A correction mechanism (QJL) then compensates for the most significant quantization errors. The result is a significant reduction in RAM usage and processing time, achieved without any meaningful loss in performance.

📚 Sources

  1. TurboQuant: Redefining AI efficiency with extreme compression. March 24, 2026 Amir Zandieh, Research Scientist, and Vahab Mirrokni, VP and Google Fellow, Google Research

Agents — A Real Innovation?

· 3 min read
Bilel Saghrouchni
PhD Student in Machine Learning

We’ve been hearing a lot about agents lately — articles and posts are overflowing with the term “Agentic AI” (and now, this one too 👀). But is it really something new? Back in the 1990s, people were already talking about agents and describing them as entities that operate continuously and autonomously within a dynamic and evolving environment. Others took a more philosophical approach, defining agents as entities whose internal state is represented by mental concepts such as beliefs, abilities, choices, and commitments. Their actions are constrained and governed by fixed rules.

Model Collapse

· 2 min read
Bilel Saghrouchni
PhD Student in Machine Learning

The Silent Invasion of AI Content

A large portion of content on the Internet is becoming AI-generated. The gradual integration of LLMs (Large Language Models) into our daily lives is democratizing and simplifying their use. Social media platforms have become true gold mines for this type of content: more and more users are relying on ChatGPT or Mistral’s LeChat to write or rewrite their posts before publishing them (maybe this post went through that too? 👀). A striking example is Quora, where AI-generated content rose from 2% in 2022 to ~38% in 2024!