🔗 Shared Links

08 Apr 2025

🔗 [Quote] LMArena on X

Meta should have made it clearer that “Llama-4-Maverick-03-26-Experimental” was a customized model to optimize for human preference. As a result of that we are updating our leaderboard policies to reinforce our commitment to fair, reproducible evaluations so this confusion doesn’t occur in the future.

We now have acknowledgement from LMArena of what we already knew: AI labs are cheating to get their models as high as possible in the LMArena leaderboard / benchmarks.

This is inevitable, all of them want to win the AI race at any cost. If you don't want to be fooled by ever-slightly-increasing benchmarks, you should set up your own benchmarks that measure their performance on your own use cases.
[[ Visit external link ]]
06 Apr 2025

🔗 [Link] The Llama 4 herd

Meta has finally released the Llama 4 family of models that Zuckerberg hyped up so much. The Llama 4 models are open-source, multi-modal, mixture-of-experts models. First impression, these models are massive. None of these models will be able to run in the average computer with a decent GPU or any single Mac Mini. This is what we have:

Llama 4 Scout

The small model in the family. A mixture-of-experts with 16 experts, totaling 109B parameters. According to Meta, after an int-4 quantization, it fits in an H100 GPU, which is 80GB of VRAM. It's officially the model with the largest context window ever, with a supported 10M context window. However, a large context window takes a big toll on the already high VRAM requirements, so you might want to keep the context window contained. As they themselves write in their new cookbook example notebook for Llama 4:

Scout supports up to 10M context. On 8xH100, in bf16 you can get upto 1.4M tokens.

Llama 4 Maverick

The mid-sized model. This one has 128 experts, totaling 400B parameters. This one "only" features a 1M context window, due to its larger size. Maverick, as of today, has reached the second place in LMArena with 1417 ELO, only surpassed by Gemini 2.5 Pro. Which is scary, knowing this is not even the best model in the family.

Llama 4 Behemoth

The big brother in the family. 16 experts, 2 TRILLION parameters. Easily surpasses Llama 3.1 405B, which was the largest Llama model until today. This model has not yet been released, as according to Meta is still training, so we don't know anything about its capabilities.

Llama 4 Reasoning

We have no details on what it's going to be, just the announcement that it's coming soon.

Overall, these look like very capable frontier models that can compete with OpenAI, Anthropic and Google while at the same time being open-source, which is a huge win. Check out Meta's post on the models' architecture and benchmarks and also check the models on HuggingFace.
[[ Visit external link ]]
30 Mar 2025

🔗 [Link] Circuit Tracing: Revealing Computational Graphs in Language Models

A group of Anthropic-affiliated scientists has released a paper where they study how human concepts are represented across Claude 3.5 Haiku's neurons and how these features interact to produce model outputs.

This is a specially difficult task since these concepts are not contained within a single neuron. Neurons are polysemantic, meaning that they encode multiple unrelated concepts in its representation. To make matters worse, superposition makes it so the representation of features are built from a combination of multiple neurons, not just one.

In this paper, the researches build a Local Replacement Model, where they replace the neural network's components with a simpler, interpretable function that mimics its behavior. Also, for each prompt, they show many Attribution Graph that help visualize how the model processes information and how the features smeared across the model's neurons influence its outputs.

Also check out the companion paper: On the Biology of a Large Language Model. In this paper the researchers also use interactive Attribution Graphs to study how models can think ahead of time to perform complex text generations that require the model to think through many steps to answer.
[[ Visit external link ]]
22 Mar 2025

🔗 [Link] Claude Think Tool

The Anthropic team has discovered an interesting approach to LLM thinking capabilities. Instead of making the model think deeply before answering or taking an action, they experimented with giving the model a think tool. The think tool does nothing but register a thought in the state. However, it does allow the model to decide when it's appropriate to stop and think more carefully about the current state and the best approach to move forward.

The thinking done using the think tool will not be as deep and it will be more focused on newly obtained information. Therefore, the think tool is specially useful when the model has to carefully analyze the outputs of complex tools and act on them thoughtfully.
[[ Visit external link ]]
12 Mar 2025

🔗 [Quote] 🔭 The Einstein AI model

These benchmarks test if AI models can find the right answers to a set of questions we already know the answer to. However, real scientific breakthroughs will come not from answering known questions, but from asking challenging new questions and questioning common conceptions and previous ideas. - Thomas Wolf

Interesting reflection from Thomas from HuggingFace. Current LLMs have limited potential to make breakthroughs since they cannot "think out-of-the-box" from their training data. We might be able to give the LLMs the ability to explore outside their known world by mechanisms like reinforcement learning + live environment feedback, or other mechanisms that we haven't thought about yet. Still, significant breakthroughs will be hard for LLMs since the real breakthroughs that make a huge impact are usually very far away from established knowledge - very far from the AI model's current probability space.
[[ Visit external link ]]

🔗 Shared Links

🔗 [Quote] LMArena on X

🔗 [Link] The Llama 4 herd

Llama 4 Scout

Llama 4 Maverick

Llama 4 Behemoth

Llama 4 Reasoning

🔗 [Link] Circuit Tracing: Revealing Computational Graphs in Language Models

🔗 [Link] Claude Think Tool

🔗 [Quote] 🔭 The Einstein AI model