Recent Posts
-
🔗 [Quote] LMArena on X
Meta should have made it clearer that “Llama-4-Maverick-03-26-Experimental” was a customized model to optimize for human preference. As a result of that we are updating our leaderboard policies to reinforce our commitment to fair, reproducible evaluations so this confusion doesn’t occur in the future.
We now have acknowledgement from LMArena of what we already knew: AI labs are cheating to get their models as high as possible in the LMArena leaderboard / benchmarks.
This is inevitable, all of them want to win the AI race at any cost. If you don't want to be fooled by ever-slightly-increasing benchmarks, you should set up your own benchmarks that measure their performance on your own use cases.
[[ Visit external link ]] -
🔗 [Link] The Llama 4 herd
Meta has finally released the Llama 4 family of models that Zuckerberg hyped up so much. The Llama 4 models are open-source, multi-modal, mixture-of-experts models. First impression, these models are massive. None of these models will be able to run in the average computer with a decent GPU or any single Mac Mini. This is what we have:
Llama 4 Scout
The small model in the family. A mixture-of-experts with 16 experts, totaling 109B parameters. According to Meta, after an int-4 quantization, it fits in an H100 GPU, which is 80GB of VRAM. It's officially the model with the largest context window ever, with a supported 10M context window. However, a large context window takes a big toll on the already high VRAM requirements, so you might want to keep the context window contained. As they themselves write in their new cookbook example notebook for Llama 4:
Scout supports up to 10M context. On 8xH100, in bf16 you can get upto 1.4M tokens.
Llama 4 Maverick
The mid-sized model. This one has 128 experts, totaling 400B parameters. This one "only" features a 1M context window, due to its larger size. Maverick, as of today, has reached the second place in LMArena with 1417 ELO, only surpassed by Gemini 2.5 Pro. Which is scary, knowing this is not even the best model in the family.
Llama 4 Behemoth
The big brother in the family. 16 experts, 2 TRILLION parameters. Easily surpasses Llama 3.1 405B, which was the largest Llama model until today. This model has not yet been released, as according to Meta is still training, so we don't know anything about its capabilities.
Llama 4 Reasoning
We have no details on what it's going to be, just the announcement that it's coming soon.
Overall, these look like very capable frontier models that can compete with OpenAI, Anthropic and Google while at the same time being open-source, which is a huge win. Check out Meta's post on the models' architecture and benchmarks and also check the models on HuggingFace.
[[ Visit external link ]] -
🔗 [Link] Circuit Tracing: Revealing Computational Graphs in Language Models
A group of Anthropic-affiliated scientists has released a paper where they study how human concepts are represented across Claude 3.5 Haiku's neurons and how these features interact to produce model outputs.
This is a specially difficult task since these concepts are not contained within a single neuron. Neurons are
polysemantic
, meaning that they encode multiple unrelated concepts in its representation. To make matters worse,superposition
makes it so the representation of features are built from a combination of multiple neurons, not just one.In this paper, the researches build a Local Replacement Model, where they replace the neural network's components with a simpler, interpretable function that mimics its behavior. Also, for each prompt, they show many Attribution Graph that help visualize how the model processes information and how the features smeared across the model's neurons influence its outputs.
Also check out the companion paper: On the Biology of a Large Language Model. In this paper the researchers also use interactive Attribution Graphs to study how models can think ahead of time to perform complex text generations that require the model to think through many steps to answer.
[[ Visit external link ]] -
How to Write a Good index.html File
Every web developer has been there: you're starting a new project and staring at an empty file called index.html. You try to remember, which tags were meant to go in the <head> again? Which are the meta tags that are best practice and which ones are deprecated? Recently, I ...
-
🚧 [Project] HTML Starter Template
A comprehensive index.html template (with comments) to get quickly up and running when starting a new web site.
-
🔗 [Link] Claude Think Tool
The Anthropic team has discovered an interesting approach to LLM thinking capabilities. Instead of making the model think deeply before answering or taking an action, they experimented with giving the model a think tool. The think tool does nothing but register a thought in the state. However, it does allow the model to decide when it's appropriate to stop and think more carefully about the current state and the best approach to move forward.
The thinking done using the think tool will not be as deep and it will be more focused on newly obtained information. Therefore, the think tool is specially useful when the model has to carefully analyze the outputs of complex tools and act on them thoughtfully.
[[ Visit external link ]] -
🔗 [Quote] 🔭 The Einstein AI model
These benchmarks test if AI models can find the right answers to a set of questions we already know the answer to. However, real scientific breakthroughs will come not from answering known questions, but from asking challenging new questions and questioning common conceptions and previous ideas. - Thomas Wolf
Interesting reflection from Thomas from HuggingFace. Current LLMs have limited potential to make breakthroughs since they cannot "think out-of-the-box" from their training data. We might be able to give the LLMs the ability to explore outside their known world by mechanisms like reinforcement learning + live environment feedback, or other mechanisms that we haven't thought about yet. Still, significant breakthroughs will be hard for LLMs since the real breakthroughs that make a huge impact are usually very far away from established knowledge - very far from the AI model's current probability space.
[[ Visit external link ]] -
About the Dead Internet Theory and AI
The Dead Internet Theory is a thought that has gained a lot of traction recently. I have to admit, the first time it was explained to me, I felt an eerie realization. Like I had already been experiencing it, but I hadn't paid too much attention to it. The first moment, I felt sc...
-
The Rise Of Reasoner Models: Scaling Test-Time Compute
A new kind of LLM has recently been popping out everywhere: Reasoner models. Kickstarted by OpenAI's o1 and o3, these models are a bit different from the rest. These models particularly shine when dealing with mathematical problems and coding challenges, where success depends on...
-
AI in 2024: Year in Review and Predictions for 2025
The past year has been transformative for artificial intelligence, marked by breakthrough innovations, emerging regulations, and a shift toward practical AI tools that enhance productivity. As we look ahead to 2025, let's review the major developments of 2024 and explore what th...