Recent Posts

14 Apr 2025

🔗 [Link] GPT 4.1

After the unimpressive release of GPT-4.5 a month and a half ago, OpenAI is now releasing a new version - backwards. Today, they released three new models, exclusive to the API: GPT-4.1, GPT-4.1 mini and GPT-4.1 nano. In the benchmarks, GPT-4.1 easily beats GPT-4.5 at a lower price and higher speed. For this reason, OpenAI has said they will be deprecating GPT-4.5 in 3 months time.

While this is a good step ahead for OpenAI, they are still a bit behind Claude and Gemini in some key benchmarks. In SWE-bench, GPT-4.1 gets a 55%, against 70% for Claude 3.7 Sonnet and 64% for Gemini 2.5 Pro. In Aider Polyglot, GPT-4.1 gets 53%, while Claude 3.7 Sonnet gets 65% and Gemini 2.5 Pro gets 69%.

On the other hand, GPT-4.1 nano offers a similar price and latency as Gemini Flash 2.0. If the performance of this small model is comparable to Gemini Flash, it can be a great option for simple tasks.

[[ Visit external link ]]

12 Apr 2025

🔗 [Link] The Agent2Agent Protocol

Just in the middle of the year of agents, Google has released two great tools for building agents: the Agent2Agent (A2A) protocol and the Agent Development Kit (ADK).

The Agent2Agent Protocol is based on JSON RPC, working both over plain HTTP and SSE. It is also built with security in mind, it implements the OpenAPI Authentication Specification.

The agents published using this protocol will advertise themselves to other agents via the Agent Card, which by default can be found at the path https://agent_url/.well-known/agent.json. The Agent Card will include information about the agent's capabilities and requirements, which will help other agents decide to ask it for help or not.

The specification includes definitions for these concepts, which agents can use to exchange between themselves: Task, Artifact, Message, Part and Push Notification.

This new protocol is not meant to replace Anthropic's Model Context Protocol. They are actually meant to work together. While MCP allows agents to have access to external tools and data sources, A2A allows agents to communicate and work together.

[[ Visit external link ]]

08 Apr 2025

🔗 [Quote] LMArena on X

Meta should have made it clearer that “Llama-4-Maverick-03-26-Experimental” was a customized model to optimize for human preference. As a result of that we are updating our leaderboard policies to reinforce our commitment to fair, reproducible evaluations so this confusion doesn’t occur in the future.

We now have acknowledgement from LMArena of what we already knew: AI labs are cheating to get their models as high as possible in the LMArena leaderboard / benchmarks.

This is inevitable, all of them want to win the AI race at any cost. If you don't want to be fooled by ever-slightly-increasing benchmarks, you should set up your own benchmarks that measure their performance on your own use cases.

[[ Visit external link ]]

06 Apr 2025

🔗 [Link] The Llama 4 herd

Meta has finally released the Llama 4 family of models that Zuckerberg hyped up so much. The Llama 4 models are open-source, multi-modal, mixture-of-experts models. First impression, these models are massive. None of these models will be able to run in the average computer with a decent GPU or any single Mac Mini. This is what we have:

Llama 4 Scout

The small model in the family. A mixture-of-experts with 16 experts, totaling 109B parameters. According to Meta, after an int-4 quantization, it fits in an H100 GPU, which is 80GB of VRAM. It's officially the model with the largest context window ever, with a supported 10M context window. However, a large context window takes a big toll on the already high VRAM requirements, so you might want to keep the context window contained. As they themselves write in their new cookbook example notebook for Llama 4:

Scout supports up to 10M context. On 8xH100, in bf16 you can get upto 1.4M tokens.

Llama 4 Maverick

The mid-sized model. This one has 128 experts, totaling 400B parameters. This one "only" features a 1M context window, due to its larger size. Maverick, as of today, has reached the second place in LMArena with 1417 ELO, only surpassed by Gemini 2.5 Pro. Which is scary, knowing this is not even the best model in the family.

Llama 4 Behemoth

The big brother in the family. 16 experts, 2 TRILLION parameters. Easily surpasses Llama 3.1 405B, which was the largest Llama model until today. This model has not yet been released, as according to Meta is still training, so we don't know anything about its capabilities.

Llama 4 Reasoning

We have no details on what it's going to be, just the announcement that it's coming soon.

Overall, these look like very capable frontier models that can compete with OpenAI, Anthropic and Google while at the same time being open-source, which is a huge win. Check out Meta's post on the models' architecture and benchmarks and also check the models on HuggingFace.

[[ Visit external link ]]

30 Mar 2025

🔗 [Link] Circuit Tracing: Revealing Computational Graphs in Language Models

A group of Anthropic-affiliated scientists has released a paper where they study how human concepts are represented across Claude 3.5 Haiku's neurons and how these features interact to produce model outputs.

This is a specially difficult task since these concepts are not contained within a single neuron. Neurons are polysemantic, meaning that they encode multiple unrelated concepts in its representation. To make matters worse, superposition makes it so the representation of features are built from a combination of multiple neurons, not just one.

In this paper, the researches build a Local Replacement Model, where they replace the neural network's components with a simpler, interpretable function that mimics its behavior. Also, for each prompt, they show many Attribution Graph that help visualize how the model processes information and how the features smeared across the model's neurons influence its outputs.

Also check out the companion paper: On the Biology of a Large Language Model. In this paper the researchers also use interactive Attribution Graphs to study how models can think ahead of time to perform complex text generations that require the model to think through many steps to answer.

[[ Visit external link ]]

23 Mar 2025

How to Write a Good index.html File

Every web developer has been there: you're starting a new project and staring at an empty file called index.html. You try to remember, which tags were meant to go in the <head> again? Which are the meta tags that are best practice and which ones are deprecated? Recently, I ...

[[ Read more ]] · 26 minute read

22 Mar 2025

🚧 [Project] HTML Starter Template

A comprehensive index.html template (with comments) to get quickly up and running when starting a new web site.

[[ Github repo ]]

22 Mar 2025

🔗 [Link] Claude Think Tool

The Anthropic team has discovered an interesting approach to LLM thinking capabilities. Instead of making the model think deeply before answering or taking an action, they experimented with giving the model a think tool. The think tool does nothing but register a thought in the state. However, it does allow the model to decide when it's appropriate to stop and think more carefully about the current state and the best approach to move forward.

The thinking done using the think tool will not be as deep and it will be more focused on newly obtained information. Therefore, the think tool is specially useful when the model has to carefully analyze the outputs of complex tools and act on them thoughtfully.

[[ Visit external link ]]

12 Mar 2025

🔗 [Quote] 🔭 The Einstein AI model

These benchmarks test if AI models can find the right answers to a set of questions we already know the answer to. However, real scientific breakthroughs will come not from answering known questions, but from asking challenging new questions and questioning common conceptions and previous ideas. - Thomas Wolf

Interesting reflection from Thomas from HuggingFace. Current LLMs have limited potential to make breakthroughs since they cannot "think out-of-the-box" from their training data. We might be able to give the LLMs the ability to explore outside their known world by mechanisms like reinforcement learning + live environment feedback, or other mechanisms that we haven't thought about yet. Still, significant breakthroughs will be hard for LLMs since the real breakthroughs that make a huge impact are usually very far away from established knowledge - very far from the AI model's current probability space.

[[ Visit external link ]]

23 Feb 2025

About the Dead Internet Theory and AI

The Dead Internet Theory is a thought that has gained a lot of traction recently. I have to admit, the first time it was explained to me, I felt an eerie realization. Like I had already been experiencing it, but I hadn't paid too much attention to it. The first moment, I felt sc...

[[ Read more ]] · 15 minute read