AI Arena Elo History: Track Model Performance Degradation Over Time

⬅️ Back to Tools

There’s a recurring complaint in the AI space: models get worse after launch. The GPT-4 you used in March isn’t the same as the GPT-4 running today. It’s been quantized, censored, or subtly lobotomised to save compute. This is the “nerfing” thesis, and it’s been a background noise in every major model launch for the last two years.

But proving it is hard. You need longitudinal data, ideally from a consistent blind evaluation pipeline. The LMSYS Chatbot Arena provides exactly that (thousands of crowdsourced head-to-head votes, updated daily), but nobody had plotted it as a timeline until Erwin Mayer built this.

AI Arena Elo History (github.com/mayerwin/AI-Arena-History) is a one-page chart that plots each lab’s highest-rated flagship model over time. Data comes fresh from the official Arena dataset on Hugging Face every day.

Why This Exists

The core question: do AI models degrade post-launch? Users report that models like Claude, GPT-4, and Gemini feel worse months after release; more refusal, worse reasoning, different personality. Labs rarely acknowledge these changes. The Arena dataset is the closest thing to an impartial witness.

The chart doesn’t answer the question definitively, but it surfaces the data in a form where trends are visible at a glance.

How the Chart Works

Each lab gets exactly one curve: the highest-rated flagship-eligible model at any given point. A few design decisions worth understanding:

  1. Flagship tracking only If a lab ships Sonnet while Opus still ranks above it, the curve stays on Opus. You don’t get noise from mid-tier releases.

  2. Inference modes merged Suffixes like -thinking, -reasoning, -high are collapsed into the parent model. The chart doesn’t flip-flop between modes.

  3. Release markers New model launches appear as labeled points on the curve, often with a visible Elo jump.

  4. Downward trends visible Between releases, you can see scores drift. But the caveats below matter.

Caveats Worth Reading

The project is transparent about its limitations, which is rare and appreciated:

Web UIs vs API. Arena tests models through API endpoints. Consumer chat interfaces (chatgpt.com, gemini.com, claude.ai) add system prompts, safety filters, and wrappers that can change behaviour independently of the underlying model. A perceived nerf on the website may not show up here, and vice versa.

Elo is relative. When stronger models enter the arena, an unchanged model’s Elo can drift down without actually changing. If every model regresses in parallel, the Elo won’t reveal it. A fixed-benchmark longitudinal dataset would be cleaner, but none exists publicly.

The page links to marginlab.ai’s Claude tracker as a complementary view; more focused, potentially more sensitive to Claude-specific changes.

What Makes This Useful

The value is in having the data in one place, updated daily, with a consistent methodology. Before this, if you wanted to know whether a model had drifted, you’d rely on forum posts and vibes. Now there’s a chart.

It’s also open source and accepts PRs, which is good because the problem it tracks is only going to get more important as more models ship post-launch updates silently.

Links

Crepi il lupo! 🐺