230B vs 235B: Why no comparison against Qwen3-235B-A22B-Thinking-2507 ?

#20

by rtzurtz - opened 10 days ago

10 days ago

Why in your model card there's no direct comparison against a very similarly/same sized Qwen3-235B-A22B-Thinking-2507?
(I saw your reference to it being benchmarked by ArtificialAnalysis: artificialanalysis.ai/?models=minimax-m2%2Cglm-4-6-reasoning%2Cqwen3-235b-a22b-instruct-2507-reasoning)

PS; In the comments I saw people comparing MiniMax-M2 against GLM-4.6 (maybe because you also compare it in your model card), but AFAICSee, most say that GLM-4.6 performed better for their task; well, this is no surprise, as GLM-4.6 is a much larger, 357B, LLM.

Enderchef

10 days ago

MiniMax M2 is better then Qwen 235B A22B Thinking 2507(the new one), I'm pretty sure.

inputout

5 days ago

There are indications that show that the opposite can also be true.
https://livebench.ai/
https://www.youtube.com/watch?v=XHbuFRupSvk

rtzurtz

2 days ago

@Enderchef , not so fast, did you see the independent benchmarks where Qwen 235B A22B Thinking 2507 scores higher (and yes, I still expected Qwen to win, despite Qwen being older) than MiniMax M2?

(And again, not sure about the benchmaxxing of MiniMax M2 on artificialanalysis or that it wouldn't matter, as I use ArtificialAnalysis quite a bit and I think it does provide a first basic good look: bad score on AA means a LLM can't solve certain basic-ish things. The only question would be, how much can AA be benchmaxxed, but this is for another discussion.
My theory: Given that both LLMs have pretty much the same parameters count and are based on very roughly the same architecture (, are released at roughly the same time (I know the field moves fast)), ) and no major technological breakthrough for either of the LLMs (A22B vs A10B is a nice improvement, but it affects just the t/s speed), the expected case is that they/any LLMs would score about the same total score (LLM A wins in certain tasks, while LLM B wins in other tasks).)

@inputout , ty, I was about to post livebench. Global Average scores, for what it's worth:
Qwen 3 235B A22B Thinking 2507: 69.11
Qwen 3 Next 80B A3B Thinking: 64.57
Minimax M2: 64.26
GPT OSS 120b::55.56 (interestingly)

Another one for what it's worth, is (this one might be just a meta benchmark): https://llm-stats.com/benchmarks/llm-leaderboard-full (click the Open button to show the open-weight LLMs), there Qwen 235B A22B Thinking 2507 also scores higher than MiniMax M2.

In my own evaluation, MiniMax-M2 answers the hardest question majority of the times wrongly and another one wrongly (tested through lmarena.ai), which Gpt-Oss-20b and Qwen3-30B-A3B-Instruct-2507 answer correctly on first try. But MiniMax-M2 passes the rest of the easier questions, the ones I tried.

Enderchef

2 days ago

I know; I don't like to look at benchmarks, I use my own tests. Qwen 235B A22B Thinking 2507 in my tests is a very low-intelligent model, and can consume tens of dozens of thinking tokens in just "wait a minute".

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment