Model eval

by kth8 - opened about 15 hours ago

kth8

about 15 hours ago

Since this model is currently not available on llama.cpp, I can't run a lot of benchmarks like using openbench against it so I thought I'd get the other hottest model out right now, Minimax M2.1 to evaluate it.

Log: https://gist.github.com/kth8/24ed2ce338c4392b78d66b68f20fcb51

tl;dr issues encounted during this eval:

- Incomplete Implementations: The model frequently failed to write code when given a template with placeholders (e.g., methods containing pass), instead just returning the empty template.
- Repetitive Looping: The model got stuck in loops on several occasions, repeating the same line of code (like a function call) over and over until it hit its token limit.
- Generation of Irrelevant Code: After producing a correct answer, the model would often append thousands of characters of completely unrelated code, typically Django models, polluting the output.
- Bugs in Advanced Code: The model generated functionally incorrect code for advanced Python features. For instance, its implementation of a descriptor class was missing a required method, making it fail at runtime.
- Logically Flawed Logic: The model produced code that ran without errors but did not work as intended. A key example was an asynchronous rate-limiting function that accidentally serialized requests instead of running them concurrently.

Also # Final Report at the end if you want to skip to that

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment