Model eval

#2
by kth8 - opened

Since this model is currently not available on llama.cpp, I can't run a lot of benchmarks like using openbench against it so I thought I'd get the other hottest model out right now, Minimax M2.1 to evaluate it.

Log: https://gist.github.com/kth8/24ed2ce338c4392b78d66b68f20fcb51

tl;dr issues encounted during this eval:

- Incomplete Implementations: The model frequently failed to write code when given a template with placeholders (e.g., methods containing pass), instead just returning the empty template.
- Repetitive Looping: The model got stuck in loops on several occasions, repeating the same line of code (like a function call) over and over until it hit its token limit.
- Generation of Irrelevant Code: After producing a correct answer, the model would often append thousands of characters of completely unrelated code, typically Django models, polluting the output.
- Bugs in Advanced Code: The model generated functionally incorrect code for advanced Python features. For instance, its implementation of a descriptor class was missing a required method, making it fail at runtime.
- Logically Flawed Logic: The model produced code that ran without errors but did not work as intended. A key example was an asynchronous rate-limiting function that accidentally serialized requests instead of running them concurrently.

Also # Final Report at the end if you want to skip to that

Sign up or log in to comment