Can Your LLM Think Like a Professional? Introducing ProfBench
Today's large language models (LLMs) are incredibly "book smart." They can write beautiful essays, answer trivia questions, and even pass bar exams—all within the boundaries of curated benchmarks. But do these benchmark results truly reflect strong performance on real-world tasks handled at the level of PhD or MBA professionals?
The SWE-Bench dataset moves evaluation closer to real-world workflows by testing how well LLMs fix software bugs and build new features. However, a significant gap still remains: the lack of high-quality, text-only datasets that mirror the complex reasoning tasks faced by professionals in fields like finance and materials science. We're not talking about simple Q&A or retrieval-based tasks. We're talking about multi-page assignments that require deep domain knowledge and reasoning. Can AI generate comprehensive reports by applying the nuanced reasoning that a PhD-level physicist/chemist or an MBA-level consultant/financier would have? To accurately measure these advanced capabilities, we need a new benchmark: ProfBench — now supported directly within the NVIDIA NeMo Evaluator SDK.
The NeMo Evaluator SDK provides a scalable, reproducible way to run hundreds of benchmarks built on top of popular evaluation repos including LM-eval-harness, simple-evals, and BigCode and compare model performance.
What is ProfBench?
ProfBench is a new benchmark designed to evaluate LLMs on complex, open-ended tasks that require professional-grade knowledge. The dataset contains over 7,000 response-criterion pairs across four deep-expertise domains:
- Finance MBA
- Consulting MBA
- Chemistry PhD
- Physics PhD
To understand the complexity, let's look at a Finance MBA example.
A user, acting as a senior partner at an investment bank, asks the AI to assess a potential new business unit focused on global health empowerment. The prompt isn't a single question; it's a multi-step assignment that includes :
1. Analyzing the history of the International Finance Facility for Immunization (IFFIm) and how it used securitization to raise money for the GAVI vaccine alliance
2. Detailing the technical aspects, factors for success, and risks involved
3. Assessing if IFFIm can serve as a "blueprint" for other global health initiatives
4. Identifying 3-5 other organizations that could use a similar model
5. Delivering the entire analysis in the style of a detailed investment memo, not just a list of answers
This is the kind of task that demands analysis, synthesis, and domain-specific knowledge far beyond simple fact retrieval.
Similarly, in a Chemistry PhD example,
A user in a research lab asks AI to perform calculations required in a complex titration experiment involving two acids - 100 mL mixture of acetic acid (0.5 M) and formic acid (0.1 M) with 0.5 M NaOH, that includes :
1. Calculating the volume of NaOH titrant required to reach the point where the two conjugate bases have equal concentrations.
2. Calculating the concentrations of the acids and their conjugate bases at the point referenced in part 1.
3. Calculating the concentration of hydronium ions and the pH of the analyte at the point referenced in part 1.
4. Calculating the volume of NaOH titrant required to reach the point where the pH of the analyte is 7.0.
5. Calculating the concentrations of the acids and their conjugate bases at the point referenced in part 4
What Makes ProfBench Special?
Fig 1. Distribution of Rubrics across categories and sub-categories.
What makes the benchmark special? The grading system is not just about getting a multiple-choice or short-answer right. Instead, we have human experts write rubrics that evaluate the AI's work on three dimensions using a diverse set of criteria:
- Extraction: Did it get the right data and details?
- Reasoning: Is the logic sound? Is the math correct? Are the conclusions justified?
- Style: Is the answer presented clearly, and in the requested format?
The LLM response is then graded in terms of whether it fulfils various rubric criteria such as:
For the Finance MBA:
Extraction: States that a breach of IFFIm’s liquidity policy could negatively impact IFFIm’s rating profile.
Reasoning: States that vaccines are one of the most successful and cost-effective health investments in the world
Style: Present findings clearly to allow for effective use.
For the Chemistry PhD:
Extraction: Determines the volume of NaOH titrant required to reach the point where the pH of the analyte is 7.0 as 0.11938 +/- 0.001 L
Reasoning: Determines the pH of the analyte at the point at which both acids are neutralized as 9.05 +/- 0.05.
Style: The molecular weight is rounded to 1 decimal place.
How Was the Benchmark Created?
ProfBench was built by the very experts it's designed to test. We recruited 38 professionals from 8 countries, all holding PhDs, MBAs, or equivalent work experience in their respective fields. Together, these experts contributed over 7000 rubrics across 80 tasks.
These experts curated the prompts themselves, basing them on tasks they might assign to a junior colleague. Most importantly, they also wrote the detailed, multi-point grading rubrics from scratch.
To ensure true human-level authenticity and prevent model bias, we disallowed the use of LLMs at any stage of the annotation process. This is a benchmark built by human professionals for evaluating professional-grade AI.
Why Release ProfBench — and Why Now?
A robust evaluation dataset is one of the biggest bottlenecks to advancing open-source models from performing complex, professional tasks. To make these evaluations seamless and reproducible, ProfBench is now fully supported through the NeMo Evaluator SDK—enabling automated, rubric-based scoring and side-by-side model comparisons out of the box.
Our primary goal is to spur progress across the open-source community by providing a clear, public benchmark—a true north for developing models and agentic systems that can tackle real-world business and science research challenges. Just as datasets like SWE-Bench have pushed the field forward, we see ProfBench as the next step in our contribution to the ecosystem, building on our work with open-source NVIDIA Nemotron models and training data.
This work also has immediate benefits for enterprise users, helping businesses using AI to use rubric-based evaluations more effectively and providing confidence in workflows and tools like LLM-as-a-Judge. Long-term, this benchmark is foundational for building the next generation of models—ones that can provide real-world value to human professionals. For AI to become a true professional partner, it must move beyond simple knowledge recall to master complex, real-world reasoning. ProfBench provides that critical roadmap, showing us where today’s AI stands and lighting the path toward solving problems that, until now, only human experts could.
How Do Today's Models Perform?
Fig 2. Cost of running full evaluation (16 samples per prompt) with human-identified reference documents against performance on ProfBench.
ProfBench poses a significant challenge even for state-of-the-art models. The top-performing model, GPT-5-High, scored just 65.9% overall when provided with human-identified reference documents (i.e., easiest setting) and 49.4% with an LLM-only setup (i.e., hardest setting). This demonstrates a massive gap that still exists between current AI models and expert-level professional performance. Notably, the model struggled the most with the Physics domain (scoring only 49.3% when provided reference documents).
How to Use This Dataset
We're excited to see how the community uses ProfBench to test, fine-tune, and build the next generation of models and generative AI systems.
You can use the dataset out-of-the-box with the newly released NeMo Evaluator SDK. Note that because the reference documents are not included in this dataset, NeMo Evaluator only supports running these benchmarks with an LLM-only setup (hardest setting).
ProfBench is released under the NVIDIA Evaluation Dataset License. Learn more about it on the paper. We look forward to seeing what you build with it.

