Update /test dataset to standardize difficulty measurement

Background

Currently, our evaluation dataset for /test measures input difficulties in two inconsistent ways:

Qualitatively: Using subjective "easy", "medium", and "hard" categories
Quantitatively: Using a numerical scale from 0-20

This dual approach creates several issues:

Inconsistent analysis across models and test cases
Difficulty comparing results between different measurements
Lack of clarity on what constitutes difficulty in each system
Potential for subjective bias in qualitative assessments

Proposal

Develop and implement a unified, objective difficulty measurement system based on measurable code characteristics that can replace the current dual approach.

Implementation Plan/Idea

Research and define objective metrics that correlate with code difficulty. For example:
- Number of dependencies
- Lines of code / token count
- Nested control structures
- Variable scope complexity
Develop a formula that combines these metrics into a single difficulty score
Validate the new scoring system against human assessments
Update the evaluation pipeline to use the new difficulty measurement
Migrate existing test cases to the new system
Assess if the system can be leveraged across other features

Success Criteria

A single, standardized difficulty score for each test case
Clear documentation on how difficulty is calculated
Implementation in the evaluation framework
Migration or update of existing test cases to the new system

Benefits

More accurate performance comparisons across models
Better identification of strengths/weaknesses in handling complex code
Reduced subjectivity in test case creation and evaluation
Improved ability to target specific complexity factors for improvement

Related issues

Edited Apr 09, 2025 by Shola Quadri