Update /test dataset to standardize difficulty measurement
Follow up from: #515925 (closed)
Background
Currently, our evaluation dataset for /test
measures input difficulties in two inconsistent ways:
- Qualitatively: Using subjective "easy", "medium", and "hard" categories
- Quantitatively: Using a numerical scale from 0-20
This dual approach creates several issues:
- Inconsistent analysis across models and test cases
- Difficulty comparing results between different measurements
- Lack of clarity on what constitutes difficulty in each system
- Potential for subjective bias in qualitative assessments
Proposal
Develop and implement a unified, objective difficulty measurement system based on measurable code characteristics that can replace the current dual approach.
Implementation Plan/Idea
- Research and define objective metrics that correlate with code difficulty. For example:
- Number of dependencies
- Lines of code / token count
- Nested control structures
- Variable scope complexity
- Develop a formula that combines these metrics into a single difficulty score
- Validate the new scoring system against human assessments
- Update the evaluation pipeline to use the new difficulty measurement
- Migrate existing test cases to the new system
- Assess if the system can be leveraged across other features
Success Criteria
- A single, standardized difficulty score for each test case
- Clear documentation on how difficulty is calculated
- Implementation in the evaluation framework
- Migration or update of existing test cases to the new system
Benefits
- More accurate performance comparisons across models
- Better identification of strengths/weaknesses in handling complex code
- Reduced subjectivity in test case creation and evaluation
- Improved ability to target specific complexity factors for improvement
Related issues
Edited by Shola Quadri