Skip to content

Update /test dataset to standardize difficulty measurement

Follow up from: #515925 (closed)

Background

Currently, our evaluation dataset for /test measures input difficulties in two inconsistent ways:

  1. Qualitatively: Using subjective "easy", "medium", and "hard" categories
  2. Quantitatively: Using a numerical scale from 0-20

This dual approach creates several issues:

  • Inconsistent analysis across models and test cases
  • Difficulty comparing results between different measurements
  • Lack of clarity on what constitutes difficulty in each system
  • Potential for subjective bias in qualitative assessments

Proposal

Develop and implement a unified, objective difficulty measurement system based on measurable code characteristics that can replace the current dual approach.

Implementation Plan/Idea

  1. Research and define objective metrics that correlate with code difficulty. For example:
    • Number of dependencies
    • Lines of code / token count
    • Nested control structures
    • Variable scope complexity
  2. Develop a formula that combines these metrics into a single difficulty score
  3. Validate the new scoring system against human assessments
  4. Update the evaluation pipeline to use the new difficulty measurement
  5. Migrate existing test cases to the new system
  6. Assess if the system can be leveraged across other features

Success Criteria

  • A single, standardized difficulty score for each test case
  • Clear documentation on how difficulty is calculated
  • Implementation in the evaluation framework
  • Migration or update of existing test cases to the new system

Benefits

  • More accurate performance comparisons across models
  • Better identification of strengths/weaknesses in handling complex code
  • Reduced subjectivity in test case creation and evaluation
  • Improved ability to target specific complexity factors for improvement

Related issues

Edited by Shola Quadri