feat: add benchmark support for code and agentic capabilities
- Add HumanEval and BFCL benchmarks to ModelSpec interface
- Populate benchmark scores with verified data from online sources
- Add collapsible benchmark explanations section to reference docs
- Make all reference documentation sections collapsible
- Add sortable benchmark columns to model specifications table
- Add benchmark selector dropdown to performance chart
- Filter legacy models from charts (only show current agentic models)
- Display models without scores in separate section below chart
Benchmark data sources documented in lib/data.ts