Replace lang with data-lang in Rouge and AsciiDoc
Replace lang with data-lang in Rouge and AsciiDoc
Part of: lang attribute must have a valid value
Depends on: Step 1 (CodeLanguageFilter refactoring), Step 2 (html_gitlab.rb refactoring), and Step 3 (Commonmarker feature flag)
What does this MR do and why?
This MR completes the transition from invalid lang attributes to valid data-lang attributes for accessibility compliance by:
- Rouge formatter: Updates
lib/rouge/formatters/html_gitlab.rbto outputdata-langinstead oflang - AsciiDoc adapter: Updates
lib/gitlab/asciidoc/syntax_highlighter/html_pipeline_adapter.rbto usedata-lang - CodeLanguageFilter: Implements complete language extraction with prioritization system
- Sanitization: Allows
data-langattributes in AsciiDoc sanitization filter - Tests: Updates ~25 spec files with new attribute expectations
Key Changes
Backend Changes:
- Rouge formatter:
<span class="line" lang="ruby">→<span class="line" data-lang="ruby"> - AsciiDoc adapter:
<pre lang="ruby">→<pre data-lang="ruby"> - CodeLanguageFilter: Complete prioritization system for language extraction:
- Priority 1: Existing
data-langattributes - Priority 2: CSS classes (
class="language-ruby") - Priority 3: Legacy
langattributes
- Priority 1: Existing
- Sanitization: Allows
data-langon<pre>tags
Test Updates:
- Updates expectations from
lang="..."todata-lang="..." - Adds comprehensive tests for language prioritization
- Tests CSS class extraction and cleanup logic
Why this approach?
This step directly addresses the accessibility issue identified in the audit:
❌ Before:lang="ruby"(invalid - not a human language code)✅ After:data-lang="ruby"(valid HTML5 data attribute)
The prioritization system ensures compatibility with:
- Step 3 output: CSS classes from Commonmarker (
class="language-ruby") - Legacy formats: Existing
langattributes from AsciiDoc/Org mode - Future formats:
data-langattributes from other sources
How to set up and validate locally
-
Test Rouge formatter output:
# Open any code file in GitLab # Inspect HTML - should see data-lang="language" instead of lang="language" -
Test AsciiDoc code blocks:
# Create AsciiDoc file with code block: # [source,ruby] # ---- # puts "hello" # ---- # Should render with data-lang="ruby" -
Test language prioritization:
# Test markdown with various language formats # CSS classes, lang attributes, data-lang should all work -
Run tests:
bundle exec rspec spec/lib/rouge/formatters/html_gitlab_spec.rb bundle exec rspec spec/lib/banzai/filter/code_language_filter_spec.rb bundle exec rspec spec/lib/banzai/filter/syntax_highlight_filter_spec.rb
Verification
- Rouge formatter outputs
data-langinstead oflang - AsciiDoc code blocks use
data-langattributes - CodeLanguageFilter handles all input formats (CSS classes, lang attributes, data-lang)
- Language prioritization works correctly
- All syntax highlighting still works properly
- Copy-as-GFM functionality preserved (handled in previous steps)
- No accessibility violations for code language attributes
Related
- Previous steps: Step 1 (CodeLanguageFilter prep), Step 2 (html_gitlab.rb cleanup), Step 3 (Commonmarker feature flag)
- Next step: Step 5 - Frontend
copy_as_gfm.jssimplification - Final goal: Complete elimination of invalid
langattributes for WCAG compliance - Addresses: lang attribute must have a valid value
Edited by Joseph Fletcher