Replace lang with data-lang in Rouge and AsciiDoc

Replace lang with data-lang in Rouge and AsciiDoc

Part of: lang attribute must have a valid value
Depends on: Step 1 (CodeLanguageFilter refactoring), Step 2 (html_gitlab.rb refactoring), and Step 3 (Commonmarker feature flag)

What does this MR do and why?

This MR completes the transition from invalid lang attributes to valid data-lang attributes for accessibility compliance by:

  1. Rouge formatter: Updates lib/rouge/formatters/html_gitlab.rb to output data-lang instead of lang
  2. AsciiDoc adapter: Updates lib/gitlab/asciidoc/syntax_highlighter/html_pipeline_adapter.rb to use data-lang
  3. CodeLanguageFilter: Implements complete language extraction with prioritization system
  4. Sanitization: Allows data-lang attributes in AsciiDoc sanitization filter
  5. Tests: Updates ~25 spec files with new attribute expectations

Key Changes

Backend Changes:

  • Rouge formatter: <span class="line" lang="ruby"><span class="line" data-lang="ruby">
  • AsciiDoc adapter: <pre lang="ruby"><pre data-lang="ruby">
  • CodeLanguageFilter: Complete prioritization system for language extraction:
    1. Priority 1: Existing data-lang attributes
    2. Priority 2: CSS classes (class="language-ruby")
    3. Priority 3: Legacy lang attributes
  • Sanitization: Allows data-lang on <pre> tags

Test Updates:

  • Updates expectations from lang="..." to data-lang="..."
  • Adds comprehensive tests for language prioritization
  • Tests CSS class extraction and cleanup logic

Why this approach?

This step directly addresses the accessibility issue identified in the audit:

  • Before: lang="ruby" (invalid - not a human language code)
  • After: data-lang="ruby" (valid HTML5 data attribute)

The prioritization system ensures compatibility with:

  • Step 3 output: CSS classes from Commonmarker (class="language-ruby")
  • Legacy formats: Existing lang attributes from AsciiDoc/Org mode
  • Future formats: data-lang attributes from other sources

How to set up and validate locally

  1. Test Rouge formatter output:

    # Open any code file in GitLab
    # Inspect HTML - should see data-lang="language" instead of lang="language"
  2. Test AsciiDoc code blocks:

    # Create AsciiDoc file with code block:
    # [source,ruby]
    # ----
    # puts "hello"
    # ----
    # Should render with data-lang="ruby"
  3. Test language prioritization:

    # Test markdown with various language formats
    # CSS classes, lang attributes, data-lang should all work
  4. Run tests:

    bundle exec rspec spec/lib/rouge/formatters/html_gitlab_spec.rb
    bundle exec rspec spec/lib/banzai/filter/code_language_filter_spec.rb
    bundle exec rspec spec/lib/banzai/filter/syntax_highlight_filter_spec.rb

Verification

  • Rouge formatter outputs data-lang instead of lang
  • AsciiDoc code blocks use data-lang attributes
  • CodeLanguageFilter handles all input formats (CSS classes, lang attributes, data-lang)
  • Language prioritization works correctly
  • All syntax highlighting still works properly
  • Copy-as-GFM functionality preserved (handled in previous steps)
  • No accessibility violations for code language attributes
  • Previous steps: Step 1 (CodeLanguageFilter prep), Step 2 (html_gitlab.rb cleanup), Step 3 (Commonmarker feature flag)
  • Next step: Step 5 - Frontend copy_as_gfm.js simplification
  • Final goal: Complete elimination of invalid lang attributes for WCAG compliance
  • Addresses: lang attribute must have a valid value
Edited by Joseph Fletcher

Merge request reports

Loading