Skip to content
Snippets Groups Projects

SYCL: Add PackedFloat3 for AMD CDNA2 devices

Merged Andrey Alekseenko requested to merge aa-unsplit-fcibuf-fastfloat3 into main

AMD CDNA2 devices, such as MI250, achieve full advertised performance only when operating on packed FP32 floats, in a 2-wide SIMD way. The compiler can apply this optimization automatically in most cases, but as of ROCm 5.7, it introduces extra inefficiencies along the way. Therefore, we use packed float2 explicitly in a few critical places.

2-5% speed-up of NBNXM kernels on MI250X.

See also #4874

Fixes #4854 (closed)

Edited by Andrey Alekseenko

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • added 1 commit

    • fd660059 - Add asserts and inlining attributes

    Compare with previous version

  • added 1 commit

    • 04adcba1 - Add asserts and inlining attributes

    Compare with previous version

  • Andrey Alekseenko marked this merge request as ready

    marked this merge request as ready

  • Szilárd Páll resolved all threads

    resolved all threads

  • Szilárd Páll approved this merge request

    approved this merge request

  • Andrey Alekseenko changed the description

    changed the description

  • added 1 commit

    • 3155dea1 - Add a macro to toggle use of packed float3

    Compare with previous version

  • added 1 commit

    • ddb78ba2 - Move the struct to sycl_kernel_utils

    Compare with previous version

  • Magnus Lundborg approved this merge request

    approved this merge request

  • Szilárd Páll approved this merge request

    approved this merge request

  • Szilárd Páll added 13 commits

    added 13 commits

    Compare with previous version

  • Szilárd Páll resolved all threads

    resolved all threads

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading