Commit d963076d authored by Tobias Rautenkranz's avatar Tobias Rautenkranz

Add Vulkan compute example

parent aff8f366
---
layout: post
title: Vulkan compute convolution/cross-correlation
lang: en
category: code
tags: code
---
A simple headless Vulkan compute shader example using the
C++ [vulkan.hpp](https://github.com/KhronosGroup/Vulkan-Hpp) API.
The example computes the valid 2D cross correlation
(i.e. non flipped kernel which is sometimes also named convolution)
of some sample data with a kernel. The square input and kernel sizes
can be specified at runtime; as well as the work group size.
## Building
The source code is available from
[GitLab](https://gitlab.com/tobiasrautenkranz/vulkan_compute_convolution)
or with pre-built binaries
[vulkan_compute_convolution.tar.gz]({{ "/assets/vulkan_compute_cross_correlation.tar.gz"
| relative_url }})
To build the code you need:
* GNU Make
* a C++14 compiler
* the Vulkan 1.0 library and headers (e.g. Debian stretch-backports)
* lslangValidator to compile the shader
(e.g. the [LunarG Vulkan SDK](https://vulkan.lunarg.com/sdk/home))
For benchmarking:
* [Google benchmark](https://github.com/google/benchmark)
(libbenchmark-dev on Debian)
* [R](https://www.r-project.org/)
## Performance
On an Intel i5-3317U running on the GPU with the Vulkan API
is up to twice as fast as on the CPU.
![]({{ "/img/vk_convolution/convolution_gpu_cpu.svg" | relative_url }})
The FLOPS (floating-point operations per second) are relative to the mean of all
CPU samples. The plot is split per the two kernel sizes 3 and 5.
As expected the FLOPS of the CPU are relatively constant
with respect to the input size since the CPU is easily fully utilized.
The GPU on the other hand appears to be underutilized, as the performance
increases with an increasing input size and thus work load.
### OpenCL
Since the main limitation on the performance is the GPU, we expect a
similar performance from Vulkan as by using OpenCL.
The convolution samples of the
[AMD-APP-SDK-v3.0.130.136](https://developer.amd.com/amd-accelerated-parallel-processing-app-sdk/#appsdkdownloads)
are used for comparison.
The OpenCL sample is somewhat different: The input is padded such that the
output has the same size. This is compensated by the FLOPS calculation.
Furthermore, the type of the data is _int_ instead of _float_.
Which is converted to _float_ for the computation and then back to _int_
for storage in the output.
Comparing the Vulkan with the AMD APPSDK OpenCL sample for an input size of 512,
there is a notable difference for the kernel size 5 (~ 7%).
Whether this difference is due to the drivers or the different data types;
or something else entirely is not clear.
Interestingly the OpenCL local data store (LDS) seems to have little effect (~ 3%).
![]({{ "/img/vk_convolution/convolution_vk_cl.svg" | relative_url }})
## Code size
The complexity of the implementation is, comparing the source code line count,
comparable. 750 lines of code for the Vulkan sample and 880 for the OpenCl one.
Again the OpenCL example is a bit different; mainly in that it also supports a
separable convolution.
# Links
Some references that were used:
* [Vulkan Tutorial](https://vulkan-tutorial.com/)
* [Vulkan C++ examples and demos](https://github.com/SaschaWillems/Vulkan)
Sascha Willems
* [Vulkan Minimal Compute](https://github.com/Erkaman/vulkan_minimal_compute)
Eric Arnebäck
* [Vulkan® 1.0.71 - A Specification](https://www.khronos.org/registry/vulkan/specs/1.0/html/vkspec.html)
The Khronos Vulkan Working Group version 1.0.71
This source diff could not be displayed because it is too large. You can view the blob instead.
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment