Guide

Backend

The OpenCL backend implements the ACC interface, which is exposed in Fortran and used throughout DBCSR's code base to drive (GPU-)acceleration based on ACC's device enumeration, data movement, and synchronization functionality. By design, DBCSR activates one device per rank (process). For instance, multiple GPUs can be used by the means of multiple ranks per system or at least one rank per device. The LIBSMM library complements the backend and implements the ACC LIBSMM interface.

All major GPU vendors support OpenCL even if the vendor-preferred programming model suggests otherwise. On Nvidia GPUs, the OpenCL backend can be used with CUDA based GPU-code in other portions of CP2K. The OpenCL based backend provides the following benefits:

Code portability between GPU vendors (if not performance portability). For instance, performance of the OpenCL backend matches the performance of the CUDA backend or exceeds it.
Acceptable performance for kernels not covered by specifically tuned parameters, and the ability to run on GPU if no tuned parameters are present.
Auto-tuning kernels within an acceptable time limit along with handy scripts to retune parameters and to carry forward an existing set (new GPU).

Runtime settings are made by the means of environment variables. The OpenCL backend provides acc_getenv.sh to list all occurrences of getenv categorized into "OpenCL Backend environment variables" and "OpenCL LIBSMM environment variables". Common backend related settings are:

ACC_OPENCL_DEVSPLIT: integer enabling devices to be split into subdevices (non-zero/default: subdevices, zero: aggregated).
ACC_OPENCL_DEVTYPE: character string selecting "cpu", "gpu", "all" (unfiltered), or any other string (neither CPU or GPU).
ACC_OPENCL_DEVICE: non-negative integer number to select a device from the (internally enumerated) list of devices.
ACC_OPENCL_VENDOR: character string matching the vendor of the OpenCL device in a case-insensitive fashion, e.g., "intel".
ACC_OPENCL_VERBOSE: verbosity level (integer) with console output on stderr.
- ACC_OPENCL_VERBOSE=1: outputs the number of devices found and the name of the selected device.
- ACC_OPENCL_VERBOSE=2: outputs the duration needed to generate a requested kernel.
- ACC_OPENCL_VERBOSE=3: outputs device-side performance of kernels (every launch profiled).
ACC_OPENCL_DUMP: dump preprocessed kernel source code (1) or dump compiled OpenCL kernels (2).
- ACC_OPENCL_DUMP=1: dump preprocessed kernel source code and use it for JIT compilation. Instantiates the original source code using preprocessor definitions (-D) and collapses the code accordingly.
- ACC_OPENCL_DUMP=2: dump compiled OpenCL kernels (depends on OpenCL implementation), e.g., PTX code on Nvidia.

The OpenCL backend enumerates and orders devices by kind, i.e., GPU, CPU, and "other" (primary criterion) and by memory capacity (secondary criterion). Device IDs are zero-based as defined by the ACC interface (and less than what is permitted by acc_get_ndevices).

LIBSMM

The LIBSMM library implements the ACC LIBSMM interface, and depends on the OpenCL backend.

Compile-time settings are (implicitly) documented and can be adjusted by editing opencl_libsmm.h, e.g., OPENCL_LIBSMM_VALIDATE is disabled by default but can be enabled for debug purpose. The OPENCL_LIBSMM_VALIDATE compile-time setting enables side-by-side validation of matrix transpose and multiply operations between device and host. For example, running DBCSR's unit tests with OPENCL_LIBSMM_VALIDATE enabled produces console output that allows to pin-point a kernel which misses validation. Runtime settings are made by the means of environment variables. The OpenCL backend provides acc_getenv.sh to list all occurrences of getenv categorized into "OpenCL Backend environment variables" and "OpenCL LIBSMM environment variables".

There are two categories for the two domains in LIBSMM, i.e., matrix transpose (OPENCL_LIBSMM_TRANS_*) and matrix multiplication (OPENCL_LIBSMM_SMM_*). For transposing matrices, the settings are:

OPENCL_LIBSMM_TRANS_BUILDOPTS: character string with build options (compile and link) supplied to the OpenCL runtime compiler.
OPENCL_LIBSMM_TRANS_INPLACE: Boolean value (zero or non-zero integer) for in-place matrix transpose (no local memory needed).
OPENCL_LIBSMM_TRANS_BM: non-negative integer number (less/equal than the M-extent) denoting the blocksize in M-direction.

The most common settings for multiplying matrices are:

OPENCL_LIBSMM_SMM_BUILDOPTS: character string with build options (compile and link) supplied to the OpenCL runtime compiler.
OPENCL_LIBSMM_SMM_PARAMS: Disable embedded/auto-tuned parameters (0), or load CSV-file (e.g., path/to/tune_multiply.csv).
OPENCL_LIBSMM_SMM_BS: non-negative integer number denoting the intra-kernel (mini-)batchsize mainly used to amortize atomic updates of data in global/main memory. The remainder with respect to the "stacksize" is handled by the kernel.
OPENCL_LIBSMM_SMM_BM: non-negative integer number (less/equal than the M-extent) denoting the blocksize in M-direction.
OPENCL_LIBSMM_SMM_BN: non-negative integer number (less/equal than the N-extent) denoting the blocksize in N-direction.
OPENCL_LIBSMM_SMM_AP: specifies access to array of parameters (batch or "stack").
OPENCL_LIBSMM_SMM_AA: specifies access to array of A-matrices.
OPENCL_LIBSMM_SMM_AB: specifies access to array of B-matrices.
OPENCL_LIBSMM_SMM_AC: specifies access to array of C-matrices.

The full list of tunable parameters and some explanation can be received with smm/tune_multiply.py --help, i.e., short description, default settings, and accepted values.

NOTE: LIBSMM's tunable runtime settings can be non-smooth like producing distinct code-paths, e.g., OPENCL_LIBSMM_SMM_BS=1 vs. OPENCL_LIBSMM_SMM_BS=2.

Auto Tuning

To tune and optimize a kernel and generating kernel parameters, please refer to the Auto Tuning guide. To update or retune an entire set of kernels (optimized parameters), please refer to the Bulk Tuning guide.