CUDA/HIP

GPU Accelerated Small Matrix Multiplications

libsmm_acc is a library for small matrix-matrix multiplication on a GPU-accelerator. Stacks of matrix-matrix multiplication indices are passed from DBCSR to libsmm_acc which performs the multiplications on the GPU.

For a description of the library (some details are outdated, but this nevertheless provides a very good introduction), see Chapter 8.4 of:

WALKER, R. C., & GOETZ, A. W. (2016). Electronic structure calculations on graphics processing units: from quantum chemistry to condensed matter physics.

Compilation

libsmm_acc is compiled from within DBCSR, there is no separate compilation.

Directory Organization

  • kernels/: GPU kernels (CUDA- and HIP-compatible) for matrix-matrix multiplication and python interface to autotuning and predictive code.
  • notebooks/: jupyter notebooks for exploring data generated from autotuning and prediction.
  • generate_*.py: utility scripts for libsmm_acc compilation
  • libsmm_acc*: libsmm_acc C++ and CUDA / HIP code
  • parameters/: contains parameters_GPU.json files. These are sets of matrix-matrix multiplication parameters for different (m, n, k)-triplets optimized for a given GPU card. You can explore these parameters interactively using the provided jupyter notebook
  • predict/: scripts for prediction of optimal parameter sets, see predictive modeling of kernel parameters
  • tune/: scripts for autotuning of optimal parameter sets, see autotuning of kernel parameters

Matrix-matrix Multiplication Kernels and Parameters

For a given matrix-matrix multiplication triplet characterized by dimensions

  • m
  • n
  • k,

libsmm_acc can run 5 different matrix-matrix multiplication kernels:

which take between 3 - 7 parameters (see figure at the top):

  • threads: number of threads per block in the execution configuration of the CUDA/HIP kernels
  • grouping: how many stack entries are grouped together into a CUDA/HIP thread block (if grouping is bigger, less blocks are launched)
  • minblocks: specifies the desired minimum number of resident blocks per multiprocessor
  • tile_m: (on the figure: M), tile_m * tile_n = dimensions of the result block T
  • tile_n : (on the figure: N)
  • w: input slab width (width of slab P_A and P_B)
  • v: output slab width (width of slab P_C)

The performance of the matrix-matrix multiplication kernels is highly dependent on the choice of algorithm and parameters. For this reason, libsmm_acc provides lists of optimal parameters for different GPU cards and different (m, n, k)-triplets. These sets of optimal parameters can be found either through autotuning or predictive modeling.

Contributing to libsmm_acc

We expect users to contribute to the library by providing new optimized kernels and support for new GPUs.

Autotuning procedure

Follow the autotuning procedure

Predictive modeling of kernel parameters

Follow the predictive modeling procedure

Adding a new kernel

  1. Choose a kernel name

  2. Add the kernel's code (must be able to compile by both nvcc and hip) in file kernels/smm_acc_dnt_name.h

  3. Add python kernel class inheriting from base class kernels/smm_acc_dnt_name.py

  4. Add the new kernel to the kernel_algorithm data structure in kernels/smm_acc_predict.py

Adding support for a new GPU card

  1. Add the GPU's compute architecture properties to kernels/gpu_properties.json. For more information on where to find these properties, please refer to the "info" field of kernels/gpu_properties.json.

  2. Add the GPU to the gpu_architectures data structure in kernels/smm_acc.py.

  3. Add the necessary code for setting ARCH_NUMBER correctly in the CMakeLists. Also add this GPU to the list of SUPPORTED_CUDA_ARCHITECTURES or SUPPORTED_HIP_ARCHITECTURES in the CMakeLists.

  4. Add a minimal JSON file parameters_GPU.json, containing:

{
}

then add matrix-matrix multiplication parameters for this GPU using autotuning and predictive modeling