autotuning_properties.json
Properties of the autotuning procedure, read from DBCSR source code
gpu_properties.json
GPU card properties
smm_acc_common.h
Functionalities common to kernel CUDA/HIP codes
smm_acc_dnt_base.py
Kernel base class
smm_acc_dnt_ALGORITHM.py
Kernel class in python
smm_acc_dnt_ALGORITHM.h
Batched Multiply Kernel CUDA/HIP code
smm_acc_predict.py
Class and helper functions for parameter prediction procedure
smm_acc_transpose.h
Transpose CUDA/HIP code
All kernels have following signature:
template <int m, int n, int k, int M, int N, int w, int v, int threads, int grouping, int minblocks >
__global__ void
__launch_bounds__(threads, minblocks)
smm_acc_dnt_ALGORITHM
(const int *__restrict__ param_stack, const int stack_size,
const double* __restrict__ a_data, const double* __restrict__ b_data, double* c_data);
At kernel launch time, the A, B, and C matrices, as well as the product descriptors (the so-called stacks) are all located in global memory on the GPU. Each entry in the stack describes one matrix-matrix product: it contains three pointers to the blocks in the A, B, and C matrices. After the kernel has read a stack entry, it fetches the blocks in matrices A and B from global to shared memory, and updates the C matrix with the product of A and B.
libsmm_acc
provides 5 different kernels for this operation (tiny, small, medium, largeDB1, largeDB2), which are optimized for different block sizes. Please refer to the documentation inside the respective .h
files for more details.