gpu_properties.json GPU card properties
smm_acc_common.h Functionalities common to kernel CUDA/HIP codes
smm_acc_dnt_base.py Kernel base class
smm_acc_dnt_ALGORITHM.py Kernel class in python
smm_acc_dnt_ALGORITHM.h Batched Multiply Kernel CUDA/HIP code
smm_acc_predict.py Class and helper functions for parameter prediction procedure
smm_acc_transpose.h Transpose CUDA/HIP code
All kernels have following signature:
template <int m, int n, int k, int M, int N, int w, int v, int threads, int grouping, int minblocks > __global__ void __launch_bounds__(threads, minblocks) smm_acc_dnt_ALGORITHM (const int *__restrict__ param_stack, const int stack_size, const double* __restrict__ a_data, const double* __restrict__ b_data, double* c_data);
At kernel launch time, the A, B, and C matrices, as well as the product descriptors (the so-called stacks) are all located in global memory on the GPU. Each entry in the stack describes one matrix-matrix product: it contains three pointers to the blocks in the A, B, and C matrices. After the kernel has read a stack entry, it fetches the blocks in matrices A and B from global to shared memory, and updates the C matrix with the product of A and B.
libsmm_acc provides 5 different kernels for this operation (tiny, small, medium, largeDB1, largeDB2), which are optimized for different block sizes. Please refer to the documentation inside the respective
.h files for more details.