libsmm_acc
The performance of the matrix-matrix multiplication kernels is highly dependent on the choice of algorithm and parameters. This is why auto-tuning is used to find optimal kernel parameters.
Python version required: python 3.6+
If you are about to autotune parameters for a new GPU (i.e. a GPU for which there are no auto-tuned parameters yet), please first follow the instructions for a new GPU.
Install all python packages required (if you do not want this project's requirements to interfere with your other Python projects, consider doing so in a virtual environment), using
pip install -r requirements.txt
libsmm_acc/tune
directory$ cd dbcsr/src/acc/libsmm_acc/tune
The parameters.h
file (a C++ header file generated from the JSON record of multiplication kernels and their optimal parameters) is needed for the auto-tuning procedure. One can copy it over from a build directory for example, as follows:
$ cp ~/dbcsr/build_dir/src/acc/libsmm_acc/parameters.h ../
tune_setup.py
to your environmentThe tune_setup.py
script generates job files. You have to adapt the script to the environment of your supercomputer and your personal settings.
...
def gen_jobfile(outdir, m, n, k):
...
output = "#!/bin/bash -l\n"
output += "#SBATCH --nodes=%d\n" % num_nodes
output += "#SBATCH --ntasks-per-core=1\n"
output += "#SBATCH --ntasks-per-node=1\n"
output += "#SBATCH --cpus-per-task=" + "%d\n" % cpus_per_node
output += "#SBATCH --time=%s\n" % time
output += "#SBATCH --partition=normal\n"
output += "#SBATCH --constraint=gpu\n"
output += "\n"
output += "source ${MODULESHOME}/init/sh;\n"
output += "module load daint-gpu\n"
output += "module unload PrgEnv-cray\n"
output += "module load PrgEnv-gnu\n"
if compiler == "nvcc":
output += "module load cudatoolkit/8.0.61_2.4.9-6.0.7.0_17.1__g899857c\n"
else: # i.e. compiler = hipcc
output += "module load hip\n"
output += "module list\n"
output += "export CRAY_CUDA_MPS=1\n"
output += "cd $SLURM_SUBMIT_DIR \n"
output += "\n"
output += "date\n"
...
...
tune_setup.py
Specify which GPU you are auto-tuning for by passing the appropriate parameters_GPU.json
file as an argument with -p/--params
.
Specify which compiler to use for compiling kernel code by passing nvcc
or hipcc
as an argument with -b/--compiler
.
More arguments can be set, run
$ ./tune_setup.py --help
for more information.
In addition, the script takes as arguments the block sizes you want to add to libsmm_acc
. You can specify these as a list of integers or provide the parameter file of a different GPU from which to read the block sizes to autotune.
For example, if the system you want to autotune for contains blocks of size 5 and 8, run:
$ ./tune_setup.py 5 8 -p ../parameters/parameters_P100.json
Reading parameters from parameters_P100.json
libsmm_acc: Found 74096 existing parameter sets, of which 1641 are autotuned and 72455 are predicted.
Requested to autotune 8 triplets
Found 41824 parameter sets for 5x5x5
Found 83648 parameter sets for 5x5x8
Found 103072 parameter sets for 5x8x5
Found 103072 parameter sets for 5x8x8
Found 103072 parameter sets for 8x5x5
Found 103072 parameter sets for 8x5x8
Found 125344 parameter sets for 8x8x5
Found 125344 parameter sets for 8x8x8
Or, if you want to obtain, for the NVIDIA P100, the parameters of the same block sizes as recorded for the NVIDIA K40, run:
$ ./tune_setup.py -p ../parameters/parameters_P100.json ../parameters/parameters_K40.json
Reading parameters from parameters_P100.json
libsmm_acc: Found 74093 existing parameter sets, of which 1638 are autotuned and 72455 are predicted.
Reading parameters to autotune from parameters_K40.json
Requested to autotune 19 triplets
Found 41824 parameter sets for 5x5x5
Found 95648 parameter sets for 6x6x6
Found 110496 parameter sets for 7x7x7
Found 125344 parameter sets for 8x8x8
Found 173764 parameter sets for 9x9x9
...
The script will create a directory for each combination of the block sizes:
$ ls -d tune_*
tune_5x5x5 tune_5x5x8 tune_5x8x5 tune_5x8x8 tune_8x5x5 tune_8x5x8 tune_8x8x5 tune_8x8x8
Each directory contains a number of files:
$ ls -1 tune_8x8x8/
Makefile
tune_8x8x8_exe0_main.cu/cpp
tune_8x8x8_exe0_part0.cu/cpp
tune_8x8x8_exe0_part1.cu/cpp
tune_8x8x8_exe0_part2.cu/cpp
tune_8x8x8_exe0_part3.cu/cpp
tune_8x8x8_exe0_part4.cu/cpp
tune_8x8x8.job
For each possible parameter-set a launcher is generated. A launcher is a small snippet of C code, which launches the kernel by using the CUDA specific <<< >>>
-notation or HIP's hipLaunchKernelGGL
function. It also instantiates the C++ template which contains the actual kernel code.
In order to parallelize the benchmarking, the launchers are distributed over multiple executables. Currently, up to 10'000 launchers are benchmarked by one executable. Each executable is linked together from several tune_*_part???.o
and a tune_*_main.o
. Each part-files contains up to 100 launchers. This allows to parallelize the compilation over multiple CPU cores.
tune_submit.py
to your environmentThe script tune_submit.py
was written for the slurm batch system as used e.g. by CRAY supercomputers. If your computer runs a different batch system, you have to adapt tune_submit.py
accordingly.
Each tune-directory contains a job file. Since there might be many tune-directories, the convenience script tune_submit.py
can be used to submit jobs. It will go through all the tune_*
-directories and check if its job has already been submitted or run. For this, the script calls squeue
in the background and it searches for slurm-*.out
files. In order to limit the number of jobs submitted at a time, a maximum number of jobs to submit can be specified with -j
.
When tune_submit.py
is called without arguments, it will just list the jobs that could be submitted:
$ ./tune_submit.py
tune_5x5x5: Would submit, run with "doit!"
tune_5x5x8: Would submit, run with "doit!"
tune_5x8x5: Would submit, run with "doit!"
tune_5x8x8: Would submit, run with "doit!"
tune_8x5x5: Would submit, run with "doit!"
tune_8x5x8: Would submit, run with "doit!"
tune_8x8x5: Would submit, run with "doit!"
tune_8x8x8: Would submit, run with "doit!"
Number of jobs submitted: 8
Only when tune_submit.py
is called with doit!
as its first argument, will it actually submit jobs:
$ ./tune_submit.py doit!
tune_5x5x5: Submitting
Submitted batch job 277987
tune_5x5x8: Submitting
Submitted batch job 277988
tune_5x8x5: Submitting
Submitted batch job 277989
tune_5x8x8: Submitting
Submitted batch job 277990
tune_8x5x5: Submitting
Submitted batch job 277991
tune_8x5x8: Submitting
Submitted batch job 277992
tune_8x8x5: Submitting
Submitted batch job 277993
tune_8x8x8: Submitting
Submitted batch job 277994
Number of jobs submitted: 8
Run tune_collect.py
to parse all log files and determine the best kernel for each blocksize:
$ ./tune_collect.py
Reading: tune_5x5x5/tune_5x5x5_exe0.log
Reading: tune_5x5x8/tune_5x5x8_exe0.log
Reading: tune_5x8x5/tune_5x8x5_exe0.log
Reading: tune_5x8x8/tune_5x8x8_exe0.log
Reading: tune_8x5x5/tune_8x5x5_exe0.log
Reading: tune_8x5x8/tune_8x5x8_exe0.log
Reading: tune_8x8x5/tune_8x8x5_exe0.log
Reading: tune_8x8x8/tune_8x8x8_exe0.log
Kernel_dnt_tiny(m=5, n=5, k=5, split_thread=32, threads=64, grouping=16, minblocks=1) , # 27.9623 GFlops
Kernel_dnt_tiny(m=5, n=5, k=8, split_thread=32, threads=96, grouping=16, minblocks=1) , # 37.8978 GFlops
Kernel_dnt_medium(m=5, n=8, k=5, tile_m=1, tile_n=1, threads=96, grouping=16, minblocks=8) , # 32.9231 GFlops
Kernel_dnt_tiny(m=5, n=8, k=8, split_thread=32, threads=96, grouping=16, minblocks=1) , # 47.0366 GFlops
Kernel_dnt_medium(m=8, n=5, k=5, tile_m=1, tile_n=1, threads=96, grouping=16, minblocks=12) , # 33.1999 GFlops
Kernel_dnt_medium(m=8, n=5, k=8, tile_m=1, tile_n=1, threads=96, grouping=16, minblocks=12) , # 49.3499 GFlops
Kernel_dnt_tiny(m=8, n=8, k=5, split_thread=32, threads=96, grouping=16, minblocks=1) , # 62.8469 GFlops
Kernel_dnt_tiny(m=8, n=8, k=8, split_thread=32, threads=128, grouping=16, minblocks=1) , # 90.7763 GFlops
Wrote parameters.json
The file parameters.json
in dbcsr/src/acc/libsmm_acc/parameters
now contains the newly autotuned parameters.
Run tune_merge.py
to merge the new parameters with the original ones, within the directory parameters
.
$ ./tune_merge.py
Merging parameters.json with parameters_P100.json
Wrote parameters.new.json
The file parameters.new.json
can now be used as a parameter file. Rename it to parameters_GPU.json
, with the appropriate GPU
.
Contribute new optimal parameters
Submit a pull request updating the appropriate parameters_GPU.json
file to the DBCSR repository.
Contribute autotuning data
See instructions in DBCSR's data repository.