At the end of an output file, a report of DBCSR's statistics and timings can be found.
The STATISTICS section of the output file provides some information on matrix-matrix multiplications that were run and their performance characteristics.
Example:
-------------------------------------------------------------------------------
- -
- DBCSR STATISTICS -
- -
-------------------------------------------------------------------------------
COUNTER TOTAL BLAS SMM ACC
flops 23 x 23 x 23 687272462200 0.0% 0.0% 100.0%
flops inhomo. stacks 0 0.0% 0.0% 0.0%
flops total 687.272462E+09 0.0% 0.0% 100.0%
flops max/rank 687.272462E+09 0.0% 0.0% 100.0%
matmuls inhomo. stacks 0 0.0% 0.0% 0.0%
matmuls total 28243300 0.0% 0.0% 100.0%
number of processed stacks 1600 0.0% 0.0% 100.0%
average stack size 0.0 0.0 17652.1
marketing flops 1.076458E+12
-------------------------------------------------------------------------------
# multiplications 50
max memory usage/rank 16.650822E+09
# max total images/rank 1
# max 3D layers 1
# MPI messages exchanged 0
MPI messages size (bytes):
total size 0.000000E+00
min size 0.000000E+00
max size 0.000000E+00
average size 0.000000E+00
MPI breakdown and total messages size (bytes):
size <= 128 0 0
128 < size <= 8192 0 0
8192 < size <= 32768 0 0
32768 < size <= 131072 0 0
131072 < size <= 4194304 0 0
4194304 < size <= 16777216 0 0
16777216 < size 0 0
-------------------------------------------------------------------------------
TOTAL
: total flopsBLAS
: percentage of flops run on BLAS (this could be CUBLAS or HIPBLAS)SMM
: percentage of flops run on SMM (libsmm or libxsmm, CPU)ACC
: percentage of flops run on ACC (libsmm_acc, DBCSR's GPU-accelerated backend)Every time "matrix-matrix multiplication" is mentionned in this paragraph, it refers not to the sparse multiplication of large matrices, but the multiplication of small dense blocks that the large sparse matrix was decomposed into.
flops 23 x 23 x 23
: indicates that batched matrix-matrix multiplication kernels with matrix dimensions (m, n, k) = (23, 23, 23) was run, and provides info on its flops. If several batched matrix-matrix multiplications of different matrix dimensions (m, n, k) were run, they would appear as subsequent separate rows.flops inhomo. stacks
: flops of so-called "inhomogeneous stacks". These are stacks of batched-matrix-matrix multiplications where not all multiplications contained have the same matrix dimensions (m, n, k).flops total
: total flops for all stacks of matrix-matrix multiplication.flops max/rank
: flops of the MPI rank with the most flops.matmuls inhomo. stacks
: number of matrix-matrix multiplications run in inhomogeneous stacks.matmuls total
: number of matrix-matrix multiplications run in total.number of processed stacks
: number of stacks of batched matrix-matrix multiplication.average stack size
: average over all stacks of the stack size (i.e. the number of matrix-matrix multiplications that a stack contains).Example of the statistics section of the output file:
-------------------------------------------------------------------------------
- -
- T I M I N G -
- -
-------------------------------------------------------------------------------
SUBROUTINE CALLS ASD SELF TIME TOTAL TIME MAXRANK
MAXIMUM AVERAGE MAXIMUM AVERAGE MAXIMUM
dbcsr_performance_driver 1 1.0 0.000 0.000 102.563 102.563 0
dbcsr_perf_multiply_low 1 2.0 0.002 0.002 102.563 102.563 0
perf_multiply 1 3.0 0.003 0.003 102.077 102.077 0
[...]
-------------------------------------------------------------------------------
The columns describe:
SUBROUTINE
: the name of the fortran subroutine (or c++ function) timed.CALLS
: number of times the subroutine was called.ASD
: average stack depth: the average number of entries on the call stack when this subroutine is called.SELF TIME
: how much time is spent in the subroutine, or in non-timed subroutines called by this subroutine.AVERAGE
: the self time averaged over all MPI ranks,MAXIMUM
: the self time maximum over all MPI ranks,AVERAGE
and MAXIMUM
can be used to locate load-imbalance or synchronization points.TOTAL TIME
: how much time is spent in the subroutine, including the time spent in timed subroutines.AVERAGE
: averaged over all MPI ranksMAXIMUM
: maximum over all MPI ranksAVERAGE
and MAXIMUM
can be used to locate load-imbalance or synchronization points.MAXRANKS
:For performance debugging and in order to check how much time a program spends doing JIT, look for the functions jit_kernel_multiply
and jit_kernel_transpose
.
By default, the most important subroutines are timed in DBCSR.
If you want to time a subroutine or function that is not timed already, call timeset
with a routine name and a handle at the beginning of the function, and timestop
with the same handle at the end of the function.
For examples, just grep
for timeset
and timestop
in the codebase.
This can be done both in fortran code and in the c++ code.