The performance of the matrix-matrix multiplication kernels is highly dependent on the choice of algorithm and parameters, this is why autotuning is used to find optimal kernel parameters.
However, the auto-tuning procedure is expensive, and the space of (m,n,k)-triplets to explore is large. The following predictive modeling procedure is set up to predict optimal parameters for (m,n,k)-triplets that have not been auto-tuned from the data gathered from auto-tuning other (m,n,k)-triplets.
Python version required:
Install all python packages required (if you do not want this project's requirements to interfere with your other Python projects, consider doing so in a virtual environment), using
pip install -r requirements.txt
The input features for the predictive models can be 'raw' parameters, or hand-engineered features 'derived' from the raw features (matrix sizes, launch parameters and resource usage estimations).
Get the data to be used for training, either by downloading data from the dedicated repository, or by auto-tuning new kernels yourself and combining them with pre-existing data.
wget https://github.com/cp2k/dbcsr-data/blob/master/GPU/raw_training_data_ALGORITHM.csv # for ALGORITHM = tiny, small, medium, largeDB1, largeDB2
prepare_training_data.py, providing the CUDA/HIP architecture number and the location of the downloaded data:
./prepare_training_data.py # –arch 60 --folder /scratch/autotuning_dataset, e.g.
We would appreciate if you would upload the data resulting from your auto-tuning procedure to the dedicated repository. For this, please take note, at this stage, of the information required to upload your data.
If you're auto-tuning data for a new GPU, make sure that the GPU's compute architecture properties are given in the file
kernels/gpu_properties.json. If not, please add them.
Follow the instructions for auto-tuning.
If all went well, you now have directories named
tune_mxnxk containing log files in which parameter sets and their corresponding measured performances are recorded.
Collect the information in all the
tune_mxnxk directories into CSV files: run
predict_collect.py, providing the location of the auto-tuning data:
./predict_collect.py # --folder /scratch/autotuning_dataset, e.g.
You should now have 5 CSV files containing raw data (
ALGORITHM = tiny, small, medium, largeDB1, largeDB2)
A few steps are needed to make the data ready for training:
./prepare_training_data.py # --folder /scratch/autotuning_dataset -a 60 -j12, e.g. to run with 12 threads
The data preparation is relatively computationally expensive, especially for large data sets. A good way of running it, is to
-l ALGORITHM --skip_derived_data=True), adjusting the
-jparameter so it runs fast enough, while not running into "out-of-memory"-errors
--skip_derived_data=Trueto create the files that aggregate maximum and baseline performances for all algorithms.
-l ALGORITHM), adjusting the
raw_training_data_ALGORITHM.csv(containing all raw parameters for training a model for algorithm ALGORITHM, obtained in step 1)
training_data_ALGORITHM.csv(containing all derived parameters for training a model for algorithm ALGORITHM)
training_data_ALGORITHM.parquet(containing all raw and derived parameters for training a model for algorithm ALGORITHM in Parquet files, convenient for reading in parallel using Dask)
baseline_performances_by_algo.json(containing, for each (m, n, k)-triplet in the training data, its baseline performance, i.e. its performance were it to be run with a set of parameters that are an expert's "best guess"). Additionally, the baseline performances are plotted in
max_performances.json(containing, for each (m, n, k)-triplet, its maximum performance). Additionally, the maximum performances are plotted in
Explore the data interactively using the provided Jupyter notebook.
For each algorithm, build a predictive model using decision trees and feature selection based on the features' permutation importance.
./predict_train.py # --algo medium --folder /scratch/autotuning_dataset, e.g.
Use the command-line parameters
--destination_folder to choose the folder from which data is read, as well as the folder to which models, logs, etc. are written.
Repeat this step for all algorithms.
This may take several hours. For example, training algorithm 'medium' for the P100 took 11 hours on a single Greina (CSCS) node.
Moreover, depending on the size of the training data, large amounts of memory may be needed. For example, training algorithm 'medium' for the P100 was run on a 192 GB node.
Given predictive models (in the form of serialized scikit-learn model objects) for all unseen (m,n,k)s, generate or update a file of optimal parameters
./predict_genpars.py -c 5000 \ # chunk size -j 12 \ # 12 threads --largeDB2 /scratch/largeDB2/feature_tree_refit.p \ # path to models --largeDB1 /scratch/largeDB1/feature_tree_refit.p \ --medium /scratch/medium/feature_tree_refit.p \ --small /scratch/small/feature_tree_refit.p \ --tiny /scratch/tiny/feature_tree_refit.p
This may take several hours. For example, generating parameters for the P100 took 8 hours on a single Piz Daint (CSCS) node. For this reason, intermediate results are stored in JSON files in a folder
predict_genpars_ckpt. Once this script has finished running, and you've successfully obtained a new
parameters_GPU.json file, you may delete the checkpoint folder
./predict_evaluate.py -f libsmm_acc_predicted.out -n libsmm_acc_baseline.out
Submit a pull request updating the
parameters_GPU.json file in question.
class PredictiveParameters, named