Parallel matrix-matrix multiplication (PMM) of dense matrices is a foundational kernel of parallel linear algebra libraries in high performance computing (HPC) domain. The problem of finding the optimal shape of matrices for efficient execution of PMM on heterogeneous platforms has an engrossing history comprising of two distinct threads. The first thread focused purely on rectangular partitions whereas the second thread relaxed the rectangular partition constraint to allow non-rectangular partitions. The research works in the second thread, however, are entirely theoretical. There is no software implementation that would facilitate experimental studies of the practical performance and optimality of the proposed partition shapes. We address this gap in this work. We propose an implementation of PMM based on non-rectangular partitions called SummaGen. To study its efficacy, we compare the performances of PMM for four partition shapes proven optimal for three processor case where speeds of the processors are represented by positive real numbers. We conduct the experiments on a hybrid heterogeneous multi-accelerator NUMA node comprising of three heterogeneous devices, a dual-socket Intel Haswell multicore CPU, an Nvidia K40 GPU, and an Intel Xeon Phi 3120P. We show that the four shapes exhibit equal performances (with an average percentage difference of 8%) for a range of problem sizes where the speeds are constant confirming the optimality of these shapes in practice. We demonstrate further that the four shapes exhibit equal dynamic energy consumptions for this case. We also present a study of performances of PMM for the same partition shapes for a matrix decomposition using load imbalancing data partitioning algorithm employing functional performance models (FPMs). The peak and average performances of the implementation are 80% and 70% of the theoretical peak floating-point performance of the machine.
|Title of host publication||2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)|
|Publication status||Published - 29 Jul 2019|
|Event||33rd IEEE International Parallel and Distributed Processing Symposium Workshops: IPDPSW 2019 - Rio de Janero, Brazil|
Duration: 20 May 2019 → 24 May 2019
|Conference||33rd IEEE International Parallel and Distributed Processing Symposium Workshops|
|City||Rio de Janero|
|Period||20/05/19 → 24/05/19|