SummaGen: parallel matrix-matrix multiplication based on non-rectangular partitions for heterogeneous HPC platforms

Stephen Patton, Hamidreza Khaleghzadeh, Ravi Reddy Manumachu, Alexey Lastovetsky

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Parallel matrix-matrix multiplication (PMM) of dense matrices is a foundational kernel of parallel linear algebra libraries in high performance computing (HPC) domain. The problem of finding the optimal shape of matrices for efficient execution of PMM on heterogeneous platforms has an engrossing history comprising of two distinct threads. The first thread focused purely on rectangular partitions whereas the second thread relaxed the rectangular partition constraint to allow non-rectangular partitions. The research works in the second thread, however, are entirely theoretical. There is no software implementation that would facilitate experimental studies of the practical performance and optimality of the proposed partition shapes. We address this gap in this work. We propose an implementation of PMM based on non-rectangular partitions called SummaGen. To study its efficacy, we compare the performances of PMM for four partition shapes proven optimal for three processor case where speeds of the processors are represented by positive real numbers. We conduct the experiments on a hybrid heterogeneous multi-accelerator NUMA node comprising of three heterogeneous devices, a dual-socket Intel Haswell multicore CPU, an Nvidia K40 GPU, and an Intel Xeon Phi 3120P. We show that the four shapes exhibit equal performances (with an average percentage difference of 8%) for a range of problem sizes where the speeds are constant confirming the optimality of these shapes in practice. We demonstrate further that the four shapes exhibit equal dynamic energy consumptions for this case. We also present a study of performances of PMM for the same partition shapes for a matrix decomposition using load imbalancing data partitioning algorithm employing functional performance models (FPMs). The peak and average performances of the implementation are 80% and 70% of the theoretical peak floating-point performance of the machine.
Original languageEnglish
Title of host publication2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
PublisherIEEE
Pages57-68
ISBN (Electronic)9781728135106
ISBN (Print)9781728135113
DOIs
Publication statusPublished - 29 Jul 2019
Event33rd IEEE International Parallel and Distributed Processing Symposium Workshops: IPDPSW 2019 - Rio de Janero, Brazil
Duration: 20 May 201924 May 2019

Conference

Conference33rd IEEE International Parallel and Distributed Processing Symposium Workshops
Country/TerritoryBrazil
CityRio de Janero
Period20/05/1924/05/19

Fingerprint

Dive into the research topics of 'SummaGen: parallel matrix-matrix multiplication based on non-rectangular partitions for heterogeneous HPC platforms'. Together they form a unique fingerprint.

Cite this