Out-of-core implementation for accelerator kernels on heterogeneous clouds

Hamidreza Khaleghzadeh, Ziming Zhong, Ravi Reddy Manumachu, Alexey Lastovetsky

Research output: Contribution to journalArticlepeer-review


Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to facilitate easier migration to them of HPC workloads. While virtualization of accelerators in clouds is a leading research challenge, we address the programming challenges that assail execution of large instances of data-parallel applications using these accelerators in this paper. In a typical hybrid node in a cloud, the tight integration of accelerators with multicore CPUs via PCI-E communication links contains inherent limitations such as limited main memory of accelerators and limited bandwidth of the PCI-E communication links. These limitations poses formidable programming challenges to execution of large problem sizes on these accelerators. In this paper, we describe a library containing interfaces (HCLOOC) that addresses these challenges. It employs optimal software pipelines to overlap data transfers between host CPU and the accelerator and computations on the accelerator. It is designed using the fundamental building blocks, which are OpenCL command queues for FPGAs, Intel offload streams for Intel Xeon Phis, and CUDA streams and events that allow concurrent utilization of the copy and execution engines provided in NVidia GPUs. We elucidate the key features of our library using an out-of-core implementation of matrix multiplication of large dense matrices on a hybrid node, an Intel Haswell multicore CPU server hosting three accelerators that includes NVidia K40c GPU, Intel Xeon Phi 3120P, and a Xilinx FPGA. Based on experiments with the GPU, we show that our out-of-core implementation achieves 82% of peak double-precision floating performance of the GPU and a speedup of 2.7 times over the NVidia’s out-of-core matrix multiplication implementation (CUBLAS-XT). We also demonstrate that our implementation exhibits 0% drop in performance when the problem size exceeds the main memory of the GPU. We observe this 0% drop also for our implementation for Intel Xeon Phi and Xilinx FPGA.
Original languageEnglish
Pages (from-to)551-568
JournalThe Journal of Supercomputing
Issue number2
Early online date13 Sept 2017
Publication statusPublished - Feb 2018


  • Heterogeneous clouds
  • GPU
  • Intel Xeon Phi
  • Matrix multiplication
  • Out-of-core
  • CUDA
  • Intel MKL


Dive into the research topics of 'Out-of-core implementation for accelerator kernels on heterogeneous clouds'. Together they form a unique fingerprint.

Cite this