Programming Reconfigurable Heterogeneous Computing Clusters Using MPI With Transpilation

Transpilation principle (copyright IEEE)

How can the execution of a collective program optimized for heterogeneous CPU + FPGA clusters?

A part of the answer is transpilation and you can find out more in my recent paper about Programming Reconfigurable Heterogeneous Computing Clusters Using MPI With Transpilation (DOI: 10.1109/H2RC51942.2020.00006). Published in the Proceedings of the Sixth International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC’20).

Abstract

With the slowdown of Moore’s law and the stop of Dennard scaling, energy efficiency of compute hardware translates to compute power. Therefore, High-Performance Computing (HPC) systems tend to rely more and more on accelerators such as Field-Programmable Gate Arrays (FPGAs) to fuel high demanding workloads, like Big Data applications or Deep Neuronal Networks. These FPGAs are reconfigurable and sometimes no longer bus-attached to a CPU but directly connected to the data center network fabric as standalone nodes. This mix of CPUs and FPGAs leads to the creation of Reconfigurable Heterogeneous HPC (ReH2PC) clusters for which no established programming model exists, despite many proposals in the past. In contrast to this, the Message Passing Interface (MPI) has evolved as the de-facto standard to program classical HPC clusters, due to its high-re-usability and fast development of applications. This paper revisits the programming model of ReH2PC clusters and argues that MPI is suitable for program-ming heterogeneous clusters of FPGAs and CPUs. Our experiments with 31 FPGAs show an average speedup of 4 and a 90% reduction of power consumption compared to a cluster of CPUs. We demonstrate a one-click solution for compiling and deploying a standard MPI application on ReH2PC clusters. Our framework implements a High-Level Synthesis (HLS) library, a specific run-time environment for FPGAs and CPUs, and a transpiler that closes the semantic gap between the MPI API and FPGA designs. Our experiments with 31 FPGAs show an average speedup of 4 and a 90% reduction of power consumption compared to a cluster of CPUs.

Paper

You can find the PDF here.

You have another opinion?
Great! Then let's reduce the fallacy together!


Why are there no comments?