Distributed large-scale graph processing on FPGAs | JoVE Visualize

Area of Science:

Computer Science
Hardware Acceleration
High-Performance Computing

Background:

Large-scale graph processing faces challenges due to irregular memory access patterns, leading to performance degradation on CPUs and GPUs.
Field-Programmable Gate Arrays (FPGAs) offer parallel processing capabilities but are limited by on-chip memory, causing data transfer bottlenecks.
Efficient graph partitioning and distributed multi-FPGA architectures are crucial for overcoming resource limitations and improving data locality.

Purpose of the Study:

To propose an FPGA processing engine that overlaps and customizes data transfers for full FPGA utilization.
To integrate this engine into a framework for FPGA clusters, enabling efficient distribution of large-scale graphs using offline partitioning.
To demonstrate high-performance graph processing on massive datasets that exceed single-device memory capacity.

Main Methods:

Development of an FPGA processing engine designed to overlap, hide, and customize data transfers.
Integration of the engine into a framework utilizing FPGA clusters and an offline partitioning method for graph distribution.
Leveraging Hadoop for higher-level graph mapping and data distribution to the FPGA layer.

Main Results:

The proposed FPGA solution achieves significant speedups for graph algorithms like PageRank, outperforming state-of-the-art CPU and GPU implementations.
For large-scale graphs, the FPGA solution demonstrates superior performance, with speedups of 26x compared to CPU (12x) and overcoming GPU memory limitations.
Compared to other FPGA solutions, the proposed method is 28 times faster, and a multi-FPGA system offers an additional 12x performance improvement.

Conclusions:

Graph partitioning combined with the proposed FPGA architecture delivers high performance for graphs with millions of vertices and billions of edges.
The framework effectively addresses the challenge of limited on-chip memory in FPGAs by optimizing data transfers and enabling distributed processing.
The research highlights the efficiency of the FPGA implementation for large datasets, showing its potential for next-generation graph processing acceleration.