The paper is devoted to a fundamental problem in the area of combinatorial optimization--the assignment problem. The goal is to find a maximum (or minimum) weight matching in a weighted bipartite graph. The authors focus on the Hungarian algorithm, which solves the linear assignment problem within polynomial time with respect to the number of agents. Alternative approaches that could be used instead of the Hungarian algorithm include, for example, adaptations of the primal simplex algorithm and the auction algorithm.
The main contribution of the paper is the development and evaluation of parallel versions of the Hungarian algorithm. The parallelization is performed in CUDA and targets graphics processing units (GPU) by the NVIDIA Corporation. The most time-intensive phase of the Hungarian algorithm is the augmenting path search; therefore, the authors develop its efficient implementation and they provide detailed pseudocode for both the classical Hungarian algorithm and the alternating tree Hungarian algorithm. The paper contains a detailed description of parallelization for each step (initial reduction, optimality check, augmenting patch search, dual update, and so on) of the algorithm.
In order to target a broad variety of parallel architectures, the authors develop parallel variants for: a single central processing unit (CPU) using OpenMP, a single GPU using CUDA, and multiple GPUs using CUDA. The paper addresses the challenging aspect of storing the cost matrix in the GPU by suggesting a grid-like architecture with multiple compute nodes, each containing one CPU-GPU pair. For implementing the communication between CPUs, the message passing interface (MPI) is employed. The experimental evaluation is conducted on problems of small and large size; it shows that the implementation versions using GPU are much faster than both the sequential and the OpenMP implementation using a common multicore CPU.