Parallel GPU implementation of reconstruction algorithms

GPU implementation of reconstruction algorithms

While the imaging methods that are developed in TR&D #1 significatly speed up the actual data acquisition, this comes at the cost of increased computational complexity during the reconstruction. While a reconstruction time of several hours is more of an inconvenience rather than a major restriction for research applications, it prohibits the use of these approaches in clinical settings, where reconstructed images have to be available immediately after the scan is completed. However, as the basic data stuctures in medical imaging are arrays of images, the core of nearly all image reconstruction problems consists of FFTs and basic matrix-manipulation based operations which have a very high parallelization potential and can therefore significantly sped up by parallel implementations on multi core systems.

Figure 1: Comparison of the temporal evolution of computational power of GPUs and CPUs (Source: NVIDIA CUDA 5 C Programming Guide)A recent trend in high-performance computing is the use of modern graphics processing units (GPU) for general purpose computing. Although the computation capability of graphics hardware is limited, it can run a large number of threads in parallel. This makes a GPU very well suited for most matrix-vector and vector-vector operations as they mostly rely on the multiplication and summation of elements. Figure 1 shows a comparison of the temporal evolution of the potential computational power of GPUs and CPUs from 2001 to 2009.

The development of code suitable for GPUs was eased by the release of modern programming interfaces like NVIDIA’s CUDA toolkit and OpenCL, compiler directives like OpenACC However, not every code running on an ordinary PC is suitable for graphics hardware. First of all, the GPU uses a massive amount of threads which means that only parallel algorithms will result in a speed-up. Serial algorithms should still be run on the host core to avoid overhead like the memory transfer from or to the graphics card. Second, care must be taken when accessing elements in GPU memory. Although this requirement is mitigated a bit when using latest graphics hardware it still applies that when writing efficient code a block of threads should access an aligned block of memory. This means that the implementation of GPU implementations which achieve the computational performance that is needed for the necessary acceleration is still a challenging task.

Figure 2 illustrates the potential of parallel GPU implementations for a well known problem in MR image reconstruction, regridding of 3D non-Cartesian data. Computation times are shown for a 3D radial sampling trajectory to a 160x160x160 image matrix with 10 receiver coils and for different amounts of subsampling (from 0.0625, corresponding to an acceleration rate of 16 to full sampling). Comparison of CPU (Intel XEON 3GHz with 8 Cores and 12GB of memory using Matlab) and GPU (NVIDIA GTX 680 with 4GB of graphics memory using CUDA) computation times illustrate the pronounced speedup of the GPU implementation as well as the increase in speedup as the data set gets larger. The reason for this is that for the larger data sets, the higher computational performance outweighs the overhead of data transfer between host and GPU more strongly.

Figure 2: Comparison of CPU (Intel XEON 3GHz with 8 Cores and 12GB of memory using Matlab) and GPU (NVIDIA GTX 680 with 4GB of graphics memory using CUDA) computation times of regridding from a 3D radial sampling pattern.

CUDA reconstruction code can be downloaded from the ressources section of our webpage.

Key Personnel: 
Alicia Yang, Li Feng, Florian Knoll, Ricardo Otazo, Tobias Block, Daniel Sodickson


Latest Updates

02/13/2020 - 11:51
02/11/2020 - 08:53
01/09/2020 - 10:59

Philanthropic Support

We gratefully acknowledge generous support for radiology research at NYU Langone Health from:
• The Big George Foundation
• Raymond and Beverly Sackler
• Bernard and Irene Schwartz

Go to top