No one can contest the benefits of running complex and computationally intensive engineering applications on GPUs over CPUs. Algorithms like video processing for image detection and tracking, pattern recognition in 3D space can all be performed at high speed and GPUs can drastically bring down the processing time of these applications.
Many OEMs and ODMs manufacturing products and solutions that require high processing are switching to GPUs. However, these companies are facing the challenges of porting the already existing, tried and tested algorithms on to GPUs. Modern programming frameworks, such as NVIDIA’s Compute Unified Device Architecture (CUDA) have made programming on the GPU more straightforward and friendly; however the challenge is to optimize the code for performing effectively on GPUs. In the projects that we have executed at eInfochips, we have noticed that the benefits of fast processing can only be derived if there are adequate steps followed from the beginning of the porting stage. Hence we have derived and documented our tried and tested methodology for application porting on GPUs. This methodology is carried out in three phases to help derive the benefits of GPUs.
This phase is intended at identifying if there is any parallelism possible and to weigh the costs vs the benefits.
This step can be done by identifying the part of program that the CPU is spending the most time in processing and identifying multi-threads that can be created from the serial algorithm. In some applications there may not be a possibility to parallelize since events occur serially. Hence porting the application on GPU will not yield any performance enhancement. Performing this step at the beginning of porting will help us predict the possible performance enhancement and if it is worth the efforts.
In many cases where the performance enhancement feasibility was not established, or even in cases where the feasibility has been established, modifying the existing algorithm can lead to additional performance enhancement. For re-designing the algorithm, the following steps have to be taken care of
In many applications there may be only parts of the algorithm that need intensive computation, identifying this will help focus optimization effort mostly on this part of the algorithm. To profile the application on CPU we use valgrind/callgrind and gprof. These utilities highlight the area of application that can be targets for optimization in the starting phases.
Once these benefits are established , we move on to the actual porting of the algorithm.
This phase involves the following steps
NVIDIA and third parties provide drop in GPU accelerated libraries that can be used to quickly port applications on GPU. Libraries like cuFFT can replace open source FFTW and cuBLAS can replace MKL BLAS.
This is the most critical phase of algorithm porting to ensure performance enhancement is achieved and involves thorough understanding of GPU execution (warp, thread, block, kernel, stream etc), will need thorough understanding of tools provided by the GPU manufacturer and thorough understanding of relative benefits of each type of optimization and the trade offs.
Programmers should always keep in mind the importance of testing and this should be done iteratively at each stage.
eInfochips is a consulting partner for NVIDIA and helps customers port applications of various platforms including Tesla, Tegra and Quadro. Apart from application porting, eInfochips also supports customers using GPUs in their product development program with the following services