Table of Contents

Methodology for Application Porting on GPUs

Methodology for Application Porting on GPUs

No one can contest the benefits of running complex and computationally intensive engineering applications on GPUs over CPUs.  Algorithms like video processing for image detection and tracking, pattern recognition in 3D space can all be performed at high speed and GPUs can drastically bring down the processing time of these applications.

nvidia

 

Many OEMs and ODMs manufacturing products and solutions that require high processing are switching to GPUs. However, these companies are facing the challenges of porting the already existing, tried and tested algorithms on to GPUs. Modern programming frameworks, such as NVIDIA’s Compute Unified Device Architecture (CUDA) have made programming on the GPU more straightforward and friendly; however the challenge is to optimize the code for performing effectively on GPUs. In the projects that we have executed at eInfochips, we have noticed that the benefits of fast processing can only be derived if there are adequate steps followed from the beginning of the porting stage. Hence we have derived and documented our tried and tested methodology for application porting on GPUs. This methodology is carried out in three phases to help derive the benefits of GPUs.

eInfochips-Methodology-for-GPU

 

Phase 1 – Feasibility

This phase is intended at identifying if there is any parallelism possible and to weigh the costs vs the benefits.

  1. Is it possible to parallelize the application?
  2. Identifying the cost of implementation vs. the benefits

This step can be done by identifying the part of program that the CPU is spending the most time in processing and identifying multi-threads that can be created from the serial algorithm. In some applications there may not be a possibility to parallelize since events occur serially. Hence porting the application on GPU will not yield any performance enhancement. Performing this step at the beginning of porting will help us predict the possible performance enhancement and if it is worth the efforts.
In many cases where the performance enhancement feasibility was not established, or even in cases where the feasibility has been established, modifying the existing algorithm can lead to additional performance enhancement. For re-designing the algorithm, the following steps have to be taken care of

Re-designing-  the existing algorithm

  1. Before re-designing the algorithm, it is important to understand
    • The actual requirements
    • The current design/implementation
    • How parallelism can be achieved (using kernels and/or creating new CPU running threads which will actually use GPU cores)
    • The hardware capabilities and limitations
  2. Design should make sure we are not leaving any requirements behind
  3. Optimization should be done in this phase

In many applications there may be only parts of the algorithm that need intensive computation, identifying this will help focus optimization effort mostly on this part of the algorithm. To profile the application on CPU we use valgrind/callgrind and gprof. These utilities highlight the area of application that can be targets for optimization in the starting phases.

Once these benefits are established , we move on to the actual porting of the algorithm.

Phase 2 – Porting

This phase involves the following steps

  1. Converting the algorithm to single CPU, multi-threaded algorithm
    • Verify if correctness of transformation has been achieved. This is necessary because the transformation of the algorithm MAY involve redesigning parts of algorithm
    • Keep in mind that Synchronization & atomicity are critical and have to be ensured. This is a good step to follow because the implementation can be verified for correctness on the workstation systems, where debugging tools and methodologies are well established, proven and well known to all programmers
  2. Converting Algorithm in step 1 to GPU level parallelism, possibly involving multiple kernels and state machines
    • Ensure correctness of transformation, proof of termination and proof of correctness, because there is again a small amount of algorithm change that may be involved for better utilization of GPU
    • Modify algorithm to ensure equivalence if sync and atomicity primitives between CPU and GPU

NVIDIA and third parties provide drop in GPU accelerated libraries that can be used to quickly port applications on GPU. Libraries like cuFFT can replace open source FFTW and cuBLAS can replace MKL BLAS.

Phase 3 – Optimization

This is the most critical phase of algorithm porting to ensure performance enhancement is achieved and involves thorough understanding of GPU execution (warp, thread, block, kernel, stream etc), will need thorough understanding of tools provided by the GPU manufacturer and thorough understanding of relative benefits of each type of optimization and the trade offs.

Programmers should always keep in mind the importance of testing and this should be done iteratively at each stage.

eInfochips is a consulting partner for NVIDIA and helps customers port applications of various platforms including Tesla, Tegra and Quadro. Apart from application porting, eInfochips also supports customers using GPUs in their product development program with the following services

  • Custom Hardware Design using GPUs
  • Software and Application Development
  • CUDA Consulting
  • Application Porting
  • Product QA and Test Automation
  • CUDA Training

Explore More

Talk to an Expert

Subscribe
to our Newsletter
Stay in the loop! Sign up for our newsletter & stay updated with the latest trends in technology and innovation.

Our Work

Innovate

Transform.

Scale

Partnerships

Device Partnerships
Digital Partnerships
Quality Partnerships
Silicon Partnerships

Company

Products & IPs

Services