We received a tremendous response to the previous Image Processing using CUDA Programming on NVIDIA blog. Here is the next installment that answers in detail the questions that many readers raised. Given the sudden surge in processing requirements and the multimedia boom, companies across verticals are looking to accelerate parallel processing using GPUs. Here are the steps that we followed to achieve 25x performance advantage over CPU for algorithms designed for Oil & Gas exploration analysis.
|Steps||Execution Time (Seconds)||Performance Factor|
|CPU execution time||448||1X|
|Basic algorithm porting||92||4.8X|
|Register usage optimization||75||5.9X|
|Improve usage of L1 Cache||24||18X|
|Concurrent kernel execution||9||49X|
|Increase Memory Bandwidth||7.3||61X|
|Use Fast Math Option||6.6||67X|
The first step is to identify the functions that take up a large chunk of the processing power, and to port them for loops. The following table shows gprof profiling output of the Kirchhoff’s Depth Migration application. We see that mig2d and sum2 functions are taking around 94% time of total execution bandwidth. We first port mig2d and sum2 functions for loops. Data structures used by these loops are created on the GPU and all loops of these functions are converted in to GPU kernel block.
|%Time||Cumulative Seconds||Self Seconds||Calls||Self s/call||Total s/call||Name|
Using the NVIDIA visual profiler, registers per thread are preventing max block (16) execution on SMX. To reduce registers per thread, kernel local variables and parameters are optimized and maxrregcount set to 32. It limits the register per thread to 32, and variables that cannot be stored in registers go to the local memory. This is called register spilling. Now 32*2048 (Registers x max threads per SMX) = 65536 (max registers per SMX), allows execution of 16 blocks per SMX and hence occupancy is improved.
Tesla K40 has two clocks – Base Clock and Boost Clock. On power up, the Base Clock (745 MHz) is selected which is based on the worst-case reference workload. Boost clocks are selected based on less power and aggressive workloads. Power consumption of this application is not more than 145W. The Graphics Clock frequency is set to 875 MHz, which increases power consumption by 20 W but improves performance.
Our application uses more than 128 threads (around 750) per block which prevents concurrent execution of maximum blocks (16) on SMX, because the maximum possible threads-per-SMX is 2048. This implementation reduces device occupancy, as fewer blocks are assigned to SMX. In this case SMX remains in idle state if instructions of assigned block are not in ready state. To improve occupancy, blocks-per-SMX are increased by dividing them in to dimensions of 128 threads, such that wrap scheduler has more blocks to choose instructions for execution.
L1 caching in Kepler GPUs is reserved only for local memory accesses, such as register spills and stack data. Global loads are cached in L2 only. Our application is using large global arrays so caching of these variables in L1 improves data access performance. Activate this mode by passing the -Xptxas -dlcm=ca flag to nvcc during code compilation.
Visual profiler shows that overall GPU utilization is 45% only, because GPU remains idle when host is performing some task. To solve this problem, a task is divided in to multiple threads and each thread operates with individual stream. So even one thread is performing host task at the same time another thread can utilize GPU. In our application task is divided with 4 streams which is increased GPU utilization up to 90%.
Pinned memory is used for small buffers that need to repeatedly transfer from host to device. Pinned memory provides higher data transfer bandwidth because it never swaps out of the physical memory. On the other hand, some large buffers used from both host and device are allocated as managed memory such that host and device can access these buffers without asking for a transfer.
The compiler has an option -use_fast_math that forces some functions to compile to its intrinsic counterpart. They are faster as they map to fewer native instructions. Note that this measure may reduce the accuracy of the mathematical formulas used in the code.
Using NVIDIA CUDA programming and best practices, we are able to port Kirchhoff depth migration algorithm applications to GPU in a short period of time to achieved 25x improvement in execution performance. eInfochips offers CUDA Consulting, Porting and System Design Services for companies looking to use NVIDIA GPUs for their products. To estimate the performance benefit for your algorithm or application, please write to firstname.lastname@example.org