Table of Contents

CUDA Optimization Steps for NVIDIA GPUs

We received a tremendous response to the previous Image Processing using CUDA Programming on NVIDIA blog. Here is the next installment that answers in detail the questions that many readers raised. Given the sudden surge in processing requirements and the multimedia boom, companies across verticals are looking to accelerate parallel processing using GPUs. Here are the steps that we followed to achieve 25x performance advantage over CPU for algorithms designed for Oil & Gas exploration analysis.

Steps Execution Time (Seconds) Performance Factor
CPU execution time 448 1X
Basic algorithm porting 92 4.8X
Register usage optimization 75 5.9X
GPU Boost 72 6.2X
Increase Occupancy 66 6.7X
Improve usage of L1 Cache 24 18X
Concurrent kernel execution 9 49X
Increase Memory Bandwidth 7.3 61X
Use Fast Math Option 6.6 67X

1. Basic Algorithm Porting

The first step is to identify the functions that take up a large chunk of the processing power, and to port them for loops. The following table shows gprof profiling output of the Kirchhoff’s Depth Migration application. We see that mig2d and sum2 functions are taking around 94% time of total execution bandwidth. We first port mig2d and sum2 functions for loops. Data structures used by these loops are created on the GPU and all loops of these functions are converted in to GPU kernel block.

%Time Cumulative Seconds Self Seconds Calls Self s/call Total s/call Name
83.23 383.83 383.83 23040 16.66 16.66 mig2d
10.42 428.64 44.81 69120 0.65 0.65 sum2
0.24 429.66 1.02 Pfacc
0.04 429.84 0.18 Pfacr
0.02 429.92 0.08 23040 0.00 0.00 Filt
0.01 429.97 0.05 Pfarc
0.01 430.01 0.04 xdrhdrsub
0.01 430.04 0.03 alloc2
0.01 430.07 0.03 fgettr_internal
0.00 430.09 0.02 Efread
0.00 430.11 0.02 main
0.00 430.12 0.01 93 0.11 0.11 resit
0.00 430.13 0.01 1 10.00 10.00 timeb
0.00 430.14 0.01 fgettr
0.00 430.15 0.01 free1

2. Register usage per thread

Using the NVIDIA visual profiler, registers per thread are preventing max block (16) execution on SMX. To reduce registers per thread, kernel local variables and parameters are optimized and maxrregcount set to 32. It limits the register per thread to 32, and variables that cannot be stored in registers go to the local memory. This is called register spilling. Now 32*2048 (Registers x max threads per SMX) = 65536 (max registers per SMX), allows execution of 16 blocks per SMX and hence occupancy is improved.

3.GPU Boost

Tesla K40 has two clocks – Base Clock and Boost Clock. On power up, the Base Clock (745 MHz) is selected which is based on the worst-case reference workload. Boost clocks are selected based on less power and aggressive workloads. Power consumption of this application is not more than 145W. The Graphics Clock frequency is set to 875 MHz, which increases power consumption by 20 W but improves performance.

4.Increased Occupancy

Our application uses more than 128 threads (around 750) per block which prevents concurrent execution of maximum blocks (16) on SMX, because the maximum possible threads-per-SMX is 2048. This implementation reduces device occupancy, as fewer blocks are assigned to SMX. In this case SMX remains in idle state if instructions of assigned block are not in ready state. To improve occupancy, blocks-per-SMX are increased by dividing them in to dimensions of 128 threads, such that wrap scheduler has more blocks to choose instructions for execution.

5.L1 Cache Usage

L1 caching in Kepler GPUs is reserved only for local memory accesses, such as register spills and stack data. Global loads are cached in L2 only. Our application is using large global arrays so caching of these variables in L1 improves data access performance. Activate this mode by passing the -Xptxas -dlcm=ca flag to nvcc during code compilation.

6.Concurrent Kernel Execution

Visual profiler shows that overall GPU utilization is 45% only, because GPU remains idle when host is performing some task. To solve this problem, a task is divided in to multiple threads and each thread operates with individual stream. So even one thread is performing host task at the same time another thread can utilize GPU. In our application task is divided with 4 streams which is increased GPU utilization up to 90%.

7.Increase Memory Bandwidth

Pinned memory is used for small buffers that need to repeatedly transfer from host to device. Pinned memory provides higher data transfer bandwidth because it never swaps out of the physical memory. On the other hand, some large buffers used from both host and device are allocated as managed memory such that host and device can access these buffers without asking for a transfer.

8.Fast Math Option

The compiler has an option -use_fast_math that forces some functions to compile to its intrinsic counterpart. They are faster as they map to fewer native instructions. Note that this measure may reduce the accuracy of the mathematical formulas used in the code.


Using NVIDIA CUDA programming and best practices, we are able to port Kirchhoff depth migration algorithm applications to GPU in a short period of time to achieved 25x improvement in execution performance. eInfochips offers CUDA Consulting, Porting and System Design Services for companies looking to use NVIDIA GPUs for their products. To estimate the performance benefit for your algorithm or application, please write to

Picture of Prerit Kapadia

Prerit Kapadia

Prerit Kapadia is a Technical Lead with eInfochips. Prerit has more than 7 years of experience in application design and performance improvement for various domains.

Explore More

Talk to an Expert

to our Newsletter
Stay in the loop! Sign up for our newsletter & stay updated with the latest trends in technology and innovation.

Reference Designs

Our Work





Device Partnerships
Digital Partnerships
Quality Partnerships
Silicon Partnerships


Products & IPs