Table of Contents

Boosting medical imaging application performance through CUDA optimization | My journey with NVIDIA’s latest GPUs

Optical Coherence Tomography (OCT) is primarily used in the medical imaging filed for three dimensional imaging of biological tissues, which are used in full diagnosis of the patient and to provide a non-invasive way to diagnose.  Speed and depth of scan remain the key parameter for technology progressiveness in OCT. Without impacting the cost structure, it becomes important to select the right hardware and software. That is where the NVIDIA desktop GPUS (single PCIe Slot) comes into the picture.

In relation to such use case, I was tasked with improving the performance of the scan rate in an already optimized CUDA code. The code was used in the OCT algorithm. One of the use case of OCT in medical imaging is to get high resolution images of the retina.  It is also successfully employed in aiding angiography and in eye surgeries.

The existing system used Maxwell architecture, based Quadro M4000 GPU.

I started with the NVIDIA visual profiler to find the hotspots. And interestingly the profiler showed usage of the double function unit, which raised an alarm as the codes were written using floats. It appeared that the constants used in the codes were promoted to “double” during the computation. The fix was to qualify the constants as float using the “.0f” prefix.

Apart from it, I also tried using various techniques, like the ones mentioned below, to improve the performance.

  • Use of fast math
  • Use of L1 cache
  • Use of texture memory
  • Use of shuffle commands to do reduction
  • Reduce register counts inside the kernel
  • Streams
  • Club kernels  working on the same data set

A few of these methods gave boost, while others caused degradation in performance. While making changes to the code, I ran into the CUDA address out-of-bound issues. Luckily, running the application in cuda-memcheck helped me in resolving these issues.

I did try to use the half precision data types. Unfortunately for the current application the usage of half data types did not help and had to revert to FP32.

Further, using the guided performance analysis in the visual profiler the performance was improved from 100 KHz to around 200 KHz.  This would have done the job, however, it needed to be further improved to take care of the increased cycles needed in transferring the data to the application for display.

At a small brainstorming session it was found that  Pascal based GPU P4000 is available in the market.  A basic comparison showed that the Tera flops are double in P4000 compared to M4000.

Quadro M4000 Quadro P4000
CUDA Cores 1664 1792
FP32 TFLOPS 2.66 5.3
Max Power Consumption 120 Watt 105 Watt
Architecture Maxwell 2  Pascal

And by using the new GPU the performance went from 200KHz to 300 KHz! The use of P4000 made a real big difference.

If you look to accelerate medical image classification with NVIDIA Tesla, you can check my previous blog: Tesla K20 & Tesla K40. eInfochips offers CUDA Consulting, Migration, and System Design Services for companies looking to use NVIDIA GPUs for their products.

Visit our Medical Devices web page to know more or contact us at for your queries.

Explore More

Talk to an Expert

to our Newsletter
Stay in the loop! Sign up for our newsletter & stay updated with the latest trends in technology and innovation.

Our Work





Device Partnerships
Digital Partnerships
Quality Partnerships
Silicon Partnerships


Products & IPs