Table of Contents

Accelerate Medical Image Classification with the NVIDIA Tesla K20

Accelerate Medical Image Classification with the NVIDIA Tesla K20

This blog is about our journey to port the WNDCHRM application to NVIDIA Tesla K20 GPGPU.

Biological Image Analysis

Wndchrm is a medical image classification application developed at the National Institute on Aging. The application can be effectively used in biological image analysis involving high volume of image data. For example it is used in identifying similarity between genes, based on the phenotype using RNAi (Ribonucleic Acid Interference). Such analysis is useful in identifying potential therapeutic targets, drug development, and other applications like Virus-Hosts interaction exploration.


WndChrm is an open source application, and the abbreviated form for Weighted Neighbour Distances using Compound Hierarchy of Algorithms Representing Morphology. The application uses diverse algorithms to extract features (content descriptors) and creates a training database called feature set to classify an input image. Some of the algorithms/features used include

  • Radon transform features
  • Chebyshev Statistics
  • Gabor Filters
  • First 4 Moments
  • Tamura texture features
  • Zernike features
  • Haralick features


Compared to other application the wndchrm accuracy is higher.

Dataset Benchmark Algorithm Benchmark number of features used Benchmark Accuracy Accuracy of WND-CHARM
Hela Murphy (2004) 37 83 86
Pollen France et al.(1997) 8 79 96
CHO Boland et al.(1998) 37 87 95




We began by setting up the NVIDIA Tesla K20 on a SuperMicron server. The Tesla K20 has 13 next-generation Streaming Multiprocessors (SMX), each having 192 cores, 4GB of memory and requires two PCIe slots on the server. Also the GPU must be powered using auxiliary power connectors. Ubuntu 14.04 was used as the operating system, with CUDA Toolkit 6.5. The Server was equipped with 2.5 GHz Intel Quad Core Processor and 16GB of RAM.

Porting Methodology

The porting process starts off with the identification of the current performance of the application on a CPU. For processing a single image, it took 82 seconds. For processing a set of 200 RNAi images, the process took around 4 hours 38 minutes on single console and 2 hours and 15 minutes on 4 consoles. The limit for consoles was 4, as scaling it up further had not impacted the processing time.

The next step of the process was to use tools like gprof and valgrind to identify the hotspots in the Wndhrm application.


The following table is the gprof output

%time cumulative seconds self seconds calls self s/call total s/call name
53.78 135.92 135.92 6184 0.02 0.02 getChCoeff1D
30.71 213.55 77.63 8 9.7 9.7 conv2comp
8.99 236.29 22.73 6 3.79 3.79 CombFirst4Moments2D
1.72 240.63 4.34 1764 0 0 FeatureCentroid
1.1 243.4 2.77 6 0.46 0.57 coarseness
0.73 245.24 1.84 12 0.15 0.16 ImageMatrix::convolve
0.66 246.91 1.67 2 0.84 0.84 mb_zernike2D
0.33 247.75 0.84 24 0.04 0.05 f14_maxcorr
0.26 248.4 0.65 44270464 0 0 efficientLocalMean
0.23 248.97 0.57 4 0.14 0.14 TNx
0.17 249.41 0.44 4 0.11 0.11 radon
0.15 249.79 0.38 6 0.06 0.06 contrast
0.08 250 0.21 24 0.01 0.01 hessenberg


The gprof output delivers good insight into the hotspots in the application. One can start examining the functions that take up most of the time, and then start porting the code to GPU using CUDA.

(CUDA is a parallel computing platform and programming model that provides easy C/C++, FORTRAN and Python-like programming interface to develop/ port applications running on CPUs to GPUs.)

So after identifying the hotspots, the CUDA programming begins. At this point of time, one needs to understand the source code lay-out and the various compilation options using the NVCC, the CUDA compiler driver. During development, the cuda-memcheck and cuda-gdb become handy tools to identify and isolate issues in the CUDA code. The NVIDIA visual profiler now can be used to profile the code running in the GPU.

The initial development involves running the code on GPU. This code may not be completely optimized. One needs to utilize the profiler to identify the characteristics of code running on GPU and utilize the different techniques to achieve better performance.

Some of the techniques that can be utilized to improve the performance include

  • Streams: Allows running more than one kernel at a time
  • Dynamic parallelism: Allows launching kernel within a kernel, best utilized to optimally use the GPU resources
  • Usage of shared/ texture memories
  • Register usage control using launch_bound or compiler options like maxregcount

Also, during development, the debug options are kept on, to view the real performance achieved and the measurement will be handy with the release version of the application.

Apart from the above-mentioned techniques, we also found GPU-accelerated Libraries like CUBLAS that were an easy replacement for tasks like matrix multiplications.

Once all techniques are utilized one may have to go on board to try and search for algorithmic improvements for improving overall performance.

For example in a codeset, the output pixel calculations were done using multiple threads and with the use of atomicAdd function. Due to the atomicAdd function, the performance was bound to be low. So, to eliminate atomicAdd and perform calculations in a single thread, we had to map out the output element calculations on a smaller scale as shown in the figure below.

With this workaround, we identified the pattern of how each pixel value is calculated and then modified the CUDA code to remove the atomicAdd function. With the elimination, the code execution time improved from 6.2 seconds to 1.5 seconds.

The journey does not end here. As the CUDA best practice suggests, the approach to be taken is APOD ( Assets, parallelize, optimize and deploy) and the cycle starts all over. In each phase, more avenues for optimization can be found and improvement from milliseconds level to micro-seconds level can be achieved.

Actual Performance

As of the time of writing of this blog, the performance achieved was as follows


Dataset Used Number of Consoles Time
RNAi Images 1 4 Hrs 38 Mins
RNAi Images 4 2 Hrs 15 Mins


Dataset Used Number of Consoles Time
RNAi Images 1 31 Mins (9x)
RNAi Images 4 11 Mins 39 Sec (12 X)

The total CPU code size that was converted to GPU was 526 lines. And the total CUDA code size was 2557 lines. In comparison with the C++ code size, only 2 % of CUDA code was added to port the application to GPU to achieve 12X performance.


Using NVIDIA best practices and GPU-accelerated libraries, one can easily port CPU applications to GPU in a short period of time and achieve performance improvement that would save ample time and energy. With GPGPU & CUDA, the tasks that can take up hours of productivity can be optimized to run within minutes!

eInfochips offers CUDA Consulting, Migration and System Design Services for companies looking to use NVIDIA GPUs for their products. For more information, please write to

Picture of Lalit Chandivade

Lalit Chandivade

Lalit Chandivade works as a Technical Manager at eInfochips. He has been leading a team at eInfochips on building automated NVMe test suites and enhancing the NVMe test suites on Linux & Windows OS. Lalit has also successfully executed projects in the Linux device drivers & applications in the Storage Area Network domain.

Explore More

Talk to an Expert

to our Newsletter
Stay in the loop! Sign up for our newsletter & stay updated with the latest trends in technology and innovation.

Reference Designs

Our Work





Device Partnerships
Digital Partnerships
Quality Partnerships
Silicon Partnerships


Products & IPs