This blog is about our journey to port the WNDCHRM application to NVIDIA Tesla K20 GPGPU.
Wndchrm is a medical image classification application developed at the National Institute on Aging. The application can be effectively used in biological image analysis involving high volume of image data. For example it is used in identifying similarity between genes, based on the phenotype using RNAi (Ribonucleic Acid Interference). Such analysis is useful in identifying potential therapeutic targets, drug development, and other applications like Virus-Hosts interaction exploration.
WndChrm is an open source application, and the abbreviated form for Weighted Neighbour Distances using Compound Hierarchy of Algorithms Representing Morphology. The application uses diverse algorithms to extract features (content descriptors) and creates a training database called feature set to classify an input image. Some of the algorithms/features used include
Compared to other application the wndchrm accuracy is higher.
Benchmark number of features used
Accuracy of WND-CHARM
France et al.(1997)
Boland et al.(1998)
We began by setting up the NVIDIA Tesla K20 on a SuperMicron server. The Tesla K20 has 13 next-generation Streaming Multiprocessors (SMX), each having 192 cores, 4GB of memory and requires two PCIe slots on the server. Also the GPU must be powered using auxiliary power connectors. Ubuntu 14.04 was used as the operating system, with CUDA Toolkit 6.5. The Server was equipped with 2.5 GHz Intel Quad Core Processor and 16GB of RAM.
The porting process starts off with the identification of the current performance of the application on a CPU. For processing a single image, it took 82 seconds. For processing a set of 200 RNAi images, the process took around 4 hours 38 minutes on single console and 2 hours and 15 minutes on 4 consoles. The limit for consoles was 4, as scaling it up further had not impacted the processing time.
The next step of the process was to use tools like gprof and valgrind to identify the hotspots in the Wndhrm application.
The following table is the gprof output
The gprof output delivers good insight into the hotspots in the application. One can start examining the functions that take up most of the time, and then start porting the code to GPU using CUDA.
(CUDA is a parallel computing platform and programming model that provides easy C/C++, FORTRAN and Python-like programming interface to develop/ port applications running on CPUs to GPUs.)
So after identifying the hotspots, the CUDA programming begins. At this point of time, one needs to understand the source code lay-out and the various compilation options using the NVCC, the CUDA compiler driver. During development, the cuda-memcheck and cuda-gdb become handy tools to identify and isolate issues in the CUDA code. The NVIDIA visual profiler now can be used to profile the code running in the GPU.
The initial development involves running the code on GPU. This code may not be completely optimized. One needs to utilize the profiler to identify the characteristics of code running on GPU and utilize the different techniques to achieve better performance.
Some of the techniques that can be utilized to improve the performance include
Also, during development, the debug options are kept on, to view the real performance achieved and the measurement will be handy with the release version of the application.
Apart from the above-mentioned techniques, we also found GPU-accelerated Libraries like CUBLAS that were an easy replacement for tasks like matrix multiplications.
Once all techniques are utilized one may have to go on board to try and search for algorithmic improvements for improving overall performance.
For example in a codeset, the output pixel calculations were done using multiple threads and with the use of atomicAdd function. Due to the atomicAdd function, the performance was bound to be low. So, to eliminate atomicAdd and perform calculations in a single thread, we had to map out the output element calculations on a smaller scale as shown in the figure below.
With this workaround, we identified the pattern of how each pixel value is calculated and then modified the CUDA code to remove the atomicAdd function. With the elimination, the code execution time improved from 6.2 seconds to 1.5 seconds.
The journey does not end here. As the CUDA best practice suggests, the approach to be taken is APOD ( Assets, parallelize, optimize and deploy) and the cycle starts all over. In each phase, more avenues for optimization can be found and improvement from milliseconds level to micro-seconds level can be achieved.
As of the time of writing of this blog, the performance achieved was as follows
Number of Consoles
4 Hrs 38 Mins
2 Hrs 15 Mins
Number of Consoles
31 Mins (9x)
11 Mins 39 Sec (12 X)
The total CPU code size that was converted to GPU was 526 lines. And the total CUDA code size was 2557 lines. In comparison with the C++ code size, only 2 % of CUDA code was added to port the application to GPU to achieve 12X performance.
Using NVIDIA best practices and GPU-accelerated libraries, one can easily port CPU applications to GPU in a short period of time and achieve performance improvement that would save ample time and energy. With GPGPU & CUDA, the tasks that can take up hours of productivity can be optimized to run within minutes!
eInfochips offers CUDA Consulting, Migration and System Design Services for companies looking to use NVIDIA GPUs for their products. For more information, please write to email@example.com