Thursday, March 25, 2010

Avoiding if condition in CUDA

The branching is a most time consume operation in CUDA ( in general as well).

For example, See below CPU code and GPU code ...

CPU code:

void CPUCode( int* input, int* output, int length)
{
for ( int i = 0; i < length; ++i )
{
output[ i ] = input[ i ] + 2 * input[ i + 1 ];
}
}

GPU code:

__global__
void GPUCode( int* input, int* output, int length)
{
int idx = __umul24( blockDim.x, blockIdx.x) + threadIdx.x;

if ( idx < length )
{
output[ idx ] = input[ idx ] + 2 * input[ idx + 1 ];
}
}

In the above GPU code, there is a if condition and which is executed by each thread. If every thread should execute the same instruction at the same time, then that execution is very fast. i.e., the kernel code (or __global__ function code) should be serial, no branching in side it.

Look at the modified GPU version code...

__global__
void GPUCode( int* input, int* output, int length)
{
int idx = __umul24( blockDim.x, blockIdx.x) + threadIdx.x;

idx = max( idx, 0);
idx = min( idx, length);

output[ idx ] = input[ idx ] + 2 * input[ idx + 1 ];
}

No branching in the above modifed GPU version code. The Kernel code is serial. Every thread is executes the same instruction at a time. This type of code is executes very fast on GPU.

The above techinique is applicable for general CPU code as well.

No comments: