CUDA small kernel 2d convolution - how to do it

Posted by paulAl on Stack Overflow See other posts from Stack Overflow or by paulAl
Published on 2012-04-13T17:25:59Z Indexed on 2012/04/13 17:29 UTC
Read the original article Hit count: 221

Filed under:
|
|
|
|

I've been experimenting with CUDA kernels for days to perform a fast 2D convolution between a 500x500 image (but I could also vary the dimensions) and a very small 2D kernel (a laplacian 2d kernel, so it's a 3x3 kernel.. too small to take a huge advantage with all the cuda threads).

I created a CPU classic implementation (two for loops, as easy as you would think) and then I started creating CUDA kernels.

After a few disappointing attempts to perform a faster convolution I ended up with this code: http://www.evl.uic.edu/sjames/cs525/final.html (see the Shared Memory section), it basically lets a 16x16 threads block load all the convolution data he needs in the shared memory and then performs the convolution.

Nothing, the CPU is still a lot faster. I didn't try the FFT approach because the CUDA SDK states that it is efficient with large kernel sizes.

Whether or not you read everything I wrote, my question is:

how can I perform a fast 2D convolution between a relatively large image and a very small kernel (3x3) with CUDA?

© Stack Overflow or respective owner

Related posts about c++

Related posts about image