How to properly cast a global memory array using the uint4 vector in CUDA to increase memory throughput?

Posted by charis on Stack Overflow See other posts from Stack Overflow or by charis
Published on 2012-10-27T16:58:01Z Indexed on 2012/10/27 17:00 UTC
Read the original article Hit count: 133

Filed under:
|
|

There are generally two techniques to increase the memory throughput of the global memory on a CUDA kernel; memory accesses coalescence and accessing words of at least 4 bytes. With the first technique accesses to the same memory segment by threads of the same half-warp are coalesced to fewer transactions while be accessing words of at least 4 bytes this memory segment is effectively increased from 32 bytes to 128.

To access 16-byte instead of 1-byte words when there are unsigned chars stored in the global memory, the uint4 vector is commonly used by casting the memory array to uint4:

uint4 *text4 = ( uint4 * ) d_text;
var = text4[i];

In order to extract the 16 chars from var, i am currently using bitwise operations. For example:

s_array[j * 16 + 0] = var.x & 0x000000FF;
s_array[j * 16 + 1] = (var.x >> 8) & 0x000000FF;
s_array[j * 16 + 2] = (var.x >> 16) & 0x000000FF;
s_array[j * 16 + 3] = (var.x >> 24) & 0x000000FF;

My question is, is it possible to recast var (or for that matter *text4) to unsigned char in order to avoid the additional overhead of the bitwise operations?

© Stack Overflow or respective owner

Related posts about c

    Related posts about optimization