CUDA - multiple kernels to compute a single value
- by Roger
Hey, I'm trying to write a kernel to essentially do the following in C
 float sum = 0.0;
 for(int i = 0; i < N; i++){
   sum += valueArray[i]*valueArray[i];
 }
 sum += sum / N;
At the moment I have this inside my kernel, but it is not giving correct values.
int i0 = blockIdx.x * blockDim.x + threadIdx.x;
   for(int i=i0; i<N; i += blockDim.x*gridDim.x){
        *d_sum += d_valueArray[i]*d_valueArray[i];
    }
  *d_sum= __fdividef(*d_sum, N);
The code used to call the kernel is
  kernelName<<<64,128>>>(N, d_valueArray, d_sum);
  cudaMemcpy(&sum, d_sum, sizeof(float) , cudaMemcpyDeviceToHost);
I think that each kernel is calculating a partial sum, but the final divide statement is not taking into account the accumulated value from each of the threads. Every kernel is producing it's own final value for d_sum?
Does anyone know how could I go about doing this in an efficient way? Maybe using shared memory between threads? I'm very new to GPU programming. Cheers