Search Results

Search found 244 results on 10 pages for 'cuda'.

Page 4/10 | < Previous Page | 1 2 3 4 5 6 7 8 9 10  | Next Page >

  • CUDA: Memory copy to GPU 1 is slower in multi-GPU

    - by zenna
    My company has a setup of two GTX 295, so a total of 4 GPUs in a server, and we have several servers. We GPU 1 specifically was slow, in comparison to GPU 0, 2 and 3 so I wrote a little speed test to help find the cause of the problem. //#include <stdio.h> //#include <stdlib.h> //#include <cuda_runtime.h> #include <iostream> #include <fstream> #include <sstream> #include <string> #include <cutil.h> __global__ void test_kernel(float *d_data) { int tid = blockDim.x*blockIdx.x + threadIdx.x; for (int i=0;i<10000;++i) { d_data[tid] = float(i*2.2); d_data[tid] += 3.3; } } int main(int argc, char* argv[]) { int deviceCount; cudaGetDeviceCount(&deviceCount); int device = 0; //SELECT GPU HERE cudaSetDevice(device); cudaEvent_t start, stop; unsigned int num_vals = 200000000; float *h_data = new float[num_vals]; for (int i=0;i<num_vals;++i) { h_data[i] = float(i); } float *d_data = NULL; float malloc_timer; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord( start, 0 ); cudaMemcpy(d_data, h_data, sizeof(float)*num_vals,cudaMemcpyHostToDevice); cudaMalloc((void**)&d_data, sizeof(float)*num_vals); cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &malloc_timer, start, stop ); cudaEventDestroy( start ); cudaEventDestroy( stop ); float mem_timer; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord( start, 0 ); cudaMemcpy(d_data, h_data, sizeof(float)*num_vals,cudaMemcpyHostToDevice); cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &mem_timer, start, stop ); cudaEventDestroy( start ); cudaEventDestroy( stop ); float kernel_timer; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord( start, 0 ); test_kernel<<<1000,256>>>(d_data); cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &kernel_timer, start, stop ); cudaEventDestroy( start ); cudaEventDestroy( stop ); printf("cudaMalloc took %f ms\n",malloc_timer); printf("Copy to the GPU took %f ms\n",mem_timer); printf("Test Kernel took %f ms\n",kernel_timer); cudaMemcpy(h_data,d_data, sizeof(float)*num_vals,cudaMemcpyDeviceToHost); delete[] h_data; return 0; } The results are GPU0 cudaMalloc took 0.908640 ms Copy to the GPU took 296.058777 ms Test Kernel took 326.721283 ms GPU1 cudaMalloc took 0.913568 ms Copy to the GPU took[b] 663.182251 ms[/b] Test Kernel took 326.710785 ms GPU2 cudaMalloc took 0.925600 ms Copy to the GPU took 296.915039 ms Test Kernel took 327.127930 ms GPU3 cudaMalloc took 0.920416 ms Copy to the GPU took 296.968384 ms Test Kernel took 327.038696 ms As you can see, the cudaMemcpy to the GPU is well double the amount of time for GPU1. This is consistent between all our servers, it is always GPU1 that is slow. Any ideas why this may be? All servers are running windows XP.

    Read the article

  • CUDA: injecting my own PTX function?

    - by shoosh
    I would like to be able to use a feature in PTX 1.3 which is not yet implemented it the C interface. Is there a way to write my own function in PTX and inject into an existing binary? The feature I'm looking for is getting the value of %smid

    Read the article

  • cuda/thrust: Trying to sort_by_key 2.8GB of data in 6GB of gpu RAM throws bad_alloc

    - by Sven K
    I have just started using thrust and one of the biggest issues I have so far is that there seems to be no documentation as to how much memory operations require. So I am not sure why the code below is throwing bad_alloc when trying to sort (before the sorting I still have 50% of GPU memory available, and I have 70GB of RAM available on the CPU)--can anyone shed some light on this? #include <thrust/device_vector.h> #include <thrust/sort.h> #include <thrust/random.h> void initialize_data(thrust::device_vector<uint64_t>& data) { thrust::fill(data.begin(), data.end(), 10); } #define BUFFERS 3 int main(void) { size_t N = 120 * 1024 * 1024; char line[256]; try { std::cout << "device_vector" << std::endl; typedef thrust::device_vector<uint64_t> vec64_t; // Each buffer is 900MB vec64_t c[3] = {vec64_t(N), vec64_t(N), vec64_t(N)}; initialize_data(c[0]); initialize_data(c[1]); initialize_data(c[2]); std::cout << "initialize_data finished... Press enter"; std::cin.getline(line, 0); // nvidia-smi reports 48% memory usage at this point (2959MB of // 6143MB) std::cout << "sort_by_key col 0" << std::endl; // throws bad_alloc thrust::sort_by_key(c[0].begin(), c[0].end(), thrust::make_zip_iterator(thrust::make_tuple(c[1].begin(), c[2].begin()))); std::cout << "sort_by_key col 1" << std::endl; thrust::sort_by_key(c[1].begin(), c[1].end(), thrust::make_zip_iterator(thrust::make_tuple(c[0].begin(), c[2].begin()))); } catch(thrust::system_error &e) { std::cerr << "Error: " << e.what() << std::endl; exit(-1); } return 0; }

    Read the article

  • Bind texture with pinned mapped memory in CUDA

    - by sjchoi
    I was trying to bind a host memory that was mapped for zero-copy to a texture, but it looks like it isn't possible. Here is a code sample: float* a; float* d_a; cudaSetDeviceFlags(cudaDeviceMapHost); cudaHostAlloc( (void **)&a, bytes, cudaHostAllocMapped); cudaHostGetDevicePointer((void **)&d_a, (void *)a, 0); texture<float, 2, cudaReadModeElementType> tex; cudaBindTexture2D( 0, &tex, d_a, &channelDesc, width, height, pitch); Is it recommended that you used pinned memory and just copy it over to device memory that is bind to texture?

    Read the article

  • OpenCL or CUDA Which way to go?

    - by holydiver
    I'm investigating ways of using GPU in order to process streaming data. I had two choices but couldn't decide which way to go? My criterias are as below: Ease of use.(good API) Community and Documentation. Performance Future I'll code in C and C++.

    Read the article

  • CUDA - multiple kernels to compute a single value

    - by Roger
    Hey, I'm trying to write a kernel to essentially do the following in C float sum = 0.0; for(int i = 0; i < N; i++){ sum += valueArray[i]*valueArray[i]; } sum += sum / N; At the moment I have this inside my kernel, but it is not giving correct values. int i0 = blockIdx.x * blockDim.x + threadIdx.x; for(int i=i0; i<N; i += blockDim.x*gridDim.x){ *d_sum += d_valueArray[i]*d_valueArray[i]; } *d_sum= __fdividef(*d_sum, N); The code used to call the kernel is kernelName<<<64,128>>>(N, d_valueArray, d_sum); cudaMemcpy(&sum, d_sum, sizeof(float) , cudaMemcpyDeviceToHost); I think that each kernel is calculating a partial sum, but the final divide statement is not taking into account the accumulated value from each of the threads. Every kernel is producing it's own final value for d_sum? Does anyone know how could I go about doing this in an efficient way? Maybe using shared memory between threads? I'm very new to GPU programming. Cheers

    Read the article

  • cuda 5.0 namespaces for contant memory variable usage

    - by Psypher
    In my program I want to use a structure containing constant variables and keep it on device all long as the program executes to completion. I have several header files containing the declaration of 'global' functions and their respective '.cu' files for their definitions. I kept this scheme because it helps me contain similar code in one place. e.g. all the 'device' functions required to complete 'KERNEL_1' are separated from those 'device' functions required to complete 'KERNEL_2' along with kernels definitions. I had no problems with this scheme during compilation and linking. Until I encountered constant variables. I want to use the same constant variable through all kernels and device functions but it doesn't seem to work. ########################################################################## CODE EXAMPLE ########################################################################### filename: 'common.h' -------------------------------------------------------------------------- typedef struct { double height; double weight; int age; } __CONSTANTS; __constant__ __CONSTANTS d_const; --------------------------------------------------------------------------- filename: main.cu --------------------------------------------------------------------------- #include "common.h" #include "gpukernels.h" int main(int argc, char **argv) { __CONSTANTS T; T.height = 1.79; T.weight = 73.2; T.age = 26; cudaMemcpyToSymbol(d_consts, &T, sizeof(__CONSTANTS)); test_kernel <<< 1, 16 >>>(); cudaDeviceSynchronize(); } --------------------------------------------------------------------------- filename: gpukernels.h --------------------------------------------------------------------------- __global__ void test_kernel(); --------------------------------------------------------------------------- filename: gpukernels.cu --------------------------------------------------------------------------- #include <stdio.h> #include "gpukernels.h" #include "common.h" __global__ void test_kernel() { printf("Id: %d, height: %f, weight: %f\n", threadIdx.x, d_const.height, d_const.weight); } When I execute this code, the kernel executes, displays the thread ids, but the constant values are displayed as zeros. How can I fix this?

    Read the article

  • CUDA threads for inner loop

    - by Manolete
    I've got this kernel __global__ void kernel1(int keep, int include, int width, int* d_Xco, int* d_Xnum, bool* d_Xvalid, float* d_Xblas) { int i, k; i = threadIdx.x + blockIdx.x * blockDim.x; if(i < keep){ for(k = 0; k < include ; k++){ int val = (d_Xblas[i*include + k] >= 1e5); int aux = d_Xnum[i]; d_Xblas[i*include + k] *= (!val); d_Xco[i*width + aux] = k; d_Xnum[i] +=val; d_Xvalid[i*include + k] = (!val); } } } launched with int keep = 9000; int include = 23000; int width = 0.2*include; int threads = 192; int blocks = keep+threads-1/threads; kernel1 <<< blocks,threads >>>( keep, include, width, d_Xco, d_Xnum, d_Xvalid, d_Xblas ); This kernel1 works fine but it is obviously not totally optimized. I thought it would be straight forward to eliminate the inner loop k but for some reason it doesn't work fine. My first idea was: __global__ void kernel2(int keep, int include, int width, int* d_Xco, int* d_Xnum, bool* d_Xvalid, float* d_Xblas) { int i, k; i = threadIdx.x + blockIdx.x * blockDim.x; k = threadIdx.y + blockIdx.y * blockDim.y; if((i < keep) && (k < include) ) { int val = (d_Xblas[i*include + k] >= 1e5); int aux = d_Xnum[i]; d_Xblas[i*include + k] *= (float)(!val); d_Xco[i*width + aux] = k; atomicAdd(&d_Xnum[i], val); d_Xvalid[i*include + k] = (!val); } } launched with a 2D grid: int keep = 9000; int include = 23000; int width = 0.2*include; int th = 32; dim3 threads(th,th); dim3 blocks (keep+threads.x-1/threads.x, include+threads.y-1/threads.y); kernel2 <<< blocks,threads >>>( keep, include, width, d_Xco, d_Xnum, d_Xvalid, d_Xblas ); Although I believe the idea is fine, it does not work and I am running out of ideas here. Could you please help me out here? I also think the problem could be in d_Xco which stores the position k in a smaller array , so the order matters, but I can't think of any other way of doing it...

    Read the article

  • Strange behaviour of CUDA kernel

    - by username_4567
    I'm writing code for calculating prefix sum. Here is my kernel __global__ void prescan(int *indata,int *outdata,int n,long int *sums) { extern __shared__ int temp[]; int tid=threadIdx.x; int offset=1,start_id,end_id; int *global_sum=&temp[n+2]; if(tid==0) { temp[n]=blockDim.x*blockIdx.x; temp[n+1]=blockDim.x*(blockIdx.x+1)-1; start_id=temp[n]; end_id=temp[n+1]; //cuPrintf("Value of start %d and end %d\n",start_id,end_id); } __syncthreads(); start_id=temp[n]; end_id=temp[n+1]; temp[tid]=indata[start_id+tid]; temp[tid+1]=indata[start_id+tid+1]; for(int d=n>>1;d>0;d>>=1) { __syncthreads(); if(tid<d) { int ai=offset*(2*tid+1)-1; int bi=offset*(2*tid+2)-1; temp[bi]+=temp[ai]; } offset*=2; } if(tid==0) { sums[blockIdx.x]=temp[n-1]; temp[n-1]=0; cuPrintf("sums %d\n",sums[blockIdx.x]); } for(int d=1;d<n;d*=2) { offset>>=1; __syncthreads(); if(tid<d) { int ai=offset*(2*tid+1)-1; int bi=offset*(2*tid+2)-1; int t=temp[ai]; temp[ai]=temp[bi]; temp[bi]+=t; } } __syncthreads(); if(tid==0) { outdata[start_id]=0; } __threadfence_block(); __syncthreads(); outdata[start_id+tid]=temp[tid]; outdata[start_id+tid+1]=temp[tid+1]; __syncthreads(); if(tid==0) { temp[0]=0; outdata[start_id]=0; } __threadfence_block(); __syncthreads(); if(blockIdx.x==0 && threadIdx.x==0) { for(int i=1;i<gridDim.x;i++) { sums[i]=sums[i]+sums[i-1]; } } __syncthreads(); __threadfence(); if(blockIdx.x==0 && threadIdx.x==0) { for(int i=0;i<gridDim.x;i++) { cuPrintf("****sums[%d]=%d ",i,sums[i]); } } __syncthreads(); __threadfence(); if(blockIdx.x!=gridDim.x-1) { int tid=(blockIdx.x+1)*blockDim.x+threadIdx.x; if(threadIdx.x==0) cuPrintf("Adding %d \n",sums[blockIdx.x]); outdata[tid]+=sums[blockIdx.x]; } __syncthreads(); } In above kernel, sums array will accumulate prefix sum per block and and then first thread will calculate prefix sum of this sum array. Now if I print this sum array from device side it'll show correct results while in cuPrintf("Adding %d \n",sums[blockIdx.x]); this line it prints that it is taking old value. What could be the reason?

    Read the article

  • How to activate nVidia cards programmatically on new MacBookPros for CUDA programming?

    - by Carsten Kuckuk
    The new MacBookPros come with two graphic adapters, the Intel HD Graphics, and the NVIDIA GeForce GT 330M. OS X switches back and forth between them, depending on the workload, detection of an external monitor, or activation of Rosetta. I want to get my feet wet with CUDA programming, and unfortunately the CUDA SDK doesn't seem to take care of this back-and-forth switching. When Intel is active, no CUDA device gets detected, and when the NVidia card is active, it gets detected. So my current work-around is to use the little tool gfxCardStatus (http://codykrieger.com/gfxCardStatus/) to force the card on or off, just as I need it, but that's not satisfactory. Does anybody here know what the Apple-blessed, Apple-recommended way is to (1) detect the presence of a CUDA card, (2) to activate this card when present?

    Read the article

  • identifier ... is undefined when trying to run pure C code in Cuda using nvcc

    - by Lostsoul
    I'm new and learning Cuda. A approach that I'm trying to use to learn is to write code in C and once I know its working start converting it to Cuda since I read that nvcc compiles Cuda code but complies everything else using plain old c. My code works in c(using gcc) but when I try to compile it using nvcc(after changing the file name from main.c to main.cu) I get main.cu(155): error: identifier "num_of_rows" is undefined main.cu(155): error: identifier "num_items_in_row" is undefined 2 errors detected in the compilation of "/tmp/tmpxft_00002898_00000000-4_main.cpp1.ii". Basically in my main method I send data to a function like this: process_list(count, countListItem, list); the first two items are ints and the last item(list) is a matrix. Then I create my function like this: void process_list(int num_of_rows, int num_items_in_row, int current_list[num_of_rows][num_items_in_row]) { This line is where I get my errors when using nvcc(line 155). I need to convert this code to cuda anyway so no need to troubleshoot this specific issue(plus code is quite large) but I'm wondering if I'm wrong about nvcc treating the C part of your code like plain C. In the book cuda by example I just used nvcc to compile but do I need any extra flags when just using pure c?

    Read the article

  • CUDA 4.1 Update

    - by N0xus
    I'm currently working on porting a particle system to update on the GPU via the use of CUDA. With CUDA, I've already passed over the required data I need to the GPU and allocated and copied the date via the host. When I build the project, it all runs fine, but when I run it, the project says I need to allocate my h_position pointer. This pointer is my host pointer and is meant to hold the data. I know I need to pass in the current particle position to the required cudaMemcpy call and they are currently stored in a list with a for loop being created and interated for each particle calling the following line of code: m_particleList[i].positionY = m_particleList[i].positionY - (m_particleList[i].velocity * frameTime * 0.001f); My current host side cuda code looks like this: float* h_position; // Your host pointer. This holds the data (I assume it's already filled with the data.) float* d_position; // Your device pointer, we will allocate and fill this float* d_velocity; float* d_time; int threads_per_block = 128; // You should play with this value int blocks = m_maxParticles/threads_per_block + ( (m_maxParticles%threads_per_block)?1:0 ); const int N = 10; size_t size = N * sizeof(float); cudaMalloc( (void**)&d_position, m_maxParticles * sizeof(float) ); cudaMemcpy( d_position, h_position, m_maxParticles * sizeof(float), cudaMemcpyHostToDevice); Both of which were / can be found inside my UpdateParticle() method. I had originally thought it would be a simple case of changing the h_position variable in the cudaMemcpy to m_particleList[i] but then I get the following error: no suitable conversion function from "ParticleSystemClass::ParticleType" to "const void *" exists I've probably messed up somewhere, but could someone please help fix the issues I'm facing. Everything else seems to running fine, it's just when I try to run the program that certain things hit the fan.

    Read the article

  • CUDA 4.0 : sortie de la Release Candidate de l'architecture de calcul parallèle de NVIDIA

    CUDA 4.0 : sortie de la Release Candidate De l'architecture de calcul parallèle de NVIDIA La sortie cette semaine aux développeurs enregistrés de CUDA est l'accomplissement de milliers d'heures dédiées à ce projet. Cette technologie peut sembler jeune, sa première version, la 1.0, n'a été sortie qu'en 2007. L'effet de cette sortie sur le monde entier n'a, sans nul doute possible, pas été nul. Elle n'est pas non plus le seul terrain d'avance pour les technologies massivement parallèles sur matériel NVIDIA : les API comme OpenCL et DirectCompute sont elles aussi supportées et améliorées, en plus d'un accès direct au GPU en C, C++ et Fortran. Comme toujours, NVIDIA est à la recherch...

    Read the article

  • How to treat 64-bit words on a CUDA device?

    - by pikkio
    Hi, I'd like to handle directly 64-bit words on the CUDA platform (eg. uint64_t vars). I understand, however, that addressing space, registers and the SP architecture are all 32-bit based. I actually found this to work correctly (on my CUDA cc1.1 card): __global__ void test64Kernel( uint64_t *word ) { (*word) <<= 56; } but I don't know, for example, how this affects registers usage and the operations per clock cycle count.

    Read the article

  • Trying to 'Make' CUDA SDK, ld cannot find library, ldconfig says it can.

    - by Andrew Bolster
    I know there are many other questions similar to this one, but none of the solutions posited there are working for me Basically, making the SDK sample files, i get /usr/bin/ld: cannot find -lcuda which would be an easy enough 'find the library and throw it to ldconfig', except ldconfig already says it has it... $ sudo ldconfig -v | grep cuda /usr/local/cuda/lib64: libcudartemu.so.3 -> libcudartemu.so.3.0.14 libcudart.so.3 -> libcudart.so.3.0.14 /usr/local/cuda/lib: libcudartemu.so.3 -> libcudartemu.so.3.0.14 libcudart.so.3 -> libcudart.so.3.0.14 libcuda.so.1 -> libcuda.so.195.36.15 libcuda.so.1 -> libcuda.so.195.36.15 libicudata.so.42 -> libicudata.so.42.1 And I checked, there is a symlink libcuda.so -> libcuda.so.1 but I'm still confused as to why libcuda.so -> ... doesnt show up I must be missing something really obvious. Any ideas?

    Read the article

  • CUDA 4.1 Particle Update

    - by N0xus
    I'm using CUDA 4.1 to parse in the update of my Particle system that I've made with DirectX 10. So far, my update method for the particle systems is 1 line of code within a for loop that makes each particle fall down the y axis to simulate a waterfall: m_particleList[i].positionY = m_particleList[i].positionY - (m_particleList[i].velocity * frameTime * 0.001f); In my .cu class I've created a struct which I copied from my particle class and is as follows: struct ParticleType { float positionX, positionY, positionZ; float red, green, blue; float velocity; bool active; }; Then I have an UpdateParticle method in the .cu as well. This encompass the 3 main parameters my particles need to update themselves based off the initial line of code. : __global__ void UpdateParticle(float* position, float* velocity, float frameTime) { } This is my first CUDA program and I'm at a loss to what to do next. I've tried to simply put the particleList line in the UpdateParticle method, but then the particles don't fall down as they should. I believe it is because I am not calling something that I need to in the class where the particle fall code use to be. Could someone please tell me what it is I am missing to get it working as it should? If I am doing this completely wrong in general, the please inform me as well.

    Read the article

  • nvcc not found, but only when using sudo

    - by dsp_099
    I can't get ANYTHING working on linux. I'm trying to compile CudaMiner. sudo make: ypt-jane.o `test -f 'scrypt-jane.cpp' || echo './'`scrypt-jane.cpp mv -f .deps/cudaminer-scrypt-jane.Tpo .deps/cudaminer-scrypt-jane.Po nvcc -g -O2 -Xptxas "-abi=no -v" -arch=compute_10 --maxrregcount=64 --ptxas-options=-v -I./compat/jansson -o salsa_kernel.o -c salsa_kernel.cu /bin/bash: nvcc: command not found make[2]: *** [salsa_kernel.o] Error 127 make[2]: Leaving directory `/var/progs/CudaMiner' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/var/progs/CudaMiner' make: *** [all] Error 2 So, kind of interesting: nvcc: nvcc fatal : No input files specified; use option --help for more information Whereas sudo nvcc: sudo: nvcc: command not found Huh?? I have identical exports listed in ~/.bashrc AND /etc/bash.bashrc. (Nvcc is located in: /usr/local/cuda-5.0/bin/nvcc) I also tried changing the current path, to no avail: $ sudo bash -c 'echo $PATH' /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin $ PATH=$PATH:/usr/local/cuda-5.0/bin/nvcc $ sudo bash -c 'echo $PATH' /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin Thanks in advance!

    Read the article

  • Can I program Nvidia's CUDA using only Python or do I have to learn C?

    - by Aquateenfan
    I guess the question speaks for itself. I'm interested in doing some serious computations but am not a programmer by trade. I can string enough python together to get done what I want. But can I write a program in python and have the GPU execute it using CUDA? Or do I have to use some mix of python and C? The examples on Klockner's (sp) "pyCUDA" webpage had a mix of both python and C, so I'm not sure what the answer is. If anyone wants to chime in about Opencl, feel free. I heard about this CUDA business only a couple of weeks ago and didn't know you could use your video cards like this. thx

    Read the article

  • CUDA 3.0 est sorti, avec le support de la nouvelle architecture de NVIDIA, Fermi

    CUDA 3.0 est sorti très récemment, avec le support de la plateforme Fermi, très attendue. Elle n'est pas encore disponible, mais ce n'est plus qu'une affaire de quelques semaines. Cette sortie permet de déjà préparer son code pour la prochaine architecture, tout en bénéficiant d'ores et déjà de grandes améliorations. Citation: Envoyé par Professor Bower, chercheur dans le Quantum ChromoDynamics

    Read the article

  • Geforce(410m with CUDA) screen resolution on Ubuntu 12.04 issue

    - by Marco K
    I made a succesful installation of Ubuntu 12.04 64-bit on my Sony Vaio PCG-71811M. I have a Geforce 410M with CUDA,it seems works fine and i have already installed packages nvidia-current and nvidia-settings at version 302.17 (I think it's the latest in this moment), but my maximum screen resolution is 1366x768(and in the native display settings it's the same thing). How can I switch it to an highest resolution, like 1920x1080?

    Read the article

  • How to optimize Conway's game of life for CUDA?

    - by nlight
    I've written this CUDA kernel for Conway's game of life: global void gameOfLife(float* returnBuffer, int width, int height) { unsigned int x = blockIdx.x*blockDim.x + threadIdx.x; unsigned int y = blockIdx.y*blockDim.y + threadIdx.y; float p = tex2D(inputTex, x, y); float neighbors = 0; neighbors += tex2D(inputTex, x+1, y); neighbors += tex2D(inputTex, x-1, y); neighbors += tex2D(inputTex, x, y+1); neighbors += tex2D(inputTex, x, y-1); neighbors += tex2D(inputTex, x+1, y+1); neighbors += tex2D(inputTex, x-1, y-1); neighbors += tex2D(inputTex, x-1, y+1); neighbors += tex2D(inputTex, x+1, y-1); __syncthreads(); float final = 0; if(neighbors < 2) final = 0; else if(neighbors 3) final = 0; else if(p != 0) final = 1; else if(neighbors == 3) final = 1; __syncthreads(); returnBuffer[x + y*width] = final; } I am looking for errors/optimizations. Parallel programming is quite new to me and I am not sure if I get how to do it right. The rest of the app is: Memcpy input array to a 2d texture inputTex stored in a CUDA array. Output is memcpy-ed from global memory to host and then dealt with. As you can see a thread deals with a single pixel. I am unsure if that is the fastest way as some sources suggest doing a row or more per thread. If I understand correctly NVidia themselves say that the more threads, the better. I would love advice on this on someone with practical experience.

    Read the article

  • What kind of data processing problems would CUDA help with?

    - by Chris McCauley
    Hi, I've worked on many data matching problems and very often they boil down to quickly and in parallel running many implementations of CPU intensive algorithms such as Hamming / Edit distance. Is this the kind of thing that CUDA would be useful for? What kinds of data processing problems have you solved with it? Is there really an uplift over the standard quad-core intel desktop? Chris

    Read the article

< Previous Page | 1 2 3 4 5 6 7 8 9 10  | Next Page >