Search Results

Search found 244 results on 10 pages for 'cuda'.

Page 5/10 | < Previous Page | 1 2 3 4 5 6 7 8 9 10  | Next Page >

  • How to return a single variable from a CUDA kernel function?

    - by Pooya
    I have a CUDA search function which calculate one single variable. How can I return it back. global void G_SearchByNameID(node* Node, long nodeCount, long start,char* dest, long answer){ answer = 2; } cudaMemcpy(h_answer, d_answer, sizeof(long), cudaMemcpyDeviceToHost); cudaFree(d_answer); for both of these lines I get this error: error: argument of type "long" is incompatible with parameter of type "const void *"

    Read the article

  • how many processors can I get in a block on cuda GPU?

    - by Vickey
    hi all, I have two questions to ask 1) If I create only one block of threads in cuda and execute the my parallel program on it then is it possible that more than one processors would be given to single block so that my program get some benefit of multiprocessor platform ? 2) can I synchronize the threads of different blocks ? if yes please give some hints. Thanks in advance since I know I'll get replies as always I get.

    Read the article

  • Parallelize code using CUDA [migrated]

    - by user878944
    If I have a code which takes struct variable as input and manipulate it's elements, how can I parallelize this using CUDA? void BackpropagateLayer(NET* Net, LAYER* Upper, LAYER* Lower) { INT i,j; REAL Out, Err; for (i=1; i<=Lower->Units; i++) { Out = Lower->Output[i]; Err = 0; for (j=1; j<=Upper->Units; j++) { Err += Upper->Weight[j][i] * Upper->Error[j]; } Lower->Error[i] = Net->Gain * Out * (1-Out) * Err; } } Where NET and LAYER are structs defined as: typedef struct { /* A LAYER OF A NET: */ INT Units; /* - number of units in this layer */ REAL* Output; /* - output of ith unit */ REAL* Error; /* - error term of ith unit */ REAL** Weight; /* - connection weights to ith unit */ REAL** WeightSave; /* - saved weights for stopped training */ REAL** dWeight; /* - last weight deltas for momentum */ } LAYER; typedef struct { /* A NET: */ LAYER** Layer; /* - layers of this net */ LAYER* InputLayer; /* - input layer */ LAYER* OutputLayer; /* - output layer */ REAL Alpha; /* - momentum factor */ REAL Eta; /* - learning rate */ REAL Gain; /* - gain of sigmoid function */ REAL Error; /* - total net error */ } NET; What I could think of is to first convert the 2d Weight into 1d. And then send it to kernel to take the product or just use the CUBLAS library. Any suggestions?

    Read the article

  • How to have many Ubuntu workstations centrally managed?

    - by Richard Zak
    I have about a dozen Alienware workstations that are used for CUDA development and for execution of MPI jobs. What is the best way to manage them? I'd like to have something like an apt-get but for several systems, and a way to reimage a system simply and centrally. It seems that a combination of Landscape and Canonical's MAAS would be a good fit, but I need an open source solution. Any thoughts?

    Read the article

  • How can I use GLUT with CUDA on MACOSX?

    - by omegatai
    Hi, I'm having problems compiling a CUDA program that uses GLUT on MacOsX. Here is the command line I use to compile the source: nvcc main.c -o main -Xlinker "-L/System/Library/Frameworks/OpenGL.framework/Libraries -lGL -lGLU" "-L/System/Library/Frameworks/GLUT.framework" And here is the errors I get: Undefined symbols: "_glutInitWindowSize", referenced from: _main in tmpxft_00001612_00000000-1_main.o "_glutInitWindowPosition", referenced from: _main in tmpxft_00001612_00000000-1_main.o "_glutDisplayFunc", referenced from: _main in tmpxft_00001612_00000000-1_main.o "_glutInitDisplayMode", referenced from: _main in tmpxft_00001612_00000000-1_main.o "_glutCreateWindow", referenced from: _main in tmpxft_00001612_00000000-1_main.o "_glutMainLoop", referenced from: _main in tmpxft_00001612_00000000-1_main.o "_glutInit", referenced from: _main in tmpxft_00001612_00000000-1_main.o ld: symbol(s) not found collect2: ld returned 1 exit status I am aware that I haven't specified any lib for GLUT but I just can't find it! Does anybody know where it is? By the way, there doesn't seem to be a way to use the GLUT.framework when compiling with nvcc. Thanks a lot, omegatai

    Read the article

  • Cuda program results are always zero in HW, correct in EMU??

    - by Orion Nebula
    Hi all! I am having a weird problem .. I have written a CUDA code which executes correctly in emulation and all results show up.. however, when executed on hardware "G210" .. the results in the result memory are always 0 I am passing two vectors to the kernel, one with random variables the other is initialized to zero, the code copies the first vector to shared memory, does some swapping and other operations and then writes back the results on the second vector (the one with the initial 0's) I am using double precision, the -arch sm13 flag is used, all memory allocation also use sizeof(double) .. I have checked if the kernel is invoked, it does .. so no problems here .. the cudaMemCpy has no problems .. what could be the problem .. :( why would it work in emulation but not on HW I am quite confused .. any ideas?

    Read the article

  • how to make different threads executing different parts in cuda ?

    - by Vickey
    I am working on cuda and I have some problem related to thread synchronization. In my code I need threads to execute different part like one thread - all thread - one thread this is what I want. In the initial part of code only one thread will execute and then some part will be executed by all threads then again single thread. Also the threads are executing in a loop. Can anyone tell me how to do that? It's a kinda urgent. I'll be grateful with any help. Thanks

    Read the article

  • How to synchronize cuda threads when they are in the same loop and we need to synchronize them to ex

    - by Vickey
    Hi all, I have written a code and Now I want to implement this on cuda GPU but I'm new to synchronization so please help me with this, It's little urgent to me. Below I'm presenting the code and I want to that LOOP1 to be executed by all threads (heance I want to this portion to take advantage of cuda and the remaining portion (the portion other from the LOOP1) is to be executed by only a single thread. do{ point_set = master_Q[(*num_mas) - 1].q; List* temp = point_set; List* pa = point_set; if(master_Q[num_mas[0] - 1].max) max_level = (int) (ceilf(il2 * log(master_Q[num_mas[0] - 1].max))); *num_mas = (*num_mas) - 1; while(point_set){ List* insert_ele = temp; while(temp){ insert_ele = temp; if((insert_ele->dist[insert_ele->dist_index-1] <= pow(2, max_level-1)) || (top_level == max_level)){ if(point_set == temp){ point_set = temp->next; pa = temp->next; } else{ pa->next = temp->next; } temp = NULL; List* new_point_set = point_set; float maximum_dist = 0; if(parent->p_index != insert_ele->point_index){ List* tmp = new_point_set; float *b = &(data[(insert_ele->point_index)*point_len]); **LOOP 1:** while(tmp){ float *c = &(data[(tmp->point_index)*point_len]); float sum = 0.; for(int j = 0; j < point_len; j+=2){ float d1 = b[j] - c[j]; float d2 = b[j+1] - c[j+1]; d1 *= d1; d2 *= d2; sum = sum + d1 + d2; } tmp->dist[tmp->dist_index] = sqrt(sum); if(maximum_dist < tmp->dist[tmp->dist_index]) maximum_dist = tmp->dist[tmp->dist_index]; tmp->dist_index = tmp->dist_index+1; tmp = tmp->next; } max_distance = maximum_dist; } while(new_point_set || insert_ele){ List* far, *par, *tmp, *tmp_new; far = NULL; tmp = new_point_set; tmp_new = NULL; float level_dist = pow(2, max_level-1); float maxdist = 0, maxp = 0; while(tmp){ if(tmp->dist[(tmp->dist_index)-1] > level_dist){ if(maxdist < tmp->dist[tmp->dist_index-1]) maxdist = tmp->dist[tmp->dist_index-1]; if(tmp == new_point_set){ new_point_set = tmp->next; par = tmp->next; } else{ par->next = tmp->next; } if(far == NULL){ far = tmp; tmp_new = far; } else{ tmp_new->next = tmp; tmp_new = tmp; } if(parent->p_index != insert_ele->point_index) tmp->dist_index = tmp->dist_index - 1; tmp = tmp->next; tmp_new->next = NULL; } else{ par = tmp; if(maxp < tmp->dist[(tmp->dist_index)-1]) maxp = tmp->dist[(tmp->dist_index)-1]; tmp = tmp->next; } } if(0 == maxp){ tmp = new_point_set; aloc_mem[*tree_index].p_index = insert_ele->point_index; aloc_mem[*tree_index].no_child = 0; aloc_mem[*tree_index].level = max_level--; parent->children_index[parent->no_child++] = *tree_index; parent = &(aloc_mem[*tree_index]); tree_index[0] = tree_index[0]+1; while(tmp){ aloc_mem[*tree_index].p_index = tmp->point_index; aloc_mem[(*tree_index)].no_child = 0; aloc_mem[(*tree_index)].level = master_Q[(*cur_count_Q)-1].level; parent->children_index[parent->no_child] = *tree_index; parent->no_child = parent->no_child + 1; (*tree_index)++; tmp = tmp->next; } cur_count_Q[0] = cur_count_Q[0]-1; new_point_set = NULL; } master_Q[*num_mas].q = far; master_Q[*num_mas].parent = parent; master_Q[*num_mas].valid = true; master_Q[*num_mas].max = maxdist; master_Q[*num_mas].level = max_level; num_mas[0] = num_mas[0]+1; if(0 != maxp){ aloc_mem[*tree_index].p_index = insert_ele->point_index; aloc_mem[*tree_index].no_child = 0; aloc_mem[*tree_index].level = max_level; parent->children_index[parent->no_child++] = *tree_index; parent = &(aloc_mem[*tree_index]); tree_index[0] = tree_index[0]+1; if(maxp){ int new_level = ((int) (ceilf(il2 * log(maxp)))) +1; if (new_level < (max_level-1)) max_level = new_level; else max_level--; } else max_level--; } if( 0 == maxp ) insert_ele = NULL; } } else{ if(NULL == temp->next){ master_Q[*num_mas].q = point_set; master_Q[*num_mas].parent = parent; master_Q[*num_mas].valid = true; master_Q[*num_mas].level = max_level; num_mas[0] = num_mas[0]+1; } pa = temp; temp = temp->next; } } if((*num_mas) > 1){ List *temp2 = master_Q[(*num_mas)-1].q; while(temp2){ List* temp3 = master_Q[(*num_mas)-2].q; master_Q[(*num_mas)-2].q = temp2; if((master_Q[(*num_mas)-1].parent)->p_index != (master_Q[(*num_mas)-2].parent)->p_index){ temp2->dist_index = temp2->dist_index - 1; } temp2 = temp2->next; master_Q[(*num_mas)-2].q->next = temp3; } num_mas[0] = num_mas[0]-1; } point_set = master_Q[(*num_mas)-1].q; temp = point_set; pa = point_set; parent = master_Q[(*num_mas)-1].parent; max_level = master_Q[(*num_mas)-1].level; if(master_Q[(*num_mas)-1].max) if( max_level > ((int) (ceilf(il2 * log(master_Q[(*num_mas)-1].max)))) +1) max_level = ((int) (ceilf(il2 * log(master_Q[(*num_mas)-1].max)))) +1; num_mas[0] = num_mas[0]-1; } }while(*num_mas > 0);

    Read the article

  • How to properly cast a global memory array using the uint4 vector in CUDA to increase memory throughput?

    - by charis
    There are generally two techniques to increase the memory throughput of the global memory on a CUDA kernel; memory accesses coalescence and accessing words of at least 4 bytes. With the first technique accesses to the same memory segment by threads of the same half-warp are coalesced to fewer transactions while be accessing words of at least 4 bytes this memory segment is effectively increased from 32 bytes to 128. To access 16-byte instead of 1-byte words when there are unsigned chars stored in the global memory, the uint4 vector is commonly used by casting the memory array to uint4: uint4 *text4 = ( uint4 * ) d_text; var = text4[i]; In order to extract the 16 chars from var, i am currently using bitwise operations. For example: s_array[j * 16 + 0] = var.x & 0x000000FF; s_array[j * 16 + 1] = (var.x >> 8) & 0x000000FF; s_array[j * 16 + 2] = (var.x >> 16) & 0x000000FF; s_array[j * 16 + 3] = (var.x >> 24) & 0x000000FF; My question is, is it possible to recast var (or for that matter *text4) to unsigned char in order to avoid the additional overhead of the bitwise operations?

    Read the article

  • Ubuntu 11.10 doesn't detect Intel integrated graphics (i7-2670QM CPU)

    - by Telmo Marques
    The laptop I'm using is an MSI GT683DX-847PT that comes with an NVIDIA GeForce GTX570M discrete GPU, and an Intel Core i7-2670QM CPU. According to Intel's description of the Core i7-2670QM CPU, it has an HD Graphics 3000 integrated GPU. The problem is that the Intel integrated graphics GPU doesn't come up in lspci nor in lshw, only the NVIDIA GPU shows up. Here is the output of both commands: sudo lspci: http://pastebin.com/raw.php?i=9AZg8bJy sudo lshw: http://pastebin.com/raw.php?i=6cAMFQsY I was counting on having two GPU's to run CUDA programs on the discrete NVIDIA GPU, while X was handled by the integrated Intel GPU, to prevent kernel execution timeout. Why doesn't the Intel HD Graphics 3000 GPU show up? Any tests I could make to verify the presence of an integrated GPU?

    Read the article

  • How can I upgrade from Ubuntu 9.10 to 11.10?

    - by Chinnu
    We need to program in CUDA 5.0 which can be installed only on ubuntu 11.10 or 12.04. Our current version, 9.10, is no longer supported, so we chose to proceed with a clean installation. Since we have a shared workstation, we used clonezilla for cloning the system. However, booting from the LiveCD showed an unexpected error. We also tried to install 11.10 in an external HDD by partitioning it, but Gparted could not be installed, and terminated with the error "installArchives() failed" which we couldn't solve even after modifying the sources.list. Is there a way to proceed with this upgrade?

    Read the article

  • Why is rvalue write in shared memory array serialised?

    - by CJM
    I'm using CUDA 4.0 on a GPU with computing capability 2.1. One of my device functions is the following: device void test(int n, int* itemp) // itemp is shared memory pointer { const int tid = threadIdx.x; const int bdim = blockDim.x; int i, j, k; bool flag = 0; itemp[tid] = 0; for(i=tid; i<n; i+=bdim) { // { code that produces some values of "flag" } } itemp[tid] = flag; } Each thread is checking some conditions and producing a 0/1 flag. Then each thread is writing flag at the tid-th location of a shared int array. The write statement "itemp[tid] = flag;" gets serialized -- though "itemp[tid] = 0;" is not. This is causing huge performance lag which technically should not be there -- I want to avoid it. Please help.

    Read the article

  • Financial applications on GPGPU

    - by CUDA-dev
    I want to know what sort of financial applications can be implemented using a GPGPU. I'm aware of Option pricing/ Stock price estimation using Monte Carlo simulation on GPGPU using CUDA. Can someone enumerate the various possibilities of utilizing GPGPU for any application in Finance domain,

    Read the article

  • CUDA: When to use shared memory and when to rely on L1 caching?

    - by Roger Dahl
    After Compute Capability 2.0 (Fermi) was released, I've wondered if there are any use cases left for shared memory. That is, when is it better to use shared memory than just let L1 perform its magic in the background? Is shared memory simply there to let algorithms designed for CC < 2.0 run efficiently without modifications? To collaborate via shared memory, threads in a block write to shared memory and synchronize with __syncthreads(). Why not simply write to global memory (through L1), and synchronize with __threadfence_block()? The latter option should be easier to implement since it doesn't have to relate to two different locations of values, and it should be faster because there is no explicit copying from global to shared memory. Since the data gets cached in L1, threads don't have to wait for data to actually make it all the way out to global memory. With shared memory, one is guaranteed that a value that was put there remains there throughout the duration of the block. This is as opposed to values in L1, which get evicted if they are not used often enough. Are there any cases where it's better too cache such rarely used data in shared memory than to let the L1 manage them based on the usage pattern that the algorithm actually has?

    Read the article

  • CUDA, more threads for same work = Longer run time despite better occupancy, Why?

    - by zenna
    I encountered a strange problem where increasing my occupancy by increasing the number of threads reduced performance. I created the following program to illustrate the problem: #include <stdio.h> #include <stdlib.h> #include <cuda_runtime.h> __global__ void less_threads(float * d_out) { int num_inliers; for (int j=0;j<800;++j) { //Do 12 computations num_inliers += threadIdx.x*1; num_inliers += threadIdx.x*2; num_inliers += threadIdx.x*3; num_inliers += threadIdx.x*4; num_inliers += threadIdx.x*5; num_inliers += threadIdx.x*6; num_inliers += threadIdx.x*7; num_inliers += threadIdx.x*8; num_inliers += threadIdx.x*9; num_inliers += threadIdx.x*10; num_inliers += threadIdx.x*11; num_inliers += threadIdx.x*12; } if (threadIdx.x == -1) d_out[blockIdx.x*blockDim.x+threadIdx.x] = num_inliers; } __global__ void more_threads(float *d_out) { int num_inliers; for (int j=0;j<800;++j) { // Do 4 computations num_inliers += threadIdx.x*1; num_inliers += threadIdx.x*2; num_inliers += threadIdx.x*3; num_inliers += threadIdx.x*4; } if (threadIdx.x == -1) d_out[blockIdx.x*blockDim.x+threadIdx.x] = num_inliers; } int main(int argc, char* argv[]) { float *d_out = NULL; cudaMalloc((void**)&d_out,sizeof(float)*25000); more_threads<<<780,128>>>(d_out); less_threads<<<780,32>>>(d_out); return 0; } Note both kernels should do the same amount of work in total, the (if threadIdx.x == -1 is a trick to stop the compiler optimising everything out and leaving an empty kernel). The work should be the same as more_threads is using 4 times as many threads but with each thread doing 4 times less work. Significant results form the profiler results are as followsL: more_threads: GPU runtime = 1474 us,reg per thread = 6,occupancy=1,branch=83746,divergent_branch = 26,instructions = 584065,gst request=1084552 less_threads: GPU runtime = 921 us,reg per thread = 14,occupancy=0.25,branch=20956,divergent_branch = 26,instructions = 312663,gst request=677381 As I said previously, the run time of the kernel using more threads is longer, this could be due to the increased number of instructions. Why are there more instructions? Why is there any branching, let alone divergent branching, considering there is no conditional code? Why are there any gst requests when there is no global memory access? What is going on here! Thanks

    Read the article

  • CG miner "configure: error: No mining configured in "

    - by Jorma
    Nvidia Gt 630 cuda 5.5 running CGminer not. Cuda examples fine. Should CGminer work or is there limitations to it? sudo ./autogen.sh --disable-cpumining --enable-opencl && make Configuration Options Summary: libcurl(GBT+getwork).: Enabled: -lcurl curses.TUI...........: FOUND: -lncurses Avalon.ASICs.........: Disabled BlackArrow.ASICs.....: Disabled BFL.ASICs............: Disabled BitForce.FPGAs.......: Disabled BitFury.ASICs........: Disabled Hashfast.ASICs.......: Disabled Icarus.ASICs/FPGAs...: Disabled Klondike.ASICs.......: Disabled KnC.ASICs............: Disabled ModMiner.FPGAs.......: Disabled configure: error: No mining configured in

    Read the article

  • Nsight Edition 4.0 apporte le support de Visual Studio 2013 et de CUDA 6, NVIDIA sort une nouvelle version de sa plateforme de développement

    Nsight Visual Studio Edition 4.0 apporte le support de Visual Studio 2013 Ainsi que le support de l'architecture Maxwell et de CUDA 6La plateforme de développement de NVIDIA : Nsight, arrive maintenant en version 4.0 et supporte Visual Studio 2013. Cet outil s'intégrant pleinement dans Visual Studio vous permettra de déboguer et profiler vos applications DirectX, OpenGL et CUDA. Cette nouvelle version, en plus d'apporter le support de la dernière version de Visual Studio, supporte aussi CUDA 6,...

    Read the article

  • cmake: Target-specific preprocessor definitions for CUDA targets seems not to work

    - by Nils
    I'm using cmake 2.8.1 on Mac OSX 10.6 with CUDA 3.0. So I added a CUDA target which needs BLOCK_SIZE set to some number in order to compile. cuda_add_executable(SimpleTestsCUDA SimpleTests.cu BlockMatrix.cpp Matrix.cpp ) set_target_properties(SimpleTestsCUDA PROPERTIES COMPILE_FLAGS -DBLOCK_SIZE=3) When running make VERBOSE=1 I noticed that nvcc is invoked w/o -DBLOCK_SIZE=3, which results in an error, because BLOCK_SIZE is used in the code, but defined nowhere. Now I used the same definition for a CPU target (using add_executable(...)) and there it worked. So now the questions: How do I figure out what cmake does with the set_target_properties line if it points to a CUDA target? Googling around didn't help so far and a workaround would be cool..

    Read the article

  • Trying to install Proprietory Nvidia Graphics Drivers

    - by Peter Snow
    After reading and trying many different suggestions for some hours, I returned to this how-to: https://help.ubuntu.com/community/BinaryDriverHowto/Nvidia The first problem I encounter is how to identify which of the listed drivers support my Nvidia GEForce 630M graphics card. Following the links doesn't really help, since it is not stated there either (except where support for a new driver was added later which is explicitly stated, but the original devices covered are not). However, even if I knew, if it doesn't appear in the 'Additional Drivers' dialogue (see below), how will I install it? Second Issue: The article goes on to say that available drivers for my hardware are usually listed in 'Additional Drivers'. In my case, they aren't. Unfortunately, it doesn't tell me how to correct that or work around it? I've checked the bios and there is no way offered there to disable the integrated graphics, only the Nvidia graphics. I've also tried each available option in this: $ sudo update-alternatives --config i386-linux-gnu_gl_conf My system is an Acer Aspire 4752G bought May 2012. I'm running Ubuntu 12.04LTS. uname -a : 3.2.0-38-generic-pae #61-Ubuntu SMP Tue Feb 19 12:39:51 UTC 2013 i686 i686 i386 GNU/Linux It's 64bit hardware but I installed 32bit OS for greater software compatibility. Running $ sudo tail -fn 500 /var/log/Xorg.0.log | grep '(EE)' returns" (WW) warning, (EE) error, (NI) not implemented, (??) unknown. [ 28.886] (EE) Failed to initialize GLX extension (Compatible NVIDIA X driver not found) The reason for wanting the proprietor y drivers is because my laptop comes with 3D accelerated graphics adaptor and so rather than confining myself to struggling with the on-board graphics, I would rather use it. I also want to experiment with using it for bitmining (which uses the GPU's for computing power).

    Read the article

  • Encoding movie files into h264

    - by Shiki
    Found some topics about archiving into h264, but those were about the generic questions (does it worth it, which codec to use.) I want to use h264 (with CUDA (if possible)). So far I only found Avidemux a usable encoder with x264 but it makes an unwatchable video file after the encoding (using the best profile, all setting maxed out), really blurry. Please write down detailed what to use, where to get it (if its free, doesnt matter), what to set, etc. Thanks in advance. (OS: Windows 7 ulti x64, VGA is VP2 capable with CUDA GTX260 XFX) Of course, if there is an up to date duplicate, just comment with the link and I'll remove the question ASAP.

    Read the article

  • How to copy the memeory allocated in device function back to main memory

    - by xhe8
    I have a CUDA program containing a host function and a device function Execute(). In the host function, I allocate a global memory output which will then be passed to the device function and used to store the address of the global memory allocated within the device function. I want to access the in-kernel allocated memory in the host function. The following is the code: #include <stdio.h> typedef struct { int * p; int num; } Structure_A; \__global__ void Execute(Structure_A *output); int main(){ Structure_A *output; cudaMalloc((void***)&output,sizeof(Structure_A)*1); dim3 dimBlockExecute(1,1); dim3 dimGridExecute(1,1); Execute<<<dimGridExecute,dimBlockExecute>>>(output); Structure_A * output_cpu; int * p_cpu; cudaError_t err; output_cpu= (Structure_A*)malloc(1); err=cudaMemcpy(output_cpu,output,sizeof(Structure_A),cudaMemcpyDeviceToHost); if( err != cudaSuccess) { printf("CUDA error a: %s\n", cudaGetErrorString(err)); exit(-1); } p_cpu=(int *)malloc(1); err=cudaMemcpy(p_cpu,output_cpu[0].p,sizeof(int),cudaMemcpyDeviceToHost); if( err != cudaSuccess) { printf("CUDA error b: %s\n", cudaGetErrorString(err)); exit(-1); } printf("output=(%d,%d)\n",output_cpu[0].num,p_cpu[0]); return 0; } \__global__ void Execute(Structure_A *output){ int thid=threadIdx.x; output[thid].p= (int*)malloc(thid+1); output[thid].num=(thid+1); output[thid].p[0]=5; } I can compile the program. But when I run it, I got a error showing that there is a invalid argument in the following memory copy function. "err=cudaMemcpy(p_cpu,output_cpu[0].p,sizeof(int),cudaMemcpyDeviceToHost);" CUDA version is 4.2. CUDA card: Tesla C2075 OS: x86_64 GNU/Linux

    Read the article

  • Why aren't we programming on the GPU???

    - by Chris
    So I finally took the time to learn CUDA and get it installed and configured on my computer and I have to say, I'm quite impressed! Here's how it does rendering the Mandelbrot set at 1280 x 678 pixels on my home PC with a Q6600 and a GeForce 8800GTS (max of 1000 iterations): Maxing out all 4 CPU cores with OpenMP: 2.23 fps Running the same algorithm on my GPU: 104.7 fps And here's how fast I got it to render the whole set at 8192 x 8192 with a max of 1000 iterations: Serial implemetation on my home PC: 81.2 seconds All 4 CPU cores on my home PC (OpenMP): 24.5 seconds 32 processors on my school's super computer (MPI with master-worker): 1.92 seconds My home GPU (CUDA): 0.310 seconds 4 GPUs on my school's super computer (CUDA with static domain decomposition): 0.0547 seconds So here's my question - if we can get such huge speedups by programming the GPU instead of the CPU, why is nobody doing it??? I can think of so many things we could speed up like this, and yet I don't know of many commercial apps that are actually doing it. Also, what kinds of other speedups have you seen by offloading your computations to the GPU?

    Read the article

  • Porting a project to OpenGL3

    - by Decapsuleur
    Hi everyone, I'm working on a C++ cross-platform OpenGL application (Windows, Linux and MacOS) and I am wondering if some of you could share some advices on porting a large application to OpenGL 3. The reason I am looking into OpenGL 3 is because I think we could benefit a lot from using the new "Sync objects". Nvidia has supported such an extension since the Geforce 256 days (gl_nv_fences) but there seems to be no equivalent functionality on ATI hardware before OpenGL 3.0+... Our code makes quite heavy use of glut/freeglut, glu functions, OpenGL 2 extensions and CUDA (on supported hardware). The problem I am now facing is that "gl3.h" and "gl.h" are mutually incompatible (as stated in gl3.h). Do you guys know if there is a GL3 glut equivalent ? Also, looking at the CUDA-toolkit header files, it seems that GL-CUDA interoperability is only available when using older versions of OpenGL... (cuda_gl_interop.h includes gl.h...). Am I missing something ? Thanks a lot for your help.

    Read the article

  • Transform OpenCV image data type to Devil image format and vice-verca

    - by D.K
    I want to use a CUDA-enabled SIFT library but I am using the OpenCV driver to get images from the webcam? The Cuda library is using the Devil Library for image data types. Should I transofrm the images from OpenCV data types to Devil? Or Should I use another method for getting images from the webcam[devil compatible data types]? Thanks for your attention

    Read the article

< Previous Page | 1 2 3 4 5 6 7 8 9 10  | Next Page >