Search Results

Search found 244 results on 10 pages for 'cuda'.

Page 2/10 | < Previous Page | 1 2 3 4 5 6 7 8 9 10  | Next Page >

  • CUDA driver installation on a laptop with nVidia NVS140M card

    - by stanigator
    I'm trying to first figure out if my computer contains a CUDA-enabled card. It has an nVidia NVS 140M card, but I can't seem to figure out if it is the 128 MB version or 256 MB version. On the laptop purchase receipt, I found out that I ordered the 128 MB version, but the control panel description of the card said otherwise as shown below: When I ran the CUDA driver from nVidia's site, it cannot find a hardware compatible with CUDA (even though the product series is CUDA-enabled, the card does not have 256 MB minimum of memory to do so). What would be your recommendations in this case with trying to use CUDA on this computer (I'm not sure if nothing can be done at this point)? Thanks in advance.

    Read the article

  • CUDA small kernel 2d convolution - how to do it

    - by paulAl
    I've been experimenting with CUDA kernels for days to perform a fast 2D convolution between a 500x500 image (but I could also vary the dimensions) and a very small 2D kernel (a laplacian 2d kernel, so it's a 3x3 kernel.. too small to take a huge advantage with all the cuda threads). I created a CPU classic implementation (two for loops, as easy as you would think) and then I started creating CUDA kernels. After a few disappointing attempts to perform a faster convolution I ended up with this code: http://www.evl.uic.edu/sjames/cs525/final.html (see the Shared Memory section), it basically lets a 16x16 threads block load all the convolution data he needs in the shared memory and then performs the convolution. Nothing, the CPU is still a lot faster. I didn't try the FFT approach because the CUDA SDK states that it is efficient with large kernel sizes. Whether or not you read everything I wrote, my question is: how can I perform a fast 2D convolution between a relatively large image and a very small kernel (3x3) with CUDA?

    Read the article

  • Do the amount of CUDA cores matter for Sony Vegas

    - by Saif Bechan
    I am building a new PC, and I am wondering if I am making the right choices. The PC will mainly be used for video editing, and mainly in Sony Vegas. Now I have the choice of going with a video that that has more of less CUDO cores. To be precise I am choosing between a Ti and non-Ti version of an nVidea video card. Now I have read that sony vegas can partially use the CUDA cores to it's benefit, but not that much, and only if you use GPU acceleration. On the other hand I have heard it's better to leave GPU acceleration off as it will effect the quality of the output. Making it so that the CUDA cores actually don't make much of a difference after all.

    Read the article

  • Best linux distro for cuda development

    - by user18151
    Can someone suggest the best linux distro for CUDA development. The reason why I'm asking is that I tried installing the latest cuda SDK in Fedora 12, and it was a real pain in the neck. It took me 8 hours to get the nouveau driver removed and the nvidia-driver installed. After that somehow the OS decides to act up and blow up the /var/log/message file to 9 GB and eat up all my remaining space, with strange errors. I don't even understand what happened further, but my Nvidia drives don't work anymore. Please don't flame me, I'm NOT a windows fanboy or anything. I've been using Linux since 2002, and actually like it. Its just my personal experience. Would be really helpful for positive suggestions. Fanboys, please stay aside. Thanks in advance.

    Read the article

  • Can't install CUDA drivers for GeForce GT555M

    - by saeed
    I've just bought a new Asus n55 laptop. It has 2 graphics cards from Intel and NVIDIA. But when I try to install CUDA's developer driver for my GPU I get this error: "This graphics driver could not find compatible graphics hardware". I have downloaded both of the following files but both of them get mentioned error: Developer Drivers for WinVista and Win7 (270.81) Notebook Developer Drivers for WinVista and Win7 (275.33) How can I fix this problem? Actually how can I develop CUDA programs on my NVIDIA GPU?

    Read the article

  • How to get Nvidia 555M working on Ubuntu 12.10 (drivers, optimus, cuda)?

    - by bluudz
    I'm trying for few days to get my GPU working properly on Alienware M14x with GPU NVidia 555M, but I have no luck at all. After fresh ubuntu install I did follow the guide here NVIDIA Optimus and Ubuntu 12.10 and istalled Bumblebee without problems. Tested glxspheres/optirun glxspheres both working fine. Now I was continuing to install CUDA as is said here How can I get nVidia CUDA or OpenCL working on a laptop with nVidia discrete card/Intel Integrated Graphics? but I'm getting: Driver: Not Selected Toolkit: Installation Failed. Using unsupported Compiler. Samples: Installation Failed. Missing required libraries. I did not select the driver as I though Bumblebee installed driver already. How should I proceed? And also at what point is the NVidia driver being installed and how can I try its working? Bumblebee seems to be installing the driver, CUDA wants to do the same, its all a bit confusing really. Sorry if its lame question, but I would really want to at least get graphic card and second screen working.. Thank you for any help.

    Read the article

  • How Do You Profile & Optimize CUDA Kernels?

    - by John Dibling
    I am somewhat familiar with the CUDA visual profiler and the occupancy spreadsheet, although I am probably not leveraging them as well as I could. Profiling & optimizing CUDA code is not like profiling & optimizing code that runs on a CPU. So I am hoping to learn from your experiences about how to get the most out of my code. There was a post recently looking for the fastest possible code to identify self numbers, and I provided a CUDA implementation. I'm not satisfied that this code is as fast as it can be, but I'm at a loss as to figure out both what the right questions are and what tool I can get the answers from. How do you identify ways to make your CUDA kernels perform faster?

    Read the article

  • How do CUDA devices handle immediate operands?

    - by Jack Lloyd
    Compiling CUDA code with immediate (integer) operands, are they held in the instruction stream, or are they placed into memory? Specifically I'm thinking about 24 or 32 bit unsigned integer operands. I haven't been able to find information about this in any of the CUDA documentation I've examined so far. So references to any documents on specific uarch details like this would be perfect, as I don't currently have a good model for how CUDA works at this level.

    Read the article

  • Creating .lib files in CUDA Toolkit 5

    - by user1683586
    I am taking my first faltering steps with CUDA Toolkit 5.0 RC using VS2010. Separate compilation has me confused. I tried to set up a project as a Static Library (.lib), but when I try to build it, it does not create a device-link.obj and I don't understand why. For instance, there are 2 files: A caller function that uses a function f #include "thrust\host_vector.h" #include "thrust\device_vector.h" using namespace thrust::placeholders; extern __device__ double f(double x); struct f_func { __device__ double operator()(const double& x) const { return f(x); } }; void test(const int len, double * data, double * res) { thrust::device_vector<double> d_data(data, data + len); thrust::transform(d_data.begin(), d_data.end(), d_data.begin(), f_func()); thrust::copy(d_data.begin(),d_data.end(), res); } And a library file that defines f __device__ double f(double x) { return x+2.0; } If I set the option generate relocatable device code to No, the first file will not compile due to unresolved extern function f. If I set it to -rdc, it will compile, but does not produce a device-link.obj file and so the linker fails. If I put the definition of f into the first file and delete the second it builds successfully, but now it isn't separate compilation anymore. How can I build a static library like this with separate source files? [Updated here] I called the first caller file "caller.cu" and the second "libfn.cu". The compiler lines that VS2010 outputs (which I don't fully understand) are (for caller): nvcc.exe -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\include" -G --keep-dir "Debug" -maxrregcount=0 --machine 32 --compile -g -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd " -o "Debug\caller.cu.obj" "G:\Test_Linking\caller.cu" -clean and the same for libfn, then: nvcc.exe -gencode=arch=compute_20,code=\"sm_20,compute_20\" --use-local-env --cl-version 2010 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin" -rdc=true -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\include" -G --keep-dir "Debug" -maxrregcount=0 --machine 32 --compile -g -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd " -o "Debug\caller.cu.obj" "G:\Test_Linking\caller.cu" and again for libfn.

    Read the article

  • How can I get nVidia CUDA or OpenCL working on a laptop with nVidia discrete card/Intel Integrated Graphics?

    - by PeterDC
    Background: I'm a 3D artist (as a hobby) and have recently started using Ubuntu 12.04 LTS as a dual-boot with Windows 7. It's running on my a fairly new 64-bit Toshiba laptop with an nVidia GeForce GT 540M GPU (graphics card). It also, however has Intel Integrated Graphics (which I suspect Ubuntu's been using). So, when I render my 3D scenes to images on Windows, I am able to choose between using my CPU or my nVidia GPU (faster). From the 3D application, I can set the GPU to use either CUDA or OpenCL. In Ubuntu, there's no GPU option. After doing (too much?) research on the issues with Linux and the nVidia Optimus technology, I am slightly more enlightened, but a lot more confused. I don't care one bit about the Optimus technology, as battery life is not by any means an issue for me. Here's my question: What can I do to be able to use CUDA-utilizing programs (such as Blender) on my nVidia GPU in Ubuntu? Will I need nVidia drivers? (I have heard they don't play nicely with Optimus setups on Linux.) Is there at least a way to use OpenCL on my GPU in Ubuntu?

    Read the article

  • cmake, gcc, cuda and -m32 wtf

    - by Nils
    Hi all I figured out that CUDA does not work in 64bit mode on my mac (or couldn't get it running so far). Therefore I decided to compile everything for 32bit. I use cmake 2.8 and added the following options add_definitions(-Wall -m32) set(CUDA_64_BIT_DEVICE_CODE OFF) set(CMAKE_MODULE_LINKER_FLAGS -m32) However when it tries to link it it does something like this: /usr/bin/c++ -mmacosx-version-min=10.6 -Wl,-search_paths_first -headerpad_max_install_names CMakeFiles/SimpleTestsCUDA.dir/BlockMatrix.cpp.o CMakeFiles/SimpleTestsCUDA.dir/Matrix.cpp.o ./SimpleTestsCUDA_generated_SimpleTests.cu.o ./SimpleTestsCUDA_generated_BlockMatrix.cu.o -o SimpleTestsCUDA /usr/local/cuda/lib/libcudart.dylib /usr/local/cuda/lib/libcuda.dylib Which fails with a lot of "file is not of required architecture" warnings from ld. Now if I add manually -m32 to the command above it works. However I have no idea how to teach cmake to add -m32 to every gcc (or ld) invocation. So far it does it for nvcc and gcc, but not for linking..

    Read the article

  • cuda program on VMware

    - by scatman
    i wrote a cuda program and i am testing it on ubuntu as a virtual machine. the reason for this is i have windows 7, i don't want to install ubuntu as a secondary operating system, and i need to use a linux operating system for testing. my question is: will the virtual machine limit the gpu resources? So will my cuda code be faster if i run it under my primary operating system than running it on a virtual machine?

    Read the article

  • CUDA linking error - Visual Express 2008 - nvcc fatal due to (null) configuration file

    - by Josh
    Hi, I've been searching extensively for a possible solution to my error for the past 2 weeks. I have successfully installed the Cuda 64-bit compiler (tools) and SDK as well as the 64-bit version of Visual Studio Express 2008 and Windows 7 SDK with Framework 3.5. I'm using windows XP 64-bit. I have confirmed that VSE is able to compile in 64-bit as I have all of the 64-bit options available to me using the steps on the following website: (since Visual Express does not inherently include the 64-bit packages) http://jenshuebel.wordpress.com/2009/02/12/visual-c-2008-express-edition-and-64-bit-targets/ I have confirmed the 64-bit compile ability since the "x64" is available from the pull-down menu under "Tools-Options-VC++ Directories" and compiling in 64-bit does not result in the entire project being "skipped". I have included all the needed directories for 64-bit cuda tools, 64 SDK and Visual Express (\VC\bin\amd64). Here's the error message I receive when trying to compile in 64-bit: 1>------ Build started: Project: New, Configuration: Release x64 ------ 1>Compiling with CUDA Build Rule... 1>"C:\CUDA\bin64\nvcc.exe" -arch sm_10 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin" -Xcompiler "/EHsc /W3 /nologo /O2 /Zi /MT " -maxrregcount=32 --compile -o "x64\Release\template.cu.obj" "c:\Documents and Settings\All Users\Application Data\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\src\CUDA_Walkthrough_DeviceKernels\template.cu" 1>nvcc fatal : Visual Studio configuration file '(null)' could not be found for installation at 'C:/Program Files (x86)/Microsoft Visual Studio 9.0/VC/bin/../..' 1>Linking... 1>LINK : fatal error LNK1181: cannot open input file '.\x64\Release\template.cu.obj' 1>Build log was saved at "file://c:\Documents and Settings\Administrator\My Documents\Visual Studio 2008\Projects\New\New\x64\Release\BuildLog.htm" 1>New - 1 error(s), 0 warning(s) ========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ========== Here's the simple code I'm trying to compile/run in 64-bit: #include <stdlib.h> #include <stdio.h> #include <string.h> #include <math.h> #include <cuda.h> void mypause () { printf ( "Press [Enter] to continue . . ." ); fflush ( stdout ); getchar(); } __global__ void VecAdd1_Kernel(float* A, float* B, float* C, int N) { int i = blockDim.x*blockIdx.x+threadIdx.x; if (i<N) C[i] = A[i] + B[i]; //result should be a 16x1 array of 250s } __global__ void VecAdd2_Kernel(float* B, float* C, int N) { int i = blockDim.x*blockIdx.x+threadIdx.x; if (i<N) C[i] = C[i] + B[i]; //result should be a 16x1 array of 400s } int main() { int N = 16; float A[16];float B[16]; size_t size = N*sizeof(float); for(int i=0; i<N; i++) { A[i] = 100.0; B[i] = 150.0; } // Allocate input vectors h_A and h_B in host memory float* h_A = (float*)malloc(size); float* h_B = (float*)malloc(size); float* h_C = (float*)malloc(size); //Initialize Input Vectors memset(h_A,0,size);memset(h_B,0,size); h_A = A;h_B = B; printf("SUM = %f\n",A[1]+B[1]); //simple check for initialization //Allocate vectors in device memory float* d_A; cudaMalloc((void**)&d_A,size); float* d_B; cudaMalloc((void**)&d_B,size); float* d_C; cudaMalloc((void**)&d_C,size); //Copy vectors from host memory to device memory cudaMemcpy(d_A,h_A,size,cudaMemcpyHostToDevice); cudaMemcpy(d_B,h_B,size,cudaMemcpyHostToDevice); //Invoke kernel int threadsPerBlock = 256; int blocksPerGrid = (N+threadsPerBlock-1)/threadsPerBlock; VecAdd1(blocksPerGrid, threadsPerBlock,d_A,d_B,d_C,N); VecAdd2(blocksPerGrid, threadsPerBlock,d_B,d_C,N); //Copy results from device memory to host memory //h_C contains the result in host memory cudaMemcpy(h_C,d_C,size,cudaMemcpyDeviceToHost); for(int i=0; i<N; i++) //output result from the kernel "VecAdd" { printf("%f ", h_C[i] ); printf("\n"); } printf("\n"); cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); free(h_A); free(h_B); free(h_C); mypause(); return 0; }

    Read the article

  • Optimize CUDA with Thrust in a loop

    - by macs
    Given the following piece of code, generating a kind of code dictionary with CUDA using thrust (C++ template library for CUDA): thrust::device_vector<float> dCodes(codes->begin(), codes->end()); thrust::device_vector<int> dCounts(counts->begin(), counts->end()); thrust::device_vector<int> newCounts(counts->size()); for (int i = 0; i < dCodes.size(); i++) { float code = dCodes[i]; int count = thrust::count(dCodes.begin(), dCodes.end(), code); newCounts[i] = dCounts[i] + count; //Had we already a count in one of the last runs? if (dCounts[i] > 0) { newCounts[i]--; } //Remove thrust::detail::normal_iterator<thrust::device_ptr<float> > newEnd = thrust::remove(dCodes.begin()+i+1, dCodes.end(), code); int dist = thrust::distance(dCodes.begin(), newEnd); dCodes.resize(dist); newCounts.resize(dist); } codes->resize(dCodes.size()); counts->resize(newCounts.size()); thrust::copy(dCodes.begin(), dCodes.end(), codes->begin()); thrust::copy(newCounts.begin(), newCounts.end(), counts->begin()); The problem is, that i've noticed multiple copies of 4 bytes, by using CUDA visual profiler. IMO this is generated by The loop counter i float code, int count and dist Every access to i and the variables noted above This seems to slow down everything (sequential copying of 4 bytes is no fun...). So, how i'm telling thrust, that these variables shall be handled on the device? Or are they already? Using thrust::device_ptr seems not sufficient for me, because i'm not sure whether the for loop around runs on host or on device (which could also be another reason for the slowliness).

    Read the article

  • Best approach for GPGPU/CUDA/OpenCL in Java?

    - by Frederik
    General-purpose computing on graphics processing units (GPGPU) is a very attractive concept to harness the power of the GPU for any kind of computing. I'd love to use GPGPU for image processing, particles, and fast geometric operations. Right now, it seems the two contenders in this space are CUDA and OpenCL. I'd like to know: Is OpenCL usable yet from Java on Windows/Mac? What are the libraries ways to interface to OpenCL/CUDA? Is using JNA directly an option? Am I forgetting something? Any real-world experience/examples/war stories are appreciated.

    Read the article

  • Simultaneous launch of Multiple Kernels using CUDA for a GPU

    - by cudadev
    Is it possible to launch two kernels that do independent tasks, simultaneously. For example if I have this Cuda code // host and device initialization ....... ....... // launch kernel1 myMethod1 <<<.... (params); // launch kernel2 myMethod2 <<<..... (params); Assuming that these kernels are independent, is there a facility to launch them at the same time allocating few grids/blocks for each. Does CUDA/OpenCL have this provision.

    Read the article

  • CUDA: accumulate data into a large histogram of floats

    - by shoosh
    I'm trying to think of a way to implement the following algorithm using CUDA: Working on a large volume of voxels, for each voxel I calculate an index i and a value c. after the calculation I need to perform histogram[i] += c c is a float value and the histogram can have up to 15,000 bins. I'm looking for a way to implement this efficiently using CUDA. The first obvious problem is that with compute capabilities 1.3 which is what I'm using I can't even do an atomicAdd() of floats so how can I accumulate anything reliably? This example by nVidia does something somewhat simpler. The histograms are saved in the shared memory (which I can't do due to its size) and it only accumulates integers. Can this approach be generalized to my case?

    Read the article

  • How can i recover a zip password using CUDA (GPU) ?

    - by marc
    How can i recover a zip password on linux using CUDA (GPU). For the past two days i tried using "fcrackzip" but it's too slow Few months back i saw some application that can use GPU / CUDA and get large performance boost in comparison to CPU. If brute-force using cuda is not possible, please tell me what's the best application for performing a dictionary attack, and where can i find best (largest) dictionary. Regards

    Read the article

  • How to convert a C++ program that uses CUDA into MEX

    - by Harold Wellington Graves
    For work, I am converting the Image Denoising program that comes with the CUDA SDK into a MATLAB program. As far as I know, I have made all the necessary changes required by MATLAB, but when I try to call mex on it, MATLAB returns a bunch of linkage errors that I have no idea how to fix. If anyone has any suggestions on what I might be doing wrong, I would greatly appreciate it. The command I am giving MATLAB is: mex imageDenoisingGL.cpp -I..\..\common\inc -IC:\CUDA\include -L..\..\common\lib -lglut32 And the output from MATLAB is a bunch of these: imageDenoisingGL.obj : error LNK2019: unresolved external symbol __imp__cutCheckCmdLineFlag@12 referenced in function "void __cdecl __cutilExit(int,char * *)" (?__cutilExit@@YAXHPAPAD@Z) I am running: Windows XP x32 Visual Studio 2005 MATLAB 2007a

    Read the article

  • CUDA & VS2010 problem

    - by Kristian D'Amato
    I have scoured the internets looking for an answer to this one, but couldn't find any. I've installed the CUDA 3.2 SDK (and, just now, CUDA 4.0 RC) and everything seems to work fine after long hours of fooling around with include directories, NSight, and all the rest. Well, except this one thing: it keeps highlighting the <<< >>> operator as a mistake. Only on VS2010--not on VS2008. On VS2010 I also get several warnings of the following sort: C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\include\xdebug(109): warning C4251: 'std::_String_val<_Ty,_Alloc>::_Alval' : class 'std::_DebugHeapAllocator<_Ty>' needs to have dll-interface to be used by clients of class 'std::_String_val<_Ty,_Alloc>' Anyone know how this can be fixed?

    Read the article

  • Solving problems involving more complex data structures with CUDA

    - by Nils
    So I read a bit about CUDA and GPU programming. I noticed a few things such that access to global memory is slow (therefore shared memory should be used) and that the execution path of threads in a warp should not diverge. I also looked at the (dense) matrix multiplication example, described in the programmers manual and the nbody problem. And the trick with the implementation seems to be the same: Arrange the calculation in a grid (which it already is in case of the matrix mul); then subdivide the grid into smaller tiles; fetch the tiles into shared memory and let the threads calculate as long as possible, until it needs to reload data from the global memory into shared memory. In case of the nbody problem the calculation for each body-body interaction is exactly the same (page 682): bodyBodyInteraction(float4 bi, float4 bj, float3 ai) It takes two bodies and an acceleration vectors. The body vector has four components it's position and the weight. When reading the paper, the calculation is understood easily. But what is if we have a more complex object, with a dynamic data structure? For now just assume that we have an object (similar to the body object presented in the paper) which has a list of other objects attached and the number of objects attached is different in each thread. How could I implement that without having the execution paths of the threads to diverge? I'm also looking for literature which explains how different algorithms involving more complex data structures can be effectively implemented in CUDA.

    Read the article

< Previous Page | 1 2 3 4 5 6 7 8 9 10  | Next Page >