Search Results

Search found 73 results on 3 pages for 'openmp'.

Page 1/3 | 1 2 3 | Next Page >

Can I implement the readers and writers algorithm in OpenMP by replacing counting semaphores with another feature?

- by DeveloperDon

After reading about OpenMP and not finding functions to support semaphores, I did an internet search for OpenMP and the readers and writers problem, but found no suitable matches. Is there a general method for replacing counting semaphores in OpenMP with something that it supports? Or is there just a gap in the environment where it does not permit things that are asymmetrical like the third readers and writers problem shown on the following page? http://en.wikipedia.org/wiki/Readers-writers_problem#The_third_readers-writers_problem

Read the article
How to deal with OpenMP thread pool contention

- by dpe82

I'm working on an application that uses both coarse and fine grained multi-threading. That is, we manage scheduling of large work units on a pool of threads manually, and then within those work units certain functions utilize OpenMP for finer grain multithreading. We have realized gains by selectively using OpenMP in our costliest loops, but are concerned about creating contention for the OpenMP worker pool as we add OpenMP blocks to cheaper loops. Is there a way to signal to OpenMP that a block of code should use the pool if it is available, and if not it should process the loop serially?

Read the article
OpenMP in Fortran

- by user345293

I very rarely use fortran, however I have been tasked with taking legacy code rewriting it to run in parallel. I'm using gfortran for my compiler choice. I found some excellent resources at https://computing.llnl.gov/tutorials/openMP/ as well as a few others. My problem is this, before I add any OpenMP directives, if I simply compile the legacy program: gfortran Example1.F90 -o Example1 everything works, but turning on the openmp compiler option even without adding directives: gfortran -openmp Example1.F90 -o Example1 ends up with a Segmentation fault when I run the legacy program. Using smaller test programs that I wrote, I've successfully compiled other programs with -openmp that run on multiple threads, but I'm rather at a loss why enabling the option alone and no directives is resulting in a seg fault. I apologize if my question is rather simple. I could post code but it is rather long. It faults as I assign initial values: REAL, DIMENSION(da,da) :: uconsold REAL, DIMENSION(da,da,dr,dk) :: uconsolde ... uconsold=0.0 uconsolde=0.0 The first assignment to "uconsold" works fine, the second seems to be the source of the fault as when I comment the line out the next several lines execute merrily until "uconsolde" is used again. Thank you for any help in this matter.

Read the article
Controlling FPU behavior in an OpenMP program?

- by STingRaySC

I have a large C++ program that modifies the FPU control word (using _controlfp()). It unmasks some FPU exceptions and installs a SEHTranslator to produce typed C++ exceptions. I am using VC++ 9.0. I would like to use OpenMP (v.2.0) to parallelize some of our computational loops. I've already successfully applied it to one, but the numerical results are slightly different (though I understand it could also be due to calculations being performed in a different order). I'm assuming this is because the FPU state is thread-specific. Is there some way to have the OpenMP threads inherit that state from the master thread? Or is there some way to specify using OpenMP that new threads execute a particular function that sets up the correct state? What is the idiomatic way to deal with this situation?

Read the article
No speed-up with useless printf's using OpenMP

- by t2k32316

I just wrote my first OpenMP program that parallelizes a simple for loop. I ran the code on my dual core machine and saw some speed up when going from 1 thread to 2 threads. However, I ran the same code on a school linux server and saw no speed-up. After trying different things, I finally realized that removing some useless printf statements caused the code to have significant speed-up. Below is the main part of the code that I parallelized: #pragma omp parallel for private(i) for(i = 2; i <= n; i++) { printf("useless statement"); prime[i-2] = is_prime(i); } I guess that the implementation of printf has significant overhead that OpenMP must be duplicating with each thread. What causes this overhead and why can OpenMP not overcome it?

Read the article
Linker library for OpenMP for Snow Leopard?

- by unknownthreat

Currently, I am trying out OpenMP on XCode 3.2.2 on Snow Leopard: #include <omp.h> #include <iostream> #include <stdio.h> int main (int argc, char * const argv[]) { #pragma omp parallel printf("Hello from thread %d, nthreads %d\n", omp_get_thread_num(), omp_get_num_threads()); return 0; } I didn't include any linking libraries yet, so the linker complains: "_omp_get_thread_num", referenced from: _main in main.o "_omp_get_num_threads", referenced from: _main in main.o OK, fine, no problem, I take a look in the existing framework, looking for keywords such as openmp or omp... here comes the problem, where is the linking library? Or should I say, what is the name of the linking library for openMP? Is it dylib, framework or what? Or do I need to get it from somewhere first?

Read the article
openmp vs opencl for computer vision

- by user1235711

I am creating a computer vision application that detect objects via a web camera. I am currently focusing on the performance of the application My problem is in a part of the application that generates the XML cascade file using Haartraining file. This is very slow and takes about 6days . To get around this problem I decided to use multiprocessing, to minimize the total time to generate Haartraining XML file. I found two solutions: opencl and (openMp and openMPI ) . Now I'm confused about which one to use. I read that opencl is to use multiple cpu and GPU but on the same machine. Is that so? On the other hand OpenMP is for multi-processing and using openmpi we can use multiple CPUs over the network. But OpenMP has no GPU support. Can you please suggest the pros and cons of using either of the libraries.

Read the article
openmp program elapsed time not scaling with increased threads

- by Griff

I've got this openmp fortran program doing an embarrassingly parallel problem - do loop over 512^3 elements. See output below. Why would there be such strange behavior in the elapsed time as a function of threads? I thought it would peak at a sweet spot then slowly degrade. This clearly isn't happening. Perhaps I misunderstand something about openmp. Threads, omp_get_wtime 1, 103.76298500015400 2, 65.346454000100493 4, 45.923643999965861 7, 38.074195000110194 8, 36.968765000114217 9, 39.45981499995105 10,40.753379000118002 12,39.577559999888763 14,37.909950000001118

Read the article
Segmentation fault on MPI, runs properly on OpenMP

- by Bellman

Hi, I am trying to run a program on a computer cluster. The structure of the program is the following: PROGRAM something ... CALL subroutine1(...) ... END PROGRAM SUBROUTINE subroutine1(...) ... DO i=1,n CALL subroutine2(...) ENDDO ... END SUBROUTINE SUBROUTINE subroutine2(...) ... CALL subroutine3(...) CALL subroutine4(...) ... END SUBROUTINE The idea is to parallelize the loop that calls subroutine2. Main program basically only makes the call to subroutine1 and only its arguments are declared. I use two alternatives. On the one hand, I write OpenMP clauses arround the loop. On the other hand, I add an IF conditional branch arround the call and I use MPI to share the results. In the OpenMP case, I add CALL KMP_SET_STACKSIZE(402653184) at the beginning of the main program and I can run it with 8 threads on an 8 core machine. When I run it (on the same 8 core machine) with MPI (either using 8 or 1 processors) it crashes just when makes the call to subroutine3 with a segmentation fault (signal 11) error. If I comment subroutine4, then it doesn't crash (notice that it crashed just when calling subroutine3 and it works when commenting subroutine4). I compile with mpif90 using MPICH2 libraries and the following flags: -O3 -fpscomp logicals -openmp -threads -m64 -xS. The machine has EM64T architecture and I use a Debian Linux distribution. I set ulimit -s hard before running the program. Any ideas on what is going on? Has it something to do with stack size? Thanks in advance

Read the article
Iteration through std containers in openmp

- by Sasun Hambardzumyan

Hi, people. I try to use openmp for multithreading the loop through std::set. When I write the following code - #pragma omp parallel for for (std::set<A>::const_iterator i = s.begin(); i != s.end(); ++i) { const A a = *i; operate(a); } I get an error - error: invalid type for iteration variable 'i' error: invalid controlling predicate error: invalid increment expression. So is there an another way to correct iteration in std containers using openmp? There is a workaround to use int i and iterate from 0 to s.size() and using iterator inside a loop body, but this is not looks good.

Read the article
OpenMP + SSE gives no speedup

- by Sayan Ghosh

Hi, My Professor found out this interesting experiment of 3D Linearly separable Kernel Convolution using SSE and OpenMP, and gave the task to me to benchmark the statistics on our system. The author claims a crazy 18 fold speedup from the serial approach! Might not be always, but we were expecting at least a 2-4 times speedup running this on a Dual Core Intel. http://software.intel.com/en-us/articles/16bit-3d-convolution-sse4openmp-implementation-on-penryn-cpu/#comment-41994 Alas, we could find exactly no speedup. The serial code performs always better, with or without OpenMP. I am using Linux, and observed a certain trend...when no other processes are running on the system, after a while the loadavg starts increasing, and the the %CPU utilization falls down. Another probable false positive which I ran into accidentally...I started the program, then immediately paused it. Then I ran it on background with bg, and saw a speedup of more than 2. This happens all the time! Any advice would be great. Thanks, Sayan

Read the article
Why aren't unsigned OpenMP index variables allowed?

- by Moe

I have a loop in my C++/OpenMP code that looks like this: #pragma omp parallel for for(unsigned int i=0; i<count; i++) { // do stuff } When I compile it (with Visual Studio 2005) I get the following error: error C3016: 'i' : index variable in OpenMP 'for' statement must have signed integral type I understand that the error occurs because i is unsigned instead of signed, and changing i to be signed removed this error. What I want to know is why is this an error? Why aren't unsigned index variables allowed? Looking at the MSDN page for this error gives me no clues.

Read the article
OpenMP - running things in parallel and some in sequence within them

- by Sayan Ghosh

Hi, I have a scenario like: for (i = 0; i < n; i++) { for (j = 0; j < m; j++) { for (k = 0; k < x; k++) { val = 2*i + j + 4*k if (val != 0) { for(t = 0; t < l; t++) { someFunction((i + t) + someFunction(j + t) + k*t) } } } } } Considering this is block A, Now I have two more similar blocks in my code. I want to put them in parallel, so I used OpenMP pragmas. However I am not able to parallelize it, because I am a tad confused that which variables would be shared and private in this case. If the function call in the inner loop was an operation like sum += x, then I could have added a reduction clause. In general, how would one approach parallelizing a code using OpenMP, when we there is a nested for loop, and then another inner for loop doing the main operation. I tried declaring a parallel region, and then simply putting pragma fors before the blocks, but definitely I am missing a point there! Thanks, Sayan

Read the article
[C++][OpenMP] Proper use of "atomic directive" to lock STL container

- by conradlee

I have a large number of sets of integers, which I have, in turn, put into a vector of pointers. I need to be able to update these sets of integers in parallel without causing a race condition. More specifically. I am using OpenMP's "parallel for" construct. For dealing with shared resources, OpenMP offers a handy "atomic directive," which allows one to avoid a race condition on a specific piece of memory without using locks. It would be convenient if I could use the "atomic directive" to prevent simultaneous updating to my integer sets, however, I'm not sure whether this is possible. Basically, I want to know whether the following code could lead to a race condition vector< set<int>* > membershipDirectory(numSets, new set<int>); #pragma omp for schedule(guided,expandChunksize) for(int i=0; i<100; i++) { set<int>* sp = membershipDirectory[5]; #pragma omp atomic sp->insert(45); } (Apologies for any syntax errors in the code---I hope you get the point) I have seen a similar example of this for incrementing an integer, but I'm not sure whether it works when working with a pointer to a container as in my case.

Read the article
What limits scaling in this simple OpenMP program?

- by Douglas B. Staple

I'm trying to understand limits to parallelization on a 48-core system (4xAMD Opteron 6348, 2.8 Ghz, 12 cores per CPU). I wrote this tiny OpenMP code to test the speedup in what I thought would be the best possible situation (the task is embarrassingly parallel): // Compile with: gcc scaling.c -std=c99 -fopenmp -O3 #include <stdio.h> #include <stdint.h> int main(){ const uint64_t umin=1; const uint64_t umax=10000000000LL; double sum=0.; #pragma omp parallel for reduction(+:sum) for(uint64_t u=umin; u<umax; u++) sum+=1./u/u; printf("%e\n", sum); } I was surprised to find that the scaling is highly nonlinear. It takes about 2.9s for the code to run with 48 threads, 3.1s with 36 threads, 3.7s with 24 threads, 4.9s with 12 threads, and 57s for the code to run with 1 thread. Unfortunately I have to say that there is one process running on the computer using 100% of one core, so that might be affecting it. It's not my process, so I can't end it to test the difference, but somehow I doubt that's making the difference between a 19~20x speedup and the ideal 48x speedup. To make sure it wasn't an OpenMP issue, I ran two copies of the program at the same time with 24 threads each (one with umin=1, umax=5000000000, and the other with umin=5000000000, umax=10000000000). In that case both copies of the program finish after 2.9s, so it's exactly the same as running 48 threads with a single instance of the program. What's preventing linear scaling with this simple program?

Read the article
OpenMP timer doesn't work on inline assembly code?

- by Brett

I'm trying to compare some code samples for speed, and I decided to use the OpenMP timer since I'll eventually be multi threading the code. The timer works great on two of my four code snippets, but not on the other two start=omp_get_wtime(); /*code here*/ finish = omp_get_wtime() - start_time; The four code here sections are serial code, xmmintrin.h code, and two inline assembly codes. The serial and xmminstrin.h code are able to be timed, but the inline assembly codes returns -1.#IND00 for a time. I can't seem to figure out why this is? Thanks for any help or suggestions!

Read the article
Cilk or Cilk++ or OpenMP

- by Aman Deep Gautam

I'm creating a multi-threaded application in Linux. here is the scenario: Suppose I am having x instance of a class BloomFilter and I have some y GB of data(greater than memory available). I need to test membership for this y GB of data in each of the bloom filter instance. It is pretty much clear that parallel programming will help to speed up the task moreover since I am only reading the data so it can be shared across all processes or threads. Now I am confused about which one to use Cilk, Cilk++ or OpenMP(which one is better). Also I am confused about which one to go for Multithreading or Multiprocessing

Read the article
Dividing sections inside an omp parallel for : OpenMP

- by Sayan Ghosh

Hi, I have a situation like: #pragma omp parallel for private(i, j, k, val, p, l) for (i = 0; i < num1; i++) { for (j = 0; j < num2; j++) { for (k = 0; k < num3; k++) { val = m[i + j*somenum + k*2] if (val != 0) for (l = start; l <= end; l++) { someFunctionThatWritesIntoGlobalArray((i + l), j, k, (someFunctionThatGetsValueFromAnotherArray((i + l), j, k) * val)); } } } for (p = 0; p < num4; p++) { m[p] = 0; } } Thanks for reading, phew! Well I am noticing a very minor difference in the results (0.999967[omp] against 1[serial]), when I use the above (which is 3 times faster) against the serial implementation. Now I know I am doing a mistake here...especially the connection between loops is evident. Is it possible to parallelize this using omp sections? I tried some options like making shared(p) {doing this, I got correct values, as in the serial form}, but there was no speedup then. Any general advice on handling openmp pragmas over a slew of for loops would also be great for me!

Read the article
OpenMP: Get total number of running threads

- by Konrad Rudolph

I need to know the total number of threads that my application has spawned via OpenMP. Unfortunately, the omp_get_num_threads() function does not work here since it only yields the number of threads in the current team. However, my code runs recursively (divide and conquer, basically) and I want to spawn new threads as long as there are still idle processors, but no more. Is there a way to get around the limitations of omp_get_num_threads and get the total number of running threads? If more detail is required, consider the following pseudo-code that models my workflow quite closely: function divide_and_conquer(Job job, int total_num_threads): if job.is_leaf(): # Recurrence base case. job.process() return left, right = job.divide() current_num_threads = omp_get_num_threads() if current_num_threads < total_num_threads: # (1) #pragma omp parallel num_threads(2) #pragma omp section divide_and_conquer(left, total_num_threads) #pragma omp section divide_and_conquer(right, total_num_threads) else: divide_and_conquer(left, total_num_threads) divide_and_conquer(right, total_num_threads) job = merge(left, right) If I call this code with a total_num_threads value of 4, the conditional annotated with (1) will always evaluate to true (because each thread team will contain at most two threads) and thus the code will always spawn two new threads, no matter how many threads are already running at a higher level. I am searching for a platform-independent way of determining the total number of threads that are currently running in my application.

Read the article
parallelizing code using openmp

- by anubhav

Hi, The function below contains nested for loops. There are 3 of them. I have given the whole function below for easy understanding. I want to parallelize the code in the innermost for loop as it takes maximum CPU time. Then i can think about outer 2 for loops. I can see dependencies and internal inline functions in the innermost for loop . Can the innermost for loop be rewritten to enable parallelization using openmp pragmas. Please tell how. I am writing just the loop which i am interested in first and then the full function where this loop exists for referance. Interested in parallelizing the loop mentioned below. //* LOOP WHICH I WANT TO PARALLELIZE *// for (y = 0; y < 4; y++) { refptr = PelYline_11 (ref_pic, abs_y++, abs_x, img_height, img_width); LineSadBlk0 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk0 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk0 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk0 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk1 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk1 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk1 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk1 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk2 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk2 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk2 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk2 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk3 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk3 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk3 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk3 += byte_abs [*refptr++ - *orgptr++]; } The full function where this loop exists is below for referance. /*! *********************************************************************** * \brief * Setup the fast search for an macroblock *********************************************************************** */ void SetupFastFullPelSearch (short ref, int list) // <-- reference frame parameter, list0 or 1 { short pmv[2]; pel_t orig_blocks[256], *orgptr=orig_blocks, *refptr, *tem; // created pointer tem int offset_x, offset_y, x, y, range_partly_outside, ref_x, ref_y, pos, abs_x, abs_y, bindex, blky; int LineSadBlk0, LineSadBlk1, LineSadBlk2, LineSadBlk3; int max_width, max_height; int img_width, img_height; StorablePicture *ref_picture; pel_t *ref_pic; int** block_sad = BlockSAD[list][ref][7]; int search_range = max_search_range[list][ref]; int max_pos = (2*search_range+1) * (2*search_range+1); int list_offset = ((img->MbaffFrameFlag)&&(img->mb_data[img->current_mb_nr].mb_field))? img->current_mb_nr%2 ? 4 : 2 : 0; int apply_weights = ( (active_pps->weighted_pred_flag && (img->type == P_SLICE || img->type == SP_SLICE)) || (active_pps->weighted_bipred_idc && (img->type == B_SLICE))); ref_picture = listX[list+list_offset][ref]; //===== Use weighted Reference for ME ==== if (apply_weights && input->UseWeightedReferenceME) ref_pic = ref_picture->imgY_11_w; else ref_pic = ref_picture->imgY_11; max_width = ref_picture->size_x - 17; max_height = ref_picture->size_y - 17; img_width = ref_picture->size_x; img_height = ref_picture->size_y; //===== get search center: predictor of 16x16 block ===== SetMotionVectorPredictor (pmv, enc_picture->ref_idx, enc_picture->mv, ref, list, 0, 0, 16, 16); search_center_x[list][ref] = pmv[0] / 4; search_center_y[list][ref] = pmv[1] / 4; if (!input->rdopt) { //--- correct center so that (0,0) vector is inside --- search_center_x[list][ref] = max(-search_range, min(search_range, search_center_x[list][ref])); search_center_y[list][ref] = max(-search_range, min(search_range, search_center_y[list][ref])); } search_center_x[list][ref] += img->opix_x; search_center_y[list][ref] += img->opix_y; offset_x = search_center_x[list][ref]; offset_y = search_center_y[list][ref]; //===== copy original block for fast access ===== for (y = img->opix_y; y < img->opix_y+16; y++) for (x = img->opix_x; x < img->opix_x+16; x++) *orgptr++ = imgY_org [y][x]; //===== check if whole search range is inside image ===== if (offset_x >= search_range && offset_x <= max_width - search_range && offset_y >= search_range && offset_y <= max_height - search_range ) { range_partly_outside = 0; PelYline_11 = FastLine16Y_11; } else { range_partly_outside = 1; } //===== determine position of (0,0)-vector ===== if (!input->rdopt) { ref_x = img->opix_x - offset_x; ref_y = img->opix_y - offset_y; for (pos = 0; pos < max_pos; pos++) { if (ref_x == spiral_search_x[pos] && ref_y == spiral_search_y[pos]) { pos_00[list][ref] = pos; break; } } } //===== loop over search range (spiral search): get blockwise SAD ===== **// =====THIS IS THE PART WHERE NESTED FOR STARTS=====** for (pos = 0; pos < max_pos; pos++) // OUTERMOST FOR LOOP { abs_y = offset_y + spiral_search_y[pos]; abs_x = offset_x + spiral_search_x[pos]; if (range_partly_outside) { if (abs_y >= 0 && abs_y <= max_height && abs_x >= 0 && abs_x <= max_width ) { PelYline_11 = FastLine16Y_11; } else { PelYline_11 = UMVLine16Y_11; } } orgptr = orig_blocks; bindex = 0; for (blky = 0; blky < 4; blky++) // SECOND FOR LOOP { LineSadBlk0 = LineSadBlk1 = LineSadBlk2 = LineSadBlk3 = 0; for (y = 0; y < 4; y++) //INNERMOST FOR LOOP WHICH I WANT TO PARALLELIZE { refptr = PelYline_11 (ref_pic, abs_y++, abs_x, img_height, img_width); LineSadBlk0 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk0 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk0 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk0 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk1 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk1 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk1 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk1 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk2 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk2 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk2 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk2 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk3 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk3 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk3 += byte_abs [*refptr++ - *orgptr++]; LineSadBlk3 += byte_abs [*refptr++ - *orgptr++]; } block_sad[bindex++][pos] = LineSadBlk0; block_sad[bindex++][pos] = LineSadBlk1; block_sad[bindex++][pos] = LineSadBlk2; block_sad[bindex++][pos] = LineSadBlk3; } } //===== combine SAD's for larger block types ===== SetupLargerBlocks (list, ref, max_pos); //===== set flag marking that search setup have been done ===== search_setup_done[list][ref] = 1; } #endif // _FAST_FULL_ME_

Read the article
Segmentation fault while matrix multiplication using openMp?

- by harshit

My matrix multiplication code is int matMul(int ld, double** matrix) { //local variables initialize omp_set_num_threads(nthreads); #pragma omp parallel private(tid,diag,ld) shared(i,j,k,matrix) { /* Obtain and print thread id */ tid = omp_get_thread_num(); for ( k=0; k<ld; k++) { if (matrix[k][k] == 0.0) { error = 1; return error; } diag = 1.0 / matrix[k][k]; #pragma omp for for ( i=k+1; i < ld; i++) { matrix[i][k] = diag * matrix[i][k]; } for ( j=k+1; j<ld; j++) { for ( i=k+1; i<ld; i++) { matrix[i][j] = matrix[i][j] - matrix[i][k] * matrix[k][j]; } } } } return error; } I assume that it is because of matrix object only but why will it be null even though it is passed as a parameter..

Read the article
openmp in mex : stackoverflow error

- by Edwin

i have got the following fraction of code that getting me the stack overflow error #pragma omp parallel shared(Mo1, Mo2, sum_normalized_p_gn, Data, Mean_Out,Covar_Out,Prior_Out, det) private(i) num_threads( number_threads ) { //every thread has a new copy double* normalized_p_gn = (double*)malloc(NMIX*sizeof(double)); #pragma omp critical { int id = omp_get_thread_num(); int threads = omp_get_num_threads(); mexEvalString("drawnow"); } #pragma omp for //some parallel process..... } the variables declared in the shared are created by malloc. and they consumes with large amount of memory there are 2 questions regarding to the above code. 1) why this would generate the stack overflow error( i.e. segmentation fault) before it goes into the parallel for loop? it works fine when it runs in the sequential mode.... 2) am i right to dynamic allocate memory for each thread like "normalized_p_gn" above? Regards Edwin

Read the article
OpenMP implementations in VC++ 2008, 2010

- by John

Depending on implementation, OMP can be quite useful to parallelize fairly arbitrary bits of code - e.g a parallel section inside a method that calls two independent methods - or it can be bad. It depends on how threads are created/cached, I think. How does the VC++ 2008 implementation work? And is the 2010 implementation significantly different in terms of features and performance/flexibility?

Read the article
Small openmp programm freezes sometimes (gcc, c, linux)

- by osgx

Hello Just write a small omp test, and it does not work correctly all the times: #include <omp.h> int main() { int i,j=0; #pragma omp parallel for(i=0;i<1000;i++) { #pragma omp barrier j+= j^i; } return j; } The usage of j for writing from all threads is incorrect in this example, BUT there must be only nondeterministic value of j I have a freeze. Compiled with gcc-4.3.1 -fopenmp a.c -o gcc -static Run on 4-core x86_Core2 Linux server: $ ./gcc and got freeze (sometimes; like 1 freeze for 4-5 fast runs). Strace: [pid 13118] <... futex resumed> ) = 0 [pid 13118] futex(0x80d3014, FUTEX_WAIT, 2, NULL <unfinished ...> [pid 13120] <... futex resumed> ) = 0 [pid 13119] futex(0x80d3014, FUTEX_WAIT, 2, NULL <unfinished ...> [pid 13120] futex(0x80d3014, FUTEX_WAKE, 1) = 1 [pid 13120] futex(0x80cd798, FUTEX_WAIT, 1, NULL <unfinished ...> [pid 13109] <... futex resumed> ) = 0 [pid 13109] futex(0x80d3014, FUTEX_WAKE, 1) = 1 [pid 13109] futex(0x80d3020, FUTEX_WAIT, 251, NULL <unfinished ...> [pid 13118] <... futex resumed> ) = 0 [pid 13118] futex(0x80d3014, FUTEX_WAKE, 1) = 1 [pid 13119] <... futex resumed> ) = 0 [pid 13118] futex(0x80d3020, FUTEX_WAIT, 251, NULL <unfinished ...> [pid 13119] futex(0x80d3014, FUTEX_WAKE, 1) = 0 [pid 13119] futex(0x80d3020, FUTEX_WAIT, 251, NULL <freeze> Why do I have a freeze (deadlock)?

Read the article
openmp sections running sequentially

- by chi42

I have the following code: #pragma omp parallel sections private(x,y,cpsrcptr) firstprivate(srcptr) lastprivate(srcptr) { #pragma omp section { //stuff } #pragma omp section { //stuff } } According to the Zoom profiler, two threads are created, one thread executes both the sections, and the other thread simply blocks! Has anyone encountered anything like this before? (And yes, I do have a dual core machine).

Read the article

1 2 3 | Next Page >