Polite busy-waiting with WRPAUSE on SPARC

Posted by Dave on Oracle Blogs See other posts from Oracle Blogs or by Dave
Published on Wed, 24 Oct 2012 04:37:02 +0000 Indexed on 2012/10/24 5:18 UTC
Read the original article Hit count: 392

Filed under:

/General

Unbounded busy-waiting is an poor idea for user-space code, so we typically use spin-then-block strategies when, say, waiting for a lock to be released or some other event. If we're going to spin, even briefly, then we'd prefer to do so in a manner that minimizes performance degradation for other sibling logical processors ("strands") that share compute resources. We want to spin politely and refrain from impeding the progress and performance of other threads — ostensibly doing useful work and making progress — that run on the same core. On a SPARC T4, for instance, 8 strands will share a core, and that core has its own L1 cache and 2 pipelines. On x86 we have the PAUSE instruction, which, naively, can be thought of as a hardware "yield" operator which temporarily surrenders compute resources to threads on sibling strands. Of course this helps avoid intra-core performance interference. On the SPARC T2 our preferred busy-waiting idiom was "RD %CCR,%G0" which is a high-latency no-nop. The T4 provides a dedicated and extremely useful WRPAUSE instruction. The processor architecture manuals are the authoritative source, but briefly, WRPAUSE writes a cycle count into the the PAUSE register, which is ASR27. Barring interrupts, the processor then delays for the requested period. There's no need for the operating system to save the PAUSE register over context switches as it always resets to 0 on traps.

Digressing briefly, if you use unbounded spinning then ultimately the kernel will preempt and deschedule your thread if there are other ready threads than are starving. But by using a spin-then-block strategy we can allow other ready threads to run without resorting to involuntary time-slicing, which operates on a long-ish time scale. Generally, that makes your application more responsive. In addition, by blocking voluntarily we give the operating system far more latitude regarding power management. Finally, I should note that while we have OS-level facilities like sched_yield() at our disposal, yielding almost never does what you'd want or naively expect.

Returning to WRPAUSE, it's natural to ask how well it works. To help answer that question I wrote a very simple C/pthreads benchmark that launches 8 concurrent threads and binds those threads to processors 0..7. The processors are numbered geographically on the T4, so those threads will all be running on just one core. Unlike the SPARC T2, where logical CPUs 0,1,2 and 3 were assigned to the first pipeline, and CPUs 4,5,6 and 7 were assigned to the 2nd, there's no fixed mapping between CPUs and pipelines in the T4. And in some circumstances when the other 7 logical processors are idling quietly, it's possible for the remaining logical processor to leverage both pipelines. Some number T of the threads will iterate in a tight loop advancing a simple Marsaglia xor-shift pseudo-random number generator. T is a command-line argument. The main thread loops, reporting the aggregate number of PRNG steps performed collectively by those T threads in the last 10 second measurement interval. The other threads (there are 8-T of these) run in a loop busy-waiting concurrently with the T threads. We vary T between 1 and 8 threads, and report on various busy-waiting idioms. The values in the table are the aggregate number of PRNG steps completed by the set of T threads. The unit is millions of iterations per 10 seconds. For the "PRNG step" busy-waiting mode, the busy-waiting threads execute exactly the same code as the T worker threads. We can easily compute the average rate of progress for individual worker threads by dividing the aggregate score by the number of worker threads T. I should note that the PRNG steps are extremely cycle-heavy and access almost no memory, so arguably this microbenchmark is not as representative of "normal" code as it could be. And for the purposes of comparison I included a row in the table that reflects a waiting policy where the waiting threads call poll(NULL,0,1000) and block in the kernel. Obviously this isn't busy-waiting, but the data is interesting for reference.

*Aggregate progress*
	T = #worker threads
Wait Mechanism for 8-T threads	T=1	T=2	T=3	T=4	T=5	T=6	T=7	T=8
Park thread in poll()	3265	3347	3348	3348	3348	3348	3348	3348
no-op	415	831	1243	1648	2060	2497	2930	3349
RD %ccr,%g0 "pause"	1426	2429	2692	2862	3013	3162	3255	3349
PRNG step	412	829	1246	1670	2092	2510	2930	3348
WRPause(8000)	3244	3361	3331	3348	3349	3348	3348	3348
WRPause(4000)	3215	3308	3315	3322	3347	3348	3347	3348
WRPause(1000)	3085	3199	3224	3251	3310	3348	3348	3348
WRPause(500)	2917	3070	3150	3222	3270	3309	3348	3348
WRPause(250)	2694	2864	2949	3077	3205	3388	3348	3348
WRPause(100)	2155	2469	2622	2790	2911	3214	3330	3348

Developer IT

Polite busy-waiting with WRPAUSE on SPARC - Developer IT

Polite busy-waiting with WRPAUSE on SPARC

/General

Related posts about /General

There is no web named - Sharepoint Event Hander

General purpose physics engine

General type conversion without risking Exceptions

Ruby/Rails display general screen when modifications being performed on server

Just a general THANK YOU to EVERYONE. [closed]

Categories cloud