Search Results

Search found 54 results on 3 pages for 'numa'.

Page 1/3 | 1 2 3  | Next Page >

  • NUMA-aware constructs for java.util.concurrent

    - by Dave
    The constructs in the java.util.concurrent JSR-166 "JUC" concurrency library are currently NUMA-oblivious. That's because we currently don't have the topology discovery infrastructure and underpinnings in place that would allow and enable NUMA-awareness. But some quick throw-away prototypes show that it's possible to write NUMA-aware library code. I happened to use JUC Exchanger as a research vehicle. Another interesting idea is to adapt fork-join work-stealing to favor stealing from queues associated with 'nearby' threads.

    Read the article

  • Thread placement policies on NUMA systems - update

    - by Dave
    In a prior blog entry I noted that Solaris used a "maximum dispersal" placement policy to assign nascent threads to their initial processors. The general idea is that threads should be placed as far away from each other as possible in the resource topology in order to reduce resource contention between concurrently running threads. This policy assumes that resource contention -- pipelines, memory channel contention, destructive interference in the shared caches, etc -- will likely outweigh (a) any potential communication benefits we might achieve by packing our threads more densely onto a subset of the NUMA nodes, and (b) benefits of NUMA affinity between memory allocated by one thread and accessed by other threads. We want our threads spread widely over the system and not packed together. Conceptually, when placing a new thread, the kernel picks the least loaded node NUMA node (the node with lowest aggregate load average), and then the least loaded core on that node, etc. Furthermore, the kernel places threads onto resources -- sockets, cores, pipelines, etc -- without regard to the thread's process membership. That is, initial placement is process-agnostic. Keep reading, though. This description is incorrect. On Solaris 10 on a SPARC T5440 with 4 x T2+ NUMA nodes, if the system is otherwise unloaded and we launch a process that creates 20 compute-bound concurrent threads, then typically we'll see a perfect balance with 5 threads on each node. We see similar behavior on an 8-node x86 x4800 system, where each node has 8 cores and each core is 2-way hyperthreaded. So far so good; this behavior seems in agreement with the policy I described in the 1st paragraph. I recently tried the same experiment on a 4-node T4-4 running Solaris 11. Both the T5440 and T4-4 are 4-node systems that expose 256 logical thread contexts. To my surprise, all 20 threads were placed onto just one NUMA node while the other 3 nodes remained completely idle. I checked the usual suspects such as processor sets inadvertently left around by colleagues, processors left offline, and power management policies, but the system was configured normally. I then launched multiple concurrent instances of the process, and, interestingly, all the threads from the 1st process landed on one node, all the threads from the 2nd process landed on another node, and so on. This happened even if I interleaved thread creating between the processes, so I was relatively sure the effect didn't related to thread creation time, but rather that placement was a function of process membership. I this point I consulted the Solaris sources and talked with folks in the Solaris group. The new Solaris 11 behavior is intentional. The kernel is no longer using a simple maximum dispersal policy, and thread placement is process membership-aware. Now, even if other nodes are completely unloaded, the kernel will still try to pack new threads onto the home lgroup (socket) of the primordial thread until the load average of that node reaches 50%, after which it will pick the next least loaded node as the process's new favorite node for placement. On the T4-4 we have 64 logical thread contexts (strands) per socket (lgroup), so if we launch 48 concurrent threads we will find 32 placed on one node and 16 on some other node. If we launch 64 threads we'll find 32 and 32. That means we can end up with our threads clustered on a small subset of the nodes in a way that's quite different that what we've seen on Solaris 10. So we have a policy that allows process-aware packing but reverts to spreading threads onto other nodes if a node becomes too saturated. It turns out this policy was enabled in Solaris 10, but certain bugs suppressed the mixed packing/spreading behavior. There are configuration variables in /etc/system that allow us to dial the affinity between nascent threads and their primordial thread up and down: see lgrp_expand_proc_thresh, specifically. In the OpenSolaris source code the key routine is mpo_update_tunables(). This method reads the /etc/system variables and sets up some global variables that will subsequently be used by the dispatcher, which calls lgrp_choose() in lgrp.c to place nascent threads. Lgrp_expand_proc_thresh controls how loaded an lgroup must be before we'll consider homing a process's threads to another lgroup. Tune this value lower to have it spread your process's threads out more. To recap, the 'new' policy is as follows. Threads from the same process are packed onto a subset of the strands of a socket (50% for T-series). Once that socket reaches the 50% threshold the kernel then picks another preferred socket for that process. Threads from unrelated processes are spread across sockets. More precisely, different processes may have different preferred sockets (lgroups). Beware that I've simplified and elided details for the purposes of explication. The truth is in the code. Remarks: It's worth noting that initial thread placement is just that. If there's a gross imbalance between the load on different nodes then the kernel will migrate threads to achieve a better and more even distribution over the set of available nodes. Once a thread runs and gains some affinity for a node, however, it becomes "stickier" under the assumption that the thread has residual cache residency on that node, and that memory allocated by that thread resides on that node given the default "first-touch" page-level NUMA allocation policy. Exactly how the various policies interact and which have precedence under what circumstances could the topic of a future blog entry. The scheduler is work-conserving. The x4800 mentioned above is an interesting system. Each of the 8 sockets houses an Intel 7500-series processor. Each processor has 3 coherent QPI links and the system is arranged as a glueless 8-socket twisted ladder "mobius" topology. Nodes are either 1 or 2 hops distant over the QPI links. As an aside the mapping of logical CPUIDs to physical resources is rather interesting on Solaris/x4800. On SPARC/Solaris the CPUID layout is strictly geographic, with the highest order bits identifying the socket, the next lower bits identifying the core within that socket, following by the pipeline (if present) and finally the logical thread context ("strand") on the core. But on Solaris on the x4800 the CPUID layout is as follows. [6:6] identifies the hyperthread on a core; bits [5:3] identify the socket, or package in Intel terminology; bits [2:0] identify the core within a socket. Such low-level details should be of interest only if you're binding threads -- a bad idea, the kernel typically handles placement best -- or if you're writing NUMA-aware code that's aware of the ambient placement and makes decisions accordingly. Solaris introduced the so-called critical-threads mechanism, which is expressed by putting a thread into the FX scheduling class at priority 60. The critical-threads mechanism applies to placement on cores, not on sockets, however. That is, it's an intra-socket policy, not an inter-socket policy. Solaris 11 introduces the Power Aware Dispatcher (PAD) which packs threads instead of spreading them out in an attempt to be able to keep sockets or cores at lower power levels. Maximum dispersal may be good for performance but is anathema to power management. PAD is off by default, but power management polices constitute yet another confounding factor with respect to scheduling and dispatching. If your threads communicate heavily -- one thread reads cache lines last written by some other thread -- then the new dense packing policy may improve performance by reducing traffic on the coherent interconnect. On the other hand if your threads in your process communicate rarely, then it's possible the new packing policy might result on contention on shared computing resources. Unfortunately there's no simple litmus test that says whether packing or spreading is optimal in a given situation. The answer varies by system load, application, number of threads, and platform hardware characteristics. Currently we don't have the necessary tools and sensoria to decide at runtime, so we're reduced to an empirical approach where we run trials and try to decide on a placement policy. The situation is quite frustrating. Relatedly, it's often hard to determine just the right level of concurrency to optimize throughput. (Understanding constructive vs destructive interference in the shared caches would be a good start. We could augment the lines with a small tag field indicating which strand last installed or accessed a line. Given that, we could augment the CPU with performance counters for misses where a thread evicts a line it installed vs misses where a thread displaces a line installed by some other thread.)

    Read the article

  • NUMA-aware placement of communication variables

    - by Dave
    For classic NUMA-aware programming I'm typically most concerned about simple cold, capacity and compulsory misses and whether we can satisfy the miss by locally connected memory or whether we have to pull the line from its home node over the coherent interconnect -- we'd like to minimize channel contention and conserve interconnect bandwidth. That is, for this style of programming we're quite aware of where memory is homed relative to the threads that will be accessing it. Ideally, a page is collocated on the node with the thread that's expected to most frequently access the page, as simple misses on the page can be satisfied without resorting to transferring the line over the interconnect. The default "first touch" NUMA page placement policy tends to work reasonable well in this regard. When a virtual page is first accessed, the operating system will attempt to provision and map that virtual page to a physical page allocated from the node where the accessing thread is running. It's worth noting that the node-level memory interleaving granularity is usually a multiple of the page size, so we can say that a given page P resides on some node N. That is, the memory underlying a page resides on just one node. But when thinking about accesses to heavily-written communication variables we normally consider what caches the lines underlying such variables might be resident in, and in what states. We want to minimize coherence misses and cache probe activity and interconnect traffic in general. I don't usually give much thought to the location of the home NUMA node underlying such highly shared variables. On a SPARC T5440, for instance, which consists of 4 T2+ processors connected by a central coherence hub, the home node and placement of heavily accessed communication variables has very little impact on performance. The variables are frequently accessed so likely in M-state in some cache, and the location of the home node is of little consequence because a requester can use cache-to-cache transfers to get the line. Or at least that's what I thought. Recently, though, I was exploring a simple shared memory point-to-point communication model where a client writes a request into a request mailbox and then busy-waits on a response variable. It's a simple example of delegation based on message passing. The server polls the request mailbox, and having fetched a new request value, performs some operation and then writes a reply value into the response variable. As noted above, on a T5440 performance is insensitive to the placement of the communication variables -- the request and response mailbox words. But on a Sun/Oracle X4800 I noticed that was not the case and that NUMA placement of the communication variables was actually quite important. For background an X4800 system consists of 8 Intel X7560 Xeons . Each package (socket) has 8 cores with 2 contexts per core, so the system is 8x8x2. Each package is also a NUMA node and has locally attached memory. Every package has 3 point-to-point QPI links for cache coherence, and the system is configured with a twisted ladder "mobius" topology. The cache coherence fabric is glueless -- there's not central arbiter or coherence hub. The maximum distance between any two nodes is just 2 hops over the QPI links. For any given node, 3 other nodes are 1 hop distant and the remaining 4 nodes are 2 hops distant. Using a single request (client) thread and a single response (server) thread, a benchmark harness explored all permutations of NUMA placement for the two threads and the two communication variables, measuring the average round-trip-time and throughput rate between the client and server. In this benchmark the server simply acts as a simple transponder, writing the request value plus 1 back into the reply field, so there's no particular computation phase and we're only measuring communication overheads. In addition to varying the placement of communication variables over pairs of nodes, we also explored variations where both variables were placed on one page (and thus on one node) -- either on the same cache line or different cache lines -- while varying the node where the variables reside along with the placement of the threads. The key observation was that if the client and server threads were on different nodes, then the best placement of variables was to have the request variable (written by the client and read by the server) reside on the same node as the client thread, and to place the response variable (written by the server and read by the client) on the same node as the server. That is, if you have a variable that's to be written by one thread and read by another, it should be homed with the writer thread. For our simple client-server model that means using split request and response communication variables with unidirectional message flow on a given page. This can yield up to twice the throughput of less favorable placement strategies. Our X4800 uses the QPI 1.0 protocol with source-based snooping. Briefly, when node A needs to probe a cache line it fires off snoop requests to all the nodes in the system. Those recipients then forward their response not to the original requester, but to the home node H of the cache line. H waits for and collects the responses, adjudicates and resolves conflicts and ensures memory-model ordering, and then sends a definitive reply back to the original requester A. If some node B needed to transfer the line to A, it will do so by cache-to-cache transfer and let H know about the disposition of the cache line. A needs to wait for the authoritative response from H. So if a thread on node A wants to write a value to be read by a thread on node B, the latency is dependent on the distances between A, B, and H. We observe the best performance when the written-to variable is co-homed with the writer A. That is, we want H and A to be the same node, as the writer doesn't need the home to respond over the QPI link, as the writer and the home reside on the very same node. With architecturally informed placement of communication variables we eliminate at least one QPI hop from the critical path. Newer Intel processors use the QPI 1.1 coherence protocol with home-based snooping. As noted above, under source-snooping a requester broadcasts snoop requests to all nodes. Those nodes send their response to the home node of the location, which provides memory ordering, reconciles conflicts, etc., and then posts a definitive reply to the requester. In home-based snooping the snoop probe goes directly to the home node and are not broadcast. The home node can consult snoop filters -- if present -- and send out requests to retrieve the line if necessary. The 3rd party owner of the line, if any, can respond either to the home or the original requester (or even to both) according to the protocol policies. There are myriad variations that have been implemented, and unfortunately vendor terminology doesn't always agree between vendors or with the academic taxonomy papers. The key is that home-snooping enables the use of a snoop filter to reduce interconnect traffic. And while home-snooping might have a longer critical path (latency) than source-based snooping, it also may require fewer messages and less overall bandwidth. It'll be interesting to reprise these experiments on a platform with home-based snooping. While collecting data I also noticed that there are placement concerns even in the seemingly trivial case when both threads and both variables reside on a single node. Internally, the cores on each X7560 package are connected by an internal ring. (Actually there are multiple contra-rotating rings). And the last-level on-chip cache (LLC) is partitioned in banks or slices, which with each slice being associated with a core on the ring topology. A hardware hash function associates each physical address with a specific home bank. Thus we face distance and topology concerns even for intra-package communications, although the latencies are not nearly the magnitude we see inter-package. I've not seen such communication distance artifacts on the T2+, where the cache banks are connected to the cores via a high-speed crossbar instead of a ring -- communication latencies seem more regular.

    Read the article

  • ESX Scheduler and NUMA issue

    - by babyg_wc
    On our 24 core bl685 (4sockets x 6core), we find that NUMA nodes 0 and 1 are pretty busy (unfortunately resulting in elevated cpu ready times on the VMS), whilst NUMA nodes 2 and 3 are almost unused. I thought this just maybe a ESX4 U1 issue, so I had a colleague with a 32 core (dl785) farm investigate, and it seems that his last 3 or 4 NUMA nodes are also not really being utilised. ESX seems to have a weakness when it comes to balancing lightly loaded NUMA boxes, Im going to enabled node interleaving in the BIOS and see if the scheduler balancers across all 24 cores, instead of just 12!... For those of you with large core counts, I would suggest you fire up you viclient, and check Physical CPU useage (or esxtop), I would be interested to hear what your results are. Please note, that its only the lightly loaded (eg less than 30% cpu load on the esx host) that seems to have the biggest issue with load imbalance. Thoughts/comments. PS ive logged a SR with vmware to assist, also the other "problem" could be that we have 128gb of ram in each host, and therefore the scheduler sees no good reason why it shouldnt try and cram all vms's into the first two NUMA nodes, as we only have around 50gb of ram worth of vms on each host...

    Read the article

  • Can the STREAM and GUPS (single CPU) benchmark use non-local memory in NUMA machine

    - by osgx
    Hello I want to run some tests from HPCC, STREAM and GUPS. They will test memory bandwidth, latency, and throughput (in term of random accesses). Can I start Single CPU test STREAM or Single CPU GUPS on NUMA node with memory interleaving enabled? (Is it allowed by the rules of HPCC - High Performance Computing Challenge?) Usage of non-local memory can increase GUPS results, because it will increase 2- or 4- fold the number of memory banks, available for random accesses. (GUPS typically limited by nonideal memory-subsystem and by slow memory bank opening/closing. With more banks it can do update to one bank, while the other banks are opening/closing.) Thanks. UPDATE: (you may nor reorder the memory accesses that the program makes). But can compiler reorder loops nesting? E.g. hpcc/RandomAccess.c /* Perform updates to main table. The scalar equivalent is: * * u64Int ran; * ran = 1; * for (i=0; i<NUPDATE; i++) { * ran = (ran << 1) ^ (((s64Int) ran < 0) ? POLY : 0); * table[ran & (TableSize-1)] ^= stable[ran >> (64-LSTSIZE)]; * } */ for (j=0; j<128; j++) ran[j] = starts ((NUPDATE/128) * j); for (i=0; i<NUPDATE/128; i++) { /* #pragma ivdep */ for (j=0; j<128; j++) { ran[j] = (ran[j] << 1) ^ ((s64Int) ran[j] < 0 ? POLY : 0); Table[ran[j] & (TableSize-1)] ^= stable[ran[j] >> (64-LSTSIZE)]; } } The main loop here is for (i=0; i<NUPDATE/128; i++) { and the nested loop is for (j=0; j<128; j++) {. Using 'loop interchange' optimization, compiler can convert this code to for (j=0; j<128; j++) { for (i=0; i<NUPDATE/128; i++) { ran[j] = (ran[j] << 1) ^ ((s64Int) ran[j] < 0 ? POLY : 0); Table[ran[j] & (TableSize-1)] ^= stable[ran[j] >> (64-LSTSIZE)]; } } It can be done because this loop nest is perfect loop nest. Is such optimization prohibited by rules of HPCC?

    Read the article

  • Cache coherence literature for big (>=16CPU) systems

    - by osgx
    Hello What books and articles can you recommend to learn basis of cache coherence problems in big SMP systems (which are NUMA and ccNUMA really) with =16 cpu sockets? Something like SGI Altix architecture analysis may be interesting. What protocols (MOESI, smth else) can scale up well?

    Read the article

  • AMD 24 core server memory bandwidth

    - by ntherning
    I need some help to determine whether the memory bandwidth I'm seeing under Linux on my server is normal or not. Here's the server spec: HP ProLiant DL165 G7 2x AMD Opteron 6164 HE 12-Core 40 GB RAM (10 x 4GB DDR1333) Debian 6.0 Using mbw on this server I get the following numbers: foo1:~# mbw -n 3 1024 Long uses 8 bytes. Allocating 2*134217728 elements = 2147483648 bytes of memory. Using 262144 bytes as blocks for memcpy block copy test. Getting down to business... Doing 3 runs per test. 0 Method: MEMCPY Elapsed: 0.58047 MiB: 1024.00000 Copy: 1764.082 MiB/s 1 Method: MEMCPY Elapsed: 0.58012 MiB: 1024.00000 Copy: 1765.152 MiB/s 2 Method: MEMCPY Elapsed: 0.58010 MiB: 1024.00000 Copy: 1765.201 MiB/s AVG Method: MEMCPY Elapsed: 0.58023 MiB: 1024.00000 Copy: 1764.811 MiB/s 0 Method: DUMB Elapsed: 0.36174 MiB: 1024.00000 Copy: 2830.778 MiB/s 1 Method: DUMB Elapsed: 0.35869 MiB: 1024.00000 Copy: 2854.817 MiB/s 2 Method: DUMB Elapsed: 0.35848 MiB: 1024.00000 Copy: 2856.481 MiB/s AVG Method: DUMB Elapsed: 0.35964 MiB: 1024.00000 Copy: 2847.310 MiB/s 0 Method: MCBLOCK Elapsed: 0.23546 MiB: 1024.00000 Copy: 4348.860 MiB/s 1 Method: MCBLOCK Elapsed: 0.23544 MiB: 1024.00000 Copy: 4349.230 MiB/s 2 Method: MCBLOCK Elapsed: 0.23544 MiB: 1024.00000 Copy: 4349.359 MiB/s AVG Method: MCBLOCK Elapsed: 0.23545 MiB: 1024.00000 Copy: 4349.149 MiB/s On one of my other servers (based on Intel Xeon E3-1270): foo2:~# mbw -n 3 1024 Long uses 8 bytes. Allocating 2*134217728 elements = 2147483648 bytes of memory. Using 262144 bytes as blocks for memcpy block copy test. Getting down to business... Doing 3 runs per test. 0 Method: MEMCPY Elapsed: 0.18960 MiB: 1024.00000 Copy: 5400.901 MiB/s 1 Method: MEMCPY Elapsed: 0.18922 MiB: 1024.00000 Copy: 5411.690 MiB/s 2 Method: MEMCPY Elapsed: 0.18944 MiB: 1024.00000 Copy: 5405.491 MiB/s AVG Method: MEMCPY Elapsed: 0.18942 MiB: 1024.00000 Copy: 5406.024 MiB/s 0 Method: DUMB Elapsed: 0.14838 MiB: 1024.00000 Copy: 6901.200 MiB/s 1 Method: DUMB Elapsed: 0.14818 MiB: 1024.00000 Copy: 6910.561 MiB/s 2 Method: DUMB Elapsed: 0.14820 MiB: 1024.00000 Copy: 6909.628 MiB/s AVG Method: DUMB Elapsed: 0.14825 MiB: 1024.00000 Copy: 6907.127 MiB/s 0 Method: MCBLOCK Elapsed: 0.04362 MiB: 1024.00000 Copy: 23477.623 MiB/s 1 Method: MCBLOCK Elapsed: 0.04262 MiB: 1024.00000 Copy: 24025.151 MiB/s 2 Method: MCBLOCK Elapsed: 0.04258 MiB: 1024.00000 Copy: 24048.849 MiB/s AVG Method: MCBLOCK Elapsed: 0.04294 MiB: 1024.00000 Copy: 23847.599 MiB/s For reference here's what I get on my Intel based laptop: laptop:~$ mbw -n 3 1024 Long uses 8 bytes. Allocating 2*134217728 elements = 2147483648 bytes of memory. Using 262144 bytes as blocks for memcpy block copy test. Getting down to business... Doing 3 runs per test. 0 Method: MEMCPY Elapsed: 0.40566 MiB: 1024.00000 Copy: 2524.269 MiB/s 1 Method: MEMCPY Elapsed: 0.38458 MiB: 1024.00000 Copy: 2662.638 MiB/s 2 Method: MEMCPY Elapsed: 0.38876 MiB: 1024.00000 Copy: 2634.043 MiB/s AVG Method: MEMCPY Elapsed: 0.39300 MiB: 1024.00000 Copy: 2605.600 MiB/s 0 Method: DUMB Elapsed: 0.30707 MiB: 1024.00000 Copy: 3334.745 MiB/s 1 Method: DUMB Elapsed: 0.30425 MiB: 1024.00000 Copy: 3365.653 MiB/s 2 Method: DUMB Elapsed: 0.30342 MiB: 1024.00000 Copy: 3374.849 MiB/s AVG Method: DUMB Elapsed: 0.30491 MiB: 1024.00000 Copy: 3358.328 MiB/s 0 Method: MCBLOCK Elapsed: 0.07875 MiB: 1024.00000 Copy: 13003.670 MiB/s 1 Method: MCBLOCK Elapsed: 0.08374 MiB: 1024.00000 Copy: 12228.034 MiB/s 2 Method: MCBLOCK Elapsed: 0.07635 MiB: 1024.00000 Copy: 13411.216 MiB/s AVG Method: MCBLOCK Elapsed: 0.07961 MiB: 1024.00000 Copy: 12862.006 MiB/s So according to mbw my laptop is 3 times faster than the server!!! Please help me explain this. I've also tried to mount a ram disk and use dd to benchmark it and I get similar differences so I don't think mbw is to blame. I've checked the BIOS settings and the memory seem to be running at full speed. According to the hosting company the modules are all OK. Could this have something to do with NUMA? It seems like Node Interleaving is disabled on this server. Will enabling it (thus turning off NUMA) make a difference? foo1:~# numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 node 0 size: 8190 MB node 0 free: 7898 MB node 1 cpus: 6 7 8 9 10 11 node 1 size: 12288 MB node 1 free: 12073 MB node 2 cpus: 18 19 20 21 22 23 node 2 size: 12288 MB node 2 free: 12034 MB node 3 cpus: 12 13 14 15 16 17 node 3 size: 8192 MB node 3 free: 8032 MB node distances: node 0 1 2 3 0: 10 20 20 20 1: 20 10 20 20 2: 20 20 10 20 3: 20 20 20 10

    Read the article

  • Mapping of memory addresses to physical modules in Windows XP

    - by Josef Grahn
    I plan to run 32-bit Windows XP on a workstation with dual processors, based on Intel's Nehalem microarchitecture, and triple channel RAM. Even though XP is limited to 4 GB of RAM, my understanding is that it will function with more than 4 GB installed, but will only expose 4 GB (or slightly less). My question is: Assuming that 6 GB of RAM is installed in six 1 GB modules, which physical 4 GB will Windows actually map into its address space? In particular: Will it use all six 1 GB modules, taking advantage of all memory channels? (My guess is yes, and that the mapping to individual modules within a group happens in hardware.) Will it map 2 GB of address space to each of the two NUMA nodes (as each processor has it's own memory interface), or will one processor get fast access to 3 GB of RAM, while the other only has 1 GB? Thanks!

    Read the article

  • Mapping of memory addresses to physical modules in Windows XP

    - by Josef Grahn
    I plan to run 32-bit Windows XP on a workstation with dual processors, based on Intel's Nehalem microarchitecture, and triple channel RAM. Even though XP is limited to 4 GB of RAM, my understanding is that it will function with more than 4 GB installed, but will only expose 4 GB (or slightly less). My question is: Assuming that 6 GB of RAM is installed in six 1 GB modules, which physical 4 GB will Windows actually map into its address space? In particular: Will it use all six 1 GB modules, taking advantage of all memory channels? (My guess is yes, and that the mapping to individual modules within a group happens in hardware.) Will it map 2 GB of address space to each of the two NUMA nodes (as each processor has it's own memory interface), or will one processor get fast access to 3 GB of RAM, while the other only has 1 GB? Thanks!

    Read the article

  • Why are my Opteron cores running at only 75% capacity each? (25% CPU idle)

    - by Tim Cooper
    We've just taken delivery of a powerful 32-core AMD Opteron server with 128Gb. We have 2 x 6272 CPU's with 16 cores each. We are running a big long-running java task on 30 threads. We have the NUMA optimisations for Linux and java turned on. Our Java threads are mainly using objects that are private to that thread, sometimes reading memory that other threads will be reading, and very very occasionally writing or locking shared objects. We can't explain why the CPU cores are 25% idle. Below is a dump of "top": top - 23:06:38 up 1 day, 23 min, 3 users, load average: 10.84, 10.27, 9.62 Tasks: 676 total, 1 running, 675 sleeping, 0 stopped, 0 zombie Cpu(s): 64.5%us, 1.3%sy, 0.0%ni, 32.9%id, 1.3%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 132138168k total, 131652664k used, 485504k free, 92340k buffers Swap: 5701624k total, 230252k used, 5471372k free, 13444344k cached ... top - 22:37:39 up 23:54, 3 users, load average: 7.83, 8.70, 9.27 Tasks: 678 total, 1 running, 677 sleeping, 0 stopped, 0 zombie Cpu0 : 75.8%us, 2.0%sy, 0.0%ni, 22.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 77.2%us, 1.3%sy, 0.0%ni, 21.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 77.3%us, 1.0%sy, 0.0%ni, 21.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 77.8%us, 1.0%sy, 0.0%ni, 21.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 76.9%us, 2.0%sy, 0.0%ni, 21.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 76.3%us, 2.0%sy, 0.0%ni, 21.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 12.6%us, 3.0%sy, 0.0%ni, 84.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 8.6%us, 2.0%sy, 0.0%ni, 89.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 77.0%us, 2.0%sy, 0.0%ni, 21.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu9 : 77.0%us, 2.0%sy, 0.0%ni, 21.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu10 : 77.6%us, 1.7%sy, 0.0%ni, 20.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu11 : 75.7%us, 2.0%sy, 0.0%ni, 21.4%id, 1.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu12 : 76.6%us, 2.3%sy, 0.0%ni, 21.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu13 : 76.6%us, 2.3%sy, 0.0%ni, 21.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu14 : 76.2%us, 2.6%sy, 0.0%ni, 15.9%id, 5.3%wa, 0.0%hi, 0.0%si, 0.0%st Cpu15 : 76.6%us, 2.0%sy, 0.0%ni, 21.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu16 : 73.6%us, 2.6%sy, 0.0%ni, 23.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu17 : 74.5%us, 2.3%sy, 0.0%ni, 23.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu18 : 73.9%us, 2.3%sy, 0.0%ni, 23.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu19 : 72.9%us, 2.6%sy, 0.0%ni, 24.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu20 : 72.8%us, 2.6%sy, 0.0%ni, 24.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu21 : 72.7%us, 2.3%sy, 0.0%ni, 25.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu22 : 72.5%us, 2.6%sy, 0.0%ni, 24.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu23 : 73.0%us, 2.3%sy, 0.0%ni, 24.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu24 : 74.7%us, 2.7%sy, 0.0%ni, 22.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu25 : 74.5%us, 2.6%sy, 0.0%ni, 22.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu26 : 73.7%us, 2.0%sy, 0.0%ni, 24.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu27 : 74.1%us, 2.3%sy, 0.0%ni, 23.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu28 : 74.1%us, 2.3%sy, 0.0%ni, 23.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu29 : 74.0%us, 2.0%sy, 0.0%ni, 24.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu30 : 73.2%us, 2.3%sy, 0.0%ni, 24.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu31 : 73.1%us, 2.0%sy, 0.0%ni, 24.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 132138168k total, 131711704k used, 426464k free, 88336k buffers Swap: 5701624k total, 229572k used, 5472052k free, 13745596k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 13865 root 20 0 122g 112g 3.1g S 2334.3 89.6 20726:49 java 27139 jayen 20 0 15428 1728 952 S 2.6 0.0 0:04.21 top 27161 sysadmin 20 0 15428 1712 940 R 1.0 0.0 0:00.28 top 33 root 20 0 0 0 0 S 0.3 0.0 0:06.24 ksoftirqd/7 131 root 20 0 0 0 0 S 0.3 0.0 0:09.52 events/0 1858 root 20 0 0 0 0 S 0.3 0.0 1:35.14 kondemand/0 A dump of the java stack confirms that none of the threads are anywhere near the few places where locks are used, nor are they anywhere near any disk or network i/o. I had trouble finding a clear explanation of what 'top' means by "idle" versus "wait", but I get the impression that "idle" means "no more threads that need to be run" but this doesn't make sense in our case. We're using a "Executors.newFixedThreadPool(30)". There are a large number of tasks pending and each task lasts for 10 seconds or so. I suspect that the explanation requires a good understanding of NUMA. Is the "idle" state what you see when a CPU is waiting for a non-local access? If not, then what is the explanation?

    Read the article

  • mcelog doesn't fails to start PUIAS 6.4 amd hardware

    - by Predrag Punosevac
    Folks, I am a total Linux n00b. I am trying to deploy mcelog on one of my computing nodes running PUIAS 6.4 (i86_64) [root@lov3 edac]# uname -a Linux lov3.mylab.org 2.6.32-358.18.1.el6.x86_64 #1 SMP Tue Aug 27 22:40:32 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux a free clone of Red Hat 6.4 on AMD hardware [root@lov3 mcelog]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 4 NUMA node(s): 8 Vendor ID: AuthenticAMD CPU family: 21 Model: 2 Stepping: 0 CPU MHz: 1400.000 BogoMIPS: 4999.30 Virtualization: AMD-V L1d cache: 16K L1i cache: 64K L2 cache: 2048K L3 cache: 6144K NUMA node0 CPU(s): 0-7 NUMA node1 CPU(s): 8-15 NUMA node2 CPU(s): 16-23 NUMA node3 CPU(s): 24-31 NUMA node4 CPU(s): 32-39 NUMA node5 CPU(s): 40-47 NUMA node6 CPU(s): 48-55 NUMA node7 CPU(s): 56-63 My mcelog.conf file is more or less default apart of the fact that I would like to run mcelog as a daemon and to log errors. When I start mcelog [root@lov3 mcelog]# mcelog --config-file mcelog.conf AMD Processor family 21: Please load edac_mce_amd module. However the module is present [root@lov3 mcelog]# locate edac_mce_amd.ko /lib/modules/2.6.32-358.18.1.el6.x86_64/kernel/drivers/edac/edac_mce_amd.ko /lib/modules/2.6.32-358.el6.x86_64/kernel/drivers/edac/edac_mce_amd.ko and loaded [root@lov3 edac]# lsmod | grep mce edac_mce_amd 14705 1 amd64_edac_mod Is there anything that I can do to get mcelog working? The only reference I found is this thread http://lists.centos.org/pipermail/centos/2012-November/130226.html

    Read the article

  • How to force two process to run on the same CPU?

    - by kovan
    Context: I'm programming a software system that consists of multiple processes. It is programmed in C++ under Linux. and they communicate among them using Linux shared memory. Usually, in software development, is in the final stage when the performance optimization is made. Here I came to a big problem. The software has high performance requirements, but in machines with 4 or 8 CPU cores (usually with more than one CPU), it was only able to use 3 cores, thus wasting 25% of the CPU power in the first ones, and more than 60% in the second ones. After many research, and having discarded mutex and lock contention, I found out that the time was being wasted on shmdt/shmat calls (detach and attach to shared memory segments). After some more research, I found out that these CPUs, which usually are AMD Opteron and Intel Xeon, use a memory system called NUMA, which basically means that each processor has its fast, "local memory", and accessing memory from other CPUs is expensive. After doing some tests, the problem seems to be that the software is designed so that, basically, any process can pass shared memory segments to any other process, and to any thread in them. This seems to kill performance, as process are constantly accessing memory from other processes. Question: Now, the question is, is there any way to force pairs of processes to execute in the same CPU?. I don't mean to force them to execute always in the same processor, as I don't care in which one they are executed, altough that would do the job. Ideally, there would be a way to tell the kernel: If you schedule this process in one processor, you must also schedule this "brother" process (which is the process with which it communicates through shared memory) in that same processor, so that performance is not penalized.

    Read the article

  • [Python] Tips for making a fraction calculator code more optimized (faster and using less memory)

    - by Logic Named Joe
    Hello Everyone, Basicly, what I need for the program to do is to act a as simple fraction calculator (for addition, subtraction, multiplication and division) for the a single line of input, for example: -input: 1/7 + 3/5 -output: 26/35 My initial code: import sys def euclid(numA, numB): while numB != 0: numRem = numA % numB numA = numB numB = numRem return numA for wejscie in sys.stdin: wyjscie = wejscie.split(' ') a, b = [int(x) for x in wyjscie[0].split("/")] c, d = [int(x) for x in wyjscie[2].split("/")] if wyjscie[1] == '+': licz = a * d + b * c mian= b * d nwd = euclid(licz, mian) konA = licz/nwd konB = mian/nwd wynik = str(konA) + '/' + str(konB) print(wynik) elif wyjscie[1] == '-': licz= a * d - b * c mian= b * d nwd = euclid(licz, mian) konA = licz/nwd konB = mian/nwd wynik = str(konA) + '/' + str(konB) print(wynik) elif wyjscie[1] == '*': licz= a * c mian= b * d nwd = euclid(licz, mian) konA = licz/nwd konB = mian/nwd wynik = str(konA) + '/' + str(konB) print(wynik) else: licz= a * d mian= b * c nwd = euclid(licz, mian) konA = licz/nwd konB = mian/nwd wynik = str(konA) + '/' + str(konB) print(wynik) Which I reduced to: import sys def euclid(numA, numB): while numB != 0: numRem = numA % numB numA = numB numB = numRem return numA for wejscie in sys.stdin: wyjscie = wejscie.split(' ') a, b = [int(x) for x in wyjscie[0].split("/")] c, d = [int(x) for x in wyjscie[2].split("/")] if wyjscie[1] == '+': print("/".join([str((a * d + b * c)/euclid(a * d + b * c, b * d)),str((b * d)/euclid(a * d + b * c, b * d))])) elif wyjscie[1] == '-': print("/".join([str((a * d - b * c)/euclid(a * d - b * c, b * d)),str((b * d)/euclid(a * d - b * c, b * d))])) elif wyjscie[1] == '*': print("/".join([str((a * c)/euclid(a * c, b * d)),str((b * d)/euclid(a * c, b * d))])) else: print("/".join([str((a * d)/euclid(a * d, b * c)),str((b * c)/euclid(a * d, b * c))])) Any advice on how to improve this futher is welcome. Edit: one more thing that I forgot to mention - the code can not make use of any libraries apart from sys.

    Read the article

  • SQL Server 2008 uses half the CPU’s

    - by ACALVETT
    I recently got my hands on a couple of 4 socket servers with Intel E7-4870's (10 cores per cpu) and with hyper threading enabled that gave me 80 logical CPU's. The server has Windows 2008 R2 SP1 along with SQL 2008 (Currently we can not deploy SQL 2008 R2 for the application being hosted). When SQL Server started I noticed only 2 NUMA nodes were configured and 40 logical cores where there should have been 4 NUMA nodes and 80 logical cores (see below). The problem is caused by that fact that...(read more)

    Read the article

  • AMD-V is not enable in virtualbox in amd APU

    - by shantanu
    I am running Dual core AMD E450 APU. When i tried to run a 64-bit OS that requires hardware virtualization using virtual-box it showed me an error "AMD-V is not enable". My AMD processor should provide AMD-V support. And i can find no option for AMD-V in BIOS. How can i solve this problem? How could i enable AMD-V for my APU? Thanks in advance lscpu :- Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 2 On-line CPU(s) list: 0,1 Thread(s) per core: 1 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 20 Model: 2 Stepping: 0 CPU MHz: 1650.000 BogoMIPS: 3291.72 Virtualization: AMD-V L1d cache: 32K L1i cache: 32K L2 cache: 512K NUMA node0 CPU(s): 0,1 EDITED:- Error of virtualBOX:- Failed to open a session for the virtual machine XXX. AMD-V is disabled in the BIOS. (VERR_SVM_DISABLED). Result Code: NS_ERROR_FAILURE (0x80004005) Component: Console Interface: IConsole {1968b7d3-e3bf-4ceb-99e0-cb7c913317bb}

    Read the article

  • Distrilogie muda de nome para Altimate

    - by Paulo Folgado
     O Grupo Distrilogie entra numa nova dimensão O Distribuidor de valor acrescentado em TI aposta numa mudança radical: muda de nome e de imagem, para passar a ser Altimate - Smart IT Distributor   Lisboa, 5 de Maio de 2010 - Para o grupo de reconhecido sucesso, o principal ponto forte está na mudança: a partir de hoje, a Distrilogie Portugal, Espanha, Bélgica, Luxemburgo, Holanda e França, bem como todas as suas aquisições, deixam o seu nome e formam o novo grupo Altimate. Na Península Ibérica, esta mudança afecta o grupo Distrilogie Iberia, formado pela Distrilogie Portugal, Distrilogie Espanha e Mambo Technology, o distribuidor especializado em segurança do grupo.   Altimate: uma marca com grandes ambições europeias Esta mudança assenta na vontade de reforçar um grupo de longo e frutífero trajecto, que conta com os melhores talentos e uma diversificada gama de soluções altamente complementares. "Continuar a crescer ao nosso ritmo (+27% este ano), em tempos como os de agora, passa por desenvolver todas as sinergias possíveis dentro do nosso grupo, e não só a nível nacional e regional, mas também pan-europeu. O nosso grupo goza, a nível internacional, de uma grande diversidade de soluções, que se complementam entre si. É uma riqueza que queremos aproveitar e desenvolver a nível de cada país, consolidando o nosso portfólio pan-europeu. Trata-se de um ponto fundamental para o crescimento futuro, agora que o mercado dos principais fabricantes tende à concentração", explica Alexis Brabant, Director-Geral da Altimate Iberia e membro do Comité Executivo Europeu do Grupo Altimate.   Por outro lado, a criação da Altimate assenta numa ambiciosa estratégia de expansão e consolidação por todo o continente. Entre outros objectivos fundamentais, a Altimate pretende estabelecer-se em 4 novos países da União Europeia nos próximos 2 anos. Assim o ilustra Patrice Arzillier, fundador da Distrilogie e PDG do grupo Altimate: "Graças ao apoio incondicional do nosso accionista DCC, o nosso grupo conheceu um desenvolvimento notável. Hoje, a criação da Altimate marca uma nova etapa de crescimento combinando solidez económica, ambição de expansão europeia e manutenção dos nossos valores fundadores."  Altimate: alta proximidade Tal como a Distrilogie, o novo grupo Altimate tem como missão o sucesso dos seus parceiros e fabricantes. Para a cumprir, continuará a potenciar a proximidade das suas equipas - altamente qualificadas e voltadas para a identificação das soluções mais inteligentes, inovadoras e adequadas.  Para mais informações acerca da Altimate, visite o novo site . http://www.altimate-group.com  

    Read the article

  • Enable VT-x in HP 8300 elite

    - by lang2
    I have a HP 8300 elite (Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz). I'm trying to run a virtual machine via VirtualBox. But every time I start the VM, it says: VT-x is disabled in the BIOS. (VERR_VMX_MSR_VMXON_DISABLED). My lscpu output is like this: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 58 Stepping: 9 CPU MHz: 1600.000 BogoMIPS: 6784.74 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 8192K NUMA node0 CPU(s): 0-7 I went into the BIOS but the things the can be tweaked is very limited and I couldn't find the VT-x setting. Anybody know how to do this in this setup?

    Read the article

  • New SQLOS features in SQL Server 2012

    - by SQLOS Team
    Here's a quick summary of SQLOS feature enhancements going into SQL Server 2012. Most of these are already in the CTP3 pre-release, except for the Resource Governor enhancements which will be in the release candidate. We've blogged about a couple of these items before. I plan to add detail. Let me know which ones you'd like to see more on: - Memory Manager Redesign: Predictable sizing and governing SQL memory consumption: sp_configure ‘max server memory’ now limits all memory committed by SQL ServerResource Governor governs all SQL memory consumption (other than special cases like buffer pool) Improved scalability of complex queries and operations that make >8K allocations Improved CPU and NUMA locality for memory accesses Single memory manager that handles page allocations of all sizes Consistent Out-of-memory handling & management across different internal components - Optimized Memory Broker for Column Store indexes (Project Apollo) - Resource Governor Support larger scale multi-tenancy by increasing Max. number of resource pools20 -> 64 [for 64-bit] Enable predictable chargeback and isolation by adding a hard cap on CPU usage Enable vertical isolation of machine resources Resource pools can be affinitized to individual or groups of schedulers or to NUMA nodes New DMV for resource pool affinity  - CLR 4 support, adds .NET Framework 4 advantages - sp_server_dianostics Captures diagnostic data and health information about SQL Server to detect potential failures Analyze internal system state Reliable when nothing else is working   - New SQLOS DMVs (in 2008 R2SP1) SQL Server related configuration - New DMVsys.dm_server_services OS related resource configurationNew DMVssys.dm_os_volume_statssys.dm_os_windows_infosys.dm_server_registry XEvents for SQL and OS related Perfmon counters Extend sys.dm_os_sys_info See previous blog posts here and here. - Scale / Mission critical Increased scalability: Support Windows 8 max memory and logical processorsDynamic Memory support in Standard Edition - Hot-Add Memory enabled when virtualized - Various Tier1 Performance Improvements, including reduced instructions for superlatches. Originally posted at http://blogs.msdn.com/b/sqlosteam/

    Read the article

  • Storage Configuration

    - by jchang
    Storage performance is not inherently complicated subject. The concepts are relatively simple. In fact, scaling storage performance is far easier compared with the difficulties encounters in scaling processor performance in NUMA systems. Storage performance is achieved by properly distributing IO over: 1) multiple independent PCI-E ports (system memory and IO bandwith is key) 2) multiple RAID controllers or host bus adapters (HBAs) 3) multiple storage IO channels (SAS or FC, complete path) most importantly,...(read more)

    Read the article

  • Storage Configuration

    - by jchang
    Storage performance is not inherently complicated subject. The concepts are relatively simple. In fact, scaling storage performance is far easier compared with the difficulties encounters in scaling processor performance in NUMA systems. Storage performance is achieved by properly distributing IO over: 1) multiple independent PCI-E ports (system memory and IO bandwith is key) 2) multiple RAID controllers or host bus adapters (HBAs) 3) multiple storage IO channels (SAS or FC, complete path) most importantly,...(read more)

    Read the article

  • DBCC MEMUSAGE in 2005/8 ?

    - by steveh99999
    I used to like using undocumented command DBCC MEMUSAGE in SQL 2000 to see which tables were using space in SQL data cache. In SQL 2005, this command is not longer present. Instead a DMV – sys.dm_os_buffer_descriptors – can be used to display data cache contents,  but this doesn’t quite give you the same output as DBCC MEMUSAGE. I’m also aware that you can use Quest’s spotlight tool to view a summary of data cache contents. Using  this post by Umachandar Jayachandran  of Microsoft, I was able to create the following equivalent for SQL 2005/8. I’ve wrapped Umachandar’s original query in a CTE to produce summary information :- ;WITH memusage_CTE AS (SELECT bd.database_id, bd.file_id, bd.page_id, bd.page_type , COALESCE(p1.object_id, p2.object_id) AS object_id , COALESCE(p1.index_id, p2.index_id) AS index_id , bd.row_count, bd.free_space_in_bytes, CONVERT(TINYINT,bd.is_modified) AS 'DirtyPage' FROM sys.dm_os_buffer_descriptors AS bd JOIN sys.allocation_units AS au ON au.allocation_unit_id = bd.allocation_unit_id OUTER APPLY ( SELECT TOP(1) p.object_id, p.index_id FROM sys.partitions AS p WHERE p.hobt_id = au.container_id AND au.type IN (1, 3) ) AS p1 OUTER APPLY ( SELECT TOP(1) p.object_id, p.index_id FROM sys.partitions AS p WHERE p.partition_id = au.container_id AND au.type = 2 ) AS p2 WHERE  bd.database_id = DB_ID() AND bd.page_type IN ('DATA_PAGE', 'INDEX_PAGE') ) SELECT TOP 20 DB_NAME(database_id) AS 'Database',OBJECT_NAME(object_id,database_id) AS 'Table Name', index_id,COUNT(*) AS 'Pages in Cache', SUM(dirtyPage) AS 'Dirty Pages' FROM memusage_CTE GROUP BY database_id, object_id, index_id ORDER BY COUNT(*) DESC I’m not 100% happy with the results of the above query however… I’ve noticed that on a busy BizTalk messageBox database  it will return information on pages that contain GHOST rows – . ie where data has already been deleted but has yet to be cleaned-up by a background process – I’m need to investigate further why cache on this server apparently contains so much GHOST data… For more information on the background ghost cleanup process, see this article by Paul Randall. However, I think the results of this query should still be of interest to a DBA. I have another post to come shortly regarding an example I encountered where this information proved useful to me… I notice in SQL 2008, sys.dm_os_buffer_descriptors gained an extra column – numa_mode – I’m interested to see how this is populated and how useful this column can be on a NUMA-enabled system. I’m assuming in theory you could use this column to help analyse how your tables are spread across Numa-enabled data-cache ?

    Read the article

  • Decoding an affinity mask

    - by GavinPayneUK
    Recently, in preparation for my SQLBits NUMA internals session I began looking at some of the SQLOS DMVs and trying to understand how their contents directly related to the physical server architecture that SQL Server was running on. While their contents used regular terms such as node and affinity mask the results were often in an “internals” format that can be distracting to the human reader.  An example of this is the DMV sys.dm_os_nodes (link to Technet here ), or more specifically the column...(read more)

    Read the article

  • Coeo sessions at SQLSaturday Cambridge

    - by GavinPayneUK
    This weekend saw the UK’s first SQLSaturday organised by Mark Broadbent, and held in Cambridge, that was without doubt a huge success. Coeo were lucky to have four of us present a staggering five sessions on the day; so thank you to the SQLSaturday team for selecting our sessions, and to those who chose to attend them. I’ve put a link to the presentation slides for all of our sessions below: I want to be a better architect - Gavin Payne Slides here NUMA internals of SQL Server 2012 – Gavin Payne...(read more)

    Read the article

  • Webcenter 11g Patch Set 3 (PS3) - Formação para Parceiros - 19/Jan/11

    - by Claudia Costa
    O Oracle WebCenter Suite consiste num conjunto integrado de soluções  baseadas em standards de mercado, com o objectivo de criar aplicações de negócio, portais corporativos, comunidades de interesse, grupos colaborativos, redes sociais, consolidar sistemas aplicacionais web e potenciar as funcionalidades web 2.0 dentro e fora de uma organização. Toda a solução está em conformidade com os principais Standards de mercado e baseia-se numa arquitectura totalmente orientada ao serviço (SOA). Venha conhecer nesta sessão as novas funcionalidades do Webcenter Suite 11g PS3. Agenda 09.30h Boas Vindas e Introdução 09.40h Oracle & Enterprise 2.0 - João Borrego 10.00h Introduction to Oracle WebCenter Suite as a User Experience Platform (UXP) - Monte Kluemper 10.40h Coffee Break 11.00h Building Rich UIs with Oracle WebCenter Portal - Monte Kluemper 12.00h Interactive Q&A Session - João Borrego e Monte Kluemper 12.30h Almoço 14.00h Deep-dive into WebCenter Technical Architecture - Monte Kluemper 15.30h Final da Sessão Target: - Equipas de Venda e Comerciais (manhã)- Equipas Técnicas e de pré-venda (tarde) Data e Local:19 de JaneiroLagoas Park Hotel Para este Workshop os lugares estão limitados. Por favor aguarde um email de confirmação de sua inscrição.Inscrições : Email Para mais informações, por favor contacte: Claudia Costa / 21 423 50 27

    Read the article

  • O modelo diamante para gerenciamento de projetos

    - by fernando.galdino
    Este ano comecei a fazer o mestrado em Gestão de Projetos. No decorrer deste período estudamos vários assuntos envolvendo abordagens de gerenciamento de projetos. Uma dessas abordagens é o Modelo Diamante. Elaborada por Aaron Shenhar e Dov Dvir, e explicada em detalhes no livro “Reinventando Gerenciamento de Projetos”, trata-se de uma estrutura que permite avaliar um projeto, e com base nos resultados, permite que o gerente de projetos possa usar uma abordagem como o descrito no PMBOK (PMI), de modo a aproveitar da melhor forma possível, as boas práticas listadas. A apresentação abaixo foi realizada por mim, numa das aulas do curso. Explica com alguns detalhes, e ao mesmo tempo fornece uma visão geral, sobre o modelo NTCP, que é uma estrutura que permite avaliar um projeto em termos de novidade, incerteza tecnológica, complexidade e ritmo.   Modelo NTCP View more presentations from Fernando Galdino.

    Read the article

1 2 3  | Next Page >