Search Results

Search found 8 results on 1 pages for 'x4800'.

Page 1/1 | 1 

  • Sun Fire X4800 M2 Delivers World Record TPC-C for x86 Systems

    - by Brian
    Oracle's Sun Fire X4800 M2 server equipped with eight 2.4 GHz Intel Xeon Processor E7-8870 chips obtained a result of 5,055,888 tpmC on the TPC-C benchmark. This result is a world record for x86 servers. Oracle demonstrated this world record database performance running Oracle Database 11g Release 2 Enterprise Edition with Partitioning. The Sun Fire X4800 M2 server delivered a new x86 TPC-C world record of 5,055,888 tpmC with a price performance of $0.89/tpmC using Oracle Database 11g Release 2. This configuration is available 06/26/12. The Sun Fire X4800 M2 server delivers 3.0x times better performance than the next 8-processor result, an IBM System p 570 equipped with POWER6 processors. The Sun Fire X4800 M2 server has 3.1x times better price/performance than the 8-processor 4.7GHz POWER6 IBM System p 570. The Sun Fire X4800 M2 server has 1.6x times better performance than the 4-processor IBM x3850 X5 system equipped with Intel Xeon processors. This is the first TPC-C result on any system using eight Intel Xeon Processor E7-8800 Series chips. The Sun Fire X4800 M2 server is the first x86 system to get over 5 million tpmC. The Oracle solution utilized Oracle Linux operating system and Oracle Database 11g Enterprise Edition Release 2 with Partitioning to produce the x86 world record TPC-C benchmark performance. Performance Landscape Select TPC-C results (sorted by tpmC, bigger is better) System p/c/t tpmC Price/tpmC Avail Database MemorySize Sun Fire X4800 M2 8/80/160 5,055,888 0.89 USD 6/26/2012 Oracle 11g R2 4 TB IBM x3850 X5 4/40/80 3,014,684 0.59 USD 7/11/2011 DB2 ESE 9.7 3 TB IBM x3850 X5 4/32/64 2,308,099 0.60 USD 5/20/2011 DB2 ESE 9.7 1.5 TB IBM System p 570 8/16/32 1,616,162 3.54 USD 11/21/2007 DB2 9.0 2 TB p/c/t - processors, cores, threads Avail - availability date Oracle and IBM TPC-C Response times System tpmC Response Time (sec) New Order 90th% Response Time (sec) New Order Average Sun Fire X4800 M2 5,055,888 0.210 0.166 IBM x3850 X5 3,014,684 0.500 0.272 Ratios - Oracle Better 1.6x 1.4x 1.3x Oracle uses average new order response time for comparison between Oracle and IBM. Graphs of Oracle's and IBM's response times for New-Order can be found in the full disclosure reports on TPC's website TPC-C Official Result Page. Configuration Summary and Results Hardware Configuration: Server Sun Fire X4800 M2 server 8 x 2.4 GHz Intel Xeon Processor E7-8870 4 TB memory 8 x 300 GB 10K RPM SAS internal disks 8 x Dual port 8 Gbs FC HBA Data Storage 10 x Sun Fire X4270 M2 servers configured as COMSTAR heads, each with 1 x 3.06 GHz Intel Xeon X5675 processor 8 GB memory 10 x 2 TB 7.2K RPM 3.5" SAS disks 2 x Sun Storage F5100 Flash Array storage (1.92 TB each) 1 x Brocade 5300 switches Redo Storage 2 x Sun Fire X4270 M2 servers configured as COMSTAR heads, each with 1 x 3.06 GHz Intel Xeon X5675 processor 8 GB memory 11 x 2 TB 7.2K RPM 3.5" SAS disks Clients 8 x Sun Fire X4170 M2 servers, each with 2 x 3.06 GHz Intel Xeon X5675 processors 48 GB memory 2 x 300 GB 10K RPM SAS disks Software Configuration: Oracle Linux (Sun Fire 4800 M2) Oracle Solaris 11 Express (COMSTAR for Sun Fire X4270 M2) Oracle Solaris 10 9/10 (Sun Fire X4170 M2) Oracle Database 11g Release 2 Enterprise Edition with Partitioning Oracle iPlanet Web Server 7.0 U5 Tuxedo CFS-R Tier 1 Results: System: Sun Fire X4800 M2 tpmC: 5,055,888 Price/tpmC: 0.89 USD Available: 6/26/2012 Database: Oracle Database 11g Cluster: no New Order Average Response: 0.166 seconds Benchmark Description TPC-C is an OLTP system benchmark. It simulates a complete environment where a population of terminal operators executes transactions against a database. The benchmark is centered around the principal activities (transactions) of an order-entry environment. These transactions include entering and delivering orders, recording payments, checking the status of orders, and monitoring the level of stock at the warehouses. Key Points and Best Practices Oracle Database 11g Release 2 Enterprise Edition with Partitioning scales easily to this high level of performance. COMSTAR (Common Multiprotocol SCSI Target) is the software framework that enables an Oracle Solaris host to serve as a SCSI Target platform. COMSTAR uses a modular approach to break the huge task of handling all the different pieces in a SCSI target subsystem into independent functional modules which are glued together by the SCSI Target Mode Framework (STMF). The modules implementing functionality at SCSI level (disk, tape, medium changer etc.) are not required to know about the underlying transport. And the modules implementing the transport protocol (FC, iSCSI, etc.) are not aware of the SCSI-level functionality of the packets they are transporting. The framework hides the details of allocation providing execution context and cleanup of SCSI commands and associated resources and simplifies the task of writing the SCSI or transport modules. Oracle iPlanet Web Server middleware is used for the client tier of the benchmark. Each web server instance supports more than a quarter-million users while satisfying the response time requirement from the TPC-C benchmark. See Also Oracle Press Release -- Sun Fire X4800 M2 TPC-C Executive Summary tpc.org Complete Sun Fire X4800 M2 TPC-C Full Disclosure Report tpc.org Transaction Processing Performance Council (TPC) Home Page Ideas International Benchmark Page Sun Fire X4800 M2 Server oracle.com OTN Oracle Linux oracle.com OTN Oracle Solaris oracle.com OTN Oracle Database 11g Release 2 Enterprise Edition oracle.com OTN Sun Storage F5100 Flash Array oracle.com OTN Disclosure Statement TPC Benchmark C, tpmC, and TPC-C are trademarks of the Transaction Processing Performance Council (TPC). Sun Fire X4800 M2 (8/80/160) with Oracle Database 11g Release 2 Enterprise Edition with Partitioning, 5,055,888 tpmC, $0.89 USD/tpmC, available 6/26/2012. IBM x3850 X5 (4/40/80) with DB2 ESE 9.7, 3,014,684 tpmC, $0.59 USD/tpmC, available 7/11/2011. IBM x3850 X5 (4/32/64) with DB2 ESE 9.7, 2,308,099 tpmC, $0.60 USD/tpmC, available 5/20/2011. IBM System p 570 (8/16/32) with DB2 9.0, 1,616,162 tpmC, $3.54 USD/tpmC, available 11/21/2007. Source: http://www.tpc.org/tpcc, results as of 7/15/2011.

    Read the article

  • Sun Fire X4800 M2 Posts World Record x86 SPECjEnterprise2010 Result

    - by Brian
    Oracle's Sun Fire X4800 M2 using the Intel Xeon E7-8870 processor and Sun Fire X4470 M2 using the Intel Xeon E7-4870 processor, produced a world record single application server SPECjEnterprise2010 benchmark result of 27,150.05 SPECjEnterprise2010 EjOPS. The Sun Fire X4800 M2 server ran the application tier and the Sun Fire X4470 M2 server was used for the database tier. The Sun Fire X4800 M2 server demonstrated 63% better performance compared to IBM P780 server result of 16,646.34 SPECjEnterprise2010 EjOPS. The Sun Fire X4800 M2 server demonstrated 4% better performance than the Cisco UCS B440 M2 result, both results used the same number of processors. This result used Oracle WebLogic Server 12c, Java HotSpot(TM) 64-Bit Server 1.7.0_02, and Oracle Database 11g. This result was produced using Oracle Linux. Performance Landscape Complete benchmark results are at the SPEC website, SPECjEnterprise2010 Results. The table below compares against the best results from IBM and Cisco. SPECjEnterprise2010 Performance Chart as of 3/12/2012 Submitter EjOPS* Application Server Database Server Oracle 27,150.05 1x Sun Fire X4800 M2 8x 2.4 GHz Intel Xeon E7-8870 Oracle WebLogic 12c 1x Sun Fire X4470 M2 4x 2.4 GHz Intel Xeon E7-4870 Oracle Database 11g (11.2.0.2) Cisco 26,118.67 2x UCS B440 M2 Blade Server 4x 2.4 GHz Intel Xeon E7-4870 Oracle WebLogic 11g (10.3.5) 1x UCS C460 M2 Blade Server 4x 2.4 GHz Intel Xeon E7-4870 Oracle Database 11g (11.2.0.2) IBM 16,646.34 1x IBM Power 780 8x 3.86 GHz POWER 7 WebSphere Application Server V7 1x IBM Power 750 Express 4x 3.55 GHz POWER 7 IBM DB2 9.7 Workgroup Server Edition FP3a * SPECjEnterprise2010 EjOPS, bigger is better. Configuration Summary Application Server: 1 x Sun Fire X4800 M2 8 x 2.4 GHz Intel Xeon processor E7-8870 256 GB memory 4 x 10 GbE NIC 2 x FC HBA Oracle Linux 5 Update 6 Oracle WebLogic Server 11g Release 1 (10.3.5) Java HotSpot(TM) 64-Bit Server VM on Linux, version 1.7.0_02 (Java SE 7 Update 2) Database Server: 1 x Sun Fire X4470 M2 4 x 2.4 GHz Intel Xeon E7-4870 512 GB memory 4 x 10 GbE NIC 2 x FC HBA 2 x Sun StorageTek 2540 M2 4 x Sun Fire X4270 M2 4 x Sun Storage F5100 Flash Array Oracle Linux 5 Update 6 Oracle Database 11g Enterprise Edition Release 11.2.0.2 Benchmark Description SPECjEnterprise2010 is the third generation of the SPEC organization's J2EE end-to-end industry standard benchmark application. The SPECjEnterprise2010 benchmark has been designed and developed to cover the Java EE 5 specification's significantly expanded and simplified programming model, highlighting the major features used by developers in the industry today. This provides a real world workload driving the Application Server's implementation of the Java EE specification to its maximum potential and allowing maximum stressing of the underlying hardware and software systems. The workload consists of an end to end web based order processing domain, an RMI and Web Services driven manufacturing domain and a supply chain model utilizing document based Web Services. The application is a collection of Java classes, Java Servlets, Java Server Pages, Enterprise Java Beans, Java Persistence Entities (pojo's) and Message Driven Beans. The SPECjEnterprise2010 benchmark heavily exercises all parts of the underlying infrastructure that make up the application environment, including hardware, JVM software, database software, JDBC drivers, and the system network. The primary metric of the SPECjEnterprise2010 benchmark is jEnterprise Operations Per Second ("SPECjEnterprise2010 EjOPS"). This metric is calculated by adding the metrics of the Dealership Management Application in the Dealer Domain and the Manufacturing Application in the Manufacturing Domain. There is no price/performance metric in this benchmark. Key Points and Best Practices Sixteen Oracle WebLogic server instances were started using numactl, binding 2 instances per chip. Eight Oracle database listener processes were started, binding 2 instances per chip using taskset. Additional tuning information is in the report at http://spec.org. See Also Oracle Press Release -- SPECjEnterprise2010 Results Page Sun Fire X4800 M2 Server oracle.com OTN Sun Fire X4270 M2 Server oracle.com OTN Sun Storage 2540-M2 Array oracle.com OTN Oracle Linux oracle.com OTN Oracle Database 11g Release 2 Enterprise Edition oracle.com OTN WebLogic Suite oracle.com OTN Disclosure Statement SPEC and the benchmark name SPECjEnterprise are registered trademarks of the Standard Performance Evaluation Corporation. Sun Fire X4800 M2, 27,150.05 SPECjEnterprise2010 EjOPS; IBM Power 780, 16,646.34 SPECjEnterprise2010 EjOPS; Cisco UCS B440 M2, 26,118.67 SPECjEnterprise2010 EjOPS. Results from www.spec.org as of 3/27/2012.

    Read the article

  • Migrating from IBM AIX/DB2 Power systems to Oracle Technologies

    - by zeynep.koch(at)oracle.com
    If you are planning to migrate from  IBM DB2 on AIX Power Systems to more open and better-performing computing environment--one that offers enhanced flexibility, clustering, availability, and security, as well as lower maintenance than download this guide that outlines migrating to Oracle Database 11g and Oracle Linux running on Oracle's Sun Fire X4800 server.This guide shows you how to:Move sample applications with an IBM DB2 on an IBM Power System to Oracle Database 11g Release 2Install Oracle Linux and Oracle Database Release 2 on the Oracle's Sun Fire X4800 serverMigrate user databases from the IBM Power System to Oracle's Sun Fire X4800 serverDownload

    Read the article

  • Viewing at Impossible Angles

    - by kemer
    The picture of the little screwdriver with the Allen wrench head to the right is bound to invoke a little nostalgia for those readers who were Sun customers in the late 80s. This tool was a very popular give-away: it was essential for installing and removing Multibus (you youngsters will have to look that up on Wikipedia…) cards in our systems. Back then our mid-sized systems were gargantuan: it was routine for us to schlep around a 200 lb. desk side box and 90 lb. monitor to demo a piece of software your smart phone will run better today. We were very close to the hardware, and the first thing a new field sales systems engineer had to learn was how put together a system. If you were lucky, a grizzled service engineer might run you through the process once, then threaten your health and existence should you ever screw it up so that he had to fix it. Nowadays we make it much easier to learn the ins and outs of our hardware with simulations–3D animations–that take you through the process of putting together or replacing pieces of a system. Most recently, we have posted three sophisticated PDFs that take advantage of Acrobat 9 features to provide a really intelligent approach to documenting hardware installation and repair: Sun Fire X4800/X4800 M2 Animations for Chassis Components Sun Fire X4800/X4800 M2 Animations for Sub Assembly Module (SAM) Sun Fire X4800/X4800 M2 Animations for CMOD Download one of these documents and take a close look at it. You can view the hardware from any angle, including impossible ones. Each document has a number of procedures, that break down into steps. Click on a procedure, then a step and you will see it animated in the drawing. Of course hardware design has generally eliminated the need for things like our old giveaway tools: components snap and lock in. Often you can replace redundant units while the system is hot, but for heaven’s sake, you’ll want to verify that you can do that before you try it! Meanwhile, we can all look forward to a growing portfolio of these intelligent documents. We would love to hear what you think about them. –Kemer

    Read the article

  • Thread placement policies on NUMA systems - update

    - by Dave
    In a prior blog entry I noted that Solaris used a "maximum dispersal" placement policy to assign nascent threads to their initial processors. The general idea is that threads should be placed as far away from each other as possible in the resource topology in order to reduce resource contention between concurrently running threads. This policy assumes that resource contention -- pipelines, memory channel contention, destructive interference in the shared caches, etc -- will likely outweigh (a) any potential communication benefits we might achieve by packing our threads more densely onto a subset of the NUMA nodes, and (b) benefits of NUMA affinity between memory allocated by one thread and accessed by other threads. We want our threads spread widely over the system and not packed together. Conceptually, when placing a new thread, the kernel picks the least loaded node NUMA node (the node with lowest aggregate load average), and then the least loaded core on that node, etc. Furthermore, the kernel places threads onto resources -- sockets, cores, pipelines, etc -- without regard to the thread's process membership. That is, initial placement is process-agnostic. Keep reading, though. This description is incorrect. On Solaris 10 on a SPARC T5440 with 4 x T2+ NUMA nodes, if the system is otherwise unloaded and we launch a process that creates 20 compute-bound concurrent threads, then typically we'll see a perfect balance with 5 threads on each node. We see similar behavior on an 8-node x86 x4800 system, where each node has 8 cores and each core is 2-way hyperthreaded. So far so good; this behavior seems in agreement with the policy I described in the 1st paragraph. I recently tried the same experiment on a 4-node T4-4 running Solaris 11. Both the T5440 and T4-4 are 4-node systems that expose 256 logical thread contexts. To my surprise, all 20 threads were placed onto just one NUMA node while the other 3 nodes remained completely idle. I checked the usual suspects such as processor sets inadvertently left around by colleagues, processors left offline, and power management policies, but the system was configured normally. I then launched multiple concurrent instances of the process, and, interestingly, all the threads from the 1st process landed on one node, all the threads from the 2nd process landed on another node, and so on. This happened even if I interleaved thread creating between the processes, so I was relatively sure the effect didn't related to thread creation time, but rather that placement was a function of process membership. I this point I consulted the Solaris sources and talked with folks in the Solaris group. The new Solaris 11 behavior is intentional. The kernel is no longer using a simple maximum dispersal policy, and thread placement is process membership-aware. Now, even if other nodes are completely unloaded, the kernel will still try to pack new threads onto the home lgroup (socket) of the primordial thread until the load average of that node reaches 50%, after which it will pick the next least loaded node as the process's new favorite node for placement. On the T4-4 we have 64 logical thread contexts (strands) per socket (lgroup), so if we launch 48 concurrent threads we will find 32 placed on one node and 16 on some other node. If we launch 64 threads we'll find 32 and 32. That means we can end up with our threads clustered on a small subset of the nodes in a way that's quite different that what we've seen on Solaris 10. So we have a policy that allows process-aware packing but reverts to spreading threads onto other nodes if a node becomes too saturated. It turns out this policy was enabled in Solaris 10, but certain bugs suppressed the mixed packing/spreading behavior. There are configuration variables in /etc/system that allow us to dial the affinity between nascent threads and their primordial thread up and down: see lgrp_expand_proc_thresh, specifically. In the OpenSolaris source code the key routine is mpo_update_tunables(). This method reads the /etc/system variables and sets up some global variables that will subsequently be used by the dispatcher, which calls lgrp_choose() in lgrp.c to place nascent threads. Lgrp_expand_proc_thresh controls how loaded an lgroup must be before we'll consider homing a process's threads to another lgroup. Tune this value lower to have it spread your process's threads out more. To recap, the 'new' policy is as follows. Threads from the same process are packed onto a subset of the strands of a socket (50% for T-series). Once that socket reaches the 50% threshold the kernel then picks another preferred socket for that process. Threads from unrelated processes are spread across sockets. More precisely, different processes may have different preferred sockets (lgroups). Beware that I've simplified and elided details for the purposes of explication. The truth is in the code. Remarks: It's worth noting that initial thread placement is just that. If there's a gross imbalance between the load on different nodes then the kernel will migrate threads to achieve a better and more even distribution over the set of available nodes. Once a thread runs and gains some affinity for a node, however, it becomes "stickier" under the assumption that the thread has residual cache residency on that node, and that memory allocated by that thread resides on that node given the default "first-touch" page-level NUMA allocation policy. Exactly how the various policies interact and which have precedence under what circumstances could the topic of a future blog entry. The scheduler is work-conserving. The x4800 mentioned above is an interesting system. Each of the 8 sockets houses an Intel 7500-series processor. Each processor has 3 coherent QPI links and the system is arranged as a glueless 8-socket twisted ladder "mobius" topology. Nodes are either 1 or 2 hops distant over the QPI links. As an aside the mapping of logical CPUIDs to physical resources is rather interesting on Solaris/x4800. On SPARC/Solaris the CPUID layout is strictly geographic, with the highest order bits identifying the socket, the next lower bits identifying the core within that socket, following by the pipeline (if present) and finally the logical thread context ("strand") on the core. But on Solaris on the x4800 the CPUID layout is as follows. [6:6] identifies the hyperthread on a core; bits [5:3] identify the socket, or package in Intel terminology; bits [2:0] identify the core within a socket. Such low-level details should be of interest only if you're binding threads -- a bad idea, the kernel typically handles placement best -- or if you're writing NUMA-aware code that's aware of the ambient placement and makes decisions accordingly. Solaris introduced the so-called critical-threads mechanism, which is expressed by putting a thread into the FX scheduling class at priority 60. The critical-threads mechanism applies to placement on cores, not on sockets, however. That is, it's an intra-socket policy, not an inter-socket policy. Solaris 11 introduces the Power Aware Dispatcher (PAD) which packs threads instead of spreading them out in an attempt to be able to keep sockets or cores at lower power levels. Maximum dispersal may be good for performance but is anathema to power management. PAD is off by default, but power management polices constitute yet another confounding factor with respect to scheduling and dispatching. If your threads communicate heavily -- one thread reads cache lines last written by some other thread -- then the new dense packing policy may improve performance by reducing traffic on the coherent interconnect. On the other hand if your threads in your process communicate rarely, then it's possible the new packing policy might result on contention on shared computing resources. Unfortunately there's no simple litmus test that says whether packing or spreading is optimal in a given situation. The answer varies by system load, application, number of threads, and platform hardware characteristics. Currently we don't have the necessary tools and sensoria to decide at runtime, so we're reduced to an empirical approach where we run trials and try to decide on a placement policy. The situation is quite frustrating. Relatedly, it's often hard to determine just the right level of concurrency to optimize throughput. (Understanding constructive vs destructive interference in the shared caches would be a good start. We could augment the lines with a small tag field indicating which strand last installed or accessed a line. Given that, we could augment the CPU with performance counters for misses where a thread evicts a line it installed vs misses where a thread displaces a line installed by some other thread.)

    Read the article

  • NUMA-aware placement of communication variables

    - by Dave
    For classic NUMA-aware programming I'm typically most concerned about simple cold, capacity and compulsory misses and whether we can satisfy the miss by locally connected memory or whether we have to pull the line from its home node over the coherent interconnect -- we'd like to minimize channel contention and conserve interconnect bandwidth. That is, for this style of programming we're quite aware of where memory is homed relative to the threads that will be accessing it. Ideally, a page is collocated on the node with the thread that's expected to most frequently access the page, as simple misses on the page can be satisfied without resorting to transferring the line over the interconnect. The default "first touch" NUMA page placement policy tends to work reasonable well in this regard. When a virtual page is first accessed, the operating system will attempt to provision and map that virtual page to a physical page allocated from the node where the accessing thread is running. It's worth noting that the node-level memory interleaving granularity is usually a multiple of the page size, so we can say that a given page P resides on some node N. That is, the memory underlying a page resides on just one node. But when thinking about accesses to heavily-written communication variables we normally consider what caches the lines underlying such variables might be resident in, and in what states. We want to minimize coherence misses and cache probe activity and interconnect traffic in general. I don't usually give much thought to the location of the home NUMA node underlying such highly shared variables. On a SPARC T5440, for instance, which consists of 4 T2+ processors connected by a central coherence hub, the home node and placement of heavily accessed communication variables has very little impact on performance. The variables are frequently accessed so likely in M-state in some cache, and the location of the home node is of little consequence because a requester can use cache-to-cache transfers to get the line. Or at least that's what I thought. Recently, though, I was exploring a simple shared memory point-to-point communication model where a client writes a request into a request mailbox and then busy-waits on a response variable. It's a simple example of delegation based on message passing. The server polls the request mailbox, and having fetched a new request value, performs some operation and then writes a reply value into the response variable. As noted above, on a T5440 performance is insensitive to the placement of the communication variables -- the request and response mailbox words. But on a Sun/Oracle X4800 I noticed that was not the case and that NUMA placement of the communication variables was actually quite important. For background an X4800 system consists of 8 Intel X7560 Xeons . Each package (socket) has 8 cores with 2 contexts per core, so the system is 8x8x2. Each package is also a NUMA node and has locally attached memory. Every package has 3 point-to-point QPI links for cache coherence, and the system is configured with a twisted ladder "mobius" topology. The cache coherence fabric is glueless -- there's not central arbiter or coherence hub. The maximum distance between any two nodes is just 2 hops over the QPI links. For any given node, 3 other nodes are 1 hop distant and the remaining 4 nodes are 2 hops distant. Using a single request (client) thread and a single response (server) thread, a benchmark harness explored all permutations of NUMA placement for the two threads and the two communication variables, measuring the average round-trip-time and throughput rate between the client and server. In this benchmark the server simply acts as a simple transponder, writing the request value plus 1 back into the reply field, so there's no particular computation phase and we're only measuring communication overheads. In addition to varying the placement of communication variables over pairs of nodes, we also explored variations where both variables were placed on one page (and thus on one node) -- either on the same cache line or different cache lines -- while varying the node where the variables reside along with the placement of the threads. The key observation was that if the client and server threads were on different nodes, then the best placement of variables was to have the request variable (written by the client and read by the server) reside on the same node as the client thread, and to place the response variable (written by the server and read by the client) on the same node as the server. That is, if you have a variable that's to be written by one thread and read by another, it should be homed with the writer thread. For our simple client-server model that means using split request and response communication variables with unidirectional message flow on a given page. This can yield up to twice the throughput of less favorable placement strategies. Our X4800 uses the QPI 1.0 protocol with source-based snooping. Briefly, when node A needs to probe a cache line it fires off snoop requests to all the nodes in the system. Those recipients then forward their response not to the original requester, but to the home node H of the cache line. H waits for and collects the responses, adjudicates and resolves conflicts and ensures memory-model ordering, and then sends a definitive reply back to the original requester A. If some node B needed to transfer the line to A, it will do so by cache-to-cache transfer and let H know about the disposition of the cache line. A needs to wait for the authoritative response from H. So if a thread on node A wants to write a value to be read by a thread on node B, the latency is dependent on the distances between A, B, and H. We observe the best performance when the written-to variable is co-homed with the writer A. That is, we want H and A to be the same node, as the writer doesn't need the home to respond over the QPI link, as the writer and the home reside on the very same node. With architecturally informed placement of communication variables we eliminate at least one QPI hop from the critical path. Newer Intel processors use the QPI 1.1 coherence protocol with home-based snooping. As noted above, under source-snooping a requester broadcasts snoop requests to all nodes. Those nodes send their response to the home node of the location, which provides memory ordering, reconciles conflicts, etc., and then posts a definitive reply to the requester. In home-based snooping the snoop probe goes directly to the home node and are not broadcast. The home node can consult snoop filters -- if present -- and send out requests to retrieve the line if necessary. The 3rd party owner of the line, if any, can respond either to the home or the original requester (or even to both) according to the protocol policies. There are myriad variations that have been implemented, and unfortunately vendor terminology doesn't always agree between vendors or with the academic taxonomy papers. The key is that home-snooping enables the use of a snoop filter to reduce interconnect traffic. And while home-snooping might have a longer critical path (latency) than source-based snooping, it also may require fewer messages and less overall bandwidth. It'll be interesting to reprise these experiments on a platform with home-based snooping. While collecting data I also noticed that there are placement concerns even in the seemingly trivial case when both threads and both variables reside on a single node. Internally, the cores on each X7560 package are connected by an internal ring. (Actually there are multiple contra-rotating rings). And the last-level on-chip cache (LLC) is partitioned in banks or slices, which with each slice being associated with a core on the ring topology. A hardware hash function associates each physical address with a specific home bank. Thus we face distance and topology concerns even for intra-package communications, although the latencies are not nearly the magnitude we see inter-package. I've not seen such communication distance artifacts on the T2+, where the cache banks are connected to the cores via a high-speed crossbar instead of a ring -- communication latencies seem more regular.

    Read the article

  • 4.8M wasn't enough so we went for 5.055M tpmc with Unbreakable Enterprise Kernel r2 :-)

    - by wcoekaer
    We released a new set of benchmarks today. One is an updated tpc-c from a few months ago where we had just over 4.8M tpmc at $0.98 and we just updated it to go to 5.05M and $0.89. The other one is related to Java Middleware performance. You can find the press release here. Now, I don't want to talk about the actual relevance of the benchmark numbers, as I am not in the benchmark team. I want to talk about why these numbers and these efforts, unrelated to what they mean to your workload, matter to customers. The actual benchmark effort is a very big, long, expensive undertaking where many groups work together as a big virtual team. Having the virtual team be within a single company of course helps tremendously... We already start with a very big server setup with tons of storage, many disks, lots of ram, lots of cpu's, cores, threads, large database setups. Getting the whole setup going to start tuning, by itself, is no easy task, but then the real fun starts with tuning the system for optimal performance -and- stability. A benchmark is not just revving an engine at high rpm, it's actually hitting the circuit. The tests require long runs, require surviving availability tests, such as surviving crashes -and- recovery under load. In the TPC-C example, the x4800 system had 4TB ram, 160 threads (8 sockets, hyperthreaded, 10 cores/socket), tons of storage attached, tons of luns visible to the OS. flash storage, non flash storage... many things at high scale that all have to be perfectly synchronized. During this process, we find bugs, we fix bugs, we find performance issues, we fix performance issues, we find interesting potential features to investigate for the future, we start new development projects for future releases and all this goes back into the products. As more and more customers, for Oracle Linux, are running larger and larger, faster and faster, more mission critical, higher available databases..., these things are just absolutely critical. Unrelated to what anyone's specific opinion is about tpc-c or tpc-h or specjenterprise etc, there is a ton of effort that the customer benefits from. All this work makes Oracle Linux and/or Oracle Solaris better platforms. Whether it's faster, more stable, more scalable, more resilient. It helps. Another point that I always like to re-iterate around UEK and UEK2 : we have our kernel source git repository online. Complete changelog of the mainline kernel, and our changes, easy to pull, easy to dissect, easy to know what went in when, why and where. No need to go log into a website and manually click through pages to hopefully discover changes or patches. No need to untar 2 tar balls and run a diff.

    Read the article

  • Information Indepth Newsletter - Linux Edition

    - by Paulo Folgado
    INFORMATION INDEPTH NEWSLETTERLinux Edition February 2011 Stay Connected:  NEWS Now Available: Oracle Linux 6 Get the latest release of Oracle Linux 6, which includes Unbreakable Enterprise Kernel.Download Oracle Linux 6 Read More Customers Succeed by Using Oracle Exadata with Oracle Linux Watch IT executives from Bank of America, Linkshare, and Johns Hopkins as they talk about the business challenges they faced and why they chose to use Oracle Linux along with Oracle Exadata as the solution. Watch Now Video Interview: Oracle Senior Vice President Wim Coekaerts Watch Wim Coekaerts, senior vice president, Linux and Virtualization Engineering, as he talks about use cases for Oracle VM Templates as well as the Unbreakable Enterprise Kernel for Linux.Watch Now Hot Off the Press: Migrate Your IBM AIX Environment to Oracle Linux This new white paper provides recommendations for planning and implementing the migration of applications from an IBM Power System running AIX to Oracle's Sun Fire X4800 Server with Intel Xeon 7560 Processor running Oracle Linux 5.5.Read More  Back to Top BLOGOSPHERE Just Launched: The Oracle Linux Blog Follow our new Oracle Linux blog  to hear the latest updates, product news, upcoming events, and all the latest happenings, directly from the Linux team at Oracle. Back to Top TECH DIVE NEW: Linux/Oracle Solaris CommandComparo Site from Oracle Technology NetworkThis site gives equivalent command syntax in Oracle Solaris 10 and Oracle Enterprise Linux 5 for common administrative tasks--focusing particularly on tasks that have tricky syntax or that you frequently need to double check. It acts as a quick reference for administrators who operate in these two OS environments. Free Download: Oracle Linux Release 5.6Did you know that by using Oracle Linux 5.5 or 5.6 along with the Unbreakable Enterprise Kernel, you can get all the benefits of Linux mainline kernel 2.6.32 and more, right now, without the need to reinstall or migrate to a new operating system such as RHEL6?Read Release NotesDownload Oracle Linux 5.6 LSB 4.0 Certification Completed for Oracle Linux 5.5Oracle Linux 5.5 with Unbreakable Enterprise Kernel successfully completed the LSB 4.0 certification.  Back to Top WEBCASTS Boost Your Linux Performance with Oracle's Enhancements in Infiniband and RDSRegister to hear Director of Kernel Engineering Chris Mason cover scalability and performance improvements in Linux environment. Get the Facts Oracle's Unbreakable Enterprise KernelSVP Wim Coekaerts and Senior Director Monica Kumar cover the facts about and benefits of using Unbreakable Enterprise Kernel.  View Other Webcasts on Demand   Back to Top EVENTS Collaborate 2011April 10-14 Orlando, Florida Cloud Summit Events, WorldwideVarious dates (check the city for date/time of event) Datacenter Efficiency Events WorldwideThese events include Linux and Oracle VM sessions.Various dates (check the city for date/time of event) Virtualization Events in North America Find an Oracle Event  Back to Top EDUCATION Get Oracle Linux Certified from Oracle University Oracle University offers courses in both Oracle Linux and the administration of Oracle Database on Linux.  Back to Top CUSTOMER SPOTLIGHT Pella Corporation Improves IT Performance and Efficiency with Oracle Linux and Oracle VM To improve IT performance and efficiency and lower operational costs, Pella Corporation, has standardized on Oracle VM and Oracle Linux. Read More Disney Store Deploys POS in 330 Stores and 7 Countries on Oracle Linux Disney Store is running 1,500 registers worldwide on a broad Oracle technology software stack including Oracle Database 11g, Oracle Fusion Middleware, and Oracle Linux. Read More Back to Top PARTNER SPOTLIGHT Emulex and Oracle Announce Data Integrity Features The Unbreakable Enterprise Kernel provides data integrity checking between Oracle Database applications and Emulex 8Gb/s LightPulse Fibre Channel Host Bus Adapters. Read More Dell Inc. Dell Inc. tested and validated configurations support Oracle Linux. Back to Top STAY IN TOUCH Follow @ORCL_Linux on Twitter for the latest penguin tweets Bookmark Oracle.com/Linux Read the Oracle Linux blog Back to Top  Oracle Information InDepth newsletters bring targeted news, articles, customer stories, and special offers to business people who want to find out how to streamline enterprise information management, measure results, improve business processes, and communicate a single truth to their constituents. Please send questions or comments to [email protected]. For answers to questions about subscribing, unsubscribing, and managing your Oracle e-mail communications preferences, please see the Oracle E-Mail Communications page. Copyright © 2011, Oracle Corporation and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. This document is provided for information purposes only, and the contents hereof are subject to change without notice. This document is not warranted to be error-free, nor is it subject to any other warranties or conditions, whether expressed orally or implied in law, including implied warranties and conditions of merchantability or fitness for a particular purpose. We specifically disclaim any liability with respect to this document, and no contractual obligations are formed either directly or indirectly by this document. This document may not be reproduced or transmitted in any form or by any means, electronic or mechanical, for any purpose, without our prior written permission. 

    Read the article

1