Search Results

Search found 854 results on 35 pages for 'cores'.

Page 8/35 | < Previous Page | 4 5 6 7 8 9 10 11 12 13 14 15 | Next Page >

SOLR multicore shared configuration

- by Mark

I'm using multiple cores in SOLR to enable offline population of indices (and then using SWAP to swap out the active core). I want to use the same solrconfig.xml file for both cores - can someone tell me where I should put this so it can be picked up by SOLR?

Read the article
How to use hardware threads in C# dot net code running on multicore machine ?

- by Techee

How to use hardware threads in C sharp code running on multi core machine ? Any example will be appreciated. I wish to run the two threads parallelly on two cores of my machine. If I create normal software threads in C sharp they may run on single core. Is it possible to run these two threads implicitly parallel on two cores so as to get better performance ?

Read the article
What programming language are you using today for multicore platform?

- by Seymour Cakes

I was reading this blog post http://www.cilk.com/multicore-blog/bid/8097/Don-t-get-caught-with-your-multicore-pants-down and got me asking this question. 4-cores or 8-cores will be a common thing in 12-24 months time and I got a chill realizing that I don't have a answer for that yet.

Read the article
What's MAC address ? What is a unique ID for a PC ?

- by Frank

I have a PC with dual cores, it has two MAC addresses : 00-1D.... & 00-21..... Are these IDs for the two cores ? If I want to get hold of a unique ID for this PC, how to get it with a Java call ? Maybe there is something in Java like "System.getId()" ? Frank

Read the article
How to solve High Load average issue in Linux systems?

- by RoCkStUnNeRs

The following is the different load with cpu time in different time limit . The below output has parsed from the top command. TIME LOAD US SY NICE ID WA HI SI ST 12:02:27 208.28 4.2%us 1.0%sy 0.2%ni 93.9%id 0.7%wa 0.0%hi 0.0%si 0.0%st 12:23:22 195.48 4.2%us 1.0%sy 0.2%ni 93.9%id 0.7%wa 0.0%hi 0.0%si 0.0%st 12:34:55 199.15 4.2%us 1.0%sy 0.2%ni 93.9%id 0.7%wa 0.0%hi 0.0%si 0.0%st 13:41:50 203.66 4.2%us 1.0%sy 0.2%ni 93.8%id 0.8%wa 0.0%hi 0.0%si 0.0%st 13:42:58 278.63 4.2%us 1.0%sy 0.2%ni 93.8%id 0.8%wa 0.0%hi 0.0%si 0.0%st Following is the additional Information of the system? cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU E5410 @ 2.33GHz stepping : 10 cpu MHz : 1992.000 cache size : 6144 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm bogomips : 4658.69 clflush size : 64 power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU E5410 @ 2.33GHz stepping : 10 cpu MHz : 1992.000 cache size : 6144 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 4 apicid : 1 initial apicid : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm bogomips : 4655.00 clflush size : 64 power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU E5410 @ 2.33GHz stepping : 10 cpu MHz : 1992.000 cache size : 6144 KB physical id : 0 siblings : 4 core id : 2 cpu cores : 4 apicid : 2 initial apicid : 2 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm bogomips : 4655.00 clflush size : 64 power management: processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU E5410 @ 2.33GHz stepping : 10 cpu MHz : 1992.000 cache size : 6144 KB physical id : 0 siblings : 4 core id : 3 cpu cores : 4 apicid : 3 initial apicid : 3 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm bogomips : 4654.99 clflush size : 64 power management: Memory: total used free shared buffers cached Mem: 2 1 1 0 0 0 Swap: 5 0 5 let me know why the system is getting abnormally this much high load?

Read the article
CPU ordering in Linux (with hyper threading)

- by Jason

I'm curious what the CPU ordering is in Linux. Say I bind a thread to cpu0 and another to cpu1 on a hyperthreaded system, are they both going to be on the same physical core. Given a Core i7 920 with 4 cores and hyperthreading, the output of /proc/cpuinfo has me thinking that cpu0 and cpu1 are different physical cores, and cpu0 and cpu4 are on the same physical core. Thanks.

Read the article
Thread placement policies on NUMA systems - update

- by Dave

In a prior blog entry I noted that Solaris used a "maximum dispersal" placement policy to assign nascent threads to their initial processors. The general idea is that threads should be placed as far away from each other as possible in the resource topology in order to reduce resource contention between concurrently running threads. This policy assumes that resource contention -- pipelines, memory channel contention, destructive interference in the shared caches, etc -- will likely outweigh (a) any potential communication benefits we might achieve by packing our threads more densely onto a subset of the NUMA nodes, and (b) benefits of NUMA affinity between memory allocated by one thread and accessed by other threads. We want our threads spread widely over the system and not packed together. Conceptually, when placing a new thread, the kernel picks the least loaded node NUMA node (the node with lowest aggregate load average), and then the least loaded core on that node, etc. Furthermore, the kernel places threads onto resources -- sockets, cores, pipelines, etc -- without regard to the thread's process membership. That is, initial placement is process-agnostic. Keep reading, though. This description is incorrect. On Solaris 10 on a SPARC T5440 with 4 x T2+ NUMA nodes, if the system is otherwise unloaded and we launch a process that creates 20 compute-bound concurrent threads, then typically we'll see a perfect balance with 5 threads on each node. We see similar behavior on an 8-node x86 x4800 system, where each node has 8 cores and each core is 2-way hyperthreaded. So far so good; this behavior seems in agreement with the policy I described in the 1st paragraph. I recently tried the same experiment on a 4-node T4-4 running Solaris 11. Both the T5440 and T4-4 are 4-node systems that expose 256 logical thread contexts. To my surprise, all 20 threads were placed onto just one NUMA node while the other 3 nodes remained completely idle. I checked the usual suspects such as processor sets inadvertently left around by colleagues, processors left offline, and power management policies, but the system was configured normally. I then launched multiple concurrent instances of the process, and, interestingly, all the threads from the 1st process landed on one node, all the threads from the 2nd process landed on another node, and so on. This happened even if I interleaved thread creating between the processes, so I was relatively sure the effect didn't related to thread creation time, but rather that placement was a function of process membership. I this point I consulted the Solaris sources and talked with folks in the Solaris group. The new Solaris 11 behavior is intentional. The kernel is no longer using a simple maximum dispersal policy, and thread placement is process membership-aware. Now, even if other nodes are completely unloaded, the kernel will still try to pack new threads onto the home lgroup (socket) of the primordial thread until the load average of that node reaches 50%, after which it will pick the next least loaded node as the process's new favorite node for placement. On the T4-4 we have 64 logical thread contexts (strands) per socket (lgroup), so if we launch 48 concurrent threads we will find 32 placed on one node and 16 on some other node. If we launch 64 threads we'll find 32 and 32. That means we can end up with our threads clustered on a small subset of the nodes in a way that's quite different that what we've seen on Solaris 10. So we have a policy that allows process-aware packing but reverts to spreading threads onto other nodes if a node becomes too saturated. It turns out this policy was enabled in Solaris 10, but certain bugs suppressed the mixed packing/spreading behavior. There are configuration variables in /etc/system that allow us to dial the affinity between nascent threads and their primordial thread up and down: see lgrp_expand_proc_thresh, specifically. In the OpenSolaris source code the key routine is mpo_update_tunables(). This method reads the /etc/system variables and sets up some global variables that will subsequently be used by the dispatcher, which calls lgrp_choose() in lgrp.c to place nascent threads. Lgrp_expand_proc_thresh controls how loaded an lgroup must be before we'll consider homing a process's threads to another lgroup. Tune this value lower to have it spread your process's threads out more. To recap, the 'new' policy is as follows. Threads from the same process are packed onto a subset of the strands of a socket (50% for T-series). Once that socket reaches the 50% threshold the kernel then picks another preferred socket for that process. Threads from unrelated processes are spread across sockets. More precisely, different processes may have different preferred sockets (lgroups). Beware that I've simplified and elided details for the purposes of explication. The truth is in the code. Remarks: It's worth noting that initial thread placement is just that. If there's a gross imbalance between the load on different nodes then the kernel will migrate threads to achieve a better and more even distribution over the set of available nodes. Once a thread runs and gains some affinity for a node, however, it becomes "stickier" under the assumption that the thread has residual cache residency on that node, and that memory allocated by that thread resides on that node given the default "first-touch" page-level NUMA allocation policy. Exactly how the various policies interact and which have precedence under what circumstances could the topic of a future blog entry. The scheduler is work-conserving. The x4800 mentioned above is an interesting system. Each of the 8 sockets houses an Intel 7500-series processor. Each processor has 3 coherent QPI links and the system is arranged as a glueless 8-socket twisted ladder "mobius" topology. Nodes are either 1 or 2 hops distant over the QPI links. As an aside the mapping of logical CPUIDs to physical resources is rather interesting on Solaris/x4800. On SPARC/Solaris the CPUID layout is strictly geographic, with the highest order bits identifying the socket, the next lower bits identifying the core within that socket, following by the pipeline (if present) and finally the logical thread context ("strand") on the core. But on Solaris on the x4800 the CPUID layout is as follows. [6:6] identifies the hyperthread on a core; bits [5:3] identify the socket, or package in Intel terminology; bits [2:0] identify the core within a socket. Such low-level details should be of interest only if you're binding threads -- a bad idea, the kernel typically handles placement best -- or if you're writing NUMA-aware code that's aware of the ambient placement and makes decisions accordingly. Solaris introduced the so-called critical-threads mechanism, which is expressed by putting a thread into the FX scheduling class at priority 60. The critical-threads mechanism applies to placement on cores, not on sockets, however. That is, it's an intra-socket policy, not an inter-socket policy. Solaris 11 introduces the Power Aware Dispatcher (PAD) which packs threads instead of spreading them out in an attempt to be able to keep sockets or cores at lower power levels. Maximum dispersal may be good for performance but is anathema to power management. PAD is off by default, but power management polices constitute yet another confounding factor with respect to scheduling and dispatching. If your threads communicate heavily -- one thread reads cache lines last written by some other thread -- then the new dense packing policy may improve performance by reducing traffic on the coherent interconnect. On the other hand if your threads in your process communicate rarely, then it's possible the new packing policy might result on contention on shared computing resources. Unfortunately there's no simple litmus test that says whether packing or spreading is optimal in a given situation. The answer varies by system load, application, number of threads, and platform hardware characteristics. Currently we don't have the necessary tools and sensoria to decide at runtime, so we're reduced to an empirical approach where we run trials and try to decide on a placement policy. The situation is quite frustrating. Relatedly, it's often hard to determine just the right level of concurrency to optimize throughput. (Understanding constructive vs destructive interference in the shared caches would be a good start. We could augment the lines with a small tag field indicating which strand last installed or accessed a line. Given that, we could augment the CPU with performance counters for misses where a thread evicts a line it installed vs misses where a thread displaces a line installed by some other thread.)

Read the article
J2EE Applications, SPARC T4, Solaris Containers, and Resource Pools

- by user12620111

I've obtained a substantial performance improvement on a SPARC T4-2 Server running a J2EE Application Server Cluster by deploying the cluster members into Oracle Solaris Containers and binding those containers to cores of the SPARC T4 Processor. This is not a surprising result, in fact, it is consistent with other results that are available on the Internet. See the "references", below, for some examples. Nonetheless, here is a summary of my configuration and results. (1.0) Before deploying a J2EE Application Server Cluster into a virtualized environment, many decisions need to be made. I'm not claiming that all of the decisions that I have a made will work well for every environment. In fact, I'm not even claiming that all of the decisions are the best possible for my environment. I'm only claiming that of the small sample of configurations that I've tested, this is the one that is working best for me. Here are some of the decisions that needed to be made: (1.1) Which virtualization option? There are several virtualization options and isolation levels that are available. Options include: Hard partitions: Dynamic Domains on Sun SPARC Enterprise M-Series Servers Hypervisor based virtualization such as Oracle VM Server for SPARC (LDOMs) on SPARC T-Series Servers OS Virtualization using Oracle Solaris Containers Resource management tools in the Oracle Solaris OS to control the amount of resources an application receives, such as CPU cycles, physical memory, and network bandwidth. Oracle Solaris Containers provide the right level of isolation and flexibility for my environment. To borrow some words from my friends in marketing, "The SPARC T4 processor leverages the unique, no-cost virtualization capabilities of Oracle Solaris Zones" (1.2) How to associate Oracle Solaris Containers with resources? There are several options available to associate containers with resources, including (a) resource pool association (b) dedicated-cpu resources and (c) capped-cpu resources. I chose to create resource pools and associate them with the containers because I wanted explicit control over the cores and virtual processors. (1.3) Cluster Topology? Is it best to deploy (a) multiple application servers on one node, (b) one application server on multiple nodes, or (c) multiple application servers on multiple nodes? After a few quick tests, it appears that one application server per Oracle Solaris Container is a good solution. (1.4) Number of cluster members to deploy? I chose to deploy four big 64-bit application servers. I would like go back a test many 32-bit application servers, but that is left for another day. (2.0) Configuration tested. (2.1) I was using a SPARC T4-2 Server which has 2 CPU and 128 virtual processors. To understand the physical layout of the hardware on Solaris 10, I used the OpenSolaris psrinfo perl script available at http://hub.opensolaris.org/bin/download/Community+Group+performance/files/psrinfo.pl: test# ./psrinfo.pl -pv The physical processor has 8 cores and 64 virtual processors (0-63) The core has 8 virtual processors (0-7) The core has 8 virtual processors (8-15) The core has 8 virtual processors (16-23) The core has 8 virtual processors (24-31) The core has 8 virtual processors (32-39) The core has 8 virtual processors (40-47) The core has 8 virtual processors (48-55) The core has 8 virtual processors (56-63) SPARC-T4 (chipid 0, clock 2848 MHz) The physical processor has 8 cores and 64 virtual processors (64-127) The core has 8 virtual processors (64-71) The core has 8 virtual processors (72-79) The core has 8 virtual processors (80-87) The core has 8 virtual processors (88-95) The core has 8 virtual processors (96-103) The core has 8 virtual processors (104-111) The core has 8 virtual processors (112-119) The core has 8 virtual processors (120-127) SPARC-T4 (chipid 1, clock 2848 MHz) (2.2) The "before" test: without processor binding. I started with a 4-member cluster deployed into 4 Oracle Solaris Containers. Each container used a unique gigabit Ethernet port for HTTP traffic. The containers shared a 10 gigabit Ethernet port for JDBC traffic. (2.3) The "after" test: with processor binding. I ran one application server in the Global Zone and another application server in each of the three non-global zones (NGZ): (3.0) Configuration steps. The following steps need to be repeated for all three Oracle Solaris Containers. (3.1) Stop AppServers from the BUI. (3.2) Stop the NGZ. test# ssh test-z2 init 5 (3.3) Enable resource pools: test# svcadm enable pools (3.4) Create the resource pool: test# poolcfg -dc 'create pool pool-test-z2' (3.5) Create the processor set: test# poolcfg -dc 'create pset pset-test-z2' (3.6) Specify the maximum number of CPU's that may be addd to the processor set: test# poolcfg -dc 'modify pset pset-test-z2 (uint pset.max=32)' (3.7) bash syntax to add Virtual CPUs to the processor set: test# (( i = 64 )); while (( i < 96 )); do poolcfg -dc "transfer to pset pset-test-z2 (cpu $i)"; (( i = i + 1 )) ; done (3.8) Associate the resource pool with the processor set: test# poolcfg -dc 'associate pool pool-test-z2 (pset pset-test-z2)' (3.9) Tell the zone to use the resource pool that has been created: test# zonecfg -z test-z1 set pool=pool-test-z2 (3.10) Boot the Oracle Solaris Container test# zoneadm -z test-z2 boot (3.11) Save the configuration to /etc/pooladm.conf test# pooladm -s (4.0) Results. Using the resource pools improves both throughput and response time: (5.0) References: System Administration Guide: Oracle Solaris Containers-Resource Management and Oracle Solaris Zones Capitalizing on large numbers of processors with WebSphere Portal on Solaris WebSphere Application Server and T5440 (Dileep Kumar's Weblog) http://www.brendangregg.com/zones.html Reuters Market Data System, RMDS 6 Multiple Instances (Consolidated), Performance Test Results in Solaris, Containers/Zones Environment on Sun Blade X6270 by Amjad Khan, 2009.

Read the article
SQL 2012 Licensing Thoughts

- by Geoff N. Hiten

The only thing more controversial than new Federal Tax plans is new Licensing plans from Microsoft. In both cases, everyone calculates several numbers. First, will I pay more or less under this plan? Second, will my competition pay more or less than now? Third, will <insert interesting person/company here> pay more or less? Not that items 2 and 3 are meaningful, that is just how people think. Much like tax plans, the devil is in the details, so lets see how this looks. Microsoft shows it here: http://www.microsoft.com/sqlserver/en/us/future-editions/sql2012-licensing.aspx First up is a switch from per-socket to per-core licensing. Anyone who didn’t see something like this coming should rapidly search for a new line of work because you are not paying attention. The explosion of multi-core processors has made SQL Server a bargain. Microsoft is in business to make money and the old per-socket model was not going to do that going forward. Per-core licensing also simplifies virtualization licensing. Physical Core = Virtual Core, at least for licensing. Oversubscribe your processors, that’s your lookout. You still pay for what is exposed to the VM. The cool part is you can seamlessly move physical and virtual workloads around and the licenses follow. The catch is you have to have Software Assurance to make the licenses mobile. Nice touch there. Let’s have a moment of silence for the late, unlamented, largely ignored Workgroup Edition. To quote the Microsoft FAQ: “Standard becomes our sole edition for basic database needs”. Considering I haven’t encountered a singe instance of SQL Server Workgroup Edition in the wild, I don’t think this will be all that controversial. As for pricing, it looks like a wash with current per-socket pricing based on four core sockets. Interestingly, that is the minimum core count Microsoft proposes to swap to transition per-socket to per-core if you are on Software Assurance. Reading the fine print shows that if you are using more, you will get more core licenses: From the licensing FAQ. 15. How do I migrate from processor licenses to core licenses? What is the migration path? Licenses purchased with Software Assurance (SA) will upgrade to SQL Server 2012 at no additional cost. EA/EAP customers can continue buying processor licenses until your next renewal after June 30, 2012. At that time, processor licenses will be exchanged for core-based licenses sufficient to cover the cores in use by processor-licensed databases (minimum of 4 cores per processor for Standard and Enterprise, and minimum of 8 EE cores per processor for Datacenter). Looks like the folks who invested in the AMD 12-core chips will make out like bandits. Now, on to something new: SQL Server Business Intelligence Edition. Yep, finally a BI-specific SKU licensed for server+CAL configurations only. Note that Enterprise Edition still supports the complete feature set; the BI Edition is intended for smaller shops who want to use the full BI feature set but without needing Enterprise Edition scale (or costs). No, you don’t get ColumnStore, Compression, or Partitioning in the BI Edition. Those are Enterprise scale features, ThankYouVeryMuch. Then again, your starting licensing costs are about one sixth of an Enterprise Edition system (based on an 8 core server). The only part of the message I am missing is if the current Failover Licensing Policy will change. Do we need to fully or partially license failover servers? That is a detail I definitely want to know.

Read the article
Concurrent Affairs

- by Tony Davis

I once wrote an editorial, multi-core mania, on the conundrum of ever-increasing numbers of processor cores, but without the concurrent programming techniques to get anywhere near exploiting their performance potential. I came to the.controversial.conclusion that, while the problem loomed for all procedural languages, it was not a big issue for the vast majority of programmers. Two years later, I still think most programmers don't concern themselves overly with this issue, but I do think that's a bigger problem than I originally implied. Firstly, is the performance boost from writing code that can fully exploit all available cores worth the cost of the additional programming complexity? Right now, with quad-core processors that, at best, can make our programs four times faster, the answer is still no for many applications. But what happens in a few years, as the number of cores grows to 100 or even 1000? At this point, it becomes very hard to ignore the potential gains from exploiting concurrency. Possibly, I was optimistic to assume that, by the time we have 100-core processors, and most applications really needed to exploit them, some technology would be around to allow us to do so with relative ease. The ideal solution would be one that allows programmers to forget about the problem, in much the same way that garbage collection removed the need to worry too much about memory allocation. From all I can find on the topic, though, there is only a remote likelihood that we'll ever have a compiler that takes a program written in a single-threaded style and "auto-magically" converts it into an efficient, correct, multi-threaded program. At the same time, it seems clear that what is currently the most common solution, multi-threaded programming with shared memory, is unsustainable. As soon as a piece of state can be changed by a different thread of execution, the potential number of execution paths through your program grows exponentially with the number of threads. If you have two threads, each executing n instructions, then there are 2^n possible "interleavings" of those instructions. Of course, many of those interleavings will have identical behavior, but several won't. Not only does this make understanding how a program works an order of magnitude harder, but it will also result in irreproducible, non-deterministic, bugs. And of course, the problem will be many times worse when you have a hundred or a thousand threads. So what is the answer? All of the possible alternatives require a change in the way we write programs and, currently, seem to be plagued by performance issues. Software transactional memory (STM) applies the ideas of database transactions, and optimistic concurrency control, to memory. However, working out how to break down your program into sufficiently small transactions, so as to avoid contention issues, isn't easy. Another approach is concurrency with actors, where instead of having threads share memory, each thread runs in complete isolation, and communicates with others by passing messages. It simplifies concurrent programs but still has performance issues, if the threads need to operate on the same large piece of data. There are doubtless other possible solutions that I haven't mentioned, and I would love to know to what extent you, as a developer, are considering the problem of multi-core concurrency, what solution you currently favor, and why. Cheers, Tony.

Read the article
Linux 2.6.31 Scheduler and Multithreaded Jobs

- by dsimcha

I run massively parallel scientific computing jobs on a shared Linux computer with 24 cores. Most of the time my jobs are capable of scaling to 24 cores when nothing else is running on this computer. However, it seems like when even one single-threaded job that isn't mine is running, my 24-thread jobs (which I set for high nice values) only manage to get ~1800% CPU (using Linux notation). Meanwhile, about 500% of the CPU cycles (again, using Linux notation) are idle. Can anyone explain this behavior and what I can do about it to get all of the 23 cores that aren't being used by someone else? Notes: In case it's relevant, I have observed this on slightly different kernel versions, though I can't remember which off the top of my head. The CPU architecture is x64. Is it at all possible that the fact that my 24-core jobs are 32-bit and the other jobs I'm competing w/ are 64-bit is relevant? Edit: One thing I just noticed is that going up to 30 threads seems to alleviate the problem to some degree. It gets me up to ~2100% CPU.

Read the article
Multicore solr on Ubuntu 10.04 working for anyone?

- by coleifer

Following instructions from the two sites below, I've installed tomcat6 and solr 1.4 http://gist.github.com/204638 https://wiki.fourkitchens.com/display/TECH/Solr+1.4+on+Ubuntu+9.10+and+CentOS+5 I have successfully got it up and running on a server running 9.04 with multicore support, but on the 10.04 I can't seem to get it to work. I am able to reach localhost:xxxx/solr/ on the 10.04 box and see a single link to the Solr Admin, but following the link takes me to a 404 page with the following output: /solr/admin/ HTTP Status 404 - missing core name in path The requested resource (missing core name in path) is not available I am also unable to access /solr/site1/ as I would except - it similarly returns a 404  <cores adminPath="/admin/cores"> <core name="site1" instanceDir="site1" /> <core name="site2" instanceDir="site2" /> </cores>  <Context docBase="/var/solr/solr.war" debug="0" privileged="true" allowLinking="true" crossContext="true"> <Environment name="solr/home" type="java.lang.String" value="/var/solr" override="true" /> </Context>

Read the article
Memory Bandwidth Performance for Modern Machines

- by porgarmingduod

I'm designing a real-time system that occasionally has to duplicate a large amount of memory. The memory consists of non-tiny regions, so I expect the copying performance will be fairly close to the maximum bandwidth the relevant components (CPU, RAM, MB) can do. This led me to wonder what kind of raw memory bandwidth modern commodity machine can muster? My aging Core2Duo gives me 1.5 GB/s if I use 1 thread to memcpy() (and understandably less if I memcpy() with both cores simultaneously.) While 1.5 GB is a fair amount of data, the real-time application I'm working on will have have something like 1/50th of a second, which means 30 MB. Basically, almost nothing. And perhaps worst of all, as I add multiple cores, I can process a lot more data without any increased performance for the needed duplication step. But a low-end Core2Due isn't exactly hot stuff these days. Are there any sites with information, such as actual benchmarks, on raw memory bandwidth on current and near-future hardware? Furthermore, for duplicating large amounts of data in memory, are there any shortcuts, or is memcpy() as good as it will get? Given a bunch of cores with nothing to do but duplicate as much memory as possible in a short amount of time, what's the best I can do?

Read the article
How do I fix this installation problem with multicore Solr on Ubuntu 10.04?

- by coleifer

Following instructions from the two sites below, I've installed Tomcat 6 and Solr 1.4. http://gist.github.com/204638 https://wiki.fourkitchens.com/display/TECH/Solr+1.4+on+Ubuntu+9.10+and+CentOS+5 I have successfully got it up and running on a server running 9.04 with multicore support, but on the 10.04 I can't seem to get it to work. I am able to reach localhost:xxxx/solr/ on the 10.04 box and see a single link to the Solr Admin, but following the link takes me to a 404 page with the following output: /solr/admin/ HTTP Status 404 - missing core name in path The requested resource (missing core name in path) is not available I am also unable to access /solr/site1/ as I would except - it similarly returns a 404.  <cores adminPath="/admin/cores"> <core name="site1" instanceDir="site1" /> <core name="site2" instanceDir="site2" /> </cores>  <Context docBase="/var/solr/solr.war" debug="0" privileged="true" allowLinking="true" crossContext="true"> <Environment name="solr/home" type="java.lang.String" value="/var/solr" override="true" /> </Context>

Read the article
Performance degrades for more than 2 threads on Xeon X5355

- by zoolii

Hi All, I am writing an application using boost threads and using boost barriers to synchronize the threads. I have two machines to test the application. Machine 1 is a core2 duo (T8300) cpu machine (windows XP professional - 4GB RAM) where I am getting following performance figures : Number of threads :1 , TPS :21 Number of threads :2 , TPS :35 (66 % improvement) further increase in number of threads decreases the TPS but that is understandable as the machine has only two cores. Machine 2 is a 2 quad core ( Xeon X5355) cpu machine (windows 2003 server with 4GB RAM) and has 8 effective cores. Number of threads :1 , TPS :21 Number of threads :2 , TPS :27 (28 % improvement) Number of threads :4 , TPS :25 Number of threads :8 , TPS :24 As you can see, performance is degrading after 2 threads (though it has 8 cores). If the program has some bottle neck , then for 2 thread also it should have degraded. Any idea? , Explanations ? , Does the OS has some role in performance ? - It seems like the Core2duo (2.4GHz) scales better than Xeon X5355 (2.66GHz) though it has better clock speed. Thank you -Zoolii

Read the article
java GC periodically enters into several full GC cycles

- by Peter

Environment: sun JDK 1.6.0_16 vm settings: -XX:+DisableExplicitGC -XX:+UseConcMarkSweepGC -Xms1024 -Xmx1024M -XX:MaxNewSize=448m -XX:NewSize=448m -XX:SurvivorRatio=4(6 also checked) -XX:MaxPermSize=128M OS: windows server 2003 processor: 4 cores of INTEL XEON 5130, 2000 Hz my application description: high intensity of concurrent(java 5 concurrency used) operations completed each time by commit to oracle. it's about 20-30 threads run non stop, doing tasks. application runs in JBOSS web container. My GC starts work normally, I see a lot of small GCs and all that time CPU shows good load, like all 4 cores loaded to 40-50%, CPU graph is stable. Then , after 1 min of good work, CPU starts drop to 0% on 2 cores from 4, it's graph becomes unstable, goes up and down("teeth"). I see, that my threads work slower(I have monitoring), I see that GC starts produce a lot of FULL GC during that time and next 4-5 minutes this situation remains as is, then for short period of time, like 1 minute, it gets back to normal situation, but shortly after that all bad thing repeats. Question: Why I have so frequent full GC??? How to prevent that? I played with SurvivorRatio - does not help. I noticed, that application behaves normally until first FULL GC occurs, while I have enough memory. Then it runs badly. my GC LOG: starts good then long period of FULL GCs(many of them) 1027.861: [GC 942200K-623526K(991232K), 0.0887588 secs] 1029.333: [GC 803279K(991232K), 0.0927470 secs] 1030.551: [GC 967485K-625549K(991232K), 0.0823024 secs] 1030.634: [GC 625957K(991232K), 0.0763656 secs] 1033.126: [GC 969613K-632963K(991232K), 0.0850611 secs] 1033.281: [GC 649899K(991232K), 0.0378358 secs] 1035.910: [GC 813948K(991232K), 0.3540375 secs] 1037.994: [GC 967729K-637198K(991232K), 0.0826042 secs] 1038.435: [GC 710309K(991232K), 0.1370703 secs] 1039.665: [GC 980494K-972462K(991232K), 0.6398589 secs] 1040.306: [Full GC 972462K-619643K(991232K), 3.7780597 secs] 1044.093: [GC 620103K(991232K), 0.0695221 secs] 1047.870: [Full GC 991231K-626514K(991232K), 3.8732457 secs] 1053.739: [GC 942140K(991232K), 0.5410483 secs] 1056.343: [Full GC 991232K-634157K(991232K), 3.9071443 secs] 1061.257: [GC 786274K(991232K), 0.3106603 secs] 1065.229: [Full GC 991232K-641617K(991232K), 3.9565638 secs] 1071.192: [GC 945999K(991232K), 0.5401515 secs] 1073.793: [Full GC 991231K-648045K(991232K), 3.9627814 secs] 1079.754: [GC 936641K(991232K), 0.5321197 secs]

Read the article
How to benchmark on multi-core processors

- by Pascal Cuoq

I am looking for ways to perform micro-benchmarks on multi-core processors. Context: At about the same time desktop processors introduced out-of-order execution that made performance hard to predict, they, perhaps not coincidentally, also introduced special instructions to get very precise timings. Example of these instructions are rdtsc on x86 and rftb on PowerPC. These instructions gave timings that were more precise than could ever be allowed by a system call, allowed programmers to micro-benchmark their hearts out, for better or for worse. On a yet more modern processor with several cores, some of which sleep some of the time, the counters are not synchronized between cores. We are told that rdtsc is no longer safe to use for benchmarking, but I must have been dozing off when we were explained the alternative solutions. Question: Some systems may save and restore the performance counter and provide an API call to read the proper sum. If you know what this call is for any operating system, please let us know in an answer. Some systems may allow to turn off cores, leaving only one running. I know Mac OS X Leopard does when the right Preference Pane is installed from the Developers Tools. Do you think that this make rdtsc safe to use again? More context: Please assume I know what I am doing when trying to do a micro-benchmark. If you are of the opinion that if an optimization's gains cannot be measured by timing the whole application, it's not worth optimizing, I agree with you, but I cannot time the whole application until the alternative data structure is finished, which will take a long time. In fact, if the micro-benchmark were not promising, I could decide to give up on the implementation now; I need figures to provide in a publication whose deadline I have no control over.

Read the article
.NET 4 ... Parallel.ForEach() question

- by CirrusFlyer

I understand that the new TPL (Task Parallel Library) has implemented the Parallel.ForEach() such that it works with "expressed parallelism." Meaning, it does not guarantee that your delegates will run in multiple threads, but rather it checks to see if the host platform has multiple cores, and if true, only then does it distribute the work across the cores (essentially 1 thread per core). If the host system does not have multiple cores (getting harder and harder to find such a computer) then it will run your code sequenceally like a "regular" foreach loop would. Pretty cool stuff, frankly. Normally I would do something like the following to place my long running operation on a background thread from the ThreadPool: ThreadPool.QueueUserWorkItem( new WaitCallback(targetMethod), new Object2PassIn() ); In a situation whereby the host computer only has a single core does the TPL's Parallel.ForEach() automatically place the invocation on a background thread? Or, should I manaully invoke any TPL calls from a background thead so that if I am executing from a single core computer at least that logic will be off of the GUI's dispatching thread? My concern is if I leave the TPL in charge of all this I want to ensure if it determines it's a single core box that it still marshalls the code that's inside of the Parallel.ForEach() loop on to a background thread like I would have done, so as to not block my GUI. Thanks for any thoughts or advice you may have ...

Read the article
Recommendation for PHP-FPM pm.max_children, PHP-FPM pm.start_servers and others

- by jaypabs

I have the following server: Intel® Xeon® E3-1270 v2 Single Processor - Quad Core Dedicated Server CPU Speed: 4 x 3.5 Ghz w/ 8MB Smart Cache Motherboard: SuperMicro X9SCM-F Total Cores: 4 Cores + 8 Threads RAM: 32 GB DDR3 1333 ECC Hard Drive: 120GB Smart Cache: 8MB I am using ubuntu 12.04 - nginx, php, mysql with ISPConfig 3. Under ISPConfig 3 website settings: I have this default value: PHP-FPM pm.max_children = 10 PHP-FPM pm.start_servers = 2 PHP-FPM pm.min_spare_servers = 1 PHP-FPM pm.max_spare_servers = 5 PHP-FPM pm.max_requests = 0 My question is what is the recommended settings for the above variable? Because I found some using a different settings.

Read the article
Command-line: What is this machine ?

- by lindenb

Hi all, is there a standard command-line for linux answering a description of the server (model, number of cores(?), speed(?)...)? Thanks, Pierre

Read the article
which factors determines the speed of a processor? [closed]

- by Deb

I think that clock rate of processor determines the speed of core, in my case it is 1.86GHz. But If I am not wrong, it also determines that how much energy it will consume. If you have more frequency then more power it will consume. I choose Power Saver scheme to increase my battery life, however it reduces my core speed to half of the actual speed. I understand this happens because of SpeedStep, but I don't see any slowdown of my computer. So my problem is why we have such high frequency cores as it uses too much power. We can use low frequency cores. Actually I get confused between the two terms Speed of the processor and its frequency. So how much important is the frequency of core in case of any processor.

Read the article
Intel Core i5-2467m - Turbo Boost not activating?

- by Trevor Sullivan

I have a Samsung Series 5 laptop with an Intel Core i5-2467m process @ 1.6Ghz. The processor supports Intel Turbo Boost up to 2.30 Ghz according to the specifications. The i5-2467m is a dual-core process with HyperThreading, so there is a total of four (4) virtual cores in Windows 7 SP1. http://ark.intel.com/products/56858/ I've installed the Intel Turbo Boost Technology Monitor v2.6 to monitor if Turbo Boost is enabled, and set it to "Always On Top." I followed this process to max out the CPU: Open (4x) PowerShell instances Set each instance's affinity to a distinct CPU vCore Ran this code in each instance: while (1 -eq 1) { } Unfortunately, after maxing out all 4 cores, my laptop got hot, but Turbo Boost never kicked in. Any ideas on how to ensure that I'm getting the 2.3Ghz Turbo Boost capability of my laptop?

Read the article
Improving SAS multipath to JBOD performance on Linux

- by user36825

Hello all I'm trying to optimize a storage setup on some Sun hardware with Linux. Any thoughts would be greatly appreciated. We have the following hardware: Sun Blade X6270 2* LSISAS1068E SAS controllers 2* Sun J4400 JBODs with 1 TB disks (24 disks per JBOD) Fedora Core 12 2.6.33 release kernel from FC13 (also tried with latest 2.6.31 kernel from FC12, same results) Here's the datasheet for the SAS hardware: http://www.sun.com/storage/storage_networking/hba/sas/PCIe.pdf It's using PCI Express 1.0a, 8x lanes. With a bandwidth of 250 MB/sec per lane, we should be able to do 2000 MB/sec per SAS controller. Each controller can do 3 Gb/sec per port and has two 4 port PHYs. We connect both PHYs from a controller to a JBOD. So between the JBOD and the controller we have 2 PHYs * 4 SAS ports * 3 Gb/sec = 24 Gb/sec of bandwidth, which is more than the PCI Express bandwidth. With write caching enabled and when doing big writes, each disk can sustain about 80 MB/sec (near the start of the disk). With 24 disks, that means we should be able to do 1920 MB/sec per JBOD. multipath { rr_min_io 100 uid 0 path_grouping_policy multibus failback manual path_selector "round-robin 0" rr_weight priorities alias somealias no_path_retry queue mode 0644 gid 0 wwid somewwid } I tried values of 50, 100, 1000 for rr_min_io, but it doesn't seem to make much difference. Along with varying rr_min_io I tried adding some delay between starting the dd's to prevent all of them writing over the same PHY at the same time, but this didn't make any difference, so I think the I/O's are getting properly spread out. According to /proc/interrupts, the SAS controllers are using a "IR-IO-APIC-fasteoi" interrupt scheme. For some reason only core #0 in the machine is handling these interrupts. I can improve performance slightly by assigning a separate core to handle the interrupts for each SAS controller: echo 2 /proc/irq/24/smp_affinity echo 4 /proc/irq/26/smp_affinity Using dd to write to the disk generates "Function call interrupts" (no idea what these are), which are handled by core #4, so I keep other processes off this core too. I run 48 dd's (one for each disk), assigning them to cores not dealing with interrupts like so: taskset -c somecore dd if=/dev/zero of=/dev/mapper/mpathx oflag=direct bs=128M oflag=direct prevents any kind of buffer cache from getting involved. None of my cores seem maxed out. The cores dealing with interrupts are mostly idle and all the other cores are waiting on I/O as one would expect. Cpu0 : 0.0%us, 1.0%sy, 0.0%ni, 91.2%id, 7.5%wa, 0.0%hi, 0.2%si, 0.0%st Cpu1 : 0.0%us, 0.8%sy, 0.0%ni, 93.0%id, 0.2%wa, 0.0%hi, 6.0%si, 0.0%st Cpu2 : 0.0%us, 0.6%sy, 0.0%ni, 94.4%id, 0.1%wa, 0.0%hi, 4.8%si, 0.0%st Cpu3 : 0.0%us, 7.5%sy, 0.0%ni, 36.3%id, 56.1%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.0%us, 1.3%sy, 0.0%ni, 85.7%id, 4.9%wa, 0.0%hi, 8.1%si, 0.0%st Cpu5 : 0.1%us, 5.5%sy, 0.0%ni, 36.2%id, 58.3%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 0.0%us, 5.0%sy, 0.0%ni, 36.3%id, 58.7%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 0.0%us, 5.1%sy, 0.0%ni, 36.3%id, 58.5%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 0.1%us, 8.3%sy, 0.0%ni, 27.2%id, 64.4%wa, 0.0%hi, 0.0%si, 0.0%st Cpu9 : 0.1%us, 7.9%sy, 0.0%ni, 36.2%id, 55.8%wa, 0.0%hi, 0.0%si, 0.0%st Cpu10 : 0.0%us, 7.8%sy, 0.0%ni, 36.2%id, 56.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu11 : 0.0%us, 7.3%sy, 0.0%ni, 36.3%id, 56.4%wa, 0.0%hi, 0.0%si, 0.0%st Cpu12 : 0.0%us, 5.6%sy, 0.0%ni, 33.1%id, 61.2%wa, 0.0%hi, 0.0%si, 0.0%st Cpu13 : 0.1%us, 5.3%sy, 0.0%ni, 36.1%id, 58.5%wa, 0.0%hi, 0.0%si, 0.0%st Cpu14 : 0.0%us, 4.9%sy, 0.0%ni, 36.4%id, 58.7%wa, 0.0%hi, 0.0%si, 0.0%st Cpu15 : 0.1%us, 5.4%sy, 0.0%ni, 36.5%id, 58.1%wa, 0.0%hi, 0.0%si, 0.0%st Given all this, the throughput reported by running "dstat 10" is in the range of 2200-2300 MB/sec. Given the math above I would expect something in the range of 2*1920 ~= 3600+ MB/sec. Does anybody have any idea where my missing bandwidth went? Thanks!

Read the article
SQL Performance Problem IA64

- by Vendoran

We’ve got a performance problem in production. QA and DEV environments are 2 instances on the same physical server: Windows 2003 Enterprise SP2, 32 GB RAM, 1 Quad 3.5 GHz Intel Xeon X5270 (4 cores x64), SQL 2005 SP3 (9.0.4262), SAN Drives Prod: Windows 2003 Datacenter SP2, 64 GB RAM, 4 Dual Core 1.6 GHz Intel Family 80000002, Model 6 Itanium (8 cores IA64), SQL 2005 SP3 (9.0.4262), SAN Drives, Veritas Cluster I am seeing excessive Signal Wait Percentages ( 250%) and Page Reads /s (50) and Page Writes /s (25) are both high occasionally. I did test this query on both QA and PROD and it has the same execution plan and even the same stats: SELECT top 40000000 * INTO dbo.tmp_tbl FROM dbo.tbl GO Scan count 1, logical reads 429564, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. As you can see it’s just logical reads, however: QA: 0:48 Prod: 2:18 So It seems like a processor related issue, however I’m not sure where to go next, any ideas? Thanks, Aaron

Read the article
Performance problem with Win Server 2008 at vbox on Ubuntu 9.10 Server

- by Diskilla

Hey, i´ve got an Ubuntu 9.10 Server and tried to install Windows Server 2008 in a virtual machine. The Problem is, the Server has got 8 cores, but the virtual machine seems to have problems with that. The VM is very slow and not really usable if i set the VM-config to more than one core. Is it set to only one core everything works fine, but thats not a solution because I bought the Server with several cores for a reason... Does anybody know this problem or has got a solution? I feel like I read the entire internet via Google... no solution found. Greetz Diskilla P.S.: I´m from Germany and it´s kind of hard to ask in english :-) I hope I din´t make that much mistakes.

Read the article

Search Results

Search found 854 results on 35 pages for 'cores'.

Page 8/35 | < Previous Page | 4 5 6 7 8 9 10 11 12 13 14 15 | Next Page >

- by Mark

- by Techee

- by Seymour Cakes

- by Frank

- by RoCkStUnNeRs

- by Jason

- by Dave

- by user12620111

- by Geoff N. Hiten

- by Tony Davis

- by dsimcha

- by coleifer

- by porgarmingduod

- by coleifer

- by zoolii

- by Peter

- by Pascal Cuoq

- by CirrusFlyer

- by jaypabs

- by lindenb

- by Deb

- by Trevor Sullivan

- by user36825

- by Vendoran

- by Diskilla

< Previous Page | 4 5 6 7 8 9 10 11 12 13 14 15 | Next Page >