Search Results

Search found 51 results on 3 pages for 'hyperthreading'.

Page 2/3 | < Previous Page | 1 2 3 | Next Page >

Context switches much slower in new linux kernels

- by Michael Goldshteyn

We are looking to upgrade the OS on our servers from Ubuntu 10.04 LTS to Ubuntu 12.04 LTS. Unfortunately, it seems that the latency to run a thread that has become runnable has significantly increased from the 2.6 kernel to the 3.2 kernel. In fact the latency numbers we are getting are hard to believe. Let me be more specific about the test. We have a program that has two threads. The first thread gets the current time (in ticks using RDTSC) and then signals a condition variable once a second. The second thread waits on the condition variable and wakes up when it is signaled. It then gets the current time (in ticks using RDTSC). The difference between the time in the second thread and the time in the first thread is computed and displayed on the console. After this the second thread waits on the condition variable once more. So, we get a thread to thread signaling latency measurement once a second as a result. In linux 2.6.32, this latency is somewhere on the order of 2.8-3.5 us, which is reasonable. In linux 3.2.0, this latency is somewhere on the order of 40-100 us. I have excluded any differences in hardware between the two host hosts. They run on identical hardware (dual socket X5687 {Westmere-EP} processors running at 3.6 GHz with hyperthreading, speedstep and all C states turned off). We are changing the affinity to run both threads on physical cores of the same socket (i.e., the first thread is run on Core 0 and the second thread is run on Core 1), so there is no bouncing of threads on cores or bouncing/communication between sockets. The only difference between the two hosts is that one is running Ubuntu 10.04 LTS with kernel 2.6.32-28 (the fast context switch box) and the other is running the latest Ubuntu 12.04 LTS with kernel 3.2.0-23 (the slow context switch box). Have there been any changes in the kernel that could account for this ridiculous slow down in how long it takes for a thread to be scheduled to run?

Read the article
Occasional InterruptedException when quitting a Swing application

- by Joonas Pulakka

I recently updated my computer to a more powerful one, with a quad-core hyperthreading processor (i7), thus plenty of real concurrency available. Now I'm occasionally getting the following error when quitting (System.exit(0)) an application (with a Swing GUI) that I'm developing: Exception while removing reference: java.lang.InterruptedException java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) at sun.java2d.Disposer.run(Disposer.java:125) at java.lang.Thread.run(Thread.java:619) Well, given that it started to happen with a more concurrency-capable hardware, and it has to do with threads, and it happens occasionally, it's obviously some kind of timing thing. But the problem is that the stack trace is so short. All I have is the listing above. It doesn't include my own code at all, so it's somewhat hard to guess where the bug is. Has anyone experienced something like this before? Any ideas how to start solving it?

Read the article
Video stutter when using external drive

- by psion

When using boxee to play video files off of an external western digital 1TB drive formatted NTFS, I notice a slight stutter in the video every 5-10 seconds. When using mplayer, it doesn't stutter as often, but it still stutters occasionally. If I play the video off of the local sata drive, it plays fine even in boxee. I use this computer as my HTPC and I just switched from windows to linux on it. In windows, I never had any sort of stutter playing movies from the drive. I am using the latest intel graphics drivers (for the intel GMA 950) root@eee-htpc:/home/htpc# grep wd /etc/mtab /dev/sdb1 /mnt/wd2 fuseblk rw,nosuid,nodev,allow_other,blksize=512 0 0 I notice that despite trying to use ntfs or ntfs-3g, ubuntu uses ntfs-fuse which I've heard is slower. /dev/sdb1: Timing buffered disk reads: 80 MB in 3.07 seconds = 26.08 MB/sec root@eee-htpc:/mnt/wd2# dd if=/dev/zero of=./120mb bs=1024 count=120000 root@eee-htpc:/mnt/wd2# time mv ./120mb /home/htpc real 0m2.095s user 0m0.016s sys 0m0.736s Even though fuse has a reputation for being slow, it should easily be fast enough for playing standard definition video files. So why the video stutter? edit: The issue seems to be overhead cpu usage from either playing off of a usb device or ntfs/fuse. Watching CPU usage with top, local files use 10-40% CPU. Watching the same video on the external formatted ntfs, it spikes to 170% (over 100% because of hyperthreading). To me it seems like it must be overhead from the fuse driver, though I don't know if it has more or less overhead than ntfs-3g. It's a EEEBox B202 that has an atom 270, so not exactly the most powerful out there. edit2: I believe the solution would be to use non-fuse drivers or different fuse drivers. so far I have not been able to. edit3: I've probably edited this more times than I should, but as an update I have upgraded ntfs drivers to ntfs-3g 2010.8.8 external FUSE 28 - Third Generation NTFS Driver using the following PPA - ppa:x3lectric/team-iquik-releases. When first opening a video file in boxee that's on ntfs there's still the same amount of lag. After a few minutes of video, the lag seems to go away and the cpu usage comes down to 10-40%. Every so often though, it begins to stutter again. Also, if I skip ahead/back in the file, it begins to stutter a lot.

Read the article
Why would more CPU cores on virtual machine slow compile times?

- by Sid

[edit#2] If anyone from VMWare can hit me up with a copy of VMWare Fusion, I'd be more than happy to do the same as a VirtualBox vs VMWare comparison. Somehow I suspect the VMWare hypervisor will be better tuned for hyperthreading (see my answer too) I'm seeing something curious. As I increase the number of cores on my Windows 7 x64 virtual machine, the overall compile time increases instead of decreasing. Compiling is usually very well suited for parallel processing as in the middle part (post dependency mapping) you can simply call a compiler instance on each of your .c/.cpp/.cs/whatever file to build partial objects for the linker to take over. So I would have imagined that compiling would actually scale very well with # of cores. But what I'm seeing is: 8 cores: 1.89 sec 4 cores: 1.33 sec 2 cores: 1.24 sec 1 core: 1.15 sec Is this simply a design artifact due to a particular vendor's hypervisor implementation (type2:virtualbox in my case) or something more pervasive across more VMs to make hypervisor implementations more simpler? With so many factors, I seem to be able to make arguments both for and against this behavior - so if someone knows more about this than me, I'd be curious to read your answer. Thanks Sid [edit:addressing comments] @MartinBeckett: Cold compiles were discarded. @MonsterTruck: Couldn't find an opensource project to compile directly. Would be great but can't screwup my dev env right now. @Mr Lister, @philosodad: Have 8 hw threads, using VirtualBox, so should be 1:1 mapping without emulation @Thorbjorn: I have 6.5GB for the VM and a smallish VS2012 project - it's quite unlikely that I'm swapping in/out trashing the page file. @All: If someone can point to an open source VS2010/VS2012 project, that might be a better community reference than my (proprietary) VS2012 project. Orchard and DNN seem to need environment tweaking to compile in VS2012. I really would like to see if someone with VMWare Fusion also sees this (for VMWare vs VirtualBox compartmentalization) Test details: Hardware: Macbook Pro Retina CPU : Core i7 @ 2.3Ghz (quad core, hyper threaded = 8 cores in windows task manager) Memory : 16 GB Disk : 256GB SSD Host OS: Mac OS X 10.8 VM type: VirtualBox 4.1.18 (type 2 hypervisor) Guest OS: Windows 7 x64 SP1 Compiler: VS2012 compiling a solution with 3 C# Azure projects Compile times measure by VS2012 plugin called 'VSCommands' All tests run 5 times, first 2 runs discarded, last 3 averaged

Read the article
ESXi Server with 12 physical cores maxed out with only 8 cores assigned in virtual machines

- by Sam

I have an ESXi 5 server running on a 2-processor, 12-core system with hyperthreading enabled. So: 12 physical cores, 24 logical ones. On this server are 4 Windows 7 VMs, each configured for 2 processors, each running VMware Tools. Looking at my stats in vSphere, my "core utilization" is constantly maxed out. Yes, these machines are working hard, but only 8 cores have been allocated. How is this possible? Should I look into reducing the processor count per machine as in this post: VMware ESX server? I checked to ensure that hardware virtualization is enabled in the BIOS of the machine (a DELL R410). I've also started reading up on configuration, but being a newbie there's a lot of material to catch up on. It also seems I should only bother with advanced settings and pools if I'm really pushing the load, and I don't think that I should be pushing it with so few VMs. I suspect that I have some basic, incorrect configuration setting, but it's also possible that I have some giant misconceptions about virtualization. Any pointers? EDIT: Given the responses I've gotten so far, it seems that this is a measurement problem and not a configuration problem, making this less critical. Perhaps the real question is: How does the core utilization of the server reach a higher percentage than all individual cores' core utilization, and given that this possibility makes the metric useless for overall server load, what is the best global metric for measuring CPU load on hyper-threaded systems?

Read the article
does my machine configuration make sense?

- by user1227914

i couldn't think of a better place to ask this question, so here it goes. we're putting together a dedicated server for a website that will initially host the web server and the mysql database. as the website grows, we'll move the database to a different server and this machine will eventually only server the actual website. so the question is ...does my configuration look okay? it's the first time i'm building a server from scratch so i want to make sure i don't combine components that don't fit or something. things like ..do the drives i picked work for the hot swap ..etc. what do you guys think? am i good to go with this configuration? :) Chassis: Supermicro SuperServer 6016T-MTHF (6x DDR3 SDRAM - ECC DIMM 240-pin, 2x LGA1366 Socket, Power Provided: 600 Watt, 4 (free) x hot-swap - 3.5") CPU: Intel BX80614E5620 Xeon E5620 Processor - 4 Core, 2.40GHz, LGA 1366, 5.86GT/s QPI 12MB Cache, 64-Bit, 80W, HyperThreading Memory: Crucial CT51272BB1339 4GB PC10600 DDR3 Memory - 1333MHz, ECC, Registered, 1x4096MB (possibly 3 or 4 of them) Hard Drives: Western Digital WD2002FAEX Caviar Black Hard Drive - 2TB, 3.5", SATA 6Gbps, 7200 RPM, 64MB (possibly 2 or 3). thank you very much for any professional advice :)

Read the article
Nice level not working on linux

- by xioxox

I have some highly floating point intensive processes doing very little I/O. One is called "xspec", which calculates a numerical model and returns a floating point result back to a master process every second (via stdout). It is niced at the 19 level. I have another simple process "cpufloattest" which just does numerical computations in a tight loop. It is not niced. I have a 4-core i7 system with hyperthreading disabled. I have started 4 of each type of process. Why is the Linux scheduler (Linux 3.4.2) not properly limiting the CPU time taken up by the niced processes? Cpu(s): 56.2%us, 1.0%sy, 41.8%ni, 0.0%id, 0.0%wa, 0.9%hi, 0.1%si, 0.0%st Mem: 12297620k total, 12147472k used, 150148k free, 831564k buffers Swap: 2104508k total, 71172k used, 2033336k free, 4753956k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 32399 jss 20 0 44728 32m 772 R 62.7 0.3 4:17.93 cpufloattest 32400 jss 20 0 44728 32m 744 R 53.1 0.3 4:14.17 cpufloattest 32402 jss 20 0 44728 32m 744 R 51.1 0.3 4:14.09 cpufloattest 32398 jss 20 0 44728 32m 744 R 48.8 0.3 4:15.44 cpufloattest 3989 jss 39 19 1725m 690m 7744 R 44.1 5.8 1459:59 xspec 3981 jss 39 19 1725m 689m 7744 R 42.1 5.7 1459:34 xspec 3985 jss 39 19 1725m 689m 7744 R 42.1 5.7 1460:51 xspec 3993 jss 39 19 1725m 691m 7744 R 38.8 5.8 1458:24 xspec The scheduler does what I expect if I start 8 of the cpufloattest processes, with 4 of them niced (i.e. 4 with most of the CPU, and 4 with very little)

Read the article
Very uneven CPU utilization with SQL Server 2012 on 2 processor computer with 16 cores / processor

- by cooplarsh

After installing SQL Server Enterprise 2012 with the Server + Cal license model, on a computer with 2 processors each with 16 cores (and no hyperthreading involved) and putting the server under extremely heavy load the 16 cores on the first processor were very underutilized, the first 4 cores on the 2nd CPU were heavily utilized, and the last 12 cores were not used at all (because of the 20 core limit for this sql server version). Total CPU utilization was displaying as around 25%. Unfortunately, the server suffered from extremely poor performance even though if the tasks were evenly distributed across the 20 cores it wouldn't have been anywhere near as bad. The Windows Server was running on a VMWare virtual image under ESX Server, but all of the CPU was allocated to the windows server. We tried changing affinity settings (e.g., allocating most cores to CPU and the others to I/O), but that didn't help solve the performance problems. Upgrading the product edition to SQL Server Enterprise Core 2012 not only allowed the SQL Server to utilize the 12 previously unused cores on the 2nd processor, but it also resulted in a much more even distribution of tasks across all of the processors. To get through the backlog of requests cpU utilization jumped to around 90%, and then came down to around 33% once it was caught up, but performance improved dramatically since we failed over to the newly updated version And the performance issues went away. I was wondering if anyone knows what might cause SQL Server to unevenly distribute the load, relying almost exclusively on the first 4 cores of the 2nd processor that had 12 cores idle, and allocate only a few tasks to each of the 16 cores on the first processor. Also, is there any way we could have more evenly distributed the load across the 20 cores that were being used without the product edition upgrade? The flip side of that question is what did the product upgrade do that caused SQL Server to start evenly distributing the load across all of the cores that it recognized? Thanks to any insight to answer these questions and/or links that might help me better understand how to make sense of what was happenings.

Read the article
What are the biggest, best CPUs that support Physical Address Extension?

- by Giffyguy

I'm looking for a CPU that will support PAE and fit into an LGA775 socket. This combination of technology is very much preferred for my current server hardware/software setup. My priorities in order of highest to lowest: PAE & LGA775 At least 1066Mhz FSB Largest CPU cache possible Multiple Cores if possible HyperThreading if possible Most other factors are of little-to-no consequence. I'm finding it very difficult to figure out what my options are. Intel doesn't have much useful information on PAE (since x64 is so dominant), and Wikipedia simply says that "PAE is provided by Intel Pentium Pro (and above) CPUs - including all later Pentium-series processors except the 400 MHz bus versions of the Pentium M." All of Intel's listed Pentium CPU's support Intel64, which makes me seriously doubt they will support PAE with a 32-bit OS. And Wikipedia's claim is so vague, I have no idea if they mean up-to-and-including the x64 Prescott CPUs. PAE is supposed to be an aspect of the x86 architecture, and I believe it is no longer supported in an x64 environment. Please correct me if I am wrong.

Read the article
Is this processor burned?

- by Jhonnytunes

I've recently exchange the processors in two PCChips boards. Both boards are LGA775. Board A is P17G(Pentium 4 HyperThreading 3GHz) and board B is P49G(Pentium Dual Core 3GHz). I use board A to watch videos, and some of them are 3GB size and this is why I exchanged the CPU. I installed Dual Core in board A and it worked out of the box, now 3GB videos use 5% of CPU instead of 50%. When I installed the pentium in the board B, I forgot to connect the 4pin power and, when i powered on the PC, the CPU fan stay off. Then, I connected all right this time, and now the board doesnt show video. I think the CPU is not working but im not sure about that. The PC turns on and the HD spins, the CPU fan spins, network socket blinking, but not video and case power led is neither blinking. I tried with other PSU and everything was the same. I figure out that CPU have that paste above. IDK really what's happening, I hope I dont have to buy another CPU. Is it Burned?

Read the article
How to find out what is causing a slow down of the application on this server?

- by Jan P.

This is not the typical serverfault question, but I'm out of ideas and don't know where else to go. If there are better places to ask this, just point me there in the comments. Thanks. Situation We have this web application that uses Zend Framework, so runs in PHP on an Apache web server. We use MySQL for data storage and memcached for object caching. The application has a very unique usage and load pattern. It is a mobile web application where every full hour a cronjob looks through the database for users that have some information waiting or action to do and sends this information to a (external) notification server, that pushes these notifications to them. After the users get these notifications, the go to the app and use it, mostly for a very short time. An hour later, same thing happens. Problem In the last few weeks usage of the application really started to grow. In the last few days we encountered very high load and doubling of application response times during and after the sending of these notifications (so basically every hour). The server doesn't crash or stop responding to requests, it just gets slower and slower and often takes 20 minutes to recover - until the same thing starts again at the full hour. We have extensive monitoring in place (New Relic, collectd) but I can't figure out what's wrong; I can't find the bottlekneck. That's where you come in: Can you help me figure out what's wrong and maybe how to fix it? Additional information The server is a 16 core Intel Xeon (8 cores with hyperthreading, I think) and 12GB RAM running Ubuntu 10.04 (Linux 3.2.4-20120307 x86_64). Apache is 2.2.x and PHP is Version 5.3.2-1ubuntu4.11. If any configuration information would help analyze the problem, just comment and I will add it. Graphs info phpinfo() apc status memcache status collectd Processes CPU Apache Load MySQL Vmem Disk New Relic Application performance Server overview Processes Network Disks (Sorry the graphs are gifs and not the same time period, but I think the most important info is in there)

Read the article
SQL Server 2005 SE SP3 on Windows Server 2008 R2 x64 premature query disconnections

- by southernpost

New Dell PowerEdge R910, 4x8 Intel X7560, 192GB RAM, hardware NUMA, local RAID, Broadcom NetExtreme II multiport NIC, unteamed, TCP Offload disabled, RSS disabled, NetDMA disabled, Hyperthreading disabled. SQL Server 2005 SE x64 SP3 on Windows Server 2008 R2 EE x64. No other apps on server. Max Mem = 180GB, Max DOP = 4. Existing Windows Server 2003 R2 EE x64 app server connecting to Dell via firewall using SQL Authenticated logins. Symptoms: Intermittent errors at the app server: A transport-level error has occurred when sending the request to the server. (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.) Findings: Running queries from SSMS located on another machine within the same domain as the SQL Server run without error. SQLIO showed good performance. Windows and SQL logs show no related messages. Microsoft reveiwed PssDiag trace and stated that "We are not seeing timeouts from SQL Side. The queries bring run against the database are timing out within 9secs. This is a database connectivity error." "we can also see from the AttnSeq column that we are also not seeing any Attentions from the SQL Side.". Dell has confirmed that we are using the latest Broadcom drivers.

Read the article
3dmark score abnormal

- by Sean

I just bought a new laptop. It is core i7 3610QM.(Ivy bridge, 4 core plus hyperthreading), 8G ram, nvidia GT 610M 2GB. The OS is Win7 64bit. I ran 3dmark 11, but the score is frustrating. For 1280*720, the score is P730. It is too low right? I searched online, the score should at least above 1000. Am I right? I never use this software before. I know nvidia has optimus, so I made the laptop in high performance state and white-listed the 3dmark program. But there is no help. I am guessing 3dmark is using i7's graphic module. It cannot transform to nvidia. In the running detail of 3dmark, the graphic card cannot be identified(The row remain blank). Can anybody tell me is this the normal case? If not, can I use some other software to test if my nvidia card is working fine? If this nvidia card cannot work, I will return the new laptop asap. Thanks.

Read the article
How to fix Windows 7 device removal notification loop

- by Barry Kelly

Bit of an odd one this. One of our PCs is getting caught in a loop some time after being turned on, usually after a USB storage device has been attached - sometimes an iPod, sometimes a GPS. Specifically, Windows Explorer starts showing a drive icon and letter (E:, as of right now) for the System partition (the small hidden one at the start of the boot drive). Then, the icon disappears. Then it reappears again. And disappears. It does this very quickly, at what looks like maybe 50 times a second. CPU usage in this loop is also very high; averages about 66%. This machine has an i7 920 CPU, which is quad core with hyperthreading; so this usage rate works out to about 5 100% busy threads, along with whatever normal idle load is (particularly Task Manager itself). Inspecting with Process Explorer shows that the device removal notification infrastructure has gone berserk. The threads in system service processes (i.e. apart from Windows Explorer) which are using all the CPU power relate to device notification. The Disk Management MMC snap-in also fails to run when the loop starts. The only way to break the loop, it seems, is to reboot the machine. Anyone seen anything similar to this, and know of a way to fix it? Machine details: Windows 7 x64, fully patched i7 920, 12GB RAM Intel SSD 80GB (X25-M, I believe; not G2) 2TB 5.2K disk for bulk storage AMD HD 5870 Further hardware details await. I'm going to go through and update all drivers I can find.

Read the article
Linux 'top' utility widly inaccurate (more so for multi-CPU/core hardware)?

- by amn

Hi all. After using 'top' for long time, albeit basically, I have grown to distrust it's '% CPU' column reports. I have a 8-core (quad core Intel i7 920 with hyperthreading) hardware, and see some wild numbers when running a process that should not use more than 5% overall. top happily reports 50%, and I suspect it is not so. My question is, is it a known fact that it's inaccurate when several CPUs/cores are present? I used 'mpstat' from the 'sysstat' package, and it's showings are much more conservative, certainly within my expectations. I did press '1' for 'top' to switch it to show all the core and us/sy/io stats, but the numbers are substantially higher than with 'mpstat'... I know that my expectations can be unfound as well, but my gut feeling tells me 'top' is wrong! :-) The reason I need to know is because the process I am monitoring only guarantees quality of service with CPU usage "less than 80%" (however vague that sounds), and I need to know how much headroom I have left. It's a streaming server.

Read the article
Which Computer Organization & Architecture book is good for me?

- by claws

I'm always interested in learning the inner working of things. I started with C programming and then learnt Operating systems (from stallings) and then linkers & loaders and then assembly language after reading these now I want to go into little more depth. Computer Architecture. I feel that makes everything clear. As per SO archives these are the two good books: Computer Architecture: A Quantitative Approach, 4th Edition Computer Organization and Design, Fourth Edition, ~ David A. Patterson, John L. Hennessy But I've browsed through the contents of these books and found that they don't exactly meet my needs. I want to learn more about caches, Memory Management Unit , mapping b/w virtual memory & physical memory I'm no way interested in other ISAs like MIPS etc.. I'm IA32 and X86-64 fan and I want to stick to it. I'm not a hardware developer I don't want to details like circuit diagrams or How is L1, L2 & L3 caches are implemented? I want to know the parallel processing technologies like HyperThreading at the architecture level but again I don't want to design them. I liked the table of Contents of - Computer Architecture: A Quantitative Approach, 4th Edition but Quantitave Approach? Seriously?? I want to know the details of current technologies and I dont want to spend reading 200 pages of outdated old technologies ( I experienced this while learning ASM}

Read the article
Help with Neuroph neural network

- by user359708

For my graduate research I am creating a neural network that trains to recognize images. I am going much more complex than just taking a grid of RGB values, downsampling, and and sending them to the input of the network, like many examples do. I actually use over 100 independently trained neural networks that detect features, such as lines, shading patterns, etc. Much more like the human eye, and it works really well so far! The problem is I have quite a bit of training data. I show it over 100 examples of what a car looks like. Then 100 examples of what a person looks like. Then over 100 of what a dog looks like, etc. This is quite a bit of training data! Currently I am running at about one week to train the network. This is kind of killing my progress, as I need to adjust and retrain. I am using Neuroph, as the low-level neural network API. I am running a dual-quadcore machine(16 cores with hyperthreading), so this should be fast. My processor percent is at only 5%. Are there any tricks on Neuroph performance? Or Java peroformance in general? Suggestions? I am a cognitive psych doctoral student, and I am decent as a programmer, but do not know a great deal about performance programming.

Read the article
Where are possible locations of queueing/buffering delays in Linux multicast?

- by Matt

We make heavy use of multicasting messaging across many Linux servers on a LAN. We are seeing a lot of delays. We basically send an enormous number of small packages. We are more concerned with latency than throughput. The machines are all modern, multi-core (at least four, generally eight, 16 if you count hyperthreading) machines, always with a load of 2.0 or less, usually with a load less than 1.0. The networking hardware is also under 50% capacity. The delays we see look like queueing delays: the packets will quickly start increasing in latency, until it looks like they jam up, then return back to normal. The messaging structure is basically this: in the "sending thread", pull messages from a queue, add a timestamp (using gettimeofday()), then call send(). The receiving program receives the message, timestamps the receive time, and pushes it in a queue. In a separate thread, the queue is processed, analyzing the difference between sending and receiving timestamps. (Note that our internal queues are not part of the problem, since the timestamps are added outside of our internal queuing.) We don't really know where to start looking for an answer to this problem. We're not familiar with Linux internals. Our suspicion is that the kernel is queuing or buffering the packets, either on the send side or the receive side (or both). But we don't know how to track this down and trace it. For what it's worth, we're using CentOS 4.x (RHEL kernel 2.6.9).

Read the article
How to make Linux reliably boot on multi-cpu machines?

- by Adam Tabi

I've got two machines, one with 4x12 AMD Opteron cores (AMD Opteron(tm) Processor 6176), one with 2x8 Xeon cores (HT disabled; Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz). On both machines I experience difficulties during boot of Linux using recent kernels. The system hangs during the initialization of the kernel, before or just when initramfs started initializing the hardware. The last thing which got displayed was a stacktrace like this: CPU: 31 PID: 0 Comm: swapper/31 Tainted: G D 3.11.6-hardened #11 Hardware name: Supermicro X9DRT-HF+/X9DRT-HF+, BIOS 3.00 07/08/2013 task: ffff880854695500 ti: ffff880854695a28 task.ti: ffff880854695a28 RIP: 0010:[<ffffffff8100a82e>] [<ffffffff8100a82e>] default_idle+0x6/0xe RSP: 0000:ffff8808546b3ec8 EFLAGS: 00000286 RAX: ffffffff8100a828 RBX: ffff880854695a28 RCX: 00000000ffffffff RDX: 0100000000000000 RSI: 0000000000000000 RDI: ffff88107fdec690 RBP: ffff8808546b3ec8 R08: 0000000000000000 R09: ffff880854695500 R10: ffff880854695500 R11: 0000000000000001 R12: ffff880854695a28 R13: ffff880854695a28 R14: ffff880854695a28 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88107fde0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000002b43256a960 CR3: 00000000016b5000 CR4: 00000000000607f0 Stack: ffff8808546b3ed8 ffffffff8100aec9 ffff8808546b3f10 ffffffff8109ce25 334ab55852ec7aef 000000000000001f ffffffff8102d6c0 0000000000000000 0000000000000000 ffff8808546b3f48 ffffffff810276e0 ffff8808546b3f28 Call Trace: [<ffffffff8100aec9>] arch_cpu_idle+0x20/0x2b [<ffffffff8109ce25>] cpu_startup_entry+0xed/0x138 [<ffffffff8102d6c0>] ? flat_init_apic_ldr+0x80/0x80 [<ffffffff810276e0>] start_secondary+0x2c9/0x2f8 I compiled the kernel myself and it works fine, if I boot with nolapic. Yet, only one core is used. Also, the kernel of RHEL6 seems to work fine. I suspect that there are some patches used to make things work. Using the kernel config file from RHEL6 and building a more recent kernel yields the same problems. On the Xeon machine, things got better by disabling Hyperthreading completely. The machine now boots successfully on at least 4 out of 5 times. And if it boots, multicore stuff works just fine. However, I'm wondering about what to do about the AMD machine. So to sum it up: Gentoo kernel 3.6 - 3.11 won't reliably boot those machines unless you reduce the amount of cores (e.g. via nolapic). RHEL6 kernel (which is 2.6.32) boots just fine. RH kernel config used to build a 3.x kernel won't yield a working kernel. Not distribution specific (apart from the kernel being used). These stack traces got printed every minute or so. The kernel seems to be stuck in an endless loop. Yet, a recent kernel is needed for various reasons. So the question is: What does the RHEL6 kernel do, what vanilla or gentoo kernels don't do? Is there a boot option that might lead to a reliable boot with all the cores enabled? Best, Adam

Read the article
Poor performance of single processor 32bit Windows XP xompared SMP in VBA+Excel

- by Adam Ryczkowski

Welcome! On many computers I experienced poor performance of 32 bit guests running on 64 bit Linux host (I used only the Debian family). At last I managed to collect benchmark data. I made the benchmark by running custom VBA macro, (which we use in our company) that generates 284 pages long Word document full of Excel Pie charts, tables and comments. The macro is run as a single task (excluding the standard services) on a set of identically configured Windows XP 32-bit systems. I measured the time (in sec.) needed to perform the test. The computer (i.e. my notebook Asus P53E) supports both VT-d extensions and native Windows XP. It has 2-core processor, each core is hyperthreaded, so in total we have 4 mostly independent execution units. I use the latest VirtualBox 4.2 and VMWare Workstation 9.0 for Linux, installed together on the same host (running Mint 13 Maya) but never run simultaneously. The results (in column Time) are no less accurate than ± 10% Here are the results (sorry for the format, but I couldn't find out a better solution for tables in SO): +---------------+-------------+------------------------------------------------------+---------+------------+----------------+------+ | Host software | # processor | Windows kernel | IO APIC | VT-x/AMD-V | 2D Video Accel | Time | +---------------+-------------+------------------------------------------------------+---------+------------+----------------+------+ | VirtualBox | 1 | Advanced Configuration and Power Interface (ACPI) PC | 0 | 1 | 0 | 1139 | | VirtualBox | 1 | Advanced Configuration and Power Interface (ACPI) PC | 0 | 1 | 1 | 1050 | | VirtualBox | 1 | Advanced Configuration and Power Interface (ACPI) PC | 0 | 0 | 1 | 1644 | | VirtualBox | 4 | ACPI Multiprocessor PC | 1 | 1 | 1 | 6809 | | VMWare | 1 | ACPI Uniprocessor PC | | 1 | 1 | 1175 | | VMWare | 4 | ACPI Multiprocessor PC | | 1 | 1 | 3412 | | Native | 4 | ACPI Multiprocessor PC | | | | 1693 | | Native | 1 | Advanced Configuration and Power Interface (ACPI) PC | | | | 1170 | +---------------+-------------+------------------------------------------------------+---------+------------+----------------+------+ Here are the striking conclusions: Although I've read in the VirtualBox fora about abysmal performance with 32-bit guest on 64-bit host, VMWare also has problems compared to native run, still being twice faster(!) than VBox. Although VBA is inherently single-threaded, the Excel calculations, which take much more than a half of total computation time, supposedly aren't. So one would expect some speed gain when running on 2+ cores ("+" for hyperthreading). What we see is a speed loss. And quite big one too. For the VirtualBox the VT-d extension isn't a big deal. Can anyone shed some light on why the singlethreaded Windows kernel is so much faster than the SMP one?

Read the article
How to get the best LINPACK result and conquer the Top500?

- by knweiss

Given a large Linux HPC cluster with hundreds/thousands of nodes. What are your best practices to get the best possible LINPACK benchmark (HPL) result to submit for the Top500 supercomputer list? To give you an idea what kind of answers I would appreciate here are some sub-questions (with links): How to you tune the parameters (N, NB, P, Q, memory-alignment, etc) for the HPL.dat file (without spending too much time trying each possible permutation - esp with large problem sizes N)? Are there any Top500 submission rules to be aware of? What is allowed, what isn't? Which MPI product, which version? Does it make a difference? Any special host order in your MPI machine file? Do you use CPU pinning? How to you configure your interconnect? Which interconnect? Which BLAS package do you use for which CPU model? (Intel MKL, AMD ACML, GotoBLAS2, etc.) How do you prepare for the big run (on all nodes)? Start with small runs on a subset of nodes and then scale up? Is it really necessary to run LINPACK with a big run on all of the nodes (or is extrapolation allowed)? How do you optimize for the latest Intel/AMD CPUs? Hyperthreading? NUMA? Is it worth it to recompile the software stack or do you use precompiled binaries? Which settings? Which compiler optimizations, which compiler? (What about profile-based compilation?) How to get the best result given only a limited amount of time to do the benchmark run? (You can block a huge cluster forever) How do you prepare the individual nodes (stopping system daemons, freeing memory, etc)? How do you deal with hardware faults (ruining a huge run)? Are there any must-read documents or websites about this topic? E.g. I would love to hear about some background stories of some of the current Top500 systems and how they did their LINPACK benchmark. I deliberately don't want to mention concrete hardware details or discuss hardware recommendations because I don't want to limit the answers. However, feel free to mention hints e.g. for specific CPU models.

Read the article
Why lock-free data structures just aren't lock-free enough

- by Alex.Davies

Today's post will explore why the current ways to communicate between threads don't scale, and show you a possible way to build scalable parallel programming on top of shared memory. The problem with shared memory Soon, we will have dozens, hundreds and then millions of cores in our computers. It's inevitable, because individual cores just can't get much faster. At some point, that's going to mean that we have to rethink our architecture entirely, as millions of cores can't all access a shared memory space efficiently. But millions of cores are still a long way off, and in the meantime we'll see machines with dozens of cores, struggling with shared memory. Alex's tip: The best way for an application to make use of that increasing parallel power is to use a concurrency model like actors, that deals with synchronisation issues for you. Then, the maintainer of the actors framework can find the most efficient way to coordinate access to shared memory to allow your actors to pass messages to each other efficiently. At the moment, NAct uses the .NET thread pool and a few locks to marshal messages. It works well on dual and quad core machines, but it won't scale to more cores. Every time we use a lock, our core performs an atomic memory operation (eg. CAS) on a cell of memory representing the lock, so it's sure that no other core can possibly have that lock. This is very fast when the lock isn't contended, but we need to notify all the other cores, in case they held the cell of memory in a cache. As the number of cores increases, the total cost of a lock increases linearly. A lot of work has been done on "lock-free" data structures, which avoid locks by using atomic memory operations directly. These give fairly dramatic performance improvements, particularly on systems with a few (2 to 4) cores. The .NET 4 concurrent collections in System.Collections.Concurrent are mostly lock-free. However, lock-free data structures still don't scale indefinitely, because any use of an atomic memory operation still involves every core in the system. A sync-free data structure Some concurrent data structures are possible to write in a completely synchronization-free way, without using any atomic memory operations. One useful example is a single producer, single consumer (SPSC) queue. It's easy to write a sync-free fixed size SPSC queue using a circular buffer*. Slightly trickier is a queue that grows as needed. You can use a linked list to represent the queue, but if you leave the nodes to be garbage collected once you're done with them, the GC will need to involve all the cores in collecting the finished nodes. Instead, I've implemented a proof of concept inspired by this intel article which reuses the nodes by putting them in a second queue to send back to the producer. * In all these cases, you need to use memory barriers correctly, but these are local to a core, so don't have the same scalability problems as atomic memory operations. Performance tests I tried benchmarking my SPSC queue against the .NET ConcurrentQueue, and against a standard Queue protected by locks. In some ways, this isn't a fair comparison, because both of these support multiple producers and multiple consumers, but I'll come to that later. I started on my dual-core laptop, running a simple test that had one thread producing 64 bit integers, and another consuming them, to measure the pure overhead of the queue. So, nothing very interesting here. Both concurrent collections perform better than the lock-based one as expected, but there's not a lot to choose between the ConcurrentQueue and my SPSC queue. I was a little disappointed, but then, the .NET Framework team spent a lot longer optimising it than I did. So I dug out a more powerful machine that Red Gate's DBA tools team had been using for testing. It is a 6 core Intel i7 machine with hyperthreading, adding up to 12 logical cores. Now the results get more interesting. As I increased the number of producer-consumer pairs to 6 (to saturate all 12 logical cores), the locking approach was slow, and got even slower, as you'd expect. What I didn't expect to be so clear was the drop-off in performance of the lock-free ConcurrentQueue. I could see the machine only using about 20% of available CPU cycles when it should have been saturated. My interpretation is that as all the cores used atomic memory operations to safely access the queue, they ended up spending most of the time notifying each other about cache lines that need invalidating. The sync-free approach scaled perfectly, despite still working via shared memory, which after all, should still be a bottleneck. I can't quite believe that the results are so clear, so if you can think of any other effects that might cause them, please comment! Obviously, this benchmark isn't realistic because we're only measuring the overhead of the queue. Any real workload, even on a machine with 12 cores, would dwarf the overhead, and there'd be no point worrying about this effect. But would that be true on a machine with 100 cores? Still to be solved. The trouble is, you can't build many concurrent algorithms using only an SPSC queue to communicate. In particular, I can't see a way to build something as general purpose as actors on top of just SPSC queues. Fundamentally, an actor needs to be able to receive messages from multiple other actors, which seems to need an MPSC queue. I've been thinking about ways to build a sync-free MPSC queue out of multiple SPSC queues and some kind of sign-up mechanism. Hopefully I'll have something to tell you about soon, but leave a comment if you have any ideas.

Read the article
.NET development on a Retina MacBook Pro with Windows 8

- by Jeff

I remember sitting in Building 5 at Microsoft with some of my coworkers, when one of them came in with a shiny new 11” MacBook Air. It was nearly two years ago, and we found it pretty odd that the OEM’s building Windows machines sucked at industrial design in a way that defied logic. While Dell and HP were in a race to the bottom building commodity crap, Apple was staying out of the low-end market completely, and focusing on better design. In the process, they managed to build machines people actually wanted, and maintain an insanely high margin in the process. I stopped buying the commodity crap and custom builds in 2006, when Apple went Intel. As a .NET guy, I was still in it for Microsoft’s stack of development tools, which I found awesome, but had back to back crappy laptops from HP and Dell. After that original 15” MacBook Pro, I also had a Mac Pro tower (that I sold after three years for $1,500!), a 27” iMac, and my favorite, a 17” MacBook Pro (the unibody style) with an SSD added from OWC. The 17” was a little much to carry around because it was heavy, but it sure was nice getting as much as eight hours of battery life, and the screen was amazing. When the rumors started about a 15” model with a “retina” screen inspired by the Air, I made up my mind I wanted one, and ordered it the day it came out. I sold my 17”, after three years, for $750 to a friend who is really enjoying it. I got the base model with the upgrade to 16 gigs of RAM. It feels solid for being so thin, and if you’ve used the third generation iPad or the newer iPhone, you’ll be just as thrilled with the screen resolution. I’m typically getting just over six hours of battery life while running a VM, but Parallels 8 allegedly makes some power improvements, so we’ll see what happens. (It was just released today.) The nice thing about VM’s are that you can run more than one at a time. Primarily I run the Windows 8 VM with four cores (the laptop is quad-core, but has 8 logical cores due to hyperthreading or whatever Intel calls it) and 8 gigs of RAM. I also have a Windows Server 2008 R2 VM I spin up when I need to test stuff in a “real” server environment, and I give it two cores and 4 gigs of RAM. The Windows 8 VM spins up in about 8 seconds. Visual Studio 2012 takes a few more seconds, but count part of that as the “ReSharper tax” as it does its startup magic. The real beauty, the thing I looked most forward to, is that beautifully crisp C# text. Consolas has never looked as good as it does at 10pt. as it does on this display. You know how it looks great at 80pt. when conference speakers demo stuff on a projector? Think that sharpness, only tiny. It’s just gorgeous. Beyond that, everything is just so responsive and fast. Builds of large projects happen in seconds, hundreds of unit tests run in seconds… you just don’t spend a lot of time waiting for stuff. It’s kind of painful to go back to my 27” iMac (which would be better if I put an SSD in it before its third birthday). Are there negatives? A few minor issues, yes. As is the case with OS X, not everything scales right. You’ll see some weirdness at times with splash screens and icons and such. Chrome’s text rendering (in Windows) is apparently not aware of how to deal with higher DPI’s, so text is fuzzy (the OS X version is super sharp, however). You’ll also have to do some fiddling with keyboard settings to use the Windows 8 keyboard shortcuts. Overall, it’s as close to a no-compromise development experience as I’ve ever had. I’m not even going to bother with Boot Camp because the VM route already exceeds my expectations. You definitely get what you pay for. If this one also lasts three years and I can turn around and sell it, it’s worth it for something I use every day.

Read the article
Strange Upload Problem on Hyper-V

- by Ring0

Hi, This one is driving me totally nuts. I have being trying to upload a file to www.virustotal.com (its a harmless exe I have since found out - DiskWipe.exe from diskwipe.org). Using IE8. From Win 7 and Win 2008 R2 Datacenter (which I select to boot from vhd's) onto my main machine hardware, and also on another Win 7 PC elsewhere on my network, when I upload the file to virustotal.com it works perfectly. So, using my native NIC's everything is fine. Using another machine also perfect. Right. OK, from my boot menu the default is my main development machine - the one I'm typing on now. This runs on the metal and has Hyper-V role and I have some guests. All guests are not running. Amazingly, from my console (root partition to be exact) or any guest OS 2003 /XP / 2008 R2 etc. My upload to virustotal.com slows at 32% then HANGS at 38.something% & never finishes!! Here is the kicker. I have another box (my main server) running Hyper-V on the metal and three live guests. Identical H/W to my main dev machine in another room. (Except OS is Datacenter - Mine is Enterprise). If I try and upload from its bare metal console or any guest this file to virustotal.com using IE8 it stops exactly in the same place!! As for "steps I have tried etc." are kind-of blown out of the water as my server box is doing the precise same thing as the machine in my room here. OK, comonalities: Mobo: Gigabyte GA-X58-UD5, 12GB Kingston RAM, Corei7 920 4 cores hyperthreading = 8 & Realtek RTL8168D/8111D Family PCI-E Gigabit Ethernet NIC's. All 3 machines have this same motherboard - revision F11 Bios, all have 12GB RAM, all have the Realtek Nic's. All x64 by the way as I mentioned before I have a Win 7 box also with the UD5 m/Board, 12 GB RAM - bit of an overkill. :-) All these machines when NOT running Hyper-V can upload this file. Perhaps you may like to try it on a Hyepr-v (2008 R2) yourselves with IE8 and the desktop experience is on. See if it works or fails for you. Root OS or any guest. So, looking like its the NIC + Hyper-V = Cannot upload this file (any file I must add.) Realtek Nic is Ver 7.002.1125.2008. Using IE8 I see in the nic settings there are the usual parameters for Jumbo frames / Checksum offloading etc. several others. Should I fiddle with these? I ran Netmon 3.3 in a guest and the TCP session halted as the upload failed. I suppose I could study that further. I dont have Netmon on the root partition machine (yet)! All OS's fully patched - including todays defender files. My box running Office 2007 - but identical server in another room is not. Also, if I fire up a VPN to a distant client and do the upload it works! Of course its a different network path. Suggestions welcome please. If I left out anything important - please yell at me. Many Thanks,

Read the article
Parallelism in .NET – Part 5, Partitioning of Work

- by Reed

When parallelizing any routine, we start by decomposing the problem. Once the problem is understood, we need to break our work into separate tasks, so each task can be run on a different processing element. This process is called partitioning. Partitioning our tasks is a challenging feat. There are opposing forces at work here: too many partitions adds overhead, too few partitions leaves processors idle. Trying to work the perfect balance between the two extremes is the goal for which we should aim. Luckily, the Task Parallel Library automatically handles much of this process. However, there are situations where the default partitioning may not be appropriate, and knowledge of our routines may allow us to guide the framework to making better decisions. First off, I’d like to say that this is a more advanced topic. It is perfectly acceptable to use the parallel constructs in the framework without considering the partitioning taking place. The default behavior in the Task Parallel Library is very well-behaved, even for unusual work loads, and should rarely be adjusted. I have found few situations where the default partitioning behavior in the TPL is not as good or better than my own hand-written partitioning routines, and recommend using the defaults unless there is a strong, measured, and profiled reason to avoid using them. However, understanding partitioning, and how the TPL partitions your data, helps in understanding the proper usage of the TPL. I indirectly mentioned partitioning while discussing aggregation. Typically, our systems will have a limited number of Processing Elements (PE), which is the terminology used for hardware capable of processing a stream of instructions. For example, in a standard Intel i7 system, there are four processor cores, each of which has two potential hardware threads due to Hyperthreading. This gives us a total of 8 PEs – theoretically, we can have up to eight operations occurring concurrently within our system. In order to fully exploit this power, we need to partition our work into Tasks. A task is a simple set of instructions that can be run on a PE. Ideally, we want to have at least one task per PE in the system, since fewer tasks means that some of our processing power will be sitting idle. A naive implementation would be to just take our data, and partition it with one element in our collection being treated as one task. When we loop through our collection in parallel, using this approach, we’d just process one item at a time, then reuse that thread to process the next, etc. There’s a flaw in this approach, however. It will tend to be slower than necessary, often slower than processing the data serially. The problem is that there is overhead associated with each task. When we take a simple foreach loop body and implement it using the TPL, we add overhead. First, we change the body from a simple statement to a delegate, which must be invoked. In order to invoke the delegate on a separate thread, the delegate gets added to the ThreadPool’s current work queue, and the ThreadPool must pull this off the queue, assign it to a free thread, then execute it. If our collection had one million elements, the overhead of trying to spawn one million tasks would destroy our performance. The answer, here, is to partition our collection into groups, and have each group of elements treated as a single task. By adding a partitioning step, we can break our total work into small enough tasks to keep our processors busy, but large enough tasks to avoid overburdening the ThreadPool. There are two clear, opposing goals here: Always try to keep each processor working, but also try to keep the individual partitions as large as possible. When using Parallel.For, the partitioning is always handled automatically. At first, partitioning here seems simple. A naive implementation would merely split the total element count up by the number of PEs in the system, and assign a chunk of data to each processor. Many hand-written partitioning schemes work in this exactly manner. This perfectly balanced, static partitioning scheme works very well if the amount of work is constant for each element. However, this is rarely the case. Often, the length of time required to process an element grows as we progress through the collection, especially if we’re doing numerical computations. In this case, the first PEs will finish early, and sit idle waiting on the last chunks to finish. Sometimes, work can decrease as we progress, since previous computations may be used to speed up later computations. In this situation, the first chunks will be working far longer than the last chunks. In order to balance the workload, many implementations create many small chunks, and reuse threads. This adds overhead, but does provide better load balancing, which in turn improves performance. The Task Parallel Library handles this more elaborately. Chunks are determined at runtime, and start small. They grow slowly over time, getting larger and larger. This tends to lead to a near optimum load balancing, even in odd cases such as increasing or decreasing workloads. Parallel.ForEach is a bit more complicated, however. When working with a generic IEnumerable<T>, the number of items required for processing is not known in advance, and must be discovered at runtime. In addition, since we don’t have direct access to each element, the scheduler must enumerate the collection to process it. Since IEnumerable<T> is not thread safe, it must lock on elements as it enumerates, create temporary collections for each chunk to process, and schedule this out. By default, it uses a partitioning method similar to the one described above. We can see this directly by looking at the Visual Partitioning sample shipped by the Task Parallel Library team, and available as part of the Samples for Parallel Programming. When we run the sample, with four cores and the default, Load Balancing partitioning scheme, we see this: The colored bands represent each processing core. You can see that, when we started (at the top), we begin with very small bands of color. As the routine progresses through the Parallel.ForEach, the chunks get larger and larger (seen by larger and larger stripes). Most of the time, this is fantastic behavior, and most likely will out perform any custom written partitioning. However, if your routine is not scaling well, it may be due to a failure in the default partitioning to handle your specific case. With prior knowledge about your work, it may be possible to partition data more meaningfully than the default Partitioner. There is the option to use an overload of Parallel.ForEach which takes a Partitioner<T> instance. The Partitioner<T> class is an abstract class which allows for both static and dynamic partitioning. By overriding Partitioner<T>.SupportsDynamicPartitions, you can specify whether a dynamic approach is available. If not, your custom Partitioner<T> subclass would override GetPartitions(int), which returns a list of IEnumerator<T> instances. These are then used by the Parallel class to split work up amongst processors. When dynamic partitioning is available, GetDynamicPartitions() is used, which returns an IEnumerable<T> for each partition. If you do decide to implement your own Partitioner<T>, keep in mind the goals and tradeoffs of different partitioning strategies, and design appropriately. The Samples for Parallel Programming project includes a ChunkPartitioner class in the ParallelExtensionsExtras project. This provides example code for implementing your own, custom allocation strategies, including a static allocator of a given chunk size. Although implementing your own Partitioner<T> is possible, as I mentioned above, this is rarely required or useful in practice. The default behavior of the TPL is very good, often better than any hand written partitioning strategy.

Read the article

Search Results

Search found 51 results on 3 pages for 'hyperthreading'.

Page 2/3 | < Previous Page | 1 2 3 | Next Page >

- by Michael Goldshteyn

- by Joonas Pulakka

- by psion

- by Sid

- by Sam

- by user1227914

- by xioxox

- by cooplarsh

- by Giffyguy

- by Jhonnytunes

- by Jan P.

- by southernpost

- by Sean

- by Barry Kelly

- by amn

- by claws

- by user359708

- by Matt

- by Adam Tabi

- by Adam Ryczkowski

- by knweiss

- by Alex.Davies

- by Jeff

- by Ring0

- by Reed

< Previous Page | 1 2 3 | Next Page >