iostat - Page 3 - Developer IT

How can I measure actual memory usage from my running processes?

- by NullUser

I have two servers, server1 and server2. Both of them are identical HP blades, running the exact same OS (RHEL 5.5). Here's the output of free for both of them: ### server1: total used free shared buffers cached Mem: 8017848 2746596 5271252 0 212772 1768800 -/+ buffers/cache: 765024 7252824 Swap: 14188536 0 14188536 ### server2: total used free shared buffers cached Mem: 8017848 4494836 3523012 0 212724 3136568 -/+ buffers/cache: 1145544 6872304 Swap: 14188536 0 14188536 If I understand correctly, server2 is using significantly more memory for disk I/O caching, which still counts as memory used. But both are running the same OS and if I remember correctly, I configured both with the same parameters when they were installed. I did a diff on /etc/sysctl.conf and they are identical. The problem is, I am collecting memory usage and other metrics over a period of time, (eg: vmstat, iostat, etc.) while a load is generated on the system. The memory used for caching is throwing off my calculations on the results. How can I measure actual memory usage from my running processes, rather than system usage? Is used - (buffers + cached) a valid way to measure this?

Read the article

What may the reason of slowness be (see details in message body)?

- by Ivan

I've got a really weird situation I'm beating to solve. A performance problem which looks really like an empty waiting sequence set in code (while it probably isn't so). I've got a pretty powerful dedicated server (10 GB RAM, eight Xeon cores, etc) running Ubuntu 10.04 with all the functionality services (except OpenVPN server used to provide secure access to clients) deployed in separate VirtualBox (vboxheadless) machines (one for the company e-mail server, one for web server and one for accounting/crm server (Firebird + proprietary app server working with Delphi-made clients)). CPU load (as "top" says) is almost always near zero. Host system RAM is close to 100% usage but not overloaded (as very little swapping gets used, and freed (by stopping one of VMs) memory doesn't get reused any quickly). Approximately 50% of guests RAM is used. iostat usually shows near zero %util. Network bandwidth seems to be underused. But the accounting/crm client (a Win32 Delphi application run on WinXP machines) software works hell-slow with this server (and works much better using an inside-LAN Windows server). I just can't imagine what can make it be slow if there are so plenty of CPU, RAM, HDD and bandwidth resources available on clients and on the server even in their hardest moments. Saying bandwidth is underused I not only know that clients and the server are connected to the Internet with a bigger channels than really used (which leaves the a chance they may have a bottleneck of a sort on the route between them), I've tested bandwidth between clients and the server by copying files among them.

Read the article

ubuntu's average load never below "0.00 0.01 0.05"

- by Karma Fusebox

I have several ubuntu 12.04 VMs running on a ubuntu 12.04 KVM host. Those of the virtual machines that are totally idle with no services (except syslog and the other "small" standard stuff of a fresh installation) show a constant load of "0.00 0.01 0.05" in top/htop as average 1/5/15. When there are "real" applications running, the load averages behave perfectly normal but they never fall below the mentioned values. While this doesn't affect performance at all and could easily be ignored, it screws up the monitoring graphs in a very annoying way: (Notice how load15 behaves nicely if 0.05 for a short time in the right half of the pic) Unfortunately I don't know what diagnostic outputs might be helpful for you, so here's some default stuff: # top top - 16:31:01 up 1:05, 1 user, load average: 0.00, 0.01, 0.05 Tasks: 62 total, 1 running, 61 sleeping, 0 stopped, 0 zombie Cpu(s): 0.2%us, 0.2%sy, 0.0%ni, 99.2%id, 0.5%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 1019464k total, 73452k used, 946012k free, 6140k buffers Swap: 0k total, 0k used, 0k free, 22504k cached . # free -m total used free shared buffers cached Mem: 995 72 923 0 6 21 -/+ buffers/cache: 43 951 Swap: 0 0 0 . # iostat -x /dev/vda Linux 3.2.0-32-virtual (vm3) 11/15/2012 _x86_64_ (2 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 0.25 0.00 0.65 0.20 0.24 98.66 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util vda 0.14 0.12 0.51 0.22 6.74 1.46 22.50 0.02 23.26 20.64 29.30 7.63 0.56 Need something else? Has anyone ever seen this behavior? Might this be a bug in kvm/ubuntu/kernel 3.x in the end? Thanks a lot!

Read the article

kill -9 doesn't work

- by Daniel

I have a server with 3 oracle instances on it, and the file system is nfs with netapp. After shutdown the databases, one process for each database doesn't quit for a long time. Each kill -i doesn't work. I tried to truss, pfile it, the command through error. And iostat shows there are lots of IO to the netapp server. So someone said the process was busy writing data to remote netapp server, and before the write complete, it won't quit. So what need to be done was just wait until all the IO was done. After wait for longer time (about 1.5 hours), the processes exit. So my question is: how can a process ignore the kill signal? As far as I know, if we kill -9, it will stop immediately. Do you encounter such situation kill -i doesn't kill the process right away? TEST7-stdby-phxdbnfs11$ ps -ef|grep dbw0 oracle 1469 25053 0 22:36:53 pts/1 0:00 grep dbw0 oracle 26795 1 0 21:55:23 ? 0:00 ora_dbw0_TEST7 oracle 1051 1 0 Apr 08 ? 3958:51 ora_dbw0_TEST2 oracle 471 1 0 Apr 08 ? 6391:43 ora_dbw0_TEST1 TEST7-stdby-phxdbnfs11$ kill -9 1051 TEST7-stdby-phxdbnfs11$ ps -ef|grep dbw0 oracle 1493 25053 0 22:37:07 pts/1 0:00 grep dbw0 oracle 26795 1 0 21:55:23 ? 0:00 ora_dbw0_TEST7 oracle 1051 1 0 Apr 08 ? 3958:51 ora_dbw0_TEST2 oracle 471 1 0 Apr 08 ? 6391:43 ora_dbw0_TEST1 TEST7-stdby-phxdbnfs11$ kill -9 471 TEST7-stdby-phxdbnfs11$ ps -ef|grep dbw0 oracle 26795 1 0 21:55:23 ? 0:00 ora_dbw0_TEST7 oracle 1051 1 0 Apr 08 ? 3958:51 ora_dbw0_TEST2 oracle 471 1 0 Apr 08 ? 6391:43 ora_dbw0_TEST1 oracle 1495 25053 0 22:37:22 pts/1 0:00 grep dbw0 TEST7-stdby-phxdbnfs11$ ps -ef|grep smon oracle 1524 25053 0 22:38:02 pts/1 0:00 grep smon TEST7-stdby-phxdbnfs11$ ps -ef|grep dbw0 oracle 1526 25053 0 22:38:06 pts/1 0:00 grep dbw0 oracle 26795 1 0 21:55:23 ? 0:00 ora_dbw0_TEST7 oracle 1051 1 0 Apr 08 ? 3958:51 ora_dbw0_TEST2 oracle 471 1 0 Apr 08 ? 6391:43 ora_dbw0_TEST1 TEST7-stdby-phxdbnfs11$ kill -9 1051 471 26795 TEST7-stdby-phxdbnfs11$ ps -ef|grep dbw0 oracle 1528 25053 0 22:38:19 pts/1 0:00 grep dbw0 oracle 26795 1 0 21:55:23 ? 0:00 ora_dbw0_TEST7 oracle 1051 1 0 Apr 08 ? 3958:51 ora_dbw0_TEST2 oracle 471 1 0 Apr 08 ? 6391:43 ora_dbw0_TEST1 TEST7-stdby-phxdbnfs11$ truss -p 26795 truss: unanticipated system error: 26795 TEST7-stdby-phxdbnfs11$ pfiles 26795 pfiles: unanticipated system error: 26795

Read the article

Very poor read performance compared to write performance on md(raid1) / crypt(luks) / lvm

- by Android5360

I'm experiencing very poor read performance over raid1/crypt/lvm. In the same time, write speeds are about 2x+ faster on the same setup. On another raid1 setup on the same machine I get normal read speeds (maybe because I'm not using cryptsetup). OS related disks: sda + sdb. I have raid1 configuration with two disks, both are in place. I'm using LVM over the RAID. No encryption. Both disks are WD Green, 5400 rpm. IO test results on this raid1: dd if=/dev/zero of=/tmp/output.img3 bs=8k count=256k conv=fsync - 2147483648 bytes (2.1 GB) copied, 22.3392 s, 96.1 MB/s sync echo 3 > /proc/sys/vm/drop_caches dd if=/tmp/output.img3 of=/dev/null bs=8k - 2147483648 bytes (2.1 GB) copied, 15.9 s, 135 MB/s And here is the problematic setup (on the same machine). Currently I have only one sdc (WD Green, 5400rpm) configured in software raid1 + crypt (luks, serpent-xts-plain) + lvm. Tomorrow I will attach another disk (sdd) to complete this two-disk raid1 setup. IO tests results on this raid1: dd if=/dev/zero of=output.img3 bs=8k count=256k conv=fsync 2147483648 bytes (2.1 GB) copied, 17.7235 s, 121 MB/s sync echo 3 > /proc/sys/vm/drop_caches dd if=output.img3 of=/dev/null bs=8k 2147483648 bytes (2.1 GB) copied, 36.2454 s, 59.2 MB/s We can see that the read performance is very very bad (59MB/s compared to 135MB/s when using no encryption). Nothing is using the disks during benchmark. I can confirm this because I checked with iostat and dstat. Details on the hardware: disks: all are WD green, 5400rpm, 64mb cache. cpu: FX-8350 at stock speed ram: 4x4GB at 1066Mhz. Details on the software: OS: Debian Wheezy 7, amd64 mdadm: v3.2.5 - 18th May 2012 LVM version: 2.02.95(2) (2012-03-06) LVM Library version: 1.02.74 (2012-03-06) LVM Driver version: 4.22.0 cryptsetup: 1.4.3 Here is how I configured the slow raid1+crypt+lvm setup: parted /dev/sdc mklabel gpt type: ext4 start: 2048s end: -1 Now the raid, crypt and the lvm configuration: mdadm --create /dev/md1 --level=1 --raid-disks=2 missing /dev/sdc cryptsetup --cipher serpent-xts-plain luksFormat /dev/md1 cryptsetup luksOpen /dev/md1 md1_crypt vgcreate vg_sql /dev/mapper/md1_crypt lvcreate -l 100%VG vg_sql -n lv_sql mkfs.ext4 /dev/mapper/vg_sql-lv-sql mount /dev/mapper/vg_sql-lv_sql /sql So guys, can you help me identify the reason and fix it? It has to be something with the cryptsetup as there is no such read slowdown on the other setup (sda+sdb) where no encryption is present. But I have no idea what to do. Thanks!

Read the article

How to find the process(es) which are hogging the machine

- by Aaron Digulla

Scenario: All of a sudden, my computer feels sluggish. Mouse moves but windows take ages to open, etc. uptime says the load is 7.69 and raising. What is the fastest way to find out which process(es) are the cause of the load? Now, "top" and similar tools isn't the answer because they either show CPU or memory usage but not both at the same time. What I need is the single command which I might be able to type as it happens - something that will figure out any of System is trying to swap 8GB of RAM to disk because process X ... or process X seeks all over the disk or process X uses 400% CPU" So what I'm looking for is iostat, htop/atop and similar tools run into one with an output like this: 1235 cp - Disk trashing 87 chrome - Uses 2 GB of RAM 137 nfs_bench - Uses 95% of the network bandwidth I don't want a tool that gives me some numbers which I can analyze but a tool that tells me exactly which process causes the current load. Assume that the user in front of the keyboard barely knows how to write "process", but the user is quickly overwhelmed when it comes to "resident size", "virtual memory" or "process life cycle". My argument goes like this: A user notices a problem. There can be thousands of reasons ... well, almost :-) The user wants to know the source of the problem. The current solutions give me lots of numbers, and I need to know what these numbers mean. What I'm looking for is a meta tool. 99% of the data is irrelevant to the problem. So what the tool should do is look for processes which hog some resource and list only those along with "this process needs a lot of CPU, this produces many IRQs, this process allocates a lot of RAM (and it's still growing)". This will be a relatively short list. It will be much more simple for someone new to this to locate the culprit from this list than from the output of, say, htop which gives me about 5000 numbers but requires me to fold multi-threaded processes myself (I have 50 lines which say VIRT 2750M but only 16 GB of RAM - the machine ought to swap itself to death but of course, this is a misinterpretation of the data that can happen quickly).

Read the article

kill -9 doesn't work

- by Daniel

I have a server with 3 oracle instances on it, and the file system is nfs with netapp. After shutdown the databases, one process for each database doesn't quit for a long time. Each kill -i doesn't work. I tried to truss, pfile it, the command through error. And iostat shows there are lots of IO to the netapp server. So someone said the process was busy writing data to remote netapp server, and before the write complete, it won't quit. So what need to be done was just wait until all the IO was done. After wait for longer time (about 1.5 hours), the processes exit. So my question is: how can a process ignore the kill signal? As far as I know, if we kill -9, it will stop immediately. Do you encounter such situation kill -i doesn't kill the process right away? TEST7-stdby-phxdbnfs11$ ps -ef|grep dbw0 oracle 1469 25053 0 22:36:53 pts/1 0:00 grep dbw0 oracle 26795 1 0 21:55:23 ? 0:00 ora_dbw0_TEST7 oracle 1051 1 0 Apr 08 ? 3958:51 ora_dbw0_TEST2 oracle 471 1 0 Apr 08 ? 6391:43 ora_dbw0_TEST1 TEST7-stdby-phxdbnfs11$ kill -9 1051 TEST7-stdby-phxdbnfs11$ ps -ef|grep dbw0 oracle 1493 25053 0 22:37:07 pts/1 0:00 grep dbw0 oracle 26795 1 0 21:55:23 ? 0:00 ora_dbw0_TEST7 oracle 1051 1 0 Apr 08 ? 3958:51 ora_dbw0_TEST2 oracle 471 1 0 Apr 08 ? 6391:43 ora_dbw0_TEST1 TEST7-stdby-phxdbnfs11$ kill -9 471 TEST7-stdby-phxdbnfs11$ ps -ef|grep dbw0 oracle 26795 1 0 21:55:23 ? 0:00 ora_dbw0_TEST7 oracle 1051 1 0 Apr 08 ? 3958:51 ora_dbw0_TEST2 oracle 471 1 0 Apr 08 ? 6391:43 ora_dbw0_TEST1 oracle 1495 25053 0 22:37:22 pts/1 0:00 grep dbw0 TEST7-stdby-phxdbnfs11$ ps -ef|grep smon oracle 1524 25053 0 22:38:02 pts/1 0:00 grep smon TEST7-stdby-phxdbnfs11$ ps -ef|grep dbw0 oracle 1526 25053 0 22:38:06 pts/1 0:00 grep dbw0 oracle 26795 1 0 21:55:23 ? 0:00 ora_dbw0_TEST7 oracle 1051 1 0 Apr 08 ? 3958:51 ora_dbw0_TEST2 oracle 471 1 0 Apr 08 ? 6391:43 ora_dbw0_TEST1 TEST7-stdby-phxdbnfs11$ kill -9 1051 471 26795 TEST7-stdby-phxdbnfs11$ ps -ef|grep dbw0 oracle 1528 25053 0 22:38:19 pts/1 0:00 grep dbw0 oracle 26795 1 0 21:55:23 ? 0:00 ora_dbw0_TEST7 oracle 1051 1 0 Apr 08 ? 3958:51 ora_dbw0_TEST2 oracle 471 1 0 Apr 08 ? 6391:43 ora_dbw0_TEST1 TEST7-stdby-phxdbnfs11$ truss -p 26795 truss: unanticipated system error: 26795 TEST7-stdby-phxdbnfs11$ pfiles 26795 pfiles: unanticipated system error: 26795

Read the article

2 drives, slow software RAID1 (md)

- by bart613

Hello, I've got a server from hetzner.de (EQ4) with 2* SAMSUNG HD753LJ drives (750G 32MB cache). OS is CentOS 5 (x86_64). Drives are combined together into two RAID1 partitions: /dev/md0 which is 512MB big and has only /boot partitions /dev/md1 which is over 700GB big and is one big LVM which hosts other partitions Now, I've been running some benchmarks and it seems like even though exactly the same drives, speed differs a bit on each of them. # hdparm -tT /dev/sda /dev/sda: Timing cached reads: 25612 MB in 1.99 seconds = 12860.70 MB/sec Timing buffered disk reads: 352 MB in 3.01 seconds = 116.80 MB/sec # hdparm -tT /dev/sdb /dev/sdb: Timing cached reads: 25524 MB in 1.99 seconds = 12815.99 MB/sec Timing buffered disk reads: 342 MB in 3.01 seconds = 113.64 MB/sec Also, when I run eg. pgbench which is stressing IO quite heavily, I can see following from iostat output: Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 231.40 0.00 298.00 0.00 9683.20 32.49 0.17 0.58 0.34 10.24 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda2 0.00 231.40 0.00 298.00 0.00 9683.20 32.49 0.17 0.58 0.34 10.24 sdb 0.00 231.40 0.00 301.80 0.00 9740.80 32.28 14.19 51.17 3.10 93.68 sdb1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb2 0.00 231.40 0.00 301.80 0.00 9740.80 32.28 14.19 51.17 3.10 93.68 md1 0.00 0.00 0.00 529.60 0.00 9692.80 18.30 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.60 0.00 4.80 8.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 529.00 0.00 9688.00 18.31 24.51 49.91 1.81 95.92 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 152.40 0.00 330.60 0.00 5176.00 15.66 0.19 0.57 0.19 6.24 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda2 0.00 152.40 0.00 330.60 0.00 5176.00 15.66 0.19 0.57 0.19 6.24 sdb 0.00 152.40 0.00 326.20 0.00 5118.40 15.69 19.96 55.36 3.01 98.16 sdb1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb2 0.00 152.40 0.00 326.20 0.00 5118.40 15.69 19.96 55.36 3.01 98.16 md1 0.00 0.00 0.00 482.80 0.00 5166.40 10.70 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 482.80 0.00 5166.40 10.70 30.19 56.92 2.05 99.04 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 181.64 0.00 324.55 0.00 5445.11 16.78 0.15 0.45 0.21 6.87 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda2 0.00 181.64 0.00 324.55 0.00 5445.11 16.78 0.15 0.45 0.21 6.87 sdb 0.00 181.84 0.00 328.54 0.00 5493.01 16.72 18.34 61.57 3.01 99.00 sdb1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb2 0.00 181.84 0.00 328.54 0.00 5493.01 16.72 18.34 61.57 3.01 99.00 md1 0.00 0.00 0.00 506.39 0.00 5477.05 10.82 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 506.39 0.00 5477.05 10.82 28.77 62.15 1.96 99.00 And this is completely getting me confused. How come two exactly the same specced drives have such a difference in write speed (see util%)? I haven't really paid attention to those speeds before, so perhaps that something normal -- if someone could confirm I would be really grateful. Otherwise, if someone have seen such behavior again or knows what is causing such behavior I would really appreciate answer. I'll also add that both "smartctl -a" and "hdparm -I" output are exactly the same and are not indicating any hardware problems. The slower drive was changed already two times (to new ones). Also I asked to change the drives with places, and then sda were slower and sdb quicker (so the slow one was the same drive). SATA cables were changed two times already.

Read the article

Mongodb Slave replication lag

- by Leonid Bugaev

We using standard mongo setup: 2 replicas + 1 arbiter. Both replica servers use same AWS m1.medium with RAID10 EBS. We experiencing constantly growing replication lag on secondary replica. I tried to do full-resync, you can see it on graph, but it helped only for some hours. Our mongo usage is really low now, and frankly i can't understan why it can be. iostat 1 for secondary: avg-cpu: %user %nice %system %iowait %steal %idle 80.39 0.00 2.94 0.00 16.67 0.00 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn xvdap1 0.00 0.00 0.00 0 0 xvdb 0.00 0.00 0.00 0 0 xvdfp4 12.75 0.00 189.22 0 193 xvdfp3 12.75 0.00 189.22 0 193 xvdfp2 7.84 0.00 40.20 0 41 xvdfp1 7.84 0.00 40.20 0 41 md127 19.61 0.00 219.61 0 224 mongostat for secondary (why 100% locks? i guess its the problem): insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn set repl time *10 *0 *16 *0 0 2|4 0 30.9g 62.4g 1.65g 0 107 0 0|0 0|0 198b 1k 16 replset-01 SEC 06:55:37 *4 *0 *8 *0 0 12|0 0 30.9g 62.4g 1.65g 0 91.7 0 0|0 0|0 837b 5k 16 replset-01 SEC 06:55:38 *4 *0 *7 *0 0 3|0 0 30.9g 62.4g 1.64g 0 110 0 0|0 0|0 342b 1k 16 replset-01 SEC 06:55:39 *4 *0 *8 *0 0 1|0 0 30.9g 62.4g 1.64g 0 82.9 0 0|0 0|0 62b 1k 16 replset-01 SEC 06:55:40 *3 *0 *7 *0 0 5|0 0 30.9g 62.4g 1.6g 0 75.2 0 0|0 0|0 466b 2k 16 replset-01 SEC 06:55:41 *4 *0 *7 *0 0 1|0 0 30.9g 62.4g 1.64g 0 138 0 0|0 0|1 62b 1k 16 replset-01 SEC 06:55:42 *7 *0 *15 *0 0 3|0 0 30.9g 62.4g 1.64g 0 95.4 0 0|0 0|0 342b 1k 16 replset-01 SEC 06:55:43 *7 *0 *14 *0 0 1|0 0 30.9g 62.4g 1.64g 0 98 0 0|0 0|0 62b 1k 16 replset-01 SEC 06:55:44 *8 *0 *17 *0 0 3|0 0 30.9g 62.4g 1.64g 0 96.3 0 0|0 0|0 342b 1k 16 replset-01 SEC 06:55:45 *7 *0 *14 *0 0 3|0 0 30.9g 62.4g 1.64g 0 96.1 0 0|0 0|0 186b 2k 16 replset-01 SEC 06:55:46 mongostat for primary insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn set repl time 12 30 20 0 0 3 0 30.9g 62.6g 641m 0 0.9 0 0|0 0|0 212k 619k 48 replset-01 M 06:56:41 5 17 10 0 0 2 0 30.9g 62.6g 641m 0 0.5 0 0|0 0|0 159k 429k 48 replset-01 M 06:56:42 9 22 16 0 0 3 0 30.9g 62.6g 642m 0 0.7 0 0|0 0|0 158k 276k 48 replset-01 M 06:56:43 6 18 12 0 0 2 0 30.9g 62.6g 640m 0 0.7 0 0|0 0|0 93k 231k 48 replset-01 M 06:56:44 6 12 8 0 0 3 0 30.9g 62.6g 640m 0 0.3 0 0|0 0|0 80k 125k 48 replset-01 M 06:56:45 8 21 14 0 0 9 0 30.9g 62.6g 641m 0 0.6 0 0|0 0|0 118k 419k 48 replset-01 M 06:56:46 10 34 20 0 0 6 0 30.9g 62.6g 640m 0 1.3 0 0|0 0|0 164k 527k 48 replset-01 M 06:56:47 6 21 13 0 0 2 0 30.9g 62.6g 641m 0 0.7 0 0|0 0|0 111k 477k 48 replset-01 M 06:56:48 8 21 15 0 0 2 0 30.9g 62.6g 641m 0 0.7 0 0|0 0|0 204k 336k 48 replset-01 M 06:56:49 4 12 8 0 0 8 0 30.9g 62.6g 641m 0 0.5 0 0|0 0|0 156k 530k 48 replset-01 M 06:56:50 Mongo version: 2.0.6

Read the article

What is causing the unusual high load average and IOwait?

- by James

I noticed on Tuesday night of last week, the load average went up sharply and it seemed abnormal since the traffic is small. Usually, the numbers usually average around .40 or lower and my server stuff (mysql, php and apache) are optimized. I noticed that the IOWait is unusually high even though the processes is barely using any CPU. top - 01:44:39 up 1 day, 21:13, 1 user, load average: 1.41, 1.09, 0.86 Tasks: 60 total, 1 running, 59 sleeping, 0 stopped, 0 zombie Cpu0 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 0.0%us, 0.0%sy, 0.0%ni, 91.5%id, 8.5%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 1048576k total, 331944k used, 716632k free, 0k buffers Swap: 0k total, 0k used, 0k free, 0k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 root 15 0 2468 1376 1140 S 0 0.1 0:00.92 init 1656 root 15 0 13652 5212 664 S 0 0.5 0:00.00 apache2 9323 root 18 0 13652 5212 664 S 0 0.5 0:00.00 apache2 10079 root 18 0 3972 1248 972 S 0 0.1 0:00.00 su 10080 root 15 0 4612 1956 1448 S 0 0.2 0:00.01 bash 11298 root 15 0 13652 5212 664 S 0 0.5 0:00.00 apache2 11778 chikorit 15 0 2344 1092 884 S 0 0.1 0:00.05 top 15384 root 18 0 17544 13m 1568 S 0 1.3 0:02.28 miniserv.pl 15585 root 15 0 8280 2736 2168 S 0 0.3 0:00.02 sshd 15608 chikorit 15 0 8280 1436 860 S 0 0.1 0:00.02 sshd Here is the VMStat procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 0 768644 0 0 0 0 14 23 0 10 1 0 99 0 IOStat - Nothing unusal Total DISK READ: 67.13 K/s | Total DISK WRITE: 0.00 B/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 19496 be/4 chikorit 11.85 K/s 0.00 B/s 0.00 % 0.00 % apache2 -k start 19501 be/4 mysql 3.95 K/s 0.00 B/s 0.00 % 0.00 % mysqld 19568 be/4 chikorit 11.85 K/s 0.00 B/s 0.00 % 0.00 % apache2 -k start 19569 be/4 chikorit 11.85 K/s 0.00 B/s 0.00 % 0.00 % apache2 -k start 19570 be/4 chikorit 11.85 K/s 0.00 B/s 0.00 % 0.00 % apache2 -k start 19571 be/4 chikorit 7.90 K/s 0.00 B/s 0.00 % 0.00 % apache2 -k start 19573 be/4 chikorit 7.90 K/s 0.00 B/s 0.00 % 0.00 % apache2 -k start 1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % init 11778 be/4 chikorit 0.00 B/s 0.00 B/s 0.00 % 0.00 % top 19470 be/4 mysql 0.00 B/s 0.00 B/s 0.00 % 0.00 % mysqld Load Average Chart - http://i.stack.imgur.com/kYsD0.png I want to be sure if this is not a MySQL problem before making sure. Also, this is a Ubuntu 10.04 LTS Server on OpenVZ. Edit: This will probably give a good picture on the IO Wait top - 22:12:22 up 17:41, 1 user, load average: 1.10, 1.09, 0.93 Tasks: 33 total, 1 running, 32 sleeping, 0 stopped, 0 zombie Cpu(s): 0.6%us, 0.2%sy, 0.0%ni, 89.0%id, 10.1%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 1048576k total, 260708k used, 787868k free, 0k buffers Swap: 0k total, 0k used, 0k free, 0k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 root 15 0 2468 1376 1140 S 0 0.1 0:00.88 init 5849 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2 8063 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2 9732 root 16 0 8280 2728 2168 S 0 0.3 0:00.02 sshd 9746 chikorit 18 0 8412 1444 864 S 0 0.1 0:01.10 sshd 9747 chikorit 18 0 4576 1960 1488 S 0 0.2 0:00.24 bash 13706 chikorit 15 0 2344 1088 884 R 0 0.1 0:00.03 top 15745 chikorit 15 0 12968 5108 1280 S 0 0.5 0:00.00 apache2 15751 chikorit 15 0 72184 25m 18m S 0 2.5 0:00.37 php5-fpm 15790 chikorit 18 0 12472 4640 1192 S 0 0.4 0:00.00 apache2 15797 chikorit 15 0 72888 23m 16m S 0 2.3 0:00.06 php5-fpm 16038 root 15 0 67772 2848 592 D 0 0.3 0:00.00 php5-fpm 16309 syslog 18 0 24084 1316 992 S 0 0.1 0:00.07 rsyslogd 16316 root 15 0 5472 908 500 S 0 0.1 0:00.00 sshd 16326 root 15 0 2304 908 712 S 0 0.1 0:00.02 cron 17464 root 15 0 10252 7560 856 D 0 0.7 0:01.88 psad 17466 root 18 0 1684 276 208 S 0 0.0 0:00.31 psadwatchd 17559 root 18 0 11444 2020 732 S 0 0.2 0:00.47 sendmail-mta 17688 root 15 0 10252 5388 1136 S 0 0.5 0:03.81 python 17752 teamspea 19 0 44648 7308 4676 S 0 0.7 1:09.70 ts3server_linux 18098 root 15 0 12336 6380 3032 S 0 0.6 0:00.47 apache2 18099 chikorit 18 0 10368 2536 464 S 0 0.2 0:00.00 apache2 18120 ntp 15 0 4336 1316 984 S 0 0.1 0:00.87 ntpd 18379 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2 18387 mysql 15 0 62796 36m 5864 S 0 3.6 1:43.26 mysqld 19584 root 15 0 12336 4028 668 S 0 0.4 0:00.02 apache2 22498 root 16 0 12336 4028 668 S 0 0.4 0:00.00 apache2 24260 root 15 0 67772 3612 1356 S 0 0.3 0:00.22 php5-fpm 27712 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2 27730 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2 30343 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2 30366 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2

Read the article

Mysql performance problem & Failed DIMM

- by murdoch

Hi I have a dedicated mysql database server which has been having some performance problems recently, under normal load the server will be running fine, then suddenly out of the blue the performance will fall off a cliff. The server isn't using the swap file and there is 12GB of RAM in the server, more than enough for its needs. After contacting my hosting comapnies support they have discovered that there is a failed 2GB DIMM in the server and have scheduled to replace it tomorow morning. My question is could a failed DIMM result in the performance problems I am seeing or is this just coincidence? My worry is that they will replace the ram tomorrow but the problems will persist and I will still be lost of explanations so I am just trying to think ahead. The reason I ask is that there is plenty of RAM in the server, more than required and simply missing 2GB should be a problem, so if this failed DIMM is causing these performance problems then the OS must be trying to access the failed DIMM and slowing down as a result. Does that sound like a credible explanation? This is what DELLs omreport program says about the RAM, notice one dimm is "Critical" Memory Information Health : Critical Memory Operating Mode Fail Over State : Inactive Memory Operating Mode Configuration : Optimizer Attributes of Memory Array(s) Attributes : Location Memory Array 1 : System Board or Motherboard Attributes : Use Memory Array 1 : System Memory Attributes : Installed Capacity Memory Array 1 : 12288 MB Attributes : Maximum Capacity Memory Array 1 : 196608 MB Attributes : Slots Available Memory Array 1 : 18 Attributes : Slots Used Memory Array 1 : 6 Attributes : ECC Type Memory Array 1 : Multibit ECC Total of Memory Array(s) Attributes : Total Installed Capacity Value : 12288 MB Attributes : Total Installed Capacity Available to the OS Value : 12004 MB Attributes : Total Maximum Capacity Value : 196608 MB Details of Memory Array 1 Index : 0 Status : Ok Connector Name : DIMM_A1 Type : DDR3-Registered Size : 2048 MB Index : 1 Status : Ok Connector Name : DIMM_A2 Type : DDR3-Registered Size : 2048 MB Index : 2 Status : Ok Connector Name : DIMM_A3 Type : DDR3-Registered Size : 2048 MB Index : 3 Status : Critical Connector Name : DIMM_B1 Type : DDR3-Registered Size : 2048 MB Index : 4 Status : Ok Connector Name : DIMM_B2 Type : DDR3-Registered Size : 2048 MB Index : 5 Status : Ok Connector Name : DIMM_B3 Type : DDR3-Registered Size : 2048 MB the command free -m shows this, the server seems to be using more than 10GB of ram which would suggest it is trying to use the DIMM total used free shared buffers cached Mem: 12004 10766 1238 0 384 4809 -/+ buffers/cache: 5572 6432 Swap: 2047 0 2047 iostat output while problem is occuring avg-cpu: %user %nice %system %iowait %steal %idle 52.82 0.00 11.01 0.00 0.00 36.17 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 47.00 0.00 576.00 0 576 sda1 0.00 0.00 0.00 0 0 sda2 1.00 0.00 32.00 0 32 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 46.00 0.00 544.00 0 544 avg-cpu: %user %nice %system %iowait %steal %idle 53.12 0.00 7.81 0.00 0.00 39.06 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 49.00 0.00 592.00 0 592 sda1 0.00 0.00 0.00 0 0 sda2 0.00 0.00 0.00 0 0 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 49.00 0.00 592.00 0 592 avg-cpu: %user %nice %system %iowait %steal %idle 56.09 0.00 7.43 0.37 0.00 36.10 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 232.00 0.00 64520.00 0 64520 sda1 0.00 0.00 0.00 0 0 sda2 159.00 0.00 63728.00 0 63728 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 73.00 0.00 792.00 0 792 avg-cpu: %user %nice %system %iowait %steal %idle 52.18 0.00 9.24 0.06 0.00 38.51 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 49.00 0.00 600.00 0 600 sda1 0.00 0.00 0.00 0 0 sda2 0.00 0.00 0.00 0 0 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 49.00 0.00 600.00 0 600 avg-cpu: %user %nice %system %iowait %steal %idle 54.82 0.00 8.64 0.00 0.00 36.55 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 100.00 0.00 2168.00 0 2168 sda1 0.00 0.00 0.00 0 0 sda2 0.00 0.00 0.00 0 0 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 100.00 0.00 2168.00 0 2168 avg-cpu: %user %nice %system %iowait %steal %idle 54.78 0.00 6.75 0.00 0.00 38.48 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 84.00 0.00 896.00 0 896 sda1 0.00 0.00 0.00 0 0 sda2 0.00 0.00 0.00 0 0 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 84.00 0.00 896.00 0 896 avg-cpu: %user %nice %system %iowait %steal %idle 54.34 0.00 7.31 0.00 0.00 38.35 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 81.00 0.00 840.00 0 840 sda1 0.00 0.00 0.00 0 0 sda2 0.00 0.00 0.00 0 0 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 81.00 0.00 840.00 0 840 avg-cpu: %user %nice %system %iowait %steal %idle 55.18 0.00 5.81 0.44 0.00 38.58 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 317.00 0.00 105632.00 0 105632 sda1 0.00 0.00 0.00 0 0 sda2 224.00 0.00 104672.00 0 104672 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 93.00 0.00 960.00 0 960 avg-cpu: %user %nice %system %iowait %steal %idle 55.38 0.00 7.63 0.00 0.00 36.98 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 74.00 0.00 800.00 0 800 sda1 0.00 0.00 0.00 0 0 sda2 0.00 0.00 0.00 0 0 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 74.00 0.00 800.00 0 800 avg-cpu: %user %nice %system %iowait %steal %idle 56.43 0.00 7.80 0.00 0.00 35.77 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 72.00 0.00 784.00 0 784 sda1 0.00 0.00 0.00 0 0 sda2 0.00 0.00 0.00 0 0 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 72.00 0.00 784.00 0 784 avg-cpu: %user %nice %system %iowait %steal %idle 54.87 0.00 6.49 0.00 0.00 38.64 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 80.20 0.00 855.45 0 864 sda1 0.00 0.00 0.00 0 0 sda2 0.00 0.00 0.00 0 0 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 80.20 0.00 855.45 0 864 avg-cpu: %user %nice %system %iowait %steal %idle 57.22 0.00 5.69 0.00 0.00 37.09 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 33.00 0.00 432.00 0 432 sda1 0.00 0.00 0.00 0 0 sda2 0.00 0.00 0.00 0 0 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 33.00 0.00 432.00 0 432 avg-cpu: %user %nice %system %iowait %steal %idle 56.03 0.00 7.93 0.00 0.00 36.04 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 41.00 0.00 560.00 0 560 sda1 0.00 0.00 0.00 0 0 sda2 2.00 0.00 88.00 0 88 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 39.00 0.00 472.00 0 472 avg-cpu: %user %nice %system %iowait %steal %idle 55.78 0.00 5.13 0.00 0.00 39.09 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 29.00 0.00 392.00 0 392 sda1 0.00 0.00 0.00 0 0 sda2 0.00 0.00 0.00 0 0 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 29.00 0.00 392.00 0 392 avg-cpu: %user %nice %system %iowait %steal %idle 53.68 0.00 8.30 0.06 0.00 37.95 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 78.00 0.00 4280.00 0 4280 sda1 0.00 0.00 0.00 0 0 sda2 0.00 0.00 0.00 0 0 sda3 0.00 0.00 0.00 0 0 sda4 0.00 0.00 0.00 0 0 sda5 78.00 0.00 4280.00 0 4280

Read the article

Why is the latency on one LVM volume consistently higher?

- by David Schmitt

I've got a server with LVM over RAID1. One of the volumes has a consistently higher IO latency (as measured by the diskstats_latency munin plugin) than the other volumes from the same group. As you can see, the dark orange /root volume has consistently high IO latency. Actually ten times the average latency of the physical devices. It also has the highest Min and Max values. My main concern are not the peaks, which occur under high load, but the constant load on (semi-)idle. The server is running Debian Squeeze with the VServer kernel and has four VServer containers and one KVM guest. I'm looking for ways to fix - or at least understand - this situation. Here're some parts of the system configuration: root@kvmhost2:~# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/system--host-root 19G 3.8G 14G 22% / tmpfs 16G 0 16G 0% /lib/init/rw udev 16G 224K 16G 1% /dev tmpfs 16G 0 16G 0% /dev/shm /dev/md0 942M 37M 858M 5% /boot /dev/mapper/system--host-isos 28G 19G 8.1G 70% /srv/isos /dev/mapper/system--host-vs_a 30G 23G 6.0G 79% /var/lib/vservers/a /dev/mapper/system--host-vs_b 5.0G 594M 4.1G 13% /var/lib/vservers/b /dev/mapper/system--host-vs_c 5.0G 555M 4.2G 12% /var/lib/vservers/c /dev/loop0 4.4G 4.4G 0 100% /media/debian-6.0.0-amd64-DVD-1 /dev/loop1 4.4G 4.4G 0 100% /media/debian-6.0.0-i386-DVD-1 /dev/mapper/system--host-vs_d 74G 55G 16G 78% /var/lib/vservers/d root@kvmhost2:~# cat /proc/mounts rootfs / rootfs rw 0 0 none /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0 none /proc proc rw,nosuid,nodev,noexec,relatime 0 0 none /dev devtmpfs rw,relatime,size=16500836k,nr_inodes=4125209,mode=755 0 0 none /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0 /dev/mapper/system--host-root / ext3 rw,relatime,errors=remount-ro,data=ordered 0 0 tmpfs /lib/init/rw tmpfs rw,nosuid,relatime,mode=755 0 0 tmpfs /dev/shm tmpfs rw,nosuid,nodev,relatime 0 0 fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0 /dev/md0 /boot ext3 rw,sync,relatime,errors=continue,data=ordered 0 0 /dev/mapper/system--host-isos /srv/isos ext3 rw,relatime,errors=continue,data=ordered 0 0 /dev/mapper/system--host-vs_a /var/lib/vservers/a ext3 rw,relatime,errors=continue,data=ordered 0 0 /dev/mapper/system--host-vs_b /var/lib/vservers/b ext3 rw,relatime,errors=continue,data=ordered 0 0 /dev/mapper/system--host-vs_c /var/lib/vservers/c ext3 rw,relatime,errors=continue,data=ordered 0 0 /dev/loop0 /media/debian-6.0.0-amd64-DVD-1 iso9660 ro,relatime 0 0 /dev/loop1 /media/debian-6.0.0-i386-DVD-1 iso9660 ro,relatime 0 0 /dev/mapper/system--host-vs_d /var/lib/vservers/d ext3 rw,relatime,errors=continue,data=ordered 0 0 root@kvmhost2:~# cat /proc/mdstat Personalities : [raid1] md1 : active raid1 sda2[0] sdb2[1] 975779968 blocks [2/2] [UU] md0 : active raid1 sda1[0] sdb1[1] 979840 blocks [2/2] [UU] unused devices: <none> root@kvmhost2:~# iostat -x Linux 2.6.32-5-vserver-amd64 (kvmhost2) 06/28/2012 _x86_64_ (8 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 3.09 0.14 2.92 1.51 0.00 92.35 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 23.25 161.12 7.46 37.90 855.27 1596.62 54.05 0.13 2.80 1.76 8.00 sdb 22.82 161.36 7.36 37.66 850.29 1596.62 54.35 0.54 12.01 1.80 8.09 md0 0.00 0.00 0.00 0.00 0.14 0.02 38.44 0.00 0.00 0.00 0.00 md1 0.00 0.00 53.55 198.16 768.01 1585.25 9.35 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.48 20.21 16.70 161.71 8.62 0.26 12.72 0.77 1.60 dm-1 0.00 0.00 3.62 10.03 28.94 80.21 8.00 0.19 13.68 1.59 2.17 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 9.17 0.00 9.64 6.42 0.00 dm-3 0.00 0.00 6.73 0.41 53.87 3.28 8.00 0.02 3.44 0.12 0.09 dm-4 0.00 0.00 17.45 18.18 139.57 145.47 8.00 0.42 11.81 0.76 2.69 dm-5 0.00 0.00 2.50 46.38 120.50 371.07 10.06 0.69 14.20 0.46 2.26 dm-6 0.00 0.00 0.02 0.10 0.67 0.81 12.53 0.01 75.53 18.58 0.22 dm-7 0.00 0.00 0.00 0.00 0.00 0.00 7.99 0.00 11.24 9.45 0.00 dm-8 0.00 0.00 22.69 102.76 407.25 822.09 9.80 0.97 7.71 0.39 4.95 dm-9 0.00 0.00 0.06 0.08 0.50 0.62 8.00 0.07 481.23 11.72 0.16 root@kvmhost2:~# ls -l /dev/mapper/ total 0 crw------- 1 root root 10, 59 May 11 11:19 control lrwxrwxrwx 1 root root 7 Jun 5 15:08 system--host-kvm1 -> ../dm-4 lrwxrwxrwx 1 root root 7 Jun 5 15:08 system--host-kvm2 -> ../dm-3 lrwxrwxrwx 1 root root 7 Jun 5 15:06 system--host-isos -> ../dm-2 lrwxrwxrwx 1 root root 7 May 11 11:19 system--host-root -> ../dm-0 lrwxrwxrwx 1 root root 7 Jun 5 15:06 system--host-swap -> ../dm-9 lrwxrwxrwx 1 root root 7 Jun 5 15:06 system--host-vs_d -> ../dm-8 lrwxrwxrwx 1 root root 7 Jun 5 15:06 system--host-vs_b -> ../dm-6 lrwxrwxrwx 1 root root 7 Jun 5 15:06 system--host-vs_c -> ../dm-7 lrwxrwxrwx 1 root root 7 Jun 5 15:06 system--host-vs_a -> ../dm-5 lrwxrwxrwx 1 root root 7 Jun 5 15:08 system--host-kvm3 -> ../dm-1 root@kvmhost2:~#

Read the article

How can a single disk in a hardware SATA RAID-10 array bring the entire array to a screeching halt?

- by Stu Thompson

Prelude: I'm a code-monkey that's increasingly taken on SysAdmin duties for my small company. My code is our product, and increasingly we provide the same app as SaaS. About 18 months ago I moved our servers from a premium hosting centric vendor to a barebones rack pusher in a tier IV data center. (Literally across the street.) This ment doing much more ourselves--things like networking, storage and monitoring. As part the big move, to replace our leased direct attached storage from the hosting company, I built a 9TB two-node NAS based on SuperMicro chassises, 3ware RAID cards, Ubuntu 10.04, two dozen SATA disks, DRBD and . It's all lovingly documented in three blog posts: Building up & testing a new 9TB SATA RAID10 NFSv4 NAS: Part I, Part II and Part III. We also setup a Cacit monitoring system. Recently we've been adding more and more data points, like SMART values. I could not have done all this without the awesome boffins at ServerFault. It's been a fun and educational experience. My boss is happy (we saved bucket loads of $$$), our customers are happy (storage costs are down), I'm happy (fun, fun, fun). Until yesterday. Outage & Recovery: Some time after lunch we started getting reports of sluggish performance from our application, an on-demand streaming media CMS. About the same time our Cacti monitoring system sent a blizzard of emails. One of the more telling alerts was a graph of iostat await. Performance became so degraded that Pingdom began sending "server down" notifications. The overall load was moderate, there was not traffic spike. After logging onto the application servers, NFS clients of the NAS, I confirmed that just about everything was experiencing highly intermittent and insanely long IO wait times. And once I hopped onto the primary NAS node itself, the same delays were evident when trying to navigate the problem array's file system. Time to fail over, that went well. Within 20 minuts everything was confirmed to be back up and running perfectly. Post-Mortem: After any and all system failures I perform a post-mortem to determine the cause of the failure. First thing I did was ssh back into the box and start reviewing logs. It was offline, completely. Time for a trip to the data center. Hardware reset, backup an and running. In /var/syslog I found this scary looking entry: Nov 15 06:49:44 umbilo smartd[2827]: Device: /dev/twa0 [3ware_disk_00], 6 Currently unreadable (pending) sectors Nov 15 06:49:44 umbilo smartd[2827]: Device: /dev/twa0 [3ware_disk_07], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 171 to 170 Nov 15 06:49:45 umbilo smartd[2827]: Device: /dev/twa0 [3ware_disk_10], 16 Currently unreadable (pending) sectors Nov 15 06:49:45 umbilo smartd[2827]: Device: /dev/twa0 [3ware_disk_10], 4 Offline uncorrectable sectors Nov 15 06:49:45 umbilo smartd[2827]: Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error Nov 15 06:49:45 umbilo smartd[2827]: # 1 Short offline Completed: read failure 90% 6576 3421766910 Nov 15 06:49:45 umbilo smartd[2827]: # 2 Short offline Completed: read failure 90% 6087 3421766910 Nov 15 06:49:45 umbilo smartd[2827]: # 3 Short offline Completed: read failure 10% 5901 656821791 Nov 15 06:49:45 umbilo smartd[2827]: # 4 Short offline Completed: read failure 90% 5818 651637856 Nov 15 06:49:45 umbilo smartd[2827]: So I went to check the Cacti graphs for the disks in the array. Here we see that, yes, disk 7 is slipping away just like syslog says it is. But we also see that disk 8's SMART Read Erros are fluctuating. There are no messages about disk 8 in syslog. More interesting is that the fluctuating values for disk 8 directly correlate to the high IO wait times! My interpretation is that: Disk 8 is experiencing an odd hardware fault that results in intermittent long operation times. Somehow this fault condition on the disk is locking up the entire array Maybe there is a more accurate or correct description, but the net result has been that the one disk is impacting the performance of the whole array. The Question(s) How can a single disk in a hardware SATA RAID-10 array bring the entire array to a screeching halt? Am I being naïve to think that the RAID card should have dealt with this? How can I prevent a single misbehaving disk from impacting the entire array? Am I missing something?

Read the article

e2fsck extremly slow, although enough memory exists

- by kaefert

I've got this external USB-Disk: kaefert@blechmobil:~$ lsusb -s 2:3 Bus 002 Device 003: ID 0bc2:3320 Seagate RSS LLC As can be seen in this dmesg output, there are some problems that prevents that disk from beeing mounted: kaefert@blechmobil:~$ dmesg | grep sdb [ 114.474342] sd 5:0:0:0: [sdb] 732566645 4096-byte logical blocks: (3.00 TB/2.72 TiB) [ 114.475089] sd 5:0:0:0: [sdb] Write Protect is off [ 114.475092] sd 5:0:0:0: [sdb] Mode Sense: 43 00 00 00 [ 114.475959] sd 5:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 114.477093] sd 5:0:0:0: [sdb] 732566645 4096-byte logical blocks: (3.00 TB/2.72 TiB) [ 114.501649] sdb: sdb1 [ 114.502717] sd 5:0:0:0: [sdb] 732566645 4096-byte logical blocks: (3.00 TB/2.72 TiB) [ 114.504354] sd 5:0:0:0: [sdb] Attached SCSI disk [ 116.804408] EXT4-fs (sdb1): ext4_check_descriptors: Checksum for group 3976 failed (47397!=61519) [ 116.804413] EXT4-fs (sdb1): group descriptors corrupted! So I went and fired up my favorite partition manager - gparted, and told it to verify and repair the partition sdb1. This made gparted call e2fsck (version 1.42.4 (12-Jun-2012)) e2fsck -f -y -v /dev/sdb1 Although gparted called e2fsck with the "-v" option, sadly it doesn't show me the output of my e2fsck process (bugreport https://bugzilla.gnome.org/show_bug.cgi?id=467925 ) I started this whole thing on Sunday (2012-11-04_2200) evening, so about 48 hours ago, this is what htop says about it now (2012-11-06-1900): PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 3704 root 39 19 1560M 1166M 768 R 98.0 19.5 42h56:43 e2fsck -f -y -v /dev/sdb1 Now I found a few posts on the internet that discuss e2fsck running slow, for example: http://gparted-forum.surf4.info/viewtopic.php?id=13613 where they write that its a good idea to see if the disk is just that slow because maybe its damaged, and I think these outputs tell me that this is not the case in my case: kaefert@blechmobil:~$ sudo hdparm -tT /dev/sdb /dev/sdb: Timing cached reads: 3562 MB in 2.00 seconds = 1783.29 MB/sec Timing buffered disk reads: 82 MB in 3.01 seconds = 27.26 MB/sec kaefert@blechmobil:~$ sudo hdparm /dev/sdb /dev/sdb: multcount = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 364801/255/63, sectors = 5860533160, start = 0 However, although I can read quickly from that disk, this disk speed doesn't seem to be used by e2fsck, considering tools like gkrellm or iotop or this: kaefert@blechmobil:~$ iostat -x Linux 3.2.0-2-amd64 (blechmobil) 2012-11-06 _x86_64_ (2 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 14,24 47,81 14,63 0,95 0,00 22,37 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,59 8,29 2,42 5,14 43,17 160,17 53,75 0,30 39,80 8,72 54,42 3,95 2,99 sdb 137,54 5,48 9,23 0,20 587,07 22,73 129,35 0,07 7,70 7,51 16,18 2,17 2,04 Now I researched a little bit on how to find out what e2fsck is doing with all that processor time, and I found the tool strace, which gives me this: kaefert@blechmobil:~$ sudo strace -p3704 lseek(4, 41026998272, SEEK_SET) = 41026998272 write(4, "\212\354K[_\361\3nl\212\245\352\255jR\303\354\312Yv\334p\253r\217\265\3567\325\257\3766"..., 4096) = 4096 lseek(4, 48404766720, SEEK_SET) = 48404766720 read(4, "\7t\260\366\346\337\304\210\33\267j\35\377'\31f\372\252\ffU\317.y\211\360\36\240c\30`\34"..., 4096) = 4096 lseek(4, 41027002368, SEEK_SET) = 41027002368 write(4, "\232]7Ws\321\352\t\1@[+5\263\334\276{\343zZx\352\21\316`1\271[\202\350R`"..., 4096) = 4096 lseek(4, 48404770816, SEEK_SET) = 48404770816 read(4, "\17\362r\230\327\25\346//\210H\v\311\3237\323K\304\306\361a\223\311\324\272?\213\tq \370\24"..., 4096) = 4096 lseek(4, 41027006464, SEEK_SET) = 41027006464 write(4, "\367yy>x\216?=\324Z\305\351\376&\25\244\210\271\22\306}\276\237\370(\214\205G\262\360\257#"..., 4096) = 4096 lseek(4, 48404774912, SEEK_SET) = 48404774912 read(4, "\365\25\0\21|T\0\21}3t_\272\373\222k\r\177\303\1\201\261\221$\261B\232\3142\21U\316"..., 4096) = 4096 ^CProcess 3704 detached around 16 of these lines every second, so 4 read and 4 write operations every second, which I don't consider to be a lot.. And finally, my question: Will this process ever finish? If those numbers from fseek (48404774912) represent bytes, that would be something like 45 gigabytes, with this beeing a 3 terrabyte disk, which would give me 134 days to go, if the speed stays constant, and he scans the disk like this completly and only once. Do you have some advice for me? I have most of the data on that disk elsewhere, but I've put a lot of hours into sorting and merging it to this disk, so I would prefer to getting this disk up and running again, without formatting it anew. I don't think that the hardware is damaged since the disk is only a few months and since I can't see any I/O errors in the dmesg output. UPDATE: I just looked at the strace output again (2012-11-06_2300), now it looks like this: lseek(4, 1419860611072, SEEK_SET) = 1419860611072 read(4, "3#\f\2447\335\0\22A\355\374\276j\204'\207|\217V|\23\245[\7VP\251\242\276\207\317:"..., 4096) = 4096 lseek(4, 43018145792, SEEK_SET) = 43018145792 write(4, "]\206\231\342Y\204-2I\362\242\344\6R\205\361\324\177\265\317C\334V\324\260\334\275t=\10F."..., 4096) = 4096 lseek(4, 1419860615168, SEEK_SET) = 1419860615168 read(4, "\262\305\314Y\367\37x\326\245\226\226\320N\333$s\34\204\311\222\7\315\236\336\300TK\337\264\236\211n"..., 4096) = 4096 lseek(4, 43018149888, SEEK_SET) = 43018149888 write(4, "\271\224m\311\224\25!I\376\16;\377\0\223H\25Yd\201Y\342\r\203\271\24eG<\202{\373V"..., 4096) = 4096 lseek(4, 1419860619264, SEEK_SET) = 1419860619264 read(4, ";d\360\177\n\346\253\210\222|\250\352T\335M\33\260\320\261\7g\222P\344H?t\240\20\2548\310"..., 4096) = 4096 lseek(4, 43018153984, SEEK_SET) = 43018153984 write(4, "\360\252j\317\310\251G\227\335{\214`\341\267\31Y\202\360\v\374\307oq\3063\217Z\223\313\36D\211"..., 4096) = 4096 So this number of the lseeks before the reads, like 1419860619264 are already a lot bigger, standing for 1.29 terabytes if the numbers are bytes, so it doesn't seem to be a linear progress on a big scale, maybe there are only some areas that need work, that have big gaps in between them. (times are in CET)

Read the article

What is causing the unusual high load average?

- by James

I noticed on Tuesday night of last week, the load average went up sharply and it seemed abnormal since the traffic is small. Usually, the numbers usually average around .40 or lower and my server stuff (mysql, php and apache) are optimized. I noticed that the IOWait is unusually high even though the processes is barely using any CPU. top - 01:44:39 up 1 day, 21:13, 1 user, load average: 1.41, 1.09, 0.86 Tasks: 60 total, 1 running, 59 sleeping, 0 stopped, 0 zombie Cpu0 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.0%us, 0.3%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 0.0%us, 0.0%sy, 0.0%ni, 91.5%id, 8.5%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 1048576k total, 331944k used, 716632k free, 0k buffers Swap: 0k total, 0k used, 0k free, 0k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 root 15 0 2468 1376 1140 S 0 0.1 0:00.92 init 1656 root 15 0 13652 5212 664 S 0 0.5 0:00.00 apache2 9323 root 18 0 13652 5212 664 S 0 0.5 0:00.00 apache2 10079 root 18 0 3972 1248 972 S 0 0.1 0:00.00 su 10080 root 15 0 4612 1956 1448 S 0 0.2 0:00.01 bash 11298 root 15 0 13652 5212 664 S 0 0.5 0:00.00 apache2 11778 chikorit 15 0 2344 1092 884 S 0 0.1 0:00.05 top 15384 root 18 0 17544 13m 1568 S 0 1.3 0:02.28 miniserv.pl 15585 root 15 0 8280 2736 2168 S 0 0.3 0:00.02 sshd 15608 chikorit 15 0 8280 1436 860 S 0 0.1 0:00.02 sshd Here is the VMStat procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 0 768644 0 0 0 0 14 23 0 10 1 0 99 0 IOStat - Nothing unusal Total DISK READ: 67.13 K/s | Total DISK WRITE: 0.00 B/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 19496 be/4 chikorit 11.85 K/s 0.00 B/s 0.00 % 0.00 % apache2 -k start 19501 be/4 mysql 3.95 K/s 0.00 B/s 0.00 % 0.00 % mysqld 19568 be/4 chikorit 11.85 K/s 0.00 B/s 0.00 % 0.00 % apache2 -k start 19569 be/4 chikorit 11.85 K/s 0.00 B/s 0.00 % 0.00 % apache2 -k start 19570 be/4 chikorit 11.85 K/s 0.00 B/s 0.00 % 0.00 % apache2 -k start 19571 be/4 chikorit 7.90 K/s 0.00 B/s 0.00 % 0.00 % apache2 -k start 19573 be/4 chikorit 7.90 K/s 0.00 B/s 0.00 % 0.00 % apache2 -k start 1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % init 11778 be/4 chikorit 0.00 B/s 0.00 B/s 0.00 % 0.00 % top 19470 be/4 mysql 0.00 B/s 0.00 B/s 0.00 % 0.00 % mysqld Load Average Chart - http://i.stack.imgur.com/kYsD0.png I want to be sure if this is not a MySQL problem before making sure. Also, this is a Ubuntu 10.04 LTS Server on OpenVZ. Edit: This will probably give a good picture on the IO Wait top - 22:12:22 up 17:41, 1 user, load average: 1.10, 1.09, 0.93 Tasks: 33 total, 1 running, 32 sleeping, 0 stopped, 0 zombie Cpu(s): 0.6%us, 0.2%sy, 0.0%ni, 89.0%id, 10.1%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 1048576k total, 260708k used, 787868k free, 0k buffers Swap: 0k total, 0k used, 0k free, 0k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 root 15 0 2468 1376 1140 S 0 0.1 0:00.88 init 5849 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2 8063 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2 9732 root 16 0 8280 2728 2168 S 0 0.3 0:00.02 sshd 9746 chikorit 18 0 8412 1444 864 S 0 0.1 0:01.10 sshd 9747 chikorit 18 0 4576 1960 1488 S 0 0.2 0:00.24 bash 13706 chikorit 15 0 2344 1088 884 R 0 0.1 0:00.03 top 15745 chikorit 15 0 12968 5108 1280 S 0 0.5 0:00.00 apache2 15751 chikorit 15 0 72184 25m 18m S 0 2.5 0:00.37 php5-fpm 15790 chikorit 18 0 12472 4640 1192 S 0 0.4 0:00.00 apache2 15797 chikorit 15 0 72888 23m 16m S 0 2.3 0:00.06 php5-fpm 16038 root 15 0 67772 2848 592 D 0 0.3 0:00.00 php5-fpm 16309 syslog 18 0 24084 1316 992 S 0 0.1 0:00.07 rsyslogd 16316 root 15 0 5472 908 500 S 0 0.1 0:00.00 sshd 16326 root 15 0 2304 908 712 S 0 0.1 0:00.02 cron 17464 root 15 0 10252 7560 856 D 0 0.7 0:01.88 psad 17466 root 18 0 1684 276 208 S 0 0.0 0:00.31 psadwatchd 17559 root 18 0 11444 2020 732 S 0 0.2 0:00.47 sendmail-mta 17688 root 15 0 10252 5388 1136 S 0 0.5 0:03.81 python 17752 teamspea 19 0 44648 7308 4676 S 0 0.7 1:09.70 ts3server_linux 18098 root 15 0 12336 6380 3032 S 0 0.6 0:00.47 apache2 18099 chikorit 18 0 10368 2536 464 S 0 0.2 0:00.00 apache2 18120 ntp 15 0 4336 1316 984 S 0 0.1 0:00.87 ntpd 18379 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2 18387 mysql 15 0 62796 36m 5864 S 0 3.6 1:43.26 mysqld 19584 root 15 0 12336 4028 668 S 0 0.4 0:00.02 apache2 22498 root 16 0 12336 4028 668 S 0 0.4 0:00.00 apache2 24260 root 15 0 67772 3612 1356 S 0 0.3 0:00.22 php5-fpm 27712 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2 27730 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2 30343 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2 30366 root 15 0 12336 4028 668 S 0 0.4 0:00.00 apache2 This is the free ram as of today. total used free shared buffers cached Mem: 1024 302 721 0 0 0 -/+ buffers/cache: 302 721 Swap: 0 0 0 Update: Looking into the logs, particularly the PHP5-FPM, which is causing the CPU spike. I found that its segment faulting for some apparent reason. [03-Jun-2012 06:11:20] NOTICE: [pool www] child 14132 started [03-Jun-2012 06:11:25] WARNING: [pool www] child 13664 exited on signal 11 (SIGSEGV) after 53.686322 seconds from start [03-Jun-2012 06:11:25] NOTICE: [pool www] child 14328 started [03-Jun-2012 06:11:25] WARNING: [pool www] child 14132 exited on signal 11 (SIGSEGV) after 4.708681 seconds from start [03-Jun-2012 06:11:25] NOTICE: [pool www] child 14329 started [03-Jun-2012 06:11:58] WARNING: [pool www] child 14328 exited on signal 11 (SIGSEGV) after 32.981228 seconds from start [03-Jun-2012 06:11:58] NOTICE: [pool www] child 15745 started [03-Jun-2012 06:12:25] WARNING: [pool www] child 15745 exited on signal 11 (SIGSEGV) after 27.442864 seconds from start [03-Jun-2012 06:12:25] NOTICE: [pool www] child 17446 started [03-Jun-2012 06:12:25] WARNING: [pool www] child 14329 exited on signal 11 (SIGSEGV) after 60.411278 seconds from start [03-Jun-2012 06:12:25] NOTICE: [pool www] child 17447 started [03-Jun-2012 06:13:02] WARNING: [pool www] child 17446 exited on signal 11 (SIGSEGV) after 36.746793 seconds from start [03-Jun-2012 06:13:02] NOTICE: [pool www] child 18133 started [03-Jun-2012 06:13:48] WARNING: [pool www] child 17447 exited on signal 11 (SIGSEGV) after 82.710107 seconds from start I'm thinking that this might be causing the problem. If that is the cause, probably switching it off that to fastcgi/fcgid might resolve it... but still, I want to see if something else might be causing it to do this.

Read the article

e2fsck extremely slow, although enough memory exists

- by kaefert

I've got this external USB-Disk: kaefert@blechmobil:~$ lsusb -s 2:3 Bus 002 Device 003: ID 0bc2:3320 Seagate RSS LLC As can be seen in this dmesg output, there is some problem that prevents that disk from beeing mounted: kaefert@blechmobil:~$ dmesg ... [ 113.084079] usb 2-1: new high-speed USB device number 3 using ehci_hcd [ 113.217783] usb 2-1: New USB device found, idVendor=0bc2, idProduct=3320 [ 113.217787] usb 2-1: New USB device strings: Mfr=2, Product=3, SerialNumber=1 [ 113.217790] usb 2-1: Product: Expansion Desk [ 113.217792] usb 2-1: Manufacturer: Seagate [ 113.217794] usb 2-1: SerialNumber: NA4J4N6K [ 113.435404] usbcore: registered new interface driver uas [ 113.455315] Initializing USB Mass Storage driver... [ 113.468051] scsi5 : usb-storage 2-1:1.0 [ 113.468180] usbcore: registered new interface driver usb-storage [ 113.468182] USB Mass Storage support registered. [ 114.473105] scsi 5:0:0:0: Direct-Access Seagate Expansion Desk 070B PQ: 0 ANSI: 6 [ 114.474342] sd 5:0:0:0: [sdb] 732566645 4096-byte logical blocks: (3.00 TB/2.72 TiB) [ 114.475089] sd 5:0:0:0: [sdb] Write Protect is off [ 114.475092] sd 5:0:0:0: [sdb] Mode Sense: 43 00 00 00 [ 114.475959] sd 5:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 114.477093] sd 5:0:0:0: [sdb] 732566645 4096-byte logical blocks: (3.00 TB/2.72 TiB) [ 114.501649] sdb: sdb1 [ 114.502717] sd 5:0:0:0: [sdb] 732566645 4096-byte logical blocks: (3.00 TB/2.72 TiB) [ 114.504354] sd 5:0:0:0: [sdb] Attached SCSI disk [ 116.804408] EXT4-fs (sdb1): ext4_check_descriptors: Checksum for group 3976 failed (47397!=61519) [ 116.804413] EXT4-fs (sdb1): group descriptors corrupted! ... So I went and fired up my favorite partition manager - gparted, and told it to verify and repair the partition sdb1. This made gparted call e2fsck (version 1.42.4 (12-Jun-2012)) e2fsck -f -y -v /dev/sdb1 Although gparted called e2fsck with the "-v" option, sadly it doesn't show me the output of my e2fsck process (bugreport https://bugzilla.gnome.org/show_bug.cgi?id=467925 ) I started this whole thing on Sunday (2012-11-04_2200) evening, so about 48 hours ago, this is what htop says about it now (2012-11-06-1900): PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 3704 root 39 19 1560M 1166M 768 R 98.0 19.5 42h56:43 e2fsck -f -y -v /dev/sdb1 Now I found a few posts on the internet that discuss e2fsck running slow, for example: http://gparted-forum.surf4.info/viewtopic.php?id=13613 where they write that its a good idea to see if the disk is just that slow because maybe its damaged, and I think these outputs tell me that this is not the case in my case: kaefert@blechmobil:~$ sudo hdparm -tT /dev/sdb /dev/sdb: Timing cached reads: 3562 MB in 2.00 seconds = 1783.29 MB/sec Timing buffered disk reads: 82 MB in 3.01 seconds = 27.26 MB/sec kaefert@blechmobil:~$ sudo hdparm /dev/sdb /dev/sdb: multcount = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 364801/255/63, sectors = 5860533160, start = 0 However, although I can read quickly from that disk, this disk speed doesn't seem to be used by e2fsck, considering tools like gkrellm or iotop or this: kaefert@blechmobil:~$ iostat -x Linux 3.2.0-2-amd64 (blechmobil) 2012-11-06 _x86_64_ (2 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 14,24 47,81 14,63 0,95 0,00 22,37 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,59 8,29 2,42 5,14 43,17 160,17 53,75 0,30 39,80 8,72 54,42 3,95 2,99 sdb 137,54 5,48 9,23 0,20 587,07 22,73 129,35 0,07 7,70 7,51 16,18 2,17 2,04 Now I researched a little bit on how to find out what e2fsck is doing with all that processor time, and I found the tool strace, which gives me this: kaefert@blechmobil:~$ sudo strace -p3704 lseek(4, 41026998272, SEEK_SET) = 41026998272 write(4, "\212\354K[_\361\3nl\212\245\352\255jR\303\354\312Yv\334p\253r\217\265\3567\325\257\3766"..., 4096) = 4096 lseek(4, 48404766720, SEEK_SET) = 48404766720 read(4, "\7t\260\366\346\337\304\210\33\267j\35\377'\31f\372\252\ffU\317.y\211\360\36\240c\30`\34"..., 4096) = 4096 lseek(4, 41027002368, SEEK_SET) = 41027002368 write(4, "\232]7Ws\321\352\t\1@[+5\263\334\276{\343zZx\352\21\316`1\271[\202\350R`"..., 4096) = 4096 lseek(4, 48404770816, SEEK_SET) = 48404770816 read(4, "\17\362r\230\327\25\346//\210H\v\311\3237\323K\304\306\361a\223\311\324\272?\213\tq \370\24"..., 4096) = 4096 lseek(4, 41027006464, SEEK_SET) = 41027006464 write(4, "\367yy>x\216?=\324Z\305\351\376&\25\244\210\271\22\306}\276\237\370(\214\205G\262\360\257#"..., 4096) = 4096 lseek(4, 48404774912, SEEK_SET) = 48404774912 read(4, "\365\25\0\21|T\0\21}3t_\272\373\222k\r\177\303\1\201\261\221$\261B\232\3142\21U\316"..., 4096) = 4096 ^CProcess 3704 detached around 16 of these lines every second, so 4 read and 4 write operations every second, which I don't consider to be a lot.. And finally, my question: Will this process ever finish? If those numbers from fseek (48404774912) represent bytes, that would be something like 45 gigabytes, with this beeing a 3 terrabyte disk, which would give me 134 days to go, if the speed stays constant, and e2fsck scans the disk like this completly and only once. Do you have some advice for me? I have most of the data on that disk elsewhere, but I've put a lot of hours into sorting and merging it to this disk, so I would prefer to getting this disk up and running again, without formatting it anew. I don't think that the hardware is damaged since the disk is only a few months and since I can't see any I/O errors in the dmesg output. UPDATE: I just looked at the strace output again (2012-11-06_2300), now it looks like this: lseek(4, 1419860611072, SEEK_SET) = 1419860611072 read(4, "3#\f\2447\335\0\22A\355\374\276j\204'\207|\217V|\23\245[\7VP\251\242\276\207\317:"..., 4096) = 4096 lseek(4, 43018145792, SEEK_SET) = 43018145792 write(4, "]\206\231\342Y\204-2I\362\242\344\6R\205\361\324\177\265\317C\334V\324\260\334\275t=\10F."..., 4096) = 4096 lseek(4, 1419860615168, SEEK_SET) = 1419860615168 read(4, "\262\305\314Y\367\37x\326\245\226\226\320N\333$s\34\204\311\222\7\315\236\336\300TK\337\264\236\211n"..., 4096) = 4096 lseek(4, 43018149888, SEEK_SET) = 43018149888 write(4, "\271\224m\311\224\25!I\376\16;\377\0\223H\25Yd\201Y\342\r\203\271\24eG<\202{\373V"..., 4096) = 4096 lseek(4, 1419860619264, SEEK_SET) = 1419860619264 read(4, ";d\360\177\n\346\253\210\222|\250\352T\335M\33\260\320\261\7g\222P\344H?t\240\20\2548\310"..., 4096) = 4096 lseek(4, 43018153984, SEEK_SET) = 43018153984 write(4, "\360\252j\317\310\251G\227\335{\214`\341\267\31Y\202\360\v\374\307oq\3063\217Z\223\313\36D\211"..., 4096) = 4096 So the numbers in the lseek lines before the reads, like 1419860619264 are already a lot bigger, standing for 1.29 terabytes if those numbers are bytes, so it doesn't seem to be a linear progress on a big scale, maybe there are only some areas that need work, that have big gaps in between them. UPDATE2: Okey, big disappointment, the numbers are back to very small again (2012-11-07_0720) lseek(4, 52174548992, SEEK_SET) = 52174548992 read(4, "\374\312\22\\\325\215\213\23\0357U\222\246\370v^f(\312|f\212\362\343\375\373\342\4\204mU6"..., 4096) = 4096 lseek(4, 46603526144, SEEK_SET) = 46603526144 write(4, "\370\261\223\227\23?\4\4\217\264\320_Am\246CQ\313^\203U\253\274\204\277\2564n\227\177\267\343"..., 4096) = 4096 so either e2fsck goes over the data multiple times, or it just hops back and forth multiple times. Or my assumption that those numbers are bytes is wrong. UPDATE3: Since it's mentioned here http://forums.fedoraforum.org/showthread.php?t=282125&page=2 that you can testisk while e2fsck is running, i tried that, though not with a lot of success. When asking testdisk to display the data of my partition, this is what I get: TestDisk 6.13, Data Recovery Utility, November 2011 Christophe GRENIER <[email protected]> http://www.cgsecurity.org 1 P Linux 0 4 5 45600 40 8 732566272 Can't open filesystem. Filesystem seems damaged. And this is what strace currently gives me (2012-11-07_1030) lseek(4, 212460343296, SEEK_SET) = 212460343296 read(4, "\315Mb\265v\377Gn \24\f\205EHh\2349~\330\273\203\3375\206\10\r3=W\210\372\352"..., 4096) = 4096 lseek(4, 47347830784, SEEK_SET) = 47347830784 write(4, "]\204\223\300I\357\4\26\33+\243\312G\230\250\371*m2U\t_\215\265J \252\342Pm\360D"..., 4096) = 4096 (times are in CET)

Read the article

ZFS for Database Log Files

- by user12620111

I've been troubled by drop outs in CPU usage in my application server, characterized by the CPUs suddenly going from close to 90% CPU busy to almost completely CPU idle for a few seconds. Here is an example of a drop out as shown by a snippet of vmstat data taken while the application server is under a heavy workload. # vmstat 1 kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr s3 s4 s5 s6 in sy cs us sy id 1 0 0 130160176 116381952 0 16 0 0 0 0 0 0 0 0 0 207377 117715 203884 70 21 9 12 0 0 130160160 116381936 0 25 0 0 0 0 0 0 0 0 0 200413 117162 197250 70 20 9 11 0 0 130160176 116381920 0 16 0 0 0 0 0 0 1 0 0 203150 119365 200249 72 21 7 8 0 0 130160176 116377808 0 19 0 0 0 0 0 0 0 0 0 169826 96144 165194 56 17 27 0 0 0 130160176 116377800 0 16 0 0 0 0 0 0 0 0 1 10245 9376 9164 2 1 97 0 0 0 130160176 116377792 0 16 0 0 0 0 0 0 0 0 2 15742 12401 14784 4 1 95 0 0 0 130160176 116377776 2 16 0 0 0 0 0 0 1 0 0 19972 17703 19612 6 2 92 14 0 0 130160176 116377696 0 16 0 0 0 0 0 0 0 0 0 202794 116793 199807 71 21 8 9 0 0 130160160 116373584 0 30 0 0 0 0 0 0 18 0 0 203123 117857 198825 69 20 11 This behavior occurred consistently while the application server was processing synthetic transactions: HTTP requests from JMeter running on an external machine. I explored many theories trying to explain the drop outs, including: Unexpected JMeter behavior Network contention Java Garbage Collection Application Server thread pool problems Connection pool problems Database transaction processing Database I/O contention Graphing the CPU %idle led to a breakthrough: Several of the drop outs were 30 seconds apart. With that insight, I went digging through the data again and looking for other outliers that were 30 seconds apart. In the database server statistics, I found spikes in the iostat "asvc_t" (average response time of disk transactions, in milliseconds) for the disk drive that was being used for the database log files. Here is an example: extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 2053.6 0.0 8234.3 0.0 0.2 0.0 0.1 0 24 c3t60080E5...F4F6d0s0 0.0 2162.2 0.0 8652.8 0.0 0.3 0.0 0.1 0 28 c3t60080E5...F4F6d0s0 0.0 1102.5 0.0 10012.8 0.0 4.5 0.0 4.1 0 69 c3t60080E5...F4F6d0s0 0.0 74.0 0.0 7920.6 0.0 10.0 0.0 135.1 0 100 c3t60080E5...F4F6d0s0 0.0 568.7 0.0 6674.0 0.0 6.4 0.0 11.2 0 90 c3t60080E5...F4F6d0s0 0.0 1358.0 0.0 5456.0 0.0 0.6 0.0 0.4 0 55 c3t60080E5...F4F6d0s0 0.0 1314.3 0.0 5285.2 0.0 0.7 0.0 0.5 0 70 c3t60080E5...F4F6d0s0 Here is a little more information about my database configuration: The database and application server were running on two different SPARC servers. Storage for the database was on a storage array connected via 8 gigabit Fibre Channel Data storage and log file were on different physical disk drives Reliable low latency I/O is provided by battery backed NVRAM Highly available: Two Fibre Channel links accessed via MPxIO Two Mirrored cache controllers The log file physical disks were mirrored in the storage device Database log files on a ZFS Filesystem with cutting-edge technologies, such as copy-on-write and end-to-end checksumming Why would I be getting service time spikes in my high-end storage? First, I wanted to verify that the database log disk service time spikes aligned with the application server CPU drop outs, and they did: At first, I guessed that the disk service time spikes might be related to flushing the write through cache on the storage device, but I was unable to validate that theory. After searching the WWW for a while, I decided to try using a separate log device: # zpool add ZFS-db-41 log c3t60080E500017D55C000015C150A9F8A7d0 The ZFS log device is configured in a similar manner as described above: two physical disks mirrored in the storage array. This change to the database storage configuration eliminated the application server CPU drop outs: Here is the zpool configuration: # zpool status ZFS-db-41 pool: ZFS-db-41 state: ONLINE scan: none requested config: NAME STATE ZFS-db-41 ONLINE c3t60080E5...F4F6d0 ONLINE logs c3t60080E5...F8A7d0 ONLINE Now, the I/O spikes look like this: extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 1053.5 0.0 4234.1 0.0 0.8 0.0 0.7 0 75 c3t60080E5...F8A7d0s0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 1131.8 0.0 4555.3 0.0 0.8 0.0 0.7 0 76 c3t60080E5...F8A7d0s0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 1167.6 0.0 4682.2 0.0 0.7 0.0 0.6 0 74 c3t60080E5...F8A7d0s0 0.0 162.2 0.0 19153.9 0.0 0.7 0.0 4.2 0 12 c3t60080E5...F4F6d0s0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 1247.2 0.0 4992.6 0.0 0.7 0.0 0.6 0 71 c3t60080E5...F8A7d0s0 0.0 41.0 0.0 70.0 0.0 0.1 0.0 1.6 0 2 c3t60080E5...F4F6d0s0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 1241.3 0.0 4989.3 0.0 0.8 0.0 0.6 0 75 c3t60080E5...F8A7d0s0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 1193.2 0.0 4772.9 0.0 0.7 0.0 0.6 0 71 c3t60080E5...F8A7d0s0 We can see the steady flow of 4k writes to the ZIL device from O_SYNC database log file writes. The spikes are from flushing the transaction group. Like almost all problems that I run into, once I thoroughly understand the problem, I find that other people have documented similar experiences. Thanks to all of you who have documented alternative approaches. Saved for another day: now that the problem is obvious, I should try "zfs:zfs_immediate_write_sz" as recommended in the ZFS Evil Tuning Guide. References: The ZFS Intent Log Solaris ZFS, Synchronous Writes and the ZIL Explained ZFS Evil Tuning Guide: Cache Flushes ZFS Evil Tuning Guide: Tuning ZFS for Database Performance

Read the article

Kernel oops on Linux running in VirtualBox breaks some IO-related functionality on the server

- by Kristoffer E

We are having problems with CentOS release 6.3 running in VirtualBox on Windows 7 machines. The symptoms are the following: Everything works as normal for several hours, even days. Then something happens which breaks the system. What we still can do after this something happens: Access the web server Use existing SSH sessions to run top and free What does not work: Starting new SSH sessions (hangs after username and password is entered) Running ls in existing SSH sessions (hangs) SSI includes from our web servers that fetch data from remote machines probably more What we see on the server when this something happens is the following: Load average go from basically nothing to around 3 CPU usage is still low (5%) Disk activity is low (running iostat) Plenty of memory available Plenty of disk space available In /var/log/messages we get the following: Jun 14 01:10:48 devvm kernel: e1000 0000:00:03.0: eth0: Detected Tx Unit Hang Jun 14 01:10:48 devvm kernel: Tx Queue <0> Jun 14 01:10:48 devvm kernel: TDH <2e> Jun 14 01:10:48 devvm kernel: TDT <30> Jun 14 01:10:48 devvm kernel: next_to_use <30> Jun 14 01:10:48 devvm kernel: next_to_clean <2e> Jun 14 01:10:48 devvm kernel: buffer_info[next_to_clean] Jun 14 01:10:48 devvm kernel: time_stamp <1038284db> Jun 14 01:10:48 devvm kernel: next_to_watch <2f> Jun 14 01:10:48 devvm kernel: jiffies <103828b42> Jun 14 01:10:48 devvm kernel: next_to_watch.status <0> Jun 14 01:10:50 devvm kernel: e1000 0000:00:03.0: eth0: Detected Tx Unit Hang Jun 14 01:10:50 devvm kernel: Tx Queue <0> Jun 14 01:10:50 devvm kernel: TDH <2e> Jun 14 01:10:50 devvm kernel: TDT <30> Jun 14 01:10:50 devvm kernel: next_to_use <30> Jun 14 01:10:50 devvm kernel: next_to_clean <2e> Jun 14 01:10:50 devvm kernel: buffer_info[next_to_clean] Jun 14 01:10:50 devvm kernel: time_stamp <1038284db> Jun 14 01:10:50 devvm kernel: next_to_watch <2f> Jun 14 01:10:50 devvm kernel: jiffies <103829312> Jun 14 01:10:50 devvm kernel: next_to_watch.status <0> Jun 14 01:10:52 devvm kernel: ------------[ cut here ]------------ Jun 14 01:10:52 devvm kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) Jun 14 01:10:52 devvm kernel: Hardware name: VirtualBox Jun 14 01:10:52 devvm kernel: NETDEV WATCHDOG: eth0 (e1000): transmit queue 0 timed out Jun 14 01:10:52 devvm kernel: Modules linked in: vboxsf(U) ipv6 ppdev parport_pc parport microcode sg vboxguest(U) i2c_piix4 i2c_core e1000 snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc pcnet32 mii ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Jun 14 01:10:52 devvm kernel: Pid: 0, comm: swapper Not tainted 2.6.32-279.el6.x86_64 #1 Jun 14 01:10:52 devvm kernel: Call Trace: Jun 14 01:10:52 devvm kernel: <IRQ> [<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0 Jun 14 01:10:52 devvm kernel: [<ffffffff8106b836>] ? warn_slowpath_fmt+0x46/0x50 Jun 14 01:10:52 devvm kernel: [<ffffffff814595fd>] ? dev_watchdog+0x26d/0x280 Jun 14 01:10:52 devvm kernel: [<ffffffff81099138>] ? sched_clock_cpu+0xb8/0x110 Jun 14 01:10:52 devvm kernel: [<ffffffff81459390>] ? dev_watchdog+0x0/0x280 Jun 14 01:10:52 devvm kernel: [<ffffffff8107e897>] ? run_timer_softirq+0x197/0x340 Jun 14 01:10:52 devvm kernel: [<ffffffff810a21c0>] ? tick_sched_timer+0x0/0xc0 Jun 14 01:10:52 devvm kernel: [<ffffffff8102b40d>] ? lapic_next_event+0x1d/0x30 Jun 14 01:10:52 devvm kernel: [<ffffffff81073ec1>] ? __do_softirq+0xc1/0x1e0 Jun 14 01:10:52 devvm kernel: [<ffffffff81096c50>] ? hrtimer_interrupt+0x140/0x250 Jun 14 01:10:52 devvm kernel: [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30 Jun 14 01:10:52 devvm kernel: [<ffffffff8100de85>] ? do_softirq+0x65/0xa0 Jun 14 01:10:52 devvm kernel: [<ffffffff81073ca5>] ? irq_exit+0x85/0x90 Jun 14 01:10:52 devvm kernel: [<ffffffff81505be0>] ? smp_apic_timer_interrupt+0x70/0x9b Jun 14 01:10:52 devvm kernel: [<ffffffff8100bc13>] ? apic_timer_interrupt+0x13/0x20 Jun 14 01:10:52 devvm kernel: <EOI> [<ffffffff810387cb>] ? native_safe_halt+0xb/0x10 Jun 14 01:10:52 devvm kernel: [<ffffffff810149cd>] ? default_idle+0x4d/0xb0 Jun 14 01:10:52 devvm kernel: [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110 Jun 14 01:10:52 devvm kernel: [<ffffffff814e433a>] ? rest_init+0x7a/0x80 Jun 14 01:10:52 devvm kernel: [<ffffffff81c21f7b>] ? start_kernel+0x424/0x430 Jun 14 01:10:52 devvm kernel: [<ffffffff81c2133a>] ? x86_64_start_reservations+0x125/0x129 Jun 14 01:10:52 devvm kernel: [<ffffffff81c21438>] ? x86_64_start_kernel+0xfa/0x109 Jun 14 01:10:52 devvm kernel: ---[ end trace 2c7bb984812cf120 ]--- Jun 14 01:10:52 devvm kernel: e1000 0000:00:03.0: eth0: Reset adapter Jun 14 01:10:53 devvm abrtd: Directory 'oops-2013-06-14-01:10:53-1537-0' creation detected Jun 14 01:10:53 devvm abrt-dump-oops: Reported 1 kernel oopses to Abrt Jun 14 01:10:53 devvm abrtd: Can't open file '/var/spool/abrt/oops-2013-06-14-01:10:53-1537-0/uid': No such file or directory Jun 14 01:10:55 devvm kernel: Bridge firewalling registered After this we see for a while, every two minutes: Jun 14 01:14:22 devvm kernel: INFO: task events/0:19 blocked for more than 120 seconds. Jun 14 01:14:22 devvm kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 14 01:14:22 devvm kernel: events/0 D 0000000000000000 0 19 2 0x00000000 Jun 14 01:14:22 devvm kernel: ffff880116c4fb90 0000000000000046 00000000ffffffff 0000000000000008 Jun 14 01:14:22 devvm kernel: 0000000000016680 0000000000016680 ffff880028210400 0000000000016680 Jun 14 01:14:22 devvm kernel: ffff880116c4daf8 ffff880116c4ffd8 000000000000fb88 ffff880116c4daf8 Jun 14 01:14:22 devvm kernel: Call Trace: Jun 14 01:14:22 devvm kernel: [<ffffffff8105b483>] ? perf_event_task_sched_out+0x33/0x80 Jun 14 01:14:22 devvm kernel: [<ffffffff814fe6a5>] schedule_timeout+0x215/0x2e0 Jun 14 01:14:22 devvm kernel: [<ffffffff8100975d>] ? __switch_to+0x13d/0x320 Jun 14 01:14:22 devvm kernel: [<ffffffff814fe323>] wait_for_common+0x123/0x180 Jun 14 01:14:22 devvm kernel: [<ffffffff81060250>] ? default_wake_function+0x0/0x20 Jun 14 01:14:22 devvm kernel: [<ffffffff814fe43d>] wait_for_completion+0x1d/0x20 Jun 14 01:14:22 devvm kernel: [<ffffffff8108d093>] __cancel_work_timer+0x1b3/0x1e0 Jun 14 01:14:22 devvm kernel: [<ffffffff8108cbe0>] ? wq_barrier_func+0x0/0x20 Jun 14 01:14:22 devvm kernel: [<ffffffff8108d0f0>] cancel_work_sync+0x10/0x20 Jun 14 01:14:22 devvm kernel: [<ffffffffa01c5ca5>] e1000_down_and_stop+0x25/0x50 [e1000] Jun 14 01:14:22 devvm kernel: [<ffffffffa01cb695>] e1000_down+0x155/0x200 [e1000] Jun 14 01:14:22 devvm kernel: [<ffffffffa01cbcb0>] ? e1000_reset_task+0x0/0xe0 [e1000] Jun 14 01:14:22 devvm kernel: [<ffffffffa01cbd1e>] e1000_reset_task+0x6e/0xe0 [e1000] Jun 14 01:14:22 devvm kernel: [<ffffffff8108c760>] worker_thread+0x170/0x2a0 Jun 14 01:14:22 devvm kernel: [<ffffffff810920d0>] ? autoremove_wake_function+0x0/0x40 Jun 14 01:14:22 devvm kernel: [<ffffffff8108c5f0>] ? worker_thread+0x0/0x2a0 Jun 14 01:14:22 devvm kernel: [<ffffffff81091d66>] kthread+0x96/0xa0 Jun 14 01:14:22 devvm kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20 Jun 14 01:14:22 devvm kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0 Jun 14 01:14:22 devvm kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20 Jun 14 01:14:22 devvm kernel: INFO: task parted:8069 blocked for more than 120 seconds. Jun 14 01:14:22 devvm kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 14 01:14:22 devvm kernel: parted D 0000000000000003 0 8069 7994 0x00000080 Jun 14 01:14:22 devvm kernel: ffff8800908b3bb8 0000000000000082 0000000000000000 ffff88010ab50080 Jun 14 01:14:22 devvm kernel: ffff880116c7d500 0000000000000001 0000000000000000 0000000000000000 Jun 14 01:14:22 devvm kernel: ffff88010ab50638 ffff8800908b3fd8 000000000000fb88 ffff88010ab50638 Jun 14 01:14:22 devvm kernel: Call Trace: Jun 14 01:14:22 devvm kernel: [<ffffffff814fe6a5>] schedule_timeout+0x215/0x2e0 Jun 14 01:14:22 devvm kernel: [<ffffffff814fe323>] wait_for_common+0x123/0x180 Jun 14 01:14:22 devvm kernel: [<ffffffff81060250>] ? default_wake_function+0x0/0x20 Jun 14 01:14:22 devvm kernel: [<ffffffff8112b6d0>] ? lru_add_drain_per_cpu+0x0/0x10 Jun 14 01:14:22 devvm kernel: [<ffffffff814fe43d>] wait_for_completion+0x1d/0x20 Jun 14 01:14:22 devvm kernel: [<ffffffff8108d177>] flush_work+0x77/0xc0 Jun 14 01:14:22 devvm kernel: [<ffffffff8108cbe0>] ? wq_barrier_func+0x0/0x20 Jun 14 01:14:22 devvm kernel: [<ffffffff8108d2f3>] schedule_on_each_cpu+0x133/0x180 Jun 14 01:14:22 devvm kernel: [<ffffffff811ad440>] ? invalidate_bh_lru+0x0/0x50 Jun 14 01:14:22 devvm kernel: [<ffffffff8112ae35>] lru_add_drain_all+0x15/0x20 Jun 14 01:14:22 devvm kernel: [<ffffffff811adf6a>] invalidate_bdev+0x2a/0x50 Jun 14 01:14:22 devvm kernel: [<ffffffff8125e9a4>] blkdev_ioctl+0x3b4/0x6e0 Jun 14 01:14:22 devvm kernel: [<ffffffff811b381c>] block_ioctl+0x3c/0x40 Jun 14 01:14:22 devvm kernel: [<ffffffff8118dec2>] vfs_ioctl+0x22/0xa0 Jun 14 01:14:22 devvm kernel: [<ffffffff8118e064>] do_vfs_ioctl+0x84/0x580 Jun 14 01:14:22 devvm kernel: [<ffffffff8118e5e1>] sys_ioctl+0x81/0xa0 Jun 14 01:14:22 devvm kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b In /var/spool/abrt/oops-2013-06-14-01:10:53-1537-0 we can see the following information: In backtrace: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) Hardware name: VirtualBox NETDEV WATCHDOG: eth0 (e1000): transmit queue 0 timed out Modules linked in: vboxsf(U) ipv6 ppdev parport_pc parport microcode sg vboxguest(U) i2c_piix4 i2c_core e1000 snd_intel8x0 snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc pcnet32 mii ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 0, comm: swapper Not tainted 2.6.32-279.el6.x86_64 #1 Call Trace: <IRQ> [<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff8106b836>] ? warn_slowpath_fmt+0x46/0x50 [<ffffffff814595fd>] ? dev_watchdog+0x26d/0x280 [<ffffffff81099138>] ? sched_clock_cpu+0xb8/0x110 [<ffffffff81459390>] ? dev_watchdog+0x0/0x280 [<ffffffff8107e897>] ? run_timer_softirq+0x197/0x340 [<ffffffff810a21c0>] ? tick_sched_timer+0x0/0xc0 [<ffffffff8102b40d>] ? lapic_next_event+0x1d/0x30 [<ffffffff81073ec1>] ? __do_softirq+0xc1/0x1e0 [<ffffffff81096c50>] ? hrtimer_interrupt+0x140/0x250 [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30 [<ffffffff8100de85>] ? do_softirq+0x65/0xa0 [<ffffffff81073ca5>] ? irq_exit+0x85/0x90 [<ffffffff81505be0>] ? smp_apic_timer_interrupt+0x70/0x9b [<ffffffff8100bc13>] ? apic_timer_interrupt+0x13/0x20 <EOI> [<ffffffff810387cb>] ? native_safe_halt+0xb/0x10 [<ffffffff810149cd>] ? default_idle+0x4d/0xb0 [<ffffffff81009e06>] ? cpu_idle+0xb6/0x110 [<ffffffff814e433a>] ? rest_init+0x7a/0x80 [<ffffffff81c21f7b>] ? start_kernel+0x424/0x430 [<ffffffff81c2133a>] ? x86_64_start_reservations+0x125/0x129 [<ffffffff81c21438>] ? x86_64_start_kernel+0xfa/0x109 In cmdline: ro root=/dev/mapper/vg_01-lv_root rd_NO_LUKS LANG=en_US.UTF-8 KEYBOARDTYPE=pc KEYTABLE=sv-latin1 rd_NO_MD SYSFONT=latarcyrheb-sun16 rd_LVM_LV=vg_01/lv_root crashkernel=129M@0M rhgb quiet rd_LVM_LV=vg_01/lv_swap rd_NO_DM rhgb quie Additional information: # uname -a Linux devvm 2.6.32-279.el6.x86_64 #1 SMP Fri Jun 22 12:19:21 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux # cat /etc/redhat-release CentOS release 6.3 (Final) VirtualBox version 4.2.6. Any insight in how we can proceed with troubleshooting this is appreciated. If you need more information, just let me know.

Read the article

High Linux loads on low CPU/memory usage

- by user13323

Hi. I have quite strange situation, where my CentOS 5.5 box loads are high, but the CPU and memory used are pretty low: top - 20:41:38 up 42 days, 6:14, 2 users, load average: 19.79, 21.25, 18.87 Tasks: 254 total, 1 running, 253 sleeping, 0 stopped, 0 zombie Cpu(s): 3.8%us, 0.3%sy, 0.1%ni, 95.0%id, 0.6%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 4035284k total, 4008084k used, 27200k free, 38748k buffers Swap: 4208928k total, 242576k used, 3966352k free, 1465008k cached free -mt total used free shared buffers cached Mem: 3940 3910 29 0 37 1427 -/+ buffers/cache: 2445 1495 Swap: 4110 236 3873 Total: 8050 4147 3903 Iostat also shows good results: avg-cpu: %user %nice %system %iowait %steal %idle 3.83 0.13 0.41 0.58 0.00 95.05 Here is the ps aux output: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 10348 80 ? Ss 2010 2:11 init [3] root 2 0.0 0.0 0 0 ? S< 2010 0:00 [migration/0] root 3 0.0 0.0 0 0 ? SN 2010 0:00 [ksoftirqd/0] root 4 0.0 0.0 0 0 ? S< 2010 0:00 [watchdog/0] root 5 0.0 0.0 0 0 ? S< 2010 0:02 [migration/1] root 6 0.0 0.0 0 0 ? SN 2010 0:00 [ksoftirqd/1] root 7 0.0 0.0 0 0 ? S< 2010 0:00 [watchdog/1] root 8 0.0 0.0 0 0 ? S< 2010 0:02 [migration/2] root 9 0.0 0.0 0 0 ? SN 2010 0:00 [ksoftirqd/2] root 10 0.0 0.0 0 0 ? S< 2010 0:00 [watchdog/2] root 11 0.0 0.0 0 0 ? S< 2010 0:02 [migration/3] root 12 0.0 0.0 0 0 ? SN 2010 0:00 [ksoftirqd/3] root 13 0.0 0.0 0 0 ? S< 2010 0:00 [watchdog/3] root 14 0.0 0.0 0 0 ? S< 2010 0:03 [migration/4] root 15 0.0 0.0 0 0 ? SN 2010 0:00 [ksoftirqd/4] root 16 0.0 0.0 0 0 ? S< 2010 0:00 [watchdog/4] root 17 0.0 0.0 0 0 ? S< 2010 0:01 [migration/5] root 18 0.0 0.0 0 0 ? SN 2010 0:00 [ksoftirqd/5] root 19 0.0 0.0 0 0 ? S< 2010 0:00 [watchdog/5] root 20 0.0 0.0 0 0 ? S< 2010 0:11 [migration/6] root 21 0.0 0.0 0 0 ? SN 2010 0:00 [ksoftirqd/6] root 22 0.0 0.0 0 0 ? S< 2010 0:00 [watchdog/6] root 23 0.0 0.0 0 0 ? S< 2010 0:01 [migration/7] root 24 0.0 0.0 0 0 ? SN 2010 0:00 [ksoftirqd/7] root 25 0.0 0.0 0 0 ? S< 2010 0:00 [watchdog/7] root 26 0.0 0.0 0 0 ? S< 2010 0:00 [migration/8] root 27 0.0 0.0 0 0 ? SN 2010 0:00 [ksoftirqd/8] root 28 0.0 0.0 0 0 ? S< 2010 0:00 [watchdog/8] root 29 0.0 0.0 0 0 ? S< 2010 0:00 [migration/9] root 30 0.0 0.0 0 0 ? SN 2010 0:00 [ksoftirqd/9] root 31 0.0 0.0 0 0 ? S< 2010 0:00 [watchdog/9] root 32 0.0 0.0 0 0 ? S< 2010 0:08 [migration/10] root 33 0.0 0.0 0 0 ? SN 2010 0:00 [ksoftirqd/10] root 34 0.0 0.0 0 0 ? S< 2010 0:00 [watchdog/10] root 35 0.0 0.0 0 0 ? S< 2010 0:05 [migration/11] root 36 0.0 0.0 0 0 ? SN 2010 0:00 [ksoftirqd/11] root 37 0.0 0.0 0 0 ? S< 2010 0:00 [watchdog/11] root 38 0.0 0.0 0 0 ? S< 2010 0:02 [migration/12] root 39 0.0 0.0 0 0 ? SN 2010 0:00 [ksoftirqd/12] root 40 0.0 0.0 0 0 ? S< 2010 0:00 [watchdog/12] root 41 0.0 0.0 0 0 ? S< 2010 0:14 [migration/13] root 42 0.0 0.0 0 0 ? SN 2010 0:00 [ksoftirqd/13] root 43 0.0 0.0 0 0 ? S< 2010 0:00 [watchdog/13] root 44 0.0 0.0 0 0 ? S< 2010 0:04 [migration/14] root 45 0.0 0.0 0 0 ? SN 2010 0:00 [ksoftirqd/14] root 46 0.0 0.0 0 0 ? S< 2010 0:00 [watchdog/14] root 47 0.0 0.0 0 0 ? S< 2010 0:01 [migration/15] root 48 0.0 0.0 0 0 ? SN 2010 0:00 [ksoftirqd/15] root 49 0.0 0.0 0 0 ? S< 2010 0:00 [watchdog/15] root 50 0.0 0.0 0 0 ? S< 2010 0:00 [events/0] root 51 0.0 0.0 0 0 ? S< 2010 0:00 [events/1] root 52 0.0 0.0 0 0 ? S< 2010 0:00 [events/2] root 53 0.0 0.0 0 0 ? S< 2010 0:00 [events/3] root 54 0.0 0.0 0 0 ? S< 2010 0:00 [events/4] root 55 0.0 0.0 0 0 ? S< 2010 0:00 [events/5] root 56 0.0 0.0 0 0 ? S< 2010 0:00 [events/6] root 57 0.0 0.0 0 0 ? S< 2010 0:00 [events/7] root 58 0.0 0.0 0 0 ? S< 2010 0:00 [events/8] root 59 0.0 0.0 0 0 ? S< 2010 0:00 [events/9] root 60 0.0 0.0 0 0 ? S< 2010 0:00 [events/10] root 61 0.0 0.0 0 0 ? S< 2010 0:00 [events/11] root 62 0.0 0.0 0 0 ? S< 2010 0:00 [events/12] root 63 0.0 0.0 0 0 ? S< 2010 0:00 [events/13] root 64 0.0 0.0 0 0 ? S< 2010 0:00 [events/14] root 65 0.0 0.0 0 0 ? S< 2010 0:00 [events/15] root 66 0.0 0.0 0 0 ? S< 2010 0:00 [khelper] root 107 0.0 0.0 0 0 ? S< 2010 0:00 [kthread] root 126 0.0 0.0 0 0 ? S< 2010 0:00 [kblockd/0] root 127 0.0 0.0 0 0 ? S< 2010 0:03 [kblockd/1] root 128 0.0 0.0 0 0 ? S< 2010 0:01 [kblockd/2] root 129 0.0 0.0 0 0 ? S< 2010 0:00 [kblockd/3] root 130 0.0 0.0 0 0 ? S< 2010 0:05 [kblockd/4] root 131 0.0 0.0 0 0 ? S< 2010 0:00 [kblockd/5] root 132 0.0 0.0 0 0 ? S< 2010 0:00 [kblockd/6] root 133 0.0 0.0 0 0 ? S< 2010 0:00 [kblockd/7] root 134 0.0 0.0 0 0 ? S< 2010 0:00 [kblockd/8] root 135 0.0 0.0 0 0 ? S< 2010 0:02 [kblockd/9] root 136 0.0 0.0 0 0 ? S< 2010 0:00 [kblockd/10] root 137 0.0 0.0 0 0 ? S< 2010 0:00 [kblockd/11] root 138 0.0 0.0 0 0 ? S< 2010 0:04 [kblockd/12] root 139 0.0 0.0 0 0 ? S< 2010 0:00 [kblockd/13] root 140 0.0 0.0 0 0 ? S< 2010 0:00 [kblockd/14] root 141 0.0 0.0 0 0 ? S< 2010 0:00 [kblockd/15] root 142 0.0 0.0 0 0 ? S< 2010 0:00 [kacpid] root 281 0.0 0.0 0 0 ? S< 2010 0:00 [cqueue/0] root 282 0.0 0.0 0 0 ? S< 2010 0:00 [cqueue/1] root 283 0.0 0.0 0 0 ? S< 2010 0:00 [cqueue/2] root 284 0.0 0.0 0 0 ? S< 2010 0:00 [cqueue/3] root 285 0.0 0.0 0 0 ? S< 2010 0:00 [cqueue/4] root 286 0.0 0.0 0 0 ? S< 2010 0:00 [cqueue/5] root 287 0.0 0.0 0 0 ? S< 2010 0:00 [cqueue/6] root 288 0.0 0.0 0 0 ? S< 2010 0:00 [cqueue/7] root 289 0.0 0.0 0 0 ? S< 2010 0:00 [cqueue/8] root 290 0.0 0.0 0 0 ? S< 2010 0:00 [cqueue/9] root 291 0.0 0.0 0 0 ? S< 2010 0:00 [cqueue/10] root 292 0.0 0.0 0 0 ? S< 2010 0:00 [cqueue/11] root 293 0.0 0.0 0 0 ? S< 2010 0:00 [cqueue/12] root 294 0.0 0.0 0 0 ? S< 2010 0:00 [cqueue/13] root 295 0.0 0.0 0 0 ? S< 2010 0:00 [cqueue/14] root 296 0.0 0.0 0 0 ? S< 2010 0:00 [cqueue/15] root 299 0.0 0.0 0 0 ? S< 2010 0:00 [khubd] root 301 0.0 0.0 0 0 ? S< 2010 0:00 [kseriod] root 490 0.0 0.0 0 0 ? S 2010 0:00 [khungtaskd] root 493 0.1 0.0 0 0 ? S< 2010 94:48 [kswapd1] root 494 0.0 0.0 0 0 ? S< 2010 0:00 [aio/0] root 495 0.0 0.0 0 0 ? S< 2010 0:00 [aio/1] root 496 0.0 0.0 0 0 ? S< 2010 0:00 [aio/2] root 497 0.0 0.0 0 0 ? S< 2010 0:00 [aio/3] root 498 0.0 0.0 0 0 ? S< 2010 0:00 [aio/4] root 499 0.0 0.0 0 0 ? S< 2010 0:00 [aio/5] root 500 0.0 0.0 0 0 ? S< 2010 0:00 [aio/6] root 501 0.0 0.0 0 0 ? S< 2010 0:00 [aio/7] root 502 0.0 0.0 0 0 ? S< 2010 0:00 [aio/8] root 503 0.0 0.0 0 0 ? S< 2010 0:00 [aio/9] root 504 0.0 0.0 0 0 ? S< 2010 0:00 [aio/10] root 505 0.0 0.0 0 0 ? S< 2010 0:00 [aio/11] root 506 0.0 0.0 0 0 ? S< 2010 0:00 [aio/12] root 507 0.0 0.0 0 0 ? S< 2010 0:00 [aio/13] root 508 0.0 0.0 0 0 ? S< 2010 0:00 [aio/14] root 509 0.0 0.0 0 0 ? S< 2010 0:00 [aio/15] root 665 0.0 0.0 0 0 ? S< 2010 0:00 [kpsmoused] root 808 0.0 0.0 0 0 ? S< 2010 0:00 [ata/0] root 809 0.0 0.0 0 0 ? S< 2010 0:00 [ata/1] root 810 0.0 0.0 0 0 ? S< 2010 0:00 [ata/2] root 811 0.0 0.0 0 0 ? S< 2010 0:00 [ata/3] root 812 0.0 0.0 0 0 ? S< 2010 0:00 [ata/4] root 813 0.0 0.0 0 0 ? S< 2010 0:00 [ata/5] root 814 0.0 0.0 0 0 ? S< 2010 0:00 [ata/6] root 815 0.0 0.0 0 0 ? S< 2010 0:00 [ata/7] root 816 0.0 0.0 0 0 ? S< 2010 0:00 [ata/8] root 817 0.0 0.0 0 0 ? S< 2010 0:00 [ata/9] root 818 0.0 0.0 0 0 ? S< 2010 0:00 [ata/10] root 819 0.0 0.0 0 0 ? S< 2010 0:00 [ata/11] root 820 0.0 0.0 0 0 ? S< 2010 0:00 [ata/12] root 821 0.0 0.0 0 0 ? S< 2010 0:00 [ata/13] root 822 0.0 0.0 0 0 ? S< 2010 0:00 [ata/14] root 823 0.0 0.0 0 0 ? S< 2010 0:00 [ata/15] root 824 0.0 0.0 0 0 ? S< 2010 0:00 [ata_aux] root 842 0.0 0.0 0 0 ? S< 2010 0:00 [scsi_eh_0] root 843 0.0 0.0 0 0 ? S< 2010 0:00 [scsi_eh_1] root 844 0.0 0.0 0 0 ? S< 2010 0:00 [scsi_eh_2] root 845 0.0 0.0 0 0 ? S< 2010 0:00 [scsi_eh_3] root 846 0.0 0.0 0 0 ? S< 2010 0:00 [scsi_eh_4] root 847 0.0 0.0 0 0 ? S< 2010 0:00 [scsi_eh_5] root 882 0.0 0.0 0 0 ? S< 2010 0:00 [kstriped] root 951 0.0 0.0 0 0 ? S< 2010 4:24 [kjournald] root 976 0.0 0.0 0 0 ? S< 2010 0:00 [kauditd] postfix 990 0.0 0.0 54208 2284 ? S 21:19 0:00 pickup -l -t fifo -u root 1013 0.0 0.0 12676 8 ? S<s 2010 0:00 /sbin/udevd -d root 1326 0.0 0.0 90900 3400 ? Ss 14:53 0:00 sshd: root@notty root 1410 0.0 0.0 53972 2108 ? Ss 14:53 0:00 /usr/libexec/openssh/sftp-server root 2690 0.0 0.0 0 0 ? S< 2010 0:00 [kmpathd/0] root 2691 0.0 0.0 0 0 ? S< 2010 0:00 [kmpathd/1] root 2692 0.0 0.0 0 0 ? S< 2010 0:00 [kmpathd/2] root 2693 0.0 0.0 0 0 ? S< 2010 0:00 [kmpathd/3] root 2694 0.0 0.0 0 0 ? S< 2010 0:00 [kmpathd/4] root 2695 0.0 0.0 0 0 ? S< 2010 0:00 [kmpathd/5] root 2696 0.0 0.0 0 0 ? S< 2010 0:00 [kmpathd/6] root 2697 0.0 0.0 0 0 ? S< 2010 0:00 [kmpathd/7] root 2698 0.0 0.0 0 0 ? S< 2010 0:00 [kmpathd/8] root 2699 0.0 0.0 0 0 ? S< 2010 0:00 [kmpathd/9] root 2700 0.0 0.0 0 0 ? S< 2010 0:00 [kmpathd/10] root 2701 0.0 0.0 0 0 ? S< 2010 0:00 [kmpathd/11] root 2702 0.0 0.0 0 0 ? S< 2010 0:00 [kmpathd/12] root 2703 0.0 0.0 0 0 ? S< 2010 0:00 [kmpathd/13] root 2704 0.0 0.0 0 0 ? S< 2010 0:00 [kmpathd/14] root 2705 0.0 0.0 0 0 ? S< 2010 0:00 [kmpathd/15] root 2706 0.0 0.0 0 0 ? S< 2010 0:00 [kmpath_handlerd] root 2755 0.0 0.0 0 0 ? S< 2010 4:35 [kjournald] root 2757 0.0 0.0 0 0 ? S< 2010 3:38 [kjournald] root 2759 0.0 0.0 0 0 ? S< 2010 4:10 [kjournald] root 2761 0.0 0.0 0 0 ? S< 2010 4:26 [kjournald] root 2763 0.0 0.0 0 0 ? S< 2010 3:15 [kjournald] root 2765 0.0 0.0 0 0 ? S< 2010 3:04 [kjournald] root 2767 0.0 0.0 0 0 ? S< 2010 3:02 [kjournald] root 2769 0.0 0.0 0 0 ? S< 2010 2:58 [kjournald] root 2771 0.0 0.0 0 0 ? S< 2010 0:00 [kjournald] root 3340 0.0 0.0 5908 356 ? Ss 2010 2:48 syslogd -m 0 root 3343 0.0 0.0 3804 212 ? Ss 2010 0:03 klogd -x root 3430 0.0 0.0 0 0 ? S< 2010 0:50 [kondemand/0] root 3431 0.0 0.0 0 0 ? S< 2010 0:54 [kondemand/1] root 3432 0.0 0.0 0 0 ? S< 2010 0:00 [kondemand/2] root 3433 0.0 0.0 0 0 ? S< 2010 0:00 [kondemand/3] root 3434 0.0 0.0 0 0 ? S< 2010 0:00 [kondemand/4] root 3435 0.0 0.0 0 0 ? S< 2010 0:00 [kondemand/5] root 3436 0.0 0.0 0 0 ? S< 2010 0:00 [kondemand/6] root 3437 0.0 0.0 0 0 ? S< 2010 0:00 [kondemand/7] root 3438 0.0 0.0 0 0 ? S< 2010 0:00 [kondemand/8] root 3439 0.0 0.0 0 0 ? S< 2010 0:00 [kondemand/9] root 3440 0.0 0.0 0 0 ? S< 2010 0:00 [kondemand/10] root 3441 0.0 0.0 0 0 ? S< 2010 0:00 [kondemand/11] root 3442 0.0 0.0 0 0 ? S< 2010 0:00 [kondemand/12] root 3443 0.0 0.0 0 0 ? S< 2010 0:00 [kondemand/13] root 3444 0.0 0.0 0 0 ? S< 2010 0:00 [kondemand/14] root 3445 0.0 0.0 0 0 ? S< 2010 0:00 [kondemand/15] root 3461 0.0 0.0 10760 284 ? Ss 2010 3:44 irqbalance rpc 3481 0.0 0.0 8052 4 ? Ss 2010 0:00 portmap root 3526 0.0 0.0 0 0 ? S< 2010 0:00 [rpciod/0] root 3527 0.0 0.0 0 0 ? S< 2010 0:00 [rpciod/1] root 3528 0.0 0.0 0 0 ? S< 2010 0:00 [rpciod/2] root 3529 0.0 0.0 0 0 ? S< 2010 0:00 [rpciod/3] root 3530 0.0 0.0 0 0 ? S< 2010 0:00 [rpciod/4] root 3531 0.0 0.0 0 0 ? S< 2010 0:00 [rpciod/5] root 3532 0.0 0.0 0 0 ? S< 2010 0:00 [rpciod/6] root 3533 0.0 0.0 0 0 ? S< 2010 0:00 [rpciod/7] root 3534 0.0 0.0 0 0 ? S< 2010 0:00 [rpciod/8] root 3535 0.0 0.0 0 0 ? S< 2010 0:00 [rpciod/9] root 3536 0.0 0.0 0 0 ? S< 2010 0:00 [rpciod/10] root 3537 0.0 0.0 0 0 ? S< 2010 0:00 [rpciod/11] root 3538 0.0 0.0 0 0 ? S< 2010 0:00 [rpciod/12] root 3539 0.0 0.0 0 0 ? S< 2010 0:00 [rpciod/13] root 3540 0.0 0.0 0 0 ? S< 2010 0:00 [rpciod/14] root 3541 0.0 0.0 0 0 ? S< 2010 0:00 [rpciod/15] root 3563 0.0 0.0 10160 8 ? Ss 2010 0:00 rpc.statd root 3595 0.0 0.0 55180 4 ? Ss 2010 0:00 rpc.idmapd dbus 3618 0.0 0.0 21256 28 ? Ss 2010 0:00 dbus-daemon --system root 3649 0.2 0.4 563084 18796 ? S<sl 2010 179:03 mfsmount /mnt/mfs -o rw,mfsmaster=web1.ovs.local root 3702 0.0 0.0 3800 8 ? Ss 2010 0:00 /usr/sbin/acpid 68 3715 0.0 0.0 31312 816 ? Ss 2010 3:14 hald root 3716 0.0 0.0 21692 28 ? S 2010 0:00 hald-runner 68 3726 0.0 0.0 12324 8 ? S 2010 0:00 hald-addon-acpi: listening on acpid socket /var/run/acpid.socket 68 3730 0.0 0.0 12324 8 ? S 2010 0:00 hald-addon-keyboard: listening on /dev/input/event0 root 3773 0.0 0.0 62608 332 ? Ss 2010 0:00 /usr/sbin/sshd ganglia 3786 0.0 0.0 24704 988 ? Ss 2010 14:26 /usr/sbin/gmond root 3843 0.0 0.0 54144 300 ? Ss 2010 1:49 /usr/libexec/postfix/master postfix 3855 0.0 0.0 54860 1060 ? S 2010 0:22 qmgr -l -t fifo -u root 3877 0.0 0.0 74828 708 ? Ss 2010 1:15 crond root 3891 1.4 1.9 326960 77704 ? S<l 2010 896:59 mfschunkserver root 4122 0.0 0.0 18732 176 ? Ss 2010 0:10 /usr/sbin/atd root 4193 0.0 0.8 129180 35984 ? Ssl 2010 11:04 /usr/bin/ruby /usr/sbin/puppetd root 4223 0.0 0.0 18416 172 ? S 2010 0:10 /usr/sbin/smartd -q never root 4227 0.0 0.0 3792 8 tty1 Ss+ 2010 0:00 /sbin/mingetty tty1 root 4230 0.0 0.0 3792 8 tty2 Ss+ 2010 0:00 /sbin/mingetty tty2 root 4231 0.0 0.0 3792 8 tty3 Ss+ 2010 0:00 /sbin/mingetty tty3 root 4233 0.0 0.0 3792 8 tty4 Ss+ 2010 0:00 /sbin/mingetty tty4 root 4234 0.0 0.0 3792 8 tty5 Ss+ 2010 0:00 /sbin/mingetty tty5 root 4236 0.0 0.0 3792 8 tty6 Ss+ 2010 0:00 /sbin/mingetty tty6 root 5596 0.0 0.0 19368 20 ? Ss 2010 0:00 DarwinStreamingServer qtss 5597 0.8 0.9 166572 37408 ? Sl 2010 523:02 DarwinStreamingServer root 8714 0.0 0.0 0 0 ? S Jan31 0:33 [pdflush] root 9914 0.0 0.0 65612 968 pts/1 R+ 21:49 0:00 ps aux root 10765 0.0 0.0 76792 1080 ? Ss Jan24 0:58 SCREEN root 10766 0.0 0.0 66212 872 pts/3 Ss Jan24 0:00 /bin/bash root 11833 0.0 0.0 63852 1060 pts/3 S+ 17:17 0:00 /bin/sh ./launch.sh root 11834 437 42.9 4126884 1733348 pts/3 Sl+ 17:17 1190:50 /usr/bin/java -Xms128m -Xmx512m -XX:+UseConcMarkSweepGC -jar /JavaCore/JavaCore.jar root 13127 4.7 1.1 110564 46876 ? Ssl 17:18 12:55 /JavaCore/fetcher.bin root 19392 0.0 0.0 90108 3336 ? Rs 20:35 0:00 sshd: root@pts/1 root 19401 0.0 0.0 66216 1640 pts/1 Ss 20:35 0:00 -bash root 20567 0.0 0.0 90108 412 ? Ss Jan16 1:58 sshd: root@pts/0 root 20569 0.0 0.0 66084 912 pts/0 Ss Jan16 0:00 -bash root 21053 0.0 0.0 63856 28 ? S Jan30 0:00 /bin/sh /usr/bin/WowzaMediaServerd /usr/local/WowzaMediaServer/bin/setenv.sh /var/run/WowzaM root 21054 2.9 10.3 2252652 418468 ? Sl Jan30 314:25 java -Xmx1200M -server -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote=true - root 21915 0.0 0.0 0 0 ? S Feb01 0:00 [pdflush] root 29996 0.0 0.0 76524 1004 pts/0 S+ 14:41 0:00 screen -x Any idea what could this be, or where I should look for more diagnostic information? Thanks.

Search Results

Search found 69 results on 3 pages for 'iostat'.

Page 3/3 | < Previous Page | 1 2 3

- by NullUser

- by Ivan

- by Karma Fusebox

- by Daniel

- by Android5360

- by Aaron Digulla

- by Daniel

- by bart613

- by Leonid Bugaev

- by James

- by murdoch

- by David Schmitt

- by Stu Thompson

- by kaefert

- by James

- by kaefert

- by user12620111

- by Kristoffer E

- by user13323

< Previous Page | 1 2 3