jemmille - Developer IT

iSCSI, failover and XenServer

- by jemmille

I have an iSCSI fail over implementation setup so if one of my storage units fails the other takes over immediately (it also runs the NFS shares). When fail over occurs, volumes are exported, the IP is switched to the other machine and the targets are reconfigured. The fail over of the storage system itself works just fine. I use NexentaStor for my filer. When I do a test (manual) fail over of my storage the following occurs: Note: I run the admin VM's on NFS and customer based VM's on iSCSI All NFS based VM's remain up and working perfectly through the failover and after All VM 's running on iSCSI eventually report the following: An error about not being able to write to a particular block An error about journaling not working Then the file system goes RO To get the VM's working again I have to do the following: Force shutdown of the "broken" VM's. Detach the iSCSI SR Re-attach the iSCSI SR Boot the VM on a different server (5 in my pool) If I don't boot on a different server I get this error "Internal error: Failure("The VDI <uuid> is already attached in RW mode; it can't be attached in RO mode!")" The only way I have found to fix that error is to reboot the entire server it was running on previously which is obviously a huge pain. Currently multipathing is NOT enabled (but can be and the same thing still occurs). I have edited much of the /etc/iscsid.conf file to work with the timeout settings but to no avail. In short, my storage fails over properly but XenServer does not keep the connection alive. As a thought, the error that shows up in #4 above might be the ultimate cause and fixing that would fix everything? Any help would be appreciated more than you know.

Read the article

ZFS - zpool ARC cache plus L2ARC benchmarking

- by jemmille

I have been doing lots of I/O testing on a ZFS system I will eventually use to serve virtual machines. I thought I would try adding SSD's for use as cache to see how much faster I can get the read speed. I also have 24GB of RAM in the machine that acts as ARC. vol0 is 6.4TB and the cache disks are 60GB SSD's. The zvol is as follows: pool: vol0 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM vol0 ONLINE 0 0 0 c1t8d0 ONLINE 0 0 0 cache c3t5001517958D80533d0 ONLINE 0 0 0 c3t5001517959092566d0 ONLINE 0 0 0 The issue is I'm not seeing any difference with the SSD's installed. I've tried bonnie++ benchmarks and some simple dd commands to write a file then read the file. I have run benchmarks before and after adding the SSD's. I've ensured the file sizes are at least double my RAM so there is no way it can all get cached locally. Am I missing something here? When am I going to see benefits of having all that cache? Am I simply not under these circumstances? Are the benchmark programs not good for testing the effect of cache because of the the way (and what) it writes and reads?

Read the article

iSCSI, failover and XenServer

- by jemmille

I have an iSCSI fail over implementation setup so if one of my storage units fails the other takes over immediately (it also runs the NFS shares). When fail over occurs, volumes are exported, the IP is switched to the other machine and the targets are reconfigured. The fail over of the storage system itself works just fine. I use NexentaStor for my filer. When I do a test (manual) fail over of my storage the following occurs: Note: I run the admin VM's on NFS and customer based VM's on iSCSI All NFS based VM's remain up and working perfectly through the failover and after All VM 's running on iSCSI eventually report the following: An error about not being able to write to a particular block An error about journaling not working Then the file system goes RO To get the VM's working again I have to do the following: Force shutdown of the "broken" VM's. Detach the iSCSI SR Re-attach the iSCSI SR Boot the VM on a different server (5 in my pool) If I don't boot on a different server I get this error "Internal error: Failure("The VDI <uuid> is already attached in RW mode; it can't be attached in RO mode!")" The only way I have found to fix that error is to reboot the entire server it was running on previously which is obviously a huge pain. Currently multipathing is NOT enabled (but can be and the same thing still occurs). I have edited much of the /etc/iscsid.conf file to work with the timeout settings but to no avail. In short, my storage fails over properly but XenServer does not keep the connection alive. As a thought, the error that shows up in #4 above might be the ultimate cause and fixing that would fix everything? Any help would be appreciated more than you know.

Read the article

Apache process consuming all memory on the server

- by jemmille

I have an apache process that suddenly appears on a particular server. When it shows up it starts consuming memory at a very rapid rate, then moves on to all the swap. In all it consumes about 11GB (including swap) of memory and the server eventually becomes unresponsive. The load on the server is under 1 at all other times. The process runs as nobody and I am having a hard time tracking down the source. If i run an strace on the process and all it did was continuously dump out mprotect over and over again If i run lsof -p <pid>, I get this, but only sometimes: httpd 19229 nobody 152u IPv4 175050 crawl-66-249-67-216.googlebot.com:62336 (CLOSE_WAIT) httpd 19229 nobody 153u IPv4 179104 crawl-66-249-71-167.googlebot.com:58012 (ESTABLISHED) As long as I catch it, I can kill the process and the server almost immediately stabilizes. I have on site on the server that is getting a few thousand hits a a day that I think might be the source, but I still can't find the exact reason. Also, this is a cPanel server and I have upcp'd the server, rebuilt apache with easy apache, and rebuilt httpd.conf. It is not spawing any related processes, meaning I can find any php, mysql, cgi, etc. processes that relate to this process. It's just a loner process that balloons fast and consumes ever last MB of memory. This is on a XenServer 5.6 based VM. No other servers in the cluster are having this issue.

Read the article

SAS Expanders vs Direct Attached (SAS)?

- by jemmille

I have a storage unit with 2 backplanes. One backplane holds 24 disks, one backplane holds 12 disks. Each backplane is independently connected to a SFF-8087 port (4 channel/12Gbit) to the raid card. Here is where my question really comes in. Can or how easily can a backplane be overloaded? All the disks in the machine are WD RE4 WD1003FBYX (black) drives that have average writes at 115MB/sec and average read of 125 MB/sec I know things would vary based on the raid or filesystem we put on top of that but it seems to be that a 24 disk backplane with only one SFF-8087 connector should be able to overload the bus to a point that might actually slow it down? Based on my math, if I had a RAID0 across all 24 disks and asked for a large file, I should, in theory should get 24*115 MB/sec wich translates to 22.08 GBit/sec of total throughput. Either I'm confused or this backplane is horribly designed, at least in a perfomance environment. I'm looking at switching to a model where each drive has it's own channel from the backplane (and new HBA's or raid card). EDIT: more details We have used both pure linux (centos), open solaris, software raid, hardware raid, EXT3/4, ZFS. Here are some examples using bonnie++ 4 Disk RAID-0, ZFS WRITE CPU RE-WRITE CPU READ CPU RND-SEEKS 194MB/s 19% 92MB/s 11% 200MB/s 8% 310/sec 194MB/s 19% 93MB/s 11% 201MB/s 8% 312/sec --------- ---- --------- ---- --------- ---- --------- 389MB/s 19% 186MB/s 11% 402MB/s 8% 311/sec 8 Disk RAID-0, ZFS WRITE CPU RE-WRITE CPU READ CPU RND-SEEKS 324MB/s 32% 164MB/s 19% 346MB/s 13% 466/sec 324MB/s 32% 164MB/s 19% 348MB/s 14% 465/sec --------- ---- --------- ---- --------- ---- --------- 648MB/s 32% 328MB/s 19% 694MB/s 13% 465/sec 12 Disk RAID-0, ZFS WRITE CPU RE-WRITE CPU READ CPU RND-SEEKS 377MB/s 38% 191MB/s 22% 429MB/s 17% 537/sec 376MB/s 38% 191MB/s 22% 427MB/s 17% 546/sec --------- ---- --------- ---- --------- ---- --------- 753MB/s 38% 382MB/s 22% 857MB/s 17% 541/sec Now 16 Disk RAID-0, it's gets interesting WRITE CPU RE-WRITE CPU READ CPU RND-SEEKS 359MB/s 34% 186MB/s 22% 407MB/s 18% 1397/sec 358MB/s 33% 186MB/s 22% 407MB/s 18% 1340/sec --------- ---- --------- ---- --------- ---- --------- 717MB/s 33% 373MB/s 22% 814MB/s 18% 1368/sec 20 Disk RAID-0, ZFS WRITE CPU RE-WRITE CPU READ CPU RND-SEEKS 371MB/s 37% 188MB/s 22% 450MB/s 19% 775/sec 370MB/s 37% 188MB/s 22% 447MB/s 19% 797/sec --------- ---- --------- ---- --------- ---- --------- 741MB/s 37% 376MB/s 22% 898MB/s 19% 786/sec 24 Disk RAID-1, ZFS WRITE CPU RE-WRITE CPU READ CPU RND-SEEKS 347MB/s 34% 193MB/s 22% 447MB/s 19% 907/sec 347MB/s 34% 192MB/s 23% 446MB/s 19% 933/sec --------- ---- --------- ---- --------- ---- --------- 694MB/s 34% 386MB/s 22% 894MB/s 19% 920/sec 28 Disk RAID-0, ZFS 32 Disk RAID-0, ZFS 36 Disk RAID-0, ZFS More details: Here is the exact unit: http://www.supermicro.com/products/chassis/4U/847/SC847E1-R1400U.cfm

Developer IT