Linux HA cluster w/Xen, Heartbeat, Pacemaker. domU does not failover to secondary node
- by Kendall
I am having the followig problem with an OenSuSE + Heartbeat + Pacemaker + Xen HA cluster:  when the node a Xen domU is running on is "dead" 
the Xen domU running on it is not restarted on the second node.
The cluster is setup with two nodes, each running OpenSuSE-11.3, Heartbeat 3.0, and Pacemaker 1.0 in CRM mode.  For storage I am using a LUN on an iSCSI
SAN device; the LUN is formatted with OCFS2 and managed with LVM.  The Xen domU has two logical volumes; one for root and the other for swap.
I am using IPMI cards for STONITH devices, and a dedicated ethernet link for heartbeat communications.
The ha.cf file is as follows:
    
    keepalive 1
    deadtime 10
    warntime 5
    udpport 694
    ucast eth1 
    auto_failback off
    node dhcp-166
    node stage
    use_logd yes
    crm yes
    
My resources look as follows:
  
    shocrm(live)configure# show
    node $id="5c1aa924-bba4-4f95-a367-6c9a58ac4a38" dhcp-166
    node $id="cebc92eb-af24-4833-aaf0-672adf80b58e" stage
    primitive Xen-Util ocf:heartbeat:Xen \
      meta target-role="Started" \
      operations $id="Xen-Util-operations" \
      op start interval="0" timeout="60" start-delay="0" \
      op stop interval="0" timeout="120" \
      params xmfile="/etc/xen/vm/xen-util"
  primitive my-stonith stonith:external/ipmi \
      params hostname="dhcp-166" ipaddr="192.168.3.106" userid="ADMIN" passwd="xxx" \
      op monitor interval="2m" timeout="60s"
  primitive my-stonith2 stonith:external/ipmi \
      params hostname="stage" ipaddr="192.168.3.105" userid="ADMIN" passwd="xxx" \
      op monitor interval="2m" timeout="60s"
  property $id="cib-bootstrap-options" \
      dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \
      cluster-infrastructure="Heartbeat"
  
The Xen domU config file is as follows:
  
  name = "xen-util"
  bootloader = "/usr/lib/xen/boot/domUloader.py"
  #bootargs = "xvda1:/vmlinuz-xen,/initrd-xen"
  bootargs = "--entry=xvda1:/boot/vmlinuz-xen,/boot/initrd-xen"
  memory = 4096
  disk = [ 'phy:vg_xen/xen-util-root,xvda1,w',
    'phy:vg_xen/xen-util-swap,xvda2,w', ]
  root = "/dev/xvda1"
  vif = [ 'mac=00:16:3e:42:42:06' ]
  #vfb = [ 'type=vnc,vncunused=0,vnclisten=192.168.3.172' ]
  extra = ""
  
Say domU "Xen-Util" is running on node "stage"; if "stage" goes down, "Xen-Util" does not restart on node "dhcp-166".  It seems to want to try
as an "xm list" will show it for a few seconds and if you "xm console xen-util" it will give a message like "copying /boot/kernel.gz from xvda1
to /var/lib/xen/tmp/kernel.a53gs for booting".  However, it never gets past that, eventually gives up, and no longer appears in "xm list".
Now, when node "stage" comes back online after being power cycled, it detects that "Xen-Util" isn't running, and starts it (on stage).
I've tried starting "Xen-Util" on node "dhcp-166" without the cluster running, and it works fine.  No problems.  So, I know it works in that respect.
Any ideas?  Thanks!