shriram,
thanks for all the suggestions. I have tried it all and remus still does not replicate two VMs. Sometimes two remus replications will run for a few seconds before they exit(abort). Usually one remus aborts and the other one continues for a few seconds before also aborting, and the two vm ending up in a state where I cannot do xm destroy vm anymore. Sometimes runnnig /etc/init.d/xend restart on both nodes will fix it, but sometimes I just have to reboot the dom0 on both nodes.
As part of upgrading to DRBD 8.3.11, I also updated to CentOS 6.3, Linux kernel 3.4.32-6.el6.x86_64, and Xen 4.2.2
remus -i 100 vm1 node2 > /var/log/vm1.log 2>&1 &
remus -i 100 vm2 node2 > /var/log/vm1.log 2>&1 &
vm1.log:
PROF: flushed memory at 1379585535.750035
PROF: suspending at 1379585535.838686
issuing HVM suspend hypercall
suspend hypercall returned 0
pausing QEMU
PROF: resumed at 1379585535.849508
resuming QEMU
Sending 5873 bytes of QEMU state
PROF: flushed memory at 1379585535.852089
PROF: suspending at 1379585535.946905
issuing HVM suspend hypercall
suspend hypercall returned 0
domain 1 not shut down
xc: error: Suspend request failed: Internal error
xc: error: Domain appears not to have suspended: Internal error
PROF: resumed at 1379585535.967212
resuming QEMU
vm2.log:
PROF: flushed memory at 1379585536.483855
PROF: suspending at 1379585536.575694
issuing HVM suspend hypercall
suspend hypercall returned 0
pausing QEMU
PROF: resumed at 1379585536.583224
resuming QEMU
Sending 5873 bytes of QEMU state
PROF: flushed memory at 1379585536.585965
PROF: suspending at 1379585536.679800
issuing HVM suspend hypercall
suspend hypercall returned 0
domain 2 not shut down
xc: error: Suspend request failed: Internal error
xc: error: Domain appears not to have suspended: Internal error
qemu logdirty mode: disable
PROF: resumed at 1379585536.688845
resuming QEMU
xend.log:
[2013-09-19 06:12:15 3318] INFO (XendDomainInfo:2079) Domain has shutdown: name=vm1 id=1 reason=suspend.
[2013-09-19 06:12:16 3318] INFO (XendDomainInfo:2079) Domain has shutdown: name=vm2 id=2 reason=suspend.
After remus exits vm1 and vm2 exist on both CPUs (node1 and node2) I get several messsages on the dom0 console that looks as follows
INFO: task qemu-dm: N blocked for more than 120 seconds.
"echo - > /proc/sys/kernel/hung_task_timeout_secs" disables this message
After a couple of minutes I get another series of messages on the dom0 console
node1:
block drbd2: [drbd2_worker/N] sock_sendmsg time expired, ko =3
block drbd2: [drbd2_worker/N] sock_sendmsg time expired, ko =2
block drbd2: meta connection shutdown by peer.
block drbd2: sock_sendmsg returned -104
block drbd2: error receving Data, 1: 4120!
block drbd2: Split-Brain detected but unresolved, dropping connection!
block drbd2: error receving ReportState, 1: 4!
node2:
block drbd2: [drbd2_worker/N] sock_sendmsg time expired, ko =3
block drbd2: [drbd2_worker/N] sock_sendmsg time expired, ko =2
block drbd2: error receving Data, 1: 4120!
block drbd2: Split-Brain detected but unresolved, dropping connection!
block drbd2: meta connection shutdown by peer.
block drbd2: error receving ReportState, 1: 4!
I am now trying different drbd sync rates to see if DRBD protocol D takes to much of the availabile bandwidth, but I doubt this is a performace/resource issue since
1) I can xm migrate —live vm1 node2 & xm migrate —live vm2 node 2 & without any problem.
2) I have in the past been able to remus vm1 node2 > /var/log/vm1.log 2>&1 & reemus vm2 node 1 > /var/log/vm2.log 2>&1 & without any problem, that is run two remus in reverse direction,
3) I can remus one vm for several days, but when I start the second remus, not only does the second remus abort, but the first remus that had been running for several days in a row, will also abort, sometimes even before the second remus aborts.
Any more ideas is greatly appreciated.
uhakansson