Hello,
I have a system configured following the wiki instructions. The system has a RAID5 SAS based array, and I'm getting ~150MB/s writing rates (inside the VM) writing 6GB of 1GB blocks of zeros from /dev/zero to a file using dd. Nodes are connected by gigabit ethernet.
While performing the write remus bails out. The scenario repeats with "real" heavy IO system load.
(cannot post links, remove space in URLs)
/var/log/messages pastebin. com/SdRJRmg4
remus.out pastebin. com/LGvyyhnb (snipped)
I'm using 40ms barriers.
From my reading of the logs, it seems that (expectedly) the communication between the machines becomes saturated:
- The ifs queue becomes unavailable (?)
- DRBD figures it got disconnected from the other node
- DRBD reconnects, and gets turned down again
- DRBD reconnects and starts syncing
- After a little more than 1 minute, DRBD eventually syncs
- Meanwhile, Remus apparently times out waiting for the barrier end, exits dirty leaving the ifs queue.
- DRBD never gets back to dual Primary mode
Does anyone have any idea on how best to deal with this scenario? Is it possible to increase the Remus timeout or to forcefully limit the DRBD block device write rate?