Quorum, How does it work? SVC and Storwize

Happy New Year to all. I’ve been getting a few questions relating to quorum devices, and in particular the IP quorum and what happens when various different failure scenarios occur. I thought it was worth detailing things here. So first some background.

What are quorum devices?

All Spectrum Virtualize systems are clusters. Even a single control enclosure of a Storwize system is a 2 node cluster. The system uses a voting set of nodes to ensure that the majority of the cluster nodes continue when there is a failure state. This is fine when there are are larger number of nodes remaining compared to those that have failed – or are missing. The majority always wins, so even if there is just a communication failure between parts of the system, if 5 nodes can see each other out of 8, then the 5 wins, and even if the other 3 nodes see each other, they know they need at least 4 to look for a tie break (quorum) device.

The real fun begins when there is an even split in the cluster – most commonly caused by a split-brain scenario – where the communication between the two halves of the cluster fails. With 8 nodes, this means 4 nodes in each half can see each other, and so neither has majority. At this time, the active quorum device is utilised as the tie break to become like a 9th member of the voting set and which ever half locks the quorum will win and continue.

The system automatically defines 3 quorum devices from the attached storage. For SVC, this is 3 mdisks, which it attempts to spread over 3 different storage controllers. For Storwize, this is 3 ‘used’ drives (i.e. not unused drives – only drives that are parts of arrays or spares)

In addition, more recently we added the ability to define up to 5 additional IP Quorum devices. These are applications running on servers that are IP connected to the same management IP network as the cluster itself. This is most commonly used when deploying a Stretched or HyperSwap cluster.

Now this means you could have up to 8 quorum devices. So how does the cluster know which one to use? Well, at any given time only one quorum devices is the active quorum. Every node in the cluster knows which is the active voting set of nodes, and which is the active quorum. The active quorum can only be changed when all nodes in the active voting set agree and can confirm.

Worked Examples

Lets go for an example to explain some of the uses / failure scenarios.

Assume we have a 4 node cluster (Storwize V7000) two control enclosures, and these are deployed in a HyperSwap solution. So one control enclosure (2nodes) at sites A and B.

The quorum devices are defined as :

TypeSiteActive Quorum
IPCYes
Drive 0ANo
Drive 1BNo
Drive 2ANo

Assume this is the starting point for each of the failures below, the cluster is whole with 4 nodes and the active IP quorum device.

Failure 1. HyperSwap Intersite Link down – Site A to Site B links dead (Split-Brain)

Starts from a fully online state as described above.

The cluster here suffers a split brain scenario, although both Sites A and B are online, the cluster has been split, and we have two online halves, so the active quorum is used as a tie break devices. Whichever half talks to the IP quorum first, and locks it, wins, now has 3 votes and continues. The other site contacts the IP quorum, sees that it is locked and halts any I/O through its nodes.

Failure 2. IP Quorum Connection Lost, Subsequent Split-Brain

Starts from a fully online state as described above.

The cluster loses IP connectivity to Site C, and so loses access to the IP quorum devices. Since this is the active quorum device, and the cluster cannot communicate with it, the cluster we re-assign one of the drives as the active quorum. In this case, Drive 0 becomes the active quorum. This re-assignment is possible because the 4 voting members (nodes) in Sites A and B are still online and communicating.

Quorum now looks like :

TypeSiteActive Quorum
IPCOffline
Drive 0AYes
Drive 1BNo
Drive 2ANo

Now if we have a split brain scenario, Site A will always win, because only Site A can communicate with the active quorum devices (Drive 0).

Worst case here, if Site A actually fails (power failure or such) then Site B will also halt, because it can’t see the active quorum device (which was at Site A) You can get Site B online again, by manually running the quorum override command – essentially you have become the active quorum device here and tell Site B to continue because you know Site A is offline.

Failure 3. Simultaneous IP Quorum Connection Loss, and Site A Power Loss

Starts from a fully online state as described above.

Simultaneous failure of the communication of the IP quorum device, and Site A power is lost, so all of SIte A goes offline. This leaves Site B looking for the active quorum device, but it has now failed, and while there are other quorum devices available (Drive 1) it was not marked as the active quorum device and so Site B will halt. Again you can become the quorum and enable Site B to continue using the quorum override command to get access again.

Conclusions

You can manually assign the quorum devices (mdisks an drives) on any system with there chquorum commands. This can be done to ensure the quorum devices are spread over available controllers or enclosures if the system hasn’t done this automatically – for example you added more controllers after the initial setup.

The system will define the active quorum automatically, and will maintain its movement as needed.

For normal (non Stretched/HyperSwap) systems you generally don’t need to worry as much. Just check that the system has spread the quorum devices over multiple controllers in the SVC case, and SAS chains in the Storwize case.

For HyperSwap/Stretched systems its is critical you ensure the quorum setup is correct for your config. As you can see from the examples, if you are using IP Quorum we’d recommend you setup at least 2 – probably 3 IP quorum devices across different systems, so that if one fails, you maintain the ‘3rd site’ nature of the config. i.e. the quorum remains on a 3rd site, and doesn’t end up local to one of the cluster sites – as this will force the Site to become the one that continues and could result in a temporary loss of access while you manually override the quorum. Ensure that the drive or disk based backup quorum devices are spread over the sites also, so that in the event of multiple failures a quorum device can be assigned even after site failures etc.

PS. One final thing, there are two uses of the quorum drives/mdisks. They act as the tie break quorum (as discussed here) but they also store the virtualization table maps and other such cluster ‘recovery’ information in the event of all nodes having their brains blown out. The IP quorum devices ONLY act as tie break devices. So both types of quorum device are needed even when using IP quorum.

Hope this is useful, and feel free to ask any questions raised.