NetApp MetroCluster Overview – Part 5 – Failure Scenarios for MetroCluster

 

Failover/Failure Scenarios for MetroCluster

I’m not going to re-invent the wheel here. These failure scenarios are all pretty self-explanatory and can be found in TR-3788.pdf. There’s far more scenarios in that document but here I’ll cover off some of the most common types.

Scenario: Loss of power to disk shelf

2015-11-16_14h14_19

Expected behaviour: Relevant disks for offline and the plex is broken. There’s no disruption to data availability to hosts running HA (VMware High Availability) or FT (Fault Tolerance), no change is detected by the ESXi Server. When the shelf is powered back on the plexes will sync automatically

Impact on data availability: None

 

Scenario: Loss of one link in one disk loop

2015-11-16_14h19_09

Expected behaviour: A notification appears on the controller to advise that disks are only accessible via one switch. There’s no disruption to data availability to hosts running HA or FT, no change is detected by the ESXi Server. When the connection is reset an alert on the controller will advise of connectivity across two switches

Impact on data availability: None

 

Scenario: Failure and Failback of Storage Controller

2015-11-16_14h21_37

Expected behaviour on failover:  There’s no disruption to data availability to hosts running HA or FT. No interruption to VMs running on ESXi Server. The partner node reports an outage. There’s a momentary pause in disk activity while the datastore connectivity (iSCSI, NFS, FC) is refreshed as the connection moves via other controller. After takeover has completed normal activity is resumed

Expected behaviour on failback: There’s no disruption to data availability to hosts running HA or FT. No interruption to VMs running on ESXi Server. There’s a momentary pause in disk activity while the datastore connectivity (iSCSI, NFS, FC) is refreshed as the connection is enabled on the controller again. After the giveback has completed normal activity is resumed

Impact on data availability: None

 

Scenario: Mirrored storage network isolation (Cluster Interconnects down)

2015-11-16_14h33_37

Expected behaviour: There’s no disruption to data availability to hosts running HA (VMware High Availability) or FT (Fault Tolerance), no change is detected by the ESXi Server. VIA Interconnect is down alert will appear on the controllers

Impact on data availability: None

 

Scenario: Total ESXi host failure on one site

2015-11-16_14h46_20

Expected behaviour: FT VMs move automatically to hosts on remaining site, HA VMs are migrated to secondary site and reboot VMs. When the ESXi hosts come back online the VMs can be migrated manually or it will happen automatically depending on DRS group rules.

Impact on data availability: None

 

Scenario: Total Network Isolation on ESXi hosts and loss of hard drive

2015-11-16_14h37_40

Expected behaviour: The relevant disks go offline and the plex is broken. FT VMs move automatically to hosts on remaining site, HA VMs are migrated to secondary site and reboot VMs. When the ESXi hosts come back online the VMs can be migrated manually or it will happen automatically depending on DRS group rules. When the storage shelves are replaced they will automatically resync the plexes.

Impact on data availability: None

 

Scenario: Loss of one Fabric Interconnect switch

2015-11-16_14h49_52

Expected behaviour: The controller displays a message that some disks are connected via one switch and that the cluster interconnects are down. There’s no change to ESXi servers or VMs. When the switch comes back online the controllers display message that the fabric interconnects are back online.

Impact on data availability: None

 

Scenario: Failure of entire Data Center

2015-11-16_20h42_07

Expected behaviour: Chaos!!! Not really. I would advise that if you’re looking at DR testing or even need to perform a failover to check out another blog series I did regarding MetroCluster failover. If the failure is in Site 1 all ESXi hosts will show as being offline or not responding. VMware HA will kick in and migrate all VMs to the other site. Alerts will appear on the controller in Site 2 that site 1 controller and fabric interconnects are offline and that paths to remote storage is offline. The plexes are broken and the mirrored plex for Site 1 becomes writeable. There is a pause on disk access during the refresh of datastore links. Once a failover is performed it takes some time for the plexes to sync etc. but once it completes the entire environment will be running from ESXi servers in Site 2. All Site 1 ESXi servers will still appear offline

Once the issues in Site 1 have been resolved and the interconnects are back online and remote storage can be reached the plex in Site 2 will automatically resync with its mirrored plex from Site 1. The ESXi hosts will appear back online and VMs will automatically migrate if DRS rules are in place for that, otherwise a manual failover can be done. The mirrored plex for Site 1, running and now owned by Site 2, will need to resync to the primary plex in Site 1. This is a manual command. Next you will now need a giveback command run to make the plex in Site 1 the primary again and enable the mirror once more. N.B. The above scenario can cause the plex numbers to change on resync

Impact on data availability: None

There are also a number of scenarios for rolling failures but I’m not going to go into that here. Really, MetroCluster is designed to handle all types of failures so it’s not a surprised that if it can handle the above scenarios it can also take care of rolling failures.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s