Ce site est optimisé pour être consulté depuis un navigateur moderne dans lequel JavaScript est activé.

Am I doing this wrong? Réseauing - High Availability: SAN, MSFT Failover Cluster, NIC Teaming

ayi

I have been working in a lab environment to figure out how to design a SAN with a Microsoft failover cluster using multiple NICs, MPIO and iSCSI. I'll try to add as much useful information as I can, please forgive me if I miss any details . . . This is all a new learning experience for me.

Objective:
Increase availability/fault tolerance by adding a second path (switch) from each node in the cluster to storage. Allow clients on the network to access to resources/services running in the cluster.

What I've done so far:
I have successfully setup and configured a 2-node cluster with MPIO and iSCSI. They share storage via a CSV on a QNAP NAS. Each node has 2 NICs in a Switch Embedded team. Each NIC is connected to one of the switches. Each QNAP ethernet port (there are only 2) is also connected to either one of the switches. I've created the following virtual network adapters in the host OS...

Management 192.168.1.XX
CSV 10.10.10.XX
Live Migration 10.10.40.XX
iSCSI-01 10.10.20.XX
iSCSI-02 10.10.30.xx

The switches that I'm using in the lab are pretty basic layer 3 switches. I have done zero config on them as I don't know if there is anything I must do. Everything appears to work okay so long as I have both of the switches linked to each other via an ethernet cable. Since the QNAP only has 2 ethernet ports, I had to create a path for the nodes to get to the QNAP's 2nd iSCSI target on the second switch. Its a janky setup, but it works. I've simulated failures by pulling one of the connections to a node...all good there.

What I don't know:
My lab is limited to the nodes, the 2 switches and NAS. I don't know what the proper next steps would be in order to integrate this into a network so clients could access resources in the cluster. If there were just one SAN/"fabric" switch (is that the right term?), no problem. I'd just link that up to another switch that connects to the clients...That's how our current setup is.

At first, I thought that it was just going to work via some sort of magic (that's what networking is like to me), but I feel like I'm missing something. Will there be an issue with the way traffic flows by linking both SAN switches to a single LAN switch? I'm not even sure if that is the "correct" thing to do.

I know that I'm missing a router in this equation . . . Didn't seem to need it for this lab experiment since I wasn't doing any kind of routing. In production we have our network segmented with vlans.

Anyway . . . Here's a diagram of what I'm thinking this will look like in production.
Network diagram.

Edit: Or should I be doing something like this?? Should the SAN connect to the LAN via the SAN switches or should the servers directly connect to the LAN via a switch?

ayi

(1) Storage. QNAP would make a decent backup target, say used over iSCSI with Veeam, but it's quite a poor choice in terms of the primary storage. Performance isn't great, but the worst part is... It's single controller, so planned or unplanned storage downtime would lead to your whole cluster going down. Not good!

https://en.wikipedia.org/wiki/Single_point_of_failure

(2) Switches. For iSCSI they should have a pretty beefy Tx/Rx buffers or you'll see random lock up and I/O latency spikes during heavy load. Would lead to VMs freezing and crashing. Sometimes... These's some good reading on topic:

https://community.arubanetworks.com/discussion/iscsi-and-packet-buffer

https://community.cisco.com/t5/switching/iscsi-switch-recommendations/td-p/1771478

Bottom line is... What you should be doing is getting rid of your current configuration and going 100% hyperconverged setup. Throw in some flash and spindles into your two cluster nodes, either enable Storage Spaces Direct or StarWind Virtual SAN (Free?). These guys both support switchless configuration just fine, so you'll have to direct connect your two cluster nodes with a pair of 10 GbE or 25 GbE NICs with no switched fabric involved! No switch can overrun copper cable in terms of latency and cost! Your existing switches would run upfront connectivity now having two redundant paths from your clients to HCI cluster itself. Very neat design and way more reliable.

https://learn.microsoft.com/en-us/azure-stack/hci/concepts/storage-spaces-direct-overview

https://www.starwindsoftware.com/starwind-virtual-san

Storage Spaces Direct is 1st-class citizen so should be mentioned, but it's rather fragile, requires expensive Datacenter edition, and S2D two-node configuration with nested resiliency can't scale beyond the initial two-node setup. StarWind is way more reliable, works with Standard edition, can be 100% free and scales the way you want.

(3) Backup. You can re-purpose QNAP and use free Veeam to handle that. Stick with a 3-2-1 backup rule and you're golden!

https://www.backblaze.com/blog/the-3-2-1-backup-strategy/

https://www.veeam.com/virtual-machine-backup-solution-free.html

Hope this all helped at least a bit...