Catalyst 2960S iSCSI Optimization
By stretch | Monday, January 16, 2012 at 1:15 a.m. UTC
I was recently tasked with configuring a number of 24-port Catalyst 2960S switches for deployment as standalone iSCSI switches for a storage area network (SAN). I haven't dealt much with SAN architecture so I wasn't sure what was needed. Obviously, just about any switch will support iSCSI right out of the box (it's just TCP/IP traffic, after all), but there are certain tweaks necessary to achieve the best possible performance.
Dell's PS Series Array Network Performance Guidelines outlines its recommendations for EqualLogic SAN arrays, including network configuration. This article parallels the Network Requirements and Recommendations section of that document.
Evaluating Switch Performance
According to the Catalyst 2960S data sheet, the 2960S-24xS-L series is capable of moving 41.7 Mpps worth of 64-byte packets, and its forwarding ceiling of 88 Gbps (which we could never reach with 24 GigE ports) leaves plenty of headroom. Since the traffic traversing this switch will be mostly iSCSI, which uses very long frames, the overall forwarding rate is much more important to us than the 64-byte packets-per-second limit (which is fairly high anyway).
Similar to their big brother, the Catalyst 3750, multiple 2960S switches can be combined into a single managed unit through the use of proprietary stacking cables. Although the requirement here is for only a single switch, it's worth keeping in mind that the stack backplane introduces a potential 20 Gbps choke point for traffic switched among stack members. Obviously, this is more of a concern regarding 48-port switches than it is for 24-port switches.
From the Dell tech report:
Because STP can increase the time it takes to recover from a PS Series array control module failover or a network switch failure, Dell recommends that you do not use STP on switch ports that connect end nodes (iSCSI initiators or storage array network interfaces).
The above quote is a great example of why you should never blindly follow networking advice from server admins. There is a far more elegant way to address this issue than simply turning off spanning tree (which the guide does reluctantly acknowledge). First, because the default spanning tree protocol on the 2960S is still legacy IEEE 802.1D-1998, we'll enable rapid spanning tree (IEEE-802.1D-2004, which includes 802.1w):
Switch(config)# spanning-tree mode rapid-pvst Switch(config)# ^Z Switch# show spanning-tree VLAN0001 Spanning tree enabled protocol rstp Root ID Priority 32768 Address 0026.622f.4788 Cost 38 Port 23 (GigabitEthernet1/0/23) Hello Time 2 sec Max Age 20 sec Forward Delay 15 sec ...
Next, we'll issue the command
spanning-tree portfast on all interfaces which connect to endpoints; that is, anything other than another switch. This command designates an interface as an edge port and reduces the time it takes to transition to forwarding state to about a second.
Switch(config)# interface range g1/0/1 -22 Switch(config-if-range)# spanning-tree portfast %Warning: portfast should only be enabled on ports connected to a single host. Connecting hubs, concentrators, switches, bridges, etc... to this interface when portfast is enabled, can cause temporary bridging loops. Use with CAUTION %Portfast will be configured in 22 interfaces due to the range command but will only have effect when the interfaces are in a non-trunking mode.
Dell's recommendation is to enable bidirectional flow control on all interfaces. Unfortunately, the 2960S supports only the reception of flow control signals; it won't generate PAUSE frames itself. We'll enable negotiation of the "receive" direction of flow control on all server- and SAN-facing interfaces:
Switch(config-if-range)# flowcontrol receive desired
Unicast Storm Control
The recommendation here is to disable unicast storm control. This feature is actually disabled by default on our switch, so we can move right along.
You can verify that storm control has been disabled with the command
show storm-control; no interfaces should be listed.
Switch# show storm-control Interface Filter State Upper Lower Current --------- ------------- ----------- ----------- ----------
"Jumbo frame" refers to an Ethernet frame which has a maximum transmission unit (MTU) of greater than 1500 bytes, typically around 9000 bytes. It is recommended to enable jumbo frames on all hosts and switches comprising the iSCSI network because available throughput is consumed more efficiently using longer frames (a lower header-to-payload ratio).
When dealing with GigE Catalyst switches, there are two classifications of MTU with which we are concerned: the system MTU and the jumbo MTU. The system MTU applies to frames which are processed by the switch CPU (e.g. SSH, SNMP, etc.). The system MTU has a limitation of 1998 bytes and is set to 1500 bytes by default.
The jumbo MTU applies to transit traffic (traffic which is received on one interface and switch out via another) and is also set to 1500 bytes by default. However, we can raise the jumbo MTU to a maximum of 9198 bytes. We'll set it to 9000 bytes as that seems to be the standard.
Switch# show system mtu System MTU size is 1500 bytes System Jumbo MTU size is 1500 bytes System Alternate MTU size is 1500 bytes Routing MTU size is 1500 bytes Switch# configure terminal Switch(config)# system mtu ? MTU size in bytes jumbo Set Jumbo MTU value for GigabitEthernet or TenGigabitEthernet interfaces Switch(config)# system mtu jumbo ? Jumbo MTU size in bytes Switch(config)# system mtu jumbo 9000 Changes to the system jumbo MTU will not take effect until the next reload is done
Note that it is necessary to reload the switch in order for the new MTU to take effect. Note also that the MTU setting does not appear in the running or startup configuration; this is important to remember when replicating configurations among multiple switches.
Remember to verify that the new jumbo MTU has taken effect after a reload:
Switch# show system mtu System MTU size is 1500 bytes System Jumbo MTU size is 9000 bytes System Alternate MTU size is 1500 bytes Routing MTU size is 1500 bytes
VLANs and QoS
Use VLANs to segment iSCSI traffic into separate layer two domains as appropriate. Limiting each VLAN to a single IP network is a good rule of thumb but your SAN design may dictate otherwise.
Dell recommends not implementing QoS on iSCSI networks and I have to agree. Your storage network should only carry storage traffic, all of which you want to move from source to destination as fast as possible. Even management traffic to and from the switches themselves is best isolated to the out-of-band management interface.
That's what I've managed to come up with, anyway. What have I missed?
About the Author
Jeremy Stretch is a network engineer living in the Raleigh-Durham, North Carolina area. He is known for his blog and cheat sheets here at Packet Life. You can reach him by email or follow him on Twitter.
Posted in Switching
January 16, 2012 at 2:17 a.m. UTC
Great article, the only thing I have to contribute is a limitation with the 2960S. There is a 6 port channel limit on the 2960S which could prohibit some deployment scenarios (such as mine).
January 16, 2012 at 5:34 a.m. UTC
Thanks for the article Stretch.
I am interested in how you'd manage performance monitoring, to see if these config changes were improving things. Would iperf be the tool to reach for? I note netflow isn't supported on 2960, and perhaps it is the wrong tool anyway.
I would think that switching on jumbo frames would have make a very noticeable difference, but I'm not sure how to prove that.
I also wonder whether your disks will be able to even reach the maximum throughput on each server NIC and switch port.
January 16, 2012 at 7:13 a.m. UTC
thanks Stretch. Good article.
Keep up the good work
January 16, 2012 at 10:19 a.m. UTC
I had such a deployment with the exact same switches (except it was 2 48 ports).
On our side we used HP P2000 G3 SAN with 8 host ports.
We had a problem with host ports going down then up again during a few seconds. HP's solution was: disable auto negotiation on ports were san is plugged.
So far, so good (even if i keep thinking that disabling auto negotiation is not a best practice).
Anyway, great article!
January 16, 2012 at 2:52 p.m. UTC
I think that forcing speeds/duplex at server ports is a good practice, also as anyone used in this scenarios where we need fast stp convergence the trunk portfast config?
January 16, 2012 at 5:43 p.m. UTC
Shure you want to use a low-end cisco switch for iSCSI traffic at all?
Quote from a recent cisco-nsp post:
On the 2960 the buffer depth is 77 or 78.
On the 2960S the buffer depth is closer to 40 (don't have an exact figure).
The buffers on these are extremely small for a gig/10GE switch (less than 1ms).
January 16, 2012 at 7:38 p.m. UTC
Outbound flow control is necessary with a blocking backplane only, when input buffers are congested. Most Cisco switches (like the 2960S) have a non-blocking backplane and no input buffers at all, thus outbound flow control doesn't make any sense.
Please also read Vendors on flow control.
Flow control is not intended to solve the problem of steady-state overloaded networks or links
Of course you can saturate the intra-stack links, but for storage with stacked switches I would stick to 3750X series anyway. When using 2960S in a stack, intra-stack bandwidth requirements needs to be considered. I wouldn't use stacked 2960S in environments with high-bandwidth requirements like storage anyway.
As for disabling auto negotiation is best pratice: this is a myth! In fact, flow control requires auto negotiation. Also read Blessay: Autonegotiation on Ethernet – It Works, It Should Be Mandatory!
The above quote is a great example of why you should never blindly follow networking advice from server admins
I can't agree more with you on this, but this is valid also for the networking 'myths' we still see floating around in the internet and real life, like auto-negotation.
January 17, 2012 at 4:02 a.m. UTC
Thanks for the post Stretch.
My company uses the 2960G at three sites for EqualLogic SANs and they have performed just fine though on a few support calls with EQL they have started down the road of pointing at the switch as "not on their supported hardware list" so that has been a concern in the back of my mind.
We've set up our switches pretty much the same way outlined here and I see you referenced Dell's guidelines so nothing much new but it is good to see that others are using these switches in similar scenarios.
Some other notes:
We have the switches set up in a redundant fashion, utilizing EtherChannel between switches and I'm considering using the stackable flavor in a new implementation but I'm concerned about the bandwidth and am currently investigating a single chassis (Force 10) in lieu of multi chassis stacked or EtherChannel... I'm wondering if anyone has any comments on the pro / con here, some people really seem to hate stackables but two large chassis for true redundancy just isn't in the budget - is a single large chassis with dual everything good enough??
Also, you mentioned QOS and that made me think of some problems we have that are worth mentioning. In order to prevent bandwidth overages I use ingress policing on the next hop switch in two sites to rate limit. This is maybe more of an EQL issue but it seems to wreck havoc with the communication between groups at different sites, they constantly balk about timeouts, etc. I wish EQL would build in a rate limit feature for replication then I wouldn't need to see a bunch of misleading errors about re-transmits.
January 17, 2012 at 8:00 a.m. UTC
A little sidenote to "The jumbo MTU applies to transit traffic". It doesn't apply to transit traffic that has an interface operating in 100Mb. Then it will follow the system MTU. If it receives jumbo traffic on an Gig interface that is destined for a FE interface it will be dropped.
January 17, 2012 at 4:14 p.m. UTC
To continue with what D Stromland has noted, you should also pay careful attention to MTU if these switches are trunked to other switches. Obviously VLAN-seperated traffic carried on a trunk adds a 4 byte VLAN tag, so if you only allow 9k bytes and your endpoints send 9k byte frames, those frames will be dropped if they need to traverse a trunk. Personally, I always recommend configuring the highest available MTU on routers and switches and giving the customer a lower MTU SLA. For switches, this is usually not a big deal and as you noted, the system and jumbo MTU are different. On routers you obviously also need to consider routing protocols and the MTU they will attempt to use. I find that MTU is a simple thing that seems to come up on a regular basis ;-)
January 17, 2012 at 7:13 p.m. UTC
D Stromblad said: "jumbo traffic on an Gig interface that is destined for a FE interface it will be dropped"
January 18, 2012 at 7:56 a.m. UTC
We always use HP Procurve 2910-AL switches for our SAN. But it's worth to take a look at the 2960S switches.
Great Article btw
January 18, 2012 at 10:01 a.m. UTC
Here's a great link from cisco http://www.cisco.com/en/US/products/hw/switches/ps700/products_configuration_example09186a008010edab.shtml#c4a
Chrismarget: this is were I found that info "If Gigabit Ethernet interfaces are configured to accept frames greater than the 10/100 interfaces, jumbo frames received on a Gigabit Ethernet interface and sent on a 10/100 interface are dropped."
January 18, 2012 at 7:10 p.m. UTC
Generally, I've found Jumbo Frames benefit FCoE and FC much-much greater than iSCSI.
January 19, 2012 at 9:46 a.m. UTC
I see you existed config mode to run a show command:
Switch(config)# spanning-tree mode rapid-pvst
Switch# show spanning-tree
You are aware you can do the following right?:
Switch(config)# spanning-tree mode rapid-pvst
Switch(config)# do show spanning-tree
Does anyone know of the routing performance in the 2960's? when you change the SDM mode to routing, does this happen in software or hardware?
January 19, 2012 at 11:45 p.m. UTC
@g1: Yes, I'm very familiar with the 'do' command. ;) But it can throw off people who aren't well-versed in IOS command syntax when used in examples.
January 20, 2012 at 6:02 a.m. UTC
So I wouldn't need (or want) outbound flow control on 3750/2960s, as they are non-blocking. My misunderstanding is…do I want 'ingress' flow control configured on port (assuming the iscsi target & initiators both support sending 'pause' frames) i.e. "flow control receive desired"
Where does TCP Windowing come into play, with regard to flow-control? Seems an interesting topic, and many articles seem to imply flow control being a good thing when using iSCSI:
Multivendor article on iscsi
Cisco article re: Dell EQ SAN
January 20, 2012 at 9:33 p.m. UTC
"Does anyone know of the routing performance in the 2960's? when you change the SDM mode to routing, does this happen in software or hardware?"
I haven't heard of 2960S's before now, but I'm assuming they're still L2 only switches? As such, I'm also assuming there's no SDM and / or "routing" issues / concerns - right? I guess I can always visit Cisco's website...
January 21, 2012 at 10:04 p.m. UTC
I'm curious what folks think about routing storage traffic vs. L2 forwarding. This will avoid STP altogether as you can localize vlans @ the TOR switch. I realize this is not really an option for FC/FCoE - but NFS/iSCSI should be able to acces this method. Has anyone tried this type of test & what type of things did you find out?
January 22, 2012 at 8:14 p.m. UTC
At my site we are working with 3750's and EMC SAN. We had some issues, too, and it seems that the 9000 MTU size is standard.
Your recommendations are right on target. We are moving from 3750's to Nexus 5K soon to minimize latency on the iSCSI side. So, from where I stand, you got it!
January 23, 2012 at 8:59 a.m. UTC
Yes the 2960(S) are still considered L2 from Cisco, but they support static unicast routing on SVI: http://www.cisco.com/en/US/docs/switches/lan/catalyst2960/software/release/12.2_55_se/configuration/guide/swipstatrout.html
I was hoping someone could confirm if the switches could handle full throughput when being used as a router?
January 26, 2012 at 11:12 a.m. UTC
Hummm i dont think that auto-negotiation just works, maybe on SAN scenarios it does, on a "normal" (ie host access) scenarios i have seen hosts having trouble to negotiate correctly and had to be forced (later on after some Nic driver upgrade all as well)
February 13, 2012 at 10:46 a.m. UTC
As usual, a very good article; lot of thanks Jeremy for the effort. This article helped me to fine tune a pair of 3750X Catalyst with StackWise.
Did you ever setup StackPower and/or EnergyWise on the 3750X platform?
February 14, 2012 at 5:35 p.m. UTC
I would recommend ammending your QoS piece, and seeing about doing something that is a VLAN based QoS, so that your iSCSI VLAN would take priority when traversing trunks.
This might be good for a Layer 3 switch article...
March 4, 2012 at 4:19 a.m. UTC
I'm in the middle of looking at switches for a HP Lefthand iSCSI solution, and the bloated marketing numbers in spec sheets from the vendors have to be translated to reality.
I know at least the following items need to be considered: blocking, flow-control, queueing method and depth, latency, and backplane bandwidth.
Switches under consideration are HP 2910AL, Cisco 4948, and a Cisco WS-X6748-GE-TX 6500 blade.
Jumbo frames are also being considered, but I wasn't a fan of making a system mtu change that required a reload on my aggregation switches and apparently my Cisco 6500s already default jumbo frames to 9216.
s-oc4-n2-agg1 uptime is 5 years, 2 weeks, 3 hours, 41 minutes ... s-oc4-n2-agg1(config)#system jumbomtu ? Jumbo mtu size in Bytes, default is 9216
March 29, 2012 at 2:19 a.m. UTC
This article is well written and exactly what one needs to do on 2960-S to prepare for iSCSI; particularly EQL iSCSI.
There is a big miss though.
The 2960-S is an access switch. Over-saturate it, encouter problems with it, and you'll end up with a TAC rep telling you that you're using an access switch in the distribution layer. They will be of little help.
Skip the 2960-S as an iSCSI switch. A low to medium storage load will bring it to its knees. The inability to xmit pause frames keeps it from tapping out when an array or initiator (or both) blast it with packets. It has an issue with buffer miss representation that makes it nearly impossible to troubleshoot.
Why pay $60k/array and hook them to a $3k switch? You're asking for heartbreak.
May 18, 2012 at 8:28 p.m. UTC
It may be advisable to include something on QoSing an iscsi vlan when the switch is handling iscsi as well as other. In my case we have a two stack handling frontend server traffic as well as back end iscsi traffic.
August 3, 2012 at 3:54 p.m. UTC
Breakfix is correct. The 2960s (G and S) do not have enough internal buffers to handle the overlapping 300ms bursts of 1Gbps traffic that occur on iSCSI Sans. It's fine as long as only one iSCSI stream is active on a given port, but (for example) if a host requests data from two or more iSCSI LUNs using the same NIC port on the host, the concurrent 1Gb streams from the LUNs fill the internal 2960 buffers causing drops.
If you are running iSCSI on an older 2960G and you are not seeing QDROPs on the interface displays, it's because your IOS is old and not counting platform drops. Use "sh platform port-asic stats drop" and you will see them.
Supposedly, the 3750/49xx etc have 32MB or more of internal buffering which is right around 300ms of 1Gbps traffic.
May 15, 2013 at 11:07 a.m. UTC
Just to ressurect this, we're now running into a problem where we use dot1q trunks to the vmware esx hosts, but it seems SVI mtu on a 3750x takes the same value as the system MTU, and since this can't be raised above 1998 it presents a serious problem.
July 24, 2013 at 9:27 p.m. UTC
Vincent, typically iSCSI traffic is placed on dedicated access ports and not trunk ports. With the importance of availability and speed on your storage connection, you don't want it competing for resources. If possible I would recommend you configure your infrastructure that way. Otherwise, you just don't use jumbo frames and add a little latency. Jumbo frames optimize the traffic by not needing to resend the packet header and footer as often.
November 30, 2013 at 2:40 p.m. UTC
Have done benchmarks on jumbo frames' "increasing performance" in our DC in pure 10G environment. Results shows almost no change in bandwidth or in CPU load. Tested win, nix, esxi VMs, different driver versions, point-to-point flows, many-to-one flows, many-to-many flows. However managed to get great increase in bandwidth by fine tuning TCP and Offload options.
May 19, 2014 at 7:15 p.m. UTC
great article, now if someone wants to monitor the switch, the ports for I/O or something else, how would he do that? SNMP? I am running into sluggish performance and I can't point the finger at the SAN, the switch or the host. How would someone go look where your bottleneck is ? Thanks
June 12, 2015 at 11:57 p.m. UTC
I use 2960's for iSCSI and they work flawless. To get more bang for the buck, don't forget to tune your queues so that ALL the buffers are lumped into one queue. That will not only consolidate your available buffers, but you can also set it so the reserved per-port buffers are minimal, allowing them to shift as needed between ports.
This can be done with the following commands:
mls qos mls qos queue-set output 1 threshold 1 100 100 1 400 mls qos queue-set output 1 threshold 2 3200 3200 10 3200 mls qos queue-set output 1 threshold 3 100 100 1 400 mls qos queue-set output 1 threshold 4 100 100 1 400
You'll notice I gave the second queue most of the buffers. That's because this is the default queue that outgoing traffic maps to that doesn't have any QoS markings.
You can verify that the traffic is in fact using the second queue via this command:
show mls qos int gi0/1 stat
at the bottom of the output, you will see the number of packets enqueued. Note that the output shows queues 0-3, however, when you configure it, you use queues 1-4. It's a little confusing at first.
Also, buffers are per ASIC and each switch ASIC is attached to several physical ports. So if you have several high traffic ports that get overrun often because they hook to multiple iSCSI targets, find out which ports group to each ASIC, and distribute the high traffic load between them. That will give you better use of the buffers.