I'm looking for some post-event advice so this event doesn't happen again.

We have a network core of two Cisco 4500x switches, configured for VSS redundancy. From those, we have iSCSI devices, our HP bladecenter for our vSphere, plus aggregated links to our user access switches, and a pair of 4948e switches for copper devices in our server room. Off the 4948es we have a pair of 2960 switches for two ISP links, and a pair of ASA as firewalls. Pretty decent redundancy, except for a lot of the devices that connect to the 4948e only have single NICs - only so much we can do.

We're preparing to replace our current user access switches (old Extremes) with Meraki. We're also implementing Meraki APs to replace our current Arubas. Part of the wireless project involves making some new VLANs and subnets, for AP management and guest wireless.

We had two defined VLANs (20 and 40) on the 4500x that were not used anywhere - confirmed that the subnets were empty, no ports using them, etc. I went into the 4500x and issued " no interface vlan 20 ", and then rebuilt it with the subnet I wanted. I then added it to the two 10Gb ports that are connected to the Meraki

switchport trunk allowed <previous list plus two VLANs above plus existing wireless VLAN>

I noticed that the 20 and 40 VLANs were shutdown, so I issued no shutdown on them. I lost access to the Merakis at that point, so I realized I hadn't added the VLANs to the port channel interface for that link.

Half of our environment became unreachable at this point

Our Internet link went extremely flakey. Our Avaya VoIP phones couldn't dial in or out. We have a couple of copper-connected iSCSI devices that became unavailable - no outage for anything user-facing, but our backups and mail archive became impacted. I went into the server room, and disconnected the Merakis from the 4500x (unplugged both 10Gb fiber ports) in case I had somehow created a loop - no change. I admit to simply staring at this for a while at that point.

I pulled up Orion and noted that one of our external switches (the Cat2960) and one of our ASA pair were down as well. Apparent that we had some sort of partial LAN connectivity loss, but the ASA pair are also connected with crossover to each other, and their uplinks didn't go down, so they didn't failover to what our internal devices could reach. I shut down the "down" ASA and the internet became reachable again.

I called TAC, and after a couple of hours of wresting with the tech who kept nitpicking every port config for each downed host I was showing him on the 4500x, I logged into one of our 4948e switches and showed how it couldn't ping things that were directly connected and up - one of our Windows-based copper iSCSI devices, an iLO interface on our bladecenter, etc.

He had looked over the logs and didn't find anything, but at this point he said "Looks like a spanning-tree bug even if I don't see that in the logs", so we rebooted the 4948e and all of its directly-connected hosts came right back up - including the Avaya cabinet, so our phones started working again. We still had problems in the 4500x fiber-connected devices - dead paths, since it was all redundant. He wanted to power-cycle it ungracefully, but this has all of our 10 Gbit iSCSI, and that would have made our vSphere environment (essentially all of our servers) have a bad week. I talked him into doing a graceful redundancy switchover, which took care of the remaining problems.

TL;DR : I made a fairly innocuous change to our core, and caused a hideous problem. Did I make a config mistake that should have been predicted to cause this - eg, if I had no-shutdown the VLANs first and added them to the portchannel and then the ports, would this have been avoided? The Cisco tech didn't say that; he said, with uptimes over a year and old IOS versions, situations like this aren't surprising.