In recent months I seem to have hit a lot of bugs in Cisco software. Across the board on the main software releases of IOS, NX-OS or IOS-SX I seem to be hitting a wide range of bugs, and some of them are pretty stupid. And I’ve realised that, in recent years, it has become so commonplace, so accepted that we actually plan our projects with time to test, locate and check for bugs. And that’s become an expensive and time-consuming problem.

Why do we put up with this ?

How is testing done ?

I’m told that Cisco does very little of their own testing. It’s outsourced to a number of specialist companies in India who then perform specified test plans (under terms of the scope of works ) and hand the results back to the developers. However, IOS development is consists of many different teams e.g IP Multicast and BGP is developed by two different teams, and tested by two different companies. Each team writes the test plans and then tests only it’s own code.

It appears that no-one is testing the convergence of BGP and Multicast because there is no team responsible for overall testing.

Process Failure

As a more concrete example, consider this bug report CSCts80367 Bug Details – AnyConnect 3.0 for Mac gets “Certificate Validation Failure” w/ ASA 8.4.

Symptom: AnyConnect 3.x for Mac gets “Certificate Validation Failure” Conditions: AnyConnect 3.x for Mac connecting to ASA running 8.4 and using certificates to authenticate Workaround: Downgrade ASA to 8.3

In other words, the certificate validation code in ASA 8.4 has not been fully tested against Cisco AnyConnect. That’s astonishing.

This is a serious failure of process to have let this bug into a shipping version of the Cisco ASA code.

Lets consider the prerequisites, just to check if this a unique problem:

AnyConnect 3.x – common and Cisco is trying to “educate” users to move to AnyConnect. SSL VPNs – very common. ASA 8.4 – latest code with a lot of bug fixes for 8.3. Certificates for authentication. MAC OSX

These features are not some sort of unique corner condition that very few people use. This is everyday, garden variety use case that should be part of the standard acceptance testing. Ok, so MACs aren’t common, but they are NOT rare.

That this bug exists tells me that internally, Cisco isn’t doing testing properly.

Does the TAC allow bad testing

In a word, yes. Cisco has been rightly be lauded for the value of the TAC to customers for solving problems and building one of the best support organisations in the world. But I’m beginning to have the view that fixing problems that I should not have, is not valuable. That is, the TAC has to be good because there are so many problems with the products.

In fact, if you aren’t thinking about it, you might say to yourself “Hey, Cisco fixed my problem so it’s all good”. What you should really be saying is “Another bug in Cisco software, time to lodge a complaint for shipping faulty product!”



Ten years ago, I could accept some bugs. I needed the new features, fast and furious. Today, I need stability and reliability because of the cost of managing faulty product. Consider the cost of managing MS Windows servers – just the monthly patching cost (which should be mostly unnecessary ) is enormous – it’s so bad that there are entire companies devoted to patch management. This is poor product causing this, and you are paying to fix it.

In the same way, the TAC kind of hides the fact the IOS quality is low. At least, unless you have Partner maintenance in which case you are really going to have a bad time getting fixes.

Lessons from Apple

Once upon a time, I believed that my laptop needed several things:

Windows OS needed patching every month

Hardware replacing every nine months

Expect at least one, maybe two hardware failures in those nine months.

reinstall OS to blank formatted HDD, install Apps, restore data, and reset all defaults.

Since I moved to Apple MAC about five years ago, I’ve not had a hardware failure, I’m on my second laptop, never had to reinstall the OS, and every upgrade allowed me to carry my settings and data forward. Big difference.

That’s a qualitative experience where quality hardware, quality software, and gave me a great experience. That’s how I measure my vendors today. I demand nothing less.

What I want

I want the confidence to say that it is unlikely that I will hit a bug. I accept that bugs are inevitable, but they should be an exceptional event, not something that we plan for. Can you believe that we actually expect bugs to be in the software and spend money to LOOK for them. None of us should pay for buggy software. We should not pay maintenance to fix a defective product, the vendor should make good on the promise of quality software and hardware. It should work as documented. Don’t accept second best. Lodge complaints with your account manager (if there are any left) or some other means. Blog about your bugs, and your experiences in getting them fixed.

Logically, Cisco must want customers to detect bugs for them – in which case I want free access to code updates AT LEAST. Or cheaper prices. Or both.

The EtherealMind View

In the past, Cisco has always shipped code early and told customers to ‘fill their boots’ – let us know if you find anything. I think, it’s time for Cisco to take code and product quality seriously. Instead of relying customers to find bugs and report them to TAC, Cisco needs to do their own testing.

And customers need to tell Cisco to improve their products. I can’t afford to be spending hundreds of hours testing for bugs, that’s what my vendor should do. If Cisco isn’t doing it properly, then I should be going somewhere else.

Quality matters. It really matters.

Postscript 20111116-1541

I should point out that most of this article applies equally to other vendors but I haven’t had the privilege lately to work on other vendors kit. My customers are heavily into Cisco therefore so am I. My previous experiences with Foundry (now Brocade), Juniper EX, and HP ProCurve switches, as an example, in years past have left a bitter taste and lots of ‘war stories’.

Vendors need to focus on software quality and great user experiences, and less on the rollout of ‘exciting marketing’ features. In the current economic climate, lets talk about new features but be careful about introducing them. Or create stable code versions, and unstable version so we know when we are at risk.

I am calling out Cisco, because the TAC isn’t as good today as years gone by (budget cuts probably) and my bad experiences are certainly piling up. I’m hoping for some change.

Hoping.

Related

Ethan has blogged on similar topics at Packet Pushers about software problems at Cisco.

Image Credit