With router issues confirmed, and after making sure you approach your network issue in an efficient way, here is a guide on how to troubleshoot the problem. This technique can also be used as a template for a basic plan for structuring your own work, as well as communicating to your colleagues who want to know what’s going on.

This guide assumes that you are familiar with the commands and interface for the type of router you are operating, so the focus is on ways of thinking and acting when it comes to tracking down and resolving issues.

Physical changes

Start by examining physical changes. If cables or network interfaces were changed, the router was replaced, or there was any other physical movement of the equipment, there is a good chance that one of the cables is not connected properly or is broken.

Try to resist the urge in the midst of an incident to tidy up a crow’s nest of patch cables connecting to the router, unless you have a fresh set of cables next to you and know-how to ensure full functionality (half-duplex is not full).

Hardware failure

Oh, blasted! At an awkward moment Darth Vader noticed that his light saber was out of battery.

A quick and surprisingly accurate way of determining if there is a hardware issue is to check the LED lights on the router. Red is usually bad, but not always, so check the manual for your particular router to find out which LED lights should be active, what color they should have, and if they should be solid light or blinking.

Also, if you can connect to the router through one interface, you can view the router’s general status. That action will show you if any of the other interfaces have a hardware failure. See Anthony Critelli’s article, A beginner’s guide to network troubleshooting in Linux, for more details.

Router basics

Once you are connected to the router and have looked for hardware failures, check the other basics and then work from there. Here are the commands used by CISCO, as an example:

Command Description show version Provides an overview of the router. show interfaces Provides an overview of all interfaces in the router. show logging Tells you what kind of logging is configured. show tech-support Provides information about CPU and memory utilization through a combination of commands.



Note: More Cisco router commands can be found here.

So, now you should know if the software is up-to-date, if the interfaces are configured and active, and if the router is overloaded or not. This is a good start.

If you have verified that the firmware is out-of-date, don’t just slap on the latest version unless you have a verified plan of how to return to the previous state. During a high profile incident, you are probably in no fit state to read pages of release notes, leaving you unable to assess what other issues might be introduced with the new firmware. Should something go wrong and you don’t know how to back out, the problem just got bigger.

The worst scenario is if the router updates, and then on restart, refuses to come back up. In a big company scenario, you typically have double routers and failover, so you can take the misbehaving router offline while troubleshooting and have the secondary router manage the full load. This situation also means that you must ensure that the secondary router has the latest routing tables.

Never assume. Always check first.

Firmware updates also might reset or changeset values, which means (again) that you need a plan regarding how to reapply the current configuration or return to a previous state. This is where software like Red Hat Ansible comes in handy. Ansible can version, store, and apply configurations and software/firmware for all infrastructure components. Doing this will save you a lot of time and trouble. It also will provide sufficient logs as part of the documentation to show what was changed, by whom, and when.

Another anonymous admin

Larger companies usually have more than one sysadmin managing network components, and with accounts like sudo and admin , it is not always clear who did what, even with log analysis.

With anonymous accounts and a high-pace workload, plus the added pressure of the internet not being available, it’s even harder to remember who did what and when. If your company is not already using a change management process this is a good opportunity to consider doing so. In the simplest form, change management is a document where you write down which component you are working on, what date it is, and then start each line with a timestamp and write what you did. This simplest form of roadmap is better than none.

With the help of documentation, you can go back and check if any change was completed recently that could potentially cause problems. Even if you are the only admin, I would recommend that you use a change management process (however simple) and document what you plan to do, what you did, and what the outcome was.

Primary router disagreement

Most corporate networks have more than one router, and close to the internet connection, there is usually fail-over functionality to ensure high availability. If incorrectly configured, this setup can cause arguments among the routers (trust me) about which is the primary router, and if a change is implemented in one router the other might not accept it. A classic example is when the primary router is taken off-line and the secondary rises to power with obsolete routing information.

Protocol perception

There are several protocols routers can use, such as OSPF, RIP, EIGRP, and BGP. If routers are by configuration error using different protocols, this setup will cause issues that could be of catastrophic or intermittent nature, so make sure you are using one protocol according to your standard.

Security breach

This is a scary one and should send you off to change passwords and SSH keys immediately to prevent additional damage and block the risk of being locked out of the system. Storing configurations in GitHub and using Ansible to retrieve them, and enforcing the desired state on network components, is a great way to prevent bad configurations from gaining a foothold in the network.

A former employee—perhaps even a sysadmin—with a grudge might have the ability to wreak havoc in the network unless you have a policy of regularly changing passwords and keeping strict control of user accounts. Many years ago, a company had the generic account "admin" and the secret password

"penguin" on all network components, and guess what? Operations were disrupted for almost 24 hours, and thanks to everything being 100% manual, it took almost six months to weed out the old admin password from all components.

Communicate with the team, align with the security officer, gather evidence (document), and follow company routines (which, most likely, involve a police report).

Physically disconnect different segments of the network to contain the damage. Doing so gives you time to assess the damage and work out a plan of action to restore operations. Remember that panic is also your enemy, especially in these situations.

Any change management record that involves a router and is listed as "zero impact" should never be allowed or trusted. Updating a router is by default “high impact” because it has the same destructive potential as an excavator going through a porcelain shop. Make sure you have a working backup of the router configuration before attempting any sort of configuration change.

However, if a change is implemented and the network goes belly up, and you have no backup and are not sure what the configuration looked like before the change, you have to adopt a "simpler is better" strategy. Work your way toward a basic level of routing and then take it from there. A full restore in this scenario will most likely take time. It should involve more than one admin and have both backup and documentation as part of the result.

Preventative measures

A tool like Ansible can keep track of which firmware and configurations are deployed, by whom, and at what time. You can use Ansible and enforce that all changes go through this tool.

Ansible also lets you keep a component in a "desired state," meaning that if someone tries to manually change the configuration by logging into the device, Ansible will restore the intended configuration within a defined time (e.g., 60 seconds).

Roundup

You can use this document as a checklist and work your way through it in order to avoid unstructured troubleshooting, which can lead to even more issues. It is essential to avoid adding stress and confusion to an already stressful situation. Make sure that the work you and your team perform is well structured and closely aligned. The worst scenario is, of course, a security breach, in which case you need to contain the damage, which can be done by physically disconnecting networks to minimize the damage.

Want more on networking topics? Check out the Linux networking cheat sheet.