The Basics of Network Troubleshooting

The following post is something I wrote as an in-house primer for our help desk staff. While it a bit down level from a lot of the content here I find more and more the picking and reliably going with a troubleshooting methodology is somewhat of a lost art. If you are just getting started in networking or are troubleshooting connectivity issues at your home or SMB this would be a great place to start.

We often get issues which are reported as application issues but end up being network related. There are a number steps and logical thought processes that can make dealing with even the most difficult network issues easy to troubleshoot. The purpose of this post is to outline many of the basic steps of troubleshooting network issues, past that it’s time to reach out and ask for assistance.

  1. Understand the basics of OSI model based troubleshooting

    The conceptual idea of how a network operates within a single node (computer, smartphone, printer, etc.) is defined by something called the OSI reference model. The OSI model breaks down the operations of a network into 7 layers, each of which is reliant on success at the layers below it (inbound traffic) and above it (outbound traffic). The layers (with some corresponding protocols you’ll recognize) are:

    7. Application: app needs to send/receive something (HTTP, HTTPS, FTP, anything that the user touches and begins/ends network transmission)
    6. Presentation: formatting & encryption (VPN and DNS host names)
    5. Session: interhost communication (nothing to see here:))
    4. Transport: end to end negotiations, reliability (the age old TCP vs. UDP debate)
    3. Network: path and logical addressing (IP addresses & routing)
    2. Data Link: physical addressing (MAC addresses & switches)
    1. Physical: physical connectivity (Is it plugged in?)

    The image below is a great cheat card for keeping these somewhat clear:

    OSI_2014

    Image source: http://www.gargasz.info/osi-model-how-internet-works/

    How OSI is used today is as a template for how to understand and thus troubleshoot networking issues. The best way to troubleshoot any IT problem that has the potential to have a network issue is from the bottom of the stack upwards. Here are a few basic steps to get you going with troubleshooting.

  2. Is it plugged in?

    This may seem like a smart ass answer, but many times this is just the case. Somebody’s unplugged the cable or the clip has broken off the Cat6 cable and every time somebody touches the desk it wiggles out. Most of the time you will have some form of a light to tell you that you have both connectivity to the network (usually green) and are transmitting on the network (usually orange).

    This troubleshooting represents layer 1 troubleshooting.

  3. Is the network interface enabled?

    So the cable is in and maybe you’ve tried to plug the same cable from the wall into multiple devices; you get link lights on other devices but no love on the device you need. This may represent a Data Link issue where the Network Interface Card (NIC) has been disabled in the OS. From the client standpoint this would be within Windows or Mac OSX or whatever, on the other side it’s possible the physical interface on the switch that represents the other end of the wire may be disabled. Check out the OS first and then reach out to your network guy to check the switch if need be.

  4. Can the user ping it?

    Moving up to the Network layer, the next step is to test if the user can ping the device which they are having an issue with. Have the user bring up a command prompt and ping the IP address of the far end device.

  5. Can you ping it?

    By the very nature of you being an awesomesauce IT person you are going to have more ability to test than the user. To start with, see if you can ping it from your workstation. This will rule out user error and potentially any number of other issues as well. Next if you can’t, are you on the same subnet/VLAN as the device you are trying to access? If not try to access a device in the same subnet as the endpoint device you are testing and ping it from there. That may give you some insight into having issues with default gateway configuration or underlying routing (aka Layer 3) issues.

  6. Can you ping it by name?

    Let’s say you can ping it by IP address from all of the above. If the user is trying to access something by name, say server1.foo.com have them ping that as well. It’s possible that while the lower three layers of the stack are operating well, something has gone awry with DNS or other forms of naming that happen at the Presentation layer.

  7. Application firewalls and the like

    Finally we’ve reached the top of the stack and we need to take a look at the individual applications. So far you’ve verified that the cable’s plugged in, the NICs on both sides are enabled and you can ping between the user and the far device by both IP and hostname but still the application won’t work so now’s when we look at the actual application and immediately start rebooting things.

    Just kidding ūüôā No now we need to look at services that are being present to the network. If we are troubleshooting an e-mail issue is the services running on the server and can we connect to it. When talking about TCP/IP-based traffic (meaning all traffic) all application layer traffic occurs over either a TCP or UDP protocol port. This isn’t something you physically plug-in, but rather it is a logical slot that an application is known to talk on, kind of like a CB radio channel. For example SMTP typically runs on TCP port 25, FTP 21, printing usually on 9100. If you are troubleshooting an e-mail issue bring up a command prompt and try to connect to the device via telnet like “telnet server1.foo.com 25.” If the SMTP server is running on that port at the far end then it will answer, if not the connection will time out.

  8. Call in reinforcements

    If you’ve got this far it’s going to take a combination of multiple brains and probably some application owners/vendors to unwrangle the mess those crazy users have made. Reach out to your network and application teams or call in vendor support at this point.

Network troubleshooting isn’t hard, you just have to know where to start.

Configuring Networking for Nimble-vSphere iSCSI

One of my last tasks for 2014 was integrating¬†a new Nimble Storage array into our environment. As this is the first of these I’ve encountered and I haven’t been able to take the free one day Nimble Installation and Operation Professional (NIOP) course they provide I was left to feeling my way through it with great help from their documentation and only ended up calling support to resolve a bug related to upgrading from 2.14 of the Nimble OS. On the network side our datacenter is powered by Cisco Nexus 3000 series switches, also a new addition for us recently. These allowed us to use our existing Cat6 copper infrastructure while increasing¬†our bandwidth¬†to 10 GbE. In this post I’m going to document some of the setup required to meet the best practices outlined in Nimble’s¬†Networking Best Practices Guide¬†when setting up your system with redundant NX-OS switches. Go Read More

Updating the Code of a ipbase Licensed Cisco Catalyst 3750X Switch Stack

Here in the office the Access Layer of our switching infrastructure is handled completely with a 7 unit stack of Cisco 3750X switches.  There is no need for these to do any routing other than intervlan so when purchased 3 years ago we just ordered the IP Base licensing level.  Well from what I can tell there is a universal code base and a licensed feature level of each code revision.  The universal naming convention looks like c3750e-universalk9-mz.122-55.SE1 while the ipbase looks like  c3750e-ipbasek9-mz.150-2.SE6.  What I found is that I do not have the ability to download the universal code of later releases due to my licensing level and possibly the lack of SmartNet I keep on these, but I do have access to the ipbase code.  When attempting to update the code on this stack I was presented with the error


After some searching I found reference to others trying to go from IP Base to Advanced IP Services code having to put the /allow-feature-upgrade switch on the archive download-sw code in order to allow the upgrade as well as it seems a downgrade.  Evidently this feature came about with IOS version 12.2(35).  Now the upgrade progressed and I have happy little upgrades switches.


Another note about this upgrade I found in the official release notes is any upgrade from 15.0(2)SE to later will result in a microcode upgrade which when unmitigated will lead to an exceptionally long restart of the switch.  You can mitigate this either by using the /force-ucode-reload parameter when downloading the code to the devices or by using the archive download-sw /upgrade-ucode privileged EXEC mode command afterwards.