About 6 months ago we updated 3/4 of our Cisco Telephony environment from 8.5 to 11.5. The only reason we didn’t do it all is because UCCX 11.5 wasn’t out yet so it went to 11. While there were a few bumps in the road; resizing VMs, some COP files, etc. the update went well. Unfortunately once it was done we starting having a glorious issue where after a reboot the servers sometimes failed to boot, presenting “FATAL: Could not load /lib/modules/2.6.32-573.18.1.el6.x86_64/modules.dep: No such file or directory”. Any way you put it, this sucked.
The first time this happened I call TAC and while they had seen it, they had no good answer except for rebuild the VM, restore from backup. Finally after the 3rd time (approximately 3 months after install) the bug had been officially documented and (yay) it included a work around. The good news is that the underlying issue at this point has been fixed in 11.5(1.11900.5) and forward so if you are already there, no problems.
The issue lies with the fact that the locked down build of RHEL 6 that any of the Cisco Voice server platforms are built on don’t handle VMware Tools updates well. It’s all good when you perform a manual update from their CLI and use their “utils vmtools refresh” utility, but many organizations, mine included, choose to make life easier and enable vCenter Update Manager to automatically upgrade the VMware tools each time a new version is available and the VM is rebooted.
So how do you fix it? While the bug ID has the fix in it, if you aren’t a VMware regular they’ve left out a few steps and it may not be the easiest thing to follow. So here I’m going to run down the entire process and get you (and chances are, myself when this happens in the future) back up and running.
We’ve been dealing with an issue for past few runs of our monthly SureBackup jobs where the Domain Controller boots into Safe Mode and stays there. This is no good because without the DC booting normally you have no DNS, no Global Catalog or any of the other Domain Controller goodness for the rest of your servers launching behind it in the lab. All of this seems to have come from a change in how domain controller recover is done in Veeam Backup and Replication 9.0, Update 2 as discussed in a post on the Veeam Forums. Further I can verify that if you call Veeam Support you get the same answer as outlined here but there is no public KB about the issue. There are a couple of ways to deal with this, either each time or permanently, and I’ll outline both in this post.
The booting into Safe Mode is totally expected, as a recovered Domain Controller object should boot into Directory Services Restore mode the first time. What is missing though is that as long as you have the Domain Controller box checked for the VM in your application group setup then once booted Veeam should modify the boot setup and reboot the system before presenting it to you as a successful launch. This in part explains why when you check the Domain Controller box it lengthens the boot time allowed from 600 seconds to 1800 seconds by default.
On the Fly Fix
If you are like me and already have the lab up and need to get it fixed without tearing it back down you simply need to clear the Safe Boot bit and reboot from Remote Console. I prefer to
- Make a Remote Console connection to the lab booted VM and login
- Go to Start, Run and type “msconfig”
- Click on the Boot tab and uncheck the “Safe boot” box. You may notice that Active Directory repair option is selected
- Hit Ok and select to Restart
Alternatively if you are command inclined a method is available via Veeam KB article 1277 where you just run these commands
bcdedit /deletevalue safeboot
shutdown -t 01 -r
it will reboot itself into normal operation. Just to be clear, either of these fixes are temporary. If you tear down the lab and start it back to the same point in time you will experience the same issue.
The Permanent Fix
The problem with either of the above methods is that while they will get you going on a lab that is already running about 50% of the time I find that once I have my DC up and running well I have to reboot all the other VMs in the lab to fix dependency issues. By the time I’m done with that I could have just relaunched the whole thing. To permanently fix the root issue is you can revert the way DCs are handled by creating a single registry entry as shown below on the production copy of each Domain Controller you run in the lab.
[HKEY_LOCAL_MACHINE\SOFTWARE\Veeam\Veeam Backup and Replication]
Once you have this key in place on your production VM you won’t have any issues with it going forward as long as the labs you launch are from backups made after that change is put in use. My understanding is this is a known issue and will eventually be fixed but at least as of 9.5 RTM it is not.
Hi all, just a quick post to serve as both a reminder to me and hopefully something helpful for you. For some reason Microsoft has decided to make installing .Net 3.5 on anything after Windows Server 2012 (or Windows 8 on the client side) harder than it has to be. While it is included in the regular Windows Features GUI it is not included in the on-disk sources for features to be installed automatically. In a perfect world you just choose to source from Windows Update and go about your day, but in my experience this is a hit or miss solution as many times for whatever reason it errors out when attempting to access.
The fix is to install via the Deployment Image Servicing and Management tool better known as DISM and provide a local source for the file. .Net 3.5 is included in every modern Windows CD/ISO under the sources\sxs directory. When I do this installation I typically use the following command set from an elevated privilege command line or PowerShell window:
dism /online /enable-feature /featurename:NetFX3 /all /Source:<WIN10DISKLETTER>:\sources\sxs /LimitAccess
When done the window should look like the window to the left. Pretty simple, right? While this is all you really need to know to get it installed let’s go over what all these parameters are that you just fed into your computer.
- /online – This refers to the idea that you are changing the installed OS as opposed to an image
- /enable-feature – the is the CLI equivalent of choosing Add Roles and Features from Server Manager
- /featurename – this is where we are specifying which role or feature we want to install. This can be used for any Windows feature
- /all – here we are saying we not only want the base component but all components underneath it
- /Source:d:\sources\sxs – This is specifying where you want DISM to look for media to install for. You could also copy this to a network share, map a drive and use it as the source.
- /Limit Access – This simply tells DISM not to query Windows Update as a source
While DISM is available both in the command line as well as PowerShell there is a PS specific command that works here as well that is maybe a little easier to read, but I tend to use DISM just because it’s what I’m used to. To do the same in PowerShell you would use:
Add-WindowsFeature NET-Framework-Core -Source d:\sources\sxs
The following post is something I wrote as an in-house primer for our help desk staff. While it a bit down level from a lot of the content here I find more and more the picking and reliably going with a troubleshooting methodology is somewhat of a lost art. If you are just getting started in networking or are troubleshooting connectivity issues at your home or SMB this would be a great place to start.
We often get issues which are reported as application issues but end up being network related. There are a number steps and logical thought processes that can make dealing with even the most difficult network issues easy to troubleshoot. The purpose of this post is to outline many of the basic steps of troubleshooting network issues, past that it’s time to reach out and ask for assistance.
Understand the basics of OSI model based troubleshooting
The conceptual idea of how a network operates within a single node (computer, smartphone, printer, etc.) is defined by something called the OSI reference model. The OSI model breaks down the operations of a network into 7 layers, each of which is reliant on success at the layers below it (inbound traffic) and above it (outbound traffic). The layers (with some corresponding protocols you’ll recognize) are:
7. Application: app needs to send/receive something (HTTP, HTTPS, FTP, anything that the user touches and begins/ends network transmission)
6. Presentation: formatting & encryption (VPN and DNS host names)
5. Session: interhost communication (nothing to see here:))
4. Transport: end to end negotiations, reliability (the age old TCP vs. UDP debate)
3. Network: path and logical addressing (IP addresses & routing)
2. Data Link: physical addressing (MAC addresses & switches)
1. Physical: physical connectivity (Is it plugged in?)
The image below is a great cheat card for keeping these somewhat clear:
Image source: http://www.gargasz.info/osi-model-how-internet-works/
How OSI is used today is as a template for how to understand and thus troubleshoot networking issues. The best way to troubleshoot any IT problem that has the potential to have a network issue is from the bottom of the stack upwards. Here are a few basic steps to get you going with troubleshooting.
Is it plugged in?
This may seem like a smart ass answer, but many times this is just the case. Somebody’s unplugged the cable or the clip has broken off the Cat6 cable and every time somebody touches the desk it wiggles out. Most of the time you will have some form of a light to tell you that you have both connectivity to the network (usually green) and are transmitting on the network (usually orange).
This troubleshooting represents layer 1 troubleshooting.
Is the network interface enabled?
So the cable is in and maybe you’ve tried to plug the same cable from the wall into multiple devices; you get link lights on other devices but no love on the device you need. This may represent a Data Link issue where the Network Interface Card (NIC) has been disabled in the OS. From the client standpoint this would be within Windows or Mac OSX or whatever, on the other side it’s possible the physical interface on the switch that represents the other end of the wire may be disabled. Check out the OS first and then reach out to your network guy to check the switch if need be.
Can the user ping it?
Moving up to the Network layer, the next step is to test if the user can ping the device which they are having an issue with. Have the user bring up a command prompt and ping the IP address of the far end device.
Can you ping it?
By the very nature of you being an awesomesauce IT person you are going to have more ability to test than the user. To start with, see if you can ping it from your workstation. This will rule out user error and potentially any number of other issues as well. Next if you can’t, are you on the same subnet/VLAN as the device you are trying to access? If not try to access a device in the same subnet as the endpoint device you are testing and ping it from there. That may give you some insight into having issues with default gateway configuration or underlying routing (aka Layer 3) issues.
Can you ping it by name?
Let’s say you can ping it by IP address from all of the above. If the user is trying to access something by name, say server1.foo.com have them ping that as well. It’s possible that while the lower three layers of the stack are operating well, something has gone awry with DNS or other forms of naming that happen at the Presentation layer.
Application firewalls and the like
Finally we’ve reached the top of the stack and we need to take a look at the individual applications. So far you’ve verified that the cable’s plugged in, the NICs on both sides are enabled and you can ping between the user and the far device by both IP and hostname but still the application won’t work so now’s when we look at the actual application and immediately start rebooting things.
Just kidding 🙂 No now we need to look at services that are being present to the network. If we are troubleshooting an e-mail issue is the services running on the server and can we connect to it. When talking about TCP/IP-based traffic (meaning all traffic) all application layer traffic occurs over either a TCP or UDP protocol port. This isn’t something you physically plug-in, but rather it is a logical slot that an application is known to talk on, kind of like a CB radio channel. For example SMTP typically runs on TCP port 25, FTP 21, printing usually on 9100. If you are troubleshooting an e-mail issue bring up a command prompt and try to connect to the device via telnet like “telnet server1.foo.com 25.” If the SMTP server is running on that port at the far end then it will answer, if not the connection will time out.
Call in reinforcements
If you’ve got this far it’s going to take a combination of multiple brains and probably some application owners/vendors to unwrangle the mess those crazy users have made. Reach out to your network and application teams or call in vendor support at this point.
Network troubleshooting isn’t hard, you just have to know where to start.
As a SysAdmin I’m used to waking up, grabbing my phone and seeing the 20 or so e-mails that the various systems and such have sent me over night, gives me an idea of how the day will go and what I need start with. Every so often though you get that morning where the 20 becomes 200 and you just want to roll over and go back to bed. This morning I had about 200, the vast majority of which was from my Cisco Unified Contact Center Express server with the subject “LogPartitionLowWaterMarkExceeded.” Luckily I’ve had this before and know what to do with it but on the chance you are getting it too here’s what it means and how to deal with it in an efficient manner.
WTF Is This?!?
Or at least that was my response the first time I ran into this. If you are a good little voice administrator one of the first things you do when installing your phone system or taking one over due to job change is setup the automatic alerting capability in the Cisco Unified Real Time Monitoring Tool (or RTMT, you did install that, right?) so that when things go awry you know in theory before the users do. One of the downsides to this system is it is an either on or off alerting system meaning what ever log events are saved within the system are automatically e-mailed at the same frequency.
This particular error message is the by-product of a bug (CSCul18667) in the 9.0.x releases of all the Cisco IP Telephony products in which the JMX logs produced by the at the time new Unified Intelligence Center didn’t get automatically deleted to maintain space on the log partition. While this has long since been fixed phone systems are one of those things that don’t get updated as regularly as they should and such it is still and issue. The resulting effect is that when you reach the “warning” level of partition usage (Low Water Mark) it starts logging ever 5 minutes that the level has been reached.
Just Make the Screaming Stop
Now that we know what the issue is how do we fix it?
|Go back to the RTMT application, and connect to the affected component server. Once there you will need to navigate to the Trace & Log Central tool then double-click on the Remote Browse option.
|Once in the Remote Browse dialog box choose “Trace Files” and then we really only need one of the services selected, Cisco Unified Intelligence Center Serviceability Service and then Next, Next, Finish.
|Once it is done gathering all of the log files it will tell you your browse is ready. You then need to drill all the way down through the menu on each node until you reach “jmx.” Once you double-click on jmx you will see the bonanza of logs. It is best to just click one, Ctrl+A to select all and then just hit the Delete button.
|After you hit delete it will probably take it quite a while to process through. You will then want to click on the node name and hit refresh to check but when done you should be left with just the currently active log file. Afterwards if you have multiple nodes of the application you will need to repeat this process for the other.
And that’s it really. Once done the e-mail bleeding will stop and you can go about the other 20 things you need to get done this day. If you are experiencing this and if possible I would recommend being smarter than me and just update your CIPT components to a version newer than 9.0 (11.5 is the current release), something I am hoping to begin the process of in the next month or so.