jump to navigation

Cisco Call Manager database replication failure due to incorrect MTU April 28, 2012

Posted by jamesisaac in Uncategorized.
Tags:
add a comment

This weekend we changed from one WAN ethernet circuit to a different one for our metro ethernet. We have a Call Manager publisher on one side and a subscriber on the other side. After changing the circuit, we discovered that we couldn’t dial out from the subscriber side to anything other than local (in the building) phones. What’s going on? The servers were unchanged; we used the same routers and interfaces for the new circuit. The only thing that happened was unplugging one cable and plugging in another one. Why would the phone routing fail in such a way?

Here’s the other interesting bit of business: during our troubleshooting, we rebooted the local (subscriber) call manager. We then tested the phones and found that we could dial out! Great, problem solved. Except that 10 minutes later we couldn’t dial out again. Putting two and two together, we realized that when the phone was registered to the publisher, we could dial out. When the subscriber rebooted and the phones re-registered to the local CCM server, our calls failed again. Hmm. It didn’t seem to be a routing problem, because all of our ping testing was successful. And when the phone is registered to the publisher, everything works! So clearly it can’t be a network problem.

At this point TAC transferred us to the database replication team and we started getting packet dumps and analyzing them with Wireshark. The smoking gun appeared. During the database replication from the publisher to the subscriber, there’s a large packet containing the CCM certificate. This packet is tagged as “do not fragment”. It was not reaching the subscriber, even though other packets around it were. Thus – the MTU.

We changed the MTU on both servers to 1400 and retested the database replication. The network tests immediately passed, and the replication began. Once the pub and sub were in sync, our calls outside the building were successful.

So why would this change? My guess is that our telco provisioned the replacement ethernet circuit with their own VLAN tags on it – to run multiple companies’ traffic over the same circuit, which result in a smaller total frame available to us. Plus we had our own VLAN tags inserted (VLAN within VLAN). A large packet with a “do not fragment” packet will get dropped instead of fragmented. We had to change the server behavior to build a smaller packet, and once we did that, everything worked.

 

VMWare ESX CHAP password recovery January 20, 2012

Posted by jamesisaac in Uncategorized.
Tags: , ,
2 comments

Just a quick note, once again brought about by necessity.

This evening I needed to hook up a new VMWare ESX host to an old SAN (and by old, I mean it was about four years old). The original configuration notes were long gone. After plugging the new host into the SAN switch and configuring an iSCSI ip address, I was faced with the dilemma of finding out what the iSCSI CHAP authentication username and password was. I could see from the existing ESX hosts that we were using CHAP authentication, and I could get the username from the GUI. But what’s the password? I tried a few of our “tried and true” passwords, but had no luck.

Option 1: log into the SAN and reset the CHAP authentication. Change the passwords on the VMWare hosts at the same time. Downside: outage. Upside: know we know what the password is.

There’s got to be another way… and there is!

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1003095

This KB gave me the info in a roundabout way. If you can read the /etc/vmkiscsi.conf file on an existing VMWare host, then the CHAP username and password are listed in cleartext. Yay.

I grabbed this information, put it into the new host, rescanned the iSCSI adapter, and away we went.

 

Can’t delete snapshots after failed Veeam job May 4, 2011

Posted by jamesisaac in Uncategorized.
Tags:
add a comment

http://www.kendrickcoleman.com/index.php?/Tech-Blog/issues-with-consolidate-helper-0-snapshot.html

Ran into this issue today after testing a Veeam backup job in “hot add” mode. The job failed and left several snapshots in the VM’s directory. I attempted to delete the snapshots using the VI Client snapshot manager, and it showed that the snapshots were deleted, but they were still present on the disk.

The issue was that the disk was still attached to the Veeam virtual machine. The quick fix is to edit the properties of your Veeam guest and remove the other VM’s disk, then remove the snapshots. In my case I had jumped the gun already by attempting to delete the snaps through snapshot manager. I followed the clone process listed in the link above, then removed the hard disk from the Veeam guest, then was able to delete the source of the clone. And finally just to restore my environment to its original condition, I cloned the clone back to the original server name and deleted the clone.
And all is well!

IIS5 and intermediate certificates April 25, 2011

Posted by jamesisaac in Uncategorized.
Tags:
add a comment

Notes regarding renewing an SSL certificate – we use Thawte for some of our SSL certs and in the past year they have moved to using intermediate certificates (along with everyone else). When you renew the certificate in IIS5, on a Windows 2000 server, you can either get the PKCS #7 cert which contains the intermediary certificates, or do like I did and get the x.509 cert because that’s what you used last year.

Once applied, you will find that SSL breaks because the certificate path can’t be verified. Oh noes. So quickly go back to IIS and replace your current certificate with the previous one (you do have a couple of days before it expires, right?)

Then go to the SSL provider’s website and download the intermediate certificates. In Thawte’s case, they are located here:

https://search.thawte.com/support/ssl-digital-certificates/index?page=content&actp=CROSSLINK&id=SO14996

Download the SSL CA and the Primary Root CA. Create a new MMC for the Certificates snap-in (specifying the Local Computer as the target), and then import the certificates. They should import successfully.

After they are imported, you can go back to IIS and replace your old certificate with the new one you added earlier, and then verify that the SSL path is verified from the root to the intermediate to your new cert.

Barracuda WAF and intermediate certificates October 4, 2010

Posted by jamesisaac in Uncategorized.
Tags: ,
3 comments

We purchased a Barracuda Web Application Firewall in order to protect our web servers from layer 7 attacks (SQL injection attacks, cross-site-scripting, etc.). The WAF works just like a network firewall in that you define an inside server (the “real” server), an outside ip address that it will respond to (the “virtual ip”), and the WAF then receives all traffic to that public ip, inspects it, and passes it along to the inside server.

The head-scratching question that I had during my research was this: if the inside server is presenting an SSL service, how does the WAF inspect the traffic? The answer is pretty straightforward: you move the certificate from the server to the WAF. It then unpacks the encrypted packets, inspects them, and forwards them to the server. You can specify whether you want to forward the traffic encrypted, in which case the WAF re-encrypts the packet and sends it to the server, or unencrypted (on port 80). Doing this reduces the load on the server since it doesn’t have to decrypt and encrypt the packets, but it does expose the traffic to your internal network in an unencrypted state.

Straightforward, I said, right? Well, yes. Except for the parts which aren’t straightforward, which is nearly everything in this case.

The Barracuda WAF accepts certificates in .PEM format, which our third-party CA doesn’t provide. There’s some documentation on how to get the certificate from your server, but it’s none too clear. In our case, we are using IIS, so I went to the IIS Admin applet and exported the certificate. It exports in a .pfx format, which can’t be read directly by the Barracuda WAF. You need to use OpenSSL to change the format of the key from .pkx to .pem, and also to extract a non-encrypted key-only version as the two go together. This isn’t listed in the documentation, by the way, only some serious Google-mojo turned up the interesting documents. Here’s Scott Lowe’s documentation on how to do it; you should end up with a .pem file containing the cert itself and then a .key file containing the key.

Now upload the certificate into the Barracuda, create the https service, and away we go, right? No. Because earlier this year Thawte (as well as most other CA’s) changed from having a single root certificate to using a chain of intermediary certificates. This improves security so I guess it’s a good thing. But it does make installing the certificate more difficult. The Barracuda WAF includes a browse-box selector for intermediate certificates, so I went to Thawte’s root certificate page and downloaded the relevant certs and included them. No dice. The client browser claims that the certificate is untrusted because the entire chain isn’t listed.

By the way, here’s a really good site to test your certificates after they are installed on your server; it provided some helpful error messages in my troubleshooting.

Finally, after more Google-mojo, I stumbled upon the answer: a certificate bundle. Unlike Matt’s problem which he detailed on the Standalone Sysadmin (great blog, by the way), I actually had the reverse problem – the individual intermediate certs didn’t work, and only the bundle saved me. Where to find it? Back to Thawte’s knowledgebase, and I downloaded the bundle listed for Apache. Then back to the Barracuda WAF, upload, and… yes! Verified!

It seems to me that this whole process should be much, much easier to deal with, but right now there are too many standards and formats and methods. Hopefully this will help anyone else with a Barracuda Web Application Firewall who is stymied, like I was, and is looking for some assistance.

Cisco Dropped Calls fix January 22, 2010

Posted by jamesisaac in Uncategorized.
add a comment

All of my time has been occupied recently with a pretty significant upgrade of Cisco Call Manager from 3.3 to 7.1, skipping all of the in-between steps. Due to hardware EOL and new requirements, we purchased five new servers for our various bits and pieces. This gave us the advantage of building our new system side-by-side, dropping it in on a segmented network, and actually being able to test our call routing by having our telco provide a test number on the new PRI. The actual change-over simply involved calling the telco and having them switch our number from one circuit to the other, and us unplugging the old master Call Manager server from the network and directing phones to the new Call Manager server.

One issue that we did run into post-install had to do with dropped calls. With the way our system is designed, our phones are in our offices while the PRI itself is at the data center. Thus any call quality issues have to be debugged through several routers and switches.

The symptoms were hard to pin down – we got reports that calls were suddenly disconnecting with a fast busy. We looked at QoS stats, line utilization, PRI slips, and all the while tried to reproduce the errors. Finally we did by a happy accident. We had two calls conferenced together through the outside lines, both on speakerphone. After playing with the echo and delay for a bit, we muted both phones – and within 60 seconds one failed with a fast busy. The lightbulb went on! After making a few more test calls, we realized that the calls drop when both phones are muted or both parties are silent for a period of time. Some googling brought this result: 

https://supportforums.cisco.com/thread/298440.pdf;jsessionid=071EDC2C7CBD6F0A1DB040E0100AD3FB.node0

And the fix, clipped from the page:

parshah 139 posts since

Sep 19, 2007 3. Re: Call Manager – Call Disconnect Jun 8, 2009 10:16 AM

Hi,

For the disconnect issue with external calls, looks like either the GW or PSTN disconnects

the call when there is silence in the RTP.

Can you check if you have

no mgcp timer receive-rtcp

on the MGCP gateway. If not, can you add it and see if that helps.

And in fact, that was the solution. We added the “no mgcp timer receive-rtcp” on our PRI gateway router, and that fixed the dropped calls.

vSphere Essentials licensing and vmotion rant September 9, 2009

Posted by jamesisaac in Uncategorized.
Tags:
add a comment

Apparently I almost made up a new product a few days ago.

After running my vSphere servers in eval mode for 50+ days, I was getting ready to apply the licenses to them and the vCenter server which was managing them. I logged into my VMWare account, found the license code, and applied it to the vSphere servers. They happily said “O Hai, I’m now a vSphere Essentials Plus server!”

So then I used the same code on my vCenter server, and it said, “O Noes! You can’t use that code here.”

I was confused, as I thought that was the only code I had. It turns out that I was wrong, as there was a code listed in my VMWare account that said “vCenter Essentials”. I thought that was the wrong thing, as I had purchased Essentials Plus. Time to chat with support.

The support guy looks up my account and gives me the vCenter Essentials keycode. “But that’s not Essentials Plus,” I respond. “I bought Essentials Plus.”

“Yeah, you’re correct… this will not allow you to use HA or DRS. You’d better talk to your reseller to find out why you have Essentials Plus vSphere but only Essentials vCenter.”

After further research, I have drawn this conclusion: there is no such thing as vCenter for Essentials Plus. There’s only vCenter Essentials. And after licensing the server, it does in fact say that it’s vCenter Essentials – but it does allow you to configure HA. Which is pretty much what I would expect, as HA is really a vSphere server feature, not so much a vCenter feature. You can’t configure DRS, as that fails with an improper license error message. So I guess the tech guy I was working with assumed (just like I did) that because you can buy vSphere Essentials Plus, you should also get a vCenter Essentials Plus. There’s no such thing.

Here’s one of the things that bothers me about Essentials and Essentials Plus: vmotion is disabled (which I understand), but storage vmotion is also disabled. Now I seem to remember that back in the 3.5 days, storage vmotion was kind of an undocumented feature available only through the CLI, so there were a number of authors who wrote plug-ins, either scripts or GUI’s and eventually a VCenter plugin. But storage vmotion didn’t depend on a license, it just worked.

Now VMWare has integrated storage vmotion into vCenter directly, but turned it off for the low-end (SMB-target) systems. I understand that they see this as a feature worthy of purchasing the pricier versions to get, but it kinda sucks to lose what was a “free” feature by upgrading to 4.0.

Furthermore, the Essentials packages don’t appear to have the “ala carte” pricing model (unlike their high-end brothers), so I can’t just go out and buy a storage vmotion license. Grr. It just frustrates me because there’s no technical reason why VMWare can’t do it.

Storage vmotion is also one of those features that would appeal to the SMB market even more, as we are more apt to be dealing with low-end storage, where you’re juggling 1 TB on one IET OpenFiler, 500 GB on a Windows Storage Server, 500 GB on a NAS box, and so on… being able to seamlessly move servers between dissimilar storage media was a KILLER feature. I found it even more useful than vmotion, as I could relocate virtual drives to where the best i/o performance was. But what VMWare giveth, VMWare taketh away.

Unless someone’s writing a script…

TLS circuits and autonegotiation August 30, 2009

Posted by jamesisaac in Uncategorized.
add a comment

Our main office and our data center are linked via fiber provided by Cox Communications; we’re on their Metro WAN fiber ring around the city. The service is called “TLS” which apparently stands for Transparent LAN Service circuit. They have gear in our telco room and hand us an ethernet cable; at the data center we have a matching ethernet cable that drops from the raceway into the rack. From our equipment’s perspective, the cable is functionally equivalent to a straight-through ethernet cable.

For all practical purposes, we could just slap a switch into the datacenter rack, plug the TLS cable into it, plug the other end of the TLS cable into our main office switch stack, and have the data center just be another distribution point attached to our switches. This is not really a good way to go, because the TLS segment adds about 2-3 ms of latency, plus it is bandwidth-constrained to 10mb/s. Prior to our routers arriving, I did in fact do this very scenario, and it worked pretty well. I could tell that we were leaking broadcast traffic across the fiber, plus our network analysis tools would complain about unexpected latency.

Once the routers arrived, I set them up back-to-back in our server room using a crossover cable between the FastEthernet0/1 ports (fe0/0 is “inside”, fe0/1 is “outside” on both routers), and made sure that my ip changes would go smoothly. The plan was to take one router over to the datacenter, mount it, then have a guy at the main office and another one at the datacenter, and at a coordinated time, unplug the TLS cable from the respective switches and plug into the routers’ “outside” ports. They worked perfectly back-to-back, so everything should work fine.

Except that it didn’t.

The symptom is this: line is up, protocol is down on both routers.

Well, crap. What’s going on? When the TLS is plugged directly into the switches, it works great. Plugged into router ports, though, it doesn’t work. Up/Down usually means there’s something physically wrong with the line, but I know that both the TLS line and the interfaces work.

So, call the Cox engineers. “Oh, yeah, that’ll happen. You’re not using the right encapsulation. We’ll have our engineer call you back.” Hmm… encapsulation… HDLC? PPP? L2TPv3? I’m not doing any encapsulation on the switches, so why does it work there? This doesn’t make any sense.

Call Cisco. The helpful tech looks at my configs and they look correct. Perhaps it’s something with the interface negotiation. I try autonegotiate. 100/Full. 100/Half. 10/Full. 10/Half. Nothing.

Cox calls back. Ok, it’s not an encapsulation issue per se. We need to be doing 802.1q encapsulation. Their gear is a “layer 1 transport”, and we need to use subinterfaces with vlan tags on our traffic. Ok, great… except I’m already doing that, and I KNOW that my vlans work as they were tested in the back-to-back crossover cable scenario.

The Cox engineer calls to the SOC to get someone on the line who can look at the TLS port itself. He calls back an hour later with interesting news. “The line at the datacenter is down but the one at your office is up.” Strange, it looks down to me. “Well, it’s up physically at the datacenter but since it’s down at your office the link can’t come up so it looks down at both places.” He suggests we check the negotiation. “What are you set for?” “Auto/auto.” “Me too.” “Let’s try hard-coding it.”

And the line comes up.

We bring it down, check it again, bring it up again.

GAAAAHHHH!

Nothing wrong with the configurations, just two devices that didn’t want to negotiate with each other. Hard-code the settings, and they’re fine. But – hard-code on both sides, which means my equipment and the ISP’s equipment. That was the missing step here.

vmware tools and msvcp71.dll August 22, 2009

Posted by jamesisaac in Uncategorized.
1 comment so far

One aspect of this data center move is moving servers from our VMWare 3.5 stack to our vSphere 4.0 stack. (By the way, what do you call a group of VMWare hosts? A farm? A cluster? A data center? I usually refer to it as the stack.) Our office and data center are separated and even though there’s a fiber link between them, I get faster bandwidth by copying the virtual machines to a USB drive, transporting the disk across town to the data center, plugging it in to a server at the data center, and uploading the vm image to the 4.0 stack. This process has worked reliably for probably 20 virtual machines, until I ran into a brand new error tonight.

The server in question was a Windows 2003 SP2 server with SQL 2000 installed. The usual process is to upload the vm folder to a datastore, register the machine, then turn it on. Due to networking changes, I log in and change the ip address, then install the vmware tools, reboot, then shut down, migrate the hardware to VM type 7, then power it back up.

This time, after upgrading I noticed a new error on boot – “at least one service could not be started.” An investigation into the event log was in order. Red Flag #2 was an error that popped up when starting the event log – something about msvcp71.dll not being available. I assumed that this was due to some software which a developer had probably installed which plugged into the mmc framework. Browsing the event log, I found that the service which failed to start was MS SQL. That’s a problem.

I tried to start SQL from the services mmc – no good. Rebooted. Still no good. Checked the SQL logs and the last time SQL had run was – interestingly enough – after the migration. What changed? IP address? Why would that cause a problem?

I tried some other SQL tools, such as query analyzer, and ran into the msvcp71.dll error again. Ah ha – if these tools can’t run, then it’s likely that SQL can’t run either. What is this msvcp71.dll anyway? I found a few references in Google to spyware and system scans and missing dll’s. (One helpfully suggested that the user may have deleted the file themselves, so check the recycle bin to see if it’s there. Like that ever happens.)

I narrowed my search to the one thing that I knew had changed – the VMWare Tools installation – and sure enough, there it was. The vSphere 4.0 tools installer deletes msvcp71.dll.

What?

Why?

I browsed to another server, found the dll, copied it to the \winnt\system32, rebooted, and SQL came right up. The server is back in business.

But why would this ever happen? What caused the VMWare developers to think it was a good idea to delete a file which (according to the searches) lots of other programs depend on? I’ve no idea.

Anyway, file this away in your mental checklist – if you upgrade to vSphere, verify that things work before you upgrade the VMWare tools, and then upgrade and check again. If your OS has a problem, odds are that this guy is responsible.

HP NIC nonsense August 17, 2009

Posted by jamesisaac in Uncategorized.
add a comment

A frustrating hour spent tonight on something that should have taken five minutes… and the issue isn’t even solved, yet.

I have an HP DL580 G2 with the standard motherboard Broadcom gigabit NICs, NC7782 in HP terminology. The NICs are teamed for fault-tolerance. In order to set up our backup-to-disk connection, I planned on adding a virtual NIC by using the VLAN feature of the HP network configuration utility and converting my “untagged” physical network ports to “tagged” VLAN ports, with one virtual NIC retaining the same ip information as the teamed adapter previously had, and the second virtual NIC acquiring an address in the backup network scope. Sounds pretty simple, right? Just go into HP network config, add the VLAN interfaces, and address them, and you’re done.

Now, from previous experience, I know that adding the VLAN configuration should create a brand-new virtual NIC with no ip addressing, as well as disabling the current NICs, so this isn’t a job to be done remotely (unless you have the RIB all confgured. Actually, come to think of it, I could have done this through our KVM-over-IP, but it turns out that it was a good thing that I didn’t.) So, a trip to the data center was in order.

Five minutes after arrival, I have the VLAN 802.1q information entered and hit “apply” in the HP network configuration utility. And wait.

And wait.

10 minutes later, the HP network configuration utility is still on “configuring adapters, please wait… this may take a few seconds per adapter” or some such verbiage. Clearly something is not going well.

The Network control panel shows that my physical NICs are now marked as “disabled” and what used to be the “Teamed” virtual adapter has turned into one of the VLAN-assigned adapters, but is also disabled. A fourth NIC has arrived – the second VLAN-assigned adapter. It is enabled but unplugged, which is odd because nothing has changed with the cabling. I’m guessing that it is unplugged because the physical NICs are disabled.

I use Task Manager to kill the HP network configuration utility. I can’t enable the network adapters. They show as disabled, and when I right-click and choose “enable”, Windows tells me that they are enabling, but they stay disabled.

Reboot.

Seven minutes later, I’m looking at the “Preparing network connections.” Crap. This stays stuck for about three minutes. I was about ready to power off and boot into safe mode. Log in, find the adapters – they’re still disabled. Not good. Device Manager shows them as NC7782 with little red X’s through them. Why? What happened?

The HP network configuration utility shows NO adapters – probably because, again, they’re disabled. But the HP network configuration utility was the tool that disabled them~! So now I’m stuck.

I log into a different machine, go to http://itrc.hp.com, and search for drivers. I download the latest Broadcom driver and the HP ncu onto a USB thumb drive that I happen to have in the data center kit box (note: always have a box of spare bits and pieces tucked away in your rack). I load the drivers and the HP ncu, and reboot.

And wait.

10 minutes later, I’m again faced with “Preparing network connections…” and after logging in, I have disabled network adapters – they’re using the latest versions of the drivers, but still disabled. Crap! Now what? Do I have to regedit something to enable these drivers? I look around and find a post that suggests uninstalling the drivers and rebooting, then going through a whole routine of flashing BIOS, firmware, and reloading drivers. I removed one NIC, uninstalled the driver from the other one, then – on a whim – rescanned the hardware in device manager. What’s this? New NIC detected?

Huge sigh of relief. Windows, all by itself, sees the NICs as new devices and loads the HP driver and enables the network cards. I start up the HP ncu (actually, I seriously considered just leaving it all alone and not dealing with ncu anymore), team the adapters, cross my fingers as the pNICs are disabled and the new Teamed interface is created, then everything is enabled again. I set the ip address and we’re back in business.

I’m still at a loss to explain why the server disabled the NICs and could not re-enable them. I’ve used VLANs through the HP ncu before, and it is usually a straight-ahead configuration. My guess is that this is an old server, probably not up to par with regards to both Windows patches and HP driver updates, and something in the OS / driver stack conflicted with the changes I was attempting to make, which left it in a disabled state.

Lessons learned:

  • nothing is ever as easy as you think it should be.
  • Don’t do maintenance on a customer-facing server on Sunday night.
  • Purchase a new NIC for the backup network instead of trying to “virtualize” the existing NICs by adding VLAN tags.