jump to navigation

TLS circuits and autonegotiation August 30, 2009

Posted by jamesisaac in Uncategorized.
add a comment

Our main office and our data center are linked via fiber provided by Cox Communications; we’re on their Metro WAN fiber ring around the city. The service is called “TLS” which apparently stands for Transparent LAN Service circuit. They have gear in our telco room and hand us an ethernet cable; at the data center we have a matching ethernet cable that drops from the raceway into the rack. From our equipment’s perspective, the cable is functionally equivalent to a straight-through ethernet cable.

For all practical purposes, we could just slap a switch into the datacenter rack, plug the TLS cable into it, plug the other end of the TLS cable into our main office switch stack, and have the data center just be another distribution point attached to our switches. This is not really a good way to go, because the TLS segment adds about 2-3 ms of latency, plus it is bandwidth-constrained to 10mb/s. Prior to our routers arriving, I did in fact do this very scenario, and it worked pretty well. I could tell that we were leaking broadcast traffic across the fiber, plus our network analysis tools would complain about unexpected latency.

Once the routers arrived, I set them up back-to-back in our server room using a crossover cable between the FastEthernet0/1 ports (fe0/0 is “inside”, fe0/1 is “outside” on both routers), and made sure that my ip changes would go smoothly. The plan was to take one router over to the datacenter, mount it, then have a guy at the main office and another one at the datacenter, and at a coordinated time, unplug the TLS cable from the respective switches and plug into the routers’ “outside” ports. They worked perfectly back-to-back, so everything should work fine.

Except that it didn’t.

The symptom is this: line is up, protocol is down on both routers.

Well, crap. What’s going on? When the TLS is plugged directly into the switches, it works great. Plugged into router ports, though, it doesn’t work. Up/Down usually means there’s something physically wrong with the line, but I know that both the TLS line and the interfaces work.

So, call the Cox engineers. “Oh, yeah, that’ll happen. You’re not using the right encapsulation. We’ll have our engineer call you back.” Hmm… encapsulation… HDLC? PPP? L2TPv3? I’m not doing any encapsulation on the switches, so why does it work there? This doesn’t make any sense.

Call Cisco. The helpful tech looks at my configs and they look correct. Perhaps it’s something with the interface negotiation. I try autonegotiate. 100/Full. 100/Half. 10/Full. 10/Half. Nothing.

Cox calls back. Ok, it’s not an encapsulation issue per se. We need to be doing 802.1q encapsulation. Their gear is a “layer 1 transport”, and we need to use subinterfaces with vlan tags on our traffic. Ok, great… except I’m already doing that, and I KNOW that my vlans work as they were tested in the back-to-back crossover cable scenario.

The Cox engineer calls to the SOC to get someone on the line who can look at the TLS port itself. He calls back an hour later with interesting news. “The line at the datacenter is down but the one at your office is up.” Strange, it looks down to me. “Well, it’s up physically at the datacenter but since it’s down at your office the link can’t come up so it looks down at both places.” He suggests we check the negotiation. “What are you set for?” “Auto/auto.” “Me too.” “Let’s try hard-coding it.”

And the line comes up.

We bring it down, check it again, bring it up again.

GAAAAHHHH!

Nothing wrong with the configurations, just two devices that didn’t want to negotiate with each other. Hard-code the settings, and they’re fine. But – hard-code on both sides, which means my equipment and the ISP’s equipment. That was the missing step here.

Advertisements

vmware tools and msvcp71.dll August 22, 2009

Posted by jamesisaac in Uncategorized.
1 comment so far

One aspect of this data center move is moving servers from our VMWare 3.5 stack to our vSphere 4.0 stack. (By the way, what do you call a group of VMWare hosts? A farm? A cluster? A data center? I usually refer to it as the stack.) Our office and data center are separated and even though there’s a fiber link between them, I get faster bandwidth by copying the virtual machines to a USB drive, transporting the disk across town to the data center, plugging it in to a server at the data center, and uploading the vm image to the 4.0 stack. This process has worked reliably for probably 20 virtual machines, until I ran into a brand new error tonight.

The server in question was a Windows 2003 SP2 server with SQL 2000 installed. The usual process is to upload the vm folder to a datastore, register the machine, then turn it on. Due to networking changes, I log in and change the ip address, then install the vmware tools, reboot, then shut down, migrate the hardware to VM type 7, then power it back up.

This time, after upgrading I noticed a new error on boot – “at least one service could not be started.” An investigation into the event log was in order. Red Flag #2 was an error that popped up when starting the event log – something about msvcp71.dll not being available. I assumed that this was due to some software which a developer had probably installed which plugged into the mmc framework. Browsing the event log, I found that the service which failed to start was MS SQL. That’s a problem.

I tried to start SQL from the services mmc – no good. Rebooted. Still no good. Checked the SQL logs and the last time SQL had run was – interestingly enough – after the migration. What changed? IP address? Why would that cause a problem?

I tried some other SQL tools, such as query analyzer, and ran into the msvcp71.dll error again. Ah ha – if these tools can’t run, then it’s likely that SQL can’t run either. What is this msvcp71.dll anyway? I found a few references in Google to spyware and system scans and missing dll’s. (One helpfully suggested that the user may have deleted the file themselves, so check the recycle bin to see if it’s there. Like that ever happens.)

I narrowed my search to the one thing that I knew had changed – the VMWare Tools installation – and sure enough, there it was. The vSphere 4.0 tools installer deletes msvcp71.dll.

What?

Why?

I browsed to another server, found the dll, copied it to the \winnt\system32, rebooted, and SQL came right up. The server is back in business.

But why would this ever happen? What caused the VMWare developers to think it was a good idea to delete a file which (according to the searches) lots of other programs depend on? I’ve no idea.

Anyway, file this away in your mental checklist – if you upgrade to vSphere, verify that things work before you upgrade the VMWare tools, and then upgrade and check again. If your OS has a problem, odds are that this guy is responsible.

HP NIC nonsense August 17, 2009

Posted by jamesisaac in Uncategorized.
add a comment

A frustrating hour spent tonight on something that should have taken five minutes… and the issue isn’t even solved, yet.

I have an HP DL580 G2 with the standard motherboard Broadcom gigabit NICs, NC7782 in HP terminology. The NICs are teamed for fault-tolerance. In order to set up our backup-to-disk connection, I planned on adding a virtual NIC by using the VLAN feature of the HP network configuration utility and converting my “untagged” physical network ports to “tagged” VLAN ports, with one virtual NIC retaining the same ip information as the teamed adapter previously had, and the second virtual NIC acquiring an address in the backup network scope. Sounds pretty simple, right? Just go into HP network config, add the VLAN interfaces, and address them, and you’re done.

Now, from previous experience, I know that adding the VLAN configuration should create a brand-new virtual NIC with no ip addressing, as well as disabling the current NICs, so this isn’t a job to be done remotely (unless you have the RIB all confgured. Actually, come to think of it, I could have done this through our KVM-over-IP, but it turns out that it was a good thing that I didn’t.) So, a trip to the data center was in order.

Five minutes after arrival, I have the VLAN 802.1q information entered and hit “apply” in the HP network configuration utility. And wait.

And wait.

10 minutes later, the HP network configuration utility is still on “configuring adapters, please wait… this may take a few seconds per adapter” or some such verbiage. Clearly something is not going well.

The Network control panel shows that my physical NICs are now marked as “disabled” and what used to be the “Teamed” virtual adapter has turned into one of the VLAN-assigned adapters, but is also disabled. A fourth NIC has arrived – the second VLAN-assigned adapter. It is enabled but unplugged, which is odd because nothing has changed with the cabling. I’m guessing that it is unplugged because the physical NICs are disabled.

I use Task Manager to kill the HP network configuration utility. I can’t enable the network adapters. They show as disabled, and when I right-click and choose “enable”, Windows tells me that they are enabling, but they stay disabled.

Reboot.

Seven minutes later, I’m looking at the “Preparing network connections.” Crap. This stays stuck for about three minutes. I was about ready to power off and boot into safe mode. Log in, find the adapters – they’re still disabled. Not good. Device Manager shows them as NC7782 with little red X’s through them. Why? What happened?

The HP network configuration utility shows NO adapters – probably because, again, they’re disabled. But the HP network configuration utility was the tool that disabled them~! So now I’m stuck.

I log into a different machine, go to http://itrc.hp.com, and search for drivers. I download the latest Broadcom driver and the HP ncu onto a USB thumb drive that I happen to have in the data center kit box (note: always have a box of spare bits and pieces tucked away in your rack). I load the drivers and the HP ncu, and reboot.

And wait.

10 minutes later, I’m again faced with “Preparing network connections…” and after logging in, I have disabled network adapters – they’re using the latest versions of the drivers, but still disabled. Crap! Now what? Do I have to regedit something to enable these drivers? I look around and find a post that suggests uninstalling the drivers and rebooting, then going through a whole routine of flashing BIOS, firmware, and reloading drivers. I removed one NIC, uninstalled the driver from the other one, then – on a whim – rescanned the hardware in device manager. What’s this? New NIC detected?

Huge sigh of relief. Windows, all by itself, sees the NICs as new devices and loads the HP driver and enables the network cards. I start up the HP ncu (actually, I seriously considered just leaving it all alone and not dealing with ncu anymore), team the adapters, cross my fingers as the pNICs are disabled and the new Teamed interface is created, then everything is enabled again. I set the ip address and we’re back in business.

I’m still at a loss to explain why the server disabled the NICs and could not re-enable them. I’ve used VLANs through the HP ncu before, and it is usually a straight-ahead configuration. My guess is that this is an old server, probably not up to par with regards to both Windows patches and HP driver updates, and something in the OS / driver stack conflicted with the changes I was attempting to make, which left it in a disabled state.

Lessons learned:

  • nothing is ever as easy as you think it should be.
  • Don’t do maintenance on a customer-facing server on Sunday night.
  • Purchase a new NIC for the backup network instead of trying to “virtualize” the existing NICs by adding VLAN tags.

“cannot change the host configuration” August 14, 2009

Posted by jamesisaac in Uncategorized.
Tags: ,
2 comments

Here’s a bizarre problem I encountered with vSphere and VMFS.

I have an iSCSI SAN presenting 5 LUNs to vSphere. I set up the first server, added the LUNs, formatted and named them, and all is well. I added the second server, and could only add two of the 5 LUNs. With the other three, I could see the LUN in the “add storage” configuration page, but after going through the wizard, vSphere errored out with “cannot change the host configuration”. There’s not a lot of documentation on this error, but I found the secret here – it’s a bug in Virtual Center. Or at least it appears to be.

The solution is to use the vSphere client to connect directly to the host, not Virtual Center, and add the LUNs there. Works perfectly and they show up in Virtual Center as you’re adding them.