jump to navigation

Cisco Call Manager database replication failure due to incorrect MTU April 28, 2012

Posted by jamesisaac in Uncategorized.
Tags:
add a comment

This weekend we changed from one WAN ethernet circuit to a different one for our metro ethernet. We have a Call Manager publisher on one side and a subscriber on the other side. After changing the circuit, we discovered that we couldn’t dial out from the subscriber side to anything other than local (in the building) phones. What’s going on? The servers were unchanged; we used the same routers and interfaces for the new circuit. The only thing that happened was unplugging one cable and plugging in another one. Why would the phone routing fail in such a way?

Here’s the other interesting bit of business: during our troubleshooting, we rebooted the local (subscriber) call manager. We then tested the phones and found that we could dial out! Great, problem solved. Except that 10 minutes later we couldn’t dial out again. Putting two and two together, we realized that when the phone was registered to the publisher, we could dial out. When the subscriber rebooted and the phones re-registered to the local CCM server, our calls failed again. Hmm. It didn’t seem to be a routing problem, because all of our ping testing was successful. And when the phone is registered to the publisher, everything works! So clearly it can’t be a network problem.

At this point TAC transferred us to the database replication team and we started getting packet dumps and analyzing them with Wireshark. The smoking gun appeared. During the database replication from the publisher to the subscriber, there’s a large packet containing the CCM certificate. This packet is tagged as “do not fragment”. It was not reaching the subscriber, even though other packets around it were. Thus – the MTU.

We changed the MTU on both servers to 1400 and retested the database replication. The network tests immediately passed, and the replication began. Once the pub and sub were in sync, our calls outside the building were successful.

So why would this change? My guess is that our telco provisioned the replacement ethernet circuit with their own VLAN tags on it – to run multiple companies’ traffic over the same circuit, which result in a smaller total frame available to us. Plus we had our own VLAN tags inserted (VLAN within VLAN). A large packet with a “do not fragment” packet will get dropped instead of fragmented. We had to change the server behavior to build a smaller packet, and once we did that, everything worked.