09/24/2014

Support Files: Network Nettle

Alertra has thousands of customers, but we only hear from a small percentage of them. Just the other day I looked at our customer list and found a customer that has been with us more than 10 years and none of us can recall them ever opening a ticket. That doesn't mean that the system admins have nothing to do though, we have a full day every day, including weekends and some vacations, answering your questions. We thought it might be fun and helpful to share some of them. All support requests and answers have had identifying information removed so that our lawyer can stay at the bar spending our money instead of having to come into the office. Let's dive in.

sreqA

Uh oh, it looks like we've told someone their site is down when they don't think it was. Worse, apparently we've been vacillating between saying it is up and down for an hour and a half. I don't want to say we're perfect, but we have been doing this for 14 years. It happens, ever so occasionally, that an update to our system breaks on a few sites. I went back to the server room and checked on the hamsters and they were the same hamsters we had the day before. I put up a picture on the monitor so they would be properly motivated.

kliban-catsa-1Hamster focus intensifies

So it wasn't an update on our part that caused this outage. Alrighty then, I'm pretty confident this is a legit outage. Our system captures the packets from each check, so that's where we get all CSI: Nerd to find out what happened. Our Orlando station had this interesting conversation with the customer's server:

sreqA-pktcap "lol, whut?", Orlando conjectured

Orlando didn't know what to make of it. Orlando opens the TCP connection with a SYN packet, then everything kind of goes nuts. A SYN,ACK is received to officially open the connection, but things never recover and the connection eventually times out. Las Vegas and Rotterdam were similarly confused. Shanghai got through to the server with only minor trouble:

sreqA-pktcap2

The outage then resumed with different mixes of stations, some reporting the site down and others saying it was uíp. This kind of problem is usually explained by a misbehaving bit of network equipment either at the webserver, or upstream from it. If the problem is upstream sometimes customers won't even realize there is a problem since they can connect just fine. A response was prepared explaining the outage to the customer. Almost as soon as the "Send" button on the reply was clicked, the worst happened.

sreqB

Another customer. Another unexplained outage during the same time frame as the last. This is a verbatim transcript of the conversation I had with my boss:

me: I checked the whois on both target webserver's IPs. They are registered to different companies. boss: hmmmm. me: They are both in the U.S. boss: hmmmm. me: They are separated by several states and hundreds of miles. boss: I was hoping you wouldn't say that.

That's "boss" for "I really didn't want to have to find a new server admin this week.":

me: But I tracerouted both customers IPs and both pass through ########'s network near the last hop. boss: I'm hitting the links.

I can't show you the traceroute results because our lawyer is at Eskimo Joe's and just ordered another drink so he won't be available to take your call for some time. However, it seems that an upstream router shared by both customers is in fact the problem. The first customer responded to us:

sreqA2

Outages like this are hard for our customers to see. If they are on the local network to the server, it will behave normally. If they are remote, but don't happen to try the server during the few minutes the outage is going on, then it looks like a false report. But Alertra's network is always watching. Our stations are located all over the world and on different ISPs with differing paths to the Internet backbone. If three or more of our stations say a site is down, it is down. If not down for you, it is down for your customers.