If you wake up and see IPs you support routing to China, it’s going to be a rough day. Today – was a rough day.
- At 4:35AM EDT my network monitoring system alarmed that a clients site-to-site VPN connection was down between the clients office in NC, and our data center in Atlanta, GA.
- At ~ 6:15AM EDT I woke up and saw the alarm. I immediately begin testing / collecting data. It quickly became obvious that this was a routing issue. Connectivity between some networks (Road Runner and several others) to our clients data center IPs was broken. Curiously – traffic from Road Runner / Time Warner Cable was routing out to a router in Los Angeles, CA then dying.
- In order to open trouble tickets for a routing issue, you need trace routes. So I collected several showing networks that worked and ones that did not – in both directions. Then I opened tickets with Road Runner / Time Warner Cable (the clients ISP) and the data center (who provides us IPs as part of a BGP mix of bandwidth they maintain and optimize).
- After some additional troubleshooting while waiting to hear back on my trouble tickets, I noticed that a new BGP advertisement which included our IPs was published at nearly the exact same time that the site-to-site VPN failed. I’ve sanitized the screen shot to protect the innocent (my client) and the guilty (a Chinese ISP). The red blocks contain IP details I’ve intentionally removed.
- After some troubleshooting we were able to determine that a Chinese ISP had published a bogus BGP advertisement. The Chinese ISP wrongly advertising the a /20 block of IPs (which included some of ours). They actually own a /20 that was one character different from the block they advertised. It appears they simply made a typo somewhere and caused all of this.
- Our data center NOC team reached out to the Chinese ISP NOC to see if they could get them to remove this wrong advertisement.
- At 10:25AM EDT our monitoring system recorded the site-to-site VPN coming back online.
- When I arrived at the client site (where I was scheduled to be today anyway) – I tested and the bogus BGP advertisement had been removed.
So – what is the take away from this? What can be learned? Here are a few things – several of which I knew intellectually previously and I know at more of a gut level now.
- False BGP advertisements can create a real mess. I knew this previously – but it never impacted me as harshly as it did today. Want to read more on how bad this can be – check out the BGPMON blog here: http://www.bgpmon.net/blog/.
- It seems some ISPs filter or manage BGP more carefully than others. For example – Level 3 never seemed to be effected by this bogus BGP update. Time Warner / Road Runner apparently accepted it nearly immediately. I’m no BGP guru at all – but wow improvement is needed here.
- In the future before I open a routing issue ticket, I’ll take a look not only at trace routes, but also at BGP advertisements. Huge thanks to Hurricane Electric for a great looking glass tool that ultimately helped me get to the bottom of this.