If you wake up and see IPs you support routing to China, it’s going be a rough day.

If you wake up and see IPs you support routing to China, it’s going to be a rough day.  Today – was a rough day.

  • At 4:35AM EDT my network monitoring system alarmed that a clients site-to-site VPN connection was down between the clients office in NC, and our data center in Atlanta, GA.
  • At ~ 6:15AM EDT I woke up and saw the alarm.  I immediately begin testing / collecting data.  It quickly became obvious that this was a routing issue.  Connectivity between some networks (Road Runner and several others) to our clients data center IPs was broken.  Curiously – traffic from Road Runner / Time Warner Cable was routing out to a router in Los Angeles, CA then dying.
  • In order to open trouble tickets for a routing issue, you need trace routes.  So I collected several showing networks that worked and ones that did not – in both directions.  Then I opened tickets with Road Runner / Time Warner Cable (the clients ISP) and the data center (who provides us IPs as part of a BGP mix of bandwidth they maintain and optimize).
  • After some additional troubleshooting while waiting to hear back on my trouble tickets, I noticed that a new BGP advertisement which included our IPs was published at nearly the exact same time that the site-to-site VPN failed.  I’ve sanitized the screen shot to protect the innocent (my client) and the guilty (a Chinese ISP).  The red blocks contain IP details I’ve intentionally removed.
    bgp_update
  • After some troubleshooting we were able to determine that a Chinese ISP had published a bogus BGP advertisement. The Chinese ISP wrongly advertising the a /20 block of IPs (which included some of ours).  They actually own a /20 that was one character different from the block they advertised.  It appears they simply made a typo somewhere and caused all of this.
  • Our data center NOC team reached out to the Chinese ISP NOC to see if they could get them to remove this wrong advertisement.
  • At 10:25AM EDT our monitoring system recorded the site-to-site VPN coming back online.
  • When I arrived at the client site (where I was scheduled to be today anyway) – I tested and the bogus BGP advertisement had been removed.

So – what is the take away from this?  What can be learned?  Here are a few things – several of which I knew intellectually previously and I know at more of a gut level now.

  • False BGP advertisements can create a real mess.  I knew this previously – but it never impacted me as harshly as it did today.  Want to read more on how bad this can be – check out the BGPMON blog here: http://www.bgpmon.net/blog/.
  • It seems some ISPs filter or manage BGP more carefully than others.  For example – Level 3 never seemed to be effected by this bogus BGP update.  Time Warner / Road Runner apparently accepted it nearly immediately.  I’m no BGP guru at all – but wow improvement is needed here.
  • In the future before I open a routing issue ticket, I’ll take a look not only at trace routes, but also at BGP advertisements.  Huge thanks to Hurricane Electric for a great looking glass tool that ultimately helped me get to the bottom of this.

Leave a comment