oneZero letter to clients as outages continue – LeapRate Exclusive

Further to our exclusive coverage of the outages suffered over the past week by oneZero and by clients of its hosted services, LeapRate has learned that oneZero management has sent out a series of letters (by email) to clients about the situation.

The emails detail nine outages from October 5 through yesterday (October 13), including one scheduled weekend downtime for maintenance, of lengths varying from 72 minutes to six minutes.

We also understand from those affected by the outages that there was another one which occurred today, lasting about 20 minutes.

LeapRate has received a copy of one of the oneZero letters to clients detailing the downtime, which reads as follows:

——–

Subject: oneZero Internet Connectivity Outage

oneZero Hosted Clients,

The purpose of this email is to succinctly define and outline the outages over the past eight days in a means that is easily transferable to both your clients and regulators (if needed). I intend to schedule time this week with each of our hosted clients to discuss in-person, or via phone, the broader nature of the incidents themselves, the failures in the redundancy model we had in place and the inevitable impact this has taken on your client relationships, and confidence in oneZero’s infrastructure. I spent the day with numerous providers, both existing and new, to ensure the necessary steps are also being taken to avoid further issues.

If we do not already have a scheduled time to speak directly, please respond to this email to ********@onezero.com and I will make time before the end of the week to speak with you in person, during your normal business hours.

The outage times over the past week in terms of Internet based connectivity to our servers in NY4 are as follows:

 

10-05-2015 – 13:26 – 14:38 = 72 Min

Reason: Core router, hardware failure.

Recovery: Issue isolated to redundant supervisor on core router, router rebooted, failure ensued, faulty redundant supervisor removed, router rebooted.

 

10-09-2015 – 13:51 – 13:57 = 6 Min

Reason: ISP Blip (LightTower), edge router reboot, core router BGP re-build

Recovery: Automatic, via BGP re-build.

 

10-09-2015 – 14:44 – 15:06 = 22 Min

Reason: ISP Failure (LightTower), repeated edge router reboots.

Recovery: Edge router (LightTower) disconnected from external, core router BGP re-build.

 

(Weekend 10-10-2015 <-> 10-11-2015)

Reason: Intermittent scheduled downtime for maintenance.

Recovery: Redundant supervisor replaced on core router, LightTower issues reported resolved, routing via 3xISP aggregate tested and certified.

 

10-12-2015 – 06:31 – 06:55 = 24 Min

Reason: ISP Failure (LightTower), repeated edge router reboots.

Recovery: Edge router disconnected from external, core router BGP re-build. ISPs isolated to 1 (Hurricane Electric), running as a single provider via BGP.

 

10-13-2015 – 06:44 – 06:59 = 15 Min

Reason: Edge router reboots, despite external connectivity disabled, forces BGP rebuild on core router.

Recovery: Automatic, via BGP.

 

(At this point in time our IT decided to (a) implement static routes in our Core Router to force the BGP mapping to our 2 non-failing ISPs / edge routers, and we dispatched L2 techs onsite).

 

10-13-2015 – 10:16 – 10:22 = 6 Min

Reason: Edge router reboots, despite external connectivity disabled, forces BGP rebuild on core router.

Recovery: Automatic, via BGP (this occurred as our techs arrived on scene).

 

10-13-2015 – 10:48 – 11:09 = 21 Min

Reason: Placement of static routes, ISP reconfiguration, BGP re-build.

Recovery: On-site techs worked through multiple packet loss scenarios as the result of implementing static routes within the core router, and completely powering down the failing edge router (LightTower).

 

10-13-2015 – 11:14 – 11:45 = 31 Min

Reason: Routing loop between edge routers after static route configuration.

Recovery: Additional static routes implemented to ensure core<->edge relationships across both functioning ISPs.

Read Also: