On the 29th August, we experienced a major disruption to our hosted service. The outage started at 5pm AEST.
Infrastructure failures and maintenance outages are not a common occurrence but they do occur and we apologise if the outage caused any impact.
Osmotion seeks to provide transparency for our clients in these situations. With this in mind, we have detailed what occurred below:
Osmotion uses a third party datacenter provider called SoftLayer (www.softlayer.com). When providing hosting services, Osmotion commissions and maintains our clients installations on Virtual Machines hosted on SoftLayer infrastructure.
Osmotion received two notices of an impending outage – one on the 3rd July and another on the 27th August.
The notice received on 3rd July indicated that the routers would be upgraded and that some customers may experience a loss of network connectivity.
The second notice on 27th August, indicated that SAN-based CCI and VPX instances in the affected POD would need to be shut down during the router upgrade to guarantee SAN data integrity. It also indicated a maximum downtime of 60 minutes for the CCI and VPX instances. SoftLayer also stated that this maintenance could not be deferred or rescheduled.
Unfortunately the second notice was a replication of the first notice with the additional information added as another paragraph. Osmotion System Administrators did not notice the additional paragraph. It was wrongly assumed it was a duplicate/reminder notice. One of many we receive from SoftLayer.
Once the outage was detected, the Osmotion technical and support teams worked together closely to keep our clients abreast of the situation.
Our major focus is always to determine if we have worked within our SLA and determine if we could do anything better. With this in mind, we have conducted a review of the incident.
The outage in this instance was scheduled maintenance. This maintenance window was critical and could not be rescheduled. Due to time zones, this had a significant impact on our WA clients.
This highlighted the potential for human error in failing to properly detect the notice and work through our options together with our customers.
We have raised these concerns with SoftLayer and also modified our internal processes to mitigate the risk of this reoccurring. In addition to working on the issues associated with the notification process we are examining other options with regards to limiting the windows for planned maintenance that is not within our control.
As we solve these problems we will communicate the solutions to our hosted customers.
Many will recall earlier in the year, we forwarded out requests to obtain emergency contact details. At that time, we also requested your preferred contact method in the event of an emergency.
We are pleased to say we were able to utilise these details to contact all affected clients and provide half hourly updates via email.
We have identified the areas where we can improve our internal processes and have a plan in place.
Dealing with our clients directly is and will always be an integral part of our support value proposition. As mentioned above, earlier on in the year when we contacted clients for emergency contact details, we requested you nominate your preferred method of contact.
We understand that communication methods other than phone are often more suitable, particularly where there are significant time zone differences or as people may not always be available. In these instances, notifications will be forwarded via email and text message where requested.
We have maintained phone calls as the initial contact method for those who have specified this preference.
There will be improvements from our review. We do apologise for the disruption and thank you for your co-operation.